Phase I Specifications

 

 

The Phase I of this project is called as “Storage Manager”. It will be responsible for handling –

The “Storage Manager” should consist of following components –

  1. System Catalog
  2. Buffer Manager
  3. Iterator
  4. Top Level Interface

 

The description of each component is given below in detail –

 

  1. “System Catalog” – There should be one “System Catalog” in your system which will manage the meta information about relations and attributes. It will consist of mainly two components “Attribute Catalog” and “Relation Catalog”.
    1. Attribute Catalog will give us information about

                                                               i.      Attribute Name

                                                             ii.      Data Type

                                                            iii.      Is that attribute nullable?

                                                           iv.      Parent Relation Name

                                                             v.      An internal ID

                                                           vi.      Number of distinct values in the Attribute –V(R)

 

    1. Relation Catalog will give us information about

                                                               i.      Relation Name

                                                             ii.      Create Date

                                                            iii.      Modify Date

                                                           iv.      Number of tuples in the relation – T(R)

                                                             v.      An internal ID

                                                           vi.      The indexed column(s)

                                                          vii.      The index file name(s) (character string)

                                                        viii.      The number of data blocks that the relation spans – B(R)

 

 

The relation will span over three files -

- [1] System Catalog

Only one System catalog will exist for all the relations. It will contain all the meta information required for data storage, access, modification and querying as described above.

- [2] Data File

      This file will be used to store the data of a particular relation.

The basic operations on the data file will be:

 

- [3] Index Files

This file will be used to store the indexing information for particular table. The basic operations on the index file will be -

B.                              

    B. Buffer Manager

D.                              

Another important component associated with the Storage Manager is the Buffer Manager. The Buffer Manager is responsible for maintaining a main memory pool (cache) of blocks [either data or index or catalog]. When a block is needed the Buffer Manager is called. It will check if the main memory buffer pool contains the desired block. If so the main memory address is returned. If the desired block is not in the buffer pool, a physical I/O (i.e., a READ) must be issued to retrieve the block from disk or secondary storage. The retrieved block will be stored in a free location in the buffer pool and its main memory address will be returned. If there is no free block, in the Buffer Manager, you can use The Clock Algorithm for buffer replacement to find a block in the buffer that will be replaced by the new incoming block. Usually, the Buffer Manager maintains an internal hash table or index table for fast lookup in the buffer pool. This table associates disk blocks currently in the buffer with buffer locations. The Buffer Manager will also maintain some statistics about the number of logical block accesses (the number of requests for the data block) and the number of physical block accesses for data (the number of disk fetches performed). Every time the Buffer Manager gets a request for a block, it increments the logical block access counter and if the requested block is not in the buffer, then it will also increment the physical block access counter.

The Storage Manager should accommodate two basic access paths: Table Scan and Index Scan. For a Table Scan, the Storage Manager will be called repeatedly to return each block of the file in physical order. For an Index Scan, the Storage Manager will be called to find the address associated with a given key value. The block address will be returned. The block may be an address of a data block or an address of a block containing pointers to data blocks that contain records with the same key value. Finally the Storage Manager is called to read the necessary data block. The value returned by the Storage Manager is the main memory address of a block in the buffer pool. As previously mentioned, the requested block may be found in the buffer, thus eliminating a disk access.

The block size we will use is 4 Kb (4096 bytes). The record size for the relations will be determined by the number of columns and there respective widths. Records should not be split between blocks. We will have a buffer pool of 16Mb (4,000 blocks).

The buffer manager (bm) interface should probably look like this:

 

        byteArray = bm.read (diskBlockNum, byteOffset, numBytes);

               bm.write (diskBlockNum, byteOffset, byteArray);

               bm.pin(diskBlockNum);

bm.unpin(diskBlockNum);

               bm.flush();

 

The write does not have to actually write the block to disk until it is evicted by The Clock Algorithm or a flush is requested. flush will write all dirty blocks to disk. pin will prevent a block from being evicted. unpin will allow a block to be evicted.

 

    C.   "Iterator": For Phase I of this project, you have to implement the basic functionality of the iterator –

For example, If R is the relation then

 

 

    D. “Top Level Interface”: For Phase I, your system should accept the following commands. They can simply be implemented as methods to be called by an application programmer: 

 

         CREATE TABLE table_name(attr1 dataType1[,attr2 dataType2, ... ,attrN dataTypeN]);

 

         CREATE INDEX index_name ON table_name (attr) [NO DUPLICATE]; 

        

         INSERT INTO table_name [(attr1, attr2, ... attrN)] VALUES (attr1Val, attr2Val, ... attr3)

 

         SELECT * FROM TABLE table_name [WHERE attrN = value];

 

         SELECT * FROM INDEX table_name index_name;

 

         SELECT * FROM CATALOG catalog_name;

 

Note:

1. Square brackets in the syntax specifications indicate command components that are optional.

 

2. catalog_name can be either ATTRIBUTE_CATALOG or RELATION_CATALOG.

                        

3. For the CREATE TABLE command, every relation name must be unique and within a given relation, every attribute name must be unique.

 

4. For the CREATE INDEX command, the associated catalog information for the index is created. The default is that duplicate values may occur for the indexed attribute. If duplicates cannot occur, then NO DUPLICATES should be specified in the CREATE INDEX statement. It will be responsible for reading the relation and storing the key value/address pairs in the index file.

 

5. The three SELECT commands are also utility functions but are only needed to help verify that your Phase I is working. The SELECT * FROM TABLE command will print the contents of a relation, one row per line. The SELECT * FROM INDEX command will print the key values in the leaf nodes of the index along with their data block addresses. The SELECT * FROM CATALOG command will print the contents of the system catalogs.

 

Important Notes for Phase I:

              

      1. You will implement your DBMS system such that each relation will have separate data and index file.

     

     2. I will be doing automated testing of your code. Therefore please be consistent and remain consistent with the class/attribute/method nomenclature that I will ask for.

 

      3. Disk File Location-

 

       To get realistic timings, you want your DBMS program to run on the same machine where your disk file is located. If you are using your CoC account, your home directory is on some file server machine, not the machine you are logged into. It is ok to run your program against the file in your home directly, but you will not get realistic timings. To get realistic timings, first copy your disk file to the /tmp directory then run your program. The /tmp directory is mapped to the local disk of the machine you are logged into. Don't forget to copy back your disk file to your home directory when your done, otherwise you will loose it when you log out. Logging out cleans out the /tmp directory.