This assignment is Due on: Day: Wednesday, April 22, 1998 Time: not later than 6:00 PM WARNINGS: TURN IN THIS ASSIGNMENT ELECTRONICALLY USING "turnin". LATE ASSIGNMENTS WILL NOT BE ACCEPTED. ---------------------------------------------------------------------- The purpose of this homework is to reinforce explanation of the following C programming concepts: Use of opendir(3c) Use of readdir(3c) Use of telldir(3c) Use of seekdir(3c) Use of rewinddir(3c) Use of closedir(3c) Type definitions Linked lists Hashing You should have received a file called "dirhash.c" with this assignment. It provides driver code for opening directories and scanning through the files in them. Before beginning this homework, read pages 271-279, 288-291, and 367-369 in the Weiss book. THE SETUP: Most modern operating systems provide a method for organizing files into directories (this is a Good Thing). Unfortunately, directories tend to be stored as nothing more complex than an unordered list of filenames, and so a linear search of this list is required whenever a particular file is requested (this is potentially a Bad Thing). A linear search is suitable if the number of files is "small", but for large numbers of files, it can be inefficient. For this homework, you will be constructing a data structure that will provide a more efficient method to locate and access files (by name) in a directory. You will be designing and coding a file access module that will use hashing to locate files more quickly (note: it is assumed that you already know something about how hashing works - if you are not familiar with terms like "hash table", "hash function" "collision", "linear probing", etc, then it is recommended that you spend some time reviewing these concepts before you proceed). The Weiss book provides an example of how to hash things into an array, using a technique called "linear probing" to resolve collisions. When a collision occurs (from two values being hashed into the same spot in the hash table), a sequential search of the hash table is started, until the next vacant spot is located where the new value can be placed. While this method is simple enough to understand and program, it has a few drawbacks. For starters, scanning the table for a particular "MyValue" searches not only through the values that have the same hash value as MyValue (which is to be expected), but it may also search through other values that have different hash values from MyValue. In the worst case (as the table starts to fill up), this results in a linear search through the entire hash table, which at that point is basically nothing but an unordered list (which is exactly the situation we were trying to avoid by using hashing to begin with). Weiss addresses this problem by using a "Rehash" function, which dynamically grows the table whenever it gets 50% full. This method works well enough, but is very time consuming, because every value in the hash table must be rehashed into the larger table after the larger table is created (only garbage collection on a Common LISP interpreter is less efficient). THE PITCH: You will modify Weiss's algorithm by using "chaining" for your collision resolution, instead of linear probing. The chaining method creates a linked list of values that were hashed into the same slot, so instead of searching through (potentially) the entire array to find a value, you only have to search through a (presumed) short list of values that hashed to the same slot in the table as the one you are looking for. Each entry in the chain will be a structure that contains three things: (a) a pointer to a string that contains the name of a file in the currently open directory (this field is copied from the "d_name" field of the "struct dirent" whose pointer is returned by a call to readdir(3c)), (b) a long integer that tells where in the current directory stream the file is located (this field is copied from the value that is returned by a call to telldir(3c)) and (c) a pointer to the next entry in the chain. -----------------------Start Answer Here----------------------- [p0] 3.0 points Rewrite Weiss's typedef for "Cell" (shown in Figure 10.22 on page 275) to conform to the above description. You may choose whatever field names you wish for the structure members. ------------------------End Answer Here------------------------ Now that we have modified the basic structure of the Cells, we will need to construct a HashTbl that is more suitable for our particular application. Weiss uses the HaMakeEmpty() function to malloc(3c) space for the HashTbl initially, and then readjusts the table size as the number of Cells increases. For our purposes, a static declaration should suffice, since we will be incrementally creating new Cells in the linked list chains as we need them. -----------------------Start Answer Here----------------------- [p1] 3.0 points Rewrite Weiss's typedef for "HashTbl" (shown in Figure 10.22 on page 275) so that it defines a static array of HaInitSize Cells. You may change the definition of HaInitSize if you wish (to make it "compatible" with a static declaration), but do not change its value (i.e. leave it at 17). Now show a declaration for a variable (named "H") of type HashTbl, that is defined such that all pointers in each Cell are initialized to NULL. These Cells will be used as head nodes for the linked list chains that we will create later. ------------------------End Answer Here------------------------ Now that you have determined what the structure of the HashTbl will look like, write a function called Print(), using the following prototype: void Print( const HashTbl H ) This function shall print the HashTbl out in a readable form. For each entry/list in the HashTbl, your function should print the name of each hashed file, plus its location in the directory. Here is a suggestion on formatting your output: [ 0] [ 1] <4lib 512> <5bin 456> [ 2] [ 3] [ 4] [ 5] [ 6] [ 7] [ 8]
[ 9] [10] [11] [12] <. 0> [13] [14] <.. 12> [15] [16] A couple of notes: 0. Exactly one complete entry/list from the table must be printed on each line. 1. You will always have a table entry for the "." and ".." files in each directory. If you use Weiss's Hash() function, then these will always hash into table entries 12 and 14, respectively. Their directory locations (as returned by telldir(3c) ) should be 0 and 12, respectively. 2. The above output was produced by running "dirhash" in the "/usr" directory on acmex. Other directories will produce different results, but you may use the above to "check" your program for correct behavior. -----------------------Start Answer Here----------------------- [p2] 8.0 points Insert the COMPLETE source code for your version of the Print() function here. ------------------------End Answer Here------------------------ Now we must modify the routines that find and insert things. Weiss shows several functions on page 277 to handle those tasks. You will rewrite each of the functions shown in Figure 10.24 to access the data as stored in the HashTbl data structure. First we will modify FindPos() to locate an entry in the table. The new version of FindPos() will have the following prototype: Cell * FindPos( const char *FileName, HashTbl H ); It will have similar functionality to Weiss's version, except that it will return a pointer to the actual Cell, instead of an index into the HashTbl. FindPos() will need to do the following tasks: 0. Hash() the FileName that was passed, to get the index of the head node (in the HashTbl) for the chain that contains it. 1. Search down the chain (which is a linked list) to see if a Cell with the requested FileName can be found. 2. If the Cell is found, then FindPos() returns a pointer to it. 3. If the Cell is NOT found, then FindPos() returns a pointer to the head node for that chain. -----------------------Start Answer Here----------------------- [p3] 8.0 points Insert the COMPLETE source code for your new version of the FindPos() function here. ------------------------End Answer Here------------------------ Now we must modify the Find() function to correctly process the information that is returned by FindPos(). The new version of Find() will have the following prototype: struct dirent * Find( DIR *DirStream, const char *FileName, const HashTbl H ); In this case, Find() will return a pointer to the directory entry structure for the requested FileName in the directory stream that is passed to it. To do this, Find() will need to do the following tasks: 0. Find the Cell in the HashTbl that contains the requested FileName. 1. Use the "long" field of that Cell to seekdir(3c) to the appropriate directory entry in the requested DirStream. 2. Read that directory entry and return a pointer to it. 3. If the Cell with the requested FileName cannot be found, then Find() returns NULL. -----------------------Start Answer Here----------------------- [p4] 8.0 points Insert the COMPLETE source code for your new version of the Find() function here. ------------------------End Answer Here------------------------ Now the only thing left is to rewrite Insert(). The new prototype will look like this: struct dirent * Insert( DIR *DirStream, const char *FileName, const HashTbl H ); Insert() will need to do the following: 0. Search the requested DirStream for the requested FileName. 1. If the FileName is found in the DirStream, and it is NOT in the requested HashTbl, then Insert() hashes it into the HashTbl along with its location within the DirStream (as returned by telldir(3c)). Insert() then returns a pointer to that directory entry (in the DirStream itself). Note that in this case a new Cell will be added to the linked list chain at the HashTbl entry for the FileName. 2. If the requested FileName is found in the DirStream, and it is ALREADY in the requested HashTbl, then Insert() just returns a pointer to that directory entry (in the DirStream itself). Note that in this case, the behavior of Insert() is exactly like that of Find(), for the same set of parameters. 3. If the requested FileName is NOT found in the requested DirStream, then no action is taken, and Insert() returns NULL. -----------------------Start Answer Here----------------------- [p5] 12.0 points Insert the COMPLETE source code for your new version of the Insert() function here. ------------------------End Answer Here------------------------ One of the advantages of doing collision resolution by chaining is that the Cells in a particular chain can be moved around with respect to each other, so that frequently accessed Cells stay nearer the front of the chain while "less popular" ones stay at the back. The most common method for doing this is to simply copy each Cell to the front of the linked list chain whenever somebody asks for it. This reduces the overall search time for that Cell if it is needed again at some time in the near future. Modify your FindPos() function so that when a Cell is found, it is moved to the front of the linked list chain before its pointer is returned. -----------------------Start Answer Here----------------------- [p6] 8.0 points Insert the COMPLETE source code for your improved version of the FindPos() function here. ------------------------End Answer Here------------------------