Spring 2003 Database Qualifier Exam

Answer any 7 questions. If you have to make any assumptions, state them clearly and proceed.

Question 1 - (Data Models, Mapping)

You are given a relational DBMS like Oracle or DB2. You have been asked to design an interface/mapping tool that allows users to view the data as ER and EER diagrams on top of this RDBMS. This tool will present the underlying data in the form of ER/EER schemas and will enable an ER model-based query language for specification of queries (could be like SQL or like QBE). Under this scenario, answer the following-

(a) Discuss what type of information you will need the dba/db designers to supply about the underlying tables so that they can be “reverse engineered’ into ER schemas. Give an example of a difficult reverse mapping situation and why it is difficult to “automatically” map it into ER without user intervention. Discuss what additional information you will need from humans to enable proper reverse engineering.
(b) Propose a design of a catalog that contains the relevant mapping details for the above reverse engineering.
(c) Show an example of an ER-schema-view-based query involving select/project/join and one that would require some form of aggregation and say narratively how it gets mapped onto the relational schema. (invent any notation for ER/EER-query specification that is similar to SQL or QBE on the fly! Do not worry about exact syntax).
 
Question 2 - (Relational Database Design)

Given a relational database with relations, R1, R2, …, Rn, and an associated set of Functional Dependencies, F1, F2, …, Fn. All the relations are in Boyce-Codd Normal Form (BCNF).  Now, consider each of the following relational algebra operators in turn: projection, selection, Cartesian product, natural join, outer join, difference, intersection, union, and outer union.  For each operator, determine whether it is possible to create a query result, which is not in BCNF.  Justify your answers, i.e., if it is not possible, then explain why not, and if it is possible, then give an example showing that the result is not in BCNF.  (Caution: remember that being in a normal form is an intentional property, not an extentional one.)

Question 3 - (Query Processing and Optimization)

Consider a relation R(A,B) stored using a partitioned hash organization.  The hash table contains 1024 blocks (buckets), and the tuples (records) are stored in these blocks. To store tuple (a,b), we apply hash function h1 for a, obtaining X bits. Then we apply hash function h2 on b, obtaining 10-X bits. The bits are concatenated, obtaining a 10 bit hash that identifies the block where (a,b) is placed. Suppose that 20% of the queries on R are of the form Q1: SELECT * from R where A=a, and 80% of the queries are of the form Q2: SELECT * from R where B=b where a and b are constants.
a) How many blocks are accessed by the Q1 queries? How many by the Q2 queries?  (hint: your answer will be a function of X.)
b) Write an expression that gives the expected number of blocks that must be accessed (on average).
c) What X value minimizes the expected number of IOs?
d) Can this hash support a query Q3: SELECT * from R where A>a1 AND A<a2? Why or why not? Explain your answer.
e) Describe a real world application that will benefit from partitioned hash organization.

Question 4 - (Physical Storage Structures and Performance)

You are given a file of records for telephone customers. The file is sorted on city, last name, and first name.  Each record is 200 bytes and there are 90 million records.  You are also given a Cheetah 36 disk from Seagate (see p.120 in the book “Fundamentals of Database Systems” by Elmasri/Navathe).

You are to write this sorted file to the disk as fast as possible.  Your initial and correct reaction probably is that the file must be written sequentially, but exactly what does “sequentially mean?  I.e. when you have written the first full cylinder of data, exactly where do you write next? You need to consider the real physical issue of writing to disk cylinder by cylinder.

1) Explain precisely how the file is written to disk using the physical characteristics of the Cheetah 36 to determine you steps.
2) Calculate how much time it will take to write the file.
 
Question 5 - (Concurrency Control)

Consider three transactions:
T1: r(a) r(b) w(a) w(b) w(c);
T2: r(c) r(b) w(b); and
T3: r(c) r(a) w(c).
Each of the following five questions asks for a schedule of a particular type, for transactions T1, T2, T3 (and no others):

(a) Give a conflict-serializable but not recoverable schedule generated by a locking-based scheduler (pessimistic concurrency control)
(b) Give a serializable, but not recoverable schedule generated by a validation schedule (i.e., using an optimistic concurrency control scheduler)
(c) Give a seralizable and recoverable schedule that does not avoid cascading rollback (ACR), generated by a validation scheduler.
(d) Give a seralizable, recoverable, ACR, and strict schedule, but not serial, generated by a validation scheduler.
(e) Give a scheduler that cannot be generated by a validation-based scheduler, but can be generated by a strict locking based scheduler.

You may represent your schedules in a 2-D grid with time axis flowing down the vertical axis and a separate column for each data value. Please show the validation or commit properties. There may be the case that there exists no schedule that meets the desired properties. If so, you may write “NO SCHEDULE EXISTS”. You are required to explain your answer.

Hints:
(1) A schedule S is recoverable if each transaction commits only after all transactions from which it reads have committed. A schedule S avoids cascading rollback (ACR) if each transaction may read only those values written by committed transactions. A schedule S is strict if each transaction may read and write only items previously written by committed transactions.

(2) How does the 2D grid look like? Consider the locking based schedule in question (a) and (e), your grid will have four columns: a, b, c, and commit point. The number of rows in this gird will be 14, namely the total number of actions: 6 for T1 (add commit), 4 for T2 (add commit) and 4 for T2 (add commit). If we assume that T1 starts first, then T2 starts. The first row of this grid is r1(a) under column a and nothing under the rest of the three columns. The second row is r2(c) under column c and nothing under the other columns, and so on. Similarly, the grid for validation schedule (question b, c, d) will have four columns: a, b, c, and validation point, and the number of rows will be 14 for validation schedule, namely the total number of actions: 6 for T1 (add validation point), 4 for T2 (add validation point) and 4 for T2 (add validation point).

(3) You may also use T1, T2, and T3 as the columns of the 2D grid if you are more used to that format. Make sure you include commit point or validation point in your schedule for each of the three transactions.

Question 6 - (Database Recovery)

Suppose the scheduler for a database system uses a strict 2-phase locking scheme for concurrency control. Also, the scheduler uses the tuple as the basic unit of locking.  Hence, a page may contain 2 (or more) distinct tuples that are write-locked by different transactions.  As for I/O operations, the basic unit of reading and writing is the page. We will assume that write-ahead logging is used by the recovery manager and in-place updating is done.  In general the recovery manager has four possible schemes for handling recovery:
∑ Steal/Force
∑ Steal/No-Force
∑ No-Steal/Force
∑ No-Steal/No-Force.
The first component, either Steal or No-Steal, refers to whether the Buffer Manager can write a page that has been updated by a non-committed transaction to disk.  If this can happen, then it is a Steal policy.  If not, then it is a No-Steal policy. The second component, Force or No-Force, refers to whether the recovery manager writes all pages updated by a committed transaction to disk when the transaction commits.  If updated pages are written at commit time, then it is called Force, otherwise it is No-Force.

(a) Explain whether or not each of the four schemes can be applied with our scheduler assumptions.

(b) For those schemes that are appropriate for our scheduler, explain how the recovery manager would handle the situation where a transaction decides to abort itself.

Question 7 - (Distributed Databases)  

Suppose we have 4 sites  (New York, Chicago, Los Angeles and Miami) in a distributed database environment.  We have one relation, the Year-To-Date-Sales table with columns Salesperson-ID, Region, Date, Sales and Commission.  The key for the table is the combination of Salesperson-ID and Date. So a given row represents the total sales of a particular salesperson for a given day of the year. The functional dependencies for this table all involve the key on the left-hand side of the dependency except for the following functional dependency: Salesperson-ID   Region. To improve performance the table is horizontally fragmented into 3 parts as follows:
 SELECT  * FROM Year-To-Date-Sales WHERE Region = 'east' for New York site.

SELECT  * FROM Year-To-Date-Sales WHERE Region = 'central' for Chicago site

SELECT  * FROM Year-To-Date-Sales WHERE Region = 'west' for Los Angeles site

We can assume that there are only 3 regions, as specified in the above SQL queries.  The table is fragmented as it is since each site needs to produce the sum of sales for their respective region. This would only involve local processing.  The Miami site is not responsible for storing the table.  You can also assume a uniform data distribution.

(a) Explain how the following query issued at the Miami site, could be efficiently processed:
      SELECT SUM (Sales), Date
      FROM Year-To-Date-Sales
      GROUP BY Date.

(b) Suppose we have another query which can be issued at any site as
      SELECT SUM (Sales)
      FROM Year-To-Date-Sales
      WHERE Salesperson-ID =? AND Date =?

The question marks (?) represent variables whose values are supplied by the user at run-time.

To improve query performance, the database system may maintain appropriate indexes.  Discuss the different options and the positive and negative points as to where the index(es) should be stored and whether the index(es) themselves should be fragmented.

Question 8 - (Technology and Applications)

A. The database technology has broadened considerably to enable new application areas in the 90’s and beyond. Consider three technology areas: data mining/OLAP, bioinformatics, and mobile database management. Pick one of these areas that you know most about.   Discuss
- what type of Data modeling, design, querying advances have been made in that area
- what DBMS functionality is lacking? What can we expect in the near future?
- List three priorities for research and justify why you consider those research problems to be important.
     
B. There is a lot of work currently going on with (a) XML and with (b) the area of Information Retrieval and Search engines. Pick any one. Write a paragraph on how this area impacts research in databases, what you see as major applications that will benefit from this technology and what are its limitations.