DATABASE QUALIFIER EXAM
SPRING 2002
April 19, 2002

Note: Please attempt 8 (eight) out of 9 (nine) questions. If you have to make assumptions, please state them clearly and proceed. If the question is unclear, just state how you interpreted it so the answer can be properly evaluated.

 

1. Data Modeling, Mapping, Languages

Imagine that you were going to implement an (extended) Entity-Relationship database management system. Ignore the query language and query optimization. Also, ignore physical sorting, indexing, etc. Instead, concentrate on the implementation of the data structures.

Which data structures would you use for:

- Entities?
- Properties of entities? Single-valued and multi-valued?
- Relationships? 1-1, 1-N, and N-M?
- Properties of relationships?
- Super-subtype (is-a) relationships? Several specializations of the same supertype?

Given the data structures above, now consider each of the modification, insert, delete and update, to data in the database. Please describe the semantics of each operation on each data structure. I.e., how would you insert a new entity? How would you insert an entity, which is already in a supertype into a subtype? How would you delete an entity from a supertype? Etc.


2. Relational DB Design

Consider the overall process of relational database design as performed in practice as well as the way it is approached in theory. Answer the following keeping both in mind.

  1. " It can be claimed that if the conceptual design is done completely and correctly in a model like the ER or extensions of ER and a mapping procedure (such as the one in Ch 9 of Elmasri/Navathe) is used diligently, the resulting database will require little normalization. " Defend this statement first.
  2. There may be intra-entity of inter-entity dependencies that may need further normalization. Come up with two examples that illustrate the need for such normalization. (show the ER model, show its mapping, point out why normalization is needed, then show the result of normalization).
  3. "BCNF can be called a stronger definition of 3NF." Justify in what way BCNF is stronger. Come up with an example (not in any text) to illustrate your point.
  4. What is ‘denormalization’ and why it may be done? Show two relations in BCNF, which may be denormalized into a relation that is not even in 2NF. Discuss the implications of your example denormalization on the subsequent processing of the resulting relation.

 

3. Query Processing

Consider the following query:

SELECT DISTINCT Ename
FROM EMPLOYEE
WHERE Title = 'programmer' AND Dept = 'production"

Assume the following:

10% of employees are programmers,
5% of employees are programmers who work for the production department,
20% of employees work for the production department,
there are 10 departments,
the EMPLOYEE relation has 1,000 pages with 10 tuples per page,
there is a 51 page buffer that can be used to process the query.

Find the best execution plan for the following cases:

(a) The only index is on Title and it is an unclustered 2-level B+ tree.
(b) The only index is a composite index on Dept,Title,Ename and it is a clustered B+ tree with 2 levels.
(c) The only index is a composite index on Dept,Ename,Title and it is a clustered B+ tree with 3 levels.

Make sure you discuss the different options and their I/O costs for processing the query in each of the three cases.

Also, be sure to specify any assumptions you make.

 

4. Physical Storage Structures

Stream Databases is the latest fad in databases.

A stream database may store, e.g. a click stream, transactions from a set of cash registers in a supermarket chain, time-series data from a scientific experiment, transactions in the stock market, or traffic at an ISP host.

Most likely, the stream will be stored, or logged, sequentially by simply appending each new event in the stream to the end of the log. Most likely, the events in the stream will have monotonically increasing time-stamps that represent the times events were logged. Many applications need to access the log of events to find the events that were logged at a particular point of time in the past or during a particular period of time in the past. In order to make the access to the log as efficient as possible for these applications, you are asked to analyze the many different indexing structures you know, including sparse and dense indices, single-level indices, B-trees, and hash structures.

For each one of the index structures, please describe all advantages and disadvantages of using this index structure to index stream databases. Please address at least the following issues for each structure: efficiency of logging, size of index, and efficiency of lookup.

 

5. Transactions and Concurrency Control

Consider the following schedule of transactions T1 and T2, operating on database objects x, y, z:

S: b, w1(x) r2(x)w2(y) c2 w1(z) e

Events b and e are the beginning and the end of the schedule S, respectively, and c is the commit. Assume that two phase locking protocol is used. Please answer the following questions, all asking where certain events may occur in this schedule. You may answer each question by giving the two events in the schedule S that must precede and follow the event in question. For example, transaction T1 must lock x between b and w1(x).

  1. Where can T1 unlock x?

  2. Where can T1 lock z?

  3. If undo logging is used, when can T2 write to the log on disk the undo information for its write to y?

  4. If red0 logging is used, when can T2 write to the log on disk the redo information for its write to y?

  5. The schedule S cannot possibly occur if the strict two phase locking is used. Assuming you can move T2‘s actions in S, where can w2(y) occur in the new schedule?

 

6. Database Recovery

To restore a database to a consistent state after a failure has occurred, a log and some combination of UNDO and REDO procedures are typically used. We can assume for this question that an UNDO-REDO recovery method is used where committed transactions are REDOne and uncommitted transactions are UNDOne.

In the case of physical logging, a before-image and after-image of the updated data item is stored in the log, i.e., a physical copy of the data item as it existed in the database before and after update.

In the case of logical logging, a log record would contain the name of an UNDO-REDO function and its parameters. For example, in the case of inserting a new record in a table, we might have the following REDO log record:

<INSERT, tablename, record>.

The associated UNDO record would look like

<DELETE, tablename, record>.

The motivation for logical logging is to reduce the amount of log data written and the associated I/O operations for the log. Even for a simple record insertion into a table, several changes may happen such as: moving records within a page, inserting entries into indexes for the Employee table and possibly firing trigger conditions that might change existing data values.

Answer the following with regard to a logical logging scheme:

(a) Are logical operations idempotent? Explain with an example. Also, if they are not, how can this be overcome.

(b) Can we assume that the logical operations are done atomically? Explain. Also, if they are not atomic, show what problem may be encountered during recovery and explain how we can overcome this problem.

 

7. Database Architectures

There have been several architectures proposed for databases:

- client/server
- fully distributed with possible replication (assume all types of transparencies)
- shared everything parallel architecture
- shared nothing parallel architecture

More recently we have seen use of mobile database architectures, web-based databases and peer-to-peer databases.

  1. describe the highlights of each of above in 3-4 sentences each
  2. describe your assessment of the current state of each from the standpoint of practical implementations in products.
  3. For each of them, point out one important outstanding research problem.

 

8. Distributed Databases

A retail company is distributed with offices in New York, Atlanta, Chicago, and Los Angeles. It wants to keep its ORDER data (order#, date, total_amount, customer#, warehouseId, …) and its CUSTOMER data (customer#, customer_info, last_order_date, …) with some optimal stategy for minimizing overall cost of processing inquires, orders, payments, etc. [NOTE: You are required to answer questions 1) and 4). But you only need to answer one of the two questions 2) and 3). ]

  1. Illustrate horizontal, vertical partitioning of the above data and argue why these partitioning strategies would be considered for a distributed database design.
  2. How could the company determine an initial optimal design of the database that distributes the data (a) without replication and (b) with replication? Please outline a procedure for the company to follow.
  3. What is semi-join? Construct an example with a sample query against the above retail company database. Use this example to illustrate a distributed query processing algorithm that involves replicated fragments and the use of semi-joins.
  4. Are these results relevant in today’s Web-based E-services? Why or Why not?

9. DB Technology and Applications

Database technology is undergoing a lot of evolution in terms of dealing with new applications enabled by the web and in e-commerce, entertainment and publishing, scientific applications in biology, GIS etc.

  1. Point out about five distinct demands that these applications place on DBMSs that are not usually considered in standard applications.
  2. Take one of two technologies; (i) object-oriented data models and database systems, (ii) XML and related technology. Comment on how , in your opinion, this technology tries to address the demand of applications.
  3. Point out what further research is necessary so as to meet the challenges with the technology you chose. (you can give a global answer to address all points in A above.)