Database Qualifier
Spring 2004

Please solve 7 of the following 8 problems.


1. DATA MODELS

1) It is said that while the relational model does a great job in terms of formalization and preciseness, it lacks conveying semantics of data.
∑ What are the reasons for this shortcoming (in other words , what are the inherent nature of the relational model that precludes the incorporation of additional semantics?).  
∑ What do current relational systems do in terms of letting the application developer “add further behavioral properties to data?  – support your answer with some examples.

2) (b) You are given a relational DBMS like Oracle or DB2. You have been asked to design an interface/mapping tool that allows users to view the data as ER and EER diagrams on top of this RDBMS. This tool will present the underlying data in the form of ER/EER schemas and will enable an an ER model-based query language. Under this scenario, answer the following-
∑ Discuss what type of information you will need the dba/db designers to supply about the underlying tables so that they can be “reverse engineered’ into ER and EER schemas.
∑ Comment on some features of the query language you would design for this tool.  Mention how it would differ from the standard relational language like SQL.

2. NORMALIZATION
Given the following relation:
Banker ( BranchName, CustomerName, BankerName )
With these functional dependencies:
BankerName Æ BranchName
CustomerName, BranchName Æ BankerName

The semantics is that each banker works is a particular branch, and each customer may have a personal banker assigned in each branch.

The Banker relation is one of those unpleasant relations where we need to decide whether to keep it in 3rd Normal Form (3NF) or to decompose it into two Boyce-Codd Normal Form (BCNF) relations with dependency loss.

1) Please show the BCNF decomposition.

Assume that the bank is now developing an application to support the hiring of a huge number of new bankers to work in branches of the bank.  None of these new bankers will be assigned as personal bankers for any customers.

2) Discuss the cost of the join between the two BCNF relations needed to make sure the lost dependency is enforced.
3) Inspired by 2), consider insertions, deletions, and updates in the two BCNF tables.  Decide for each case when we need to worry about the lost dependency and when we do not need to worry about it.
4) Consider if it is possible to use assertions in SQL to enforce the functional dependencies in the two BCNF relations.  Show examples, if possible.

3. QUERY PROCESSING

Consider the following query:

 SELECT DISTINCT Sname
 FROM Students
 WHERE Major = 'Database' AND Dept = 'Management"

Assume the following:
  10% of students are choosing database as their major,
  5% of students with database major are Management students,
  20% of students are from the management department, there are 10 departments, the Students relation has 3,000 pages with 10 tuples per page, there is a 51 page buffer that can be used to process the query.

Find the best execution plan for the following cases:
1) The only index is on Major and it is an unclustered 2-level B+ tree.
2) The only index is a composite index on Dept,Major,Sname and it is a clustered B+ tree with 2 levels.
3) The only index is a composite index on Dept,Sname,Major and it is a clustered B+ tree with 3 levels.

Please discuss the different options and their I/O costs for processing the query in each of the three cases. Hints: You need to specify any assumptions you make while answering the above questions.

4. PHYSICAL STORAGE STRUCTURES AND PERFORMANCE

Two common index structures for databases are hash tables and B-trees.

1) Briefly describe the basic differences between these two index structures. When might you prefer a hash table over a B-tree? When might you prefer a B-tree over a hash table?
2) If we have a frequently used B-tree, we may decide to cache some B-tree blocks in memory to avoid I/Os. However, the commonly used LRU cache replacement algorithm may not be best for managing a cache of B-tree blocks in some situations. Why not? Explain how we might modify pure LRU to better handle these situations.
3) If a B-tree is used as a secondary index, it might contain duplicate values. Two common ways of handling duplicates are:
a) Store duplicate values in the B-tree itself, one per data pointer, or
b) Create "buckets" of pointers, with each bucket containing the set of pointers that share the same indexed value. Store each duplicate value in the B-tree only once, but point to the bucket of pointers rather than directly to the data.
4) Briefly compare and contrast these two approaches. When might we prefer option a), and when might we prefer option b)?

5. CONCURRENCY CONTROL

1) Define informally what a recoverable schedule is. What is the main property that must hold?
2) If two-phase locking is used, when can write locks be released in a recoverable schedule? When is the earliest time that a read lock can be released? Please explain.
3) Let the memory-commit of a transaction T be the point when T ’s commit record is written to the in-memory log buffer. Let the disk-commit of T be the point when T ’s log record is flushed to disk. With so-called group commit, T ’s write locks are released after the in-memory commit. The client that submitted T is not notified of its commit until after the disk commits. Explain why a group commit system does not generate recoverable schedules. Explain why this is not a problem.
4) Describe 1-2 examples of Internet transactions that you know. Explain why the concurrency control mechanisms widely adopted in relational database systems are not directly applicable for Internet transactions today. Discuss the potential mechanisms that you think could be effective for maintaining the transactional consistency of Internet transactions.

6. DATABASE RECOVERY

Consider a client-server database system that follows a data-shipping approach. That is, the server maintains the stable database and each client places the data pages fetched from the server in the client's buffer and performs the desired modifications to the data. The data page is the basic unit of exchange between the server and the client.  The server also has its own buffer.  There is a mechanism for concurrency control but you do not need to consider it.  

For this question describe a redo-only recovery scheme with write-ahead logging for our client-server database. Be sure to justify any design decisions based on the criteria of efficiency and correctness. The following are a few questions to consider in your answer.

∑ Where is the log kept?
∑ Should you have a global log and a local log?  
∑ What is done when an application on a client wants to commit?   
∑ What is done when an application on a client aborts?  
∑ What is done to recover from a server crash?

Be sure to specify any reasonable assumptions you make.

7. DISTRIBUTED DATABASES  

Suppose there is a relation R that is accessed from 3 sites in our network.
The ith site (1 <= i <= 3) issues 2i queries and i updates per second. The cost of executing a query if there is a copy of R at the site issuing the query is C, while if there is no copy there, then the query must be sent to some remote site and the cost is 10C.  The cost of executing an update is D for the copy of R at the issuing site and 10D for every copy of R that is not at the issuing site.  We can assume that C < D.  
1) If we want to minimize costs, which site or sites should store a copy of relation R?. Justify your answer and  explain how you arrived at it, i.e., give a brief overview of the algorithm you used.
2) If we increase the number of sites to 1000, would you still want to use your algorithm?

8. DATABASE TECHNOLOGIES AND APPLICATIONS

Consider some current research areas in databases: (a) dealing with streaming data (b) incorporating sampling and other features for efficient processing (c) incorporating mobility and location oriented processing (d) incorporating aggregation and mining  functionality.

1) Discuss the current state of the art in database and related technology products with respect to the above features. Mention at least some products that you know with the functionalities mentioned.
2) Do you believe that future DBMSs should have all of the above features- give some pro and con arguments.
3) Discuss briefly what architectural and functional changes would be needed to current RDBMSs and what challenges exist to “productize” the research in above areas.