Database Qualifier
Spring 2004
Please solve 7 of the following 8 problems.
1. DATA MODELS
1) It is said that while the relational model does a great job in terms
of formalization and preciseness, it lacks conveying semantics of data.
∑ What are the reasons for this shortcoming (in other words , what are
the inherent nature of the relational model that precludes the
incorporation of additional semantics?).
∑ What do current relational systems do in terms of letting the
application developer “add further behavioral properties to data?
– support your answer with some examples.
2) (b) You are given a relational DBMS like Oracle or DB2. You have
been asked to design an interface/mapping tool that allows users to
view the data as ER and EER diagrams on top of this RDBMS. This tool
will present the underlying data in the form of ER/EER schemas and will
enable an an ER model-based query language. Under this scenario, answer
the following-
∑ Discuss what type of information you will need the dba/db designers
to supply about the underlying tables so that they can be “reverse
engineered’ into ER and EER schemas.
∑ Comment on some features of the query language you would design for
this tool. Mention how it would differ from the standard
relational language like SQL.
2. NORMALIZATION
Given the following relation:
Banker ( BranchName, CustomerName, BankerName )
With these functional dependencies:
BankerName Æ BranchName
CustomerName, BranchName Æ BankerName
The semantics is that each banker works is a particular branch, and
each customer may have a personal banker assigned in each branch.
The Banker relation is one of those unpleasant relations where we need
to decide whether to keep it in 3rd Normal Form (3NF) or to decompose
it into two Boyce-Codd Normal Form (BCNF) relations with dependency
loss.
1) Please show the BCNF decomposition.
Assume that the bank is now developing an application to support the
hiring of a huge number of new bankers to work in branches of the
bank. None of these new bankers will be assigned as personal
bankers for any customers.
2) Discuss the cost of the join between the two BCNF relations needed
to make sure the lost dependency is enforced.
3) Inspired by 2), consider insertions, deletions, and updates in the
two BCNF tables. Decide for each case when we need to worry about
the lost dependency and when we do not need to worry about it.
4) Consider if it is possible to use assertions in SQL to enforce the
functional dependencies in the two BCNF relations. Show examples,
if possible.
3. QUERY PROCESSING
Consider the following query:
SELECT DISTINCT Sname
FROM Students
WHERE Major = 'Database' AND Dept = 'Management"
Assume the following:
10% of students are choosing database as their major,
5% of students with database major are Management students,
20% of students are from the management department, there are 10
departments, the Students relation has 3,000 pages with 10 tuples per
page, there is a 51 page buffer that can be used to process the query.
Find the best execution plan for the following cases:
1) The only index is on Major and it is an unclustered 2-level B+ tree.
2) The only index is a composite index on Dept,Major,Sname and it is a
clustered B+ tree with 2 levels.
3) The only index is a composite index on Dept,Sname,Major and it is a
clustered B+ tree with 3 levels.
Please discuss the different options and their I/O costs for processing
the query in each of the three cases. Hints: You need to specify any
assumptions you make while answering the above questions.
4. PHYSICAL STORAGE STRUCTURES AND PERFORMANCE
Two common index structures for databases are hash tables and B-trees.
1) Briefly describe the basic differences between these two index
structures. When might you prefer a hash table over a B-tree? When
might you prefer a B-tree over a hash table?
2) If we have a frequently used B-tree, we may decide to cache some
B-tree blocks in memory to avoid I/Os. However, the commonly used LRU
cache replacement algorithm may not be best for managing a cache of
B-tree blocks in some situations. Why not? Explain how we might modify
pure LRU to better handle these situations.
3) If a B-tree is used as a secondary index, it might contain duplicate
values. Two common ways of handling duplicates are:
a) Store duplicate values in the B-tree itself, one per data pointer, or
b) Create "buckets" of pointers, with each bucket containing the set of
pointers that share the same indexed value. Store each duplicate value
in the B-tree only once, but point to the bucket of pointers rather
than directly to the data.
4) Briefly compare and contrast these two approaches. When might we
prefer option a), and when might we prefer option b)?
5. CONCURRENCY CONTROL
1) Define informally what a recoverable schedule is. What is the main
property that must hold?
2) If two-phase locking is used, when can write locks be released in a
recoverable schedule? When is the earliest time that a read lock can be
released? Please explain.
3) Let the memory-commit of a transaction T be the point when T ’s
commit record is written to the in-memory log buffer. Let the
disk-commit of T be the point when T ’s log record is flushed to disk.
With so-called group commit, T ’s write locks are released after the
in-memory commit. The client that submitted T is not notified of its
commit until after the disk commits. Explain why a group commit system
does not generate recoverable schedules. Explain why this is not a
problem.
4) Describe 1-2 examples of Internet transactions that you know.
Explain why the concurrency control mechanisms widely adopted in
relational database systems are not directly applicable for Internet
transactions today. Discuss the potential mechanisms that you think
could be effective for maintaining the transactional consistency of
Internet transactions.
6. DATABASE RECOVERY
Consider a client-server database system that follows a data-shipping
approach. That is, the server maintains the stable database and each
client places the data pages fetched from the server in the client's
buffer and performs the desired modifications to the data. The data
page is the basic unit of exchange between the server and the
client. The server also has its own buffer. There is a
mechanism for concurrency control but you do not need to consider it.
For this question describe a redo-only recovery scheme with write-ahead
logging for our client-server database. Be sure to justify any design
decisions based on the criteria of efficiency and correctness. The
following are a few questions to consider in your answer.
∑ Where is the log kept?
∑ Should you have a global log and a local log?
∑ What is done when an application on a client wants to commit?
∑ What is done when an application on a client aborts?
∑ What is done to recover from a server crash?
Be sure to specify any reasonable assumptions you make.
7. DISTRIBUTED DATABASES
Suppose there is a relation R that is accessed from 3 sites in our
network.
The ith site (1 <= i <= 3) issues 2i queries and i updates per
second. The cost of executing a query if there is a copy of R at the
site issuing the query is C, while if there is no copy there, then the
query must be sent to some remote site and the cost is 10C. The
cost of executing an update is D for the copy of R at the issuing site
and 10D for every copy of R that is not at the issuing site. We
can assume that C < D.
1) If we want to minimize costs, which site or sites should store a
copy of relation R?. Justify your answer and explain how you
arrived at it, i.e., give a brief overview of the algorithm you used.
2) If we increase the number of sites to 1000, would you still want to
use your algorithm?
8. DATABASE TECHNOLOGIES AND APPLICATIONS
Consider some current research areas in databases: (a) dealing with
streaming data (b) incorporating sampling and other features for
efficient processing (c) incorporating mobility and location oriented
processing (d) incorporating aggregation and mining
functionality.
1) Discuss the current state of the art in database and related
technology products with respect to the above features. Mention at
least some products that you know with the functionalities mentioned.
2) Do you believe that future DBMSs should have all of the above
features- give some pro and con arguments.
3) Discuss briefly what architectural and functional changes would be
needed to current RDBMSs and what challenges exist to “productize” the
research in above areas.