Spring 2003 Database Qualifier Exam
Answer any 7 questions. If you have to make any assumptions, state them clearly
and proceed.
Question 1 - (Data Models, Mapping)
You are given a relational DBMS like Oracle or DB2. You have been asked to
design an interface/mapping tool that allows users to view the data as ER
and EER diagrams on top of this RDBMS. This tool will present the underlying
data in the form of ER/EER schemas and will enable an ER model-based query
language for specification of queries (could be like SQL or like QBE). Under
this scenario, answer the following-
(a) Discuss what type of information you will need the dba/db designers to
supply about the underlying tables so that they can be “reverse engineered’
into ER schemas. Give an example of a difficult reverse mapping situation
and why it is difficult to “automatically” map it into ER without user intervention.
Discuss what additional information you will need from humans to enable proper
reverse engineering.
(b) Propose a design of a catalog that contains the relevant mapping details
for the above reverse engineering.
(c) Show an example of an ER-schema-view-based query involving select/project/join
and one that would require some form of aggregation and say narratively how
it gets mapped onto the relational schema. (invent any notation for ER/EER-query
specification that is similar to SQL or QBE on the fly! Do not worry about
exact syntax).
Question 2 - (Relational Database Design)
Given a relational database with relations, R1, R2, …, Rn, and an associated
set of Functional Dependencies, F1, F2, …, Fn. All the relations are in Boyce-Codd
Normal Form (BCNF). Now, consider each of the following relational
algebra operators in turn: projection, selection, Cartesian product, natural
join, outer join, difference, intersection, union, and outer union. For
each operator, determine whether it is possible to create a query result,
which is not in BCNF. Justify your answers, i.e., if it is not possible,
then explain why not, and if it is possible, then give an example showing
that the result is not in BCNF. (Caution: remember that being in a
normal form is an intentional property, not an extentional one.)
Question 3 - (Query Processing and Optimization)
Consider a relation R(A,B) stored using a partitioned hash organization.
The hash table contains 1024 blocks (buckets), and the tuples (records) are
stored in these blocks. To store tuple (a,b), we apply hash function h1 for
a, obtaining X bits. Then we apply hash function h2 on b, obtaining 10-X
bits. The bits are concatenated, obtaining a 10 bit hash that identifies
the block where (a,b) is placed. Suppose that 20% of the queries on R are
of the form Q1: SELECT * from R where A=a, and 80% of the queries are of
the form Q2: SELECT * from R where B=b where a and b are constants.
a) How many blocks are accessed by the Q1 queries? How many by the Q2 queries?
(hint: your answer will be a function of X.)
b) Write an expression that gives the expected number of blocks that must
be accessed (on average).
c) What X value minimizes the expected number of IOs?
d) Can this hash support a query Q3: SELECT * from R where A>a1 AND A<a2?
Why or why not? Explain your answer.
e) Describe a real world application that will benefit from partitioned hash
organization.
Question 4 - (Physical Storage Structures and Performance)
You are given a file of records for telephone customers. The file is sorted
on city, last name, and first name. Each record is 200 bytes and there
are 90 million records. You are also given a Cheetah 36 disk from Seagate
(see p.120 in the book “Fundamentals of Database Systems” by Elmasri/Navathe).
You are to write this sorted file to the disk as fast as possible.
Your initial and correct reaction probably is that the file must be written
sequentially, but exactly what does “sequentially mean? I.e. when you
have written the first full cylinder of data, exactly where do you write
next? You need to consider the real physical issue of writing to disk cylinder
by cylinder.
1) Explain precisely how the file is written to disk using the physical characteristics
of the Cheetah 36 to determine you steps.
2) Calculate how much time it will take to write the file.
Question 5 - (Concurrency Control)
Consider three transactions:
T1: r(a) r(b) w(a) w(b) w(c);
T2: r(c) r(b) w(b); and
T3: r(c) r(a) w(c).
Each of the following five questions asks for a schedule of a particular
type, for transactions T1, T2, T3 (and no others):
(a) Give a conflict-serializable but not recoverable schedule generated by
a locking-based scheduler (pessimistic concurrency control)
(b) Give a serializable, but not recoverable schedule generated by a validation
schedule (i.e., using an optimistic concurrency control scheduler)
(c) Give a seralizable and recoverable schedule that does not avoid cascading
rollback (ACR), generated by a validation scheduler.
(d) Give a seralizable, recoverable, ACR, and strict schedule, but not serial,
generated by a validation scheduler.
(e) Give a scheduler that cannot be generated by a validation-based scheduler,
but can be generated by a strict locking based scheduler.
You may represent your schedules in a 2-D grid with time axis flowing down
the vertical axis and a separate column for each data value. Please show
the validation or commit properties. There may be the case that there exists
no schedule that meets the desired properties. If so, you may write “NO SCHEDULE
EXISTS”. You are required to explain your answer.
Hints:
(1) A schedule S is recoverable if each transaction commits only after all
transactions from which it reads have committed. A schedule S avoids cascading
rollback (ACR) if each transaction may read only those values written by
committed transactions. A schedule S is strict if each transaction may read
and write only items previously written by committed transactions.
(2) How does the 2D grid look like? Consider the locking based schedule in
question (a) and (e), your grid will have four columns: a, b, c, and commit
point. The number of rows in this gird will be 14, namely the total number
of actions: 6 for T1 (add commit), 4 for T2 (add commit) and 4 for T2 (add
commit). If we assume that T1 starts first, then T2 starts. The first row
of this grid is r1(a) under column a and nothing under the rest of the three
columns. The second row is r2(c) under column c and nothing under the other
columns, and so on. Similarly, the grid for validation schedule (question
b, c, d) will have four columns: a, b, c, and validation point, and the number
of rows will be 14 for validation schedule, namely the total number of actions:
6 for T1 (add validation point), 4 for T2 (add validation point) and 4 for
T2 (add validation point).
(3) You may also use T1, T2, and T3 as the columns of the 2D grid if you
are more used to that format. Make sure you include commit point or validation
point in your schedule for each of the three transactions.
Question 6 - (Database Recovery)
Suppose the scheduler for a database system uses a strict 2-phase locking
scheme for concurrency control. Also, the scheduler uses the tuple as the
basic unit of locking. Hence, a page may contain 2 (or more) distinct
tuples that are write-locked by different transactions. As for I/O
operations, the basic unit of reading and writing is the page. We will assume
that write-ahead logging is used by the recovery manager and in-place updating
is done. In general the recovery manager has four possible schemes
for handling recovery:
∑ Steal/Force
∑ Steal/No-Force
∑ No-Steal/Force
∑ No-Steal/No-Force.
The first component, either Steal or No-Steal, refers to whether the Buffer
Manager can write a page that has been updated by a non-committed transaction
to disk. If this can happen, then it is a Steal policy. If not,
then it is a No-Steal policy. The second component, Force or No-Force, refers
to whether the recovery manager writes all pages updated by a committed transaction
to disk when the transaction commits. If updated pages are written
at commit time, then it is called Force, otherwise it is No-Force.
(a) Explain whether or not each of the four schemes can be applied with our
scheduler assumptions.
(b) For those schemes that are appropriate for our scheduler, explain how
the recovery manager would handle the situation where a transaction decides
to abort itself.
Question 7 - (Distributed Databases)
Suppose we have 4 sites (New York, Chicago, Los Angeles and Miami)
in a distributed database environment. We have one relation, the Year-To-Date-Sales
table with columns Salesperson-ID, Region, Date, Sales and Commission.
The key for the table is the combination of Salesperson-ID and Date. So a
given row represents the total sales of a particular salesperson for a given
day of the year. The functional dependencies for this table all involve the
key on the left-hand side of the dependency except for the following functional
dependency: Salesperson-ID Region. To improve performance the
table is horizontally fragmented into 3 parts as follows:
SELECT * FROM Year-To-Date-Sales WHERE Region = 'east' for New
York site.
SELECT * FROM Year-To-Date-Sales WHERE Region = 'central' for Chicago
site
SELECT * FROM Year-To-Date-Sales WHERE Region = 'west' for Los Angeles
site
We can assume that there are only 3 regions, as specified in the above SQL
queries. The table is fragmented as it is since each site needs to
produce the sum of sales for their respective region. This would only involve
local processing. The Miami site is not responsible for storing the
table. You can also assume a uniform data distribution.
(a) Explain how the following query issued at the Miami site, could be efficiently
processed:
SELECT SUM (Sales), Date
FROM Year-To-Date-Sales
GROUP BY Date.
(b) Suppose we have another query which can be issued at any site as
SELECT SUM (Sales)
FROM Year-To-Date-Sales
WHERE Salesperson-ID =? AND Date =?
The question marks (?) represent variables whose values are supplied by the
user at run-time.
To improve query performance, the database system may maintain appropriate
indexes. Discuss the different options and the positive and negative
points as to where the index(es) should be stored and whether the index(es)
themselves should be fragmented.
Question 8 - (Technology and Applications)
A. The database technology has broadened considerably to enable new application
areas in the 90’s and beyond. Consider three technology areas: data mining/OLAP,
bioinformatics, and mobile database management. Pick one of these areas that
you know most about. Discuss
- what type of Data modeling, design, querying advances have been made in
that area
- what DBMS functionality is lacking? What can we expect in the near future?
- List three priorities for research and justify why you consider those research
problems to be important.
B. There is a lot of work currently going on with (a) XML and with (b) the
area of Information Retrieval and Search engines. Pick any one. Write a paragraph
on how this area impacts research in databases, what you see as major applications
that will benefit from this technology and what are its limitations.