Generating Representative Synthetic Workloads


Sponsor

Zack Kurmas
kurmasz@cc.gatech.edu
226e CCB

Area Systems

Background
In order to evaluate a new disk array, we must use some workload. In other words, we must give the disk array something to do. The results of the evaluation depend heavily on the choice of workload.

In order to get the most accurate results, many people use a trace of a production workload. That is, they collect a trace of some disk array's activity and evaluate the new disk array using the collected trace. These traces are hard to collect, however. They tend to be large and difficult to store. In addition, for privacy reasons, people don't like to have their systems traced.

A potential solution is to use a "fake" or random workload. Such a workload, if designed properly, can be used to accurately evaluate the disk array, without the limitations of a real trace. (For a more complete comparison of real and synthetic workloads, see Generating Representative Synthetic Workloads: An Unsolved Problem Gregory R. Ganger. Appears in the Proceedings of the Computer Measurement Group (CMG) Conference, December 1995, pp. 1263-1269.) The problem is that we don't know how to accurately generate a synthetic workload.

Problem
Given some target workload, it is currently very difficult to generate a synthetic (i.e. random) workload that behaves the same way when executed on a disk array. This is because we don't know what properties the two workloads must share in order to have similar behavior.

Here are some examples of properties that workloads must share to have similar behavior:

Locality presents a special challenge because, although we know of many several methods of describing locality, we do not have a good method of generating a synthetic workload with the desired degree of locality. Your job is to design and implement a method of generating a synthetic workload with a specified degree of spatial and/or temporal locality.

I use a measure of spatial locality I call "Proximity". This metric measures four values:

  1. Percentage of I/Os whose location is adjacent to that of the previous I/O (i.e. a run.)
  2. Percentage of I/Os whose location is adjacent to that of at least one of the previous 25 I/Os.
  3. Percentage of I/Os whose location is within 64K of the previous I/O.
  4. Percentage of I/Os whose location is within 64K of at least one of the previous 25 I/Os.

Conte and Hwu proposed two measures of locality: The Inter-reference temporal distance, and the Inter-reference spatial distance. See me for a copy of the paper, if you are interested.

You are also welcome to propose your own measure of locality (spatial or temporal) and write the corresponding generator.

One obvious method is to simply brainstorm an algorithm and you are certainly welcome to take this approach. However, since this problem applies to many properties (not just locality), I would suggest trying to solve the problem using a genetic algorithm or genetic program. That way, your solution could be applied to other problems.

Deliverables
For each algorithm or genetic technique you try, you must provide


Your report must describe at least one successful, or three unsuccessful attempts.