Project 3: WWW Cache Replacement Policies

Introduction

Caches help with WWW requests just as they do with memory: whenever a WWW browser requests a web page, it can look first in the cache before retrieving the page from the Internet at large. Some caches are better than others; a poor cache will rarely have the requested page in it, while a good cache will frequently have the requested page.

In this project, you will implement different replacement policies in the Squid caching HTTP proxy, and then you will measure how well the different policies work on your own web browsing load.

The goals of this project are:

Requirements

You may work in groups of two.

Submit the following items:

NOTE: There will be about 5 points allocated for extra credit. Don't be sad if you get a 95, but if you want a full 100, you might do things like:

Be sure that your report describes anything extra you have done!

Programming Information

Compiling and Using Squid

Download Squid 2.5.STABLE5 from http://www.squid-cache.org. Extract it somewhere, and then make a separate copy of the directory for you to work in. (You will need the original version later, in order to produce diff's when you submit your work.)

Compile and install Squid with the following sequence of commands:

To later get rid of Squid, either do make uninstall, or simply rm -rf ~/squid. Feel free to specify some other directory than ~/squid; the rest of this page will be written as if you have installed Squid into ~/squid, but you can simply adjust the instructions.

You will need to configure Squid a little. Edit ~/squid/etc/squid.conf and make the following changes:

Replacement Strategies

A cache of any kind must have a replacement strategy. This is the strategy that is used whenever a new item is added to the cache, to decide which old item to remove. That is, the new item replaces some other object in the cache. Some common strategies are:

For this project, you are to implement three policies:

  1. first-in first-out -- remove the item which has been in the cache the longest.
  2. uniform random -- remove an item at random, where each item is given equal probability.
  3. lru-weighted random -- remove an item at random, but give higher probabilities to items that a pure lru strategy would prefer.

You do not need to be mathematically exact in the last model; so long as some preference is given to less-recently-used items, that is fine.

Do not panic over the number of strategies! Once you get one of them to work, the others should be variations.

Implementing Replacement Strategies

To gain an overall understanding of the Squid code structure, first read the Squid Programmers Guide. Squid uses its own, customized file systems for performance purposes. The source code for each type of storage is in "src/fs/". You need to understand the following data structures and functions to begin work.

Feel free to implement your alternative replacement policies by modifying the code in src/repl/lru. Be sure that you have enough ifdef's, however, that Squid can be compiled with each of the three replacement strategies that you need to support. For extra credit, implement your replacement strategies so that they may be selected in squid.conf without needing a recompile.

Recording and Playing Back a Web Load

To record a web load, you should start up Squid and then configure your browser to use it. To do this, go to the "proxy" settings of your web browser and, for HTTP, tell it to use host "localhost" and port "3128". If your tool wants a URL for the proxy, then specify http://localhost:3128.

Additionally, you should look through your web browser's settings and disable its own disk cache and give it a small memory cache. That way, the browser will hit on your Squid server hard.

Once you are set up like this, then all of your web browsing should go through Squid. If you "du -s ~/squid/var/cache" then you will see the cache growing larger over time as you request more web pages. If you "ls -l ~/squid/var/logs" then you will see the log files growing.

After a day or so, you should have accumulated a thousand or more hits. (If not, then do some random web browsing until you do reach at least a few thousand hits!) Look through ~/squid/var/logs/access.log to see the URL's you have requested.

Part of this project is to convert the list of requests in access.log into a script that will repeat that series of requests. You may limit yourself to GET requests if you wish.

The details of your script are up to you. However, do look into the wget utility, which should be useful. Also, if you use wget, be sure to set the http_proxy environment variable to point to your Squid proxy.

Interpreting the Log Files

The access.log file includes both what requests have been made to Squid and how those requests were handled. Each line includes one request. A typical line looks like this:

1079897254.896    569 127.0.0.1 TCP_MISS/200 33254 GET http://www.gatech.edu/ - DIRECT/130.207.165.120 text/html

This is a GET-style request for the page http://www.gatech.edu/. It was handled as a TCP_MISS, which means that the file was not in the cache and had be downloaded.

There is software around to interpret Squid log files. Feel free to use it.

Using diff

Please do not submit an entire tarball of your modified Squid. Instead, keep a copy of the original Squid archive separate from your hacked version, and generate a diff file with a command like this:

diff -ru squid-2.5.STABLE5 squid-hacked
Browse through your diff file before turning it in, to make sure that it includes all of the changes you have made.

CS 6210