Experimental Verification of the Millibottleneck Theory of Performance Bugs with the WISE Toolkit

TL;DR

In this tutorial, you will experimentally verify the millibottleneck theory of performance bugs using WISE, a newly released toolkit to conduct large-scale system experiments in the cloud. Specifically, you will provision a cloud, run a microservice application benchmark, and study the resulting system log data.

Introduction

Cloud computing is characterized by the on-demand availability of computing resources like CPU, network, and disk over the internet. Public clouds, such as Amazon Web Services (AWS), have enabled customers to instantaneously scale their IT infrastructures up and down according to the present workload needs, freeing them from the complex task of managing costly datacenters that are underutilized for most of the time.

Virtualization technologies play a key role in the field of cloud computing. Public clouds sell computing resources in a “pay-as-you-go” model that can only be profitable under high utilization, with multiple guest systems sharing a single physical host; likewise, for their customers, high utilization is essential to maximize the return on investment (ROI) and is commonly achieved with multiple server instances running on containers sharing a single system. However, virtualized systems are more susceptible to resource contention, making the performance of cloud applications less predictable.

Furthermore, the widespread adoption of virtualization technologies spurred the trend of splitting large programs comprising the functionalities of entire applications into many loosely-coupled microservices. The inherent large number of RPC calls between microservices makes performance even less predictable: according to the millibottleneck theory of performance bugs, transient resource bottlenecks in the order of milliseconds (called millibottlenecks) can propagate through a distributed system via RPC calls and have their effects amplified, causing severe performance bugs despite their very short lifespan.

As a result, a well-known performance bug of cloud applications — especially of web-facing applications subject to bursty workloads — is the long-tail latency problem, characterized by a small number of requests taking seconds to return while the majority of requests takes only a few milliseconds. This is a puzzling problem because requests with long response times (order of seconds) start to happen at low CPU utilization levels (around 40%), when none of the hardware resources is anywhere near saturation on average. Further, this problem has been reported to be bad for businesses: Amazon found that every increase of 100 ms in page loading time is correlated to roughly 1% loss in sales; similarly, Google found that a 500 ms additional delay to return search results could reduce revenues by up to 20%.

Even though companies want to reduce their response time long tails to the 99.9th percentiles and above, studying the performance of cloud applications is challenging. Detecting millibottlenecks is particularly hard due to their very short lifespan: according to the Sampling Theorem, phenomena with the duration of a fraction of a second are not detectable by observation tools with sampling periods of multiple seconds, which is the typical configuration of most cloud monitoring tools. Moreover, studying the propagation of millibottlenecks requires logging the messages exchanged between servers, with their arrival and departure timestamps, for each request — a level of detail that is not generally found in native server logs.

In this tutorial, you will experimentally verify the millibottleneck theory of performance bugs using WISE, a newly released toolkit to conduct large-scale system experiments in the cloud. Specifically, you will provision a cloud, run a microservice application benchmark, and study the resulting system log data.

Generating a new SSH key pair

For security reasons, generate a new SSH key pair, without password, to be used exclusively in your experiments:
ssh-keygen -t rsa -b 4096
Also, upload this newly generated SSH public key to GitHub:

Creating a CloudLab Account

Your experiment will run in a cloud instantiated in CloudLab, a public infrastructure to build scientific clouds that provides bare-metal access and total control over provisioned computing resources.

To create your CloudLab account:
Your request to join project Infosphere is now waiting for approval. If your request is still pending after a couple of days, please contact the responsible TA.

Setting Bash as the Default Linux Shell

Once you gain access to CloudLab, it is time to set Bash as your default Linux shell:

Instantiating a Cloud in CloudLab

You can now instantiate a cloud:
After approximately 20 minutes, the cloud will be ready to use.

Setting up the Experiment Controller Node

The execution of your experiment will be coordinated from the node-0 listed in the experiment page. Specifically, the scripts that install and configure software dependencies, start servers and monitors, and collect the resulting system log data on all nodes will run from this node-0.
To clone the WISETutorial repository in node-0:
# Copy your newly generated private SSH key to node-0
scp -i [path to the SSH private key] [path to the SSH private key] [CloudLab username]@[hostname of node-0]:.ssh/id_rsa
# SSH into node-0
ssh -i [path to the SSH private key] [CloudLab username]@[hostname of node-0]
# Synchronize index files of the apt package manager
sudo apt-get update
# Install Git
sudo apt-get install -y git
# Clone the WISETutorial repository
git clone git@github.com:coc-gatech-newelba/WISETutorial.git
This repository contains all the scripts, client and server software, monitors, and parsers needed to run your experiment.

Configuring the Microblog Application Benchmark

Your experiment will run a microblog application benchmark that is part of WISE. This benchmark generates HTTP requests to a Twitter-like web application, simulating user interactions such as creating and endorsing posts, subscribing to other users, viewing of a personal inbox, viewing recent posts, etc. The HTTP requests arrive at an Apache server and are processed by a Python Flask application that sends requests to five other Thrift microservices:
These microservices store their state in a PostgreSQL database. Also, workers asynchronously process jobs offloaded to the queue.

Workload Configuration

A workload comprises one more groups of clients that simulates user interactions. The duration of each user interaction (following a uniform or Gaussian distribution) and the next interaction to be performed (following a stochastic model) are configured in the file WISETutorial/experiment/conf/session.yml. The groups of clients are configured in file WISETutorial/experiment/conf/workload.yml. It has parameters to control the number of clients simulating user requests and the duration of these client sessions, among others. You should take some time to understand this configuration, but do not update it for your first experiment.

Server Configuration

To configure the servers, edit file WISETutorial/experiment/conf/config.sh, following its instructions to set the parameters to be passed to the benchmark execution scripts.

Running the Microblog Application Benchmark

The benchmark execution scripts in WISETutorial/experiment/scripts/run.sh creates the needed disk partitions, makes filesystems, installs and configures software, initializes servers and monitors, runs the benchmark, and finally collects the resulting system logs. After configuring your experiment, start the benchmark execution:
# Enter the WISETutorial directory
cd WISETutorial
# Run the benchmark
./experiment/scripts/run.sh

Parsing and Processing the Resulting Log Data

The experiment execution will take approximately 40 minutes. The resulting log files and graphs are compressed in a tarball called results.tar.gz in WISETutorial/experiment. :
# Copy the tarball to your local machine
scp -i [path to the SSH private key] [CloudLab username]@[hostname of node-0]:WISETutorial/experiment/results.tar.gz .
# Uncompress it to a directory called analysis
mkdir analysis
tar -xzf results.tar.gz -C analysis
In the analysis directory, you will find nine tarballs containing the log files of each server. Each of these tarballs has the following log files:
To uncompress the tarballs with the log files of each server:
# Uncompress the log files collected from the client server
mkdir -p analysis/client
tar -xzf log-client*.tar.gz -C analysis/client
# Uncompress the log files collected from the database server
mkdir -p analysis/db
tar -xzf log-db*.tar.gz -C analysis/db
# Uncompress the log files collected from the authentication server
mkdir -p analysis/auth
tar -xzf log-auth*.tar.gz -C analysis/auth
# Uncompress the log files collected from the inbox server
mkdir -p analysis/inbox
tar -xzf log-inbox*.tar.gz -C analysis/inbox
# Uncompress the log files collected from the microblog server
mkdir -p analysis/microblog
tar -xzf log-microblog*.tar.gz -C analysis/microblog
# Uncompress the log files collected from the queue server
mkdir -p analysis/queue
tar -xzf log-queue*.tar.gz -C analysis/queue
# Uncompress the log files collected from the subscription server
mkdir -p analysis/sub
tar -xzf log-sub*.tar.gz -C analysis/sub
Take a time to skim through these logs files and understand their format and content.

A quick reference on how to parse the resource log files follows:
# Parse CPU utilization logs collected from the database server
python parsers/cpu.py analysis/db/[CPU log file]
# Parse memory utilization logs collected from the database server
python parsers/mem.py analysis/db/[memory log file]
# Parse disk utilization logs collected from the database server
python parsers/disk.py analysis/db/[disk log file]
And now the network event log files:
# Calculate the number of requests per second
python parsers/requests_per_sec.py 80 analysis/client
# Calculate the response time distribution
python parsers/rt_dist.py 80 analysis/client
# Calculate the point-in-time response time
python parsers/rt_pit.py 80 analysis/client
# Calculate the queue length in the database server
python parsers/queue_length.py 5432 analysis/auth/ analysis/inbox/ \
    analysis/queue/ analysis/sub/ analysis/microblog/
It is worth noting that these parsers have Python dependencies that can be satisfied through the installation of the packages declared in the parsers/requirements.txt file:
# Install Python dependencies
pip install -r requirements.txt
The outcome of these parsers are one or more *.data files containing a time series in a two-column format: the first column represents the number of milliseconds elapsed since the monitor started and the second column represents the metric of interest.  In the vis directory, there are corresponding Gnuplot scripts that you can use to plot these time series but you are encouraged to use your preferred tools (e.g., Matplotlib). 

Analyzing the Resulting Log Data

In the following, an analysis of the results obtained from an execution of this same experiment is presented, with 150 clients generating a bursty workload that reaches its peak close to 60 requests per second. It is also worth noting that the workload parameters were all set in WISETutorial/experiment/conf/workload.yml.
As expected, the long-tail latency problem can be observed in the distribution of response times.
The point-in-time response time graph shows the moments these outliers happen.
Transient spikes in the data written to the database server disk is an indication of resource contention.
As well as the CPU utilization graph of the database server showing millibottlenecks.
As configured in WISETutorial/experiment/conf/config.sh, the database can process at most 100 transactions concurrently. Also, the SOMAXCONN parameter of the operation system was set to 64. This parameter sets the maximum size of a listening socket backlog. Therefore, at any given moment, the database server cannot have accepted and be processing more than 164 connections. When this limit is reached, packet retransmissions happen. The graph below shows the queue length of the database server with a time window of 25 ms, where such limit is reached many times while all the clients are in full operation.
[TODO] rt.pngAppendix A: Instantiating an OpenStack Cloud
To study the requests with very long response times and analyze their root causes, you can narrow down the search space and correlate them with CPU and I/O millibottlenecks in the database server and queue overflows in the database server itself or other servers.

Appendix A: Instantiating an OpenStack Cloud

OpenStack is a standard cloud software stack that can manage the creation of virtual machines, among other features, in a private cloud. CloudLab has a specific profile that installs and handles the basic configuration of OpenStack on the provisioned physical hosts. To instantiate your own OpenStack cloud:
CloudLab will then try to allocate the requested resources: one controller node to manage the cloud and three c8220 nodes where the virtual machines will be instantiated. When all the nodes become ready, appearing as green boxes in the topology view of CloudLab’s experiment page, CloudLab will start the installation and configuration of OpenStack on these nodes. At this point, you will receive an email informing you that your OpenStack instance is being set up. After approximately 50 minutes, your OpenStack instance will be ready and you will receive another email informing that it has completed set up. While you wait, watch this awesome introduction to the basics of OpenStack. Also, take some time to explore its control panel, using the great CloudLab manual as reference.

Launching Virtual Machines

When OpenStack is ready, launch the virtual machines on which the application benchmark will run:
The specified number of virtual machines (12 by default) will be quickly launched on the provisioned c8220 computing nodes. These virtual machines are named node1node2, and so on. Virtual machine node1 is launched on computing node cp-1node2 on cp-2node3 on cp-3node4 on cp-1, and so on.

Configuring the Experiment Controller Node in OpenStack

node1 is the experiment controller node that you will access to start running your experiment so it needs a public IP. To associate a floating IP with node1:
You can now SSH into node1 using its public IP address as user ubuntu. Also, node1 will connect to the other virtual machines via SSH during the experiment. Copy your CloudLab private SSH key to node1scp [path to the private SSH key] ubuntu@[public IP of node1]:.ssh. From node1, you can now SSH into the other virtual machines using their non-public IP addresses. Give it a try!

Appendix B: Advanced Workload Configuration

As aforementioned, the groups of clients are configured in file WISETutorial/experiment/conf/workload.yml. In that file, you can configure the level of burstiness in the workload for each group of clients, as show below:
- sessionConfig: {{WISEHOME}}/experiment/conf/session.yml
  noConcurrentSessions: 100
  startTime: 0
  endTime: 360
  rampUpDuration: 15
  rampDownDuration: 15
  burstiness:
    - speedUpFactor: 2.0
      startTime: 60
      endTime: 90
    - speedUpFactor: 2.0
      startTime: 150
      endTime: 180
This group of 100 concurrent clients will start generating requests at the beginning of the experiment (startTime: 0) and stop after 6 minutes (endTime: 360). The clients will start sending requests progressively and it will take 15 seconds for all of them to be in full operation (rampUpDuration: 15). Similarly, they will stop sending requests progressively (rampDownDuration: 15). Between intervals 60-90 and 150-180, these clients will send requests twice as fast (speedUpFactor: 2.0), creating a burst of requests that can be useful for the study of millibottlenecks.