Project 1: Scheduling with GTThreads

Due Thursday, September 18th at 11:59 pm.

There are no exceptions to the due date unless you have consent from the instructor.

Important clarifications for the priority scheduler have been made. You will need to implement priority scheduling on multiple cores.

I apologize profusely for the foul up; that was my fault. -Scott

There are also additional tips and hints that have been added to the bottom of this page in the Tips section. Additional hints may also be posted to the git.cc.class.cs6210 newsgroup. There are instructions on how to access news.gatech.edu at OIT's newsgroup site.

Goal

The goal of this project is to understand and implement simple coscheduling and priority policies along with processor affinity. To do so you must modify the provided GTThreads package, which is a simple user level thread scheduler.

Project Requirements

This project minimally requires the following in order to receive full credit:

Each of these is explained below. You may also be asked to provide a demo to the TA showing your results.

The project must be emailed to the TA, Scott McManus (smcmanus@cc.gatech.edu), by September 15th at 11:59 pm (i.e., midnight). If you email multiple submissions, then the last one submitted will be graded.

You may choose another threading package to modify other than GTThreads, but it must be approved ahead of time. That said, code that implements priority-based scheduling, processor affinity, or coscheduling is off-limits.

The code should be written to run on the warp cluster; there are a couple of dependencies in the code on Linux which have not been tested on other operating systems. If you have issues with this then contact the TA, Scott (smcmanus@cc), and we can try to work something out.

If you have any questions, then you are encouraged to consult with the TA. The earlier you ask, the better - others may have the same questions or issues, too.

Priority-Based Scheduling

You will first need to implement priority-based scheduling at the thread level. The priority scheduling should run in O(1) time, which means that switching threads should not involve searching the run queue. This also means your scheduler may order threads in the run queue when the thread is created or when the thread's priority is changed.

It should be possible to change the priority of your thread via the user-level library. If the thread is no longer the highest-priority thread in the run queue, then it should be returned to the run queue and the highest-priority thread should be run.

Next, you must extend your priority-based scheduling for a thread group. The API for setting the thread group is open, but at a minimum it must be possible to set the thread group at thread creation. If no thread group is specified, then all threads must execute in the same thread group. Mapping the thread to a thread group is important for coscheduling support, which is described later.

Finally, your priority scheduling algorithm must be able to run on multiple cores. Your implementation should always execute the highest-priority runnable threads on the cores. Since each thread group has the same priority among its members, you should be able to choose any thread from among the group to schedule. There are no requirements for how you choose a thread from within a thread group to run, but you should include information on how you chose the thread to run in your write up.

Full credit will be awarded if both per-thread priority-based scheduling and thread-group priority-based scheduling are implemented on both single and multiple cores.

Clarifications
The highest priority runnable thread must always be run. If an implementation creates a thread with the highest priority but there is not an immediate context switch to that thread, then a large number of points will be deducted from the priority scheduling grade. When two threads have the same priority (as in the case when two threads are in the same thread group), then the action taken is up to you. This is very important in some applications, so although there is flexibility in your API there is no flexibility on this issue.

When implementing thread group priorities, you will not need to create a mix of priorities for the threads and the thread groups. Once a thread is assigned to the thread group, the thread will take on the priority of the thread group. You also do not need to worry about common problems such as priority inversion with the mutexes.

Sample API for Priority-Based Scheduling

You have complete freedom in defining the API for setting the priority and thread group. One possibility is to do something similar to PThreads' sched_param, pthread_attr_getschedparam, and pthread_attr_setschedparam where the priority can be set. Keep in mind that the provided GTThreads package does not use thread attributes, so you will have to build in support for this kind of mechanism if you want to use it.

As a different example, the API for setting the thread group and priority could look something like this:

#define GTTHREAD_PRIO_MAX (255)
#define GTTHREAD_PRIO_MIN (1)
#define GTTHREAD_PRIO_DEFAULT (100)

typedef int gttgroup_t;

int gtthread_thread_set_prio(
  gtthread_t thread_id,
  int priority );

int gtthread_threadgroup_create(
  gttgroup_t* new_thread_group,
  int priority );

int gtthread_create (
  gtthread_t *thread_id,
  gttgroup_t thread_group,
  int start_function (void*),
  void* arg );

int gtthread_tgroup_set_prio( gttgroup_t thread_group, int priority );

In this example API, gtthread_thread_set_prio will set the priority for the given thread. The priority has a maximum value of 255, a minimum value of 1, and a default of 100. This gives some flexibility in the case that a priority of 0 is supplied to other functions.

The example gtthread_threadgroup_create function creates a thread group with the specified priority. If the priority argument is 0, then one possibility is to use the default priority (in this example, 100). The thread group can start at 1, leaving smaller thread groups to indicate other conditions. A default thread group of 0 with a default priority can be created.

In this example, the gtthread_create function now takes the thread group as an argument. If the thread group is 0, then this could indicate that the default thread group is to be used. (Note that this may help when you are first implementing the per-thread priority scheduling, as your scheduler can simply ignore the parameter or use the initial thread group of 0.)

Finally, in this example the thread group's priority can be set using the gtthread_tgroup_set_prio.

Note that all of this functionality may require that you change other functions in GTThreads, and you have complete freedom in how you choose to do this.

Processor Affinity

The next step is to implement processor affinity. Processor affinity refers to the preference of scheduling a thread on a single processor. When a thread becomes runnable, it can either be run on the same processor as before or it can be run on a different processor. Running on the same processor as before can have some benefits, but waiting for the same processor to become idle can also incur a penalty from a few different factors. This is discussed in depth in "Using Processor-Cache Affinity Information in Shared-Memory Multiprocessor Scheduling" by Squillante & Lazowska.

For this project, an idle processor will select the next thread based on the following algorithm:

There is already a structure, "current", which maps the CPU # to the list of tasks that are running on that CPU. You have complete freedom in choosing how to run the processor-affinity algorithm. If you have a cool algorithm in mind, then run it by the instructor or the TA first and include it in your write up.

The supplied GTThreads code already creates one kernel thread via the clone() call per available core, and the sched_affinity call does part of the work for setting up processes on other CPUs. What's left to you is to complete the algorithm as described.

The processor affinity scheduling is really a stepping stone to implementing coscheduling. For the purpose of this assignment, you can use round-robin scheduling for your global run queue.

The number of cores to run on is also implementation-dependent. The supplied GTThreads code uses the NUM_CPUS environment variable to set the number of cores/CPUs to run on. However, if you have a way to set this up that's easier for you, that's fine, too - just document how you did it.

In order to receive full credit for this section, you must write an affinity scheduler that maps a thread to a CPU/core as described. You must also describe how to set the number of CPUs/cores to use if you are not using NUM_CPUS.

Coscheduling

Coscheduling attempts to schedule a group of threads together to improve performance. One corollary is that the thread scheduler may try to avoid rescheduling one or more threads in the group when the thread is merely communicating with another thread.

The following example illustrates coscheduling. There are four groups of threads:

(The linked list image merely indicates how you may want to conceptualize your thread groups, but it's not required.)

There are also 8 processors available for use.
Time Instance 1
In the first instance, all of Group A and all of Group B run on the 8 processors, as they are the first thread groups in the run queue.

Time Instance 2
Now suppose that a single thread (say, B2) in threadgroup B decides to block. After a timeout, the scheduler decides to reschedule B2. Since we're running coscheduling, this means that all other threads in B2's thread group, thread group B, also must be rescheduled. This means that we can schedule thread group C's threads. Since there is room left over, we can also schedule thread group D's threads.

Time Instance 3
Now D yields, which leaves thread groups A and C running. The question now is what to do with the other two processors. In this case the scheduler gives a partial allocation to thread group B. When more processors become available, thread group B will claim them. This is the case regardless of if another thread group is created which only needs one processor. Thread group B must run next unless one of its threads blocks for a sufficiently long time, in which case the entire thread group would be rescheduled.

Time Instance 4
Finally, thread group A yields its processors. Again, Group B's threads must be scheduled on the available processors first. Group C's threads are already running, so Group D is scheduled.

img1.jpg
Many thanks to Min Lee for this picture and example

Notice that there is no concept of processor affinity; processes are scheduled on CPUs/cores as the processors become available. This is also different from gang scheduling, where all threads must be scheduled at the same time - in coscheduling, we can start running part of a task group (in this case, group B in Time Instance 3) when it is otherwise impossible to schedule all threads in the group. Likewise, coscheduling differs from simple round-robin or processor-affinity schedulers in that the thread scheduling on all processors are coordinated - processors in processor affinity schedulers don't care about the ordering in the run queue or the currently executing tasks on other processors.

This is not the last word on coscheduling, so you are encouraged to create a fancier algorithm for coscheduling. In the previous example, it may have made sense to have scheduled Group D instead of threads B1 and B2 in Time Instance 3. If you create such a scheduler, then please discuss it in your report.

In order to receive full credit for this section, you must implement the basic coscheduling described across thread groups.

Threading the Matrix Multiplication

A very rudimentary code for multiplying matrices is provided in parallel_example.c . The code generates random matrices and then creates a number of threads to multiply the matrices. The output of the multiplication corresponds to calculating C = A*B. Each thread calculates a fraction of the rows in C by calculating the same stripe of rows in A times the entire matrix B.

However, simply threading the code for a single matrix multiplication isn't always very interesting. In addition, you should create an application that threads two or more simultaneous matrix multiplications.

In order to receive full credit for this section, you must evaluate the runtime, excluding the matrix initialization, for at least two simultaneous matrix multiplications for at least four different number of cores (e.g., 1, 2, 4, and 4 cores). You should also create at least three cases for each number of CPUs/cores used: one case where the total number of threads equals the total number of cores, and two cases where there are more than one thread per core. For example, if your test case is for two cores, then you should have a test with 2 threads, and you could have one test with at least 6 threads and another test with at least 10 threads total.

Your analysis should also show the total runtime (excluding initialization) for the three algorithms implemented: priority-based scheduling, processor-affinity scheduling, and coscheduling. So far this makes a total of 36 data points - 3 algorithms, 4 cores evaluated per algorithm, and 3 different number of threads per core.

This code must also run and execute on the warp cluster. The code is specifically geared towards Linux on i386, so deviations are probably not a good idea. (In another words, Windows programming is out for now.)

Write Up of Results

After this project, you should have a better working understanding of performance related to threading issues. You will need to write up your results and describe the issues surrounding them.

To get full credit for the write up, you must include the following:

Any reasonable format is acceptable - text, Word, PS, or PDF are all fine.

Submission

You should include the following in a tarball or similar: You should send a copy of this to both the instructor and the TA, smcmanus@cc, by Monday, September 15th at 11:59 pm.

Notes on GTThreads

The provided GTThreads library is meant to give you access to a simple, extensible, user-level thread library. It only comes with a round-robin scheduler, but this should be enough to get you started with manipulating a scheduler.

At a high level, GTThreads clones the invoking process until it is executing on NUM_CPUS processors (where NUM_CPUS is the environment variable described earlier). The use of the clone function means that there is a heavyweight process, referred to as a kernel thread in GTThreads, that is running on each processor. The threads that applications schedule in GTThreads are then scheduled on top of those kernel-level threads.

The provided GTThreads package makes heavy use of signals. In particular, the SIGVTALRM (or virtual timer), is used to indicate when the timeslice is up for a given task. When the SIGVTALRM timer goes off, the handler will find the next thread in the run queue and start executing it.

Like PThreads, you are able to create threads and have threads yield. You also have mutex functions available to you - gtthread_mutex_init, gtthread_mutex_lock, and gtthread_mutex_unlock work similar to their PThread cousins (though without attributes).

In general, the function "gtthread_*" has a corresponding PThreads function named "pthread_*". The following functions are provided in GTThreads:

GTThreads also uses the following key variables:

Tips: The following functions and datatypes are also either used within GTThreads or may become useful to you. Many of these, especially the functions, have MAN pages associated with them, so you should be able to type "man " and get a fuller explanation of each function.

At some point processor P1 may want to tell processor P2 to do something - say, wake up, read its relevant structures, and make a context switch. For this reason, you should consider using the kill( process_id, signal# ) function to send a signal to a process. In GTThreads, each CPU/core has one heavyweight process (referred to in the code as a kernel thread) that runs the threads on that CPU. These processes can be found in the processor_kthread_map array, which maps the processor number to the heavyweight process that is running threads on that processor. For example, processor_kthread_map[1] may contain 613, indicating that process 613 is running the threads on processor #1. Then the kill function can be used to send any signal to the processes listed in processor_kthread_map. For example, kill( processor_kthread_map[1], SIGUSR1 ) would send the SIGUSR1 alarm to process 613.

The SIGUSR1 and SIGUSR2 signals may also help you out in addition to SIGVTALRM. The reason is that SIGUSR1 and SIGUSR2 are reserved for applications, so the kernel will never use these signals to indicate some condition. This is useful because you won't have to disambiguate what a signal means. You do not have to use them.

For more information on the kill and signal functions, look at their man pages. These should be available using "man kill" and "man signal". Sometimes this is ambiguous - kill, for example, is also a shell command. You can also use the "whatis kill" and "whatis signal" commands to find the right section of the man pages for kill and signal. For example, "whatis kill" might display this:

	smcmanus|greenland>whatis kill
	kill                 (1)  - terminate a process
	kill                 (1p)  - terminate or signal processes
	kill                 (2)  - send signal to a process
	kill                 (3p)  - send a signal to a process or a group of processes
	kill [builtins]      (1)  - bash built-in commands, see bash(1)
	kill                 (2)  - send signal to a process
  
The (1) and (1p) refer to shell commands, while the (2) section refers to the C function. Then "man -s 2 kill" will show you the relevant information about the kill function.

The given makefile creates an archive library and links in the sample programs. Feel free to change what kind of libraries are created; as long as your samples run, you can do whatever makes your coding easiest.

 

References

coscheduling : http://csg.csail.mit.edu/people/rudolph/Autobiography/papers/Cosched95IJPP.ps

Robert Love's book "Kernel Development" : http://www.amazon.com/Linux-Kernel-Development-Robert-Love/dp/0672325128

Previous GTThreads Implementation Sample GTThreads Library