The goal of this project is to understand and implement simple coscheduling and priority policies along with processor affinity. To do so you must modify the provided GTThreads package, which is a simple user level thread scheduler.
This project minimally requires the following in order to receive full credit:
The project must be emailed to the TA, Scott McManus (smcmanus@cc.gatech.edu), by September 15th at 11:59 pm (i.e., midnight). If you email multiple submissions, then the last one submitted will be graded.
You may choose another threading package to modify other than GTThreads, but it must be approved ahead of time. That said, code that implements priority-based scheduling, processor affinity, or coscheduling is off-limits.
The code should be written to run on the warp cluster; there are a couple of dependencies in the code on Linux which have not been tested on other operating systems. If you have issues with this then contact the TA, Scott (smcmanus@cc), and we can try to work something out.
If you have any questions, then you are encouraged to consult with the TA. The earlier you ask, the better - others may have the same questions or issues, too.
You will first need to implement priority-based scheduling at the thread level. The priority scheduling should run in O(1) time, which means that switching threads should not involve searching the run queue. This also means your scheduler may order threads in the run queue when the thread is created or when the thread's priority is changed.
It should be possible to change the priority of your thread via the user-level library. If the thread is no longer the highest-priority thread in the run queue, then it should be returned to the run queue and the highest-priority thread should be run.
Next, you must extend your priority-based scheduling for a thread group. The API for setting the thread group is open, but at a minimum it must be possible to set the thread group at thread creation. If no thread group is specified, then all threads must execute in the same thread group. Mapping the thread to a thread group is important for coscheduling support, which is described later.
Finally, your priority scheduling algorithm must be able to run on multiple cores. Your implementation should always execute the highest-priority runnable threads on the cores. Since each thread group has the same priority among its members, you should be able to choose any thread from among the group to schedule. There are no requirements for how you choose a thread from within a thread group to run, but you should include information on how you chose the thread to run in your write up.
Full credit will be awarded if both per-thread priority-based scheduling and thread-group priority-based scheduling are implemented on both single and multiple cores.
Clarifications
The highest priority runnable thread must always be run.
If an implementation creates a thread with the highest priority
but there is not an immediate context switch to that thread,
then a large number of points will be deducted from the
priority scheduling grade.
When two threads have the same priority (as in the case when
two threads are in the same thread group), then the action
taken is up to you.
This is very important in some applications, so although there
is flexibility in your API there is no flexibility on
this issue.
When implementing thread group priorities, you will not need to create a mix of priorities for the threads and the thread groups. Once a thread is assigned to the thread group, the thread will take on the priority of the thread group. You also do not need to worry about common problems such as priority inversion with the mutexes.
You have complete freedom in defining the API for setting the priority and thread group. One possibility is to do something similar to PThreads' sched_param, pthread_attr_getschedparam, and pthread_attr_setschedparam where the priority can be set. Keep in mind that the provided GTThreads package does not use thread attributes, so you will have to build in support for this kind of mechanism if you want to use it.
As a different example, the API for setting the thread group and priority could look something like this:
|
#define GTTHREAD_PRIO_MAX (255) |
In this example API, gtthread_thread_set_prio will set the priority for the given thread. The priority has a maximum value of 255, a minimum value of 1, and a default of 100. This gives some flexibility in the case that a priority of 0 is supplied to other functions.
The example gtthread_threadgroup_create function creates a thread group with the specified priority. If the priority argument is 0, then one possibility is to use the default priority (in this example, 100). The thread group can start at 1, leaving smaller thread groups to indicate other conditions. A default thread group of 0 with a default priority can be created.
In this example, the gtthread_create function now takes the thread group as an argument. If the thread group is 0, then this could indicate that the default thread group is to be used. (Note that this may help when you are first implementing the per-thread priority scheduling, as your scheduler can simply ignore the parameter or use the initial thread group of 0.)
Finally, in this example the thread group's priority can be set using the gtthread_tgroup_set_prio.
Note that all of this functionality may require that you change other functions in GTThreads, and you have complete freedom in how you choose to do this.
The next step is to implement processor affinity. Processor affinity refers to the preference of scheduling a thread on a single processor. When a thread becomes runnable, it can either be run on the same processor as before or it can be run on a different processor. Running on the same processor as before can have some benefits, but waiting for the same processor to become idle can also incur a penalty from a few different factors. This is discussed in depth in "Using Processor-Cache Affinity Information in Shared-Memory Multiprocessor Scheduling" by Squillante & Lazowska.
For this project, an idle processor will select the next thread based on the following algorithm:
There is already a structure, "current", which maps the CPU # to the list of tasks that are running on that CPU. You have complete freedom in choosing how to run the processor-affinity algorithm. If you have a cool algorithm in mind, then run it by the instructor or the TA first and include it in your write up.
The supplied GTThreads code already creates one kernel thread via the clone() call per available core, and the sched_affinity call does part of the work for setting up processes on other CPUs. What's left to you is to complete the algorithm as described.
The processor affinity scheduling is really a stepping stone to implementing coscheduling. For the purpose of this assignment, you can use round-robin scheduling for your global run queue.
The number of cores to run on is also implementation-dependent. The supplied GTThreads code uses the NUM_CPUS environment variable to set the number of cores/CPUs to run on. However, if you have a way to set this up that's easier for you, that's fine, too - just document how you did it.
In order to receive full credit for this section, you must write an affinity scheduler that maps a thread to a CPU/core as described. You must also describe how to set the number of CPUs/cores to use if you are not using NUM_CPUS.
Coscheduling attempts to schedule a group of threads together to improve performance. One corollary is that the thread scheduler may try to avoid rescheduling one or more threads in the group when the thread is merely communicating with another thread.
The following example illustrates coscheduling. There are four groups of threads:
There are also 8 processors available for use.
Time Instance 1
In the first instance, all of Group A and all of Group B run on the 8
processors, as they are the first thread groups in the run queue.
Time Instance 2
Now suppose that a single thread (say, B2) in threadgroup B decides
to block. After a timeout, the scheduler decides to reschedule B2.
Since we're running coscheduling, this means that all other threads
in B2's thread group, thread group B, also must be rescheduled.
This means that we can schedule thread group C's threads. Since there
is room left over, we can also schedule thread group D's threads.
Time Instance 3
Now D yields, which leaves thread groups A and C running. The question
now is what to do with the other two processors. In this case the
scheduler gives a partial allocation to thread group B. When more
processors become available, thread group B will claim them. This
is the case regardless of if another thread group is created which
only needs one processor. Thread group B must run next unless one
of its threads blocks for a sufficiently long time, in which case
the entire thread group would be rescheduled.
Time Instance 4
Finally, thread group A yields its processors. Again, Group B's
threads must be scheduled on the available processors first.
Group C's threads are already running, so Group D is scheduled.
Notice that there is no concept of processor affinity; processes are scheduled on CPUs/cores as the processors become available. This is also different from gang scheduling, where all threads must be scheduled at the same time - in coscheduling, we can start running part of a task group (in this case, group B in Time Instance 3) when it is otherwise impossible to schedule all threads in the group. Likewise, coscheduling differs from simple round-robin or processor-affinity schedulers in that the thread scheduling on all processors are coordinated - processors in processor affinity schedulers don't care about the ordering in the run queue or the currently executing tasks on other processors.
This is not the last word on coscheduling, so you are encouraged to create a fancier algorithm for coscheduling. In the previous example, it may have made sense to have scheduled Group D instead of threads B1 and B2 in Time Instance 3. If you create such a scheduler, then please discuss it in your report.
In order to receive full credit for this section, you must implement the basic coscheduling described across thread groups.
A very rudimentary code for multiplying matrices is provided in parallel_example.c . The code generates random matrices and then creates a number of threads to multiply the matrices. The output of the multiplication corresponds to calculating C = A*B. Each thread calculates a fraction of the rows in C by calculating the same stripe of rows in A times the entire matrix B.
However, simply threading the code for a single matrix multiplication isn't always very interesting. In addition, you should create an application that threads two or more simultaneous matrix multiplications.
In order to receive full credit for this section, you must evaluate the runtime, excluding the matrix initialization, for at least two simultaneous matrix multiplications for at least four different number of cores (e.g., 1, 2, 4, and 4 cores). You should also create at least three cases for each number of CPUs/cores used: one case where the total number of threads equals the total number of cores, and two cases where there are more than one thread per core. For example, if your test case is for two cores, then you should have a test with 2 threads, and you could have one test with at least 6 threads and another test with at least 10 threads total.
Your analysis should also show the total runtime (excluding initialization) for the three algorithms implemented: priority-based scheduling, processor-affinity scheduling, and coscheduling. So far this makes a total of 36 data points - 3 algorithms, 4 cores evaluated per algorithm, and 3 different number of threads per core.
This code must also run and execute on the warp cluster. The code is specifically geared towards Linux on i386, so deviations are probably not a good idea. (In another words, Windows programming is out for now.)
To get full credit for the write up, you must include the following:
The provided GTThreads library is meant to give you access to a simple, extensible, user-level thread library. It only comes with a round-robin scheduler, but this should be enough to get you started with manipulating a scheduler.
At a high level, GTThreads clones the invoking process until it is executing on NUM_CPUS processors (where NUM_CPUS is the environment variable described earlier). The use of the clone function means that there is a heavyweight process, referred to as a kernel thread in GTThreads, that is running on each processor. The threads that applications schedule in GTThreads are then scheduled on top of those kernel-level threads.
The provided GTThreads package makes heavy use of signals. In particular, the SIGVTALRM (or virtual timer), is used to indicate when the timeslice is up for a given task. When the SIGVTALRM timer goes off, the handler will find the next thread in the run queue and start executing it.
Like PThreads, you are able to create threads and have threads yield. You also have mutex functions available to you - gtthread_mutex_init, gtthread_mutex_lock, and gtthread_mutex_unlock work similar to their PThread cousins (though without attributes).
In general, the function "gtthread_*" has a corresponding PThreads
function named "pthread_*". The following functions are provided
in GTThreads:
GTThreads also uses the following key variables:
Tips:
The following functions and datatypes are also either used within GTThreads
or may become useful to you. Many of these, especially the functions, have
MAN pages associated with them, so you should be able to type
"man
At some point processor P1 may want to tell processor P2 to do something - say, wake up, read its relevant structures, and make a context switch. For this reason, you should consider using the kill( process_id, signal# ) function to send a signal to a process. In GTThreads, each CPU/core has one heavyweight process (referred to in the code as a kernel thread) that runs the threads on that CPU. These processes can be found in the processor_kthread_map array, which maps the processor number to the heavyweight process that is running threads on that processor. For example, processor_kthread_map[1] may contain 613, indicating that process 613 is running the threads on processor #1. Then the kill function can be used to send any signal to the processes listed in processor_kthread_map. For example, kill( processor_kthread_map[1], SIGUSR1 ) would send the SIGUSR1 alarm to process 613.
The SIGUSR1 and SIGUSR2 signals may also help you out in addition to SIGVTALRM. The reason is that SIGUSR1 and SIGUSR2 are reserved for applications, so the kernel will never use these signals to indicate some condition. This is useful because you won't have to disambiguate what a signal means. You do not have to use them.
For more information on the kill and signal functions, look at their man pages. These should be available using "man kill" and "man signal". Sometimes this is ambiguous - kill, for example, is also a shell command. You can also use the "whatis kill" and "whatis signal" commands to find the right section of the man pages for kill and signal. For example, "whatis kill" might display this:
smcmanus|greenland>whatis kill kill (1) - terminate a process kill (1p) - terminate or signal processes kill (2) - send signal to a process kill (3p) - send a signal to a process or a group of processes kill [builtins] (1) - bash built-in commands, see bash(1) kill (2) - send signal to a processThe (1) and (1p) refer to shell commands, while the (2) section refers to the C function. Then "man -s 2 kill" will show you the relevant information about the kill function.
The given makefile creates an archive library and links in the sample programs. Feel free to change what kind of libraries are created; as long as your samples run, you can do whatever makes your coding easiest.
coscheduling : http://csg.csail.mit.edu/people/rudolph/Autobiography/papers/Cosched95IJPP.ps
Robert Love's book "Kernel Development" : http://www.amazon.com/Linux-Kernel-Development-Robert-Love/dp/0672325128
Previous GTThreads Implementation Sample GTThreads Library