Shared Memory Programming with Multithreading

(CS 267, Feb 2 1995)

Multithreading is a programming style suitable for shared memory space MIMD machines, like the SGI Power Challenge, or the Sun SPARCcenter 2000. These machines have a single address space, so that two processors loading location 37, say, from memory are both guaranteed to load the same value. Initially, a program consists of a single "thread" of control, as the main routine begins. The user controls parallelism by creating other "threads" of control, which are told to execute a subroutine of the users choice. These threads can be thought of as (UNIX) processes, which are executed by the available parallel processors. If the user creates more threads than there are physical processors, then the threads are scheduled for execution similar to the way UNIX schedule multiple independent processes to run. Threads share the same address space, and so the the same code and most of the variables of the program. They may also synchronize with and wait for one another, so they can cooperate in parallel programs.

We begin by examining the multithreaded solution of the first Sharks and Fish problem, fish swimming in a current. The solutions are written in C with calls to the Solaris multithreading library. You should start a new window and click here to see the code as we discuss it.

After definitions of various macros and types, we see a set of variable (fishes through g_dsum) which are declared outside the main routine; these are global variables that will be visible in all routines on all processors.

Now skip down to the main{} routine. After mallocing some space (in fishes) to hold the global array of data for NFISH fish, and some space to hold a little data for each of the threads that will be created (in thread_ptr), a mutex variable mul_lock is initialized by calling mutex_init. Mutex stands for mutual exclusion, and will later be used to guarantee that at most one thread can execute a particular sequence of code (a critical section) at a time. The second argument to mutex_init, sync_type = USYNC_PROCESS, indicates that other processes can access this mutex variable too; the third argument is ignored.

Similarly, barrier_init initializes a barrier variable ba. A barrier is a synchronization point in the code that all threads need to reach before any can continue. Threads reaching the barrier early wait until all have arrived. The second argument to barrier_init, NTHREADS, is the number of threads that will synchronize at the barrier. The third and fourth arguments are as above.

Now thread_ptr is initialized. It has one entry for each of the NTHREADS threads to be created. thread_ptr[i].chunk is set to i for i=0 to NTHREADS-1. thread_ptr[i].tid is used to store the system-assigned thread-id-number later.

The loop for (i = 1; i < NTHREADS; i++) {} actually creates the other NTHREADS-1 parallel threads (besides the main one executing main{}). This is done by the system call to thr_create. The first two arguments describe where the stack for the new thread is to be located and how large it is (0 indicates defaults). The third argument, move_fish, is the name of a procedure that the newly created thread will begin executing when it starts up; this is how parallelism is created. move_fish will be called with its argument equal to the fourth argument of thr_create, in this case i. In other words, the i-th thread created will be passed the value i; this will be used to divide up the work. The fifth argument indicates when the thread will start up, and on which processor it will run; 0 is a default. Finally, the system-assigned thread-id-number is returned in the last argument; this is needed below for synchronization and termination purposes.

Then the main thread also begins moving fish by calling move_fish itself. Finally, the program terminates by having the main thread call thr_join to wait for all the other threads to return.

Now we examine the move_fish{} routine, which is right above main{}. The argument is stored in a local variable called mychunk. Thus, thread i has mychunk assigned to i, so the threads can divide up the work. Mychunk is first used in all_init_fish{}, where it indicates that thread i should initialize num_fish = NFISH/NTHREADS fish positions and velocities starting at mychunk*num_fish. The argument fishes, which is the array of all the fish data, is global and so visible to all threads. The all_move_fish{} routine is similar. No synchronization is needed here; each thread can move the fish it is assigned independently of the other threads.

The next three parallel operations, computing max_acc, max_speed and sum_speed_sq, do require synchronization, and are performed by the routines all_reduce_to_all_dmax and all_reduce_to_all_dadd. These routines compute the global max (respectively sum) of their local arguments. It suffices to examine all_reduce_to_all_dadd{}. The first statement is barrier_wait(&ba), which causes all threads to wait until all have reached this statement. Then thread 0 initializes the global sum g_dsum.accum to zero and sets g_dsum.zeroed to 1 to indicate to other threads that g_dsum.accum has indeed been initialized. The other threads wait at the line "while ( !g_dsum.zeroed )" for this to occur. (myID contains the thread-id number of the calling thread.)

Then the pair of calls mutex_lock(&mul_lock) and mutex_unlock(&mul_lock) permits only one thread at a time to be executing the code between them (a so-called critical section). Here the global sum is actually incremented.

If we do not use mutual exclusion, we may have a race condition as two processors try to execute the critical section simultaneously, with the result that g_dsum can be computed incorrectly. For example, suppose for simplicity that we only have two threads, threads 1 and 2, that thread i wants to add dmax=i to g_dsum, and that g_dsum has been initialized to zero. Thus, the correct result is g_dsum = 1+2 = 3. Now suppose that threads 1 and 2 simultaneously enter the critical section. We claim that when both threads finish, g_dsum could equal 1, 2 or 3. To see why, consider the following sequence of event:

    thread 1 fetches g_dsum.accum = 0 from memory into a register
    thread 2 fetches g_dsum.accum = 0 from memory into a register
    thread 1 increments its register by 1
    thread 2 increments its register by 2
    thread 1 writes its register back to memory, setting g_dsum.accum to 1
    thread 2 writes its register back to memory, setting g_dsum.accum to 2

Clearly, by reversing the last two event, g_dsum.accum could have been set to 1 as well. The use of critical section protected by mutex locks prevents these sorts of bugs, which are otherwise nondeterministic and hard to find.

More about SunOS Multi-thread architecture

Here is a more complete description of the routines available to control multithreading. This discussion is taken from SunOS Multi-thread architecture, M. Powell et al. USENIX, Winter 1991, and from "Writing Multithread Code in Solaris", S. Kleiman et al, SunSoft Inc.

There are actually two "levels" of threads available in SunOS; this is not true of all thread systems. The threads used above, which are the only ones the user has to know about, are scheduled entirely at the user level, so that the OS kernel does not have to get involved. This means they are relatively inexpensive to create, start, stop and synchronize, since anything involving the OS kernel is more expensive. The second level of threads are called LWPs, or "light weight processes". These are known to the kernel, and are consequently more expensive to create, etc. SunOS supports both because some applications are more effectively programmed with one than the other. For our purposes we will only consider the case where there is one thread per LWP, and one LWP per physical processor, though more general situations are possible.

It is important to remember a threads shares all the instructions of the program that creates it, and all the data visible in its scope at the point of creation. Once created, a thread gets its own ID, and its own registers and stack; this allows it to execute independently of the program that created it, but to share data.

A thread is created by calling thr_create:

       int thr_create( void *stack_addr, 
                       unsigned int stack_size,
                       void (*func) (),
                       void *arg,
                       long flags,
                       thread_id_t *new_thread )

The first argument *stack_addr indicates where the thread's stack is to be based, and stack_size indicate how large it may grow; defaults are available for both. When the thread starts running, it calls func(arg), which man be any routine the user wants. When func(arg) returns, the thread terminates. The flags describe how the thread will execute. For example, the thread may be suspended on creation, until a later thread_continue() call starts it. The thread may also either be allowed to "float", i.e. executed by any available physical processor, or tied down to execute on one of them; we will typically use the latter, creating only as many threads as physical processors, since it offers more control over parallelism. Finally, thr_create returns a pointer to a thread identifier uniquely identifying the thread it creates; this can be used for synchronization purposes by other routines.

Thread_wait allows one thread to synchronize with another, by waiting for the thread specified by thread_id to exit:

   thread_id_t thread_wait( thread_id_t thread_id )

Threads can signal one another by calling the following routines:

   int thread_stop( thread_id_t thread_id )
   int thread_continue( thread_id_t thread_id )
   int thread_continue( thread_id_t thread_id )

Threads can synchronize using mutual exclusion (mutex), condition variables, semaphores, or reader/writers locks, all of which are traditional OS constructs. We have already seen the use of mutex variables in the Sharks and Fish code. Here we illustrate the use of condition variables, which are used to wait until a particular condition is true. For example, consider the code sequence

     mutex_enter( &lock )
     ... 
     while ( not condition ) { cv_wait( &cond_var, &lock ) }
     ...
     mutex_exit( &lock )

If the condition is false, cv_wait will be called, and the thread will release the lock and suspend until reawakened by a signal. The signal is generated by another thread calling either cv_signal or cv_broadcast (to wake up all threads). If, when reawakened, another thread has entered the critical section, the awakened thread must wait to reacquire the lock.