Freesteel Blog » Parallel Programming-O-Rama

Parallel Programming-O-Rama

Friday, May 30th, 2008 at 5:04 pm Written by:

Suppose you have a function called process() which you want to apply to the members of an array, like so:

ProcessAllSingleThread(array<A> a)
{
    for (i = 0; i < a.size(); i++)
        a[i].process();
}

If it’s okay to execute the process() function simultaneously on entries of the array, the computation can be distributed across several processing cores using multiple threads, like so:

global ndone = 0;
global inext = 0;
bool ProcessAllMultiThread(array<A> a)    // several threads can enter this function
{
    while (true)
    {
        synchronized: 
        {
            if (inext == a.size())
                return ndone == a.size();   // false means other threads are working
            i = inext; 
            inext++; 
        }

        a[i].process(); 

        synchronized: 
            ndone++; 
    }
}

… or some other more complex and correct form that involves the ability to chunk up groups of elements in the array a if the individual process() functions are so trivial that the synchronizing events would overwhelm them individually.

The important thing is it’s not be important to the daily programmer how it is done. All we need to learn is learn how to encapsulate the simple single threaded version in a way that makes it possible to engage the whole multithreaded system.

To do this effectively it helps to have a framework that is general enough to apply to most places you want to.

We can remove the dependency on using an array, and base it on a class T that can return pointers of type A to make a loop like so:

while (true)
{
    synchronized:
        b = T.getnext()
    if (b == 0)
        return T.alldone()
    b.process()
    synchronized:
        T.release(b)
}

The above ProcessAllMultiThread() fits exactly into this system, which is now flexible enough to handle when the process elements are not quite independent of each other.

For example, if it’s unsafe to process a[i] and a[i + 1] at the same time, but a[i] and a[j] are safe for all abs(i - j) >= 2, then by including a variable processstate in the class A initially set to NOT_DONE, we can implement the two functions, like so:

T::getnext()
{
    for (i = 0; i < a.size(); i++)
    {
        if ((a[i].processstate == NOT_DONE) && 
            (a[i+1].processstate != PROCESSING) && 
            (a[i-1].processstate != PROCESSING))
        {
            a[i].processstate = PROCESSING;
            return a[i];
        }
    }
}

T::release(b)
{
    b.processstate = DONE;
}

These functions, subject to some obvious debugging and efficiency improvements, should fit into the parallelizing machinery and make it move forwards when you finally get your shiny new eight core computer and need to prove that the investment was worth it.

Now this framework isn’t quite good enough for cases where there is an aggregation of a result. It needs to be taken further.

Suppose the process function returns a value, and you need the sum of these values. For example, you are measuring the area of a surface composed of a billion triangles with a loop of the form:

sum = 0
for (i = 0; i < a.size(); i++)
    sum += a[i].calcarea();

Let’s assume it’s okay to chunk this sum up into sub-sums, so you don’t need to lock the sum variable a billion times. The framework we might use is:

synchronized:
    w = T.getworkspace()
while (synchronized: w.loadnext()):
{
    w.process();
    synchronized: 
        w.release(); 

    optional:
    {
        sychronized:  
        w.commitdata();
    }
}
synchronized:
    w.commitdata()
synchronized:
    T.releaseworkspace(w)

I’ve actually got no knowledge of parallel processing, and I don’t want to know anything about it. What’s important to me is the idea I can deliver my algorithm within a set of brick-like (ie small enough to hold in your hand, but large enough to make things with) building blocks so that someone who does have the time and inclination can put them together.

It’s very much like the sort routine. I’ve no interest in the quick-sort, shell-sort or bubble-sort at all. What I do know is if I prepare an array and an operator<() function on its elements, then I can take whatever is the state of the art at the time.

The sort function is something that can be obviously parallelized by virtue of breaking the array up into large chunks and merging the sorted pieces. This only works when your operator<() is thread safe, which it almost always is.

In our code we’ve created the new function called sortMT() that will automatically engages all this machinery if it finds that the array has over a thousand elements, say. Now we have to go through the code everywhere, changing sort() to sortMT(), because the non-threadable case is the default, even though it’s likely to be a very rare exception.

So why don’t we just make sort() to use multi-threads, and establish a sortNonMT() function for these exceptional cases? Why is the default case the exceptional one, and not what we want to apply in the overwhelming situation?

Good question. Maybe we’re not very confident that this multithreading thing is right yet, and this is how it shows.

Leave a comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <blockquote cite=""> <code> <em> <strong>