Christian's profileChristian's SpaceBlogLists Tools Help

Blog


    6/3/2008

    First experiments with Tasking in OpenMP 3.0

    OpenMP 3.0 is out, maybe a bit later than we hoped for, but I think that we got a solid standard document. At IWOMP 2008 a couple of weeks ago, there was an OpenMP tutorial which included a talk by Alex Duran (from UPC in Barcelona, Spain) on what is new in OpenMP 3.0 - which is really worth a look! My talk was on some OpenMP application experiences, including a case study on Windows, and I really think that many of our codes can profit from Tasks. Motivated by Alex' talk I tried the updated Nanos compiler and prepared a couple of examples for my lectures on Parallel Programming in Maastricht and Aachen. In this post I am walking through the simplest one: Computing the Fibonacci number in parallel.

    Before I dive into the code: If you really want to compute Fibonacci, use a different algorithmic approach. This version is taken just for the reason of simplicity... The recursive approach to compute the Fibonacci number n is:

    int main(int argc, char* argv[])
    {
       [...]
       fib(input);
       [...]
    }

    int fib(int n)
    {
       if (n < 2) return n;

       int x, y;
       x = fib(n - 1);
      
    y = fib(n - 2);
      
    return x+y;
    }

    Several approaches to parallelize such a recursive algorithm with OpenMP have been discussed in the past, including the use of Nested OpenMP. If you do it right, you will get the performance you are looking for, but for example the solution with Nested OpenMP typically requires vendor-specific control of the total number of threads created (OpenMP 3.0 has a solution for that as well). With Tasks in OpenMP 3.0, the code will look like this:

    int main(int argc, char* argv[])
    {
       [...]
    #pragma omp parallel
    {
    #pragma omp single
    {
       fib(input);
    }
    } /* end omp parallel */
       [...]
    }

    int fib(int n)
    {
       if (n < 2) return n;

       int x, y;
    #pragma omp task shared(x)
       x = fib(n - 1);
    #pragma omp task shared(y)
      
    y = fib(n - 2);
    #pragma omp taskwait
      
    return x+y;
    }

    Let me explain that a bit. The main() function now includes a Parallel Region with a perfectly nested Single construct. The Parallel Region will create the team of threads that will eventually execute all the Tasks. The Single construct has the effect, that only one thread will enter the fib() function, while the other threads will wait at the end of the Single construct. That single thread will then encounter the Task constructs, thus only one thread is creating the two tasks per recursion level. If we would not have had the Single construct, all the Team's threads would have entered fib() and each thread would have created two tasks - which is not the desired behavior. So the first lession is: Task is not a worksharing construct in the sense you know it.

    Inside fib(), I am writing shared(x) and shared(y). Isn't shared the default in OpenMP? Yes it is, but for the worksharing constructs of 2.5. With the Task construct, the following rules apply:

    • Static and Global variables are shared, Automatic Storage (= local) variables are private -> Well-known OpenMP scoping rules!
    • Orphaned Task variables are firstprivate
    • Non-orphaned Task variables inherit the shared attribute
    • -> Task Variables are firstprivate unless shared in the enclosing context

    Without shared(x, y), x and y would be firstprivate (because x and y are local in the Parallel Region and thus do not inherit the shared attribute) for the Task, thus a new instance would be created for each task and that would prohibit building the correct sum at each recursion level.

    And why do we need this taskwait here? This construct causes the encountering task to suspend until all child tasks are completed, but only direct childs and not descendants. We need this synchronization here, because otherwise x and y (local variables of automatic storage location!) might have gone out of scope before the created tasks have been executed.

    Ok, let's finally take a brief look at the performance. Again the disclaimer: This algorithm is bad. Besides that: If you compile this code squentially, it will run faster than with all the threads you can take - because in each recursion step only two substractions and one addition are computed, which is way cheaper than creating a task or a thread! But this code is so well-suited for explaining Tasking...

    speedup_fibonacci_tasks

    I am comparing three versions here:

    • omp_v1: This is the version as shown above. It does not scale well, why? Again, creating the tasks - that involves copying the data, putting the code+data package into a workqueue, scheduling the package onto a thread and finally the taskwait synchronization - is significantly more expensive than the actual computation. Once the tree of the spawned tasks has reached a certain size, this does not make sense anymore. This is addressed by version omp-v2.
    • omp_v2: I have used the OpenMP if-clause on the Task construct to not create additional tasks if (n < 30) to avoid some overhead. That means then the code is executed by the encountering task, but still the semantics stay the same. We got a speedup of slightly below 7 with 8 threads, but 7 is not 8! So I created version omp-v3.
    • omp_v3: This gives a speedup of about 8 with 8 threads. Here I used a C++ if branch to call a serial version of that function if (n < 30), which really avoids the overhead of the Task construct.

    If the actual compute function is more expensive (from a compute point of view), the difference between version omp-v2 and omp-v3 should vanish. I think there is a lot to say and explain about Tasks and many people started looking into this and bringing Tasking to the public, which is great! There is even more to say for this simple example, but for today I think this is enough, let me just give you the source of version omp-v3:

    int fib(int n)
    {
       if (n < 2) return n;
       if (n < 30) return (fib(n - 1) + fib(n - 2));

       int x, y;
    #pragma omp task shared(x)
       x = fib(n - 1);
    #pragma omp task shared(y)
      
    y = fib(n - 2);
    #pragma omp taskwait
      
    return x+y;
    }

    Technorati-Tags: ,,,

    Comments

    Please wait...
    Sorry, the comment you entered is too long. Please shorten it.
    You didn't enter anything. Please try again.
    Sorry, we can't add your comment right now. Please try again later.
    To add a comment, you need permission from your parent. Ask for permission
    Your parent has turned off comments.
    Sorry, we can't delete your comment right now. Please try again later.
    You've exceeded the maximum number of comments that can be left in one day. Please try again in 24 hours.
    Your account has had the ability to leave comments disabled because our systems indicate that you may be spamming other users. If you believe that your account has been disabled in error please contact Windows Live support.
    Complete the security check below to finish leaving your comment.
    The characters you type in the security check must match the characters in the picture or audio.
    Christian Terboven has turned off comments on this page.

    Trackbacks (4)

    The trackback URL for this entry is:
    http://terboven.spaces.live.com/blog/cns!EA3D3C756483FECB!316.trak
    Weblogs that reference this entry