Christian's profileChristian's SpaceBlogLists Tools Help

Blog


    8/15/2008

    More on Tasking in OpenMP 3.0

    We already played a lot with Tasking and are looking into how well it can be applied to *real* applications. In the “Introduction to Parallel Programming” lecture and also in the discussions following after some of my recent talks I noticed a strong interest in this new paradigm and I really hope that people will like it!

    I also heard users commenting that Tasking is missing some features, especially reduction-style operations. Well, this is probably true, and the language committee was kind of expecting this when the final 3.0 spec had been released. The intent was to get Tasking out and available by providing the most important set of features – exactly those that are required for building the higher levels ones. Personally, I (now) like this way! Sure, one could ask for more sophisticated task synchronization / ordering constructs and for things like reductions, but once a programmer has understood the Tasking concept I think he will be able to easily implement what is missing on it’s own – or realize that nothing is missing at all.

    There is a chance that an upcoming OpenMP 3.x standard might have reduction-style operations for tasks. Nevertheless, one can get this functionality already today and it is pretty easy, as you will see. I prepared an example (taken from a real application) of different strategies to get reduction-style operations with tasks. The example can be downloaded from my SkyDrive (omp-tasking-reduction.cpp), but I will discuss the code here as well. The idea is pretty simple: Each element of a self-defined dynamic linked list should be summed up to compute the global sum. This is not directly possible using Worksharing-constructs of OpenMP 2.5, as the number of list elements is unknown and thus the for-construct cannot be applied. As the task-construct of OpenMP 3.0 does not support reduction-style operations directly, some additional work is required to compute the global sum. Let’s take a look at the linked list first, which is rather self-explanatory:

    typedef int elem_t;       // data type
    
    template<class T>
    class CListItem {         // single element of a linked list
    public:
        CListItem() {
            this->next = 0;  
        } 
    
        T            value;   // member of "data type"
        CListItem    *next;   // pointer to next list element
    };
    
    typedef CListItem<elem_t>    item_t;  // list element with integer data type
    typedef item_t               *item_p; // pointer to list element

    Let’s also assume that we have a compute-intense function compute(double input), which returns a double and has to be called for each list element before the global sum can be computed. A serial approach computing the global sum would look like this:

    elem_t globalSum = (elem_t)0;        // global sum, please
    t1 = root; 
    
    // simply loop through the list
    while (t1 != 0)
    {
        t1->value = compute(t1->value);
        globalSum += t1->value;          // add each value to global sum
        t1 = t1->next;
    } 
    
    std::cout << "globalSum (serial): " << globalSum << std::endl; 

    In order to parallelize this with Tasking, we use the same approach as discussed in my previous post on OpenMP 3.0: Nesting a single-construct right in the Parallel Region. By doing so we end up with one thread running through the dynamic linked list and creating all the work tasks. The firstprivate(t1) clause is important, otherwise t1 would be shared and there would be a data race by using t1 inside and outside of the task without proper synchronization. We also need a Critical Region when updating the globalSum variable to prevent a data race again:

    elem_t globalSum = (elem_t)0;        // global sum, please
    t1 = root;
    
    #pragma omp parallel                 // create a team of threads
        {
    #pragma omp single                   // one thread only, please
            {
    
                while (t1 != 0)
                {
    
    #pragma omp task firstprivate(t1)    // copy t1 into the task
                    {                    // globalSum is shared here
                        t1->value = compute(t1->value);
    #pragma omp critical
                        {
                            globalSum += t1->value;   // avoid data races
                        }
                    }
    
                    t1 = t1->next;
                }
    
            } // omp single
        } // omp parallel
    
    std::cout << "globalSum (omp_tasking_v01): " << globalSum << std::endl;

    The downside of the approach presented above is obvious: There is one Critical Region for each list element, which could lead to a severe (negative) performance impact, depending on how compute-intense the compute() function is and on the number of list items. In general, one should avoid such a situation, in this case we can easily introduce an additional variable (here: helperSum) that we use to compute the global sum and make the globalSum variable firstprivate. That way each thread will have one private instance of globalSum and globalSum can be shared among the tasks. As each task can only be executed by one thread (no untied clause here), each thread will end up with a partial sum that can be summed up to the global sum after the iteration loop. The improvement is that we now only have as many Critical Regions executed as we have threads in the Team:

    elem_t globalSum = (elem_t)0;        // global sum, please
    elem_t helperSum = (elem_t)0;        // helper
    t1 = root;
    
    #pragma omp parallel firstprivate(globalSum)   // create a team of threads
        {
    #pragma omp single                   // one thread only, please
            {
    
                while (t1 != 0)
                {
    
    #pragma omp task firstprivate(t1) shared(globalSum)   // shared between tasks
                    {                                     // but private for threads
                        t1->value = compute(t1->value);
                        globalSum += t1->value;
                    }
    
                    t1 = t1->next;
                }
    
            } // omp single
    
    #pragma omp critical                 // compute global sum without data races
    {
        helperSum += globalSum;
    } // omp critical
    
        } // omp parallel
    
    globalSum = helperSum;
    std::cout << "globalSum (omp_tasking_v02): " << globalSum << std::endl;

    There are several approaches which are equivalent to the one discussed here. We could have declared helperSum as threadprivate, used helperSum instead of globalSum in the task to compute the partial sum and then build the final sum in the Critical Region similar as shown above. We also could have created an array to store one instance of globalSum for each thread and compute the final sum in a serial part after the Parallel Region. If you take a look at the OpenMP WIKI at Sun Microsystems, Yuan Lin is explaining this approach with a slightly different and simpler example, along with some other aspects of Tasking.

    But all these approaches do not help if you want to use untied tasks. If you add this clause to a task-construct, the OpenMP runtime is allowed to start the execution of a given task on thread A, break execution and later resume it on thread B. Obviously this will cause you problems when you use threadprivate data or thread-id-dependent accesses. But, there is a solution for that case:

    elem_t globalSum = (elem_t)0;        // global sum, please
    static elem_t helperSum = (elem_t)0;
    #pragma omp threadprivate(helperSum) // one instance for each thread
    t1 = root;
    
    #pragma omp parallel                 // create a team of threads
        {
    #pragma omp single                   // one thread only, please
            {
    
                while (t1 != 0)
                {
    
    #pragma omp task untied firstprivate(t1)
                    {
                        t1->value = compute(t1->value); // this is the compute-intense
                                                        // function, thread-switching
    #pragma omp task                                    // could happen - but not
                        {                               // here any more. thus there
                            helperSum += t1->value;     // will be computed a partial 
                        }                               // sum for each thread
                    }
    
                    t1 = t1->next;
                }
    
            } // omp single
    
    #pragma omp critical                 // compute global sum without data races
    {
        globalSum += helperSum;
    } // omp critical
    
        } // omp parallel
    
    std::cout << "globalSum (omp_tasking_v04): " << globalSum << std::endl;
    

    Nice, hm? Federico Massaioli came up with this “trick” on the omp-lang mailing list. While the outer task containing the compute-intense function is allowed to be interrupted and thread-switched because of the untied clause, the inner task updating the threadprivate variable is not!

    Ok, which version should you use? Please avoid critical regions when possible, so go for the second approach – unless you have to use thread-switching for performance, then you can use the approach discussed last. I do not (yet) have any precise performance experiments, as all Tasking-enabled compilers are still beta releases. By the way, Intel has opened the 11.0 beta program and Sun has released Sun Studio Express July 2008, both compilers offer Tasking support. Both compilers have individual advantages and disadvantages, but at the time of this writing the Sun compiler’s implementation of OpenMP 3.0 seems to be more stable, thus gets my recommendation!

    Technorati Tags: ,,,

    Comments

    Please wait...
    Sorry, the comment you entered is too long. Please shorten it.
    You didn't enter anything. Please try again.
    Sorry, we can't add your comment right now. Please try again later.
    To add a comment, you need permission from your parent. Ask for permission
    Your parent has turned off comments.
    Sorry, we can't delete your comment right now. Please try again later.
    You've exceeded the maximum number of comments that can be left in one day. Please try again in 24 hours.
    Your account has had the ability to leave comments disabled because our systems indicate that you may be spamming other users. If you believe that your account has been disabled in error please contact Windows Live support.
    Complete the security check below to finish leaving your comment.
    The characters you type in the security check must match the characters in the picture or audio.
    Christian Terboven has turned off comments on this page.

    Trackbacks (8)

    The trackback URL for this entry is:
    http://terboven.spaces.live.com/blog/cns!EA3D3C756483FECB!360.trak
    Weblogs that reference this entry