| Christian's profileChristian's SpaceBlogLists | Help |
|
Christian's SpaceWelcome to my personal blog on parallel programming (on Windows). My intent is to publish pieces of information that might be of interest to a larger audience, but would otherwise just hang around in my $HOME or get archived in our team's WIKI otherwise. 11/4/2008 My Blog has moved to terboven.wordpress.comSince about two years I am running this blog on Parallel Programming. As opposed to several friends and colleagues I liked Live Spaces, but with a growing number of readers the following three issues really bugged me:
Because of these reasons I did a brief evaluation phase and just now decided to switch to WordPress. Using Wei Wei's Live Space Mover I imported the last five blog posts (to have some content to be found by search engines) to this blog. I will leave this Live Space alive, but stop blogging to it and disable the comment option. Pleases visit my new blog at http://terboven.wordpress.com. I hope the people interested in my experiments and findings will follow. Online Video Coverage of PDC 2008Because of conflicting dates and my intent of keeping some scheduled events as they were planned for quite some time I decided against attending Microsoft’s PDC 2008 – which probably was the wrong decision :-(. The mainstream media is talking a lot about Windows Azure and Windows 7 - which are interesting news items for sure - but not closely related to my “business”, maybe except for the fact that Windows Server 2008 R2 will be capable of handling more cores than just 64. According to my opinion PDC 2008 brought us great news regarding the feature set of Visual Studio 2010 and Microsoft’s activities on multi-core (aka Shared-Memory parallel) programming and I hope to see and learn more about this during SC08 in about two weeks from now. Meanwhile you can do the following, if you are like me interesting in this stuff:
Technorati-Tags: PDC,Visual Studio 2010 11/2/2008 A performance tuning tale: Optimizing SMXV (sparse Matrix-Vector-Multiplication) on Windows [part 1 of 2]Since a while I am involved in several teaching activities on parallel programming and in my humble opinion this also includes talking about parallel computer architectures. As I am usually responsible for Shared-Memory parallel programming with OpenMP and TBB and the like, examples and exercises include learning about and tuning for the recent multi-core architectures we are using, namely Opteron-based and Xeon-based multi-socket systems. Well, understanding the perils of Shared-Memory parallel programming is not easy, but my impression is that several students are challenged when they are asked to carry the usual obstacles of parallel programming (e.g. load imbalance) forward to the context of different systems (e.g. UMA versus cc-NUMA). So this blog post has two goals: Examine and tune a sparse Matrix-Vector-Multiplication (SMXV) kernel on several architectures with (1) putting my oral explanations into text as a brief reference and (2) showing that one can do all the analysis and tuning work on Windows as well. From school you probably know how to do a Matrix-Vector-Multiplication for dense matrices. In the field of high performance technical computing, you typically have to deal with sparse linear algebra (unless you do a LINPACK :-) benchmark). In my example, the matrix is stored in CRS format and has the following structure: 01 #pragma omp parallel for private(sum, rowend, rowbeg, nz) How good is this parallelization for the matrix as shown above? Lets take a look at a two-socket quad-core Intel Xeon E5450-based system (3.0 GHz), Below, I am plotting the performance in MFLOP/s for one to eight threads using just the plain Debug configuration of Visual Studio 2008 in which OpenMP has been enabled:
In many cases, thread binding can be of some help to improve the performance. The result of thread binding is also shown as Debug w/ “scatter” binding – using this approach the threads are distributed over the machine as far away from each other as possible. For example with two threads, each thread is running on a separate socket. This strategy has the advantage of using the maximal possible cache size, but does not improve the performance significantly for this application (or: Windows is already doing a similarly good job with respect to thread binding). Nevertheless, I will use the scattered thread binding strategy in all following measurements. Now, what can we do? Let’s try compiler optimization: Switching to the Release configuration does not require any work from the user, but results in a pretty nice performance improvement. I usually enabled architecture-specific optimization as well (e.g. SSE-support is enabled in the ReleaseOpt configuration), but that does not result in any further performance improvement for this memory-bound application / benchmark. Anyway, as the compiler has optimized our code for example with respect to cache utilization, this also increases the performance when using more than one thread! In sequential execution (aka with one thread only) we get about 570 MFLOP/s. This is only a small fraction of the peak performance one core could deliver theoretically (1 core * 3 GHz * 4 instructions/sec = 12 GFLOP/s), but this is what you have to live with given the gap between CPU speed and memory speed. In order to improve the sequential performance, we would have to examine the matrix access pattern and re-arrange / optimize this with respect to the given cache hierarchy. But for now, I would rather like to think about the parallelization again: When you look at the matrix structure plot above, you will find that the density of nonzero elements is decreasing with the matrix rows counting. Our parallelization did not respect this, so we should expect to have a load imbalance limiting our parallelization. I used the Intel Thread Profiler (available on Windows as well as on Linux) to visualize this: The default for-loop scheduling in OpenMP is static (well, on all implementations I know), thus the iteration space is divided into as many chunks as we have threads, all of approximately equal size. So the first thread (T1 in the image above) gets the part of the matrix containing the more dense rows, thus it has more work to do than the other threads. Note: The reason why the Thread Profiler claims the threads two to four have “Barrier”-overhead instead of “Imbalance”-overhead is caused by my benchmark kernel, which looks slightly different than the code snippet above, but let’s ignore that differentiation here. So, what can we do about it? Right, OpenMP allows for pretty easy and efficient ways of influencing the for-loop scheduling strategy. We just have to extend the line 01 of the code snippet above to look like this: 01 #pragma omp parallel for private(sum, rowend, rowbeg, nz) schedule(guided)With guided scheduling, the initial chunks have an implementation-specific size which is decreased exponentially down to the chunksize specified, or 1 in our case. For the matrix with a structure as shown above, this results in a good load balance. So this is the performance we get including all optimization we discussed so far: We started with an non-optimized serial code delivering about 350 MFLOP/s and finished with a parallel code delivering about 1000 MFLOP/s! This is still far away from a linear scaling, but this is what you see in reality with complex (aka memory-bound) applications. Regarding these results, please note the following:
10/21/2008 Series of (German) HPC-related videos on Channel8Last week we had the Windows HPC Server 2008 launch event in Germany. Microsoft is doing a lot of marketing for this product, which one may like or dislike. However, HPC just recently got the attention by some Channel8 activists. Channel8 is a forum with (video) content produced by students meant for students, covering all sorts of Microsoft products and technologies as well as other interesting “hot” technologies topics. Following this link you will find five videos (as of October 21, 2008) on HPC (in German only). The first two videos are interviews with Wolfgang Nagel and Matthias Müller from the Center for Information Services and High Performance Computing at Dresden University covering HPC in general. The next two videos are interviews with Dieter an Mey and me from the Center for Computing and Communication at RWTH Aachen and we are talking about our experiences and activities with HPC on Windows. The fifth video shows excerpts of our interview again, but covers parallel programming with MPI and (especially) OpenMP in general.
9/22/2008 Debugging parallel programs with Visual Studio: MPI (using Allinea DDTLite)Just this week Allinea released it’s DDTLite plugin for Visual Studio 2008. I have been using a beta version for a couple of weeks now and in my humble opinion, DDTLite extends the MPI cluster debugger of Visual Studio 2008 with a must-have feature for any parallel debugger: Individual process control for MPI programs. With the capabilities provided by the MPI cluster debugger of Visual Studio, debugging MPI programs can be a pain as it is not possible to control MPI processes individually. That means if you select one process and execute it step-by-step, the other process will continue as well and there is no chance of stopping it from doing so (e.g. freezing as you can do with threads). This blog post is not intended to become an Allinea commercial, but I want to briefly demonstrate what DDTLite can do for you. In order to debug MPI programs, you have to go to the project properties, choose Debugging in the left column, and select the MPI Cluster Debugger as the debugger to launch. Additionally you have to provide the following options (listed below along with my advices):
That said, your configuration could look like this: If you then start the debugger (e.g. via F5), two MPI processes will be started and you can switch between them using the Processes window (you can enable that window via the menu: Debug –> Windows –> Processes): From the menu via Tools –> Options… –> Debugging (in the left column) you can set the option Break all processes when one process breaks to influence what happens when a breakpoint is encountered. For the case of debugging MPI programs, you probably want this option to be enabled! But – as already mentioned above – when all processes were interrupted after a breakpoint has been hit, you cannot continue with just one process step-by-step, as the other process will always do a step as well. And this is where DDTLite comes into play… After the plugin has been enabled (via the menu: Tools –> Add-in Manager…) you are presented with several additional windows, among this is the Selected Processes and Threads window to select and switch between processes and threads, as shown above. Via the Groups – Parallel View window you can select individual processes (in the screenshot above you can see that only the MPI process with rank 0, out of two MPI processes, is selected) and then control the selection (selecting a group of processes is possible as well) using the Visual Studio debugger as you do with a serial program. All MPI processes not currently selected stand still! There is more in DDTLite: For example you can select a variable and go to the Variable – Parallel View window to receive a list of variable values by MPI rank (the screenshot below shows the iMyRank member of a struct type named data, which denotes the MPI rank). Of course there are even more capabilities provided by DDTLite, but you can go to the product homepage and find out for yourself by grabbing a 30-day trial version (I used that trial to create the screenshots shown in this blog post). But I would like to add one additional note on the question of how many MPI processes you should use for debugging. Most parallel debuggers (including DDTLite and DDT) are advertised that they are capable of controlling hundreds (and even thousands) of MPI processes. I think that you will hardly ever need that! Instead, I bet that in 99% of the cases in which your MPI programs works fine with one and two processes but fails when using more, you will find the issue by using three or maybe five processes with your debugger. That is all you need for finding the usual off-by-one work-distribution error and similar things :-).
9/11/2008 C++0x: OpenMP loop parallelization without pragmas?!Some people are complaining that OpenMP’s approach of using pragmas to annotate a program is not very nice, as pragmas / the OpenMP directives are not well-integrated into the language. Personally, I like the OpenMP approach and think it has some specific advantages. But I am also very interested in researching how the OpenMP language bindings could be improved, especially for C++. This post is about using C++0x features to build parallelization constructs that have been praised in the context of other approaches (e.g. Parallel Extensions for C#, or Intel’s Threading Building Blocks), but using OpenMP constructs. Let’s consider the following sequential loop which is very similar to the example used in the Microsoft Parallel Extensions to the .NET Framework 3.5 (June 2008) documentation: 01 double dStart, dEnd; 02 for (int rep = 0; rep < iNumRepetitions; rep++) 03 { The experiment loop (line 05 to line 08) is executed iNumRepetitions times, the time is taken in line 04 and 09 using OpenMP time measurement functions (portability!), and the time required for each element can be controlled via iNumIterations. I will use that parametrization for my performance experiments – for now let’s just look at how this would be parallelized in OpenMP: #pragma omp parallel for shared(iNumElements, vec, iNumIterations) for (int i = 0; i < iNumElements; i++) { vec[i] = compute(vec[i], iNumIterations); } Pretty straight forward – as this parallel loop is perfectly balanced, we do not need the schedule clause here. How could that loop look like without using pragmas? Maybe as shown here: omp_pfor (0, iNumElements, [&](int i)
{
vec[i] = compute(vec[i], iNumIterations);
});
Do you like that? No OpenMP pragma is visible in the user’s code, he just has to specify the loop iteration space and the loop variable, the parallelization is done “under the hood”. The implementation of this is pretty simple using lambda functions of C++0x: template<typename F> void omp_pfor(int start, int end, F x) { #pragma omp parallel for for(int __i = start; __i < end; __i++) { x(__i); } } Of course I am still using OpenMP directives here, but they are hidden as an implementation detail. The actual loop body is passed as an argument to the omp_pfor lambda function, as well as the loop boundaries. Please note that this is just a very simple example, of course one can handle all types of loops that are currently supported in OpenMP 3.0 (any maybe even more) and STL-type algorithms! In this post I only talked about syntax, but there is more to it. A part of my research is looking into how programmers (especially from the background of computation engineering science in Aachen) can be provided with more powerful language-based tools to ease writing parallel and reusable code / components. I am always happy to discuss on such a topic – if you like the Live Space comment functionality as little as I do, just drop me a mail at christian@terboven.com. You can download this example code from my SkyDrive. In order to compile that code, I recommend using the latest Intel 11.0 beta compiler.
9/9/2008 Building and Using BOOST.MPI on Windows HPC Server 2008If you are a MPI programmer, you probably know that there are C++ bindings for MPI, which nowadays come with most MPI distributions. Personally, I find they are ugly and do not provide any advantage over using the plain C bindings. In addition, there are even some disadvantages in using the C++ bindings, as I encountered that they are causing problems for several MPI analysis tools (under some circumstances). If you are intending to use the MPI C++ bindings on Windows, you will find that MS-MPI does not come with them. So if you really need them, I would advise to use Intel MPI on Windows – but if you are interested in better C++ bindings for MPI, I would advise to take a look at the BOOST.MPI bindings. This brief blog post is intended to get you started building and using BOOST.MPI on Windows HPC Server 2008. I will provide just basic BOOST build instructions and point you to the “Getting Started on Windows”-guide at the BOOST homepage for more details. This is how we typically build BOOST on our systems, our student Christopher Schleiden figured out the nifty details:
Following this approach, the MPI bindings will be skipped! To enable the MPI library build process, you have to add --with-mpi to the bjam command line from above, or edit the user-config.jam file in subdirectory tools\build\v2 of your BOOST sources and add the line using mpi ; (notice the white spaces) to the bottom of that file - the latter is the approach preferred by me. If you are building on a Windows Compute Cluster 2003 machine, you will end up with the desired library. If you are building on a Windows HPC Server 2008 machine, you will receive the following error message: MPI auto-detection failed: unknown wrapper compiler mpic++ This is because the MPI configuration just looks for MS-MPI v1, which typically resides in directory C:\Program Files\Microsoft Compute Cluster Pack. In order to make the auto-configuration look for MS-MPI v2, you have to edit the file mpi.jam in subdirectory tools\build\v2\tools and replace line 235 with the following: local cluster_pack_path_native = "C:\\Program Files\\Microsoft HPC Pack 2008 SDK" ; Future versions of BOOST will probably look for the MS-MPI v2 automatically. Of course this works as well on a Windows HPC Server 2008 as on a workstation having the HPC Pack SDK installed (e.g. on my notebook running Vista 32-bit). Once the stage target is completed, you will find five additional files in your BOOST target directory:
You will find some examples for BOOST.MPI on page describing the BOOST.MPI bindings. In order to build those, you have to make the following changes to your project:
That’s all it takes! Note that if you have just compiled the release libraries of BOOST (as done in this example) and not the debug libraries, building your project using BOOST.MPI will fail (or cause you trouble), so for development purposes you should build both.
9/2/2008 Debugging parallel programs with Visual Studio: OpenMPI love Visual Studio (2008) for C++ development - and this is not the first time that I stated that, nor the only place. One of its specific strengths is the tight integration of the compiler and the debugger. In my opinion, VS is the best debugger for C++, as “it just works” in most cases, even with our most demanding applications. Sure, there are some issues (as with every debugger out there), but especially when it comes to working with multi-threaded codes VS has proven to be a pretty solid solution. This blog post will discuss the debugging capabilities for OpenMP programs written in C/C++ (well, and the obstacles you will encounter); a following post will discuss debugging MPI programs using the DDTlite Visual Studio plugin from Allinea as soon as the beta program goes public). For the sake of simplicity and just because I like it, I will take the Jacobian solver as an example program. The interesting part looks like follows: 01 while (data.iIterCount < data.iIterMax && residual > data.fTolerance) 02 { 03 residual = 0.0; 04 /* copy new solution into old */ 05 #pragma omp parallel 06 { 07 #pragma omp for 08 for (int j = 1; j < data.iRows - 1; j++) 09 { 10 for (int i = 1; i < data.iCols - 1; i++) 11 UOLD(j,i) = U(j,i); 12 } 13 14 /* compute stencil, residual and update */ 15 #pragma omp for reduction(+:residual) 16 for (int j = data.iRowFirst + 1; j <= data.iRowLast - 1; j++) 17 { 18 for (int i = 1; i < data.iCols - 1; i++) 19 { 20 /* … more code here …*/ 21 22 residual += fLRes * fLRes; 23 } 24 } 25 } /* end omp parallel */ 26 27 /* error check */ 28 data.iIterCount++; 29 residual = sqrt(residual) / (data.iCols * data.iRows); 30 } /* while */ Here we have one OpenMP Parallel Region with two Worksharing constructs in it, one of the Worksharing constructs has a reduction operation on it. There are two important features a parallel debugger has to provide for OpenMP:
In order to set the number of threads an OpenMP program should run with, you have several options. The most prominent one is to set the OMP_NUM_THREADS environment variable (right-click on the project, choose Properties, select Debugging from the Configuration Options in the left pane and add OMP_NUM_THREADS=value to the Environment field in the right pane). Other options are to add the num_threads() clause to the Parallel Region or use OpenMP’s API, but typically I prefer to use the environment variable. Obstacle: When you “approach” a Parallel Region in the debugger and select “Step Over (F10)” just at the line at which the Parallel Region begins (line 05 in my example above), the debugger will continue at the next executable line right after the Parallel Region (line 28 in my example above). Of course it will execute the Parallel Region, but it will not go through it line by line. If you select “Step Into (F11)”, it will ask for the file fork.cpp, as it tries to debug the OpenMP runtime creating the Team of threads. The Workaround is pretty simple: Just set a breakpoint on the first executable line inside the Parallel Region. Following the example above, you would set a breakpoint in line 08 and select “Step Over (F10)” (as you are probably not interested in debugging Microsoft’s OpenMP implementation and assumingly do not have the source to it). Once you arrived there, the Call Stack window will show you something pretty similar to this: Lets now examine what we have here, starting at the bottom of the list:
I wrote “When you switch to another thread”: There is a Thread window with which you can switch between the different threads. If you don’t have it active, you can get it via right-most button of the Debug toolbar (or via Ctrl+D,T): The Thread window is pretty straight-forward: It shows you which threads are current running and give some additional information on those. In the screenshot below you have two threads, namely the “Main Thread” (which is the Master in OpenMP) and one additional “Worker Thread”. It also indicates the current state and location of all threads. You can select a thread by just double-clicking it. Lets look at the first important feature: Control of individual threads. Once you are run into a breakpoint, the debugger will suspend all threads. And it will suspend them when the first thread has hit a breakpoint - the other threads might still be at a point before that (e.g. somewhere in the OpenMP runtime library so that you cannot select them). So far, so good, but when you select “Step Over (F10)” for instance, this can take a while and during that time span all threads are running. If you do not want that, you can right click on a thread (you can select multiple threads at once via the Ctrl key) and select “Freeze”. All “freezed” threads are marked with pause-symbol in the second column of the Threads window and the value in the Suspend column will be 1. Right-clicking on a thread and selecting “Thaw” will “unfreeze” it. Using this approach you can control each thread individually, which was the required capability. Please note: If an “unfreezed” thread encounters an OpenMP barrier it will only continue once all threads have reached that barrier. If you manually drag a thread past a barrier and any other thread encounters that barrier, it will wait there forever (or until you have dragged the other threads back in front of the barrier). An do not forget the implicit barriers at the end of any Worksharing constructs (e.g. in line 12 in the example above)! The second required feature was the examination of private variables. In order to examine this, create a breakpoint in line 22. The residual variable will have significantly different values right after the first few iterations of the j-loop. You can add this variable to a Watch window and switch between threads to examine the values, and Visual Studio correctly displays different values for different threads, which was the required capability. Besides that, you can of course use all the debugging capabilities you know from Visual Studio, such as editing the value of a variable. But to my feeling there is one thing missing: You cannot get a view of a private variable showing all its values from all different threads. This would clearly be a nice features, it can be found in Unix debuggers with dedicated support for OpenMP (e.g. DDT or TotalView). Hint: The “Show Threads in Source” option is a nice feature and once it is enabled, you get a better overview of the source code location at which the threads currently are (look out for two small wavelike lines in the breakpoint-column). You can enable this view by clicking on the second right-most button of the Debug toolbar.
Hint: You might try to add a thread-dependent condition to a breakpoint. You cannot do that in Visual Studio, but you can use the Filter option to achieve that functionality. Right-click on a breakpoint and select “Filter…”. The help text provided with the upcoming dialog window will guide you through setting thread-specific breakpoints. If you select “When Hit…”, you can access a thread’s id and name as well in the action to be taken. I guess that this is enough for a blog post and I hope it gave a compact overview of the OpenMP-related debugging capabilities provided by Visual Studio. In case you are interested in giving it a try but are looking for a simple OpenMP example program, you can grab the Jacobian solver mentioned here from my SkyDrive (C++-omp-jacobi.zip) with a pre-configured Visual Studio 2008 solution.
8/15/2008 More on Tasking in OpenMP 3.0We already played a lot with Tasking and are looking into how well it can be applied to *real* applications. In the “Introduction to Parallel Programming” lecture and also in the discussions following after some of my recent talks I noticed a strong interest in this new paradigm and I really hope that people will like it! I also heard users commenting that Tasking is missing some features, especially reduction-style operations. Well, this is probably true, and the language committee was kind of expecting this when the final 3.0 spec had been released. The intent was to get Tasking out and available by providing the most important set of features – exactly those that are required for building the higher levels ones. Personally, I (now) like this way! Sure, one could ask for more sophisticated task synchronization / ordering constructs and for things like reductions, but once a programmer has understood the Tasking concept I think he will be able to easily implement what is missing on it’s own – or realize that nothing is missing at all. There is a chance that an upcoming OpenMP 3.x standard might have reduction-style operations for tasks. Nevertheless, one can get this functionality already today and it is pretty easy, as you will see. I prepared an example (taken from a real application) of different strategies to get reduction-style operations with tasks. The example can be downloaded from my SkyDrive (omp-tasking-reduction.cpp), but I will discuss the code here as well. The idea is pretty simple: Each element of a self-defined dynamic linked list should be summed up to compute the global sum. This is not directly possible using Worksharing-constructs of OpenMP 2.5, as the number of list elements is unknown and thus the for-construct cannot be applied. As the task-construct of OpenMP 3.0 does not support reduction-style operations directly, some additional work is required to compute the global sum. Let’s take a look at the linked list first, which is rather self-explanatory: typedef int elem_t; // data type template<class T> class CListItem { // single element of a linked list public: CListItem() { this->next = 0; } T value; // member of "data type" CListItem *next; // pointer to next list element }; typedef CListItem<elem_t> item_t; // list element with integer data type typedef item_t *item_p; // pointer to list element Let’s also assume that we have a compute-intense function compute(double input), which returns a double and has to be called for each list element before the global sum can be computed. A serial approach computing the global sum would look like this: elem_t globalSum = (elem_t)0; // global sum, please t1 = root; // simply loop through the list while (t1 != 0) { t1->value = compute(t1->value); globalSum += t1->value; // add each value to global sum t1 = t1->next; } std::cout << "globalSum (serial): " << globalSum << std::endl; In order to parallelize this with Tasking, we use the same approach as discussed in my previous post on OpenMP 3.0: Nesting a single-construct right in the Parallel Region. By doing so we end up with one thread running through the dynamic linked list and creating all the work tasks. The firstprivate(t1) clause is important, otherwise t1 would be shared and there would be a data race by using t1 inside and outside of the task without proper synchronization. We also need a Critical Region when updating the globalSum variable to prevent a data race again: elem_t globalSum = (elem_t)0; // global sum, please t1 = root; #pragma omp parallel // create a team of threads { #pragma omp single // one thread only, please { while (t1 != 0) { #pragma omp task firstprivate(t1) // copy t1 into the task { // globalSum is shared here t1->value = compute(t1->value); #pragma omp critical { globalSum += t1->value; // avoid data races } } t1 = t1->next; } } // omp single } // omp parallel std::cout << "globalSum (omp_tasking_v01): " << globalSum << std::endl; The downside of the approach presented above is obvious: There is one Critical Region for each list element, which could lead to a severe (negative) performance impact, depending on how compute-intense the compute() function is and on the number of list items. In general, one should avoid such a situation, in this case we can easily introduce an additional variable (here: helperSum) that we use to compute the global sum and make the globalSum variable firstprivate. That way each thread will have one private instance of globalSum and globalSum can be shared among the tasks. As each task can only be executed by one thread (no untied clause here), each thread will end up with a partial sum that can be summed up to the global sum after the iteration loop. The improvement is that we now only have as many Critical Regions executed as we have threads in the Team: elem_t globalSum = (elem_t)0; // global sum, please elem_t helperSum = (elem_t)0; // helper t1 = root; #pragma omp parallel firstprivate(globalSum) // create a team of threads { #pragma omp single // one thread only, please { while (t1 != 0) { #pragma omp task firstprivate(t1) shared(globalSum) // shared between tasks { // but private for threads t1->value = compute(t1->value); globalSum += t1->value; } t1 = t1->next; } } // omp single #pragma omp critical // compute global sum without data races { helperSum += globalSum; } // omp critical } // omp parallel globalSum = helperSum; std::cout << "globalSum (omp_tasking_v02): " << globalSum << std::endl; There are several approaches which are equivalent to the one discussed here. We could have declared helperSum as threadprivate, used helperSum instead of globalSum in the task to compute the partial sum and then build the final sum in the Critical Region similar as shown above. We also could have created an array to store one instance of globalSum for each thread and compute the final sum in a serial part after the Parallel Region. If you take a look at the OpenMP WIKI at Sun Microsystems, Yuan Lin is explaining this approach with a slightly different and simpler example, along with some other aspects of Tasking. But all these approaches do not help if you want to use untied tasks. If you add this clause to a task-construct, the OpenMP runtime is allowed to start the execution of a given task on thread A, break execution and later resume it on thread B. Obviously this will cause you problems when you use threadprivate data or thread-id-dependent accesses. But, there is a solution for that case: elem_t globalSum = (elem_t)0; // global sum, please static elem_t helperSum = (elem_t)0; #pragma omp threadprivate(helperSum) // one instance for each thread t1 = root; #pragma omp parallel // create a team of threads { #pragma omp single // one thread only, please { while (t1 != 0) { #pragma omp task untied firstprivate(t1) { t1->value = compute(t1->value); // this is the compute-intense // function, thread-switching #pragma omp task // could happen - but not { // here any more. thus there helperSum += t1->value; // will be computed a partial } // sum for each thread } t1 = t1->next; } } // omp single #pragma omp critical // compute global sum without data races { globalSum += helperSum; } // omp critical } // omp parallel std::cout << "globalSum (omp_tasking_v04): " << globalSum << std::endl; Nice, hm? Federico Massaioli came up with this “trick” on the omp-lang mailing list. While the outer task containing the compute-intense function is allowed to be interrupted and thread-switched because of the untied clause, the inner task updating the threadprivate variable is not! Ok, which version should you use? Please avoid critical regions when possible, so go for the second approach – unless you have to use thread-switching for performance, then you can use the approach discussed last. I do not (yet) have any precise performance experiments, as all Tasking-enabled compilers are still beta releases. By the way, Intel has opened the 11.0 beta program and Sun has released Sun Studio Express July 2008, both compilers offer Tasking support. Both compilers have individual advantages and disadvantages, but at the time of this writing the Sun compiler’s implementation of OpenMP 3.0 seems to be more stable, thus gets my recommendation!
6/24/2008 Windows HPC at ISC08 in DresdenWhile writing this I am on my way back home from ISC08 in Dresden. This was the first time I went to ISC and in my opinion this conference provided a great opportunity to talk to hardware and software vendors alike and discuss some of our outstanding (technical) questions. We are NDAed, so I really do not intend to talk about the glory news I learned in our vendor meetings (the longer one stayed during the evening events, the more one could learn without an NDA :-) ), but I want to point out some news of the HPC on Windows world:
|
|||
|
|