Christian's profileChristian's SpaceBlogLists Tools Help

Christian Terboven

Occupation
Location
Computer Architecture. A Quantitative Approach
Die C++-Programmiersprache. Deutsche Übersetzung der Special Edition (Professionelle Programmierung)

Christian's Space

Welcome to my personal blog on parallel programming (on Windows). My intent is to publish pieces of information that might be of interest to a larger audience, but would otherwise just hang around in my $HOME or get archived in our team's WIKI otherwise.
11/4/2008

My Blog has moved to terboven.wordpress.com

Since about two years I am running this blog on Parallel Programming. As opposed to several friends and colleagues I liked Live Spaces, but with a growing number of readers the following three issues really bugged me:

  • Only people logged in with a Windows Live ID are allowed to comment. From the emails I got this stopped several people from commenting, but did not helped on my following complaint (as this functionality is intended assumingly).
  • Most of the comments I get are spam. While that is bad already, there is no function to authorize comments individually, thus I had to remove the spam manually, which is pretty tedious (even error-prone).
  • The advertising sometimes was inappropriate. I don't mind advertising on a free service, I also don't mind banners of dating services in general. But every now and then when I checked that a blog post is formatted correctly I found some inappropriate banners.

Because of these reasons I did a brief evaluation phase and just now decided to switch to WordPress. Using Wei Wei's Live Space Mover I imported the last five blog posts (to have some content to be found by search engines) to this blog. I will leave this Live Space alive, but stop blogging to it and disable the comment option.

Pleases visit my new blog at http://terboven.wordpress.com. I hope the people interested in my experiments and findings will follow.

Online Video Coverage of PDC 2008

Because of conflicting dates and my intent of keeping some scheduled events as they were planned for quite some time I decided against attending Microsoft’s PDC 2008 – which probably was the wrong decision :-(. The mainstream media is talking a lot about Windows Azure and Windows 7 - which are interesting news items for sure - but not closely related to my “business”, maybe except for the fact that Windows Server 2008 R2 will be capable of handling more cores than just 64.

According to my opinion PDC 2008 brought us great news regarding the feature set of Visual Studio 2010 and Microsoft’s activities on multi-core (aka Shared-Memory parallel) programming and I hope to see and learn more about this during SC08 in about two weeks from now. Meanwhile you can do the following, if you are like me interesting in this stuff:

  • Grab the Visual Studio 2010 and .NET Framework 4.0 CTP virtual machine image from this download site and play with it. From my experience with the Visual Studio 2008 beta program I know that although my laptop is dual-core and has 2 GB of memory, running the VM on that hardware does not make a lot of fun. I am so happy that just recently I virtually “found” a two-socket quad-core (Clovertown) machine that is not suited for production as it is a Intel Software Development test machine with a pre-series chipset and pre-series CPU. I released that machine instantly from doing some stupid software and system testing task :-).
  • Watch the PDC 2008 videos that are available online. Channel 9 has a section covering PDC 2008. That is nice if you are in front of a PC and connected to the Internet, but downloading all the interesting videos from that site is a pain. Greg Duncan has put together a site containing the download links to the videos (in various formats) and the PowerPoint slides here. Grep the content for “parallel” or “concur” or “Studio”!

Technorati-Tags: ,

11/2/2008

A performance tuning tale: Optimizing SMXV (sparse Matrix-Vector-Multiplication) on Windows [part 1 of 2]

Since a while I am involved in several teaching activities on parallel programming and in my humble opinion this also includes talking about parallel computer architectures. As I am usually responsible for Shared-Memory parallel programming with OpenMP and TBB and the like, examples and exercises include learning about and tuning for the recent multi-core architectures we are using, namely Opteron-based and Xeon-based multi-socket systems. Well, understanding the perils of Shared-Memory parallel programming is not easy, but my impression is that several students are challenged when they are asked to carry the usual obstacles of parallel programming (e.g. load imbalance) forward to the context of different systems (e.g. UMA versus cc-NUMA). So this blog post has two goals: Examine and tune a sparse Matrix-Vector-Multiplication (SMXV) kernel on several architectures with (1) putting my oral explanations into text as a brief reference and (2) showing that one can do all the analysis and tuning work on Windows as well.

From school you probably know how to do a Matrix-Vector-Multiplication for dense matrices. In the field of high performance technical computing, you typically have to deal with sparse linear algebra (unless you do a LINPACK :-) benchmark). In my example, the matrix is stored in CRS format and has the following structure:matrix_sparsity_plotThe CRS format stores just the nonzero elements of the matrix in three vectors: The val-vector contains the values of all nonzero elements, the col-vector has the same dimension as the val-vector and contains the column indices for each nonzero element, the row-vector is of the same length as there are rows in the matrix (+1) and points to the first nonzero element index (in val and col) for each matrix row. While there a several different format to save sparse matrices, the CRS format is well-suited for matrices without special properties and allows for an efficient implementation of the Matrix-Vector-Multiplication kernel. The intuitive approach for a parallel SMXV kernel may look as shown below. Let Aval, Acol and Arow be the vector-based implementations of val, col and row:

01  #pragma omp parallel for private(sum, rowend, rowbeg, nz)
02 for (i = 0; i < num_rows; i++){ 03 sum = 0.0; 04 rowend = Arow[i+1]; 05 rowbeg = Arow[i]; 06 for (nz=rowbeg; nz<rowend; ++nz)
07 { 08 sum += Aval[nz] * x[Acol[nz]]; 09 } 10 y[i] = sum; 11 }

How good is this parallelization for the matrix as shown above? Lets take a look at a two-socket quad-core Intel Xeon E5450-based system (3.0 GHz), Below, I am plotting the performance in MFLOP/s for one to eight threads using just the plain Debug configuration of Visual Studio 2008 in which OpenMP has been enabled:

smxv_intuitive_parallelizationThe speedup for two threads (about 1.7) is not too bad, but the best speedup of just 2.1 is achieved with eight threads. It does not pay off significantly to use more than four threads. This is because the Frontside Bus has an insuperable limit of about eight GB/s in total and using dedicated memory bandwidth benchmarks (e.g. STREAM) one can see that this limit can already be reached with four threads (sometimes even using just two threads). Since we are working with a sparse matrix, most accesses are quasi-random and neither the hardware prefetcher nor the compiler inserting prefetch instructions can help us any more.

In many cases, thread binding can be of some help to improve the performance. The result of thread binding is also shown as Debug w/ “scatter” binding – using this approach the threads are distributed over the machine as far away from each other as possible. For example with two threads, each thread is running on a separate socket. This strategy has the advantage of using the maximal possible cache size, but does not improve the performance significantly for this application (or: Windows is already doing a similarly good job with respect to thread binding). Nevertheless, I will use the scattered thread binding strategy in all following measurements. Now, what can we do? Let’s try compiler optimization:

smxv_compiler_optimization

Switching to the Release configuration does not require any work from the user, but results in a pretty nice performance improvement. I usually enabled architecture-specific optimization as well (e.g. SSE-support is enabled in the ReleaseOpt configuration), but that does not result in any further performance improvement for this memory-bound application / benchmark. Anyway, as the compiler has optimized our code for example with respect to cache utilization, this also increases the performance when using more than one thread!

In sequential execution (aka with one thread only) we get about 570 MFLOP/s. This is only a small fraction of the peak performance one core could deliver theoretically (1 core * 3 GHz * 4 instructions/sec = 12 GFLOP/s), but this is what you have to live with given the gap between CPU speed and memory speed. In order to improve the sequential performance, we would have to examine the matrix access pattern and re-arrange / optimize this with respect to the given cache hierarchy. But for now, I would rather like to think about the parallelization again: When you look at the matrix structure plot above, you will find that the density of nonzero elements is decreasing with the matrix rows counting. Our parallelization did not respect this, so we should expect to have a load imbalance limiting our parallelization. I used the Intel Thread Profiler (available on Windows as well as on Linux) to visualize this:

load_imbalance_ThreadProfiler_4threads

The default for-loop scheduling in OpenMP is static (well, on all implementations I know), thus the iteration space is divided into as many chunks as we have threads, all of approximately equal size. So the first thread (T1 in the image above) gets the part of the matrix containing the more dense rows, thus it has more work to do than the other threads. Note: The reason why the Thread Profiler claims the threads two to four have “Barrier”-overhead instead of “Imbalance”-overhead is caused by my benchmark kernel, which looks slightly different than the code snippet above, but let’s ignore that differentiation here.

So, what can we do about it? Right, OpenMP allows for pretty easy and efficient ways of influencing the for-loop scheduling strategy. We just have to extend the line 01 of the code snippet above to look like this:

01 #pragma omp parallel for private(sum, rowend, rowbeg, nz) schedule(guided)

With guided scheduling, the initial chunks have an implementation-specific size which is decreased exponentially down to the chunksize specified, or 1 in our case. For the matrix with a structure as shown above, this results in a good load balance. So this is the performance we get including all optimization we discussed so far:

smxv_binding_balancing

We started with an non-optimized serial code delivering about 350 MFLOP/s and finished with a parallel code delivering about 1000 MFLOP/s! This is still far away from a linear scaling, but this is what you see in reality with complex (aka memory-bound) applications. Regarding these results, please note the following:

  • We did not apply any dataset-specific optimization. That means if the matrix structure changes (which it does over the time of a program run in the application I took this benchmark from) we will still do well and not run into any new load balance. This is clearly an advantage of OpenMP over manual threading!
  • We did not apply any architecture-specific optimization. This code will deliver a reasonable performance on most machines. But we did not yet take a look at cc-NUMA machines (e.g AMD Opteron-based or Intel Nehalem-based systems), this will be done in part 2. On a cc-NUMA system, there is a lot of performance to win or to loose, depending on if you are doing everything right or making a mistake.
  • Was anything in here OS-specific? No, it wasn’t. I did the experiments on Windows, but could have done everything on Linux in exactly the same way. More on this in the next post as well…

10/21/2008

Series of (German) HPC-related videos on Channel8

Last week we had the Windows HPC Server 2008 launch event in Germany. Microsoft is doing a lot of marketing for this product, which one may like or dislike. However, HPC just recently got the attention by some Channel8 activists. Channel8 is a forum with (video) content produced by students meant for students, covering all sorts of Microsoft products and technologies as well as other interesting “hot” technologies topics.

Following this link you will find five videos (as of October 21, 2008) on HPC (in German only). The first two videos are interviews with Wolfgang Nagel and Matthias Müller from the Center for Information Services and High Performance Computing at Dresden University covering HPC in general. The next two videos are interviews with Dieter an Mey and me from the Center for Computing and Communication at RWTH Aachen and we are talking about our experiences and activities with HPC on Windows. The fifth video shows excerpts of our interview again, but covers parallel programming with MPI and (especially) OpenMP in general.

9/22/2008

Debugging parallel programs with Visual Studio: MPI (using Allinea DDTLite)

Just this week Allinea released it’s DDTLite plugin for Visual Studio 2008. I have been using a beta version for a couple of weeks now and in my humble opinion, DDTLite extends the MPI cluster debugger of Visual Studio 2008 with a must-have feature for any parallel debugger: Individual process control for MPI programs. With the capabilities provided by the MPI cluster debugger of Visual Studio, debugging MPI programs can be a pain as it is not possible to control MPI processes individually. That means if you select one process and execute it step-by-step, the other process will continue as well and there is no chance of stopping it from doing so (e.g. freezing as you can do with threads). This blog post is not intended to become an Allinea commercial, but I want to briefly demonstrate what DDTLite can do for you.

In order to debug MPI programs, you have to go to the project properties, choose Debugging in the left column, and select the MPI Cluster Debugger as the debugger to launch. Additionally you have to provide the following options (listed below along with my advices):

  • MPIRun Command: The location of mpirun. Specify the full path to the mpiexec program here, do not use “”, and do not omit the .exe extension.
  • MPIRun Arguments: Arguments to pass to mpirun, such as number of processes to start.
  • MPIShim Location: Location of mpishim.exe. As far as my experience goes, you avoid trouble if you copy mpishim.exe to a path that does not contain any white space (the original location is C:\Program Files\Microsoft Visual Studio 9.0\Common7\IDE\Remote Debugger\x86 on a 32-bit system), again do not omit the .exe extension.

That said, your configuration could look like this:

sshot_cluster_debugger_settings

If you then start the debugger (e.g. via F5), two MPI processes will be started and you can switch between them using the Processes window (you can enable that window via the menu: Debug –> Windows –> Processes):

sshot_debugger_processes_window

From the menu via Tools –> Options… –> Debugging (in the left column) you can set the option Break all processes when one process breaks to influence what happens when a breakpoint is encountered. For the case of debugging MPI programs, you probably want this option to be enabled! But – as already mentioned above – when all processes were interrupted after a breakpoint has been hit, you cannot continue with just one process step-by-step, as the other process will always do a step as well. And this is where DDTLite comes into play…

sshot_ddtlite_process_and_thread_selection

After the plugin has been enabled (via the menu: Tools –> Add-in Manager…) you are presented with several additional windows, among this is the Selected Processes and Threads window to select and switch between processes and threads, as shown above. Via the Groups – Parallel View window you can select individual processes (in the screenshot above you can see that only the MPI process with rank 0, out of two MPI processes, is selected) and then control the selection (selecting a group of processes is possible as well) using the Visual Studio debugger as you do with a serial program. All MPI processes not currently selected stand still!

There is more in DDTLite: For example you can select a variable and go to the Variable – Parallel View window to receive a list of variable values by MPI rank (the screenshot below shows the iMyRank member of a struct type named data, which denotes the MPI rank).

sshot_ddtlite_variable_parallel_view

Of course there are even more capabilities provided by DDTLite, but you can go to the product homepage and find out for yourself by grabbing a 30-day trial version (I used that trial to create the screenshots shown in this blog post). But I would like to add one additional note on the question of how many MPI processes you should use for debugging. Most parallel debuggers (including DDTLite and DDT) are advertised that they are capable of controlling hundreds (and even thousands) of MPI processes. I think that you will hardly ever need that! Instead, I bet that in 99% of the cases in which your MPI programs works fine with one and two processes but fails when using more, you will find the issue by using three or maybe five processes with your debugger. That is all you need for finding the usual off-by-one work-distribution error and similar things :-).

9/11/2008

C++0x: OpenMP loop parallelization without pragmas?!

Some people are complaining that OpenMP’s approach of using pragmas to annotate a program is not very nice, as pragmas / the OpenMP directives are not well-integrated into the language. Personally, I like the OpenMP approach and think it has some specific advantages. But I am also very interested in researching how the OpenMP language bindings could be improved, especially for C++. This post is about using C++0x features to build parallelization constructs that have been praised in the context of other approaches (e.g. Parallel Extensions for C#, or Intel’s Threading Building Blocks), but using OpenMP constructs.

Let’s consider the following sequential loop which is very similar to the example used in the Microsoft Parallel Extensions to the .NET Framework 3.5 (June 2008) documentation:

01   double dStart, dEnd;
02   for (int rep = 0; rep < iNumRepetitions; rep++)
03   {
04 dStart = omp_get_wtime(); 05 for (int i = 0; i < iNumElements; i++) 06 { 07 vec[i] = compute(vec[i], iNumIterations); 08 } 09 dEnd = omp_get_wtime();
10 }

The experiment loop (line 05 to line 08) is executed iNumRepetitions times, the time is taken in line 04 and 09 using OpenMP time measurement functions (portability!), and the time required for each element can be controlled via iNumIterations. I will use that parametrization for my performance experiments – for now let’s just look at how this would be parallelized in OpenMP:

#pragma omp parallel for shared(iNumElements, vec, iNumIterations)
        for (int i = 0; i < iNumElements; i++)
        {
            vec[i] = compute(vec[i], iNumIterations);
        }

Pretty straight forward – as this parallel loop is perfectly balanced, we do not need the schedule clause here. How could that loop look like without using pragmas? Maybe as shown here:

omp_pfor (0, iNumElements, [&](int i)
{
    vec[i] = compute(vec[i], iNumIterations);
});

Do you like that? No OpenMP pragma is visible in the user’s code, he just has to specify the loop iteration space and the loop variable, the parallelization is done “under the hood”. The implementation of this is pretty simple using lambda functions of C++0x:

template<typename F>
void omp_pfor(int start, int end, F x)
{
#pragma omp parallel for
    for(int __i = start; __i < end; __i++)
    {
        x(__i);
    }
}

Of course I am still using OpenMP directives here, but they are hidden as an implementation detail. The actual loop body is passed as an argument to the omp_pfor lambda function, as well as the loop boundaries. Please note that this is just a very simple example, of course one can handle all types of loops that are currently supported in OpenMP 3.0 (any maybe even more) and STL-type algorithms!

In this post I only talked about syntax, but there is more to it. A part of my research is looking into how programmers (especially from the background of computation engineering science in Aachen) can be provided with more powerful language-based tools to ease writing parallel and reusable code / components. I am always happy to discuss on such a topic – if you like the Live Space comment functionality as little as I do, just drop me a mail at christian@terboven.com.

You can download this example code from my SkyDrive. In order to compile that code, I recommend using the latest Intel 11.0 beta compiler.

9/9/2008

Building and Using BOOST.MPI on Windows HPC Server 2008

If you are a MPI programmer, you probably know that there are C++ bindings for MPI, which nowadays come with most MPI distributions. Personally, I find they are ugly and do not provide any advantage over using the plain C bindings. In addition, there are even some disadvantages in using the C++ bindings, as I encountered that they are causing problems for several MPI analysis tools (under some circumstances). If you are intending to use the MPI C++ bindings on Windows, you will find that MS-MPI does not come with them. So if you really need them, I would advise to use Intel MPI on Windows – but if you are interested in better C++ bindings for MPI, I would advise to take a look at the BOOST.MPI bindings.

This brief blog post is intended to get you started building and using BOOST.MPI on Windows HPC Server 2008. I will provide just basic BOOST build instructions and point you to the “Getting Started on Windows”-guide at the BOOST homepage for more details. This is how we typically build BOOST on our systems, our student Christopher Schleiden figured out the nifty details:

  • Download boost_1_36_0.zip (53.4 MB) and boost-jam-3.1.16-1-ntx86.zip (120 KB), which were the most current versions by the time of this writing (September 8, 2008).
  • Extract both archives, here I used X:\src\boost_1_36_0 as the destination path of the boost package itself and X:\src\bin as the destination path of the bjam tool.
  • Start a command shell and put the Visual Studio (2008) compiler in your path. You can easily get a suitable prompt via Start –> All Programs –> Microsoft Visual Studio 2008 –> Visual Studio Tools –> Visual Studio 2008 Command Prompt.
  • Put the bjam tool in your path, e.g. via set PATH=%PATH%;x:\src\bin.
  • Start the build process via X:\src\boost_1_36_0\boost_1_36_0> bjam --build-type=release --toolset=msvc --build-dir=x:\src\boost_1_36_0\build\90\32 --stagedir=x:\src\boost_1_36_0\stage\90-32 stage. This will take some time… A brief description on my options:
    • --build-type: You can choose between release and debug, or build both. Please take into account that the debug build will consume a significant amount of disc space.
    • --toolset: msvc stands for the Microsoft Visual Studio C/C++ compiler. On Windows you could also use the Intel C/C++ compiler and maybe even cygwin, but I never tried that.
    • --build-dir: In order to avoid trouble you should specify a directory to contain the intermediate files created during the build process. As we support Visual Studio version 2005 and 2008 for 32bit and 64bit targets, we created an appropriate naming scheme.
    • --stage-dir: This denotes the directory in which you intend to install boost.

Following this approach, the MPI bindings will be skipped! To enable the MPI library build process, you have to add --with-mpi to the bjam command line from above, or edit the user-config.jam file in subdirectory tools\build\v2 of your BOOST sources and add the line using mpi ; (notice the white spaces) to the bottom of that file - the latter is the approach preferred by me. If you are building on a Windows Compute Cluster 2003 machine, you will end up with the desired library. If you are building on a Windows HPC Server 2008 machine, you will receive the following error message:

MPI auto-detection failed: unknown wrapper compiler mpic++

This is because the MPI configuration just looks for MS-MPI v1, which typically resides in directory C:\Program Files\Microsoft Compute Cluster Pack. In order to make the auto-configuration look for MS-MPI v2, you have to edit the file mpi.jam in subdirectory tools\build\v2\tools and replace line 235 with the following:

local cluster_pack_path_native = "C:\\Program Files\\Microsoft HPC Pack 2008 SDK" ;

Future versions of BOOST will probably look for the MS-MPI v2 automatically. Of course this works as well on a Windows HPC Server 2008 as on a workstation having the HPC Pack SDK installed (e.g. on my notebook running Vista 32-bit). Once the stage target is completed, you will find five additional files in your BOOST target directory:

  • libboost_mpi-vc90-mt.lib
  • libboost_mpi-vc90-mt-1_36.lib
  • boost_mpi-vc90-mt.lib
  • boost_mpi-vc90-mt-1_36.dll
  • boost_mpi-vc90-mt-1_36.lib

You will find some examples for BOOST.MPI on page describing the BOOST.MPI bindings. In order to build those, you have to make the following changes to your project:

  • Add include path (C:\Program Files\Microsoft HPC Pack 2008 SDK\Include) and library path (C:\Program Files\Microsoft HPC Pack 2008 SDK\Lib\i386 for 32-bit applications or C:\Program Files\Microsoft HPC Pack 2008 SDK\Lib\amd64 for 64-bit applications) for MS-MPI v2 and add msmpi.lib to the linker input.
  • Add BOOST to your include path (in my example: X:\src\boost_1_36_0\boost_1_36_0) and to your library path (in my example: X:\src\boost_1_36_0\stage\90-32\lib).

That’s all it takes! Note that if you have just compiled the release libraries of BOOST (as done in this example) and not the debug libraries, building your project using BOOST.MPI will fail (or cause you trouble), so for development purposes you should build both.

9/2/2008

Debugging parallel programs with Visual Studio: OpenMP

I love Visual Studio (2008) for C++ development - and this is not the first time that I stated that, nor the only place. One of its specific strengths is the tight integration of the compiler and the debugger. In my opinion, VS is the best debugger for C++, as “it just works” in most cases, even with our most demanding applications. Sure, there are some issues (as with every debugger out there), but especially when it comes to working with multi-threaded codes VS has proven to be a pretty solid solution. This blog post will discuss the debugging capabilities for OpenMP programs written in C/C++ (well, and the obstacles you will encounter); a following post will discuss debugging MPI programs using the DDTlite Visual Studio plugin from Allinea as soon as the beta program goes public).

For the sake of simplicity and just because I like it, I will take the Jacobian solver as an example program. The interesting part looks like follows:

01         while (data.iIterCount < data.iIterMax && residual > data.fTolerance) 
02         {
03             residual = 0.0;
04             /* copy new solution into old */
05 #pragma omp parallel
06 {
07 #pragma omp for 
08             for (int j = 1; j < data.iRows - 1; j++)
09             {
10                 for (int i = 1; i < data.iCols - 1; i++)
11                     UOLD(j,i) = U(j,i);
12             }
13 
14             /* compute stencil, residual and update */
15 #pragma omp for reduction(+:residual)
16             for (int j = data.iRowFirst + 1; j <= data.iRowLast - 1; j++)
17             {
18                 for (int i = 1; i < data.iCols - 1; i++)
19                 {
20                     /* … more code here …*/
21 
22                     residual += fLRes * fLRes;
23                 }
24              }
25 } /* end omp parallel */
26 
27             /* error check */
28             data.iIterCount++;
29             residual = sqrt(residual) / (data.iCols * data.iRows);
30         } /* while */

Here we have one OpenMP Parallel Region with two Worksharing constructs in it, one of the Worksharing constructs has a reduction operation on it. There are two important features a parallel debugger has to provide for OpenMP:

  1. Control of individual threads. In OpenMP, the program execution starts with a single thread (named the Master or Initial thread). At each Parallel Region a Team of threads is created. As the thread creation is done by the OpenMP runtime, you (= the user) are not interested in this, but once all threads are created you are interested in controlling them individually!
  2. Examination of private variables. Each thread has its own copy of a private variable, thus the values can differ between threads. The debugger has to be able to display the different values for different threads!

In order to set the number of threads an OpenMP program should run with, you have several options. The most prominent one is to set the OMP_NUM_THREADS environment variable (right-click on the project, choose Properties, select Debugging from the Configuration Options in the left pane and add OMP_NUM_THREADS=value to the Environment field in the right pane). Other options are to add the num_threads() clause to the Parallel Region or use OpenMP’s API, but typically I prefer to use the environment variable.

Obstacle: When you “approach” a Parallel Region in the debugger and select “Step Over (F10)” just at the line at which the Parallel Region begins (line 05 in my example above), the debugger will continue at the next executable line right after the Parallel Region (line 28 in my example above). Of course it will execute the Parallel Region, but it will not go through it line by line. If you select “Step Into (F11)”, it will ask for the file fork.cpp, as it tries to debug the OpenMP runtime creating the Team of threads. The Workaround is pretty simple: Just set a breakpoint on the first executable line inside the Parallel Region.

Following the example above, you would set a breakpoint in line 08 and select “Step Over (F10)” (as you are probably not interested in debugging Microsoft’s OpenMP implementation and assumingly do not have the source to it). Once you arrived there, the Call Stack window will show you something pretty similar to this:

shot_debugging_openmp_callstack

Lets now examine what we have here, starting at the bottom of the list:

  • Three lines beginning with “jacobi-omp.exe”: This is our program code. jacobi-omp.exe!main() is the program entry point of my example. All lines that are below that are from initialization of the runtime library and initialization of static variables and kernel functions to setup the program context. In 99.9% of the cases you are using a debugger you can happily ignore those. Inside main() the function Jacobi() is called, which contains the code snippet shown above with the Parallel Region.
  • Four lines beginning with “vcomp90d.dll”: This is Microsoft’s OpenMP runtime library. VCOMP assumingly stands for Visual C/C++ OpenMP and 90 is the version number (Visual Studio 2008 is Visual Studio 9.0). Please note that many things I am telling you about this library are based on my experiments with it, as I do not have any additional sources than MSDN (which is pretty sparse regarding OpenMP details).
    • _vcomp_fork() + InvokeThreadTeam(): These functions create or awake the Team of threads for the current Parallel Region. You might have noticed the if_test variable that equals 1: OpenMP allows for an if() clause that can be added to a Parallel Region and if that expression evaluates to false, the Parallel Region will be executed with a Team of only one thread. You will find that this variable correlates to the if() clause.
    • _vcomp::ParallelRegion::HandlerThreadFunc() + _vcomp::fork_help(): These functions start the execution of the Parallel Region and take care of setting up any Worksharing constructs. You might have noticed the index variable that equals 0: It correlates to the OpenMP thread id. The Master thread has the id zero, the other Team members have positive ids starting with one.
  • jacobi-omp.exe!Jacobi$omp$1(): This name results from the way in which OpenMP is implemented. The code inside a Parallel Region is outlined into a new routine which is named accordingly, so here you have the first OpenMP Parallel Region of the Jacobi() function. This routine just contains your code and it called by all threads in the Team. When you switch to another thread, you will see that its user-part of the call stack just begins with such a routine, as its life begins somewhere in the OpenMP runtime. This is shown in the picture above for the second thread of a Team (see variable index).

shot_debugging_openmp_callstack_slave

I wrote “When you switch to another thread”: There is a Thread window with which you can switch between the different threads. If you don’t have it active, you can get it via right-most button of the Debug toolbar (or via Ctrl+D,T):

shot_debugging_toolbar 

The Thread window is pretty straight-forward: It shows you which threads are current running and give some additional information on those. In the screenshot below you have two threads, namely the “Main Thread” (which is the Master in OpenMP) and one additional “Worker Thread”. It also indicates the current state and location of all threads. You can select a thread by just double-clicking it.

shot_debugging_window_threads

Lets look at the first important feature: Control of individual threads. Once you are run into a breakpoint, the debugger will suspend all threads. And it will suspend them when the first thread has hit a breakpoint - the other threads might still be at a point before that (e.g. somewhere in the OpenMP runtime library so that you cannot select them). So far, so good, but when you select “Step Over (F10)” for instance, this can take a while and during that time span all threads are running. If you do not want that, you can right click on a thread (you can select multiple threads at once via the Ctrl key) and select “Freeze”. All “freezed” threads are marked with pause-symbol in the second column of the Threads window and the value in the Suspend column will be 1. Right-clicking on a thread and selecting “Thaw” will “unfreeze” it. Using this approach you can control each thread individually, which was the required capability.

Please note: If an “unfreezed” thread encounters an OpenMP barrier it will only continue once all threads have reached that barrier. If you manually drag a thread past a barrier and any other thread encounters that barrier, it will wait there forever (or until you have dragged the other threads back in front of the barrier). An do not forget the implicit barriers at the end of any Worksharing constructs (e.g. in line 12 in the example above)!

The second required feature was the examination of private variables. In order to examine this, create a breakpoint in line 22. The residual variable will have significantly different values right after the first few iterations of the j-loop. You can add this variable to a Watch window and switch between threads to examine the values, and Visual Studio correctly displays different values for different threads, which was the required capability. Besides that, you can of course use all the debugging capabilities you know from Visual Studio, such as editing the value of a variable.

But to my feeling there is one thing missing: You cannot get a view of a private variable showing all its values from all different threads. This would clearly be a nice features, it can be found in Unix debuggers with dedicated support for OpenMP (e.g. DDT or TotalView).

Hint: The “Show Threads in Source” option is a nice feature and once it is enabled, you get a better overview of the source code location at which the threads currently are (look out for two small wavelike lines in the breakpoint-column). You can enable this view by clicking on the second right-most button of the Debug toolbar.

shot_debugging_window_threads_showsource

 

Hint: You might try to add a thread-dependent condition to a breakpoint. You cannot do that in Visual Studio, but you can use the Filter option to achieve that functionality. Right-click on a breakpoint and select “Filter…”. The help text provided with the upcoming dialog window will guide you through setting thread-specific breakpoints. If you select “When Hit…”, you can access a thread’s id and name as well in the action to be taken.

I guess that this is enough for a blog post and I hope it gave a compact overview of the OpenMP-related debugging capabilities provided by Visual Studio. In case you are interested in giving it a try but are looking for a simple OpenMP example program, you can grab the Jacobian solver mentioned here from my SkyDrive (C++-omp-jacobi.zip) with a pre-configured Visual Studio 2008 solution.

8/15/2008

More on Tasking in OpenMP 3.0

We already played a lot with Tasking and are looking into how well it can be applied to *real* applications. In the “Introduction to Parallel Programming” lecture and also in the discussions following after some of my recent talks I noticed a strong interest in this new paradigm and I really hope that people will like it!

I also heard users commenting that Tasking is missing some features, especially reduction-style operations. Well, this is probably true, and the language committee was kind of expecting this when the final 3.0 spec had been released. The intent was to get Tasking out and available by providing the most important set of features – exactly those that are required for building the higher levels ones. Personally, I (now) like this way! Sure, one could ask for more sophisticated task synchronization / ordering constructs and for things like reductions, but once a programmer has understood the Tasking concept I think he will be able to easily implement what is missing on it’s own – or realize that nothing is missing at all.

There is a chance that an upcoming OpenMP 3.x standard might have reduction-style operations for tasks. Nevertheless, one can get this functionality already today and it is pretty easy, as you will see. I prepared an example (taken from a real application) of different strategies to get reduction-style operations with tasks. The example can be downloaded from my SkyDrive (omp-tasking-reduction.cpp), but I will discuss the code here as well. The idea is pretty simple: Each element of a self-defined dynamic linked list should be summed up to compute the global sum. This is not directly possible using Worksharing-constructs of OpenMP 2.5, as the number of list elements is unknown and thus the for-construct cannot be applied. As the task-construct of OpenMP 3.0 does not support reduction-style operations directly, some additional work is required to compute the global sum. Let’s take a look at the linked list first, which is rather self-explanatory:

typedef int elem_t;       // data type

template<class T>
class CListItem {         // single element of a linked list
public:
    CListItem() {
        this->next = 0;  
    } 

    T            value;   // member of "data type"
    CListItem    *next;   // pointer to next list element
};

typedef CListItem<elem_t>    item_t;  // list element with integer data type
typedef item_t               *item_p; // pointer to list element

Let’s also assume that we have a compute-intense function compute(double input), which returns a double and has to be called for each list element before the global sum can be computed. A serial approach computing the global sum would look like this:

elem_t globalSum = (elem_t)0;        // global sum, please
t1 = root; 

// simply loop through the list
while (t1 != 0)
{
    t1->value = compute(t1->value);
    globalSum += t1->value;          // add each value to global sum
    t1 = t1->next;
} 

std::cout << "globalSum (serial): " << globalSum << std::endl; 

In order to parallelize this with Tasking, we use the same approach as discussed in my previous post on OpenMP 3.0: Nesting a single-construct right in the Parallel Region. By doing so we end up with one thread running through the dynamic linked list and creating all the work tasks. The firstprivate(t1) clause is important, otherwise t1 would be shared and there would be a data race by using t1 inside and outside of the task without proper synchronization. We also need a Critical Region when updating the globalSum variable to prevent a data race again:

elem_t globalSum = (elem_t)0;        // global sum, please
t1 = root;

#pragma omp parallel                 // create a team of threads
    {
#pragma omp single                   // one thread only, please
        {

            while (t1 != 0)
            {

#pragma omp task firstprivate(t1)    // copy t1 into the task
                {                    // globalSum is shared here
                    t1->value = compute(t1->value);
#pragma omp critical
                    {
                        globalSum += t1->value;   // avoid data races
                    }
                }

                t1 = t1->next;
            }

        } // omp single
    } // omp parallel

std::cout << "globalSum (omp_tasking_v01): " << globalSum << std::endl;

The downside of the approach presented above is obvious: There is one Critical Region for each list element, which could lead to a severe (negative) performance impact, depending on how compute-intense the compute() function is and on the number of list items. In general, one should avoid such a situation, in this case we can easily introduce an additional variable (here: helperSum) that we use to compute the global sum and make the globalSum variable firstprivate. That way each thread will have one private instance of globalSum and globalSum can be shared among the tasks. As each task can only be executed by one thread (no untied clause here), each thread will end up with a partial sum that can be summed up to the global sum after the iteration loop. The improvement is that we now only have as many Critical Regions executed as we have threads in the Team:

elem_t globalSum = (elem_t)0;        // global sum, please
elem_t helperSum = (elem_t)0;        // helper
t1 = root;

#pragma omp parallel firstprivate(globalSum)   // create a team of threads
    {
#pragma omp single                   // one thread only, please
        {

            while (t1 != 0)
            {

#pragma omp task firstprivate(t1) shared(globalSum)   // shared between tasks
                {                                     // but private for threads
                    t1->value = compute(t1->value);
                    globalSum += t1->value;
                }

                t1 = t1->next;
            }

        } // omp single

#pragma omp critical                 // compute global sum without data races
{
    helperSum += globalSum;
} // omp critical

    } // omp parallel

globalSum = helperSum;
std::cout << "globalSum (omp_tasking_v02): " << globalSum << std::endl;

There are several approaches which are equivalent to the one discussed here. We could have declared helperSum as threadprivate, used helperSum instead of globalSum in the task to compute the partial sum and then build the final sum in the Critical Region similar as shown above. We also could have created an array to store one instance of globalSum for each thread and compute the final sum in a serial part after the Parallel Region. If you take a look at the OpenMP WIKI at Sun Microsystems, Yuan Lin is explaining this approach with a slightly different and simpler example, along with some other aspects of Tasking.

But all these approaches do not help if you want to use untied tasks. If you add this clause to a task-construct, the OpenMP runtime is allowed to start the execution of a given task on thread A, break execution and later resume it on thread B. Obviously this will cause you problems when you use threadprivate data or thread-id-dependent accesses. But, there is a solution for that case:

elem_t globalSum = (elem_t)0;        // global sum, please
static elem_t helperSum = (elem_t)0;
#pragma omp threadprivate(helperSum) // one instance for each thread
t1 = root;

#pragma omp parallel                 // create a team of threads
    {
#pragma omp single                   // one thread only, please
        {

            while (t1 != 0)
            {

#pragma omp task untied firstprivate(t1)
                {
                    t1->value = compute(t1->value); // this is the compute-intense
                                                    // function, thread-switching
#pragma omp task                                    // could happen - but not
                    {                               // here any more. thus there
                        helperSum += t1->value;     // will be computed a partial 
                    }                               // sum for each thread
                }

                t1 = t1->next;
            }

        } // omp single

#pragma omp critical                 // compute global sum without data races
{
    globalSum += helperSum;
} // omp critical

    } // omp parallel

std::cout << "globalSum (omp_tasking_v04): " << globalSum << std::endl;

Nice, hm? Federico Massaioli came up with this “trick” on the omp-lang mailing list. While the outer task containing the compute-intense function is allowed to be interrupted and thread-switched because of the untied clause, the inner task updating the threadprivate variable is not!

Ok, which version should you use? Please avoid critical regions when possible, so go for the second approach – unless you have to use thread-switching for performance, then you can use the approach discussed last. I do not (yet) have any precise performance experiments, as all Tasking-enabled compilers are still beta releases. By the way, Intel has opened the 11.0 beta program and Sun has released Sun Studio Express July 2008, both compilers offer Tasking support. Both compilers have individual advantages and disadvantages, but at the time of this writing the Sun compiler’s implementation of OpenMP 3.0 seems to be more stable, thus gets my recommendation!

Technorati Tags: ,,,

6/24/2008

Windows HPC at ISC08 in Dresden

While writing this I am on my way back home from ISC08 in Dresden. This was the first time I went to ISC and in my opinion this conference provided a great opportunity to talk to hardware and software vendors alike and discuss some of our outstanding (technical) questions. We are NDAed, so I really do not intend to talk about the glory news I learned in our vendor meetings (the longer one stayed during the evening events, the more one could learn without an NDA :-) ), but I want to point out some news of the HPC on Windows world:

  • Our HPC-Cluster that we installed in the beginning of 2008 made it to rank 100 in the current Top500 list of June 2008! We used the Windows HPC Server 2008 pre-beta2 on 256 dual-socket nodes (in total we had 2048 Xeon cores running at 3.0 GHz) and achieved 18.81 Terraflops. The theoretical peak performance of this machine is 24.58 Terraflops (= 256 nodes * 8 cores * 3.0 GHz * 4 results per cycle). Our LINPACK efficiency is about 77% - which is a bit too low for that machinery, but during the measurements we had issues with our InfiniBand fabric (hardware!). The Windows-Cluster at Umea University in Sweden reached an efficiency of about 86% (rank 39), so we aim to slightly improve our result until November 2008… The fastest Windows machine on earth ;-) is a Cluster at NCSA that reached 68.48 Terraflops reached (rank 23).
  • The Allinea people demoed their DDTlite for Windows, which is a plugin for Visual Studio 2008 (SP1) to enable MPI-debugging. Although it is possible to use plain Visual Studio 2005/2008 to debug MPI programs, you cannot control several processes individually unless you go the detach/attach way with multiple instances of VS running (we have a workshop on this in Aachen in a few weeks from now). With DDTlite you can control each MPI process individually, or you can create groups of selected processes to be controlled at once, and you have a nice (MPI) parallel stack view, and so on. I am really looking forward to their beta release!
  • Just recently the WinIB drivers in version 1.4.0 beta2 have been released. I talked to Xavier Pillons from Microsoft and he told me that they achieved less than 2.0 us latency with QDR InfiniBand. This is nice, but Linux is at about 1.0 us with QDR, so I hope that MSFT will improve with the final drivers.
  • I talked to people from Intel and PGI to improve the integration of their Fortran compilers in Visual Studio 2005/2008. Although I do not know of any nice Fortran IDE on Linux with parallel (OpenMP) debugging capabilities (so what you currently get is probably the best on the market), as a programmer grown up with C++ IDEs there are several things I feel are missing.
  • I was rather surprised finding our Case Study at the Microsoft booth, as I was unaware that it has already been published (we gave our approval to the text, but did not had seen the final formatting). The PDF is available here.
  • Several companies were demoing their products on Windows, for example Cluster Resources with their Moab Cluster Software Suite, or ANSYS Inc. with their flow solver product suite.