C++11 multithreading tutorial - part 2
Posted on February 27, 2012 by Sol

The code for this tutorial is on GitHub: https://github.com/sol-prog/threads.

In my last tutorial about using threads in C++11 we’ve seen that the new C++11 threads syntax is remarkably clean compared with the POSIX pthreads syntax. Using a few simple concepts we were able to build a fairly complex image processing example avoiding the subject of thread synchronization. In the second part of this introduction to multithreading programming in C++11 we are going to see how we can synchronize a group of threads running in parallel.

We’ll start with a quick remainder of how we can create a group of threads in C++11. In the last tutorial we’ve seen that we can store a group of threads in a classical C-type array, it is entirely possible to store our threads in a std::vector which is more in the spirit of C++11 and avoids the pitfalls of dynamical memory allocation with new and delete:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#include <iostream>
#include <thread>
#include <vector>

//This function will be called from a thread

void func(int tid) {
    std::cout << "Launched by thread " << tid << std::endl;
}

int main() {
    std::vector<std::thread> th;

    int nr_threads = 10;

    //Launch a group of threads
    for (int i = 0; i < nr_threads; ++i) {
        th.push_back(std::thread(func,i));
    }

    //Join the threads with the main thread
    for(auto &t : th){
        t.join();
    }

    return 0;
}

Compiling the above program on Mac OSX Lion with clang++ or with gcc-4.7 (gcc-4.7 was compiled from source):

1
2
3
clang++ -Wall -std=c++0x -stdlib=libc++ file_name.cpp

g++-4.7 -Wall -std=c++11 file_name.cpp

On a modern Linux system with gcc-4.6.x we can compile the code with:

1
g++ -std=c++0x -pthread file_name.cpp

Some real life problems are embarrassingly parallel in their nature and can be well managed with the simple syntax presented in the first part of this tutorial. Adding two arrays, multiplying an array with a scalar, generating the Mandelbroot set are classical examples of embarrassingly parallel problems.

Other problems by their nature require some level of synchronization between threads. Take for example the dot product of two vectors: take two vectors of equal lengths multiply them element by element and add the result of each multiplication in a scalar variable. A naive parallelization of this problem is presented in the next code snippet:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
#include <iostream>
#include <thread>
#include <vector>

...

void dot_product(const std::vector<int> &v1, const std::vector<int> &v2, int &result, int L, int R){
    for(int i = L; i < R; ++i){
        result += v1[i] * v2[i];
    }
}

int main(){
    int nr_elements = 100000;
    int nr_threads = 2;
    int result = 0;
    std::vector<std::thread> threads;

    //Fill two vectors with some constant values for a quick verification 
    // v1={1,1,1,1,...,1}
    // v2={2,2,2,2,...,2}
    // The result of the dot_product should be 200000 for this particular case
    std::vector<int> v1(nr_elements,1), v2(nr_elements,2);

    //Split nr_elements into nr_threads parts
    std::vector<int> limits = bounds(nr_threads, nr_elements);

    //Launch nr_threads threads:
    for (int i = 0; i < nr_threads; ++i) {
        threads.push_back(std::thread(dot_product, std::ref(v1), std::ref(v2), std::ref(result), limits[i], limits[i+1]));
    }


    //Join the threads with the main thread
    for(auto &t : threads){
        t.join();
    }

    //Print the result
    std::cout<<result<<std::endl;

    return 0;
}

The result of the above code should obviously be 200000, however, running the code a few times gives slightly different results:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
sol $g++-4.7 -Wall -std=c++11 cpp11_threads_01.cpp
sol $./a.out
138832
sol $./a.out
138598
sol $./a.out
138032
sol $./a.out
140690
sol $

What has happened ??? Look carefully at line 9 of the C++ code, you can see that the variable result sums the result of v1[i] and v2[i]. Line 9 is a typical example of a race condition, this code runs in two parallel asynchronous threads and the variable result is changed by whichever thread access it first.

We can avoid this problem by specifying that this variable should be accessed synchronously by our threads, we can use for this a mutex which is a special purpose variable that acts like a barrier, synchronizing the access to the code that modifies the result variable:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
#include <iostream>
#include <thread>
#include <vector>
#include <mutex>

static std::mutex barrier;

...

void dot_product(const std::vector<int> &v1, const std::vector<int> &v2, int &result, int L, int R){
    int partial_sum = 0;
    for(int i = L; i < R; ++i){
        partial_sum += v1[i] * v2[i];
    }
    std::lock_guard<std::mutex> block_threads_until_finish_this_job(barrier);
    result += partial_sum;
}
...

Line 6 creates a global mutex variable barrier, line 15 forces the threads to finalize the for loop and access synchronously result. Notice that this time we use a new variable partial_sum declared locally for each thread. The rest of the code is unchanged.

For this particular case we can actually find a simpler and more elegant solution, we can use an atomic type which is a special kind of variable that allows safe concurrent reading/writing, basically the synchronization is done under the hood. As a side note on an atomic type we can apply only atomic operations which are defined in the atomic header:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#include <iostream>
#include <thread>
#include <vector>
#include <atomic>

void dot_product(const std::vector<int> &v1, const std::vector<int> &v2, std::atomic<int> &result, int L, int R){
    int partial_sum = 0;
    for(int i = L; i < R; ++i){
        partial_sum += v1[i] * v2[i];
    }
    result += partial_sum;
}

int main(){
    int nr_elements = 100000;
    int nr_threads = 2;
    std::atomic<int> result(0);
    std::vector<std::thread> threads;

        ...

    return 0;
}

The atomic types and atomic operations are not available in the current Apple’s clang++, however you can use atomic types if you are wiling to compile the last clang++ from sources, or you can use the last gcc-4.7 also compiled from sources.

If you are interested in learning more about the new C++11 syntax I would recommend reading Professional C++ by M. Gregoire, N. A. Solter, S. J. Kleper 2nd edition:

or, if you are a C++ beginner you could read C++ Primer (5th Edition) by S. B. Lippman, J. Lajoie, B. E. Moo.

blog comments powered by Disqus