C++11 multithreading tutorial
Posted on December 16, 2011 by Paul
The code for this tutorial is on GitHub: https://github.com/sol-prog/threads.
Perhaps one of the biggest change to the language is the addition of multithreading support. Before C++11, it was possible to target multicore computers using OS facilities (pthreads on Unix like systems) or libraries like OpenMP and MPI.
This tutorial is meant to get you started with C++11 threads and not to be an exhaustive reference of the standard.
Creating and launching a thread in C++11 is as simple as adding the thread header to your C++ source. Let’s see how we can create a simple HelloWorld program with threads:
On Linux you can compile the above code with g++:
On a Mac with Xcode you can compile the above code with clang++:
On Windows you could use a commercial library, just::thread, for compiling multithread codes. Unfortunately they don’t supply a trial version of the library, so I wasn’t able to test it.
In a real world application the “call_from_thread” function will do some work independently of the main function. For this particular code, the main function creates a thread and wait for the thread to finish at t1.join(). If you forget to wait for a thread to finish his work, it is possible that main will finish first and the program will exit killing the previously created thread regardless if “call_from_thread” has finished or not.
Compare the relative simplicity of the above code with an equivalent code that uses POSIX threads:
Usually we will want to launch more than one thread at once and do some work in parallel. In order to do this we could create an array of threads versus creating a single thread like in our first example. In the next example the main function creates a group of 10 threads that will do some work and waits for the threads to finish their work (there is also a POSIX version of this example in the github repository for this article):
Remember that the main function is also a thread, usually named the main thread, so the above code actually runs 11 threads. This allows us to do some work in the main thread after we have launched the threads and before joining them, we will see this in an image processing example at the end of this tutorial.
What about using a function with parameters in a thread ? C++11 let us to add as many parameters as we need in the thread call. For e.g. we could modify the above code in order to receive an integer as a parameter (you can see the POSIX version of this example in the github repository for this article):
The result of the above code on my system is:
You can see in the above result that there is no particular order in which once created a thread will run. It is the programmer’s job to ensure that a group of threads won’t block trying to modify the same data. Also the last lines are somehow mangled because thread 4 didn’t finish to write on stdout when thread 8 has started. Actually if you run the above code on your system you can get a completely different result or even some mangled characters. This is because all 11 threads of this program compete for the same resource which is stdout.
You can avoid some of the above problem using barriers in your code (std::mutex) which will let you synchronize the way a group of threads share a resource, or you could try to use separate data structures for your threads, if possible. We will talk about thread synchronization using atomic types and mutex in the next tutorial.
In principle we have all we need in order to write more complex parallel codes using only the above syntax.
In the next example I will try to illustrate the power of parallel programming by tackling a slightly more complex problem: removing the noise from an image, with a blur filter. The idea is that we can dissipate the noise from an image by using some form of weighted average of a pixel and his neighbours.
This tutorial is not about optimum image processing nor the author is an expert in this domain, so we will take a rather simple approach here. Our purpose is to illustrate how to write a parallel code and not how to efficiently read/write images or convolve them with filters. I’ve used for example the definition of the spatial convolution instead of the more performant, but slightly more difficult to implement, convolution in the frequency domain by use of Fast Fourier Transform.
For simplicity we will use a simple non-compressed image file format like PPM. Next we present the header file of a simple C++ class that allows you to read/write PPM images and to store them in memory as three arrays (for the R,G,B colours) of unsigned characters:
A possible way to structure our code is:
- Load an image to memory.
- Split the image in a number of threads corresponding to the max number of threads accepted by your system, e.g. on a quad-core computer we could use 8 threads.
- Launch number of threads - 1 (7 for a quad-core system), each one will process his chunk of the image.
- Let the main thread to deal with the last chunk of the image.
- Wait until all threads have finished and join them with the main thread.
- Save the processed image.
Next we present the main function that implements the above algorithm (many thanks to wicked for suggesting some code improvements):
Please ignore the hard coded name of image file and the number of threads to launch, on a real world application you should allow the user to enter interactively these parameters.
Now, in order to see a parallel code at work we will need to give him a significative amount of work, otherwise the overhead of creating and destroying threads will nullify our effort to parallelize this code. The input image should be large enough to actually see an improvement in performance when the code is run in parallel. For this purpose I’ve used an image of 16000x10626 pixels which occupy about 512 MB in PPM format:
I’ve added some noise over the above image in Gimp. The effect of the noise addition can be seen in the next detail of the above picture:
Let’s see the above code in action:
As you can see from the above image the noise level was dissipated.
The results of running the last example code on a dual-core MacBook Pro from 2010 is presented in the next table:
On a dual core machine this code has a perfect speed up 2x for running in parallel versus running the code in serial mode (a single thread).
I’ve also tested the code on a quad-core Intel i7 machine with Linux, these are the results:
Apparently Apple’s clang++ is better at scaling a parallel program, however this can be a combination of compiler/machine characteristics, it could also be because the MacBook Pro used for tests has 8GB of RAM versus only 6GB for the Linux machine.
Read the second part of this tutorial - C++11 multithreading tutorial - part 2/.
If you are interested in learning more about the new C++ syntax I would recommend reading Professional C++ by M. Gregoire, N. A. Solter, S. J. Kleper 4th edition:
or, if you are a C++ beginner you could read C++ Primer (5th Edition) by S. B. Lippman, J. Lajoie, B. E. Moo.
A good book for learning about C++11 multithreading support is C++ Concurrency in Action: Practical Multithreading by Anthony Williams: