Solarian Programmer

My programming ramblings

C++17 STL Parallel Algorithms - with GCC 9.1 and Intel TBB on Linux and macOS

Posted on May 9, 2019 by Paul

Thanks to Bartlomiej Filipek for the suggestion to try C++17 parallel algorithms with GCC 9 and Intel TBB on Linux.

GCC 9.1 has support for C++17 parallel algorithms by using the Intel TBB library. In this article, I will show you how to build Intel TBB from sources on your machine and how to sort a vector of random numbers in parallel using C++17 std::sort.

At the time of this writing, the latest stable version of Intel TBB is version 2019 Update 6. You can find the latest release of TBB on their GitHub repo. If a newer version was released, change the version number from the next instructions accordingly.

Let’s start by downloading Intel TBB:

1 # On Linux:
2 
3 wget https://github.com/intel/tbb/archive/2019_U6.tar.gz
4 tar zxvf 2019_U6.tar.gz
5 rm 2019_U6.tar.gz
6 
7 # On macOS
8 
9 curl -L https://github.com/intel/tbb/archive/2019_U6.tar.gz | tar xf -

Next, we are going to duplicate one of the existing build files and change the compiler from generic GCC to GCC 9.1:

 1 cd tbb-2019_U6
 2 
 3 # On Linux:
 4 
 5 cp build/linux.gcc.inc build/linux.gcc-9.1.inc
 6 nano build/linux.gcc-9.1.inc
 7 
 8 # On macOS
 9 
10 cp build/macos.gcc.inc build/macos.gcc-9.1.inc
11 nano build/macos.gcc-9.1.inc

At about line 15 you will find these two lines:

1 CPLUS ?= g++
2 CONLY ?= gcc

change them to:

1 CPLUS ?= g++-9.1
2 CONLY ?= gcc-9.1

save and close the file (with nano you need to press CTRL-X and answer Y when asked if you want to save it).

Next, we can build the library:

1 make compiler=gcc-9.1 stdver=c++17 tbb_build_prefix=my_tbb_build

Once the build is finished, you can install it with the next instructions:

 1 sudo mkdir /usr/local/tbb-2019_U6
 2 sudo cp -r include /usr/local/tbb-2019_U6/include
 3 sudo ln -s /usr/local/tbb-2019_U6/include/tbb /usr/local/include/tbb
 4 sudo cp -r build/my_tbb_build_release /usr/local/tbb-2019_U6/lib
 5 
 6 # On Linux:
 7 sudo ln -s /usr/local/tbb-2019_U6/lib/libtbb.so.2 /usr/local/lib/libtbb.so
 8 sudo ln -s /usr/local/tbb-2019_U6/lib/libtbbmalloc.so.2 /usr/local/lib/libtbbmalloc.so
 9 sudo ln -s /usr/local/tbb-2019_U6/lib/libtbbmalloc_proxy.so.2 /usr/local/lib/libtbbmalloc_proxy.so
10 echo 'export LD_LIBRARY_PATH=/usr/local/tbb-2019_U6/lib:$LD_LIBRARY_PATH' >> ~/.bashrc
11 source ~/.bashrc
12 
13 # On macOS:
14 sudo ln -s /usr/local/tbb-2019_U6/lib/libtbb.dylib /usr/local/lib/libtbb.dylib
15 sudo ln -s /usr/local/tbb-2019_U6/lib/libtbbmalloc.dylib /usr/local/lib/libtbbmalloc.dylib
16 sudo ln -s /usr/local/tbb-2019_U6/lib/libtbbmalloc_proxy.dylib /usr/local/lib/libtbbmalloc_proxy.dylib
17 echo 'export DYLD_LIBRARY_PATH=/usr/local/tbb-2019_U6/lib:$DYLD_LIBRARY_PATH' >> ~/.bash_profile
18 source ~/.bash_profile

At this point, you should be able to use the C++17 parallel algorithms.

Here is a small C++17 test program, that will fill a vector with 5 millions random doubles, with values from 0 to 1. Next, we sort the vector sequentially and in parallel. Each test is repeated 10 times and the elapsed time, in milliseconds, is printed on the screen:

 1 #include <iostream>
 2 #include <vector>
 3 #include <random>
 4 #include <algorithm>
 5 #include <chrono>
 6 #include <execution>
 7 
 8 void printDuration(std::chrono::steady_clock::time_point start, std::chrono::steady_clock::time_point end, const char *message) {
 9         auto diff = end - start;
10         std::cout << message << ' ' << std::chrono::duration <double, std::milli> (diff).count() << " ms\n";
11 }
12 
13 template<typename T>
14 void test(const T &policy, const std::vector<double> &data, const int repeat, const char *message) {
15     for(int i = 0; i < repeat; ++i) {
16         std::vector<double> curr_data(data);
17 
18         const auto start = std::chrono::steady_clock::now();
19         std::sort(policy, curr_data.begin(), curr_data.end());
20         const auto end = std::chrono::steady_clock::now();
21         printDuration(start, end, message);
22     }
23     std::cout << '\n';
24 }
25 
26 int main() {
27     // Test samples and repeat factor
28     constexpr size_t samples{5'000'000};
29     constexpr int repeat{10};
30 
31     // Fill a vector with samples numbers
32     std::random_device rd;
33     std::mt19937_64 mre(rd());
34     std::uniform_real_distribution<double> urd(0.0, 1.0);
35 
36     std::vector<double> data(samples);
37     for(auto &e : data) {
38         e = urd(mre);
39     }
40 
41     // Sort data using different execution policies
42     std::cout << "std::execution::seq\n";
43     test(std::execution::seq, data, repeat, "Elapsed time");
44 
45     std::cout << "std::execution::par\n";
46     test(std::execution::par, data, repeat, "Elapsed time");
47 }

You can build the code, assuming that you saved the file as t0.cpp, by using one of the next lines:

1 g++-9.1 -std=c++17 -Wall -Wextra -pedantic t0.cpp -o t0 -ltbb
2 g++-9.1 -std=c++17 -Wall -Wextra -pedantic -O2 t0.cpp -o t0_opt -ltbb

in the above, t0_opt is optimized for speed, while t0 uses the default compiler settings.

Here is what I see on my machine (macOS MacBook Air 2018):

 1 ~/work $ g++-9.1 -std=c++17 -Wall -Wextra -pedantic t0.cpp -o t0 -ltbb
 2 ~/work $ ./t0
 3 std::execution::seq
 4 Elapsed time 2446.64 ms
 5 Elapsed time 2468.2 ms
 6 Elapsed time 2488.72 ms
 7 Elapsed time 2536.5 ms
 8 Elapsed time 2571.8 ms
 9 Elapsed time 2625.19 ms
10 Elapsed time 2636.66 ms
11 Elapsed time 2698.23 ms
12 Elapsed time 2726.94 ms
13 Elapsed time 2766.53 ms
14 
15 std::execution::par
16 Elapsed time 1296.66 ms
17 Elapsed time 1329.46 ms
18 Elapsed time 1320.61 ms
19 Elapsed time 1357.26 ms
20 Elapsed time 1474.66 ms
21 Elapsed time 1376.8 ms
22 Elapsed time 1366.08 ms
23 Elapsed time 1372.55 ms
24 Elapsed time 1390.08 ms
25 Elapsed time 1393.47 ms
26 
27 ~/work $
28 ~/work $ g++-9.1 -std=c++17 -Wall -Wextra -pedantic -O2 t0.cpp -o t0_opt -ltbb
29 ~/work $ ./t0_opt
30 std::execution::seq
31 Elapsed time 441.889 ms
32 Elapsed time 441.553 ms
33 Elapsed time 450.44 ms
34 Elapsed time 453.351 ms
35 Elapsed time 448.148 ms
36 Elapsed time 440.537 ms
37 Elapsed time 451.15 ms
38 Elapsed time 449.112 ms
39 Elapsed time 444.223 ms
40 Elapsed time 445.78 ms
41 
42 std::execution::par
43 Elapsed time 215.877 ms
44 Elapsed time 215.243 ms
45 Elapsed time 212.281 ms
46 Elapsed time 217.507 ms
47 Elapsed time 206.876 ms
48 Elapsed time 209.786 ms
49 Elapsed time 217.279 ms
50 Elapsed time 210.217 ms
51 Elapsed time 213.432 ms
52 Elapsed time 217.352 ms
53 
54 ~/work $ 

I’ve seen similar results on a Linux machine. The parallel, optimized version, is 2 - 2.5x faster than the serial version. For the non optimized version the difference between the parallel and the sequential version is slightly larger.


Show Comments