OpenMP on OS X
Posted on May 14, 2015 by Paul
At the time of this writing Clang, the default C++ compiler on OS X, doesn’t have an OpenMP implementation. GCC however supports OpenMP 4, but you need to install it yourself if you want to use it. GCC 5 can be built from sources for OS X if you want to have the latest and greatest or, you can install it with Homebrew.
OpenMP ads threading support to a C or C++ code through pragmas, this has the advantage that you can take a serial code and parallelize it with minimal code modifications.
As a side node, starting with C++11, you can directly use the thread header if you want to explicitly parallelize your code and this approach is supported by Clang.
Let’s start with a simple C++14 function that calculates the sum of two vectors:
1 void vector_sum(std::vector<double> &sum, const std::vector<double> &va, const std::vector<double> &vb) {
2 auto nr_elements = sum.size();
3 for(decltype(nr_elements) i = 0; i < nr_elements; ++i) {
4 sum[i] = va[i] + vb[i];
5 }
6 }
We can parallelize the above code by adding a single line of code and including the omp.h header, see line 3 from the next piece of code:
1 void parallel_sum(std::vector<double> &sum, const std::vector<double> &va, const std::vector<double> &vb) {
2 auto nr_elements = sum.size();
3 #pragma omp parallel for
4 for(decltype(nr_elements) i = 0; i < nr_elements; ++i) {
5 sum[i] = va[i] + vb[i];
6 }
7 }
the rest of the code doesn’t need to be changed.
We can measure the performance of the above code with the steady_clock from the chrono header:
1 // OpenMP parallel vector addition example
2 #include <vector>
3 #include <iostream>
4 #include <chrono>
5 #include <omp.h>
6
7 void parallel_sum(std::vector<double> &sum, const std::vector<double> &va, const std::vector<double> &vb);
8
9 int main() {
10 auto nr_elements = 90'000'000;
11 std::vector<double> sum(nr_elements), va(nr_elements, 1.0), vb(nr_elements, 2.0);
12
13 auto start = std::chrono::steady_clock::now();
14 parallel_sum(sum, va, vb);
15 auto end = std::chrono::steady_clock::now();
16
17 // Print the elapsed time
18 auto time_diff = end - start;
19 std::cout << std::chrono::duration <double, std::milli> (time_diff).count() << " ms" << std::endl;
20
21 return 0;
22 }
23
24 ...
This code can be compiled with:
1 $ g++-5.1.0 -std=c++14 -pedantic -Wall test.cpp -o test -fopenmp
2 $ ./test
3 $
On a MacBook Air with a dual-core Intel i7 processor, e.g. the above code takes about 800ms for a serial run versus 400 - 500ms for a parallel run. If we use the -O3 optimization level we get much closer results, about 280ms for a serial run versus about 170ms for a parallel run. Please note that this example was presented for the sake of simplicity, you will need to use a more CPU intensive code if you want to see real differences between a serial versus a parallel run.
If you want to learn more about OpenMP I would recommend reading Using OpenMP by B. Chapman, G. Jost: