Solarian Programmer

My programming ramblings

OpenMP on OS X

Posted on May 14, 2015 by Paul

At the time of this writing Clang, the default C++ compiler on OS X, doesn’t have an OpenMP implementation. GCC however supports OpenMP 4, but you need to install it yourself if you want to use it. GCC 5 can be built from sources for OS X if you want to have the latest and greatest or, you can install it with Homebrew.

OpenMP ads threading support to a C or C++ code through pragmas, this has the advantage that you can take a serial code and parallelize it with minimal code modifications.

As a side node, starting with C++11, you can directly use the thread header if you want to explicitly parallelize your code and this approach is supported by Clang.

Let’s start with a simple C++14 function that calculates the sum of two vectors:

1 void vector_sum(std::vector<double> &sum, const std::vector<double> &va, const std::vector<double> &vb) {
2 	auto nr_elements = sum.size();
3 	for(decltype(nr_elements) i = 0; i < nr_elements; ++i) {
4 		sum[i] = va[i] + vb[i];
5 	}
6 }

We can parallelize the above code by adding a single line of code and including the omp.h header:

1 void parallel_sum(std::vector<double> &sum, const std::vector<double> &va, const std::vector<double> &vb) {
2 	auto nr_elements = sum.size();
3 	#pragma omp parallel for
4 	for(decltype(nr_elements) i = 0; i < nr_elements; ++i) {
5 		sum[i] = va[i] + vb[i];
6 	}
7 }

the rest of the code doesn’t need to be changed.

We can measure the performance of the above code with the steady_clock from the chrono header:

 1 // OpenMP parallel vector addition example
 2 #include <vector>
 3 #include <iostream>
 4 #include <chrono>
 5 #include <omp.h>
 6 
 7 void parallel_sum(std::vector<double> &sum, const std::vector<double> &va, const std::vector<double> &vb);
 8 
 9 int main() {
10 	auto nr_elements = 90'000'000;
11 	std::vector<double> sum(nr_elements), va(nr_elements, 1.0), vb(nr_elements, 2.0);
12 
13 	auto start = std::chrono::steady_clock::now();
14 	parallel_sum(sum, va, vb);
15 	auto end = std::chrono::steady_clock::now();
16 
17 	// Print the elapsed time
18 	auto time_diff = end - start;
19 	std::cout << std::chrono::duration <double, std::milli> (time_diff).count() << " ms" << std::endl;
20 
21 	return 0;
22 }
23 
24 ...

This code can be compiled with:

1 $ g++-5.1.0 -std=c++14 -pedantic -Wall test.cpp -o test -fopenmp
2 $ ./test
3 $

On a MacBook Air with a dual-core Intel i7 processor, e.g. the above code takes about 800ms for a serial run versus 400 - 500ms for a parallel run. If we use the -O3 optimization level we get much closer results, about 280ms for a serial run versus about 170ms for a parallel run. Please note that this example was presented for the sake of simplicity, you will need to use a more CPU intensive code if you want to see real differences between a serial versus a parallel run.

If you want to learn more about OpenMP I would recommend reading Using OpenMP by B. Chapman, G. Jost:


Show Comments