Solarian Programmer

My programming ramblings

Vector addition benchmark in C, C++ and Fortran

Posted on April 11, 2012 by Paul

If you were involved in working with large arrays and linear algebra, you’ve probably heard mantras like A Fortran implementation can usually achieve more performance than C or C++ or C++ is slow for scientific computations, don’t use it!. Until recently I honestly believed that, at least for elementary linear algebra operations, like vector addition, e.g. C can easily beat C++ (specifically that working with C-arrays should be faster than the more elaborate vector data structure from C++) and Fortran will beat both. Please be aware that I’m not talking about which language is inherently superior to another, but rather about what is the actual state of their implementation on personal computers running Linux and Windows.

In order to test the above myths I’ve implemented a few simple codes for vector, one dimensional arrays, addition in C, C++ and Fortran.

Adding two vectors is a simple operation you just need to pass through data only once and add element by element. A straightforward C implementation that adds 90 millions doubles is presented next:

 1 #include <stdio.h>
 2 #include <stdlib.h>
 3 #include <time.h>
 5 static size_t MM = 90000000;
 7 void add(double *a, double *b, double *c, size_t n){
 8     size_t i;
 9     for(i = 0; i < n; ++i) {
10         c[i] = a[i] + b[i];
11     }
12 }
14 int main() {
15     double *a, *b, *c;
16     size_t i;
17     clock_t start,end;
18     double diffs;
20     a = (double *) malloc(MM*sizeof(double));
21     b = (double *) malloc(MM*sizeof(double));
22     c = (double *) malloc(MM*sizeof(double));
25     for(i = 0; i < MM; ++i){
26         a[i] = 1.0/(double)(i+1);
27         b[i] = a[i];
28     }
30     start = clock();
31     add(a,b,c,MM);
32     end = clock();
33     diffs = (end - start)/(double)CLOCKS_PER_SEC;
35     printf("time = %lf\n",diffs);
37     free(a);
38     free(b);
39     free(c);
40     return 0;
41 }

The above code will simply report how much time it takes to add two 90 millions elements vectors on a given machine. If you have a C99 compatible compiler you can also use the restrict keyword:

1 void add(double *restrict a, double *restrict b, double *restrict c, size_t n){
2 ...
3 }

On a modern core i7 processor with 16 GB or RAM the above code takes about 0.6 seconds with no optimizations and about 0.35 seconds with maximum compiler optimization. There was practically no speed difference between the original code and the one with the restrict keyword.

Let’s see the equivalent C++ implementation with STL vectors instead of C-arrays:

 1 #include <iostream>
 2 #include <vector>
 3 #include <ctime>
 5 using namespace std;
 7 static size_t MM = 90000000;
 9 void add(const vector<double> &a, const vector<double> &b, vector<double> &c){
10     size_t n = a.size();
12     for(size_t i = 0; i < n; ++i) {
13         c[i] = a[i] + b[i];
14     }
15 }
17 int main() {
18     vector<double> a(MM), b(MM), c(MM);
20     for(size_t i = 0; i < MM; ++i){
21         a[i] = 1.0/(double)(i+1);
22         b[i] = a[i];
23     }
25     clock_t start = clock();
26     add(a,b,c);
27     clock_t end = clock();
28     double diffs = (end - start)/(double)CLOCKS_PER_SEC;
30     cout << "time = " << diffs << endl;
32     return 0;
33 }

The code is shorter and, at least to me, it looks cleaner. What about the performance ? Well … this code takes about 1.23 seconds with zero optimizations! More than double the time of the C version. However if you enable full compiler optimizations this will take 0.23 seconds!

Both tests were done on the same machine and same operating system (Windows 7 64 bits) with the same compiler (Visual Studio 2010), each test was done ten times and the results were averaged. Maybe Microsoft has a crappy C compiler, after all they have no support for C99 …

Repeating the above two tests with gcc-4.7 and g++-4.7 on a Linux machine (64 bits Linux Mint) I’ve obtained 0.755 seconds and 0.345 seconds (with optimizations) for the C version; versus 1.1 seconds and 0.22 seconds (with optimizations) for C++. Can you see a pattern here ?

Apparently the compiler, for both GCC and Visul Studio, does a better optimization job for C++ than for C.

In true C++11 spirit we can further modify the add function:

1 void add(const vector<double> &a, const vector<double> &b, vector<double> &c){
2     transform(a.begin(),a.end(),b.begin(),c.begin(),
3        [] (double a, double b) { return a + b; });
4 }

This time the code is slightly slower on Windows with Visual Studio 1.29 seconds and 0.25 seconds (with optimizations) versus Linux and gcc where the code takes about 2.23 seconds and 0.22 seconds (with optimizations).

What about Fortran ? First, I’ve used a trial version of Intel Fortran for both Linux and Windows, from past experiences I knew Intel Fortran is one of the fastest Fortran compilers available for Intel processors.

Next is a straightforward Fortran implementation of vector addition:

 1 subroutine add(a,b,c,n)
 2     integer :: n,i
 3     double precision :: a(n),b(n),c(n)
 5     do i = 1,n
 6         c(i) = a(i) + b(i)
 7     enddo
 8 end subroutine add
10 program test
11     integer,parameter :: MM = 90000000
12     double precision, allocatable :: a(:),b(:),c(:)
13     double precision :: start,finish
15     allocate(a(MM),b(MM),c(MM))
17     do i = 1,MM
18         a(i) = 1.0/dble(i)
19         b(i) = a(i)
20     enddo
22     call cpu_time(start)
23     call add(a,b,c,MM)
24     call cpu_time(finish)
26     print*,'time =',(finish-start)
28     deallocate(a,b,c)
29 end program test

If you are a modern Fortran priest you may notice that I’ve used an explicit loop for vector addition, I did this because, initially, I wanted to give a fair chance to C and C++. The above code scores, on Windows, about 0.66 seconds without compiler optimization which is way better than both C and C++ without optimizations. However, the optimized version takes about 0.29 seconds which is slightly worse than C++ but a much better score than C. The results on Linux are slightly worse.

In most modern Fortran books and tutorials you are advised to avoid the use of explicit loops for vector and array operations, the code is cleaner and the compiler can potentially use more optimizations for array operations. Having said that, we can replace the add subroutine with a single instruction:

 1 program test
 2     integer,parameter :: MM = 90000000
 3     double precision, allocatable :: a(:),b(:),c(:)
 4     double precision :: start,finish
 6     allocate(a(MM),b(MM),c(MM))
 8     do i = 1,MM
 9         a(i) = 1.0/dble(i)
10         b(i) = a(i)
11     enddo
13     call cpu_time(start)
14     c = a + b
15     call cpu_time(finish)
17     print*,'time =',(finish-start)
19     deallocate(a,b,c)
20 end program test

The above code is actually a bit slower slower than the one with the explicit do loop, it takes about 0.89 seconds and 0.29 seconds (with optimizations).

The six codes used for this benchmark are available on GitHub together with an Excel file that contains the recorded result for each run.

If you are interested in learning more about the new C++11 syntax I would recommend reading The C++ Programming Language by Bjarne Stroustrup.

or, Professional C++ by M. Gregoire, N. A. Solter, S. J. Kleper 4th edition:

If you need to brush your Fortran knowledge a good book is Modern Fortran Explained by M. Metcalf, J. Reid and M. Cohen:

Show Comments