« October 28, 2017 | Main | November 6, 2017 »
Thursday, November 2, 2017
Floating Point Benchmark: C++ Language Added, Multiple Precision Arithmetic
I have posted a new edition of the floating point benchmark collection which adds the C++ language and compares the performance of four floating point implementations with different precisions: standard double (64 bit), long double (80 bit), GNU libquadmath (__float128, 128 bit), and the GNU MPFR multiple-precision library, tested at both 128 and 512 bit precision. It is, of course, possible to compile the ANSI C version of the benchmark with a C++ compiler, as almost any ANSI C program is a valid C++ program, but this program is a complete rewrite of the benchmark algorithm in C++, using the features of the language as they were intended to improve the readability, modularity, and generality of the program. As with all versions of the benchmark, identical results are produced, to the last decimal place, and the results are checked against a reference to verify correctness. This benchmark was developed to explore whether writing a program using the features of C++ imposed a speed penalty compared to the base C language, and also to explore the relative performance of four different implementations of floating point arithmetic and mathematical function libraries, with different precision. The operator overloading features of C++ make it possible to easily port code to multiple precision arithmetic libraries without the cumbersome and error-prone function calls such code requires in C. The resulting program is object-oriented, with objects representing items such as spectral lines, surface boundaries in an optical assembly, a complete lens design, the trace of a ray of light through the lens, and an evaluation of the aberrations of the design compared to acceptable optical quality standards. Each object has methods which perform computation related to its contents. All floating point quantities in the program are declared as type Real, which is typedef-ed to the precision being tested. The numbers supported by libquadmath and MPFR cannot be directly converted to strings by snprintf() format phrases, so when using these libraries auxiliary code is generated to use those packages' facilities for conversion to strings. In a run of the benchmark which typically runs hundreds of thousands or millions of executions of the inner loop, this code only executes once, so it has negligible impact on run time. I first tested the program with standard double arithmetic. As always, I do a preliminary run and time it, then compute an iteration count to yield a run time of around five minutes. I then perform five runs on an idle system, time them, and compute the mean run time. Next, the mean time is divided by the iteration count to compute microseconds per iteration. All tests were done with GCC/G++ 5.4.0. Comparing with a run of the ANSI C benchmark, the C++ time was 0.9392 of the C run time. Not only didn't we pay a penalty for using C++, we actually picked up around 6% in speed. Presumably, the cleaner structure of the code allowed the compiler to optimise a bit better whereas the global variables in the original C program might have prevented some optimisations. Next I tested with a long double data type, which uses the 80 bit internal representation of the Intel floating point unit. I used the same iteration count as with the original double test. Here, the run time was 0.9636 that of C, still faster, and not that much longer than double. If the extra precision of long double makes a difference for your application, there's little cost in using it. Note that support for long double varies from compiler to compiler and architecture to architecture: whether it's available and, if so, what it means depends upon which compiler and machine you're using. These test results apply only to GCC on the x86 (actually x86_64) architecture. GCC also provides a nonstandard data type, __float128, which implements 128 bit (quadruple precision) floating point arithmetic in software. The libquadmath library includes its own mathematical functions which end in “q” (for example sinq instead of sin), which must be called instead of the standard library functions, and a quadmath_snprintf function for editing numbers to strings. The benchmark contains conditional code and macro definitions to accommodate these changes. This was 31.0031 times slower than C. Here, we pay a heavy price for doing every floating point operation in software instead of using the CPU's built in floating point unit. If you have an algorithm which requires this accuracy, it's important to perform the numerical analysis to determine where the accuracy is actually needed and employ quadruple precision only where necessary. Finally, I tested the program using the GNU MPFR multiple-precision library which is built atop the GMP package. I used the MPFR C++ bindings developed by Pavel Holoborodko, which overload the arithmetic operators and define versions of the mathematical functions which make integrating MPFR into a C++ program almost seamless. As with __float128, the output editing code must be rewritten to accommodate MPFR's toString() formatting mechanism. MPFR allows a user-selected precision and rounding mode. I always use the default round to nearest mode, but allow specifying the precision in bits by setting MPFR_PRECISION when the program is compiled. I started with a precision of 128 bits, the same as __float128 above. The result was 189.72 times slower than C. The added generality of MPFR over __float128 comes at a steep price. Clearly, if 128 bits suffices for your application, __float128, is the way to go. Next, I wanted to see how run time scaled with precision. I rebuilt for 512 bit precision and reran the benchmark. Now we're 499.865 times slower than C—almost exactly 1/500 the speed. This is great to have if you really need it, but you'd be wise to use it sparingly. The program produced identical output for all choices of floating point precision. By experimentation, I determined that I could reduce MPFR_PRECISION to as low as 47 without getting errors in the least significant digits of the results. At 46 bits and below, errors start to creep in. The relative performance of the various language implementations (with C taken as 1) is as follows. All language implementations of the benchmark listed below produced identical results to the last (11th) decimal place.Language | Relative Time |
Details |
---|---|---|
C | 1 | GCC 3.2.3 -O3, Linux |
JavaScript | 0.372 0.424 1.334 1.378 1.386 1.495 |
Mozilla Firefox 55.0.2, Linux Safari 11.0, MacOS X Brave 0.18.36, Linux Google Chrome 61.0.3163.91, Linux Chromium 60.0.3112.113, Linux Node.js v6.11.3, Linux |
Chapel | 0.528 0.0314 |
Chapel 1.16.0, -fast, Linux Parallel, 64 threads |
Visual Basic .NET | 0.866 | All optimisations, Windows XP |
C++ | 0.939 0.964 31.00 189.7 499.9 |
G++ 5.4.0, -O3,
Linux, double long double (80 bit) __float128 (128 bit) MPFR (128 bit) MPFR (512 bit) |
FORTRAN | 1.008 | GNU Fortran (g77) 3.2.3 -O3, Linux |
Pascal | 1.027 1.077 |
Free Pascal 2.2.0 -O3, Linux GNU Pascal 2.1 (GCC 2.95.2) -O3, Linux |
Swift | 1.054 | Swift 3.0.1, -O, Linux |
Rust | 1.077 | Rust 0.13.0, --release, Linux |
Java | 1.121 | Sun JDK 1.5.0_04-b05, Linux |
Visual Basic 6 | 1.132 | All optimisations, Windows XP |
Haskell | 1.223 | GHC 7.4.1-O2 -funbox-strict-fields, Linux |
Scala | 1.263 | Scala 2.12.3, OpenJDK 9, Linux |
Ada | 1.401 | GNAT/GCC 3.4.4 -O3, Linux |
Go | 1.481 | Go version go1.1.1 linux/amd64, Linux |
Simula | 2.099 | GNU Cim 5.1, GCC 4.8.1 -O2, Linux |
Lua | 2.515 22.7 |
LuaJIT 2.0.3, Linux Lua 5.2.3, Linux |
Python | 2.633 30.0 |
PyPy 2.2.1 (Python 2.7.3), Linux Python 2.7.6, Linux |
Erlang | 3.663 9.335 |
Erlang/OTP 17, emulator 6.0, HiPE [native, {hipe, [o3]}] Byte code (BEAM), Linux |
ALGOL 60 | 3.951 | MARST 2.7, GCC 4.8.1 -O3, Linux |
PL/I | 5.667 | Iron Spring PL/I 0.9.9b beta, Linux |
Lisp | 7.41 19.8 |
GNU Common Lisp 2.6.7, Compiled, Linux GNU Common Lisp 2.6.7, Interpreted |
Smalltalk | 7.59 | GNU Smalltalk 2.3.5, Linux |
Ruby | 7.832 | Ruby 2.4.2p198, Linux |
Forth | 9.92 | Gforth 0.7.0, Linux |
Prolog | 11.72 5.747 |
SWI-Prolog 7.6.0-rc2, Linux GNU Prolog 1.4.4, Linux, (limited iterations) |
COBOL | 12.5 46.3 |
Micro Focus Visual COBOL 2010, Windows 7 Fixed decimal instead of computational-2 |
Algol 68 | 15.2 | Algol 68 Genie 2.4.1 -O3, Linux |
Perl | 23.6 | Perl v5.8.0, Linux |
QBasic | 148.3 | MS-DOS QBasic 1.1, Windows XP Console |
Mathematica | 391.6 | Mathematica 10.3.1.0, Raspberry Pi 3, Raspbian |