« October 28, 2017 | Main | November 6, 2017 »

Thursday, November 2, 2017

Floating Point Benchmark: C++ Language Added, Multiple Precision Arithmetic

I have posted a new edition of the floating point benchmark collection which adds the C++ language and compares the performance of four floating point implementations with different precisions: standard double (64 bit), long double (80 bit), GNU libquadmath (__float128, 128 bit), and the GNU MPFR multiple-precision library, tested at both 128 and 512 bit precision.

It is, of course, possible to compile the ANSI C version of the benchmark with a C++ compiler, as almost any ANSI C program is a valid C++ program, but this program is a complete rewrite of the benchmark algorithm in C++, using the features of the language as they were intended to improve the readability, modularity, and generality of the program. As with all versions of the benchmark, identical results are produced, to the last decimal place, and the results are checked against a reference to verify correctness.

This benchmark was developed to explore whether writing a program using the features of C++ imposed a speed penalty compared to the base C language, and also to explore the relative performance of four different implementations of floating point arithmetic and mathematical function libraries, with different precision. The operator overloading features of C++ make it possible to easily port code to multiple precision arithmetic libraries without the cumbersome and error-prone function calls such code requires in C.

The resulting program is object-oriented, with objects representing items such as spectral lines, surface boundaries in an optical assembly, a complete lens design, the trace of a ray of light through the lens, and an evaluation of the aberrations of the design compared to acceptable optical quality standards. Each object has methods which perform computation related to its contents. All floating point quantities in the program are declared as type Real, which is typedef-ed to the precision being tested.

The numbers supported by libquadmath and MPFR cannot be directly converted to strings by snprintf() format phrases, so when using these libraries auxiliary code is generated to use those packages' facilities for conversion to strings. In a run of the benchmark which typically runs hundreds of thousands or millions of executions of the inner loop, this code only executes once, so it has negligible impact on run time.

I first tested the program with standard double arithmetic. As always, I do a preliminary run and time it, then compute an iteration count to yield a run time of around five minutes. I then perform five runs on an idle system, time them, and compute the mean run time. Next, the mean time is divided by the iteration count to compute microseconds per iteration. All tests were done with GCC/G++ 5.4.0.

Comparing with a run of the ANSI C benchmark, the C++ time was 0.9392 of the C run time. Not only didn't we pay a penalty for using C++, we actually picked up around 6% in speed. Presumably, the cleaner structure of the code allowed the compiler to optimise a bit better whereas the global variables in the original C program might have prevented some optimisations.

Next I tested with a long double data type, which uses the 80 bit internal representation of the Intel floating point unit. I used the same iteration count as with the original double test.

Here, the run time was 0.9636 that of C, still faster, and not that much longer than double. If the extra precision of long double makes a difference for your application, there's little cost in using it. Note that support for long double varies from compiler to compiler and architecture to architecture: whether it's available and, if so, what it means depends upon which compiler and machine you're using. These test results apply only to GCC on the x86 (actually x86_64) architecture.

GCC also provides a nonstandard data type, __float128, which implements 128 bit (quadruple precision) floating point arithmetic in software. The libquadmath library includes its own mathematical functions which end in “q” (for example sinq instead of sin), which must be called instead of the standard library functions, and a quadmath_snprintf function for editing numbers to strings. The benchmark contains conditional code and macro definitions to accommodate these changes.

This was 31.0031 times slower than C. Here, we pay a heavy price for doing every floating point operation in software instead of using the CPU's built in floating point unit. If you have an algorithm which requires this accuracy, it's important to perform the numerical analysis to determine where the accuracy is actually needed and employ quadruple precision only where necessary.

Finally, I tested the program using the GNU MPFR multiple-precision library which is built atop the GMP package. I used the MPFR C++ bindings developed by Pavel Holoborodko, which overload the arithmetic operators and define versions of the mathematical functions which make integrating MPFR into a C++ program almost seamless. As with __float128, the output editing code must be rewritten to accommodate MPFR's toString() formatting mechanism. MPFR allows a user-selected precision and rounding mode. I always use the default round to nearest mode, but allow specifying the precision in bits by setting MPFR_PRECISION when the program is compiled. I started with a precision of 128 bits, the same as __float128 above. The result was 189.72 times slower than C. The added generality of MPFR over __float128 comes at a steep price. Clearly, if 128 bits suffices for your application, __float128, is the way to go.

Next, I wanted to see how run time scaled with precision. I rebuilt for 512 bit precision and reran the benchmark. Now we're 499.865 times slower than C—almost exactly 1/500 the speed. This is great to have if you really need it, but you'd be wise to use it sparingly.

The program produced identical output for all choices of floating point precision. By experimentation, I determined that I could reduce MPFR_PRECISION to as low as 47 without getting errors in the least significant digits of the results. At 46 bits and below, errors start to creep in.

The relative performance of the various language implementations (with C taken as 1) is as follows. All language implementations of the benchmark listed below produced identical results to the last (11th) decimal place.

Language Relative
Time
Details
C 1 GCC 3.2.3 -O3, Linux
JavaScript 0.372
0.424
1.334
1.378
1.386
1.495
Mozilla Firefox 55.0.2, Linux
Safari 11.0, MacOS X
Brave 0.18.36, Linux
Google Chrome 61.0.3163.91, Linux
Chromium 60.0.3112.113, Linux
Node.js v6.11.3, Linux
Chapel 0.528
0.0314
Chapel 1.16.0, -fast, Linux
Parallel, 64 threads
Visual Basic .NET 0.866 All optimisations, Windows XP
C++ 0.939
0.964
31.00
189.7
499.9
G++ 5.4.0, -O3, Linux, double
long double (80 bit)
__float128 (128 bit)
MPFR (128 bit)
MPFR (512 bit)
FORTRAN 1.008 GNU Fortran (g77) 3.2.3 -O3, Linux
Pascal 1.027
1.077
Free Pascal 2.2.0 -O3, Linux
GNU Pascal 2.1 (GCC 2.95.2) -O3, Linux
Swift 1.054 Swift 3.0.1, -O, Linux
Rust 1.077 Rust 0.13.0, --release, Linux
Java 1.121 Sun JDK 1.5.0_04-b05, Linux
Visual Basic 6 1.132 All optimisations, Windows XP
Haskell 1.223 GHC 7.4.1-O2 -funbox-strict-fields, Linux
Scala 1.263 Scala 2.12.3, OpenJDK 9, Linux
Ada 1.401 GNAT/GCC 3.4.4 -O3, Linux
Go 1.481 Go version go1.1.1 linux/amd64, Linux
Simula 2.099 GNU Cim 5.1, GCC 4.8.1 -O2, Linux
Lua 2.515
22.7
LuaJIT 2.0.3, Linux
Lua 5.2.3, Linux
Python 2.633
30.0
PyPy 2.2.1 (Python 2.7.3), Linux
Python 2.7.6, Linux
Erlang 3.663
9.335
Erlang/OTP 17, emulator 6.0, HiPE [native, {hipe, [o3]}]
Byte code (BEAM), Linux
ALGOL 60 3.951 MARST 2.7, GCC 4.8.1 -O3, Linux
PL/I 5.667 Iron Spring PL/I 0.9.9b beta, Linux
Lisp 7.41
19.8
GNU Common Lisp 2.6.7, Compiled, Linux
GNU Common Lisp 2.6.7, Interpreted
Smalltalk 7.59 GNU Smalltalk 2.3.5, Linux
Ruby 7.832 Ruby 2.4.2p198, Linux
Forth 9.92 Gforth 0.7.0, Linux
Prolog 11.72
5.747
SWI-Prolog 7.6.0-rc2, Linux
GNU Prolog 1.4.4, Linux, (limited iterations)
COBOL 12.5
46.3
Micro Focus Visual COBOL 2010, Windows 7
Fixed decimal instead of computational-2
Algol 68 15.2 Algol 68 Genie 2.4.1 -O3, Linux
Perl 23.6 Perl v5.8.0, Linux
QBasic 148.3 MS-DOS QBasic 1.1, Windows XP Console
Mathematica 391.6 Mathematica 10.3.1.0, Raspberry Pi 3, Raspbian

Posted at 22:45 Permalink