Fourmilog: None Dare Call It Reason: November 2, 2017 Archives

« October 28, 2017 | Main | November 6, 2017 »

Thursday, November 2, 2017

Floating Point Benchmark: C++ Language Added, Multiple Precision Arithmetic

I have posted a new edition of the floating point benchmark collection which adds the C++ language and compares the performance of four floating point implementations with different precisions: standard double (64 bit), long double (80 bit), GNU libquadmath (__float128, 128 bit), and the GNU MPFR multiple-precision library, tested at both 128 and 512 bit precision.

It is, of course, possible to compile the ANSI C version of the benchmark with a C++ compiler, as almost any ANSI C program is a valid C++ program, but this program is a complete rewrite of the benchmark algorithm in C++, using the features of the language as they were intended to improve the readability, modularity, and generality of the program. As with all versions of the benchmark, identical results are produced, to the last decimal place, and the results are checked against a reference to verify correctness.

This benchmark was developed to explore whether writing a program using the features of C++ imposed a speed penalty compared to the base C language, and also to explore the relative performance of four different implementations of floating point arithmetic and mathematical function libraries, with different precision. The operator overloading features of C++ make it possible to easily port code to multiple precision arithmetic libraries without the cumbersome and error-prone function calls such code requires in C.

The resulting program is object-oriented, with objects representing items such as spectral lines, surface boundaries in an optical assembly, a complete lens design, the trace of a ray of light through the lens, and an evaluation of the aberrations of the design compared to acceptable optical quality standards. Each object has methods which perform computation related to its contents. All floating point quantities in the program are declared as type Real, which is typedef-ed to the precision being tested.

The numbers supported by libquadmath and MPFR cannot be directly converted to strings by snprintf() format phrases, so when using these libraries auxiliary code is generated to use those packages' facilities for conversion to strings. In a run of the benchmark which typically runs hundreds of thousands or millions of executions of the inner loop, this code only executes once, so it has negligible impact on run time.

I first tested the program with standard double arithmetic. As always, I do a preliminary run and time it, then compute an iteration count to yield a run time of around five minutes. I then perform five runs on an idle system, time them, and compute the mean run time. Next, the mean time is divided by the iteration count to compute microseconds per iteration. All tests were done with GCC/G++ 5.4.0.

Comparing with a run of the ANSI C benchmark, the C++ time was 0.9392 of the C run time. Not only didn't we pay a penalty for using C++, we actually picked up around 6% in speed. Presumably, the cleaner structure of the code allowed the compiler to optimise a bit better whereas the global variables in the original C program might have prevented some optimisations.

Next I tested with a long double data type, which uses the 80 bit internal representation of the Intel floating point unit. I used the same iteration count as with the original double test.

Here, the run time was 0.9636 that of C, still faster, and not that much longer than double. If the extra precision of long double makes a difference for your application, there's little cost in using it. Note that support for long double varies from compiler to compiler and architecture to architecture: whether it's available and, if so, what it means depends upon which compiler and machine you're using. These test results apply only to GCC on the x86 (actually x86_64) architecture.

GCC also provides a nonstandard data type, __float128, which implements 128 bit (quadruple precision) floating point arithmetic in software. The libquadmath library includes its own mathematical functions which end in “q” (for example sinq instead of sin), which must be called instead of the standard library functions, and a quadmath_snprintf function for editing numbers to strings. The benchmark contains conditional code and macro definitions to accommodate these changes.

This was 31.0031 times slower than C. Here, we pay a heavy price for doing every floating point operation in software instead of using the CPU's built in floating point unit. If you have an algorithm which requires this accuracy, it's important to perform the numerical analysis to determine where the accuracy is actually needed and employ quadruple precision only where necessary.

Finally, I tested the program using the GNU MPFR multiple-precision library which is built atop the GMP package. I used the MPFR C++ bindings developed by Pavel Holoborodko, which overload the arithmetic operators and define versions of the mathematical functions which make integrating MPFR into a C++ program almost seamless. As with __float128, the output editing code must be rewritten to accommodate MPFR's toString() formatting mechanism. MPFR allows a user-selected precision and rounding mode. I always use the default round to nearest mode, but allow specifying the precision in bits by setting MPFR_PRECISION when the program is compiled. I started with a precision of 128 bits, the same as __float128 above. The result was 189.72 times slower than C. The added generality of MPFR over __float128 comes at a steep price. Clearly, if 128 bits suffices for your application, __float128, is the way to go.

Next, I wanted to see how run time scaled with precision. I rebuilt for 512 bit precision and reran the benchmark. Now we're 499.865 times slower than C—almost exactly 1/500 the speed. This is great to have if you really need it, but you'd be wise to use it sparingly.

The program produced identical output for all choices of floating point precision. By experimentation, I determined that I could reduce MPFR_PRECISION to as low as 47 without getting errors in the least significant digits of the results. At 46 bits and below, errors start to creep in.

The relative performance of the various language implementations (with C taken as 1) is as follows. All language implementations of the benchmark listed below produced identical results to the last (11th) decimal place.

Language	Relative Time	Details
C	1	GCC 3.2.3 `-O3`, Linux
JavaScript	0.372 0.424 1.334 1.378 1.386 1.495	Mozilla Firefox 55.0.2, Linux Safari 11.0, MacOS X Brave 0.18.36, Linux Google Chrome 61.0.3163.91, Linux Chromium 60.0.3112.113, Linux Node.js v6.11.3, Linux
Chapel	0.528 0.0314	Chapel 1.16.0, `-fast`, Linux Parallel, 64 threads
Visual Basic .NET	0.866	All optimisations, Windows XP
C++	0.939 0.964 31.00 189.7 499.9	G++ 5.4.0, `-O3`, Linux, `double` `long double` (80 bit) `__float128` (128 bit) MPFR (128 bit) MPFR (512 bit)
FORTRAN	1.008	GNU Fortran (g77) 3.2.3 `-O3`, Linux
Pascal	1.027 1.077	Free Pascal 2.2.0 `-O3`, Linux GNU Pascal 2.1 (GCC 2.95.2) `-O3`, Linux
Swift	1.054	Swift 3.0.1, `-O`, Linux
Rust	1.077	Rust 0.13.0, `--release`, Linux
Java	1.121	Sun JDK 1.5.0_04-b05, Linux
Visual Basic 6	1.132	All optimisations, Windows XP
Haskell	1.223	GHC 7.4.1`-O2 -funbox-strict-fields`, Linux
Scala	1.263	Scala 2.12.3, OpenJDK 9, Linux
Ada	1.401	GNAT/GCC 3.4.4 `-O3`, Linux
Go	1.481	Go version go1.1.1 linux/amd64, Linux
Simula	2.099	GNU Cim 5.1, GCC 4.8.1 -O2, Linux
Lua	2.515 22.7	LuaJIT 2.0.3, Linux Lua 5.2.3, Linux
Python	2.633 30.0	PyPy 2.2.1 (Python 2.7.3), Linux Python 2.7.6, Linux
Erlang	3.663 9.335	Erlang/OTP 17, emulator 6.0, HiPE [native, {hipe, [o3]}] Byte code (BEAM), Linux
ALGOL 60	3.951	MARST 2.7, GCC 4.8.1 -O3, Linux
PL/I	5.667	Iron Spring PL/I 0.9.9b beta, Linux
Lisp	7.41 19.8	GNU Common Lisp 2.6.7, Compiled, Linux GNU Common Lisp 2.6.7, Interpreted
Smalltalk	7.59	GNU Smalltalk 2.3.5, Linux
Ruby	7.832	Ruby 2.4.2p198, Linux
Forth	9.92	Gforth 0.7.0, Linux
Prolog	11.72 5.747	SWI-Prolog 7.6.0-rc2, Linux GNU Prolog 1.4.4, Linux, (limited iterations)
COBOL	12.5 46.3	Micro Focus Visual COBOL 2010, Windows 7 Fixed decimal instead of `computational-2`
Algol 68	15.2	Algol 68 Genie 2.4.1 -O3, Linux
Perl	23.6	Perl v5.8.0, Linux
QBasic	148.3	MS-DOS QBasic 1.1, Windows XP Console
Mathematica	391.6	Mathematica 10.3.1.0, Raspberry Pi 3, Raspbian

Posted at 22:45

Fourmilog: None Dare Call It Reason

John Walker's Fourmilab Change Log

Thursday, November 2, 2017

Floating Point Benchmark: C++ Language Added, Multiple Precision Arithmetic