« ISBNiser 1.3 Update Released | Main | Floating Point Benchmark: Ruby Language Updated »
Thursday, October 26, 2017
Floating Point Benchmark: Chapel Language Added
I have posted an update to my trigonometry-intense floating point benchmark which adds the Chapel language. Chapel (Cascade High Productivity Language) is a programming language developed by Cray, Inc. with the goal of integrating parallel computing into a language without cumbersome function calls or awkward syntax. The language implements both task based and data based parallelism: in the first, the programmer explicitly defines the tasks to be run in parallel, while in the second an operation is performed on a collection of data and the compiler and runtime system decides how to partition it among the computing resources available. Both symmetric multiprocessing with shared memory (as on contemporary “multi-core” microprocessors) and parallel architectures with local memory per processor and message passing are supported. Apart from its parallel processing capabilities, Chapel is a conventional object oriented imperative programming language. Programmers familiar with C++, Java, and other such languages will quickly become accustomed to its syntax and structure. Because this is the first parallel processing language in which the floating point benchmark has been implemented, I wanted to test its performance in both serial and parallel processing modes. Since the benchmark does not process large arrays of data, I used task parallelism to implement two kinds of parallel processing. The first is “parallel trace”, enabled by compiling with:chpl --fast fbench.chpl --set partrace=true
The ray tracing process propagates light of four different wavelengths through the lens assembly and then uses the object distance and axis slope angle of the rays to compute various aberrations. When partrace is set to true, the computation of these rays is performed in parallel, with four tasks running in a “cobegin” structure. When all of the tasks are complete, their results, stored in shared memory passed to the tasks by reference, is used to compute the aberrations. The second option is “parallel iteration”, enabled by compiling with:
chpl --fast fbench.chpl --set pariter=n
where n is the number of tasks among which the specified iteration count will be divided. On a multi-core machine, this should usually be set to the number of processor cores available, which you can determine on most Linux systems with:
cat /proc/cpuinfo | grep processor | wc -l
(If the number of tasks does not evenly divide the number of iterations, the extra iterations are assigned to one of the tasks.) The parallel iteration model might be seen as cheating, but in a number of applications, such as ray tracing for computer generated image rendering (as opposed to the ray tracing we do in the benchmark for optical design), a large number of computations are done which are independent of one another (for example, every pixel in a generated image is independent of every other), and the job can be parallelised by a simple “farm” algorithm which spreads the work over as many processors as are available. The parallel iteration model allows testing this approach with the floating point benchmark. If the benchmark is compiled without specifying partrace or pariter, it will run the task serially as in conventional language implementations. The number of iterations is specified on the command line when running the benchmark as:
./fbench --iterations=n
where n is the number to be run. After preliminary timing runs to determine the number of iterations, I ran the serial benchmark for 250,000,000 iterations, with run times in seconds of:
user | real | sys | |
---|---|---|---|
301.00 | 235.73 | 170.46 | |
299.24 | 234.26 | 169.27 | |
297.93 | 233.67 | 169.40 | |
301.02 | 236.05 | 171.08 | |
298.59 | 234.45 | 170.30 | |
Mean | 299.56 | 234.83 | 170.10 |
PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20 0 167900 2152 2012 S 199.7 0.0 0:12.54 fbenchYup, almost 200% CPU utilisation. I then ran top -H to show threads and saw:
PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20 0 167900 2152 2012 R 99.9 0.0 1:43.28 fbench 20 0 167900 2152 2012 R 99.7 0.0 1:43.27 fbenchso indeed we had two threads. You can control the number of threads with the environment variable CHPL_RT_NUM_THREADS_PER_LOCALE, so I set:
export CHPL_RT_NUM_THREADS_PER_LOCALE=1
and re-ran the benchmark, verifying with top that it was now using only one thread. I got the following times:
user | real | sys | |
---|---|---|---|
235.46 | 235.47 | 0.00 | |
236.52 | 236.55 | 0.02 | |
235.06 | 235.07 | 0.00 | |
235.17 | 235.20 | 0.02 | |
236.20 | 236.21 | 0.00 | |
Mean | 235.68 |
threads | real | user | sys |
---|---|---|---|
1 | 16.92 | 16.91 | 0.00 |
2 | 30.74 | 41.68 | 18.16 |
4 | 43.15 | 68.23 | 90.65 |
5 | 64.29 | 112.38 | 358.88 |
user | real | sys | |
---|---|---|---|
342.27 | 48.95 | 39.84 | |
339.50 | 48.10 | 39.76 | |
343.01 | 49.34 | 42.19 | |
342.08 | 48.78 | 39.90 | |
338.83 | 47.70 | 37.30 | |
Mean | 341.14 | 48.57 | 39.79 |
threads | real | user | sys |
---|---|---|---|
1 | 459.76 | 458.86 | 0.08 |
16 | 33.08 | 523.78 | 0.12 |
32 | 17.17 | 530.21 | 0.34 |
64 | 25.35 | 816.64 | 0.43 |
threads | real | user | sys |
---|---|---|---|
32 | 17.12 | 528.46 | 0.29 |
64 | 14.00 | 824.79 | 0.66 |
Language | Relative Time |
Details |
---|---|---|
C | 1 | GCC 3.2.3 -O3, Linux |
JavaScript | 0.372 0.424 1.334 1.378 1.386 1.495 |
Mozilla Firefox 55.0.2, Linux Safari 11.0, MacOS X Brave 0.18.36, Linux Google Chrome 61.0.3163.91, Linux Chromium 60.0.3112.113, Linux Node.js v6.11.3, Linux |
Chapel | 0.528 0.0314 |
Chapel 1.16.0, -fast, Linux Parallel, 64 threads |
Visual Basic .NET | 0.866 | All optimisations, Windows XP |
FORTRAN | 1.008 | GNU Fortran (g77) 3.2.3 -O3, Linux |
Pascal | 1.027 1.077 |
Free Pascal 2.2.0 -O3, Linux GNU Pascal 2.1 (GCC 2.95.2) -O3, Linux |
Swift | 1.054 | Swift 3.0.1, -O, Linux |
Rust | 1.077 | Rust 0.13.0, --release, Linux |
Java | 1.121 | Sun JDK 1.5.0_04-b05, Linux |
Visual Basic 6 | 1.132 | All optimisations, Windows XP |
Haskell | 1.223 | GHC 7.4.1-O2 -funbox-strict-fields, Linux |
Scala | 1.263 | Scala 2.12.3, OpenJDK 9, Linux |
Ada | 1.401 | GNAT/GCC 3.4.4 -O3, Linux |
Go | 1.481 | Go version go1.1.1 linux/amd64, Linux |
Simula | 2.099 | GNU Cim 5.1, GCC 4.8.1 -O2, Linux |
Lua | 2.515 22.7 |
LuaJIT 2.0.3, Linux Lua 5.2.3, Linux |
Python | 2.633 30.0 |
PyPy 2.2.1 (Python 2.7.3), Linux Python 2.7.6, Linux |
Erlang | 3.663 9.335 |
Erlang/OTP 17, emulator 6.0, HiPE [native, {hipe, [o3]}] Byte code (BEAM), Linux |
ALGOL 60 | 3.951 | MARST 2.7, GCC 4.8.1 -O3, Linux |
PL/I | 5.667 | Iron Spring PL/I 0.9.9b beta, Linux |
Lisp | 7.41 19.8 |
GNU Common Lisp 2.6.7, Compiled, Linux GNU Common Lisp 2.6.7, Interpreted |
Smalltalk | 7.59 | GNU Smalltalk 2.3.5, Linux |
Forth | 9.92 | Gforth 0.7.0, Linux |
Prolog | 11.72 5.747 |
SWI-Prolog 7.6.0-rc2, Linux GNU Prolog 1.4.4, Linux, (limited iterations) |
COBOL | 12.5 46.3 |
Micro Focus Visual COBOL 2010, Windows 7 Fixed decimal instead of computational-2 |
Algol 68 | 15.2 | Algol 68 Genie 2.4.1 -O3, Linux |
Perl | 23.6 | Perl v5.8.0, Linux |
Ruby | 26.1 | Ruby 1.8.3, Linux |
QBasic | 148.3 | MS-DOS QBasic 1.1, Windows XP Console |
Mathematica | 391.6 | Mathematica 10.3.1.0, Raspberry Pi 3, Raspbian |
Posted at October 26, 2017 21:41