Crusoe Update: Linux Benchmarks

15 Sep 2021 - faintshadows

The final entry in 2021’s Crusoe saga.

Debian 8 is the last Debian release that works with i586 CPUs, and thankfully, it works on the Crusoe, with limited stability issues. Namely if you have X running, and try to run apt, it freezes. Don’t know, but I’ll just install things before running X, no big deal.

I was about to put the Crusoe back away for hibernation but I wanted to check some benchmarks under Linux, since they likely won’t need SSE, or could be compiled to not use it at least.

I grabbed all benchmarks from https://linux-sunxi.org/Benchmarks.

LINPACK

foxpro@crusoe:~$ ./linpack
Enter array size (q to quit) [200]:  
Memory required:  315K.


LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      64   0.66  83.32%   2.78%  13.90%  155170.141
     128   1.32  83.33%   2.77%  13.90%  155020.165
     256   2.63  83.32%   2.77%  13.90%  155095.938
     512   5.27  83.34%   2.77%  13.88%  155007.179
    1024  10.53  83.32%   2.77%  13.91%  155128.325

I compiled it with cc -Ofast -o linpack linpack.c -lm -march=i586 -fomit-frame-pointer -mpreferred-stack-boundary=2 -falign-functions=0 -falign-jumps=0 -falign-loops=0 because of course I would. This is what was recommended for the Crusoe from Gentoo users back in the day, and there was actually a ~3% decrease in performance without all those flags added, so what the hell, sure.

“But faint those flags are kinda pointless” ok but when you look at how the Crusoe works under the hood, those last 3 flags actually help because the Crusoe does its own re-aligning of code, so why have the compiler do it? It’s not like I’m doing -funroll-all-loops

Fine I’ll run LINPACK without the flags and you can see the difference

foxpro@crusoe:~/bench$ cc -Ofast -o linpack linpack.c -lm
foxpro@crusoe:~/bench$ ./linpack 
Enter array size (q to quit) [200]:  
Memory required:  315K.


LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      64   0.70  79.80%   2.70%  17.50%  152346.352
     128   1.41  79.69%   2.77%  17.54%  151261.439
     256   2.78  79.90%   2.58%  17.53%  153236.933
     512   5.56  79.92%   2.58%  17.50%  153195.100
    1024  11.14  79.73%   2.58%  17.69%  153330.947

See!! It’s slower!

Anyways, all benchmarks will be compiled with those flags to keep it even. If people don’t like that I guess I can re-run them later without the fun flags.

DHRYSTONE

Compiled with gcc dhry1.c cpuidc.o cpuida.o -lrt -lc -lm -march=i586 -fomit-frame-pointer -mpreferred-stack-boundary=2 -falign-functions=0 -falign-jumps=0 -falign-loops=0 -o dhry1 Ok I said I’d use all the same flags but the instructions for this said no optimizations, and -O2 then -O3, so I did that, but the rest are the same.

Outputs Truncated after the first run for brevity.

1.1 -O0

  ####################################################
  getDetails and MHz

  Assembler CPUID and RDTSC      
  CPU GenuineTMx86, Features Code 0080893F, Model Code 00000543
  Transmeta(tm) Crusoe(tm) Processor TM5800
  Measured - Minimum 793 MHz, Maximum 793 MHz
  Linux Functions
  get_nprocs() - CPUs 1, Configured CPUs 1
  get_phys_pages() and size - RAM Size  0.23 GB, Page Size 4096 Bytes
  uname() - Linux, crusoe, 3.16.0-6-586
  #1 Debian 3.16.56-1+deb8u1 (2018-05-08), i586

##########################################

Dhrystone Benchmark, Version 1.1 (Language: C or C++)

Optimisation    No Opt

       10000 runs   0.02 seconds 
      100000 runs   0.10 seconds 
      200000 runs   0.20 seconds 
      400000 runs   0.40 seconds 
      800000 runs   0.80 seconds 
     1600000 runs   1.61 seconds 
     3200000 runs   3.24 seconds 

Array2Glob8/7: O.K.       3200010

Microseconds for one run through Dhrystone:         1.01 
Dhrystones per Second:                          986560 
VAX  MIPS rating =                                561.50

1.1 -O2

Microseconds for one run through Dhrystone:         0.55 
Dhrystones per Second:                         1815156 
VAX  MIPS rating =                               1033.10

1.1 -O3

Microseconds for one run through Dhrystone:         0.44 
Dhrystones per Second:                         2279718 
VAX  MIPS rating =                               1297.51

2.1 -O0

There were two versions of the Dhrystone benchmark, here’s version 2.1

Dhrystone Benchmark, Version 2.1 (Language: C or C++)

Optimisation    No Opt
Register option not selected

       40000 runs   0.05 seconds 
      400000 runs   0.41 seconds 
      800000 runs   0.82 seconds 
     1600000 runs   1.63 seconds 
     3200000 runs   3.27 seconds 

Final values (* implementation-dependent):

Int_Glob:      O.K.  5  Bool_Glob:     O.K.  1
Ch_1_Glob:     O.K.  A  Ch_2_Glob:     O.K.  B
Arr_1_Glob[8]: O.K.  7  Arr_2_Glob8/7: O.K.     3200010
Ptr_Glob->              Ptr_Comp:       *    134545776
  Discr:       O.K.  0  Enum_Comp:     O.K.  2
  Int_Comp:    O.K.  17 Str_Comp:      O.K.  DHRYSTONE PROGRAM, SOME STRING
Next_Ptr_Glob->         Ptr_Comp:       *    134545776 same as above
  Discr:       O.K.  0  Enum_Comp:     O.K.  1
  Int_Comp:    O.K.  18 Str_Comp:      O.K.  DHRYSTONE PROGRAM, SOME STRING
Int_1_Loc:     O.K.  5  Int_2_Loc:     O.K.  13
Int_3_Loc:     O.K.  7  Enum_Loc:      O.K.  1  
Str_1_Loc:                             O.K.  DHRYSTONE PROGRAM, 1'ST STRING
Str_2_Loc:                             O.K.  DHRYSTONE PROGRAM, 2'ND STRING

Microseconds for one run through Dhrystone:         1.02 
Dhrystones per Second:                          979200 
VAX  MIPS rating =                                557.31

2.1 -O2

Microseconds for one run through Dhrystone:         0.74 
Dhrystones per Second:                         1358044 
VAX  MIPS rating =                                772.93

2.1 -O3

Microseconds for one run through Dhrystone:         0.71 
Dhrystones per Second:                         1409295 
VAX  MIPS rating =                                802.10

WHETSTONE

Same flags as the Dhrystone benchmarks.

-O0

          Single Precision C/C++ Whetstone Benchmark

Loop content                  Result              MFLOPS      MOPS   Seconds

N1 floating point     -1.12475025653839111       132.613              0.031
N2 floating point     -1.12274754047393799        81.167              0.353
N3 if then else        1.00000000000000000                 142.886    0.154
N4 fixed point        12.00000000000000000                 121.395    0.553
N5 sin,cos etc.        0.49904659390449524                   6.884    2.574
N6 floating point      0.99999988079071045        36.821              3.120
N7 assignments         3.00000000000000000                  67.113    0.587
N8 exp,sqrt etc.       0.75110864639282227                   3.031    2.614

MWIPS                                            213.303              9.986

-O2

          Single Precision C/C++ Whetstone Benchmark

Loop content                  Result              MFLOPS      MOPS   Seconds

N1 floating point     -1.12441420555114746       199.450              0.033
N2 floating point     -1.12241148948669434       118.126              0.389
N3 if then else        1.00000000000000000                 470.852    0.075
N4 fixed point        12.00000000000000000                 687.611    0.157
N5 sin,cos etc.        0.49904659390449524                   6.721    4.233
N6 floating point      0.99999988079071045       188.546              0.978
N7 assignments         3.00000000000000000                 671.471    0.094
N8 exp,sqrt etc.       0.75110864639282227                   3.276    3.884

MWIPS                                            347.433              9.844

-O3

          Single Precision C/C++ Whetstone Benchmark

Loop content                  Result              MFLOPS      MOPS   Seconds

N1 floating point     -1.12441420555114746       199.070              0.034
N2 floating point     -1.12239956855773926       234.379              0.201
N3 if then else        1.00000000000000000                 590.353    0.061
N4 fixed point        12.00000000000000000                 688.087    0.160
N5 sin,cos etc.        0.49904659390449524                   6.682    4.358
N6 floating point      0.99999988079071045       188.449              1.002
N7 assignments         3.00000000000000000                 717.757    0.090
N8 exp,sqrt etc.       0.75110864639282227                   3.277    3.973

MWIPS                                            354.306              9.878

NBENCH

Used some extra CFLAGS, as per request of the Makefile.

-s -static -Wall -O3 -march=i586 -fomit-frame-pointer \
-mpreferred-stack-boundary=2 -falign-functions=0 -falign-jumps=0 \
-falign-loops=0 -funroll-loops

BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          353.92  :       9.08  :       2.98
STRING SORT         :          22.378  :      10.00  :       1.55
BITFIELD            :      1.7021e+08  :      29.20  :       6.10
FP EMULATION        :           68.53  :      32.88  :       7.59
FOURIER             :          2952.1  :       3.36  :       1.89
ASSIGNMENT          :          10.309  :      39.23  :      10.18
IDEA                :          924.23  :      14.14  :       4.20
HUFFMAN             :          613.45  :      17.01  :       5.43
NEURAL NET          :           4.998  :       8.03  :       3.38
LU DECOMPOSITION    :          246.81  :      12.79  :       9.23
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 18.774
FLOATING-POINT INDEX: 7.011
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 : GenuineTMx86 Transmeta(tm) Crusoe(tm) Processor TM5800 800MHz
L2 Cache            : 512 KB
OS                  : Linux 3.16.0-6-586
C compiler          : gcc version 4.9.2 (Debian 4.9.2-10+deb8u1) 
libc                : libc-2.19.so
MEMORY INDEX        : 4.580
INTEGER INDEX       : 4.765
FLOATING-POINT INDEX: 3.889
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
* Trademarks are property of their respective holder.

Well, that’s it for my benchmarks under Linux. I have no frame of reference, so I’ll leave it up to you, the reader, to compare amongst your own hardware of this vintage.