Page 1 of 1
I tried to bench qemu's FP performance ....
Posted: Sun May 12, 2024 4:21 pm
by Andrew_R
Cmd line
qemu-system-ppc -m 1024 -M mac99,via=pmu -hda ~/QEMU/osx-tiger_10.4.11_installed-compressed.qcow -cpu G4 -boot c -accel tcg,tb-size=256 -g 1368x768x32 -display sdl,gl=on -device intel-hda -device hda-duplex -nic user,hostfwd=tcp:127.0.0.1:6001-:6000 -cdrom ~/ISO/ansib_bench.iso
sadly Mac OS X 10.4.11 does not know how to deal with pure iso9660 without Joilet ? So I just ftp'ed archive inside
https://github.com/nfinit/ansibench <<< - bench used
Host:
inxi
CPU: quad core AMD FX-4300 (-MT MCP-) speed/min/max: 3522/1400/3800 MHz
Kernel: 6.1.44-x64 x86_64 Up: 6d 23h 11m Mem: 5581.6/15996.4 MiB (34.9%)
Storage: 931.51 GiB (99.5% used) Procs: 265 Shell: Bash inxi: 3.3.12
qemu version - git just slightly after 9.0 release:
git commit 77bcaf5f222fb19667738dc2ca7dec6172d69db7
Results:
bin/linpackdp
Enter array size (q to quit) [100]: 1000
LINPACK benchmark, Double precision.
Machine precision: 15 digits.
Array size 1000 X 1000.
Memory required: 7824K.
Average rolled and unrolled performance:
Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
1 11.39 96.58% 0.53% 2.90% 15159.735
Enter array size (q to quit) [100]: 1000
LINPACK benchmark, Double precision.
Machine precision: 15 digits.
Array size 1000 X 1000.
Memory required: 7824K.
Average rolled and unrolled performance:
Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
1 11.38 96.57% 0.62% 2.81% 15159.735
Enter array size (q to quit) [100]:
LINPACK benchmark, Double precision.
Machine precision: 15 digits.
Array size 100 X 100.
Memory required: 79K.
Average rolled and unrolled performance:
Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
64 0.98 74.49% 4.08% 21.43% 14683.983
128 1.95 76.41% 6.67% 16.92% 13958.848
256 3.88 75.52% 6.44% 18.04% 14222.222
512 7.78 80.72% 5.40% 13.88% 13500.498
1024 15.52 77.38% 6.31% 16.30% 13926.610
Enter array size (q to quit) [100]: q
bin/linpacksp
Enter array size (q to quit) [100]:
LINPACK benchmark, Single precision.
Machine precision: 6 digits.
Array size 100 X 100.
Memory required: 40K.
Average rolled and unrolled performance:
Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
32 0.69 81.16% 4.35% 14.49% 9581.921
64 1.39 77.70% 5.04% 17.27% 9831.885
128 2.80 77.86% 6.07% 16.07% 9622.692
256 5.63 83.13% 4.26% 12.61% 9192.416
512 11.21 80.29% 5.53% 14.18% 9402.642
Enter array size (q to quit) [100]: q
So, single precision is SLOWER than double precision on emulated G4 ?! (may be gcc autovectorizes by default, and altivec accelerates on AVX-capable host CPU?)
I wonder how older hardfloat patches behaved, anyone still have qemu with them applied around ..?
Re: I tried to bench qemu's FP performance ....
Posted: Sun May 12, 2024 7:51 pm
by Cat_7
anyone still have qemu with them applied around ..?
Sure, but only for windows and macOS hosts:
MacOS Qemu 7.1:
https://surfdrive.surf.nl/files/index.p ... s/download
Windows Qemu 6.2:
https://surfdrive.surf.nl/files/index.p ... 2/download
Please note both also have the screamer audio enabled which may slow down the emulation. Both also have 60Hz screen refresh.
Best,
Cat_7
Re: I tried to bench qemu's FP performance ....
Posted: Mon May 13, 2024 6:28 am
by Andrew_R
well, this is real strange!
Got 6.2.0 from
https://download.qemu.org/
Applied patch from
https://patchwork.kernel.org/project/qe ... ik.bme.hu/
compiled with simple
Code: Select all
configure --target-list=ppc-softmmu
and result was basically same with hardfloat=true and false (for G4 cpu)
Code: Select all
Last login: Sun May 12 22:17:34 on console
Welcome to Darwin!
tim-cooks-power-mac-g4-agp-graphics:~ timcook$ cd ansibench
ansibench ansibench.tar.gz
tim-cooks-power-mac-g4-agp-graphics:~ timcook$ cd ansibench/
.git coremark hint mk stream utilities
README.md dhrystone linpack nbench tripforce whetstone
tim-cooks-power-mac-g4-agp-graphics:~ timcook$ cd ansibench/linpack/
tim-cooks-power-mac-g4-agp-graphics:~/ansibench/linpack timcook$ bin/linpacksp
Enter array size (q to quit) [100]:
LINPACK benchmark, Single precision.
Machine precision: 6 digits.
Array size 100 X 100.
Memory required: 40K.
Average rolled and unrolled performance:
Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
8 0.63 90.48% 7.94% 1.59% 2279.570
16 1.24 92.74% 5.65% 1.61% 2316.940
32 2.51 90.84% 4.78% 4.38% 2355.556
64 5.00 89.20% 6.20% 4.60% 2370.370
128 10.00 90.00% 5.80% 4.20% 2360.473
Enter array size (q to quit) [100]:
LINPACK benchmark, Single precision.
Machine precision: 6 digits.
Array size 100 X 100.
Memory required: 40K.
Average rolled and unrolled performance:
Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
8 0.62 88.71% 4.84% 6.45% 2436.774
16 1.25 88.00% 8.00% 4.00% 2355.554
32 2.54 90.16% 6.30% 3.54% 2307.482
64 5.04 89.68% 6.15% 4.17% 2340.927
128 10.00 89.40% 5.90% 4.70% 2372.857
Enter array size (q to quit) [100]: 1000
LINPACK benchmark, Single precision.
Machine precision: 6 digits.
Array size 1000 X 1000.
Memory required: 3914K.
Average rolled and unrolled performance:
Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
1 72.01 98.90% 0.60% 0.50% 2340.079
Enter array size (q to quit) [100]: q
tim-cooks-power-mac-g4-agp-graphics:~/ansibench/linpack timcook$ bin/linpackdp
Enter array size (q to quit) [100]:
LINPACK benchmark, Double precision.
Machine precision: 15 digits.
Array size 100 X 100.
Memory required: 79K.
Average rolled and unrolled performance:
Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
8 0.67 94.03% 4.48% 1.49% 2141.414
16 1.37 88.32% 8.03% 3.65% 2141.414
32 2.72 90.44% 7.35% 2.21% 2125.313
64 5.45 91.74% 6.06% 2.20% 2121.326
128 10.88 92.83% 5.33% 1.84% 2117.353
Enter array size (q to quit) [100]: 1000
LINPACK benchmark, Double precision.
Machine precision: 15 digits.
Array size 1000 X 1000.
Memory required: 7824K.
Average rolled and unrolled performance:
Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
1 80.34 99.07% 0.59% 0.35% 2094.263
Enter array size (q to quit) [100]: n
Too small.
Enter array size (q to quit) [100]: q
tim-cooks-power-mac-g4-agp-graphics:~/ansibench/linpack timcook$
Compiled 9.0.0 just to be sure - and performance is back to 10-15 Mflops ...
Note, this is i686 (!) build by gcc 11.2.0
I tried configure with those params on 6.2.0:
Code: Select all
--extra-cflags="-I/usr/X11R7/include -O3 -march=i686 -mtune=native -m32 -Wno-maybe-uninitialized -Wno-nested-externs -Wno-implicit-function-declaration"
but result was basically the same (or those params not accepted anymore? qemu's build system uses meson/ninja since some time, and this line was done in 5.x.x times ...)
There are other, incomplete series from 2022:
https://www.mail-archive.com/qemu-devel ... 7756.html
but I have not tried it.
Re: I tried to bench qemu's FP performance ....
Posted: Mon May 13, 2024 7:02 am
by Andrew_R
also, strictly from "because it still works" I tried to use MacOS X 10.4.11 Xdarwin (Xfree 4.4.0) server) for displaying program we maintain - Cinelerra-GG:
I used "-ac" flag for disabling access control on xserver side (because I am too lazy for xauth) and on host side I used e16 as window manager:
Code: Select all
DISPLAY=":1" starte16
Xlib: extension "RANDR" missing on display ":1.0".
X connection to :1.0 broken (explicit kill or server shutdown).
For small (320x240) vid it was even quite fast! Sadly, this program is quite FP intensive AND does have few places in code where /proc filesystem queried for 1) number of cpus and 2) name of executable. i think I found solution to both problems, just waiting on Macports to fix pulseaudio build ...
https://build.macports.org/builders/por ... lds/201811
Re: I tried to bench qemu's FP performance ....
Posted: Mon May 13, 2024 6:17 pm
by adespoton
Andrew_R wrote: Mon May 13, 2024 7:02 am
also, strictly from "because it still works" I tried to use MacOS X 10.4.11 Xdarwin (Xfree 4.4.0) server) for displaying program we maintain - Cinelerra-GG:
I used "-ac" flag for disabling access control on xserver side (because I am too lazy for xauth) and on host side I used e16 as window manager:
Code: Select all
DISPLAY=":1" starte16
Xlib: extension "RANDR" missing on display ":1.0".
X connection to :1.0 broken (explicit kill or server shutdown).
For small (320x240) vid it was even quite fast! Sadly, this program is quite FP intensive AND does have few places in code where /proc filesystem queried for 1) number of cpus and 2) name of executable. i think I found solution to both problems, just waiting on Macports to fix pulseaudio build ...
https://build.macports.org/builders/por ... lds/201811
Let us know how that goes... there've been a number of FPU improvements in QEMU from 5.x to 9.x, but there's still a few issues with screamer emulation on 10.x, and also with audio emulation pre-Lion on x86. On the x86 side, OpenCore can be used to tweak some of the items to help, but on PPC, we're dealing with the Mac99 virtual machine, so if you find potential fixes, we should get them added to the default machine definition.
Re: I tried to bench qemu's FP performance ....
Posted: Mon May 13, 2024 7:59 pm
by mcayland
Andrew_R wrote: Mon May 13, 2024 6:28 am
Compiled 9.0.0 just to be sure - and performance is back to 10-15 Mflops ...
Note, this is i686 (!) build by gcc 11.2.0
Note that i686 builds give much worse performance than x86_64 builds because i686 has only half the number of host CPU registers compared to x86_64 (I believe on i686 there are only 5 or 6 available host CPU registers available for the JIT). So I'd be interested to see the results of the same benchmark on an x86_64 build.
Re: I tried to bench qemu's FP performance ....
Posted: Tue May 14, 2024 1:48 pm
by Andrew_R
mcayland wrote: Mon May 13, 2024 7:59 pm
Andrew_R wrote: Mon May 13, 2024 6:28 am
Compiled 9.0.0 just to be sure - and performance is back to 10-15 Mflops ...
Note, this is i686 (!) build by gcc 11.2.0
Note that i686 builds give much worse performance than x86_64 builds because i686 has only half the number of host CPU registers compared to x86_64 (I believe on i686 there are only 5 or 6 available host CPU registers available for the JIT). So I'd be interested to see the results of the same benchmark on an x86_64 build.
Code: Select all
LINPACK benchmark, Single precision.
Machine precision: 6 digits.
Array size 1000 X 1000.
Memory required: 3914K.
Average rolled and unrolled performance:
Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
1 8.16 96.57% 0.61% 2.82% 21143.342
2 16.39 97.13% 0.67% 2.20% 20919.111
LINPACK benchmark, Double precision.
Machine precision: 15 digits.
Array size 1000 X 1000.
Memory required: 7824K.
Average rolled and unrolled performance:
Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
1 6.10 95.74% 0.66% 3.61% 28514.739
2 11.83 97.21% 0.59% 2.20% 28983.002
this is on Sandybridge laptop machine, qemu 8.2.3, Slackware -current x86-64
Code: Select all
inxi
CPU: dual core Intel Core i5-2450M (-MT MCP-) speed/min/max: 1174/800/3100 MHz
Kernel: 6.1.62 x86_64 Up: 15m Mem: 1.13/5.72 GiB (19.7%) Storage: 654.77 GiB (90.8% used)
Procs: 176 Shell: Bash inxi: 3.3.34
Re: I tried to bench qemu's FP performance ....
Posted: Wed May 15, 2024 5:22 pm
by Andrew_R
More bench data, using sandybridge laptop runnig x86-64 Slackware as host:
Code: Select all
Results 2.91
System Info
Xbench Version 1.3
System Version 10.4.11 (8S165)
Physical RAM 512 MB
Model PowerMac3,1
Processor PowerPC G4 @ 900 MHz
Version 7400 (Max) v2.9
L1 Cache 32K (instruction), 32K (data)
Bus Frequency 100 MHz
Drive Type QEMU HARDDISK
CPU Test 1.57
GCD Loop 144.54 7.62 Mops/sec
Floating Point Basic 1.67 39.66 Mflop/sec
AltiVec Basic 165.34 6.59 Gflop/sec
vecLib FFT 0.46 15.02 Mflop/sec
Floating Point Library 2.68 465.83 Kops/sec
Thread Test 19.53
Computation 91.16 1.85 Mops/sec, 4 threads
Lock Contention 10.94 470.58 Klocks/sec, 4 threads
Memory Test 17.62
System 16.41
Allocate 7.06 25.94 Kalloc/sec
Fill 38.81 1886.94 MB/sec
Copy 64.52 1332.71 MB/sec
Stream 19.03
Copy 66.12 1365.71 MB/sec [altivec]
Scale 13.28 274.31 MB/sec [altivec]
Add 17.97 382.87 MB/sec [altivec]
Triad 15.61 333.84 MB/sec [altivec]
Quartz Graphics Test 6.65
Line 9.12 607.11 lines/sec [50% alpha]
Rectangle 4.72 1.41 Krects/sec [50% alpha]
Circle 6.07 494.94 circles/sec [50% alpha]
Bezier 8.06 203.37 beziers/sec [50% alpha]
Text 7.05 440.71 chars/sec
OpenGL Graphics Test 3.39
Spinning Squares 3.39 4.29 frames/sec
User Interface Test 2.28
Elements 2.28 10.48 refresh/sec
Disk Test 32.76
Sequential 20.59
Uncached Write 11.68 7.17 MB/sec [4K blocks]
Uncached Write 56.80 32.14 MB/sec [256K blocks]
Uncached Read 11.67 3.42 MB/sec [4K blocks]
Uncached Read 186.75 93.86 MB/sec [256K blocks]
Random 80.11
Uncached Write 29.29 3.10 MB/sec [4K blocks]
Uncached Write 95.22 30.48 MB/sec [256K blocks]
Uncached Read 299.79 2.12 MB/sec [4K blocks]
Uncached Read 511.67 94.94 MB/sec [256K blocks]
everyone likes when their memory copy speed that big
Xbench website -
http://xbench.com/