I tried to bench qemu's FP performance ....

About Qemu-system-ppc, a PPC Mac emulator for Windows, macOS and Linux that can run Mac OS 9.0 up to Mac OS X 10.5

Moderators: Cat_7, Ronald P. Regensburg

Post Reply
Andrew_R
Inquisitive Elf
Posts: 31
Joined: Tue Jul 06, 2021 1:32 am

I tried to bench qemu's FP performance ....

Post by Andrew_R »

Cmd line
qemu-system-ppc -m 1024 -M mac99,via=pmu -hda ~/QEMU/osx-tiger_10.4.11_installed-compressed.qcow -cpu G4 -boot c -accel tcg,tb-size=256 -g 1368x768x32 -display sdl,gl=on -device intel-hda -device hda-duplex -nic user,hostfwd=tcp:127.0.0.1:6001-:6000 -cdrom ~/ISO/ansib_bench.iso
sadly Mac OS X 10.4.11 does not know how to deal with pure iso9660 without Joilet ? So I just ftp'ed archive inside

https://github.com/nfinit/ansibench <<< - bench used

Host:
inxi
CPU: quad core AMD FX-4300 (-MT MCP-) speed/min/max: 3522/1400/3800 MHz
Kernel: 6.1.44-x64 x86_64 Up: 6d 23h 11m Mem: 5581.6/15996.4 MiB (34.9%)
Storage: 931.51 GiB (99.5% used) Procs: 265 Shell: Bash inxi: 3.3.12
qemu version - git just slightly after 9.0 release:
git commit 77bcaf5f222fb19667738dc2ca7dec6172d69db7
Results:
bin/linpackdp
Enter array size (q to quit) [100]: 1000
LINPACK benchmark, Double precision.
Machine precision: 15 digits.
Array size 1000 X 1000.
Memory required: 7824K.
Average rolled and unrolled performance:

Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
1 11.39 96.58% 0.53% 2.90% 15159.735

Enter array size (q to quit) [100]: 1000
LINPACK benchmark, Double precision.
Machine precision: 15 digits.
Array size 1000 X 1000.
Memory required: 7824K.
Average rolled and unrolled performance:

Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
1 11.38 96.57% 0.62% 2.81% 15159.735

Enter array size (q to quit) [100]:
LINPACK benchmark, Double precision.
Machine precision: 15 digits.
Array size 100 X 100.
Memory required: 79K.
Average rolled and unrolled performance:

Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
64 0.98 74.49% 4.08% 21.43% 14683.983
128 1.95 76.41% 6.67% 16.92% 13958.848
256 3.88 75.52% 6.44% 18.04% 14222.222
512 7.78 80.72% 5.40% 13.88% 13500.498
1024 15.52 77.38% 6.31% 16.30% 13926.610

Enter array size (q to quit) [100]: q

bin/linpacksp
Enter array size (q to quit) [100]:
LINPACK benchmark, Single precision.
Machine precision: 6 digits.
Array size 100 X 100.
Memory required: 40K.
Average rolled and unrolled performance:

Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
32 0.69 81.16% 4.35% 14.49% 9581.921
64 1.39 77.70% 5.04% 17.27% 9831.885
128 2.80 77.86% 6.07% 16.07% 9622.692
256 5.63 83.13% 4.26% 12.61% 9192.416
512 11.21 80.29% 5.53% 14.18% 9402.642

Enter array size (q to quit) [100]: q
So, single precision is SLOWER than double precision on emulated G4 ?! (may be gcc autovectorizes by default, and altivec accelerates on AVX-capable host CPU?)

I wonder how older hardfloat patches behaved, anyone still have qemu with them applied around ..?
User avatar
Cat_7
Expert User
Posts: 6557
Joined: Fri Feb 13, 2004 8:59 am
Location: Sittard, The Netherlands

Re: I tried to bench qemu's FP performance ....

Post by Cat_7 »

anyone still have qemu with them applied around ..?
Sure, but only for windows and macOS hosts:
MacOS Qemu 7.1: https://surfdrive.surf.nl/files/index.p ... s/download
Windows Qemu 6.2: https://surfdrive.surf.nl/files/index.p ... 2/download

Please note both also have the screamer audio enabled which may slow down the emulation. Both also have 60Hz screen refresh.

Best,
Cat_7
Andrew_R
Inquisitive Elf
Posts: 31
Joined: Tue Jul 06, 2021 1:32 am

Re: I tried to bench qemu's FP performance ....

Post by Andrew_R »

Cat_7 wrote: Sun May 12, 2024 7:51 pm
anyone still have qemu with them applied around ..?
Sure, but only for windows and macOS hosts:
MacOS Qemu 7.1: https://surfdrive.surf.nl/files/index.p ... s/download
Windows Qemu 6.2: https://surfdrive.surf.nl/files/index.p ... 2/download

Please note both also have the screamer audio enabled which may slow down the emulation. Both also have 60Hz screen refresh.

Best,
Cat_7
well, this is real strange!

Got 6.2.0 from https://download.qemu.org/

Applied patch from https://patchwork.kernel.org/project/qe ... ik.bme.hu/

compiled with simple

Code: Select all

configure --target-list=ppc-softmmu
and result was basically same with hardfloat=true and false (for G4 cpu)

Code: Select all

Last login: Sun May 12 22:17:34 on console
Welcome to Darwin!
tim-cooks-power-mac-g4-agp-graphics:~ timcook$ cd ansibench
ansibench         ansibench.tar.gz  
tim-cooks-power-mac-g4-agp-graphics:~ timcook$ cd ansibench/
.git       coremark   hint       mk         stream     utilities  
README.md  dhrystone  linpack    nbench     tripforce  whetstone  
tim-cooks-power-mac-g4-agp-graphics:~ timcook$ cd ansibench/linpack/
tim-cooks-power-mac-g4-agp-graphics:~/ansibench/linpack timcook$ bin/linpacksp 
Enter array size (q to quit) [100]:  
LINPACK benchmark, Single precision.
Machine precision:  6 digits.
Array size 100 X 100.
Memory required:  40K.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
       8   0.63  90.48%   7.94%   1.59%   2279.570
      16   1.24  92.74%   5.65%   1.61%   2316.940
      32   2.51  90.84%   4.78%   4.38%   2355.556
      64   5.00  89.20%   6.20%   4.60%   2370.370
     128  10.00  90.00%   5.80%   4.20%   2360.473

Enter array size (q to quit) [100]:  
LINPACK benchmark, Single precision.
Machine precision:  6 digits.
Array size 100 X 100.
Memory required:  40K.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
       8   0.62  88.71%   4.84%   6.45%   2436.774
      16   1.25  88.00%   8.00%   4.00%   2355.554
      32   2.54  90.16%   6.30%   3.54%   2307.482
      64   5.04  89.68%   6.15%   4.17%   2340.927
     128  10.00  89.40%   5.90%   4.70%   2372.857

Enter array size (q to quit) [100]:  1000
LINPACK benchmark, Single precision.
Machine precision:  6 digits.
Array size 1000 X 1000.
Memory required:  3914K.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
       1  72.01  98.90%   0.60%   0.50%   2340.079

Enter array size (q to quit) [100]:  q
tim-cooks-power-mac-g4-agp-graphics:~/ansibench/linpack timcook$ bin/linpackdp 
Enter array size (q to quit) [100]:  
LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 100 X 100.
Memory required:  79K.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
       8   0.67  94.03%   4.48%   1.49%   2141.414
      16   1.37  88.32%   8.03%   3.65%   2141.414
      32   2.72  90.44%   7.35%   2.21%   2125.313
      64   5.45  91.74%   6.06%   2.20%   2121.326
     128  10.88  92.83%   5.33%   1.84%   2117.353

Enter array size (q to quit) [100]:  1000
LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 1000 X 1000.
Memory required:  7824K.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
       1  80.34  99.07%   0.59%   0.35%   2094.263

Enter array size (q to quit) [100]:  n
Too small.
Enter array size (q to quit) [100]:  q
tim-cooks-power-mac-g4-agp-graphics:~/ansibench/linpack timcook$ 

Compiled 9.0.0 just to be sure - and performance is back to 10-15 Mflops ...

Note, this is i686 (!) build by gcc 11.2.0

I tried configure with those params on 6.2.0:

Code: Select all

--extra-cflags="-I/usr/X11R7/include -O3 -march=i686 -mtune=native -m32 -Wno-maybe-uninitialized -Wno-nested-externs -Wno-implicit-function-declaration"

but result was basically the same (or those params not accepted anymore? qemu's build system uses meson/ninja since some time, and this line was done in 5.x.x times ...)

There are other, incomplete series from 2022:
https://www.mail-archive.com/qemu-devel ... 7756.html
but I have not tried it.
Andrew_R
Inquisitive Elf
Posts: 31
Joined: Tue Jul 06, 2021 1:32 am

Re: I tried to bench qemu's FP performance ....

Post by Andrew_R »

also, strictly from "because it still works" I tried to use MacOS X 10.4.11 Xdarwin (Xfree 4.4.0) server) for displaying program we maintain - Cinelerra-GG:

I used "-ac" flag for disabling access control on xserver side (because I am too lazy for xauth) and on host side I used e16 as window manager:

Code: Select all

DISPLAY=":1" starte16
Xlib:  extension "RANDR" missing on display ":1.0".
X connection to :1.0 broken (explicit kill or server shutdown).

Image

For small (320x240) vid it was even quite fast! Sadly, this program is quite FP intensive AND does have few places in code where /proc filesystem queried for 1) number of cpus and 2) name of executable. i think I found solution to both problems, just waiting on Macports to fix pulseaudio build ...

https://build.macports.org/builders/por ... lds/201811
User avatar
adespoton
Forum All-Star
Posts: 4727
Joined: Fri Nov 27, 2009 5:11 am
Location: Emaculation.com

Re: I tried to bench qemu's FP performance ....

Post by adespoton »

Andrew_R wrote: Mon May 13, 2024 7:02 am also, strictly from "because it still works" I tried to use MacOS X 10.4.11 Xdarwin (Xfree 4.4.0) server) for displaying program we maintain - Cinelerra-GG:

I used "-ac" flag for disabling access control on xserver side (because I am too lazy for xauth) and on host side I used e16 as window manager:

Code: Select all

DISPLAY=":1" starte16
Xlib:  extension "RANDR" missing on display ":1.0".
X connection to :1.0 broken (explicit kill or server shutdown).

Image

For small (320x240) vid it was even quite fast! Sadly, this program is quite FP intensive AND does have few places in code where /proc filesystem queried for 1) number of cpus and 2) name of executable. i think I found solution to both problems, just waiting on Macports to fix pulseaudio build ...

https://build.macports.org/builders/por ... lds/201811
Let us know how that goes... there've been a number of FPU improvements in QEMU from 5.x to 9.x, but there's still a few issues with screamer emulation on 10.x, and also with audio emulation pre-Lion on x86. On the x86 side, OpenCore can be used to tweak some of the items to help, but on PPC, we're dealing with the Mac99 virtual machine, so if you find potential fixes, we should get them added to the default machine definition.
mcayland
Mac Mechanic
Posts: 155
Joined: Sun Nov 01, 2015 10:33 pm

Re: I tried to bench qemu's FP performance ....

Post by mcayland »

Andrew_R wrote: Mon May 13, 2024 6:28 am Compiled 9.0.0 just to be sure - and performance is back to 10-15 Mflops ...

Note, this is i686 (!) build by gcc 11.2.0
Note that i686 builds give much worse performance than x86_64 builds because i686 has only half the number of host CPU registers compared to x86_64 (I believe on i686 there are only 5 or 6 available host CPU registers available for the JIT). So I'd be interested to see the results of the same benchmark on an x86_64 build.
Andrew_R
Inquisitive Elf
Posts: 31
Joined: Tue Jul 06, 2021 1:32 am

Re: I tried to bench qemu's FP performance ....

Post by Andrew_R »

mcayland wrote: Mon May 13, 2024 7:59 pm
Andrew_R wrote: Mon May 13, 2024 6:28 am Compiled 9.0.0 just to be sure - and performance is back to 10-15 Mflops ...

Note, this is i686 (!) build by gcc 11.2.0
Note that i686 builds give much worse performance than x86_64 builds because i686 has only half the number of host CPU registers compared to x86_64 (I believe on i686 there are only 5 or 6 available host CPU registers available for the JIT). So I'd be interested to see the results of the same benchmark on an x86_64 build.

Code: Select all

LINPACK benchmark, Single precision.
Machine precision:  6 digits.
Array size 1000 X 1000.
Memory required:  3914K.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
       1   8.16  96.57%   0.61%   2.82%  21143.342
       2  16.39  97.13%   0.67%   2.20%  20919.111

LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 1000 X 1000.
Memory required:  7824K.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
       1   6.10  95.74%   0.66%   3.61%  28514.739
       2  11.83  97.21%   0.59%   2.20%  28983.002


this is on Sandybridge laptop machine, qemu 8.2.3, Slackware -current x86-64

Code: Select all

inxi
CPU: dual core Intel Core i5-2450M (-MT MCP-) speed/min/max: 1174/800/3100 MHz
Kernel: 6.1.62 x86_64 Up: 15m Mem: 1.13/5.72 GiB (19.7%) Storage: 654.77 GiB (90.8% used)
Procs: 176 Shell: Bash inxi: 3.3.34

Andrew_R
Inquisitive Elf
Posts: 31
Joined: Tue Jul 06, 2021 1:32 am

Re: I tried to bench qemu's FP performance ....

Post by Andrew_R »

More bench data, using sandybridge laptop runnig x86-64 Slackware as host:

Code: Select all

Results	2.91	
	System Info		
		Xbench Version		1.3
		System Version		10.4.11 (8S165)
		Physical RAM		512 MB
		Model		PowerMac3,1
		Processor		PowerPC G4 @ 900 MHz
			Version		7400 (Max) v2.9
			L1 Cache		32K (instruction), 32K (data)
			Bus Frequency		100 MHz
		Drive Type		QEMU HARDDISK
	CPU Test	1.57	
		GCD Loop	144.54	7.62 Mops/sec
		Floating Point Basic	1.67	39.66 Mflop/sec
		AltiVec Basic	165.34	6.59 Gflop/sec
		vecLib FFT	0.46	15.02 Mflop/sec
		Floating Point Library	2.68	465.83 Kops/sec
	Thread Test	19.53	
		Computation	91.16	1.85 Mops/sec, 4 threads
		Lock Contention	10.94	470.58 Klocks/sec, 4 threads
	Memory Test	17.62	
		System	16.41	
			Allocate	7.06	25.94 Kalloc/sec
			Fill	38.81	1886.94 MB/sec
			Copy	64.52	1332.71 MB/sec
		Stream	19.03	
			Copy	66.12	1365.71 MB/sec [altivec]
			Scale	13.28	274.31 MB/sec [altivec]
			Add	17.97	382.87 MB/sec [altivec]
			Triad	15.61	333.84 MB/sec [altivec]
	Quartz Graphics Test	6.65	
		Line	9.12	607.11 lines/sec [50% alpha]
		Rectangle	4.72	1.41 Krects/sec [50% alpha]
		Circle	6.07	494.94 circles/sec [50% alpha]
		Bezier	8.06	203.37 beziers/sec [50% alpha]
		Text	7.05	440.71 chars/sec
	OpenGL Graphics Test	3.39	
		Spinning Squares	3.39	4.29 frames/sec
	User Interface Test	2.28	
		Elements	2.28	10.48 refresh/sec
	Disk Test	32.76	
		Sequential	20.59	
			Uncached Write	11.68	7.17 MB/sec [4K blocks]
			Uncached Write	56.80	32.14 MB/sec [256K blocks]
			Uncached Read	11.67	3.42 MB/sec [4K blocks]
			Uncached Read	186.75	93.86 MB/sec [256K blocks]
		Random	80.11	
			Uncached Write	29.29	3.10 MB/sec [4K blocks]
			Uncached Write	95.22	30.48 MB/sec [256K blocks]
			Uncached Read	299.79	2.12 MB/sec [4K blocks]
			Uncached Read	511.67	94.94 MB/sec [256K blocks]

everyone likes when their memory copy speed that big :)

Xbench website - http://xbench.com/
Post Reply