QEMU PowerPC emulation is S L O W

About Qemu-system-ppc, a PPC Mac emulator for Windows, macOS and Linux that can run Mac OS 9.0 up to Mac OS X 10.5

Moderators: Cat_7, Ronald P. Regensburg

Post Reply
User avatar
Cat_7
Expert User
Posts: 6145
Joined: Fri Feb 13, 2004 8:59 am
Location: Sittard, The Netherlands

Re: QEMU PowerPC emulation is S L O W

Post by Cat_7 »

The emulation is quite a bit slower in Windows when compared to Linux/OSX.
Not much you can do. Perhaps try emulating a G3 cpu with -cpu G3
But PearPC is even slower ;-)

Some improvements are on the way. I saw a ~10% disk speed increase with some lined up patches for ide/dma transfers.

Best,
Cat_7
User avatar
adespoton
Forum All-Star
Posts: 4227
Joined: Fri Nov 27, 2009 5:11 am
Location: Emaculation.com
Contact:

Re: QEMU PowerPC emulation is S L O W

Post by adespoton »

The ide/dma patches definitely speed things up; we've still got to do something about the FPU emulation, which causes knock-on slowdown effects in other processes and code. ProgrammingKid's single instruction patch provided significant performance improvements -- but he aimed this solely at the audio emulation. We probably have similar issues with displayPDF emulation, since we're not enabling hardware acceleration at this time.
mcayland
Mac Mechanic
Posts: 152
Joined: Sun Nov 01, 2015 10:33 pm

Re: QEMU PowerPC emulation is S L O W

Post by mcayland »

From personal experience, I know that Windows is considerably slower than Linux/OS X for disk IO.

An anecdote from when I used to use WinXP as my desktop: we used to check out and build SVN trees at work for deployment - probably a project in the MBs with a reasonable number of file. Initially I was doing checkouts on my laptop which would take ~40s.

We then moved to CI and starting checking out and building the same tree in a Linux box - the same checkout process took 4s. Yes, Windows was 10 times *slower* than Linux at disk IO. I spent some time on IRC with the SVN folks and found some interesting results:

1) Disabling the anti-virus halved the checkout time from 40s to 20s
2) NTFS was considerably slower than FAT32 (reduced checkout time to 12s)

I also believe that the Windows threading primitives weren't as efficient as their POSIX counterparts, but that wasn't something I was interested enough to benchmark at that point in time.

In short, if you like to experiment I'd expect you'd get the best results storing your qcow2 files on a separate FAT32 partition with anti-virus disabled (and also on a volume without encryption if enabled). Then again that was around 10 years ago so possibly things have improved in this area, but that was pretty much the point where I switched over to Linux as my primary desktop...
User avatar
adespoton
Forum All-Star
Posts: 4227
Joined: Fri Nov 27, 2009 5:11 am
Location: Emaculation.com
Contact:

Re: QEMU PowerPC emulation is S L O W

Post by adespoton »

Actually, that raises a really good point: when I had on-access virus scanning enabled on the disk images, even on a Mac with an Apple SSD, disk I/O was slow. I excluded the images from on-access scanning, and saw I/O double. So on Windows, check to see what other software is getting involved in each disk read/write; this could be a big part of your speed issues. Being in a recognized Linux VM, the host OS probably just leaves those read/writes to the hypervisor and doesn't handle the I/O much itself.
mcayland
Mac Mechanic
Posts: 152
Joined: Sun Nov 01, 2015 10:33 pm

Re: QEMU PowerPC emulation is S L O W

Post by mcayland »

For disk IO I have a beta patch to add support for PCI virtio devices for OpenBIOS, so once this is in place all you'd need is for someone to write a virtio driver for OS 9/X and that would give you accelerated disk IO similar to how MOL currently works :)
MetalSnake
Granny Smith
Posts: 120
Joined: Fri Nov 09, 2007 5:42 pm

Re: QEMU PowerPC emulation is S L O W

Post by MetalSnake »

Zacchi4k wrote:UPDATE:
QEMU on Linux (Lubuntu 16.04) (Virtual Machine): 1' 20.46''
QEMU on Windows (10 Pro): 1' 17.43''

So, as a recap, QEMU for Linux is just slightly faster than on Windows
slightly slower you mean? Or am I misinterpreting the numbers?
User avatar
adespoton
Forum All-Star
Posts: 4227
Joined: Fri Nov 27, 2009 5:11 am
Location: Emaculation.com
Contact:

Re: QEMU PowerPC emulation is S L O W

Post by adespoton »

I think it's in his github repo now, but it's a beta patch; other things might break.
Programmingkid
Apple Corer
Posts: 243
Joined: Sun Jan 31, 2016 6:01 pm

Re: QEMU PowerPC emulation is S L O W

Post by Programmingkid »

Emulation is a slow business. With an efficiency of about 10% we can't expect too much speed. I did come up with a theory on how to increase qemu-system-ppc's speed on x86 hardware. Right now PowerPC instructions are translated to Tiny Code Generator (TCG), then to the host instruction set. The host instruction set for probably 95% of QEMU running machines is going to be x86. This is how it looks like visually:

PowerPC -> Tiny Code Generator -> x86.

This picture will give you a good idea just how much overhead we are dealing with:
Image

This is a three phase process that has quite a lot of overhead. What I think would speed things up for us is changing over to a two phase system. To this:

PowerPC -> x86

Seeing a speed-up of 30 to 50% would definitely be possible with this setup. There are way more x86 instructions than PowerPC instructions. This means that it might be possible to come close to replacing each PowerPC instruction with one or a couple of x86 instructions.

There are reasons why no one has tried this. One reason is assembly language is not easy to learn and use. So learning the assembly language of one CPU would be hard. Mastering the assembly language of both PowerPC and x86 would be quite a challenge for any programmer.

Another reason may be translating PowerPC to x86 instructions might be like trying to put on a left shoe on a right foot. The shoe might be the right size, but just not the right shape. In other words x86 equivalent instructions may not be close enough to PowerPC instructions to use.

This leads me to another theory. What if we just translated PowerPC instructions into C implementations. Something like this:

/* and rD, rA, rB */
void and (int rx, int ra, int rb)
{
....
}

As soon as we see a PowerPC instruction, we just hash the instruction to a C function and execute it. This would be a simple system. Why not use it? There would be no code optimizations performed. Tiny code generator can remove code from the instruction queue if it isn't needed. I think it could see a bunch of repeated code that had no side effects worth keeping and remove them. Something like this:

and r6, r7, r8
and r6, r7, r8
and r6, r7, r8
and r6, r7, r8
and r6, r7, r8
and r6, r7, r8
and r6, r7, r8
and r6, r7, r8

could be optimized to just this:

and r6, r7, r8

So what now? There is still one thing we haven't tried that looks very promising. That is multithreaded tiny code generator (mttcg). This allows for more host CPU cores to be used with emulating the guest processor. I'm pretty sure Power instructions have been made to use mttcg. We just have to fix PowerPC to use it.

From: https://wiki.qemu.org/Features/tcg-multithread

Porting a guest architecture
Before MTTCG can be enabled for a guest the following changes must be made.
• Correctly translate atomic/exclusive instructions (see tcg_gen_atomic_)
• Ensure the translation step correctly handles barrier instructions (tcg_gen_mb)
• Define TCG_GUEST_DEFAULT_MO
• Audit instructions that modify system state
• generally this means taking BQL (e.g. HELPER(set_cp_reg))
• Audit MMU management functions
• cputlb provides an API for various tlb_flush_FOO operations
• Audit power/reset sequences
• see for example target/arm/arm-powerctl.c
The work queue API async_[safe_]run_on_cpu provides a mechanism for one vCPU to queue work on another.
Once this work is done your final patch can update configure and enable TARGET_SUPPORTS_MTTCG

Any volunteers to do this?
User avatar
adespoton
Forum All-Star
Posts: 4227
Joined: Fri Nov 27, 2009 5:11 am
Location: Emaculation.com
Contact:

Re: QEMU PowerPC emulation is S L O W

Post by adespoton »

No time on my part to do this, but I do have something else we should keep in mind:
ARM emulation is likely to become a bigger thing in the future, with people using qemu on chromebooks, tablets, and possibly even phones. There's also the possibility that Apple will move to their own ARM-based chips at some point in the future.

So while optimizing for IAx86 and/or AMD-64 translation is a good idea, we have to make sure we don't optimize ourselves into a corner.

That said there's some stuff left to TCG that it seems to me would be better served with compile-time flags in qemu. After all, you're not going to be running qemu-system-ppc to emulate ARM, and you're not going to be running an AMD-64 build on an ARM host either.

So I agree that taking out that middle step and making it a compile-time switch makes more sense.

But then we come to the asm-level implementation... PPC<->ARM isn't too difficult, as they use similar register logic. But modern x86 uses a complex pipelining process and dedicated registers that just doesn't map all that well. Basically, RISC systems allow for more flexibility, and all of those edge cases have to be taken into account when translating to the limited x86 instruction set. Going the other direction is only an issue because the x86 pipelines are so heavily optimized that you'll see a performance hit trying to execute most x86 routines in PPC/ARM.

What may be needed is a compromise: the most commonly used PPC instructions are pretty much a 1:1 mapping to x86. So we could do the direct translation without too much difficulty, and just fall back to the current 3-step process if we hit an "unrecognized" instruction set (we'd have to pipeline and go by sets to get any real optimization, instead of by individual instructions).

Other than those, I agree that MTTCG is a good option, as most modern cores are multithreaded themselves. We'd probably see more improvement than just distributing the work between available cores, as some instructions could be processed quickly without blocking on the more intensive but less time sensitive instructions running in a different thread.
Programmingkid
Apple Corer
Posts: 243
Joined: Sun Jan 31, 2016 6:01 pm

Re: QEMU PowerPC emulation is S L O W

Post by Programmingkid »

adespoton wrote:No time on my part to do this, but I do have something else we should keep in mind:
ARM emulation is likely to become a bigger thing in the future, with people using qemu on chromebooks, tablets, and possibly even phones. There's also the possibility that Apple will move to their own ARM-based chips at some point in the future.
I have thought about that myself. There were some CPU test that showing the newest iPad's ARM-based CPU actually beating the MacPro's x86 CPU. Currently the only thing that I think comes close to an ARM desktop would be the Raspberry Pi 3.
adespoton wrote:So I agree that taking out that middle step and making it a compile-time switch makes more sense.

But then we come to the asm-level implementation... PPC<->ARM isn't too difficult, as they use similar register logic. But modern x86 uses a complex pipelining process and dedicated registers that just doesn't map all that well. Basically, RISC systems allow for more flexibility, and all of those edge cases have to be taken into account when translating to the limited x86 instruction set. Going the other direction is only an issue because the x86 pipelines are so heavily optimized that you'll see a performance hit trying to execute most x86 routines in PPC/ARM.
This is why implementing PowerPC instructions in C would probably be best. The compiler can do all the hard work for us.
adespoton wrote:Other than those, I agree that MTTCG is a good option, as most modern cores are multithreaded themselves. We'd probably see more improvement than just distributing the work between available cores, as some instructions could be processed quickly without blocking on the more intensive but less time sensitive instructions running in a different thread.
Maybe we should try both methods. MTTCG sounds like a really good idea and some of the benchmarks I have seen look really promising. Starting to eliminate tiny code generator sounds really good also.

For eliminating Tiny Code Generator, I suggest we start with a proof of concept replacement instruction. My vote goes with ADD and SUBF. They are really easy to understand and implement in C. A simple test program can be made to verify if these instructions are working correctly. Implementing them in x86 assembly language might sound like a good idea, but there are problem with that approach that I think make it not worth the trouble. A C implementation might be fast enough.

For those who want to know about CPU pipelining:
https://www.youtube.com/watch?v=eVRdfl4zxfI
User avatar
adespoton
Forum All-Star
Posts: 4227
Joined: Fri Nov 27, 2009 5:11 am
Location: Emaculation.com
Contact:

Re: QEMU PowerPC emulation is S L O W

Post by adespoton »

ADD should be simple enough to do in ASM as well; or we could start with XOR which is essentially the same on both platforms, other than assumptions under x86 as to what registers are being referenced. And XOR is of course the elemental instruction... would make it relatively easy to try both ASM and C and see what the impact is.

My guess however is that for x86, C is probably going to be better, as the x86 pipelines expect C structures and optimize them; ASM wouldn't inherit any of those optimizations (but would already be pretty fast).

Useful references:
https://www.ibm.com/support/knowledgece ... ctions.htm

https://www.aldeid.com/wiki/X86-assembl ... ctions/xor
darthnvader
Mac Mechanic
Posts: 178
Joined: Sun Feb 07, 2016 4:40 pm

Re: QEMU PowerPC emulation is S L O W

Post by darthnvader »

Well, running the Mac OS is one thing, running LinuxPPC is something else all together.

It seems that I'm running on a 10 Mhz system.
User avatar
adespoton
Forum All-Star
Posts: 4227
Joined: Fri Nov 27, 2009 5:11 am
Location: Emaculation.com
Contact:

Re: QEMU PowerPC emulation is S L O W

Post by adespoton »

If you do an instruction trace in the monitor, can you narrow down which instructions are causing the slowdown?
darthnvader
Mac Mechanic
Posts: 178
Joined: Sun Feb 07, 2016 4:40 pm

Re: QEMU PowerPC emulation is S L O W

Post by darthnvader »

Just a little testing on my end tells us what we have already found, FP and Vec performance is the bottleneck.

Integer performance is pretty good.

I was going to work on this after I got work done on GFX performance, but I may switch gears and try and get better CPU speed, then go back to GFX.

Really, the Summer of Code a few years back was the only time anyone cared much about running the Mac OS in Qemu, they got it to work, but it really didn't go beyond that. I take it from most of the work I see on the Qemu-PPC mailing list, that the typical use case is KVM on actual Power architecture hosts.

I don't think anyone that is active in coding for Qemu-PPC really cares about emulating on x86 hosts.
User avatar
Cat_7
Expert User
Posts: 6145
Joined: Fri Feb 13, 2004 8:59 am
Location: Sittard, The Netherlands

Re: QEMU PowerPC emulation is S L O W

Post by Cat_7 »

The quirks still left after the GSOC were ironed out by Mark Cave-Ayland, and the vga/network updates were mainly done by Ben Herrenschmidt. There is no big focus, but Mark will get back to qemu-ppc in the new year. There are many small, but important issues to be fixed.

An overview of current issues is in this doc: https://docs.google.com/spreadsheets/d/ ... edit#gid=0

I understood Mark was aiming for a talk with Ben about the future of qemu-ppc. I haven't heard whether that talk took place, and thus do not know what (if anything) came out of it. But I gathered Mark was considering aiming at making the mac99p model (so with PMU) default.

Best,
Cat_7
darthnvader
Mac Mechanic
Posts: 178
Joined: Sun Feb 07, 2016 4:40 pm

Re: QEMU PowerPC emulation is S L O W

Post by darthnvader »

Cat_7 wrote:A new Qemu build for Windows has landed. It seems to be showing considerable speed improvements ;-)
Possibly due to performance enhancements in the TCG just included in the source code.

MacBench 3.0 results in Mac OS 9.2:

From 2.11 (yesterday) to 2.12 pre (15-12-2017)
Processor: 137 > 195
Floating point: 43 > 51
Disk mix: 78 > 140
Graphics mix: 126 > 160

Find it here: viewtopic.php?f=34&t=9028
Under 10.4.11 with SkidMarksGT I got:

Int: 88->109
FP 9->10
Vec 7->7

under 9.2.2

Mac Bench 5.0

Processer 129->184
FP 52-->57
Disk 188->242
Post Reply