Login  •  Register


The time is now: Sun Dec 17, 2017 8:18 am

Emaculation wiki  •  Delete all board cookies



Post new topic  Reply to topic Page 1 of 1 [ 22 posts ]
Print view Previous topic  |  Next topic
Author Message
PostPosted: Thu Oct 13, 2016 3:17 pm 
Offline
Student Driver

Joined: Tue Sep 23, 2014 12:00 pm
Posts: 22
QEMU PowerPC emulation is SLOW, at least on my PC, both the "classic" qemu-system-ppc.exe and qemu-system-ppc64.exe and the qemu-system-ppc-wip.exe from this forum.
Is it a known problem or is it just something with my computer?

Here are some specs:
CPU: AMD Athlon X4 860k Quad Core @3.7GHz
RAM: 8GB DDR3
OS: Windows 10 Pro x64


Top
 Profile  
Reply with quote Post a reply  
PostPosted: Thu Oct 13, 2016 4:31 pm 
Offline
Expert User
User avatar

Joined: Fri Feb 13, 2004 8:59 am
Posts: 4205
Location: Sittard, The Netherlands
The emulation is quite a bit slower in Windows when compared to Linux/OSX.
Not much you can do. Perhaps try emulating a G3 cpu with -cpu G3
But PearPC is even slower ;-)

Some improvements are on the way. I saw a ~10% disk speed increase with some lined up patches for ide/dma transfers.

Best,
Cat_7


Top
 Profile  
Reply with quote Post a reply  
PostPosted: Thu Oct 13, 2016 5:31 pm 
Offline
Forum All-Star
User avatar

Joined: Fri Nov 27, 2009 5:11 am
Posts: 1864
The ide/dma patches definitely speed things up; we've still got to do something about the FPU emulation, which causes knock-on slowdown effects in other processes and code. ProgrammingKid's single instruction patch provided significant performance improvements -- but he aimed this solely at the audio emulation. We probably have similar issues with displayPDF emulation, since we're not enabling hardware acceleration at this time.


Top
 Profile  
Reply with quote Post a reply  
PostPosted: Thu Oct 13, 2016 9:04 pm 
Offline
Student Driver

Joined: Tue Sep 23, 2014 12:00 pm
Posts: 22
Cat_7 wrote:
The emulation is quite a bit slower in Windows when compared to Linux/OSX.

Why is that? I (miraculously) managed to successfully understand how to use those darn makefiles and build a Linux version of Qemu, and I'm going to try it in Lubuntu tomorrow, so I can tell if I see any performance increase.
Cat-7 wrote:
Not much you can do. Perhaps try emulating a G3 cpu with -cpu G3
But PearPC is even slower ;-)

Uhmm, PearPC is still faster on my PC, even if I set both to emulate a G3 processor.


Top
 Profile  
Reply with quote Post a reply  
PostPosted: Fri Oct 14, 2016 1:01 pm 
Offline
Student Driver

Joined: Tue Sep 23, 2014 12:00 pm
Posts: 22
Okay, I just tried Qemu on a Lubuntu virtual machine, and it's actually faster on there than on my host OS.


Top
 Profile  
Reply with quote Post a reply  
PostPosted: Fri Oct 14, 2016 6:04 pm 
Offline
Granny Smith

Joined: Sun Nov 01, 2015 10:33 pm
Posts: 104
From personal experience, I know that Windows is considerably slower than Linux/OS X for disk IO.

An anecdote from when I used to use WinXP as my desktop: we used to check out and build SVN trees at work for deployment - probably a project in the MBs with a reasonable number of file. Initially I was doing checkouts on my laptop which would take ~40s.

We then moved to CI and starting checking out and building the same tree in a Linux box - the same checkout process took 4s. Yes, Windows was 10 times *slower* than Linux at disk IO. I spent some time on IRC with the SVN folks and found some interesting results:

1) Disabling the anti-virus halved the checkout time from 40s to 20s
2) NTFS was considerably slower than FAT32 (reduced checkout time to 12s)

I also believe that the Windows threading primitives weren't as efficient as their POSIX counterparts, but that wasn't something I was interested enough to benchmark at that point in time.

In short, if you like to experiment I'd expect you'd get the best results storing your qcow2 files on a separate FAT32 partition with anti-virus disabled (and also on a volume without encryption if enabled). Then again that was around 10 years ago so possibly things have improved in this area, but that was pretty much the point where I switched over to Linux as my primary desktop...


Top
 Profile  
Reply with quote Post a reply  
PostPosted: Fri Oct 14, 2016 7:31 pm 
Offline
Forum All-Star
User avatar

Joined: Fri Nov 27, 2009 5:11 am
Posts: 1864
Actually, that raises a really good point: when I had on-access virus scanning enabled on the disk images, even on a Mac with an Apple SSD, disk I/O was slow. I excluded the images from on-access scanning, and saw I/O double. So on Windows, check to see what other software is getting involved in each disk read/write; this could be a big part of your speed issues. Being in a recognized Linux VM, the host OS probably just leaves those read/writes to the hypervisor and doesn't handle the I/O much itself.


Top
 Profile  
Reply with quote Post a reply  
PostPosted: Sat Oct 15, 2016 1:24 pm 
Offline
Student Driver

Joined: Tue Sep 23, 2014 12:00 pm
Posts: 22
UPDATE:
I've measured how much it took every "incarnation" of QEMU + PearPC to boot Mac OS X Tiger 10.4, for starting the program to when the Dock pops up, all from the same disk image. Here's what I found out:
QEMU on Linux (Lubuntu 16.04) (Virtual Machine): 1' 20.46''
QEMU on Windows (10 Pro): 1' 17.43''
PearPC on Windows (10 Pro): 37''
QEMU WIP would't finish even after 2' 30''

So, as a recap, QEMU for Linux is just slightly faster than on Windows, put PearPC is definitely the faster. However it lacks a proper network support and even changing CD support (even if the site claims otherwise, I just couldn't manage to get it to work)
Is there any hope that QEMU or QEMU WIP will be any faster in the future?


Top
 Profile  
Reply with quote Post a reply  
PostPosted: Sun Oct 23, 2016 3:32 pm 
Offline
Granny Smith

Joined: Sun Nov 01, 2015 10:33 pm
Posts: 104
For disk IO I have a beta patch to add support for PCI virtio devices for OpenBIOS, so once this is in place all you'd need is for someone to write a virtio driver for OS 9/X and that would give you accelerated disk IO similar to how MOL currently works :)


Top
 Profile  
Reply with quote Post a reply  
PostPosted: Mon Oct 24, 2016 12:30 pm 
Offline
Tinkerer

Joined: Fri Nov 09, 2007 5:42 pm
Posts: 98
Zacchi4k wrote:
UPDATE:
QEMU on Linux (Lubuntu 16.04) (Virtual Machine): 1' 20.46''
QEMU on Windows (10 Pro): 1' 17.43''

So, as a recap, QEMU for Linux is just slightly faster than on Windows


slightly slower you mean? Or am I misinterpreting the numbers?


Top
 Profile  
Reply with quote Post a reply  
PostPosted: Tue Nov 15, 2016 11:43 am 
Offline
Student Driver

Joined: Tue Sep 23, 2014 12:00 pm
Posts: 22
LOL what a gaffe I made xD


Top
 Profile  
Reply with quote Post a reply  
PostPosted: Tue Nov 15, 2016 11:44 am 
Offline
Student Driver

Joined: Tue Sep 23, 2014 12:00 pm
Posts: 22
mcayland wrote:
For disk IO I have a beta patch to add support for PCI virtio devices for OpenBIOS, so once this is in place all you'd need is for someone to write a virtio driver for OS 9/X and that would give you accelerated disk IO similar to how MOL currently works :)

Where can I get this?


Top
 Profile  
Reply with quote Post a reply  
PostPosted: Tue Nov 15, 2016 6:15 pm 
Offline
Forum All-Star
User avatar

Joined: Fri Nov 27, 2009 5:11 am
Posts: 1864
I think it's in his github repo now, but it's a beta patch; other things might break.


Top
 Profile  
Reply with quote Post a reply  
PostPosted: Fri Nov 17, 2017 12:59 am 
Offline
Apple Corer

Joined: Sun Jan 31, 2016 6:01 pm
Posts: 225
Emulation is a slow business. With an efficiency of about 10% we can't expect too much speed. I did come up with a theory on how to increase qemu-system-ppc's speed on x86 hardware. Right now PowerPC instructions are translated to Tiny Code Generator (TCG), then to the host instruction set. The host instruction set for probably 95% of QEMU running machines is going to be x86. This is how it looks like visually:

PowerPC -> Tiny Code Generator -> x86.

This picture will give you a good idea just how much overhead we are dealing with:
Image

This is a three phase process that has quite a lot of overhead. What I think would speed things up for us is changing over to a two phase system. To this:

PowerPC -> x86

Seeing a speed-up of 30 to 50% would definitely be possible with this setup. There are way more x86 instructions than PowerPC instructions. This means that it might be possible to come close to replacing each PowerPC instruction with one or a couple of x86 instructions.

There are reasons why no one has tried this. One reason is assembly language is not easy to learn and use. So learning the assembly language of one CPU would be hard. Mastering the assembly language of both PowerPC and x86 would be quite a challenge for any programmer.

Another reason may be translating PowerPC to x86 instructions might be like trying to put on a left shoe on a right foot. The shoe might be the right size, but just not the right shape. In other words x86 equivalent instructions may not be close enough to PowerPC instructions to use.

This leads me to another theory. What if we just translated PowerPC instructions into C implementations. Something like this:

/* and rD, rA, rB */
void and (int rx, int ra, int rb)
{
....
}

As soon as we see a PowerPC instruction, we just hash the instruction to a C function and execute it. This would be a simple system. Why not use it? There would be no code optimizations performed. Tiny code generator can remove code from the instruction queue if it isn't needed. I think it could see a bunch of repeated code that had no side effects worth keeping and remove them. Something like this:

and r6, r7, r8
and r6, r7, r8
and r6, r7, r8
and r6, r7, r8
and r6, r7, r8
and r6, r7, r8
and r6, r7, r8
and r6, r7, r8

could be optimized to just this:

and r6, r7, r8

So what now? There is still one thing we haven't tried that looks very promising. That is multithreaded tiny code generator (mttcg). This allows for more host CPU cores to be used with emulating the guest processor. I'm pretty sure Power instructions have been made to use mttcg. We just have to fix PowerPC to use it.

From: https://wiki.qemu.org/Features/tcg-multithread

Porting a guest architecture
Before MTTCG can be enabled for a guest the following changes must be made.
• Correctly translate atomic/exclusive instructions (see tcg_gen_atomic_)
• Ensure the translation step correctly handles barrier instructions (tcg_gen_mb)
• Define TCG_GUEST_DEFAULT_MO
• Audit instructions that modify system state
• generally this means taking BQL (e.g. HELPER(set_cp_reg))
• Audit MMU management functions
• cputlb provides an API for various tlb_flush_FOO operations
• Audit power/reset sequences
• see for example target/arm/arm-powerctl.c
The work queue API async_[safe_]run_on_cpu provides a mechanism for one vCPU to queue work on another.
Once this work is done your final patch can update configure and enable TARGET_SUPPORTS_MTTCG

Any volunteers to do this?


Top
 Profile  
Reply with quote Post a reply  
PostPosted: Fri Nov 17, 2017 6:00 pm 
Offline
Forum All-Star
User avatar

Joined: Fri Nov 27, 2009 5:11 am
Posts: 1864
No time on my part to do this, but I do have something else we should keep in mind:
ARM emulation is likely to become a bigger thing in the future, with people using qemu on chromebooks, tablets, and possibly even phones. There's also the possibility that Apple will move to their own ARM-based chips at some point in the future.

So while optimizing for IAx86 and/or AMD-64 translation is a good idea, we have to make sure we don't optimize ourselves into a corner.

That said there's some stuff left to TCG that it seems to me would be better served with compile-time flags in qemu. After all, you're not going to be running qemu-system-ppc to emulate ARM, and you're not going to be running an AMD-64 build on an ARM host either.

So I agree that taking out that middle step and making it a compile-time switch makes more sense.

But then we come to the asm-level implementation... PPC<->ARM isn't too difficult, as they use similar register logic. But modern x86 uses a complex pipelining process and dedicated registers that just doesn't map all that well. Basically, RISC systems allow for more flexibility, and all of those edge cases have to be taken into account when translating to the limited x86 instruction set. Going the other direction is only an issue because the x86 pipelines are so heavily optimized that you'll see a performance hit trying to execute most x86 routines in PPC/ARM.

What may be needed is a compromise: the most commonly used PPC instructions are pretty much a 1:1 mapping to x86. So we could do the direct translation without too much difficulty, and just fall back to the current 3-step process if we hit an "unrecognized" instruction set (we'd have to pipeline and go by sets to get any real optimization, instead of by individual instructions).

Other than those, I agree that MTTCG is a good option, as most modern cores are multithreaded themselves. We'd probably see more improvement than just distributing the work between available cores, as some instructions could be processed quickly without blocking on the more intensive but less time sensitive instructions running in a different thread.


Top
 Profile  
Reply with quote Post a reply  
PostPosted: Sat Nov 18, 2017 12:58 am 
Offline
Apple Corer

Joined: Sun Jan 31, 2016 6:01 pm
Posts: 225
adespoton wrote:
No time on my part to do this, but I do have something else we should keep in mind:
ARM emulation is likely to become a bigger thing in the future, with people using qemu on chromebooks, tablets, and possibly even phones. There's also the possibility that Apple will move to their own ARM-based chips at some point in the future.

I have thought about that myself. There were some CPU test that showing the newest iPad's ARM-based CPU actually beating the MacPro's x86 CPU. Currently the only thing that I think comes close to an ARM desktop would be the Raspberry Pi 3.

adespoton wrote:
So I agree that taking out that middle step and making it a compile-time switch makes more sense.

But then we come to the asm-level implementation... PPC<->ARM isn't too difficult, as they use similar register logic. But modern x86 uses a complex pipelining process and dedicated registers that just doesn't map all that well. Basically, RISC systems allow for more flexibility, and all of those edge cases have to be taken into account when translating to the limited x86 instruction set. Going the other direction is only an issue because the x86 pipelines are so heavily optimized that you'll see a performance hit trying to execute most x86 routines in PPC/ARM.

This is why implementing PowerPC instructions in C would probably be best. The compiler can do all the hard work for us.

adespoton wrote:
Other than those, I agree that MTTCG is a good option, as most modern cores are multithreaded themselves. We'd probably see more improvement than just distributing the work between available cores, as some instructions could be processed quickly without blocking on the more intensive but less time sensitive instructions running in a different thread.

Maybe we should try both methods. MTTCG sounds like a really good idea and some of the benchmarks I have seen look really promising. Starting to eliminate tiny code generator sounds really good also.

For eliminating Tiny Code Generator, I suggest we start with a proof of concept replacement instruction. My vote goes with ADD and SUBF. They are really easy to understand and implement in C. A simple test program can be made to verify if these instructions are working correctly. Implementing them in x86 assembly language might sound like a good idea, but there are problem with that approach that I think make it not worth the trouble. A C implementation might be fast enough.

For those who want to know about CPU pipelining:
https://www.youtube.com/watch?v=eVRdfl4zxfI


Top
 Profile  
Reply with quote Post a reply  
PostPosted: Sat Nov 18, 2017 7:48 am 
Offline
Forum All-Star
User avatar

Joined: Fri Nov 27, 2009 5:11 am
Posts: 1864
ADD should be simple enough to do in ASM as well; or we could start with XOR which is essentially the same on both platforms, other than assumptions under x86 as to what registers are being referenced. And XOR is of course the elemental instruction... would make it relatively easy to try both ASM and C and see what the impact is.

My guess however is that for x86, C is probably going to be better, as the x86 pipelines expect C structures and optimize them; ASM wouldn't inherit any of those optimizations (but would already be pretty fast).

Useful references:
https://www.ibm.com/support/knowledgece ... ctions.htm

https://www.aldeid.com/wiki/X86-assembl ... ctions/xor


Top
 Profile  
Reply with quote Post a reply  
PostPosted: Wed Dec 13, 2017 6:52 am 
Offline
Granny Smith

Joined: Sun Feb 07, 2016 4:40 pm
Posts: 117
Well, running the Mac OS is one thing, running LinuxPPC is something else all together.

It seems that I'm running on a 10 Mhz system.


Top
 Profile  
Reply with quote Post a reply  
PostPosted: Wed Dec 13, 2017 4:52 pm 
Offline
Forum All-Star
User avatar

Joined: Fri Nov 27, 2009 5:11 am
Posts: 1864
If you do an instruction trace in the monitor, can you narrow down which instructions are causing the slowdown?


Top
 Profile  
Reply with quote Post a reply  
PostPosted: Fri Dec 15, 2017 1:03 pm 
Offline
Granny Smith

Joined: Sun Feb 07, 2016 4:40 pm
Posts: 117
Just a little testing on my end tells us what we have already found, FP and Vec performance is the bottleneck.

Integer performance is pretty good.

I was going to work on this after I got work done on GFX performance, but I may switch gears and try and get better CPU speed, then go back to GFX.

Really, the Summer of Code a few years back was the only time anyone cared much about running the Mac OS in Qemu, they got it to work, but it really didn't go beyond that. I take it from most of the work I see on the Qemu-PPC mailing list, that the typical use case is KVM on actual Power architecture hosts.

I don't think anyone that is active in coding for Qemu-PPC really cares about emulating on x86 hosts.


Top
 Profile  
Reply with quote Post a reply  
PostPosted: Fri Dec 15, 2017 1:46 pm 
Offline
Expert User
User avatar

Joined: Fri Feb 13, 2004 8:59 am
Posts: 4205
Location: Sittard, The Netherlands
The quirks still left after the GSOC were ironed out by Mark Cave-Ayland, and the vga/network updates were mainly done by Ben Herrenschmidt. There is no big focus, but Mark will get back to qemu-ppc in the new year. There are many small, but important issues to be fixed.

An overview of current issues is in this doc: https://docs.google.com/spreadsheets/d/ ... edit#gid=0

I understood Mark was aiming for a talk with Ben about the future of qemu-ppc. I haven't heard whether that talk took place, and thus do not know what (if anything) came out of it. But I gathered Mark was considering aiming at making the mac99p model (so with PMU) default.

Best,
Cat_7


Top
 Profile  
Reply with quote Post a reply  
PostPosted: Sat Dec 16, 2017 4:25 pm 
Offline
Granny Smith

Joined: Sun Feb 07, 2016 4:40 pm
Posts: 117
Cat_7 wrote:
A new Qemu build for Windows has landed. It seems to be showing considerable speed improvements ;-)
Possibly due to performance enhancements in the TCG just included in the source code.

MacBench 3.0 results in Mac OS 9.2:

From 2.11 (yesterday) to 2.12 pre (15-12-2017)
Processor: 137 > 195
Floating point: 43 > 51
Disk mix: 78 > 140
Graphics mix: 126 > 160

Find it here: viewtopic.php?f=34&t=9028


Under 10.4.11 with SkidMarksGT I got:

Int: 88->109
FP 9->10
Vec 7->7

under 9.2.2

Mac Bench 5.0

Processer 129->184
FP 52-->57
Disk 188->242


Top
 Profile  
Reply with quote Post a reply  
Display posts from previous:  Sort by  
Post new topic  Reply to topic Page 1 of 1 [ 22 posts ]


Who is online

Users browsing this forum: No registered users and 4 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
 

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group