Planar Emulation Improvements

PainDictator · Mar 28th 2023

Hi,

I really like the new planar emulation. However, currently it is of limited use as applications writing directly to the planes have issues. Or, more practically speaking, in the end I cannot run Nuclear War and Pirates!, which where the two sample games using system screens that I tried. Both have issues, but for different reasons. Below I will address them and suggest improvements that probably will help these games (and others alike) to run on emulated screens, which would be a great achievement - at least for people like me, who have a VGA monitor without 15KHz support connected and like to run at least some games from time to time using a RTG card (a GPBAII++ in my case).

So the suggested improvements are:

1.) Option to force automatic fullscreen updates

Symptom: In games like nuclear War. The screen (320x200x5) itself can be currently (3.3.3.) retargeted to a planar screen, but most updates are only visible when the mouse moves over the respective area.

Problem: The game writes directly to the bitplanes of an OS screen. Since in the emulation these are only a backbuffer for the real display, nothing causes the real display to be updated.

Idea: Provide an option (via setenv) to enable periodic updates of the actual dispay from the backbuffer. However, there are performance considerations, as complete full screens updates may have serious performance implications (depending on resolutioin, bus speed, etc.). However, for most cases this will IMHO be used to redirect lores screens from classic amiga applications, so the resulting bandwidth would be 320x200x60 (~3,7 MB/s) for a classic NTSC game, which is almost in reach for a Zorro II card. PAL numbers are similar. Still, this is a little too much already and higher resolutions make things worse. Thus, I would propose a simple implementation, that for most games (particularly of the likes of Nuclear War and Pirates, with limited on-screen changes) and probably also many applications like DPaint would help:

- Besides the bitplanes provided to the application, also keep a full copy of the last values written to the graphics adapter

- update graphics card periodically (ideally on every graphics card retrace) from the back buffer by only transferrung changed data

As the local RAM of accelerators is by an order of magnitude faster as the Zorro II bus, this largely elimnates unnecessary bus transfers with moderate overhead and should be easy to implement.

Still, it may be reasonable to restrict the effect of this option to certain resolutions or (even better) applications, but for me it would be still fine if it jast affects all (when enabled).

Another way of improving performance / reducing required bandwidth might be an additional option to skip every other frame, so updates happen typically at 25/30 fps. This however might create artefacts, e.g., under a moving mouse pointer in nuclear war.

2,) Support for Extra Halfbrite Screen

Symptom: Pirates can only be redirected (by NewMode) to EHB screens.

Problem: RTG screens currently do not offer EHB.

Idea: It might be possible to advertise planar P96 Screens as having EHB support. To get this to work, palette handling for a 6 plane EHB screen needs to be modified to reflect the additional 32 colors.

What do You think?

Regards,

Pain

Jens · Mar 29th 2023

I need to get details from Thor about the way the emulation is implemented. There may be a way to use the MMU of the accelerator to find out which 4k-block needs to be transferred to the GFX card. We've used this technique on the Graffiti back in 1996, and it was *the* key to accelerated graphics in Mac emulators.

Jens

PainDictator · Mar 29th 2023

Quote from Jens

I need to get details from Thor about the way the emulation is implemented. There may be a way to use the MMU of the accelerator to find out which 4k-block needs to be transferred to the GFX card. We've used this technique on the Graffiti back in 1996, and it was *the* key to accelerated graphics in Mac emulators.

Jens

That would be great. I thought of something similar, but did not propose it because it is more complex and requires an MMU.

Thanks for your efforts!

Jens · Mar 29th 2023

You were talking about accelerator memory, and if it's not a 68020, it's very likely that you have an MMU. While I'm not a fan of activating the MMU "just because it's there" (like the latest OS tries to push on people), I see this as an application that really benefits the user. Normally, the MMU mainly benefits developers, but this is a real-world non-developer use case, which is rare on the Amiga

Jens

PainDictator · Mar 29th 2023

Quote from Jens

You were talking about accelerator memory, and if it's not a 68020, it's very likely that you have an MMU. While I'm not a fan of activating the MMU "just because it's there" (like the latest OS tries to push on people), I see this as an application that really benefits the user. Normally, the MMU mainly benefits developers, but this is a real-world non-developer use case, which is rare on the Amiga

Jens

I am a big fan of the concept of using the MMU for this. Anything below an 68030 will probably have too much pain with the planar to chunky overhead anyway.

McTrinsic · Mar 29th 2023

I second such a request.

Star Trek 25th anniversary might as well benefit. The game runs on an accelerated ECS system, e.g. an Amiga with an ACA500+/ACA1233.

Unfortunately the colors are wrong and it seems it uses EHB.

Jens · Mar 30th 2023

Trouble is that the planar emulation goes through chip ram, and the MMU does not see any Blitter operations. So no, my idea of using the MMU is not usable here.

EHB is indeed possible to implement.

PainDictator · Mar 30th 2023

Quote from Jens

Trouble is that the planar emulation goes through chip ram, and the MMU does not see any Blitter operations. So no, my idea of using the MMU is not usable here.

EHB is indeed possible to implement.

Blitter: Good point!

I wanted to add anyway that after thinking about it, the 4k page granularity would be probably rather inefficient for lores screens (which would come down to tpyically 16-20 4K pages) with few/moderate local updates. In such a scenario, the "compare memory" approach will be more efficient w.r.t. Zorro bandwidth, which is a big issue. Tests here can go down to the word level (though dword is probably a good compromise). Still, of course many games/applications would have benefitted from the MMU-approach more. Anyway, comparing lores frame data at full frame rate requires a combined 8MB/s fast ram bandwidth +chunky data writes bandwidth on the bus. This is not exactly funny, but probably doable. I would expect >32MB/s memory bandwidth from a typical 68030 accelerator, so this would ideally consume roughly 1/4 of the frame time.

I would like to add, that only updating changes (instead of full frames) is highly relevant as unfortunately a lores PAL/NTSC screen AFAIK requires just a little more bandwidth (up to ~3,9MB/s) for full frame rate updates than is available on the zorro II bus (<3,5MB/s). Since however most games have some static border elements, these should be enough to rescue most games into the Zorro II bandwidth range.

Also I thought about the "half frame rate" idea und think it is probably not very useful unless the work is actually distributed across frames (which might add more overhead), because otherwise frame times for the underlying application will vary dramatically and may cause nasty timing. Thus, it might not be a great idea at all.

EHB: Great!

Jens · Mar 31st 2023

Quote from PainDictator

In such a scenario, the "compare memory" approach will be more efficient

Really? I mean, reading Chipram is the time-consuming part. Writing to a GFX card (as long as it's not a Z2 card) should be as fast as the compare operation. After all, comparing means you have to fetch the data for comparison anyway, so I don't see the benefit at all.

Quote from PainDictator

I would expect >32MB/s memory bandwidth from a typical 68030 accelerator,

Only on fastmem, so that would require mirroring the gfx data in fastmem while copying changes to the gfx card. Not sure if the effort pays off, as it's two instead of one write access in the copy-loop. With the small caches of a 68030, you might end up slowing down the routine to a level where "just a plain frame copy" may work better.

My concept of improving CPU-to-Chipram performance to over 20MBytes/s for the A1200 Reloaded will definitely help, but I currently don't have the time to implement that...

Jens

PainDictator · Mar 31st 2023

Quote from Jens

Really? I mean, reading Chipram is the time-consuming part. Writing to a GFX card (as long as it's not a Z2 card) should be as fast as the compare operation. After all, comparing means you have to fetch the data for comparison anyway, so I don't see the benefit at all.

Only on fastmem, so that would require mirroring the gfx data in fastmem while copying changes to the gfx card. Not sure if the effort pays off, as it's two instead of one write access in the copy-loop. With the small caches of a 68030, you might end up slowing down the routine to a level where "just a plain frame copy" may work better.

My concept of improving CPU-to-Chipram performance to over 20MBytes/s for the A1200 Reloaded will definitely help, but I currently don't have the time to implement that...

Jens

Display More

Ouch, because of the P96 context I was mislead to believe that the backbuffer for the planes could be in fast RAM. Because of the Blitter, this is wrong, particularly since the blitter is a large part of the problem we want to address. Thanks for pointing that out!

Just reading the whole lores frame from chip ram takes IMO half a frame on an AGA system (and even worse a frame or slightly longer on ECS/OCS), so this is particularly for non-AGA systems largely pointless. Still, for an accelerated AGA system (that is much more than 2x faster than the standard system), there might be almost half a frame left to do the work at higher speed to catch up. So its not total nonsense.

Admittedly, I was mostly thinking of my rusty Zorro II card in an A2000 (68030/50), so this is (also w.r.t. chip ram bandwidth) the worst case anyway. I fully agree, depending on the system configuration, the compare may be more expensive than a full frame over the Zorro III bus.

Probably the use of partial updates really comes down to the scenario where a game/application with little/moderate screen updates runs on a Zorro II RTG system (implying also slowes chip ram), where the partial updates should be enough to make the games working. Such games probably include the two example candidates, because there are usually only minor screen updates while full screen updates are rare und usually isolated. Still, it is exciting to get this to work ... not because we cannot live without pirates, but because it sort of concludes the work on planar emulation for RTG with a notable difference in compatibility. And again, I guess it might be the last thing DPaint is missing. Getting DPaint to run properly on a P96 A2000 is still a thing.

For the sake of completeness, I might still elaborate what I think the (updated) procedure for partial updates could be:

- We need 3 Buffers:

* chip ram back buffer for the OS screen bitplanes (receiving all the graphics operations from the application including non-os blitter calls)

* double buffer for the bitplane (!) data sent to the gfx adapter (last state and current state)

Code

AT EACH Frame
FOR EACH BlockOfPixels
PlanarData := READ BlockOfPixels FROM ChipRamBackbuffer ; from all bitplanes, e.g., 32 Pixels (a DWORD per plane)
WRITE PlanarData TO NewFastRamBuffer
Changed := COMPARE PlanarData TO OldFastRamBuffer ; see if block of pixcels has changed
IF Changed
COMPUTE ChunkyData
WRITE ChunkData to GFXCard
EXCHANGE NewFastRamBuffer,OldFastRamBuffer

My hope would be, that an existing chunky-to-planer conversion may have room (particularly on the register side, but maybe also w.r.t. pipiline timing) to be modified to include the additional memory accesses and comparisions.

I still think, if it is a low hanging fruit, just enabling plain and stupid copy might already help some people. The "partial" update" overhead might be too much for many systems and is only in some cases beneficial. Also, it might ruin the efficiency of an optimized planar-to-chunky conversion routine (even of specific a copy of it). Still, I think it would actually work for games like nuclear war or Pirates with very limited changes, but is just overhead if an application constantly makes significant updates. However, it might help some games and maybe DPaint as well ...

Just one thing: I am only remotely familiar with the 68k MMU. Is it possible to intercept specifically accesses to the blitter registers to actually go and pull the whole thing in favour of a software emulation?

PainDictator · Mar 31st 2023

Quote from PainDictator

Ouch, because of the P96 context I was mislead to believe that the backbuffer for the planes could be in fast RAM. Because of the Blitter, this is wrong, particularly since the blitter is a large part of the problem we want to address. Thanks for pointing that out!

Just reading the whole lores frame from chip ram takes IMO half a frame on an AGA system (and even worse a frame or slightly longer on ECS/OCS), so this is particularly for non-AGA systems largely pointless. Still, for an accelerated AGA system (that is much more than 2x faster than the standard system), there might be almost half a frame left to do the work at higher speed to catch up. So its not total nonsense. Still, for a Zorro II card it takes another full frame to update the gfx card, which could be at least partially interleaved in theory, but still is ... umn ... an issue

Admittedly, I was mostly thinking of my rusty Zorro II card in an A2000 (68030/50), so this is (also w.r.t. chip ram bandwidth) the worst case anyway. I fully agree, depending on the system configuration, the compare may be more expensive than a full frame over the Zorro III bus.

Probably the use of partial updates really comes down to the scenario where a game/application with little/moderate screen updates runs on a Zorro II RTG system (implying also slowes chip ram), where the partial updates should be enough to make the games working. Such games probably include the two example candidates, because there are usually only minor screen updates while full screen updates are rare und usually isolated. Still, it is exciting to get this to work ... not because we cannot live without pirates, but because it sort of concludes the work on planar emulation for RTG with a notable difference in compatibility. And again, I guess it might be the last thing DPaint is missing. Getting DPaint to run properly on a P96 A2000 is still a thing.

For the sake of completeness, I might still elaborate what I think the (updated) procedure for partial updates could be:

- We need 3 Buffers:

* chip ram back buffer for the OS screen bitplanes (receiving all the graphics operations from the application including non-os blitter calls)

* double buffer for the bitplane (!) data sent to the gfx adapter (last state and current state)

Code

AT EACH Frame

FOR EACH BlockOfPixels

PlanarData := READ BlockOfPixels FROM ChipRamBackbuffer ; from all bitplanes, e.g., 32 Pixels (a DWORD per plane)

WRITE PlanarData TO NewFastRamBuffer

Changed := COMPARE PlanarData TO OldFastRamBuffer ; see if block of pixcels has changed

IF Changed

COMPUTE ChunkyData

WRITE ChunkData to GFXCard

EXCHANGE NewFastRamBuffer,OldFastRamBuffer

My hope would be, that an existing chunky-to-planer conversion may have room (particularly on the register side, but maybe also w.r.t. pipiline timing) to be modified to include the additional memory accesses and comparisions.

I still think, if it is a low hanging fruit, just enabling plain and stupid copy might already help some people. The "partial" update" overhead might be too much for many systems and is only in some cases beneficial. Also, it might ruin the efficiency of an optimized planar-to-chunky conversion routine (even of specific a copy of it). Still, I think it would actually work for games like nuclear war or Pirates with very limited changes, but is just overhead if an application constantly makes significant updates. However, it might help some games and maybe DPaint as well ...

Just one thing: I am only remotely familiar with the 68k MMU. Is it possible to intercept specifically accesses to the blitter registers to actually go and pull the whole thing in favour of a software emulation?

Display More

Jens · Apr 2nd 2023

Quote from PainDictator

Just one thing: I am only remotely familiar with the 68k MMU. Is it possible to intercept specifically accesses to the blitter registers to actually go and pull the whole thing in favour of a software emulation?

Yes, the MMU could be used to throw an exception on chip register accesses, but that takes quite a bit of time, so games that are full frame rate will most likely drop to 25FPS or even lower.

So while "possible", I wouldn't rate it "viable". THe key problem remains: Find out which parts of the screen have changed since the last update.

Jens

PainDictator · Apr 2nd 2023

Quote from Jens

Yes, the MMU could be used to throw an exception on chip register accesses, but that takes quite a bit of time, so games that are full frame rate will most likely drop to 25FPS or even lower.

So while "possible", I wouldn't rate it "viable". THe key problem remains: Find out which parts of the screen have changed since the last update.

Jens

Ok, makes sense. Also, the interrupt handing overhead might in some cases already break the timing anyway, if it was possible. Also, probably it is still not possible to isolate just the individual blitter registers, so the overhead will be added to various accesses across the board. Otherwise, my idea would have been to keep a log of the blitter operations since the last update and use them in addition to MMU-monitored CPU chip memory accesses to determine the changed memory regions.

BTW: Sorry for quoting my while previous message. No idea how that happened.