A few corrections which may be useful in understanding how to get the most out of an old system, using a compatible fast CPU:
Dave wrote:
One of the neat tricks of the 68EC020 is that it can use an 8, 16 or 32-bit bus. On the fly. The address decoder can be configured to not just select devices, but also let the CPU know the width of the data bus for that access. The CPU is natively 32-bit all the time, but can freely access 16- and 8-bit areas/devices.
It goes without saying that one access to 32 bits (a long word) is four times faster than four consecutive accesses to 8 bits (a byte), and that moving contents around memory being a big chunk of what computers do, wider is better. Think 4-lane freeway versus single lane country road.
Unfortunately out of the whole 68k series only the 68020 and EC variants can do this.
The main thing is that they do it in a very interesting manner.
The way the CPU bus signalling on these CPUs works, is to start a cyecle and then using the cycle termination sugnal, which is normally supplied by external devices, to determine what width of bus it has just accessed. In other words, it FIRST does a read or write to an address, then decides what to do next depending on what the external hardware has told it about how much of the data was used or supplied. The reason this is done is to avoid the need for double signalling (first to tell the CPU what size of bus it's looking at for the address it has supplied, then to perform the actual data transfer) which would result in a signifficant performance penalty.
In particular, and this is quite important in achieving maximum speed, the CPU starts off assuming a 32-bit wide bus, and waits for the external hardware to tell it what really happened. This external hardware can produce 4 possible states:
1) Do nothing and wait for me to tell you otherwise
2) You have just succeeded completing a 32-bit wide data transfer, start the next one
3) You have just succeeded completing a 16-bit wide data transfer, if you needed to make a wider one, start a new cycle to transfer the next 16 bits.
4) You have just succeeded completing a 8-bit wide data transfer, if you needed to make a wider one, start a new cycle to transfer the next 8 bits.
In this manner the CPU can access nearly any combination of addresses and bus widths, the only restriction is that addresses vs. bus width must be aligned to long word addresses. I.e. the smallest unit of address space where the bus width must be constant (8, 16 or 32-bit wide) is 4 consecutive byte addresses.
Now, when dealing with a system where the native CPU is 32-bit wide but the rest of the system is 16 or 8 bit wide, considerable speed-up can be gained by using an address decoding method known as shadowing. Unlike standard address decoding which maps one physical device (such as memory or IO chips etc) to one area of the address space of the CPU, shadowing takes advantage of mapping multiple physical devices (again - memory systems, Io chips etc) to the same area of the memory space, and/or mapping one physical device to several areas of the CPU address space. Special rules are implemented on how these devices and addresses are accessed to gain various advantages (usually speed), and prevent contention, i.e. a situation where each of these systems may attempt to store or supply different data (at which point the question would arise which data is correct), or hardware issues such as one piece of hardware attempting to drive a logic 1 to a bus line, while another tries a logic 0.
Using shadowing, many tricks are possible, some of which are in present QL hardware, and I will attempt to explain how they work below.
One of the problems with the QL is that it is implemented on an 8-bit data bus. The video, main memory and all devices are structured as 8-bit. Even the ROMs are 8-bit. That means a QL is always running at a quarter of the speed of the same computer with a 32-bit bus.
So, let's get back to the 68EC020. It can access any width on the fly. We can upgrade the memory with 32-bit wide RAM, replace the ROMS with a pair of 16-bit wide EEPROMS or flash. Even with the crappy video, the stock QL would run a LOT faster.
The SGC does all this and more. It has a copy of the video RAM in its own fast memory space. It copies the ROMs containing QDOS into faster, wider RAM. All neat tricks I'm far from capable of copying. However, I think I can get 80% of the results with about 40% of the effort.
Actually, most of the tricks the SGC uses are rather easy to implement.
One small point to make is that usually 8-bit wide (and lately 16-bit wide, but also serial access 1 or 4-bit wide) firmware storage is used mainly for reasons of cost and space use (be that EPROM, Flash, FRAM or any other non-volatile storage technology). One of the main reasons it comes to this is that modern non-volatile storage, although offering high memory capacity, does not offer great speed. This is somewhat a 'chicken and egg' thing as it was RAM that was absolutely necessary to make as wide as possible as it offered the most speed increase, in computers that have traditionally been driven by firmware that was loaded (booted) from external devices. The actual non-volatile firmware would in the end only be used for booting the OS and the actual hardware containing it would not be used at all once the OS was booted up and started. As a result 'evolution' ocured where the hardware holding the boot firmware needed not be fast, partly because one of the first things it would do would be to copy itself into fast RAM. As semiconductor non-volatile storage became larger, it replaced other types of external non-volatile memory such as discs, especially in embedded and ruggedized systems, but this started by it emulating discs, which have in principle (block) serial access. Code was never executed from this memory even though it might in principle support random read access. So, such memory was optimized to hold a lot of data, be reasonably fast for reading in order to copy itself to RAM for those parts that would be executed, and to use the least possible board space and connection signals. This trend continues and it will likely continue still unless someone discovers comerically viable non-volatile RAM. In the mean time it is cheaper to use a package with less pins (i.e. a narrower bus) since it uses less space on the board and requires less signal lines to connect to it (again, using less space so more available for other signal lines). THe penalty is relatively small and it's the time needed to copy the relevant part to RAM.
In the particular case of the SGC, the ROM is initially (after reset) mapped to (if memory serves me right) address F00000h to F0BFFFh, i.e. the forst 48k of the 16th megabyte. Now, the usual address is 000000h to 00BFFFh, the first 48k of the address space. The CPU requires that the first long words in the address maps (address 0 and 4) contain the address the stack pointer will be initially loaded with, and the address at which code execution is to start, so one would normally expect real data there at reset, and this is usually done by there being some sort of non-volatile storage mapped to these addresses, in the case of the standard QL, this is the system ROM.
However, on the SGC, this area initially maps to the SGC ROM, so instead of the system ROM, it's the one on the SGC that provides the initial stack pointer and address to start code execution from. And, this is done to create a copy of the system ROM in the SGC's RAM. The SGC ROM also maps to F10000h to F1FFFFh, and (if I remember correctly) part of it (to 32k) appear in 018000h to 01FFFFh, the 32k just below the RAM addresses.
The actual SGC RAM (8M) maps to the first 8M of the CPU memory area, except for the parts mentioned above.
Now, this looks like a whole jumble of addresses, but here is how this works:
Initially, i.e. after reset, every byte of the SCG ROM appears at two addresses, 15 Megs apart - i.e. the first byte of the ROm appears at address 000000h and at address F10000h, the second at 000001h and F10001h etc. The first thing the CPU does is read addresses 000000h and 000004h and uses the data there to load the stack pointer and program counter, i.e. these addresses tell it where to put the stack and at which address to look for it's first instruction. And, in fact, the first instruction points to the very same SGC ROM, but the 'alias' at address F10000 etc. HArdware within the glue PLD detects the CPU accessing any high address and from that onwards changes the way it decoded addresses so that the SGC ROM does not appear at addresses 000000h to 00FFFFh any more, but instead now it accesses the corresponding addresses in RAM.
The code that executes from the rom alias first copies the system ROM from it's re-mapped addresses at F00000 to F0BFFFh to RAM at the addresses it would appear at, in the normal QL - namely down at 000000h to 00BFFFh. Then it figures out what ROM it is and applies the necessary patches to make it run on a 68020 CPU. Next it accesses another high address (Probably in the F40000h to FBFFFFh range) which sets the glue PLD to ignore write accesses to RAM for addresses 000000h to 00BFFFh, in essence making it look like ROM, except that this copy of the actual ROM is 32-bits wide and MUCH faster (approximately by a factor of 14 or so).
So, as you can see, one trick that can be done is making the decoder behave differently based on accesses to special addresses to change it's behaviour after an address has been accessed. In this case multiple 'shadows' and aliases of the same physical memory are used, as well as temporary disabling of access or a kind of access to other kinds of memory to get a desired advantageous effect.
It should be noted that other ways of doing things are also possible. For instance, many systems use shadowing to implement faster copies of a ROM in RAM (i.e. a sort of ROM emulation), by mapping the ROM to an area normally used by RAM but only for reading, while writing the same address still writes into RAM. Then, copying the contents of the ROM to RAM is done by reading easch address within this shadow area and then writing it back to the same address, which actually reads the contents of the ROM but writes them back to RAM. When everything is copied, the ROM is completely disabled (so it is not accessible any more at all), retirning the RAM containing the copy of the ROM in it's place, but also ignoring writes to the same area to prevent corruption of the ROM copy in RAM. This sort of thing is normally done when we want the maximum RAM possible, so we want to fill as much of the available address map of the CPU with RAM, using only the minimum to implement ROM copies and IO spaces. Such an approach would be used in this proposed project when 16M of RAM is used since that is the size of the complete address map available on 68EC020 series CPUs.
Now, the matter of video. THis also uses shadowing on the SGC but uses different maps and read-write distinction in a different manner.
In particular, the SGC maps it's own 320bit RAM to the addresses of the video memory as well as the actual video memory (through the ZX8301 ULA). The difference is that it only ever WRITES to the ZX8301 i.e. the real video memory so that you get a copy of the contents of the SGC RAM at the same address in the actual video memory, and are thus able to see the picture.
However, when the CPU reads from any address within the video memory area, it actually reads ONLY from it's on-board RAM. Access speed is always limited by the slowest of these, so writing speed is governed by the speed the ZX8301 can accept data written into the original RAM, but since reading is only done from SGC RAM, this is MUCH faster - over 25x faster, since QL motherboard RAM is slower than all other accesses through the original 8-bit bus.
Here is where the peculiar way the 68EC020 does access comes right in - because it first assumes a 32-bit bus, if 32 bits of data are to be written, it initially provides the full 32 bits of valid data, even though it may turn out it's only going to need 8 or 16, because it's accessing an 8 or 16 bit bus. The initial 32 bits of data provided on writing to the video RAM addresses are used for writing into the SGC RAM, and only 8 bits go out to the original 8-bit bus. The 68EC020 then performs 3 more cycles to provide the remaining 3 bytes to the 8-bit QL bus, while the RAM waits. For reading, the RAM supplies all 32-bits at once, and the slow 8-bit bus is not used at all.
Now, this approach at first glance offers limited acceleration, but two significant facts exist that actually make this faster than one would expect in real life.
Since one pixel on the screen uses less bits than the actual width of either the 8 or the 16 bit bus, to draw one pixel, the CPU must almost always first read a byte or two or four from the video area, modify the required bits that correspond to the pixels to be changed, then write this data back. Because of this, nearly all accesses to the video RAM except when filling it with a pattern or say, stored window image, are read-modify-write cide sequences. Reading is so much faster than writing in this case, that it uses up next to no time compared to the bog standard QL, so every such operation will be at least twice as fast - things such as line and character drawing, and scrolling, for instance. But that's not all - the 68EC020 has something called a read-write buffer. Instead of waiting for some data to be written before it continues executing code, it stores data to be written in an internal buffer and it's bus control circuits take care of it while the CPU goes on about it's business. The only time it has to stall and wait is if it needs to write something again before the previous has been completely written. It will of course also stall if it has to read data or instructions vie the external bus and it has to wait for the previous transfer to complete. This is where two other things come in, namely the instruction cache and the instruction pipeline buffer. Because graphics operation code is quite repetitive (lots of loops), a CPU with cache can offer significant speed advantage even with slow external memory - the cache will likely contain instructions for the CPU that were read from the RAM in some other pass through a loop, and will not need to interrupt a write transfer in progress to get them, so a number of instructions may execute while the actual data is written to slow RAM, instead of them being executed only once the write is finished. If the next instruction is a jump up to 3 words backwards, the CPU will not even attempt a cache access but will find the instruction in it's read buffer - this would be the case of a very tight loop used for raw data transfer. In any case, a degree of parallelism occurs, producing a speed advantage because transfers that would use the bus on lesser CPUs don't even appear to happen, so obviously speed penalties do not apply.
An aside to this is that you can actually remove the old video RAM or have faulty ones and the system will still work just fine (except the screen will be corrupt). One trick that can be used is to remove the upper 64k or RAM from an original motherboard - this will save some power.
Higher density DRAM can also be had that makes it possible to replace each bank of 8 64k DRAM chips by 2 chips with a 64x4 bit organisation. There are 8-bit DRAM chips but they are not easy to find - it's easyer to find 16-bit ones and only use 8 bits, but these are packaged in a PLCC case so not easy to breadboard with.
An interesting point to mention is that shadowing of this kind is possible even on standard QL's, expanding the logic used on many of the standard RAM expansion cards with extra RAM and decoders to implement the shadowing function.
A long time ago I made my own RAM expansion for the original QL because I could not afford a Miracle TrumpCard. However, I decided that I would rather have 768k total RAM and leave the IO axtension space open. 3 banks of 256k were used to implement the 768k, but the first 182k of the first bank was used to shadow the video RAM and the second 128k fot he forst bank replaced the internal RAM completely. The add-on RAM could support the 68008 at full speed. This expansion ran noticeably faster than any other 8-bit CPU QL and the reason was faster access to video, as well as the system variables and tables which are also located in slow RAM on the QL motherboard.
Later on, when I got my hands on some PLCC 68008s (which have 2 more address lines and have a 4M address space compared to the original DIP 68008 with only 1M), I used a 4 meg old style 8-bit SIM to populete the whole 4M of the 68008 PLCC address space with RAM, leaving only a small space at the end for a floppy controller. I seem to remember I discovered a bug in an early version of Minerva which Tony and Lau then fixed, related to using more than 1 meg of RAM

This also ran the 68008 at about 9MHz (more than that and the ZX8301 would not access RAM correctly) but microdrives and net would not work
FInally, a word on the ZX8301. It's very finicky when it comes to what it expects from the CPU. In particular, a faster CPU must be prevented from seeing the DTACKL signal from the ZX8301 for a certain time because it will otherwise be too fast in finishing up the data transfer and will either remove data to be written before the ZX8301 will actually write it to RAM or will assume the ZX8301 provided data from RAM before it actually did.
Fortunately, accesses to the ZX8301 control register as well as the ZX8302 do not have that problem, so usinf video RAM shadowing can let the designer debug this on-fly - the CPU RAM will provide correct data for the system to operate, even though the screen might show 'snow' - getting the correct timing will then clear up the picture, while the actual functionality will be unimpaired.