Now, as I said, the actual CPLD programming was never written as a program, but it was designed. One overwhelming reason for this was that two competing companies that were the only ones at the time to produce sensible CPLDs that could be used, AMD and Lattice, ended up by AMD completely pulling out of the CPLD business, and selling their CPLD division to Lattice. A third company, Cypress semicondictor, used to make their own compatible versions of the AMD CPLDs and promptly stopped once AMD pulled out. With AMD pulling out, their software support (programming language, editor, compiler, simulator) instantly went away, and so did Cypress. Lattice on the other hand had completely different software and first decided to go the Cypress route (using a rather complicated language called VHDL so both their CPLD and FPGA products could use the same software to program them), but dod not at all support the AMD design they inherited. In fact, they killed off the very ones I intended to use first! Incidentally, the CPLDs used on the Aurora suffered the same fate. Later on Lattice decided to go back and fuse their then top of the line design with AMDs, shich actually resulted in ressurecting one of AMDs CPLDs which sould also be used for the GF. But - they changed the programming software AGAIN so in a matter of about a year and a half went through first dropping the CPLDs I wanted to use, then dropping any software and programming support, then changing the programming language TWICE while forgetting about the one AMD used altogether (so all in all 3 different programming languages that are not easily translatable), then decided to bring back an AMD compatible CPLD but not the one I wanted to use (fortunately, the hardware was similar) but only supported it with a yet third programming package and hardware.
At this point it is important to say something about CPLDs or, to unravel the acronym, Complex Programmable Logic Device. There is another kind of programmable logic called FPGA, which is a Field Programmable Gate Array, and the approach to implementing logic in these devices is quite different. FPGAs are in principle more capable as they can cram large amounts of complex logic in them, to the point that today you could build a whole computer in one. But you pay for this by the timing and actual speed it can run at being unpredictable in advance, because it not only implements programmable logic, but also programmable connections between it. Often to get from one part of the chip to another many connections are needed, and thigs like wide buses are the bane of a FPGAs existence as different bits of the bus may have to take (and almost as a rule do) different routes, which makes some bits faster and others slower! BEcause of this there is arather high investment in the software that converts a program that defines the logic, into the logic itself, because it needs to find the best logic block placements in the chip in order to even be able to connect them, let alone connect them optimally. For this a very complex simulator and placer is used. Even though the logic can be fitted into a FPGA looking at how much there is of it, it may ultimately turn out it does not because it cannot be connected together given the routing resources. And when it can, it may turn out it's not quick enough - both of which mean you may end up having to buy a larger and faster, i.e. more expensive FPGA than you originally thought. Finally, a simple change in the logic may also end up upsetting the routing so you can for instance get the same pinout but not the same speed - it does not happen often, but the point is you cannot know in advance. But what is worse, at the level of the programming language it is extremely difficult and sometimes impossible to force the compiler to implement the logic in a fashion that best uses the peculiarities of the FPGA implementation to get the most out of it. Manufacturers discourage this claiming it's bad for future compatibility, and they do have a point, but as one may expect, being able to sell you more expensive FPGAs is also a good motivator for them.
CPLDs are a completely different beast. They are not suitable for complex logic, but more for lots of simple logic. Implementing any kind of registers and heaven forbid memory, will use up CPLD resources in a flash - and the capacity will be a small fraction of what a FPGA can do in this regard. But, when it comes to lots of repeating functions involving many signals such as bus bits, CPLDs rule. This is because they do not have visible routing resources, and can almost connect anything to anything - not quite, but far more than a FPGA can. Also, if they do have some sort of routing, it's made to be hidden as far as timing and speed is concerned - a speed rating of a modern CPLD is guaranteed, no matter what kind of logic you put in it. And, because the logic it implements is fairly simple and uniform, it's actually possible to write the program so that it automatically takes advantage of the peculiarities of the CPLD implementation, and if you can do this, you can use almost all the available logic in the CPLD - IF the language the manufacturer uses supports it. One problem here is again touting compatibility - so the same software can be used for CPLDs and FPGAs, but often (if not always) this removes the ability of the designer to exploit the actual way the CPLD is designed. This is what happened when AMD went away and for a short time Cypress alternative compatible CPLDs were available (but they would not take existing AMD programs even though they were essentially the same, you had to re-write them in VHDL which is FAR more complicated since it's designed to describe any logic, not just CPLD or FPGA), then that chip was dropped, then re-instated without the abilty to import olad programs and finally same thing but at least you could re-write it in something similar, yet not exploit the actual hardware implementation. Really, it's no wonder I gave up
But back to the QL, GF, and the 68020 project.
One thing that the GF was to implement was a way to do 32-bit transfers over the QL's expansion connector. This required some thought but it all came together when I got the idea to use FC0, 1, 2, and E as extra address lines. Namely, with the existing 20 address and 8 data lines, it makes 32 lines total, and this was enough to peak my interest. IF you look at theose as signal lines that can output or input any value, which is exactly how a CPLD treats them, since they would be connected directly to a CPLD, adding two signals that were not previosuly used could implement a full 32-bit bus, but multiplexed, so that the same 32 lines are first used for the address, and then made free for the peripheral to place data on them, or used for the CPU to output data on them. The whole thing would have to be asynchronous, because the only clock signal available on the bus is the original 7.5MHz. Admittedly, I don't know that anything used it but that does not mean nothing did. It would have been nice to have a clock signal because in theory knowing how the GF generates the cycles could give a fast enough peripheral a speed advantage, but then distributing a fast clock signal around a bus of indeterminate length and signal integrity would have been a serious problem. So, async it was.
In order to use the address, data and extra lines without any of the 'old 8-bit stuff' being aware of it, the line DSL must stay inactive. This line is normally the 'data strobe' and all existing QL peripherals, in fact the motherboard as well, use a low state on this signal as an 'enable' of sorts. If it's high, they ignore what's going on with the bus.
Obviously an 'alternative' to DSL had to be found, that would have a similar function but only for future 32-bit peripherals. An ideal candidate was found in the normally unused ASL signal, which is already defined as the address strobe signal, that tells the system there is a stable address on the bus - so it was the most logical to use as it already has the right function. So, the GF would place an address on the 32-bit bus and activate ASL.
A bit more thought showed that with a 32-bit bus you do not need the lower address bits A0 and A1, but you do need byte selects, 4 of them, to tell peripherals which bytes the CPU will actually be writing to, out of the full 32 bits. These could actually be ignored when data is read, as the CPU simply ignores the bytes it does not need. And, since the 32-bit bus could not occupy the whole of the 32-bit address space, because (a) A31..29 should be don't care for Qliberator, and (b) other things use up most of the address map remaining, 32 lines was more than enough to address 32M of space (A2 to A24 are needed for this), leaving 32 - 23 = 11 lines to be used for other clever things. 4 were used for byte selects, and I put some thought of using one more as a 'burst mode' signal (*)
When a 32-bit peripheral would see AS activated, it would latch (internally store) the address and byte select signals, and then use the outputs of the latch as real address lines locally. From the RWL signal and the byte selects it could also in this phase generate byte write enable signals (one for each of the 4 bytes) if desired. When it did so, it would activate DTACKL as usual, telling the GF that it can now remove the address and use the 32 lines for data. This is a slightly different use of DTACKL which is normally used to tell the CPU it can finish the bus cycle altogether, instead here it says it can finish the address phase of the bus cycle.
The GF would then either place the data on the lines (if it was doing a write), or expect the data to be placed on the lines by the peripheral (if it was doing a read), and activate a 32-bit data strobe. DSL could not be used for this as it would erroneously activate standard QL 8-bit peripherals, so a different signal must be used. This was found in the signal DBGL which the QL does not use at all (it was intended for a planned but never produced buffered bus expansion box), and it's only pulled up on the mother board by a resistor, which was just fine. So, the GF would activate (pull low) DBGL.
The peripheral would then either take the data from the bus or supply it as the RDWL signal dictates, and when it's done so, it would then return DTACKL high. This indicates that the data transfer is done. The GF would then deactivate both DBGL and ASL and if it was outputting data, remove it from the bus.
(*) burst mode is a mode of operation of the bus where it outputs only one address and expects the peripheral to supply multiple consecutive long words of data starting at that address or write multiple consecutive long words of data starting at that address. 68030 and higher 68k CPUs use this technique (always 4 consecutive long words) to speed up transfers for various things, notably instruction prefetch, data prefetch and in some cases data cache flushing. The idea is to omit the overhead of providing an address for each long word of data, implying consecutive transfers. If you can get the CPU to do such transfers, the speed improvement is potentially quite impressive - what would take at least 8 cycles (and realistically 12 cycles) on the 68030, takes 5 cycles in burst mode. The fastest possible transfer on the 68060 takes 8 cycles in normal, and 5 cycles in burst mode. The speed improvement is more dramatic for certain types of memory devices, such as SDRAM - at least 12 and possibly 14 or 16 cycles for 4 long words on the 68060 normally, or - 7 cycles in burst mode.
This was only contemplated for the 32-bit bus mode because realistically only something like a video card would be able to implement it, but it was of dubious advantage as the 68060 is very clever when it comes to bus use, shadowing would take all advantage away for reads (the GF's SDRAM could easily sustain at least twice the speed of even a burst transfer using the 32-bit bus protocol) and for writes, the 060 implements internal write buffering (or more, if the data cache is enabled), so it would mostly continue executing instructions from it's caches while the actual write took place, rendering the write 'invisible' to software execution. The 020 isn't capable of burst transfers anyway so this mode could not be implemented anyway if the 32-bit protocol was implemented on a 020 board.
There are a couple of fine points here.
- because DSL was never activated, normal 8-bit QL peripherals do not pay any attention to any signals on the bus thus remaining inactive, the whole 32-bit thing happens without them even noticing. That being said, the bus does require some extra attention regarding implementation - relatively strong buffering is required to be able to drive the lines quickly with old slow peripherals loading them. Secondly, if speed and power is required, termination, and in this case series termination, is a must. So, each line had a 33 ohm resistor in series. If this was done by discrete logic chips (and there are suitable ones), there are versions of them that have the termination built-in. A ground plane connection was also to be implemented by using a 3-row connector (all ines in the third row except where the power lines are in the original 2 rows would be used as ground, and one line would be used to sense the presence of a 3-row backplane).
- the 32-bit peripheral is completely in command regarding how long each of the phases of the bus cycle must take, and instead of using a level on DTACKL to signal the timing, it uses level transitions, high (inactive) to low to end the address phase, and low to high (inactive) to end the data phase. Being able to re-use DTACKL actually saves 2 signal lines, and still implement totally flexible timing. In theory, the fastest transfer speed was about 5 CPU clocks for write, 6 for read - keep in mind the CPU clock was at least 40 and preferably 66 or more MHz. In reality, without a ground plane, about half of that could be expected, so about 24 times faster than the QL's original bus maximum theoretical speed (and the latter is only possible with something like the SGC, BTW) .
The way it was designed, the implementation is completely independant of what the clock speed actually is, this is fairly easy to implement for slower CPUs, with proportionally less speed and easing up of the requirements for signal integrity.
- in order to get decent speed, the DTACKL signal must be actively driven both low and high, on the original bus it's only driven low, and a resistor on the motherboard is used to drive it high. When lines are long, the resistor has to fight increasingly larger capacitance and other parasitics, resulting in ever slower returning of DTACKL to high, which in the worst case leads to bus operation errors and system hangs. Driving it actively was never really defined on the QL but this was a chance to do so. The bus spec only defines that after DTACKL is returned to high, it has to be made 'high impedance' (so other peripherals can drive it) within a short time, and this is quite easy to implement in hardware. Active driving with proper termination also prevents ringing of the DTACK line, which is essential for proper recognition of it and preventing noise and transients from being taken as DTACKL transitions.
Here are some ideas of what could be implemented on a 020 board.
First, some major considerations - available address space and RAM size, some speed issues.
On one side, this is determined by which actual CPU is used: a full 68020 or an EC020. The EC020 is limited to 16M, while the full 020 has a full 32-bit address, but for reasons already mentioned, the QL can only use up to A28, which makes the maximum size of the address map 512M bytes.
On the other ide, having owned an 8M QXL, I can tell you that indeed, you DO want more RAM. No-one ever had too much of RAM! The reason you want this is simple - it's a must once the possibility appears for extended graphics. When WMAN/QPTR is used, never mind something like SMSQ/E, non-destructive windows are implemented, the images of which are kept in RAM, so you are looking at increased memory use, proportional to screen size and color depth. Let's look at that in a bit more detail - In theory it should not be too difficult to implement something like Q40/60 style graphics on this board (although it could get a bit slow...), suddenly you are looking at up to 1024x512 windows, so 4x the original size, and - instead of 2 bits per pixel, 16 bits per pixel, so 8x color depth. All in all, oh, 32x more memory used. 1024x512 in 16 bits per pixel is 1M or RAM. This sort of thing would be completely unusable on a 4M system. Aurora screen size is limited by it's video RAM which is 4x smaller, so expect 'only' 8x more memory used. Having used the 8M QXL at PC style resolutions, I can say that in some cases 8M would come up short - not alarmingly, but you could see the end of it. IMHO, if you are counting on better graphics, count on more RAM - it follows directly.
Although the use of a full 020 does not directly follow from the above, one advantage with the full 68020 even if it was used as an EC020 (extra address lines are not used), is that a 33MHz clock rating is available. That being said, it's all actually the same chip, selected. Motorola was always very conservative with it's specs, so it would come as no surprise that an EC020 rated at 25MHz could indeed work at 33MHz and quite possibly more, as long as it's bus pins were lightly loaded and well managed, and some cooling was supplied. Some Atari enthusiasts were well known for running 10 MHz bog standard 68000s at 16MHz and 33MHz 68020 chips at up to 50MHz... which is also an interesting data point to consider.
Then there is the matter of RAM implementation. Static RAM is the simplest to use, the fastest, and uses the least power, BUT far from highest density and especially lowest cost. You are stuck with densities of 8M bits per chip (so 2M bytes) which are targeted to low power applications, making them more expensive than usual, and one reason for this is that the 020 uses 5V logic. 3.3V logic is more commonly used today and higher density SRAM is available for it, but requires logic level conversion in the grand majority of cases. The cost of having 5V to 3V logic translation is not trivial in terms of speed, board space, signal integrity and cost, although some of the cost may be recovered since the higher density devices are cheaper per bit. But to give you an idea, a 4M bit (512k x 8) device costs about 3 Euro or less in quantity. Moving to 8M bit (1M x 8), rises the price to anything between double and tripple, while 16M bit (2M x 8) devices are already around 18-20 Euro a piece. The up side is, they are FAST, easily would handle the fastest overclocked 020, and use very little power - some as low as 10mW per chip!
Unfortunately, the market space has driven out the most logical alternative, regular 5V DRAM which is the 020s 'natural counterpart'. The parts that are still manufactured are of the same density as SRAM, but cheaper - however, a DRAM controller (extra hardware) is needed. In most cases it's easyer to source DRAM second-hand! Also, speed is lower than that of SRAM, at leat by 30%.
Finally there is SDRAM, and although this is the most used RAM type today, 99.9% of the market is DDR SDRAM in various versions - these cannot be effectively used on a 020. SDRAM is however still available at quite high densities and failry cheap, but it has all the disadvantages you can get from the other types - it needs a controller, which is more complicated than for regular DRAM, and in order to get good speed, it has to be operate at 2x the CPU frequency (this in itself is no problem, the slowest available parts are 100MHz) but it's what makes the controller even more complicated - we are talking serious CPLD work here. AND it's all 3.3V - logic level translation is a must.
So it looks like rock, hard place, wall... or something like that.
Regardless of the memory type used, 16M can can become a squeeze especially if one is planning for a 32-bit video extension.
The second major consideration is what expansions (if any) are expected. This would define what options are to be implemented on the expansion connector, if there is one.
The size of the memory map is also relevant here as any peripherals to go onto an expansion bus, must be given an address space to work in.
We can use the SGC as an example, and keep in mind it only uses about 4.5M bytes of the 16M total - and in our case we are aiming for more.
Out of the 4.5M used, 4M out is RAM, with a bit at the beginning that can't be used because it holds the usual QL bits such as the ROM, EPROM slot, extra code copied from the SGC EPROM, the IO register area, and the screen (actually two of them). Shadowing of the second screen can be turned off for a little speed boost - if only one screen is used, the second screen area holds the system variables and tables, which are very frequently accessed, so speeding that up lowers the system overhead. All the rest up to the end of the 4th megabyte is RAM.
There is an IO area at C00000h, which is the 12th megabyte (counting from zero!), but only parts of it are actually implemented. The top 256k at CC0000 is the actual IO area (on the original it was at C0000), and because the old QL bus has only 20 address lines, it actually appears as C0000h on the QL bus. The Aurora completely uses it up for the video RAM.
There is also a special area IIRC 128k in size at C00000h (I am writing this from memory, I could be wrong...) and this is where the SGC ROM can be read from, where the on-board hardware IO locations are, and also where the first 64k of the QL's address map is mirrored (right at C00000..C0FFFF), but not the RAM copy, rather the actual thing. As far as I know you can only read from these addresses, which is a pity as writing would make it possible to put a Flash ROM into the Aurora ROM socket and it could then be programmed in-circuit, implementing a sort of mini-ROMdisq (maximum size is 512k). As it is the actual ROM can be read here as well as the ROM slot.
Assuming the proposed 020 system has an expansion bus (and I will try to show it actually has to have one internally anyway) allocating an area of the memory map is essential, but it would be prudent to put it right at the end of the 16M available, so more is left for RAM.
Here are some ideas on what the requirements might be:
- We need a place for a Flash ROM or similar, to store the firmware (OS, drivers for extra hardware on-board, boot and initialization code that has to run before the OS or patches for the OS). At the very least 128k is needed, preferably more. It should be noted that this hardware is basically needed only at boot and when the contents have to be copied into RAM, or the Flash has to be updated. So, some memory map savings can be had if a mapping mechanism is implemented which temporarily swithches out some other hardware and switches in the flash when needed. In fact, a much larger flash can be used if a paging mechanism is also implemented.
- In theory you could connect an Aurora to this system, or at least something similar, like a clone that transplants the chips on it to a board that only does the extended graphics. Using that a s quide, one would expect at least 256k of memory space available on the expansion bus, mimicking the SGC except at a more convenient address. That being said, adding fast DTACK and RAM shadowing capabilities would significantly increase the feature set.
- Some addresses are traditionally available on the expansion bus in existing systems, such as the ROM slot (00C000..00FFFFh), and QL on-board IO area (018000..01BFFFh).
- On the high end, one might look at graphics capability close to or equal to the Q40/Q60. This requires about 1M of address space, and would need access to the full 32-bit bus, or it will simply be far too slow. However, this should best be a 'shadow' to some existing 1M of RAM for speed, and therefore actually takes up the same address space as RAM. Obviously, when it's used Aurora type graphics would not be needed any more except maybe for development purposes, but in any case if the Aurora video RAM has shadowing capability, it would not be used at the same time as this 'high spec' graphics board.
How about possible implementation?
One could use the last meg at F00000h to implement 4 256k blocks. One for the Flash chip, and 3 would be aliases of a QL/SGC like IO space, one for the usual behaviour, one with maximum bus speed, and one with maximum speed and RAM shadowing.
Another possibility which uses less of the address space, 2 256k blocks, would be an extended implementation of the SGC idea, one 256k block containing the flash (perhaps with bank switching so a large one can be used) 64 or 128k used for that, the rest is then free for peripherals. Then the other 256k block that mimics the QLs/SGCs IO area, but with options to switch in memory shadowing and/or fast transfer support.
In any case some extra address lines (on FC0..2, E) could be used ti distinguish various aliases so that clever peripherals to come can make use of it and map themselves only where needed, leaving the other aliases free.
The following thing to think about is what happens when the system accesses some addresses that traditionally appear on the bus, and relate to bits and pieces of the original QL's hardware. In particular, the ROM slot addresses at 00C000..00FFFFh must appear on the bus if you want to use a Qubide set to the ROM port address. But then, it's a moot point if it, or a version of it is already on board - and this is something that should be considered for various reasons, amongst others because IDE is natively 16-bit so it would be really odd to force the 020 to access it as an 8-bit peripheral only to have the 8 bits sonverted back to 16 - all while not only a 16-bit but a 32-bit bus is available.
Then there is the old QL on-board IO area, at 018000..01BFFFh. Aurora/QIMI uses address 01BF00..01BFFFh, Aurora/QL (ZX8301/2) uses 018000..0180FFh. The rest is unused but it's an area ideally suited for actual IO devices and chips either on or off-board, because with a bit of help from the motherboard it's easy to decode and does not use up anything that could be used in better ways - most IO devices only need a few locations. More on this below.
Other candidates would be 010000..017FFFh, which Minerva supports as alternative ROM addresses and looks for ROM headers at 010000h and 014000h, and possibly 001C000..001FFFFh.
These are perhaps best kept as write protected RAM. On the GC and SGC these are used as a RAM copy of (parts of) the SGC ROM and in any case, in order to run all the extra bits on this new 020 board, copying the drivers to RAM is a good idea. Since these addresses were not used at all on the original QL (they contain aliases of 18000..1BFFFh there), they cannot normally be used as system RAM and at best could appear on the expansion bus and be used by a non-standard IO or ROM board, quite superfluous considering there must already be a large Flash on board to contain an image of a QL ROM, and the extra software. So, why not make this area available as fast write protected RAM for ROM emulation, so OS extensions do not use up system RAM.
And now, the HOWEVER bit

.
There is an important consideration that gets the required signals to the bus anyway, and this is buffering. In order to interface both slow and fast stuff to the 020, you REALLY want to keep their respective bus signals separate. For one thing, keeping the 020 bus lightly (and uniformly) loaded has very desirable consequences in form of speed and signal integrity. 8-bit stuff must be connected to CPU data lines D24..D31 and a bunch of it requires a buffer anyway, because it would heavily load that part of the bus while the other 24 bits would see a lighter load. It's not only a consideration for the CPU, but everything on the 32-bit bus driving that part on the bus when the CPU reads from it, has to drive all the 8-bit stuff connected to it. Also, you need to buffer the address lines too because all transactions for anything on the bus (much faster than QL standard!) would be seen on these address lines so the CPU would be driving everything at the highest speed even if it can't take it. To prevent this, you need buffers that are only active when the 8-bit bus is accessed, making all the 8-bit stuff look like one single device to the CPU. In short, the 8-bit side requires signal buffers anyway, the same ones that the QL expansion bus connector would require.
What this boils down to is, you end up using something 99% similar to the QL bus (buffered from the CPU) to connect the various 8-bit peripherals that you have on board. Adding a connector to this existing bus is a no-brainer, and in fact, then you are looking at all of your on-board peripherals (and I mean actual IO hardware like floppy controllers, ZX8301/2 etc, not extra ROMs) as peripherals on the QL bus, it's just that part of it is implemented 'on-board'. And, since on-board you may have extra signlas available that do not normally appear on the Ql bus, you might implement some special properties for some of the on-board peripherals, such as higher bus speed, or even 16-bit access (for IDE).
This brings us back to the GF spec. The exact same approach was to be used on GF - because most of the peripheral chips were 8-bit (or could work from an 8-bit bus), and needed 5V to 3V logic conversion, and handling of 32 to 8 bit bus size conversion, which the Ql bus interface CPLD already did, they were connected to the QL side of the bus. This also severely reduces the number of long lines that must be used to get to the actual peripherals, 8 bits of data, and a relatively small number of address lines is enough. The actual addresses used fall within the original QL on-board IO area at 0018000...001BFFF. The GF would supress DSL geenration when these addresses were used so they were actually invisible by other things on the bus, and it would also generated a special internal DTACKL for them, for optimum access speed.
The same thing can be done for this 020 board, except it could actually use DSL and implement DSMCL so external hardware intended to replace (and hopefully enhance!) on-board stuff could do so by using DSMCL as usual. Internal DTACKL generation is still possible so one could optimize speed (very interesting for IDE or ethernet). Implementing this would then require the QL on-board IO area accesses to appear on the expansion connector.
Implementing a 32-bit bus interface on a 020 board would be a chore using discrete chips but it's not impossible. However, the only thing one could realistically expect to use on it is a graphics card, and for this it only makes sense to put it in the last meg of RAM (whatever that turns out to be, depending what parts of the address map is taken up by the expansion connector stuff) where it would 'replace' or, even better, shadow existing RAM only for writes. Whatever this RAM will be, more than 1M is hardly a good idea, and only if the rest of the 16 available megs is used as RAM too - otherwise, RAM will become short. Although 16 bits per pixel is very attractive, there is the issue of non-standard resolution which is a huge headache on LCD monitors - and tofay, there is no other kind left. A 256 color mode could run in 1024x576 which is a wide-screen standard, or 1024x768 shich is a regular 4:3 PS standard. Supporting wide screen resolutions is a serios consideration, as it's now nearly impossible to find a 4:3 LCD for a sane price.
In any case, providing some sort of special connector for 32-bit graphics is a better idea than designing an actual 32-bit bus interface which first multiplexes the bus to pass through the IO connector, only to have it de-multiplexed for the actual hardware on the other side. A sort-of 32-bit version of DSMCL could be implemented on this connector so anything 32-bit on it could disable on-board 32-bit hardware if needed.
To do:
Some considerations regarding extra peripherals that were to be used on the GF, regarding interrupts and the like, multiprocessing (easy - the point is why and how), ideas on Flash/ROM management and peripheral stuff that came up while designing the GF.
Since it's so late it's actually early, I will leave that for the next post...