Fun things to do with an MC68EC020....

Nagging hardware related question? Post here!
User avatar
Dave
SandySuperQDave
Posts: 2816
Joined: Sat Jan 22, 2011 6:52 am
Location: Austin, TX
Contact:

Re: Fun things to do with an MC68EC020....

Post by Dave »

I have talked with George about it. If I implement it in hardware and more than a couple of people have a use for it, he seems inclined to make any changes that might be needed to optimize it for the platform.

I would love to see some native support for it spooned into Minerva 2.00 :)


Nasta
Gold Card
Posts: 463
Joined: Sun Feb 12, 2012 2:02 am
Location: Zapresic, Croatia

Re: Fun things to do with an MC68EC020....

Post by Nasta »

Dave wrote:Here's where I'm at:
16MB 32-bit RAM (not all wired in) if used with an EC020 - up to 4GB with 68020
QL systems are realistically limited to 512M (that shoould really be plently :P) since decoders must leave A29, 30, 31 as don't care. Otherwise Qliberator and anything compiled with it will not work.
With Minerva (but not sure if this was fixed!) you might run into problems with large RAM sizes. The slave block table is limited to 16M (not sure) but either way this is purely a software problem, if RAM is bigger, the slave blocks will just occupy up to 16 megs. Not 100% sure, I recall there was a problem at the 2M boundary but this has surely been foxed of Minerva would not work on SGC.
128k 8-bit RAM
If you are using static RAM this can relatively easily be shadowed.
I rejigged the QL expansion port. Everything in the DIN connector is 100% compatible, but I put extra address lines on the R, G and B pins since nobody ever used those. I have now decided that is a bad idea, because if anyone produces a video card it might want those lines to feed video back. So I am working on using a small secondary connector alongside the original one.
Have a look at the GoldFire specs, they should be available on the Qlhardware yahoo group. GF was to implement an extended bus, and I spent a LOT of time figuring out which lines can be used for more useful purposes, compared to standard. One which i clearly remember is the E line which was, as far as I know, never used for anything. SP0-3 are also all grounded on the motherboard and sonsidering what you are putting on-board, highly unlikely to be used at all. Also DBGL was never used.
One other way to extend the bus is to use a 3-row DIN, a 2-row one fits into the two bottom rows. On the GF I planned to use the third row almost exclusively for ground, and in order not to be able to plug in the 2-row cards incorrectly, a jumper would be put across two pins of the third row, which one would manually remove to plug in a 3-row connector card (this was to be the Aurora II).


User avatar
Dave
SandySuperQDave
Posts: 2816
Joined: Sat Jan 22, 2011 6:52 am
Location: Austin, TX
Contact:

Re: Fun things to do with an MC68EC020....

Post by Dave »

Nasta wrote:QL systems are realistically limited to 512M (that shoould really be plently :P) since decoders must leave A29, 30, 31 as don't care. Otherwise Qliberator and anything compiled with it will not work.
I can't imagine ever building a system with more than 16MB, since the most capacious SGC system has 4MB and people don't complain. More is physically possible, but.... What would you do with a 128MB QL?
128k 8-bit RAM
If you are using static RAM this can relatively easily be shadowed.
For you, maybe :) I have read your posts, and will many more times. There's a big difference between understanding the principles, and understanding the hardware and code practicalities of implementing those principles... I'll do the best I can :)
One which I clearly remember is the E line which was, as far as I know, never used for anything. SP0-3 are also all grounded on the motherboard and considering what you are putting on-board, highly unlikely to be used at all. Also DBGL was never used.
Ooooh, I think you just saved me a ton of time right there :)

I have been looking for a way to shoehorn a couple of extra address lines, and two lines for decoders on expansions to tell the CPU how wide the bus is. I am a little confused by what you said about how the 020 tells the bus witch. I have been using the two pins to have the address decoder explicitly tell the CPU the width of the bus at any given decoded address. Your way sounds more passive and automatic, and does get rid of some head scratching logic and a GAL or two - IF I am understanding you correctly.


Nasta
Gold Card
Posts: 463
Joined: Sun Feb 12, 2012 2:02 am
Location: Zapresic, Croatia

Re: Fun things to do with an MC68EC020....

Post by Nasta »

Dave wrote:
Nasta wrote: QL systems are realistically limited to 512M (that shoould really be plently :P) since decoders must leave A29, 30, 31 as don't care. Otherwise Qliberator and anything compiled with it will not work.
I can't imagine ever building a system with more than 16MB, since the most capacious SGC system has 4MB and people don't complain. More is physically possible, but.... What would you do with a 128MB QL?
Well, that's the interesting thing, on a QL it would mean you coould keep more or less everything resident and if your system has static RAM, it can retain it's data with very little power, meaning you really never need to switch the QL off.
128k 8-bit RAM
If you are using static RAM this can relatively easily be shadowed.
For you, maybe :) I have read your posts, and will many more times. There's a big difference between understanding the principles, and understanding the hardware and code practicalities of implementing those principles... I'll do the best I can :)
The 68020 (and EC020) is peculiar in it's implementation of dynamic bus sizing, that it has to be able to dynamically size it's bus 'post festum' - because the 68020 makes no assumptions of the bus size at any address. The way the 020 does this is to initially assume that the bus is always 32 bits wide, for any cycle, even the ones that are already in progress, such as a 32-bit transfer to a 32-bit bus.
In a more conventional way bus sizing is done, the CPU would start a cycle telling the system what size of data it wants to transfer, then the system would respond by telling it what size of the bus is present at the address this data resides at, and then the CPU would actually start transferring data.
In order to save time and actually skip this initial phase, the 020 first assumes the bus is 32-bit wide, and has a simple rule what part of the 32-bit bus is used if the actual bus has a smaller width. So, when it starts data transfer, it first assumes it's going to be 32-bits.
In the case in point, what we want to look at is writing data - in the case of generating video, the only requirement is to write to the 128k of 8-bit RAM controlled by the 8301 ULA. However, in order to get proper memory functionality, we also need to be able to read from those same addresses. In order to make it faster, we would ideally like this memory area to be 32-bit wide. So, we can use a trick where 32-bit memory occupies the same addresses as the 8-bit RAM tied to the ZX8301 ULA. However, when data is written, it's written to both, but it's only read (MUCH faster) from the 32-bit RAM.
The particular assumption that the 020 makes with regards to the bus width, actually makes this possible, and actually easier than it looks.
A carefull look at the sections of the user's manual that deal with bus cycles, and we want to look only at the write cycles, shows that the 020 replicates certain bytes of data depending on where it is in making the required cycles to transfer the data it want's transferred. The reason it does this is because it has to assume a 32-bit width for every cycle. In particular, the 020 assumes that a 8-bit bus is connected to CPU data bits 31..24, i.e. the top byte in the 32-bit long word.
When the 020 writes any data to the bus, it has to place bytes of data so that the correct data goes to the correct parts of the bus for any possible bus width and data size.
Here is what it does when it writes a long word to a 32-bit bus, 32-bit aligned, so in theory it can transfer this in one data transfer cycle. If puts the 32-bit data on it's 32-bit bus, but incidentally, if you look at data bits 31..24, you will find exactly the right byte for an 8-bit bus, at the correct address. In fact, the 020 expects ta 16-bit bus to be connected to data bits 31..16, an lo and behold, the data is again at the right address. Obviously, it's at the right address for a 32-bit bus, you just use the whole thing.
So, it tarts off by attempting to write the whole 32-bits. Then it uses 2 signals that the system returns, to tell it how much of the data it actually stored - and in our case, this would be 8 bits. At this point, because we have a 32-bit and 8-bit memory in parallel, we could actually use the whole 32 bits and store them then and there into the 32-bit memory, while at the same time just use the top byte and store it to the 8-bit RAM. And - because we want the 020 to go through the rest of the necessary cycles to transfer the remaining bytes to the 8-bit memory, we'll tell the 020 that the actual bus size is 8-bit wide, and only 8 bits were transferred.
What happens then is that the 020 responds with another cycle, where it tells the system that it wants to transfer the 3 remaining bytes, to the next address (offset +1). If we then again tell the 020 only 8 bits were transferred, it does yet another cycle, telling us one word (2 bytes) remain to be transferred to the next address (offset +2). And finally, if you then tell it only one byte was transferred, it will do one more cycle, requesting one remaining byte to be transferred at the next address (offset +3).
Were it not for the fact the 020 assumes 32-bit width in EVERY one of these cycles (even though data was already partially transferred), makes the very complicated task of figuring out when data is suitable to be written to the 32-bit RAM (in our example it was obvious it's possible in the first transfer), almost trivial - the work is already done for us. It is possible to safely write the 32-bit memory in EVERY cycle.
The peculiar way the 020 approaches bus sizing actually requires it to replicate bytes to portions of the 32-bit bus to always suit any bus size. The section on bus operation also gives you a table to generate byte enable signals for a 32-bit bus, which are necessary when less than the full 32 bits of data are to be written to 32-bit memory. Not surprising, these signals also always fit any case.

So this is what actually happens:
Lets say the 020 wants to write 4 bytes ABCD through it's 32-bit bus. A goes to bits 31..24, B goes to 23..16, C goes to 15..8, D goes to 7..0, i.e. literally as I write them.

This is what happens in our 4 cycles as I have outlined them above:

1st cycle: ABCD appears on the bus, and enables for all bytes are active, AAAA

This si where it would end if a 32-bit wide bus was indicated. However, we respond with telling the 020 the bus is 8 bits wide, and continue to do so for all addresses occupied by the 128k video RAM. The 020 responds by:

2nd cycle: BBCD appears on the bus, offset +1, and the 3 right-side enables are active, -AAA
3rd cycle: CDCD appears on the bus, offset +2, and the 2 right-side enables are active, --AA
4th cycle: DDDD appears on the bus, offset +3, and the rightmost enable is active, ---A

There are 2 things to note here:

1) Since 8-bit data for an 8-bit bus must appear on the top 8 bits of the bust (leftmost position), note that as the 020 generates 4 cycle to transfer the 4 bytes, a sequence of A, B, C, D indeed appears on these bits as expected. Also, the address lines cycle through 4 consecutive addresses, as expected.

2) Simultaneously, the remaining bytes appear at exactly the right positions in all 32-bits of the byte, with the right byte enables set, for them to be simultaneously written to 32-bit RAM (Indeed, if it was 16-bit RAM they would still appear in the right positions at the right time and with proper address offsets). If we simply continued to write the 32-bit RAM as the cycles go, completely unaware that there is an 8-bit bus shadowed 'behind' it, all the data still ends in exactly the right place. There is a peculiar side effect, that byte A is written once, byte B twice, byte C three times, and byte D four times in succession, but the proper data ends up in the right place anyway. This particular side effect would indeed be important if we were accessing certain kinds of IO hardware, (which rely on a written data being transferred elsewhere as it's being written, so instead of ABCD being transferred, it would end up being ABCDBCDCDD) but this sort of thing is never shadowed, because it usually does not make any sense, the side effect notwithstanding.

So... long winded, but there you have it. This, incidentally is exactly how the SGC works.
One which I clearly remember is the E line which was, as far as I know, never used for anything. SP0-3 are also all grounded on the motherboard and considering what you are putting on-board, highly unlikely to be used at all. Also DBGL was never used.
Ooooh, I think you just saved me a ton of time right there :)

I have been looking for a way to shoehorn a couple of extra address lines, and two lines for decoders on expansions to tell the CPU how wide the bus is. I am a little confused by what you said about how the 020 tells the bus witch. I have been using the two pins to have the address decoder explicitly tell the CPU the width of the bus at any given decoded address. Your way sounds more passive and automatic, and does get rid of some head scratching logic and a GAL or two - IF I am understanding you correctly.
Actually, if you read the above, you will find out that you still have to use the same approach and indeed it's the best and fastest one, what you do is just add terms to your decoder that let it select an address space depending on the read/write signal, and, in some cases, you will need to provide multiple chip selects as I outlined above, while signalling the bus width slightly differently.

Just remembered 3 more lines you could use - FC0..2, these were normally not used as far as I remember. I seem to remember these were used to signal interrupt autovectoring on the original 68008, but in any case it uses a separate pin on the 020 called AVEC that you can simply tie to ground and it will always use autovectoring, so they are not needed on the external bus. In fact I don't know of any peripheral that used these that would be needed in your system.
IIRC I intended to use FC0, 1, 2 and E to implement A20, 21, 22, 23 on GF. Together with the 8 data bits, this comes to 32 lines which could then be used to implement a full 32-bit multiplexed bus (the same 32 lines would be used to first transfer 32-bits of an addressm then 32 bit of data, using another signal to tell the system which is which). I think I used AS (which is not used by QL hardware AFAIK) to signal the address or data phase. This was easy to do because the 68060 used on the GF has no dynamic bus sizing and also requires bus buffering, so all the address and data lines went into a PLD used to interface the 060 to the QL expansion bus, so one could do a lot of things with the signals since they were already all present for the internals of the PLD to work witht hem. Unfortunately, this is not at all easy with the 68020 and it would make the hardware complex to way beyond overkill.

Now, the actual FC0..2 signals on the CPU are needed to interface with the 68881/68882 FPU.
This is because FC0..2 are a kind of extra address line set whic define 8 instances of a 4G address space, due to the internally separated address spaces for user and supervisor code and data, this is required for systems which support virtualisation and use a MMU. These take 4 out of the 8 possible encodings on FC0..2. One, FC0..2=111 is sued for coprocessor communications, special instructions are used for this, and external decoders must have their outputs disabled when this combination appears in order to prevent the FPU and other parts of the system to be enabled simultaneously.
AFAIK one of the possible 8 codes is left as 'user defined' but in order to use it, one needs to use special instructions implemented in the 020 and higher members of the 68k family, which is a pity because it could have been used to map all the various IO devices, flash ROMS which are copied to RAM etc etc without taking up the usual address space. Incidentally, 68000, 68008, 68010 cant directly use FC0..2, but it is possible to add a FPU even to the bog standard QL by mapping it as an IO device. It even supports dynamic bus sizing, and except for using FC0..2 natively (easy to spoof by tying it's FC0..2 pins together and using them as an active high chip select), it works just like a regular 68k peripheral :) (BTW if you really want to do a lot of math, you can connect a whole array of them, 020 supports multiple FPUs :) ).

BTW regarding SP0..3, these are connected to ground on the QL motherboard, and indeed it would be a good idea to keep them that way as the bus as a whole really could use every ground pin it van get - sadly. the designers didn't think to spread them around within the connector, this would have improved signal integrity a lot. DBGL is a good candidate for a 'special chip select', as it's pulled up to VCC on the regular QL, so a peripheral using it won't misbehave if plugged into a regular QL that does not support it. ASL should best be left unused (and possibly not connected an pulled high on the motherboard).

There is a LOT that could be done to tidy-up the IO device area at 18000h, I did this on the Aurora and produced an extra chipo select on the ROM port to handle it. I wish I had thought of this on the Qubide, it would have made the hardware simpler and more capable - with the nice feature that it could sit at the address of the ROM port and yet not disable it. Ask me about it :P
IIRC on the QL only the first and last 256 bytes of the IO device area is used - 18000..180FF for ZX8301 and 8302, and 1BF00..1BFFF for QIMI.


User avatar
Dave
SandySuperQDave
Posts: 2816
Joined: Sat Jan 22, 2011 6:52 am
Location: Austin, TX
Contact:

Re: Fun things to do with an MC68EC020....

Post by Dave »

Nasta:

How much of the work you did with Goldfire could be retargeted to a regular 680x0? How easy would it be to retarget the custom logic to a readily available FPGA?

Even a 68020 board based on that technology would kick ass, even if it is maybe 1/50th of the speed it could have achieved!

Everyone else:

Those interested in hardware design, would you be interested in a truly open-source project to do this? We could each take a subsystem, or just contribute knowledge/ideas.... Particularly in the area of video...


Nasta
Gold Card
Posts: 463
Joined: Sun Feb 12, 2012 2:02 am
Location: Zapresic, Croatia

Re: Fun things to do with an MC68EC020....

Post by Nasta »

OK this is going to be a long one (isn't it always :) ) but you did ask for it.
What follows has a lot of references to the GF, a product that never saw the light of day, but that I spent a huge amount of time designing. What is more important, because it was to be a 'synthesis' of sorts of all previous expansions, there is a LOT of data here on how the QL works, how GC and SGC go about implementing their extra functions and enhancements, as well as quite a lot of discussion on how 68k CPUs and QL hardware in general work, what the requirements and functions of various addresses, hardware signals etc are, lots of data that may not have been previously available.
Please note a lot of this is from memory so I may be wring in the details (and may have typos in hex values etc, too).

CPLD programming for the GF

Although the CPLD programming - and keep in mind it was never written as a 'program' in one of the CPLD/FPGA programming languages (reason explained below), i spent hundreds of hours how it should work and why, so there is a lot of text and some snippets of schematics - retargeting it for the 020 would be overkill since the 020 already does most of that by itself.
One reason why a CPLD was a requirement on the GF was because of the use of LVTTL logic, which operates on 3.3V and does not tolerate 5V signals on it's input. Since all fast devices that need to be connected directly to the CPU(s) operate from that same supply and only use 3.3V logic levels, and all IO devices, including the QL bus (with extra capabilities) can produce 5V logic levels, some sort of 'translation logic' was needed in-between. And, since the 060 CPU does not support bus sizing and extra logic was needed for that, a CPLD was required, which incidentally could operate both from 5 and 3V logic, it was the ideal 'glue'.
The 020 can indeed operate with both 3.3 and 5V logic, so it can actually work directly with the bus. However, because it's so fast, in order to get predictable behaviour on it's many data, address and control lines, none of these should be let out onto the QL's bus directly as one cannot guarantee what is or isn't connected there. This is one big difference between the GC and the SGC. The SGC was concieved as being able to drive some sort of expansion bus (not just the QL's motherboard), and incorporates bus buffers for that reason. One extra bit that was to be added on the GF was series termination resistors, to prevent bus line 'ringing'. This is a must since a CPLD even faster than the discrete logic chips used on the SGC was to drive the bus, and this was already shown to be a problem on the SGC. This should also be considered a must on the 020 board, if one expects the expansion bus to drive any sort of a serious peripheral (not just a very small board where the bus could be well contained and have predictable behaviour).
(ironically) I anticipated the GF to be the last CPU enhancement to be developed for the original QL hardware, so I had to think forward, and not only see if I could enhance the operation of hardware already out there (which is what one would expect from a rather expensive add-on!), but provide new functionality that is well outside the capability of the original QL bus.

Extended QL bus specifications:

- extra address lines (on lines FC0, FC1, FC2, E). Small footnote - FC0..2 are all high when the so-called CPU address space is accessed, this is, amongst other things, used to communicate with a co-processor, such as the 68881 or 882 FPU, and for interrupt acknowledge. Even with a 68020 FC0..2 = 111 should be decoded to prevent the usual address decoding mechanism from working, so if something external is to use CPU address space, this signal can be routed to an unused pin on the bus, instead of using 3 pins. These address lines expand the address space available on the bus to 16M which is incidentally the complete address space of the 68EC020, which begs further questions, as follows:

- The signal DSMCL is normally used on the QL to disable existing decoders on the motherboard. It's normally pulled high by an external device when it detects an address it wants to use as it's own. Because of the way 68k CPUs handle bus accesses, on the original QL all the available addresses appear to be used, so anything on the expansion bus behaves as if it's disabling the internal QL decoder and taking it's place in the address map.
The QL's internal address decoder does not look at any address line beyond A17 (defining it's own address space as 256k), so the entire memory map of the original board (said 256k) is repeated 4 times (i.e. has 4 aliases) within the 1M address space of the original 68008 CPU. So, if one wanted to implement a RAM expansion, DSMCL had to be pulled up by it whenever addresses within the expansion space appeared, to prevent the QL motherboard hardware from interfering on the bus.
One interesting consequence is that because of the way this works, external hardware on the expansion connector can replace any internal hardware on the motherboard, simply by pulling DSMCL high when the appropriate address shows up on the bus.
One might ask why this aliasing, and repeating of addresses? Well, the 68k CPU family uses an asynchronous bus protocol, it's the addressed device which tells the CPU when the data transfer is done using the DTACKL signal, so without a device to provide that signal, if you tried to read or write to an address with 'nothing', there would be nothing to generate the signal, and the CPU would wait 'forever', i.e. the system would hang.
In some systems there is timer which implements a maximum on how long the CPU will wait and then either terminates the access, with written data lost and read data undefined (a sort of garbage in - garbage out approach) or it generates the BERRL signal, which terminates the access and generates an exception (similar to an interrupt), which makes it possible for the CPU to jump to predefined software code and log the error, tell the user, etc. Other systems simply set things up that there is always something on all addresses, so there is always something to tell the CPU it's done, usually by simply aliasing the addresses the hadware uses, as many times as needed to fill up the full address space. A 'technical specification' then defines which one of the aliases the software should use, and this is guaranteed to not change - i.e. aliases are handled in software. This is the approach the QL uses, and BERRL is not used (which makes it another candidate for a bus pin used for a more interesting function).
Because the GF was to expand hugely the capability of the original bus, it could not rely only on 'default hardware' by aliasing like on the QL. Although the aliasing trick was still used partially, a 'timeout' circuit is provided that automatically terminates a cycle after a while (the time was chosen so that anything on the bus would always be faster, no matter how slow it was).
The GF, being a CPU board, never generates DSMCL as it's actually trying to be an extension of the QL motherboard, and was supposed to be able to operate connected to one. But, since it's an 'uber-CPU' it had it's own internal mechanisms to handle hardware which may initially occupy some of the addresses used by the original QL motherboard, simply by not letting these accesses be seen on the QL bus. A good example is the area occupied by the ZX8301 video RAM, where just like on the SGC and GC it only writes to these addresses, while never giving any indication of doing a read on the QL bus, because it happens internally from the fast 32-bit RAM.
The Aurora, being a motherboard replacement, uses DSMCL as an input, which makes it possible for other external hardware to disable any part of it's address map and use it for it's own purposes, maintaining original QL compatibility.
The 020 board we are discussing here is a complete replacement, so in a way like a GF and Aurora together, so it has to, at least in some aspects, behave similar to a regular QL. This includes the behaviour of the DSMCL line.
The determining factor here is, if the entire address space of the board (all 16M) is made available on the bus (for the moment disregarding the bus width issues), should it implement DSMCL so that external hardware can replace any on-board hardware? IMost of the 020 map should (and probably will) be taken up by fast 32-bit stuff, mostly RAM. Although it is in theory possible to output the address to the bus, and then wait if the DSMCL line will appear, then instead of doing a 32-bit access, defer to the bus and use it to do an access to an 8-bit piece of hardware there, this requires a huge slow-down of the 32-bit access when DSMCL does not appear. In fact, only to satisfy the timing of DSMCL older peripherals generate, it would be so slowed down, that it could have already completed the transfer by the time the old 68008 even started one - the 020 at 25MHz is potentially 4.45 times faster at transferring data even if it's doing it 8 bits at a time.
The way the GF handled this sort of thing was easy - it didn't - the fast it's internal memory map was much bigger, copled with a slowdown that would truly be preposterous, put an end to that idea, as attractive as it may be.
IMHO, the same approach is probably the best for the 020 board, even if 'only' a 68EC020 is used, because it's not worth the slow-down and complication. There is an alternative way some of this functionality could be added, more on that in the next post.

- Fast bus transfers. Most modern and even some old QL peripherals would be capable of significantly higher data transfer rates were it not for the limit imposed by the requirement to emulate the 68008, and only because of compatibility with the ZX8301 video ULA. The ZX8302 in turn actually behaves as a 'stupid peripheral' a collection of IO ports and is quite fast, while the ZX8301 is not only slow, but corrupts data if you do not adhere closely to original 68008 timing. This is because the ZX8301 takes over the RAM when fetching data from the video RAM area, and in theory it does this by delaying the DTACKL signal (i.e. making the CPU wait) while it's using the RAM itself. However, because it takes a while for the CPU to recognize DTACKL, the ZX8301 activates it before it has actually read or written the data, knowing that the data will be there in time when the CPU expects it, in order to not slow the CPU down more than is absolutely necessary.
With a faster CPU, the 'recognition time' of DTACKL becomes shorter, and upwards of a certain speed (around 9MHz) the CPU will assume data is there before it actually is when reading, or remove write data assuming it's already written when it's not, and it will be corrupt. To prevent this on faster CPUs, DTACKL is delayed on it's way to a faster CPU, to account for the DTACKL recognition lag of the 68008, and the behavior of the ZX8301.
However, this delay is ALWAYS in place because other axisting peripherals must be assumed to expect the old 68008, even though they may be able to work faster. To get around this, the GF actually implemented several aliases of the whole QL address map (1M) which ultimately appeared as the same addresses on the external bus, but behaved differently as far DTACKL delay is concerned. In particular, the first one had the usual long delay imposed on DTACKL, and the last one had zero delay imposed on DTACKL expecting the peripheral to follow some basic timing rules and not generate DTACKL if it had not read or written the data properly, while the inbetween ones had intermediate delay values.
The idea was to use a bit of software support to exploit this aliasing through a slightly modified 'peripheral discovery' routine in the OS. Normally, this goes around the addresses specified for peripherals in 16k increments looking for a 'magic number', 4AFB0001h, and then assumes it has found a standard peripheral ROM header. The idea was for the OS to scan within the alias that implements the standard QL bus timing only, to see what it finds, then to report this and let the user chose what speed it should attempt to use, by pretending it found the ROM in the alias that provided that speed setting instead - and only then would it go and link the ROMs into the system. It was of course also possible for a clever peripheral to support this in it's own software and let the user manipulate only the alias where it accessed the actual 'business end' hardware, such as IDE interface registers, in case these could or had to work at a different speed than the ROM on the same board, or for that matter other bits of hardware. Incidentally, the GF also implemented a 32-bit bus protocol, but because this was new, it had no need of backwards compatibility, hence it did not have axtra aliases of that.
The problem with the EC020 here is lack of address space to implement something like this on the same scale. However, one could use a lesser chunk of the available 16M and implement smaller IO address map aliases, with less delay settings. It would certainly be a good idea to implement the 'no wait' setting, since it is then possible to design relatively simple hardware on the 8-bit bus which can still work at an acceptably high speed, and just the right one for the hardware itself.

- Cache support. This is another potentially important thing to consider, given the slow nature of the original expansion bus. However, this is completely different WRT 68020 compared to all other later 68k CPUs. Namely, 68030 and higher all have a data cache, which means than not only do they store previously read instructions internally like the 68020 does, they also store previously read, and more importantly, written data. With an instruction cache, you have to make sure that if something changes an area of RAM that contained instructions the CPU has previously read, like loading a new piece of code (which at that point is treated as the CPU as just raw data, it has no idea this is really code), you must make sure that if there are any parts of code from that memory area in the cache, they will not be valid any more (as the actual code has changed with the new load) and the cache must be updated or, a quick and dirty fix, invalidated in it's entirety, forcing the CPU to re-read from memory.
The idea of the cache, incidentally, is to have a piece of ultrafast memory within the CPU which is small and keeps a copy of the part of memory the CPU is frequently using at the moment, so that it does not have to access the memory, but this fast local memory instead. This makes the CPU faster because a huge pert of actual code are loops, and when they are short enough, all the instructions within a loop can fit inside the cache, resulting in the CPU being feed (in the case of the 020) instructions as fast as possible, and not only that - the bus can read or write data or pre-fetch further instructions using the CPU bus in parallel. Users of the SGC can easily see how much of a difference it makes, even though the SGC has fast 32-bit RAM, and only 256bytes of cache. The cache can be typically 3 times faster at providing instructions than the RAM.
Newer 68k CPUs also did something similar with data, and this actually becomes a problem for simple IO. Not only may data in the cache not be the same as in actual memory if something re-writes the memory, the oposite can also be true - data in memory may not be the same as that with which the CPU works, because it does not only keep it in the cache for later reading, it also keeps it there for later writing - in case it gets further modified by the program it's running, so it does not have to write it to memory, only to read it again to modify it. Instead, it only writes the data when it needs the space in the cache for other data, or if it's told to explicitly, either by a hardware signal or a software instruction. The problem with this is that data writing is delayed and may also be out of sequence with the instructions, including ones that read data from nearby addresses.
It should be relatively obvious that addresses used by IO hardware should not be cached, because these are not used to execute programs from, and the hardware relies on the data read and writes to be sequential and often 'volatile' - IO hardware does not generally behave as memory, so it has no storage capacity - once a byte is read, it 'goes away' usually making room for the next byte, same for writing - or, the data written is a command, and data read is status. For such locations all caching should be disabled.
Unfortunately, the QL architecture poses some problems here. In particular, it specifies that each peripheral is recognized by the system using a ROM with a defined structure, which tells the system what to do by linking in software stored within it. 16k of address space is provided for each peripheral (although a peripheral can take the space of several such 16k chunks if needed) but the software itself 'knows' where the actual hardware it's handling is, or in fact if there is any at all. This presents two problems. First, when ROM software is linked into the system, the CPU actually has to execute it from the ROM. In our case it's a problem because its very slow. Unfortunately, it may even count on being slow to provide correct timing when accessing the hardware it handles, and there is no way to know this in advance. So, in some cases it would be interesting to have it cached for code. However, there may be a small bunch of addresses within the 16k allocated for each peripheral which is actually the hardware itself, and if there is a data cache, it should certainly NOT be cached. So, code cache optional, data cache off.
But then, there is the second problem. A peripheral might use more than one slot and be something like a video card. Now, it is unlikely this will have an on-board ROM but never the less, it is possible. Since video speed is generally a rather important issue, it's more likely the code will be copied to RAM for fastest execution, so no code caching necessary. However, such a peripheral will contain video memory - and in this case one thing which we might want is data caching, since video data manipulation often requires several operations on data read from one address, before it's modified contents are actually written back to the same address. However, then it actually HAS to be written or we would not see it (for data caching it's called 'write-through mode'). This is no issue on the 020 CPU as it has no data cache, but on the 060 which has separate caches for data and code 32 times the cache size of the 020, not considering this would be a waste.
The policy adopted by the GF was to not code cache any peripherals although this is usually not a problem - the addresses used as IO are never used to read instructions from (at least if the system is working properly!). Data caching was, however, an option using yet more aliases. It was expected that any code for which speed was important would be copied to RAM from the external ROMs., and in fact, preferred over any sort of speed-up of the ROM itself including using wide ROMs, simply because it made the requirements for the peripheral hardware simpler and easy to implement, just the same as the original spec.
Still, there may be a point here for the 68EC020 - since address space is limited for an alias approach, it may be a good idea to use caching in case we do not want to take away, say, RAM space for external EPROMs, which would otherwise sit on the bus wasting space doing nothing once their content was copied to RAM. Actual IO hardware, and things like video RAM can or do not contain code so leaving code caching on for the entire address map can be made safe.
Incidentally, the considerations regarding copying ROMs to RAM or not, later became an idea for a Qubide II (which could use the ROM slot address but still keep the ROM slot available!) and a ROM management system, more of this some other time.

- RAM shadowing. This came out as a requirement specifically for the 'off' peripheral described above, a video card. This was never really planned for by the original QL spec, and as the caching example shows, has quite a different nature than a regular IO type peripheral such as a disc interface or various ports, since it's actually RAM. With the Aurora already on the market, squeezing the most out of it with a GF as an upgrade was a no-brainer. Provision for something faster than a SGC had already been made on the Aurora, and in fact it's a great pity the ALTERA glue logic CPLD on the SGC became unobtainium because a very simple upgrade to it could have made the Aurora quite a bit faster when mated with a SGC. This is because the Aurora implements a proper DTACKL signal at the maximum speed it can, and this is about 2.5 times faster than the maximum speed of the old QL bus. A simple upgrade to the SGC CPLD could have implemented an alternative 256k IO area with no internal delay added to the DTACKL signal (as I said, this delay caters for the ZX8301), and made it possible for the Aurora to work at least twice as fast. In essence, this would have implemented the 'no delay' IO area alias as it was planned on the GF.
However, just like the original ZX8301 is shadowed in both the GC and SGC, with a signifficant improvement in speed, this same thing could be done with the Aurora, or for that matter anything similar to it. So, the GF added an alias of all the aliases it had for all kinds of IO, that would only write data to the bus, while simultaneously shadow it at the same offset in the top area of the RAM, while reading the alias would actually read an alias of the RAM where the writes were shadowed. Again, the idea was for the OS to initially recognize an Aurora and let the user chose options such as shadowing and cache, and adjust the address it will access the peripheral at, accordingly.

At this point one might ask, why use a part of RAM for shadowing if there was no actual hardware to shadow, connected to the system. Well, here is where the QL's architecture and it's OS actually helps things. If the IO area that has a side effect of writing data to RAM was never used (because no peripheral was found that may require it), that part of RAM is never changed as a result of this side effect and it behaves just as regular RAM would, and is safe to use that way. However, if the peripheral is found and the user chooses the shadowing option, the area of RAM where the shadow is, falls right into the resident procedure memory heap, so all the OS has to do is allocate it, and ignore the allocation (since it's the hardware that will be using it automatically), to prevent anything else from allocating it for it's own use, so only that part of RAM is lost. Initially, the idea was to use a RAM module on the GF. In case it was smaller than the maximum possible size, it would have aliases of itself within the total area available for RAM. That always put the shadow area always at the top of the RAM. A bit of extra cleverness was required from the RAM test, to detect the aliasing, but since a lot would have to change for the GF to work at it's full capacity, it was not deemed a problem.
Assuming a clever enough memory map structure can be devised for the 68EC020, given the 16M maximum addressing limitation, this aliasing/shadowing trick can easily be implemented on a 020 board.

2) 32-bit bus protocol

(More about this in a subsequent post)

To sum it up, the GF used aliasing to implement special options like access speed, caching and shadowing for it's peripherals. In general it implemented 3 IO areas, 2 of which were brand new, and for each of the 3, aliases to control the extra features as outlined above.
These were:
1) The original QL area, which is the original 1M address map of the QL, accessing it would generate an 8-bit access similar to the original 68008, the differences and side effects depending on the alias used. The actual access was indicated by the signal DSL just ike on the original QL.

2) An extended 8-bit area, which had a 16M address space. Accessing that would generate an 8-bit access similar to the original QL but without the DTACKL delay, so it operated at a preset speed, which was close to the 'no delay' option on the original QL area. This had aliases only for caching, and also used it's own signal to indicate access, IIRC it was placed on what was previously used for HALT or BERR, not sure which. Just like normal, the DTACKL signal was used as an indication from the peripheral to finish the cycle. This was intended for something like large flash ROM or a bank of them. The idea came from a feature of the Aurora, which can use a large EPROM to store extra software, but this needs some kind of driver or manager to be used. There is, BTW a problem there inherent to the SGC which prevents a flash ROM plugged into the Aurora ROM socket to be programmed in-circuit, which this (EC)020 board would be wise to avoid :)
There was also one more 16M area which was invisible externally, the addresses would normally be used as a shadowing alias. This is where the on-board flash ROM was located (to store the OS and other firmware as well as boot code), so up to 16M of on-board flash was supported.

3) A 32-bit bus area, which was 32M in size. This would generate a special multiplexed 32-bit protocol on the original QL connector. It also operated at the highest possible speed since the protocol was defined to use an interlocking sequence of signals for all it's phases, making it possible for the peripheral to control the duration of all bus phases so it could be optimized for speed and signal integrity. This also had aliases only for caching or shadowing. It was to use the ASL signal to indicate a cycle start, as this signal is normally not used by any peripheral, in fact IIRC the hardware manual explicitly says NOT to use it at all. It also used a second signal to differentiate between the address and data phases on the bus, in lockstep with the usual DTACKL signal from the peripheral.

All in all, the IO actually used a huge total of 256M of the memory map, due to the many aliases.
In particular, the shadowed/non shadowed versions of the 32-bit area used 64M on their own, and these were duplicated as a cached and non-cached version, that's 128M right there. Then the 16M extended 8-bit area had it's 16M on-board counterpart, and those were available in a cached and non-cached version, so 64M more was used there. The remaining 64M was used as cached/non cached (shadowed / non shadowed (16 speed settings * 1M each) copies of the original QL area.
Obviously, such heavy handedness at dispensing areas of available address space is impossible with the 68EC020 CPU. However, it is possible to implement a smaller sub-set of all of this since several options are not present on the 020 nor does one need to cater for an original QL motherboard since this is a full replacement. My thoughts on this to follow...
Last edited by Nasta on Thu Apr 18, 2013 3:36 pm, edited 1 time in total.


Nasta
Gold Card
Posts: 463
Joined: Sun Feb 12, 2012 2:02 am
Location: Zapresic, Croatia

Re: Fun things to do with an MC68EC020....

Post by Nasta »

Now, as I said, the actual CPLD programming was never written as a program, but it was designed. One overwhelming reason for this was that two competing companies that were the only ones at the time to produce sensible CPLDs that could be used, AMD and Lattice, ended up by AMD completely pulling out of the CPLD business, and selling their CPLD division to Lattice. A third company, Cypress semicondictor, used to make their own compatible versions of the AMD CPLDs and promptly stopped once AMD pulled out. With AMD pulling out, their software support (programming language, editor, compiler, simulator) instantly went away, and so did Cypress. Lattice on the other hand had completely different software and first decided to go the Cypress route (using a rather complicated language called VHDL so both their CPLD and FPGA products could use the same software to program them), but dod not at all support the AMD design they inherited. In fact, they killed off the very ones I intended to use first! Incidentally, the CPLDs used on the Aurora suffered the same fate. Later on Lattice decided to go back and fuse their then top of the line design with AMDs, shich actually resulted in ressurecting one of AMDs CPLDs which sould also be used for the GF. But - they changed the programming software AGAIN so in a matter of about a year and a half went through first dropping the CPLDs I wanted to use, then dropping any software and programming support, then changing the programming language TWICE while forgetting about the one AMD used altogether (so all in all 3 different programming languages that are not easily translatable), then decided to bring back an AMD compatible CPLD but not the one I wanted to use (fortunately, the hardware was similar) but only supported it with a yet third programming package and hardware.

At this point it is important to say something about CPLDs or, to unravel the acronym, Complex Programmable Logic Device. There is another kind of programmable logic called FPGA, which is a Field Programmable Gate Array, and the approach to implementing logic in these devices is quite different. FPGAs are in principle more capable as they can cram large amounts of complex logic in them, to the point that today you could build a whole computer in one. But you pay for this by the timing and actual speed it can run at being unpredictable in advance, because it not only implements programmable logic, but also programmable connections between it. Often to get from one part of the chip to another many connections are needed, and thigs like wide buses are the bane of a FPGAs existence as different bits of the bus may have to take (and almost as a rule do) different routes, which makes some bits faster and others slower! BEcause of this there is arather high investment in the software that converts a program that defines the logic, into the logic itself, because it needs to find the best logic block placements in the chip in order to even be able to connect them, let alone connect them optimally. For this a very complex simulator and placer is used. Even though the logic can be fitted into a FPGA looking at how much there is of it, it may ultimately turn out it does not because it cannot be connected together given the routing resources. And when it can, it may turn out it's not quick enough - both of which mean you may end up having to buy a larger and faster, i.e. more expensive FPGA than you originally thought. Finally, a simple change in the logic may also end up upsetting the routing so you can for instance get the same pinout but not the same speed - it does not happen often, but the point is you cannot know in advance. But what is worse, at the level of the programming language it is extremely difficult and sometimes impossible to force the compiler to implement the logic in a fashion that best uses the peculiarities of the FPGA implementation to get the most out of it. Manufacturers discourage this claiming it's bad for future compatibility, and they do have a point, but as one may expect, being able to sell you more expensive FPGAs is also a good motivator for them.
CPLDs are a completely different beast. They are not suitable for complex logic, but more for lots of simple logic. Implementing any kind of registers and heaven forbid memory, will use up CPLD resources in a flash - and the capacity will be a small fraction of what a FPGA can do in this regard. But, when it comes to lots of repeating functions involving many signals such as bus bits, CPLDs rule. This is because they do not have visible routing resources, and can almost connect anything to anything - not quite, but far more than a FPGA can. Also, if they do have some sort of routing, it's made to be hidden as far as timing and speed is concerned - a speed rating of a modern CPLD is guaranteed, no matter what kind of logic you put in it. And, because the logic it implements is fairly simple and uniform, it's actually possible to write the program so that it automatically takes advantage of the peculiarities of the CPLD implementation, and if you can do this, you can use almost all the available logic in the CPLD - IF the language the manufacturer uses supports it. One problem here is again touting compatibility - so the same software can be used for CPLDs and FPGAs, but often (if not always) this removes the ability of the designer to exploit the actual way the CPLD is designed. This is what happened when AMD went away and for a short time Cypress alternative compatible CPLDs were available (but they would not take existing AMD programs even though they were essentially the same, you had to re-write them in VHDL which is FAR more complicated since it's designed to describe any logic, not just CPLD or FPGA), then that chip was dropped, then re-instated without the abilty to import olad programs and finally same thing but at least you could re-write it in something similar, yet not exploit the actual hardware implementation. Really, it's no wonder I gave up :(



But back to the QL, GF, and the 68020 project.

One thing that the GF was to implement was a way to do 32-bit transfers over the QL's expansion connector. This required some thought but it all came together when I got the idea to use FC0, 1, 2, and E as extra address lines. Namely, with the existing 20 address and 8 data lines, it makes 32 lines total, and this was enough to peak my interest. IF you look at theose as signal lines that can output or input any value, which is exactly how a CPLD treats them, since they would be connected directly to a CPLD, adding two signals that were not previosuly used could implement a full 32-bit bus, but multiplexed, so that the same 32 lines are first used for the address, and then made free for the peripheral to place data on them, or used for the CPU to output data on them. The whole thing would have to be asynchronous, because the only clock signal available on the bus is the original 7.5MHz. Admittedly, I don't know that anything used it but that does not mean nothing did. It would have been nice to have a clock signal because in theory knowing how the GF generates the cycles could give a fast enough peripheral a speed advantage, but then distributing a fast clock signal around a bus of indeterminate length and signal integrity would have been a serious problem. So, async it was.
In order to use the address, data and extra lines without any of the 'old 8-bit stuff' being aware of it, the line DSL must stay inactive. This line is normally the 'data strobe' and all existing QL peripherals, in fact the motherboard as well, use a low state on this signal as an 'enable' of sorts. If it's high, they ignore what's going on with the bus.

Obviously an 'alternative' to DSL had to be found, that would have a similar function but only for future 32-bit peripherals. An ideal candidate was found in the normally unused ASL signal, which is already defined as the address strobe signal, that tells the system there is a stable address on the bus - so it was the most logical to use as it already has the right function. So, the GF would place an address on the 32-bit bus and activate ASL.
A bit more thought showed that with a 32-bit bus you do not need the lower address bits A0 and A1, but you do need byte selects, 4 of them, to tell peripherals which bytes the CPU will actually be writing to, out of the full 32 bits. These could actually be ignored when data is read, as the CPU simply ignores the bytes it does not need. And, since the 32-bit bus could not occupy the whole of the 32-bit address space, because (a) A31..29 should be don't care for Qliberator, and (b) other things use up most of the address map remaining, 32 lines was more than enough to address 32M of space (A2 to A24 are needed for this), leaving 32 - 23 = 11 lines to be used for other clever things. 4 were used for byte selects, and I put some thought of using one more as a 'burst mode' signal (*)
When a 32-bit peripheral would see AS activated, it would latch (internally store) the address and byte select signals, and then use the outputs of the latch as real address lines locally. From the RWL signal and the byte selects it could also in this phase generate byte write enable signals (one for each of the 4 bytes) if desired. When it did so, it would activate DTACKL as usual, telling the GF that it can now remove the address and use the 32 lines for data. This is a slightly different use of DTACKL which is normally used to tell the CPU it can finish the bus cycle altogether, instead here it says it can finish the address phase of the bus cycle.
The GF would then either place the data on the lines (if it was doing a write), or expect the data to be placed on the lines by the peripheral (if it was doing a read), and activate a 32-bit data strobe. DSL could not be used for this as it would erroneously activate standard QL 8-bit peripherals, so a different signal must be used. This was found in the signal DBGL which the QL does not use at all (it was intended for a planned but never produced buffered bus expansion box), and it's only pulled up on the mother board by a resistor, which was just fine. So, the GF would activate (pull low) DBGL.
The peripheral would then either take the data from the bus or supply it as the RDWL signal dictates, and when it's done so, it would then return DTACKL high. This indicates that the data transfer is done. The GF would then deactivate both DBGL and ASL and if it was outputting data, remove it from the bus.

(*) burst mode is a mode of operation of the bus where it outputs only one address and expects the peripheral to supply multiple consecutive long words of data starting at that address or write multiple consecutive long words of data starting at that address. 68030 and higher 68k CPUs use this technique (always 4 consecutive long words) to speed up transfers for various things, notably instruction prefetch, data prefetch and in some cases data cache flushing. The idea is to omit the overhead of providing an address for each long word of data, implying consecutive transfers. If you can get the CPU to do such transfers, the speed improvement is potentially quite impressive - what would take at least 8 cycles (and realistically 12 cycles) on the 68030, takes 5 cycles in burst mode. The fastest possible transfer on the 68060 takes 8 cycles in normal, and 5 cycles in burst mode. The speed improvement is more dramatic for certain types of memory devices, such as SDRAM - at least 12 and possibly 14 or 16 cycles for 4 long words on the 68060 normally, or - 7 cycles in burst mode.
This was only contemplated for the 32-bit bus mode because realistically only something like a video card would be able to implement it, but it was of dubious advantage as the 68060 is very clever when it comes to bus use, shadowing would take all advantage away for reads (the GF's SDRAM could easily sustain at least twice the speed of even a burst transfer using the 32-bit bus protocol) and for writes, the 060 implements internal write buffering (or more, if the data cache is enabled), so it would mostly continue executing instructions from it's caches while the actual write took place, rendering the write 'invisible' to software execution. The 020 isn't capable of burst transfers anyway so this mode could not be implemented anyway if the 32-bit protocol was implemented on a 020 board.

There are a couple of fine points here.

- because DSL was never activated, normal 8-bit QL peripherals do not pay any attention to any signals on the bus thus remaining inactive, the whole 32-bit thing happens without them even noticing. That being said, the bus does require some extra attention regarding implementation - relatively strong buffering is required to be able to drive the lines quickly with old slow peripherals loading them. Secondly, if speed and power is required, termination, and in this case series termination, is a must. So, each line had a 33 ohm resistor in series. If this was done by discrete logic chips (and there are suitable ones), there are versions of them that have the termination built-in. A ground plane connection was also to be implemented by using a 3-row connector (all ines in the third row except where the power lines are in the original 2 rows would be used as ground, and one line would be used to sense the presence of a 3-row backplane).

- the 32-bit peripheral is completely in command regarding how long each of the phases of the bus cycle must take, and instead of using a level on DTACKL to signal the timing, it uses level transitions, high (inactive) to low to end the address phase, and low to high (inactive) to end the data phase. Being able to re-use DTACKL actually saves 2 signal lines, and still implement totally flexible timing. In theory, the fastest transfer speed was about 5 CPU clocks for write, 6 for read - keep in mind the CPU clock was at least 40 and preferably 66 or more MHz. In reality, without a ground plane, about half of that could be expected, so about 24 times faster than the QL's original bus maximum theoretical speed (and the latter is only possible with something like the SGC, BTW) .
The way it was designed, the implementation is completely independant of what the clock speed actually is, this is fairly easy to implement for slower CPUs, with proportionally less speed and easing up of the requirements for signal integrity.

- in order to get decent speed, the DTACKL signal must be actively driven both low and high, on the original bus it's only driven low, and a resistor on the motherboard is used to drive it high. When lines are long, the resistor has to fight increasingly larger capacitance and other parasitics, resulting in ever slower returning of DTACKL to high, which in the worst case leads to bus operation errors and system hangs. Driving it actively was never really defined on the QL but this was a chance to do so. The bus spec only defines that after DTACKL is returned to high, it has to be made 'high impedance' (so other peripherals can drive it) within a short time, and this is quite easy to implement in hardware. Active driving with proper termination also prevents ringing of the DTACK line, which is essential for proper recognition of it and preventing noise and transients from being taken as DTACKL transitions.

Here are some ideas of what could be implemented on a 020 board.

First, some major considerations - available address space and RAM size, some speed issues.

On one side, this is determined by which actual CPU is used: a full 68020 or an EC020. The EC020 is limited to 16M, while the full 020 has a full 32-bit address, but for reasons already mentioned, the QL can only use up to A28, which makes the maximum size of the address map 512M bytes.
On the other ide, having owned an 8M QXL, I can tell you that indeed, you DO want more RAM. No-one ever had too much of RAM! The reason you want this is simple - it's a must once the possibility appears for extended graphics. When WMAN/QPTR is used, never mind something like SMSQ/E, non-destructive windows are implemented, the images of which are kept in RAM, so you are looking at increased memory use, proportional to screen size and color depth. Let's look at that in a bit more detail - In theory it should not be too difficult to implement something like Q40/60 style graphics on this board (although it could get a bit slow...), suddenly you are looking at up to 1024x512 windows, so 4x the original size, and - instead of 2 bits per pixel, 16 bits per pixel, so 8x color depth. All in all, oh, 32x more memory used. 1024x512 in 16 bits per pixel is 1M or RAM. This sort of thing would be completely unusable on a 4M system. Aurora screen size is limited by it's video RAM which is 4x smaller, so expect 'only' 8x more memory used. Having used the 8M QXL at PC style resolutions, I can say that in some cases 8M would come up short - not alarmingly, but you could see the end of it. IMHO, if you are counting on better graphics, count on more RAM - it follows directly.

Although the use of a full 020 does not directly follow from the above, one advantage with the full 68020 even if it was used as an EC020 (extra address lines are not used), is that a 33MHz clock rating is available. That being said, it's all actually the same chip, selected. Motorola was always very conservative with it's specs, so it would come as no surprise that an EC020 rated at 25MHz could indeed work at 33MHz and quite possibly more, as long as it's bus pins were lightly loaded and well managed, and some cooling was supplied. Some Atari enthusiasts were well known for running 10 MHz bog standard 68000s at 16MHz and 33MHz 68020 chips at up to 50MHz... which is also an interesting data point to consider.

Then there is the matter of RAM implementation. Static RAM is the simplest to use, the fastest, and uses the least power, BUT far from highest density and especially lowest cost. You are stuck with densities of 8M bits per chip (so 2M bytes) which are targeted to low power applications, making them more expensive than usual, and one reason for this is that the 020 uses 5V logic. 3.3V logic is more commonly used today and higher density SRAM is available for it, but requires logic level conversion in the grand majority of cases. The cost of having 5V to 3V logic translation is not trivial in terms of speed, board space, signal integrity and cost, although some of the cost may be recovered since the higher density devices are cheaper per bit. But to give you an idea, a 4M bit (512k x 8) device costs about 3 Euro or less in quantity. Moving to 8M bit (1M x 8), rises the price to anything between double and tripple, while 16M bit (2M x 8) devices are already around 18-20 Euro a piece. The up side is, they are FAST, easily would handle the fastest overclocked 020, and use very little power - some as low as 10mW per chip!
Unfortunately, the market space has driven out the most logical alternative, regular 5V DRAM which is the 020s 'natural counterpart'. The parts that are still manufactured are of the same density as SRAM, but cheaper - however, a DRAM controller (extra hardware) is needed. In most cases it's easyer to source DRAM second-hand! Also, speed is lower than that of SRAM, at leat by 30%.
Finally there is SDRAM, and although this is the most used RAM type today, 99.9% of the market is DDR SDRAM in various versions - these cannot be effectively used on a 020. SDRAM is however still available at quite high densities and failry cheap, but it has all the disadvantages you can get from the other types - it needs a controller, which is more complicated than for regular DRAM, and in order to get good speed, it has to be operate at 2x the CPU frequency (this in itself is no problem, the slowest available parts are 100MHz) but it's what makes the controller even more complicated - we are talking serious CPLD work here. AND it's all 3.3V - logic level translation is a must.
So it looks like rock, hard place, wall... or something like that.

Regardless of the memory type used, 16M can can become a squeeze especially if one is planning for a 32-bit video extension.

The second major consideration is what expansions (if any) are expected. This would define what options are to be implemented on the expansion connector, if there is one.
The size of the memory map is also relevant here as any peripherals to go onto an expansion bus, must be given an address space to work in.

We can use the SGC as an example, and keep in mind it only uses about 4.5M bytes of the 16M total - and in our case we are aiming for more.
Out of the 4.5M used, 4M out is RAM, with a bit at the beginning that can't be used because it holds the usual QL bits such as the ROM, EPROM slot, extra code copied from the SGC EPROM, the IO register area, and the screen (actually two of them). Shadowing of the second screen can be turned off for a little speed boost - if only one screen is used, the second screen area holds the system variables and tables, which are very frequently accessed, so speeding that up lowers the system overhead. All the rest up to the end of the 4th megabyte is RAM.
There is an IO area at C00000h, which is the 12th megabyte (counting from zero!), but only parts of it are actually implemented. The top 256k at CC0000 is the actual IO area (on the original it was at C0000), and because the old QL bus has only 20 address lines, it actually appears as C0000h on the QL bus. The Aurora completely uses it up for the video RAM.
There is also a special area IIRC 128k in size at C00000h (I am writing this from memory, I could be wrong...) and this is where the SGC ROM can be read from, where the on-board hardware IO locations are, and also where the first 64k of the QL's address map is mirrored (right at C00000..C0FFFF), but not the RAM copy, rather the actual thing. As far as I know you can only read from these addresses, which is a pity as writing would make it possible to put a Flash ROM into the Aurora ROM socket and it could then be programmed in-circuit, implementing a sort of mini-ROMdisq (maximum size is 512k). As it is the actual ROM can be read here as well as the ROM slot.

Assuming the proposed 020 system has an expansion bus (and I will try to show it actually has to have one internally anyway) allocating an area of the memory map is essential, but it would be prudent to put it right at the end of the 16M available, so more is left for RAM.

Here are some ideas on what the requirements might be:
- We need a place for a Flash ROM or similar, to store the firmware (OS, drivers for extra hardware on-board, boot and initialization code that has to run before the OS or patches for the OS). At the very least 128k is needed, preferably more. It should be noted that this hardware is basically needed only at boot and when the contents have to be copied into RAM, or the Flash has to be updated. So, some memory map savings can be had if a mapping mechanism is implemented which temporarily swithches out some other hardware and switches in the flash when needed. In fact, a much larger flash can be used if a paging mechanism is also implemented.
- In theory you could connect an Aurora to this system, or at least something similar, like a clone that transplants the chips on it to a board that only does the extended graphics. Using that a s quide, one would expect at least 256k of memory space available on the expansion bus, mimicking the SGC except at a more convenient address. That being said, adding fast DTACK and RAM shadowing capabilities would significantly increase the feature set.
- Some addresses are traditionally available on the expansion bus in existing systems, such as the ROM slot (00C000..00FFFFh), and QL on-board IO area (018000..01BFFFh).
- On the high end, one might look at graphics capability close to or equal to the Q40/Q60. This requires about 1M of address space, and would need access to the full 32-bit bus, or it will simply be far too slow. However, this should best be a 'shadow' to some existing 1M of RAM for speed, and therefore actually takes up the same address space as RAM. Obviously, when it's used Aurora type graphics would not be needed any more except maybe for development purposes, but in any case if the Aurora video RAM has shadowing capability, it would not be used at the same time as this 'high spec' graphics board.

How about possible implementation?
One could use the last meg at F00000h to implement 4 256k blocks. One for the Flash chip, and 3 would be aliases of a QL/SGC like IO space, one for the usual behaviour, one with maximum bus speed, and one with maximum speed and RAM shadowing.
Another possibility which uses less of the address space, 2 256k blocks, would be an extended implementation of the SGC idea, one 256k block containing the flash (perhaps with bank switching so a large one can be used) 64 or 128k used for that, the rest is then free for peripherals. Then the other 256k block that mimics the QLs/SGCs IO area, but with options to switch in memory shadowing and/or fast transfer support.
In any case some extra address lines (on FC0..2, E) could be used ti distinguish various aliases so that clever peripherals to come can make use of it and map themselves only where needed, leaving the other aliases free.

The following thing to think about is what happens when the system accesses some addresses that traditionally appear on the bus, and relate to bits and pieces of the original QL's hardware. In particular, the ROM slot addresses at 00C000..00FFFFh must appear on the bus if you want to use a Qubide set to the ROM port address. But then, it's a moot point if it, or a version of it is already on board - and this is something that should be considered for various reasons, amongst others because IDE is natively 16-bit so it would be really odd to force the 020 to access it as an 8-bit peripheral only to have the 8 bits sonverted back to 16 - all while not only a 16-bit but a 32-bit bus is available.
Then there is the old QL on-board IO area, at 018000..01BFFFh. Aurora/QIMI uses address 01BF00..01BFFFh, Aurora/QL (ZX8301/2) uses 018000..0180FFh. The rest is unused but it's an area ideally suited for actual IO devices and chips either on or off-board, because with a bit of help from the motherboard it's easy to decode and does not use up anything that could be used in better ways - most IO devices only need a few locations. More on this below.
Other candidates would be 010000..017FFFh, which Minerva supports as alternative ROM addresses and looks for ROM headers at 010000h and 014000h, and possibly 001C000..001FFFFh.
These are perhaps best kept as write protected RAM. On the GC and SGC these are used as a RAM copy of (parts of) the SGC ROM and in any case, in order to run all the extra bits on this new 020 board, copying the drivers to RAM is a good idea. Since these addresses were not used at all on the original QL (they contain aliases of 18000..1BFFFh there), they cannot normally be used as system RAM and at best could appear on the expansion bus and be used by a non-standard IO or ROM board, quite superfluous considering there must already be a large Flash on board to contain an image of a QL ROM, and the extra software. So, why not make this area available as fast write protected RAM for ROM emulation, so OS extensions do not use up system RAM.

And now, the HOWEVER bit :).

There is an important consideration that gets the required signals to the bus anyway, and this is buffering. In order to interface both slow and fast stuff to the 020, you REALLY want to keep their respective bus signals separate. For one thing, keeping the 020 bus lightly (and uniformly) loaded has very desirable consequences in form of speed and signal integrity. 8-bit stuff must be connected to CPU data lines D24..D31 and a bunch of it requires a buffer anyway, because it would heavily load that part of the bus while the other 24 bits would see a lighter load. It's not only a consideration for the CPU, but everything on the 32-bit bus driving that part on the bus when the CPU reads from it, has to drive all the 8-bit stuff connected to it. Also, you need to buffer the address lines too because all transactions for anything on the bus (much faster than QL standard!) would be seen on these address lines so the CPU would be driving everything at the highest speed even if it can't take it. To prevent this, you need buffers that are only active when the 8-bit bus is accessed, making all the 8-bit stuff look like one single device to the CPU. In short, the 8-bit side requires signal buffers anyway, the same ones that the QL expansion bus connector would require.

What this boils down to is, you end up using something 99% similar to the QL bus (buffered from the CPU) to connect the various 8-bit peripherals that you have on board. Adding a connector to this existing bus is a no-brainer, and in fact, then you are looking at all of your on-board peripherals (and I mean actual IO hardware like floppy controllers, ZX8301/2 etc, not extra ROMs) as peripherals on the QL bus, it's just that part of it is implemented 'on-board'. And, since on-board you may have extra signlas available that do not normally appear on the Ql bus, you might implement some special properties for some of the on-board peripherals, such as higher bus speed, or even 16-bit access (for IDE).

This brings us back to the GF spec. The exact same approach was to be used on GF - because most of the peripheral chips were 8-bit (or could work from an 8-bit bus), and needed 5V to 3V logic conversion, and handling of 32 to 8 bit bus size conversion, which the Ql bus interface CPLD already did, they were connected to the QL side of the bus. This also severely reduces the number of long lines that must be used to get to the actual peripherals, 8 bits of data, and a relatively small number of address lines is enough. The actual addresses used fall within the original QL on-board IO area at 0018000...001BFFF. The GF would supress DSL geenration when these addresses were used so they were actually invisible by other things on the bus, and it would also generated a special internal DTACKL for them, for optimum access speed.
The same thing can be done for this 020 board, except it could actually use DSL and implement DSMCL so external hardware intended to replace (and hopefully enhance!) on-board stuff could do so by using DSMCL as usual. Internal DTACKL generation is still possible so one could optimize speed (very interesting for IDE or ethernet). Implementing this would then require the QL on-board IO area accesses to appear on the expansion connector.

Implementing a 32-bit bus interface on a 020 board would be a chore using discrete chips but it's not impossible. However, the only thing one could realistically expect to use on it is a graphics card, and for this it only makes sense to put it in the last meg of RAM (whatever that turns out to be, depending what parts of the address map is taken up by the expansion connector stuff) where it would 'replace' or, even better, shadow existing RAM only for writes. Whatever this RAM will be, more than 1M is hardly a good idea, and only if the rest of the 16 available megs is used as RAM too - otherwise, RAM will become short. Although 16 bits per pixel is very attractive, there is the issue of non-standard resolution which is a huge headache on LCD monitors - and tofay, there is no other kind left. A 256 color mode could run in 1024x576 which is a wide-screen standard, or 1024x768 shich is a regular 4:3 PS standard. Supporting wide screen resolutions is a serios consideration, as it's now nearly impossible to find a 4:3 LCD for a sane price.
In any case, providing some sort of special connector for 32-bit graphics is a better idea than designing an actual 32-bit bus interface which first multiplexes the bus to pass through the IO connector, only to have it de-multiplexed for the actual hardware on the other side. A sort-of 32-bit version of DSMCL could be implemented on this connector so anything 32-bit on it could disable on-board 32-bit hardware if needed.

To do:
Some considerations regarding extra peripherals that were to be used on the GF, regarding interrupts and the like, multiprocessing (easy - the point is why and how), ideas on Flash/ROM management and peripheral stuff that came up while designing the GF.
Since it's so late it's actually early, I will leave that for the next post...


Nasta
Gold Card
Posts: 463
Joined: Sun Feb 12, 2012 2:02 am
Location: Zapresic, Croatia

Re: Fun things to do with an MC68EC020....

Post by Nasta »

When the GF was designed, it was already clear that one would have to wait for some possible future ColdFire CPU that would put all that was taken out of the 68k architecture to make a Coldfire, back into it. In particular the debacle of the MCF5401 ColdFire MK1 which was actually a 68EC040 with a multiplexed bus was what clearly showed the future - this was the last true 68k CPU. That left only the 68060 as the logical upgrade path. Even though the large case is a problem because of the limited space available for a QL expansion card, the 68060 actually offered a very welcome feature, rather low power consumption. But, looking into the future, this was to be it, for a long time. Becasue of this, I spent a lot of time thinking of ways to get the most performace out of the hardware, and one part of the effort was to look into the way existing peripherals worked. In particular, one annoying thing with QL peripherals such as floppy controllers and to a lesser extent, hard disc controllers, is that when they transfer data, the QL simply STOPS. With the old 68008 this was not so strange as no form of DMA or other advanced data transfer hardware was implemented (such hardware can be a rather big problem when it comes to representing such hardware as a fairly universal resource in a multitasking system!), and the CPU was too slow to effectively use interrupt driven IO to transfer data. As time went by, the drivers stayed the same even though faster CPUs could at least in principle support background data transfers, and - the QL bus remained the same, a built-in bottleneck regardless of CPU speed. As time further went by, chips that implemented the peripherals we were all used to also went away becoming obsolete, but at least when the GF was designed, they could be replaced by ships used in then common PCs, which still implemented the PC ISA bus protocol. The particular one chosen was rather clever, implementing a floppy controller, serial ports (MIDI compatible!), PS2 mouse and keyboard interfaces, non-volatile CMOS RAM and RTC. The interesting thing was the floppy controller, because it was compatible with the one used on the GC and SGC, but it also implemented a FIFO buffer in it's data path, so you did not have to service it whenever a single byte was to be read or written to the floppy, but rather up to every 16 bytes. Because the overhead of transferring one byte of data to such peripheral is easily an order of magnitude larger than the actual transfer, reducing the frequency it happens by more than an order of magnitude is a very welcome thing. The serial ports also implemented FIFOs and in fact relied on quick interrupt response to work properly (anyone who has ever attempted to use a PC serial port knows the importance of this and how badly it's implemented on the PC - 'fortunately' they removed the serial ports from modern PCs, without ever solving this problem properly). In addition, the GF added a sound chip with MIDI (also with FIFOs) and an ethernet chip (this one actually had a built-in RAM buffer that operates similar to a FIFO). It became obvious that 'a bit' more than just the EXTINTL external interrupt pin on the bus will be needed to get this working to it's full extent.

Fortunately, some of this work was done before, on the Q40/60. It implemented a fast poll interrupt (20kHz IIRC!) and it showed that the CPU could handle this approach with clever programming. And since the programming was after all done by Tony Tebby, you knew it was REALLY clever. The purpose of a fast poll interrupt is to support hardware that requires very quick but fairly simple interrupt processing - for instance, emptying or filling a FIFO, with a small overhead to test if it needs to be done at all. Interrupts on the QL do gave some problems as there were bugs in the servicing routines, so really only one level was truly usable out of the 3 available (the top level being NMI, non maskable interrupt). Every 68k CPU except the 68008 in a DIL case implements the full 7 levels, and so does the 68060 - and the 020 of course. On the GF I decided to use extra levels for special things, and this was actually intended to double the existing interrupt structure, so that the usual slow poll (20ms) and external interrupt, got 'fast' versions in form of a fast poll interrupt, and a fast external interrupt. A pin on the QL expansion bus was re-defined as the fast external interrupt. All the rest was only visible internally. The fast poll interrupt was not as fast as that of the Q40, because it's frequency was not relevant to sound generation as it is on the Q40, however, the sound chip used could still be set up for compatibility with QLSSS. NMI was left unused (and could be generated from outside) and a level of interrupt just below it was used for one more completely new feature, multiprocessing.
Because PC IDE Plug And Play compatible IO chips were used, it was possible to route the various interrupt outputs of the chips to various interrupt levels in software. This feature is also extremely important for multiprocessor setups, again more on this later.
In summary, new and faster peripherals needed a new approach to interfacing them to the CPU and especially to writing drivers - the universal problem on the QL. The way it was implemented made it possible for the 'old ways' to be used initially, but if one really wanted to get the most of it, things had to be done differently - which should have really been no big problem as drivers for something like Ethernet had never been written before anyway, so no compatibility issues.

The whole GF IO subsystem could easily be used on the 020 board, and this includes at least some of the interrupt resources. At one point I actually contemplated making a separate board with it, to be connected to a regular QL so that drivers could be written and tested before the GF was ever made. The atual chips, as I already mentioned, use a small part of the QL on-board IO area.

Before I go to the last section, the 'big one' - multiprocessing, let me ocome back to some ideas that came out of designing the GF, for revisions of older hardware.

Qubide II

This was basically the logic of the old Qubide plus a drive expander, all implemented in a single CPLD, adding bus termination to the IDE and QL sides. It was to support 3 drives, one of which could be an on-board CF (not hot swappable) or an on-board 2.5" hard drive, one was a hot-swap capable CF port, and one was a standard IDE port (the intention was to use it for a CD ROM drive or similar).
This was fairly easy to do as the Qubide has a provision to use an external decoder to decode up to 8 drive pairs, so a decoder for 3 was built-in.
However, it had another interesting feature, which I then kicked myself for not thinking of when I did the original Qubide, and the reason was I dod not know the structure of the QL on-board Io area. Not knowing that only a small portion of it was sued by the QL and existing expansions, I actually specifically warned about not setting the Qubide address to this area.
The Qubide behaves as a normal QL peripheral and it has a ROM which contains the driver. However, beacuse 16k of addresses is allocated for each QL peripheral, a decoder is implemented ina GAL that puts the actual IDE control registers (this is physically located on the drive itself!) in a 256 byte block at the end of this 16k area, so the last 256 bytes of he ROM cannot be used to store the driver. Originally another part was omitted to make it possible to use the Miracle hard disc, with the idea of making it possible for it's users to copy theri data to an IDE hard drive, by connecting both the miracle hard drive and Qubide (the latter set to the address of the ROM slot). This was later removed for Qubide 2.0 and the bare minimum of 256 bytes was taken away from the total ROM capacity for the IDE registers.
Had I known the on0board IO area of the QL was so sparsely populated, I would have used 256 bytes of that for the IDE registers (at a fixed address) and had the full 16k available. Also, there would be fewer jumpers - possibly just one, to select the ROM port area or the regular expansion area.

However, thinkong of how the GF would be handling an 8-bit flash ROM with firmware at boot time, an idea occured to me. Because the Qubide is an expanison bus peripheral, it has access to the DSMCL line, and through it it can disable the ROM slot and map itself in those addresses instead, and in fact what he Qubide base address is set to ROM slot, this is exactly what happens. What had not occurred to me before, is that it could have been a temporary condition - long enough for the contents of the Qubide ROM to be copied to RAM, which is what it does anyway whenever it detects it's running on a GC or SGC - because it then runs much faster.
Consider this scenario, which is how the Qubide II would work:
1) At reset, it maps itself into the addresses of the ROM slot, disabling it.
2) The system recognises it by the ROM header and calls the ROM initialization routine
3) This routine copies the driver code into RAM. At this point it can use some sort of a bank switching mechanism to access a ROM (or Flash) larger than the 16k of the memory map it's currently occupying. It does so by using a register located at a fixed address within the QL's on-board IO area, which was previously not used.
4) When it's done, it jumps to the code it's just copied to RAM and continues executing there, linking in the driver, detecting drives etc as usual. You would see the usual Qubide startup text etc...
5) The driver copy in RAM then again uses the ROM (or Flash) paging register to completely remove the Qubide ROM from the ROM port address, which re-enables the actual ROM slot.
6) Finally, driver copy in RAM then mimics the ROM slot detection routine, looking for a ROM there, and if found does exactly what the OS would do, jumps to the init code, that links in the ROM if it's present.
7) When the init code above is finished, control is returned to the OS, and you have a working Qubide AND the ROM slot (whatever hardware is in it) - even on a GC or TC.

Because EPROMS went out of fashion and Flash ROM became cheap, the Qubide II was to use a Flash ROM, 512k bytes in fact - this was the most cost-efficient one. It could provide the space for many ROM images, even something like a ROM-able SMSQ/E. It could also be reprogrammed in-circuit - because the Qubide is an expansion bus peripheral, it has access to the RDWL line (read/write control) which is missing on the ROM slot (to the dismay of many a hardware developer!) so both reading and writing are possible even though ROM slot addresses are used. And, because a large size Flash is used, it has to be paged into the available 16k, 16k at a time. The consequence of the paging mechanism is that it can also be completely 'paged out' of the memory map, returning ROM slot functionality.

At some point I actually thought of a EPROM manager or ROM slot manager based on the extended ROM slot on the Aurora, this added a read-write signal, and an extra chip select which tells the hardware on it there is an access to the QL on-board IO area. The Aurora also does a complete decode of the on-board IO area so there are no aliases of anything in the parts previously unused, which makes it possible to map simpler peripheral controllers and IO chips into locations there, without the need to implement DSMCL and DTACKL and complete address decoding like on a full expansion bus card. The EXTINTL interrupt pin was also provided. The idea was to have each peripheral use a 256 byte area in the on-board IO space (64 total fit into the 16k available, first and last used by Qimi, ZX8301/2 or Aurora) - more than enough for 99.9% of Io needs. This would have made simple peripherals very easy to implement, and in fact it may be one thing that could be included on the 020 board.

The Aurora itself has a provision to use a large EPROM using paging, in it's case 32k pages are used and the ROM is the usual QL ROM (the first 32k are paged, 16 pages total for a capaicty of up to 512k bytes). This was intended for a ROM-able version of SMSQ/E but it never materialized. The 020 board already would contain a super-set of this functionality as it's required for proper booting.


Nasta
Gold Card
Posts: 463
Joined: Sun Feb 12, 2012 2:02 am
Location: Zapresic, Croatia

Re: Fun things to do with an MC68EC020....

Post by Nasta »

Finally, there is the most 'out there' fetaure of the GF - the second CPU.
One could rightly say there is a huge problem getting the programming done for the first one, let alone for the second. But, the potential offered is very high for the price of another CPU socket and some PCB routing.

The basic idea was for both CPUs to share a single external bus, using their bus arbitration lines. A small piece of logic was used to insure that if both CPUs wanted the bus at the same time, they would get it alternating on an access to access basis, so equal time.
This was actually quite easy to implement with the 060 CPU because it acts quite differently than say a 68000. The latter assumes it starts off 'owning' the bus, and if something else wants it, it has to 'take it away' and assume ownership. The 060 in turn assumes it does not own the bus, and initially sends a request to external hardware, which then grants the bus for as long as it sees fit. In a single CPU system, the bus request signal is connected to the bus grant signal so whenever the 060 requests the bus it 'grants it to itself' and does what it needs to do. Implementing this with the 020 CPU is a bit different but not too complicated.

Why would we want 2 CPUs to share a bus?
Well, let's start small. For one thing, QL hardware has no means of data transfer within the system (IO devices to RAM and back, RAM to RAM, etc), except by executing a program specifically for that purpose. Ordinarily such pieces of code are executed as parts of a device driver, sometimes under interrupt control. In some cases they are invoked by other events in the system, one example would be when the user presses ctrl-C and switches between programs, WMAN restores the program's windows from RAM. When data transfer of this kind is needed, the CPU usually has to interrupt normal program execution and 'service the event', how it actually does this depends on the event and what generated it. This requires an overhead. Servicing the actual event might require execution of many instructions in a sequence that must not be interrupted by servicing other events.

In systems with more complex hardware for it, there is often a dedicated piece of hardware called a DMA controller, which can be set up by the CPU to automatically transfer data 'in the background', without actually interrupting the CPU, but by slowing it down to execute it's own accesses to the required hardware when needed. Although this piece of hardware can be quite complicated, it's basic operation is to transfer a block of data from one location in the memory map to another - with twists like address incrementing for source, destination or both, setting up the number of bytes to be transferred, and finally, generating an interrupt when finished. The complexity comes from how many such transfers the device can do in parallel. Although in itself DMA does not improve data transfer as this is usually limited by where the data must come from or go to (like a port, disc drive, etc), it removes the CPU overhead associated with the instructions needed to implement the transfer, and interrupt the CPU to start executing them, then again when it's finished, to continue doing what was interrupted. In some cases the CPU is not interrupted at all, in others only for a short time to tell the DMA controller 'to start working'. all in all it uses up vary little of it's time to set up the transfer. At that point the DMA controller takes over - it's own overhead needed to generate the addresses, read the data, write it elsewhere etc, happens within it and is not visible by the CPU - the CPU only 'feels' the actual transfer on the bus, since it is stopped from accessing the bus while it's used by the DMA controller. However, the stopping is handled by hardware, many times faster than, say, an interrupt - and for clever CPUs, does not impede internal operations of the CPU itself, so in essence the CPU may continue running in parallel with DMA most of the time. In any case, if the DMA is limited to an equal share of the bus as the CPU itself, the maximum slowdown for the CPU is to half speed, and the minimum guaranteed speed of the DMA is again 50% of maximum attainable - or, as fast as the Io devices can handle data, whatever is slower.
In many cases adding some sort of DMA feature can make fast IO devices which would completely have to take control of the CPU to be handled in time, work in the bacground.

Now, one problem with DMA is that it's a limited resource and not expandable - at least not easily. You get as many DMA 'channels' as you get an that's it - usually a sngle digit number, often 4. And, they tend to be allocated to specific hardware. It's very difficult to make this a 'universal' resource.
On the other hand, all a DMA does is doable by a CPU, except a CPU can be programmed to to it much more intelligently, if slower. For instance, a system might have several devices which need periodic checking to see if they have data ready or need more data sent to them. Checking for such conditions is usually very simple, involves checking for the status of certain bits at an address that actually addresses an IO device, it's status register, to be exact. Thereafter, if the correct bits are set, you can take a certain number of bytes from yet another location or put a certain number of bytes from an address in memory into a location (this would be some sort of data register), and if the combination of bits is not correct, you either do nothing or report an error in some way. This sort of thing can be coded quite simply, and in fact needs to be simple because it should execute quickly, the overhead of starting the code, setting up parameters and addresses, checking status should be relatively small compared to the actual data transfer. This type of code is normally a part of the device driver, referred to as the hardware level task (note, NOT job, which is QDOS/SMSQ speak for a program - a task is something the OS does as a response to an event, in this case hardware needing to be serviced). In the example shown above, a particular task is invoked by polling, i.e. periodically checking the hardware to determine if it needs anything done. Another way would be for the hardware to directly cause an interrupt, which makes the CPU interrupt what it was doing (hence the name) and jump to code that services the hardware - so in theory this has a faster response time. In any case, if a CPU is doing this, lists of tasks can be made that are scanned periodically and each task in the list handles it's own hardware, or lists of task interrupts can be built, that get scanned each time a particular level of interrupt is generated, and then check if the hardware needs to be serviced. A DMA controller in most cases cannot even approach this level of sophistication, although as DMA controllers go, the one from the 68k family is quite clever. In any case, what I have described is a 'task list'.

Now, a finer point on this is that tasks are NOT jobs. This is one definition that appears no-where in the QDOS/SMSQ documantation, assuming the difference is known.
Tasks, as I explained, are programs that are started as a response to an event, i.e. reactions to events. They can be quite complex but in general they are fairly simple, and do the 'low level' stuff of hardware control. For instance, a device driver might set up a list of sectors that need to be transferred from a hard drive into RAM, for instance as a result of an application requesting a file to be loaded. For each sector holding a part of the file, the hardware has to be told to fetch that sector into it's buffer, then the actual sector data of the buffer has to be transferred to an area of RAM, etc for every sector, filling the part of RAM allocated for the file. However, the actual hardware governs the data transfer once it's been set up - the hardware signals that it has the data in it's buffer, and this is an event that invokes the task to move the contents of the buffer into their appropriate place in RAM. The application has no idea about this and continues only once it's told the file has been loaded.
Jobs on the other hand are in 'QL speak' what is commonly referred to as 'applications'. These are programs that do the actual useful work with the data the tasks move around. Jobs execute under control of two things - results of task completion, which can also be events in their own right (basically this determines data availability, ultimately this is determined by user input), and the available time allotted, which is shared among other jobs, according to a priority scheme.
Unlike jobs, tasks may have priorities which are inherent - higher speed devices have higher priority. However, this is not usually user controlled. Neither can the user start or stop tasks immediately - they are entirely driven by events from the hardware, and in a way their execution is a 'property' of the OS itself.
Jobs however are under complete user control, either by user input or by the user starting and stopping them at will, or setting their priority at will. Tasks and jobs pass data to each other through various means, which normally all boil down to areas in RAM. In the case of the QL, these can either be raw data or code (as used by lrespr, ex, save etc commands), file sectors (buffered in memory as slave blocks usually), 'pipes', which are FIFO buffers maintained in software, or other 'mailbox' type structures in RAM - the OS provides resources like the forst three mentioned, but drivers can implement their own means. On the job end of the 'communication' is the 'application level', which is accessed through OS traps, on the task end, it's the hardware level, which is accessed as a result of a hardware event, such as an interrupt.

QDOS/SMSQ has a strict although somewhat implied separation of jobs and tasks, which is always necessary in real-time OS design. In contrast, unix-based OS' often use jobs as tasks, setting up small jobs to run at very high priority, polling hardware, and servicing it if necessary, called demons - attempting to do the work of tasks. The problem here is that the concept of priority and speed of execution has entirely different meaning for applications (jobs) and event handlers (tasks). QDOS wins here but it's a pity there never was a 'concept and design' section in it's docs, it would have made things much clearer.

The reason why I went to all this trouble to explain jobs and tasks is to be able to explain one of the ways 'things to do in an OS' can be distributed among two CPUs. When there is a clear enough distinction jobs/tasks, one CPU can handle the first, the other the second. In actuality, the job handling CPU does handle some tasks, in fact at least one - and that's the event that causes time slicing between jobs - the polling loop. And even that event can be partially off-loaded to the other CPU. So, although with DMA hardware a part of task processing is handled by it, with a whole other CPU, things can be far more flexible, and according to the ways the OS usually works. Unlike DMA channels that are limited in number, adding data transfer tasks to a list is only limited by available RAM and the speed the CPU handling them can attain, although quite a bit of sophistication is possible if say, there are multiple lists organized by priority, so data transfer tasks are linked into them according to required transfer speeds.

Then, there are tasks which are much more complex than simple data transfers. One good example would be handling background windows. A list of tasks that happens as a response to a system event can be set up, with the tasks being a bit more intelligent than mere shuffling of bytes, in that they understand window bitmap organization. Not only can a second CPU periodically refresh windows that are only partially visible from their 'real time' bitmap in RAM, going one step further would be actually implementing drawing functions on the second CPU. Many very clever things can be done with the same exact hardware - because it's a whole CPU and therefore it's function is entirely controlled by software, which we can make whatever we like.

Several things are needed in the hardware to implement such functionality:
1) A mechanism to arbitrate bus access between CPUs.
2) A mechanism for the 'main' CPU (the one running the OS in the usual manner) to initialize and start the second CPU under software control
3) A mechanism for each CPU to cause events on the other one (this is used to signal the task CPU that something like a list or initial data or configuration is set up or a status for something is required by the main CPU, or that the task CPU has finished some task list or other operation).
4) A flexible interrupt structure so that each CPU can receive only those events from the hardware that it needs. The main (job) CPU would be closer to the old QL in this respect having a periodic (poll) interrupt for time slicing, and an interrupt outlined in (3). The task CPU would probably have more extensive interrupt sources, as these normally are parts of hardware it has to handle, as well as a slow and fast poll interrupt for polled tasks.
5) An area in RAM where the CPUs can communicate safely and in lockstep as needed. This is easy with the 020, more complex with CPUs that implement a data cache as such communication requires that the RAM used for it always holds an actual updated copy of the data. On the GF this was handled by an alias of the entire RAM for which data caching is disabled.

Some advanced systems use something called cache snooping, which treats the cache of a CPU that has been taken off the bus as RAM, in order to automatically have the cache represent a real copy of RAM data written by the other CPU, which solves the cache integrity problem. The 020 does not have such a mode of operation and since there is no data cache, things like loading code for the other CPU into RAM, and code cache issues tied to that, are handled in software (the main/job CPU always loads code for the task CPU into RAM, stopping the task CPU and invalidating the contents of it's cache before letting it execute the new code).

The ultimate step in multiprocessing on the QL would actually be real multiprocessing where job execution happens on both CPUs. However, most systems that do this, and any QL OS extended to add this capability, would not be an exception, do dedicate some things to one CPU, others to the other, and this has to do with maintaining the various system tables and lists. Often one CPU (primary) handles all the OS management, like memory allocation, job set-up and priority management, etc, while another (secondary) handles tasks, while both execute jobs in the 'time remaining to them'. This is because it's FAR easier to have one CPU maintain complex data structures like job tables, task lists, memory allocation tables, slave blocks etc, than reliable having two of them access them 'at the same time'. Although given the practical non-existence of QL system programmers, expanding a QL OS to do such a thing is far from impossible. Actually, although QL OS's in their variety have various unpolished bits, a concept implementation or two that needs revamping and some compatibility issues that bog them down, they are superbly though out, with quite clear limits of what each part does, how and why exactly that division. The 'processing model' used by the QL is sufficiently delimited in it's constituent elements that it's not too difficult to give some elements to one, others to another CPU and possibly even to more than two, to handle them.
The actual hardware requirements are low enough to implement for next to no cost - even if never used.


User avatar
vanpeebles
Commissario Pebbli
Posts: 2854
Joined: Sat Nov 20, 2010 7:13 pm
Location: North East UK

Re: Fun things to do with an MC68EC020....

Post by vanpeebles »

Nasta your posts are amazing, each one is like a university assignment. Thank you for taking the time to post them :ugeek:


Post Reply