C-one

Introducing JRISC

Introducing JRISC, the new early startup processor
by Jens Schönfeld

Note: Jeri Ellsworth has abandoned JRISC, this information is only staying online to have a complete overview over the development history.

Here's the long-awaited news update after many months of intense work on the C-One. We were so close to releasing the boards with a working C64 core, but there was something that kept us from completing the early startup procedure: The size of the design. It simply did not fit in the "small" FPGA, the 1K30 that launches the whole system.

Many weeks of optimizing freed a few logic cells, but did not make the design fit. Taking a bigger FPGA was not an option, because the boards had already been produced a long time ago. Exchanging the FPGAs on 300 boards (where some of them are already spread all around the globe at the developer's sites) would have been too costly, and too risky for those who want to do it on their own. Exchanging a 208-pin QFP package is something you can do at home, but you should have lots of experience and the right tools.

We dropped the idea of a bigger FPGA. This would have been the Wintel way of doing things: If it does not work, use a bigger/faster computer. We're Commodore people, and that means we're taking hardware restrictions as a challenge, not as a limit of our brains.

The lion share of logic cells was eaten up by the 6502 processor, AKA the drive CPU. Although it's cut-down version with no BCD support has gotten 50% smaller than any other implementation you can find on the net, it was still too big to be fitted with the D-Ram controller, the video controller, keyboard controller, floppy controller, audio engine and the early startup DMA engine that gets the 64K of early startup code into the memory.

The other parts of the design are as small as can be. Neither a memory controller, nor video or the DMA engine can be considered "big". It's the CPU that takes up too much space, so Jeri tried to optimize that, and having squeezed everything out of it, it was clear that the 6502 had to go.

The new idea was to create a microcode engine that loads a 6502 emulator with the DMA engine. A new processor from scratch. During the design phase, Jeri discovered that the microcode engine she designed was nearly a full processor. Just add a few things here and there, cut out the microcode overhead, and all of a sudden we have a RISC processor, the JRISC. Even the DMA engine can be wiped, because the CPU has a small bootstrap that can load it's code from the flash memory.

Before anyone complains about not having the 6502 any more, and all the work on that being done in vain: We still need the 6502 for the C64 "compatibility" core. The work was necessary, we're glad it's done, and we're confident that this implementation of a 6502 on an FPGA beats any other in size, speed and cycle-exact execution. It's just that the 6502 is not used in early startup any more, but the C64 will of course stay with this classic processor, and the "native C-One" will stay with the 65816, as promised in all the technical data that has been spread over the course of the project.

Now let's come to the new processor that has so many benefits over the 6502: JRISC is a 32-bit processor. It has the full address range of the SIMM module, and a little more (currently 27 bits, but that might change in order to optimize the design). No matter what size SIMM you put in the multimedia memory socket, it can all be used as program memory for the drive CPU. Up to 128 megs can be used as shared memory for program, data, video and audio.

JRISC has a von Neumann architecture. As opposed to the Harvard-architecture, this has common memory areas for program and data. In other words: The same as you're used to from C64 or Amiga experience. I want to make this clear, as Harvard architecture, which separates code and data, is often used on microcontroller systems such as PIC micro or the alltime classic much-hated Intel 8031, and since the goal of the design was small size, one could consider JRISC a microcontroller, which it really isn't.

The basic architecture is about everything that JRISC and the 6502 have in common. You will have to re-think some aspects of processors if you haven't dealt with RISC processors before. RISC means "reduced instruction set computer", and it's the opposite of the CISC (complex instruction set computer), the concept of most popular processor architectures of today: x86, M68K, PPC, just to mention a few.

Those of you who know the CISC architecture might have real problems getting the idea of Jeri's new approach to having a small processor. Everyone who wants to take part in future discussions should read the preliminary documentation, and everyone is welcome to discuss the new aspects of the processor on this list. Ladies and Gentlemen, may I present: JRISC!

The document is preliminary, and it's already outdated in some aspects. Please do not base any work on this document!

Reducing the instruction set does not mean crippling the capabilities of the processor. Reading the document, you might have found the "shift right" opcode, but no shift left opcode. Such an operation must be done with other operations, but we'll come to that later. Let's first discuss the form of an opcode, and the power you can squeeze into one 32-bit word:

Every opcode is 32 bits wide. There's no smaller unit in the program memory. Every opcode is conditional, that means that the execution of each command depends on the state of the flags. 6502 programmers only know the branch commands like

BNE (branch not equal)
BEQ (branch equal)
BCS (branch carry set)
BCC (branch carry clear)

and so on. These are conditional commands, they jump if a certain condition is met. The drawback is that only one bit of the flags field can be checked, and the action taken can only be a jump to a different location in program memory. JRISC allows checking of more than just one flag, giving fairly complex comparisons such as "greater than" or "less or equal" in just one opcode. Further, not only jump commands can be executed conditional, but also any other command. I'll spare you any examples and leave them for the discussion later.

Any opcode can update the resulting flags of the just-performed operation or not. Further, any ALU (arithmetic-logical-unit) operation result can be stored or not. This makes comparisons possible without having to destroy the contents of a register. As opposed to the preliminary PDF document, this is not achieved by the "store result" bit, but by choosing register 0 as the destination register. It is now write-protected, which frees one bit in the opcode word for other purposes. To compare two registers, just subtract them, and choose register 0 as the destination. Set the "store flags" bit. Next you can use the flags as a condition for some command. For example the Zero flag will be set if the two registers were equal.

Speaking of registers, there are ten of them, and they're 32 bits wide. There are six more registers for special purposes, which emphasizes the load-store architecture of the processor: Simply everything is done in the registers, and after preparing data in them, you can store the data in memory, which leads to the next aspect of the CPU, the addressing modes.

The only immediate mode is loading data into a register. Immediate means that the data is located in the program itself, as a parameter to an opcode. Loading some value to a register not only takes the opcode, but also the data as a parameter, so the full command is two 32-bit words long (one for the opcode, and one for the data). This is the only command that takes two instruction words, all other commands only have one word. Other addressing modes are generated by the ALU: You can use ADD, SBC, AND, OR, NOT, shift and combinations of those to generate the target address where you load or store data.

The experienced readers among you might already see the power in this: Weird addressing modes like "indirect indexed masked" where the index can point over the whole memory space is possible. The only thing that the processor lacks is a direct addressing mode; you always have to go the re-route of loading the address to a register with immediate, and then using that register as a pointer to your memory location - this is as "direct" as it gets!

There are three register fields in the bit pattern of an opcode. Two source fields, and one destination field. Each of them is 4 bits wide, so each of them can point to one of the possible 16 registers of the processor (restrictions apply, because only ten of them are general-purpose registers!). The two source registers are fed into the ALU, and the destination register field points to the register file where the result of the ALU operation shall be stored. You can choose the registers freely, so the two source registers can be the same, and even the destination register can be the same, because the ALU latches the values and therefore does not cause any glitches in the data consistency. The result is stored in a register file after the source data has been read - if you choose to store the result at all (see above).

The final part of a 32-bit word is the remaining 8 bits of "branch". This means that every opcode gives you the opportunity to jump after it's execution. Since every opcode is conditional, this jump is also conditional, and the same condition code applies to the jump itself. The jump is relative, you can jump 128 locations forward, or 127 locations back. To emphasize the meaning of these last bits: Every opcode can perform two actions, where the second is always jumping. If you don't want to jump anywhere, just set these bits to the default pattern that makes the CPU fetch the next instruction, and that's all 0's.

Let's shock you a little more: There's no JSR (jump to subroutine) and no RTS (return from subroutine) in the instruction set. How we still manage to jump to subroutines and return from them in a clean way requires knowledge about the six "special purpose" registers of JRISC:

Register number 0 always contains 0. It cannot be used for anything else, only as a constant 0. As mentioned earlier, the register is write-protected, so it can be used as a dummy destination for results that shall not be stored.

Registers 1-10 are general purpose.

Register 11 IRQ vector (upper 5 bits reserved)
Register 12 software stack (upper 5 bits reserved)
Register 13 program counter (upper 5 bits reserved)
Register 14 Last address, Upper 5 bits are flags: NZCVI (MSB first)
Register 15 current opcode fetched (do not modify!)

Every instruction can find out what the previous instruction was just by looking at the last address register (14). You can now make a subroutine-entry routine that stores the flags and the last address on the stack, and restores this before leaving the subroutine. The difference between a "real" JSR and this way is just that you're doing the necessary steps to find the way back into the main program instead of the CPU doing the work for you. Overall execution time will be the same compared to CISC units, because there's hardly a difference between the CPU doing the things on it's own or the programmer telling it what to do.

The idea to do subroutines with JRISC is to jump to the routine, the first instruction of that routine catching the last address and the flags from register 14 and push it on the stack, then do the subroutine stuff. To exit and return to the location where it came from, just take that word from the stack, write the flags back to register 14, set the program counter to the old address and the CPU will automatically continue at the next address after the jump to the subroutine. Saving the flags is of course optional, the 6502 doesn't save flags on subroutine calls either.

A similar workaround will have to be done for shifting left. Since that is not directly supported by the ALU, you'll have to do it with some knowledge about bit operations. Here's the theory: Shifting left is the same as multiplying by two. Multiplying by two means adding the same number to itself, and that's what the ALU can do natively: Choose the same register for both sources for an Add, and the result will be that register shifted one bit left. More theory: The result of a multiplication by two is always an even number. Even numbers represented in binary always have a cleared least significant bit, so "add with carry" will never make the operation wrap around if the carry bit is set. Instead, it will "shift" in the carry bit, another benefit. If the MSB of the to-be-shifted value was set, the carry bit will be set after this operation, giving a full rotate command with the few features the CPU has.

The one big exception to the flexibility of conditionals and branching into subroutines from any command is the immediate load: The condition code must be set to "always", and a jump into a subroutine is not allowed either. This is because of the data being located right behind the opcode: If you use a condition code which is not met, the data would be executed as code, which would crash the CPU. This is also the reason for the no-JSR, because the return code of the subroutine would not know that it was called from an immediate load, and return into the data instead of the word after that. The final restriction is that the branch may not be set to all 0's, because this would also lead to the data being interpreted as code. Immediate load requires the user to set the branch destination to the word after the data word, or some other valid instruction word.

Interrupts are a little tricky: They can only be served after instructions that have the branch destination set to 0. The IRQ service routine will have to do similar things as the subroutine entry code, because it has to take care of flags and the return address. The full detail of interrupts would go beyond the scope of a preliminary introduction of this processor, so please accept that it can do IRQs with some restrictions.

Speed: The PDF document from last week still states that every instruction takes 7 cycles to execute. This will be reduced in order to boost performance.

Code: All the startup code developed until now has been written in 6502 assembler. We don't want to re-write all that code, so the JRISC has some features that assist in emulating 8-bit processors in general, not only 6502. We have already started to develop the necessary code and will document the new opcodes and ALU features in a few days. The next few days are reserved for implementing and testing some of the 6502 emulation code. I will be reading the mailinglist and try to answer as many questions as you have.

Happy valentine's day everyone,

Jens Schönfeld