Sunday, September 06, 2015

Exploring Forth for Low Level Hacking

Background

Forth is an interactive, low level language which shares a lot in common with machine code. It allows low level access to the processor and its resources and can therefore be quick and powerful - in the right hands.
It allows a degree of interaction which has now been lost in the higher level compiled languages, but for the right applications it provides all the flexibility needed.

The following videos illustrate some aspects of Forth when used for controlling hardware.

Open Firmware




It Was Twenty Years Ago Today.......

In the mid-1990s Chuck Moore, Jeff Fox and others worked towards forth computing engines that would achieve burst speeds of around 500MIPS.

Chuck Moore developed custom VLSI devices - a series of  processors where the machine language instructions were essentially Forth primitives.  These processors all used a minimal instruction set - and were known as MISC processors.

Dr C.H. Ting, had also shown with his eForth model, that a working Forth could be composed from just 31 Forth primitives, and that all other definitions could be assembled from this core set. Thus a processor with a 5 bit instruction length could potentially be used for Forth execution.  Dr. Ting explored this further with a series of chip designs - where 5-bit instructions were packed into 16 bit or 32 bit words - allowing 3 or 6 instructions to be fetched from memory at a time, which better suited the slower RAM access.

Keeping the speed up...

When Forth is implemented on a  register based load-store architecture- such as the ARM, the overheads of running the Forth inner interpreter - in particular NEXT,  means that around 10 machine instructions need to be executed in order to execute a Forth primitive. This suggests that an ARM clocked at 100MHz will only achieve around 10MIPS.

Forth requires the right architecture in the processor in order to be able to execute the Forth primitives efficiently - preferably as single cycle instructions.  The processor should have a stack-based architecture, and the machine instructions should be directly map-able to the Forth primitives for efficient executing. Using this approach allows a simple Forth processor to be designed as a soft-core cpu for a FPGA - and maintain a performance of around 50 to 100 million Forth instructions per second. (Forth MIPS).

Affordable FPGAs

Whilst much of this work was done about 20 years ago using custom VLSI chips, progressive improvements to FPGAs, falling memory prices and greater access to sophisticated design and simulation tools has allowed the creation of FPGA soft-core microcontrollers to be in the reach of the hobbyist.  Low cost FPGA dev-boards are available for the $50 to $80 price range.

There have however been a number of stack machine cpu designs developed over recent years, several of which have been implemented on a low cost FPGA.  Notably ZPUino - by Alvaro Lopez, and J1 - by James Bowman, although several others exist.

James Bowman's J1 design is of interest because it is close in architectural design to Chuck Moore's 1985 Novix NC4000, but much simpler because the data and return stacks are implemented in on-chip RAM. This gives it the potential for 100 Forth MIPS - when implemented in a Xilinx Spartan 3E - and described in under 200 (160) lines of Verilog code.

J1 is incorporated into the Gameduino Shield - a gaming -  graphics and sound generator for Arduino. Versions are also available from Olimex - which include PS2 keyboard connector and additional 32MB SDRAM for extended resolution - although Olimex leave you high and dry when it comes to implementing firmware to make full use of the extra 32Mb!

https://www.olimex.com/Products/Modules/Video/MOD-VGA-32MB/open-source-hardware

The J1 Processor Model

The J1 processor is simple enough that it may be modelled in about 100 lines of C code. I used this model available from ddb's Github Repository 

The J1 model is created from James Bowman's original documentation "J1: a small Forth CPU Core for FPGAs" and is very similar to the verilog code that defines the J1 implementation in hardware.

More documentation at James's J1 site .  As can be seen, the J1 has been used in a variety of projects including the Gameduino shield - which is a graphics engine in the form of an Arduino shield.

The J1 has just 5 categories of instruction coded up into a 16 bit instruction word:

Literal                           a 15 bit literal pushed onto the data stack
Jump                            Jump to a 13 bit target address
Conditional Jump          Jump if T is zero to a 13 bit target address
Call                              Call a subroutine at a 13 bit target address
ALU

The ALU uses a 4 bit field to determine the its action, and there are additional bit fields to control access to the stacks and memory.

T -> N        Copy T to Next
R -> PC      Put the return stack into the PC to get a free Return
N -> [T]     Store Next at the location addressed by T
T  -> R       Copy T to Return stack

Additionally there are two, 2-bit, bit fields that allow for the increment and decrement of the data stack pointer, and the return stack pointer - this enables items placed further down the stack to be accessed.

Simulation

The instruction set of any proposed processor may be simulated in software. Once a model of the various stacks, registers and memory has been devised, it becomes a relatively straightforward task to create a C program, with text output, that simulates the operation of the cpu and instruction execution. Whilst the output of the simulator is either text or graphics, the process can be further developed to the point where any processor can emulate the instruction set of another - but with a vast speed penalty.

Fortunately the relatively simple J1 processor may be quite easily simulated in C - even using an Arduino.

The model consists of a 512 word memory  (As an Arduino Uno only has 2K of on chip RAM)

Snippets of machine language are loaded into the RAM during the setup() function - for example

 m[0] = 0x6000;       // NOP
 m[1] = 0x8020;      // LIT 20
 m[2] = 0x8010;      // LIT 10
 m[3] = 0x6400;      // ADD
 m[4] = 0x6700;      // NEG
 m[5] = 0x6000;      // NOP
 m[6] = 0x0001;      // JMP 0001

In this trivial example two literals are loaded onto the  stack,  added together, negated and the whole process is repeated as an endless loop - by the unconditional jump back to the beginning. This is definitely not a particularly good example to illustrate Forth, but it's a good test case to show that the processor model is correctly fetching, decoding and executing code, and that the "alu" and pc are working properly together.

The information contained in the machine instructions contains the following

Numerical constants or literals.  These are 15 bits packaged into a word that has bit 15 set - i.e. 0x8xxx in hexadecimal.

Target Addresses - there are signed 13 bit addresses, which are used to force the processor to branch to a new subroutine address or jump to a new address. The jump in unconditional, but the branch may be conditional - in that the top of the stack needs to equal zero for the branch to be executed. This gives the processor a branch range of +/- 8192 addresses.

ALU Instructions.

The ALU has 16 possible instructions as controlled by a 4-bit field.  Instructions of the type 0x6X00 are alu - where the X is the 4 bit instruction.


Code is Code

It might be worth stating that the entire operation of the processor is controlled by the various fields coded within the instruction.  This is what makes machine language very powerful, and yet very easy to make mistakes.  A single mistake in a field might send your processor off into an unintended area of RAM, where it can misinterpret your stored data as a program, and then start indiscriminately writing to RAM.  Invariable this ends up as a system crash.

As writing in machine code has always been a thankless task and prone to mistakes, it is best to spend time writing an assembler to help "assemble" programs from the processor's instruction set.

Assemblers use human readable mnemonics such as ADD, OR, JMP and allow numbers to be entered in decimal or hex. The assembler will use a text file which contains the source code, and which can be edited using a text editor. This can then be processed by the assembler to produce a binary or hex file that may then be loaded into the RAM of the processor.

Assembly language is the first layer of abstraction above the processor's own machine language.  As a tool it makes programming simpler, faster and less prone to mistakes.

These tools first stared becoming available in the early-1980s. Early 8-bit home-micros often had an assembler/disassembler available as part of it's toolkit.

An excellent reference book on Assemblers by David Salomon.

Forth as an Assembly Language

In the early 1960's, Charles Moore - the creator of Forth, realised that there may be a better way of writing programs, than the traditional assembler or high level language compiler method.

He knew that any program consisted of small snippets of code, each performing some small function within the program.  These functions and routines would be stitched together with calls and jumps to form the structure of the program.

He came up with the concept of the  Forth word,  where the word is the name of such a function - for example SQUARE.

Running on the processor was a small interpreter program, which could take the text input and compile it into executable machine code.

The word SQUARE could be written at the keyboard, or typed into a text file, and every time it was encountered it would perform the function of calculating the square of a number.

For this to work, SQUARE had to be created using the colon definition method of defining new words - which is written like this:

: SQUARE DUP  *  ;

: This colon is the word that tells the interpreter that this is a new definition
SQUARE this is the name of our new word, and that will be put into the dictionary
DUP is a forth word that duplicates the top word on the stack
* multiplies the top two entries on the stack, leaving the product on the stack
;  Semi-colon  - this denotes the end of the definition and a return to the inner interpreter

For a much fuller explanation of how this works - have a Read of Brad Rodriguez' excellent article "Moving Forth"

Suffice to say, that the Forth system provides the assembly, compilation and run-time execution environment needed for a self contained system, and it does it in a user interactive manner.

This video shows a typical Forth work session.

N.I.G.E Machine


In another post, I describe my project to combine a simulation of the J1 Processor with a set of simple graphical tools to allow assembly, disassembly and memory viewing.

No comments: