|BOB is no slouch when it comes to simulating a virtual stack cpu!|
Way back in 1991 when I was half the age I am now, I did my pcb design work using OrCAD on a 25MHz 486 desktop. The picture above is of my latest experimental pcb - a breakout board for the 216MHz STM32F746 ARM Cortex M7 microcontroller. BOB (above) can emulate a 16 bit minimal instruction set processor faster than the 25MHz ' 486 box - and for about $20! Now that's progress.
Implementing a Stack Processor as a Virtual Machine
This post examines the role of a virtual machine, created to run on a given processor for the purpose of simulating another processor, for performing operations that the host processor might not readily do easily. One example was Steve Wozniak's "Sweet 16" - a 16 bit bytecode interpreter he wrote to run on the 6502, to allow the Apple II to readily perform 16 bit maths and 16 bit memory transfers.
In his closing remarks, Woz wrote:
"And as a final thought, the ultimate modification for those who do not use the 6502 processor would be to implement a version of SWEET16 for some other microprocessor design. The idea of a low level interpretive processor can be fruitfully implemented for a number of purposes, and achieves a limited sort of machine independence for the interpretive execution strings. I found this technique most useful for the implementation of much of the software of the Apple II computer. I leave it to readers to explore further possibilities for SWEET16."
The main limitations to the VM approach is that the execution speed is often one or two orders of magnitude slower than the host running native machine code, but with processsors now available with clock-speeds of 200MHz - this is not so much of a problem.
It is more than offset by the ability to design a processor with an instruction set that is hand-crafted for a particular application, or the means to explore different architectures and instruction sets, and to simulate these in software, before committing to FPGA hardware.
Whilst Woz's Sweet 16 was a 16 bit register based machine, I had ideas more along the lines of a stack machine, because of it's simpler architecture and low hardware resource requirement.
I had become interested in an interpreted bytecode language that I believed would be a good fit for a stack machine, and so in order to get the ball rolling, I needed a virtual stack machine to try out the language.
Earlier this year, I invested in a Papilio Duo FPGA board, and with this came access to a ZPUino soft-core stack processor - devised and much enhanced from an existing design, by Alvie Lopez. The advantage of the ZPUino was that it was one of the few soft core processors that had GCC available, and so the task of porting the Arduino flavour of C++ to it was not over arduous (for those accustomed to that sort of task - not me!).
However, porting C to a stack machine is never a very successful fit - as C prefers an architecture with lots of registers - such as ARM.
As a result, the ZPUino, whilst clocked at 6 times the speed of the standard Arduino, only achieved about twice the performance when running a Dhrystone Benchmark test - written in C. The other factor limiting ZPUino is that it executes code from the external RAM - and there is a time overhead in fetching instructions.
Despite these limitations, the ZPUino has been a useful tool to run simulators, as it supports VGA hardware and the Adafruit Graphics library - allowing text and video output from an Arduino-like environment.
The other stack processor that caught my attention is James Bowman's J1 Forth processor. This became available as an implementation on the Papilio Duo in early September to run on readily available FPGA hardware at speeds of up to 180MHz. So I have been working towards trying it out - first using a software simulator.
A J1 Simulator - written in C - and tried on a number of processors.
Back in the spring, I found a bit of C code that allowed a J1 processor to be run as a virtual machine on almost any processor.
Initially, I implemented it on Arduino, but I quickly moved to the faster ZPUino - which, as stated above, is a stack based processor implemented on a FPGA. This was a stop-gap, whilst I was waiting for James to release his J1 in a form that I could use.
The simulator is about 100 lines of standard C code, and implements a 16-bit processor with integer maths and a 64K word addressing space.
I then wrote a test routine, in J1 assembler, consisting of just a simple loop - executing 7 instructions and incrementing (by 1) a 16-bit memory location, every time around the loop.
Running this test code - the standard 16MHz Arduino managed 67,000 J1 instructions per second. (Jips).
I then transferred the sketch to the ZPUino, running on the Papilio Duo board. This provides a useful boost in performance to about 152,000 Jips.
A 72MHz STM32F103 running the same code under STM32-Duino managed 404,000 Jips - about 6 times the speed of the Arduino, - a healthy performance boost.
The difference in performance between the 8-bit Arduino and the 32 bit STM32F103 - could be explained to be partly down to the 4.5 times increase in clock speed, and partly that a 32 bit microcontroller can implement a 16 bit virtual machine somewhat more efficiently than an 8-bit device giving an additional 30% boost - over clock speed scaling alone.
In addition, the test code only added one to the memory cell. If this were say adding a 16 bit value into that location - the 16 bit transfer would slow the 8-bit AVR down considerably.
I then proceeded to port the simulator to a 168MHz STM32F407 Discovery board. The 168MHz STM32F407 returned a slightly puzzling 764,000 Jips.
Based on the increase in clock speed it should have been about 940,000 Jips. This appeared to be a bit slow. In theory it should be running at 2.33 times the speed of the 72MHz part. This needs further checking to ensure that it is not a compiler optimisation issue that is holding it back.
I tried again with the various optimisation levels of the GCC compiler:
Optimisation -00 733,333 Jips
Optimisation -01 3,083,333 Jips
Optimisation -02 3,333,333 Jips
Optimisation -03 3,583,333 Jips
With only modest optimisation the '407 is returning around 3 million Jips!
Meet BOB - the fastest, newest kid on the block.
Back in the summer I made up a break out board BOB for the 216MHz STM32F746 Cortex M7 microcontroller. Whilst ST Microelectronics had released their $50 F7 Discovery board - complete with LCD, I wanted a very simple board, with the same pin-out as the previous F4 Discovery to try out relative performance checks.
So, it's now time to port the J1 simulator onto the STM32F746 - and see how it performs.
The '746 is an M7 ARM and has a six-stage dual issue pipeline - which means that it can effectively load two instructions from memory at once. This feature and the higher clock frequency gives it a 2.2 times speed advantage over the '407.
With all this working, the 746 BOB board - should be able to simulate the J1 at around 7.8 million J1 instructions per second - welcome back to the 1980's in terms of performance!
Whilst we can emulate the J1 in C at around 8 Million Jips, the real J1 should manage nearly 200 Million Jips, so when I get real J1 hardware up and running - it should really fly!
After a long day and half a night of battling with compilers, I just got the figures for the STM32F746 running the J1 interpreter at 216 MHz. Initial measurements suggest that it's running at close to 15 million Jips per second with minor optimisation and about 27 million JIPS with the most aggressive optimisation!
Optimisation 00 9,000,000 JIPS
Optimisation 01 15,000,000 JIPS
Optimisation 03 27,000,000 JIPS