Saturday, May 02, 2015

Benchmarking Arduino - and his Chums.

Background

The standard Arduino based on the ATmega328 is an 8-bit device with a 16MHz clock frequency and 2K bytes of RAM.

I have for some time been exploring more powerful alternatives to the Arduino - especially the 32 bit STM32Fxxx ARM Cortex M4 range of microcontrollers, and some softcore processors implemented in a FPGA.

The one thing that these processors have in common is that they can all be programmed using the Arduino IDE - so in theory, code written for the Arduino will run on all devices -almost without modification.

The flavour of C++ used by Arduino has become a kind of lingua franca for these widely varying processors, allowing access to vast knowledge base and range of libraries that permit the easy interfacing of hardware devices. In truth, if you wish to use an integrated device or sensor, then someone will already have created an Arduino library for it.

Since the earliest days of commercial computers, both manufacturers and users have had a strong interest in their computing performance. Computers were expensive, and computing time was equally expensive. Any way of increasing performance and reducing programming costs was sought after. As memory technologies improved, processor cycle times reduced to match the shorter access time of the memory.

When launched in 1965, the PDP-8 was capable of 312,500 12-bit additions per second.  How does the Arduino compare with that figure?

Into Practice

The Arduino is a great platform for trying things out.  Whilst not the fastest board available, it's resources are easily accessible, and the millis() and micros() timer functions allow simple benchmarking to be done. 

Remembering the claimed performance for the PDP-8, I decided to set up a simple addition test for Arduino. First I formed an array of 16 bit integers - remembering that in the Arduino there is only sufficient RAM space for about 500 16-bit words. Exceeding this gives a risk of overwriting some of the stack, heap and system variables.

I then loaded up the array with random integers 

void setup()
{  
  Serial.begin(115200);
  
  for(int i = 0; i <=500; i++)
  {
  m[i] = random(0,65535);
  }  

}

The main routine would then add two of the memory locations together, working it's way through the array. The time taken for the 500 iteration function was calculated using the micros() function.

Results were as follows:

1. Adding a constant to memory    1uS
2. Adding contents of two memory locations into a variable   1.4uS
3. Adding  contents of two memory locations and storing back into a third memory location 1.6uS

So based on this, the Arduino is performing addition of  memory located operands at between 2 to 3 times the speed of the 1965 PDP-8.

However  - we should bear in mind that at 16MHz, the Arduino is executing approximately 16 instructions per microsecond.  Whilst an 8-bit add is a single cycle instruction, by the time we have used it within a 16-bit add routine, and involved a memory access, the Arduino is taking roughly a microsecond to achieve a common operation in a typical program.  

I then conducted the same test on the 72MHz STM32F103 board programmed using Arduino_STM32.

1. Adding a constant to memory    0.156uS    (6.4X faster)
2. Adding contents of two memory locations into a variable  0.294uS  (4.76X faster)
3. Adding  contents of two memory locations and storing back into a third memory location 0.32uS (5X faster).

Next was the turn for the 96MHz ZPUino - a softcore running in a Xilinx Spartan 6 - on a Papilio Duo FPGA board.

1. Adding a constant to memory    0.58uS    (1.72X faster than Arduino)
2. Adding contents of two memory locations into a variable  0.708uS  (1.39X faster)
3. Adding  contents of two memory locations and storing back into a third memory location 0.706uS (1.41X faster).

The skeleton Arduino code for these addition speedtests is available on this Github Gist

The results for the ZPUino were a little disappointing - bearing in mind it is being clocked at 6X the speed of the Arduino. However it is a stack based processor, which uses external RAM. The external bytewide RAM will slow down RAM accesses and perhaps C does not compile efficiently to its stack based architecture.

Whetstone and Dhrystone Benchmarks

One solution is to use standard benchmark code, of which there are several well documented programmes, designed to test the various performance aspects of the processor. These include:

Dhrystone  - an integer arithmetic benchmark
Whetstone  - a floating point benchmark
CoreMark - for multi-core processors
LinPak      - for Linux based systems

Dhrystone - a fixed point benchmark

The Dhrystone benchmark code - suitable for small microcontrollers is available here - however the Dhrystone caused some me difficulties in converting it to an Arduino compatible format - particularly because of the shortage of RAM on the ATmega328.

Fortunately, I was contacted by Magnus of Saanlima Electronics - with an Arduino friendly version. You will however need an Arduino MEGA - because the Uno does not have enough RAM to run this benchmark.

http://www.saanlima.com/download/dhry21a.zip

Results:

Arduino MEGA1260 board 16MHz

Microseconds for one run through Dhrystone: 78.67
Dhrystones per Second:12711.22
VAX MIPS rating = 7.23


72MHz STM32F103 programmed using Arduino_STM32 

Microseconds for one run through Dhrystone: 11.66
Dhrystones per Second: 85762.68
VAX MIPS rating = 48.81
I then ported it to ZPUino2.0  - and after a little fiddling got the following:
Microseconds for one run through Dhrystone: 37.95
Dhrystones per Second: 26351.79
VAX MIPS rating = 15.00


Whetstone - a floating point benchmark

The Whetstone test code adapted for Arduino by Thomas Kirchner is here
When run on a standard Arduino 16MHz Duemillenove the Whetstone produced the following result

Starting Whetstone benchmark...
Loops: 1000 Iterations: 1 Duration: 81740 millisec.
C Converted Double Precision Whetstones: 1.22 MIPS

On the STM32F103 board with a 72MHz clock

Starting Whetstone benchmark...
Loops: 1000 Iterations: 1 Duration: 19691 millisec.
C Converted Double Precision Whetstones: 5.08 MIPS

So the STM32F103 appears to be running at approximately four times the speed of the Arduino. This speed increase is dominated by the faster clock on the STM32F103, and not that the ARM processor is executing the compiled code any more efficiently than the AVR.

Whilst an indication of processor performance, the benchmarks are a somewhat artificial test, and the actual performance of one processor compared to another will depend on the application. Additionally, the manner in which the compiler interprets the C source code and efficiently converts it into the native machine language of the processor has an effect on the overall processing speed.


Conclusions

The 16MHz Arduino can execute real code at around 1.22 million instructions per second.

Moving up to a 72MHz STM32F103 ARM will give about a 5X to 7X speed advantage over the Arduino. A lot of this is down to the faster clock, and some down to the fact that double precision arithmetic will be easier (less cycles) on a 32bit processor than an 8 bit device

Soft core FPGA processors are interesting, but may be constrained by the use of external RAM and the restriction of an 8 bit external bus hen accessing multi-byte words. It must therefore be possible to improve their performance, with the used of internal (on chip) RAM.




No comments: