Tuesday, March 5, 2013

Benchmarking the Pin Classes

Time to present some of the performance evaluation of the design of the Cosa Pin classes. How well do they work compared to the standard Arduino/Wiring functions for digital/analog read and write to pins?

This posting is also a short introduction to the iostream (Cosa/IOStream.hh) and trace support classes in Cosa (Cosa/Trace.hh). These are a set of powerful classes that allows easy redirection of basic output operations to several devices; UART, Textbox on Canvas, Virtual Wire Interface, IOBuffer, etc. Just by changing a few lines in your code you can get trace output to another IOStream device. There is also support for automatic prefix with file line number, function name, and log type.

The benchmark I want to present is CosaBenchmarkPins.ino. It measures the time to perform different types of read and write operations in both Arduino/Wiring and Cosa within the same sketch. This is possible as Cosa is written so that Arduino/Wiring code may be used together with Cosa. This help migrate Arduino/Wiring libraries for more object-oriented variants. More on this in coming posts.

The benchmark begins with a declaration of the pins that are going to be used.  These are instances of the Pin classes and defined as:

#include "Cosa/Pins.hh"

InputPin inPin(Board::D7);
OutputPin outPin(Board::D8);
OutputPin dataPin(Board::D9);
OutputPin clockPin(Board::D10);
AnalogPin analogPin(Board::A0);

The setup() starts by initiating the UART (Cosa's version of Arduino Serial) and binding it to the trace output stream with a short banner. Note that whenever possible string constants are defined in program memory using the GCC-AVR PSTR() macro. The banner is given as a program memory string.

#include "Cosa/Memory.h"
#include "Cosa/Trace.hh"
#include "Cosa/IOStream/Driver/UART.hh"

void setup()
  // Start the trace output stream on the serial port
  trace.begin(&uart, PSTR("CosaBenchmarkPins: started"));

  // Check amount of free memory and size of instances


The setup() also prints the amount of free memory and the sizes of some of the used data structures.

There is the first support for tracing: the macro TRACE() will expand to a print statement on the trace stream with a program memory string with the given expression and a print of the value. C++ overloading will select the correct IOStream print member function:
     CosaBenchmarkPins: started 
     free_memory() = 1673 
     sizeof(Event::Handler) = 2 
     sizeof(InputPin) = 4 
     sizeof(OutputPin) = 4 
     sizeof(AnalogPin) = 9  

All the details of using strings in program memory (PSTR) and trace print is hidden in the TRACE() macro.

The next step in setup() is to measure the time to execute an empty 1-M loop block. To force the compiler to generate this we insert a "nop" instruction into the loop as an assembly instruction.

For the measurement we use the Cosa Realtime Clock (Cosa/RTC.hh) static class. The class implements the Arduino micros() and millis() functions but uses the class as a name space (RTC::).

#include "Cosa/RTC.hh"
void setup()
  uint32_t start, stop;
  uint32_t us, base;

  // Start the timers

  // Measure the time to perform 1,000,000; empty loop block
  start = RTC::micros();
  for (uint16_t i = 0; i < 1000; i++)
    for (uint16_t j = 0; j < 1000; j++) {

      // Here is where the test code will be put
      __asm__ __volatile__("nop");
  stop = RTC::micros();
  base = (stop - start) / 1000L;
  INFO("Loop: %ul us per 1,000 nop loops\n", base);


There is a new trace support macro in this section, INFO(). It has a parameter list like printf_P() but the printout is controlled by a trace log mask. There are eight levels of log message; EMERGency, ALERT, CRITical, ERRor, WARNING, NOTICE, INFOrmation, DEBUG. Each log message, when active, will print the line number, function name, and the given message on the trace stream. Again the message string is automatically put in program memory to save SRAM. The typical output from the above INFO() statement is:

     70:setup:info:Loop: 504 us per 1,000 nop loops  

The compiler/preprocessor will provide the line number and function name when the sketch is built. The printout also contains the type of message. In this case "info". 

Now for the benchmark: It begins by measuring the time to perform 1M Arduino/Wiring digitalRead() and compare with Cosa inPin.is_set(), inPin >> var, Input::read().

  start = RTC::micros();
  for (uint16_t i = 0; i < 1000; i++)
    for (uint16_t j = 0; j < 1000; j++) {
      __asm__ __volatile__("nop");
  stop = RTC::micros();
  base = (stop - start) / 1000L;
  INFO("Arduino: %ul us per 1000 digitalRead(7)", base);

The Cosa version of the benchmark uses inPin.is_set(), etc, and the INFO-message below with the measurement. 

  INFO("Cosa(%ulX): %ul us per 1000 inPin.is_set()", base/us, us);

Running the full benchmark on an Arduino Nano (ATmega328P/16MHz) gives the results below :

CosaBenchmarkPins: started
free_memory() = 1557
sizeof(Event::Handler) = 2
sizeof(InputPin) = 4
sizeof(OutputPin) = 4
sizeof(AnalogPin) = 9
74:setup:info:Loop: 504 us per 1,000 nop loops

85:setup:info:Arduino: 4151 us per 1000 digitalRead(7)
95:setup:info:Cosa(6X): 629 us per 1000 inPin.is_set()
106:setup:info:Cosa(6X): 629 us per 1000 inPin >> var
116:setup:info:Cosa(7X): 567 us per 1000 InputPin::read(7)

128:setup:info:Arduino: 8302 us per 1000 digitalWrite(8, 1); digitalWrite(8, 0)
139:setup:info:Cosa(3X): 2327 us per 1000 outPin.write(1); outPin.write(0)
150:setup:info:Cosa(3X): 2327 us per 1000 outPin.set; outPin.clear()
161:setup:info:Cosa(3X): 2327 us per 1000 outPin << 1; outPin << 0
172:setup:info:Cosa(1X): 6541 us per 1000 OutputPin::write(8, 1); OutputPin::write(8, 0)

183:setup:info:Arduino: 8271 us per 1000 digitalWrite(8, !digitalRead(8))
193:setup:info:Cosa(3X): 2076 us per 1000 outPin.write(!outPin.read())
203:setup:info:Cosa(3X): 2076 us per 1000 outPin.is_set/clear/set()
215:setup:info:Cosa(3X): 2076 us per 1000 outPin >> var; outPin << !var
225:setup:info:Cosa(3X): 2076 us per 1000 outPin.set/is_clear()
235:setup:info:Cosa(6X): 1195 us per 1000 outPin.toggle()
245:setup:info:Cosa(2X): 3648 us per 1000 OutputPin::write(8, !OutputPin::read(8))
258:setup:info:Cosa(2X): 3743 us per 1000 OutputPin::read/write(8,0/1)
268:setup:info:Cosa(10X): 755 us per 1000 OutputPin::toggle(8)

283:setup:info:Arduino: 15 us per bit data transfer() digitalWrite()
297:setup:info:Cosa(5X): 3 us per bit data transfer() pin.write()
311:setup:info:Cosa(5X): 3 us per bit data transfer() pin.write/toggle()
325:setup:info:Cosa(1X): 11 us per bit data transfer() OutputPin::write()
339:setup:info:Cosa(3X): 4 us per bit data transfer() OutputPin::write/toggle()
373:setup:info:Cosa(7X): 2 us per bit data transfer() pin.write/toggle() unrolled

383:setup:info:Arduino: 17 us per bit data transfer() shiftOut()
393:setup:info:Cosa(4X): 4 us per bit data transfer() dataPin.write()

401:setup:info:Arduino: 112 us per analogRead()
408:setup:info:Cosa(1X): 112 us per analogPin.sample()
417:setup:info:Cosa(1X): 112 us per analogPin >> var
424:setup:info:Cosa(1X): 112 us per AnalogPin::sample()

[Update 2014-05-16]
Below are the latest benchmark measurement. A lot has happened since the initial post and Cosa has be optimized further. Some pin operations are reduced to only 1-6 instructions. See below:

CosaBenchmarkPins: started
free_memory() = 1576
sizeof(Event::Handler) = 2
sizeof(InputPin) = 4
sizeof(OutputPin) = 4
sizeof(AnalogPin) = 12
F_CPU = 16000000
I_CPU = 16

119:void loop():info:Measure the time to perform an empty loop block
127:void loop():info:nop:315 ns

129:void loop():info:Measure the time to perform an input pin read
138:void loop():info:inPin.is_set():125 ns
150:void loop():info:inPin >> var:125 ns
160:void loop():info:InputPin::read(7):62 ns
170:void loop():info:read digitalRead(7):62 ns

172:void loop():info:Measure the time to perform an output pin write
182:void loop():info:outPin.write():910 ns
195:void loop():info:outPin._write():690 ns
206:void loop():info:outPin.set/clear():910 ns
219:void loop():info:outPin._set/_clear():691 ns
230:void loop():info:outPin << val:910 ns
241:void loop():info:OutputPin::write(8, val):314 ns
252:void loop():info:digitalWrite(8, val):314 ns
263:void loop():info:outPin.toggle():502 ns
276:void loop():info:outPin._toggle():596 ns
287:void loop():info:OutputPin::toggle(8):62 ns

289:void loop():info:Measure the time to perform input pin read/output pin write
298:void loop():info:outPin.write(!inPin.read()):1633 ns
308:void loop():info:inPin.is_set();outPin.clear/set():1633 ns
320:void loop():info:inPin >> var; outPin << !var:1633 ns
330:void loop():info:outPin.set(inPin.is_clear()):1633 ns
340:void loop():info:OutputPin::write(8, !InputPin::read(7)):565 ns
353:void loop():info:OutputPin::read(7)/write(8,0/1):849 ns

355:void loop():info:Measure the time to perform 8-bit serial data transfer
363:void loop():info:pin.write(data,clk):18 us
376:void loop():info:pin.write();clock.write(1/0):27 us
391:void loop():info:pin._write();clock._write(1/0):22 us
404:void loop():info:pin.write/toggle():20 us
419:void loop():info:pin._write/_toggle():20 us
432:void loop():info:OutputPin::write():12 us
445:void loop():info:OutputPin::write/toggle():8 us
477:void loop():info:pin.write/toggle() unrolled:15 us

479:void loop():info:Measure the time to read analog pin
485:void loop():info:analogPin.sample():112 us
494:void loop():info:analogPin.sample_request/await():112 us
503:void loop():info:analogPin >> var:112 us
510:void loop():info:AnalogPin::sample():112 us

512:void loop():info:Measure the time to read analog pin with varying prescale
521:void loop():info:prescale(128):bits(10):analogPin.sample():112 us
521:void loop():info:prescale(64):bits(9):analogPin.sample():56 us
521:void loop():info:prescale(32):bits(8):analogPin.sample():30 us
521:void loop():info:prescale(16):bits(7):analogPin.sample():17 us
521:void loop():info:prescale(8):bits(6):analogPin.sample():10 us
521:void loop():info:prescale(4):bits(5):analogPin.sample():6 us
521:void loop():info:prescale(2):bits(4):analogPin.sample():5 us

No comments:

Post a Comment