Freesteel Blog » Programmable Realtime Unit, the learning curve edition

Programmable Realtime Unit, the learning curve edition

Friday, November 13th, 2015 at 1:08 pm Written by:

TomC did the work of porting across the LinuxCNC based controller of the triangle machine tool on an ancient heavy desktop with a parallel port to a Beaglebone Black running Machinekit. The good news is it’s all back to working again, and I can access the UI through X11 over a network over a USB serial port, so there’s some latency, but who cares.

We heard that the BB used a special unit to generate the realtime pulses, rather than relying on a somewhat bogus “realtime” linux build. We began our investigation of the code and documentation.

The Programmable Realtime Units (there are two) attached to the ARM processor are small processors that share a few kilobytes of memory between themselves and with the the main processor and run at 200MHz in a very predictable manner, with each instruction taking one or two cycles. This provides a potential resolution of 5nanoseconds and an order of magnitude faster than the 16MHz arduino I was using for my anemometer experiments.

(My intuition is that this tech is very similar to GPUs, which have thousands of special purpose processors with their own tads of memory, shared memory, unique characteristics, and protocol for communicating with the main CPU.)

The Beaglebone has shedloads pins of all kinds and has the complexity of Manhattan Island compared to the Arduino’s more understandable farmyard size. In terms of learning how to use these things, less is most definitely more — you’ll get far more done in a month with an Arduino than with a Beaglebone if you are a Dummy.

The standard blink exercise is too indirect. The following implementation (from graycat) provides a much better stepping stone to the hardware of the PRU.

from mmap import mmap
import time, struct

# page numbers from the 4973 page AM335x Sitara reference manual

# codes given in p182 table 2-3
GPIO1_offset = 0x4804c000
GPIO1_size   = 0x0fff
BIT28        = 1<<28   # for pin P9_28

# values from p4877 section 25.4.1

# memory map the IO address space to a Python object
f = open("/dev/mem", "r+b" )
mem = mmap(f.fileno(), GPIO1_size, offset=GPIO1_offset)

# set flag for pin P9_28 to output
reg = struct.unpack("<L", mem[GPIO_OUTPUTENABLE:GPIO_OUTPUTENABLE+4])[0]
mem[GPIO_OUTPUTENABLE:GPIO_OUTPUTENABLE+4] = struct.pack("<L", reg & ~USR28)

# set and clear the pin for bit 28 every 0.2 seconds
while True:
  mem[GPIO_SETDATAOUT:GPIO_SETDATAOUT+4]     = struct.pack("<L", BIT28)

The critical table from page 4877 is where those magic numbers are obtained:


This is how we solve the problem caused by the bitpacking of all the pin values into one 32bit word. The corresponding bits in the GPIO_SETDATAOUT and GPIO_CLEARDATAOUT perform a corresponding set or clear. Otherwise, to set the bit we’d have to write:

GPIO_DATAOUT      = 0x13C
reg = struct.unpack("<L", mem[GPIO_DATAOUT:GPIO_DATAOUT+4])[0]
mem[GPIO_DATAOUT:GPIO_DATAOUT+4] = struct.pack("<L", reg | BIT28)

and risk causing masking over-writes on all the other 31 bits we didn’t want to be altering if an independent process changed it during the gap between line 2 and line 3 above.

Moving on to the Machinekit code

The critical function is stepgen.c which in some sophisticated way controls the PRU and its code in pru_generic.p for the purpose of generating precisely timed pulses for stepper motors or servo motor drivers.

These amazing programs require further study under freedom 1 of the free software definition. Nevertheless, there are some interesting comments at the head of the file:

PRU GPIO Write Timing Details
The actual write instruction to a GPIO pin using SBBO takes two PRU cycles (10 nS). However, the GPIO logic can only update every 40 nS (8 PRU cycles). This means back-to-back writes to GPIO pins will eventually stall the PRU, or you can execute 6 PRU instructions for ‘free’ when burst writing to the GPIO.

Latency from the PRU write to the actual I/O pin changing stat (normalized to PRU direct output pins = zero latency) when the PRU is writing to GPIO1 and L4_PERPort1 is idle measures 95 nS or 105 nS (apparently depending on clock synchronization)

PRU GPIO Posted Writes
When L4_PERPort1 is idle, it is possible to burst-write multiple values to the GPIO pins without stalling the PRU, as the writes are posted. With an unrolled loop (SBBO to GPIO followed by a single SET/CLR to R30), the first 20 write cycles (both instructions) took 15 nS each, at which point the PRU began to stall and the write cycle settled in to the 40 nS maximum update frequency.

PRU GPIO Read Timing Details
Reading from a GPIO pin when L4_PERPort1 is idle require 165 nS as measured using direct PRU I/O updates bracking a LBBO instruction. Since there is no speculative execution on the PRU, it is not possible to execute any instructions during this time, the PRU just stalls.

That final paragraph amazingly suggests a less good response time than the 16MHz AVR using its SBIS function which can read and respond to a digital in within a single 62.5 nS processor cycle, unless it can redeem itself through an interrupt feature — which of course buggers up any special timing loops I might set up. Maybe that’s what we need the second PRU for.

Anyways, not being a Machinekit master, I tried some direct control of the PRU from Python using the amazing PyPRUSS library.

First things first, assuming the PRU assembly code is in a file called prucode.p, the Python test harness code is as follows:

# compile the file into prucode.bin
import subprocess, os
p = subprocess.Popen("/usr/bin/pasm -b prucode.p", shell=True)
pid, sts = os.waitpid(, 0)

# do this on the command line at start up if the device needs to be enbabled
#    echo BB-BONE-PRU-01 > /sys/devices/bone_capemgr.9/slots

# run the complete cycle
import pypruss
pypruss.modprobe()                      # This only has to be called once per boot
pypruss.init()                          # Init the PRU                         # Open PRU event 0 which is PRU0_ARM_INTERRUPT
pypruss.pruintc_init()                  # Init the interrupt controller
pypruss.exec_program(0,"./prucode.bin") # Load firmware "prucode.bin" on PRU 0
pypruss.wait_for_event(0)               # Wait for event 0 which is connected to PRU0_ARM_INTERRUPT
pypruss.clear_event(0)                  # Clear the event
pypruss.pru_disable(0)                  # Disable PRU 0, this is already done by the firmware

The mainloop of the PRU code looks like this:

  MOV r1, 0xF00000
  MOV r2, 1<<28
    MOV r0, 8                         // loop 8 times
    SBBO r2, r3, 0, 4                 // go HIGH!!!!
      SUB r0, r0, 1
    QBNE DELAY1, r0, 0
    //ADD r0, r0, 1  // commented slowdownop[1]

    SBBO r2, r3, 0, 4                  // go LOW!!!!

    MOV r0, 4                          // loop 4 times 
      SUB r0, r0, 1
    QBNE DELAY2, r0, 0

    SUB r1, r1, 1
  QBNE BLINK, r1, 0

The output on the scope is as follows:


So, that’s 100nanoseconds for the HIGH and 80nanoseconds on the LOW.

A high loop delay of 7 instead of 8 results in 90nS HIGH and 80nS LOW because the DELAY1 loop is two instructions long or 10nS. DELAY1 of 9 gives 110 nS HIGH, and so on, so it’s all good, and you can extrapolate down to a theoretical delay of zero leaving 20 nS for the subsequent MOV r3 and SBBO r2 after the loop before before it goes LOW.

On the LOW side there are 40 nS that need to be accounted for outside the DELAY2 loop. In order of execution they are: MOV r0,4; SUB r1; QBNE BLINK; MOV r3; MOV r0,8; SBBO r2; which is 6 instructions that ought to add up to 30 nS, so two of them must be taking 2 cycles each to make up the 10 nS difference.

Luckily there’s a really useful set of training slides from Texas Instruments in 2009 where they specifically explain what’s going on as if to a human. Fancy that! Why the heck don’t they insert these prepared summaries for the purpose of teaching humans as an appendix to the official manuals?

It explains:

Nearly all instructions (with the exception of memory access) are single cycle execution.

That accounts for the Store Byte Burst (SBBO) instructions taking two cycles each. The remainder of the time is due to some of the MOV instructions requiring two cycles, and others completing in one.

Turns out that MOV r3, X is a pseudo instruction composed of:

LDI r3.w0, (X&0xFFFF)
LDI r3.w2, (X>>16)

This is obviously necessary because as each instruction is 32 bits long, you can’t fill it all with data, and the most you can load at a time is 16 bits into one or other of the words.

However, if you do LDI r3, X instead of LDI r3.w0, X it packs the top 16 bits with zeros, which is handy if X happens to be less than 65536, as the compiler PASM recognizes in the case of MOV r0,4 and MOV r0,8.

So, it’s all easy and adds up like that…

Not so fast!

What happens when I uncomment the single cycle instruction at slowdownop[1]?


So about 50% of the loops are registering 100nS delay and the rest are giving 110nS delay instead of the 105nS delay I was hoping for.

When I zoom out the jitter is not cumulative.
It’s as if there’s an independent process that is carrying the GPIO_SETDATAOUT and GPIO_CLEARDATAOUT values to the physical GPIO seen by the oscilloscope that really only works on a 10nS cycle.

This isn’t so bad as it generally requires two-instruction countdown loops to control the delays as in the example above — although you can get to single-cycle resolution with a once-off branch across an optional singl-cycle instruction that runs in series to the delay loop.

There’s probably no way to discover the phase of this GPIO update process against the PRU cycle, which is a pity.

Many thanks to hipstercircuits for parts of all these examples. In fact his example of an accelerating profile implemented by a table of precalculated values accessed by the PRU leads me to imagine a system where we feed a circular buffer of delay times, wire the signals into our 42Volt servo motors via an H-bridge and get them to play music or speak words.

But while I have this test harness going, it’s worth corroborating the awful read functionality mentioned by the authors of pru_generic.p above and insert the following lines after the DELAY1 loop

#define GPIO_DATAIN         0x138
LBBO r2, r3, 0, 4
MOV r2, 1<<28

The result is:
That's 300 nS (note the change in horizontal scale), or 170 nS in excess of what it ought to have been, which matches the observation. (I have no idea what he means by L4_PERPort1 being idle.)

This is a problem because factoring this kind of delay into the code is not going to happen. It feels like there's a bodge going on as the PRU has its clock brutally put on hold when it accesses certain segments of memory while the system calls out to a non-integrated unit to get the data before releasing it -- when the PRU could at the very least have been allowed run asynchronously for 30 instruction cycles until the data was ready.

Indeed, it seems like there should be no reason for directly accessing the inputs at this level, because just like the brain, there are numerous special units for preprocessing the signals within the Enhanced Capture Module.

In particular, there's the Enhanced Quadrature Encoder Pulse module for handling all the signals returning from the servo motor. Here's one of the diagrams from the manual:


It's almost as if they've built a whole servo motor drive apart from the H-bridge into this one chip, where this unit takes the feedback and the PRU generates the complicated PWM drive cycle.

Even better, it seems like there are three of these independent units on board, so theoretically all three servo motor drives we currently buy in at the retail of $60 could be implemented with this one chip plus three H-bridges and a little bit of smart PRU code.

The advantage of getting all the servo motor drives into one unit isn't so much about the cost savings as the fact that they can potentially respond to one another.

So instead of each motor driver struggling independently to attend to position, when one falls out of tolerance due to the speed or forces encountered it can communicate to the other two axes to slow down and give it time to catch up so that the head of the machine remains exactly on course. Under the current configuration with completely independent motor drivers, this is not an option, so everything needs to be run at an absolute slower speed to avoid overloading things and maintain tolerance.

On the other hand, independent motor drivers are simple to interface and can be sold as a commodity.

Rolling it all into one unit would probably result in far higher complexity (eg a 5000 page manual) and many fewer sales. As is clear, the hardware complexity is now supplied for less than a $35 beaglebone, so the only thing lacking is the software. This can only be done open source due to the intense customization required which source code access enables and the lack of investment available for the years of risky unprofitability it would take to develop under the circumstances where established solutions already exist and all the work would lost and wasted if a new venture failed to lead to a less than wholly marketable product. Free software is the only realistic way that the potential of this complex tech gets unlocked and turned into productivity by incremental aggregation. It would help if this vital work wasn't consigned to the margins of the economy while all the money and therefore employment kept flowing to the likes of Autodesk to be squandered on financial games in which the organization of the effective service of human needs by technology is not the measure of success.

As Henry Ford wrote 100 years ago:

I do not know whether bad business is the result of bad financial methods or whether the wrong motive in business created bad financial methods, but I do know that, while it would be wholly undesirable to try to overturn the present financial system, it is wholly desirable to reshape business on the basis of service. Then a better financial system will have to come. The present system will drop out because it will have no reason for being. The process will have to be a gradual one.


  • 1. Frédéric SIMIAN replies at 23rd November 2015, 7:58 am :

    I’m not posting a message at the right place but as your conctact address doesn’t work anymore, I just would like to know where I can get the “Python package and scripts for linux” of your z-level slicer.

    Thank you by advance

    Frédéric SIMIAN

  • 2. Julian replies at 23rd November 2015, 6:31 pm :

    You should be able to find it at

  • 3. Damien DD replies at 14th February 2019, 5:46 pm :

    I’m starting to dig in the PRU assembly to develop some feature I need in Machinekit and I found this page very usefull. Thanks for sharing!

    (I have no idea what he means by L4_PERPort1 being idle.)

    I was also wondering about that and in case it can help anyone, I think I found the answer (or at least part of it) here.

    As we can see in the link, L4_PER is the interconnection layer that is used to communicate between PRU and GPIO1-3 and L4_WKUP is the same thing for GPIO0.
    My guess is that L4_PERPort1 means L4_PER[GPIO1].
    In the same link, we can see that “best-case” latency read is 34 PRU cycles (=170ns) for L4_PER[GPIO1-3] and 38 PRU cycles (=190ns) for L4_WKUP[GPIO0]. This explains the 170ns excess you observed in your setup.

Leave a comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <blockquote cite=""> <code> <em> <strong>