Running Julia baremetal on an Arduino

Preamble
<blink> an LED in C
A first piece of julia pseudocode
1. Datasheets & Memory Mapping
Compiling our code
1. Configuring LLVM
2. Defining an architecture
Looking at the binary
1. Atomicity
2. Inline LLVM-IR
<blink> an LED in Julia
Limitations
Links & references

Preamble

I don't really have much experience with microcontrollers. I've played around with some arduinos before and the main entry point for my home network is a Raspberry Pi, but that's about it for recent experience. I did take a single course on microcontrollers a few years back though. I am fascinated by them - they're low powered devices that we can program to make almost anything happen, as long as we're a little careful with ressource management and don't shoot ourselves in the foot.

One thing that is always implicitly assumed when talking about julia is the requirement for a runtime and garbage collector. Most of the time, optimizing julia (or any code, really) comes down to two things:

minimize the time spent running code you didn't write
have as much code you want to run compiled to the native instructions of where you want to run it

Requirement 1) results more or less in "don't talk to runtime & GC if you don't have to" and 2) boils down to "make sure you don't run unnecessary code, like an interpreter" - i.e. statically compile your code and avoid dynamicness wherever you can.^[1]

I'm already used to 1) due to regular optimization when helping people on Slack and Discourse, and with better static compilation support inching ever closer over the past few years and me procrastinating writing my bachelors' thesis last week, I thought to myself

Julia is based on LLVM and is basically already a compiled language.
You've got some old arduinos lying around.
You know those take in some AVR blob to run as their code.
LLVM has an AVR backend.

and the very next thought I had was "that can't be too difficult to get to work, right?".

This is the (unexpectedly short) story of how I got julia code to run on an arduino.

[1]

Funnily enough, once you're looking for it, you can find these concepts everywhere. For example, you want to minimize the number of times you talk to the linux kernel on an OS, since context switches are expensive. You also want to call into fast native code as often as possible, as is done in python by calling into C when performance is required.

<blink> an LED in C

So, what are we dealing with? Well, even arduino don't sell these anymore:

This is an Arduino Ethernet R3, a variation on the common Arduino UNO. It's the third revision, boasting an ATmega328p, an ethernet port, a slot for an SD card as well as 14 I/O pins, most of which are reserved. It has 32KiB of flash memory, 2KiB SRAM and 1KiB EEPROM. Its clock runs at measly 16 MHz, there's a serial interface for an external programmer and it weighs 28g.

With this documentation, the schematic for the board, the datasheet for the microcontroller and a good amount of "you've done harder things before" I set out to achieve the simplest goal imaginable: Let the LED labeled L9 (see the lower left corner of the board in the image above, right above the on LED above the power connector) blink.

For comparison sake and to have a working implementation to check our arduino with, here's a C implementation of what we're trying to do:

#include <avr/io.h>
#include <util/delay.h>

#define MS_DELAY 3000

int main (void) {
    DDRB |= _BV(DDB1);

    while(1) {
        PORTB |= _BV(PORTB1);

        _delay_ms(MS_DELAY);

        PORTB &= ~_BV(PORTB1);

        _delay_ms(MS_DELAY);
    }
}

This short piece of code does a few things. It first configures our LED-pin as an output, which we can do by setting pin DDB1^[2] in DDRB (which is a contraction of "Data Direction Register Port B" - it controls whether a given I/O pin is interpreted as input or output). After that, it enters an infinite loop, where we first set our pin PORTB1 on PORTB to HIGH (or 1) to instruct our controller to power the LED. We then wait for MS_DELAY milliseconds, or 3 seconds. Then, we unpower the LED by setting the same PORTB1 pin to LOW (or 0). Compiling & flashing this code like so^[3] :

avr-gcc -Os -DF_CPU=16000000UL -mmcu=atmega328p -c -o blink_led.o blink_led.c
avr-gcc -mmcu=atmega328p -o blink_led.elf blink_led.o
avr-objcopy -O ihex blink_led.elf blink_led.hex
avrdude -V -c arduino -p ATMEGA328P -P /dev/ttyACM0 -U flash:w:blink_led.hex

results in a nice, blinking LED.

These few shell commands compile our .c soure code to an .o object file targeting our microcontroller, link it into an .elf, translate that to the Intel .hex format the controller expects and finally flash it to the controller with the appropriate settings for avrdude. Pretty basic stuff. It shouldn't be hard to translate this, so where's the catch?

Well, most of the code above is not even C, but C preprocessor directives tailored to do exactly what we mean to do. We can't make use of them in julia and we can't import those .h files, so we'll have to figure out what they mean. I haven't checked, but I think not even _delay_ms is a function.

On top of this, we don't have a convenient existing avr-gcc to compile julia to AVR for us. However, if we manage to produce a .o file, we should be able to make the rest of the existing toolchain work for us - after all, avr-gcc can't tell the difference between a julia-created .o and a avr-gcc created .o.

[2] Finding the right pin & port took a while. The documentation states that the LED is connected to "digital pin 9", which is supported by the label L9 next to the LED itself. It then goes on to say that on most of the arduino boards, this LED is placed on pin 13, which is used for SPI on mine instead. This is confusing, because the datasheet for our board connects this LED to pin 13 (PB1, port B bit 1) on the controller, which has a split trace leading to pin 9 of the J5 pinout. I mistakenly thought "pin 9" referred to the microcontroller, and tried to control the LED through PD5 (port D, bit 5) for quite some time, before I noticed my mistake. The upside was that I now had a known-good piece of code that I could compare to - even on the assembly level.

[3] -DF_CPU=16000000UL is required for _delay_ms to figure out how to translate from milliseconds to "number of cycles required to wait" in our loops. While it's nice to have, it's not really required - we only have to wait some visibly distinct amount to notice the blinking, and as such, I've skipped implementing this in the julia version.

A first piece of julia pseudocode

So with all that in mind, let's sketch out what we think our code should look like:

const DDRB = ??
const PORTB = ??

function main()
    set_high(DDRB, DDB1) # ??

    while true
        set_high(PORTB, PORTB1) # ??

        for _ in 1:500000
            # busy loop
        end

        set_low(PORTB, PORTB1) # ??

        for _ in 1:500000
            # busy loop
        end
    end
end

From a high level, it's almost exactly the same. Set bits, busy loop, unset bits, loop. I've marked all places where we have to do something, though we don't know exactly what yet, with ??. All of these places are a bit interconnected, so let's dive in with the first big question: how can we replicate what the C-macros DDRB, DDB1, PORTB and PORTB1 end up doing?

Datasheets & Memory Mapping

To answer this we first have to take a step back, forget that these are defined as macros in C and think back to what these represent. Both DDRB and PORTB reference specific I/O registers in our microcontroller. DDB1 and PORTB1 refer to the (zero-based) 1st bit of the respective register. In theory, we only have to set these bits in the registers above to make the controller blink our little LED. How do you set a bit in a specific register though? This has to be exposed to a high level language like C somehow. In assembly code we'd just access the register natively, but save for inline assembly, we can't do that in either C or julia.

When we take a look in our microcontroller datasheet, we can notice that there's a chapter 36. Register Summary from page 621 onwards. This section is a register reference table. It has an entry for each register, specifying an address, a name, the name of each bit, as well as the page in the datasheet where further documentation, such as initial values, can be found. Scrolling to the end, we find what we've been looking for:

Address	Name	Bit 7	Bit 6	Bit 5	Bit 4	Bit 3	Bit 2	Bit 1	Bit 0	Page
0x05 (0x25)	PORTB	PORTB7	PORTB6	PORTB5	PORTB4	PORTB3	PORTB2	PORTB1	PORTB0	100
0x04 (0x24)	DDRB	DDR7	DDR6	DDR5	DDR4	DDR3	DDR2	DDR1	DDR0	100

So PORTB is mapped to addresses 0x05 and 0x25, while DDRB is mapped to addresses 0x04 and 0x24. Which memory are those addresses referring to? We have EEPROM, flash memory as well as SRAM after all. Once again, the datasheet comes to our help: Chapter 8 AVR Memories has a short section on our SRAM memory, with a very interesting figure:

as well as this explanation:

The first 32 locations [of SRAM] address the Register File, the next 64 locations the standard I/O memory, then 160 locations of Extended I/O memory, and the next 512/1024/1024/2048 locations address the internal data SRAM.

So the addresses we got from the register summary actually correspond 1:1 to SRAM addresses^[4]. Neat!

Translating what we've learned into code, our prototype now looks like this:

const DDRB  = Ptr{UInt8}(36) # 0x25, but julia only provides conversion methods for `Int`
const PORTB = Ptr{UInt8}(37) # 0x26

# The bits we're interested in are the same bit 1
#                76543210
const DDB1   = 0b00000010
const PORTB1 = 0b00000010

function main_pointers()
    unsafe_store!(DDRB, DDB1)

    while true
        pb = unsafe_load(PORTB)
        unsafe_store!(PORTB, pb | PORTB1) # enable LED

        for _ in 1:500000
            # busy loop
        end

        pb = unsafe_load(PORTB)
        unsafe_store!(PORTB, pb & ~PORTB1) # disable LED

        for _ in 1:500000
            # busy loop
        end
    end
end
builddump(main_pointers, Tuple{})

We can write to our registers by storing some data at its address, as well as read from our register by reading from the same address.

In one fell swoop, we got rid of all of our ?? at once! This code now seemingly has everything the C version has, so let's start on the biggest unknown: how do we compile this?

[4]	This is in contrast to more high level systems like an OS kernel, which utilizes virtual RAM and paging of sections of memory to give the illusion of being on the "baremetal" machine and handling raw pointers.

Compiling our code

Julia has for quite some time now run on more than just x86(_64) - it also has support for Linux as well as macOS on ARM. These are, in large part, possible due to LLVM supporting ARM. However, there is one other large space where julia code can run directly: GPUs. For a while now, the package GPUCompiler.jl has done a lot of work to compile julia down to NVPTX and AMDGPU, the NVidia and AMD specific architectures supported by LLVM. Because GPUCompiler.jl interfaces with LLVM directly, we can hook into this same mechanism to have it produce AVR instead - the interface is extensible!

Configuring LLVM

The default julia install does not come with the AVR backend of LLVM enabled, so we have to build both LLVM and julia ourselves. Be sure to do this on one of the 1.8 betas, like v1.8.0-beta3. More recent commits currently break GPUCompiler.jl with this, which should be fixed in the future as well.

Julia luckily already supports building its dependencies, so we just have to make a few changes to two Makefiles, enabling the backend

diff --git a/deps/llvm.mk b/deps/llvm.mk
index 5afef0b83b..8d5bbd5e08 100644
--- a/deps/llvm.mk
+++ b/deps/llvm.mk
@@ -60,7 +60,7 @@ endif
 LLVM_LIB_FILE := libLLVMCodeGen.a
 
 # Figure out which targets to build
-LLVM_TARGETS := host;NVPTX;AMDGPU;WebAssembly;BPF
+LLVM_TARGETS := host;NVPTX;AMDGPU;WebAssembly;BPF;AVR
 LLVM_EXPERIMENTAL_TARGETS :=
 
 LLVM_CFLAGS :=

and instruct julia not to use the prebuilt LLVM by setting a flag in Make.user:

USE_BINARYBUILDER_LLVM=0

Now, after running make to start the build process, LLVM is downloaded, patched & built from source and made available to our julia code. The whole LLVM compilation took about 40 minutes on my laptop. I honestly expected worse!

Defining an architecture

With our custom LLVM built, we can define everything that's necessary for GPUCompiler.jl to figure out what we want.

We start by importing our dependencies, defining our target architecture and its target triplet:

using GPUCompiler
using LLVM

#####
# Compiler Target
#####

struct Arduino <: GPUCompiler.AbstractCompilerTarget end

GPUCompiler.llvm_triple(::Arduino) = "avr-unknown-unkown"
GPUCompiler.runtime_slug(::GPUCompiler.CompilerJob{Arduino}) = "native_avr-jl_blink"

struct ArduinoParams <: GPUCompiler.AbstractCompilerParams end

We're targeting a machine that's running avr, with no known vendor and no OS - we're baremetal after all. We're also providing a runtime slug to identify our binary by. We're also defining a dummy struct to hold additional parameters for our target architecture. We don't require any, so we can just leave it empty and otherwise ignore it.

Since the julia runtime can't run on GPUs, GPUCompiler.jl also expects us to provide a replacement module for various operations we might want to do, like allocating memory on our target architecture or throwing exceptions. We're of course not going to do any of that, which is why we can just define an empty placeholder for these as well:

module StaticRuntime
    # the runtime library
    signal_exception() = return
    malloc(sz) = C_NULL
    report_oom(sz) = return
    report_exception(ex) = return
    report_exception_name(ex) = return
    report_exception_frame(idx, func, file, line) = return
end

GPUCompiler.runtime_module(::GPUCompiler.CompilerJob{<:Any,ArduinoParams}) = StaticRuntime
GPUCompiler.runtime_module(::GPUCompiler.CompilerJob{Arduino}) = StaticRuntime
GPUCompiler.runtime_module(::GPUCompiler.CompilerJob{Arduino,ArduinoParams}) = StaticRuntime

In the future, these calls may be used to provide a simple bump allocator or report exceptions via the serial bus for other code targeting the arduino. For now though, this "do nothing" runtime is sufficient.^[5]

Now for the compilation. We first define a job for our pipeline:

function native_job(@nospecialize(func), @nospecialize(types))
    @info "Creating compiler job for '$func($types)'"
    source = GPUCompiler.FunctionSpec(
                func, # our function
                Base.to_tuple_type(types), # its signature
                false, # whether this is a GPU kernel
                GPUCompiler.safe_name(repr(func))) # the name to use in the asm
    target = Arduino()
    params = ArduinoParams()
    job = GPUCompiler.CompilerJob(target, source, params)
end

This then gets passed to our LLVM IR builder:

function build_ir(job, @nospecialize(func), @nospecialize(types))
    @info "Bulding LLVM IR for '$func($types)'"
    mi, _ = GPUCompiler.emit_julia(job)
    ir, ir_meta = GPUCompiler.emit_llvm(
                    job, # our job
                    mi; # the method instance to compile
                    libraries=false, # whether this code uses libraries
                    deferred_codegen=false, # is there runtime codegen?
                    optimize=true, # do we want to optimize the llvm?
                    only_entry=false, # is this an entry point?
                    ctx=JuliaContext()) # the LLVM context to use
    return ir, ir_meta
end

We first get a method instance from the julia runtime and ask GPUCompiler to give us the corresponding LLVM IR for our given job, i.e. for our target architecture. We don't use any libraries and we can't run codegen, but julia specific optimizations sure would be nice. They're also required for us, since they remove obviously dead code regarding the julia runtime, which we neither want nor can call into. If it would remain in the IR, we'd error out when trying to build our ASM, due to the missing symbols.

After this, it's just emitting of AVR ASM:

function build_obj(@nospecialize(func), @nospecialize(types); kwargs...)
    job = native_job(func, types)
    @info "Compiling AVR ASM for '$func($types)'"
    ir, ir_meta = build_ir(job, func, types)
    obj, _ = GPUCompiler.emit_asm(
                job, # our job
                ir; # the IR we got
                strip=true, # should the binary be stripped of debug info?
                validate=true, # should the LLVM IR be validated?
                format=LLVM.API.LLVMObjectFile) # What format would we like to create?
    return obj
end

We're also going to strip out debug info since we can't debug anyway and we're additionally asking LLVM to validate our IR - a very useful feature!

[5] The eagle eyed may notice that this is suspiciously similar to what one needs for Rust - something to allocate and something to report errors. This is no coincidence - it's the minimum required for a language that usually has a runtime that handles things like signals and allocation of memory for you. Spinning this further could lead one to think that Rust too is garbage collected, since you never have to call malloc and free yourself - it's all handled by the runtime & compiler, which inserts calls to these (or another allocator) in the appropriate places.

Looking at the binary

When calling this like build_obj(main_pointers, Tuple{}) (we don't pass any arguments to main), we receive a String containing binary data - this is our compiled object file:

obj = build_obj(main_pointers, Tuple{})

\x7fELF\x01\x01\x01\0\0\0\0\0\0\0\0\0\x01\0S\0\x01\0\0\0\0\0\0\0\0\0\0\0\xf8\0\0\0\x02\0\0\x004\0\0\0\0\0(\0\x05\0\x01\0\x82\xe0\x84\xb9\0\xc0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\a\0\0\0\0\0\0\0\0\0\0\0\x04\0\xf1\xff\0\0\0\0\0\0\0\0\0\0\0\0\x03\0\x02\0\e\0\0\0\0\0\0\0\x06\0\0\0\x12\0\x02\0?\0\0\0\0\0\0\0\0\0\0\0\x10\0\0\0\f\0\0\0\0\0\0\0\0\0\0\0\x10\0\0\0\x04\0\0\0\x03\x02\0\0\x04\0\0\0\0.rela.text\0__do_clear_bss\0julia_main_pointers\0.strtab\0.symtab\0__do_copy_data\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0/\0\0\0\x03\0\0\0\0\0\0\0\0\0\0\0\xa8\0\0\0N\0\0\0\0\0\0\0\0\0\0\0\x01\0\0\0\0\0\0\0\x06\0\0\0\x01\0\0\0\x06\0\0\0\0\0\0\x004\0\0\0\x06\0\0\0\0\0\0\0\0\0\0\0\x04\0\0\0\0\0\0\0\x01\0\0\0\x04\0\0\0\0\0\0\0\0\0\0\0\x9c\0\0\0\f\0\0\0\x04\0\0\0\x02\0\0\0\x04\0\0\0\f\0\0\x007\0\0\0\x02\0\0\0\0\0\0\0\0\0\0\0<\0\0\0`\0\0\0\x01\0\0\0\x03\0\0\0\x04\0\0\0\x10\0\0\0

Let's take a look at the disassembly, to confirm that this is what we expect to see:

function builddump(fun, args)
    obj = build_obj(fun, args)
    mktemp() do path, io
        write(io, obj)
        flush(io)
        str = read(`avr-objdump -dr $path`, String)
    end |> print
end
builddump(main_pointers, Tuple{})


/tmp/jl_uOAUKI:     file format elf32-avr


Disassembly of section .text:

00000000 <julia_main_pointers>:
   0:	82 e0       	ldi	r24, 0x02	; 2
   2:	84 b9       	out	0x04, r24	; 4
   4:	00 c0       	rjmp	.+0      	; 0x6 <julia_main_pointers+0x6>
			4: R_AVR_13_PCREL	.text+0x4

Well that doesn't look good - where has all our code gone? All that's left is a single out followed by a single do-nothing relative jump. That's almost nothing if we compare to the equivalent C code:

$ avr-objdump -d blink_led.elf

[...]
00000080 <main>:
  80:	21 9a       	sbi	0x04, 1	; 4
  82:	2f ef       	ldi	r18, 0xFF	; 255
  84:	8b e7       	ldi	r24, 0x7B	; 123
  86:	92 e9       	ldi	r25, 0x92	; 146
  88:	21 50       	subi	r18, 0x01	; 1
  8a:	80 40       	sbci	r24, 0x00	; 0
  8c:	90 40       	sbci	r25, 0x00	; 0
  8e:	e1 f7       	brne	.-8      	; 0x88 <main+0x8>
  90:	00 c0       	rjmp	.+0      	; 0x92 <main+0x12>
  92:	00 00       	nop
  94:	29 98       	cbi	0x05, 1	; 5
  96:	2f ef       	ldi	r18, 0xFF	; 255
  98:	8b e7       	ldi	r24, 0x7B	; 123
  9a:	92 e9       	ldi	r25, 0x92	; 146
  9c:	21 50       	subi	r18, 0x01	; 1
  9e:	80 40       	sbci	r24, 0x00	; 0
  a0:	90 40       	sbci	r25, 0x00	; 0
  a2:	e1 f7       	brne	.-8      	; 0x9c <main+0x1c>
  a4:	00 c0       	rjmp	.+0      	; 0xa6 <main+0x26>
  a6:	00 00       	nop
  a8:	ec cf       	rjmp	.-40     	; 0x82 <main+0x2>
[...]

This sets the same bit as our code on 0x04 (remember, this was DDRB), initializes a loop variable over three words, branches, jumps, sets and clears bits.. Basically everything we'd expect our code to do as well, so what gives?

In order to figure out what's going on, we have to remember that julia, LLVM and gcc are optimizing compilers. If they can deduce that some piece of code has no visible effect, for example because you're always overwriting previous loop iterations with known constants, the compiler is usually free to just delete the superfluous writes because you can't observe the difference anyway.

Here, I believe two things happened:

The initial unsafe_load from our pointer triggered undefined behavior, since the initial value of a given pointer is not defined. LLVM saw that, saw that we actually used the read value and eliminated both read & store due to it being undefined behavior and it being free to pick the value it "read" to be the one we wrote, making the load/store pair superfluous.
The now empty loops serve no purpose, so they got removed as well.

In C, you can solve this problem by using volatile. That keyword is a very strict way of telling the compiler "Look, I want every single read & write from and to this variable to happen. Don't eliminate any and don't shuffle them around (except for non-volatile, you're free to shuffle those around)". In contrast, julia doesn't have this concept at all - but we do have atomics. So let's use them to see if they're enough, even though semantically they're a tiny bit different^[6].

Atomicity

With the atomics, our code now looks like this:

const DDRB  = Ptr{UInt8}(36) # 0x25, but julia only provides conversion methods for `Int`
const PORTB = Ptr{UInt8}(37) # 0x26

# The bits we're interested in are the same bit as in the datasheet
#                76543210
const DDB1   = 0b00000010
const PORTB1 = 0b00000010

function main_atomic()
    ddrb = unsafe_load(PORTB)
    Core.Intrinsics.atomic_pointerset(DDRB, ddrb | DDB1, :sequentially_consistent)

    while true
        pb = unsafe_load(PORTB)
        Core.Intrinsics.atomic_pointerset(PORTB, pb | PORTB1, :sequentially_consistent) # enable LED

        for _ in 1:500000
            # busy loop
        end

        pb = unsafe_load(PORTB)
        Core.Intrinsics.atomic_pointerset(PORTB, pb & ~PORTB1, :sequentially_consistent) # disable LED

        for _ in 1:500000
            # busy loop
        end
    end
end

Note

This is not how you'd usually use atomics in julia! I'm using intrinsics in hopes of communicating with LLVM directly, since I'm dealing with pointers here. For more high-level code, you'd use @atomic operations on struct fields.

giving us the following assembly:


/tmp/jl_UfT1Rf:     file format elf32-avr


Disassembly of section .text:

00000000 <julia_main_atomic>:
   0:	85 b1       	in	r24, 0x05	; 5
   2:	82 60       	ori	r24, 0x02	; 2
   4:	a4 e2       	ldi	r26, 0x24	; 36
   6:	b0 e0       	ldi	r27, 0x00	; 0
   8:	0f b6       	in	r0, 0x3f	; 63
   a:	f8 94       	cli
   c:	8c 93       	st	X, r24
   e:	0f be       	out	0x3f, r0	; 63
  10:	85 b1       	in	r24, 0x05	; 5
  12:	a5 e2       	ldi	r26, 0x25	; 37
  14:	b0 e0       	ldi	r27, 0x00	; 0
  16:	98 2f       	mov	r25, r24
  18:	92 60       	ori	r25, 0x02	; 2
  1a:	0f b6       	in	r0, 0x3f	; 63
  1c:	f8 94       	cli
  1e:	9c 93       	st	X, r25
  20:	0f be       	out	0x3f, r0	; 63
  22:	98 2f       	mov	r25, r24
  24:	9d 7f       	andi	r25, 0xFD	; 253
  26:	0f b6       	in	r0, 0x3f	; 63
  28:	f8 94       	cli
  2a:	9c 93       	st	X, r25
  2c:	0f be       	out	0x3f, r0	; 63
  2e:	00 c0       	rjmp	.+0      	; 0x30 <julia_main_atomic+0x30>
			2e: R_AVR_13_PCREL	.text+0x18

At first glance, it doesn't look too bad. We have a little bit more code and we see some out instructions, so are we good? Unfortunately, no. There is only a single rjmp, meaning our nice busy loops got eliminated. I also had to insert those unsafe_load to not get a segfault during compilation.. Further, the atomics seem to have ended up reading some pretty weird addresses - they appear to read/write 0x3f (or address 63) which is mapped to SREG, or the status register. Even weirder is what it's doing with the value it read:

8:	0f b6       	in	r0, 0x3f	; 63
a:	f8 94       	cli
...
e:	0f be       	out	0x3f, r0	; 63

First, reading SREG into r0, then clearing the interrupt bit, then writing the value we saved back out. I don't know how it got to this code, but I do know that it's not what we want. So atomics are not the way to go.

[6]

"Atomic and volatile in the IR are orthogonal; “volatile” is the C/C++ volatile, which ensures that every volatile load and store happens and is performed in the stated order. A couple examples: if a SequentiallyConsistent store is immediately followed by another SequentiallyConsistent store to the same address, the first store can be erased. This transformation is not allowed for a pair of volatile stores.", LLVM Documentation - Atomics

Inline LLVM-IR

The other option we still have at our disposal is writing inline LLVM-IR. Julia has great support for such constructs, so let's use them:

const DDRB  = Ptr{UInt8}(36)
const PORTB = Ptr{UInt8}(37)
const DDB1   = 0b00000010
const PORTB1 = 0b00000010
const PORTB_none = 0b00000000 # We don't need any other pin - set everything low

function volatile_store!(x::Ptr{UInt8}, v::UInt8)
    return Base.llvmcall(
        """
        %ptr = inttoptr i64 %0 to i8*
        store volatile i8 %1, i8* %ptr, align 1
        ret void
        """,
        Cvoid,
        Tuple{Ptr{UInt8},UInt8},
        x,
        v
    )
end

function main_volatile()
    volatile_store!(DDRB, DDB1)

    while true
        volatile_store!(PORTB, PORTB1) # enable LED

        for _ in 1:500000
            # busy loop
        end

        volatile_store!(PORTB, PORTB_none) # disable LED

        for _ in 1:500000
            # busy loop
        end
    end
end

with our disassembly looking like:


/tmp/jl_3twwq9:     file format elf32-avr


Disassembly of section .text:

00000000 <julia_main_volatile>:
   0:	82 e0       	ldi	r24, 0x02	; 2
   2:	84 b9       	out	0x04, r24	; 4
   4:	90 e0       	ldi	r25, 0x00	; 0
   6:	85 b9       	out	0x05, r24	; 5
   8:	95 b9       	out	0x05, r25	; 5
   a:	00 c0       	rjmp	.+0      	; 0xc <julia_main_volatile+0xc>
			a: R_AVR_13_PCREL	.text+0x6

Much better! Our out instructions save to the correct register. Unsurprisingly, all loops are still eliminated. We could force the variable from busy looping to exist by writing its value somewhere in SRAM, but that's a little wasteful. Instead, we can go one step deeper with our nesting and have inline AVR assembly in our inline LLVM-IR:

const DDRB  = Ptr{UInt8}(36)
const PORTB = Ptr{UInt8}(37)
const DDB1   = 0b00000010
const PORTB1 = 0b00000010
const PORTB_none = 0b00000000 # We don't need any other pin - set everything low

function volatile_store!(x::Ptr{UInt8}, v::UInt8)
    return Base.llvmcall(
        """
        %ptr = inttoptr i64 %0 to i8*
        store volatile i8 %1, i8* %ptr, align 1
        ret void
        """,
        Cvoid,
        Tuple{Ptr{UInt8},UInt8},
        x,
        v
    )
end

function keep(x)
    return Base.llvmcall(
        """
        call void asm sideeffect "", "X,~{memory}"(i16 %0)
        ret void
        """,
        Cvoid,
        Tuple{Int16},
        x
)
end

function main_keep()
    volatile_store!(DDRB, DDB1)

    while true
        volatile_store!(PORTB, PORTB1) # enable LED

        for y in Int16(1):Int16(3000)
            keep(y)
        end

        volatile_store!(PORTB, PORTB_none) # disable LED

        for y in Int16(1):Int16(3000)
            keep(y)
        end
    end
end

This slightly unorthodox not even nop construct pretends to execute an instruction that has some sideeffect, using our input as an argument. I've changed the loop to run for fewer iterations because it makes the assembly easier to read.

Checking the disassembly we get...


/tmp/jl_xOZ5hH:     file format elf32-avr


Disassembly of section .text:

00000000 <julia_main_keep>:
   0:	82 e0       	ldi	r24, 0x02	; 2
   2:	84 b9       	out	0x04, r24	; 4
   4:	21 e0       	ldi	r18, 0x01	; 1
   6:	30 e0       	ldi	r19, 0x00	; 0
   8:	9b e0       	ldi	r25, 0x0B	; 11
   a:	40 e0       	ldi	r20, 0x00	; 0
   c:	85 b9       	out	0x05, r24	; 5
   e:	62 2f       	mov	r22, r18
  10:	73 2f       	mov	r23, r19
  12:	e6 2f       	mov	r30, r22
  14:	f7 2f       	mov	r31, r23
  16:	31 96       	adiw	r30, 0x01	; 1
  18:	68 3b       	cpi	r22, 0xB8	; 184
  1a:	79 07       	cpc	r23, r25
  1c:	6e 2f       	mov	r22, r30
  1e:	7f 2f       	mov	r23, r31
  20:	01 f4       	brne	.+0      	; 0x22 <julia_main_keep+0x22>
			20: R_AVR_7_PCREL	.text+0x16
  22:	45 b9       	out	0x05, r20	; 5
  24:	62 2f       	mov	r22, r18
  26:	73 2f       	mov	r23, r19
  28:	e6 2f       	mov	r30, r22
  2a:	f7 2f       	mov	r31, r23
  2c:	31 96       	adiw	r30, 0x01	; 1
  2e:	68 3b       	cpi	r22, 0xB8	; 184
  30:	79 07       	cpc	r23, r25
  32:	6e 2f       	mov	r22, r30
  34:	7f 2f       	mov	r23, r31
  36:	01 f4       	brne	.+0      	; 0x38 <julia_main_keep+0x38>
			36: R_AVR_7_PCREL	.text+0x2c
  38:	00 c0       	rjmp	.+0      	; 0x3a <julia_main_keep+0x3a>
			38: R_AVR_13_PCREL	.text+0xc

Huzzah! Pretty much everything we'd expect to see is here:

We write to 0x05 with out
We have some brne to busy loop with
We add something to some register for our looping

Granted, the binary is not as small as the one we compiled with -Os from C, but it should work! The only remaining step is to get rid of all those .+0 jump labels, which would prevent us from actually looping. I've also enabled dumping of relocation labels (that's the R_AVR_7_PCREL stuff) - these are inserted by the compiler make the code relocatable in an ELF file and used by the linker during final linking of the assembly. Now that we're probably ready to flash, we can link our code into a binary (thereby resolving those relocation labels) and flash it onto our arduino:

$ avr-ld -o jl_blink.elf jl_blink.o

$ avr-objcopy -O ihex jl_blink.elf jl_blink.hex

$ avrdude -V -c arduino -p ATMEGA328P -P /dev/ttyACM0 -U flash:w:jl_blink.hex
avrdude: AVR device initialized and ready to accept instructions

Reading | ################################################## | 100% 0.00s

avrdude: Device signature = 0x1e950f (probably m328p)
avrdude: NOTE: "flash" memory has been specified, an erase cycle will be performed
         To disable this feature, specify the -D option.
avrdude: erasing chip
avrdude: reading input file "jl_blink.hex"
avrdude: input file jl_blink.hex auto detected as Intel Hex
avrdude: writing flash (168 bytes):

Writing | ################################################## | 100% 0.04s

avrdude: 168 bytes of flash written

avrdude done.  Thank you.

and after flashing we get...

<blink> an LED in Julia

Now THAT is what I call two days well spent! The arduino is powered through the serial connector I use to flash programs on the right.

I want to thank everyone in the Julialang Slack channel #static-compilation for their help during this! Without them, I wouldn't have thought of the relocation labels in linking and their help was invaluable when figuring out what does and does not work when compiling julia to a, for this language, exotic architecture.

Limitations

Would I use this in production? Unlikely, but possibly in the future. It was finicky to get going and random segmentation faults during the compilation process itself are bothersome. But then again - nothing of this was part of a supported workflow, so I guess I'm happy that it has worked as well as it has! I do believe that this area will steadily improve - after all, it's already working well on GPUs and FPGAs (or so I'm told - Julia on an FPGA is apparently some commercial offering from a company). From what I know, this is the first julia code to run native & baremetal on any Arduino/ATmega based chip, which in and of itself is already exciting. Still, the fact that there is no such thing as a runtime for this (julia uses libuv for tasks - getting that on an arduino seems challenging) means you're mostly going to be limited to self-written or vetted code that doesn't rely on too advanced features, like a GC.

Some niceties I'd like to have are better custom-allocator support, to allow actual proper "heap" allocation. I haven't tried yet, but I think immutable structs (which are often placed on the stack already, which the ATmega328p does have!) should work out of the box.

I'm looking forward to trying out some i²c and SPI communication, but my gut tells me it won't be much different from writing this in C (unless we get custom allocator support or I use one of the malloc based arrays from StaticTools.jl, that is).