Embedded Firmware — [hard] Lessons learned

Over the years I’ve developed a number of embedded systems across many industries, and have learned a lot of painful lessons along the way. This document is an attempt to capture some of those, in the hopes others can avoid the same pitfalls on their path.

Failure to propagate an error-code up the call chain usually means lots of painful debugging and ouija board use in the future.

In my early days, if a function would fail, I’d just return “failed” to the caller. Things had gone sour, so why care about the failure since we were toast now. Then I actually had to debug large and complex systems, and more so, create robust systems that would either work around the fault or attempt to recover. After that, I learned that every function should return a unique error code upon failure, and callers should propagate that up the call chain until it was printed/logged/displayed somewhere that a human could see.

Most systems don’t have more than 2^15 ways a function can fail, so it can be as simple as returning a negative 16-bit value to indicate failure, so the cost is very low.

By propagating the unique error code up the chain, I can instantly find where the error occurred no matter how deep in the drivers it happened, rather than having to guess where that ERR_TIMEOUT failure was triggered.

Anything but the most low-level communications link can fail, and sometimes even those fail – be ready for it and never assume success.

For the most part EEPROMs live forever and don’t wear out, and we often assume if we write a memory location we’re good to go after the write completes. Most code will be successful with this approach, especially when we’ve done that analysis that with one write per boot cycle, and 100,000 boot cycles, we’re not ever coming close to the wear limit. Seems fine.

But what happens when there’s a bug in your development code and you happen to spin in a loop continually writing a memory cell? Oops. Now that part is unexpectedly worn out.

Or maybe you write an I2C register, but failed to disable a write-protect beforehand by accident or due to a race condition and that critical write was ignored. Oops. You’ll never know unless you read it back.

And of course there are all the single-bit errors that can happen due to transients on that lowly RS232 serial channel when a solenoid engages that happens to have its power wire adjacent to the serial wires.

Power-ups are messy

If only when we power up our systems, everything would come up clean and ready to go. But some parts are sleepy and wake up slowly. Or get grumpy when the slew rate of our power rails isn’t what they expected, or some pin toggles during their power up setting them off like someone who needs coffee before conversation in the morning.

So, on power up, always make sure that peripheral is actually ready to talk. Often they have their own POST to get through, and might not be fully responsive initially. If you don’t check, then maybe your power-up initialization works fine until you change the boot-up timing, and then unexpectedly something fails.

When parts have a way to do a reset, it’s a good practice to invoke that reset when the driver initializes. That way you know it’s clean, and the part is in a defined state. And that way if your system needs to re-initialize, you can call that driver again and again and it will be immune to the previous peripheral state.

Floating pins create nasty bugs

So many times I’ve seen designs with floating GPIOs, rationalized with the argument, “We’ll initialize that pin on power-up, so it’s okay that it floats. It’s just for a short time, so it doesn’t matter.” This has burned countless designs again and again.

Systems must be stable and deterministic in reset. Non-negotiable.

From a firmware perspective, you should never need to race to initialize pins on power-up, and it’s good to test adding delays in power up to expose race conditions and unstable boot conditions. Sometimes this can manifest as excessive power consumption during boot, and by adding a delay it’ll help the hardware team find it.

Beware of assumed long term perfection

It’s common for some SPI peripherals to just emit endless streams of data formatted into packets of a fixed length. Thus once you start the streaming, you just pack every N bytes into a structure and you’re good to go for thousands, millions, billions(?) of packets. You’re synchronized, right? Yeah, kinda.

I’ve seen systems like this often purr like well-oiled machines for countless streams, but it all hinges on not skipping a beat. Ever. But even a human heart sometimes just skips a beat, and the same for embedded systems – sometimes we miss a byte, or someone messes up sending it, or it gets corrupted. The point is, it’s essential to not assume long synchronization, and periodically confirm that all is well.

This could be an IMU, for example, that streams out a packet of 6 axes of 2-byte data (12 bytes) whenever read. Or an AFE reading out sensor channels in a block. In both of these cases I’ve seen a need to watch for loss of synchronization, and to have a detection/recovery mechanism in place. In one-way fiber-optic communication systems, this can mean forward-error-correction mechanisms and key-frames, for example.

That lowly UART is your friend

Any system of consequence I’ve done has included a UART that is dedicated to the firmware team to do with what they want. Yes, there are lots of fancy JTAG debuggers, but the value of getting detailed debug info barked out that simple TxD channel can be immeasurable in value.

On most SoCs it takes just a couple register writes to enable a simple busy-wait UART transmitter, and at 115.2k-baud, it’s an acceptable time to wait for a byte to clock out. Later in the boot process a nice interrupt-driven driver can be turned on, but it’s helpful to have a dirt-simple way to send debug data immediately in the boot process.

POST data is essential. Having each driver broadcast out configuration information makes it trivial to see when a system boots up in an unexpected state.

Find your neighbors on power-up

In POST, I have had my debugging time saved many times by scanning each I2C bus to see what addresses are responded to. It’s reasonably fast to do, and showing all of the I2C addresses in the POST UART stream which elicit a response can help debug a mis-configuration or missing component. I generally also display the value read from the address, as well as what peripheral I expected to be at that address.

Remember the movie Memento

Most SoCs have one or more registers that try to explain what just happened on boot. Did I just have a watchdog reset? Software reset? Hardware reset? Brown-out? Read those registers and display them in your POST UART stream. And log them later.

That POST TxD stream is ephemeral, but useful – don’t lose it

The serial output stream on power-up can be super useful, but it’s fleeting. And sometimes while debugging, you want to know what happened at power up but that was long ago. Solution? Store it. Write every byte sent out the UART to a RAM buffer so you can replay it later. That buffer should be just large enough to hold all the POST messages, and then once boot has completed stop writing to it. If resources are too tight to keep it, just keep it as long as you can before you reallocate that RAM. Or after boot, write it out to some non-volatile memory that’s sitting idle.

It’s good to know your history

Most embedded systems have non-volatile storage, which often includes things like MAC addresses, serial numbers, configuration details, and calibration constants. In a perfect world, we knew at the beginning everything we’d need to store, but reality is otherwise so this table of non-volatile details evolves and grows. But you don’t want to be saddled with legacy format decisions, so it’s been greatly helpful to track the previous FW version to help migration. Thus when a new FW version is loaded, it sees that the version changed and executes the appropriate migration code. New code isn’t hindered by old decisions, and configuration details aren’t lost.

Think OOPy

Long story short, keep the configuration and calibration data close to where it’s needed. This generally means keeping this data on device in non-volatile memory, rather than in some off-module table. This saves elaborate systems of tracking, storing, transporting, deploying such data. It’s just written on the device during manufacturing/calibration and it stays with the module, tucked close to the code that actually needs it.

I’ve seen countless man-years spent on trying to develop/maintain systems that try to keep minimal data on embedded systems and rely on matching that to off-module databases.

Electrosmog

Engineering, Design, and pursuit of the perfect pork sandwich (by Tim Prachar)