Managing Software Updates

By Pilgrim - August 11, 2020

Introduction

Software isn't perfect. The software running on your deployed devices today has bugs, missing features and security holes - that's a fact of life. The good news is that with connected devices you can fix these problems, in the field, with a remote firmware update. 

With the passage of time, and increasing scale, the management of software versions becomes one of those eternal "war of attrition" tasks, which generally falls to the Ops team to manage. You're always aiming for 100%, but you can never quite get there. It's something that can either be done well or done badly, and the right tools and processes will deliver a measurably better experience to your customers, while reducing the load on your team.

The ideal to aim for

Let's consider the ideal scenario: From time to time a new version of firmware is released by the development team (or the device vendor, if you buy-in your software stack). You deploy that new version across all your devices, it works well, and the job's done - until the next release. This chart shows this process in action:

firmware_ideal-1

At the start of 2019 all devices were running v0.3 or earlier, then sometime in March 2019 v0.4 was released and started to be deployed, achieving 100% penetration after a couple of months. Then v0.5 came along in July ... and so on. You can easily create charts of your own just like this in DevicePilot, by creating a KPI to measure the "percentage of devices in each value of" the firmware version, and then grouping by month.

Reality strikes

However, as anyone who has ever managed devices in the field will know, the above scenario is unfortunately a pipe-dream!
Here's a more realistic scenario:

firmware_realistic

As above, you started to roll out v0.4 in Mar-2019. But you never succeeded in upgrading all devices to any latest version, and so two years later you have a mixture of all the firmware versions released during that time. That's a problem, because each firmware has its own peculiarities, which makes it increasingly-hard to deliver a consistent user experience, and to diagnose and resolve problems when they do occur. It can even make the ongoing process of upgrading harder, as it may not be possible to upgrade directly from a very old version of software to the latest version without going through some intermediate versions.

So you end-up in the worst of all worlds: High support costs and poor customer experience. 

Clearly this situation isn't desirable, so how does it happen?

There are many reasons why the latest firmware doesn't end-up on all devices:

  • You can't upgrade a device if it's offline
  • ...or its battery is flat.
  • In most cases, you can't upgrade it while it's in use, so you might need to get the user's co-operation to identify a safe time to do the upgrade.
  • And sometimes the upgrade process fails for one reason or another, in which case the device ideally reverts to the previous version, or falls back to the factory-installed firmware, or becomes "bricked", requiring at least user intervention, and potentially a product replacement. 

By the way, congratulations if you spotted that in Jan-2020 the proportion of devices running new firmware actually decreased for a while (the blue v0.3 area stops shrinking and actually grows a little). How is this even possible? For a number of reasons: Sometimes devices may be intentionally downgraded ("rolled back") if a new firmware version is discovered to have issues. But also remember that our device estate is constantly growing in size, and the chart is showing us the results as a proportion of this growing estate. New devices could be running old software when they first come alive, because they've been stuck in a warehouse for a year.

You probably don't even want the latest version on all devices, immediately. Sometimes minor firmware versions are released (e.g. v0.51 and v0.52 here), in order to patch a specific, urgent problem that only one, important customer is experiencing. Because of the urgency, this version may have undergone, shall we say, less-than-exhaustive testing - a calculated risk worth taking for that one customer perhaps, but not across the whole device estate for customers who are perfectly happy with their current version, as any upgrade is not without risk, having a finite chance of regressions and/or of leaving some devices in a non-functional state.

Even regular, major firmware upgrades probably shouldn't be rolled-out to all devices, immediately - even if we could - because the real world will test your devices far more thoroughly than any in-house tests possibly can. Therefore it is prudent to roll out to only a fraction of your customer base (perhaps 5%), and then wait a week or two while watching all the service monitoring KPIs and customer reports carefully, to ensure that there aren't any unexpected regressions. Combined with the challenge of getting any release to 100% of the estate, this tends to lead to an "S-shape" curve for firmware deployment - cautiously-slow at first, then getting to the majority as fast as possible, then a long tail of stragglers that are hard to get upgraded:

firmware rollout S curve (1)-1

Measure, measure, measure

Of course, a key concept behind managing software is the ability to measure how good it is - which means measuring the actual customer experience in the field. As I said above, the real world will test your software far better than any in-house tests can possibly achieve, and this is especially true for IoT software because devices are exposed to such a wide range of uncontrolled externalities, including environmental factors and 3rd-party equipment, as well as user behaviour. But we can take advantage of all this extra in-the-field testing if we actually gather all that data, live, from our entire estate of connected devices. 

Luckily, DevicePilot provides the ability to do exactly that - to measure the quality of each software version in the field, even though devices are constantly being upgraded. It can produce charts like that below, which shows the up-time of devices by software version. Up-time is probably the simplest metric, but equally we could analyse for fault states or other criteria.  We can see at a glance that v0.5 suffered some serious regression (though we should also consider statistical-significance, if the number of devices running that version was small).

Think for a moment about how this is calculated. Right now a particular device might be working, but it was not working for most of the past month. This is why it's not sufficient to run your business based only on a live snapshot - you have to capture and analyse historical data, too, because performance is always measured over some period of time - one month in this case. A during that period, lots will change - including software versions on some devices.

So it's not as simple as just selecting devices according to their current firmware version, because e.g. a specific device might have been running v0.4 at the start of the period but then been upgraded to v0.5 half way through the period, and its performance may have changed accordingly. So the "group by version" analysis needs to be dynamic in time - something that regular databases struggle to achieve, but which DevicePilot was built to do, because it's a natural question in Service Monitoring. 

 uptime by firmware version

Conclusions

Managing software versions is an ongoing task for any device estate, and there is a balance to be struck between the effort, cost and inconvenience of getting as many devices as possible upgraded, versus the effort and reputational damage that comes from having a confusing array of ageing software versions running across the device estate. 

DevicePilot makes it easy to build charts like those shown here to fully understand the quality of software versions old and new, as well as tools and techniques to proactively manage individual firmware upgrade campaigns, measuring their progress and spotting any regressions quickly. 

End result: Happier customers, with less effort.

Comments

See how DevicePilot can make the difference

 

Industry leaders trust DevicePilot to help them improve the quality of the service they deliver at scale.

  • Eliminate revenue loss
  • Deliver a better service with the same human resource
  • Focus on growth and not firefighting
  • Get customer satisfaction through the roof

Book your personalised demo now and discover how DevicePilot can help you scale your connected business

Erik in a circle-1

Erik Fairbairn, CEO at POD Point:
Achieved 99% uptime across device estate

"We're totally data driven at POD Point, and if we can answer a question using data then we think that’s the best way - there’s no guesswork and you can use the facts.

Our DevicePilot dashboards have really let us get that actionable insight out of our devices."