When it comes to the internet of things (IoT), it’s time for developers to place a greater focus on the “things” — connected devices. The number of cellular IoT connections will more than double to upwards of 5 billion in 2027. And as they develop for the cloud and data centers, developers must start paying attention to the challenges that could arise with the explosion of connected devices.
With this backdrop, managing devices at scale is an ever-increasing problem. Each device is effectively a remote computer, and companies delivering IoT solutions will soon find themselves dealing with the problem of managing a large fleet of devices that need to be updated, secured and monitored.
This comes along with the traditional problems of running production software to a new scale, with potential issues including lack of visibility, devices running outdated software, devices having security exploits that need to be stopped and patched and more.
Further, devices are often not physically accessible to allow recovery from bad updates, meaning groups of devices could be “bricked” if they’re updated without the ability to apply a rollback mechanism. With all of these moving parts in play, there are four key things IoT developers need to keep in mind.
The first secret of effective device management is making sure all devices are well-identified in a secure fashion. Even though devices can be thought of as “cattle” — meaning no individual device is itself important — in practice, devices are often not interchangeable with one another. Rather, each device is a unique runtime, acting almost as a mini data center. Therefore, it is important to uniquely identify them at scale.
Once device identity is established, devices can be annotated with the right metadata for management and grouping.
Next, teams need to ensure management can work across many network topologies, including intermittently disconnected scenarios. Many devices sit behind double Network Address Translation (NAT) topologies and cannot accept remote connections, requiring a gateway or a secure tunneling agent that runs on the device.
Finally, you need to guarantee that such an agent mechanism running on the device is lightweight and can self-update with a high degree of trust.
It’s challenging to see the big picture across many data points when monitoring devices and also still identify issues with individual devices. The “big picture” view can be achieved by applying metrics on groups of devices, and aggregating it with metadata tags assigned to devices (or device groups).
Identifying issues with an individual device can be solved by adding “smartness” in the device agent, which can be updated dynamically. The agent applies rules assigned to it in runtime, for example, and could securely block untrusted behavior.
Another challenge is the ability to monitor devices with limited connectivity. This is often solved by aggregating data on the device and streaming it to a monitoring server when the device regains connectivity. But this can also be problematic, considering that many monitoring services don’t handle historical events well, particularly when you want issues to be visible as early as possible. A gateway service that is deployed in proximity to devices and acts as a monitoring proxy can help solve this problem.
Like any production deployment, the best practice of deploying to smaller groups of devices first applies. This can often be achieved with metadata attached to devices.
Then, due to the sensitivity and potential risks in an upgrade, you need to guarantee that two things can be achieved through the agent software of the device. First, the update time needs to reflect minimal disruption to device service.
Second, the ability to roll back bad or failed updates is critical, especially for devices at a mass scale that cannot be manually rolled back, or devices that cannot be physically reached to “reset” a bad state. This makes auto-recovery of bad updates a must. Remote access to devices may help here, but again, it may not be applicable to reset a large set of devices.
Of course, the core of the device agent needs to be rock-solid yet minimal so the update agent itself stays up and auto-recovers from self-updates.
Remote access provides great troubleshooting and debugging capabilities when looking to identify issues on individual devices. Most often, problems arise across more than a single device as a result of either external changes or a version update. If monitoring data fails to provide details that are relevant and in context, a developer needs access to a troubled device and check the problem on the device itself.
When you need remote access, this solution is indispensable, especially for otherwise inaccessible devices. Making remote access secure is important, as is making it easy for developers to use, such as establishing connections through a web browser.
Across this entire workflow, automation is key. It requires you to create a fully automated software supply chain process to update your devices and apply runtime monitoring and security checks. This process can often be a natural extension to the existing software supply chain process — one that already exists for building and distributing your device software.
Keeping everything connected and maintaining fully automated processes allows you to achieve wonders with a small and nimble team, even for a large and complicated set of devices. Similar to enterprise “Super DevOps” groups that serve tens of thousands of developers with only a small team, trusted automation processes enable organizations to achieve the same for large, diverse fleets of devices.
All of this ultimately gives you traceability and visibility to all software running on your devices. For example, you can identify a security vulnerability on a device, block it in runtime using agent rules, and also quickly identify the software build that created the device software (and, potentially, other builds impacted by the same vulnerability).
You can then automatically build and release a new version that patches the security hole and distributes it automatically to regional update centers. From there you can roll out a gradual update to selected groups of devices to verify the fix.
In the end, the goal is for these four elements — device management, real-time monitoring, efficient software updates and remote access — to work together. Doing so effectively while embracing automation keeps developer teams nimble and end-users and customers happy, creating a win-win situation for team morale and overall business objectives.