Seamless migration: Securely transitioning large IoT fleets to AWS

Large-scale IoT fleet migrations to the cloud represent one of the most complex technical transformations that organizations face today. While the benefits of cloud migration are clear, the path to successful implementation requires careful planning and execution. In a previous blog post we elaborated on key reasons to migrate to AWS IoT Core. In this blog post, we’ll share a proven strategy for transitioning IoT fleets with hundreds of millions of devices to AWS IoT Core, addressing common challenges, outlining a specific migration scenario, and delving into the AWS IoT Core features that facilitate complex migrations.
Challenges with self-managed IoT messaging brokers
Many organizations begin their IoT journey with self-managed messaging brokers. While this approach offers initial control and flexibility, it often becomes increasingly challenging as device fleets expand. Understanding these challenges is crucial before embarking on a cloud migration journey.
High costs
The financial impact of maintaining and operating self-managed IoT infrastructure extends far beyond basic hosting costs. Organizations frequently struggle with inefficient capacity planning, requiring dedicated engineering teams to manage infrastructure. These teams must constantly balance competing priorities across different departments while maintaining system reliability. The overhead costs of monitoring, security, and compliance add another layer of complexity to the financial equation.
Compute matching
One of the most demanding aspects of managing IoT infrastructure is matching compute resources to workload demands. Peak usage periods require excess capacity to maintain performance, while low-usage periods result in wasteful resource allocation. This challenge becomes particularly acute when managing global deployments, where usage patterns vary by region and time zone. Organizations often find themselves either over-provisioning resources to ensure reliability or risking performance issues during unexpected usage spikes. The demand also varies depending on the phase of development: There are different usage patterns during the Proof of Concept (PoC) phase in contrast to the usage at scale.
Unsolved security challenges
Security presents perhaps the most critical challenge in large-scale IoT deployments. Managing millions of connected devices requires sophisticated security protocols, including certificate management, real-time threat detection, update mechanisms, and secure data transmission. As regulatory requirements evolve, organizations must continuously update their security practices while maintaining uninterrupted service. This becomes increasingly complex as device fleets grow and geographic distribution expands.
Slow innovation
Perhaps the most significant hidden cost of self-managed brokers is their impact on innovation. Engineering teams spend considerable time maintaining existing infrastructure rather than developing new features or improving customer experiences. This maintenance burden often leads to delayed product launches and missed market opportunities, affecting the organization’s competitive position.
Customer scenario and requirements
Let’s consider a migration scenario that demonstrates how even complex IoT environments can successfully transition to AWS IoT Core.
Figure 1: Customer scenario before the migration
Architecture
Imagine a customer with the following setup, visualized in Figure 1:
- 10 million devices: Connecting daily from various locations worldwide.
- On-premises solution: Devices initially connect to an on-premises broker and backend services that consist of the logic for the consumers like internal or support applications.
- DNS Server: Leveraged for connecting to the self-managed MQTT broker.
- 80+ backend services: Distributed microservices architecture with 20-100 instances per service.
- API Gateway: Consuming applications interact with backend services through an API gateway.
Technical requirements for the new solution
The new solution must meet stringent technical requirements to ensure a seamless transition:
- Zero-touch device updates: The entire device fleet must transition without firmware modifications or manual interventions, as field updates are not feasible within the expected migration timelines. This is considered one of the most challenging migration requirement.
- Protocol compatibility: Seamless support for both MQTT3 and MQTT5 protocols is essential, as the device fleet includes multiple generations of hardware running different protocol versions.
- Advanced message distribution: Backend services require shared subscription capabilities to maintain efficient load balancing and ensure consistent message processing across service instances.
AWS IoT Core features for complex migrations
AWS IoT Core offers a suite of features specifically designed to support challenging migrations like the one described above.
AWS IoT Core operates on a shared responsibility model that defines security and operational boundaries. AWS manages and secures the underlying infrastructure, including physical data centers, service maintenance, and service availability. Customers remain responsible for securing their applications, implementing device-level security, managing certificates, and developing their business logic on top of AWS IoT Core.
Figure 2: AWS IoT Core features
Here’s a look at some key capabilities (highlighted services are particularly relevant to the customer architecture):
- Identity service: Advanced device authentication using X.509 certificates, custom Certificate Authorities support, and fine-grained access control through AWS IoT policies.
- Device Gateway: Highly scalable connectivity supporting millions of concurrent connections, with multi-protocol support (HTTPS, MQTT, MQTT over WebSockets, and LoRaWAN), and automatic load balancing.
- Message broker: Low-latency message distribution with MQTT 3.1.1 and MQTT 5 support, shared subscriptions, and message retention capabilities.
- Registry: Comprehensive device catalog with flexible metadata management, dynamic thing groups, and integration with AWS IoT Device Management.
Key features for challenging migrations
AWS IoT Core offers a robust set of features designed to simplify complex IoT fleet migrations and address common challenges when upgrading to a managed AWS IoT Core solution. A key aspect of a phased migration is that these techniques enable the backend services and devices to migrate at their own pace, minimizing downtime and disruption. Let’s explore in more detail some essential capabilities relevant for the migration scenario depicted in the customer scenario section:
- Custom domain: This capability stands out as a crucial feature for large-scale migrations. It eliminates one of the most significant migration barriers by allowing organizations to use their existing domains with AWS IoT Core endpoints. This means devices can continue operating with their current configurations, significantly reducing the risk and complexity of the migration process. This comes on top of the ability for customers to configure TLS policies and versions as well as the protocols and ports for the used endpoints.
- MQTT support (MQTT 3 and MQTT 5): In heterogeneous IoT deployments, devices often utilize different MQTT versions. AWS IoT Core supports both MQTT 3.1.1 and MQTT 5, enabling interoperability between devices using different MQTT versions. This ensures a smooth migration, without forcing you to upgrade all devices to the latest MQTT standard simultaneously.
- Bring your own certificate authority (CA): Maintaining existing security infrastructure is crucial during a migration. AWS IoT Core allows you to register your existing CA with AWS IoT Core, establishing a chain of trust between your devices and AWS IoT Core without requiring devices to re-enroll with new certificates. This eliminates the need for certificate rotation during migration.
In recent months, AWS IoT Core has introduced new features that further enhance the migration process and improve overall functionality:
- Message enrichment with registry metadata: Propagate device attributes stored in the registry with every message, eliminating the need for AWS Lambda functions or compute instances to retrieve this information from other sources.
- Thing-to-connection association: A thing is an entry in the registry that contains attributes that describe a device. Policies determine which operations a device can perform in AWS IoT. This new feature enables thing policies variables for devices with any client ID format, resolving a critical migration blocker where client IDs didn’t conform to AWS IoT Core’s thing naming restrictions. Once configured, enables multiple client IDs per certificate and thing, providing flexibility without changing existing device configurations or ID formats.
- Client ID in just-in-time registration (JITR): Perform additional security validations during JITR by receiving client ID information.
- Custom client certificate validation: Enables custom certificate validation through AWS Lambda functions during device connection, supporting integration with external validation services like Online Certificate Status Protocol (OCSP) responders for enhanced security controls.
- Custom authentication with X.509 client certificates: Extend certificate validation through an AWS Lambda function allowing to also specify policies for the connected devices at runtime. This complements the previously existing Custom Authorizer feature which offers a similar approach for JWT tokens and username/password credentials.
- ALPN TLS extension removal: The Application Layer Protocol Negotiation (ALPN) extension is no longer required in the Transport Layer Security (TLS) handshake, removing a barrier for device with lack of ALPN support.
These features offer greater flexibility, security, and efficiency for managing your IoT fleet in AWS IoT Core. By leveraging these key features, you can minimize the complexities and risks associated with migrating large IoT fleets, ensuring a seamless transition to a modern, scalable, and secure cloud-based IoT platform.
Target architecture
The target architecture involves transitioning the 10 million devices to connect to AWS IoT Core via Amazon Route 53 (or any DNS server). The backend services, API gateway, and consuming applications remain the same.
Figure 3: Target architecture
Migration strategy
The idea is to build the migration strategy based on five key pillars designed to ensure a seamless transition. The process starts with maintaining a risk-free approach through careful planning and testing, while keeping operations controlled with thorough documentation and monitoring. The strategy emphasizes maintaining a minimal error surface through precise execution and validation steps.
Aligned with these strategy principles, we recommend a phased approach. Each phase has specific objectives and dependencies, allowing you to carefully monitor progress and adjust your approach as needed.
Let’s explore each phase in detail, highlighting the rationale behind the choices and providing a real-world example.
Phase 0: Preparation
The preparation phase sets the groundwork for a successful migration. During this critical stage, we focus on establishing a bridge between existing infrastructure and AWS IoT Core, ensuring uninterrupted operations throughout the migration process.
At the heart of this phase is the implementation of a republish layer. This crucial component acts as an intermediary, facilitating bidirectional communication between your self-managed broker and AWS IoT Core. Think of it as building a secure tunnel that allows messages to flow seamlessly between both systems.
Figure 4: Architecture of the Preparation Phase
The republish layer consists of two primary components:
- Device to backend (DTB): This component captures messages from devices connected to your self-managed broker and forwards them to AWS IoT Core. By implementing this path first, we can begin migrating backend services while devices stay connected to the self-managed broker.
- Backend to device (BTD): Working in parallel, this component ensures that messages from newly migrated backend services reach devices still connected to the self-managed broker. This bidirectional capability maintains system integrity throughout the migration process.
For optimal performance, we recommend implementing the republish layer using container services, such as Amazon Elastic Container Service (ECS), or other compute options based on your specific needs. The code for these components is straightforward: subscribing to a topic on a broker and publishing it to the other broker. The container service deployment allows the scaling up and down of instances to accommodate the requirements of the migration.
Phase 1: Backend migration
This phase focuses on migrating backend services from the self-managed broker to AWS IoT Core. Let’s understand how we leverage the republishing layer to migrate the backends step by step without losing any messages.
Device to backend republishing layer
During backend migration, maintaining consistent message distribution through shared subscriptions is critical to not overload any of the existing or new subscribers. The republishing layer integrates seamlessly with existing instances using the same shared subscription pattern, ensuring balanced message consumption. As messages flow through this layer to AWS IoT Core and migrated backend instances, we carefully control the introduction of each component to prevent system overload. This measured approach enables gradual migration while preserving the original message distribution patterns and system stability.
Backend to device republishing layer
The Backend to device (BTD) Republishing layer is prepared and configured at the Amazon ECS cluster level, establishing connections to AWS IoT Core for message consumption. Unlike the Device to Backend layer, all BTD republishing instances can be deployed simultaneously since each instance handles distinct device topics, eliminating the risk of system overload. This enables faster backend migration while maintaining reliable message delivery to devices.
Figure 5: Architecture visualizing the Backend to Device Republishing Layer for the migration of service A
During backend migration, establishing an AWS IoT Core rule to persist messages to Amazon Simple Storage Service (S3) serves as a crucial safety net. This message backup enables recovery and reprocessing if unexpected issues occur during the transition, ensuring no device messages are lost.
With the republishing layer in place and thoroughly tested, the migration process follows a systematic pattern:
- Introduce the first DTB republishing instance
- Verify message flow through this instance to AWS IoT Core and back to devices
- Remove the corresponding unmigrated backend instance
- Progress incrementally through all backend instances
This methodical approach facilitates a smooth transition of all backend services to AWS IoT Core. The same strategy extends to other platform services, maintaining operational continuity throughout the process.
Figure 6: Architecture visualizing the completion of the backend migration to AWS IoT
Phase 2: Device migration
This phase requires particular attention to detail, as it directly impacts end-user experience and device connectivity.
The key to a successful device migration lies in implementing a weighted DNS routing strategy (or any routing strategy of your choice), with a service like Amazon Route 53 (or any DNS server of your choice). This approach allows for granular control over the transition:
- Begin with a small percentage (typically 1-2%) of traffic routed to AWS IoT Core.
- Monitor device connections, message delivery, potential throttling limits exceeded, and error rates relying on AWS IoT metrics and dimensions in Amazon CloudWatch.
- Gradually increase the percentage based on performance metrics.
- Maintain the ability to quickly revert traffic if needed.
During this phase, we leverage AWS IoT Core’s just-in-time registration capabilities to automatically provision resources for connecting devices. This automation significantly reduces the operational overhead of managing large-scale migrations.
Figure 7: Architecture visualizing the Device Migration
After completing device migration, the republishing layer remains active, continuing to forward messages to the self-managed broker. This design provides a critical rollback path – should any issues arise, traffic can be immediately reverted to the self-managed broker while maintaining full message delivery between devices and backend services.
Phase 3: Cleanup
The cleanup phase marks the final step in the migration journey. The republishing layer naturally phases out first, creating a clean isolation of the self-managed broker. Once monitoring systems and dependent processes confirm zero traffic to the self-managed broker, and all systems operate smoothly through AWS IoT Core, the broker’s decommissioning completes the migration.
Figure 8: Architecture visualizing the finished migration matching the target architecture
This measured sequence ensures a graceful transition while maintaining system stability throughout the final migration phase.
Conclusion
Organizations can successfully migrate their large IoT fleet to AWS IoT Core by following the outlined phased approach and adhering to the five strategic pillars. This pattern reduces risk, and provides failback mechanisms as safe guards throughout each migration step. The structured progression through preparation, backend migration, device migration, and cleanup phases ensures a methodical and secure transition, allowing both backend services and devices to migrate at their own pace while maintaining operational stability.
For a more detailed and interactive explanation of this migration journey, we invite you to watch our comprehensive walkthrough on the AWS IoT YouTube channel: Part 1 and Part 2. These videos provide additional insights and practical demonstrations of the concepts covered in this blog post. To learn about customers and partners that have migrated their solution to AWS IoT, please check out this blog post.
Remember, a successful IoT migration is not just about moving systems – it’s about building a foundation for future scalability while ensuring business continuity throughout the transition.
About the Authors
Andrea Sichel is a Principal Specialist IoT Solutions Architect at Amazon Web Services, where he helps customers navigate their cloud adoption journey in the IoT space. Driven by curiosity and a customer-first mindset, he works on developing innovative solutions while staying at the forefront of cloud technology. Andrea enjoys tackling complex challenges and helping organizations think big about their IoT transformations. Outside of work, Andrea coaches his son’s soccer team and pursues his passion for photography. When not behind the camera or on the soccer field, you can find him swimming laps to stay active and maintain a healthy work-life balance.
Katja-Maja Kroedel is a passionate Advocate for Databases and IoT at AWS, where she helps customers leverage the full potential of cloud technologies. With a background in computer engineering and extensive experience in IoT and databases, she works closely with customers to provide guidance on cloud adoption, migration, and strategy in these areas. Katja is passionate about innovative technologies and enjoys building and experimenting with cloud services like AWS IoT Core and AWS RDS.