For a variety of reasons, organizations are moving their workloads to the cloud. Our research shows that one-third of organizations have their primary data lake platforms in the cloud and most organizations (86%) expect the majority of their data to be in the cloud at some point in the future. Those organizations that already have the majority of their data in the cloud report they gained a competitive advantage, decreased time to value, and improved communication and knowledge sharing in their organizations.
Cloud-based deployments offer several benefits, some of which are amplified in big data scenarios. For example, time to value can be improved significantly since there is no need to acquire, install and configure hardware. For a large on-premises deployment involving hundreds of nodes, the initial capital outlay can be daunting for many organizations. And even if the organization were able to acquire the equipment, the ongoing maintenance and upgrades become a burden that detracts from the organization’s productivity. As data volumes grow over time, additional hardware must be acquired and configured to accommodate the increased demands. Cloud-based deployments offer elasticity and the ability to grow with the organization’s requirements, effectively offering infinite scalability. Cloud-based deployments also offer temporary increases in capacity. There is no need to build out a system to meet peak demands. Ephemeral workloads such as increased month-end processing can be managed with a burst of additional resources that can be released when no longer required.
But to take advantage of the cloud, an organization’s data must be available in the cloud. For big data workloads, merely moving the data to the cloud can be challenging. Moving hundreds of terabytes could require weeks to transfer using standard network connections. Few organizations can afford to shut down these systems for that long, so assuming the systems continue to operate, the process is complicated by the fact that changes must be captured and applied after the bulk transfer is processed. Best practices suggest running both the new and old systems in parallel for a period, so the synchronization process must continue during that time.
In order to make the bulk transfer process easier and more cost effective, cloud providers offer several alternatives. These include the physical shipment of petabyte-scale secure appliances. For extremely large migrations, some will provide a semi-trailer truck that can transfer up to 100 petabytes in a rugged shipping container. These solutions help minimize the amount of time required for the transfer.
A software-based live data migration tool provides another alternative. Depending on the implementation, software may be installed alongside the source and/or target systems that copies the data from one system to another while the systems are running. The migration tool not only copies the data but keeps track of the changes occurring as the data is processed so it can keep the systems consistent. Tools that support “active-active replication” allow changes to occur in either environment, thus removing the distinction between source and target. This scenario is common in the world of relational databases, but the challenge in the big data world is to do this at the scale required and adapt it to big data platforms recognizing the bandwidth limits available for both the initial and ongoing synchronization.
With such a tool, organizations solve multiple problems. The ongoing synchronization allows both systems to run in parallel during the testing phase to ensure the transfer has completed properly. A migration tool can help simplify a complicated project by minimizing the disruption to the existing system and avoiding the need for a “big bang” cutover to a new system. This potentially saves months of work planning and executing the migration and avoids the risk of downtime during such a cutover.
Not only do they simplify the migration process, but these tools can also solve the need for back-up and recovery of big data systems in event of failure. Multiple sites can house the same data simultaneously and are ready to become the primary system if a failure occurs. If the systems are always consistent, the workloads can be distributed among the systems to reduce the burden on any one system and provide better performance. The systems can also be spread across on-premises deployments and cloud deployments, thus reflecting the reality of most organizations today and allowing them to operate in a hybrid mode.
As organizations are migrating, they should also be taking advantage of the cloud—not just lifting and shifting their existing architectures and workloads. For instance, while migrating an HDFS data lake, an organization can consider utilizing cloud-based object storage. Also consider options that may not be available on-premises, such as different instance sizes or specialized CPUs, and explore auto-scaling and pricing options to help maximize post-migration savings.
Migrating big data workloads to the cloud presents challenges and opportunities. Organizations should consider their requirements as they evaluate alternatives. If minimizing downtime is an important consideration, software tools that provide live data migration rather than a back-up and restore process may provide the best solution. Once the challenge of migration is tackled, organizations should expand their efforts beyond simple migration to optimize their cloud-based deployments as well.