Data lakes are important. They provide the opportunity to analyze data from not just one but all sources and at any level of detail, which improves business outcomes. Participants in our research reported that using big data analytics enabled better communication and alignment in the organization. These organizations share more information and therefore have more consistency in the information they share. They also reported they were able to gain a competitive advantage, be more responsive to customers and go to market faster.
However, data lakes can be challenging to create and maintain. Data engineers, data stewards, data scientists and business analysts are all involved in creating data lakes. Each plays a different role ingesting, enhancing and managing the stored data and information.
The process begins with identifying and ingesting all relevant data sources. As the data is ingested it must be enhanced. Enhancing includes ensuring the quality of the data, combining it with additional data from other sources and preparing it to support various types of analyses. All of this needs to be governed to ensure compliance with internal and regulatory requirements. The data and information also must be cataloged to help users find it when needed.
Organizations embarking on their data lake journey are often tempted to begin with manual processes and custom-coded scripts and routines, but this can be short-sighted. Our research shows that those organizations that rely on custom programming are most likely to complain that the technology has insufficient capabilities, is hard to build, hard to use and unreliable. Organizations can address these issues by adopting tools, choosing either individual tools for each task or an integrated platform. Note, though, that separate tools can lead to silos of activity and information, requiring additional efforts to bring these silos together to share the data as well as its business context.
People become silos as well. With all the efforts to democratize data with decentralized access, little attention has been given to fostering a community around the use and meaning of the data. Collaboration capabilities address that problem by enhancing the interaction among all the people involved and the tasks they perform, enabling them to share their knowledge and exchange ideas. Our research shows that 80 percent of organizations working with big data consider collaboration in the data preparation process important or very important. However, it can be time-consuming if collaboration capabilities are not integrated into the platform – participants in our research identified collaboration as the data preparation task on which they spend the most significant amount of time.
An integrated, end-to-end data lake management platform with all of these capabilities can address these issues, minimizing or eliminating the risks and delays of manual processes by providing a more robust, scalable and well-governed approach.
An integrated platform with machine learning techniques also yields a broader, more coordinated set of data and information with fewer manual processes than would be possible otherwise. Machine learning algorithms can be used to determine the patterns of usage across the entire platform. For example, data lakes often include many types of unstructured data sources such as text files and log files. Our research shows that 49 percent of organizations are working with text from documents and the web and 42 percent are working with machine data captured in log files. Based on past patterns of usage and the processing of these types of data, machine learning techniques can be used to process new files without the need for users to open the files and manually process their contents. Since a data lake can contain hundreds or even thousands of these files, this level of automation can significantly reduce the amount of manual labor needed to manage the data lake.
In addition to reducing the total cost of ownership for the data lake, automation also improves governance. With fewer manual tasks, there is less risk of error. AI and automation can also be used to ensure that all sensitive data and information in the lake is restricted, masked or protected as appropriate. If similar data sources and fields have been protected previously, AI algorithms can detect that and protect the new data as it is ingested.
If you are still relying on manual processes with custom-coded data lake routines or individual tools, you may be exposing your organization to a higher cost of ownership and to governance risks. Consider integrated data lake management with embedded machine learning capabilities to extract more valuable and transformative insights from your data lake.