Big data is valuable. Our benchmark research shows that using big data analytics results in better communication, better alignment of the business, a competitive advantage, better responsiveness and decreased time to market. But big data comes in many forms, from many sources and is not always easy to work with. These complexities have caused many organizations to become too reliant on IT for big data analytics. Nearly two-thirds (61%) of the organizations participating in our benchmark research told us they either rely on IT or require the assistance of IT to create big data analytics. Only one- quarter (24%) make direct, self-service access by line-of-business employees the primary way they provide big data analytics. Those organizations that do provide self-service access have the highest rates of satisfaction — 72 percent compared with 54 percent when IT resources are required.
Our research shows that Business Intelligence (BI) tools are used by three-quarters of organizations to perform big data analytics, but using them is not always easy and not always effective. Most BI tools were not designed to deal with big data technologies or the volumes of data involved. Nor were these big data technologies designed for the interactive analytical queries that users demand. As a result, when organizations use BI tools on big data, poor performance and concurrency issues often cause data lakes to go unused or underutilized.
To provide business users with big data analytics via their BI tools, some organizations create extracts from the data lake. However, this approach creates limitations that can reduce the effectiveness of the analytics and impose several additional administrative burdens on IT. Many BI tools require a fixed set of dimensions and hierarchies that are determined as the data is loaded. Changing the hierarchies and dimensions requires some administrative effort, inhibiting the users’ flexibility in their analysis. Another problem is that extracts are typically summarized versions of the data in the data lake rather than the data in full detail. Consequently, the level of analysis cannot be as detailed. In addition, the extracts must be refreshed when the data in the data lake is updated. These refresh cycles are not instantaneous, which means the data being analyzed is often not up to date. The refresh cycle must be managed by IT. The effort required to maintain these extracts can create an administrative nightmare.
Direct access to the data lake avoids many of these problems. All of the data can be analyzed at any level of detail. The data is always up to date and there is no additional copy of the data to govern and maintain. Historically, however, providing adequate performance and dealing with the massive volumes of data in the lake have been challenges of direct access. Fortunately, the programming environments associated with big data technologies such as Spark have advanced to the point where software vendors can apply them to create a new class of BI tools that operate directly on the data lake. These tools provide familiar environments within which users can perform analyses without the need to interact directly with the programming environments. These products can also include, where necessary, summarized information to accelerate query performance and can be automatically maintained directly in the data lake without any additional administrative burden.
For those organizations or users wedded to their existing BI tools, many of the benefits of this new class of big data BI tools are available via hybrids of the old and the new. The data models created directly in the data lake and many of the query acceleration techniques can be exposed via standard SQL interfaces. Users work with their familiar BI tools but can operate directly on the data lake. This approach also allows different parts of an organization to share the same data lake even if one group is using a traditional BI tool to access the data while the other has adopted the more modern approach.
As you consider your data lake strategy, focus on more than storing and managing data; focus on how you can utilize the data to improve the performance of the organization. Go beyond simple, summarized analyses of the data to provide access to all the detail in the data lake. Empower your line- of-business users with better choices for BI tool access and an architecture that provides direct access to analytics on the data lake so they can get the full value of the data in your organization.