Different Designs for Different Functions
Apache Spark and massively parallel processing (MPP) analytical databases are designed for different things. The first generation of “big data” architectures relied upon the distributed Hadoop and MapReduce framework for analytical processing. This framework provided a breakthrough in that it increased the amount of data that could be processed, but it operated in batch mode which limited its applicability for interactive analyses. Spark removed the batch processing limitation of MapReduce thus making interactive analyses on big data practical. It also provided capabilities for streaming analyses and machine learning, but it does not include its own persistent storage layer.
Distributed MPP systems are designed for scalable, high-performance analytical database operations. These database systems spread processing across multiple compute resources to provide scalability and enhance performance while maintaining transactional consistency with support for data updates and deletes. Many applications require transactional consistency or repeatability—for example, customer billing or financial systems—that the relational database technology underlying MPP systems provides. These systems also use a variety of optimization techniques to deliver very high performance when executing a wide variety of analyses, including those involving small numbers of records or very large numbers of records. And while the best implementations of MPP systems are not limited to only SQL processing, the wide availability of SQL skills and tools make it easier to deploy and integrate them into an organization’s information architecture.