Berkeley Data Analysis Stack (BDAS)
The Spark ecosystem has evolved into a comprehensive big data computing platform, comprising a range of subprojects. At its core lies the Berkeley Data Analysis Stack (BDAS), a unified ecosystem for data analysis. This stack includes the core framework Spark, which supports a robust structure for executing SQL queries and analyzing data through the Spark SQL query engine and Shark. Additionally, BDAS encompasses a distributed machine learning system (MLbase), a parallel computing framework (MLlib), a graph processing framework (GraphX), a streaming framework (Spark Streaming), a sampling approximate query engine (BlinkDB), a distributed file system (Tachyon), and a resource management framework (Mesos).
Figure 1-1: Project Structure Diagram of BDAS
[illustration]
The BDAS subprojects provide a higher-level, richer computing paradigm on top of Spark, enabling users to leverage a broader range of tools and techniques for data analysis and processing.
Spark: The Core Component
Spark is the core component of the BDAS, serving as a distributed programming framework for big data processing. It extends the traditional MapReduce programming model with a richer set of operators, such as filter, join, and groupByKey. Spark provides a distributed data abstraction called Resilient Distributed Datasets (RDD), which enables efficient task scheduling, RPC, and compression. The Spark API is written in Scala and provides a deep reference to Scala functional programming principles.
Figure 1-2: Spark Task Processing Flowchart
[illustration]
Spark data partitions are processed in a distributed environment, with jobs being broken down into acyclic graphs (DAGs) and stages for efficient scheduling.
Shark: A Data Warehouse in Spark
Shark is a data warehouse built on top of Spark and Hive, providing a SQL interface for querying Hive data. Although its development has been terminated, Shark’s structure and principles remain relevant. Users familiar with Hive QL or SQL can quickly execute Ad-Hoc and Reporting queries using Shark, which compiles Hive QL into Spark tasks for distributed computing.
Spark SQL: SQL Queries on Large Data
Spark SQL provides a SQL interface for executing queries on large data sets, similar to Shark. It relies on the Catalyst optimizer and uses the Spark execution engine to implement SQL operators. Users can write SQL queries directly on Spark, which is equivalent to a set of SQL operators, enriching the Spark API and providing compatibility with different persistent storage systems.
Spark Streaming: Large-Scale Data Processing
Spark Streaming processes data streams by dividing them into RDDs based on a specified time slice, achieving large-scale data processing. It offers a rich API for calculating data streams and can process data streams beyond the existing Storm framework.
GraphX: Global Synchronization of Large-Scale Map Calculations
GraphX is a BSP-based model that provides a Pregel-like interface for global synchronization of large-scale map calculations. It is particularly useful when users perform multiple iterations based on calculated memory, leveraging Spark’s advantages.
Tachyon: A Distributed Memory File System
Tachyon is a distributed memory file system that can be understood as a memory-based HDFS. It provides higher performance by releasing data storage from Java Heap. Users can share applications across Tachyon, achieving RDD or file-based sharing, and provide high fault tolerance to ensure data reliability.
Mesos: Resource Management Framework
Mesos is a resource management framework that provides similar functionality to YARN. Users can run tasks on Spark, MapReduce, Tez, and other computing frameworks in a plug-in manner, with Mesos isolating resources and tasks for efficient scheduling.
BlinkDB: Interactive SQL for Huge Amounts of Data
BlinkDB is an approximate query engine for interactive SQL on huge amounts of data. It allows users to make a trade-off between query accuracy and response time, completing approximate queries with accuracy within a permissible error range. BlinkDB achieves this goal by establishing and maintaining a multi-dimensional set of samples from raw data using an adaptive optimization framework, selecting samples based on a dynamic strategy that meets the needs of user queries for accuracy and response time.