In this very first tutorial of Spark we are going to have an introduction of Apache Spark and its core concept RDD.
What is Apache Spark?
- Apache Spark is an open source general purpose cluster computational engine.
- Spark was born out of the necessity to prove out the concept of Mesos, in the AMPLab at the University of California, Berkeley, in 2009.
- It is designed to cover a wide range of workloads including batch applications, iterative algorithms, interactive queries and stream processing.
- One of the main advantages is that Spark offers for speed is the ability to run computations in memory. Hence it supports In-Memory computations.
- Spark can run in Hadoop clusters and access any Hadoop data source, including Cassandra.
The Spark Stack
It contains basic functionalities of Spark including components for task scheduling, memory management, fault recovery, interacting with storage systems, etc.
Resilient Distributed Datasets (RDDs) are Spark’s main programming abstraction.
RDDs represent a collection of items distributed across many compute nodes that can be manipulated in parallel.
It allows querying data via SQL as well as Hive Query Language (HQL) and it supports many sources of data like Hive tables, Parquet, JSON.
Shark was an older SQL on Spark.
This enables processing of live streams of data.
E.g. it can process log files of web server.
Spark provides library containing common machine learning functionality ,called MLib, MLib provides multiple types of machine learning algorithms, including classification, regression, clustering, and collaborative filtering, as well as supporting functionality such as model evaluation and data import.
GraphX is a library for manipulating graphs (e.g., a social network’s friend graph) and performing graph-parallel computations. Like Spark Streaming and Spark SQL, GraphX extends the Spark RDD API. It allows creating a directed graph with arbitrary properties attached to each vertex and edge.
Spark can run over variety of cluster managers like Hadoop YARN, Apache Mesos, and a simple cluster manager included in Spark itself called the Standalone Scheduler.
Storage Layer of Spark:
It’s important to remember that Spark does not require Hadoop; it simply has support for storage systems implementing the Hadoop APIs. Spark supports text files, Sequence Files, Avro, Parquet, and any other Hadoop InputFormat.
Why Spark? Or Spark Features:
- Easy to get started – It offers spark-shell which is very easy head start to writing and running Spark application on the command line.
- Unified Engine for Diverse workloads, it is more than just Map and Reduce.
- Spark enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster even when running on disk. Spark holds intermediate results in memory rather than writing them to disk which is very useful especially when you need to work on the same dataset multiple times. Spark will attempt to store as much as data in memory and then will spill to disk. It can store part of a data set in memory and the remaining data on the disk. Spark comes with performance advantage.
- It optimizes arbitrary operator graphs.
- Lazy evaluation of big data queries which helps with the optimization of the overall data processing workflow. Apache Spark uses a directed acyclic graph (DAG) of computation stages. It postpones any processing until really required for actions. Spark’s lazy evaluation gives plenty of opportunities to induce low-level optimizations
- It is mainly written in Scala but it provides concise and consistent APIs in Scala, Java and Python.
- It provides interactive shell for Scala and Python.
RDD is the core concept of Spark.
It is an immutable distributed collection of objects. Spark uses RDD to achieve faster and efficient MapReduce operations. They are also fault tolerance because an RDD know how to recreate and recompute the datasets.
RDDs are immutable. You can modify an RDD with a transformation but the transformation returns you a new RDD whereas the original RDD remains the same.
RDD supports two types of operations:
Transformation: Transformation don’t return a single value, they return a new RDD. Nothing gets evaluated when you call a Transformation function; it just takes an RDD and return a new RDD.
Examples of Transformation functions are map, filter, flatMap, groupByKey, reduceByKey, aggregateByKey, pipe, and coalesce.
Action: Action operation evaluates and returns a new value. When an Action function is called on a RDD object, all the data processing queries are computed at that time and the result value is returned.
Examples of the Action operations are reduce, collect, count, first, take, countByKey, and foreach.
Example of creating RDD:
scala> val lines = sc.textFile("README.md") // Create an RDD called lines lines: spark.RDD[String] = MappedRDD[...] scala> lines.count() // Count the number of items in this RDD res0: Long = 127 scala> lines.first() // First item in this RDD, i.e. first line of README.md res1: String = # Apache Spark
Here lines is RDD and count () and first () are two actions performed on lines RDD.
In the next tutorial we will learn how to install Spark on Windows system.