1+1 is just a convenient way of writing 1.plus(1). This is achieved by class [Read more…]
In this tutorial we’ll learn about RDD (Re-silent Distributed Data sets) which is the core concept of spark.
RDD is an immutable (read-only) collection of objects, distributed in the cluster.
RDD can be created from storage data or from other RDD by performing any operation on it.
- In Older Map Reduce paradigm, the map and reduce operation was not effective in terms of memory and speed. So RDD has taken the place to make MapReduce more efficient.
- As data sharing was very slow as it requires map reduce program to write the output on disk. So to reuse data between computations also require o/p to disk.
- Due to replication, serialization and disk IO hadoop spend 90% time on read and write operation.
- In short Iterative and Interactive both processes need faster data sharing.
Apache spark supports in-memory operations and so the job becomes 10 to 100 times faster than hadoop job.
RDD can be created in two ways,
- By paralleling the existing one
Loading external dataset from HDFS
Operations on RDD:
Two types of operations can be performed on RDD.
RDD can be transformed from one form to another form. Map, filter, combineByKey etc. are transformation operation which create other RDD.
If you have multiple operations to be performed on the same data, you can store that data explicitly in the memory by calling cache() or persist() functions.
Actions returns final result. Like first, collect, reduce, count etc. are actions.
Until the action operation is called, no transformation operations are performed.
RDD having key/value pairs called Pair RDDs.They are very useful performing or counting aggregations by keys in parallel on various nodes of the cluster.
Pair RDD can be created by calling a map() operation which will emit key/value pairs.
Transformations on Pair RDDs:
ReduceByKey(),groupByKey(),cobineByKey(),mapValues(),flatMapValues(),keys() etc. are functions can be performed on one Pair RDDs where as subtractByKet(),join, cogroup() are functions can be performed on two pair RDDs.
Run the spark-shell command on command line.
Then create the rdd from any text file.
Here media.txt is a list of instagram URLs in it.
12345678910111213141516171819scala> val mediaRDD =sc.textFile("D:/instagram-scraper-master/media.txt")rdd: org.apache.spark.rdd.RDD[String] = D:/instagram-scraper-master/media.txt MapPartitionsRDD at textFile at <console>:21scala> mediaRDD.countres0: Long = 1013scala> mediaRDD.take(2).foreach(println)https://instagram.fbom1-1.fna.fbcdn.net/t50.2886-16/14790206_177359509381923_7967834812834643968_n.mp4https://instagram.fbom1-1.fna.fbcdn.net/t50.2886-16/14833228_1020652531380366_8548718479509815296_n.mp4
Node.txt: It is a network file having node id and it’s neighbors.
scala> val nodeRDD =sc.textFile(“D:/Node.txt”)
nodeRDD: org.apache.spark.rdd.RDD[String] = D:/Node.txt MapPartitionsRDD at textFile at <console>:21
scala> val mapRDD= nodeRDD.map(_.split(” “)).map(v => (v(0).toInt, v(1).toInt))
mapRDD: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD at ma
fold foreach foreachPartition foreachWith
scala> val result=mapRDD.countByKey()
result: scala.collection.Map[Int,Long] = Map(4 -> 1, 2 -> 1, 1 -> 3, 3 -> 2)
So like this we can perform several paired RDD functions on Paired RDD and it makes easy to perform several aggregation functions.
In the next tutorial we’ll see all the RDD functions in details.
In this article I will introduce you to exception handling. In the previous articles in this series we use exception handling to cover some rare exception cases (remember the Guess the Number game?) and now it is time to explain how to do it in your own application — and when to do it. [Read more…]
In this very first tutorial of Spark we are going to have an introduction of Apache Spark and its core concept RDD.
What is Apache Spark?
Apache Spark is an open source general purpose cluster computational engine.
Spark was born out of the necessity to prove out the concept of Mesos, in the AMPLab at the University of California, Berkeley, in 2009.
It is designed to cover a wide range of workloads including batch applications, iterative algorithms, interactive queries and stream processing.
One of the main advantages is that Spark offers for speed is the ability to run computations in memory. Hence it supports In-Memory computations.
Spark can run in Hadoop clusters and access any Hadoop data source, including Cassandra.
The Spark Stack
It contains basic functionalities of Spark including components for task scheduling, memory management, fault recovery, interacting with storage systems, etc.
Resilient Distributed Datasets (RDDs) are Spark’s main programming abstraction.
RDDs represent a collection of items distributed across many compute nodes that can be manipulated in parallel.
It allows querying data via SQL as well as Hive Query Language (HQL) and it supports many sources of data like Hive tables, Parquet, JSON.
Shark was an older SQL on Spark.
This enables processing of live streams of data.
E.g. it can process log files of web server.
Spark provides library containing common machine learning functionality ,called MLib, MLib provides multiple types of machine learning algorithms, including classification, regression, clustering, and collaborative filtering, as well as supporting functionality such as model evaluation and data import.
GraphX is a library for manipulating graphs (e.g., a social network’s friend graph) and performing graph-parallel computations. Like Spark Streaming and Spark SQL, GraphX extends the Spark RDD API. It allows creating a directed graph with arbitrary properties attached to each vertex and edge.
Spark can run over variety of cluster managers like Hadoop YARN, Apache Mesos, and a simple cluster manager included in Spark itself called the Standalone Scheduler.
Storage Layer of Spark:
It’s important to remember that Spark does not require Hadoop; it simply has support for storage systems implementing the Hadoop APIs. Spark supports text files, Sequence Files, Avro, Parquet, and any other Hadoop InputFormat.
Why Spark? Or Spark Features:
Easy to get started – It offers spark-shell which is very easy head start to writing and running Spark application on the command line.
Unified Engine for Diverse workloads, it is more than just Map and Reduce.
Spark enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster even when running on disk. Spark holds intermediate results in memory rather than writing them to disk which is very useful especially when you need to work on the same dataset multiple times. Spark will attempt to store as much as data in memory and then will spill to disk. It can store part of a data set in memory and the remaining data on the disk. Spark comes with performance advantage.
It optimizes arbitrary operator graphs.
Lazy evaluation of big data queries which helps with the optimization of the overall data processing workflow. Apache Spark uses a directed acyclic graph (DAG) of computation stages. It postpones any processing until really required for actions. Spark’s lazy evaluation gives plenty of opportunities to induce low-level optimizations
It is mainly written in Scala but it provides concise and consistent APIs in Scala, Java and Python.
It provides interactive shell for Scala and Python.
RDD is the core concept of Spark.
It is an immutable distributed collection of objects. Spark uses RDD to achieve faster and efficient MapReduce operations. They are also fault tolerance because an RDD know how to recreate and recompute the datasets.
RDDs are immutable. You can modify an RDD with a transformation but the transformation returns you a new RDD whereas the original RDD remains the same.
RDD supports two types of operations:
Transformation: Transformation don’t return a single value, they return a new RDD. Nothing gets evaluated when you call a Transformation function; it just takes an RDD and return a new RDD.
Examples of Transformation functions are map, filter, flatMap, groupByKey, reduceByKey, aggregateByKey, pipe, and coalesce.
Action: Action operation evaluates and returns a new value. When an Action function is called on a RDD object, all the data processing queries are computed at that time and the result value is returned.
Examples of the Action operations are reduce, collect, count, first, take, countByKey, and foreach.
Example of creating RDD:
scala> val lines = sc.textFile("README.md") // Create an RDD called lines
lines: spark.RDD[String] = MappedRDD[...]
scala> lines.count() // Count the number of items in this RDD
res0: Long = 127
scala> lines.first() // First item in this RDD, i.e. first line of README.md
res1: String = # Apache Spark
Here lines is RDD and count () and first () are two actions performed on lines RDD.
In the next tutorial we will learn how to install Spark on Windows system.
We haven’t used any explicit static typing in the way that you’re familiar with in Java. We assigned strings and numbers to variables and didn’t care about the type. Besides this, Groovy implicitly assumes these variables to be of static type java.lang.Object.