What good is a line-by-line code walk-through, if you can’t execute and see the output? So without any more fuss, let’s run our code!!
Search Results for:
Hiya everyone! Time to grab a cup of Java, I mean coffee ;)
Without further ado, let’s get started with our first test script in WebDriver. Under the same project that we created from our previous post (where WebDriver JAR files are added), I have created a new package named, ‘com.blog.tests’.
Welcome to our first explore-along post of WebDriver series. It is recommended that you follow closely and work along to make the set up process all the more simple!
- Have an active internet connection.
- Download and install Java SE Development Kit (JDK) on your system. (http://www.oracle.com/technetwork/java/javase/downloads/index.html)
- Download Eclipse IDE. Installation isn’t required. Just extract all the contents to a folder and double click on the executable file. (http://www.eclipse.org/downloads/)
The path to learning WebDriver is interesting and often mind-bending one, so get ready… We are going to encounter some wonderful, wild and wacky things as we trek in the land of Southern Surprises.
You are about to find out why Selenium WebDriver is going to make your life so much better – well, in a QA sense and why we are calling it, The Ruling Champ! In order to get a grip on the tool and build a test automation framework, it is really important to get a deeper understanding of what we are dealing with. So, what are we waiting for? Let us get a good foundation started, now!
In this tutorial we’ll learn about RDD (Re-silent Distributed Data sets) which is the core concept of spark.
RDD is an immutable (read-only) collection of objects, distributed in the cluster.
RDD can be created from storage data or from other RDD by performing any operation on it.
- In Older Map Reduce paradigm, the map and reduce operation was not effective in terms of memory and speed. So RDD has taken the place to make MapReduce more efficient.
- As data sharing was very slow as it requires map reduce program to write the output on disk. So to reuse data between computations also require o/p to disk.
- Due to replication, serialization and disk IO hadoop spend 90% time on read and write operation.
- In short Iterative and Interactive both processes need faster data sharing.
Apache spark supports in-memory operations and so the job becomes 10 to 100 times faster than hadoop job.
RDD can be created in two ways,
- By paralleling the existing one
Loading external dataset from HDFS
Operations on RDD:
Two types of operations can be performed on RDD.
RDD can be transformed from one form to another form. Map, filter, combineByKey etc. are transformation operation which create other RDD.
If you have multiple operations to be performed on the same data, you can store that data explicitly in the memory by calling cache() or persist() functions.
Actions returns final result. Like first, collect, reduce, count etc. are actions.
Until the action operation is called, no transformation operations are performed.
RDD having key/value pairs called Pair RDDs.They are very useful performing or counting aggregations by keys in parallel on various nodes of the cluster.
Pair RDD can be created by calling a map() operation which will emit key/value pairs.
Transformations on Pair RDDs:
ReduceByKey(),groupByKey(),cobineByKey(),mapValues(),flatMapValues(),keys() etc. are functions can be performed on one Pair RDDs where as subtractByKet(),join, cogroup() are functions can be performed on two pair RDDs.
Run the spark-shell command on command line.
Then create the rdd from any text file.
Here media.txt is a list of instagram URLs in it.
12345678910111213141516171819scala> val mediaRDD =sc.textFile("D:/instagram-scraper-master/media.txt")rdd: org.apache.spark.rdd.RDD[String] = D:/instagram-scraper-master/media.txt MapPartitionsRDD at textFile at <console>:21scala> mediaRDD.countres0: Long = 1013scala> mediaRDD.take(2).foreach(println)https://instagram.fbom1-1.fna.fbcdn.net/t50.2886-16/14790206_177359509381923_7967834812834643968_n.mp4https://instagram.fbom1-1.fna.fbcdn.net/t50.2886-16/14833228_1020652531380366_8548718479509815296_n.mp4
Node.txt: It is a network file having node id and it’s neighbors.
scala> val nodeRDD =sc.textFile(“D:/Node.txt”)
nodeRDD: org.apache.spark.rdd.RDD[String] = D:/Node.txt MapPartitionsRDD at textFile at <console>:21
scala> val mapRDD= nodeRDD.map(_.split(” “)).map(v => (v(0).toInt, v(1).toInt))
mapRDD: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD at ma
fold foreach foreachPartition foreachWith
scala> val result=mapRDD.countByKey()
result: scala.collection.Map[Int,Long] = Map(4 -> 1, 2 -> 1, 1 -> 3, 3 -> 2)
So like this we can perform several paired RDD functions on Paired RDD and it makes easy to perform several aggregation functions.
In the next tutorial we’ll see all the RDD functions in details.