“If the facts don’t fit the theory, change the facts.”
― Albert Einstein
Apache Spark is the newest kid on the Big Data block.
While re-using major components of the Apache Hadoop Framework, Apache Spark lets you execute big data processing jobs that do not neatly fit into the Map-Reduce paradigm. It provides support for many patterns similar to the Java 8 Streams functionality, while letting you run these jobs on a cluster.
You have a data processing job working nicely with Java 8 Streams? But need more horsepower & memory than a single machine can provide?Apache Spark is your friend.
In this article, we delve into the basics of Apache Spark and show you how to set up a single-node cluster using the computing resources of Amazon EC2. For the purposes of the demonstration, we setup a single server and run the master and slave on the same node. Such a setup is good for getting your feet wet with Apache Spark on a laptop.
Create AWS Instance
Setting up an AWS EC2 instance is quite straightforward and we have covered it here while demonstrating how to set up a Hadoop Cluster. The procedure is the same up until the cluster is running on EC2. Follow the steps in that guide till the instance is launched, and get back here to continue with Apache Spark.
Instance Setup
Once the instance is up and running on AWS EC2, we need to setup the requirements for Apache Spark.
Install Java
Install Java on the node using the ubuntu package: openjdk-8-jdk-headless
sudo apt-get -y install openjdk-8-jdk-headless
Install Apache Spark
Next head on over to the Apache Spark website and download the latest version. At the time of the writing of this article, the latest version is 2.1.0. We have chosen to install Spark with Hadoop 2.7 (the default).
Download and unpack the Apache Spark package.
mkdir ~/server
cd ~/server
wget <Link to Apache Spark Binary Distribution>
tar xvzf spark-2.1.0-bin-hadoop2.7.tgz
After unpacking, you have just one step to complete the installation: JAVA_HOME.
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
And that’s it for installation! The friendly folks at Apache Spark have certainly made our lives easy, haven’t they?
Startup Master
Let us now fire up the Apache Spark master. The master is in charge of the cluster. This is where you submit jobs, and this where you go for the status of the cluster. Start the master as follows:
cd ~/server
./spark-2.1.0-bin-hadoop2.7/sbin/start-master.sh
Once the master is running, navigate to port 8080 on the Node’s Public DNS and you get a snapshot of the cluster.
The URL highlighted in red is the Spark URL for the Cluster. Copy it down as you will need it to start the slave.
Slave Startup
Ensure that JAVA_HOME is set properly and run the following command.
cd ~/server
./spark-2.1.0-bin-hadoop2.7/sbin/start-slave.sh spark://ip-172-31-30-53.us-west-1.compute.internal:7077
And with that, your cluster should be functioning. Hit the status page again at port 8080 to check for it. Observe that you can see the slave under Workers, along with the number of core available and the memory.
Run Jobs Using Pyspark
Let us now run a job using the Python shell provided by Apache Spark. Starting up the shell needs the Spark Cluster URL mentioned earlier.
cd ~/server
./spark-2.1.0-bin-hadoop2.7/bin/pyspark --master spark://ip-172-31-30-53.us-west-1.compute.internal:7077
After a brief startup, you should see the pyspark prompt “>>>”.
For the purpose of testing, we are using a data file containing salaries of baseball players from 1985 through 2016. It contains 26429 records.
Here is a sample session with the pyspark shell using the file Salaries.csv
>>> a = sc.textFile('Salaries.csv')
>>> a.count()
26429
>>> a.filter(lambda x : '2005' in x).count()
837
Python Code to Run Jobs
Let us now see how to run some sample python code on the Spark cluster. The following shows code similar to the above pyspark session.
from pyspark import SparkContext
dataFile = "../data/Salaries.csv"
sc = SparkContext("spark://ip-172-31-30-53.us-west-1.compute.internal:7077", "Simple App")
a = sc.textFile(dataFile)
print "Count of records: ", a.count()
print "Count of 2005 records: ", a.filter(lambda x : '2005' in x).count()
sc.stop()
Along with a bunch of diagnostic output, the code prints:
Count of records: 26429
Count of 2005 records: 837
Summary
And that, my friends, is a simple and complete Apache Spark tutorial. We covered the basics of setting up Apache Spark on an AWS EC2 instance. We ran both the Master and Slave daemons on the same node. Finally, we demonstrated an interactive pyspark session as well as some Python code to run jobs on the cluster.