This tutorial covers Spark setup on Ubuntu 14.04:
- Installation of all Spark prerequisites
- Spark build and installation
- Basic Spark configuration
- standalone cluster setup (one master and 4 slaves on a single machine)
Before installing Spark, we need:
- Ubuntu 14.04 LTS
- Python (you already have this)
- Git 18.104.22.168
Step 1: I have already explained how to install jdk and scala in my previous tutorials. Therefore you can refer it.
Step 2: Installing Maven:
Run the following command:
sudo apt-get install maven
Be warned, a large download will take place.
Step 3:Building and Installing Spark:
Download the latest version of spark from the following link:
Now extract it and cd into it in the terminal.
we’ll use Maven to build Spark:
But what these params bellow actually mean?
Spark will build against Hadoop 1.0.4 by default, so if you want to read from HDFS (optional), use your version of Hadoop. If not, choose any version.
For more options and additional details, take a look at the official instructions on building Spark with Maven.
# first, to avoid the notorious ‘permgen’ error
# increase the amount memory available to the JVM:
export MAVEN_OPTS=”-Xmx1300M -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m”
# then trigger the build:
mvn -Phadoop2-yarn -Dhadoop.version=2.0.5-alpha -Dyarn.version=2.0.5-alpha -DskipTests clean package
If you did everything right, the build process should complete without a glitch, in about 15 to 30 minutes (downloads take the majority of that time), depending on your hardware. The only type of notifications should be Scala deprecation and duplicate class warnings, but both can be ignored.
You should see something like this by the end of the compilation process:
Since I have successfully built Spark with mvn I have never used sbt (Scala Build Tool) to build it, but that option is also available to you.
OK, we have installed all the prerequisites and successfully built Spark. Now, it’s finally time to have some fun with it.
To get the ball rolling, start the Spark master:
# to start the Spark master on your localhost:
# outputs master’s class and log file info:
starting org.apache.spark.deploy.master.Master, logging to /home/mbo/spark/bin/../
To check out master’s web console, open http://localhost:8080/
Now its time to configure and start the slave workers:
We will be using the launch scripts that are provided by Spark to make our lives more easier. First of all there are a couple of configurations we need to set.
When using the launch scripts this file is used to identify the host-names of the machine that the slave nodes will be running. All you have to do is provide the host names of the machines one per line. since we are setting up everything in our machine we will only need to add “localhost” to this file.
There are a set of variables that you can set to override the default values. this can be done by putting in values in the “spark-env.sh” file. There is a template available “conf/spark-env.sh.template” you can use this template to create the spark-env.sh file. Several variable that can be added is mentioned in the template is self. we will add the following lines to the file.
Here SPARK_WORKER_MEMORY specifies the amount of memory you want to allocate for a worker node if this value is not given the default value is the total memory available – 1G. Since we are running everything in our local machine we woundt want the slave the use up all our memory. I am running on a machine with 8GB of ram and since we are creating 2 slave node we will give each of the 2GB of ram.
The SPARK_WORKER_INSTANCES specified the number of instances here its given as 2 since we will only create 2 slave nodes.
The SPARK_WORKER_DIR will be the location that the run applications will run and which will include both logs and scratch space. Make sure that the directory can be written to by the application that is permission are set properly.
After we have these configurations ready we are good to go. now lets start by running the master node.
Just execute the launch script for the master that is “start-master.sh”
Once the master is started you should be able to access the web ui at http://localhost:8080.
Now you can proceed to start the slaves. This can be done by running the “start-slaves.sh” launch script.
Note: In order to start slaves the master need to be able to access the slave machines through ssh. since we are running on the same machine that is your machine should be accessible through ssh. make sure you have ssh installed run “which sshd”. if you don’t have it installed install it with the following command.
sudo apt-get install openssh-server
You will also need to specify an password for the root since this will be requested when running the slaves. If you do not have a root password set use the following command to set an password.
With the slaves successfully started now you have a Spark cluster up and running. If everything went according to plan the web-ui for the master should show the two slave nodes.
Now lets connect to the cluster from the interactive shell by executing the following command
You can find the IP and the PORT in the top left corner of the web ui for the master. When successfully connected the web ui will show that there is an active task.