Using Spring XD to stream Tweets to Hadoop

1 – Download and Install Spring-XD

Spring-XD can be found at http://spring.io. This tutorial uses the 1.0.0.M3 version, so conventions may change in next release. Follow the install instructions and kick up Spring-XD with a test stream to make sure it’s looking good.

create stream --name ticktock --definition "Time | Log"

That simple instruction should begin showing output in the server terminal window similar to:

2013-10-12 17:18:09
2013-10-12 17:18:10
2013-10-12 17:18:11
2013-10-12 17:18:12
2013-10-12 17:18:13
2013-10-12 17:18:14

Congrats, Spring-XD is running.

2 – Download and Install Hadoop

The hadoop can be installed on the singlenode by using the steps mentioned in my previous blog at https://amankumarlabra.wordpress.com/2014/08/09/installing-single-node-hadoop-2-4-1-on-ubuntu/

Configuring Spring-XD to use Hadoop :

Step 1 – Edit the Hadoop.properties file Edit the file at XD_HOME\xd\config\hadoop.properties to enter the namenode config:

fs.default.name=hdfs://10.0.0.27:8020

Step 2 – Spin up the Spring-XD Service with Hadoop

In a terminal window get the server running from the XD_HOME\XD\ folder:

./xd-singlenode --hadoopDistro hadoop24

Step 3 – Spin up the Spring-XD Client with Hadoop

In a separate terminal window get the shell running from the XD_HOME\Shell\ folder:

./xd-shell --hadoopDistro hadoop24

Then set the namenode for the client using the IP Address of the hadoop:

hadoop config fs --namenode hdfs://10.0.0.27:8020

Next, test out whether you can see HDFS with a command like:

hadoop fs ls /

You should see something like:

drwxr-xr-x   - hdfs   hdfs          0 2013-05-30 10:34 /apps
drwx------   - mapred hdfs          0 2013-10-12 17:06 /mapred
drwxrwxrwx   - hdfs   hdfs          0 2013-10-12 17:19 /tmp
drwxr-xr-x   - hdfs   hdfs          0 2013-06-10 14:39 /user

Once that’s confirmed we can set up a simple test stream. In this case, we can re-create TickTock but store it in HDFS.

stream create --name ticktockhdfs --definition "Time | HDFS"

Leave it a few seconds, then destroy or undeploy the stream.

stream destroy --name ticktockhdfs

You can then view the small file that will have been generated in HDFS.

hadoop fs ls /xd/ticktockhdfs

Found 1 items
-rwxr-xr-x   3 root hdfs        420 2013-10-12 17:18 /xd/ticktockhdfs/ticktockhdfs-0.log

Which you can quickly examine with:

hadoop fs cat /xd/ticktockhdfs/ticktockhdfs-0.log

2013-10-12 17:18:09
2013-10-12 17:18:10
2013-10-12 17:18:11
2013-10-12 17:18:12
2013-10-12 17:18:13
2013-10-12 17:18:14

Cool, but not so interesting, so let’s get to Twitter.

3 – Create the Tweet Stream in Spring-XD

In order to stream in information from Twitter, then you’ll need to set-up a Twitter Developer app so you can get the necessary keys. Once you have the keys, you can add them to XD_HOME\xd\config\twitter.properties In our case, we’ll take a look at the stream of current opinion on that current icon of popular culture: Miley Cyrus. The stream can be set-up as follows with some simple tracking terms:

stream create --name cyrustweets --definition "twitterstream --track='mileycyrus, miley cyrus' | hdfs"

You might want to build up these files for a little while. You can check in on the data at:

hadoop fs ls  /xd/cyrustweets/

Found 12 items
-rwxr-xr-x   3 root hdfs    1002252 2013-10-12 19:33 /xd/cyrustweets/cyrustweets-0.log
-rwxr-xr-x   3 root hdfs    1000126 2013-10-12 19:33 /xd/cyrustweets/cyrustweets-1.log
-rwxr-xr-x   3 root hdfs    1004800 2013-10-12 19:34 /xd/cyrustweets/cyrustweets-10.log
-rwxr-xr-x   3 root hdfs          0 2013-10-12 19:34 /xd/cyrustweets/cyrustweets-11.log
-rwxr-xr-x   3 root hdfs    1003357 2013-10-12 19:33 /xd/cyrustweets/cyrustweets-2.log
-rwxr-xr-x   3 root hdfs    1000903 2013-10-12 19:33 /xd/cyrustweets/cyrustweets-3.log
-rwxr-xr-x   3 root hdfs    1000096 2013-10-12 19:34 /xd/cyrustweets/cyrustweets-4.log
-rwxr-xr-x   3 root hdfs    1001072 2013-10-12 19:34 /xd/cyrustweets/cyrustweets-5.log
-rwxr-xr-x   3 root hdfs    1001226 2013-10-12 19:34 /xd/cyrustweets/cyrustweets-6.log
-rwxr-xr-x   3 root hdfs    1000398 2013-10-12 19:34 /xd/cyrustweets/cyrustweets-7.log
-rwxr-xr-x   3 root hdfs    1001404 2013-10-12 19:34 /xd/cyrustweets/cyrustweets-8.log
-rwxr-xr-x   3 root hdfs    1006052 2013-10-12 19:34 /xd/cyrustweets/cyrustweets-9.log

The default rollover for the logs is 1MB so there are a lot of files. You might want to increase that or change other options. After a cup of coffee or two, we should have some reasonable data to begin processing and refining. It took around 30 mins to generate 100MB of log files – clearly a fairly popular topic. At this point, you can undeploy the stream so we can do some sample analysis:

stream undeploy --name cyrustweets
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s