One of the most referenced use cases for Hadoop is to collect social media interactions to better understand consumer sentiment about a topic, brand or company. Hadoop is an ideal platform for this, since it can inexpensively store large volumes of data, and is especially good at analyzing data whose structure isn’t understood in advance.
However, streaming data from social media sources is easier said than done, and it’s easy to be discouraged. After reading this tutorial, you should be able to repeat the process for yourself confidently!
In this tutorial I’ll show step-by-step how to use HortonWorks HDP 2.1 Hadoop distribution (on Linux), and Apache Flume to collect Twitter Tweets and store them in HDFS for later analysis. If you’re using a different Hadoop distribution, or using Windows instead of Linux, this process will be nearly the same for you. However, some of the commands and file locations may be slightly different and require adjustments.
Before beginning, make sure you have a HortonWorks cluster installed, and your HDFS file system is operational. In addition, the node in the cluster that you’ll use for collecting tweets should have access to the Internet (obviously).
I’m using is a HDP 2.1 sandbox running on the CentOS 6.5 distribution of Linux. I’ll be installing and configuring Flume on localhost, which is one of my data nodes that has the Hadoop client role deployed to it.
Creating a Twitter Application
You probably already have a twitter account. If you don’t, create one on Twitter.com.
Next, browse to dev.twitter.com, login with your Twitter ID, and read through some of the introductory API material on Twitter Apps. A “Twitter App” can take many forms. Any piece of software that interacts with twitter on behalf of a user is an app. A mobile phone app you an write as an alternative to the one Twitter provides is a “Twitter App”. The Flume agent we’ll create is an “App” too.
Next browse to apps.twitter.com, and click the Create New App button.
You’ll be redirected to the management page for your new app. Switch to the API Keys tab, and click the create my access token button.
OK, you’re done! There are four pieces of information you need to copy from the form before we go to the Hadoop cluster to setup the Flume agent:
- API key
- API secret
- Access token
- Access token secret
These four keys will be added to a configuration file in Flume, enabling it to use your Twitter account to stream tweets. Copy the four into a text file for later reference.
The next step is to install Flume on the Hadoop cluster node you’ll use as the flume agent. Pick a node that has a client role, as the agent will be connecting with the name node, and sending files to HDFS nodes as data is streamed.
Flume is easy to install with HDP. Just run the following two Yum commands as root (or use sudo if not logged in as root):
yum install flume
After these two commands are complete, the Flume code is installed and ready to be used. But Flume needs agents that connect to data sources to work. Agents are written in Java–and you could write your own agent. Luckily for us, Cloudera provides a sample Twitter agent for Flume that we’ll use in this tutorial. Although shared by Cloudera, it will work with other Hadoop distributions as well.
Download the Cloudera .jar from this link: http://files.cloudera.com/samples/flume-sources-1.0-SNAPSHOT.jar
Copy the .jar file to the /usr/lib/flume/lib folder on the node where you installed the Flume software.
Now that the agent code is in place, we need to configure flume to create an agent using the class in that .jar. We do this by updating the /etc/flume/conf/flume.conf file.
Download the sample flume.conf file cloudera included on github as a base here:
Then make the following changes (highlighted in yellow). Note that the configuration file uses the term “consumerKey” and “consumerSecret“. Twitter now calls these “API Key” and “API Secret“, respectively. Simply substitute in the keys you copied from the Twitter app configuration screen earlier.
The TwitterAgent.sources.Twitter.keywords contains a comma-separated list of words used to select which tweets you want to add to HDFS.
The TwitterAgent.sinks.HDFS.hdfs.path provides the path from the name node where the tweets should be saved. Be sure that the user running the Flume agent can write to this HDFS file location. Note in this tutorial I used the “root” user to run the agent. In a production solution, it would be better to create a specific user for the agent to use, and and assign that user only the permission it needs. But, for learning purposes in a lab environment, it’s OK to use root.
Now that the configuration is complete, we can start the flume agent. Since we want the agent to continue running even when we close our ssh session, start the process using NOHUP:
nohup flume-ng agent --conf-file /etc/flume/conf/flume.conf --name TwitterAgent >flume_twitteragent.log &
As the agent begins running, you can monitor progress by using the tail command against the log file (specified on the previous command line) with the “follow” flag:
tail -f flume_twitteragent.log
And of course the “acid test” is to look at the files being collected in HDFS: