Flume needs agents that connect to data sources to work. Agents are written in Java–and you could write your own agent. Luckily for us, Cloudera provides a sample Twitter agent for Flume that we’ll use in this tutorial. Although shared by Cloudera, it will work with other Hadoop distributions as well.
Copy the .jar file to the /usr/lib/flume/lib folder on the node where you installed the Flume software.
Now that the agent code is in place, we need to configure flume to create an agent using the class in that .jar. We do this by updating the /etc/flume/conf/flume.conf file.
Then make the following changes. Note that the configuration file uses the term “consumerKey” and “consumerSecret“. Twitter now calls these “API Key” and “API Secret“, respectively. Simply substitute in the keys from the Twitter app.
The TwitterAgent.sources.Twitter.keywords contains a comma-separated list of words used to select which tweets needed to be added to HDFS.
The TwitterAgent.sinks.HDFS.hdfs.path provides the path from the name node where the tweets should be saved. Be sure that the user running the Flume agent can write to this HDFS file location.
Now that the configuration is complete, start the flume agent. Since the agent needs to be continue running even after closing ssh session, start the process using NOHUP:
nohup flume-ng agent –conf-file /etc/flume/conf/flume.conf –name TwitterAgent >flume_twitteragent.log &
As the agent begins running, monitor its progress by using the tail command against the log file (specified on the previous command line) with the “follow” flag:
tail -f flume_twitteragent.log
And of course the “acid test” is to look at the files being collected in HDFS:
hadoop fs -ls /root/flume/tweets