Visualize the refined data with Excel

We will use Excel Professional Plus 2013 to access the refined sentiment data.

    • In Windows, open a new Excel workbook, then select Data > From Other Sources > From Microsoft Query.


  • On the Choose Data Source pop-up, select the Hortonworks ODBC data source you installed previously, then click OK.The Hortonworks ODBC driver enables you to access Hortonworks data with Excel and other Business Intelligence (BI) applications that support ODBC.


  • After the connection to the Sandbox is established, the Query Wizard appears. Select the “tweetsbi” table in the Available tables and columns box, then click the right arrow button to add the entire “tweetsbi” table to the query. Click Next to continue.


  • Select the “text” column in the “Columns in your query” box, then click the left arrow button to remove the text column.
  • After the “text” column has been removed, click Next to continue.
  • On the Filter Data screen, click Next to continue without filtering the data.


  • On the Sort Order screen, click Next to continue without setting a sort order.
  • Click Finish on the Query Wizard Finish screen to retrieve the query data from the Sandbox and import it into Excel.
  • On the Import Data dialog box, click OK to accept the default settings and import the data as a table.
  • The imported query data appears in the Excel workbook.


Now that we have successfully imported the Twitter sentiment data into Microsoft Excel, we can use the Excel Power View feature to analyze and visualize the data.

In this section we will see how sentiment varies by country,

    • In the Excel worksheet with the imported “tweetsbi” table, select Insert > Power View to open a new Power View report.


  • The Power View Fields area appears on the right side of the window, with the data table displayed on the left. Drag the handles or click the Pop Out icon to maximize the size of the data table.
  • In the Power View Fields area, clear the checkboxes next to the id and ts fields, then click Map on the Design tab in the top menu.


  • The map view displays a global view of the data.
  • Now let’s display the sentiment data by color. In the Power View Fields area, click sentiment, then selectAdd as Color.


  • Under SIZE, click sentiment, then select Count (Not Blank).


  • Now the map displays the sentiment data by color:
  • Orange: positive
  • Blue: negative
  • Red: neutral


  • Use the map controls to zoom in on Ireland. About half of the tweets have a positive sentiment score, as indicated by the color orange.


  • Next, use the map controls to zoom in on the sentiment data in China.


  • The United States is the biggest market, so let’s look at sentiment data there. The size of the United States pie chart indicates that a relatively large number of the total tweets come from the US.About half of the tweets in the US show neutral sentiment, with a relatively small amount of negative sentiment.



Tools for data visualization

Creating infographics can be time-consuming. But these tools make it easier.

It’s often said that data is the new world currency, and the web is the exchange bureau through which it’s traded. As consumers, we’re positively swimming in data; it’s everywhere from labels on food packaging design to World Health Organisation reports. As a result, for the designer it’s becoming increasingly difficult to present data in a way that stands out from the mass of competing data streams.

One of the best ways to get your message across is to use a visualization to quickly draw attention to the key messages, and by presenting data visually it’s also possible to uncover surprising patterns and observations that wouldn’t be apparent from looking at stats alone.

1. Excel


You can actually do some pretty complex things with Excel, from ‘heat maps’ of cells to scatter plots. As an entry-level tool, it can be a good way of quickly exploring data, or creating visualizations for internal use, but the limited default set of colours, lines and styles make it difficult to create graphics that would be usable in a professional publication or website. Nevertheless, as a means of rapidly communicating ideas, Excel should be part of your toolbox.

2. Tableau


Tableau Desktop is data analysis that keeps up with you. It’s easy to learn, easy to use, and 10-100x faster than existing solutions. It’s built on breakthrough technology that translates pictures of data into optimized database queries. Use your natural ability to see patterns, identify trends and discover visual insights in seconds. No wizards, no scripts.

3. R


How many other pieces of software have an entire search engine dedicated to them? A statistical package used to parse large data sets, R is a very complex tool, and one that takes a while to understand, but has a strong community and package library, with more and more being produced.

4. jpGraph


If you need to generate charts and graphs server-side, jpGraph offers a PHP-based solution with a wide range of chart types. It’s free for non-commercial use, and features extensive documentation. By rendering on the server, this is guaranteed to provide a consistent visual output, albeit at the expense of interactivity and accessibility.

Refining the Raw Data

    • In the Hortonworks Sandbox virtual machine (VM) console window, press the Alt and F5 keys, then log in to the Sandbox using the following user name and password:

Login: root\       Password: hadoop

After you log in, the command prompt will appear with the prefix [root@sandbox ~]#:

    • At the command prompt, type in the following command, then press the Enter key: hive -f hiveddl.sql

Lines of text appear as the script runs a series of MapReduce jobs. It will take a few minutes for the script to finish running. When the script has finished running, the time taken is displayed, and the normal command prompt appears.


The hiveddl.sql script has performed the following steps to refine the data:

  • Converted the raw Twitter data into a tabular format.
  • Used the dictionary file to score the sentiment of each Tweet by the number of positive words compared to the number of negative words, and then assigned a positive, negative, or neutral sentiment value to each Tweet.
  • Created a new table that includes the sentiment value for each Tweet.

We can look at the data using the Hive command line. We can start Hive by typing hive at the prompt.

23_start_hive (1)

Remember to add the json serde jar file so we can look at the tables.


The command “show tables” will show you the tables. You can browse the data using the “select * from

limit 10;” command. The limit 10 gives you the first 10 records instead of the whole table.


We can also use HCatalog to view the results,


What is Sentiment Data ???

Sentiment data is unstructured data that represents opinions, emotions, and attitudes contained in sources such as social media posts, blogs, online product reviews, and customer support interactions.

3.5.2 Potential Uses of Sentiment Data

Organizations use sentiment analysis to understand how the public feels about something at a particular moment in time, and also to track how those opinions change over time.

An enterprise may analyze sentiment about:

  • A product – For example, does the target segment understand and appreciate messaging around a product launch? What products do visitors tend to buy together, and what are they most likely to buy in the future?
  • A service – For example, a hotel or restaurant can look into its locations with particularly strong or poor service.
  • Competitors – In what areas do people see our company as better than (or weaker than) our competition?
  • Reputation – What does the public really think about our company? Is our reputation positive or negative?

The benefits that this data offers are:

  • Improves Customer Service:
    It gives useful insights about current and future customers purchase preferences, brand affiliations, topics of interests, opinions, point of views on discussions, likes and dislikes in products/services and much more. This useful information lets organizations to drastically improve their customer service and engagement strategies by building on the positive sentiments and formulating methods to combat negative sentiments.
  • Reviving Brand:
    One of the best uses of this data is that it allows organizations to quantify perceptions – about their brand, products and services, marketing campaigns, social engagement initiatives, online content etc. Organizations can use this information for devising better and more effective branding and marketing strategies and thus improve your brand reputation.
  • Beats Competition:
    This data lets the organizations to know sentiments surrounding their competitors too. This allows to benchmark their performance against that of their competitors.
  •  Measure Effectiveness of Marketing Campaigns:
    Analyze changes in sentiment relative to specific campaigns, audiences and social outlets, and can quickly identify positive talking points around the brand to measure, inform and evaluate marketing strategy.

Configuring Flume with Twitter App

Flume needs agents that connect to data sources to work. Agents are written in Java–and you could write your own agent. Luckily for us, Cloudera provides a sample Twitter agent for Flume that we’ll use in this tutorial. Although shared by Cloudera, it will work with other Hadoop distributions as well.

Copy the .jar file to the /usr/lib/flume/lib folder on the node where you installed the Flume software.
Now that the agent code is in place, we need to configure flume to create an agent using the class in that .jar. We do this by updating the /etc/flume/conf/flume.conf file.

Then make the following changes. Note that the configuration file uses the term “consumerKey” and “consumerSecret“. Twitter now calls these “API Key” and “API Secret“, respectively. Simply substitute in the keys from the Twitter app.
The TwitterAgent.sources.Twitter.keywords contains a comma-separated list of words used to select which tweets needed to be added to HDFS.
The TwitterAgent.sinks.HDFS.hdfs.path provides the path from the name node where the tweets should be saved. Be sure that the user running the Flume agent can write to this HDFS file location.

w640 (5)

Now that the configuration is complete, start the flume agent. Since the agent needs to be continue running even after closing ssh session, start the process using NOHUP:

nohup flume-ng agent –conf-file /etc/flume/conf/flume.conf –name TwitterAgent >flume_twitteragent.log &

As the agent begins running, monitor its progress by using the tail command against the log file (specified on the previous command line) with the “follow” flag:

tail -f flume_twitteragent.log

And of course the “acid test” is to look at the files being collected in HDFS:

hadoop fs -ls /root/flume/tweets

w640 (6)

Installing Flume on HDP

The next step is to install Flume on the Hadoop cluster node you’ll use as the flume agent. Pick a node that has a client role, as the agent will be connecting with the name node, and sending files to HDFS nodes as data is streamed.
Flume is easy to install with HDP. Just run the following two Yum commands as root (or use sudo if not logged in as root):

yum install flume
yum install flume-node

After these two commands are complete, the Flume code is installed and ready to be used.

Creating a Twitter Application

  • You probably already have a twitter account. If you don’t, create one on
  • Next, browse to, login with your Twitter ID, and read through some of the introductory API material on Twitter Apps. A “Twitter App” can take many forms. Any piece of software that interacts with twitter on behalf of a user is an app.A mobile phone app you can write as an alternative to the one Twitter provides is a “Twitter App”.  The Flume agent we’ll create is an “App” too.
  • Next browse to, and click the Create New App button.


  • Next fill in the basic app info form. The application “Name” must be globally unique across all Twitter apps for all users, so pick something unique.  After filling the info, agree to the terms of use and press the “Create App” at the bottom of the form.

w640 (1)

You’ll be redirected to the management page for your new app. Switch to the API Keys tab, and click the create my access token button.

OK, you’re done!  There are four pieces of information you need to copy from the form before we go to the Hadoop cluster to setup the Flume agent:

  1. API key
  2. API secret
  3. Access token
  4. Access token secret

These four keys will be added to a configuration file in Flume, enabling it to use your Twitter account to stream tweets. Copy the four into a text file for later reference.

w640 (2)

w640 (4)