Refining the Raw Data

    • In the Hortonworks Sandbox virtual machine (VM) console window, press the Alt and F5 keys, then log in to the Sandbox using the following user name and password:

Login: root\       Password: hadoop

After you log in, the command prompt will appear with the prefix [root@sandbox ~]#:

    • At the command prompt, type in the following command, then press the Enter key: hive -f hiveddl.sql

Lines of text appear as the script runs a series of MapReduce jobs. It will take a few minutes for the script to finish running. When the script has finished running, the time taken is displayed, and the normal command prompt appears.


The hiveddl.sql script has performed the following steps to refine the data:

  • Converted the raw Twitter data into a tabular format.
  • Used the dictionary file to score the sentiment of each Tweet by the number of positive words compared to the number of negative words, and then assigned a positive, negative, or neutral sentiment value to each Tweet.
  • Created a new table that includes the sentiment value for each Tweet.

We can look at the data using the Hive command line. We can start Hive by typing hive at the prompt.

23_start_hive (1)

Remember to add the json serde jar file so we can look at the tables.


The command “show tables” will show you the tables. You can browse the data using the “select * from

limit 10;” command. The limit 10 gives you the first 10 records instead of the whole table.


We can also use HCatalog to view the results,



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s