Tableau Apache Hive and Spark

Reading some interesting things from the AMP group and especially the Shark server. Following these easy to install steps and executing the example, I was inspired to connect using the Cloudera ODBC Hive connector. A bit of fiddling around but the example seems to work really well out of the box.

https://github.com/amplab/shark/wiki/Running-Shark-Locally

CREATE TABLE src(key INT, value STRING);
LOAD DATA LOCAL INPATH '${env:HIVE_HOME}/examples/files/kv1.txt' INTO TABLE src;
SELECT COUNT(1) FROM src;
CREATE TABLE src_cached AS SELECT * FROM SRC;
SELECT COUNT(1) FROM src_cached;

Installing the server outside my firewall meant a little SSH tunnelling (don’t tell the FW guy’s) to get to the example server on the public internet.

Image

The Cloudera Hive connection has to be setup as below, if you are using an SSH putty tunnel to the port that you have exposed on your Hive/Shark server.

Image

Image

You can now monitor two log files the Shark server, running on your linux box, in this case an Ubuntu server.

Image

You can also monitor what Tableau is doing locally on the local machine, for this I use a custom Python script that you can use to monitor the log files.

Image

As you move the control around on the Tableau workbook, the SQL query bounces between the local log, the Shark server and back.

HIVE

 

 

 

 

 

 

 

 

 

The shark server running locally as a server, seems to log ‘OK’ alot?

Image

 

Data led decision making.

Access to more and more data helps who? Who Is best placed to access and make inferences from this increasing stack of data?

From an interesting article from the financial times big data series we generated 2 Exo bytes between the beginning of time and 2004. We now take 2-3 days to generate this, soon to become 10-20 minutes. This is truly a disruptive amount of data. So much of the modern world is now addressable and the rate of increase is increasing. This is the age of data for sure.

Least placed are rural and small economies. Often technically reduced but perhaps with some of the most impact full datasets. Rural settings so tranquil and beautiful and surrounded by price data, time series zipping around the place. Accessing and predicting in these settings I predict are an untapped market and one that needs and benefit from synthesis of datasets.

In a series of blog posts I plan to uncover the under side of big data in the rural economy through real application. Unearth how data led decision making can make a difference, apply disruptive analytics at the very heart of our rural economy.

My first article scheduled for print quite soon is around log prices, something that is seasonal and effects so many folks. Plan to publish this very soon.

Disruptive analytics

We all think that making decisions is something we are fairly natural at, what kinda coffee to what kinda personality do I enjoy being around? However decision making is something least adapted to our own brain. Bernoulli gave us the expectation or sum over the likely hood of something happening multiplied by the value of that something . It is only by evaluating this expectation that we can make an informed decision. Not some thing that your brain can compute that easily on the fly, or at all for most mortals.

Many companies leverage this to gain competitive edge, some even simulate on grand scales this very equation to price and figure out the true value of something.

So what does all this mean? How can this help? Well for most this means nothing, for some who are engaged by data led decision making this simple insight coupled with larger and larger datasets can give just enough edge to perhaps make the difference?