HDInsight (Azure Hadoop) JSON Hive files – Tables setup

October 5, 2015

On my previous post HDInsight (Azure Hadoop) JSON Hive files – Environment setup, we focused on the Environment when it’s one side of the solution when the other side is use it with the Echo-System.

The basic Echo-System design

Echo System Design

The main idea is that we might have several endpoints (component types) which can write different JSON files to a Blob Storage, when the HDInsight will point to this B-Storage to query it by the relevant tables by the relevant files. (the design is not limited to JSON format files, but to all formats).

How to query the HDInsight – Hive Cluster

The good news are, that we don’t need to login into the machine and run hive.cmd console application (the file located on C:\apps\dist\hive-{x.xx.x.x.x.x.x-xxxx}\bin\hive.cmd).

Microsoft Azure HDInsight Query Console

Azure supply very easy and approachable way to query the HDInsight cluster (and even more old school Hadoop UI). On the Azure Management Portal, navigate to your HDInsight cluster and press on the “Dashboard” link.

hdinsightDashboard

This will open the Microsoft Azure HDInsight Query Console website. you will need to login with the credentials supplied when creating the HDInsight cluster.

Microsoft Azure HDInsight Query Console

I’m not going to focus on the Hadoop & Yarn UI’s, as you already know it’s come with the basic Hadoop installation. in few words;

  • The Hadoop UI – Supply data about the Hadoop nodes, statistics about the way the work, maintenance and basic utilities.
  • The Yarn UI – Supply data about the actual jobs.
    The File Browser – As it say browsing the files on the cluster.
  • Hive Editor – Give us the ability to run any query we wants (Select, Create, Delete). on the end of the console you can find the “Job Session”. When pressing on it you will be navigated to the “Job History”.
  • Job History panel which provide the data about the Query, Status, Job log and …

Define the Hive table

For the example I will use semi-complex JSON input file. The file will contain 2 entries of Contact Information.

The file contains the entries one after the other and not as JSON Array! (lines 11-12)

{"id":123456,"name": "Bill","surname": "Taker","birthday": "1989-11-30 18:25:43.511","address":{"city":"NYC","state": "NY","country": "USA"}}
{"id":234243,"name": "Jack","surname": "Taker","birthday": "1991-11-12 23:00:13.457","address":{"city":"Atlantic City","state": "NJ","country": "USA"}}

***Notice, DO NOT USE WHITE SPACES OR BREAK LINES – EACH ENTRY IN SINGLE LINE!

Download contact_info_1.json file from here.

Define the external Blob container

I’ll use the HDInsight Blob Storage “demohdpmainstorage“, but I’ll create an additional container (aside to the “install” container) named contacts.

demohdpmainstorage hfds files

Place the JSON file in the contacts folder.

Define (Create) the folder on the Hive side

Because that we are using EXTERNAL Tables and not physical ones, we must;

  • Mark the table as an External one.
  • Point to the blob location using wasb:// url format.

The Table definition will include 3 parts:

  1. The input schema.
  2. The used JSON SerDe definition.
  3. The point to the files location.
DROP TABLE contacts;
 CREATE EXTERNAL TABLE contacts (
 id bigint,
 name string,
 surname string,
 birthday timestamp,
 address struct)
 ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
 LOCATION 'wasb://hdfsfiles@demohdpmainstorage.blob.core.windows.net/contacts';

The Location point to the ‘contacts‘ folder inside the ‘hdfsfiles‘ container, the Hive will use all the files inside this folder.
***Notice, the first file that it will failed to parse with the given SerDe will stop the Hive work and it will retrieve partial result until the problematic file. The Job’s log file will include the error.

Run the command via the HDInsight Query Console

We’ll use the HDInsight Query Console to run the “Create Table” command.

Run Create Command

On the ‘Job Session’ press the ‘View Details” link and see the result of this job (it will take a few seconds).

Run Create Command Result

We can see that the Job ended with status ‘Completed’, it’s very good.
We can view the log for more info, and us expected nothing to show (still it is a create command).

Run some queries

Query #1

When running the next command command.

select * from contacts;

We will get the next answer:

123456 	Bill 	Taker 	1989-11-30 18:25:43.511 <strong>	{"city":"NYC","state":"NY","country":"USA"}</strong>
234243	 Jack	 Taker	 1991-11-12 23:00:13.457<strong>	 {"city":"Atlantic City","state":"NJ","country":"USA"}</strong>

See that the ‘address’ filed it’s a JSON result.

Query #2

When running the next command command. (break down the address on the query structure)

select id, name, surname, birthday, address.city, address.state, address.country from contacts;

We will get the next answer:

123456 	Bill 	Taker 	1989-11-30 18:25:43.511	 NYC 	NY 	USA
234243 	Jack 	Taker 	1991-11-12 23:00:13.457 	Atlantic City 	NJ 	USA

Query #3

When running the next command command. Focus on the YEAR.

select id, name, surname, YEAR(birthday), address.city, address.state, address.country from contacts;

We will get the next answer:

123456 	Bill 	Taker 	1989 	NYC 	NY 	USA
234243 	Jack 	Taker 	1991 	Atlantic City 	NJ 	USA

Query #4

When running the next command command. Focus on WHERE statement.

select id, name, YEAR(birthday), address.city from contacts where surname='Taker' and address.state='NJ';

We will get the next answer:

234243	Jack	1991	Atlantic City

A short summary

We did SQL on HDInsight (Hadoop) Hive, when the sources (files) written to an external location on the Blob storage. This will give us the ability for flexibility and scaling in the future.

In the next Post we will see how to build Cubes based the HDInsight Hive with Microsoft Tabular.

Add comment
facebook linkedin twitter email

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*