HDInsight (Azure Hadoop) JSON Hive files – Environment setup

October 4, 2015

no comments

When I was asked to design an Echo system base Microsoft Azure Cloud system, I understand that I have challenge to stop thinking IAAS and start doing PAAS.

For sure that my first question was why? and my second question (after a little thinking was) why not?

  • I need minimal DevOps support.
  • I need minimal tech support for my system.
  • I need to pay only for what I’m using.
  • I can scale when I need.
  • And … (I believe you can rise one or more things).

The question on using PAAS is not only “Are they ready?” it’s also “Are you (as a software architect ready?”. Ready to design the software and less the infra and services that you need for the software?

Let’s go back for the main issue of this post, in simple words Hadoop querying via Hive base JSON files. In Azure cloud services we are talking about HDInsight Hive base JSON files.

I believe that all of you out there that know Hadoop with Hive/Pig/Spark …, just need the machines and the root password and all the rest is history. But to do so on PAAS with Azure? More than that we are talking about files based JSON format and not CSV, When Hive by default cannot parse JSON files (not to add on it hierarchical data based JSON

Where to start?

So, also I was not alone here and I found very good posts how to establishment the environment. one of them is “How to use a Custom JSON SerDe with Microsoft Azure HDInsight“.

Support Custom JSON Serialization

Let’s start from the end, the non obvious with this request for any Hive is to use JSON files, here came the SerDe functionality.

SerDe is short for Serializer/Deserializer. Hive uses the SerDe interface for IO. The interface handles both serialization and deserialization and also interpreting the results of serialization as individual fields for processing.
A SerDe allows Hive to read in data from a table, and write it back out to HDFS in any custom format. Anyone can write their own SerDe for their own data formats.

You can find here a very good JSON SerDe for Hive, Hive-JSON-Serde. When the magic happened on the “JsonSerDe.java” file in the “SerializedField” method.

Object serializeField(Object obj, ObjectInspector oi ){

        if(obj == null) {return null;}

        Object result = null;
        switch(oi.getCategory()) {
            case PRIMITIVE:
                PrimitiveObjectInspector poi = (PrimitiveObjectInspector)oi;
                switch(poi.getPrimitiveCategory()) {
                    case VOID:
                        result = null;
                        break;
                    case BOOLEAN:
                        result = (((BooleanObjectInspector)poi).get(obj)?
                                            Boolean.TRUE:
                                            Boolean.FALSE);
                        break;
                    case BYTE:
                        result = (((ByteObjectInspector)poi).get(obj));
                        break;
                    case DOUBLE:
                        result = (((DoubleObjectInspector)poi).get(obj));
                        break;
                    case FLOAT:
                        result = (((FloatObjectInspector)poi).get(obj));
                        break;
                    case INT:
                        result = (((IntObjectInspector)poi).get(obj));
                        break;
                    case LONG:
                        result = (((LongObjectInspector)poi).get(obj));
                        break;
                    case SHORT:
                        result = (((ShortObjectInspector)poi).get(obj));
                        break;
                    case STRING:
                        result = (((StringObjectInspector)poi).getPrimitiveJavaObject(obj));
                        break;
                    case UNKNOWN:
                        throw new RuntimeException("Unknown primitive");
                }
                break;
            case MAP:
                result = serializeMap(obj, (MapObjectInspector) oi);
                break;
            case LIST:
                result = serializeList(obj, (ListObjectInspector)oi);
                break;
            case STRUCT:
                result = serializeStruct(obj, (StructObjectInspector)oi, null);
                break;
            case UNION:
                result = serializeUnion(obj, (UnionObjectInspector)oi);
        }
        return result;
    }

The highlighted lines (43-53) give the ability to use Hierarchical JSON data structure, when the STRUCT it’s the must flexible. In additional you can change this file to support any other wildcards / custom collections.
The PRIMITIVES are the base field types that the Hive support.

Continue from here you can use any Java support IDE to build those JARs, or use Maven for it (download it from here)
Or use the files I already build by the Git definitions:

Setup the (Hadoop) HDInsigh system

I know that Microsoft supply very cool and nice UI to manage all your Azure components, still the Azure PowerShell is more powerful and flexible to use (see How to install and configure Azure PowerShell for more info).

First let’s prepare the Azure system;

  • Create  Storage name: demohdpmainstorage, with Container:install.
  • Create  Storage name: demohdplibstorage, with Container: libs.
  • Copy the SerDe files into the demohdplibstorage/ libs container.

Storage Pre Deploy

I supply below a PowerShell code to create the HDInsight cluster.

  • Cluster name: demohdphive. (line #28)
  • Cluster main Storage name: demohdpmainstorage, with Container: install. (line #13 & #16)
  • Storage for Jar(s) deployment name: demohdplibstorage, with Container: libs. (line #19 & #22)
  • The Cluster size will be 4, can be changed in future after it’s established.
##################### Begin Edits ####################
#region edits
param
(
    # NOTE: All the storage accounts and containers need to be created on the same data center as the HDInsight cluster and would need to be created prior to running the script
    # They can be created from the Azure Management Portal

    # This is the name of your Azure Subscription that will be used for provisiong Azure HDInsight
    [string]$PrimarySubscriptionName="<place here>",

    # This is the primary storage account that needs to be created on the same data center as your HDInsight Cluster
    # This needs to be pre-provisioned prior to running the script
    [string]$PrimaryStorageAccount="demohdpmainstorage",  #change it to your name

    # This is the name of the container in the primary storage account. This needs to be pre-provisioned prior to running this script.
    [string]$PrimaryStorageContainer="install",  #change it to your name

    # This is the additional storage account that is used to storage the external JAR files for customizing the HDInsight Cluster. Needs to be pre-provisioned prior to running the script.
    [string]$HiveLibStorageAccount="demohdplibstorage",  #change it to your name

    # This is the name of the container in the additional storage account. Again, needs to be created prior to running the script.
    [string]$HiveLibStorageContainer="libs",  #change it to your name

    # This is the data center where the HDInsight cluster and storage accounts are provisioned
    [string]$HDInsightClusterLocation="<place here>",

    #This is the name of the HDInsight cluster
    [string]$HDInsightClusterName="demohdphive",  #change it to your name

    #This is the size of the cluster # of data nodes
    [string]$HDIClusterSize = "4",

    #This is the version of the HDInsight cluster.
    # Please refer to this page for more info https://azure.microsoft.com/en-us/documentation/articles/hdinsight-release-notes/
    [string]$HDInsightClusterVersion="3.2",

    # HDInsight cluster admin user name and password need to be specified here. This will be used for subsequent admin logins to the cluster for job submissions.
    [string]$MyHDInsightUserName = "root",
    [string]$MyHDInsightPwd = "Pass@Word2"
)
#endregion edits
#################### End Edits ####################### 

#region code

# Credentials
Add-AzureAccount
Select-AzureSubscription "$PrimarySubscriptionName"

$HdInsightPwd = ConvertTo-SecureString $MyHDInsightPwd -AsPlainText -Force
$creds= New-Object System.Management.Automation.PSCredential ($MyHDInsightUserName, $HdInsightPwd)

$PrimarySubscriptionID = (Get-AzureSubscription $PrimarySubscriptionName).SubscriptionId
$Key1 = Get-AzureStorageKey $PrimaryStorageAccount | %{ $_.Primary }
$HiveLibStorageAccountKey = Get-AzureStorageKey $HiveLibStorageAccount | %{ $_.Primary }

#This will be the credential used for the admin account of the HDInsight Cluster
#$creds = Get-Credential -Message "Please enter the admin account credentials for your HDInsight Cluster"

# Set Custom Configuration
$configvalues = new-object 'Microsoft.WindowsAzure.Management.HDInsight.Cmdlet.DataObjects.AzureHDInsightHiveConfiguration'

$configvalues.Configuration = @{ "hive.exec.compress.output"="true" }
$configvalues.AdditionalLibraries = new-object 'Microsoft.WindowsAzure.Management.HDInsight.Cmdlet.DataObjects.AzureHDInsightDefaultStorageAccount'
$configvalues.AdditionalLibraries.StorageAccountName = "$HiveLibStorageAccount.blob.core.windows.net"
$configvalues.AdditionalLibraries.StorageAccountKey = $HiveLibStorageAccountKey
$configvalues.AdditionalLibraries.StorageContainerName = $HiveLibStorageContainer

$SubId = (Get-AzureSubscription -Current).SubscriptionId

# Create Azure HDInsight Cluster

New-AzureHDInsightClusterConfig -ClusterSizeInNodes $HDIClusterSize -ClusterType Hadoop `
    | Set-AzureHDInsightDefaultStorage -StorageAccountName "$PrimaryStorageAccount.blob.core.windows.net" -StorageAccountKey $Key1 -StorageContainerName "$PrimaryStorageContainer" `
    | Add-AzureHDInsightConfigValues -Hive $configvalues `
    | New-AzureHDInsightCluster -Subscription $SubId -Credential $Creds -Name $HDInsightClusterName  -Location $HDInsightClusterLocation -Version $HDInsightClusterVersion

#endregion code

The script is available on this link

Run the script via Microsoft Azure PowerShell application; use the next command.
{Full path to the script folder}>.\hdinsightCluster.ps1

Microsoft Azure PowerShellIt will take some time ~10 minutes. to can monitor it via the Azure management portal.

After it finally created you will get the next output to the PS window.

spResult

 

Explorer the HDInsigh system

Remote Login

Go into the azure portal, https://portal.azure.com. to your HDInsight cluster (in this post demohdphive) and press the “Remote Desktop” image/link. configure the remote login user and password. also the expiration date is mandatory.

hdinsightRDPAfter pressing the “Enable” link, the cloud will update the machine and the “Connect” link will be enable. Press it and download the rdp file to connect the machine.

Browsing the machine

As you can see, it’s a Windows 2012 R12 server machine. browse into “C:\apps\dist” folder and you can see the base components that define this HDInsight cluster. The Java which it use the Hadoop version, Hive version, Pig version, logs and more.

At the end it’s pretty similar to the standard installation, Yes there are the Microsoft wrapping, but the shell files are the same, the libs the JARs are as you know. So, the manipulations that you did on your Hadoop based Linux also can be done here.

Hadoop system

Under the “C:\apps\temp\HDInsightResources\Hive” folder you can find the SerDe JARs that we added to the Hive use.

HDInsight Resources

Under the “C:\apps\dist\hive-{x.xx.x.x.x.x.x-xxxx}\conf” you can find file named hive-site.xml, with will contain the next entry;

<property>
 <name>hive.aux.jars.path</name>
 <value>file:///c:/apps/temp/hdinsightresources/hive/json-serde-1.3.1-snapshot-jar-with-dependencies.jar,file:///c:/apps/temp/hdinsightresources/hive/json-serde-1.3.1-snapshot.jar</value>
 <description />
 </property>

*** Extra JARs loaded to the machine and reference to it, Now there is no use for the “demohdplibstorage” storage any more and can be deleted!

Actually we could in advance not create it at all, just pass those JARs to the machine, reference it via the hive-site.xml file and re-start the hive service (as we’ll see in the next paragraph).

Under the “C:\apps\dist\hive-{x.xx.x.x.x.x.x-xxxx}\bin” folder you can find both of the important scripts, which manage the hive Stop/Start functionality.

  1. start_daemons.cmd
  2. stop_daemons.cmd

Under the “C:\apps\dist\hive-{x.xx.x.x.x.x.x-xxxx}\logs” folder you can find the “hive.log” file.

*** Notice that on the log file the only SerDe that loaded is the default Hive SerDe and nothing about the custom one we added. (We’ll talk about it later).

A short summary

We are talking here one ps script two JARs and a Blob storage and we have a full managed PASS Hive-Hadoop on Azure cloud. Cool !

In the next we will see how to configure the JSON tables, use files which located on the Blobs storage (not by loading it via the Hive) and work with the build-in Microsoft Azure HDInsight Query Console. See –HDInsight (Azure Hadoop) JSON Hive files – Tables setup.

Add comment
facebook linkedin twitter email

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*