Big Data, Big Compute, Big Fun!

  • Hadoop deployment options in Microsoft Azure - architecture

    The last past of series of articles describing most common options of hosting Hadoop clusters on Microsoft Azure platform is focused mostly on differences in deployment architecture. Interestingly enough all three described options (HDI, HDP and CDH) use slightly different approach to achieve basically the same goal – provide redundant, highly-available infrastructure. Let's have a closer look:

    Azure HDInsight (HDI)

    Azure HDInsight offers managed Hadoop services running on top of either Windows Server 2012 R2 or Linux (Ubuntu 12.04) virtual machines. Users are free to define the size and number of VMs when provisioning the cluster, as well as add and remove nodes on-the-fly.

    Underlying HDInsight infrastructure is fully redundant and consists of 2 head nodes, 2 secure gateway nodes (A2 instances, exposing the HDI management API), 3 Zookeeper nodes (A1 instances) and a number of user-defined data nodes.

    HDI is not relying upon its own HDFS file system as it is tightly integrated with Azure Storage (WASB):

    Source: http://azure.microsoft.com/en-us/documentation/articles/hdinsight-high-availability/

    All communication to HDI cluster is routed through secure gateway that exposes a few REST services:

    Service

    Default configuration

    ODBC

    Enabled

    JDBC

    Enabled

    Ambari

    Enabled

    Oozie

    Enabled

    Templeton

    Enabled

    RDP (Windows only)

    Disabled

    SSH (Linux only)

    Enabled

     

    Cloudera CDH Template

    Currently available CDH template is based on fixed-size VMs (D14 instances), and utilizes local, temporary SSD storage as the HDFS backbone. Typical cluster configuration consists of 1 or 2 head nodes, 1 management node and between 3 and 20 data nodes in total. Using SSD disks improves overall cluster performance but introduces also constrained capacity (up to 800 GB per node) and requires cluster to be running all the time. Sample configuration of CDH cluster looks as follows:

    Cloudera runs on top of Linux CentOS 6.5 distribution and provisions single storage account that contains OS drives for all VMs in the cluster. CDH template also provides dedicated virtual machine - management node - hosting Cloudera Manager, proprietary all-up Hadoop management tool. All virtual machines are kept separate in different cloud services and are connected to the same virtual network. Here's the list of all endpoints and services opened up by default:

    Virtual Machine

    Service

    Port

    Management Node

    SSH

    22

     

    Management Web UI

    7180

     

    Navigator

    7187

    Head Node

    HDFS Web UI

    50070

     

    Hive Metastore

    9083

     

    HiveServer2

    10000

     

    Hue

    8888

     

    Oozie Server

    11000

     

    YARN JobHistory Server

    19888

     

    YARN Resource Manager

    8088

    Hortonworks HDP Template

    Azure Marketplace offers 2 flavors of Hortonworks Hadoop cluster templates: Hortonworks Data Platform and single-VM Hortonworks Sandbox. Furthermore HDP can be deployed either as evaluation setup (fixed-size, perfect for testing distributed environment) or standard (giving more cluster customization options). Default configuration recommends A7 instances for both master and worker virtual machines. As with CDH described in previous paragraph, all VMs are stored in separate cloud services and connected to the same virtual network. This time though cluster uses 2 separate storage accounts – one shared across master nodes and another one used as underlying layer for HDFS spread across all data nodes:

    As HDP is managed mostly using Ambari, it is the only one major service exposed by default:

    Virtual Machine

    Service

    Port

    Master Node 1

    SSH

    22

     

    Ambari Server

    8443

    Master Node 2

    SSH

    22

    Worker Node

    SSH

    22

    This concludes my technical comparison of Azure-hosted Hadoop alternatives (see also general overview and detailed description of components). As with all cloud services, be warned that it's "just" the snapshot of current capabilities. Things are moving fast and most likely in the coming months new features will be announced both by Microsoft and 3rd party vendors like native integration with Azure Storage (which is now officially part of Hadoop 2.7.0) or recently announced Azure Data Lake.

    more

  • Hadoop deployment options in Microsoft Azure - components

    In the second part of my comparison of different flavors of Azure-hosted Hadoop environments it is time to present detailed list of all available, preinstalled components. Bear in mind that the list covers only what is available out of the box, you can always customize your cluster either manually or by using Script Actions:

    Component

    HDI 3.2

    CDH 5.3.1 (Preview)

    HDP 2.1

    Accumulo

    -

    -

    1.5.1

    Ambari

    -

    -

    1.6

    Cloudera Hue

    -

    3.7.0

    -

    Cloudera Impala

    -

    2.1

    -

    Cloudera Search

    -

    1.2.0

    -

    Falcon

    -

    -

    0.5.0.2.1

    Flume

    -

    1.5.0

    1.4.0

    Ganglia

    -

    -

    3.5.0

    HBase

    0.98.4

    0.98.1

    0.98.0.2.1

    HDFS

    2.6.0

    2.5.0

    2.4.0.2.1

    Hive

    0.14

    0.13.0

    0.13.0.2.1

    Knox

    -

    -

    0.4.0

    Mahout

    0.9

    0.9

    0.9

    Nagios

    -

    -

    3.5.0

    Oozie

    4.1.0

    4.0.0

    4.0.0.2.1

    Phoenix

    4.2.0

    -

    4.2.0

    Pig

    0.14

    0.12

    0.12.1.2.1

    Sentry

    -

    1.3.0

    -

    Solr

    -

    4.4.0

    4.7.2

    Spark

    -

    1.2

    -

    Sqoop

    1.4.5

    1.4.4

    1.4.4.2.1

    Storm

    0.9.3

    -

    0.9.1.2.1

    Tez

    0.5.2

    -

    0.4.0.2.1

    Yarn + Map Reduce 2

    2.6.0

    2.5.0

    2.4.0.2.1

    ZooKeeper

    3.4.6

    3.4.5

    3.4.5.2.1

     

    Last part of the series will cover differences in technical architecture (to be posted soon!).

    more

  • Hadoop deployment options in Microsoft Azure - general overview

    With recently announced changes in HDInsight service I decided it's about time to compare and contrast currently available, Azure-based Hadoop hosting alternatives. I've decided to take a closer look at newest release of HDInsight (3.2) and two Hadoop distributions available in Azure Marketplace: Cloudera's CDH 5.3.1 (still in preview) and Hortonworks Data Platform 2.1, but of course customers are still free to use pure IaaS approach, and bring their own cluster configuration to public cloud as well. In the first part of this article, you'll find short tabular summary of most important differences. Second part will give you better understanding of available components and libraries, followed by internal cluster architecture distinctions.

    Feature

    HDI 3.2

    CDH 5.3.1 (Preview)

    HDP 2.1

    Infrastructure

    Operating system

    Windows Server, Linux

    Linux

    Linux

    Configuration flexibility

    Broad choice between A3-A9 and D3-D14 instances

    Requires at least 64 cores to provision (1 head+3 data nodes based on D14 instances), can be modified once deployment is ready.

    2 configs available: Evaluation (smaller - 5*A3) and Standard (customizable 9*A7 by default).

    Scale-up

    Yes, requires cluster reprovisioning

    D14 VMs only

    Yes

    Scale-out

    Yes, Adding or removing data nodes supported by API

    Yes, requires manual provisioning and configuration

    Yes, requires manual provisioning and configuration

    Features

    Manageability

    Azure portal, Query Console (Windows) or Ambari (Linux), PowerShell, Azure CLI

    Azure portal, Cloudera Manager

    Azure portal, Ambari

    Remote access

    RDP (Windows), SSH (Linux)

    SSH

    SSH

    Unique value

    Tight integration with Azure components (storage, data factory, machine learning, Document DB)

    Hue, Impala, Navigator

    Ambari, Falcon

    Additional security

    Custom hardening specific to Azure

    HDFS encryption (optional)

    No

    Authenication

    Username&password

    LDAP/Active Directory, Kerberos, SAML

    Kerberos

    Storage

    Integrated with Azure Storage by default

    Requires HDFS. HDFS cluster is put on top of temporary SSD volumes attached to virtual machines.

    Supports both HDFS and Azure Storage. Uses plenty of data disks (8 connected to each of machines).

    Customization

    ScriptAction

    Cloudera packages

    Manual

    Updates

    Automatic patches, user controls major upgrades

    Manual

    Manual

    Licensing & support

    Licensing

    Included

    Not included

    Not included

    Pricing

    Charged for master and data nodes, zookeeper and secure gateway free of charge.

    Charged for VM instance, requires constantly running HDFS infrastructure

    Charged for VM instance, requires constantly running HDFS infrastructure

    Support

    1st party (Microsoft)

    Mixed (Microsoft+Cloudera)

    Mixed (Microsoft+Hortonworks)

    SLA

    99.9% (including full Hadoop stack)

    99.95% for IaaS only

    99.95% for IaaS only

     

    more

  • Shape
  • Model
  • Placement
  • Template
  • HTML