HDInsight Essentials - Second Edition
()
About this ebook
- Learn how to quickly provision a Hadoop cluster using Windows Azure Cloud Services
- Build an end-to-end application for a big data problem using open source software
- Discover more about modern data architecture with this guide, to help you understand the transition from legacy relational Enterprise Data Warehouse
If you want to discover one of the latest tools designed to produce stunning Big Data insights, this book features everything you need to get to grips with your data. Whether you are a data architect, developer, or a business strategist, HDInsight adds value in everything from development, administration, and reporting.
Related to HDInsight Essentials - Second Edition
Related ebooks
Hadoop Blueprints Rating: 0 out of 5 stars0 ratingsBeginning Apache Spark Using Azure Databricks: Unleashing Large Cluster Analytics in the Cloud Rating: 0 out of 5 stars0 ratingsLearning HBase Rating: 0 out of 5 stars0 ratingsExploring Hadoop Ecosystem (Volume 1): Batch Processing Rating: 0 out of 5 stars0 ratingsMastering Apache Cassandra - Second Edition Rating: 0 out of 5 stars0 ratingsMonitoring Hadoop Rating: 0 out of 5 stars0 ratingsGetting Started with Oracle Data Integrator 11g: A Hands-On Tutorial Rating: 5 out of 5 stars5/5Apache Hive Essentials Rating: 0 out of 5 stars0 ratingsMastering Hadoop Rating: 0 out of 5 stars0 ratingsOptimizing Hadoop for MapReduce Rating: 0 out of 5 stars0 ratingsUnderstanding Azure Data Factory: Operationalizing Big Data and Advanced Analytics Solutions Rating: 0 out of 5 stars0 ratingsOracle Warehouse Builder 11g: Getting Started Rating: 0 out of 5 stars0 ratingsLearn Hadoop in 24 Hours Rating: 0 out of 5 stars0 ratingsLearn T-SQL Querying: A guide to developing efficient and elegant T-SQL code Rating: 0 out of 5 stars0 ratingsIntroducing Microsoft SQL Server 2019: Reliability, scalability, and security both on premises and in the cloud Rating: 0 out of 5 stars0 ratingsGetting Started with Big Data Query using Apache Impala Rating: 0 out of 5 stars0 ratingsInstant SQL Server Analysis Services 2012 Cube Security Rating: 0 out of 5 stars0 ratingsAzure Data Engineering Cookbook: Design and implement batch and streaming analytics using Azure Cloud Services Rating: 0 out of 5 stars0 ratingsInstant Pentaho Data Integration Kitchen Rating: 0 out of 5 stars0 ratingsProfessional Microsoft SQL Server 2014 Integration Services Rating: 0 out of 5 stars0 ratingsProfessional Hadoop Solutions Rating: 4 out of 5 stars4/5Database Security A Complete Guide - 2020 Edition Rating: 0 out of 5 stars0 ratingsRelational Database Index Design and the Optimizers: DB2, Oracle, SQL Server, et al. Rating: 5 out of 5 stars5/5Getting Started with Greenplum for Big Data Analytics Rating: 0 out of 5 stars0 ratingsBig Data Architecture A Complete Guide - 2019 Edition Rating: 0 out of 5 stars0 ratingsHadoop Cluster Deployment Rating: 0 out of 5 stars0 ratingsApache Oozie Essentials Rating: 0 out of 5 stars0 ratingsOracle 11g Streams Implementer's Guide Rating: 0 out of 5 stars0 ratingsMastering PostgreSQL 9.6 Rating: 0 out of 5 stars0 ratings
System Administration For You
Linux: Learn in 24 Hours Rating: 5 out of 5 stars5/5Learn PowerShell in a Month of Lunches, Fourth Edition: Covers Windows, Linux, and macOS Rating: 5 out of 5 stars5/5CompTIA A+ Complete Review Guide: Core 1 Exam 220-1101 and Core 2 Exam 220-1102 Rating: 5 out of 5 stars5/5Linux for Beginners: Linux Command Line, Linux Programming and Linux Operating System Rating: 4 out of 5 stars4/5Networking for System Administrators: IT Mastery, #5 Rating: 5 out of 5 stars5/5Bash Command Line Pro Tips Rating: 5 out of 5 stars5/5Practical Data Analysis Rating: 4 out of 5 stars4/5Mastering Windows PowerShell Scripting Rating: 4 out of 5 stars4/5PowerShell: A Beginner's Guide to Windows PowerShell Rating: 4 out of 5 stars4/5Operating Systems DeMYSTiFieD Rating: 0 out of 5 stars0 ratingsCybersecurity: The Beginner's Guide: A comprehensive guide to getting started in cybersecurity Rating: 5 out of 5 stars5/5Wordpress 2023 A Beginners Guide : Design Your Own Website With WordPress 2023 Rating: 0 out of 5 stars0 ratingsLearning Microsoft Endpoint Manager: Unified Endpoint Management with Intune and the Enterprise Mobility + Security Suite Rating: 0 out of 5 stars0 ratingsLearn PowerShell Scripting in a Month of Lunches Rating: 0 out of 5 stars0 ratingsLinux Command-Line Tips & Tricks Rating: 0 out of 5 stars0 ratingsThe Complete Powershell Training for Beginners Rating: 0 out of 5 stars0 ratingsLinux Bible Rating: 0 out of 5 stars0 ratingsEasy Linux For Beginners Rating: 2 out of 5 stars2/5Git Essentials Rating: 4 out of 5 stars4/5Linux: A complete guide to Linux command line for beginners, and how to get started with the Linux operating system! Rating: 0 out of 5 stars0 ratingsLearning Linux Shell Scripting Rating: 4 out of 5 stars4/5CompTIA A+ Complete Practice Tests: Core 1 Exam 220-1101 and Core 2 Exam 220-1102 Rating: 0 out of 5 stars0 ratingsImprove your skills with Google Sheets: Professional training Rating: 0 out of 5 stars0 ratingsEvaluation of Some Android Emulators and Installation of Android OS on Virtualbox and VMware Rating: 0 out of 5 stars0 ratingsWindows Security Basics: User Accounts Rating: 0 out of 5 stars0 ratings
Reviews for HDInsight Essentials - Second Edition
0 ratings0 reviews
Book preview
HDInsight Essentials - Second Edition - Rajesh Nadipalli
Table of Contents
HDInsight Essentials Second Edition
Credits
About the Author
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Instant updates on new Packt books
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Hadoop and HDInsight in a Heartbeat
Data is everywhere
Business value of big data
Hadoop concepts
Brief history of Hadoop
Core components
Hadoop cluster layout
HDFS overview
Writing a file to HDFS
Reading a file from HDFS
HDFS basic commands
YARN overview
YARN application life cycle
YARN workloads
Hadoop distributions
HDInsight overview
HDInsight and Hadoop relationship
Hadoop on Windows deployment options
Microsoft Azure HDInsight Service
HDInsight Emulator
Hortonworks Data Platform (HDP) for Windows
Summary
2. Enterprise Data Lake using HDInsight
Enterprise Data Warehouse architecture
Source systems
Data warehouse
Storage
Processing
User access
Provisioning and monitoring
Data governance and security
Pain points of EDW
The next generation Hadoop-based Enterprise data architecture
Source systems
Data Lake
Storage
Processing
User access
Provisioning and monitoring
Data governance, security, and metadata
Journey to your Data Lake dream
Ingestion and organization
Transformation (rules driven)
Access, analyze, and report
Tools and technology for Hadoop ecosystem
Use case powered by Microsoft HDInsight
Problem statement
Solution
Source systems
Storage
Processing
User access
Benefits
Summary
3. HDInsight Service on Azure
Registering for an Azure account
Azure storage
Provisioning an HDInsight cluster
Cluster topology
Provisioning using Azure PowerShell
Creating a storage container
Provisioning a new HDInsight cluster
HDInsight management dashboard
Dashboard
Monitor
Configuration
Exploring clusters using the remote desktop
Running a sample MapReduce
Deleting the cluster
HDInsight Emulator for the development
Installing HDInsight Emulator
Installation verification
Using HDInsight Emulator
Summary
4. Administering Your HDInsight Cluster
Monitoring cluster health
Name Node status
The Name Node Overview page
Datanode Status
Utilities and logs
Hadoop Service Availability
YARN Application Status
Azure storage management
Configuring your storage account
Monitoring your storage account
Managing access keys
Deleting your storage account
Azure PowerShell
Access Azure Blob storage using Azure PowerShell
Summary
5. Ingest and Organize Data Lake
End-to-end Data Lake solution
Ingesting to Data Lake using HDFS command
Connecting to a Hadoop client
Getting your files on the local storage
Transferring to HDFS
Loading data to Azure Blob storage using Azure PowerShell
Loading files to Data Lake using GUI tools
Storage access keys
Storage tools
CloudXplorer
Key benefits
Registering your storage account
Uploading files to your Blob storage
Using Sqoop to move data from RDBMS to Data Lake
Key benefits
Two modes of using Sqoop
Using Sqoop to import data (SQL to Hadoop)
Organizing your Data Lake in HDFS
Managing file metadata using HCatalog
Key benefits
Using HCatalog Command Line to create tables
Summary
6. Transform Data in the Data Lake
Transformation overview
Tools for transforming data in Data Lake
HCatalog
Persisting HCatalog metastore in a SQL database
Apache Hive
Hive architecture
Starting Hive in HDInsight
Basic Hive commands
Apache Pig
Pig architecture
Starting Pig in HDInsight node
Basic Pig commands
Pig or Hive
MapReduce
The mapper code
The reducer code
The driver code
Executing MapReduce on HDInsight
Azure PowerShell for execution of Hadoop jobs
Transformation for the OTP project
Cleaning data using Pig
Executing Pig script
Registering a refined and aggregate table using Hive
Executing Hive script
Reviewing results
Other tools used for transformation
Oozie
Spark
Summary
7. Analyze and Report from Data Lake
Data access overview
Analysis using Excel and Microsoft Hive ODBC driver
Prerequisites
Step 1 – installing the Microsoft Hive ODBC driver
Step 2 – creating Hive ODBC Data Source
Step 3 – importing data to Excel
Analysis using Excel Power Query
Prerequisites
Step 1 – installing the Microsoft Power Query for Excel
Step 2 – importing Azure Blob storage data into Excel
Step 3 – analyzing data using Excel
Other BI features in Excel
PowerPivot
Power View and Power Map
Step 1 – importing Azure Blob storage data into Excel
Step 2 – launch map view
Step 3 – configure the map
Power BI Catalog
Ad hoc analysis using Hive
Other alternatives for analysis
RHadoop
Apache Giraph
Apache Mahout
Azure Machine Learning
Summary
8. HDInsight 3.1 New Features
HBase
HBase positioning in Data Lake and use cases
Provisioning HDInsight HBase cluster
Creating a sample HBase schema
Designing the airline on-time performance table
Connecting to HBase using the HBase shell
Creating an HBase table
Loading data to the HBase table
Querying data from the HBase table
HBase additional information
Storm
Storm positioning in Data Lake
Storm key concepts
Provisioning HDInsight Storm cluster
Running a sample Storm topology
Connecting to Storm using Storm shell
Running the Storm Wordcount topology
Monitoring status of the Wordcount topology
Additional information on Storm
Apache Tez
Summary
9. Strategy for a Successful Data Lake Implementation
Challenges on building a production Data Lake
The success path for a production Data Lake
Identifying the big data problem
Proof of technology for Data Lake
Form a Data Lake Center of Excellence
Executive sponsors
Data Lake consumers
Development
Operations and infrastructure
Architectural considerations
Extensible and modular
Metadata-driven solution
Integration strategy
Security
Online resources
Summary
Index
HDInsight Essentials Second Edition
HDInsight Essentials Second Edition
Copyright © 2015 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: September 2013
Second edition: January 2015
Production reference: 1200115
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78439-942-9
www.packtpub.com
Credits
Author
Rajesh Nadipalli
Reviewers
Simon Elliston Ball
Anindita Basak
Rami Vemula
Commissioning Editor
Taron Pereira
Acquisition Editor
Owen Roberts
Content Development Editor
Rohit Kumar Singh
Technical Editors
Madhuri Das
Taabish Khan
Copy Editor
Rashmi Sawant
Project Coordinator
Mary Alex
Proofreaders
Ting Baker
Ameesha Green
Indexer
Rekha Nair
Production Coordinator
Melwyn D'sa
Cover Work
Melwyn D'sa
About the Author
Rajesh Nadipalli currently manages software architecture and delivery of Zaloni's Bedrock Data Management Platform, which enables customers to quickly and easily realize true Hadoop-based Enterprise Data Lakes. Rajesh is also an instructor and a content provider for Hadoop training, including Hadoop development, Hive, Pig, and HBase. In his previous role as a senior solutions architect, he evaluated big data goals for his clients, recommended a target state architecture, and conducted proof of concepts and production implementation. His clients include Verizon, American Express, NetApp, Cisco, EMC, and UnitedHealth Group.
Prior to Zaloni, Rajesh worked for Cisco Systems for 12 years and held a technical leadership position. His key focus areas have been data management, enterprise architecture, business intelligence, data warehousing, and Extract Transform Load (ETL). He has demonstrated success by delivering scalable data management and BI solutions that empower business to make informed decisions.
Rajesh authored the first version of the book HDInsight Essentials, Packt Publishing, released in September 2013, the first book in print for HDInsight, providing data architects, developers, and managers with an introduction to the new Hadoop distribution from Microsoft.
He has over 18 years of IT experience. He holds an MBA from North Carolina State University and a BSc degree in Electronics and Electrical from the University of Mumbai, India.
I would like to thank my family for their unconditional love, support, and patience during the entire process.
To my friends and coworkers at Zaloni, thank you for inspiring and encouraging me.
And finally a shout-out to all the folks at Packt Publishing for being really professional.
About the Reviewers
Simon Elliston Ball is a solutions engineer at Hortonworks, where he helps a wide range of companies get the best out of Hadoop. Before that, he was the head of big data at Red Gate, creating tools to make HDInsight and Hadoop easier to work with. He has also spoken extensively on big data and NoSQL at conferences around the world.
Anindita Basak works as a big data cloud consultant and a big data Hadoop trainer and is highly enthusiastic about Microsoft Azure and HDInsight along with Hadoop open source ecosystem. She works as a specialist for Fortune 500 brands including cloud and big data based companies in the US. She has been playing with Hadoop on Azure since the incubation phase (http://www.hadooponazure.com). Previously, she worked as a module lead for the Alten group and as a senior system analyst at Sonata Software Limited, India, in the Azure Professional Direct Delivery group of Microsoft. She worked as a senior software engineer on implementation and migration of various enterprise applications on the Azure cloud in healthcare, retail, and financial domains. She started her journey with Microsoft Azure in the Microsoft Cloud Integration Engineering (CIE) team and worked as a support engineer in Microsoft India (R&D) Pvt. Ltd.
With more than 6 years of experience in the Microsoft .NET technology stack, she is solely focused on big data cloud and data science. As a Most Valued Blogger, she loves to share her technical experience and expertise through her blog at http://anindita9.wordpress.com and http://anindita9.azurewebsites.net. You can find more about her on her LinkedIn page and you can follow her at @imcuteani on Twitter.
She recently worked as a technical reviewer for the books HDInsight Essentials and Microsoft Tabular Modeling Cookbook, both by Packt Publishing. She is currently working on Hadoop Essentials, also by Packt Publishing.
I