Lecture 2
Lecture 2
1
Reading Reference for Lecture 2
2 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Remind -- Apache Hadoop
The Apache™ Hadoop® project develops open-source software for reliable, scalable,
distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing
of large data sets across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each offering local
computation and storage. Rather than rely on hardware to deliver high-availability, the
library itself is designed to detect and handle failures at the application layer, so delivering
a highly-available service on top of a cluster of computers, each of which may be prone to
failures.
http://hadoop.apache.org
3
Remind -- Hadoop-related Apache Projects
• Ambari™: A web-based tool for provisioning, managing, and monitoring Hadoop
clusters.It also provides a dashboard for viewing cluster health and ability to view
MapReduce, Pig and Hive applications visually.
• Avro™: A data serialization system.
• Cassandra™: A scalable multi-master database with no single points of failure.
• Chukwa™: A data collection system for managing large distributed systems.
• HBase™: A scalable, distributed database that supports structured data storage for
large tables.
• Hive™: A data warehouse infrastructure that provides data summarization and ad hoc
querying.
• Mahout™: A Scalable machine learning and data mining library.
• Pig™: A high-level data-flow language and execution framework for parallel
computation.
• Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple
and expressive programming model that supports a wide range of applications,
including ETL, machine learning, stream processing, and graph computation.
• Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which
provides a powerful and flexible engine to execute an arbitrary DAG of tasks to
process data for both batch and interactive use-cases.
• ZooKeeper™: A high-performance coordination service for distributed applications.
4
Four distinctive layers of Hadoop
5
Common Use Cases for Big Data in Hadoop
D. deRoos et al, Hadoop for Dummies, John Wiley & Sons, 2014
6
Example: Business Value of Log Analysis – “Struggle Detection”
D. deRoos et al, Hadoop for Dummies, John Wiley & Sons, 2014
7 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Remind -- MapReduce example
http://www.alex-hanna.com
8 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
MapReduce Process on User Behavior via Log Analysis
D. deRoos et al, Hadoop for Dummies, John Wiley & Sons, 2014
9 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Setting Up the Hadoop Environment
10 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Data Storage Operations on HDFS
• Hadoop is designed to work best with a modest number of extremely large files.
• Average file sizes ➔ larger than 500MB.
11 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Remind -- Hadoop Distributed File System (HDFS)
http://hortonworks.com/hadoop/hdfs/
12 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
HDFS blocks
• File is divided into blocks (default: 64MB) and duplicated in multiple places (default: 3)
• Dividing into blocks is normal for a file system. E.g., the default block size in Linux is 4KB.
The difference of HDFS is the scale.
• Hadoop was designed to operate at the petabyte scale.
• Every data block stored in HDFS has its own metadata and needs to be tracked by a
central server.
13 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
HDFS blocks
• When HDFS stores the replicas of the original blocks across the Hadoop cluster, it tries to
ensure that the block replicas are stored in different failure points.
14 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
HDFS is a User-Space-Level file system
15 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Interaction between HDFS components
16 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
HDFS Federation
• Before Hadoop 2.0, NameNode was a single point of failure and operation
limitation.
• Before Hadoop 2, Hadoop clusters usually have fewer clusters that were able to
scale beyond 3,000 or 4,000 nodes.
• Multiple NameNodes can be used in Hadoop 2.x. (HDFS High Availability
feature – one is in an Active state, the other one is in a Standby state).
http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html
17 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
High Availability of the NameNodes
• Active NameNode
• Standby NameNode – keeping the state of the block locations and block metadata in memory ->
HDFS checkpointing responsibilities.
• JournalNode – if a failure occurs, the Standby Node reads all completed journal entries to
ensure the new Active NameNode is fully consistent with the state of cluster.
• Zookeeper – provides coordination and configuration services for distributed systems.
18 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Several useful commands for HDFS
• For HDFS, the schema name is hdfs, and for the local file system, the schema name is
file.
• A file or director in HDFS can be specified in a fully qualified way, such as:
hdfs://namenodehost/parent/child or hdfs://namenodehost
• The HDFS file system shell command is similar to Linux file commands, with the following
general syntax: hadoop hdfs –file_cmd
19 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Several useful commands for HDFS -- II
20 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
YARN
21 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Four distinctive layers of Hadoop
22 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Hadoop execution
23 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Limitation of original Hadoop 1
25 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
YARN’s application execution
http://www.ibm.com/developerworks/cloud/library/cl-openstack-deployhadoop/
27 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
MapReduce Use Case Example – flight data
• Data Source: Airline On-time Performance data set (flight data set).
– All the logs of domestic flights from the period of October 1987 to April 2008.
– Each record represents an individual flight where various details are
captured:
• Time and date of arrival and departure
• Originating and destination airports
• Amount of time taken to taxi from the runway to the gate.
28 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Other datasets available from Statistical Computing
http://stat-computing.org/dataexpo/
29 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Flight Data Schema
30 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
MapReduce Use Case Example – flight data
31 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
MapReduce Use Case Example – flight data
• Parallel way:
32 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
MapReduce application flow
33 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
MapReduce
steps for flight
data
computation
34 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
FlightsByCarrier application
Create FlightsByCarrier.java:
35 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
FlightsByCarrier application
36 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
FlightsByCarrier Mapper
37 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
FlightsByCarrier Reducer
38 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Run the code
39 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
See Result
40 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
HBase
HBase is modeled after Google’s BigTable and written in Java. It is developed on top of
HDFS.
It provides a fault-tolerant way of storing large quantities of sparse data (small amounts of
information caught within a large collection of empty or unimportant data, such as finding
the 50 largest items in a group of 2 billion records, or finding the non-zero items
representing less than 0.1% of a huge collection).
HBase features compression, in-memory operation, and Bloom filters on a per-column basis
An HBase system comprises a set of tables. Each table contains rows and columns, much
like a traditional database. Each table must have an element defined as a Primary Key,
and all access attempts to HBase tables must use this Primary Key. An HBase column
represents an attribute of an object
41 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Characteristics of data in HBase
Sparse data
HDFS lacks random read and write access. This is where HBase comes into picture. It's a
distributed, scalable, big data store, modeled after Google's BigTable. It stores data as
key/value pairs.
42 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
HBase Architecture
43 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
HBase Example -- I
44 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
HBase Example -- II
45 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
HBase Example -- III
46 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
HBase Example - IV
47 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Apache Hive
48 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Creating, Dropping, and Alternating DB in Hive
49 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Another Hive Example
Hive’s operation modes
79
50 E6893E6893
Big Data
Big Analytics – Lecture
Data Analytics 7: Spark
– Lecture andData
2: Big DataPlatform
Analytics © 2021
2015 CY Lin, Columbia University
Reference
51 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Spark Stack
52 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Spark Core
Home to the API that defines resilient distributed datasets (RDDs) - Spark’s main
programming abstraction.
RDD represents a collection of items distributed across many compute nodes that can be
manipulated in parallel.
53 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
First language to use — Python
54 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Spark’s Python Shell (PySpark Shell)
bin/pyspark
55 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Test installation
56 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Core Spark Concepts
• At a high level, every Spark application consists of a driver program that launches various
parallel operations on a cluster.
• The driver program contains your application’s main function and defines distributed
databases on the cluster, then applies operations to them.
• In the preceding example, the driver program was the Spark shell itself.
57 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Driver Programs
If we run the count() operation on a cluster, different machines might count lines in different
ranges of the file.
58 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Example filtering
59 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2020 CY Lin, Columbia University
Example — word count
60 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2020 CY Lin, Columbia University
Resilient Distributed Dataset (RDD) Basics
• Each RDD is split into multiple partitions, which may be computed on different nodes of the
cluster.
• Once created, RDDs offer two types of operations: transformations and actions.
<== transformation
<== action
Transformations and actions are different because of the way Spark computes RDDs.
==> Only computes when something is, the first time, in an action.
61 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Persistance in Spark
• By default, RDDs are computed each time you run an action on them.
• If you like to reuse an RDD in multiple actions, you can ask Spark to persist it using
RDD.persist().
• RDD.persist() will then store the RDD contents in memory and reuse them in future actions.
• Persisting RDDs on disk instead of memory is also possible.
• The behavior of not persisting by default seems to be unusual, but it makes sense for big
data.
62 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Spark SQL
63 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Spark SQL
64 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Using Spark SQL — Steps and Example
65 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Query testtweet.json
Get it from Learning Spark Github ==> https://github.com/databricks/learning-spark/tree/master/files
66 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Machine Learning Library in Spark — MLlib
An example of using MLlib for text classification task, e.g., identifying spammy emails.
67 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Example: Spam Detection
68 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Feature Extraction Example — TF-IDF
69 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Questions?
70 E6893 Big Data Analytics – Lecture 8: Big Data Analytics Algorithms © 2020 CY Lin, Columbia University