0% found this document useful (0 votes)
14 views48 pages

Lec 7 Hadoop Intro

Uploaded by

sama ghorab
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views48 pages

Lec 7 Hadoop Intro

Uploaded by

sama ghorab
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Introduction to big data

© Copyright IBM Corporation 2021


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit objectives
• Explain the concept of big data.
• Describe the factors that contributed to the emergence of big data
processing.
• List the various characteristics of big data.
• List typical big data use cases.
• Describe the evolution from traditional data processing to big data
processing.
• List Apache Hadoop core components and their purpose.
• Describe the Hadoop infrastructure and the purpose of the main
projects.
• Identify what is a good fit for Hadoop and what is not.

Introduction to big data © Copyright IBM Corporation 2021


Big data overview

Introduction to big data © Copyright IBM Corporation 2021


Topics
• Big data overview
• Big data use cases
• Evolution from traditional data processing to big data processing
• Introduction to Apache Hadoop and the Hadoop infrastructure

Introduction to big data © Copyright IBM Corporation 2021


Introduction to big data
• The big data tsunami.
• The Vs of big data
(3Vs, 4Vs, 5Vs, and so on).
The count depends on who
does the counting.
• The infrastructure:
▪ Apache open source
▪ The distributions
▪ The add-ons
▪ Open Data Platform initiative
(OPDi.org)
• Some basic terminology.

Introduction to big data © Copyright IBM Corporation 2021


Big data: A tsunami that is hitting us
• We are witnessing a tsunami of data:
▪ Huge volumes
▪ Data of different types and formats
▪ Impacts on the business at new and ever-increasing speeds
• The challenges:
▪ Capturing, transporting, and moving the data
▪ Managing the data, the hardware that is involved, and the software
(open source and not)
▪ Processing from munging the raw data to programming and providing insight
into the data.
▪ Storing, safeguarding, and securing:
“Big data refers to non-conventional strategies and innovative technologies that are
used by businesses and organizations to capture, manage, process, and make sense
of a large volume of data.”
• The industries that are involved.
• The future.

Introduction to big data © Copyright IBM Corporation 2021


Some examples of big data
• Science
• Astronomy • Medical records
• Atmospheric science • Commercial
• Genomics • Web, event, and database logs
• Biogeochemical • "Digital exhaust“, which is the result of
• Biological human interaction with the internet
• Other complex / interdisciplinary • Sensor networks
scientific research • RFID
• Social • Internet text and documents
• Social networks • Internet search indexing
• Social data: • Call detail records (CDRs)
▪ Person to person and client to client (P2P • Photographic archives
and C2C): • Video and audio archives
• Wish lists on Amazon.com • Large-scale e-commerce
• Craig’s List: • Regular government business and
▪ Person to world (P2W) : commerce needs
• Twitter • Military and homeland security
• Facebook surveillance
• LinkedIn

Introduction to big data © Copyright IBM Corporation 2021


Types of big data

Structured Semi-structured Unstructured


• Data that can be • Data that does not • Data that has an
stored and processed have a formal structure unknown form and
in a fixed format, of a data model, that cannot be stored
which is also known as is, a table definition in in RDBMS and
a schema. a relational DBMS, but analyzed unless it
has some is transformed into
organizational a structured format
properties like tags is called unstructured
and other markers to data.
separate semantic • Text files and
elements that makes multimedia contents
it easier to analyze, like images, audio, and
such as XML or JSON. videos are examples
of unstructured data.
Unstructured data
is growing quicker than
other data. Experts
say that 80% of the
data in an organization
is unstructured.

Introduction to big data © Copyright IBM Corporation 2021


The four classic dimensions of big data (the four Vs)

Variety
Different
forms of data

Velocity
Veracity
Analysis of
streaming Value Uncertainty
of data
data

There is a fifth V, which is


Value. It is the reason for
Volume
working with big data
Scale of
data to obtain business insight.

Introduction to big data © Copyright IBM Corporation 2021


An insight into big data analytic techniques

Domain knowledge

Business strategy Communications

Statistics Visualizations

Neurocomputing Data mining

Data
Machine Science Pattern
learning recognition
Business analysis
Presentation
KDD AI
Databases and
data processing

Problem solving Inquisitiveness

Introduction to big data © Copyright IBM Corporation 2021


Big data use cases

Introduction to big data © Copyright IBM Corporation 2021


Topics
• Big data overview
• Big data use cases
• Evolution from traditional data processing to big data processing
• Introduction to Apache Hadoop and the Hadoop infrastructure

Introduction to big data © Copyright IBM Corporation 2021


Big data analytics use case examples

Big data exploration Enhanced 360⁰ view of the Security &


Find, visualize, and customer Intelligence extension
understand all big data to Extend existing customer views Lower risk, detect fraud, and
improve decision making. by incorporating extra internal monitor cybersecurity in real
and external data sources. time.

Data warehouse
modernization
Operational analysis Integrate big data and data
Analyze various machine warehouse capabilities to gain
data for improved new business insights and
business results. increase operational efficiency.

Introduction to big data © Copyright IBM Corporation 2021


Common use cases that are applied to big data
• Extract, transform, and load What do these workloads have in
(ETL) common?
▪ Common to business intelligence
and data warehousing.
▪ In big data, it changes to extract, The nature of the data has the
load, and transform (ELT). characteristics of some of the Vs:
• Text mining • Volume
• Index building • Velocity
• Graph creation and analysis • Variety
• Pattern recognition
• Collaborative filtering
• Predictive models
• Sentiment analysis
• Risk assessment

Introduction to big data © Copyright IBM Corporation 2021


Examples of business sectors that use big data
• Healthcare
• Financial
• Industry
• Agriculture

Introduction to big data © Copyright IBM Corporation 2021


Use cases for big data: Healthcare
Healthcare transformation comes with many challenges

Introduction to big data © Copyright IBM Corporation 2021


The Precision Medicine Initiative and big data
• Precision medicine:
▪ A medical model that proposes the customization of
healthcare, with medical decisions, practices, and
products tailored to the individual patient (Source:
https://en.wikipedia.org/wiki/Precision_medicine).
▪ Diagnostic testing is used for selecting the appropriate
and optimal therapies based on a patient’s genetic
content or other molecular or cellular analysis.
▪ Tools that are employed in precision medicine can
include molecular diagnostics, imaging, and analytics
software.
• The Precision Medicine Initiative (PMI) is a
$215 million investment in President Obama’s
Fiscal Year 2016 Budget to accelerate
biomedical research and provide clinicians
with new tools to select the therapies that
work best in individual patients.

Introduction to big data © Copyright IBM Corporation 2021


Use cases for big data: Financial services
• Problem: Manage the several petabytes of data that is growing at
40 - 100% per year under increasing pressure to prevent fraud and
complaints to regulators.
• How big data analytics can help:
▪ Fraud detection
▪ Credit issuance
▪ Risk management
▪ 360° view of the customer

Introduction to big data © Copyright IBM Corporation 2021


Financial marketplace example: Visa
• Problem:
▪ Credit card fraud costs up to 7 cents per 100 dollars to
billions of dollars per year.
▪ Fraud schemes are constantly changing.
▪ Understanding the fraud pattern months after
the fact is only partially helpful, so fraud detection
models must evolve faster.
• If Visa could:
▪ Reinvent how to detect the fraud patterns.
▪ Stop new fraud patterns before they can
rack up significant losses.
• Solution:
▪ Revolutionize the speed of detection.
▪ Visa loaded two years of test records, or 73 billion transactions, amounting
to 36 TB of data into Hadoop. Their processing time fell from one month
with traditional methods to a mere 13 minutes.

Introduction to big data © Copyright IBM Corporation 2021


Financial

• Credit Scoring in the Era of Big Data:


https://yjolt.org/credit-scoring-era-big-data
• Big Data Trends in Financial Services:
https://www.accesswire.com/575714/Big-Data-Trends-in-Financial-
Services

Introduction to big data © Copyright IBM Corporation 2021


“Data is the new oil”

Introduction to big data © Copyright IBM Corporation 2021


Evolution from traditional data
processing to big data
processing

Introduction to big data © Copyright IBM Corporation 2021


Topics
• Big data overview
• Big data use cases
• Evolution from traditional data processing to big data processing
• Introduction to Apache Hadoop and the Hadoop infrastructure

Introduction to big data © Copyright IBM Corporation 2021


Traditional versus big data approaches to using data

Introduction to big data © Copyright IBM Corporation 2021


System of units / Binary system of units

Introduction to big data © Copyright IBM Corporation 2021


Hardware improvements over the years
• CPU speeds:
▪ 1990: 44 MIPS at 40 MHz
▪ 2020: 2,356,230 MIPS at 4.35 GHz
• RAM memory:
▪ 1990: 640 KB conventional memory
(256 KB extended memory recommended)
▪ 2020: 16 GB at 3,200 MHz
• Disk capacity:
▪ 1990: 20 MB
▪ 2020: 80 TB
• Disk latency (speed of reads and writes)
Not much improvement in the last 7 - 10
years. Currently, ~ 80 MBps.

Introduction to big data © Copyright IBM Corporation 2021


Parallel data processing
Different approaches:
▪ GRID computing: Spreads processing load
(“CPU scavenging”).
▪ Distributed workload: Hard to manage
applications and impacts the developer.
▪ Parallel databases: Db2 DPF, Teradata, and
Netezza (distribute the data).
• Distributed computing:
Multiple computers appear as one
supercomputer and communicate with each
other by message passing, and operate together
“In pioneer days they used oxen
to achieve a common goal. for heavy pulling, and when
• Challenges one ox couldn’t budge a log,
they didn’t try to grow a larger ox.
Heterogeneity, openness, security, scalability, We shouldn’t be trying for
concurrency, fault tolerance, and transparency. bigger computers, but for more
systems of computers.”
-Grace Hopper

Introduction to big data © Copyright IBM Corporation 2021


Online transactional processing system
• Online transactional processing (OLTP) enables the real-time execution
of large numbers of database transactions by large numbers of people,
typically over the internet.
• A database transaction is a change, insertion, deletion, or query of data
in a database. OLTP systems (and the database transactions they
enable) drive many of the financial transactions we make every day,
including online banking and ATM transactions, e-commerce and in-
store purchases, and hotel and airline bookings, among other
transactions.

Introduction to big data © Copyright IBM Corporation 2021


Online analytical processing system
• Online transactional processing (OLAP) is software for performing
multidimensional analysis at high speeds on large volumes of data from
a data warehouse, data mart, or some other unified, centralized data
store.
• OLAP is optimized for conducting complex data analysis. OLAP
systems are designed for use by data scientists, business analysts, and
knowledge workers, and they support business intelligence (BI), data
mining, and other decision support applications.

Introduction to big data © Copyright IBM Corporation 2021


Meaning of “real time” when applied to big data
• Subsecond response
Generally, when engineers say “real time”, they are usually referring to subsecond response
time. In this kind of real-time data processing, nanoseconds count. Extreme levels of
performance are key to success.

• Human comfortable response time


“Thou shalt not bore or frustrate the users.” The performance requirement for this kind of
processing is usually a couple of seconds.

• Event-driven
If when you say “real time” that you mean the opposite of scheduled, then you mean event-
driven. Instead of happening in a particular time interval, event-driven data processing
happens when a certain action or condition triggers it. The performance requirement for it is
generally before another event happens.

• Streaming data processing


If when you say “real-time” that you mean the opposite of batch processing, then you mean
streaming data processing. In batch processing, data is gathered, and all records or other
data units are processed in one large bundle until they are done. In streaming data
processing, the data is processed as it flows in, one unit at a time. After the data starts
coming in, it generally does not end.
Source Four Really Real Meanings of Real-Time Data http://blog.syncsort.com/2016/03/big-data/four-really-real-meanings-of-real-time

Introduction to big data © Copyright IBM Corporation 2021


More comments on “real time”
• Real time is not a concept that is woven into the fabric of the universe:
It is a human construct. Essentially, real time refers to lags in data
arrival that are either below the threshold of perception or are so short
that they do not pose a barrier to immediate action.
• Decisions have various tolerances for protracted data arrival.
• Data latencies versus decision latencies.

Introduction to big data © Copyright IBM Corporation 2021


Introduction to Apache
Hadoop and the Hadoop
infrastructure

Introduction to big data © Copyright IBM Corporation 2021


Topics
• Big data overview
• Big data use cases
• Evolution from traditional data processing to big data processing
• Introduction to Apache Hadoop and the Hadoop infrastructure

Introduction to big data © Copyright IBM Corporation 2021


A new approach is needed to process big data: Requirements
• Partial failure support
Failure of one component should not result in the failure of the entire
system.
• Data recoverability
The workload of a failed component should be assumed by another
functioning unit.
• Component recovery
A recovered component should rejoin the system without requiring a full
restart of the system.
• Consistency
Component failures during job execution should not affect the outcome of
the job.
• Scalability
▪ Adding load to the system should result in a graceful decline in performance of
individual jobs, not a failure of the system.
▪ Increasing resources should support a proportional increase in load capacity.

Introduction to big data © Copyright IBM Corporation 2021


Introduction to Apache Hadoop and the Hadoop
infrastructure
• Why? When? Where?
▪ Origins / History
▪ The Why of Hadoop
▪ The When of Hadoop
▪ The Where of Hadoop
• Hadoop architecture:
▪ MapReduce
▪ Hadoop Distributed File System (HDFS)
▪ Hadoop Common
• Hadoop infrastructure

Introduction to big data © Copyright IBM Corporation 2021


Core Hadoop characteristics
• Applications are written in high-level language code.
• Work is performed in a cluster of commodity machines. Nodes talk to
each other as little as possible.
• Data is distributed in advance. Bring the computation to the data.
• Data is replicated for increased availability and reliability.
• Hadoop is fully scalable and fault-tolerant.

Introduction to big data © Copyright IBM Corporation 2021


What is Apache Hadoop?
• Apache Hadoop is an open source software framework for reliable,
scalable, and distributed computing of massive amount of data.
▪ Hides the underlying system details and complexities from the user.
▪ Developed in Java.
• Consists of these subprojects:
▪ Hadoop Common.
▪ HDFS.
▪ Hadoop YARN.
▪ MapReduce.
▪ Hadoop Ozone.
• Large Hadoop infrastructure with both open source and proprietary
Hadoop-related projects, such as Hbase, Apache ZooKeeper, and Apache
Avro.
• Meant for heterogeneous commodity hardware.
• Hadoop is based on work that was done by Google in the late 1990s and
early 2000s, specifically the papers describing the Google File System
(GFS) (published in 2003), and MapReduce (published in 2004).

Introduction to big data © Copyright IBM Corporation 2021


Why and where Hadoop is used and not used
• Hadoop is good for:
▪ Massive amounts of data through parallelism.
▪ A variety of data (structured, unstructured, and semi-structured).
▪ Inexpensive commodity hardware.
• Hadoop is not good for:
▪ Processing transactions (random access).
▪ When work cannot be parallelized.
▪ Low latency data access.
▪ Processing many small files.
▪ Intensive calculations with little data.

Introduction to big data © Copyright IBM Corporation 2021


Apache Hadoop core components
• MapReduce
• HDFS
• YARN
• Hadoop Common

Apache Hadoop
MapReduce

HDFS

YARN Hadoop Common

Introduction to big data © Copyright IBM Corporation 2021


The two key components of Hadoop
• HDFS:
▪ Where Hadoop stores data.
▪ A file system that spans all the nodes in a Hadoop cluster.
▪ It links together the file systems on many local nodes to make them into one
large file system.
• MapReduce framework
How Hadoop understands and assigns work to the nodes (machines).

Introduction to big data © Copyright IBM Corporation 2021


Differences between RDBMS and Hadoop HDFS

Introduction to big data © Copyright IBM Corporation 2021


Hadoop infrastructure: Large and constantly growing
• The Hadoop infrastructure includes
components that support each stage
of big data processing and
supplement the core components:
▪ Constantly growing.
▪ It includes Apache open source Cluster Management and Monitoring
Apache Ambari
projects and contributions from other
companies. Data Management

• Hadoop-related projects: Apache Oozie


Workflow
&
Apache Chukwa
Monitoring
Apache ZooKeeper
Coordination
&
▪ Hbase Scheduling Management

▪ Apache Hive Data Access


▪ Apache Pig Apache Hive Apache Pig Apache Avro Apache Sqoop
Query/SQL

data flow
Apache Avro Data serialization/RPC RDBMS connector
Data integration
▪ Apache Sqoop Data Processing
▪ Apache Oozie MapReduce
Distributed processing
Yarn
Cluster and resource management
▪ Apache ZooKeeper Cluster management

▪ Apache Chukwa HDFS Data Storage


HBase
▪ Apache Ambari Distributed file system
Cluster and resource management

▪ Apache Spark
Introduction to big data © Copyright IBM Corporation 2021
Think differently
As you start to work with Hadoop, you must think differently:
• There are different processing paradigms.
• There are different approaches to storing data.
• Think ELT rather than ETL.

Understanding the Hadoop infrastructure is embarking on a continuing


learning process where self-education is an ongoing requirement.

Introduction to big data © Copyright IBM Corporation 2021


Unit summary
• Explained the concept of big data.
• Described the factors that contributed to the emergence of big data
processing.
• Listed the various characteristics of big data.
• Listed typical big data use cases.
• Described the evolution from traditional data processing to big data
processing.
• Listed Apache Hadoop core components and their purpose.
• Described the Hadoop infrastructure and the purpose of the main
projects.
• Identified what is a good fit for Hadoop and what is not.

Introduction to big data © Copyright IBM Corporation 2021


Review questions
1. True or False: the number of Vs of big data are exactly four.

2. Data that can be stored and processed in a fixed format is


called:
A. Structured
B. Semi-structured
C. Unstructured
D. Machine generated

3. True or False: Agriculture is one of the industry sectors that


are using big data and analytics to help to improve and
transform their industries.

Introduction to big data © Copyright IBM Corporation 2021


Review questions (cont.)
4. Hadoop is good for:
A. Processing transactions (random access)
B. Massive amounts of data through parallelism
C. Processing lots of small files
D. Intensive calculations with little data
E. Low latency data access

5. True or False: One of Hadoop main characteristics is that


applications are written in low-level language code.

Introduction to big data © Copyright IBM Corporation 2021


Review answers
1. True or False: the number of Vs of big data are exactly four.

2. Data that can be stored and processed in a fixed format is


called:
A. Structured
B. Semi-structured
C. Unstructured
D. Machine generated

3. True or False: Agriculture is one of the industry sectors that


are using big data and analytics to help to improve and
transform their industries.

Introduction to big data © Copyright IBM Corporation 2021


Review answers (cont.)
4. Hadoop is good for:
A. Processing transactions (random access)
B. Massive amounts of data through parallelism
C. Processing lots of small files
D. Intensive calculations with little data
E. Low latency data access

5. True or False: One of Hadoop main characteristics is that


applications are written in low-level language code.

Introduction to big data © Copyright IBM Corporation 2021

You might also like