0% found this document useful (0 votes)

14 views48 pages

Lec 7 Hadoop Intro

Uploaded by

sama ghorab

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views48 pages

Lec 7 Hadoop Intro

Uploaded by

sama ghorab

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

Introduction to big data

© Copyright IBM Corporation 2021

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit objectives
• Explain the concept of big data.
• Describe the factors that contributed to the emergence of big data
processing.
• List the various characteristics of big data.
• List typical big data use cases.
• Describe the evolution from traditional data processing to big data
processing.
• List Apache Hadoop core components and their purpose.
• Describe the Hadoop infrastructure and the purpose of the main
projects.
• Identify what is a good fit for Hadoop and what is not.

Introduction to big data © Copyright IBM Corporation 2021

Big data overview

Introduction to big data © Copyright IBM Corporation 2021

Topics
• Big data overview
• Big data use cases
• Evolution from traditional data processing to big data processing
• Introduction to Apache Hadoop and the Hadoop infrastructure

Introduction to big data © Copyright IBM Corporation 2021

Introduction to big data
• The big data tsunami.
• The Vs of big data
(3Vs, 4Vs, 5Vs, and so on).
The count depends on who
does the counting.
• The infrastructure:
▪ Apache open source
▪ The distributions
▪ The add-ons
▪ Open Data Platform initiative
(OPDi.org)
• Some basic terminology.

Introduction to big data © Copyright IBM Corporation 2021

Big data: A tsunami that is hitting us
• We are witnessing a tsunami of data:
▪ Huge volumes
▪ Data of different types and formats
▪ Impacts on the business at new and ever-increasing speeds
• The challenges:
▪ Capturing, transporting, and moving the data
▪ Managing the data, the hardware that is involved, and the software
(open source and not)
▪ Processing from munging the raw data to programming and providing insight
into the data.
▪ Storing, safeguarding, and securing:
“Big data refers to non-conventional strategies and innovative technologies that are
used by businesses and organizations to capture, manage, process, and make sense
of a large volume of data.”
• The industries that are involved.
• The future.

Introduction to big data © Copyright IBM Corporation 2021

Some examples of big data
• Science
• Astronomy • Medical records
• Atmospheric science • Commercial
• Genomics • Web, event, and database logs
• Biogeochemical • "Digital exhaust“, which is the result of
• Biological human interaction with the internet
• Other complex / interdisciplinary • Sensor networks
scientific research • RFID
• Social • Internet text and documents
• Social networks • Internet search indexing
• Social data: • Call detail records (CDRs)
▪ Person to person and client to client (P2P • Photographic archives
and C2C): • Video and audio archives
• Wish lists on Amazon.com • Large-scale e-commerce
• Craig’s List: • Regular government business and
▪ Person to world (P2W) : commerce needs
• Twitter • Military and homeland security
• Facebook surveillance
• LinkedIn

Introduction to big data © Copyright IBM Corporation 2021

Types of big data

Structured Semi-structured Unstructured

• Data that can be • Data that does not • Data that has an
stored and processed have a formal structure unknown form and
in a fixed format, of a data model, that cannot be stored
which is also known as is, a table definition in in RDBMS and
a schema. a relational DBMS, but analyzed unless it
has some is transformed into
organizational a structured format
properties like tags is called unstructured
and other markers to data.
separate semantic • Text files and
elements that makes multimedia contents
it easier to analyze, like images, audio, and
such as XML or JSON. videos are examples
of unstructured data.
Unstructured data
is growing quicker than
other data. Experts
say that 80% of the
data in an organization
is unstructured.

Introduction to big data © Copyright IBM Corporation 2021

The four classic dimensions of big data (the four Vs)

Variety
Different
forms of data

Velocity
Veracity
Analysis of
streaming Value Uncertainty
of data
data

There is a fifth V, which is

Value. It is the reason for
Volume
working with big data
Scale of
data to obtain business insight.

Introduction to big data © Copyright IBM Corporation 2021

An insight into big data analytic techniques

Domain knowledge

Business strategy Communications

Statistics Visualizations

Neurocomputing Data mining

Data
Machine Science Pattern
learning recognition
Business analysis
Presentation
KDD AI
Databases and
data processing

Problem solving Inquisitiveness

Introduction to big data © Copyright IBM Corporation 2021

Big data use cases

Introduction to big data © Copyright IBM Corporation 2021

Topics
• Big data overview
• Big data use cases
• Evolution from traditional data processing to big data processing
• Introduction to Apache Hadoop and the Hadoop infrastructure

Introduction to big data © Copyright IBM Corporation 2021

Big data analytics use case examples

Big data exploration Enhanced 360⁰ view of the Security &

Find, visualize, and customer Intelligence extension
understand all big data to Extend existing customer views Lower risk, detect fraud, and
improve decision making. by incorporating extra internal monitor cybersecurity in real
and external data sources. time.

Data warehouse
modernization
Operational analysis Integrate big data and data
Analyze various machine warehouse capabilities to gain
data for improved new business insights and
business results. increase operational efficiency.

Introduction to big data © Copyright IBM Corporation 2021

Common use cases that are applied to big data
• Extract, transform, and load What do these workloads have in
(ETL) common?
▪ Common to business intelligence
and data warehousing.
▪ In big data, it changes to extract, The nature of the data has the
load, and transform (ELT). characteristics of some of the Vs:
• Text mining • Volume
• Index building • Velocity
• Graph creation and analysis • Variety
• Pattern recognition
• Collaborative filtering
• Predictive models
• Sentiment analysis
• Risk assessment

Introduction to big data © Copyright IBM Corporation 2021

Examples of business sectors that use big data
• Healthcare
• Financial
• Industry
• Agriculture

Introduction to big data © Copyright IBM Corporation 2021

Use cases for big data: Healthcare
Healthcare transformation comes with many challenges

Introduction to big data © Copyright IBM Corporation 2021

The Precision Medicine Initiative and big data
• Precision medicine:
▪ A medical model that proposes the customization of
healthcare, with medical decisions, practices, and
products tailored to the individual patient (Source:
https://en.wikipedia.org/wiki/Precision_medicine).
▪ Diagnostic testing is used for selecting the appropriate
and optimal therapies based on a patient’s genetic
content or other molecular or cellular analysis.
▪ Tools that are employed in precision medicine can
include molecular diagnostics, imaging, and analytics
software.
• The Precision Medicine Initiative (PMI) is a
$215 million investment in President Obama’s
Fiscal Year 2016 Budget to accelerate
biomedical research and provide clinicians
with new tools to select the therapies that
work best in individual patients.

Introduction to big data © Copyright IBM Corporation 2021

Use cases for big data: Financial services
• Problem: Manage the several petabytes of data that is growing at
40 - 100% per year under increasing pressure to prevent fraud and
complaints to regulators.
• How big data analytics can help:
▪ Fraud detection
▪ Credit issuance
▪ Risk management
▪ 360° view of the customer

Introduction to big data © Copyright IBM Corporation 2021

Financial marketplace example: Visa
• Problem:
▪ Credit card fraud costs up to 7 cents per 100 dollars to
billions of dollars per year.
▪ Fraud schemes are constantly changing.
▪ Understanding the fraud pattern months after
the fact is only partially helpful, so fraud detection
models must evolve faster.
• If Visa could:
▪ Reinvent how to detect the fraud patterns.
▪ Stop new fraud patterns before they can
rack up significant losses.
• Solution:
▪ Revolutionize the speed of detection.
▪ Visa loaded two years of test records, or 73 billion transactions, amounting
to 36 TB of data into Hadoop. Their processing time fell from one month
with traditional methods to a mere 13 minutes.

Introduction to big data © Copyright IBM Corporation 2021

Financial

• Credit Scoring in the Era of Big Data:

https://yjolt.org/credit-scoring-era-big-data
• Big Data Trends in Financial Services:
https://www.accesswire.com/575714/Big-Data-Trends-in-Financial-
Services

Introduction to big data © Copyright IBM Corporation 2021

“Data is the new oil”

Introduction to big data © Copyright IBM Corporation 2021

Evolution from traditional data
processing to big data
processing

Introduction to big data © Copyright IBM Corporation 2021

Topics
• Big data overview
• Big data use cases
• Evolution from traditional data processing to big data processing
• Introduction to Apache Hadoop and the Hadoop infrastructure

Introduction to big data © Copyright IBM Corporation 2021

Traditional versus big data approaches to using data

Introduction to big data © Copyright IBM Corporation 2021

System of units / Binary system of units

Introduction to big data © Copyright IBM Corporation 2021

Hardware improvements over the years
• CPU speeds:
▪ 1990: 44 MIPS at 40 MHz
▪ 2020: 2,356,230 MIPS at 4.35 GHz
• RAM memory:
▪ 1990: 640 KB conventional memory
(256 KB extended memory recommended)
▪ 2020: 16 GB at 3,200 MHz
• Disk capacity:
▪ 1990: 20 MB
▪ 2020: 80 TB
• Disk latency (speed of reads and writes)
Not much improvement in the last 7 - 10
years. Currently, ~ 80 MBps.

Introduction to big data © Copyright IBM Corporation 2021

Parallel data processing
Different approaches:
▪ GRID computing: Spreads processing load
(“CPU scavenging”).
▪ Distributed workload: Hard to manage
applications and impacts the developer.
▪ Parallel databases: Db2 DPF, Teradata, and
Netezza (distribute the data).
• Distributed computing:
Multiple computers appear as one
supercomputer and communicate with each
other by message passing, and operate together
“In pioneer days they used oxen
to achieve a common goal. for heavy pulling, and when
• Challenges one ox couldn’t budge a log,
they didn’t try to grow a larger ox.
Heterogeneity, openness, security, scalability, We shouldn’t be trying for
concurrency, fault tolerance, and transparency. bigger computers, but for more
systems of computers.”
-Grace Hopper

Introduction to big data © Copyright IBM Corporation 2021

Online transactional processing system
• Online transactional processing (OLTP) enables the real-time execution
of large numbers of database transactions by large numbers of people,
typically over the internet.
• A database transaction is a change, insertion, deletion, or query of data
in a database. OLTP systems (and the database transactions they
enable) drive many of the financial transactions we make every day,
including online banking and ATM transactions, e-commerce and in-
store purchases, and hotel and airline bookings, among other
transactions.

Introduction to big data © Copyright IBM Corporation 2021

Online analytical processing system
• Online transactional processing (OLAP) is software for performing
multidimensional analysis at high speeds on large volumes of data from
a data warehouse, data mart, or some other unified, centralized data
store.
• OLAP is optimized for conducting complex data analysis. OLAP
systems are designed for use by data scientists, business analysts, and
knowledge workers, and they support business intelligence (BI), data
mining, and other decision support applications.

Introduction to big data © Copyright IBM Corporation 2021

Meaning of “real time” when applied to big data
• Subsecond response
Generally, when engineers say “real time”, they are usually referring to subsecond response
time. In this kind of real-time data processing, nanoseconds count. Extreme levels of
performance are key to success.

• Human comfortable response time

“Thou shalt not bore or frustrate the users.” The performance requirement for this kind of
processing is usually a couple of seconds.

• Event-driven
If when you say “real time” that you mean the opposite of scheduled, then you mean event-
driven. Instead of happening in a particular time interval, event-driven data processing
happens when a certain action or condition triggers it. The performance requirement for it is
generally before another event happens.

• Streaming data processing

If when you say “real-time” that you mean the opposite of batch processing, then you mean
streaming data processing. In batch processing, data is gathered, and all records or other
data units are processed in one large bundle until they are done. In streaming data
processing, the data is processed as it flows in, one unit at a time. After the data starts
coming in, it generally does not end.
Source Four Really Real Meanings of Real-Time Data http://blog.syncsort.com/2016/03/big-data/four-really-real-meanings-of-real-time

Introduction to big data © Copyright IBM Corporation 2021

More comments on “real time”
• Real time is not a concept that is woven into the fabric of the universe:
It is a human construct. Essentially, real time refers to lags in data
arrival that are either below the threshold of perception or are so short
that they do not pose a barrier to immediate action.
• Decisions have various tolerances for protracted data arrival.
• Data latencies versus decision latencies.

Introduction to big data © Copyright IBM Corporation 2021

Introduction to Apache
Hadoop and the Hadoop
infrastructure

Introduction to big data © Copyright IBM Corporation 2021

Topics
• Big data overview
• Big data use cases
• Evolution from traditional data processing to big data processing
• Introduction to Apache Hadoop and the Hadoop infrastructure

Introduction to big data © Copyright IBM Corporation 2021

A new approach is needed to process big data: Requirements
• Partial failure support
Failure of one component should not result in the failure of the entire
system.
• Data recoverability
The workload of a failed component should be assumed by another
functioning unit.
• Component recovery
A recovered component should rejoin the system without requiring a full
restart of the system.
• Consistency
Component failures during job execution should not affect the outcome of
the job.
• Scalability
▪ Adding load to the system should result in a graceful decline in performance of
individual jobs, not a failure of the system.
▪ Increasing resources should support a proportional increase in load capacity.

Introduction to big data © Copyright IBM Corporation 2021

Introduction to Apache Hadoop and the Hadoop
infrastructure
• Why? When? Where?
▪ Origins / History
▪ The Why of Hadoop
▪ The When of Hadoop
▪ The Where of Hadoop
• Hadoop architecture:
▪ MapReduce
▪ Hadoop Distributed File System (HDFS)
▪ Hadoop Common
• Hadoop infrastructure

Introduction to big data © Copyright IBM Corporation 2021

Core Hadoop characteristics
• Applications are written in high-level language code.
• Work is performed in a cluster of commodity machines. Nodes talk to
each other as little as possible.
• Data is distributed in advance. Bring the computation to the data.
• Data is replicated for increased availability and reliability.
• Hadoop is fully scalable and fault-tolerant.

What is Apache Hadoop?
• Apache Hadoop is an open source software framework for reliable,
scalable, and distributed computing of massive amount of data.
▪ Hides the underlying system details and complexities from the user.
▪ Developed in Java.
• Consists of these subprojects:
▪ Hadoop Common.
▪ HDFS.
▪ Hadoop YARN.
▪ MapReduce.
▪ Hadoop Ozone.
• Large Hadoop infrastructure with both open source and proprietary
Hadoop-related projects, such as Hbase, Apache ZooKeeper, and Apache
Avro.
• Meant for heterogeneous commodity hardware.
• Hadoop is based on work that was done by Google in the late 1990s and
early 2000s, specifically the papers describing the Google File System
(GFS) (published in 2003), and MapReduce (published in 2004).

Why and where Hadoop is used and not used
• Hadoop is good for:
▪ Massive amounts of data through parallelism.
▪ A variety of data (structured, unstructured, and semi-structured).
▪ Inexpensive commodity hardware.
• Hadoop is not good for:
▪ Processing transactions (random access).
▪ When work cannot be parallelized.
▪ Low latency data access.
▪ Processing many small files.
▪ Intensive calculations with little data.

Apache Hadoop core components
• MapReduce
• HDFS
• YARN
• Hadoop Common

Apache Hadoop
MapReduce

HDFS

YARN Hadoop Common

The two key components of Hadoop
• HDFS:
▪ Where Hadoop stores data.
▪ A file system that spans all the nodes in a Hadoop cluster.
▪ It links together the file systems on many local nodes to make them into one
large file system.
• MapReduce framework
How Hadoop understands and assigns work to the nodes (machines).

Differences between RDBMS and Hadoop HDFS

Hadoop infrastructure: Large and constantly growing
• The Hadoop infrastructure includes
components that support each stage
of big data processing and
supplement the core components:
▪ Constantly growing.
▪ It includes Apache open source Cluster Management and Monitoring
Apache Ambari
projects and contributions from other
companies. Data Management

• Hadoop-related projects: Apache Oozie

Workflow
&
Apache Chukwa
Monitoring
Apache ZooKeeper
Coordination
&
▪ Hbase Scheduling Management

▪ Apache Hive Data Access

▪ Apache Pig Apache Hive Apache Pig Apache Avro Apache Sqoop
Query/SQL
▪
data flow
Apache Avro Data serialization/RPC RDBMS connector
Data integration
▪ Apache Sqoop Data Processing
▪ Apache Oozie MapReduce
Distributed processing
Yarn
Cluster and resource management
▪ Apache ZooKeeper Cluster management

▪ Apache Chukwa HDFS Data Storage

HBase
▪ Apache Ambari Distributed file system
Cluster and resource management

▪ Apache Spark
Introduction to big data © Copyright IBM Corporation 2021
Think differently
As you start to work with Hadoop, you must think differently:
• There are different processing paradigms.
• There are different approaches to storing data.
• Think ELT rather than ETL.

Understanding the Hadoop infrastructure is embarking on a continuing

learning process where self-education is an ongoing requirement.

Unit summary
• Explained the concept of big data.
• Described the factors that contributed to the emergence of big data
processing.
• Listed the various characteristics of big data.
• Listed typical big data use cases.
• Described the evolution from traditional data processing to big data
processing.
• Listed Apache Hadoop core components and their purpose.
• Described the Hadoop infrastructure and the purpose of the main
projects.
• Identified what is a good fit for Hadoop and what is not.

Review questions
1. True or False: the number of Vs of big data are exactly four.

2. Data that can be stored and processed in a fixed format is

called:
A. Structured
B. Semi-structured
C. Unstructured
D. Machine generated

3. True or False: Agriculture is one of the industry sectors that

are using big data and analytics to help to improve and
transform their industries.

Review questions (cont.)
4. Hadoop is good for:
A. Processing transactions (random access)
B. Massive amounts of data through parallelism
C. Processing lots of small files
D. Intensive calculations with little data
E. Low latency data access

5. True or False: One of Hadoop main characteristics is that

applications are written in low-level language code.

Review answers
1. True or False: the number of Vs of big data are exactly four.

2. Data that can be stored and processed in a fixed format is

called:
A. Structured
B. Semi-structured
C. Unstructured
D. Machine generated

3. True or False: Agriculture is one of the industry sectors that

are using big data and analytics to help to improve and
transform their industries.

Review answers (cont.)
4. Hadoop is good for:
A. Processing transactions (random access)
B. Massive amounts of data through parallelism
C. Processing lots of small files
D. Intensive calculations with little data
E. Low latency data access

5. True or False: One of Hadoop main characteristics is that

applications are written in low-level language code.

Big Data Project
100% (3)
Big Data Project
61 pages
Arroyo Oscar The World of Tomorrow
100% (1)
Arroyo Oscar The World of Tomorrow
5 pages
Shandon Cytospin 3 Operator Guide
No ratings yet
Shandon Cytospin 3 Operator Guide
68 pages
Lec 7 Hadoop Intro
No ratings yet
Lec 7 Hadoop Intro
48 pages
Big Data Engineer 2021-Ecosystem-Course Guide (2) - 21-30
No ratings yet
Big Data Engineer 2021-Ecosystem-Course Guide (2) - 21-30
10 pages
Dsbda Unit1
No ratings yet
Dsbda Unit1
221 pages
Unit - 1
No ratings yet
Unit - 1
46 pages
Big Data Engineer 2021-Ecosystem-Course Guide (2) - 31-40
No ratings yet
Big Data Engineer 2021-Ecosystem-Course Guide (2) - 31-40
10 pages
Big Data Study 1
No ratings yet
Big Data Study 1
77 pages
Bda (Unit 1)
No ratings yet
Bda (Unit 1)
24 pages
BDA NOTES With Questions Included
No ratings yet
BDA NOTES With Questions Included
108 pages
Big Data
No ratings yet
Big Data
30 pages
Bda Unit 1
No ratings yet
Bda Unit 1
47 pages
Bigdata Unit1
No ratings yet
Bigdata Unit1
62 pages
Unit 1
No ratings yet
Unit 1
54 pages
CS8091 - Big Data Analytics - Unit 1
No ratings yet
CS8091 - Big Data Analytics - Unit 1
28 pages
Lec 1 - Introduction To Big Data
No ratings yet
Lec 1 - Introduction To Big Data
37 pages
Advanced Analytics: What Is Big Data Analytics? Definition, Benefits, and More
No ratings yet
Advanced Analytics: What Is Big Data Analytics? Definition, Benefits, and More
13 pages
The Age of Big Data: Kayvan Tirdad
No ratings yet
The Age of Big Data: Kayvan Tirdad
26 pages
Unit 1 Bigdata
No ratings yet
Unit 1 Bigdata
30 pages
Big Data and Hadoop Self Notes
No ratings yet
Big Data and Hadoop Self Notes
16 pages
Hadoop Report
No ratings yet
Hadoop Report
110 pages
Course Material
100% (1)
Course Material
57 pages
Unit 1
No ratings yet
Unit 1
76 pages
Mca Big Data PDF Sem 3
No ratings yet
Mca Big Data PDF Sem 3
193 pages
Taming Big Data
No ratings yet
Taming Big Data
268 pages
01-Introduction To Big Data and Data Analytics
No ratings yet
01-Introduction To Big Data and Data Analytics
103 pages
BIG Data - Unit - 1
No ratings yet
BIG Data - Unit - 1
24 pages
Big Data Chapter 1
No ratings yet
Big Data Chapter 1
7 pages
Big Data 2022 Notes
No ratings yet
Big Data 2022 Notes
118 pages
BDM 1
No ratings yet
BDM 1
37 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
69 pages
Introduction To Big Data PDF
No ratings yet
Introduction To Big Data PDF
31 pages
Bigdata Units
No ratings yet
Bigdata Units
80 pages
(IJCST-V9I6P1) :yew Kee Wong
No ratings yet
(IJCST-V9I6P1) :yew Kee Wong
7 pages
Introductionto Big Data Analytics
No ratings yet
Introductionto Big Data Analytics
162 pages
Unit I
No ratings yet
Unit I
25 pages
Digital Notes IDBA Final Original
No ratings yet
Digital Notes IDBA Final Original
156 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
5 pages
Big Data: Management Information Systems
No ratings yet
Big Data: Management Information Systems
11 pages
Big Data 2022 Notes
No ratings yet
Big Data 2022 Notes
118 pages
Big Data 2022 Notes
No ratings yet
Big Data 2022 Notes
118 pages
Big Data in Pharmaceutical Industry
No ratings yet
Big Data in Pharmaceutical Industry
10 pages
Da Unit - I - Notes
No ratings yet
Da Unit - I - Notes
30 pages
Big Data Analytics
No ratings yet
Big Data Analytics
36 pages
CSE545 sp23 (1) What Is Big Data 1-29
No ratings yet
CSE545 sp23 (1) What Is Big Data 1-29
88 pages
What Is Big Data
No ratings yet
What Is Big Data
18 pages
05-Big Data
No ratings yet
05-Big Data
29 pages
Bda U1
No ratings yet
Bda U1
80 pages
Big Data Analytics TEXTBOOK
100% (1)
Big Data Analytics TEXTBOOK
230 pages
Seminar Report BIG DATA
No ratings yet
Seminar Report BIG DATA
28 pages
Big Data Analytics (VN) 1
No ratings yet
Big Data Analytics (VN) 1
98 pages
Big Data Analytics - Lecture Slides
No ratings yet
Big Data Analytics - Lecture Slides
72 pages
BA Intro Bigdata Analytics
No ratings yet
BA Intro Bigdata Analytics
24 pages
Big Data12
No ratings yet
Big Data12
11 pages
Fundamentals of Big Data Analytics
No ratings yet
Fundamentals of Big Data Analytics
151 pages
Unit 1 Bda Complete Notes
No ratings yet
Unit 1 Bda Complete Notes
15 pages
Module 1
No ratings yet
Module 1
90 pages
Introduction To Big Data: Soorya Prasanna Ravichandran
No ratings yet
Introduction To Big Data: Soorya Prasanna Ravichandran
33 pages
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Mastering Vector Databases: The Future of Data Retrieval and AI
From Everand
Mastering Vector Databases: The Future of Data Retrieval and AI
Robert Johnson
No ratings yet
Task MegaX
No ratings yet
Task MegaX
12 pages
Operating+Systems+Part2+ Lec (3&4)
No ratings yet
Operating+Systems+Part2+ Lec (3&4)
268 pages
SABDE3G05 Big Data MapReduce Yarn
No ratings yet
SABDE3G05 Big Data MapReduce Yarn
69 pages
Lec - Spark
No ratings yet
Lec - Spark
65 pages
Deep Learning Section5 (2023)
No ratings yet
Deep Learning Section5 (2023)
17 pages
Big Data Hadoop HDFS
No ratings yet
Big Data Hadoop HDFS
32 pages
Text
No ratings yet
Text
3 pages
Gautama Buddha Was Born in Hela Bima
33% (3)
Gautama Buddha Was Born in Hela Bima
62 pages
Monitoring and Evaluation
100% (1)
Monitoring and Evaluation
2 pages
FIT ZONE Nutrition Plan For MEN by Guru Mann
100% (1)
FIT ZONE Nutrition Plan For MEN by Guru Mann
8 pages
JDBC Drivers JDBC-ODBC Bridge Driver Native-API Driver Network Protocol Driver Thin Driver
No ratings yet
JDBC Drivers JDBC-ODBC Bridge Driver Native-API Driver Network Protocol Driver Thin Driver
8 pages
Properties of KMnO4 and K2Cr2O7.PDF-65
No ratings yet
Properties of KMnO4 and K2Cr2O7.PDF-65
7 pages
Grade 7 Maths Notes Part 1
No ratings yet
Grade 7 Maths Notes Part 1
6 pages
Daily Lesson Log: Tle - Icttd9 - 12al - Ic - E - 3
No ratings yet
Daily Lesson Log: Tle - Icttd9 - 12al - Ic - E - 3
4 pages
Event Management and Marketing in Tourism
No ratings yet
Event Management and Marketing in Tourism
8 pages
New Developments in Freefem++
No ratings yet
New Developments in Freefem++
16 pages
Bodybuilding, Drugs and Risk
No ratings yet
Bodybuilding, Drugs and Risk
230 pages
01-Altair Technology For Digital Product Developments
No ratings yet
01-Altair Technology For Digital Product Developments
67 pages
CSC10004: Data Structures and Algorithms
No ratings yet
CSC10004: Data Structures and Algorithms
20 pages
Namra Finance Limited
No ratings yet
Namra Finance Limited
5 pages
2012 - 2013 Full Program
No ratings yet
2012 - 2013 Full Program
36 pages
Revised PN Staff Writing Manual - 1
No ratings yet
Revised PN Staff Writing Manual - 1
334 pages
A Brief History of Consumer Culture
No ratings yet
A Brief History of Consumer Culture
6 pages
A First Book Nature UK Part4
100% (1)
A First Book Nature UK Part4
13 pages
Brighton Spec ASME 80-10 2017 PDF
No ratings yet
Brighton Spec ASME 80-10 2017 PDF
1 page
Lesson Plan (Thai Son)
No ratings yet
Lesson Plan (Thai Son)
8 pages
Monthly Reimbursement Bill Enclosure
No ratings yet
Monthly Reimbursement Bill Enclosure
3 pages
For Ex Project
No ratings yet
For Ex Project
64 pages
Amaravathi Bye Laws
No ratings yet
Amaravathi Bye Laws
5 pages
Quality Wireless (B) :: Call Center Performance
No ratings yet
Quality Wireless (B) :: Call Center Performance
3 pages
1 Maxwell's Equations in Matter (Integrate With Next Section)
100% (1)
1 Maxwell's Equations in Matter (Integrate With Next Section)
2 pages
23PGHR023 Final Review Ather
No ratings yet
23PGHR023 Final Review Ather
13 pages
Portfolio Management in Kotak Securites
0% (1)
Portfolio Management in Kotak Securites
92 pages
Yaskawa SGMGV
No ratings yet
Yaskawa SGMGV
24 pages

Lec 7 Hadoop Intro

Uploaded by

Lec 7 Hadoop Intro

Uploaded by

Introduction to big data

© Copyright IBM Corporation 2021

Introduction to big data © Copyright IBM Corporation 2021

Introduction to big data © Copyright IBM Corporation 2021

Introduction to big data © Copyright IBM Corporation 2021

Introduction to big data © Copyright IBM Corporation 2021

Introduction to big data © Copyright IBM Corporation 2021

Introduction to big data © Copyright IBM Corporation 2021

Structured Semi-structured Unstructured

Introduction to big data © Copyright IBM Corporation 2021

There is a fifth V, which is

Introduction to big data © Copyright IBM Corporation 2021

Business strategy Communications

Neurocomputing Data mining

Problem solving Inquisitiveness

Introduction to big data © Copyright IBM Corporation 2021

Introduction to big data © Copyright IBM Corporation 2021

Introduction to big data © Copyright IBM Corporation 2021

Big data exploration Enhanced 360⁰ view of the Security &

Introduction to big data © Copyright IBM Corporation 2021

Introduction to big data © Copyright IBM Corporation 2021

Introduction to big data © Copyright IBM Corporation 2021

Introduction to big data © Copyright IBM Corporation 2021

Introduction to big data © Copyright IBM Corporation 2021

Introduction to big data © Copyright IBM Corporation 2021

Introduction to big data © Copyright IBM Corporation 2021

• Credit Scoring in the Era of Big Data:

Introduction to big data © Copyright IBM Corporation 2021

Introduction to big data © Copyright IBM Corporation 2021

Introduction to big data © Copyright IBM Corporation 2021

Introduction to big data © Copyright IBM Corporation 2021

Introduction to big data © Copyright IBM Corporation 2021

Introduction to big data © Copyright IBM Corporation 2021

Introduction to big data © Copyright IBM Corporation 2021

Introduction to big data © Copyright IBM Corporation 2021

Introduction to big data © Copyright IBM Corporation 2021

Introduction to big data © Copyright IBM Corporation 2021

• Human comfortable response time

• Streaming data processing

Introduction to big data © Copyright IBM Corporation 2021

Introduction to big data © Copyright IBM Corporation 2021

Introduction to big data © Copyright IBM Corporation 2021

Introduction to big data © Copyright IBM Corporation 2021

Introduction to big data © Copyright IBM Corporation 2021

Introduction to big data © Copyright IBM Corporation 2021

Introduction to big data © Copyright IBM Corporation 2021

Introduction to big data © Copyright IBM Corporation 2021

Introduction to big data © Copyright IBM Corporation 2021

YARN Hadoop Common

Introduction to big data © Copyright IBM Corporation 2021

Introduction to big data © Copyright IBM Corporation 2021

Introduction to big data © Copyright IBM Corporation 2021

• Hadoop-related projects: Apache Oozie

▪ Apache Hive Data Access

▪ Apache Chukwa HDFS Data Storage

Understanding the Hadoop infrastructure is embarking on a continuing

Introduction to big data © Copyright IBM Corporation 2021

Introduction to big data © Copyright IBM Corporation 2021

2. Data that can be stored and processed in a fixed format is

3. True or False: Agriculture is one of the industry sectors that

Introduction to big data © Copyright IBM Corporation 2021

5. True or False: One of Hadoop main characteristics is that

Introduction to big data © Copyright IBM Corporation 2021

2. Data that can be stored and processed in a fixed format is

3. True or False: Agriculture is one of the industry sectors that

Introduction to big data © Copyright IBM Corporation 2021

5. True or False: One of Hadoop main characteristics is that

Introduction to big data © Copyright IBM Corporation 2021

You might also like