0% found this document useful (0 votes)
210 views31 pages

Introduction To Big Data PDF

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
210 views31 pages

Introduction To Big Data PDF

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

INTRODUCTION TO BIG DATA

Week 1
Agenda
• What is Big Data (evolution)
• Introduction to Big Data
– Problems with Traditional Large-Scale Systems
– Introduction to Distributed File Systems
– The current space of Hadoop
– Big Data Solution Landscape
• Industry Insight
– Motivator - Governance Compliance for Financial Services
– Healthcare
– Use Cases of Big Data analytics
• Big Data Technology Career Path
– Roles
– Adoption of Big data tools
Evolution of data
Some interesting facts about Data
Introduction to Big Data
• What is Big Data
Big data is a term that describes the large volume of data – both
structured and unstructured – that inundates a business on a day-to-day
basis. But it’s not the amount of data that’s important. It’s what
organizations do with the data that matters. Big data can be analysed
for insights that lead to better decisions and strategic business moves.

• Big Data - Four Vs:


Volume , Velocity, Variety, Veracity
The 4 Vs
• Terabytes of data
• Torrents of data in near real-time
• Transactions
• Social data
• Social data
• Sensor data
• IoT, Sensor data
• Streaming data
• Machine to machine
Volume Velocity

Veracity Variety
• Structured
• Authenticity
• Text data
• Trustworthiness
• Pictures, Visuals
• Availability
• Audio
• Accountability
• Video
The 5th V
Uncover Hidden Patterns, The value uncovered helps
unknown correlations, customer organizations, industries to
preferences and other various create new products, to explore
important information new market

Value
Extend the value of a predictive
Help companies streamline
model by subsequently
operations, improve marketing,
uncovering a virtually
enhance customer engagement,
unfathomable combination of
improve customer service
additional variables
Problems with Traditional Large-Scale Systems
Traditional Grid Computing

• Distribute the processing

• Worker Nodes sharing the


same storage system,
acting as a bottleneck
Need for multi-core distribution
CPU speed is not increasing as On the other hand the prices of hardware disks
dramatically have reduced drastically over the years

Multi core architectures are the norm


Big Data distributed computing stats
Putting it all together

• Storing lot of data (Big Data) is inexpensive on commodity hardware's


• Reading or writing 100 GB from a single disk takes 20min and it would only use
1/16th of the machine’s CPU resources

Distributed parallel processing will then introduce

• Reading 100 GB off 64 disks would take only 30 secs and it would only fully use
the CPU resources of 4-8 machines
• The key – sequentially read from multiple disks in parallel
Introduction to Distributed Computing
Distributed Computing is an environment in which a group of independent and geographically
dispersed computer systems take part to solve a complex problem, each by solving a part of solution
and then combining the result from all computers.

Network bandwidth
becomes a
bottleneck
Characteristics of Distributed Computing
Resource Sharing

•Resources, Hardware, Data

Concurrency

•Multi-programming/ Multi-processing

Scalability

•Scalable to multiple computers

Fault Tolerant

•Hardware Redundancy, Software Recovery, Data Recovery

Transparency

•Transparency of
access,location,concurrency,replication,failure,migration,performance,scaling
Big Data Grid Computing
• Distribute the data
processing across multiple
machines

• Each worker node to


have a separate storage
system
Big Data Solution Landscape
2005-2007 2008 2009 2010 2011 2012 2013-2016
HDFS HDFS HDFS HDFS HDFS, MR, HDFS, MR, HDFS, MR,
YARN YARN YARN
Map Map Map Map Zookeeper, Zookeeper, Zookeeper,
Reduce Reduce Reduce Reduce HBase HBase HBase, Pig,
Solr, Falcon
Pig Zookeeper Zookeeper Zookeeper Mahout, Mahout, Mahout,
Hive, Avro Hive, Avro Hive, Avro
Solr HBase HBase HBase Sqoop, Sqoop, Sqoop,
HCatalog HCatalog HCatalog
Pig Mahout Mahout Oozie, Pig, Oozie, Tez, Oozie, Tez,
Solr Spark, Pig, Spark, Sentry
Solr
Solr Hive, Pig Hive, Avro, Flume Flume, Kafka Flume,
Pig Kafka, Kudu
Solr Sqoop, Solr Hue Hue, Impala Hue, Impala,
Parquet,
Knox
Industry Insight - Governance Compliance for
Financial Services
Why Banking sector is aggressively adopting Big Data technologies?
Financial Services

Fraud and Compliance EDW Optimization Risk Management

•Cyber attack prevention •Offload expensive analytics •Real time risk alerting system
•Regulatory compliance •Offload expensive data •Analyse credit risk, counter- or third
•Criminal behaviour preparation at lower cost party risk
•Credit Card fraud detection •Data discovery •Utilizing simulations that use huge
•Deal with various data types volumes of data and require
massive parallel computing power
Healthcare
Reducing Fraud, Electronic Health
Healthcare IOT
Waste and Abuse Records (EHRs)
Claims
•Most of the data is of •Prevent healthcare •Every patient has his
unstructured variety fraud by using own digital record
created by Healthcare Predictive analytics. which includes
IOT’s Centre for Medicare demographics,
•Devices monitoring and Medicaid services medical history,
everything of patient prevented $210.7 allergies, laboratory
Logs and million using the same test results etc.
Clinical from blood sugar
Notes level, heart rate, etc. •Identifying fraud by •Records are shared
Healthcare •Smart devices already analyzing large via secure information
in place can detect if historical unstructured systems. Every record
Data Lake medicines are being data of historical is comprised of one
taken regularly at claims and by using modifiable file, which
home ML algorithms to means that doctors
•Lower costs and detect anomalies and can implement
improve patient care patterns changes over time
with no paperwork
and no danger of data
replication
EMR Pharmacy
Industry Use Cases of Big Data Analytics
Industry Use Cases of Big Data Analytics
Key IT Considerations for implementing Big Data
solution
Reduntant Physical Infrastructure

Performance Availability Scalability

Flexibilty Cost
Security Infrastructure

Data access Application access

Data Encryption Threat Detection


Operational Database

Atomicity Consistency

Isolation Durability
Organizing Data Services and Tools
•A distributed file system
•Serialization services
•Coordination services
•Extraction transform and Load
•Workflow services
Big Data Analytic Providers
Forrester Wave™: Big Data Solutions, Q1 ’16
Forrester Wave™: Big Data Solutions, Q1 ’16
Big Data as a Career Path
Adoption of tool and Job titles
Adoption of Big Data Technology
THANK YOU

You might also like