Introduction To Big Data PDF
Introduction To Big Data PDF
Week 1
Agenda
• What is Big Data (evolution)
• Introduction to Big Data
– Problems with Traditional Large-Scale Systems
– Introduction to Distributed File Systems
– The current space of Hadoop
– Big Data Solution Landscape
• Industry Insight
– Motivator - Governance Compliance for Financial Services
– Healthcare
– Use Cases of Big Data analytics
• Big Data Technology Career Path
– Roles
– Adoption of Big data tools
Evolution of data
Some interesting facts about Data
Introduction to Big Data
• What is Big Data
Big data is a term that describes the large volume of data – both
structured and unstructured – that inundates a business on a day-to-day
basis. But it’s not the amount of data that’s important. It’s what
organizations do with the data that matters. Big data can be analysed
for insights that lead to better decisions and strategic business moves.
Veracity Variety
• Structured
• Authenticity
• Text data
• Trustworthiness
• Pictures, Visuals
• Availability
• Audio
• Accountability
• Video
The 5th V
Uncover Hidden Patterns, The value uncovered helps
unknown correlations, customer organizations, industries to
preferences and other various create new products, to explore
important information new market
Value
Extend the value of a predictive
Help companies streamline
model by subsequently
operations, improve marketing,
uncovering a virtually
enhance customer engagement,
unfathomable combination of
improve customer service
additional variables
Problems with Traditional Large-Scale Systems
Traditional Grid Computing
• Reading 100 GB off 64 disks would take only 30 secs and it would only fully use
the CPU resources of 4-8 machines
• The key – sequentially read from multiple disks in parallel
Introduction to Distributed Computing
Distributed Computing is an environment in which a group of independent and geographically
dispersed computer systems take part to solve a complex problem, each by solving a part of solution
and then combining the result from all computers.
Network bandwidth
becomes a
bottleneck
Characteristics of Distributed Computing
Resource Sharing
Concurrency
•Multi-programming/ Multi-processing
Scalability
Fault Tolerant
Transparency
•Transparency of
access,location,concurrency,replication,failure,migration,performance,scaling
Big Data Grid Computing
• Distribute the data
processing across multiple
machines
•Cyber attack prevention •Offload expensive analytics •Real time risk alerting system
•Regulatory compliance •Offload expensive data •Analyse credit risk, counter- or third
•Criminal behaviour preparation at lower cost party risk
•Credit Card fraud detection •Data discovery •Utilizing simulations that use huge
•Deal with various data types volumes of data and require
massive parallel computing power
Healthcare
Reducing Fraud, Electronic Health
Healthcare IOT
Waste and Abuse Records (EHRs)
Claims
•Most of the data is of •Prevent healthcare •Every patient has his
unstructured variety fraud by using own digital record
created by Healthcare Predictive analytics. which includes
IOT’s Centre for Medicare demographics,
•Devices monitoring and Medicaid services medical history,
everything of patient prevented $210.7 allergies, laboratory
Logs and million using the same test results etc.
Clinical from blood sugar
Notes level, heart rate, etc. •Identifying fraud by •Records are shared
Healthcare •Smart devices already analyzing large via secure information
in place can detect if historical unstructured systems. Every record
Data Lake medicines are being data of historical is comprised of one
taken regularly at claims and by using modifiable file, which
home ML algorithms to means that doctors
•Lower costs and detect anomalies and can implement
improve patient care patterns changes over time
with no paperwork
and no danger of data
replication
EMR Pharmacy
Industry Use Cases of Big Data Analytics
Industry Use Cases of Big Data Analytics
Key IT Considerations for implementing Big Data
solution
Reduntant Physical Infrastructure
Flexibilty Cost
Security Infrastructure
Atomicity Consistency
Isolation Durability
Organizing Data Services and Tools
•A distributed file system
•Serialization services
•Coordination services
•Extraction transform and Load
•Workflow services
Big Data Analytic Providers
Forrester Wave™: Big Data Solutions, Q1 ’16
Forrester Wave™: Big Data Solutions, Q1 ’16
Big Data as a Career Path
Adoption of tool and Job titles
Adoption of Big Data Technology
THANK YOU