Data Science and Big Data Analytics_ Unit_1
Data Science and Big Data Analytics_ Unit_1
Analytics
314453 : DATA SCIENCE AND BIG DATA ANALYTICS
Teaching Scheme:
Lectures: 4 Hours/Week
Credits 04
Examination Scheme:
In-Semester : 30 Marks
End-Semester: 70 Marks
Prerequisites:
Discrete Mathematics
Database management system
Data mining and Data warehousing
Course Objectives
UNIT – I
Introduction: Data Science and
Big Data
Big Data: Introduction
• Big Data is large volume of Data in structured or
unstructured form.
• The rate of data generation has increased
exponentially by increasing use of data intensive
technologies.
• Processing or analyzing the huge amount of data is a
challenging task.
• It requires new infrastructure and a new way of
thinking about the way business and IT industry
works
What is Big Data?
The are many examples of "data", but what makes some of it
“big”? The classic definition revolves around the three Vs.
Volume, velocity, and variety.
Volume: There is a just a lot of it being generated all the
time. Things get interesting and “big”, when you can’t fit it
all on one computer anymore. Why? There are many ideas
here such as MapReduce, Hadoop, etc. that all revolve
around being able to process data that goes from Terabytes,
to Petabytes, to Exabytes.
Velocity: Data is being generated very quickly. Can you
even store it all? If not, then what do you get rid of and http://
pl.wikipedia.org/
what do you keep? wiki/
Green_Giant#mediavie
Variety: The data types you mention all take different wer/
Plik:Jolly_green_giant.j
shapes. What does it mean to store them so that you can pg
play with or compare them?
Defining Big Data
6 V’s of Big Data
Case study: 6 V’s in clinical dataset
Data Volume
Data Velocity
Data Variety
Ambiguity -- Uncertainty
Viscosity – It is the inertia when navigating through a data collection.
Virality – measures the speed at which data can spread through a network.
Problem of Data Explosion
Problem of Data Explosion (Cont…)
• The International Data Corporation (IDC) study
predicts that overall data will grow by 50 times
by 2020.
• The digital universe is 1.8 trillion gigabytes
(109) in size and stored in 500 quadrillion
(1015) files.
• Information Bits in the digital universe as stars
in our physical universe.
• 90% Data is in unstructured form.
Issues in Big Data
• Issues related to the Characteristics
• Storage and Transfer Issues
• Data Management Issues
• Processing Issues
Issues related to the Characteristics
• Data Volume Issues
• Data Velocity Issues
• Data Variety Issues
• Worth of Data Issues
• Data Complexity Issues
Storage and Transfer Issues
• Current Storage Techniques and Storage Medium are
not appropriate for effectively handling Big Data.
• Current Technology limits 4 Terabytes (1012) per disk,
so1 Exabyte (1018) size data will take 25,000 Disks.
• Accessing that data will also overwhelm network.
• Assuming a sustained transfer of 1 Exabyte will take
2,800 hours with a 1 Gbps capable network with 80%
effective transfer rate and 100Mbps sustainable
speed.
Data Management Issues
• Resolving issues of access, utilization, updating,
governance, and reference (in publications) have
proven to be major stumbling blocks.
• In such volume, it is impractical to validate every
data item.
• New approaches and research to data qualification
and validation are needed.
• The richness of digital data representation prohibits
a personalized methodology for data collection.
Processing Issues
• The Processing Issues are critical to handle.
• Example: 1 Exabyte = 1000 Petabytes (1015).
Assuming a processor expends 100 instructions on
one block at 5 gigahertz, the time required for end
to-end processing would be 20 nanoseconds. To
process 1K petabytes would require a total end-to-
end processing time of roughly 635 years.
• Effective processing of Exabyte of data will require
extensive parallel processing and new analytics
algorithms
Challenges in Big Data
• Privacy and Security
• Data Access and Sharing of Information
• Analytical Challenges
• Human Resources and Manpower
• Technical Challenges
Privacy and Security
• Privacy and Security are sensitive and includes
conceptual, Technical as well as legal significance.
• Most Peoples are vulnerable to Information Theft.
• Privacy can be compromised in the large data sets.
• The Security is also critical to handle in such large
data.
• Social stratification would be important arising
consequence.
Data Access and Sharing of Information
• Data should be available in accurate, complete
and timely manner.
• The data management and governance
process bit complex adding the necessity to
make data open and make it available to
government agencies.
• Expecting sharing of data between companies
is awkward.
Analytical Challenges
• Big data brings along with it some huge
analytical challenges.
• Analysis on such huge data, requires a large
number of advance skills.
• The type of analysis which is needed to be
done on the data depends highly on the
results to be obtained.
Human Resources and Manpower
• Big Data needs to attract organizations and
youth with diverse new skill sets.
• The skills includes technical as well as research,
analytical, interpretive and creative ones.
• It requires training programs to be held by the
organizations.
• Universities need to introduce curriculum on
Big data.
Technical Challenges
• Fault Tolerance: If the failure occurs the damage
done should be within acceptable threshold rather
than beginning the whole task from the scratch.
• Scalability: Requires a high level of sharing of
resources which is expensive and dealing with the
system failures in an efficient manner.
• Quality of Data: Big data focuses on quality data
storage rather than having very large irrelevant data.
• Heterogeneous Data: Structured and Unstructured
Data.
Advantages of Big Data
• Understanding and Targeting Customers
• Understanding and Optimizing Business Process
• Improving Science and Research
• Improving Healthcare and Public Health
• Optimizing Machine and Device Performance
• Financial Trading
• Improving Sports Performance
• Improving Security and Law Enforcement
Some Projects using Big Data
• Amazon.com handles millions of back-end operations and
have 7.8 TB, 18.5 TB, and 24.7 TB Databases.
• Walmart is estimated to store more than 2.5 PB Data for
handling 1 million transactions per hour.
• The Large Hadron Collider (LHC) generates 25 PB data
before replication and 200 PB Data after replication.
• Sloan Digital Sky Survey ,continuing at a rate of about 200
GB per night and has more than 140 TB of information.
• Utah Data Center for Cyber Security stores Yottabytes
(1024).
Is Big Data the same as Data Science?
Are Big Data and Data Science the same
thing?
I wouldn't say so...
Data Science can be done on small data sets.
And not everything done using Big Data would
necessarily be called Data Science.
Data
Big Data
Science
Is Big Data the same as Data Science?
Are Big Data and Data Science the same
thing?
I wouldn't say so...
Data Science can be done on small data sets.
And not everything done using Big Data would
necessarily be called Data Science.
But there certainly is a substantial overlap!
Data
Big Data
Science
Big Data Infrastructure: Hadoop/MapReduce
Programming & Data Processing
Hive/Pig
HBase and Cassandra
Big Data Learning Approaches
Classical Approach
Given
Wanted
Input Output
Model
• Weather data
• Contract Data
• Financial reporting data
• Clinical trials data
• Social Media posts
• Survey data
Big Data processing Architectures
• Centralized Processing
• Distributed Processing
• Client Server Architecture
• Cluster Architecture
Advantages of distributed processing are scalability,
customization of processing and management of information
based on operation, and parallel processing of data which
reduce latencies.
Disadvantages of distributed processing are data redundancy,
process redundancy, resource overhead and volume.
Big Data Processing Cycle
Big Data Processing Flow
Shared everything Architectures
Shared nothing Architectures
The requirements for Big Data infrastructure and
Processing Architecture
• Data Processing Requirements:
– Data- Model less architecture
– Micro batch Processing
– Data collection in real time
– Minimal Data transformation
– Multi partition Capability
– Efficient Data Reads
– Share data across multiple processing points
– Store results in File system or non-relational DBMS
Infrastructure Requirements
• Linear Scalability
• High Throughput
• Fault Tolerant
• Auto recovery
• Distributed Data processing
• High degree of parallelism
• Programming language interface