0% found this document useful (0 votes)
15 views

Data Science and Big Data Analytics_ Unit_1

The document outlines a course on Data Science and Big Data Analytics, detailing its teaching scheme, objectives, outcomes, and syllabus. It covers fundamental concepts of Big Data, mathematical foundations, processing techniques, analytics, visualization, and the impact of Big Data on various sectors. Additionally, it discusses challenges, advantages, and projects utilizing Big Data, emphasizing the need for new infrastructure and skill sets in the field.

Uploaded by

Devika Rankhambe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Data Science and Big Data Analytics_ Unit_1

The document outlines a course on Data Science and Big Data Analytics, detailing its teaching scheme, objectives, outcomes, and syllabus. It covers fundamental concepts of Big Data, mathematical foundations, processing techniques, analytics, visualization, and the impact of Big Data on various sectors. Additionally, it discusses challenges, advantages, and projects utilizing Big Data, emphasizing the need for new infrastructure and skill sets in the field.

Uploaded by

Devika Rankhambe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 47

Data Science and Big Data

Analytics
314453 : DATA SCIENCE AND BIG DATA ANALYTICS

Teaching Scheme:
Lectures: 4 Hours/Week
Credits 04
Examination Scheme:
In-Semester : 30 Marks
End-Semester: 70 Marks

Prerequisites:
Discrete Mathematics
Database management system
Data mining and Data warehousing
Course Objectives

1. To introduce basic need of Big Data and Data science


to handle huge amount of data.
2. To understand the basic mathematics behind the Big
data.
3. To understand the different Big data processing
techniques.
4. To understand and apply the Analytical concept of Big
data using R.
5. To visualize the Big Data using different tools.
6. To understand the application and impact of Big Data.
Course Outcomes

By the end of the course, students should be able to

 Students will be able to outline Big Data learning primitives.


 Students will be able to learn and Apply different
mathematical models behind Big Data.
 Students will be able to demonstrate their Big Data learning
skills by developing industry or research applications.
 Students will be able to analyze each learning model come
from a different algorithmic approach and it will perform
differently under different datasets.
Syllabus
• UNIT – I INTRODUCTION: DATA SCIENCE AND BIG DATA (08
hours)
– Introduction to Data science and Big Data, Defining Data science and Big
Data, Big Data examples, Data explosion, Data volume, Data Velocity, Big
data infrastructure and challenges, Big Data Processing Architectures,
Data Warehouse, Re-Engineering the Data Warehouse, Shared everything
and shared nothing architecture, Big data learning approaches.

• UNIT – II MATHEMATICAL FOUNDATION OF BIG DATA (08 Hours)


– Probability theory, Tail bounds with applications, Markov chains and
random walks, Pair wise independence and universal hashing,
Approximate counting, Approximate median, The streaming models,
Flajolet Martin Distance sampling, Bloom filters, Local search and testing
connectivity, Enforce test techniques, Random walks and testing, Boolean
functions, BLR test for linearity.
Syllabus (Cont…)
• UNIT - III BIG DATA PROCESSING (08 Hours)
– Big Data technologies, Introduction to Google file system, Hadoop
Architecture, Hadoop Storage: HDFS, Common Hadoop Shell commands,
Anatomy of File Write and Read, NameNode, Secondary NameNode, and
DataNode, Hadoop MapReduce paradigm, Map Reduce tasks, Job, Task
trackers - Cluster Setup – SSH & Hadoop Configuration, Introduction to:
NOSQL, Textual ETL processing.

• UNIT – IV BIG DATA ANALYTICS (08 Hours)


– Data analytics life cycle, Data cleaning , Data transformation, Comparing
reporting and analysis, Types of analysis, Analytical approaches, Data
analytics using R, Exploring basic features of R, Exploring R GUI, Reading
data sets, Manipulating and processing data in R, Functions and packages
in R, Performing graphical analysis in R, Integrating R and Hadoop, Hive,
Data analytics.
Syllabus (Cont…)
• UNIT – V Big Data Visualization (08 Hours)
– Introduction to Data visualization, Challenges to Big data visualization,
Conventional data visualization tools, Techniques for visual data
representations, Types of data visualization, Visualizing Big Data, Tools used in
data visualization, Propriety Data Visualization tools, Open –source data
visualization tools, Analytical techniques used in Big data visualization, Data
visualization with Tableau, Introduction to: Pentaho, Flare, Jasper Reports,
Dygraphs, Datameer Analytics Solution and Cloudera, Platfora, NodeBox,
Gephi, Google Chart API, Flot, D3, and Visually.

• UNIT – VI BIG DATA TECHNOLOGIES APPLICATION AND IMPACT (08 Hours)


– Social media analytics, Text mining, Mobile analytics , Roles and
responsibilities of Big data person, Organizational impact, Data analytics life
cycle, Data Scientist roles and responsibility, Understanding decision theory,
creating big data strategy, big data value creation drivers, Michael Porter’s
valuation creation models, Big data user experience ramifications, Identifying
big data use cases.
Text Books
• 1. Krish Krishnan, Data warehousing in the age
of Big Data, Elsevier, ISBN: 9780124058910,
1st Edition.
• 2. DT Editorial Services, Big Data, Black Book,
DT Editorial Services, ISBN: 9789351197577,
2016 Edition.
Reference Books
• 1. Mitzenmacher and Upfal, Probability and Computing: Randomized
Algorithms and Probabilistic Analysis, Cambridge University press,
ISBN :521835402 hardback.
• 2. Dana Ron, Algorithmic and Analysis Techniques in Property Testing,
School of EE.
• 3. Graham Cormode, Minos Garofalakis, Peter J. Haas and Chris Jermaine,
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches,
Foundation and trends in databases, ISBN :10.1561/1900000004.
• 4. A.Ohri, R for Business Analytics, Springer, ISBN:978-1-4614-4343-8.
• 5. Alex Holmes, Hadoop in practice, Dreamtech press,
ISBN:9781617292224.
• 6. AmbigaDhiraj, Big Data, Big Analytics: Emerging Business Intelligence
and Analytic Trends for Today’s Business, Wiely CIO Series.
• 7. Arvind Sathi, Big Data Analytics: Disruptive Technologies for Changing
the Game, IBM Corporation, ISBN:978-1-58347-380-1.
Reference Books (Cont…)
• 8. EMC Education Services, Data Science and Big Data Analytics- Discovering, analyzing
Visualizing and Presenting Data.
• 9. Li Chen, Zhixun Su, Bo Jiang, Mathematical Problems in Data Science, Springer,
ISBN :978-3-31925127-1.
• 10. Philip Kromer and Russell Jurney, Big Data for chips, O’Reilly, ISBN :9789352132447.
• 11. EMC Education services, Data Science and Big Data Analytics, EMC2 Wiley,
ISBN :9788126556533.
• 12. Mueller Massaron, Python for Data science, Wiley, ISBN :9788126557394.
• 13. EMC Education Services, Data Science and Big Data Analytics, Wiley India, ISBN:
9788126556533
• 14. Benoy Antony, Konstantin Boudnik, Cheryl Adams,,Professional Hadoop, Wiley India,
ISBN :9788126563029
• 15. Mark Gardener, Beginning R: The Statistical Programming Language ,Wiley India,
ISBN :9788126541201
• 16. Mark Gardener, The Essential R Reference ,Wiley India, ISBN : 9788126546015
• 17. Judith Hurwitz, Alan Nugent, Big Data For Dummies, Wiley India, ISBN :
9788126543281
• .

UNIT – I
Introduction: Data Science and
Big Data
Big Data: Introduction
• Big Data is large volume of Data in structured or
unstructured form.
• The rate of data generation has increased
exponentially by increasing use of data intensive
technologies.
• Processing or analyzing the huge amount of data is a
challenging task.
• It requires new infrastructure and a new way of
thinking about the way business and IT industry
works
What is Big Data?

The are many examples of "data", but what makes some of it
“big”? The classic definition revolves around the three Vs.

Volume, velocity, and variety.

Volume: There is a just a lot of it being generated all the
time. Things get interesting and “big”, when you can’t fit it
all on one computer anymore. Why? There are many ideas
here such as MapReduce, Hadoop, etc. that all revolve
around being able to process data that goes from Terabytes,
to Petabytes, to Exabytes.

Velocity: Data is being generated very quickly. Can you
even store it all? If not, then what do you get rid of and http://
pl.wikipedia.org/
what do you keep? wiki/
Green_Giant#mediavie

Variety: The data types you mention all take different wer/
Plik:Jolly_green_giant.j
shapes. What does it mean to store them so that you can pg
play with or compare them?
Defining Big Data
6 V’s of Big Data
Case study: 6 V’s in clinical dataset
Data Volume
Data Velocity
Data Variety

Ambiguity -- Uncertainty
Viscosity – It is the inertia when navigating through a data collection.
Virality – measures the speed at which data can spread through a network.
Problem of Data Explosion
Problem of Data Explosion (Cont…)
• The International Data Corporation (IDC) study
predicts that overall data will grow by 50 times
by 2020.
• The digital universe is 1.8 trillion gigabytes
(109) in size and stored in 500 quadrillion
(1015) files.
• Information Bits in the digital universe as stars
in our physical universe.
• 90% Data is in unstructured form.
Issues in Big Data
• Issues related to the Characteristics
• Storage and Transfer Issues
• Data Management Issues
• Processing Issues
Issues related to the Characteristics
• Data Volume Issues
• Data Velocity Issues
• Data Variety Issues
• Worth of Data Issues
• Data Complexity Issues
Storage and Transfer Issues
• Current Storage Techniques and Storage Medium are
not appropriate for effectively handling Big Data.
• Current Technology limits 4 Terabytes (1012) per disk,
so1 Exabyte (1018) size data will take 25,000 Disks.
• Accessing that data will also overwhelm network.
• Assuming a sustained transfer of 1 Exabyte will take
2,800 hours with a 1 Gbps capable network with 80%
effective transfer rate and 100Mbps sustainable
speed.
Data Management Issues
• Resolving issues of access, utilization, updating,
governance, and reference (in publications) have
proven to be major stumbling blocks.
• In such volume, it is impractical to validate every
data item.
• New approaches and research to data qualification
and validation are needed.
• The richness of digital data representation prohibits
a personalized methodology for data collection.
Processing Issues
• The Processing Issues are critical to handle.
• Example: 1 Exabyte = 1000 Petabytes (1015).
Assuming a processor expends 100 instructions on
one block at 5 gigahertz, the time required for end
to-end processing would be 20 nanoseconds. To
process 1K petabytes would require a total end-to-
end processing time of roughly 635 years.
• Effective processing of Exabyte of data will require
extensive parallel processing and new analytics
algorithms
Challenges in Big Data
• Privacy and Security
• Data Access and Sharing of Information
• Analytical Challenges
• Human Resources and Manpower
• Technical Challenges
Privacy and Security
• Privacy and Security are sensitive and includes
conceptual, Technical as well as legal significance.
• Most Peoples are vulnerable to Information Theft.
• Privacy can be compromised in the large data sets.
• The Security is also critical to handle in such large
data.
• Social stratification would be important arising
consequence.
Data Access and Sharing of Information
• Data should be available in accurate, complete
and timely manner.
• The data management and governance
process bit complex adding the necessity to
make data open and make it available to
government agencies.
• Expecting sharing of data between companies
is awkward.
Analytical Challenges
• Big data brings along with it some huge
analytical challenges.
• Analysis on such huge data, requires a large
number of advance skills.
• The type of analysis which is needed to be
done on the data depends highly on the
results to be obtained.
Human Resources and Manpower
• Big Data needs to attract organizations and
youth with diverse new skill sets.
• The skills includes technical as well as research,
analytical, interpretive and creative ones.
• It requires training programs to be held by the
organizations.
• Universities need to introduce curriculum on
Big data.
Technical Challenges
• Fault Tolerance: If the failure occurs the damage
done should be within acceptable threshold rather
than beginning the whole task from the scratch.
• Scalability: Requires a high level of sharing of
resources which is expensive and dealing with the
system failures in an efficient manner.
• Quality of Data: Big data focuses on quality data
storage rather than having very large irrelevant data.
• Heterogeneous Data: Structured and Unstructured
Data.
Advantages of Big Data
• Understanding and Targeting Customers
• Understanding and Optimizing Business Process
• Improving Science and Research
• Improving Healthcare and Public Health
• Optimizing Machine and Device Performance
• Financial Trading
• Improving Sports Performance
• Improving Security and Law Enforcement
Some Projects using Big Data
• Amazon.com handles millions of back-end operations and
have 7.8 TB, 18.5 TB, and 24.7 TB Databases.
• Walmart is estimated to store more than 2.5 PB Data for
handling 1 million transactions per hour.
• The Large Hadron Collider (LHC) generates 25 PB data
before replication and 200 PB Data after replication.
• Sloan Digital Sky Survey ,continuing at a rate of about 200
GB per night and has more than 140 TB of information.
• Utah Data Center for Cyber Security stores Yottabytes
(1024).
Is Big Data the same as Data Science?

Are Big Data and Data Science the same
thing?

I wouldn't say so...

Data Science can be done on small data sets.

And not everything done using Big Data would
necessarily be called Data Science.

Data
Big Data
Science
Is Big Data the same as Data Science?

Are Big Data and Data Science the same
thing?

I wouldn't say so...

Data Science can be done on small data sets.

And not everything done using Big Data would
necessarily be called Data Science.

But there certainly is a substantial overlap!
Data
Big Data
Science
Big Data Infrastructure: Hadoop/MapReduce
Programming & Data Processing

 Architecture of Hadoop, HDFS, and Yarn


 Programming on Hadoop

 Basic Data Processing: Sort and Join


 Information Retrieval using Hadoop
 Data Mining using Hadoop (Kmeans+Histograms)
 Machine Learning on Hadoop (EM)

 Hive/Pig
 HBase and Cassandra
Big Data Learning Approaches
Classical Approach
Given
Wanted

Input Output
Model

Machine Learning Approach

Input Model Output


How Machine Learning Works
• Machine Learning builds a model from the data
• Supervised: Data and Labels
• Unsupervised: Data with no label

• The model is used then to:


• Predict the outcome of a system
• Recognize complicated patterns in the new data points
• Classify inputs
Big Data Processing Examples

• Weather data
• Contract Data
• Financial reporting data
• Clinical trials data
• Social Media posts
• Survey data
Big Data processing Architectures

• Centralized Processing
• Distributed Processing
• Client Server Architecture
• Cluster Architecture
Advantages of distributed processing are scalability,
customization of processing and management of information
based on operation, and parallel processing of data which
reduce latencies.
Disadvantages of distributed processing are data redundancy,
process redundancy, resource overhead and volume.
Big Data Processing Cycle
Big Data Processing Flow
Shared everything Architectures
Shared nothing Architectures
The requirements for Big Data infrastructure and
Processing Architecture
• Data Processing Requirements:
– Data- Model less architecture
– Micro batch Processing
– Data collection in real time
– Minimal Data transformation
– Multi partition Capability
– Efficient Data Reads
– Share data across multiple processing points
– Store results in File system or non-relational DBMS
Infrastructure Requirements

• Linear Scalability
• High Throughput
• Fault Tolerant
• Auto recovery
• Distributed Data processing
• High degree of parallelism
• Programming language interface

You might also like