CSIC 221: Machine Learning & Data Analytics: Mayank Dave Professor Dept. of Computer Engineering

The document outlines the field of Machine Learning and Data Analytics, emphasizing the significance of data science in managing and interpreting vast amounts of data generated globally. It discusses the challenges of big data, the roles of data scientists, and various types of data and analytical techniques used in data science. Additionally, it highlights the importance of understanding data science principles across multiple disciplines to derive insights and predictions from data.

Uploaded by

123109015

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views23 pages

CSIC 221: Machine Learning & Data Analytics: Mayank Dave Professor Dept. of Computer Engineering

Uploaded by

123109015

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

CSIC 221: Machine Learning

& Data Analytics

Mayank Dave
Professor
Dept. of Computer Engineering
Outline
• Data, Big Data and Challenges
• Data Science
• Introduction
• Why Data Science
• Data Scientists
• What do they do?
• Major/Concentration in Data Science
• What courses to take.
Data All Around
• Lots of data is being collected
and warehoused
• Web data, e-commerce
• Financial transactions, bank/credit transactions
• Online trading and purchasing
• Social Network
How Much Data Do We have?
• In 2020, users generated 64.2 ZB of data, which exceeded the number
of detectable stars in the cosmos. (1ZB = 1021 bytes)
• Data creation reached about 147 ZB by the end of 2024.
• Google processes 20 PB a day (2008)
• Facebook has 60 TB of daily logs (In 2023, 4PB per day)
• eBay has 6.5 PB of user data + 50 TB/day (5/2009)
• 1000 genomes project: 200 TB

• Cost of 1 TB of disk: $35

• Time to read 1 TB disk: 3 hrs
(100 MB/s)
Big Data
• Big Data is any data that is expensive to manage and hard to
extract value from
•Volume
•The size of the data (also scope)
•Velocity
•The latency of data processing relative to the growing demand
for interactivity (accumulation)
•Variety and Complexity
• the diversity of sources, formats, quality, structures (structured and
unstructured).
Big Data (The 3V Model)
•The “3V model” attempts to lay this out in
a simple way.
•Each of these three Vs regarding data has
dramatically increased in recent years.
•Specifically, the increasing volume of
heterogeneous and unstructured (text,
images, and video) data, as well as the
possibilities emerging from their analysis,
renders data science ever more essential.
Big Data and Data Science
•They are not the “same thing”
•If Big data = crude oil
•Then, Big data is about extracting “crude oil”, transporting it in
“mega tankers”, siphoning it through “pipelines”, and storing it in
“massive silos”
•Data science is about refining the “crude oil”
How much data is generated by us
• With around 5.35 billion internet users worldwide, each person can potentially generate
approximately 15.87 TB of data daily.
• Facebook produces approximately 4,000 TB daily, ranking #1 in most visited sites globally
in 2023.
• X (Twitter) garners around 500 million tweets daily, which amounts to 560 GB of data.
• In 2024, on average, TikTok videos produce approximately 7.35 TB of data daily.
• YouTube hosted over 720,000 hours of videos daily in 2023, which is equivalent to about
4.3 PB of data.
• Google processes around 3.5 billion searches daily, amounting to 20 PB.
• In 2023, the average internet user created about 1.7 MB of data per second, equal to
146,880 MB daily.
• An average household of 4 can create about 506,736 MB of data daily.
• In 2023, the world created around 120 zettabytes (ZB), which gives a rough estimate of
337,080 petabytes (PB) of daily data.
• In context, there are around 5.35 billion internet users globally, meaning each user can
create about 15.87 TB of data daily.
Types of Data We Have
• Relational Data (Tables/Transaction/Legacy Data)
• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data
• Social Network, Semantic Web (RDF), …
• Streaming Data
• You can afford to scan the data once
What To Do With These Data?
• Aggregation and Statistics
• Data warehousing and OLAP
• Indexing, Searching, and Querying
• Keyword based search
• Pattern matching (XML/RDF)
• Knowledge discovery
• Data Mining
• Statistical Modeling
What is Data Science?
•An area that manages, manipulates, extracts, and interprets
knowledge from tremendous amount of data
•Data science (DS) is a multidisciplinary field of study with goal to
address the challenges in big data
•Data science principles apply to all data – big and small

https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/
What is Data Science?
•“Data science, also known as data-driven science, is an
interdisciplinary field of scientific methods, processes, algorithms
and systems to extract knowledge or insights from data in various
forms, either structured or unstructured, similar to data mining.”
What is Data Science?
• “Data science, also known as data-driven science, is an interdisciplinary field
of scientific methods, processes, algorithms and systems to extract
knowledge or insights from data in various forms, either structured or
unstructured, similar to data mining.”
• “Data science intends to analyze and understand actual phenomena with
‘data’.
• In other words, the aim of data science is to reveal the features or the
hidden structure of complicated natural, human, and social phenomena with
data from a different point of view from the established or traditional theory
and method.”
What is Data Science?
• Theories and techniques from many fields and disciplines are used to
investigate and analyze a large amount of data to help decision makers in
many industries such as science, engineering, economics, politics, finance,
and education
•Computer Science
• Pattern recognition, visualization, data warehousing, High performance
computing, Databases, AI
•Mathematics
• Mathematical Modeling
•Statistics
• Statistical and Stochastic modeling, Probability.
Data Science
What is Data Science
•“Data science produces insights.
•Machine learning produces
predictions”
ML with Data Analytics
Types of Data Science
Tasks Description Algorithms Examples

Classification Predict if a data point belongs to Decision Trees, Neural Assigning voters into known buckets by
one of predefined classes. The networks, Bayesian political parties eg: soccer moms.
prediction will be based on models, Induction rules, K Bucketing new customers into one of
learning from known data set. nearest neighbors known customer groups.

Regression Predict the numeric target label of Linear regression, Logistic Predicting unemployment rate for next
a data point. The prediction will be regression year. Estimating insurance premium.
based on learning from known
data set.

Anomaly detection Predict if a data point is an outlier Distance based, Density Fraud transaction detection in credit
compared to other data points in based, LOF cards. Network intrusion detection.
the data set.

Time series Predict if the value of the target Exponential smoothing, Sales forecasting, production forecasting,
variable for future time frame ARIMA, regression virtually any growth phenomenon that
based on history values. needs to be extrapolated

Clustering Identify natural clusters within the K means, density based Finding customer segments in a
data set based on inherit clustering - DBSCAN company based on transaction, web and
properties within the data set. customer call data.

Association analysis Identify relationships within an FP Growth, Apriori Find cross selling opportunities for a
itemset based on transaction data. retailor based on transaction purchase
history.
Data Science
What Core
should we
Algorithms
know?
Classification
Decision Trees
Rule Induction
k-Nearest Neighbors
Naïve Bayesian

Process Basics
Artificial Neural Networks Common Applications
Support Vector Machines
Ensemble Learners
Data Science Process Text Mining
Regression
Data Exploration Time Series Forecasting
Linear Regression
Model Evaluation Logistic Regression
Anomaly Detection

Association Analysis Feature Selection

Apriori
FP-Growth

Clustering
k-Means
DBSCAN
Self-Organizing Maps
Solving Problems with Data

Datascience
75% (8)
Datascience
28 pages
Lecture-1 Introduction To Data Science
No ratings yet
Lecture-1 Introduction To Data Science
20 pages
Data Science: Chapter 1: Introduction To Big Data
100% (2)
Data Science: Chapter 1: Introduction To Big Data
77 pages
Fundamentals of Data Science
100% (3)
Fundamentals of Data Science
62 pages
Day 1 Intro To DS and ML - New
No ratings yet
Day 1 Intro To DS and ML - New
41 pages
Fds Module 1
No ratings yet
Fds Module 1
65 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
30 pages
Chapter 1 Data Science Fundamentals
No ratings yet
Chapter 1 Data Science Fundamentals
34 pages
The 365 DS Booklet PDF
100% (1)
The 365 DS Booklet PDF
67 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
Seminar On Data Science
100% (7)
Seminar On Data Science
25 pages
Dsbda Unit1
No ratings yet
Dsbda Unit1
232 pages
Kadir
No ratings yet
Kadir
84 pages
Data Science Lecture 1 Introduction
No ratings yet
Data Science Lecture 1 Introduction
27 pages
Sci 7 q1 12 Demonstrate Proper Use and Handling of Science Equipment
No ratings yet
Sci 7 q1 12 Demonstrate Proper Use and Handling of Science Equipment
44 pages
PSD02 - Data Science Overview
No ratings yet
PSD02 - Data Science Overview
64 pages
Pragmatics PDF
0% (1)
Pragmatics PDF
87 pages
Mind Map
100% (1)
Mind Map
13 pages
Ids (R22) U1 PPT 03092024
No ratings yet
Ids (R22) U1 PPT 03092024
87 pages
Module 1
No ratings yet
Module 1
192 pages
DS Unit 1 - ABM
No ratings yet
DS Unit 1 - ABM
103 pages
IDS Complete Notes
No ratings yet
IDS Complete Notes
126 pages
Data Science
No ratings yet
Data Science
16 pages
Dia 1
No ratings yet
Dia 1
88 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
70 pages
Data Science
No ratings yet
Data Science
71 pages
BCA Lecture I
No ratings yet
BCA Lecture I
20 pages
Chapter 1 - Lecture
No ratings yet
Chapter 1 - Lecture
7 pages
Introduction Am
No ratings yet
Introduction Am
74 pages
Bsd1313 Chapter 1
No ratings yet
Bsd1313 Chapter 1
60 pages
Lecture 1 & 2
No ratings yet
Lecture 1 & 2
53 pages
Analog Electronics Instrumentation - Current Loops
No ratings yet
Analog Electronics Instrumentation - Current Loops
23 pages
Lec1 - For Upload Complete
No ratings yet
Lec1 - For Upload Complete
111 pages
Ch7-Overview of Data Science-Part 1
No ratings yet
Ch7-Overview of Data Science-Part 1
37 pages
Inroduction To Data Science
No ratings yet
Inroduction To Data Science
62 pages
DSC Unit 1
No ratings yet
DSC Unit 1
59 pages
Data Science
No ratings yet
Data Science
40 pages
DS 1
No ratings yet
DS 1
56 pages
01 Introduction
No ratings yet
01 Introduction
37 pages
Simba S7 D - Techspecific
No ratings yet
Simba S7 D - Techspecific
4 pages
What Is Data Science GDI
0% (1)
What Is Data Science GDI
24 pages
Big Data Analytics: Data Scientists Are in High Demand
No ratings yet
Big Data Analytics: Data Scientists Are in High Demand
32 pages
DataScience 1
No ratings yet
DataScience 1
22 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
26 pages
Data Science - AD1102-1
No ratings yet
Data Science - AD1102-1
53 pages
Unit 3 Part 1
No ratings yet
Unit 3 Part 1
43 pages
Chapter 1
No ratings yet
Chapter 1
47 pages
DSUP Chapter 1 PDF
No ratings yet
DSUP Chapter 1 PDF
31 pages
Project Report
No ratings yet
Project Report
29 pages
1.introduction To Data Science
No ratings yet
1.introduction To Data Science
23 pages
Data Science: October 2021
No ratings yet
Data Science: October 2021
51 pages
M 1 FDS Notes
No ratings yet
M 1 FDS Notes
19 pages
1) Data-Sci Chapter-1
No ratings yet
1) Data-Sci Chapter-1
17 pages
1c. INTRODUCTION-Data-Science-basic
No ratings yet
1c. INTRODUCTION-Data-Science-basic
31 pages
Data
No ratings yet
Data
43 pages
6220010
No ratings yet
6220010
37 pages
Introduction To Data Science 5-13
No ratings yet
Introduction To Data Science 5-13
18 pages
GE 461 Introduction To Data Science: Spring 2021
No ratings yet
GE 461 Introduction To Data Science: Spring 2021
39 pages
Introduction To Data Science 5-13
No ratings yet
Introduction To Data Science 5-13
19 pages
Ds Intro KK
No ratings yet
Ds Intro KK
11 pages
Introduction
No ratings yet
Introduction
20 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
16 pages
Data Science Class Lecture
No ratings yet
Data Science Class Lecture
22 pages
CS429: Data Mining: About Instructor
No ratings yet
CS429: Data Mining: About Instructor
26 pages
Gpon Cli Manual-V1.01
No ratings yet
Gpon Cli Manual-V1.01
257 pages
Fundamental Biostatistics Dillon Jones
No ratings yet
Fundamental Biostatistics Dillon Jones
68 pages
Wa0004.
No ratings yet
Wa0004.
239 pages
LIFT DATA SHEET (Single Mobile Crane Lift)
No ratings yet
LIFT DATA SHEET (Single Mobile Crane Lift)
1 page
ACS800 Multidrive Modules & Cabinets
No ratings yet
ACS800 Multidrive Modules & Cabinets
3 pages
(2022-S2) 02 Robot Kinematics Part 1 New
No ratings yet
(2022-S2) 02 Robot Kinematics Part 1 New
36 pages
Ma3151 Matrices and Calculus Two Mark Questions 2
No ratings yet
Ma3151 Matrices and Calculus Two Mark Questions 2
14 pages
SPDD and SPAU
No ratings yet
SPDD and SPAU
5 pages
Pamantayan Sa Pagkatuto Time Allotment
No ratings yet
Pamantayan Sa Pagkatuto Time Allotment
41 pages
CGL Tier-1 Mock - p12
No ratings yet
CGL Tier-1 Mock - p12
1 page
Multiquark Hadrons 1st Edition Ahmed Ali Download
No ratings yet
Multiquark Hadrons 1st Edition Ahmed Ali Download
61 pages
Chem Lab 2
No ratings yet
Chem Lab 2
6 pages
8000 Series C Programming Guide Part 1
No ratings yet
8000 Series C Programming Guide Part 1
362 pages
Green Acid Brochure
No ratings yet
Green Acid Brochure
4 pages
Project Report
No ratings yet
Project Report
29 pages
Iron FerroVer + TPTZ Methods
No ratings yet
Iron FerroVer + TPTZ Methods
15 pages
Unit-III Final Java Servlets and XML Notes
No ratings yet
Unit-III Final Java Servlets and XML Notes
64 pages
IoT Lab Assignment No. 2
No ratings yet
IoT Lab Assignment No. 2
8 pages
Flux Motor 2018
No ratings yet
Flux Motor 2018
29 pages
Are We Compatible or Terrible
No ratings yet
Are We Compatible or Terrible
6 pages
Series CC01
No ratings yet
Series CC01
4 pages
Introduction To Part I: The Methanol-to-Olefins (MTO) Reaction and Small-Pore Microporous Materials
No ratings yet
Introduction To Part I: The Methanol-to-Olefins (MTO) Reaction and Small-Pore Microporous Materials
13 pages
Cryptanalysis of A New Ultralightweight RFID Authentication ProtocolSASI
No ratings yet
Cryptanalysis of A New Ultralightweight RFID Authentication ProtocolSASI
5 pages
Explosive Detection Systems For Cabin Baggage Edscb Excel Format
No ratings yet
Explosive Detection Systems For Cabin Baggage Edscb Excel Format
3 pages
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
From Everand
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
Rob Botwright
No ratings yet

CSIC 221: Machine Learning & Data Analytics: Mayank Dave Professor Dept. of Computer Engineering

Uploaded by

CSIC 221: Machine Learning & Data Analytics: Mayank Dave Professor Dept. of Computer Engineering

Uploaded by

CSIC 221: Machine Learning

& Data Analytics

• Cost of 1 TB of disk: $35

Association Analysis Feature Selection

You might also like