0% found this document useful (0 votes)
13 views23 pages

CSIC 221: Machine Learning & Data Analytics: Mayank Dave Professor Dept. of Computer Engineering

The document outlines the field of Machine Learning and Data Analytics, emphasizing the significance of data science in managing and interpreting vast amounts of data generated globally. It discusses the challenges of big data, the roles of data scientists, and various types of data and analytical techniques used in data science. Additionally, it highlights the importance of understanding data science principles across multiple disciplines to derive insights and predictions from data.

Uploaded by

123109015
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views23 pages

CSIC 221: Machine Learning & Data Analytics: Mayank Dave Professor Dept. of Computer Engineering

The document outlines the field of Machine Learning and Data Analytics, emphasizing the significance of data science in managing and interpreting vast amounts of data generated globally. It discusses the challenges of big data, the roles of data scientists, and various types of data and analytical techniques used in data science. Additionally, it highlights the importance of understanding data science principles across multiple disciplines to derive insights and predictions from data.

Uploaded by

123109015
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

CSIC 221: Machine Learning

& Data Analytics


Mayank Dave
Professor
Dept. of Computer Engineering
Outline
• Data, Big Data and Challenges
• Data Science
• Introduction
• Why Data Science
• Data Scientists
• What do they do?
• Major/Concentration in Data Science
• What courses to take.
Data All Around
• Lots of data is being collected
and warehoused
• Web data, e-commerce
• Financial transactions, bank/credit transactions
• Online trading and purchasing
• Social Network
How Much Data Do We have?
• In 2020, users generated 64.2 ZB of data, which exceeded the number
of detectable stars in the cosmos. (1ZB = 1021 bytes)
• Data creation reached about 147 ZB by the end of 2024.
• Google processes 20 PB a day (2008)
• Facebook has 60 TB of daily logs (In 2023, 4PB per day)
• eBay has 6.5 PB of user data + 50 TB/day (5/2009)
• 1000 genomes project: 200 TB

• Cost of 1 TB of disk: $35


• Time to read 1 TB disk: 3 hrs
(100 MB/s)
Big Data
• Big Data is any data that is expensive to manage and hard to
extract value from
•Volume
•The size of the data (also scope)
•Velocity
•The latency of data processing relative to the growing demand
for interactivity (accumulation)
•Variety and Complexity
• the diversity of sources, formats, quality, structures (structured and
unstructured).
Big Data (The 3V Model)
•The “3V model” attempts to lay this out in
a simple way.
•Each of these three Vs regarding data has
dramatically increased in recent years.
•Specifically, the increasing volume of
heterogeneous and unstructured (text,
images, and video) data, as well as the
possibilities emerging from their analysis,
renders data science ever more essential.
Big Data and Data Science
•They are not the “same thing”
•If Big data = crude oil
•Then, Big data is about extracting “crude oil”, transporting it in
“mega tankers”, siphoning it through “pipelines”, and storing it in
“massive silos”
•Data science is about refining the “crude oil”
How much data is generated by us
• With around 5.35 billion internet users worldwide, each person can potentially generate
approximately 15.87 TB of data daily.
• Facebook produces approximately 4,000 TB daily, ranking #1 in most visited sites globally
in 2023.
• X (Twitter) garners around 500 million tweets daily, which amounts to 560 GB of data.
• In 2024, on average, TikTok videos produce approximately 7.35 TB of data daily.
• YouTube hosted over 720,000 hours of videos daily in 2023, which is equivalent to about
4.3 PB of data.
• Google processes around 3.5 billion searches daily, amounting to 20 PB.
• In 2023, the average internet user created about 1.7 MB of data per second, equal to
146,880 MB daily.
• An average household of 4 can create about 506,736 MB of data daily.
• In 2023, the world created around 120 zettabytes (ZB), which gives a rough estimate of
337,080 petabytes (PB) of daily data.
• In context, there are around 5.35 billion internet users globally, meaning each user can
create about 15.87 TB of data daily.
Types of Data We Have
• Relational Data (Tables/Transaction/Legacy Data)
• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data
• Social Network, Semantic Web (RDF), …
• Streaming Data
• You can afford to scan the data once
What To Do With These Data?
• Aggregation and Statistics
• Data warehousing and OLAP
• Indexing, Searching, and Querying
• Keyword based search
• Pattern matching (XML/RDF)
• Knowledge discovery
• Data Mining
• Statistical Modeling
What is Data Science?
•An area that manages, manipulates, extracts, and interprets
knowledge from tremendous amount of data
•Data science (DS) is a multidisciplinary field of study with goal to
address the challenges in big data
•Data science principles apply to all data – big and small

https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/
What is Data Science?
•“Data science, also known as data-driven science, is an
interdisciplinary field of scientific methods, processes, algorithms
and systems to extract knowledge or insights from data in various
forms, either structured or unstructured, similar to data mining.”
What is Data Science?
• “Data science, also known as data-driven science, is an interdisciplinary field
of scientific methods, processes, algorithms and systems to extract
knowledge or insights from data in various forms, either structured or
unstructured, similar to data mining.”
• “Data science intends to analyze and understand actual phenomena with
‘data’.
• In other words, the aim of data science is to reveal the features or the
hidden structure of complicated natural, human, and social phenomena with
data from a different point of view from the established or traditional theory
and method.”
What is Data Science?
• Theories and techniques from many fields and disciplines are used to
investigate and analyze a large amount of data to help decision makers in
many industries such as science, engineering, economics, politics, finance,
and education
•Computer Science
• Pattern recognition, visualization, data warehousing, High performance
computing, Databases, AI
•Mathematics
• Mathematical Modeling
•Statistics
• Statistical and Stochastic modeling, Probability.
Data Science
What is Data Science
•“Data science produces insights.
•Machine learning produces
predictions”
ML with Data Analytics
Types of Data Science
Tasks Description Algorithms Examples

Classification Predict if a data point belongs to Decision Trees, Neural Assigning voters into known buckets by
one of predefined classes. The networks, Bayesian political parties eg: soccer moms.
prediction will be based on models, Induction rules, K Bucketing new customers into one of
learning from known data set. nearest neighbors known customer groups.

Regression Predict the numeric target label of Linear regression, Logistic Predicting unemployment rate for next
a data point. The prediction will be regression year. Estimating insurance premium.
based on learning from known
data set.

Anomaly detection Predict if a data point is an outlier Distance based, Density Fraud transaction detection in credit
compared to other data points in based, LOF cards. Network intrusion detection.
the data set.

Time series Predict if the value of the target Exponential smoothing, Sales forecasting, production forecasting,
variable for future time frame ARIMA, regression virtually any growth phenomenon that
based on history values. needs to be extrapolated

Clustering Identify natural clusters within the K means, density based Finding customer segments in a
data set based on inherit clustering - DBSCAN company based on transaction, web and
properties within the data set. customer call data.

Association analysis Identify relationships within an FP Growth, Apriori Find cross selling opportunities for a
itemset based on transaction data. retailor based on transaction purchase
history.
Data Science
What Core
should we
Algorithms
know?
Classification
Decision Trees
Rule Induction
k-Nearest Neighbors
Naïve Bayesian

Process Basics
Artificial Neural Networks Common Applications
Support Vector Machines
Ensemble Learners
Data Science Process Text Mining
Regression
Data Exploration Time Series Forecasting
Linear Regression
Model Evaluation Logistic Regression
Anomaly Detection

Association Analysis Feature Selection

Apriori
FP-Growth

Clustering
k-Means
DBSCAN
Self-Organizing Maps
Solving Problems with Data

You might also like