CSIC 221: Machine Learning & Data Analytics: Mayank Dave Professor Dept. of Computer Engineering
CSIC 221: Machine Learning & Data Analytics: Mayank Dave Professor Dept. of Computer Engineering
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/
What is Data Science?
•“Data science, also known as data-driven science, is an
interdisciplinary field of scientific methods, processes, algorithms
and systems to extract knowledge or insights from data in various
forms, either structured or unstructured, similar to data mining.”
What is Data Science?
• “Data science, also known as data-driven science, is an interdisciplinary field
of scientific methods, processes, algorithms and systems to extract
knowledge or insights from data in various forms, either structured or
unstructured, similar to data mining.”
• “Data science intends to analyze and understand actual phenomena with
‘data’.
• In other words, the aim of data science is to reveal the features or the
hidden structure of complicated natural, human, and social phenomena with
data from a different point of view from the established or traditional theory
and method.”
What is Data Science?
• Theories and techniques from many fields and disciplines are used to
investigate and analyze a large amount of data to help decision makers in
many industries such as science, engineering, economics, politics, finance,
and education
•Computer Science
• Pattern recognition, visualization, data warehousing, High performance
computing, Databases, AI
•Mathematics
• Mathematical Modeling
•Statistics
• Statistical and Stochastic modeling, Probability.
Data Science
What is Data Science
•“Data science produces insights.
•Machine learning produces
predictions”
ML with Data Analytics
Types of Data Science
Tasks Description Algorithms Examples
Classification Predict if a data point belongs to Decision Trees, Neural Assigning voters into known buckets by
one of predefined classes. The networks, Bayesian political parties eg: soccer moms.
prediction will be based on models, Induction rules, K Bucketing new customers into one of
learning from known data set. nearest neighbors known customer groups.
Regression Predict the numeric target label of Linear regression, Logistic Predicting unemployment rate for next
a data point. The prediction will be regression year. Estimating insurance premium.
based on learning from known
data set.
Anomaly detection Predict if a data point is an outlier Distance based, Density Fraud transaction detection in credit
compared to other data points in based, LOF cards. Network intrusion detection.
the data set.
Time series Predict if the value of the target Exponential smoothing, Sales forecasting, production forecasting,
variable for future time frame ARIMA, regression virtually any growth phenomenon that
based on history values. needs to be extrapolated
Clustering Identify natural clusters within the K means, density based Finding customer segments in a
data set based on inherit clustering - DBSCAN company based on transaction, web and
properties within the data set. customer call data.
Association analysis Identify relationships within an FP Growth, Apriori Find cross selling opportunities for a
itemset based on transaction data. retailor based on transaction purchase
history.
Data Science
What Core
should we
Algorithms
know?
Classification
Decision Trees
Rule Induction
k-Nearest Neighbors
Naïve Bayesian
Process Basics
Artificial Neural Networks Common Applications
Support Vector Machines
Ensemble Learners
Data Science Process Text Mining
Regression
Data Exploration Time Series Forecasting
Linear Regression
Model Evaluation Logistic Regression
Anomaly Detection
Apriori
FP-Growth
Clustering
k-Means
DBSCAN
Self-Organizing Maps
Solving Problems with Data