Foundations of Data Science
(CS F320)
Prof.N.L.Bhanu Murthy
BITS Pilani
Hyderabad Campus
What is Science?
“Science” is derived from the Latin word scientia, meaning
"knowledge"
“Science is the study of the nature, behavior of natural things and
the knowledge that we obtain about them.” - Collins
BITS Pilani, Hyderabad Campus
Physical Science
Physical Science is the scientific study of non-living matter.
– Chemistry
The study of all forms of matter,
including how matter interacts
with other matter.
– Physics
The study of energy and
how it affects matter.
BITS Pilani, Hyderabad Campus
Data..
BITS Pilani, Hyderabad Campus
Data..
“Data is the New Oil” – World Economic Forum 2011
BITS Pilani, Hyderabad Campus
Data Science
Data science is an interdisciplinary field that uses methods,
processes, algorithms and systems to extract knowledge and
insights from data in various forms, both structured and
unstructured
“The ability to take data — to be able to understand it, to process it, to extract
value from it, to visualize it, to communicate it — that’s going to be a hugely
important skill in the next decades.”
Hal Varian, chief economist at Google and UC Berkeley professor of
information sciences, business, and economics
BITS Pilani, Hyderabad Campus
Data Science Pipeline
BITS Pilani, Hyderabad Campus
Machine Learning
Machine Learning is study of
algorithms that
improve performance P
at some task T
with experience E
Tom Mitchell (1990)
Well-defined learning task: <P,T,E>
Regression
Supervised learning
Unsupervised Learning
BITS Pilani, Hyderabad Campus
Data Mining
Data mining (knowledge discovery from data)
“Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns or knowledge from huge
amount of data”
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Deviation Detection [Predictive]
BITS Pilani, Hyderabad Campus
Information Retrieval Info
Query
IR
Retrieval system
Document Answer list
collection
RRR vs KGF 2 which is better in Hindi?
Multimedia Information retrieval(MIR)
Recommender systems
Web search
BITS Pilani, Hyderabad Campus
Probability Foundations to Data Science
“Probability is a mathematical tool to model uncertainty”
Frequentist vs Bayesian perspective of Probability
Probability distributions – Gaussian, Beta, Bernoulli and Binomial
Maximum likelihood and Bayeisan Inference
Probabilistic perspective of Polynomial Curve Fitting
Bayesian Curve Fitting
Mixture of Guassians and Probability Bounds
Nonparametric Methods - Nearest-neighbour methods
BITS Pilani, Hyderabad Campus
Decision & Information Theory Foundations
Minimizing Misclassification rate & expected loss
Inference and decision
Loss functions for regression
Relative Entropy and Mutual Information
Decision Tree
BITS Pilani, Hyderabad Campus
Computational Foundations to Data Science
Unconstrained/Constrained optimization
Equality/inequality constraints
Convex optimization
Lagrange multiplier
Primal/dual concept
Quadratic programming
Kernel Machines for Regression
BITS Pilani, Hyderabad Campus
Curse of Dimensionality
Dimensionality Reduction
Principal Component Analysis
BITS Pilani, Hyderabad Campus
Data Visualization
Mapping Data to Graphical Elements like
Histograms and Pie Charts
Box Plot
Percentile Plots
Empirical Cumulative Distribution Functions
Scatter Plots
Visualizing Spatio-temporal Data
OLAP and Multidimensional Data Analysis
BITS Pilani, Hyderabad Campus
Data Preprocessing Techniques
Types of Data
Data Quality
Feature Extraction
Feature Subset Selection
Discretization and Binrization
Measures of Similarity and Dissimilarity
BITS Pilani, Hyderabad Campus
Introduction to Big Data & Analytics
BITS Pilani, Hyderabad Campus
Applications of Data Science
PageRank: The web as a behavioral dataset
BITS Pilani, Hyderabad Campus
Applications of Data Science
Sponsored search:
BITS Pilani, Hyderabad Campus
Applications of Data Science
Sponsored search
Google revenue around $50 bn/year from marketing, 97%
of the companies revenue.
Sponsored search uses an auction – a pure competition for
marketers trying to win access to consumers.
BITS Pilani, Hyderabad Campus
Applications of Data Science
"In the 21st century, the candidate with [the] best
data, merged with the best messages dictated by that
data, wins.”
Andrew Rasiej, Personal Democracy Forum
BITS Pilani, Hyderabad Campus
Data Science
BITS Pilani, Hyderabad Campus
Data Science
BITS Pilani, Hyderabad Campus
Data Science
BITS Pilani, Hyderabad Campus
Teaching & Evaluation (CS F320 – L P U – 3 0 3)
Evaluation Components & Criteria
Component Duration Weightage Date&Time Mode
Mid Sem Test 90 30% TBA Closed
Class Participation 5-10 mins 10% Surprise Open
Programming - 20% TBA Open
Assignments (2-3)
Comprehensive 180 mins 40% TBA Closed
Make-up Policy: Make-up will be granted only to genuine cases with prior
permission only
Course Notices: All notices will be put up in CMS and students are strongly
advised to log in to CMS and look for notices quite often.
Chamber Consultation: Tuesday 5PM – 6PM
BITS Pilani, Hyderabad Campus
Text Book
T1. Christopher Bishop: Pattern Recognition and Machine Learning,
Springer International Edition.
BITS Pilani, Hyderabad Campus
Text Book
T1. Tan,Pang-Ning & others. “Introduction to Data Mining” Pearson Education,
2006.
BITS Pilani, Hyderabad Campus
Reference Book
R1. Avrim Blum, John Hopcroft, Ravindran Kannan: Foundations of Data
Science, CUP
BITS Pilani, Hyderabad Campus
Reference Book
R1. Tom M. Mitchell: Machine Learning, The McGraw-Hill Companies, Inc.
BITS Pilani, Hyderabad Campus
Reference Book
BITS Pilani, Hyderabad Campus
Reference Book
BITS Pilani, Hyderabad Campus
Reference Book
BITS Pilani, Hyderabad Campus
Thank You!!
BITS Pilani, Hyderabad Campus