Modul 1
Modul 1
[21CS644]
Course Outline
•Introduction to Data Science
•Exploratory Data Analysis and the
Data Science Process
•Feature Generation and Feature Selection
•Data Visualization and Data Exploration
•A Deep Dive into Matplotlib
Course Outcomes
To understand the different types of data and skill set for data science.
CO1 L3
Apply different techniques to Explore Data Analysis and the Data Science
CO2 Process
L3
Analyze feature selection algorithms & design a recommender system
CO3 L4
Apply data visualization tools and libraries to plot graphs.
CO4 L3
Analyze the different plots and layouts in Matplotlib.
CO5 L4
Learning Resources
Textbooks
1. Doing Data Science, Cathy O’Neil and Rachel Schutt, O'Reilly
Media, Inc O'Reilly Media, Inc, 2013
2. 2. Data Visualization workshop, Tim Grobmann and Mario Dobler,
Packt Publishing, ISBN 9781800568112
References
1. Mining of Massive Datasets, Anand Rajaraman and Jeffrey D.
Ullman, Cambridge University Press, 2010
2. 2. Data Science from Scratch, Joel Grus, Shroff Publisher /O’Reilly
Publisher Media
3. 3. A handbook for data driven design by Andy krik
Weblinks and Video Lectures
1. https://nptel.ac.in/courses/106/105/106105077/
2.
https://www.oreilly.com/library/view/doing-data-scienc
e/9781449363871/toc01.html
3. http://book.visualisingdata.com/
4. https://matplotlib.org/
5. https://docs.python.org/3/tutorial/
6. https://www.tableau.com/
Assessment Details
Continuous Internal Evaluation (CIE) - 50 Marks
Three Test -20 Marks each (60 Marks)
Two assignments-10 Marks each
One assignment(quiz)- 20 Marks
Ass Test + Assignment= 100 Marks scaled down to 50 Marks
Final CIE marks will be = 50 Marks
• The parameter μ is the mean and median and controls where the
distribution is centered (because this is a symmetric distribution), and the
parameter σ controls how spread out the distribution is.
• This is the general functional form, but for specific real-world
phenomenon, these parameters have actual numbers as values, which we
can estimate from the data
• A random variable denoted by x or y can be assumed to have a
corresponding probability distribution, p(x) , which maps x to a
positive real number.
• In order to be a probability density function, we’re restricted to the set
of functions such that if we integrate p(x)to get the area under the curve,
it is 1, so it can be interpreted as probability.
• For example, let x be the amount of time until the next bus arrives
(measured in seconds).
• x is a random variable because there is variation and uncertainty in the
amount of time until the next bus
• Suppose we know (for the sake of argument) that the time until the
next bus has a probability density function of
P(x) =2e −2x .
• If we want to know the likelihood of the next bus arriving in between
12 and 13 minutes, then we find the area under the curve between
12 and 13 by
2e −2x .
•
How to know right distribution to use?
There are two possible ways:
• Conduct an experiment where we show up at the bus stop at a random
time, measure how much time until the next bus, and repeat this
experiment over and over again.
• Then we look at the measurements, plot them, and approximate the
function as discussed.
• Because we are familiar with the fact that “waiting time” is a common
enough real-world phenomenon that a distribution called the
exponential distribution has been invented to describe it, we know
that it takes the form p( x) = λe−λx
• In addition to denoting distributions of single random variables with
functions of one variable, we use multivariate functions called joint
distributions to do the same thing for more than one random
variable.
• So in the case of two random variables, for example, we could denote
our distribution by a function p(x, y) , and it would take values in the
plane and give us nonnegative values.
• In keeping with its interpretation as a probability, its (double) integral
over the whole plane would be 1.
• We also have conditional distribution p(x|y) , which is to be
interpreted as the density function of x given a particular value of y
Example
• suppose we have a set of user-level data for Amazon.com that lists for
each user the amount of money spent last month on Amazon
• whether the user is male or female, and how many items they looked at
before adding the first item to the shopping cart.
• If we consider X to be the random variable that represents the amount of
money spent, then we can look at the distribution of money spent across
all users, and represent it as p( X) .
• Take the subset of users who looked at more than five items before buying
anything, and look at the distribution of money spent among these users.
• Let Y be the random variable that represents number of items looked at,
then p( X| Y > 5) would be the corresponding conditional distribution.
• Conditional distribution has the same properties as a regular
distribution in that when we integrate it, it sums to 1 and has to take
nonnegative values.
• When we observe data points, i.e.,( x1, y1) ( x2, y2) , . . ., (xn, yn ), we
are observing realizations of a pair of random variables.
• When we have an entire dataset with n rows and k columns, we are
observing n realizations of the joint distribution of those k random
variables.
Fitting a model
• Fitting a model means that you estimate the parameters of the model using
the observed data.
• You are using your data as evidence to help approximate the real-world
mathematical process that generated the data.
• Fitting the model often involves optimization methods and algorithms,
such as maximum likelihood estimation, to help get the parameters.
• When you estimate the parameters, they are actually estimators, meaning
they themselves are functions of the data.
• Once you fit the model, you actually can write it as
y =7.2+4.5x, for example, which means that your best guess is that this
equation or functional form expresses the relationship between your two
variables, based on your assumption that the data followed a linear pattern
• Fitting the model is when you start actually coding:
• your code will read in the data, and you’ll specify the functional form
that you wrote down on the piece of paper.
• Then R or Python will use built-in optimization methods to give you the
most likely values of the parameters given the data
Overfitting
• Overfitting is the term used to mean that you used a dataset to estimate
the parameters of your model, but your model isn’t that good at
capturing reality beyond your sampled data.
• You might know this because you have tried to use it to predict labels
for another set of data that you didn’t use to fit the model, and it
doesn’t do a good job, as measured by an evaluation metric such as
accuracy