0% found this document useful (0 votes)
14 views56 pages

Modul 1

The document outlines a course on Data Science and Visualization, covering topics such as data analysis, feature selection, and visualization techniques using Matplotlib. It details course outcomes, learning resources, assessment methods, and the importance of data science in various industries. Additionally, it discusses the role of data scientists, the skills required, and the significance of datafication in today's data-driven world.

Uploaded by

kanti chandrakar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views56 pages

Modul 1

The document outlines a course on Data Science and Visualization, covering topics such as data analysis, feature selection, and visualization techniques using Matplotlib. It details course outcomes, learning resources, assessment methods, and the importance of data science in various industries. Additionally, it discusses the role of data scientists, the skills required, and the significance of datafication in today's data-driven world.

Uploaded by

kanti chandrakar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Data Science & Visualization

[21CS644]
Course Outline
•Introduction to Data Science
•Exploratory Data Analysis and the
Data Science Process
•Feature Generation and Feature Selection
•Data Visualization and Data Exploration
•A Deep Dive into Matplotlib
Course Outcomes

CO# CO Description RBL

To understand the different types of data and skill set for data science.
CO1 L3
Apply different techniques to Explore Data Analysis and the Data Science
CO2 Process
L3
Analyze feature selection algorithms & design a recommender system
CO3 L4
Apply data visualization tools and libraries to plot graphs.
CO4 L3
Analyze the different plots and layouts in Matplotlib.
CO5 L4
Learning Resources
Textbooks
1. Doing Data Science, Cathy O’Neil and Rachel Schutt, O'Reilly
Media, Inc O'Reilly Media, Inc, 2013
2. 2. Data Visualization workshop, Tim Grobmann and Mario Dobler,
Packt Publishing, ISBN 9781800568112
References
1. Mining of Massive Datasets, Anand Rajaraman and Jeffrey D.
Ullman, Cambridge University Press, 2010
2. 2. Data Science from Scratch, Joel Grus, Shroff Publisher /O’Reilly
Publisher Media
3. 3. A handbook for data driven design by Andy krik
Weblinks and Video Lectures
1. https://nptel.ac.in/courses/106/105/106105077/
2.
https://www.oreilly.com/library/view/doing-data-scienc
e/9781449363871/toc01.html
3. http://book.visualisingdata.com/
4. https://matplotlib.org/
5. https://docs.python.org/3/tutorial/
6. https://www.tableau.com/
Assessment Details
Continuous Internal Evaluation (CIE) - 50 Marks
Three Test -20 Marks each (60 Marks)
Two assignments-10 Marks each
One assignment(quiz)- 20 Marks
Ass Test + Assignment= 100 Marks scaled down to 50 Marks
Final CIE marks will be = 50 Marks

Semester End Examination(SEE)- 50 Marks


Final weightage = 50% of CIE + 50% of SEE
Min Passing marks for Theory- 40% of 50 = 20 Marks
Min passing marks for SEE- 35% of 100 (18 marks out of 50)
MODULE-1

Introduction to Data Science


Topics Covered
•What is Data Science?
•Big Data and Data Science hype – and getting past the hype,
•Why now? – Datafication, Current landscape of perspectives,
•Skill sets.
•Needed Statistical Inference
•Populations and samples
•Probability distributions
•fitting a model
What is Data Science?
• Data science is the study of data to extract meaningful insights for
business.
• It is a multidisciplinary approach that combines principles and
practices from the fields of mathematics, statistics, artificial
intelligence, and computer engineering to analyze large amounts of
data.
• This analysis helps data scientists to ask and answer questions like
what happened, why it happened, what will happen, and what can be
done with the results.
Why is data science important?
• Data science is important because it combines tools, methods, and
technology to generate meaning from data.
• Modern organizations are inundated with data; there is a proliferation of
devices that can automatically collect and store information. Online
systems and payment portals capture more data in the fields of
e-commerce, medicine, finance, and every other aspect of human life.
• We have text, audio, video, and image data available in vast quantities.
Big Data and Data Science Hype
• There’s a lack of definitions around the most basic terminology.
• What is “Big Data” anyway? What does “data science” mean?
• What is the relationship between Big Data and data science?
• Is data science the science of Big Data?
• Is data science only the stuff going on in companies like Google and Facebook
and tech companies?
• Why do many people refer to Big Data as crossing disciplines
astronomy, finance, tech, etc.) and to data science as only taking place in
tech?
• Just how big is big? Or is it just a relative term?
• These terms are so ambiguous, they’re well-nigh meaningless.
•Lack of respect for the researchers in academia and industry labs
•statisticians, computer scientists, mathematicians, engineers, and
scientists of all types.
• machine learning algorithms were just invented last week and data
was never “big” until Google came along.
• Many of the methods and techniques we’re using—and the
challenges we’re facing now are part of the evolution of everything
that’s come before.
•This doesn’t mean that there’s not new and exciting stuff going on,
but we think it’s important to show some basic respect for
everything that came before.
• The hype is crazy—people throw around tired phrases straight out of the
height of the pre-financial crisis era like “Masters of the Universe” to
describe data scientists, and that doesn’t bode well.
• In general, hype masks reality and increases the noise-to-signal ratio.
• The longer the hype goes on, the more many of us will get turned off by it,
and the harder it will be to see what’s good underneath it all, if anything.
• Statisticians already feel that they are studying and working on the “Science
of Data.”
• Although we will make the case that data science is not just a rebranding of
statistics or machine learning but rather a field unto itself
• Media often describes data science in a way that makes it sound like as if it’s
simply statistics or machine learning in the context of the tech industry.
• People have said to us, “Anything that has to call itself a science
isn’t.”
• Although there might be truth in there, that doesn’t mean that the term
“data science” itself represents nothing, but of course what it
represents may not be science but more of a craft.
Getting Past the Hype
• there’s is a difference between industry and academia.
• But does it really have to be that way?
• Why do many courses in school have to be so intrinsically out of touch with
reality?
• Even so, the gap doesn’t represent simply a difference between industry
statistics and academic statistics.
• The general experience of data scientists is that, at their job, they have access
to a larger body of knowledge and methodology, as well as a process, which
we now define as the data science process that has foundations in both
statistics and computer science
Why Now?
• We have massive amounts of data about many aspects of our lives.
• Simultaneously, an abundance of inexpensive computing power.
• Shopping, communicating, reading news, listening to music, searching for
information, expressing our opinions—all this is being tracked online, as
most people know.
• What people might not know is that the “datafication” of our offline behavior
has started as well, mirroring the online data collection revolution .
• Put the two together, and there’s a lot to learn about our behavior and, by
extension, who we are as a species.
• It’s not just Internet data, though it’s finance, the medical industry,
pharmaceuticals, bioinformatics, social welfare, government, education, retail, and
the list goes on.
• There is a growing influence of data in most sectors and most industries.
• In some cases, the amount of data collected might be enough to be considered
“big” in other cases, it’s not.
• But it’s not only the massiveness that makes all this new data interesting (or poses
challenges).
• It’s that the data itself, often in real time, becomes the building blocks of data
products.
• On the Internet, this means Amazon recommendation systems, friend
recommendations on Facebook, film and music recommendations, and so on.
• In finance, this means credit ratings, trading algorithms, and models.
• In education, this is starting to mean dynamic personalized learning and
assessments coming out of places like Knewton and Khan Academy.
• In government, this means policies based on data
• Technology makes this possible
• Infrastructure for large-scale data processing, increased memory, and
bandwidth, as well as a cultural acceptance of technology in the fabric of
our lives.
• This wasn’t true a decade ago.
• Considering the impact of this feedback loop, we should start thinking
seriously about how it’s being conducted, along with the ethical and
technical responsibilities for the people responsible for the process.
Datafication
•Datafication is a process of “taking all aspects of life and
turning them into data.”
Eg:
•Google’s augmented-reality glasses datafy the gaze.
•Twitter datafies stray thoughts.
•LinkedIn datafies professional networks.
• We are being datafied, or rather our actions are, and when we “like” someone or
something online, we are intending to be datafied,
• When we merely browse the Web, we are unintentionally, or at least passively,
being datafied through cookies that we might or might not be aware of.
• And when we walk around in a store, or even on the street, we are being datafied
in a completely unintentional way, via sensors, cameras, or Google glasses.
• This spectrum of intentionality ranges from us gleefully taking part in a social
media experiment we are proud of, to all-out surveillance and stalking.
• But it’s all datafication.
• Our intentions may run the gamut, but the results don’t.
• Once we datafy things, we can transform their purpose and turn the information
into new forms of value to increase efficiency.
The Current Landscape (with a Little
History

Drew Conway’s venn diagram of Data science


Role of Social scientist in Data Science
• Both LinkedIn and Facebook are social network companies.
• Often‐ times a description or definition of data scientist includes hybrid
statistician, software engineer, and social scientist.
• This made sense in the context of companies where the product was a social
product and still makes sense when we’re dealing with human or user behavior.
• Substantive expertise depends on the context of the problems you’re trying to
solve.
• If they’re social science problems like friend recommendations or people you
know or user segmentation, then by all means, bring on the social scientist!
• Social scientists also do tend to be good question askers and have other good
investigative qualities.
• Social scientist who also has the quantitative and programming chops makes a
great data scientist
Data Science Jobs
•Job descriptions for Data scientists has to be experts in computer
science, statistics, communication, data visualization.
•Extensive domain expertise.
•Nobody is an expert in everything, which is why it makes more
sense to create teams of people who have different profiles and
different expertise—together, as a team, they can specialize in all
those things.
A Data Science Profile
• Computer science
• Math
• Statistics
• Machine learning
• Domain expertise
• Communication and presentation skills
• Data visualization
Skills and Areas
Skills
•Data Business
•Data Creative
•Data Developer
•Data Researcher
Areas
•Business
•ML/Big Data
•Math
•Programming
•Statistics
Data Scientist in Academia and Industry
• Currently, no one calls themselves a data scientist in academia
• except to take on a secondary title for the sake of being a part of a “data science
institute” at a university,
• For applying for a grant that supplies money for data science research.
• For the term “data science” to catch on in academia at the level of the faculty, and
as a primary title, the research area needs to be more formally defined
• There is already a rich set of problems that could translate into many PhD theses.
• Scientist, trained in anything from social science to biology, who works with large
amounts of data, and must grapple with computational problems posed by the
structure, size, messiness, and the complexity and nature of the data, while
simultaneously solving a real world problem.
In Industry
• What do data scientists look like in industry?
• It depends on the level of seniority and whether you’re talking about the
Internet/online industry in particular.
• The role of data scientist need not be exclusive to the tech world, but that’s
where the term originated.
• A chief data scientist should be setting the data strategy of the company,
which involves a variety of things: setting everything up from the
engineering and infrastructure for collecting data and logging, to privacy
concerns, to deciding what data will be user-facing, how data is going to be
used to make decisions, and how it’s going to be built back into the product.
• Manage a team of engineers and analysts and should communicate with
leadership across the company, including the CEO, CTO, and product
leadership.
• Concerned with patenting innovative solutions and settting research
goals.
• Data scientist is someone who knows how to extract meaning from and
interpret data, which requires both tools and methods from statistics and
machine learning, as well as being human.
• Spends lot of time in the process of collecting, cleaning, and munging
data, because data is never clean.
• This process requires persistence, statistics, and software engineering
skills—skills that are also necessary for understanding biases in the data,
and for debugging logging output from code
• Once data gets into shape, a crucial part is exploratory data analysis, which
combines visualization and data sense.
• Find patterns, build models, and algorithms—some with the intention of
understanding product usage and the overall health of the product, and
others to serve as prototypes that ultimately get baked back into the
product.
• Design experiments, which is a critical part of data driven decision making.
• Communicate with team members, engineers, and leadership in clear
language and with data visualizations so that colleagues are not immersed
in the data themselves, they will understand the implications.
Needed Statistical Inference
• The world we live in is complex, random, and uncertain.
• At the same time, it’s one big data-generating machine.
• As we commute to work on subways and in cars, as our blood moves
through our bodies, as we’re shopping, emailing, procrastinating at
work by browsing the Internet and watching the stock market, as we’re
building things, eating things, talking to our friends and family about
things, while factories are producing products, this all at least
potentially produces data
• Data represents the traces of the real-world processes
• exactly which traces we gather are decided by our data collection or
sampling method.
• After separating the process from the data collection, we can see clearly
that there are two sources of randomness and uncertainty.
• Namely, the randomness and uncertainty underlying the process itself,
and the uncertainty associated with your underlying data collection
methods.
• simplify those captured traces into something more comprehensible, to
something that somehow captures it all in a much more concise way, and
that something could be mathematical models or functions of the data,
known as statistical estimator.
• This overall process of going from the world to the data, and then from
the data back to the world, is the field of statistical inference
Populations and Sample
• The word population immediately us think of the entire US population
of 300 million people,
• Entire world’s population of 7 billion people.
• But put that image out of your head, because in statistical inference
population isn’t used to simply describe only people.
• It could be any set of objects or units, such as tweets or photographs
or stars.
• If we could measure the characteristics or extract characteristics of all
those objects, we’d have a complete set of observations,
• Convention is to use N to represent the total number of observations in
the population.
• Suppose your population was all emails sent last year by employees at a
huge corporation, BigCorp.
• Then a single observation could be a list of things: the sender’s name,
the list of recipients, date sent, text of email, number of characters in the
email, number of sentences in the email, number of verbs in the email,
and the length of time until first reply
Sample
• Subset of the units of size n is used in order to examine the
observations to draw conclusions and make inferences about the
population.
• There are different ways you might go about getting this subset of
data, and you want to be aware of this sampling mechanism
• It can introduce biases into the data, and distort it, so that the subset is
not a “mini-me” shrunk-down version of the population.
• Once that happens, any conclusions you draw will simply be wrong
and distorted
• In the BigCorp email example,
• List of all the employees and select 1/10th of those people at random and
take all the email they ever sent, and that would be your sample.
• Sample 1/10th of all email sent each day at random, and that would be
your sample.
• Both these methods are reasonable, and both methods yield the same
sample size.
• But if you took them and counted how many email messages each person
sent, and used that to estimate the underlying distribution of emails sent by
all indiviuals at BigCorp, you might get entirely different answers
Populations and Samples of Big Data
• Sampling solves some engineering challenges.
• In Big Data, the focus on enterprise solutions such as Hadoop is to handle engineering and
computational challenges caused by too much data overlooks sampling as a legitimate
solution.
• At Google, for example, soft‐ ware engineers, data scientists, and statisticians sample all
the time
• In statistics we often model the relationship between a population and a sample with an
underlying mathematical process.
• we make simplifying assumptions about the underlying truth, the mathematical structure,
and shape of the underlying generative process that created the data.
• We observe only one particular realization of that generative process, which is that sample.
• The uncertainty created by such a sampling process has a name: the sampling distribution
New kinds of data
A strong data scientist needs to be versatile and comfortable with
dealing a variety of types of data, including
• Traditional: numerical, categorical, or binary
• Text: emails, tweets, New York Times articles
• Records: user-level data, timestamped event data, json formatted log
files
• Geo-based location data
• Network
• Sensor data
• Images
Terminology: Big Data
“Big” is a moving Target
• Constructing a threshold for Big Data such as 1 petabyte is
meaningless because it makes it sound absolute.
• Only when the size becomes a challenge is it worth referring to it as
“Big.”
• So it’s a relative term referring to when the size of the data outstrips
the state-of-the-art current computational solutions (in terms of
memory, storage, complexity, and processing speed) available to
handle it.
• So in the 1970s this meant something different than it does today
“Big” is when you can’t fit it on one machine.
• Different individuals and companies have different computational resources
available to them
• so for a single scientist data is big if she can’t fit it on one machine because
she has to learn a whole new host of tools and methods once that happens
Big Data is a cultural phenomenon.
It describes how much data is part of our lives, precipitated by accelerated
advances in technology.
The 4 Vs:
Volume, variety, velocity, and value. Many people are circulating this as a way
to characterize Big Data.
Big Data revolution consists of three things:
• Collecting and using a lot of data rather than small samples
• Accepting messiness in your data
• Giving up on knowing the causes
Modeling
What is a model?
•Represent the nature of reality through a particular lens, be it
architectural, biological, or mathematical.
•Is an artificial construction where all extraneous detail has been
removed or abstracted.
•Attention must always be paid to these abstracted details after a
model has been analyzed to see what might have been
overlooked
Statistical modeling
•Draw a picture of what you think the underlying process
might be with your model.
• What comes first?
•What influences what?
•What causes what?
•What’s a test of that?
• Some prefer to express these kinds of relationships in terms of math.
• The mathematical expressions will be general enough that they have to
include parameters, but the values of these parameters are not yet known.
• In mathematical expressions, the convention is to use Greek letters for
parameters and Latin letters for data.
• So, for example, if you have two columns of data, x and y, and you think
there’s a linear relationship, you’d write down y = β0 +β1x.
• You don’t know what β0 and β1 are in terms of actual numbers yet, so
they’re the parameters
• Other people prefer pictures and will first draw a diagram of data flow,
possibly with arrows, showing how things affect other things or what
happens over time.
• This gives them an abstract picture of the relationships before choosing
equations to express them
But how do you build a model?
• Always good to start simple.
• There is a trade-off in modeling between simple and accurate.
• Simple models may be easier to interpret and understand.
• Oftentimes the crude, simple model gets you 90% of the way there and
only takes a few hours to build and fit.
• whereas getting a more complex model might take months and only get
you to 92%.
• Some of the building blocks of these models are probability
distributions.
Probability distribution
• Probability distributions are the foundation of statistical models.
• When we get to linear regression and Naive Bayes, you will see how this
happens in practice.
• The classical example is the height of humans, following a normal
distribution—a bell-shaped curve, also called a Gaussian distribution,
named after Gauss.
• Natural processes tend to generate measurements whose empirical shape
could be approximated by mathematical functions with a few parameters
that could be estimated from the data.
• Not all processes generate data that looks like a named distribution, but
many do.
Probability distributions
• They are to be interpreted as assigning a probability to a subset of possible
outcomes, and have corresponding functions.
• For example, the normal distribution is written as

• The parameter μ is the mean and median and controls where the
distribution is centered (because this is a symmetric distribution), and the
parameter σ controls how spread out the distribution is.
• This is the general functional form, but for specific real-world
phenomenon, these parameters have actual numbers as values, which we
can estimate from the data
• A random variable denoted by x or y can be assumed to have a
corresponding probability distribution, p(x) , which maps x to a
positive real number.
• In order to be a probability density function, we’re restricted to the set
of functions such that if we integrate p(x)to get the area under the curve,
it is 1, so it can be interpreted as probability.
• For example, let x be the amount of time until the next bus arrives
(measured in seconds).
• x is a random variable because there is variation and uncertainty in the
amount of time until the next bus
• Suppose we know (for the sake of argument) that the time until the
next bus has a probability density function of
P(x) =2e −2x .
• If we want to know the likelihood of the next bus arriving in between
12 and 13 minutes, then we find the area under the curve between
12 and 13 by

2e −2x .

How to know right distribution to use?
There are two possible ways:
• Conduct an experiment where we show up at the bus stop at a random
time, measure how much time until the next bus, and repeat this
experiment over and over again.
• Then we look at the measurements, plot them, and approximate the
function as discussed.
• Because we are familiar with the fact that “waiting time” is a common
enough real-world phenomenon that a distribution called the
exponential distribution has been invented to describe it, we know
that it takes the form p( x) = λe−λx
• In addition to denoting distributions of single random variables with
functions of one variable, we use multivariate functions called joint
distributions to do the same thing for more than one random
variable.
• So in the case of two random variables, for example, we could denote
our distribution by a function p(x, y) , and it would take values in the
plane and give us nonnegative values.
• In keeping with its interpretation as a probability, its (double) integral
over the whole plane would be 1.
• We also have conditional distribution p(x|y) , which is to be
interpreted as the density function of x given a particular value of y
Example
• suppose we have a set of user-level data for Amazon.com that lists for
each user the amount of money spent last month on Amazon
• whether the user is male or female, and how many items they looked at
before adding the first item to the shopping cart.
• If we consider X to be the random variable that represents the amount of
money spent, then we can look at the distribution of money spent across
all users, and represent it as p( X) .
• Take the subset of users who looked at more than five items before buying
anything, and look at the distribution of money spent among these users.
• Let Y be the random variable that represents number of items looked at,
then p( X| Y > 5) would be the corresponding conditional distribution.
• Conditional distribution has the same properties as a regular
distribution in that when we integrate it, it sums to 1 and has to take
nonnegative values.
• When we observe data points, i.e.,( x1, y1) ( x2, y2) , . . ., (xn, yn ), we
are observing realizations of a pair of random variables.
• When we have an entire dataset with n rows and k columns, we are
observing n realizations of the joint distribution of those k random
variables.
Fitting a model
• Fitting a model means that you estimate the parameters of the model using
the observed data.
• You are using your data as evidence to help approximate the real-world
mathematical process that generated the data.
• Fitting the model often involves optimization methods and algorithms,
such as maximum likelihood estimation, to help get the parameters.
• When you estimate the parameters, they are actually estimators, meaning
they themselves are functions of the data.
• Once you fit the model, you actually can write it as
y =7.2+4.5x, for example, which means that your best guess is that this
equation or functional form expresses the relationship between your two
variables, based on your assumption that the data followed a linear pattern
• Fitting the model is when you start actually coding:
• your code will read in the data, and you’ll specify the functional form
that you wrote down on the piece of paper.
• Then R or Python will use built-in optimization methods to give you the
most likely values of the parameters given the data
Overfitting
• Overfitting is the term used to mean that you used a dataset to estimate
the parameters of your model, but your model isn’t that good at
capturing reality beyond your sampled data.
• You might know this because you have tried to use it to predict labels
for another set of data that you didn’t use to fit the model, and it
doesn’t do a good job, as measured by an evaluation metric such as
accuracy

You might also like