DSV Module-2
DSV Module-2
Module-2:
Exploratory Data Analysis and the Data Science Process
Basic tools (plots, graphs and summary statistics) of EDA, Philosophy of EDA,
The Data Science Process, Case Study: Real Direct (online real estate firm).
Three Basic Machine Learning Algorithms: Linear Regression, k-Nearest
Neighbours (k- NN), k-means.
form example, highest frequency of clicks, or the post with the greater number of
comments, or comments above some threshold, or some weighted average of many
metrics), you need to understand how the data is behaving, and the best way to do that
is looking at it and getting your hands dirty.
i. Here are some references to help you understand best practices and
historical context:
ii. Exploratory Data Analysis by John Tukey (Pearson)
iii. The Visual Display of Quantitative Information by Edward Tufte
(Graphics Press)
iv. The Elements of Graphing Data by William S. Cleveland (Hobart
Press)
1. First, we have the Real World. Inside the Real World are lots of people busy at
various activities. Some people are using Google+, others are competing in the
Olympics; there are spammers sending spam, and there are people getting their blood
drawn. Say we have data on one of these things.
2. Specifically, we’ll start with raw data—logs, Olympics records, Enron employee
emails, or recorded genetic material (note there are lots of aspects to these activities
already lost even when we have that raw data).
3. We want to process this to make it clean for analysis. So we build and use pipelines of
data munging: joining, scraping, wrangling, or whatever you want to call it. To do this
we use tools such as Python, shell scripts, R, or SQL, or all of the above.
4. Eventually we get the data down to a nice format, like something with columns: name
| event | year | gender | event time
5. Once we have this clean dataset, we should be doing some kind of EDA. In the course
of doing EDA, we may realize that it isn’t actually clean because of duplicates,
missing values, absurd outliers, and data that wasn’t actually logged or incorrectly
logged. If that’s the case, we may have to go back to collect more data or spend more
time cleaning the dataset.
6. Next, we design our model to use some algorithm like k-nearest neighbor (k-NN),
linear regression, Naive Bayes, or something else. The model we choose depends on
the type of problem we’re trying to solve, of course, which could be a classification problem,
a prediction problem, or a basic description problem.
7. We then can interpret, visualize, report, or communicate our results. This could take the form
of reporting the results up to our boss or coworkers or publishing a paper in a journal and
going out and giving academic talks about it.
8. Alternatively, our goal may be to build or prototype a “data product”; e.g., a spam classifier,
or a search ranking algorithm, or a recommendation system.
9. Now the key here that makes data science special and distinct from statistics is that this data
product then gets incorporated back into the real world, and users interact with that product,
and that generates more data, which creates a feedback loop.
10. This is very different from predicting the weather, say, where your model doesn’t influence
the outcome at all. For example, you might predict it will rain next week, and unless you have
some powers we don’t know about, you’re not going to cause it to rain. But if you instead
build a recommendation system that generates evidence that “lots of people love this book,”
say, then you will know that you caused that feedback loop.
11. A data product that is productionized and that users interact with is at one extreme and the
weather is at the other, but regardless of the type of data you work with and the “data
product” that gets built on top of it—be it public policy determined by a statistical model,
health insurance, or election polls that get widely reported and perhaps influence viewer
opinions—you should consider the extent to which your model is influencing the very
phenomenon that you are trying to observe and understand.
This model so far seems to suggest this will all magically happen without human
intervention. By “human” here, we mean “data scientist.” Someone has to make the decisions
about what data to collect, and why. That person needs to be formulating questions and
hypotheses and making a plan for how the problem will be attacked. And that
Someone is the data scientist or our beloved data science team. Let’s revise or at least add an
overlay to make clear that the data scientist needs to be involved in this process throughout,
meaning they are involved in the actual coding as well as in the higher-level process, as
shown in Figure 2-3.
• Ask a question.
• Do background research.
• Construct a hypothesis.
• Test your hypothesis by doing an experiment.
• Analyze your data and draw a conclusion.
• Communicate your results.
In both the data science process and the scientific method, not every problem requires one to
go through all the steps, but almost all problems can be solved with some combination of the
stages. For example, if your end goal is a data visualization (which itself could be thought of
as a data product), it’s possible you might not do any machine learning or statistical
modeling, but you’d want to get all the way to a clean dataset, do some exploratory analysis,
and then create the visualization.
likely find that asking clarifying questions about vocabulary gets you even more insight into
the underlying data problem.
3. Simulation is a useful technique in data science. It can be useful practice to simulate fake
datasets from a model to understand the generative process better, for example, and to debug
code.
There are some challenges they have to deal with as well, of course:
1. First off, there’s a law in New York that says you can’t show all the current housing
listings unless those listings reside behind a registration wall, so RealDirect requires
registration. On the one hand, this is an obstacle for buyers, but serious buyers are likely
willing to do it.
2. Moreover, places that don’t require registration, like Zillow, aren’t true competitors to
RealDirect because they are merely showing listings without providing any additional
service. Doug pointed out that you also need to register to use Pinterest, and it has tons of
users in spite of this.
RealDirect comprises licensed brokers in various established realtor associations, but even so
it has had its share of hate mail from realtors who don’t appreciate its approach to cutting
commission costs. In this sense, RealDirect is breaking directly into a guild.
On the other hand, if a realtor refused to show houses because they are being sold on
RealDirect, the potential buyers would see those listings elsewhere and complain. So the
traditional brokers have little choice but to deal with RealDirect even if they don’t like it. In
other words, the listings themselves are sufficiently transparent so that the traditional brokers
can’t get away with keeping their buyers away from these houses. Doug talked about key
issues that a buyer might care about—nearby parks, subway, and schools, as well as the
comparison of prices per square foot of apartments sold in the same building or block. This is
the kind of data they want to increasingly cover as part of the service of RealDirect.
ii. Sometimes “domain experts” have their own set of vocabulary. Did Doug use
vocabulary specific to his domain that you didn’t understand (“comps,” “open
houses,” “CPC”)? Sometimes if you don’t understand vocabulary that an expert is
using, it can prevent you from understanding the problem. It’s good to get in the habit
of asking questions because eventually you will get to something you do understand.
This involves persistence and is a habit to cultivate.
6. Doug mentioned the company didn’t necessarily have a data strategy. There is no industry
standard for creating one. As you work through this assignment, think about whether there is
a set of best practices you would recommend with respect to developing a data strategy for an
online business, or in your own domain.
Sample R code
Here’s some sample R code that takes the Brooklyn housing data in the preceding exercise
and cleans and explores it a bit. (The exercise asks you to do this for Manhattan.)
Algorithm:
An algorithm is a procedure or set of steps or rules to accomplish a task. Algorithms are one
of the fundamental concepts in, or building blocks of, computer science: the basis of the
design of elegant and efficient code, data preparation and processing, and software
engineering.
Some of the basic types of tasks that algorithms can solve are sorting, searching, and graph
based computational problems.
Efficient algorithms that work sequentially or in parallel are the basis of pipelines to process
and prepare data. With respect to data science, there are at least three classes of algorithms
one should be aware of:
1. Data munging, preparation, and processing algorithms, such as sorting, MapReduce, or
Pregel.
We would characterize these types of algorithms as data engineering, and while we devote a
chapter to this, it’s not the emphasis of this book. This is not to say that you won’t be doing
data wrangling and munging—just that we don’t emphasize the algorithmic aspect of it.
2. Optimization algorithms for parameter estimation, including Stochastic Gradient Descent,
Newton’s Method, and Least Squares.
It is common to talk about the complexity of a regression model such as linear regression.
This refers to the number of coefficients used for the model.
When the coefficient is zero, the influence of the input variable on the model is effectively
removed from the model prediction (0 * x = 0). This becomes relevant when you look at the
regularization methods that change the learning algorithm to reduce the complexity of the
regression models by putting pressure on the absolute size of the coefficients, driving some to
zero.
3. Gradient Descent
When one or more inputs are available, you can use the process of optimizing coefficient
values by iteratively minimizing the model error on your training data. This operation is
called Gradient Descent, starting with random values for each coefficient. The sum of
squared errors is calculated for each pair of input and output values. The learning rate is used
as a scale factor and the coefficients are updated in order to minimize the error. The process
is repeated until a minimum amount of squared error is achieved or no further improvement
is possible.
When using this method, you must select the learning rate (alpha) parameter that will
determine the size of the improvement step to be taken for each iteration of the procedure.
Gradient descent is often taught using a linear regression model because it is relatively easy
to understand. In practice, it is useful when you have a very large dataset in either the number
of rows or the number of columns that may not fit into your memory.
4. Regularization
There are extensions to the training of a linear model called regularization methods. Both aim
to minimize the sum of the squared error of the model on the training data (using ordinary
least squares) but also to reduce the complexity of the model (such as the number or absolute
size of the sum of all coefficients in the model).
You can see that the above equation could be plotted as a line in two-dimensions. The B0 is
our starting point regardless of what height we have. We can run through a bunch of heights
from 100 to 250 centimeters and plug them to the equation and get weight values, creating
our line.
Now that we know how to make predictions given a learned linear regression model, let’s
look at some rules of thumb for preparing our data to make the most of this type of model.
Figure 4: Dataset
Now, we'd like to classify new data point with black dot (at point 60, 60) into blue or
red class. We are assuming K = 3 i.e. it would find three nearest data points. It is shown in
Figure 5.
We can see in Figure 5 the three nearest neighbors of the data point with black dot. Among
those three, two of them lie in Red class hence the black dot will also be assigned in red class.
Implementation in Python
As we all know K-nearest neighbors (KNN) algorithm are often used for both classification
also as regression. The following are the recipes in Python to use KNN as classifier also as
regressor – KNN as Classifier
First, start with importing necessary python packages –
importnumpy as np
importmatplotlib.pyplot as plt
import pandas as pd
Next, download the iris dataset from its weblink as follows −
path = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
Next, we need to assign column names to the dataset as follows −
headernames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
Now, we need to read dataset to pandasdataframe as follows −
dataset = pd.read_csv(path, names = headernames)
dataset.head()
Applications of KNN
The following are some of the areas in which KNN can be successfully applied –
1. The KNN Banking System can be used in the banking system to predict the weather and
the individual is fit for loan approval? Does that individual have the same characteristics as
one of the defaulters?
2. Calculation of credit ratings KNN algorithms can be used to find an individual's credit
rating by comparing it to persons with similar characteristics.
3. Politics With the assistance of KNN algorithms, we will classify potential voters into
different classes like "Will vote," "Will not vote," "Will vote to the Congress Party," "Will
vote to the BJP Party."
4. Other areas where the KNN algorithm is often used are Speech Recognition, Handwriting
Detection, Image Recognition and Video Recognition.
K-MEANS
K-means algorithm is an iterative algorithm that attempts to divide the dataset into K pre-
defined separate non-overlapping subgroups (clusters) where each data point belongs to only
one group. It tries to make inter-cluster data points as similar as possible while keeping
clusters as different (far) as possible. It assigns data points to a cluster in such a way that the
sum of the squared distance between the data points and the centroid cluster (arithmetic mean
of all data points belonging to that cluster) is at a minimum. The less variation we have
within clusters, the more homogenous (similar) the data points are within the same cluster.
Implementation
We will use simple implementation of k-means here to illustrate some of the concepts. Then
we will use sklearn implementation that makes it more efficient to take care of a lot of things
for us.
Applications
K-means algorithm is very popular and is used in a variety of applications such as market
segmentation, document clustering, image segmentation and image compression, etc. The
goal usually when we're undergoing a cluster analysis is either:
1. Get a meaningful insight into the structure of the data we're dealing with.
2. Cluster-then-predicts where different models will be built for different subgroups if we
believe there is a wide variation in the behavior of different subgroups. An example of this is
the clustering of patients into different subgroups and the development of a model for each
subgroup to predict the risk of heart attack.