0% found this document useful (0 votes)
36 views23 pages

DSV Module-2

21CS44 DATA SCIENCE VISUALIZATION NOTES VTU

Uploaded by

jayanthi.vm2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views23 pages

DSV Module-2

21CS44 DATA SCIENCE VISUALIZATION NOTES VTU

Uploaded by

jayanthi.vm2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Data Science & Visualization Module-2

Module-2:
Exploratory Data Analysis and the Data Science Process
Basic tools (plots, graphs and summary statistics) of EDA, Philosophy of EDA,
The Data Science Process, Case Study: Real Direct (online real estate firm).
Three Basic Machine Learning Algorithms: Linear Regression, k-Nearest
Neighbours (k- NN), k-means.

Exploratory Data Analysis


a. “Exploratory data analysis” is an attitude, a state of flexibility, a willingness to look
for those things that we believe are not there, as well as those we believe to be there.
— John Tukey
b. Exploratory data analysis (EDA) as the first step toward building a model.
c. It’s traditionally presented as a bunch of histograms and stem-and-leaf plots.
d. EDA has no hypothesis and no modelling.
e. EDA is a process of understanding of the data of problem that we are solving.
f. Basic tools of EDA are plot, graphs, and summary statistics. Generally, it’s method of
systematically going through the data, plotting distributions of all variables (using box
plots), plotting time series of data, transforming variables, looking at all pairwise
relationships between variables using scatterplot matrices, and generating summary
statistics for all of them.
g. Computing pair-wise relationship, mean, minimum, maximum, upper & lower
quartiles, outlies, variance, SD.
h. EDA provides intuition, shape and insight about the data generating process.
i. EDA is between data and data scientist. But as much as EDA is a set of tools, it’s also
a mindset. And that mindset is about your relationship with the data. You want to
understand the data—gain intuition, understand the shape of it, and try to connect
your understanding of the process that generated the data to the data itself. EDA
happens between you and the data and isn’t about proving anything to anyone else
yet.
j. But EDA is a critical part of the data science process and represents a philosophy or
way of doing statistics practiced by a strain of statisticians coming from the Bell Labs
tradition.
k. John Tukey, a mathematician at Bell Labs, developed exploratory data analysis in
contrast to confirmatory data analysis, which concerns itself with modeling and
hypotheses as described in the previous section.
l. In EDA, there is no hypothesis and there is no model. The “exploratory” aspect means
that your understanding of the problem you are solving, or might solve, is changing as
you go.

Historical Perspective: Bell Labs


Bell Labs is a research lab going back to the 1920s that has made innovations in physics,
computer science, statistics, and math, producing languages like C++, and many Nobel Prize
winners as well.
i. There was a very successful and productive statistics group there, and among its many
notable members was John Tukey, a mathematician who worked on a lot of statistical
problems. He is considered the father of EDA and R (which started as the S language

Dept. of CSE, SJCIT 1 Prepared by AJAY.N


Data Science & Visualization Module-2

at Bell Labs; R is the open-source version), and he was interested in trying to


visualize. high-dimensional data.
ii. We think of Bell Labs as one of the places where data science was “born” because of
the collaboration between disciplines, and the massive amounts of complex data
available to people working there. It was a virtual playground for statisticians and
computer scientists, much like Google is today.
iii. In fact, in 2001, Bill Cleveland wrote “Data Science: An Action Plan for expanding
the technical areas of the field of statistics,” which described multidisciplinary
investigation, models, and methods for data (traditional applied stats), computing with
data (hardware, software, algorithms, coding), pedagogy, tool evaluation (staying on
top of current trends in technology), and theory (the math behind the data). You can
read more about Bell Labs in the book The Idea Factory by Jon Gertner (Penguin
Books).

Philosophy of Exploratory Data Analysis


i. Long before worrying about how to convince others, you first must understand what’s
happening yourself. — Andrew Gelman
ii. While at Google, Rachel was fortunate to work alongside two former Bell
Labs/AT&T statisticians—Daryl Pregibon and Diane Lambert, who also work in this
vein of applied statistics—and learned from them to make EDA a part of her best
practices.
iii. Yes, even with very large Google-scale data, they did EDA. In the context of data in
an Internet/engineering company, EDA is done for some of the same reasons it’s done
with smaller datasets, but there are additional reasons to do it with data that has been
generated from logs.
iv. There are important reasons anyone working with data should do EDA. Namely, to
gain intuition about the data; to make comparisons between distributions; for sanity
checking (making sure the data is on the scale you expect, in the format you thought it
should be); to find out where data is missing or if there are outliers; and to summarize
the data.
v. EDA helps in debugging the logging process through logs. In the context of data
generated from logs, EDA also helps with debugging the logging process. For
example, “patterns” you find in the data could be something wrong in the logging
process that needs to be fixed. If you never go to the trouble of debugging, you’ll
continue to think your patterns are real. The engineers we’ve worked with are always
grateful for help in this area.
vi. EDA makes sure that the product is performing well.
vii. EDA is done at beginning of the analysis of data.
viii. Although there’s lots of visualization involved in EDA, we distinguish between EDA
and data visualization in that EDA is done toward the beginning of analysis, and data
visualization, as it’s used in our vernacular, is done toward the end to communicate
one’s findings. Visualization is done at the end to communicate the finding.
ix. EDA helps in informing and impressing development algorithm.
Eg: likes on the posts, Ranking algorithm popularity can be quantified using number
of clicks and comments.
x. Doing EDA is far better than running algorithm immediately on data sets. With EDA,
you can also use the understanding you get to inform and improve the development of
algorithms. For example, suppose you are trying to develop a ranking algorithm that
ranks content that you are showing to users. To do this you might want to develop a
notion of “popular.” Before you decide how to quantify popularity (which could be,

Dept. of CSE, SJCIT 2 Prepared by AJAY.N


Data Science & Visualization Module-2

form example, highest frequency of clicks, or the post with the greater number of
comments, or comments above some threshold, or some weighted average of many
metrics), you need to understand how the data is behaving, and the best way to do that
is looking at it and getting your hands dirty.
i. Here are some references to help you understand best practices and
historical context:
ii. Exploratory Data Analysis by John Tukey (Pearson)
iii. The Visual Display of Quantitative Information by Edward Tufte
(Graphics Press)
iv. The Elements of Graphing Data by William S. Cleveland (Hobart
Press)

The Data Science Process

1. First, we have the Real World. Inside the Real World are lots of people busy at
various activities. Some people are using Google+, others are competing in the
Olympics; there are spammers sending spam, and there are people getting their blood
drawn. Say we have data on one of these things.
2. Specifically, we’ll start with raw data—logs, Olympics records, Enron employee
emails, or recorded genetic material (note there are lots of aspects to these activities
already lost even when we have that raw data).
3. We want to process this to make it clean for analysis. So we build and use pipelines of
data munging: joining, scraping, wrangling, or whatever you want to call it. To do this
we use tools such as Python, shell scripts, R, or SQL, or all of the above.
4. Eventually we get the data down to a nice format, like something with columns: name
| event | year | gender | event time
5. Once we have this clean dataset, we should be doing some kind of EDA. In the course
of doing EDA, we may realize that it isn’t actually clean because of duplicates,
missing values, absurd outliers, and data that wasn’t actually logged or incorrectly
logged. If that’s the case, we may have to go back to collect more data or spend more
time cleaning the dataset.
6. Next, we design our model to use some algorithm like k-nearest neighbor (k-NN),
linear regression, Naive Bayes, or something else. The model we choose depends on

Dept. of CSE, SJCIT 3 Prepared by AJAY.N


Data Science & Visualization Module-2

the type of problem we’re trying to solve, of course, which could be a classification problem,
a prediction problem, or a basic description problem.
7. We then can interpret, visualize, report, or communicate our results. This could take the form
of reporting the results up to our boss or coworkers or publishing a paper in a journal and
going out and giving academic talks about it.
8. Alternatively, our goal may be to build or prototype a “data product”; e.g., a spam classifier,
or a search ranking algorithm, or a recommendation system.
9. Now the key here that makes data science special and distinct from statistics is that this data
product then gets incorporated back into the real world, and users interact with that product,
and that generates more data, which creates a feedback loop.
10. This is very different from predicting the weather, say, where your model doesn’t influence
the outcome at all. For example, you might predict it will rain next week, and unless you have
some powers we don’t know about, you’re not going to cause it to rain. But if you instead
build a recommendation system that generates evidence that “lots of people love this book,”
say, then you will know that you caused that feedback loop.
11. A data product that is productionized and that users interact with is at one extreme and the
weather is at the other, but regardless of the type of data you work with and the “data
product” that gets built on top of it—be it public policy determined by a statistical model,
health insurance, or election polls that get widely reported and perhaps influence viewer
opinions—you should consider the extent to which your model is influencing the very
phenomenon that you are trying to observe and understand.

A Data Scientist’s Role in This Process

This model so far seems to suggest this will all magically happen without human
intervention. By “human” here, we mean “data scientist.” Someone has to make the decisions
about what data to collect, and why. That person needs to be formulating questions and
hypotheses and making a plan for how the problem will be attacked. And that
Someone is the data scientist or our beloved data science team. Let’s revise or at least add an
overlay to make clear that the data scientist needs to be involved in this process throughout,
meaning they are involved in the actual coding as well as in the higher-level process, as
shown in Figure 2-3.

Connection to the Scientific Method


We can think of the data science process as an extension of or variation of the scientific
method:

Dept. of CSE, SJCIT 4 Prepared by AJAY.N


Data Science & Visualization Module-2

• Ask a question.
• Do background research.
• Construct a hypothesis.
• Test your hypothesis by doing an experiment.
• Analyze your data and draw a conclusion.
• Communicate your results.
In both the data science process and the scientific method, not every problem requires one to
go through all the steps, but almost all problems can be solved with some combination of the
stages. For example, if your end goal is a data visualization (which itself could be thought of
as a data product), it’s possible you might not do any machine learning or statistical
modeling, but you’d want to get all the way to a clean dataset, do some exploratory analysis,
and then create the visualization.

Thought Experiment: How Would You Simulate Chaos?


Most data problems start out with a certain amount of dirty data, ill-defined questions, and
urgency. As data scientists we are, in a sense, attempting to create order from chaos. The
class took a break from the lecture to discuss how they’d simulate chaos. Here are some ideas
from the discussion:
A Lorenzian water wheel, which is a Ferris wheel-type contraption with equally
spaced buckets of water that rotate around in a circle. Now imagine water being dripped into
the system at the very top. Each bucket has a leak, so some water escapes into whatever
bucket is directly below the drip. Depending on the rate of the water coming in, this system
exhibits a chaotic process that depends on molecular-level interactions of water molecules on
the sides of the buckets. Read more about it in this associated Wikipedia article. Many
systems can exhibit inherent chaos. Philippe M. Binder and Roderick V. Jensen have written
a paper entitled “Simulating chaotic behaviour with finite-state machines”, which is about
digital computer simulations of chaos.
An interdisciplinary program involving M.I.T., Harvard, and Tufts involved teaching
a technique that was entitled “Simulating chaos to teach order”. They simulated an
emergency on the border between Chad and Sudan’s troubled Darfur region, with students
acting as members of Doctors Without Borders, International Medical Corps, and other
humanitarian agencies.
See also Joel Gascoigne’s related essay, “Creating order from chaos in a startup”.
Instructor Notes
1. Being a data scientist in an organization is often a chaotic experience, and it’s the data
scientist’s job to try to create order from that chaos. So, I wanted to simulate that chaotic
experience for my students throughout the semester. But I also wanted them to know that
things were going to be slightly chaotic for a pedagogical reason, and not due to my
ineptitude!
2. I wanted to draw out different interpretations of the word “chaos” to think about the
importance of vocabulary, and the difficulties caused in communication when people either
don’t know what a word means or have different ideas of what the word means. Data
scientists might be communicating with domain experts who don’t really understand what
“logistic regression” means, say, but will pretend to know because they don’t want to appear
stupid, or because they think they ought to know, and therefore don’t ask. But then the whole
conversation is not really a successful communication if the two people talking don’t really
understand what they’re talking about. Similarly, the data scientists ought to be asking
questions to make sure they understand the terminology the domain expert is using (be it an
astrophysicist, a social networking expert, or a climatologist). There’s nothing wrong with
not knowing what a word means, but there is something wrong with not asking! You will

Dept. of CSE, SJCIT 5 Prepared by AJAY.N


Data Science & Visualization Module-2

likely find that asking clarifying questions about vocabulary gets you even more insight into
the underlying data problem.
3. Simulation is a useful technique in data science. It can be useful practice to simulate fake
datasets from a model to understand the generative process better, for example, and to debug
code.

Case Study: RealDirect


 How Does RealDirect Make Money?
 Exercise: RealDirect Data Strategy
 Sample R code
1) Doug Perlson, the CEO of RealDirect, has a background in real estate law, startups, and
online advertising. His goal with RealDirect is to use all the data he can access about real
estate to improve the way people sell and buy houses. Normally, people sell their homes
about once every seven years, and they do so with the help of professional brokers and
current data. But there’s a problem both with the broker system and the data quality.
2) RealDirect addresses both of them. First, the brokers. They are typically “free agents”
operating on their own—think of them as home sales consultants. This means that they guard
their data aggressively, and the really good ones have lots of experience. But in the grand
scheme of things, that really means they have only slightly more data than the inexperienced
brokers.
3) RealDirect is addressing this problem by hiring a team of licensed realestate agents who
work together and pool their knowledge. To accomplish this, it built an interface for sellers,
giving them useful datadriven tips on how to sell their house. It also uses interaction data to
give real-time recommendations on what to do next.
4) The team of brokers also become data experts, learning to use information-collecting tools
to keep tabs on new and relevant data or to access publicly available information. For
example, you can now get data on co-op (a certain kind of apartment in NYC) sales, but
that’s a relatively recent change.
5) One problem with publicly available data is that it’s old news—there’s a three-month lag
between a sale and when the data about that sale is available. RealDirect is working on real-
time feeds on things like when people start searching for a home, what the initial offer is, the
time between offer and close, and how people search for a home online.
6) Ultimately, good information helps both the buyer and the seller. At least if they’re honest.

How Does RealDirect Make Money?


1. First, it offers a subscription to sellers—about $395 a month—to access he selling tools.
2. Second, it allows sellers to use RealDirect’s agents at a reduced commission, typically 2%
of the sale instead of the usual 2.5% or 3%. This is where the magic of data pooling comes in:
it allows RealDirect to take a smaller commission because it’s more optimized, and therefore
gets more volume.
The site itself is best thought of as a platform for buyers and sellers to manage their sale or
purchase process.

There are some challenges they have to deal with as well, of course:
1. First off, there’s a law in New York that says you can’t show all the current housing
listings unless those listings reside behind a registration wall, so RealDirect requires
registration. On the one hand, this is an obstacle for buyers, but serious buyers are likely
willing to do it.

Dept. of CSE, SJCIT 6 Prepared by AJAY.N


Data Science & Visualization Module-2

2. Moreover, places that don’t require registration, like Zillow, aren’t true competitors to
RealDirect because they are merely showing listings without providing any additional
service. Doug pointed out that you also need to register to use Pinterest, and it has tons of
users in spite of this.
RealDirect comprises licensed brokers in various established realtor associations, but even so
it has had its share of hate mail from realtors who don’t appreciate its approach to cutting
commission costs. In this sense, RealDirect is breaking directly into a guild.
On the other hand, if a realtor refused to show houses because they are being sold on
RealDirect, the potential buyers would see those listings elsewhere and complain. So the
traditional brokers have little choice but to deal with RealDirect even if they don’t like it. In
other words, the listings themselves are sufficiently transparent so that the traditional brokers
can’t get away with keeping their buyers away from these houses. Doug talked about key
issues that a buyer might care about—nearby parks, subway, and schools, as well as the
comparison of prices per square foot of apartments sold in the same building or block. This is
the kind of data they want to increasingly cover as part of the service of RealDirect.

Exercise: RealDirect Data Strategy


You have been hired as chief data scientist at realdirect.com, and report directly to the CEO.
The company (hypothetically) does not yet have its data plan in place. It’s looking to you to
come up with a data strategy.
Here are a couple ways you could begin to approach this problem:
1. Explore its existing website, thinking about how buyers and sellers would navigate through
it, and how the website is structured/organized. Try to understand the existing business
model, and think about how analysis of RealDirect user-behavior data could be used to
inform decision-making and product development. Come up with a list of research questions
you think could be answered by data:
i. What data would you advise the engineers log and what would your ideal datasets
look like?
ii. How would data be used for reporting and monitoring product usage?
iii. How would data be built back into the product/website?
2. Because there is no data yet for you to analyze (typical in a startup when its still building
its product), you should get some auxiliary data to help gain intuition about this market. For
example, go to https://github.com/oreillymedia/doing_data_science. Click on Rolling Sales
Update (after the fifth paragraph). You can use any or all of the datasets here—start with
Manhattan August, 2012–August 2013.
i. First challenge: load in and clean up the data. Next, conduct exploratory data analysis
in order to find out where there are outliers or missing values, decide how you will
treat them, make sure the dates are formatted correctly, make sure values you think
are numerical are being treated as such, etc.
ii. Once the data is in good shape, conduct exploratory data analysis to visualize and
make comparisons (i) across neighborhoods, and (ii) across time. If you have
time, start looking for meaningful patterns in this dataset.
3. Summarize your findings in a brief report aimed at the CEO.
4. Being the “data scientist” often involves speaking to people who aren’t also data scientists,
so it would be ideal to have a set of communication strategies for getting to the information
you need about the data. Can you think of any other people you should talk to?
5. Most of you are not “domain experts” in real estate or online businesses.
i. Does stepping out of your comfort zone and figuring out how you would go about
“collecting data” in a different setting give you insight into how you do it in your own
field?

Dept. of CSE, SJCIT 7 Prepared by AJAY.N


Data Science & Visualization Module-2

ii. Sometimes “domain experts” have their own set of vocabulary. Did Doug use
vocabulary specific to his domain that you didn’t understand (“comps,” “open
houses,” “CPC”)? Sometimes if you don’t understand vocabulary that an expert is
using, it can prevent you from understanding the problem. It’s good to get in the habit
of asking questions because eventually you will get to something you do understand.
This involves persistence and is a habit to cultivate.
6. Doug mentioned the company didn’t necessarily have a data strategy. There is no industry
standard for creating one. As you work through this assignment, think about whether there is
a set of best practices you would recommend with respect to developing a data strategy for an
online business, or in your own domain.

Sample R code
Here’s some sample R code that takes the Brooklyn housing data in the preceding exercise
and cleans and explores it a bit. (The exercise asks you to do this for Manhattan.)

Dept. of CSE, SJCIT 8 Prepared by AJAY.N


Data Science & Visualization Module-2

Algorithm:
An algorithm is a procedure or set of steps or rules to accomplish a task. Algorithms are one
of the fundamental concepts in, or building blocks of, computer science: the basis of the
design of elegant and efficient code, data preparation and processing, and software
engineering.
Some of the basic types of tasks that algorithms can solve are sorting, searching, and graph
based computational problems.
Efficient algorithms that work sequentially or in parallel are the basis of pipelines to process
and prepare data. With respect to data science, there are at least three classes of algorithms
one should be aware of:
1. Data munging, preparation, and processing algorithms, such as sorting, MapReduce, or
Pregel.
We would characterize these types of algorithms as data engineering, and while we devote a
chapter to this, it’s not the emphasis of this book. This is not to say that you won’t be doing
data wrangling and munging—just that we don’t emphasize the algorithmic aspect of it.
2. Optimization algorithms for parameter estimation, including Stochastic Gradient Descent,
Newton’s Method, and Least Squares.

Machine Learning Algorithms

Dept. of CSE, SJCIT 9 Prepared by AJAY.N


Data Science & Visualization Module-2

Dept. of CSE, SJCIT 10 Prepared by AJAY.N


Data Science & Visualization Module-2

Linear Regression for Machine Learning


Linear regression in statistics and machine learning is perhaps one of the most well-known
and well understood algorithms.
You will discover the linear regression algorithm in this article, how it operates and how you
can best use it in your machine learning projects. In this article you'll learn: • Why linear
regression is part of statistics as well as machine learning.
 The other titles known as linear regression.
 Algorithms of representation and inference used to construct a model of linear
regression;
 How best to plan the data using linear regression modeling.
To grasp linear regression you don't need to learn any statistics or linear algebras. It is a
gentle highlevel introduction to the technique to give you enough experience to be able to
make successful use of it for your own problems.
Discover how machine learning algorithms work in my latest book like kNN, decision trees,
naïve bayes, SVM, ensembles and much more, with 22 tutorials and examples in excel.

Linear Regression Model Representation

Linear regression, since it is so easy to depict, is a popular model. A representation is a linear


equation that combines a particular set of input values (x) with the solution for which the
output is expected for that set of input values (y). As such, both the values of input (x) and
output are numeric. For each input value or column, the linear equation assigns one scale
factor, called the coefficient, represented by the Greek capital letter Beta (B). An additional
coefficient is often introduced, which gives the line an extra degree of freedom (for example,
going up and down on a two-dimensional plot) and is also referred to as the coefficient of
intercept or bias.
For example, in a simple regression problem (a single x and a single y), the form of the model
would be:
y = B0 + B1*x
In higher dimensions, when we have more than one input (x), the line is called a plane or a
hyperplane. The representation is therefore the form of the equation and the specific values
used for the coefficients (e.g. B0 and B1 in the example above).

Dept. of CSE, SJCIT 11 Prepared by AJAY.N


Data Science & Visualization Module-2

It is common to talk about the complexity of a regression model such as linear regression.
This refers to the number of coefficients used for the model.

When the coefficient is zero, the influence of the input variable on the model is effectively
removed from the model prediction (0 * x = 0). This becomes relevant when you look at the
regularization methods that change the learning algorithm to reduce the complexity of the
regression models by putting pressure on the absolute size of the coefficients, driving some to
zero.

Linear Regression Learning Model


Learning a linear regression model means estimating the values of the coefficients used in the
representation with the data available to us. In this section, we will take a brief look at four
techniques for the preparation of a linear regression model. This is not enough information to
implement it from scratch, but enough to get a taste of the computation and the trade-offs
involved. There are a lot more techniques because the model is so well studied. Take note of
Ordinary Less Squares because it is the most common method used in general. Take note of
Gradient Descent as it is the most common technique taught in machine learning classes.

1. Simple Linear Regression


With simple linear regression, if we have a single input, we can use statistics to estimate the
coefficients. This requires you to calculate statistical properties from data such as mean,
standard deviations, correlations and covariance. All data must be made available for the
purpose of crossing and calculating statistics.

2. Ordinary Least Squares


If we have more than one input, we can use the Ordinary Lowest Squares to estimate the
coefficient values. The Ordinary Least Squares procedure seeks to minimize the sum of the
squared residues. This means that given the regression curve through the info , we calculate
the space from each datum to the regression curve , square it, and sum all the squared errors
together. This is the quantity that the least common squares are trying to minimize. This
approach treats the data as a matrix and uses linear algebra operations to estimate the ptimum
coefficient values. This means that all the data must be available and you must have enough
memory to fit the data and perform the matrix operation.
It is unusual to perform the Ordinary Least Squares procedure yourself, unless it is done as a
linear algebra exercise. It's more likely you're going to call a procedure in a linear algebra
library. This procedure is very quick to calculate.

3. Gradient Descent
When one or more inputs are available, you can use the process of optimizing coefficient
values by iteratively minimizing the model error on your training data. This operation is
called Gradient Descent, starting with random values for each coefficient. The sum of
squared errors is calculated for each pair of input and output values. The learning rate is used
as a scale factor and the coefficients are updated in order to minimize the error. The process
is repeated until a minimum amount of squared error is achieved or no further improvement
is possible.
When using this method, you must select the learning rate (alpha) parameter that will
determine the size of the improvement step to be taken for each iteration of the procedure.
Gradient descent is often taught using a linear regression model because it is relatively easy
to understand. In practice, it is useful when you have a very large dataset in either the number
of rows or the number of columns that may not fit into your memory.

Dept. of CSE, SJCIT 12 Prepared by AJAY.N


Data Science & Visualization Module-2

4. Regularization
There are extensions to the training of a linear model called regularization methods. Both aim
to minimize the sum of the squared error of the model on the training data (using ordinary
least squares) but also to reduce the complexity of the model (such as the number or absolute
size of the sum of all coefficients in the model).

Two common examples of regularization procedures for linear regression are:


1. Lasso regression: where ordinary least squares are modified to minimize the absolute sum
of the coefficients (called L1 regularization) as well.
2. Ridge Regression: where the ordinary least squares are modified to minimize the squared
absolute sum of the coefficients (called L2 regularization)
These methods are effective to use when there is collinearity in your input values, and the
ordinary least squares would override the training data.
Now that you know some techniques to learn the coefficients in a linear regression model,
let's look at how we can use a model to make new data predictions.
Making Linear Regression Predictions Since representation is a linear equation, making
predictions is as simple as solving an equation for a specific set of inputs.
Let's use an example to make this concrete. Imagine that we predict weight (y) from height
(x). Our linear regression model for this problem would be:
y = B0 + B1 * x1
or
weight =B0 +B1 * height
Where B0 is that the bias coefficient and B1 is that the coefficient for the height column. We
use a learning technique to seek out an honest set of coefficient values. Once found, we will
connect different height values to predict the load .
For example, let’s use B0 = 0.1 and B1 = 0.5. Let’s plug them in and calculate the weight (in
kilograms) for an individual with the height of 182 centimeters.
weight = 0.1 + 0.5 * 182
weight = 91.1

You can see that the above equation could be plotted as a line in two-dimensions. The B0 is
our starting point regardless of what height we have. We can run through a bunch of heights
from 100 to 250 centimeters and plug them to the equation and get weight values, creating
our line.

Dept. of CSE, SJCIT 13 Prepared by AJAY.N


Data Science & Visualization Module-2

Figure 3: Sample Height vs Weight Linear Regression

Now that we know how to make predictions given a learned linear regression model, let’s
look at some rules of thumb for preparing our data to make the most of this type of model.

Preparing Data for Linear Regression


Linear regression has been studied extensively, and there is a lot of literature on how your
data needs to be structured to make the most of the model.
As such, there is a lot of sophistication in talking about these requirements and expectations
that can be intimidating. In practice, these rules can be used more as thumb rules when using
Ordinary Less Squares Regression, the most common linear regression implementation.
Try using these heuristics to prepare your data differently and see what works best for your
problem.
1. Linear Assumption. Linear regression assumes that the relationship between input and
output is linear. It doesn't support anything else. This may be obvious, but it's a good
thing to remember when you have a lot of attributes. You may need to transform the
data to make the relationship linear (e.g. transform log for an exponential
relationship).
2. Remove your noise. Linear regression assumes that the input and output variables are
not noisy. Consider using data cleaning operations that will make it easier for you to
expose and clarify the signal in your data. This is most important for the output
variable and, if possible, you want to remove outliers in the output variable (y).
3. Remove Collinearity from me. Linear regression is over-fitting your data when you
have highly correlated input variables. Consider calculating pairwise correlations for
your input data and removing the most correlated data.
4. The Gaussian Distribution. Linear regression makes more reliable predictions if your
input and output variables have a Gaussian distribution. You may get some benefit
from transforms (e.g. log or BoxCox) on your variables to make their distribution
look more Gaussian.
5. Rescale Inputs: Linear regression will often make more reliable predictions if you
rescale input variables using standardization or normalization.

Dept. of CSE, SJCIT 14 Prepared by AJAY.N


Data Science & Visualization Module-2

K-NEAREST NEIGHBORS (K-NN)


K-nearest neighbors (KNN) algorithm may be a sort of controlled ML algorithm which will
be used for both classification and regression predictive problems. However, it's mainly used
for the classification of predictive problems within the industry.
The subsequent two properties would define KNN well –
• Lazy learning algorithm − KNN may be a lazy learning algorithm because it doesn't have a
specialized training phase and uses all data for training while classifying.
• Non-parametric learning algorithm − KNN is additionally a non-parametric learning
algorithm because it assumes nothing about the underlying data.

Working of KNN Algorithm


K-nearest neighbors (KNN) algorithm uses 'feature similarity' to predict the values of new
data points, which means that a value will be assigned to the new data point based on how
closely the points in the training set match. We can understand how it works by following
steps –
Step 1 – We need a data set to implement any algorithm. So we have to load the training as
well as the test data during the first step of KNN.
Step 2 − Next, we’d like to settle on the value of K, i.e. the nearest data point. K could be any
integer.
Step 3 − For each point of the test data, do the following −
3.1 − Calculate the distance between the test data and each row of training data using any
method, namely: Euclidean, Manhattan or Hamming distance. Euclidean is the most common
method used to calculate distance.
3.2 − Now, based on the distance value, sort it in ascending order.
3.3 − Next, the top rows of K will be selected from the sorted array.
3.4 − Now assign a class to the test point based on the most frequent class of these rows.
Step 4 − End Example The following is an example to understand the concept of K and the
working of the KNN algorithm − Suppose we have a dataset that can be plotted as follows
shown in Figure 4

Figure 4: Dataset
Now, we'd like to classify new data point with black dot (at point 60, 60) into blue or
red class. We are assuming K = 3 i.e. it would find three nearest data points. It is shown in
Figure 5.

Dept. of CSE, SJCIT 15 Prepared by AJAY.N


Data Science & Visualization Module-2

We can see in Figure 5 the three nearest neighbors of the data point with black dot. Among
those three, two of them lie in Red class hence the black dot will also be assigned in red class.

Implementation in Python
As we all know K-nearest neighbors (KNN) algorithm are often used for both classification
also as regression. The following are the recipes in Python to use KNN as classifier also as
regressor – KNN as Classifier
First, start with importing necessary python packages –
importnumpy as np
importmatplotlib.pyplot as plt
import pandas as pd
Next, download the iris dataset from its weblink as follows −
path = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
Next, we need to assign column names to the dataset as follows −
headernames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
Now, we need to read dataset to pandasdataframe as follows −
dataset = pd.read_csv(path, names = headernames)
dataset.head()

Dept. of CSE, SJCIT 16 Prepared by AJAY.N


Data Science & Visualization Module-2

Dept. of CSE, SJCIT 17 Prepared by AJAY.N


Data Science & Visualization Module-2

Dept. of CSE, SJCIT 18 Prepared by AJAY.N


Data Science & Visualization Module-2

Pros and Cons of KNN


Pros
 It is a very simple algorithm to understand and interpret.
 It’s very useful for non-linear data because there’s no assumption about the
information in this algorithm.
 It may be a versatile algorithm which will be used for both classification and
regression.
 It is comparatively accurate, but there are far better supervised learning models than
KNN.
Cons
 It's a bit expensive algorithm, because it stores all the training data.
 High memory storage required compared to other supervised learning algorithms.
 The prediction is slow within the case of a large N.
 It is extremely sensitive to the size of the information also on irrelevant features.

Applications of KNN
The following are some of the areas in which KNN can be successfully applied –
1. The KNN Banking System can be used in the banking system to predict the weather and
the individual is fit for loan approval? Does that individual have the same characteristics as
one of the defaulters?
2. Calculation of credit ratings KNN algorithms can be used to find an individual's credit
rating by comparing it to persons with similar characteristics.
3. Politics With the assistance of KNN algorithms, we will classify potential voters into
different classes like "Will vote," "Will not vote," "Will vote to the Congress Party," "Will
vote to the BJP Party."
4. Other areas where the KNN algorithm is often used are Speech Recognition, Handwriting
Detection, Image Recognition and Video Recognition.

Dept. of CSE, SJCIT 19 Prepared by AJAY.N


Data Science & Visualization Module-2

K-MEANS
K-means algorithm is an iterative algorithm that attempts to divide the dataset into K pre-
defined separate non-overlapping subgroups (clusters) where each data point belongs to only
one group. It tries to make inter-cluster data points as similar as possible while keeping
clusters as different (far) as possible. It assigns data points to a cluster in such a way that the
sum of the squared distance between the data points and the centroid cluster (arithmetic mean
of all data points belonging to that cluster) is at a minimum. The less variation we have
within clusters, the more homogenous (similar) the data points are within the same cluster.

The way k-means algorithm works is as follows:


1. Specify the number of K clusters.
2. Initialize the centroids by first shuffling the dataset and then randomly selecting K data
points for the centroids without substitution.
3. Keep iterating until there is no change in the centroids. i.e. the assignment of data points to
clusters does not change.
4. Calculate the sum of the squared distance between the data points and all the centroids.
5. Assign each data point to the nearest cluster (centroid).
6. Calculate the centroids for clusters by taking the average of all data points belonging to
each cluster.

The following approach k-means to solve the problem is called Expectation-Maximization.


The Estep assigns the data points to the nearest cluster. The M— step is computing the
centroid of each cluster. Below is a breakdown of how we can solve it mathematically

Dept. of CSE, SJCIT 20 Prepared by AJAY.N


Data Science & Visualization Module-2

Few things to note here:


• Since cluster algorithms, together with k-means, use distance-based measurements to work
out the similarity between data points, it's counseled that data ought to be standardized to
own a mean zero and a typical deviation of 1, since nearly always the characteristics in any
knowledge set would have totally different units of measuring like age vs financial gain.
• In sight of the k-mean repetitious nature and therefore the random formatting of the
centroids at the start of the algorithmic program, {different|totally totally different|completely
different} initializations could result in different clusters, because the k-mean algorithmic
program could also be stuck to the native optimum and should not converge to the world
optimum. It is so counseled that the algorithmic program be run mistreatment totally different
center of mass initializations which the results of the run be chosen that yielded a lower total
of square distance.
• The assignment of examples doesn't amendment is that the same issue as no amendment in
within cluster variation:

Implementation
We will use simple implementation of k-means here to illustrate some of the concepts. Then
we will use sklearn implementation that makes it more efficient to take care of a lot of things
for us.
Applications
K-means algorithm is very popular and is used in a variety of applications such as market
segmentation, document clustering, image segmentation and image compression, etc. The
goal usually when we're undergoing a cluster analysis is either:
1. Get a meaningful insight into the structure of the data we're dealing with.
2. Cluster-then-predicts where different models will be built for different subgroups if we
believe there is a wide variation in the behavior of different subgroups. An example of this is
the clustering of patients into different subgroups and the development of a model for each
subgroup to predict the risk of heart attack.

Dept. of CSE, SJCIT 21 Prepared by AJAY.N


Data Science & Visualization Module-2

Dept. of CSE, SJCIT 22 Prepared by AJAY.N


Data Science & Visualization Module-2

Dept. of CSE, SJCIT 23 Prepared by AJAY.N

You might also like