Data Science Lab 5
Data Science Lab 5
Class: BSCS-6A
Date: 5-3-2020
Table of Content
s
Features Data Type............................................................................................................3
Scikit-learn.................................................................................................................................4
Linear Regression.......................................................................................................................4
How to Find the Regression Equation..........................................................................................................4
How to Use the Regression Equation...........................................................................................................6
How to Find the Coefficient of Determination.............................................................................................6
K Means.....................................................................................................................................7
LAB TASKS..................................................................................................................................8
Bahria University, Islamabad Campus
Department of Computer Science
Lab 5: Prediction of data in Python
Introduction
The purpose of this lab is to get familiar with Data Science by Python. In this lab we explore
prediction techniques on data in Python, using examples. I encourage you to type all python
commands your own machine.
Tools/Software Requirement
Python, Jupyter Notebook
Scikit-learn
Scikit-learn is probably the most useful library for machine learning in Python. It is on
NumPy, SciPy and matplotlib, this library contains a lot of effiecient tools for
machine learning and statistical modeling including classification, regression,
clustering and dimensionality reduction
Linear Regression
Linear regression is a predictive modeling technique. It is used whenever there is a linear relation
between the dependent and the independent variables.
Y = b0 + b1* x
Bahria University, Islamabad Campus
Department of Computer Science
It is used in estimating exactly how much of y will change, when x changes a certain amount.
As we see in the picture, a flower’s sepal length is mapped onto the x-axis and the petal length is
mapped on the y-axis.
In the table below, the xi column shows scores on the aptitude test. Similarly, the yi column
shows statistics grades. The last two columns show deviations scores - the difference between the
student's score and the average score on each test. The last two rows show sums and mean scores
that we will use to conduct the regression analysis.
And for each student, we also need to compute the squares of the deviation scores (the last two
columns in the table below).
Student xi yi (xi-x)(yi-y)
1 95 85 136
2 85 95 126
3 80 70 -14
4 70 65 96
5 60 70 126
Sum 390 385 470
Mean 78 77
The regression equation is a linear equation of the form: ŷ = b0 + b1x . To conduct a regression
analysis, we need to solve for b0 and b1. Computations are shown below. Notice that all of our
inputs for the regression analysis come from the above three tables.
b1 = 470/730
b1 = 0.644
Once we know the value of the regression coefficient (b1), we can solve for the regression slope
(b0):
b0 = y - b1 * x
b0 = 77 - (0.644)(78)
b0 = 26.768
Once you have the regression equation, using it is a snap. Choose a value for the independent
variable (x), perform the computation, and you have an estimated value (ŷ) for the dependent
variable.
In our example, the independent variable is the student's score on the aptitude test. The dependent
variable is the student's statistics grade. If a student made an 80 on the aptitude test, the estimated
statistics grade (ŷ) would be:
ŷ = b0 + b1x
In this example, the aptitude test scores used to create the regression equation ranged from 60 to
95. Therefore, only use values inside that range to estimate statistics grades. Using values outside
that range (less than 60 or greater than 95) is problematic.
Whenever you use a regression equation, you should ask how well the equation fits the data. One
way to assess fit is to check the coefficient of determination, which can be computed from the
following formula.
where N is the number of observations used to fit the model, Σ is the summation symbol, xi is the
x value for observation i, x is the mean x value, yi is the y value for observation i, y is the mean y
value, σx is the standard deviation of x, and σy is the standard deviation of y.
Computations for the sample problem of this lesson are shown below. We begin by computing
the standard deviation of x (σx):
σx = sqrt [ Σ ( xi - x )2 / N ]
σy = sqrt [ Σ ( yi - y )2 / N ]
A coefficient of determination equal to 0.48 indicates that about 48% of the variation in statistics
grades (the dependent variable) can be explained by the relationship to math aptitude scores
(the independent variable). This would be considered a good fit to the data, in the sense that it
would substantially improve an educator's ability to predict student performance in statistics
class.
K Means
KNN can be summarized as below:
Bahria University, Islamabad Campus
Department of Computer Science
Initialisation – K initial “means” (centroids) are generated at random
Assignment – K clusters are created by associating each observation with the nearest
centroid
Update – The centroid of the clusters becomes the new mean
classifying data using the K-Means algorithm with python. As always, we need to start by
importing the required libraries.
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import Kmeans
In this tutorial, we’ll generate our own data using the make_blobs function from
the sklearn.datasets module. The centers parameter specifies the number of clusters.
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60,
random_state=0)plt.scatter(X[:,0], X[:,1])
LAB TASKS
Task 1
Write data types of bigsales.csv from above mentioned types
Feature Name Data type
Bahria University, Islamabad Campus
Department of Computer Science
Task 2
Housing dataset contains information collected by the U.S Census Service concerning
housing in the area of Boston Mass. There are 506 samples and 13 feature variables in this
dataset. The objective is to predict the value of prices of the house using the given features
by Linear regression.Following features will be considered for regression
LSTAT: Percentage of lower status of the population
MEDV: Median value of owner-occupied homes in $1000s
Task 2
Deliverables: Submit Python files as zip archive before the next lab along with lab journal.