0% found this document useful (0 votes)
16 views88 pages

DSF Lab Exp Full

The document outlines a series of exercises focused on using Python for data analytics, including installation of packages like NumPy, SciPy, Pandas, and Matplotlib. It provides step-by-step procedures for creating and manipulating arrays, data frames, and various plots, as well as performing statistical calculations such as mean, mode, and standard deviation. Each exercise concludes with a verification of the output, demonstrating successful implementation of the tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views88 pages

DSF Lab Exp Full

The document outlines a series of exercises focused on using Python for data analytics, including installation of packages like NumPy, SciPy, Pandas, and Matplotlib. It provides step-by-step procedures for creating and manipulating arrays, data frames, and various plots, as well as performing statistical calculations such as mean, mode, and standard deviation. Each exercise concludes with a verification of the output, demonstrating successful implementation of the tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 88

EX.

NO: 1 DOWNLOAD, INSTALL AND EXPLORE THE FEATURES OF PYTHON


FOR DATA ANALYTICS

PROCEDURE:
Python pip is the package manager for Python packages. We can use pip to install packages that
do not come with Python. The basic syntax of pip commands in command prompt is:
pip 'arguments'
Step1:

Step2:
Python pip comes pre-installed on 3.4 or older versions of Python. To check whether pip is
installed or not type the below command in the terminal.
pip --version
This command will tell the version of the pip if pip is already installed in the system.

Step3:
Before you upgrade, first let’s get the current pip version by running pip --version
On Windows, to upgrade pip first open the windows command prompt and then run the following
command to update with the latest available version
# Upgrade to latest available version
python -m pip install --upgrade pip

Step4:
Now check the version of pip it will show the updated version of pip

Step5:
use pip install numpy
Pip downloads the NumPy package and notifies you it has been successfully installed.
pip3 install numpy
Pip downloads the NumPy package and notifies you it has been successfully installed.

Step6:
use pip install scipy
Pip downloads the Scipy package and notifies you it has been successfully installed.
pip install scipy
Pip downloads the Scipy package and notifies you it has been successfully installed
Step7:
use pip install pandas
Pip downloads the pandas package and notifies you it has been successfully installed.
pip install pandas
Pip downloads the pandas package and notifies you it has been successfully installed

Step:8
use pip install matplotlib
Pip downloads the matplotlib package and notifies you it has been successfully installed.
pip install matplotlib
Pip downloads the matplotlib package and notifies you it has been successfully installed
Step 9:
Pip intall seaborn

Step10:
Finally Type the command python and import the packages by using import command

Result:
Thus the download, install and explore the features of python for data analytics was
successfully implement
EX.NO: 2 WORKING WITH NUMPY ARRAYS

ALGORITHM:

Step 1: Start the program


Step 2: import numpy as np
Step 3: Define a One /Two/Multidimensional Array
Step 4: Perform file operation using Numpy arrays
Step 5: Display the Output
Step 6: Stop the Program

CODE:
1) To create list
import numpy as np
list=np.array([1,2,3,4,5])
print(list)

2) To create one dimensional array


import numpy as np
list=[1,2,3,4]
sample_array=np.array(list)
print("List in python:",list)
print("Numpy array in python:",sample_array)
3) To create multidimensional array
list_1=[1,2,3,4]
list_2=[5,6,7,8]
list_3=[9,10,11,12,]
sample_array1=np.array([list_1,list_2,list_3])
print("Numpy multidimensional array in python \n",sample_array1)

4) To add two array values


a1=np.arange(1,7).reshape(2,3)
a2=np.arange(8,14).reshape(2,3)
d=np.add(a1,a2)
print(d)

5) To print min,max values


print(np.max(sample_array1))
print(np.min(sample_array1))
6. Used for appending the values to the end of the list
arr1=np.array([1,6,7,34,100])append_val=np.append(arr1, 99)
print(append_val)

7. Its used to printing the range of values


arange_val=np.arange(11)
print(arange_val)

8. Used to printing range of values by start and stop


arange_val_start_and_stop=np.arange(1,10)
print(arange_val_start_and_stop)

9. Used to printing range of values by start, stop and step value


arange_val_start_stop_and_step=np.arange(1,10,2)
print(arange_val_start_stop_and_step)

10. To print the round values of an array


around_array=np.array([3.4, 1.1, 5.5, 6.7])
print(np.around(around_array))

11. To generate random floating samples


rand_int= np.random.rand(4))
print(rand_int)

12. To generate 2 dimensional array


array_2d = np.array([[1,2,3], [4,5,6]])
print(array_2d)

13. To print the shape of the array


print(array_2d.shape)
#Output will be (2,3), which indicates 2 rows and 3 columns

14. To print the array which containing the value 0 only


array_zero_only = np.zeros((5,5),dtype='int') #5 rows and 5 column
print(array_zero_only)

15. To print the mean of the array


array1 = np.array([1,16,4,25,9])
mean_array = np.mean(array1)
print(mean_array)

16. To print the median of the array


array1 = np.array([1,16,4,25,9])
median_array = np.median(array1)
print(median_array)

17. Reshape the array


array1=np.random.randint(15,size=(4,3))
print(“Before Reshape:”,array1)
print(" ")
#Here randomly generate 4x3 integer matrix
array_reshape = array1.reshape(3,4)
print(“After Reshape:” ,array_reshape)

18. Print array based on the condition


array=np.arange(1,11)
filter_val = array[np.where(array>5)]
print(filter_val)

19. Returns True if the two arrays are same


array1=np.array([1,2,3,4,5])
array2=np.array([1,2,3,4,5])
print(np.equal(array1, array2))

20. To repeat the value for the number of times


print_100_five_times = np.repeat('100', 5)
print(print_100_five_times)

21. To print the standard deviation and variance of the array


array1=np.array([1,2,3,4,5,6])
std_of_array1=np.std(array1)
variance_of_array1=np.var(array1)
print(“Standard Deviation of the array is : “, std_of_array1)
print(“Variance of the array is :”, variance_of_array1)

RESULT:

Thus the python program for creating numpy array using different functions has been
done and the outputhas been verified.
Ex.No.3 WORKING WITH PANDAS DATA FRAMES

AIM:

To create a DataFrame using a single list or a list of lists, Locate row, named index, Locate Named
indexes.

CODE 1:
Create a DataFrame can be created using a single list.
import pandas as pd
lst = ['python', 'For', 'first', 'year',
'students', 'interesting',
'programs']
# Calling DataFrame constructor on
listdf = pd.DataFrame(lst)
print(df)

CODE 2:

Locate Row:

import pandas as pd
lst = ['python', 'For', 'first', 'year',
'students', 'interesting',
'programs']

# Calling DataFrame constructor on


listdf = pd.DataFrame(lst)
print(df)
print(df.loc[0]
)

CODE 3:
Named Indexes,Locate Named Indexes
import pandas as pd
lst = {
"list1":['python', 'For', 'first'],
"list2":['students', 'interesting', 'programs']
}
# Calling DataFrame constructor on list
df = pd.DataFrame(lst, index = ["day1", "day2", "day3"])
print(df)
print(df.loc["day2"])
RESULT:

Thus the python program for creating dataframes using different functions has been done and the
outputhas been verified.
Ex.No.4 BASIC PLOTS USING MATPLOTLIB

AIM:

To draw various plots like line plot, Bar Graph, Histogram, Scatter Plot, Area
Plotand Pie Chart.

CODE 1

LINE PLOT:
from matplotlib import pyplot as plt

#Plotting to our canvas plt.plot([1,2,3],[4,5,1])

#Showing what we

plottedplt.show()

CODE 2:
from matplotlib import pyplot as plt
plt.bar([0.25,1.25,2.25,3.25,4.25],[50,40,70,80,20],
label="BMW",width=.5)
plt.bar([.75,1.75,2.75,3.75,4.75],[80,20,20,50,60],
label="Audi", color='r',width=.5)
plt.legend()
plt.xlabel('Days')
plt.ylabel('Distance
(kms)')
plt.title('Information')
plt.show()

CODE 4:
HISTOGRAM
import matplotlib.pyplot as
pltpopulation_age =
[22,55,62,45,21,22,34,42,42,4,2,102,95,85,55,110,120,70,65,55,111,115,80,75,65,54,44,43,42,48]
bins = [0,10,20,30,40,50,60,70,80,90,100]
plt.hist(population_age, bins, histtype='bar', rwidth=0.8)
plt.xlabel('age groups')
plt.ylabel('Number of
people')plt.title('Histogram')
plt.show()

CODE 5:
AREA PLOT
mport matplotlib.pyplot as
pltdays = [1,2,3,4,5]
Output –
sleeping =[7,8,6,11,7]
eating = [2,3,4,3,2]
working =[7,8,7,2,2]
playing = [8,5,7,8,13]

plt.plot([],[],color='m', label='Sleeping', linewidth=5)


plt.plot([],[],color='c', label='Eating', linewidth=5)
plt.plot([],[],color='r', label='Working', linewidth=5)
plt.plot([],[],color='k', label='Playing', linewidth=5)

plt.stackplot(days, sleeping,eating,working,playing, colors=['m','c','r','k'])

plt.xlabel('x')
plt.ylabel('y')
plt.title('Stack
Plot')plt.legend()
plt.show()

CODE 6:

PIE CHART
import matplotlib.pyplot as plt
days = [1,2,3,4,5]
sleeping =[7,8,6,11,7]
eating = [2,3,4,3,2]
working =[7,8,7,2,2]
playing = [8,5,7,8,13]
slices = [7,2,2,13]
activities = ['sleeping','eating','working','playing']
cols = ['c','m','r','b']
plt.pie(slices,
labels=activities,
colors=cols,
startangle=90,
shadow= True,
explode=(0,0.1,0,0)
,
autopct='%1.1f%%'
)plt.title('Pie Plot')
plt.show()
RESULT: Thus the program for creating different plots using matplotlib has been done and the
output has been
lOM oARc PSD|37 23 9 59 6

Ex.no 5 STATISTICAL AND PROBABILITY MEASURES

5. a) Frequency distributions

Aim:

To Count the frequency of occurrence of a word in a body of text is often needed during text
processing.

ALGORITHM

Step 1: Start the Program


Step 2: Create text file blake-poems.txt
Step 3: Import the word_tokenize function and gutenberg
Step 4: Write the code to count the frequency of occurrence of a word in a body of
text

Step 5: Print the result


Step 6: Stop the process

Program:
from nltk.tokenize import word_tokenize
nltk.corpus import gutenberg

sample = gutenberg.raw("blake-poems.txt") token =

word_tokenize(sample)
wlist = []

for i in range(50): wlist.append(token[i])

wordfreq = [wlist.count(w) for w in wlist]


print("Pairs\n" + str(zip(token, wordfreq)))
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6

Result:
Thus the count the frequency of occurrence of a word in a body of text is often needed during
text processing and Conditional Frequency Distribution program using python was successfully
completed.
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6

5. b) MEAN, MODE, STANDARD DEVIATION


AIM:
To write a python program to calculate the Mean , Mode, Standard Deviation.

ALGORITHM:
Step 1: Start the Program
Step 2: Write the code to calculate
Mean, Mode ,standard deviation.

Step 3: Print the result


Step 4: Stop the process

1. Mean:

The mean is the average of all numbers and is sometimes called the arithmetic mean. This code calculates
Mean or Average of a list containing numbers:

CODE
# mean of elements

# list of elements to calculate mean


n_num = [1, 2, 3, 4, 5]
n = len(n_num)

get_sum = sum(n_num)
mean = get_sum / n

print("Mean / Average is: " + str(mean))

2 Mode :
The mode is the number that occurs most often within a set of numbers. This code calculates Mode of a list
containing numbers:
# Python program to print
# mode of elements
from collections import Counter

# list of elements to calculate mode


n_num = [1, 2, 3, 4, 5, 5]
n = len(n_num)

data = Counter(n_num)
get_mode = dict(data)
mode = [k for k, v in get_mode.items() if v == max(list(data.values()))]

if len(mode) == n:
get_mode = "No mode found"
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6

else:
get_mode = "Mode is / are: " + ', '.join(map(str, mode))

print(get_mode)

3. Standard Deviation
is a measure of spread in Statistics. It is used to quantify the measure of spread, variation of a set of data
values. It is very much similar to variance, gives the measure of deviation whereas variance provides the
squared value.
A low measure of Standard Deviation indicates that the data are less spread out, whereas a high value of
Standard Deviation shows that the data in a set are spread apart from their mean average values.
# Python code to demonstrate stdev() function

# importing Statistics module


import statistics

# creating a simple data - set


sample = [1, 2, 3, 4, 5]

# Prints standard deviation


# xbar is set to default value of 1 print("Standard Deviation of sample is % s "
% (statistics.stdev(sample)))

Result:

Thus the python code of Mean, Mode, and Standard Deviation was successfully calculated.
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6

5. c) VARIABILITY

Aim:
To write a python program to calculate the variance.

ALGORITHM

Step 1: Start the Program


Step 2: Import statistics module from statistics import variance
Step 3: Import fractions as parameter values from fractions import Fraction as fr
Step 4: Create tuple of a set of positive and negative numbers
Step 5: Print the variance of each samples
Step 6: Stop the process
Program:
# Python code to demonstrate variance() #
function on varying range of data-types

# importing statistics module


from statistics import variance

# importing fractions as parameter values from


fractions import Fraction as fr

# tuple of a set of positive integers


# numbers are spread apart but not very much
sample1 = (1, 2, 5, 4, 8, 9, 12)

# tuple of a set of negative integers


sample2 = (-2, -4, -3, -1, -5, -6)

# tuple of a set of positive and negative numbers


# data-points are spread apart considerably
sample3 = (-9, -1, -0, 2, 1, 3, 4, 19)

# tuple of a set of fractional numbers


sample4 = (fr(1, 2), fr(2, 3), fr(3, 4),
fr(5, 6), fr(7, 8))

# tuple of a set of floating point values


sample5 = (1.23, 1.45, 2.1, 2.2, 1.9)
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6

# Print the variance of each samples


print("Variance of Sample1 is % s " %(variance(sample1)))
print("Variance of Sample2 is % s " %(variance(sample2)))
print("Variance of Sample3 is % s " %(variance(sample3)))
print("Variance of Sample4 is % s " %(variance(sample4)))
print("Variance of Sample5 is % s " %(variance(sample5)))

Result:

Thus the python program to calculate the variance was successfully implemented.
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6

5.d) NORMAL CURVES

a) Aim:
To create a normal curve using python program.

ALGORITHM

Step 1: Start the Program


Step 2: Import packages scipy and call function scipy.stats
Step 3: Import packages numpy, matplotlib and seaborn Step
4: Create the distribution
Step 5: Visualizing the distribution
Step 6: Stop the process

Program:

# import required libraries


from scipy.stats import norm
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb

# Creating the distribution


data = np.arange(1,10,0.01)
pdf = norm.pdf(data , loc = 5.3 , scale = 1 )

#Visualizing the distribution

sb.set_style('whitegrid')
sb.lineplot(data, pdf , color = 'black')
plt.xlabel('Heights')
plt.ylabel('Probability Density')
lOM oARc PSD|37 23 9 59 6

Output:
lOM oARc PSD|37 23 9 59 6

Result :

Thus the a normal curve using python program.


lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6

5.e) CORRELATION AND SCATTER PLOTS


AIM
To write a python program for correlation with scatter plot

ALGORITHM

Step 1: Start the Program Step


2: Create variable y1, y2
Step 3: Create variable x, y3 using random function
Step 4: plot the scatter plot
Step 5: Print the result
Step 6: Stop the process

ALGORITHM:

Program:

# Scatterplot and Correlations

# Data

x-pp random randn(100)


yl=x*5+9
y2=-5°x
y3=no_random.randn(100)

#Plot

plt.reParams update('figure figsize' (10,8), 'figure dpi¹:100})


plt scatter(x, yl, label=fyl, Correlation = {np.round(np.corrcoef(x,y1)[0,1], 2)}) plt
scatter(x, y2, label=fy2 Correlation = (np.round(np.corrcoef(x,y2)[0,1], 2)}) plt
scatter(x, y3, label=fy3 Correlation = (np.round(np.corrcoef(x,y3)[0,1], 2)})

# Plot

plt titlef('Scatterplot and Correlations')


plt(legend)
plt(show)
lOM oARc PSD|37 23 9 59 6

Output
lOM oARc PSD|37 23 9 59 6

Result:

Thus the Correlation and scatter plots using python program was successfully completed.
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6

5.f) CORRELATION COEFFICIENT

Aim:
To write a python program to compute correlation coefficient.

ALGORITHM

Step 1: Start the Program Step


2: Import math package
Step 3: Define correlation coefficient function
Step 4: Calculate correlation using formula
Step 5:Print the result
Step 6 : Stop the process

Program:

# Python Program to find correlation coefficient.


import math

# function that returns correlation coefficient.


def correlationCoefficient(X, Y, n) :
sum_X = 0
sum_Y = 0
sum_XY = 0
squareSum_X = 0
squareSum_Y = 0

i=0
while i < n :
# sum of elements of array X.
sum_X = sum_X + X[i]

# sum of elements of array Y.


sum_Y = sum_Y + Y[i]

# sum of X[i] * Y[i].


sum_XY = sum_XY + X[i] * Y[i]

# sum of square of array elements.


squareSum_X = squareSum_X + X[i] * X[i]
squareSum_Y = squareSum_Y + Y[i] * Y[i]
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6

i=i+1

# use formula for calculating correlation


# coefficient.
corr = (float)(n * sum_XY - sum_X * sum_Y)/
(float)(math.sqrt((n * squareSum_X -
sum_X * sum_X)* (n * squareSum_Y -
sum_Y * sum_Y)))
return corr

# Driver function
X = [15, 18, 21, 24, 27]
Y = [25, 25, 27, 31, 32]

# Find the size of array.


n = len(X)

# Function call to correlationCoefficient.


print ('{0:.6f}'.format(correlationCoefficient(X, Y, n)))

Result:

Thus the computation for correlation coefficient was successfully completed


lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6

5.g) REGRESSION

Aim:
To write a python program for Simple Linear Regression

ALGORITHM

Step 1: Start the Program


Step 2: Import numpy and matplotlib package
Step 3: Define coefficient function
Step 4: Calculate cross-deviation and deviation about x Step
5: Calculate regression coefficients
Step 6: Plot the Linear regression and define main function
Step 7: Print the result
Step 8: Stop the process

Program:

import numpy as np
import matplotlib.pyplot as plt

def estimate_coef(x, y):


# number of observations/points
n = np.size(x)

# mean of x and y vector


m_x = np.mean(x)
m_y = np.mean(y)

# calculating cross-deviation and deviation about x


SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x

# calculating regression coefficients


b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x

return (b_0, b_1)


lOM oARc PSD|37 23 9 59 6

Graph:
lOM oARc PSD|37 23 9 59 6

def plot_regression_line(x, y, b):


# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m",
marker = "o", s = 30)

# predicted response vector


y_pred = b[0] + b[1]*x

# plotting the regression line


plt.plot(x, y_pred, color = "g")

# putting labels
plt.xlabel('x')
plt.ylabel('y')

# function to show plot


plt.show()

def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])

# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))

# plotting regression line


plot_regression_line(x, y, b)

if name == " main


": main()

Result:

Thus the computation for Simple Linear Regression was successfully completed.
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6

Ex.no : 6 a) USE THE STANDARD BENCHMARK DATA SET FOR PERFORMING THE
FOLLOWING UNIVARIATE ANALYSIS:
UNIVARIATE ANALYSIS:
AIM:
To write a python program for univeariate analysis on UCI datasets

ALGORITHM:
Step 1: Start the program
Step 2: Write the coding
Step 3: calculate the Frequency, Mean, Median, Mode, Variance, Standard Deviation, Skewness and
Kurtosis.
Step 4: Stop the program

Frequency

import pandas as pd
import numpy as np
import statistics as st

# Load the data


df = pd.read_csv("data_desc.csv")
print(df.shape)
print(df.info())

<class'pandas.core.frame.DataFrame> RangeIndex: 600 entries, 0 to599 Data columns (total 10columns):


Marital_status 600 non-null
object Dependents
600 non-null
int64 Is_graduate
600 non-null
object Income 600 non-null
int64 Loan_amount
600 non-null
int64 Term_months
600 non-null
int64 Credit_score
600 non-null
object approval_status
600 non-null
object Age 600 non-null int64
Sex 600 non-null
object dtypes: int64(5),
object(5) memory usage:
47.0+ KB
None

Mean

df.mean()
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6

6 dtype: float64
It is also possible to calculate the mean of a particular variable in a data, as shown below, where we
calculate the mean of the variables 'Age' and 'Income'.

print(df.loc[:,'Age'].mean())
print(df.loc[:,'Income'].mean())

It is also possible to calculate the mean of the rows by specifying the (axis = 1) argument. The code
below calculates the mean of the first five rows.

df.mean(axis = 1)[0:5]

Median

df.median()

Mode
Mode

df.mode()

Variance

df.var()

Standard Deviation

df.std()

Skewness and Kurtosis.


Skewness

print(df.skew())

df.describe()
df.describe(include='all')

Result:

Thus the Univariate and Multiple Regression Analysis using the diabetes data setfrom
UCI and Pima Indians Diabetes data set was completed and verified successfully.
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6

5.b) BIVARIATE ANALYSIS:

AIM:

To write a program for bivariate Analysis

ALGORITHM:

Step 1: Start the program


Step 2: Write the coding
Step 3: Bivariate Analysis: Linear and logistic regression modelling.
Step 4: Stop the program

Linear Regression

import matplotlib.pyplot as plt


from scipy import stats

x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]

slope, intercept, r, p, std_err = stats.linregress(x, y)

def myfunc(x):
return slope * x + intercept

mymodel = list(map(myfunc, x))

plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()

Logistic Regression
import numpy
from sklearn import linear_model

X = numpy.array([3.78, 2.44, 2.09, 0.14, 1.72, 1.65, 4.92, 4.37, 4.96, 4.52, 3.69, 5.88]).reshape(-1,1)
y = numpy.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])

logr = linear_model.LogisticRegression()
logr.fit(X,y)

def logit2prob(logr, X):


log_odds = logr.coef_ * X + logr.intercept_
odds = numpy.exp(log_odds)
probability = odds / (1 + odds)
return(probability)
print(logit2prob(logr, X))
lOM oARc PSD|37 23 9 59 6

OUTPUT:
lOM oARc PSD|37 23 9 59 6

Multiple Regression analysis

Multiple regression works by considering the values of the available multiple independent
variables and predicting the value of one dependent variable.

import pandas as pd
from sklearn import linear_model
import statsmodels.api as sm

data = {'year':
[2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,20
16,2016,2016,2016,2016,2016],
'month': [12,11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1],
'interest_rate':
[2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75],
'unemployment_rate':
[5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1],
'index_price':
[1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,
876,822,704,719]
}

df = pd.DataFrame(data)

x = df[['interest_rate','unemployment_rate']]
y = df['index_price']

# with sklearn
regr = linear_model.LinearRegression()
regr.fit(x, y)

print('Intercept: \n', regr.intercept_)


print('Coefficients: \n', regr.coef_)

# with statsmodels
x = sm.add_constant(x) # adding a constant

model = sm.OLS(y, x).fit()


predictions = model.predict(x)

print_model = model.summary()
print(print_model)
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6

Result:

Thus the Bivariate and Multiple Regression Analysis using the diabetes data set
from UCI and Pima Indians Diabetes data set was completed and verified successfully.
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6

EX.NO 7. APPLY SUPERVISED LEARNING ALGORITHMS AND UNSUPERVISED


LEARNING ALGORITHMS ON IRIS DATASET.
AIM: To write a program for apply supervised learning algorithms and unsupervised learning algorithms
on iris dataset.
ALGORITHM:

WHAT IS SUPERVISED LEARNING?


Supervised learning is a machine learning task where an algorithm is trained to find patterns using a
dataset. The supervised learning algorithm uses this training to make input-output inferences on future
datasets. In the same way a teacher (supervisor) would give a student homework to learn and grow
knowledge, supervised learning gives algorithms datasets so it too can learn and make inferences.

pip install pandas


pip install matplotlib
pip install scikit-learn

from sklearn import datasets


import pandas as pd
import matplotlib.pyplot as plt

# Loading IRIS dataset from scikit-learn object into iris variable.


iris = datasets.load_iris()

# Prints the type/type object of iris


print(type(iris))
# <class 'sklearn.datasets.base.Bunch'>

# prints the dictionary keys of iris data


print(iris.keys())

# prints the type/type object of given attributes


print(type(iris.data), type(iris.target))

# prints the no of rows and columns in the dataset

print(iris.data.shape)

# prints the target set of the data


print(iris.target_names)

# Load iris training dataset


X = iris.data

# Load iris target set


Y = iris.target

# Convert datasets' type into dataframe


df = pd.DataFrame(X, columns=iris.feature_names)

# Print the first five tuples of dataframe.


print(df.head())
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6

from sklearn import datasets


from sklearn.neighbors import KNeighborsClassifier

# Load iris dataset from sklearn


iris = datasets.load_iris()

# Declare an of the KNN classifier class with the value with neighbors.
knn = KNeighborsClassifier(n_neighbors=6)

# Fit the model with training data and target values


knn.fit(iris['data'], iris['target'])

# Provide data whose class labels are to be predicted


X=[
[5.9, 1.0, 5.1, 1.8],
[3.4, 2.0, 1.1, 4.8],
]

# Prints the data provided


print(X)

# Store predicted class labels of X


prediction = knn.predict(X)

# Prints the predicted class labels of X

print(prediction)
from sklearn import datasets, linear_model
import matplotlib.pyplot as plt
import numpy as np

# Load the diabetes dataset


diabetes = datasets.load_diabetes()

# Use only one feature for training


diabetes_X = diabetes.data[:, np.newaxis, 2]

# Split the data into training/testing sets


diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets


diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Create linear regression object


regr = linear_model.LinearRegression()

# Train the model using the training sets


regr.fit(diabetes_X_train, diabetes_y_train)

# Input data
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6

print('Input Values')
print(diabetes_X_test)

# Make predictions using the testing set


diabetes_y_pred = regr.predict(diabetes_X_test)

# Predicted Data
print("Predicted Output Values")
print(diabetes_y_pred)

# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color='black')
plt.plot(diabetes_X_test, diabetes_y_pred, color='red', linewidth=1)

plt.show()

Unsupervised learning is a class of machine learning (ML) techniques used to find patterns indata. The
data given to unsupervised algorithms is not labelled, which means only the input variables (x) are
given with no corresponding output variables. In unsupervised learning, the algorithms are left to
discover interesting structures in the data on their own.
On GitHub: iris_dataset.py

# Importing Modules
from sklearn import datasets
import matplotlib.pyplot as plt

# Loading dataset
iris_df = datasets.load_iris()

# Available methods on dataset


print(dir(iris_df))

# Features
print(iris_df.feature_names)

# Targets
print(iris_df.target)

# Target Names
print(iris_df.target_names)

label = {0: 'red', 1: 'blue', 2: 'green'}

# Dataset Slicing
x_axis = iris_df.data[:, 0] # Sepal Length
y_axis = iris_df.data[:, 2] # Sepal Width

# Plotting
plt.scatter(x_axis, y_axis, c=iris_df.target)
plt.show()
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6

['DESCR', 'data', 'feature_names', 'target', 'target_names']

K-Means Clustering in Python

K-means implementation in Python on GitHub: clustering_iris.py

# Importing Modules
from sklearn import datasets
from sklearn.cluster import KMeans

# Loading dataset
iris_df = datasets.load_iris()

# Declaring Model
model = KMeans(n_clusters=3)

# Fitting Model
model.fit(iris_df.data)

# Predicitng a single input


predicted_label = model.predict([[7.2, 3.5, 0.8, 1.6]])

# Prediction on the entire data


all_predictions = model.predict(iris_df.data)

# Printing Predictions
print(predicted_label)
print(all_predictions)
[0]
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6

Result:

Thus the To write a program for apply supervised learning algorithms and unsupervisedlearning algorithms on iris
dataset
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6

EX.NO:8 APPLY AND EXPLORE VARIOUS PLOTTING FUNCTIONS ON UCI DATA


SETS

To load and quickly visualize the Multiple Features Dataset [1] from the UCI repository, which
is available in mvlearn. This dataset can be a good tool for analyzing the effectiveness of
multiview algorithms. It contains 6 views of handwritten digit images, thus allowing for analysis
of multiview algorithms in multiclass or unsupervised tasks.

a. Normal curves

A probability distribution is a statistical function that describes the likelihood of obta ining the
possible values that a random variable can take. By this, we mean the range of values that a
parameter can take when we randomly pick up values from it. If we were asked to pick up 1 adult
randomly and asked what his/her (assuming gender does not affect height) height would be?
There’s no way to know what the height will be. But if we have the distribution of heights ofadults
in the city, we can bet on the most probable outcome.A Normal Distribution is also known as a
Gaussian distribution or famously Bell Curve. People use both words interchangeably, but it means
the same thing. It is a continuous probability distribution.

Code:

import numpy as np

import matplotlib.pyplot as plt

# Creating a series of data of in range

of 1-50. x = np.linspace(1,50,200)

#Creating a Function.

def normal_dist(x , mean , sd):

prob_density = (np.pi*sd) * np.exp(-0.5*((x-

mean)/sd)**2) return prob_density

#Calculate mean and Standard

deviation. mean = np.mean(x)

sd = np.std(x)
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6

#Apply function to the

data. pdf =

normal_dist(x,mean,sd)

#Plotting the Results

plt.plot(x,pdf , color =

'red') plt.xlabel('Data

points')

plt.ylabel('Probability Density')

a. Density and contour plots

Contour plots also called level plots are a tool for doing multivariate analysis and visualizing 3-D
plots in 2-D space. If we consider X and Y as our variables we want to plot then the response Z
will be plotted as slices on the X-Y plane due to which contours are sometimes referred as Z-
slices or iso-response.

Contour plots are widely used to visualize density, altitudes or heights of the mountain as well as
in the meteorological department. Due to such wide usage matplotlib.pyplot provides a method
contour to make it easy for us to draw contour plots.
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6

Code:

import matplotlib.pyplot as

plt import numpy as np

feature_x = np.arange(0, 50, 2)

feature_y = np.arange(0,

50, 3) # Creating 2-D grid

of features

[X, Y] = np.meshgrid(feature_x,

feature_y) fig, ax = plt.subplots(1, 1)

Z = np.cos(X / 2) +

np.sin(Y / 4) # plots

contour lines ax.contour(X,

Y, Z)

ax.set_title('Contour

Plot')

ax.set_xlabel('feature

_x')

ax.set_ylabel('feature

_y') plt.show()
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6

b. Correlation and scatter plots

Correlation means an association, It is a measure of the extent to which two variables are related.

1. Positive Correlation: When two variables increase together and decrease together. They are
positively correlated. ‘1’ is a perfect positive correlation. For example – demand and profit are
positively correlated the more the demand for the product, the more profit hence positive
correlation.

2. Negative Correlation: When one variable increases and the other variable decreases together
and vice-versa. They are negatively correlated. For example, If the distance between magnet
increases their attraction decreases, and vice- versa. Hence, a negative correlation. ‘-1’ is no
correlation

3. Zero Correlation( No Correlation): When two variables don’t seem to be linked at all. ‘0’ is a
perfect negative correlation. For Example, the amount of tea you take and level of intelligence.

Code:

import pandas as pd

con =

pd.read_csv('concrete.csv')

con
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6

list(con.columns)

con.head()

con['cement'] = con['cement'].astype('category')

con.describe(include='category')

import seaborn as sns

sns.scatterplot(x="water", y="coarseagg", data=con);

ax = sns.scatterplot(x="water", y="coarseagg",

data=con) ax.set_title("Concrete Strength vs. Fly

ash") ax.set_xlabel("coarseagg");

sns.lmplot(x="water", y="coarseagg", data=con);

a. Histograms:

A histogram is basically used to represent data provided in a form of some groups.It is accurate
method for the graphical representation of numerical data distribution.It is a type of bar plot where
X-axis represents the bin ranges while Y-axis gives information about frequency.

Creating a Histogram

To create a histogram the first step is to create bin of the ranges, then distribute the whole range
of the values into a series of intervals, and count the values which fall into each of the
intervals.Bins are clearly identified as consecutive, non-overlapping intervals of variables. The
matplotlib.pyplot.hist() function is used to compute and create histogram of x.
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6

Code:

from matplotlib import pyplot as plt

import numpy as np

# Creating dataset

a = np.array([22, 87, 5, 43, 56,

73, 55, 54, 11,

20, 51, 5, 79, 31,

27])

# Creating histogram

fig, ax = plt.subplots(figsize =(10, 7))

ax.hist(a, bins = [0, 25, 50, 75, 100])

# Show plot

plt.show()

Code:

import matplotlib.pyplot as plt

import numpy as np

from matplotlib import colors

from matplotlib.ticker import PercentFormatter

# Creating dataset

np.random.seed(23685752)

N_points = 10000

n_bins = 20
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6

# Creating distribution
x = np.random.randn(N_points)

y = .8 ** x + np.random.randn(10000) + 25

# Creating histogram

fig, axs = plt.subplots(1, 1,figsize =(10, 7),tight_layout = True)

axs.hist(x, bins = n_bins)

# Show plot

plt.show()

a. Three dimensional plotting

Matplotlib was introduced keeping in mind, only two-dimensional plotting. But at the time when
the release of 1.0 occurred, the 3d utilities were developed upon the 2d and thus, we have 3d
implementation of data available today! The 3d plots are enabled by importing the mplot3d toolkit.
In this article, we will deal with the 3d plots using matplotlib.

Code:

from mpl_toolkits import mplot3d

Import numpy as

import matplotlib.pyplot as plt fig

= plt.figure()

# syntax for 3-D projection

ax = plt.axes(projection ='3d')

# defining axes

z = np.linspace(0, 1, 100)

x = z * np.sin(25 * z)

y = z * np.cos(25 * z)

c=x+y

ax.scatter(x, y, z, c = c)
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6

# syntax for plotting

ax.set_title('3d Scatter plot')

plt.show()

Result:

Thus the apply and explore various plotting functions on uci data
lOM oARc PSD|37 23 9 59 6

You might also like