0% found this document useful (0 votes)
20 views26 pages

AD3301 DEV Lab Manual

The document is a laboratory manual for a course on Data Exploration and Visualization, detailing various exercises involving Python, R, and data visualization techniques. It includes instructions for installing necessary tools, performing exploratory data analysis, working with NumPy arrays and Pandas data frames, creating plots with Matplotlib, and visualizing data using interactive maps. Each exercise outlines the aim, procedure, and program code to execute the tasks successfully.

Uploaded by

jishnuanaicut
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views26 pages

AD3301 DEV Lab Manual

The document is a laboratory manual for a course on Data Exploration and Visualization, detailing various exercises involving Python, R, and data visualization techniques. It includes instructions for installing necessary tools, performing exploratory data analysis, working with NumPy arrays and Pandas data frames, creating plots with Matplotlib, and visualizing data using interactive maps. Each exercise outlines the aim, procedure, and program code to execute the tasks successfully.

Uploaded by

jishnuanaicut
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

lOM oAR c P S D | 3049 913 0

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

LABORATORY MANUAL

COURSE CODE : AD3301

COURSE NAME : Data Exploration and Visualization

REGULATION : R2021

CLASS : II

SEMESTER : III
lOM oAR c P S D | 3049 913 0

Ex.No:1 Installation of Data Analysis And Visualization Tool: Python


DATE

Packages that we will need

Python 3 and the following Python libraries/packages are needed for data exploration and visualization:

• jupyter
• jupyterlab
• numpy
• scipy
• pandas
• matplotlib
• seaborn
How to install Python and the packages
Install Anaconda which will give you a Python 3 environment and all the above required

packages.After you have installed Anaconda, please verify the installation.

$ conda install -c conda-forge altair vega_datasets

How to verify your installation

1. Open the Anaconda Navigator.


2. Find the JupyterLab tile and “launch” it.
lOM oAR c P S D | 3049 913 0

import numpy
import scipy
import pandas
import matplotlib
import seaborn

print("all good")

click on the “play”/”run” icon.

Result:
Thus, the python tool was installed and verified successfully.
lOM oAR c P S D | 3049 913 0

Ex.No:2 Exploratory Data Analysis

DATE

Aim:
To perform exploratory data analysis (EDA) on with datasets.

Procedure:
1. Import the dataset
2. View the head of the data
3. View the basic information of data and description of data
4. Find the unique value of data and verify the duplication of data
5. Plot a graph for unique value of dataset
6. Verify the presence of null value and replace thenull value
7. Visualize the needed data

Program:
#Load the required libraries
import pandas as pd
import numpy as np
import seaborn as sns

#Load the data


df = pd.read_csv('titanic.csv')

#View the data


df.head()

df.info()

df.describe()
lOM oAR c P S D | 3049 913 0

#Find the duplicates

df.duplicated().sum()

#unique values
df['Pclass'].unique()
df['Survived'].unique()
df['Sex'].unique()

array([3, 1, 2], dtype=int64)


array([0, 1], dtype=int64)
array(['male', 'female'], dtype=object)

#Plot the unique values

sns.countplot(df['Pclass']).unique()

#Find null values

df.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2

dtype: int64

#Replace null values

df.replace(np.nan,'O',inplace = True)

#Check the changes now


df.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 0
Embarked 0
lOM oAR c P S D | 3049 913 0

dtype: int64

#Filter data

df[df['Pclass']==1].head()

#Boxplot

df[['Fare']].boxplot()

Result:
Thus, the program to perform exploratory data analysis (EDA) on with datasets was executed
lOM oAR c P S D | 3049 913 0

Ex.No:3.1 Numpy Arrays


DATE

Aim:
To write a program to work with Numpy arrays .
Procedure:
1. Create array using numpy
2. Access the element in the array
3. Retrieve element using slice operation
4. Compute calculation in the array
Program:
import numpy as np

a = np.array([1, 2, 3]) # Create a rank 1 array


print(type(a)) # Prints "<class 'numpy.ndarray'>"
print(a.shape) # Prints "(3,)"
print(a[O], a[1], a[2]) # Prints "1 2 3"
a[O] = 5 # Change an element of the array
print(a) # Prints "[5, 2, 3]"

b = np.array([[1,2,3],[4,5,6]]) # Create a rank 2 array


print(b.shape) # Prints "(2, 3)"
print(b[O, O], b[O, 1], b[1, O])
# Create the following rank 2 array with shape (3, 4)
# [[ 1 2 3 4]
# [ 5 6 7 8]
# [ 9 1O 11 12]]
a = np.array([[1,2,3,4], [5,6,7,8], [9,1O,11,12]])

# Use slicing to pull out the subarray consisting of the first 2 rows
# and columns 1 and 2; b is the following array of shape (2, 2):
# [[2 3]
# [6 7]]
b = a[:2, 1:3]

# A slice of an array is a view into the same data, so modifying it


# will modify the original array.
print(a[O, 1]) # Prints "2"
b[O, O] = 77 # b[O, O] is the same piece of data as a[O, 1]
print(a[O, 1])
a = np.array([[1,2,3], [4,5,6], [7,8,9], [1O, 11, 12]])

print(a) # prints "array([[ 1, 2, 3],


# [ 4, 5, 6],
# [ 7, 8, 9],
# [1O, 11, 12]])"

# Create an array of indices


b = np.array([O, 2, O, 1])

# Select one element from each row of a using the indices in b


print(a[np.arange(4), b]) # Prints "[ 1 6 7 11]"

# Mutate one element from each row of a using the indices in b


a[np.arange(4), b] += 1O

print(a)
x = np.array([1, 2]) # Let numpy choose the datatype
print(x.dtype) # Prints "int64"
x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)

# Elementwise sum; both produce the array


# [[ 6.O 8.O]
# [1O.O 12.O]]
print(x + y)
print(np.add(x, y))
lOM oAR c P S D | 3049 913 0

x = np.array([[1,2],[3,4]])

print(np.sum(x)) # Compute sum of all elements; prints "1O"


print(np.sum(x, axis=O)) # Compute sum of each column; prints "[4 6]"
print(np.sum(x, axis=1)) # Compute sum of each row; prints "[3 7]"

Output:
<class 'numpy.ndarray'>
(3,)
123
[5 2 3]
(2, 3)
124
2
77
[[ 1 2 3]
[ 4 5 6]
[ 7 8 9]
[1O 11 12]]
[ 1 6 7 11]
[[11 2 3]
[ 4 5 16]
[17 8 9]
[1O 21 12]]
int32 [[
6. 8.]
[1O. 12.]]
[[ 6. 8.]
[1O. 12.]]
1O
[4 6]
[3 7]

Result:
Thus, the program using NumPy array was executed.
lOM oAR c P S D | 3049 913 0

Ex.No:3.2 Pandas Data Frames


DATE

Aim:
To write a program for working with pandas data frames.
Procedure:
1. Import panda library
2. Construct a panda dataframe
3. Modify, drop columns in dataframe
4. Calculate median in the dataframe
Program:
import pandas as pd
data = pd.DataFrame({"x1":["y", "x", "y", "x", "x", "y"], # Construct a pandas DataFrame
"x2":range(16, 22),
"x3":range(1, 7),
"x4":["a", "b", "c", "d", "e", "f"],
"x5":range(3O, 24, - 1)})
print(data)

data_row = data[data.x2 < 2O] # Remove particular rows


print(data_row) # Print pandas DataFrame subset
data_col = data.drop("x1", axis = 1) # Drop certain variable from DataFrame
print(data_col)

data_col = data.drop("x1", axis = 1) # Drop certain variable from DataFrame


print(data_col)

data_med = data["x5"].median() # Calculate median


print(data_med)

27.5

Result:
Thus the program to work with panda data frame was executed
lOM oAR c P S D | 3049 913 0

Ex.No:3.3 Basic Plots Using Matplotlib


DATE

Aim:
To write a program to visualize basic plots using Matplotlib.
Procedure:
1. Import matplotlib library
2. Define x,y axis
3. Label the axis
4. Visualize the data using line plot
Program:

from matplotlib import pyplot as plt


import numpy as np
x=[2O,25,37]
y=[25OOO,4OOOO,6OOOO]
plt.plot(x,y)
plt.xlabel("Age")
plt.ylabel('salary')
plt.title('Salary by age')
plt.show()

Output:

Result:
Thus, the program to plot the basic plots using Matplotlib was executed.
lOM oAR c P S D | 3049 913 0

Ex.No:4 Data Cleaning using R


DATE

Aim:
To explore various variable and row filters , plot features in R for cleaning data and visualize it.
Procedure:
1. Import dplyr and ggplot2 library
2. Import iris dataset
3. Using dplyr select and filter functions rearrange the data
4. Using plotting visualize the selected data
Program:

plot(iris$Sepal.Length)

Result:
Thus, the program for cleaning and visualizing the data using R was executed.
lOM oAR c P S D | 3049 913 0

Ex.No: 5 Time Series


DATE

Aim:
To write a program to visualize time series analysis.
Procedure:
1. Import the temperature dataset
2. Import panda and matplotlib library
3. Visualize the data using line plot, histogram and boxplot
Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# reading the dataset using read_csv
df = pd.read_csv("stock_data.csv",
parse_dates=True,
index_col="Date")

# displaying the first five rows of dataset


df.head()

from pandas import read_csv


from matplotlib import pyplot
series = read_csv('daily-minimum-temperatures.csv', header=O, index_col=O, parse_dates=True, squeeze=True)
print(series.head())

Date
1981-01-01 20.7
1981-01-02 17.9
1981-01-03 18.8
1981-01-04 14.6
1981-01-05 15.8
Name: Temp, dtype: float64

from pandas import read_csv


from matplotlib import pyplot
series = read_csv('daily-minimum-temperatures.csv', header=O, index_col=O,parse_dates=True,
squeeze=True)
series.plot()
pyplot.show()

from pandas import read_csv


from matplotlib import pyplot
series = read_csv('daily-minimum-temperatures.csv', header=O, index_col=O, parse_dates=True,
squeeze=True)
series.plot(style='k.')
pyplot.show()
lOM oAR c P S D | 3049 913 0

from pandas import read_csv


from matplotlib import pyplot
series = read_csv('daily-minimum-temperatures.csv', header=O, index_col=O, parse_dates=True, squeeze=True)
series.hist()
pyplot.show()

from pandas import read_csv


from pandas import DataFrame
from pandas import Grouper
from matplotlib import pyplot
series = read_csv('daily-minimum-temperatures.csv', header=O, index_col=O, parse_dates=True,
squeeze=True)
groups = series.groupby(Grouper(freq='A'))
years = DataFrame()
for name, group in groups:
years[name.year] = group.values
years.boxplot()
pyplot.show()

Result:
Thus, the program to visualize time series analysis was executed.
lOM oAR c P S D | 3049 913 0

Ex.No:6 Interactive Map Visualization


DATE

Aim:
To represent on a Map using various Map data sets with Mouse Rollover effect.

Procedure:
1. Import the library
2. Import map dataset
3. specify the width, height, title for mouse rollover
4. visualize the map
Program:
pip install pyecharts
pip install echarts-countries-pypkg
pip install echarts-china-provinces-pypkg
pip install echarts-china-cities-pypkg
pip install echarts-china-counties-pypkg
import pyecharts
print(pyecharts. version )
import pandas as pd
from pyecharts.charts import Map
from pyecharts import options as opts
data = pd.read_excel('GDP.xlsx')
province = list(data["province"])
gdp = list(data["2O19_gdp"])
list = [list(z) for z in zip(province,gdp)]
c=(
Map(init_opts=opts.InitOpts(width="1OOOpx", height="6OOpx")) #Initialize map size
.set_global_opts(
title_opts=opts.TitleOpts(title="2O19 Provinces in GDP Distribution unit:1OO million yuan"),
#Configuration title
visualmap_opts=opts.VisualMapOpts(
type_ = "scatter" #Scatter type
)
)
.add("GDP",list,maptype="china") #take list Imported, map type is China Map
.render("Map1.html")
)

Output:

Result:
Thus, the program for mouse rollover in map visualization was executed successful
lOM oAR c P S D | 3049 913 0

Ex.No:7 Cartographic Visualization


DATE

Aim:
To build cartographic visualization for multiple datasets involving states and districts in India.
Procedure:
1. Import basemap and library
2. Import the state data and map
3. Using matplotlib add the title and attributes to display the map

Program:
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
map = Basemap()
map.drawcoastlines()
plt.show()
plt.savefig(‘test.png’)

pip install geopandas


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
import shapefile as shp
from shapely.geometry import Point
sns.set_style('whitegrid')
fp = r'Maps_with_python\india-polygon.shp'
map_df = gpd.read_file(fp)
map_df_copy = gpd.read_file(fp)
map_df.head()

map_df.plot()

df = pd.read_csv('globallandslides.csv')
lOM oAR c P S D | 3049 913 0

pd.set_option('display.max_columns', None)
df = df[df.country_name=="India"]
df["Year"] = pd.to_datetime(df["event_date"]).dt.year
df = df[df.landslide_category=="landslide"]
ls_df["admin_division_name"].replace("Nāgāland", "Nagaland",inplace = True)
ls_df["admin_division_name"].replace("Meghālaya", "Meghalaya",inplace = True)
ls_df["admin_division_name"].replace("Tamil Nādu", "Tamil Nadu",inplace = True)
ls_df["admin_division_name"].replace("Karnātaka", "Karnataka",inplace = True)
ls_df["admin_division_name"].replace("Gujarāt", "Gujarat",inplace = True)
ls_df["admin_division_name"].replace("Arunāchal Pradesh", "Arunachal Pradesh",inplace = True)
state_df = ls_df["admin_division_name"].value_counts()
state_df = state_df.to_frame()
state_df.reset_index(level=O, inplace=True)
state_df.columns = ['State', 'Count']
state_df.at[15,"Count"] = 69
state_df.at[O,"State"] = "Jammu and Kashmir" state_df.at[2O,"State"] = "Delhi"
state_df.drop(7)

#Merging the data


merged = map_df.set_index('st_nm').join(state_df.set_index('State'))
merged['Count'] = merged['Count'].replace(np.nan, O)
merged.head()

#Create figure and axes for Matplotlib and set the title
fig, ax = plt.subplots(1, figsize=(1O, 1O))
lOM oAR c P S D | 3049 913 0

ax.axis('off')
ax.set_title('Number of landslides in India state-wise', fontdict={'fontsize': '2O', 'fontweight' : '1O'})
# Plot the figure
merged.plot(column='Count',cmap='YlOrRd', linewidth=O.8, ax=ax,
edgecolor='O',legend=True,markersize=[39.739192, -1O4.99O337], legend_kwds={'label': "Number of
landslides"})

Result:
Thus, the program to display the cartographic visualization of India was executed successfully.
lOM oAR c P S D | 3049 913 0

Ex.No:8 EDA on Wine Quality Data Set


DATE

Aim:
To write a python program for EDA on Wine Quality Data Set.
Procedure:
1. Import library
2. Import wine dataset
3. Perform EDA to display information ,description of data.
4. Analyse the content of alcohol consumption and visualize it
Program:
import pandas as pd
df_red = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learningdatabases/wine-quality/winequality-
red.csv", delimiter=";")
df_white = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learningdatabases/wine-
quality/winequality-white.csv", delimiter=";")
df_red.columns
Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar','chlorides', 'free sulfur dioxide', 'total
sulfur dioxide',
'density','pH', 'sulphates', 'alcohol', 'quality'],dtype='object')
df_red.iloc[1OO:11O]

df_red.dtypes

fixed acidity float64


volatile acidity float64
citric acid float64
residual sugar float64
chlorides float64
free sulfur dioxide float64
total sulfur dioxide float64
density float64
pH float64
sulphates float64
alcohol float64
quality int64
dtype: object

df_red.describe()

df_red.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, O to 1598
Data columns (total 12 columns):
fixed acidity 1599 non-null float64
lOM oAR c P S D | 3049 913 0

volatile acidity 1599 non-null float64


citric acid 1599 non-null float64
residual sugar 1599 non-null float64
chlorides 1599 non-null float64
free sulfur dioxide 1599 non-null float64
total sulfur dioxide 1599 non-null float64
density 1599 non-null float64
pH 1599 non-null float64
sulphates 1599 non-null float64
alcohol 1599 non-null float64
quality 1599 non-null int64
dtypes: float64(11), int64(1)
memory usage: 15O.O KB

import seaborn as sns


sns.set(rc={'figure.figsize': (14, 8)})
sns.countplot(df_red['quality'])

Sns.distplot(df_red[‘alchol’])

Result:
Thus, the program to execute the EDA on wine dataset was executed successfully.
lOM oAR c P S D | 3049 913 0

Ex.No:9 Case Study on a Data Set to present an Analysis Report


DATE

Aim:
To analysis following using the diabetes data set from UCI and Pima Indians Diabetes data set for
performing the following
Procedure:
a. Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard Deviation,
Skewness and Kurtosis.
b. Bivariate analysis: Linear and logistic regression modeling
c. Multiple Regression analysis
d. Also compare the results of the above analysis for the two data sets.

a. Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard Deviation,


Skewness and Kurtosis.
import pandas as pd
import numpy as np
import statistics as st

# Load the data


df = pd.read_csv("data_desc.csv")
print(df.shape)
print(df.info())

Output:
(600, 10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 600 entries, 0 to 599
Data columns (total 10 columns):
Marital_status 600 non-null object
Dependents 600 non-null int64
Is_graduate 600 non-null object
Income 600 non-null int64
Loan_amount 600 non-null int64
Term_months 600 non-null int64
Credit_score 600 non-null object
approval_status 600 non-null object
Age 600 non-null int64
Sex 600 non-null object
dtypes: int64(5), object(5)
memory usage: 47.0+ KB
None
Measures of Central Tendency
Measures of central tendency describe the center of the data, and are often represented by the
mean, the median, and the mode.

Mean
df.mean()

python
Output:
1 Dependents 0.748333
2 Income 705541.333333
3 Loan_amount 323793.666667
4 Term_months 183.350000
5 Age 49.450000
6 dtype: float64
lOM oAR c P S D | 3049 913 0

It is also possible to calculate the mean of a particular variable in a data, as shown below, where we
calculate the mean of the variables 'Age' and 'Income'.

print(df.loc[:,'Age'].mean())
print(df.loc[:,'Income'].mean())

python
Output:
1 49.45
2 705541.33

It is also possible to calculate the mean of the rows by specifying the (axis = 1) argument. The code
below calculates the mean of the first five rows.

df.mean(axis = 1)[0:5]

python
Output:
1 0 70096.0
2 1 161274.0
3 2 125113.4
4 3 119853.8
5 4 120653.8
6 dtype: float64

Median

df.median()

python
Output:
1 Dependents 0.0
2 Income 508350.0
3 Loan_amount 76000.0
4 Term_months 192.0
5 Age 51.0
6 dtype: float64

Mode

df.mode()

Python
Output:

1 Marital_stat Dependen Is_gradua Income Loan_amou Term_mont Credit_scor approval_stat Ag Se


Us ts Te nt hs e us e x
2
- - -- -- - ---
3 yes 0 Yes 33330 70000 192.0 satisfacto yes 55 M
0 ry

Measures of Dispersion
lOM oAR c P S D | 3049 913 0

The most popular measures of dispersion are standard deviation, variance, and the interquartile
range.
Standard Deviation

df.std()

python
Output:
1 Dependents 1.026362
2 Income 711421.814154
3 Loan_amount 724293.480782
4 Term_months 31.933949
5 Age 14.728511
6 dtype: float64

Variance

df.var()

python
Output:
1 Dependents 1.053420e+00
2 Income 5.061210e+11
3 Loan_amount 5.246010e+11
4 Term_months 1.019777e+03
5 Age 2.169290e+02
6 dtype: float64

Interquartile Range (IQR)

from scipy.stats import iqr


iqr(df['Age'])

python
Output:
1 25.0

Skewness

print(df.skew())

python
Output:
1 Dependents 1.169632
2 Income 5.344587
3 Loan_amount 5.006374
4 Term_months -2.471879
5 Age -0.055537
6 dtype: float64

The skewness values can be interpreted in the following manner:


• Highly skewed distribution: If the skewness value is less than −1 or greater than +1.
• Moderately skewed distribution: If the skewness value is between −1 and −½ or between +½
and +1.
• Approximately symmetric distribution: If the skewness value is between −½ and +½.
lOM oAR c P S D | 3049 913 0

Putting Everything Together

df.describe()

python
Output:
1| | Dependents | Income | Loan_amount | Term_months | Age |
2| | | | | | |
3| count | 600.000000 | 6.000000e+02 | 6.000000e+02 | 600.000000 | 600.000000
|
4| mean | 0.748333 | 7.055413e+05 | 3.237937e+05 | 183.350000 | 49.450000
|
5| std | 1.026362 | 7.114218e+05 | 7.242935e+05 | 31.933949 | 14.728511
|
6| min | 0.000000 | 3.000000e+04 | 1.090000e+04 | 18.000000 | 22.000000
|
7| 25% | 0.000000 | 3.849750e+05 | 6.100000e+04 | 192.000000 | 36.000000
|
8| 50% | 0.000000 | 5.083500e+05 | 7.600000e+04 | 192.000000 | 51.000000
|
9| 75% | 1.000000 | 7.661000e+05 | 1.302500e+05 | 192.000000 | 61.000000
|
10| max | 6.000000 | 8.444900e+06 | 7.780000e+06 | 252.000000 | 76.000000
|

df.describe(include='all')

b. Bivariate analysis: Linear and logistic regression modeling

Linear Regression

import matplotlib.pyplot as plt


from scipy import stats

x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]

slope, intercept, r, p, std_err = stats.linregress(x, y)

def myfunc(x):
return slope * x + intercept

mymodel = list(map(myfunc, x))

plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()
lOM oAR c P S D | 3049 913 0

OUTPUT:

Logistic Regression
import numpy
from sklearn import linear_model

X = numpy.array([3.78, 2.44, 2.09, 0.14, 1.72, 1.65, 4.92, 4.37, 4.96, 4.52, 3.69, 5.88]).reshape(-1,1)
y = numpy.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])

logr = linear_model.LogisticRegression()
logr.fit(X,y)

def logit2prob(logr, X):


log_odds = logr.coef_ * X + logr.intercept_
odds = numpy.exp(log_odds)
probability = odds / (1 + odds)
return(probability)

print(logit2prob(logr, X))
OUTPUT:

[[0.60749955]
[0.19268876]
[0.12775886]
[0.00955221]
[0.08038616]
[0.07345637]
[0.88362743]
[0.77901378]
[0.88924409]
[0.81293497]
[0.57719129]
[0.96664243]]
Results Explained
3.78 0.61 The probability that a tumor with the size 3.78cm is cancerous is 61%.
2.44 0.19 The probability that a tumor with the size 2.44cm is cancerous is 19%.
2.09 0.13 The probability that a tumor with the size 2.09cm is cancerous is 13%.
lOM oAR c P S D | 3049 913 0

c. Multiple Regression analysis

Multiple regression works by considering the values of the available multiple independent
variables and predicting the value of one dependent variable.

import pandas as pd
from sklearn import linear_model
import statsmodels.api as sm

data = {'year':
[2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,20
16,2016,2016,2016,2016,2016],
'month': [12,11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1],
'interest_rate':
[2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75],
'unemployment_rate':
[5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1],
'index_price':
[1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,
876,822,704,719]
}

df = pd.DataFrame(data)

x = df[['interest_rate','unemployment_rate']]
y = df['index_price']

# with sklearn
regr = linear_model.LinearRegression()
regr.fit(x, y)

print('Intercept: \n', regr.intercept_)


print('Coefficients: \n', regr.coef_)

# with statsmodels
x = sm.add_constant(x) # adding a constant

model = sm.OLS(y, x).fit()


predictions = model.predict(x)

print_model = model.summary()
print(print_model)
lOM oAR c P S D | 3049 913 0

OUTPUT:
Intercept:
1798.4039776258564
Coefficients:
[ 345.54008701 -250.14657137]

OLS Regression Results


==========================================================================
====
Dep. Variable: index_price R-squared: 0.898
Model: OLS Adj. R-squared: 0.888
Method: Least Squares F-statistic: 92.07
Date: Sat, 30 Jul 2022 Prob (F-statistic): 4.04e-11
Time: 13:47:01 Log-Likelihood: -134.61
No. Observations: 24 AIC: 275.2
Df Residuals: 21 BIC: 278.8
Df Model: 2
Covariance Type: nonrobust
==========================================================================

coef std err t P>|t| [0.025 0.975]

Const 1798.4040 899.248 2.000 0.059 -71.685 3668.493


interest_rate 345.5401 111.367 3.103 0.005 113.940 577.140
unemployment_rate -250.1466 117.950 -2.121 0.046 -495.437 -4.856
==========================================================================
====
Omnibus: 2.691 Durbin-Watson: 0.530
Prob(Omnibus): 0.260 Jarque-Bera (JB): 1.551
Skew: -0.612 Prob(JB): 0.461
Kurtosis: 3.226 Cond. No. 394.
==========================================================================
====

Result:
Thus, the Univariate, Bivariate and Multiple Regression Analysis using the diabetes data set from
UCI and Pima Indians Diabetes data set was completed and verified successfully.

You might also like