AD3301 DEV Lab Manual
AD3301 DEV Lab Manual
LABORATORY MANUAL
REGULATION : R2021
CLASS : II
SEMESTER : III
lOM oAR c P S D | 3049 913 0
Python 3 and the following Python libraries/packages are needed for data exploration and visualization:
• jupyter
• jupyterlab
• numpy
• scipy
• pandas
• matplotlib
• seaborn
How to install Python and the packages
Install Anaconda which will give you a Python 3 environment and all the above required
import numpy
import scipy
import pandas
import matplotlib
import seaborn
print("all good")
Result:
Thus, the python tool was installed and verified successfully.
lOM oAR c P S D | 3049 913 0
DATE
Aim:
To perform exploratory data analysis (EDA) on with datasets.
Procedure:
1. Import the dataset
2. View the head of the data
3. View the basic information of data and description of data
4. Find the unique value of data and verify the duplication of data
5. Plot a graph for unique value of dataset
6. Verify the presence of null value and replace thenull value
7. Visualize the needed data
Program:
#Load the required libraries
import pandas as pd
import numpy as np
import seaborn as sns
df.info()
df.describe()
lOM oAR c P S D | 3049 913 0
df.duplicated().sum()
#unique values
df['Pclass'].unique()
df['Survived'].unique()
df['Sex'].unique()
sns.countplot(df['Pclass']).unique()
df.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
df.replace(np.nan,'O',inplace = True)
dtype: int64
#Filter data
df[df['Pclass']==1].head()
#Boxplot
df[['Fare']].boxplot()
Result:
Thus, the program to perform exploratory data analysis (EDA) on with datasets was executed
lOM oAR c P S D | 3049 913 0
Aim:
To write a program to work with Numpy arrays .
Procedure:
1. Create array using numpy
2. Access the element in the array
3. Retrieve element using slice operation
4. Compute calculation in the array
Program:
import numpy as np
# Use slicing to pull out the subarray consisting of the first 2 rows
# and columns 1 and 2; b is the following array of shape (2, 2):
# [[2 3]
# [6 7]]
b = a[:2, 1:3]
print(a)
x = np.array([1, 2]) # Let numpy choose the datatype
print(x.dtype) # Prints "int64"
x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)
x = np.array([[1,2],[3,4]])
Output:
<class 'numpy.ndarray'>
(3,)
123
[5 2 3]
(2, 3)
124
2
77
[[ 1 2 3]
[ 4 5 6]
[ 7 8 9]
[1O 11 12]]
[ 1 6 7 11]
[[11 2 3]
[ 4 5 16]
[17 8 9]
[1O 21 12]]
int32 [[
6. 8.]
[1O. 12.]]
[[ 6. 8.]
[1O. 12.]]
1O
[4 6]
[3 7]
Result:
Thus, the program using NumPy array was executed.
lOM oAR c P S D | 3049 913 0
Aim:
To write a program for working with pandas data frames.
Procedure:
1. Import panda library
2. Construct a panda dataframe
3. Modify, drop columns in dataframe
4. Calculate median in the dataframe
Program:
import pandas as pd
data = pd.DataFrame({"x1":["y", "x", "y", "x", "x", "y"], # Construct a pandas DataFrame
"x2":range(16, 22),
"x3":range(1, 7),
"x4":["a", "b", "c", "d", "e", "f"],
"x5":range(3O, 24, - 1)})
print(data)
27.5
Result:
Thus the program to work with panda data frame was executed
lOM oAR c P S D | 3049 913 0
Aim:
To write a program to visualize basic plots using Matplotlib.
Procedure:
1. Import matplotlib library
2. Define x,y axis
3. Label the axis
4. Visualize the data using line plot
Program:
Output:
Result:
Thus, the program to plot the basic plots using Matplotlib was executed.
lOM oAR c P S D | 3049 913 0
Aim:
To explore various variable and row filters , plot features in R for cleaning data and visualize it.
Procedure:
1. Import dplyr and ggplot2 library
2. Import iris dataset
3. Using dplyr select and filter functions rearrange the data
4. Using plotting visualize the selected data
Program:
plot(iris$Sepal.Length)
Result:
Thus, the program for cleaning and visualizing the data using R was executed.
lOM oAR c P S D | 3049 913 0
Aim:
To write a program to visualize time series analysis.
Procedure:
1. Import the temperature dataset
2. Import panda and matplotlib library
3. Visualize the data using line plot, histogram and boxplot
Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# reading the dataset using read_csv
df = pd.read_csv("stock_data.csv",
parse_dates=True,
index_col="Date")
Date
1981-01-01 20.7
1981-01-02 17.9
1981-01-03 18.8
1981-01-04 14.6
1981-01-05 15.8
Name: Temp, dtype: float64
Result:
Thus, the program to visualize time series analysis was executed.
lOM oAR c P S D | 3049 913 0
Aim:
To represent on a Map using various Map data sets with Mouse Rollover effect.
Procedure:
1. Import the library
2. Import map dataset
3. specify the width, height, title for mouse rollover
4. visualize the map
Program:
pip install pyecharts
pip install echarts-countries-pypkg
pip install echarts-china-provinces-pypkg
pip install echarts-china-cities-pypkg
pip install echarts-china-counties-pypkg
import pyecharts
print(pyecharts. version )
import pandas as pd
from pyecharts.charts import Map
from pyecharts import options as opts
data = pd.read_excel('GDP.xlsx')
province = list(data["province"])
gdp = list(data["2O19_gdp"])
list = [list(z) for z in zip(province,gdp)]
c=(
Map(init_opts=opts.InitOpts(width="1OOOpx", height="6OOpx")) #Initialize map size
.set_global_opts(
title_opts=opts.TitleOpts(title="2O19 Provinces in GDP Distribution unit:1OO million yuan"),
#Configuration title
visualmap_opts=opts.VisualMapOpts(
type_ = "scatter" #Scatter type
)
)
.add("GDP",list,maptype="china") #take list Imported, map type is China Map
.render("Map1.html")
)
Output:
Result:
Thus, the program for mouse rollover in map visualization was executed successful
lOM oAR c P S D | 3049 913 0
Aim:
To build cartographic visualization for multiple datasets involving states and districts in India.
Procedure:
1. Import basemap and library
2. Import the state data and map
3. Using matplotlib add the title and attributes to display the map
Program:
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
map = Basemap()
map.drawcoastlines()
plt.show()
plt.savefig(‘test.png’)
map_df.plot()
df = pd.read_csv('globallandslides.csv')
lOM oAR c P S D | 3049 913 0
pd.set_option('display.max_columns', None)
df = df[df.country_name=="India"]
df["Year"] = pd.to_datetime(df["event_date"]).dt.year
df = df[df.landslide_category=="landslide"]
ls_df["admin_division_name"].replace("Nāgāland", "Nagaland",inplace = True)
ls_df["admin_division_name"].replace("Meghālaya", "Meghalaya",inplace = True)
ls_df["admin_division_name"].replace("Tamil Nādu", "Tamil Nadu",inplace = True)
ls_df["admin_division_name"].replace("Karnātaka", "Karnataka",inplace = True)
ls_df["admin_division_name"].replace("Gujarāt", "Gujarat",inplace = True)
ls_df["admin_division_name"].replace("Arunāchal Pradesh", "Arunachal Pradesh",inplace = True)
state_df = ls_df["admin_division_name"].value_counts()
state_df = state_df.to_frame()
state_df.reset_index(level=O, inplace=True)
state_df.columns = ['State', 'Count']
state_df.at[15,"Count"] = 69
state_df.at[O,"State"] = "Jammu and Kashmir" state_df.at[2O,"State"] = "Delhi"
state_df.drop(7)
#Create figure and axes for Matplotlib and set the title
fig, ax = plt.subplots(1, figsize=(1O, 1O))
lOM oAR c P S D | 3049 913 0
ax.axis('off')
ax.set_title('Number of landslides in India state-wise', fontdict={'fontsize': '2O', 'fontweight' : '1O'})
# Plot the figure
merged.plot(column='Count',cmap='YlOrRd', linewidth=O.8, ax=ax,
edgecolor='O',legend=True,markersize=[39.739192, -1O4.99O337], legend_kwds={'label': "Number of
landslides"})
Result:
Thus, the program to display the cartographic visualization of India was executed successfully.
lOM oAR c P S D | 3049 913 0
Aim:
To write a python program for EDA on Wine Quality Data Set.
Procedure:
1. Import library
2. Import wine dataset
3. Perform EDA to display information ,description of data.
4. Analyse the content of alcohol consumption and visualize it
Program:
import pandas as pd
df_red = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learningdatabases/wine-quality/winequality-
red.csv", delimiter=";")
df_white = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learningdatabases/wine-
quality/winequality-white.csv", delimiter=";")
df_red.columns
Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar','chlorides', 'free sulfur dioxide', 'total
sulfur dioxide',
'density','pH', 'sulphates', 'alcohol', 'quality'],dtype='object')
df_red.iloc[1OO:11O]
df_red.dtypes
df_red.describe()
df_red.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, O to 1598
Data columns (total 12 columns):
fixed acidity 1599 non-null float64
lOM oAR c P S D | 3049 913 0
Sns.distplot(df_red[‘alchol’])
Result:
Thus, the program to execute the EDA on wine dataset was executed successfully.
lOM oAR c P S D | 3049 913 0
Aim:
To analysis following using the diabetes data set from UCI and Pima Indians Diabetes data set for
performing the following
Procedure:
a. Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard Deviation,
Skewness and Kurtosis.
b. Bivariate analysis: Linear and logistic regression modeling
c. Multiple Regression analysis
d. Also compare the results of the above analysis for the two data sets.
Output:
(600, 10)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 600 entries, 0 to 599
Data columns (total 10 columns):
Marital_status 600 non-null object
Dependents 600 non-null int64
Is_graduate 600 non-null object
Income 600 non-null int64
Loan_amount 600 non-null int64
Term_months 600 non-null int64
Credit_score 600 non-null object
approval_status 600 non-null object
Age 600 non-null int64
Sex 600 non-null object
dtypes: int64(5), object(5)
memory usage: 47.0+ KB
None
Measures of Central Tendency
Measures of central tendency describe the center of the data, and are often represented by the
mean, the median, and the mode.
Mean
df.mean()
python
Output:
1 Dependents 0.748333
2 Income 705541.333333
3 Loan_amount 323793.666667
4 Term_months 183.350000
5 Age 49.450000
6 dtype: float64
lOM oAR c P S D | 3049 913 0
It is also possible to calculate the mean of a particular variable in a data, as shown below, where we
calculate the mean of the variables 'Age' and 'Income'.
print(df.loc[:,'Age'].mean())
print(df.loc[:,'Income'].mean())
python
Output:
1 49.45
2 705541.33
It is also possible to calculate the mean of the rows by specifying the (axis = 1) argument. The code
below calculates the mean of the first five rows.
df.mean(axis = 1)[0:5]
python
Output:
1 0 70096.0
2 1 161274.0
3 2 125113.4
4 3 119853.8
5 4 120653.8
6 dtype: float64
Median
df.median()
python
Output:
1 Dependents 0.0
2 Income 508350.0
3 Loan_amount 76000.0
4 Term_months 192.0
5 Age 51.0
6 dtype: float64
Mode
df.mode()
Python
Output:
Measures of Dispersion
lOM oAR c P S D | 3049 913 0
The most popular measures of dispersion are standard deviation, variance, and the interquartile
range.
Standard Deviation
df.std()
python
Output:
1 Dependents 1.026362
2 Income 711421.814154
3 Loan_amount 724293.480782
4 Term_months 31.933949
5 Age 14.728511
6 dtype: float64
Variance
df.var()
python
Output:
1 Dependents 1.053420e+00
2 Income 5.061210e+11
3 Loan_amount 5.246010e+11
4 Term_months 1.019777e+03
5 Age 2.169290e+02
6 dtype: float64
python
Output:
1 25.0
Skewness
print(df.skew())
python
Output:
1 Dependents 1.169632
2 Income 5.344587
3 Loan_amount 5.006374
4 Term_months -2.471879
5 Age -0.055537
6 dtype: float64
df.describe()
python
Output:
1| | Dependents | Income | Loan_amount | Term_months | Age |
2| | | | | | |
3| count | 600.000000 | 6.000000e+02 | 6.000000e+02 | 600.000000 | 600.000000
|
4| mean | 0.748333 | 7.055413e+05 | 3.237937e+05 | 183.350000 | 49.450000
|
5| std | 1.026362 | 7.114218e+05 | 7.242935e+05 | 31.933949 | 14.728511
|
6| min | 0.000000 | 3.000000e+04 | 1.090000e+04 | 18.000000 | 22.000000
|
7| 25% | 0.000000 | 3.849750e+05 | 6.100000e+04 | 192.000000 | 36.000000
|
8| 50% | 0.000000 | 5.083500e+05 | 7.600000e+04 | 192.000000 | 51.000000
|
9| 75% | 1.000000 | 7.661000e+05 | 1.302500e+05 | 192.000000 | 61.000000
|
10| max | 6.000000 | 8.444900e+06 | 7.780000e+06 | 252.000000 | 76.000000
|
df.describe(include='all')
Linear Regression
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
def myfunc(x):
return slope * x + intercept
plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()
lOM oAR c P S D | 3049 913 0
OUTPUT:
Logistic Regression
import numpy
from sklearn import linear_model
X = numpy.array([3.78, 2.44, 2.09, 0.14, 1.72, 1.65, 4.92, 4.37, 4.96, 4.52, 3.69, 5.88]).reshape(-1,1)
y = numpy.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])
logr = linear_model.LogisticRegression()
logr.fit(X,y)
print(logit2prob(logr, X))
OUTPUT:
[[0.60749955]
[0.19268876]
[0.12775886]
[0.00955221]
[0.08038616]
[0.07345637]
[0.88362743]
[0.77901378]
[0.88924409]
[0.81293497]
[0.57719129]
[0.96664243]]
Results Explained
3.78 0.61 The probability that a tumor with the size 3.78cm is cancerous is 61%.
2.44 0.19 The probability that a tumor with the size 2.44cm is cancerous is 19%.
2.09 0.13 The probability that a tumor with the size 2.09cm is cancerous is 13%.
lOM oAR c P S D | 3049 913 0
Multiple regression works by considering the values of the available multiple independent
variables and predicting the value of one dependent variable.
import pandas as pd
from sklearn import linear_model
import statsmodels.api as sm
data = {'year':
[2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,20
16,2016,2016,2016,2016,2016],
'month': [12,11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1],
'interest_rate':
[2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75],
'unemployment_rate':
[5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1],
'index_price':
[1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,
876,822,704,719]
}
df = pd.DataFrame(data)
x = df[['interest_rate','unemployment_rate']]
y = df['index_price']
# with sklearn
regr = linear_model.LinearRegression()
regr.fit(x, y)
# with statsmodels
x = sm.add_constant(x) # adding a constant
print_model = model.summary()
print(print_model)
lOM oAR c P S D | 3049 913 0
OUTPUT:
Intercept:
1798.4039776258564
Coefficients:
[ 345.54008701 -250.14657137]
Result:
Thus, the Univariate, Bivariate and Multiple Regression Analysis using the diabetes data set from
UCI and Pima Indians Diabetes data set was completed and verified successfully.