Dsf-Pyt-Lab Manual
Dsf-Pyt-Lab Manual
Dsf-Pyt-Lab Manual
AIM
Working with Numpy arrays
ALGORITHM
Step1: Start
Step2: Import numpy module
Step3: Print the basic characteristics and operactions of array Step4: Stop
PROGRAM
import numpy as np
# Printing size (total number of elements) of array print("Size of array: ", arr.size) #
OUTPUT
a = np.array([[1,2,3],[3,4,5],[4,5,6]])
print(a)
print("After slicing") print(a[1:])
Output
[[1 2 3]
[3 4 5]
[4 5 6]]
After slicing [[3 4 5]
[4 5 6]]
Output:
Our array is:
[[1 2 3]
[3 4 5]
[4 5 6]]
The items in the second column are: [2
4 5]
The items in the second row are: [3
4 5]
The items column 1 onwards are: [[2
3]
[4 5]
[5 6]]
Result:
Thus the working with Numpy arrays was successfully completed
ALGORITHM
Step1: Start
Step2: import numpy and pandas module Step3: Create a dataframe using the dictionary
Step4: Print the output
Step5: Stop
PROGRAM
import numpy as np import pandas as pd data =
np.array([['','Col1','Col2'], ['Row1',1,2],
['Row2',3,4]])
print(pd.DataFrame(data=data[1:,1:], index
= data[1:,0], columns=data[0,1:]))
# Take a 2D array as input to your DataFrame my_2darray = np.array([[1, 2, 3], [4, 5,
6]]) print(pd.DataFrame(my_2darray))
# Take a dictionary as input to your DataFrame my_dict = {1: ['1', '3'], 2: ['1', '2'], 3:
['2', '4']}
print(pd.DataFrame(my_dict))
# Take a DataFrame as input to your DataFrame
my_df = pd.DataFrame(data=[4,5,6,7], index=range(0,4), columns=['A'])
print(pd.DataFrame(my_df))
# Take a Series as input to your DataFrame
my_series = pd.Series({"United Kingdom":"London", "India":"New Delhi", "United
States":"Washington", "Belgium":"Brussels"})
print(pd.DataFrame(my_series))
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]])) #
Use the `shape` property print(df.shape)
# Or use the `len()` function with the `index` property print(len(df.index))
Output:
Col1 Col2
Row1, , , 1, 2
Row2, , , 3, 4
0, 1, 2, ,
0, 1, 2, 3,
1, 4, 5, 61, 2 3
0, 1, 1, 2,
1, 3, 2, 4A,
0, 4, , ,
1, 5, , ,
2, 6, , ,
3, 7, , ,
0, , , ,
United Kingdom London India
New Delhi United States Washington Belgium
Brussels
(2, 3)
2
Result:
Thus the working with Pandas data frames was successfully completed.
AIM:
ALGORITHM
Step1: Start
Step2: import Matplotlib module
Step3: Create a Basic plots using Matplotlib
Step4: Print the output
Step5: Stop
Program:3a
# importing the required module
import matplotlib.pyplot as plt
# x axis values
x = [1,2,3]
# corresponding y axis values y
= [2,4,1]
# plotting the points
plt.plot(x, y)
# naming the x axis
plt.xlabel('x - axis') #
naming the y axis
plt.ylabel('y - axis')
# giving a title to my graph
plt.title('My first graph!')
# function to show the plot
plt.show()
Output:
Program:3b
import matplotlib.pyplot as plt a
= [1, 2, 3, 4, 5]
b = [0, 0.6, 0.2, 15, 10, 8, 16, 21]
plt.plot(a)
# o is for circles and r is #
for red
plt.plot(b, "or")
plt.plot(list(range(0, 22, 3)))
# naming the x-axis
plt.xlabel('Day ->')
# naming the y-axis
plt.ylabel('Temp ->')
c = [4, 2, 6, 8, 3, 20, 13, 15]
plt.plot(c, label = '4th Rep')
# get current axes command ax
= plt.gca()
# get command over the individual #
boundary line of the graph body
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False) #
set the range or the bounds of
# the left boundary line to fixed range
ax.spines['left'].set_bounds(-3, 40)
Output:
Program:3c
import matplotlib.pyplot as plt
a = [1, 2, 3, 4, 5]
b = [0, 0.6, 0.2, 15, 10, 8, 16, 21]
c = [4, 2, 6, 8, 3, 20, 13, 15]
sub1.plot(a, 'sb')
sub2.plot(b, 'or')
sub4.plot(c, 'Dm')
Result:
Thus the basic plots using Matplotlib in Python program was successfully
completed.
ALGORITHM
Step 1: Start the Program
Step 2: Create text file blake-poems.txt
Step 3: Import the word_tokenize function and gutenberg
Step 4: Write the code to count the frequency of occurrence of a word in a body of text
Step 5: Print the result
Step 6: Stop the process
PROGRAM:
from nltk.tokenize import word_tokenize
from nltk.corpus import gutenberg
sample = gutenberg.raw("blake-poems.txt")
token = word_tokenize(sample)
wlist = []
for i in range(50):
wlist.append(token[i])
wordfreq = [wlist.count(w) for w in wlist]
print("Pairs\n" + str(zip(token, wordfreq)))
Output:
[([', 1), (Poems', 1), (by', 1), (William', 1), (Blake', 1), (1789', 1), (]', 1), (SONGS',
2), (OF', 3),
(INNOCENCE', 2), (AND', 1), (OF', 3), (EXPERIENCE', 1), (and', 1), (THE', 1),
(BOOK', 1), (of', 2),
(THEL', 1), (SONGS', 2), (OF', 3), (INNOCENCE', 2), (INTRODUCTION', 1),
(Piping', 2), (down', 1),
(the', 1), (valleys', 1), (wild', 1), (,', 3), (Piping', 2), (songs', 1), (of', 2), (pleasant',
1), (glee', 1), (,', 3),
(On', 1), (a', 2), (cloud', 1), (I', 1), (saw', 1), (a', 2), (child', 1), (,', 3), (And', 1), (he',
1), (laughing', 1),
(said', 1), (to', 1), (me', 1), (:', 1), (``', 1)]
Result:
Thus the count the frequency of occurrence of a word in a body of text is often needed
during
text processing and Conditional Frequency Distribution program using python was
successfully completed.
ALGORITHM
Program:
# Python code to demonstrate variance() #
function on varying range of data-types
Output :
Result:
Thus the computation for variance was successfully completed.
Ex. No.:4(d) NORMAL CURVE
Aim:
To create a normal curve using python program.
ALGORITHM
Step 1: Start the Program
Step 2: Import packages scipy and call function scipy.stats Step
3: Import packages numpy, matplotlib and seaborn Step 4:
Create the distribution
Step 5: Visualizing the distribution
Step 6: Stop the process
Program:
# import required libraries
from scipy.stats import norm
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
Result:
Thus the normal curve using python program was successfully
completed.
ALGORITHM
Step 1: Start the Program Step
2: Create variable y1, y2
Step 3: Create variable x, y3 using random function Step
4: plot the scatter plot
Step 5: Print the result
Step 6: Stop the process
Program:
# Scatterplot and Correlations #
Data
x-pp random randn(100)
yl=x*5+9
y2=-5°x
y3=no_random.randn(100) #Plot
plt.reParams update('figure figsize' (10,8), 'figure dpi¹:100})
plt scatter(x, yl, label=fyl, Correlation = {np.round(np.corrcoef(x,y1)[0,1], 2)}) plt
scatter(x, y2, label=fy2 Correlation = (np.round(np.corrcoef(x,y2)[0,1], 2)}) plt
scatter(x, y3, label=fy3 Correlation = (np.round(np.corrcoef(x,y3)[0,1], 2)}) # Plot
plt titlef('Scatterplot and Correlations')
plt(legend)
plt(show)
Output :
RESULT:
Thus the Correlation and scatter plots using python program was successfully
completed.
ALGORITHM
Step 1: Start the Program Step
2: Import math package
Step 3: Define correlation coefficient function
Step 4: Calculate correlation using formula Step
5:Print the result
Step 6 : Stop the process
Program:
# Python Program to find correlation coefficient.
import math
# function that returns correlation coefficient.
def correlationCoefficient(X, Y, n) :
sum_X = 0
sum_Y = 0
sum_XY = 0
squareSum_X = 0
squareSum_Y = 0
i=0
while i < n :
# sum of elements of array X.
sum_X = sum_X + X[i]
i=i+1
# use formula for calculating correlation #
coefficient.
corr = (float)(n * sum_XY - sum_X * sum_Y)/
(float)(math.sqrt((n * squareSum_X -
sum_X * sum_X)* (n * squareSum_Y -
sum_Y * sum_Y)))
return corr
# Driver function
X = [15, 18, 21, 24, 27]
Y = [25, 25, 27, 31, 32]
Output :
0.953463
Result:
Thus the computation for correlation coefficient was successfully completed.
Ex. No.: 4 (g) SIMPLE LINEAR REGRESSION
Aim:
To write a python program for Simple Linear Regression
ALGORITHM
Step 1: Start the Program
Step 2: Import numpy and matplotlib package
Step 3: Define coefficient function
Step 4: Calculate cross-deviation and deviation about x Step
5: Calculate regression coefficients
Step 6: Plot the Linear regression and define main function Step
7: Print the result
Step 8: Stop the process
Program:
import numpy as np
import matplotlib.pyplot as plt
# putting labels
plt.xlabel('x')
plt.ylabel('y')
def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
# estimating coefficients b
= estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))
Output :
Estimated coefficients:
b_0 = -0.0586206896552
b_1 = 1.45747126437
Graph:
Result:
Thus the computation for Simple Linear Regression was successfully completed.
EX.NO 5. USE THE STANDARD BENCHMARK DATASET
FOR PERFORMING THE FOLLOWING:
A) UNIVARIATE ANALYSIS: FREQUENCY, MEAN, MEDIAN, MODE,
AIM:To explore various commands for doing Univariate analytics on the UCI AND PIMA
ALGORITHM:
STEP 2: To download the UCI AND PIMA INDIANS DIABETES data set using Kaggle.
STEP 3: To read data from UCI AND PIMA INDIANS DIABETES data set. STEP 4:
To find the mean, median, mode, variance, standard deviation, skewness and kurtosis
PROGRAM:
%matplotlib inline
warnings.filterwarnings('ignore')
df = pd.read_csv('C:/Users/kirub/Documents/Learning/Untitled Folder/diabetes.csv')
df.head()
df.shape df.dtypes
df.describe().T
#kurtosis df.kurtosis(axis=0,skipna=True)
df['Outcome'].kurtosis(axis=0,skipna=True) #skewness
#Pregnancy variable
np.array(df['Pregnancies'].value_counts().index) preg_proportion_perc =
np.array(np.round(preg_proportion/sum(preg_proportion),3)*100,dtype=int)
preg =
pd.DataFrame({'month':preg_month,'count_of_preg_prop':preg_proportion,'perce
preg.head(10)
sns.countplot(data=df['Outcome']) sns.distplot(df['Pregnancies'])
sns.boxplot(data=df['Pregnancies'])
OUTPUT:
RESULT: Exploring various commands for doing univariate analytics on the UCI AND
PIMA INDIANS DIABETES was successfully executed.
EX.NO:5. B) BIVARIATE ANALYSIS: LINEAR AND LOGISTIC
REGRESSION DATE:MODELING
AIM:
To explore the Linear and Logistic Regression model on the USA HOUSING
AND UCI AND PIMA INDIANS DIABETES data set.
ALGORITHM:
STEP 1: Start the program
STEP 2: To download the any kind of data set like housing dataset using kaggle.
STEP 3: To read data from downloaded data set.
STEP 4: To find the linear and logistic regression model using the given data set.
STEP 5: Display the output.
STEP 6: Stop the program.
PROGRAM:
BIVARIATE ANALYSIS GENERAL PROGRAM
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt import
seaborn as sns sns.set_style('darkgrid')
%matplotlib inline
from matplotlib.ticker import FormatStrFormatter import
warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('C:/Users/diabetes.csv')
df.head()
df.shape
df.dtypes
df['Outcome']=df['Outcome'].astype('bool')
plot00=sns.countplot('Pregnancies',data=df,ax=axes[0][0],color='gree n')
axes[0][0].set_title('Count',fontdict={'fontsize':8}) axes[0]
[0].set_xlabel('Month of Preg.',fontdict={'fontsize':7}) axes[0]
[0].set_ylabel('Count',fontdict={'fontsize':7})
plt.tight_layout()
plot01=sns.countplot('Pregnancies',data=df,hue='Outcome',ax=axes[0][1]) axes[0]
[1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8}) axes[0]
[1].set_xlabel('Month of Preg.',fontdict={'fontsize':7}) axes[0]
[1].set_ylabel('Count',fontdict={'fontsize':7}) plot01.axes.legend(loc=1)
plt.setp(axes[0][1].get_legend().get_texts(), fontsize='6') plt.setp(axes[0]
[1].get_legend().get_title(), fontsize='6') plt.tight_layout()
plot21 = sns.boxplot(x='Outcome',y='Pregnancies',data=df,ax=axes[2][1])
axes[2][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8})
axes[2][1].set_xlabel('Pregnancy',fontdict={'fontsize':7}) axes[2]
[1].set_ylabel('Five Point Summary',fontdict={'fontsize':7})
plt.xticks(ticks=[0,1],labels=['Non-Diab.','Diab.'],fontsize=7)
plt.tight_layout()
plt.show()
OUTPUT:
plot01=sns.distplot(df[df['Outcome']==False]['BloodPressure'],ax=axes[0][1],color='green', label='Non
Diab.') sns.distplot(df[df.Outcome==True]['BloodPressure'],ax=axes[0][1],color='red',label='Diab')
OUTPUT:
plot0=sns.distplot(df[df['BloodPressure']!=0]['BloodPressure'],ax=axes[0],color='green')
axes[0].yaxis.set_major_formatter(FormatStrFormatter('%.3f')) axes[0].set_title('Distribution of
BP',fontdict={'fontsize':8})
axes[0].set_xlabel('BP Class',fontdict={'fontsize':7})
axes[0].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
plt.tight_layout()
plot1=sns.boxplot(df[df['BloodPressure']!=0]['BloodPressure'],ax=axes[1],orient='v')
axes[1].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1].set_xlabel('BloodPressure',fontdict={'fontsize':7})
axes[1].set_ylabel(r'Five Point Summary(BP)',fontdict={'fontsize':7})
plt.tight_layout()
OUTPUT:
LINEAR REGRESSION MODELLING ON HOUSING DATASET
# Data manipulation libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
USAhousing = pd.read_csv('USA_Housing.csv')
USAhousing.head()
USAhousing.info()
USAhousing.describe()
USAhousing.columns
sns.pairplot(USAhousing)
sns.distplot(USAhousing['Price'])
sns.heatmap(USAhousing.corr())
X = USAhousing[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
'Avg. Area Number of Bedrooms', 'Area Population']] y =
USAhousing['Price']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101) from
sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
# print the intercept
print(lm.intercept_)
coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])
coeff_df
predictions = lm.predict(X_test)
plt.scatter(y_test,predictions)
sns.distplot((y_test-predictions),bins=50);
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, predictions)) print('MSE:',
metrics.mean_squared_error(y_test, predictions)) print('RMSE:',
np.sqrt(metrics.mean_squared_error(y_test, predictions)))
OUTPUT:
RESULT:
Exploring various commands for doing Bivariate analytics on the USA HOUSING Dataset was
successfully executed.
Ex. No.: 7. APPLY SUPERVISED LEARNING ALGORITHMS
AND UNSUPERVISED LEARNING ALGORITHMS WITH ANY DATSET
ALGORITHM :
Decision Tree Construction: The algorithm recursively splits the dataset into subsets
based on the features that best separate the classes. It selects the feature that
provides the best split, usually based on criteria like Gini impurity or information
gain.
Decision Tree Pruning (Optional): After constructing the tree, pruning can be
applied to avoid overfitting. Pruning involves removing parts of the tree that do not
provide significant predictive power.
Prediction: To make predictions, the algorithm traverses the tree from the root node
to a leaf node, following the decision rules at each node based on the feature values.
The predicted class is the majority class of the instances in the leaf node.
PROGRAM :
OUTPUT :
Accuracy: 1.0
RESULT :
The result of running the provided program is the accuracy of the trained Decision
Tree classifier on the test data from the Iris dataset is classified.
EX.NO:8. APPLY AND EXPLORE VARIOUS PLOTTING FUNCTIONS
ON UCI DATE: DATA SETS.
AIM:
To apply and explore various plotting functions on UCI datasets.
ALGORITHM:
A. NORMAL CURVES
#seaborn package
import seaborn as
sns
flights = sns.load_dataset("flights")
flights.head()
may_flights = flights.query("month == 'May'")
sns.lineplot(data=may_flights, x="year",
y="passengers")
OUTPUT:
B. DENSITY AND CONTOUR PLOTS
iris = sns.load_dataset("iris")
sns.kdeplot(data=iris)
OUTPUT:
#histogram of datafra,e
df = sns.load_dataset("titanic")
sns.histplot(data=df, x="age")
OUTPUT: