CS3361-Data Science Lab Manual - B.rethina Kumar
CS3361-Data Science Lab Manual - B.rethina Kumar
No : 1 Python Packages
Date :
AIM:
To Download, install and explore the features of NumPy, SciPy,
Jupyter, and Pandas packages.
(i) Installing numpy
(ii) Installing scipy
(iii) Installing jupyter
(iv) Installing pandas
Procedure:
(i) Installing PIP On Windows
Step 1: Download PIP get-pip.py
i. Launch a command prompt
ii. Then, run the following command to download the get-pip.py file:
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
---------------------------------------------------------------------------------------------------------
(2.2). Matrix Multiplication
Program :
import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([[10, 11, 12], [13, 14, 15]])
c=a*b
print("a = ", a)
print("b = ", b)
print("Multiplication of a and b = ", c)
Output :
---------------------------------------------------------------------------------------------------------
(2.3). Scalar Multiplication of Matrix
Program :
import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6]])
b=a*3
print("a = ", a)
print("b = a * 3 = ", b)
Output :
---------------------------------------------------------------------------------------------------------
(2.4). Matrix Transpose
Program :
import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
b = a.T
print("a = \n", a)
print("Transpose of a = \n", b)
Output :
---------------------------------------------------------------------------------------------------------
(2.5). Array Datatype Conversion
Program :
import numpy as np
a = np.array([[2.5, 3.8, 1.5], [4.7, 2.9, 1.56]])
b = a.astype('int')
print("The array in float datatype =\n", a)
print("The array in int datatype =\n", b)
Output :
---------------------------------------------------------------------------------------------------------
(2.6). Stacking of numpy arrays
Program :
import numpy as np
a1 = np.array([[1, 2, 3], [4, 5, 6]])
a2 = np.array([[7, 8, 9], [10, 11, 12]])
c = np.hstack((a1, a2))
d = np.vstack((a1, a2))
print("The two arrays are :\na1 =\n", a1, "\na2 =\n", a2)
print("\nHorizontal stacking :\n", c)
print("\nVertical stacking :\n", d)
Output :
---------------------------------------------------------------------------------------------------------
(2.7). Sequence generation
Program :
import numpy as np
lists = [x for x in range(0, 101, 2)]
a = np.array(lists)
print(a)
Output :
---------------------------------------------------------------------------------------------------------
(2.8). Sorting an array
Program:
import numpy as np
a = np.array([[1, 4, 2], [3, 4, 6], [0, -1, 5]])
print("Array before sorting")
print(np.sort(a, axis=None))
print("Sorting in row wise :")
print(np.sort(a, axis=1))
print("Sorting in column wise :")
print(np.sort(a, axis=0))
Output :
Result :
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Ex.No : 3 Working With Pandas Dataframe
Date :
AIM:
To perform various operations on dataframe using pandas module in
python.
Output :
Output :
---------------------------------------------------------------------------------------------------------
(3.3). Creating dataframe from a series
Program
import pandas as pd
data = {'ONE' : pd.Series([10, 20, 30, 40], index=[1, 2, 3, 4]),
'TWO' : pd.Series([50, 60, 70, 80], index=[1, 2, 3, 4])}
df = pd.DataFrame(data)
print(df)
Output :
---------------------------------------------------------------------------------------------------------
(3.4). Sorting the dataframe
Program
import pandas as pd
data = {'Name' : ['name1', 'name2', 'name3'], 'Age' : [20, 21, 22]}
df = pd.DataFrame(data)
print("\nDataset before sorting :\n", df)
d_sort1 = df.sort_values(by='Name')
print("\nDataset after sorted by Name :\n", d_sort1)
d_sort2 = df.sort_values(by='Age')
print("\nDataset after sorted by Age :\n", d_sort2)
Output :
(3.5). Manipulation of data frame
(i) Selection of column :
Source Code :
import pandas as pd
data = {'ONE' : pd.Series([10, 20, 30, 40], index=[1, 2, 3, 4]),
'TWO' : pd.Series([50, 60, 70, 80], index=[1, 2, 3, 4])}
df = pd.DataFrame(data)
print("------------------------")
print(df)
print("------------------------")
print("Selecting row ONE")
print(df['ONE'])
print("------------------------")
print("Selecting row TWO")
print(df['TWO'])
print("------------------------")
Output :
(ii) Addition of column :
Program
import pandas as pd
data = {'ONE' : pd.Series([10, 20, 30, 40], index=[1, 2, 3, 4]),
'TWO' : pd.Series([50, 60, 70, 80], index=[1, 2, 3, 4])}
df = pd.DataFrame(data)
print("------------------------")
print("Data Frame before adding a new column")
print(df)
print("------------------------")
df['THREE'] = pd.Series([90, 100, 110, 120], index=[1, 2, 3, 4])
print("Data Frame after adding a new column\n", df)
print("------------------------")
Output :
(iii) Deletion of column
Source Code :
import pandas as pd
data = {'ONE' : pd.Series([0, 1, 2, 3], index=[1, 2, 3, 4]),
'TWO' : pd.Series([4, 5, 6, 7], index=[1, 2, 3, 4])}
print("-----------------------")
df = pd.DataFrame(data)
print("Original DataFrame :\n", df)
print("-----------------------")
del df['ONE']
print("DataFrame after deleting a column :\n", df)
print("-----------------------")
Output :
(iv) Selection of rows
Source Code :
import pandas as pd
data = {'ONE' : pd.Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd']),
'TWO' : pd.Series([4, 5, 6, 7], index=['a', 'b', 'c', 'd'])}
print("-----------------------")
df = pd.DataFrame(data)
print("DataFrame :\n", df)
print("-----------------------")
print("row 'c' :")
print(df.loc['c'])
print("-----------------------")
Output :
(v) Addition of rows
Source Code :
import pandas as pd
df1 = pd.DataFrame([[1, 2], [3, 4]], columns = ['a', 'b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a', 'b'])
print("---------------------")
print("df1 :")
print(df1)
print("---------------------")
print("df2 :")
print(df2)
print("---------------------")
print("df1 + df2 :")
df1 = df1.append(df2)
print(df1)
Output :
(vi) Deletion of rows
Program :
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a', 'b'])
print("DataFrame :")
print(df)
df = df.drop(0)
print("DataFrame after deleting the row 0 :")
print(df)
Output :
Result :
The various operations on data frame using pandas module in python has
been implemented and executed successfully.
Program :
T = open(r'Data.txt')
print(T.read())
Output :
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(4.2) Reading the CSV file
Program
import pandas as pd
data = pd.read_csv(r'Data.csv')
print(data)
Output :
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(4.3) Reading the excel file
Program
import pandas as pd
data = pd.read_excel(r'Data.xlsx')
print(data)
Output :
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(4.4) Reading from web
Program:
import pandas as pd
url="https://en.wikipedia.org/wiki/Iris_flower_data_set"
df=pd.read_html(url)
print(df)
Output:
[ Dataset order Sepal length ... Petal width Species
0 1 5.1 ... 0.2 I. setosa
1 2 4.9 ... 0.2 I. setosa
2 3 4.7 ... 0.2 I. setosa
3 4 4.6 ... 0.2 I. setosa
4 5 5.0 ... 0.3 I. setosa
.. ... ... ... ... ...
145 146 6.7 ... 2.3 I. virginica
146 147 6.3 ... 1.9 I. virginica
147 148 6.5 ... 2.0 I. virginica
148 149 6.2 ... 2.3 I. virginica
149 150 5.9 ... 1.8 I. virginica
Algorithm:
Descriptive analytics is the process of using current and historical data to identify trends
and relationships
Data set: Iris Dataset
Source Code :
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# To read the dataset in python
Iris = pd.read_csv (r'C:\Users\8316\Desktop\Iris.csv')
print ("Iris Dataset : \n",Iris) # To print the dataset.
#The head function in Python displays the first five rows of the dataframe by default.
print ("Iris Dataset Head : \n",Iris.head())
#Shape Function to list the records and the features
print ("Iris Dataset Shape : \n",Iris.shape)
# The info() method prints information about the DataFrame.
print ("Iris Dataset Info : \n",Iris.info())
# Summaries for a dataset.
print ("Iris Dataset Describe : \n",Iris.describe())
#The number of rows in the dataset, and can be obtained via `count()`.
print ("Iris Dataset Count : \n",Iris.count())
# Pandas groupby is used for grouping the data according to the categories and apply a
function to the categories.
print ("Iris Dataset Group : \n",Iris.groupby('Species',as_index= False)["Id"].count())
#Sample mean for every numeric column
print ("Iris Dataset Mean : \n",Iris.mean())
# Sample median for every numeric column
print ("Iris Dataset Median : \n",Iris.median())
# Sample variance for every numeric column
print ("Iris Dataset Variance : \n",Iris.var())
# The different categories of Species
print ("Iris Dataset different categories : \n",Iris.Species.unique())
output:
Ex.No : 6 Bivariate Analysis and Multiple Regression analysis
Date:
AIM:
To perform Bivariate analysis such as linear, logistic regression modelling
and Multiple Regression analysis using python
Dataset: Diabetes Dataset
(6.1) Linear Regression
Source Code :
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
diabetes_x, diabetes_y = datasets.load_diabetes(return_X_y=True)
diabetes_x = diabetes_x[:, np.newaxis, 2]
diabetes_x_train = diabetes_x[:-20]
diabetes_x_test = diabetes_x[-20:]
diabetes_y_train = diabetes_y[:-20]
diabetes_y_test = diabetes_y[-20:]
regr = linear_model.LinearRegression()
regr.fit(diabetes_x_train, diabetes_y_train)
diabetes_y_pred = regr.predict(diabetes_x_test)
print('Coefficients :\n', regr.coef_)
print('Mean squared error : %.2f'%mean_squared_error(diabetes_y_test,
diabetes_y_pred))
print('Coefficient of ditermination : %.2f'%r2_score(diabetes_y_test, diabetes_y_pred))
plt.scatter(diabetes_x_test, diabetes_y_test, color='black')
plt.plot(diabetes_x_test, diabetes_y_pred, color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
Output :
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(6.2) Logistic Regression
Source Code :
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import metrics
import warnings
warnings.filterwarnings('ignore')
DataPath = (r'C:\Users\8316\Downloads\diabetes.csv')
data = pd.read_csv(DataPath)
x=data.drop("Outcome",axis=1)
y=data[["Outcome"]]
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.30,random_state=0)
model=LogisticRegression()
model.fit(x_train,y_train)
y_predict=model.predict(x_test)
model_score=model.score(x_test,y_test)
#Logistic Regression Model Score
print("Logistic Regression Model Score = ",model_score)
#confusion matrix
print("Confusion Matrix : \n",metrics.confusion_matrix(y_test,y_predict))
sns.heatmap(metrics.confusion_matrix(y_test,y_predict), annot=True, fmt='d',
cmap='Blues')
plt.title("LogisticRegression Confusion Matrix")
plt.ylabel("Actual Values")
plt.xlabel("Predicted Values")
plt.savefig('confusion_matrix.png')
plt.show()
Output:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
( 6.3 ) #Multiple Regression analysis
Source Code:
import pandas as pd
from sklearn import linear_model
DataPath = (r'C:\Users\8316\Downloads\diabetes.csv')
df = pd.read_csv(DataPath)
df.head()
x=df[['Insulin','Glucose']]
y=df[['Outcome']]
regr=linear_model.LinearRegression()
regr.fit(x,y)
predicted=regr.predict([[500,200]])
print("Predicted Outcome = ", predicted)
Output:
Ex.No : 7 Exploring Various Plotting Functions using UCI data sets
Date:
(7.1). Constructing Normal Curve
Source Code :
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import norm
import statistics
Output :
(7.2 ) Constructing lineplot,scatterplot,density plot and Contour plot.
Source Code:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
# read the csv data
DataPath = (r'C:\Users\8316\Downloads\diabetes.csv')
df = pd.read_csv(DataPath)
df.head()
#Line Plot for Diabetes Dataset
sns.lineplot(df['BloodPressure'],df['Age'], hue =df["Outcome"])
plt.title("Lineplot for Diabetes Dataset")
plt.show()
#Scatter Plot for Diabetes Dataset
sns.scatterplot(df['BloodPressure'],df['Age'], hue =df["Outcome"])
plt.title("Scatterplot for Diabetes Dataset")
plt.show()
#Density Plot for Diabetes Dataset
x=df["Insulin"]
sns.distplot(x, hist=False)
plt.title("Density plot for Diabetes Dataset")
plt.show()
#Contour Plot for Diabetes Dataset
def f(x, y): return np.sin(x) ** 10 + np.cos(10 + y * x) * np.cos(x)
x1=df["Age"]
x2=df["Outcome"]
X, Y = np.meshgrid(x1, x2)
Z = f(X, Y)
#plt.contour(X, Y, Z, colors='black');
plt.contour(X, Y, Z, 20, cmap='RdGy');
plt.title("Contour plot for Diabetes Dataset")
plt.xlabel("Age")
plt.ylabel("Outcome")
plt.show()
output :