0% found this document useful (0 votes)
13 views36 pages

CS-3361-Data-science-lab Manual

The document outlines a data science lab course (CS 3361) that includes exercises on installing and using libraries like Numpy and Pandas, working with data frames, and performing various analyses on datasets such as diabetes and Iris. It covers topics like univariate analysis, multiple regression, and visualization techniques including scatter plots, histograms, and geographic data visualization using Basemap. The document provides example code snippets for each exercise to facilitate learning and application of data science concepts.

Uploaded by

sumathi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views36 pages

CS-3361-Data-science-lab Manual

The document outlines a data science lab course (CS 3361) that includes exercises on installing and using libraries like Numpy and Pandas, working with data frames, and performing various analyses on datasets such as diabetes and Iris. It covers topics like univariate analysis, multiple regression, and visualization techniques including scatter plots, histograms, and geographic data visualization using Basemap. The document provides example code snippets for each exercise to facilitate learning and application of data science concepts.

Uploaded by

sumathi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 36

lOMoAR cPSD| 7367891

CS 3361 data science lab


lOMoAR cPSD| 7367891

Ex No 1:
Install Method
Numpy

Numpy is a numerical computing package for mathematics, science, and engineering. Many data
science packages use Numpy as a dependency.

Ex : pip install NumPy


Output:
lOMoAR cPSD| 7367891

.Ex: pip install pandas


Output:
lOMoAR cPSD| 7367891

Ex: pip install statsmodels


Output

Output:
lOMoAR cPSD| 7367891

Ex. No. 2 - Working with Numpy arrays

Example Code:

# importing numpy

module import numpy as

np

# creating list

list = [1, 2, 3,

4]

# creating numpy array

sample_array = np.array(list1)

print("List in python : ", list)

print("Numpy Array in python :", sample_array)


lOMoAR cPSD| 7367891

Example:

# importing numpy

module import numpy as

np

# creating list

list_1 = [1, 2, 3, 4]

list_2 = [5, 6, 7, 8]

list_3 = [9, 10, 11, 12]

# creating numpy array

sample_array = np.array([list_1, list_2, list_3])

print("Numpy multi dimensional array in python\n", sample_array)


lOMoAR cPSD| 7367891

Ex.No – 3 - Working with Pandas data frames

Code:

import pandas as pd

import numpy as np

sas=pd.Series([1,3,5,np.nan,6])

sas
lOMoAR cPSD| 7367891

Code:

import pandas as pd

data={'apple': [3,2,0],

'orange' : [3,8,9]}

purchase=pd.DataFrame(data)

purchase

purchase.to_csv('datasciencelab.csv')
lOMoAR cPSD| 7367891

Ex. No. 4 - Reading data from text files, Excel and the web and exploring various
commands for doing descriptive analytics on the Iris data set.

For Code:

import pandas as pd

data1=pd.read_csv("Iris.csv")

data1.head()
lOMoAR cPSD| 7367891

data1.info()

data1.describe()

data1.isnull().sum()

data1.shape
lOMoAR cPSD| 7367891

data = data1.drop_duplicates(subset ="Species",)

data
lOMoAR cPSD| 7367891

Ex No. 5 - Use the diabetes data set from UCI and Pima Indians Diabetes data
set for performing the following:

a. Univariate analysis: Frequency, Mean, Median, Mode, Variance,


Standard Deviation, Skewness and Kurtosis.
.

Code:

import pandas as pd

import numpy as np

import statistics as

st # Load the data

df = pd.read_csv("diabetes.csv")

print(df.shape)

print(df.info())
lOMoAR cPSD| 7367891

Measures of Central Tendency

Code:

df.mean()

Code:

print(df.loc[:,'Age'].mean())

print(df.loc[:,'Income'].mean())

Median

Code:

df.median()
lOMoAR cPSD| 7367891

Code:

df.mode()

Code:

df.std()
lOMoAR cPSD| 7367891

Code:

df.var()

Code:

from scipy.stats import

iqr iqr(df['Age'])
lOMoAR cPSD| 7367891

Code:

print(df.skew())

Code:

import pandas as pd

df = pd.read_csv(diabetes.csv')

df.head()
lOMoAR cPSD| 7367891

Code:

import matplotlib.pyplot as

plt import seaborn as sns

sns.set(style='whitegrid', context='notebook')

cols = ['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','Age']

Code:

import numpy as np

cm = np.corrcoef(df[cols].values.T)

sns.set(font_scale=1.5)
lOMoAR cPSD| 7367891

hm = sns.heatmap(cm,cbar=True,annot=True,square=True,fmt='.2f',annot_kws={'size':
15},yticklabels=cols,xticklabels=cols)

plt.show()

Code:

class LinearRegressionGD(object):

def init (self, eta=0.001, n_iter=20):

self.eta = eta

self.n_iter = n_iter

def fit(self, X, y):

self.w_ = np.zeros(1 + X.shape[1])

self.cost_ = []

for i in range(self.n_iter):

output = self.net_input(X)
lOMoAR cPSD| 7367891

errors = (y - output)

self.w_[1:] += self.eta *

X.T.dot(errors) self.w_[0] += self.eta *

errors.sum() cost = (errors**2).sum() /

2.0 self.cost_.append(cost)

return self

def net_input(self, X):

return np.dot(X, self.w_[1:]) +

self.w_[0] def predict(self, X):

return self.net_input(X)

X = df[['Age']].values

y = df['Pregnancies'].values

from sklearn.preprocessing import

StandardScaler sc_x = StandardScaler()

sc_y = StandardScaler()

X_std =

sc_x.fit_transform(X) y_std =

sc_y.fit_transform(y) lr =

LinearRegressionGD()

lr.fit(X_std, y_std)

plt.plot(range(1, lr.n_iter+1), lr.cost_)

plt.ylabel('SSE')

plt.xlabel('Epoch')

plt.show()
lOMoAR cPSD| 7367891

Code:

def lin_regplot(X, y, model):

plt.scatter(X, y, c='blue')

plt.plot(X, model.predict(X),

color='red') return None

lin_regplot(X_std, y_std, lr)

plt.xlabel('Age (standardized)')

plt.ylabel('Pregnancies(standardized)')

plt.show()
lOMoAR cPSD| 7367891

Code:

age_std = sc_x.transform([20])

pregnancy_std =

lr.predict(age_std)

print("Pregnancy: %.3f" %sc_y.inverse_transform(price_std))

print('Slope: %.3f' % lr.w_[1])

C. Multiple Regression analysis:

Code:

from sklearn.model_selection import train_test_split

train_x, test_x, train_y, test_y = train_test_split(X,Y,test_size=0.3,random_state=99)

train_x.shape, train_y.shape

from sklearn.linear_model import

MultipleRegression le = MultipleRegression()

le.fit(train_x,train_y)

y_pred = le.predict(test_x)

y_pred
lOMoAR cPSD| 7367891

result = pd.DataFrame({'Actual': test_y, 'Predict' : y_pred})

result
lOMoAR cPSD| 7367891

Code:

print('coefficient', le.coef_)

print('intercept', le.intercept_)

b. Also compare the results of the above analysis for the two data sets

Installing datacompy

pip install datacompy

Details :

datacompy takes two dataframes as input and gives us a human-readable report containing statistics that lets us
know the similarities and dissimilarities between the two dataframes. It will try to join two dataframes either on a
list of join columns, or on indexes.
lOMoAR cPSD| 7367891

Code:

import datacompy

compare = datacompy.Compare(df1,df2,join_columns=‟acct_id‟, abs_tol=0.0001,

rel_tol=0,df1_name=‟olddiabetes‟,df2_name=‟newdiabetes‟)

print(compare.report())

OUTPUT:
lOMoAR cPSD| 7367891

Ex.No. 6 Apply and explore various plotting functions on UCI data sets

a. Normal curves
Code:

import numpy as np

import matplotlib.pyplot as plt

# Creating a series of data of in range of 1-50.

x = np.linspace(1,50,200)

#Creating a Function.

def normal_dist(x , mean , sd):

prob_density = (np.pi*sd) *

np.exp(-0.5*((x-mean)/sd)**2) return prob_density

#Calculate mean and Standard deviation.

mean = np.mean(x)

sd = np.std(x)

#Apply function to the data.

pdf = normal_dist(x,mean,sd)

#Plotting the Results

plt.plot(x,pdf , color = 'red')

plt.xlabel('Data points')
lOMoAR cPSD| 7367891

plt.ylabel('Probability Density')

b. Density and contour plots


Code:

import matplotlib.pyplot as

plt import numpy as np

feature_x = np.arange(0, 50, 2)

feature_y = np.arange(0, 50,

3) # Creating 2-D grid of

features

[X, Y] = np.meshgrid(feature_x, feature_y)

fig, ax = plt.subplots(1, 1)

Z = np.cos(X / 2) + np.sin(Y / 4)

# plots contour lines

ax.contour(X, Y, Z)
lOMoAR cPSD| 7367891

ax.set_title('Contour Plot')

ax.set_xlabel('feature_x')

ax.set_ylabel('feature_y')

plt.show()

c. Correlation and scatter plots

Code:

import pandas as pd

con = pd.read_csv('concrete.csv')

con

list(con.columns)
lOMoAR cPSD| 7367891

con.head()

con['cement'] = con['cement'].astype('category')

con.describe(include='category')

import seaborn as sns

sns.scatterplot(x="water", y="coarseagg", data=con);

ax = sns.scatterplot(x="water", y="coarseagg", data=con)

ax.set_title("Concrete Strength vs. Fly ash")

ax.set_xlabel("coarseagg");

sns.lmplot(x="water", y="coarseagg", data=con);


lOMoAR cPSD| 7367891

d. Histograms:
Creating a Histogram

Code:

from matplotlib import pyplot as

plt import numpy as np

# Creating dataset

a = np.array([22, 87, 5, 43, 56,

73, 55, 54, 11,

20, 51, 5, 79, 31,

27])
lOMoAR cPSD| 7367891

# Creating histogram

fig, ax = plt.subplots(figsize =(10, 7))

ax.hist(a, bins = [0, 25, 50, 75,

100]) # Show plot

plt.show()

Code:

import matplotlib.pyplot as plt

import numpy as np

from matplotlib import colors

from matplotlib.ticker import PercentFormatter

# Creating dataset

np.random.seed(23685752)

N_points = 10000

n_bins = 20

# Creating distribution
lOMoAR cPSD| 7367891

x = np.random.randn(N_points)

y = .8 ** x + np.random.randn(10000) +

25 # Creating histogram

fig, axs = plt.subplots(1, 1,figsize =(10, 7),tight_layout =

True) axs.hist(x, bins = n_bins)

# Show plot

plt.show()

e. Three dimensional plotting

Code:

from mpl_toolkits import mplot3d

import numpy as np
lOMoAR cPSD| 7367891

import matplotlib.pyplot as plt

fig = plt.figure()

# syntax for 3-D projection

ax = plt.axes(projection

='3d') # defining axes

z = np.linspace(0, 1,

100) x = z * np.sin(25 *

z)

y = z * np.cos(25 *

z) c = x + y

ax.scatter(x, y, z, c =

c) # syntax for plotting

ax.set_title('3d Scatter plot')

plt.show()
lOMoAR cPSD| 7367891

Ex. No7 Visualizing Geographic Data with Basemap

Code:

%matplotlib inline

import numpy as np

import matplotlib.pyplot as plt

from mpl_toolkits.basemap import Basemap

plt.figure(figsize=(8, 8))

m = Basemap(projection='ortho', resolution=None, lat_0=50, lon_0=-100)

m.bluemarble(scale=0.5);

fig = plt.figure(figsize=(8, 8))

m = Basemap(projection='lcc', resolution=None,

width=8E6, height=8E6,

lat_0=45, lon_0=-100,)
lOMoAR cPSD| 7367891

m.etopo(scale=0.5, alpha=0.5)

# Map (long, lat) to (x, y) for

plotting x, y = m(-122.3, 47.6)

plt.plot(x, y, 'ok', markersize=5)

plt.text(x, y, ' Seattle',

fontsize=12);

from mpl_toolkits.basemap import Basemap

import matplotlib.pyplot as plt

fig = plt.figure(figsize =

(12,12)) m = Basemap()

m.drawcoastlines()

m.drawcoastlines(linewidth=1.0, linestyle='dashed', color='red')

plt.title("Coastlines", fontsize=20)

plt.show()
lOMoAR cPSD| 7367891

import numpy as np

import pandas as pd

import matplotlib.pyplot as

plt import seaborn as sns

import geopandas as

gpd import shapefile as

shp

from shapely.geometry import Point

sns.set_style('whitegrid')

fp = r'Maps_with_python\india-polygon.shp'

map_df = gpd.read_file(fp)

map_df_copy = gpd.read_file(fp)

plt.plot(map_df , markersize=5)
lOMoAR cPSD| 7367891

You might also like