OCS353 - Data Science Manual-FULL
OCS353 - Data Science Manual-FULL
for
of
Anna University
Regulation - 2021
CHENNAI – 600097
Bonafide Certificate
REG. No……………..…………………………….
during the period from July 2024 to Nov 2024 of the academic year 2024 - 2025, for the
Submitted for the Anna university practical examination held at KCG COLLEGE
PEO 2 Devise, Implement and Deploy software solutions for computational problems.
PEO 3 Build software solutions for the challenging problems in industry and research.
Recognize the need for, and have the preparation and ability to engage in
PO12
independent and life-long learning in the broadest context of technological change.
After successful completion of B.Tech (IT) programme, the graduates will be able to
Identify the need for sustainable development in software industries and follow
PSO 3
the professional code of ethics.
Course Outcomes:
Highest
Course Outcomes Cognitive
Level
Output:
1
PANDAS
Pandas visualizes and manipulates data tables. There are many functions that allow
efficient manipulation for the preliminary steps of data analysis problems.
Output:
Statsmodel
Statsmodels is a package for exploring data, estimating statistical models, and
performing statistical tests. It include descriptive statistics, statistical tests, plotting
functions, and result statistics.
Output:
2
Scipy:
SciPy is a general-purpose package for mathematics, science, and engineering and
extends the base capabilities of NumPy.
Output:
MARKS MARKS
ALLOTTED SECURED
AIM & ALG/
5
PROCEDURE
PROGRAM 10
OUTPUT 5
TOTAL 20
Result:
Thus the python packages for data analytics has been installed successfully.
3
EX.NO: 2
DATE:
Working with Numpy arrays
AIM
To work with Numpy arrays,
NumPy stands for Numerical Python. It is a Python library used for working with an
array. In Python, we use the list for purpose of the array but it’s slow to process.
NumPy array is a powerful N-dimensional array object and its use in linear algebra,
Fourier transforms, and random number capabilities. It provides an array object much
faster than traditional Python lists.
Types of Array:
1. One Dimensional Array
2. Multi-Dimensional Array
Example
5.
6. One
Dimensional Array
ALGORITHM
Step1: Start
Step2: Import numpy module
Step3: Print the basic characteristics and operactions of array
Step4: Stop
4
PROGRAM
# importing numpy module
import numpy as np
# creating list
list = [1, 2, 3,4]
# creating numpy array
sample_array = np.array(list1)
print("List in python : ", list)
print("Numpy Array in python :", sample_array)
Multi-Dimensional Array:
PROGRAM
5
# creating numpy array
sample_array = np.array([list_1, list_2, list_3])
print("Numpy multi dimensional array in python\n", sample_array)
MARKS MARKS
ALLOTTED SECURED
AIM & ALG/
5
PROCEDURE
PROGRAM 10
OUTPUT 5
TOTAL 20
Result:
Thus the working with Numpy arrays was successfully completed.
6
EX.NO: 3 Create a Dataframe Using Pandas with
DATE: a List of Elements
AIM:
To create a DataFrame using Pandas to createa single list or a list of lists, Locate row,
named index, Locate Named indexes.
ALGORITHM
Step1: Start
Step2: import numpy and pandas module
Step3: Create a dataframe using the dictionary
Step4: Print the output
Step5: Stop
PROGRAM
Create a DataFrame can be created using a single list.
import pandas as pd
lst = ['python', 'For', 'first', 'year', 'students', 'interesting', 'programs']
# Calling DataFrame constructor on list
df = pd.DataFrame(lst)
print(df)
Output:
0
0 python
1 For
2 first
3 year
4 students
5 interesting
6 programs
7
CODE 2: Locate Row:
import pandas as pd
lst = ['python', 'For', 'first', 'year', 'students', 'interesting', 'programs']
# Calling DataFrame constructor on list
df = pd.DataFrame(lst)
print(df)
print(df.loc[0])
Output:
0
0 python
1 For
2 first
3 year
4 students
5 interesting
6 programs
0 python
Name: 0, dtype: object
CODE 3:
import pandas as pd
lst = {"list1":['python', 'For', 'first'], "list2":['students', 'interesting', 'programs']}
8
OUTPUT:
list1 list2
day1 python students
day2 For interesting
day3 first programs
list1 For
list2 interesting
Name: day2, dtype: object
MARKS MARKS
ALLOTTED SECURED
AIM & ALG/
5
PROCEDURE
PROGRAM 10
OUTPUT 5
TOTAL 20
Result:
Thus the working with Pandas data frames was successfully completed.
9
EX.NO: 4
DATE: Basic plots using Matplotlib
AIM
To draw basic plots in Python program using Matplotlib
ALGORITHM
Step1: Start
Step2: import Matplotlib module
Step3: Create a Basic plots using Matplotlib
Step4: Print the output
Step5: Stop
Program : 3 a
10
Output:
Program: 3b
import matplotlib.pyplot as plt
a = [1, 2, 3, 4, 5]
b = [0, 0.6, 0.2, 15, 10, 8, 16, 21]
plt.plot(a)
# o is for circles and r is
# for red
plt.plot(b, "or")
plt.plot(list(range(0, 22, 3)))
# naming the x-axis
plt.xlabel('Day ->')
# naming the y-axis
plt.ylabel('Temp ->')
c = [4, 2, 6, 8, 3, 20, 13, 15]
plt.plot(c, label = '4th Rep')
# get current axes command
ax = plt.gca()
# get command over the individual
# boundary line of the graph body
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
# set the range or the bounds of
# the left boundary line to fixed range a
x.spines['left'].set_bounds(-3, 40)
# set the interval by which
# the x-axis set the marks
plt.xticks(list(range(-3, 10)))
11
# set the intervals by which y-axis
# set the marks
plt.yticks(list(range(-3, 20, 3)))
Output:
Program:3c
13
Output:
CODE 3d:
from matplotlib import pyplot as plt
plt.bar([0.25,1.25,2.25,3.25,4.25],[50,40,70,80,20
], label="BMW",width=.5)
plt.bar([.75,1.75,2.75,3.75,4.75],[80,20,20,50,60],
label="Audi", color='r',width=.5)
plt.legend()
plt.xlabel('Days')
plt.ylabel('Distance
(kms)')
plt.title('Informatio
n') plt.show()
14
OUTPUT:
CODE 3e:
HISTOGRAM
import matplotlib.pyplot as plt
population_age = [22,55,62,45,21,22,34,42,42,4,2,102,95,85,55,110,120,70,65,55,111,115,
80,75,65,54,44,43,42,48]
bins = [0,10,20,30,40,50,60,70,80,90,100]
plt.hist(population_age, bins, histtype='bar', rwidth=0.8)
plt.xlabel('age groups')
plt.ylabel('Number of people')plt.title('Histogram') plt.show()
15
CODE 3 f :
AREA PLOT
import matplotlib.pyplot as plt
days = [1,2,3,4,5]
sleeping =[7,8,6,11,7]
eating = [2,3,4,3,2]
working =[7,8,7,2,2]
playing = [8,5,7,8,13]
plt.plot([],[],color='m', label='Sleeping', linewidth=5)
plt.plot([],[],color='c', label='Eating', linewidth=5)
plt.plot([],[],color='r', label='Working', linewidth=5)
plt.plot([],[],color='k', label='Playing', linewidth=5)
plt.stackplot(days, sleeping,eating,working,playing, colors=['m','c','r','k'])
plt.xlabel('x')
plt.ylabel('y')
plt.title('Stack
Plot')plt.legend()
plt.show()
OUTPUT:
16
CODE 3 g: PIE CHART
import matplotlib.pyplot as plt
days = [1,2,3,4,5]
sleeping =[7,8,6,11,7] eating = [2,3,4,3,2] working =[7,8,7,2,2] playing = [8,5,7,8,13] slices =
[7,2,2,13]
activities = ['sleeping','eating','working','playing']
cols = ['c','m','r','b']
plt.pie(slices, labels=activities, colors=cols, startangle=90, shadow= True,
explode=(0,0.1,0,0), autopct='%1.1f%%')
plt.title('Pie Plot')
plt.show()
OUTPUT
MARKS MARKS
ALLOTTED SECURED
AIM & ALG/
5
PROCEDURE
PROGRAM 10
OUTPUT 5
TOTAL 20
Result:
Thus the basic plots using Matplotlib in Python program was successfully completed.
17
EX.NO: 5 a Statistical and Probability measures
DATE: (Frequency Distributions)
AIM
To Count the frequency of occurrence of a word in a body of text is often needed
during text processing.
ALGORITHM
Program:
from nltk.tokenize
import word_tokenize from nltk.corpus
import gutenberg
sample = gutenberg.raw("blake-poems.txt")
token = word_tokenize(sample)
wlist = []
for i in range(50):
wlist.append(token[i])
wordfreq = [wlist.count(w) for w in wlist]
print("Pairs\n" + str(zip(token, wordfreq)))
18
Output:
[([', 1), (Poems', 1), (by', 1), (William', 1), (Blake', 1), (1789', 1), (]', 1), (SONGS', 2), (OF', 3),
(INNOCENCE', 2), (AND', 1), (OF', 3), (EXPERIENCE', 1), (and', 1), (THE', 1), (BOOK', 1),
(of', 2), (THEL', 1), (SONGS', 2), (OF', 3), (INNOCENCE', 2), (INTRODUCTION', 1),
(Piping', 2), (down', 1), (the', 1), (valleys', 1), (wild', 1), (,', 3), (Piping', 2), (songs', 1), (of', 2),
(pleasant', 1), (glee', 1), (,', 3), (On', 1), (a', 2), (cloud', 1), (I', 1), (saw', 1), (a', 2), (child', 1), (,',
3), (And', 1), (he', 1), (laughing', 1), (said', 1), (to', 1), (me', 1), (:', 1), (``', 1)]
MARKS MARKS
ALLOTTED SECURED
AIM & ALG/
5
PROCEDURE
PROGRAM 10
OUTPUT 5
TOTAL 20
Result:
Thus the count the frequency of occurrence of a word in a body of text is often needed
during text processing and Conditional Frequency Distribution program using python
was successfully completed.
19
EX.NO: 5 b Statistical and Probability measures
DATE: (Averages)
AIM
To compute weighted averages in Python either defining your own functions or using
Numpy.
ALGORITHM
Program:
#Method Using Numpy Average() Function
weighted_avg_m3
Output:
44225.35
MARKS MARKS
ALLOTTED SECURED
AIM & ALG/
5
PROCEDURE
PROGRAM 10
OUTPUT 5
TOTAL 20
Result:
Thus the compute weighted averages in Python either defining your own functions or
using Numpy was successfully completed.
20
EX.NO: 5 c Statistical and Probability measures
DATE: (Variability)
AIM
To write a python program to calculate the variance.
ALGORITHM
Program
# Python code to demonstrate variance()
# function on varying range of data-types
# importing statistics module from statistics
import variance
# importing fractions as parameter values from fractions
import Fraction as fr
# tuple of a set of positive integers
# numbers are spread apart but not very much
sample1 = (1, 2, 5, 4, 8, 9, 12)
# tuple of a set of negative integers
sample2 = (-2, -4, -3, -1, -5, -6)
# tuple of a set of positive and negative numbers
# data-points are spread apart considerably
sample3 = (-9, -1, -0, 2, 1, 3, 4, 19)
# tuple of a set of fractional numbers
sample4 = (fr(1, 2), fr(2, 3), fr(3, 4), fr(5, 6), fr(7, 8))
# tuple of a set of floating point values
21
sample5 = (1.23, 1.45, 2.1, 2.2, 1.9)
# Print the variance of each samples
print("Variance of Sample1 is % s " %(variance(sample1)))
print("Variance of Sample2 is % s " %(variance(sample2)))
print("Variance of Sample3 is % s " %(variance(sample3)))
print("Variance of Sample4 is % s " %(variance(sample4)))
print("Variance of Sample5 is % s " %(variance(sample5)))
Output:
MARKS MARKS
ALLOTTED SECURED
AIM & ALG/
5
PROCEDURE
PROGRAM 10
OUTPUT 5
TOTAL 20
Result:
Thus the computation for variance was successfully completed.
22
EX.NO: 5 d Statistical and Probability measures
DATE: (Normal Curve)
AIM:
To create a normal curve using python program.
ALGORITHM
Program:
# import required libraries
from scipy.stats
import norm
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
23
Output:
MARKS MARKS
ALLOTTED SECURED
AIM & ALG/
5
PROCEDURE
PROGRAM 10
OUTPUT 5
TOTAL 20
Result:
Thus the normal curve using python program was successfully completed.
24
EX.NO: 5 e Statistical and Probability measures
DATE: (Correlation and scatter plots)
AIM
To write a python program for correlation with scatter plot.
ALGORITHM
Program:
#Plot
plt.reParams update('figure figsize' (10,8), 'figur e dpi¹:100})
plt scatter(x, yl, label=fyl, Correlation = {np.round(np.corrcoef(x,y1)[0,1], 2)})
plt scatter(x, y2, label=fy2 Correlation = (np.round(np.corrcoef(x,y2)[0,1], 2)})
plt scatter(x, y3, label=fy3 Correlation = (np.round(np.corrcoef(x,y3)[0,1], 2)})
# Plot
plt titlef('Scatterplot and Correlations')
plt(legend)
plt(show)
25
Output
MARKS MARKS
ALLOTTED SECURED
AIM & ALG/
5
PROCEDURE
PROGRAM 10
OUTPUT 5
TOTAL 20
Result:
Thus the Correlation and scatter plots using python program was successfully
completed.
26
EX.NO: 5 f Statistical and Probability measures
DATE: (Correlation coefficient)
AIM:
To write a python program to compute correlation coefficient.
ALGORITHM
Program:
27
i=i+1
# use formula for calculating correlation
# coefficient.
corr = (float)(n * sum_XY - sum_X * sum_Y) /
(float)(math.sqrt((n * squareSum_X –
sum_X * sum_X)* (n * squareSum_Y –
sum_Y * sum_Y)))
return corr
# Driver function
X = [15, 18, 21, 24, 27] Y = [25, 25, 27, 31, 32]
Output :
0.953463
MARKS MARKS
ALLOTTED SECURED
AIM & ALG/
5
PROCEDURE
PROGRAM 10
OUTPUT 5
TOTAL 20
Result:
Thus the computation for correlation coefficient was successfully completed.
28
EX.NO: 5 g Statistical and Probability measures
DATE: (Simple Linear Regression)
AIM
To write a python program for Simple Linear Regression.
ALGORITHM
Step 1: Start the Program
Step 2: Import numpy and matplotlib package
Step 3: Define coefficient function
Step 4: Calculate cross-deviation and deviation about x
Step 5: Calculate regression coefficients
Step 6: Plot the Linear regression and define main function
Step 7: Print the result
Step 8: Stop the process
Program:
import numpy as np
import matplotlib.pyplot as plt
29
def plot_regression_line(x, y, b):
# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m", marker = "o", s = 30)
# putting labels
plt.xlabel('x') plt.ylabel('y')
def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \\ nb_1 = {}".format(b[0], b[1]))
30
Output:
Estimated coefficients:
b_0 = -0.0586206896552
b_1 = 1.45747126437
Graph:
MARKS MARKS
ALLOTTED SECURED
AIM & ALG/
5
PROCEDURE
PROGRAM 10
OUTPUT 5
TOTAL 20
Result:
Thus the computation for Simple Linear Regression was successfully completed.
31
EX.NO: 6 a Univariate Analysis
Frequency, Mean, Median, Mode, Variance,
DATE:
Standard Deviation, Skewness and Kurtosis.
AIM
To use the diabetes data set from UCI and Pima Indians Diabetes data set for
performing the Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard
Deviation, Skewness and Kurtosis.
DISCRIPTION:
This dataset is originally from the National Institute of Diabetes and Digestive and
Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a
patient has diabetes, based on certain diagnostic measurements included in the dataset.
Several constraints were placed on the selection of these instances from a larger database. In
particular, all patients here are females at least 21 years old of Pima Indian heritage. The
datasets consists of several medical predictor variables and one target variable, Outcome.
Predictor variables include the number of pregnancies the patient has had, their BMI, insulin
level, age, and so on.
The datasets consist of several medical predictor (independent) variables and one target
(dependent) variable, Outcome. Independent variables include the number of pregnancies the
patient has had, their BMI, insulin level, age, and so on.
Descriptive Statistics is the building block of data science. Advanced analytics is often
incomplete without analyzing descriptive statistics of the key metrics. In simple terms,
descriptive statistics can be defined as the measures that summarize a given data, and these
measures can be broken down further into the measures of central tendency and the measures
of dispersion.
Measures of central tendency include mean, median, and the mode, while the measures of
variability include standard deviation, variance, and the interquartile range. In this guide, you
will learn how to compute these measures of descriptive statistics and use them to interpret
the data.
32
Mode
Standard Deviation
Variance
Interquartile
Range
Skewness
Data
In this guide, we will be using fictitious data of loan applicants containing 600 observations and 10
variables, as described below:
1. Marital_status: Whether the applicant is married ("Yes") or not ("No").
2. Dependents: Number of dependents of the applicant.
3. Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No").
4. Income: Annual Income of the applicant (in USD).
5. Loan_amount: Loan amount (in USD) for which the application was submitted.
6. Term_months: Tenure of the loan (in months).
7. Credit_score: Whether the applicant's credit score was good ("Satisfactory") or not ("Not_satisfactory").
8. Age: The applicant’s age in years.
9. Sex: Whether the applicant is female (F) or male (M).
10. approval_status: Whether the loan application was approved ("Yes") or not ("No").
Code:
import pandas as pd
import numpy as np
import statistics as st
# Load the data
df = pd.read_csv("diabetes.csv")
print(df.shape)
print(df.info())
33
Measures of Central Tendency
Measures of central tendency describe the center of the data, and are often represented by the mean,
the median, and the mode.
Mean
Mean represents the arithmetic average of the data. The line of code below prints the mean of
the numerical variables in the data. From the output, we can infer that the average age of the applicant
is 49 years, the average annual income is USD 705,541, and the average tenure of loans is 183
months. The command df.mean(axis = 0) will also give the same output.
Code:
df.mean()
Code:
print(df.loc[:,'Age'].mean())
print(df.loc[:,'Income'].mean())
Median
In simple terms, median represents the 50th percentile, or the middle value of the data, that separates the
distribution into two halves. The line of code below prints the median of the numerical variables
in the data. The command df.median(axis = 0) will also give the same output.
Code:
df.median()
34
Mode
Mode represents the most frequent value of a variable in the data.
This is the only central tendency measure that can be used with categorical variables, unlike the
mean and the median which can be used only with quantitative data.
The line of code below prints the mode of all the variables in the data.
The .mode() function returns the most common value or most repeated value of a variable.
The command df.mode (axis = 0) will also give the same output.
Code:
df.mode()
Standard Deviation
Standard deviation is a measure that is used to quantify the amount of variation of a set of data
values from its mean. A low standard deviation for a variable indicates that the data points tend to be
close to its mean, and vice versa. The line of code below prints the standard deviation of all the
numerical variables in the data.
Code:
df.std()
35
Variance
Variance is another measure of dispersion. It is the square of the standard deviation and the covariance
of the random variable with itself. The line of code below prints the variance of all the numerical
variables in the dataset. The interpretation of the variance is similar to that of the standard deviation.
Code:
df.var()
Code:
from scipy.stats import iqr
iqr(df['Age'])
36
Skewness
Another useful statistic is skewness, which is the measure of the symmetry, or lack of it, for a real-valued
random variable about its mean. The skewness value can be positive, negative, or undefined. In
a perfectly symmetrical distribution, the mean, the median, and the mode will all have the same value.
However, the variables in our data are not symmetrical, resulting in different values of the
central tendency. We can calculate the skewness of the numerical variables using the skew() function, as
shown below.
Code:
print(df.skew())
MARKS MARKS
ALLOTTED SECURED
AIM & ALG/
5
PROCEDURE
PROGRAM 10
OUTPUT 5
TOTAL 20
Result:
Thus the univariate analysis is performed in the given dataset successfully.
37
EX.NO: 6 b Bivariate Analysis -
DATE: Linear and logistic regression modelling.
AIM
To use the diabetes data set from UCI and Pima Indians Diabetes data set for
performing the Bivariate analysis linear and logistic regression modelling.
DISCRIPTION:
This dataset is originally from the National Institute of Diabetes and Digestive and
Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a
patient has diabetes, based on certain diagnostic measurements included in the dataset.
Several constraints were placed on the selection of these instances from a larger database. In
particular, all patients here are females at least 21 years old of Pima Indian heritage. The
datasets consists of several medical predictor variables and one target variable, Outcome.
Predictor variables include the number of pregnancies the patient has had, their BMI, insulin
level, age, and so on.
The datasets consist of several medical predictor (independent) variables and one target
(dependent) variable, Outcome. Independent variables include the number of pregnancies the
patient has had, their BMI, insulin level, age, and so on.
Descriptive Statistics is the building block of data science. Advanced analytics is often
incomplete without analyzing descriptive statistics of the key metrics. In simple terms,
descriptive statistics can be defined as the measures that summarize a given data, and these
measures can be broken down further into the measures of central tendency and the measures
of dispersion.
Bivariate Regression Analysis is a type of statistical analysis that can be used during the
analysis and reporting stage of quantitative market research. It is often considered the simplest
form of regression analysis, and is also known as Ordinary Least-Squares regression or linear
regression.
38
Code:
import pandas as pd
df = pd.read_csv(diabetes.csv')
df.head()
Code:
context='notebook')
cols =
['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','A
ge']
39
Code:
import numpy as np
cm = np.corrcoef(df[cols].values.T)
sns.set(font_scale=1.5)
hm = sns.heatmap(cm,cbar=True,annot=True,square=True,fmt='.2f',annot_kws={'size':
15},yticklabels=cols,xticklabels=cols)
plt.show()
Code:
class LinearRegressionGD(object):
def init (self, eta=0.001, n_iter=20):
self.eta = eta self.n_iter =
n_iter def fit(self, X, y):
self.w_ = np.zeros(1 + X.shape[1]) self.cost_ = []
for i in range(self.n_iter): output = self.net_input(X)
errors = (y - output) self.w_[1:] += self.eta * X.T.dot(errors) self.w_[0] += self.eta * errors.sum() cost
= (errors**2).sum() / 2.0 self.cost_.append(cost)
40
return self
def net_input(self, X):
return np.dot(X, self.w_[1:]) +
self.w_[0] def predict(self, X):
return self.net_input(X) X = df[['Age']].values
y = df['Pregnancies'].values
from sklearn.preprocessing import StandardScaler sc_x = StandardScaler() sc_y = StandardScaler()
X_std = sc_x.fit_transform(X) y_std = sc_y.fit_transform(y) lr = LinearRegressionGD() lr.fit(X_std,
y_std)
plt.plot(range(1, lr.n_iter+1), lr.cost_) plt.ylabel('SSE') plt.xlabel('Epoc
h') plt.show()
Code:
def lin_regplot(X, y, model): plt.scatter(X, y, c='blue')
plt.plot(X, model.predict(X), color='red') return None
lin_regplot(X_std, y_std, lr)
plt.xlabel('Age (standardized)')
plt.ylabel('Pregnancies(standardized)')
plt.show()
41
Code:
age_std = sc_x.transform([20])
pregnancy_std = lr.predict(age_std)
print("Pregnancy: %.3f"
MARKS MARKS
ALLOTTED SECURED
AIM & ALG/
5
PROCEDURE
PROGRAM 10
OUTPUT 5
TOTAL 20
Result:
Thus the bivariate analysis is performed in the given dataset successfully.
42
EX.NO: 7 Apply and explore various plotting functions on any
DATE: data set
AIM
To Apply and explore various plotting functions on any data set.
PROCEDURE:
To load and quickly visualize the Multiple Features Dataset [1] from the UCI
repository, which is available in mvlearn. This dataset can be a good tool for analyzing
the effectiveness of multiview algorithms. It contains 6 views of handwritten digit images, thus
allowing for analysis of multiview algorithms in multiclass or unsupervised tasks.
a. Normal curves
A probability distribution is a statistical function that describes the likelihood of obtaining the
possible values that a random variable can take. By this, we mean the range of values that a
parameter can take when we randomly pick up values from it. If we were asked to pick up 1 adult
randomly and asked what his/her (assuming gender does not affect height) height would be?
There’s no way to know what the height will be. But if we have the distribution of heights of
adults in the city, we can bet on the most probable outcome.A Normal Distribution is also known
as a Gaussian distribution or famously Bell Curve. People use both words interchangeably, but it
means the same thing. It is a continuous probability distribution.
Code:
import numpy as np
import matplotlib.pyplot as plt
# Creating a series of data of in range of 1-50.
x = np.linspace(1,50,200)
#Creating a Function.
def normal_dist(x , mean , sd):
prob_density = (np.pi*sd) * np.exp(-0.5*((x- mean)/sd)**2)
return prob_density
#Calculate mean and Standard deviation.
mean = np.mean(x)
sd = np.std(x)
#Apply function to the data.
pdf = normal_dist(x,mean,sd)
43
#Plotting the Results
plt.plot(x,pdf , color = 'red') plt.xlabel('Data points')
plt.ylabel('Probability Density')
OUTPUT:
Contour plots also called level plots are a tool for doing multivariate analysis and visualizing 3-D plots in
2-D space. If we consider X and Y as our variables we want to plot then the response Z will be plotted
as slices on the X-Y plane due to which contours are sometimes referred as Z-slices or iso-response.
Contour plots are widely used to visualize density, altitudes or heights of the mountain as well as in the
meteorological department. Due to such wide usage matplotlib.pyplot provides a method contour to make
it easy for us to draw contour plots
Code:
44
ax.set_ylabel('feature_y')
plt.show()
OUTPUT:
Correlation means an association, It is a measure of the extent to which two variables are
related.
1. Positive Correlation: When two variables increase together and decrease together. They
are positively correlated.
‘1’ is a perfect positive correlation. For example – demand and profit are positively
correlated the more the demand for the product, the more profit hence positive correlation.
2. Negative Correlation: When one variable increases and the other variable decreases
together and vice-versa. They are negatively correlated. For example, If the distance
between magnet increases their attraction decreases, and vice- versa. Hence, a negative
correlation. ‘-1’ is no correlation
Code:
import pandas as pd
con = pd.read_csv('concrete.csv')
con.list(con.columns)
con.head() con['cement'] = con['cement'].astype('category')
con.describe(include='category')
import seaborn as sns
sns.scatterplot(x="water", y="coarseagg", data=con);
45
ax = sns.scatterplot(x="water", y="coarseagg", data=con)
ax.set_title("Concrete Strength vs. Fly ash")
ax.set_xlabel("coarseagg");
sns.lmplot(x="water", y="coarseagg", data=con);
OUTPUT:
46
d. Histograms:
Creating a Histogram
To create a histogram the first step is to create bin of the ranges, then distribute the whole
range of the values into a series of intervals, and count the values which fall into each of the
intervals. Bins are clearly identified as consecutive, non-overlapping intervals of variables.
The matplotlib.pyplot.hist() function is used to compute and create histogram of x.
Code:
# Creating dataset
a = np.array([22, 87, 5, 43, 56, 73, 55, 54, 11, 20, 51, 5, 79, 31, 27])
# Creating histogram
fig, ax = plt.subplots(figsize =(10, 7))
ax.hist(a, bins = [0, 25, 50, 75, 100])
# Show plot
plt.show()
OUTPUT:
47
CODE:
import matplotlib.pyplot as plt
import numpy as np
from matplotlib import colors
from matplotlib.ticker import PercentFormatter
# Creating dataset
np.random.seed(23685752)
N_points = 10000
n_bins = 20
# Creating distribution
x = np.random.randn(N_points)
y = .8 ** x + np.random.randn(10000) + 25
# Creating histogram
fig, axs = plt.subplots(1, 1,figsize =(10, 7),tight_layout = True)
axs.hist(x, bins = n_bins)
# Show plot
plt.show()
OUTPUT:
48
e. Three dimensional plotting
Matplotlib was introduced keeping in mind, only two-dimensional plotting. But at the time
when the release of 1.0 occurred, the 3d utilities were developed upon the 2d and thus, we have
3d implementation of data available today! The 3d plots are enabled by importing the
mplot3d toolkit. In this article, we will deal with the 3d plots using matplotlib.
CODE:
# defining axes
z = np.linspace(0, 1, 100)
x= z * np.sin(25 * z)
y = z * np.cos(25 * z)
c=x+y
ax.scatter(x, y, z, c = c)
OUTPUT:
49
MARKS MARKS
ALLOTTED SECURED
AIM & ALG/
5
PROCEDURE
PROGRAM 10
OUTPUT 5
TOTAL 20
Result:
Thus the various plotting functions on given dataset is performed in the successfully.
50
EX.NO: 8
Visualizing Geographic Data with Basemap
DATE:
AIM
To Visualizing Geographic Data with Basemap.
PROCEDURE:
One common type of visualization in data science is that of geographic data. Matplotlib's
main tool for this type of visualization is the Basemap toolkit, which is one of several
Matplotlib toolkits which lives under the mpl_toolkits namespace. Admittedly, Basemap
feels a bit clunky to use, and often even simple visualizations take much longer to render
than you might hope. More modern solutions such as leaflet or the Google Maps API may
be a better choice for more intensive map visualizations. Still, Basemap is a useful tool for
Python users to have in their virtual toolbelts. In this section, we'll show several examples
of the type of map visualization that is possible with this toolkit.
Installation of Basemap is straightforward; if you're using conda you can type this and the
package will be downloaded:
Code:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt from mpl_toolkits.basemap
import Basemap plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None, lat_0=50, lon_0=-100)
m.bluemarble(scale=0.5);
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution=None, width=8E6, height=8E6, lat_0=45, lon_0=-
100,)
m.etopo(scale=0.5, alpha=0.5)
51
plt.text(x, y, ' Seattle', fontsize=12);
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
fig = plt.figure(figsize = (12,12)) m = Basemap()
m.drawcoastlines()
m.drawcoastlines(linewidth=1.0, linestyle='dashed', color='red')
plt.title("Coastlines", fontsize=20)
plt.show()
OUTPUT:
52
CODE:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
import shapefile as shp
from shapely.geometry import Point
sns.set_style('whitegrid')
fp = r'Maps_with_python\india-polygon.shp'
map_df = gpd.read_file(fp)
map_df_copy = gpd.read_file(fp)
plt.plot(map_df , markersize=5)
53
OUTPUT:
MARKS MARKS
ALLOTTED SECURED
AIM & ALG/
5
PROCEDURE
PROGRAM 10
OUTPUT 5
TOTAL 20
Result:
Thus the Visualizing Geographic Data with Basemap is completed successfully.
54
EX.NO: 9
Exploratory Data Analysis
DATE:
AIM
To do Exploratory Data Analysis on Iris dataset.
Exploratory Data Analysis (EDA) is a technique to analyze data using some visual
Techniques. With this technique, we can get detailed information about the statistical
summary of the data. We will also be able to deal with the duplicates values, outliers, and
also see some trends or patterns present in the dataset.
Iris Dataset
If you are from a data science background you all must be familiar with the Iris Dataset. If
you are not then don’t worry we will discuss this here.
Iris Dataset is considered as the Hello World for data science. It contains five columns
namely – Petal Length, Petal Width, Sepal Length, Sepal Width, and Species Type. Iris is a
flowering plant, the researchers have measured various features of the different iris flowers
and recorded them digitally.
You can download the Iris.csv file from the above link. Now we will use the Pandas library
to load this CSV file, and we will convert it into the dataframe. read_csv() method is used
to read CSV files.
55
Code:
import pandas as pd
data1=pd.read_csv("Iris.csv")
data1.head()
data1.info()
56
data1.describe()
data1.isnull().sum()
data1.shape
57
data = data1.drop_duplicates(subset ="Species",)
data
MARKS MARKS
ALLOTTED SECURED
AIM & ALG/
5
PROCEDURE
PROGRAM 10
OUTPUT 5
TOTAL 20
Result:
Thus the do Exploratory Data Analysis on Iris dataset is completed successfully.
58