DSF Lab Exp Full
DSF Lab Exp Full
PROCEDURE:
Python pip is the package manager for Python packages. We can use pip to install packages that
do not come with Python. The basic syntax of pip commands in command prompt is:
pip 'arguments'
Step1:
Step2:
Python pip comes pre-installed on 3.4 or older versions of Python. To check whether pip is
installed or not type the below command in the terminal.
pip --version
This command will tell the version of the pip if pip is already installed in the system.
Step3:
Before you upgrade, first let’s get the current pip version by running pip --version
On Windows, to upgrade pip first open the windows command prompt and then run the following
command to update with the latest available version
# Upgrade to latest available version
python -m pip install --upgrade pip
Step4:
Now check the version of pip it will show the updated version of pip
Step5:
use pip install numpy
Pip downloads the NumPy package and notifies you it has been successfully installed.
pip3 install numpy
Pip downloads the NumPy package and notifies you it has been successfully installed.
Step6:
use pip install scipy
Pip downloads the Scipy package and notifies you it has been successfully installed.
pip install scipy
Pip downloads the Scipy package and notifies you it has been successfully installed
Step7:
use pip install pandas
Pip downloads the pandas package and notifies you it has been successfully installed.
pip install pandas
Pip downloads the pandas package and notifies you it has been successfully installed
Step:8
use pip install matplotlib
Pip downloads the matplotlib package and notifies you it has been successfully installed.
pip install matplotlib
Pip downloads the matplotlib package and notifies you it has been successfully installed
Step 9:
Pip intall seaborn
Step10:
Finally Type the command python and import the packages by using import command
Result:
Thus the download, install and explore the features of python for data analytics was
successfully implement
EX.NO: 2 WORKING WITH NUMPY ARRAYS
ALGORITHM:
CODE:
1) To create list
import numpy as np
list=np.array([1,2,3,4,5])
print(list)
RESULT:
Thus the python program for creating numpy array using different functions has been
done and the outputhas been verified.
Ex.No.3 WORKING WITH PANDAS DATA FRAMES
AIM:
To create a DataFrame using a single list or a list of lists, Locate row, named index, Locate Named
indexes.
CODE 1:
Create a DataFrame can be created using a single list.
import pandas as pd
lst = ['python', 'For', 'first', 'year',
'students', 'interesting',
'programs']
# Calling DataFrame constructor on
listdf = pd.DataFrame(lst)
print(df)
CODE 2:
Locate Row:
import pandas as pd
lst = ['python', 'For', 'first', 'year',
'students', 'interesting',
'programs']
CODE 3:
Named Indexes,Locate Named Indexes
import pandas as pd
lst = {
"list1":['python', 'For', 'first'],
"list2":['students', 'interesting', 'programs']
}
# Calling DataFrame constructor on list
df = pd.DataFrame(lst, index = ["day1", "day2", "day3"])
print(df)
print(df.loc["day2"])
RESULT:
Thus the python program for creating dataframes using different functions has been done and the
outputhas been verified.
Ex.No.4 BASIC PLOTS USING MATPLOTLIB
AIM:
To draw various plots like line plot, Bar Graph, Histogram, Scatter Plot, Area
Plotand Pie Chart.
CODE 1
LINE PLOT:
from matplotlib import pyplot as plt
#Showing what we
plottedplt.show()
CODE 2:
from matplotlib import pyplot as plt
plt.bar([0.25,1.25,2.25,3.25,4.25],[50,40,70,80,20],
label="BMW",width=.5)
plt.bar([.75,1.75,2.75,3.75,4.75],[80,20,20,50,60],
label="Audi", color='r',width=.5)
plt.legend()
plt.xlabel('Days')
plt.ylabel('Distance
(kms)')
plt.title('Information')
plt.show()
CODE 4:
HISTOGRAM
import matplotlib.pyplot as
pltpopulation_age =
[22,55,62,45,21,22,34,42,42,4,2,102,95,85,55,110,120,70,65,55,111,115,80,75,65,54,44,43,42,48]
bins = [0,10,20,30,40,50,60,70,80,90,100]
plt.hist(population_age, bins, histtype='bar', rwidth=0.8)
plt.xlabel('age groups')
plt.ylabel('Number of
people')plt.title('Histogram')
plt.show()
CODE 5:
AREA PLOT
mport matplotlib.pyplot as
pltdays = [1,2,3,4,5]
Output –
sleeping =[7,8,6,11,7]
eating = [2,3,4,3,2]
working =[7,8,7,2,2]
playing = [8,5,7,8,13]
plt.xlabel('x')
plt.ylabel('y')
plt.title('Stack
Plot')plt.legend()
plt.show()
CODE 6:
PIE CHART
import matplotlib.pyplot as plt
days = [1,2,3,4,5]
sleeping =[7,8,6,11,7]
eating = [2,3,4,3,2]
working =[7,8,7,2,2]
playing = [8,5,7,8,13]
slices = [7,2,2,13]
activities = ['sleeping','eating','working','playing']
cols = ['c','m','r','b']
plt.pie(slices,
labels=activities,
colors=cols,
startangle=90,
shadow= True,
explode=(0,0.1,0,0)
,
autopct='%1.1f%%'
)plt.title('Pie Plot')
plt.show()
RESULT: Thus the program for creating different plots using matplotlib has been done and the
output has been
lOM oARc PSD|37 23 9 59 6
5. a) Frequency distributions
Aim:
To Count the frequency of occurrence of a word in a body of text is often needed during text
processing.
ALGORITHM
Program:
from nltk.tokenize import word_tokenize
nltk.corpus import gutenberg
word_tokenize(sample)
wlist = []
Result:
Thus the count the frequency of occurrence of a word in a body of text is often needed during
text processing and Conditional Frequency Distribution program using python was successfully
completed.
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6
ALGORITHM:
Step 1: Start the Program
Step 2: Write the code to calculate
Mean, Mode ,standard deviation.
1. Mean:
The mean is the average of all numbers and is sometimes called the arithmetic mean. This code calculates
Mean or Average of a list containing numbers:
CODE
# mean of elements
get_sum = sum(n_num)
mean = get_sum / n
2 Mode :
The mode is the number that occurs most often within a set of numbers. This code calculates Mode of a list
containing numbers:
# Python program to print
# mode of elements
from collections import Counter
data = Counter(n_num)
get_mode = dict(data)
mode = [k for k, v in get_mode.items() if v == max(list(data.values()))]
if len(mode) == n:
get_mode = "No mode found"
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6
else:
get_mode = "Mode is / are: " + ', '.join(map(str, mode))
print(get_mode)
3. Standard Deviation
is a measure of spread in Statistics. It is used to quantify the measure of spread, variation of a set of data
values. It is very much similar to variance, gives the measure of deviation whereas variance provides the
squared value.
A low measure of Standard Deviation indicates that the data are less spread out, whereas a high value of
Standard Deviation shows that the data in a set are spread apart from their mean average values.
# Python code to demonstrate stdev() function
Result:
Thus the python code of Mean, Mode, and Standard Deviation was successfully calculated.
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6
5. c) VARIABILITY
Aim:
To write a python program to calculate the variance.
ALGORITHM
Result:
Thus the python program to calculate the variance was successfully implemented.
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6
a) Aim:
To create a normal curve using python program.
ALGORITHM
Program:
sb.set_style('whitegrid')
sb.lineplot(data, pdf , color = 'black')
plt.xlabel('Heights')
plt.ylabel('Probability Density')
lOM oARc PSD|37 23 9 59 6
Output:
lOM oARc PSD|37 23 9 59 6
Result :
ALGORITHM
ALGORITHM:
Program:
# Data
#Plot
# Plot
Output
lOM oARc PSD|37 23 9 59 6
Result:
Thus the Correlation and scatter plots using python program was successfully completed.
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6
Aim:
To write a python program to compute correlation coefficient.
ALGORITHM
Program:
i=0
while i < n :
# sum of elements of array X.
sum_X = sum_X + X[i]
i=i+1
# Driver function
X = [15, 18, 21, 24, 27]
Y = [25, 25, 27, 31, 32]
Result:
5.g) REGRESSION
Aim:
To write a python program for Simple Linear Regression
ALGORITHM
Program:
import numpy as np
import matplotlib.pyplot as plt
Graph:
lOM oARc PSD|37 23 9 59 6
# putting labels
plt.xlabel('x')
plt.ylabel('y')
def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))
Result:
Thus the computation for Simple Linear Regression was successfully completed.
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6
Ex.no : 6 a) USE THE STANDARD BENCHMARK DATA SET FOR PERFORMING THE
FOLLOWING UNIVARIATE ANALYSIS:
UNIVARIATE ANALYSIS:
AIM:
To write a python program for univeariate analysis on UCI datasets
ALGORITHM:
Step 1: Start the program
Step 2: Write the coding
Step 3: calculate the Frequency, Mean, Median, Mode, Variance, Standard Deviation, Skewness and
Kurtosis.
Step 4: Stop the program
Frequency
import pandas as pd
import numpy as np
import statistics as st
Mean
df.mean()
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6
6 dtype: float64
It is also possible to calculate the mean of a particular variable in a data, as shown below, where we
calculate the mean of the variables 'Age' and 'Income'.
print(df.loc[:,'Age'].mean())
print(df.loc[:,'Income'].mean())
It is also possible to calculate the mean of the rows by specifying the (axis = 1) argument. The code
below calculates the mean of the first five rows.
df.mean(axis = 1)[0:5]
Median
df.median()
Mode
Mode
df.mode()
Variance
df.var()
Standard Deviation
df.std()
print(df.skew())
df.describe()
df.describe(include='all')
Result:
Thus the Univariate and Multiple Regression Analysis using the diabetes data setfrom
UCI and Pima Indians Diabetes data set was completed and verified successfully.
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6
AIM:
ALGORITHM:
Linear Regression
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
def myfunc(x):
return slope * x + intercept
plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()
Logistic Regression
import numpy
from sklearn import linear_model
X = numpy.array([3.78, 2.44, 2.09, 0.14, 1.72, 1.65, 4.92, 4.37, 4.96, 4.52, 3.69, 5.88]).reshape(-1,1)
y = numpy.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])
logr = linear_model.LogisticRegression()
logr.fit(X,y)
OUTPUT:
lOM oARc PSD|37 23 9 59 6
Multiple regression works by considering the values of the available multiple independent
variables and predicting the value of one dependent variable.
import pandas as pd
from sklearn import linear_model
import statsmodels.api as sm
data = {'year':
[2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,20
16,2016,2016,2016,2016,2016],
'month': [12,11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1],
'interest_rate':
[2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75],
'unemployment_rate':
[5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1],
'index_price':
[1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,
876,822,704,719]
}
df = pd.DataFrame(data)
x = df[['interest_rate','unemployment_rate']]
y = df['index_price']
# with sklearn
regr = linear_model.LinearRegression()
regr.fit(x, y)
# with statsmodels
x = sm.add_constant(x) # adding a constant
print_model = model.summary()
print(print_model)
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6
Result:
Thus the Bivariate and Multiple Regression Analysis using the diabetes data set
from UCI and Pima Indians Diabetes data set was completed and verified successfully.
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6
print(iris.data.shape)
# Declare an of the KNN classifier class with the value with neighbors.
knn = KNeighborsClassifier(n_neighbors=6)
print(prediction)
from sklearn import datasets, linear_model
import matplotlib.pyplot as plt
import numpy as np
# Input data
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6
print('Input Values')
print(diabetes_X_test)
# Predicted Data
print("Predicted Output Values")
print(diabetes_y_pred)
# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color='black')
plt.plot(diabetes_X_test, diabetes_y_pred, color='red', linewidth=1)
plt.show()
Unsupervised learning is a class of machine learning (ML) techniques used to find patterns indata. The
data given to unsupervised algorithms is not labelled, which means only the input variables (x) are
given with no corresponding output variables. In unsupervised learning, the algorithms are left to
discover interesting structures in the data on their own.
On GitHub: iris_dataset.py
# Importing Modules
from sklearn import datasets
import matplotlib.pyplot as plt
# Loading dataset
iris_df = datasets.load_iris()
# Features
print(iris_df.feature_names)
# Targets
print(iris_df.target)
# Target Names
print(iris_df.target_names)
# Dataset Slicing
x_axis = iris_df.data[:, 0] # Sepal Length
y_axis = iris_df.data[:, 2] # Sepal Width
# Plotting
plt.scatter(x_axis, y_axis, c=iris_df.target)
plt.show()
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6
# Importing Modules
from sklearn import datasets
from sklearn.cluster import KMeans
# Loading dataset
iris_df = datasets.load_iris()
# Declaring Model
model = KMeans(n_clusters=3)
# Fitting Model
model.fit(iris_df.data)
# Printing Predictions
print(predicted_label)
print(all_predictions)
[0]
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6
Result:
Thus the To write a program for apply supervised learning algorithms and unsupervisedlearning algorithms on iris
dataset
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6
To load and quickly visualize the Multiple Features Dataset [1] from the UCI repository, which
is available in mvlearn. This dataset can be a good tool for analyzing the effectiveness of
multiview algorithms. It contains 6 views of handwritten digit images, thus allowing for analysis
of multiview algorithms in multiclass or unsupervised tasks.
a. Normal curves
A probability distribution is a statistical function that describes the likelihood of obta ining the
possible values that a random variable can take. By this, we mean the range of values that a
parameter can take when we randomly pick up values from it. If we were asked to pick up 1 adult
randomly and asked what his/her (assuming gender does not affect height) height would be?
There’s no way to know what the height will be. But if we have the distribution of heights ofadults
in the city, we can bet on the most probable outcome.A Normal Distribution is also known as a
Gaussian distribution or famously Bell Curve. People use both words interchangeably, but it means
the same thing. It is a continuous probability distribution.
Code:
import numpy as np
of 1-50. x = np.linspace(1,50,200)
#Creating a Function.
sd = np.std(x)
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6
data. pdf =
normal_dist(x,mean,sd)
plt.plot(x,pdf , color =
'red') plt.xlabel('Data
points')
plt.ylabel('Probability Density')
Contour plots also called level plots are a tool for doing multivariate analysis and visualizing 3-D
plots in 2-D space. If we consider X and Y as our variables we want to plot then the response Z
will be plotted as slices on the X-Y plane due to which contours are sometimes referred as Z-
slices or iso-response.
Contour plots are widely used to visualize density, altitudes or heights of the mountain as well as
in the meteorological department. Due to such wide usage matplotlib.pyplot provides a method
contour to make it easy for us to draw contour plots.
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6
Code:
import matplotlib.pyplot as
feature_y = np.arange(0,
of features
[X, Y] = np.meshgrid(feature_x,
Z = np.cos(X / 2) +
np.sin(Y / 4) # plots
Y, Z)
ax.set_title('Contour
Plot')
ax.set_xlabel('feature
_x')
ax.set_ylabel('feature
_y') plt.show()
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6
Correlation means an association, It is a measure of the extent to which two variables are related.
1. Positive Correlation: When two variables increase together and decrease together. They are
positively correlated. ‘1’ is a perfect positive correlation. For example – demand and profit are
positively correlated the more the demand for the product, the more profit hence positive
correlation.
2. Negative Correlation: When one variable increases and the other variable decreases together
and vice-versa. They are negatively correlated. For example, If the distance between magnet
increases their attraction decreases, and vice- versa. Hence, a negative correlation. ‘-1’ is no
correlation
3. Zero Correlation( No Correlation): When two variables don’t seem to be linked at all. ‘0’ is a
perfect negative correlation. For Example, the amount of tea you take and level of intelligence.
Code:
import pandas as pd
con =
pd.read_csv('concrete.csv')
con
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6
list(con.columns)
con.head()
con['cement'] = con['cement'].astype('category')
con.describe(include='category')
ax = sns.scatterplot(x="water", y="coarseagg",
ash") ax.set_xlabel("coarseagg");
a. Histograms:
A histogram is basically used to represent data provided in a form of some groups.It is accurate
method for the graphical representation of numerical data distribution.It is a type of bar plot where
X-axis represents the bin ranges while Y-axis gives information about frequency.
Creating a Histogram
To create a histogram the first step is to create bin of the ranges, then distribute the whole range
of the values into a series of intervals, and count the values which fall into each of the
intervals.Bins are clearly identified as consecutive, non-overlapping intervals of variables. The
matplotlib.pyplot.hist() function is used to compute and create histogram of x.
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6
Code:
import numpy as np
# Creating dataset
27])
# Creating histogram
# Show plot
plt.show()
Code:
import numpy as np
# Creating dataset
np.random.seed(23685752)
N_points = 10000
n_bins = 20
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6
# Creating distribution
x = np.random.randn(N_points)
y = .8 ** x + np.random.randn(10000) + 25
# Creating histogram
# Show plot
plt.show()
Matplotlib was introduced keeping in mind, only two-dimensional plotting. But at the time when
the release of 1.0 occurred, the 3d utilities were developed upon the 2d and thus, we have 3d
implementation of data available today! The 3d plots are enabled by importing the mplot3d toolkit.
In this article, we will deal with the 3d plots using matplotlib.
Code:
Import numpy as
= plt.figure()
ax = plt.axes(projection ='3d')
# defining axes
z = np.linspace(0, 1, 100)
x = z * np.sin(25 * z)
y = z * np.cos(25 * z)
c=x+y
ax.scatter(x, y, z, c = c)
lOM oARc PSD|37 23 9 59 6
lOM oARc PSD|37 23 9 59 6
plt.show()
Result:
Thus the apply and explore various plotting functions on uci data
lOM oARc PSD|37 23 9 59 6