Data Science Lab Manual Full
Data Science Lab Manual Full
1:
DOWNLOAD ,INSTALL AND EXPLORE THE FEATURES OF
NUMPY,SCIPY,JUPYTER,STATSMODELS AND PANDAS PACKAGES
Aim:
To download and install various packages like NUMPY, SCIPY, JUPYTER,
STATSMODELS AND PACKAGES in python.
Python has various versions available with differences between the syntax and working of different
versions of the language. We need to choose the version which we want to use or need. There are
different versions of Python 2 and Python 3 available.
On the web browser, in the official site of python (www.python.org), move to the Download for
Windows section.
All the available versions of Python will be listed. Select the version required by you and click on
Download. Let suppose, we chose the Python version.
On clicking download, various available executable installers shall be visible with different operating
system specifications. Choose the installer which suits your system operating system and download
the instlaller. Let suppose, we select the Windows installer(64 bits).
We downloaded the Python 3.9.1 Windows 64 bit installer Run the installer. Make sure to select both
the checkboxes at the at the bottom and then click Install New.
On clicking the Install Now, The installation processstarts.
The installation process will take few minutes to complete and once the installation is successful, the
following screen is displayed.
To ensure if Python is succesfully installed on your system. Follow the given steps −
Pip is a powerful package management system for Python software packages. Thus, make sure that
you have it installed.
Pip allows to you to install various python packages like NUMPY, SCIPY, MATPLOTLIB,
PANDAS, JUPYTER, STATSMODELS using the command pip install packagename.
Result:
Thus we have successfully installed python ,pip and various python packages on our Windows
system.
EX.NO:2 WORKING WITH NUMPY ARRAYS
AIM:
ALGORITHM:
1. First download and install numpy packages in python by using the command pip install numpy.
2. Write the one-Dimensional arrays and n-dimensional arrays inNumPy.
3. Apply some linear algebra operations to n-dimensional arrays without using for-loops.
4. Write the axis and shape properties for n-dimensional arrays. The write NumPy dimensions
are axes.
5. Write matrix with n rows and m columns, shape will be (n,m).
6. The length of the shape tuple is the number of axes, ndim. To write the total number of elements
of the array.
7. Create or specify datatype using standard Python types. To stop the numpy program.
PROGRAM EX.NO:1
import numpy as np
print(type(arr))
Output:
[1 2 3 4 5]
Ex.No:2
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr)
print(type(arr))
Output:
[[1 2 3]
[4 5 6]]
Multi-Dimensional Array:
Ex.No:3
import numpy as np
print(arr)
Output:
[[[1 2 3]
[4 5 6]]
[[1 2 3]
[4 5 6]]]
Ex.No:4
list_2 = [5, 6, 7, 8]
print(sample_array)
Output:
Numpy array :
[[ 1 2 3 4]
[ 5 6 7 8]
[ 9 10 11 12]]
Ex.No:5
import numpy as np
Ex.No:6
import numpy as np
Output:
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
print(arr[0, 1, 2])
Output:
Result:
Thus the Various Array operations on Numpy Arrays has been verified and done successfully.
EX.NO:3 PANDAS DATA FRAMES
AIM:
To write the program for PANDAS DATA FRAMES packages on python program.
ALGORITHM:
2. To pandas provides the function to read data stored as a .csv file into a pandas DataFrame
3. To pandas supports many different file formats or data sources. The select to the subset of the
data frames.
4. To create the new rows,columns derived from existing data. To combine the multiple tables.
PROGRAM
import pandas as pd
Print(S)
OUTPUT:
0 11
1 28
2 72
3 3
4 5
5 8
dtype: int64
EX.NO: 2
myvar = pd.Series(a)
print(myvar)
OUTPUT:
0 1
1 7
2 2
dtype: int64
EX.NO:3
1 380 40
2 390 45
EX.NO:4
import pandas as pd data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"]) print(df)
OUTPUT
calories duration
day 1 420 50
day 2 380 40
day 3 390 45
EX.NO:5
print(myvar)
OUTPUT
x 1
y 7
z 2
dtype: int64
EX.NO:6
import pandas as pd
myvar = pd.Series(calories)
print(myvar)
OUTPUT:
day1 420
day2 380
day3 390
dtype: int64
EX.NO:7
import pandas as pd
calories = {"day1": 420, "day2": 380, "day3": 390} myvar = pd.Series(calories, index = ["day1",
"day2"])
print(myvar)
Sample Output:
day1 420
day2 380
dtype: int64
Result:
Thus the Various data Manipulation operations on PANDAS DATAFRAMES has been
verified and done successfully.
EX:NO:4
READING DATA FROM TEXT FILES, EXCEL AND THE WEB
AND EXPLORING VARIOUS COMMANDS FOR DOING
DESCRIPTIVE ANALYTICS ON THE IRISDATA SET.
AIM:
To read the data from text files, excel and the web and exploring various commands for doing
descriptive analytics on the iris data set.
Steps:
Kaggle DataSet:
https://www.kaggle.com/datasets/uciml/iris
1. Download the IRIS data set from the kaggle website and save the Documents.
Step 2:
Open the jupyter notebook and the type the following commands import pandas as pd
iris=pd.read_csv("Documents/iris.data.csv")
iris
Step 3:
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_palette('husl')
import matplotlib.pyplot as plt
from subprocess import check_output
data = pd.read_csv('C:/Users/Welcome/Downloads/archive/Iris.csv')
data.head()
data.info()
data.describe()
data['Species'].value_counts()
tmp = data.drop('Id', axis=1)
g = sns.pairplot(tmp, hue='Species', markers='+')
plt.show()
g = sns.violinplot(y='Species', x='SepalLengthCm', data=data, inner='quartile')
plt.show()
g = sns.violinplot(y='Species', x='SepalWidthCm', data=data, inner='quartile')
plt.show()
g = sns.violinplot(y='Species', x='PetalLengthCm', data=data, inner='quartile')
plt.show()
g = sns.violinplot(y='Species', x='PetalWidthCm', data=data, inner='quartile')
plt.show()
<class 'pandas.core.frame.Dataframe'> RangeIndex : 150 entries, 0 to 149 Data columns (total 5
columns):
OUTPUT:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 150 non-null int64
1 SepalLengthCm 150 non-null float64
2 SepalWidthCm 150 non-null float64
3 PetalLengthCm 150 non-null float64
4 PetalWidthCm 150 non-null float64
5 Species 150 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB
In[45]: iris.species.unique()
Out[45]: array(['setosa', 'versicolor', 'virginica'], dtype=object)
In[59]: ax=iris[iris.species==’Iris-setosa’].plot.scatter(x=’sepal_length’,y=’sepal_width’)
ax=iris[iris.species==’Iris-versicolor’].plot.scatter(x=’sepal_length’,y=’sepal_width’)
ax=iris[Iris.species==’Iris-virgnica’].plot.scatter(x=’sepal_length’,y=’sepal_width)
ax.set_xlabel(“Sepal Length”)
ax.set_ylabel(“Sepal Width”)
sns.add_Legend()
Result:
To read the data from text files, excel and the web and exploring various commands for doing
descriptive analytics on the iris data set.
EX:NO:5 USE THE DIABETES DATASET FROM UCI AND PIMA
INDIANS DIABETES DATASET FOR PERFORMING
THE FOLLOWING:
Aim:
d. Also compare the results of the above analysis for the twodatasets.
STEPS:
STEP1: Download the Pima Indians Diabetes Dataset STEP 2: Open the
commands
PROGRAM:
Out[3]:
In[5]: dial1
out[5]:
In[6]: dial0 Out[6]:
Out[8]:(65.104166666666667, 34.8958333333336)
In[9]:
plt.figure(figsize=(20,6))
Plt.subplot(1,3,1)
Sns.set_style(“dark”)
Plt.title(“Histogram for Pregnancies”)
Sns.displot(pima.pregnancies,kde=False)
Plt.subplot(1,3,2)
Plt.legend()
Plt.subplot(1,3,3)
Sns.boxplot(x=pima.outcome,y=pima.pregnancies)
plt.figure(figsize=(20,6)
plt.subplot(1,3,1)
sns.distplot(pima.glucose,kde=false)
plt.subplot(1,3,2)
Plt.legend()
Plt.subplot(1,3,3)
Sns.boxplot(x=pima.outcome,y=pima.Glucose)
In[11]:
import pandas as pd
pima=pd.read_csv(‘c:/users/welcome/downloads/diabetes.csv’)
import matplot.pyplot as plt
import seaborn as sns
sns.set(color_codes=true)
pima
Screening of association Variables to study Bivariate relationship
"SkinThickness"]
Out[13]:
Heatmap chart
import statmodels.api as sm
logit_model=sm.logit(y,x)
result=logit_model.fit()
print(result.summary())
Result:
Aim:
To implement the,
a. Normal Curves
b. Density and contour plots
c. Correlation and scatter plots
d. Histograms
e. Three dimensional plotting Using UCI data sets.
Drawing Plot
Matplotlib is a python 2D plotting library and the most widely used library for data
Visualization. It Provides an extensive set of plotting APIs to create various plots such
as Scatter, bar, box, and distribution plots with custom styling and annotation. Detailed
documentation for matplotlib can be found at https://matplotlib.org/
Seaborn is also a python data Visualizing library based on matplotlib. It Provides a high-
level interface for drawing innovative and informative statistical Charts
https://seaborn.pydata.org/.
Matplotlib is written in python and makes use of Numpy arrays. Seaborn which is built on top
of matplotlib, is a library for making elegant charts in python and well-integrated with pandas
dataframe.
To Create graphs and Plots, we need to import matplotlib.pyplot and seaborn modules.
To display the plots on the jupyter Notebook, we need to provide a directive %matplotlib inline.
Only if the directive is provided, the plots will be displayed in the notebook.
PROGRAM:
import seaborn as sn
%matplotlib inline
2. Bar Chart
The bar Chart is a frequency chart for a qualitative variable. A bar chart can be used to accesss
the most- occurring and least occurring categories within a dataset. To draw a bar chart, call ‘barplot()’
of the seaborn library. The data frame should be passed in the parameter data here.
Matplotlib is written in python and makes use of Numpy arrays. Seaborn Which is built on top od
matplotlib, is a library for making charts in python, and well-integrated with pandas dataframe. To
create graphs and plots, we need to import ‘matplotlib.pyplot’ and ‘seaborn’ modules. To display the
plots on the jupyter Notebook, we need to provide a directive ‘%matplotlib inline’. Only if the directive
is provided, the plots will be displayed in the notebook.
PROGRAM:
import seaborn as sn
%matplotlib inline
A bar chart displays a set of categories in one axis and the percentage or frequencies of a variable
for those categories in another axis. The height of the bar is either less or more depending upon the
frequency value. In a Vertical Bar Chart, the X-axis will represent categories and Y-axis will represent
frequencies. In a Horizontal Bar Chart, it is the inverse. In a Vertical Bar Chart, the bars
grow downwards below the X-axis for negative values. In a Horizontal Bar Chart, the bars
grow leftwards from the Y-axis for negative values.
Program:
import seaborn as sns
plt.show()
3. Pandas Histogram
Let’s understand how to create histogram in pandas and how it is useful.Histograms are
very useful in statistical analysis. Histograms are generally used to represent the frequency
distribution for a numeric array, split into small equal-sized bins. As we used pandas to
work with tabular data, it’s important to know how to work with histograms in a pandas
dataframe. The pandas.dataframe.hist and pandas.dataframe.plot.hist are two popular
functions. You can use them to directly plot histograms from pandas dataframes.
plt.hist(df[‘fare’])
Program:
4. Distribution Plot
A Distribution or density plot depicts the distribution of data over a continuous interval. A
density plot is like a smoothed histogram and visualizes the distribution of data over a
continuous interval. So a density plot also gives into what might be the distribution of the
population.
sns.distplot(df['fare'])
Program:
Box Plot is the visual representation of the depicting groups of numerical data through their
quartiles. Boxplot is also used for detect the outlier in data set. It captures the summary of
the data efficiently with a simple box and whiskers and allows us to compare easily across
groups. Boxplot summarizes a sample data using 25th, 50th and 75th percentiles. These
percentiles are also known as the lower quartile, median and upper quartile.
A box plot consist of 5 things.
Minimum
First Quartile or 25%
Median (Second Quartile) or 50%
Third Quartile or 75%
Maximum
Program:
import seaborn as sns
import matplotlib.pyplot as plt # read a titanic.csv file
# from seaborn library
df = sns.load_dataset ('titanic') #who v/s fare barplot
box=sns.boxplot(df['fare'])
plt.show()
To draw the boxplot, call boxplot() of the seaborn library.
box=sns.boxplot(df[‘fare’])
6. Scatter Plot
A scatter plot is a means to represent data in a graphical format. A simple scatter plot makes
use of the Coordinate axes to plot the points, based on their values and reveal the correlation
present between the variables.
plt.scatter(x,y)
Program:
import numpy
import matplotlib.pyplot as plt
x=numpy.random.normal(5.0,1.0,1000)
y=numpy.random.normal(10.0,2.0,1000)
plt.scatter(x,y)
plt.show()
7. Pair Plot
Pair Plots are easier method to draw scatter plots if there are more than two variables. It can
be plotted by using the pairplot() method.
Program:
import seaborn as sns
import matplotlib.pyplot as plt
df=sns.load_dataset('tips')
sns=pairplot(df, hue='sex')
plt.show()
8. Correlation and Heatmap
Correlation is used for measuring the strength and direction of the linear relationship
between two continuous random variables x and y. A positive correlation means the
variables increase or decrease together. A negative correlation means if one variable
increases then the other decrease.
correaltion values can be computed using the 'corr()' method of the DaraFrame and
rendered using heatmap.
Program:
# import modules
import matplotlib.pyplot as mp
import pandas as pd
import seaborn as sb
# displaying heatmap
mp.show()
Result:
Thus the Program to draw various plots was executed sucessfully in UCI datasets.
.
EX:NO:7
VISUALIZING GEOGRAPHICAL DATA WITH BASEMAP
Aim:
Basemap
Basemap is a great tool for creating maps using python in a simple way. It’s
a matplotlib extension, so it has got all its features to create data visualizations, and adds
the geographical projections and some datasets to be able to plot coast lines, countries,
and so on directly from the library.
Basemap has got some documentation, but some things are a bit more difficult to find. I
started this documentation to extend a little the original documentation and examples, but
it grew a little, and now covers many of the basemap possibilities.
To install and import Basemap package in python use pip install Basemap.
Basemap methods
1. Draw countries
Draws the USA counties from the layer included with the library
linestyle sets the line type. By default is solid, but can be dashed, or any matplotlib
option
zorder sets the layer position. By default, the order is set by Basemap
PROGRAM:
map = Basemap(llcrnrlon=-93.,llcrnrlat=40.,urcrnrlon=-75.,urcrnrlat=50.,
map.drawmapboundary(fill_color='aqua')map.fillcontinents(color='#cc9955',
lake_color='aqua')
map.drawcounties()
plt.show()
Draws the country borders from the layer included with the library.
linestyle sets the line type. By default is solid, but can be dashed, or any matplotlib
option
color is k (black) by default. Follows also matplotlib conventions
zorder sets the layer position. By default, the order is set by Basemap
Note that:
The resolution indicated when creating the Basemap instance makes the layer to have a
better or coarser resolution
The coastline is in another function, and the country coasts are not considered coast,
which makes necessary to combine the method with others to get a good map
PROGRAM
map.drawmapboundary(fill_color='aqua')
map.fillcontinents(color='coral',lake_color='aqua')
map.drawcountries()
plt.show()
Drawmap boundary
color sets the edge color and is k (black) by default. Follows also matplotlib
conventions
fill_color sets the color that fills the globe, and is None by default . Follows also
matplotlib conventions
zorder sets the layer position. By default, the order is set by Basemap
PROGRAM
from mpl_toolkits.basemap
plt.figure(0)
map= Basemap(projection='ortho',lon_0=0,lat_0=0,resolution='c')
map.drawmapboundary()
plt.figure(1)
map= Basemap(projection='sinu',lon_0=0,resolution='c')
map.drawmapboundary(fill_color='aqua')
plt.show()
Orthographic projection result
Draws the American countries states borders from the layer included with the library.
Draws also the Australian states.
linestyle sets the line type. By default is solid, but can be dashed, or any matplotlib
option
zorder sets the layer position. By default, the order is set by Basemap
Note that:
The resolution is fix, and doesn’t depend on the resolution parameter passed to the
class constructor
The country border is not drawn, creating a strange effect if the method is not
combined with drawcountries
PROGRAM
from mpl_toolkits.basemap
map = Basemap(width=12000000,height=9000000,
rsphere=(6378137.00,6356752.3142),\
resolution='l',area_thresh=1000.,projection='lcc',\
lat_1=45.,lat_2=55,lat_0=50,lon_0=-107.)
map.drawmapboundary(fill_color='aqua')
map.fillcontinents(color='#ddaa66', lake_color='aqua')
map.drawcountries()
map.drawstates(color='0.5')
plt.show()
Etopo
Plots a relief image called etopo taken from the NOAA. The image has a 1’’ arch
resolution, so when zooming in, the results are quite poor.
The scale is useful to downgrade the original image resolution to speed up the process. A
value of 0.5 will divide the size of the image by 4
The image is warped to the final projection, so all projectinos work properly with this
method
PROGRAM
map = Basemap(llcrnrlon=-10.5,llcrnrlat=33,urcrnrlon=10.,urcrnrlat=46.,
map.etopo()
map.drawcoastlines()
plt.show()
Fill continents
color sets the continent color. By default is a gry color. This page explains all the
color options
lake color sets the color of the lakes. By default doesn’t draw them, but you may set
it to aqua to plot them blue
zorder sets the position of the layer related to others. It can be used to hide (or show)
a contourf layer, that should be only on the sea, for instance
PROGRAM
from mpl_toolkits.basemap I
map = Basemap(projection='ortho',
lat_0=0, lon_0=0)
map.drawmapboundary(fill_color='aqua')
colormap.fillcontinents(color='coral',lake_color='aqua')
map.drawcoastlines()
plt.show()
Shadedrelief
Plots a shaded relief image. The origin is the www-shadedrelief.com web page. The
original image size is 10800x5400
The scale is useful to downgrade the original image resolution to speed up the
process. A value of 0.5 will divide the size of the image by 4. The original size is
quite big, 10800x5400 pixels
The image is warped to the final projection, so all projections work properly with
this method
PROGRAM
map = Basemap(llcrnrlon=-10.5,llcrnrlat=33,urcrnrlon=10.,urcrnrlat=46.,
map.shadedrelief()
map.drawcoastlines()
plt.show()
Warpimage
The image must be in latlon projection, so the x size must be double than the y size
The image must cover the whole world, with the longitude starting at -180
PROGRAM
tmpdir = '/tmp'
map.warpimage(tmpdir+'/resized.png')
map.drawcoastlines()
plt.show()
Result
Thus the various plotting using Basemap has been done successfully.