0% found this document useful (0 votes)
55 views66 pages

FODS Record

Uploaded by

monishar9895
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views66 pages

FODS Record

Uploaded by

monishar9895
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

SRIRAM ENGINEERING COLLEGE

Perumalpattu, Thiruvallur Dist-602024

in partial fulfilment for the award of the degree

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING –


CYBER SECURITY

NOVEMBER / DECEMBER:2024

CS3361 DATA SCIENCE LABORATORY MANUAL

II YEAR – III SEMESTER

PREPARED BY

ESTHER PRAVEENA. S
SYLLABUS

COURSE OBJECTIVES: CS3361 Data Science Laboratory


 To understand the python libraries for data science
 To understand the basic Statistical and Probability measures for data science.
 To learn descriptive analytics on the benchmark data sets.
 To apply correlation and regression analytics on standard data sets.
 To present and interpret data using visualization packages in Python.

LIST OF EXPERIMENTS:
1. Download, install and explore the features of NumPy, SciPy, Jupyter, Statsmodels and Pandas
packages.
2. Working with Numpy arrays
3. Working with Pandas data frames
4. Reading data from text files, Excel and the web and exploring various commands for doing descriptive
analytics on the Iris data set. CS3361 Data Science Laboratory
5. Use the diabetes data set from UCI and Pima Indians Diabetes data set for performing the following:
a. Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard Deviation, Skewness and
Kurtosis.
b. Bivariate analysis: Linear and logistic regression modeling
c. Multiple Regression analysis
d. Also compare the results of the above analysis for the two data sets.
6. Apply and explore various plotting functions on UCI data sets.
a. Normal curves
b. Density and contour plots
c. Correlation and scatter plots
d. Histograms CS3361 Data Science Laboratory Lab Manual
e. Three dimensional plotting
7. Visualizing Geographic Data with Base map

COURSE OUTCOMES: CS3361 Data Science Laboratory

At the end of this course, the students will be able to:


CO1: Make use of the python libraries for data science
CO2: Make use of the basic Statistical and Probability measures for data science. Lab Manual
CO3: Perform descriptive analytics on the benchmark data sets.
CO4: Perform correlation and regression analytics on standard data sets CS3361 Data Science
Laboratory
CO5: Present and interpret data using visualization packages in Python.
TABLE OF CONTENTS

S.NO NAME OF EXPERIMENT PAGE NO

1. DOWNLOAD, INSTALL AND EXPLORE THEFEATURES 5


OF NUMPY, SCIPY, JUPUYTER, STATSMODEL AND
PANDAS PACKAGE
2. WORKING WITH NUMPYARRAY 14

3. WORKING WITH PANDAS DATAFRAME 18

4. READING DATA FROM TEXT FILE, EXCEL AND THE 19


WEB AND EXPLORING VARIOUSCOMMAND FOR
DOING DESCRIPTIVE
ANALYSIS ON THE IRIS DATA SET
5. PERFORM THE FOLLOWING ANALYSIS: 32
(i)UNIVARIANT ANALYSIS (ii)BIVARIANT
ANALYSIS (iii)MULTIPLE REGRESSION ANALYSIS
(iv)AND COMPARE RESULTS
FROM THE UCI AND PIMA INDIANS
DIABETES DATASET
6. APPLY AND EXPLORE VARIOUS PLOTTINGFUNCTIONS: 45
(i)NORMAL CURVES (ii)DENSITY AND
CONTOUR PLOTS
(iii)CORRELATION AND SCATTER PLOTS
(iv)HISTOGRAM
(v)3D PLOTTING
7. APPLY AND EXPLORE VARIOUS PLOTTINGFUNCTIONS 56

4
EX. NO: 1

Download, install and explore the features of Numpy, scipy, Jupyter,


Statsmodel and Pandas Package

Procedure

Setting up your machine for data science in Python

Download and Install Anaconda

Installing Anaconda on Windows

For problem solvers, installing and using the Anaconda distribution of Python. This
section details the installation of the Anaconda distribution of Python on Windows 10. I think
the Anaconda distribution of Python is the best option for problem solvers who want to use
Python. Anaconda is free (although the download is large which can take time) and can be
installed on school or work computers where you don't have administrator access or the ability
to install new programs. Anaconda comes bundled with about 600 packages pre- installed
including NumPy, Matplotlib and SymPy.

Follow the steps below to install the Anaconda distribution of Python on Windows.

Steps:
1. Visit Anaconda.com/downloads
2. Select Windows
3. Download the .exe installer
4. Open and run the .exe installer
5. Open the Anaconda Prompt and run some Python code

1. Visit the Anaconda downloads page

Go to the following link: Anaconda.com/downloads

The Anaconda Downloads Page will look something like this:

5
2. Select Windows

Select Windows where the three operating systems are listed.

3. Download

Download the most recent Python 3 release. At the time of writing, the most recent
release was the Python 3.6 Version. Python 2.7 is legacy Python. For problem solvers, select
the Python 3.6 version. If you are unsure if your computer is running a 64-bit or 32-bit version
of Windows, select 64-bit as 64-bit Windows is most common.

You may be prompted to enter your email. You can still download Anaconda if you click [No
Thanks] and don't enter your Work Email address.

6
The download is quite large (over 500 MB) so it may take a while to for Anaconda to
download.

4. Open and run the installer

Once the download completes, open and run the .exe installer

At the beginning of the install, you need to click Next to confirm the installation.

Then agree to the license.

7
At the Advanced Installation Options screen, I recommend that you do not check "Add
Anaconda to my PATH environment variable"

5. Open the Anaconda Prompt from the Windows start menu

After the installation of Anaconda is complete, you can go to the Windows start menu and
select the Anaconda Prompt.

8
This opens the Anaconda Prompt. Anaconda is the Python distribution and the Anaconda
Prompt is a command line shell (a program where you type in commands instead of using a
mouse). The black screen and text that makes up the Anaconda Prompt doesn't look like much,
but it is really helpful for problem solvers using Python. At the Anaconda prompt, type python
and hit [Enter]. The python command starts the Python interpreter, also called the Python REPL
(for Read Evaluate Print Loop).

> python

Note the Python version. You should see something like Python 3.6.1. With the interpreter
running, you will see a set of greater-than symbols >>> before the cursor.

Now you can type Python commands. Try typing import this. You should see the Zen of Python
by Tim Peters

9
To close the Python interpreter, type exit() at the prompt >>>. Note the double parenthesis at
the end of the exit() command. The () is needed to stop the Python interpreter and get back out
to the Anaconda Prompt.
To close the Anaconda Prompt, you can either close the window with the mouse, or type exit,
no parenthesis necessary.
When you want to use the Python interpreter again, just click the Windows Start button and
select the Anaconda Prompt and type python.

2. Download and install common packages for data science in Python

 Click the link below to download an environment file. This file contains a list of
common packages and libraries for doing data science in Python. Remember where you
save the file environment.yml. You'll need that path shortly. You don't need to open that
file right now.
o Windows
o OSX
 Once the download finishes, open the command line by doing the following:

o Windows - Hit "Start" and then type "Command Prompt" and use that
terminal.
o OSX - Type Cmd+Space and then enter Terminal in the search box to open
the terminal.

10
 Run the following commands, which will install the package and put you in the
tutorial environment.

o conda env create -f <PATH_TO_ENVIRONMENT.YML> - You'll need to


replace <PATH_TO_ENVIRONMENT.YML> with the actual path where the
file was downloaded. For OSX, that's often
(/Users/<USERNAME>/Downloads/environment.yml). For Windows, it is
usually C:/Users/<USERNAME>/Downloads/environment.yml. You'll have to
replace <USERNAME> with your username on your machine.

 That will download all a set of packages that are commonly used for data science in
Python. When it finishes, you can activate the environment with the following
command:
o Windows - activate tutorial
o OSX - source activate tutorial

3. Run Jupyter notebook!


In this step, we'll make sure everything is working by running the Jupyter Notebook.
Jupyter Notebook is a tool for doing interactive data science work in your browser. * In your
command prompt with the tutorial environment activated (Note: you'll be able to tell because
your command prompt will say (tutorial) at the start of it.) * Type the following command
jupyter notebook . * A browser window will open, showing the Jupyer environment. By default,
you will be in a file browser view. * In the file browser, find where you have a Jupyter notebook.
If you don't have materials for a course or tutorial that you havedownloaded, you can download
this fun Jupyter notebook and then open it in the file browser. * Click on one of the notebook
(*.ipynb) files to get started!

4. To stop Jupyter notebook:

 Hit Ctrl+c to stop the Jupyter notebook server running on your machine. (Make sure
to use Ctrl+s in the notebook to save it first!)

11
Feature of python package:

Python Libraries for Data Processing and Modeling


1. Pandas
Pandas is a free Python software library for data analysis and data handling. It was
created as a community library project and initially released around 2008. Pandas provides
various high-performance and easy-to-use data structures and operations for manipulating data
in the form of numerical tables and time series. Pandas also has multiple tools for reading and
writing data between in-memory data structures and different file formats. In short, it is perfect
for quick and easy data manipulation, data aggregation, reading, and writing the data as well as
data visualization. Pandas can also take in data from differenttypes of files such as CSV, excel
etc.or a SQL database and create a Python object knownas a data frame. A data frame contains
rows and columns and it can be used for data manipulation with operations such as join, merge,
groupby, concatenate etc.

2. NumPy
NumPy is a free Python software library for numerical computing on data that canbe
in the form of large arrays and multi-dimensional matrices. These multidimensional matricesare
the main objects in NumPy where their dimensions are called axes and the number of axes is
called a rank. NumPy also provides various tools to work with these arrays and high-level
mathematical functions to manipulate this data with linear algebra, Fourier transforms, random
number crunchings, etc. Some of the basic array operations that can be performed using NumPy
include adding, slicing, multiplying, flattening, reshaping, and indexing the arrays. Other
advanced functions include stacking the arrays, splitting them into sections, broadcasting
arrays, etc.

3. SciPy
SciPy is a free software library for scientific computing and technical computing on the
data. It was created as a community library project and initially released around 2001. SciPy
library is built on the NumPy array object and it is part of the NumPy stack which also includes
other scientific computing libraries and tools such as Matplotlib, SymPy, pandas etc. This
NumPy stack has users which also use comparable applications such as GNU Octave,
MATLAB, GNU Octave, Scilab, etc. SciPy allows for various scientific computing tasks that
handle data optimization, data integration, data interpolation, and data modification using linear
algebra, Fourier transforms, random number generation, special

12
functions, etc. Just like NumPy, the multidimensional matrices are the main objects in
SciPy, which are provided by the NumPy module itself.

4. Scikit-learn
Scikit-learn is a free software library for Machine Learning coding primarily in the
Python programming language. It was initially developed as a Google Summer of Code project
by David Cournapeau and originally released in June 2007. Scikit-learn is built on top of other
Python libraries like NumPy, SciPy, Matplotlib, Pandas, etc. and so it provides full
interoperability with these libraries. While Scikit-learn is written mainly in Python, it has also
used Cython to write some core algorithms in order to improve performance. You can
implement various Supervised and Unsupervised Machine learning models on Scikit- learn like
Classification, Regression, Support Vector Machines, Random Forests, Nearest Neighbors,
Naive Bayes, Decision Trees, Clustering, etc. with Scikit-learn.

5. TensorFlow
TensorFlow is a free end-to-end open-source platform that has a wide variety of tools,
libraries, and resources for Artificial Intelligence. It was developed by the Google Brain team
and initially released on November 9, 2015. You can easily build and train Machine Learning
models with high-level API’s such as Keras using TensorFlow. It also provides multiple levels
of abstraction so you can choose the option you need for your model. TensorFlow also allows
you to deploy Machine Learning models anywhere such as the cloud, browser, or your own
device. You should use TensorFlow Extended (TFX) if you want the full experience,
TensorFlow Lite if you want usage on mobile devices, and TensorFlow.js if you want to train
and deploy models in JavaScript environments. TensorFlow is available for Python and C APIs
and also for C++, Java, JavaScript, Go, Swift, etc. but without an API backward compatibility
guarantee. Third-party packages are also available for MATLAB, C#, Julia, Scala, R, Rust, etc.

6. Keras
Keras is a free and open-source neural-network library written in Python. It was
primarily created by François Chollet, a Google engineer, and initially released on 27 March
2015. Keras was created to be user friendly, extensible, and modular while being supportive of
experimentation in deep neural networks. Hence, it can be run on top of other libraries and
languages like TensorFlow, Theano, Microsoft Cognitive Toolkit, R, etc. Keras has multiple
tools that make it easier to work with different types of image and

13
textual data for coding in deep neural networks. It also has various implementations of the
building blocks for neural networks such as layers, optimizers, activation functions, objectives,
etc. You can perform various actions using Keras such as creating custom function layers,
writing functions with repeating code blocks that are multiple layers deep, etc.

Python Libraries for Data Visualization


1. Matplotlib
Matplotlib is a data visualization library and 2-D plotting library of Python It was
initially released in 2003 and it is the most popular and widely-used plotting library in the
Python community. It comes with an interactive environment across multiple platforms.
Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook,
web application servers etc. It can be used to embed plots into applications using various GUI
toolkits like Tkinter, GTK+, wxPython, Qt, etc. So you can use Matplotlibto create plots,
bar charts, pie charts, histograms, scatterplots, error charts, power spectra, stemplots, and
whatever other visualization charts you want! The Pyplot module also provides a MATLAB-
like interface that is just as versatile and useful as MATLAB while being totally free and open
source.

2. Seaborn
Seaborn is a Python data visualization library that is based on Matplotlib and closely
integrated with the numpy and pandas data structures. Seaborn has various dataset-oriented
plotting functions that operate on data frames and arrays that have whole datasets within them.
Then it internally performs the necessary statistical aggregation and mapping functions to
create informative plots that the user desires. It is a high-level interface for creating beautiful
and informative statistical graphics that are integral to exploring and understanding data. The
Seaborn data graphics can include bar charts, pie charts, histograms, scatterplots, error charts,
etc. Seaborn also has various tools for choosing color palettes that can reveal patterns in the
data.

3. Plotly
Plotly is a free open-source graphing library that can be used to form data visualizations.
Plotly (plotly.py) is built on top of the Plotly JavaScript library (plotly.js) and can be used to
create web-based data visualizations that can be displayed in Jupyter notebooks or web
applications using Dash or saved as individual HTML files. Plotly

14
provides more than 40 unique chart types like scatter plots, histograms, line charts, bar charts,
pie charts, error bars, box plots, multiple axes, sparklines, dendrograms, 3-D charts, etc. Plotly
also provides contour plots, which are not that common in other data visualization libraries.
In addition to all this, Plotly can be used offline with no internet connection.

4. GGplot
Ggplot is a Python data visualization library that is based on the implementation of
ggplot2 which is created for the programming language R. Ggplot can create data visualizations
such as bar charts, pie charts, histograms, scatterplots, error charts, etc. using high-level API. It
also allows you to add different types of data visualization components or layers in a single
visualization. Once ggplot has been told which variables to map to which aesthetics in the plot,
it does the rest of the work so that the user can focus on interpretingthe visualizations and take
less time in creating them. But this also means that it is not possible to create highly customised
graphics in ggplot. Ggplot is also deeply connected with pandas so it is best to keep the data in
DataFrames.

15
EX NO: 2
Working with Numpy array

NUMPY:

NumPy is a Python library used for working with arrays.It also has functions for
working in domain of linear algebra, fourier transform, and matrices. NumPy was created in
2005 by Travis Oliphant. It is an open source project and you can use it freely. NumPy stands
for Numerical Python.

It is a general-purpose array-processing package. It provides a high-performance


multidimensionalarray object, and tools for working with these arrays.

It is the fundamental package for scientific computing with Python. It contains


various featuresincluding these important ones:

 A powerful N-dimensional array object


 Sophisticated (broadcasting) functions
 Tools for integrating C/C++ and Fortran code
 Useful linear algebra, Fourier transform, and random number capabilities

ALGORITHM

Step 1: Start
Step 2: Import numpy module
Step 3: Print the basic characteristics of array
Step 4: Stop

PROGRAM

Import numpy as np
# Creating array object
arr = np.array( [[ 1, 2, 3], [ 4, 2, 5]] )
# Printing type of arr object
print("Array is of type: ", type(arr))

16
# Printing array dimensions (axes)
print("No. of dimensions: ", arr.ndim)
# Printing shape of array
print("Shape of array: ", arr.shape)
# Printing size (total number of elements) of array
print("Size of array: ", arr.size)
# Printing type of elements in array
print("Array stores elements of type: ", arr.dtype)

OUTPUT
Array is of type: <class 'numpy.ndarray'>No.
of dimensions: 2
Shape of array: (2, 3)Size of array: 6
Array stores elements of type: int32

17
EX NO 3
Working with Pandas Data frame

PANDAS:

It is a Python library. Pandas is used to analyze data. A Pandas Data Frame is a 2


dimensional data structure, like a 2 dimensional array, or a table with rows and columns.
Pandas, Data Frame can be created from the lists, dictionary and from a list of dictionary
etc.

ALGORITHM
Step1: Start
Step2: import numpy and pandas module
Step3: Create a dataframe using list of elements
Step4: Print the output
Step5: Stop

PROGRAM
# import pandas as pd
import pandas as pd
# list of strings
lst = ['A', 'B', 'C', 'D', 'E', 'F', 'G']
# Calling DataFrame constructor on list
df = pd.DataFrame(lst)
print(df)

OUTPUT
0 A
1 B
2 C
3 D
4 E
5 F
6 G

18
Ex. NO.: 4
Reading data from text file, Excel and the web and exploring various
Command for doing descriptive analysis on the iris dataset

Procedure:
Exploratory Data Analysis (EDA) is a technique to analyze data using some visual
Techniques. With this technique, we can get detailed information about the statistical summary
of the data. We will also be able to deal with the duplicates values, outliers, and also see some
trends or patterns present in the dataset.

Now let’s see a brief about the Iris dataset.

Iris Dataset
Iris Dataset is considered as the Hello World for data science. It contains five columns
namely – Petal Length, Petal Width, Sepal Length, Sepal Width, and SpeciesType. Iris is a
flowering plant, the researchers have measured various features of the different iris flowers and
recorded them digitally.

Note: This dataset can be downloaded from https://datahub.io/machine-learning/iris

You can download the Iris.csv file from the above link. Now we will use the Pandas
library to load this CSV file, and we will convert it into the dataframe. read_csv() method is
used to read CSV files.

Code:
import pandas as pd
# Reading the CSV file
df = pd.read_csv("Iris.csv")
# Printing top 5 rows
df.head()

19
Output:

Getting Information about the Dataset


We will use the shape parameter to get the shape of the dataset.

Code:

df.shape

Output:

(150, 6)

We can see that the dataframe contains 6 columns and 150 rows.

Now, let’s also the columns and their data types. For this, we will use the info() method.

Code:

df.info()

Output:

We can see that only one column has categorical data and all the other columns are
of the numeric type with non-Null entries.

Let’s get a quick statistical summary of the dataset using the describe() method.
The describe() function applies basic statistical computations on the dataset like extreme

20
values, count of data points standard deviation, etc. Any missing value or NaN value is
automatically skipped. describe () function gives a good picture of the distribution of data.

Code:

df.describe()

Output:

We can see the count of each column along with their mean value, standard deviation,
minimum and maximum values.

Checking Missing Values


We will check if our data contains any missing values or not. Missing values can occur
when no information is provided for one or more items or for a whole unit. We will use the
isnull() method.
Code:
df.isnull().sum()
Output:

We can see that no column as any missing values

21
Checking Duplicates
Let’s see if our dataset contains any duplicates or not.
Pandas drop_duplicates() method helps in removing duplicates from the data frame.
Code:
data = df.drop_duplicates(subset ="Species",)

data

Output:

We can see that there are only three unique species. Let’s see if the dataset is balanced
or not i.e. all the species contain equal amounts of rows or not. We will use the
Series.value_counts() function. This function returns a Series containing counts of unique
values.

Code:

df.value_counts("Species")

output:

We can see that all the species contain an equal amount of rows, so we should not delete
any entries.

Data Visualization
Visualizing the target column
Our target column will be the Species column because at the end we will need the result
according to the species only. Let’s see a countplot for species.

22
Code:

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.countplot(x='Species', data=df, )
plt.show()

Output:

Relation between variables

We will see the relationship between the sepal length and sepal width and also between
petal length and petal width.

Example 1: Comparing Sepal Length and Sepal Width

Code:

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.scatterplot(x='SepalLengthCm', y='SepalWidthCm',
hue='Species', data=df, )
# Placing Legend outside the Figure
plt.legend(bbox_to_anchor=(1, 1), loc=2)
plt.show()

23
Output:

From the above plot, we can infer that –

 Species Setosa has smaller sepal lengths but larger sepal widths.
 Versicolor Species lies in the middle of the other two species in terms of sepal length
and width
 Species Virginica has larger sepal lengths but smaller sepal widths.

Example 2: Comparing Petal Length and Petal Width

Code:

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.scatterplot(x='PetalLengthCm', y='PetalWidthCm',
hue='Species', data=df, )
# Placing Legend outside the Figure
plt.legend(bbox_to_anchor=(1, 1), loc=2)
plt.show()

24
output:

From the above plot, we can infer that –

 Species Setosa has smaller petal lengths and widths.


 Versicolor Species lies in the middle of the other two species in terms of petal length
and width
 Species Virginica has the largest of petal lengths and widths.
Let’s plot all the column’s relationships using a pairplot. It can be used for multivariate
analysis.

Code:

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.pairplot(df.drop(['Id'], axis = 1),
hue='Species', height=2)

25
Output:

We can see many types of relationships from this plot such as the species Seotsa has the
smallest of petals widths and lengths. It also has the smallest sepal length but larger sepal
widths. Such information can be gathered about any other species.

Histograms

Histograms allow seeing the distribution of data for various columns. It can be used for unias
well as bi-variate analysis.

Code:

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
fig, axes = plt.subplots(2, 2, figsize=(10,10))
axes[0,0].set_title("Sepal Length")
axes[0,0].hist(df['SepalLengthCm'], bins=7)
axes[0,1].set_title("Sepal Width")
axes[0,1].hist(df['SepalWidthCm'], bins=5);
axes[1,0].set_title("Petal Length")
axes[1,0].hist(df['PetalLengthCm'], bins=6);

26
axes[1,1].set_title("Petal Width")
axes[1,1].hist(df['PetalWidthCm'], bins=6);
Output:

From the above plot, we can see that –

 The highest frequency of the sepal length is between 30 and 35 which is between 5.5
and 6
 The highest frequency of the sepal Width is around 70 which is between 3.0 and 3.5
 The highest frequency of the petal length is around 50 which is between 1 and 2
 The highest frequency of the petal width is between 40 and 50 which is between 0.0 and
0.5

Histograms with Distplot Plot

Distplot is used basically for the univariant set of observations and visualizes it through a
histogram i.e. only one observation and hence we choose one particular column of the dataset.

Code:
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
plot = sns.FacetGrid(df, hue="Species")

27
plot.map(sns.distplot, "SepalLengthCm").add_legend()
plot = sns.FacetGrid(df, hue="Species")
plot.map(sns.distplot, "SepalWidthCm").add_legend()
plot = sns.FacetGrid(df, hue="Species")
plot.map(sns.distplot, "PetalLengthCm").add_legend()
plot = sns.FacetGrid(df, hue="Species")
plot.map(sns.distplot, "PetalWidthCm").add_legend()
plt.show()

Output:

From the above plots, we can see that –

 In the case of Sepal Length, there is a huge amount of overlapping.


 In the case of Sepal Width also, there is a huge amount of overlapping.
 In the case of Petal Length, there is a very little amount of overlapping.
 In the case of Petal Width also, there is a very little amount of overlapping.
So we can use Petal Length and Petal Width as the classification feature.

28
Handling Correlation

Pandas dataframe.corr() is used to find the pairwise correlation of all columns in the dataframe.
Any NA values are automatically excluded. For any non-numeric data type columns in the
dataframe it is ignored.

Code:
data.corr(method='pearson')

Output:

Heatmaps

The heatmap is a data visualization technique that is used to analyze the dataset as colors in
two dimensions. Basically, it shows a correlation between all numerical variables in the dataset.
In simpler terms, we can plot the above-found correlation using the heatmaps.

Code:

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(df.corr(method='pearson').drop(['Id'], axis=1).drop(['Id'], axis=0), annot =


True);
plt.show()

29
Output:

From the above graph, we can see that –

 Petal width and petal length have high correlations.


 Petal length and sepal width have good correlations.
 Petal Width and Sepal length have good correlations.

Box Plots
We can use boxplots to see how the categorical value os distributed with other numerical
values.

# importing packages
import seaborn as sns

import matplotlib.pyplot as plt


def graph(y):

sns.boxplot(x="Species", y=y, data=df)


plt.figure(figsize=(10,10))
# Adding the subplot at the specified

# grid position
plt.subplot(221)
graph('SepalLengthCm')
plt.subplot(222)
graph('SepalWidthCm')
plt.subplot(223)

30
graph('PetalLengthCm')
plt.subplot(224)
graph('PetalWidthCm')
plt.show()

Output:

From the above graph, we can see that –

 Species Setosa has the smallest features and less distributed with some outliers.
 Species Versicolor has the average features.
 Species Virginica has the highest features

31
EX NO:5
Use the diabetes data set from UCI and Pima Indians Diabetes data set forperforming the
following:

a. Univariate analysis: frequency, Mean, Median, Mode, Variance,


Standard Deviation, Skewness and Kurtosis
b. Bivariate analysis: Linear and logistic regression modelling
c. Multiple Regression analysis
d. Also compare the results of the above analysis for the two data sets
a. Univariate analysis: frequency, Mean, Median, Mode, Variance, Standard Deviation,
Skewness and Kurtosis

Source Code
import numpy as np

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

import statsmodels.api as sm

df = pd.read_csv('D:\Studentdetails.csv')

def UVA_numeric(data):

var_group = data.columns

size = len(var_group)

plt.figure(figsize = (7*size,3), dpi = 400)

for j,i in enumerate(var_group):

mini = data[i].min()

maxi = data[i].max()

ran = data[i].max()-data[i].min()

mean = data[i].mean()

32
median = data[i].median()
st_dev = data[i].std() skew = data[i].skew() kurt = data[i].kurtosis()
points = mean-st_dev, mean+st_dev

plt.subplot(1,size,j+1)

sns.distplot(data[i],hist=True, kde=True)

sns.lineplot(points, [0,0], color = 'black', label = "std_dev")

sns.scatterplot([mini,maxi], [0,0], color = 'orange', label = "min/max")

sns.scatterplot([mean], [0], color = 'red', label = "mean")

sns.scatterplot([median], [0], color = 'blue', label = "median")

plt.xlabel('{}'.format(i), fontsize = 20)

plt.ylabel('density')

plt.title('std_dev = {}; kurtosis = {};\nskew = {}; range = {}\nmean = {}; median =


{}'.format((round(points[0],2),round(points[1],2)),round(kurt,2),round(skew,2),(round(mini,2
),round(maxi,2),round(ran,2)),round(mean,2),round(median,2)))

UVA_numeric(df)

Output:

b. Bivariate analysis: Linear and logistic regression modeling

Linear RegressionCode:

# Import the libraries


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
diabetes = datasets.load_diabetes()
diabetes

33
Output:

Code:

print(diabetes.DESCR)
Output:

Code:

# columns

diabetes.feature_names

34
Output:

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

Code:

# Now we will split the data into the independent and independent variable
X = diabetes.data
Y = diabetes.target
X.shape, Y.shape
Output:

((442, 10), (442,))

Code:

Y
Output:

Code:

# We will split the data into training and testing data


from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(X,Y,test_size=0.3,random_state=99)
train_x.shape, train_y.shape

Output:

((309, 10), (309,))

35
Code

# Linear Regression

from sklearn.linear_model

import LinearRegression

le = LinearRegression()

le.fit(train_x,train_y)

y_pred = le.predict(test_x)

y_pred

Output:

Code:

result = pd.DataFrame({'Actual': test_y, 'Predict' : y_pred})


result

36
output:

Code:

# we will check the accuracy

print('coefficient', le.coef_)

print('intercept', le.intercept_)

Output

coefficient [ 40.66018999 -313.29560706 517.1785363 386.06685795 -604.64498104


275.32058758 3.91393457 172.38010275 661.95935148 62.25715134]
intercept 155.59114167162846

Code:

from sklearn.metrics import mean_squared_error, r2_score


# mean_squared_error
mean_squared_error(test_y,y_pred)
# r2 score
r2_score(test_y,y_pred)

Output:

3157.9566009965824
0.4545737971700595

Logistic Regression Modelling

Code:

import numpy as np # linear algebra


import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv('/kaggle/input/diabetes-dataset/diabetes2.csv')
df.head()

37
Output:

Code

df.info()

Output:

Code:

df.describe()

Output:

Code:

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

Output:

38
Code:

sns.countplot(x='Outcome',data=df)

Output:

Code:

sns.distplot(df['Age'].dropna(),kde=True)

Output:

39
Code:

df.corr()

Output:

Code:

sns.heatmap(df.corr())

Output:

Code:

x = df.drop('Outcome',axis=1)
y = df['Outcome']
from sklearn.model_selection
import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state=101)
from sklearn.linear_model
import LogisticRegression
logmodel = LogisticRegression()
logmodel.fit(x_train,y_train)
40
Output:

Code:

predictions = logmodel.predict(x_test)

from sklearn.metrics import classification_reportprint(classification_report(y_test,predictions))

Output:

Code:

from sklearn.metrics import confusion_matrix


confusion_matrix(y_test,predictions)

Output:

array([[136, 14], [ 36, 45]])

c. Multiple Regression analysis

Code:

Import numpy as np
fromsklearn.linear_model import LinearRegression
x = [[0, 1], [5, 1], [15, 2], [25, 5], [35, 11], [45, 15], [55, 34], [60, 35]]
y = [4, 5, 20, 14, 32, 22, 38, 43]
x,y=np.array(x),np.array(y)
model=LinearRegression().fit(x,y)
r_sq = model.score(x,y)
print('coefficient of determination:', r_sq)

41
print('intercept:', model.intercept_)
print('slope:', model.coef_)
y_pred = model.predict(x)

print('predicted response:', y_pred)

Output:
coefficient of determination: 0.8615939258756775
intercept: 5.52257927519819
slope: [0.44706965 0.25502548]
predicted response: [ 5.77760476 8.012953
12.73867497 17.9744479 23.97529728 29.4660957
38.78227633 41.27265006]

d. Comparison of dataset

Code:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

# disable warnings

import warnings

warnings.filterwarnings('ignore')

code:

data=pd.read_csv('D:/diabetes.csv')

data1=pd.read_csv('D:/pima_diabetes.csv')

code:

print(data.columns)

print(data1.columns)

Output:

Index(['Pregnancies', 'Glucose', 'BloodPressure',


'SkinThickness',Insulin','BMI','DiabetesPedigreeFunction', 'Age', 'Outcome'],
dtype='object')
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
42
'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
dtype='object')

code:

import seaborn as sns

corrmat = data.corr()

f, ax = plt.subplots(figsize=(12, 9))

sns.heatmap(corrmat, cbar=True, annot=True, square=True, vmax=.8);


import seaborn as sns

corrmat = data1.corr()

f, ax = plt.subplots(figsize=(12, 9))

sns.heatmap(corrmat, cbar=True, annot=True, square=True, vmax=.8);

Output:

43
code:

sns.set()

cols=['Pregnancies','Glucose','BloodPressure','Insulin','BMI','DiabetesPedigreeFunction','Age',
'Outcome']

sns.pairplot(data[cols], size = 2.5)

plt.show();

sns.set()

cols=['Pregnancies','Glucose','BloodPressure','Insulin','BMI','DiabetesPedigreeFunction','Age',
'Outcome']

sns.pairplot(data1[cols], size = 2.5)

plt.show()

Output:

44
EX. No. :6

Apply and explore various plotting functions

a. Normal curves

b. Density and contour plot

c. Correlation and Scatter plots

d. Histogram

e. Three dimensional Plotting

a. Normal curves

Program:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
df = pd.read_csv('Marks.csv')
def UVA_numeric(data):
var_group = data.columns
size = len(var_group)
plt.figure(figsize = (7*size,3), dpi = 400)
#looping for each variable
for j,i in enumerate(var_group):
# calculating descriptives of variable
mini = data[i].min()
maxi = data[i].max()
ran = data[i].max()-data[i].min()
mean = data[i].mean()
median = data[i].median()

45
st_dev = data[i].std()
skew = data[i].skew()
kurt = data[i].kurtosis()
# calculating points of standard deviation
points = mean-st_dev, mean+st_dev
#Plotting the variable with every information
plt.subplot(1,size,j+1)
sns.distplot(data[i],hist=True, kde=True)
sns.lineplot(points, [0,0], color = 'black', label = "std_dev")
sns.scatterplot([mini,maxi], [0,0], color = 'orange', label = "min/max")
sns.scatterplot([mean], [0], color = 'red', label = "mean")
sns.scatterplot([median], [0], color = 'blue', label = "median")
plt.xlabel('{}'.format(i), fontsize = 20)
plt.ylabel('density')
plt.title('std_dev = {}; kurtosis = {};\nskew = {}; range = {}\nmean = {}; median =
{}'.format((round(points[0],2),round(points[1],2)),
round(kurt,2),
round(skew,2),
(round(mini,2),round(maxi,2),round(ran,2)),
round(mean,2),
round(median,2)))
UVA_numeric(df)

Output:

46
b. Density and contour plot

Program:

From scipy.stats import norm


import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
data = np.arange(1,10,0.01)
pdf = norm.pdf(data , loc = 5.3 , scale = 1 )
sb.set_style('whitegrid')
sb.lineplot(data, pdf , color = 'black')
plt.xlabel('Heights') plt.ylabel('Probability Density')

Output:

Text(0, 0.5, 'Probability Density')

c. Correlation and scatterplot


code:

import pandas as pd
con = pd.read_csv('Data/ConcreteStrength.csv')
con

47
Output:

Renaming columns
Recall the the column names in the “ConcreteStrength” file is problematic: they are too longto
type repeatedly, have spaces, and include special characters like “.”. Although we could change
the name of the columns in the underlying spreadsheet before importing, it is generallymore
practical/less work/less risk to leave the organization’s spreadsheets and files as they are and
write some code to fix things prior to analysis. In this way, you do not have tostart over when
an updated version of the data is handed to you. Let’s start by listing the column names.A Pandas
DataFrame object exposes a list of columns through the columns property. Here I use the
list() type conversion method to convert the results to a simple list (which printsnicer):
Code:
list(con.columns)
Output:

The rename() method for data frames is straightforward. Here I define a standard Python dictionary (of the
form {key1: value1, key2: value2, … }) and assign it to the “columns” axis.Remember that the
inplace=True argument is required if we want to make changes to the underlying data frame.

48
Code:
con.rename(columns={'Fly ash': 'FlyAsh', 'Coarse Aggr.': "CoarseAgg" 'Fine Aggr.': 'FineAgg',
'Air Entrainment': 'AirEntrain', 'Compressive Strength (28-day)(Mpa)': 'Strength'},
inplace=True)
con.head()

Output:

As before, we should convert any obvious categorical variables to categories:


Code:
con['AirEntrain'] = con['AirEntrain'].astype('category')
con.describe(include='category')
Output:

Scatterplots

Scatterplots are a fundamental graph type—much less complicated than histograms and
boxplots. As such, we might use the Mathplotlib library instead of the Seaborn library. But
since we have already used Seaborn, I will stick with it here. Just know that there are many
ways to create scatterplots and other basic graphs in Python.
To create a bare-bones scatterplot, we must do four things:
1. Load the seaborn library
2. Specify the source data frame

49
3. Set the x axis, which is generally the name of a predictor/independent variable
4. Set the y axis, which is generally the name of a response/dependent variable

Code:

import seaborn as sns


sns.scatterplot(x="FlyAsh", y="Strength", data=con);

Output:

Adding labels
To this point, we have not said much about decorating Seaborn charts with labels and other
details. This is because Seaborn does a pretty good job by default. But if we do need to clean
up our charts a bit, here is the thing to know: the Seaborn chart methods return an object (of
type AxesSubplot, whatever that is) for which properties can be set.
Here I assign the results of the scatterplot() call to a variable called ax and then set various
properties of ax. I end the last line of the code block with a semicolon to suppress return values:

Code:

ax = sns.scatterplot(x="FlyAsh", y="Strength", data=con)


ax.set_title("Concrete Strength vs. Fly ash")
ax.set_xlabel("Fly ash")

50
Output:

Adding a best fit line

As we saw with SAS Enterprise Guide and R, it is sometimes useful to add a best fit line (with
confidence intervals around the slope) to a scatterplot. But let’s be clear: this is not one of these
situations. It is obvious from the scatterplot above that the relationship between concrete
strength and fly ash is only weakly linear.
The easiest way to “add” a best-fit line to a scatterplot is to use a different plotting method.
Seaborn’s lmplot() method (where “lm” stands for “linear model”) is one possibility:
Code:
sns.lmplot(x="FlyAsh", y="Strength", data=con);
Output:

51
Coefficient of correlation
Code:

from scipy import stats


stats.pearsonr(con['Strength'], con['FlyAsh'])
Output:
(0.4063870105954507, 2.0500713273946373e-05)

Corrleation matrix

Code:

Codecormat = con.corr()
round(cormat,2)

Output:

d.Histogram

Code:

import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_excel('https://github.com/datagy/Intro-to-Python/raw/master/sportsdata.xls',
usecols=['Age'])
print(df.describe())
plt.hist(df['Age'])

52
Output:

e. Three dimensional plotting

Code:
# importing mplot3d toolkits
from mpl_toolkits import mplot3d
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
# syntax for 3-D projection
ax = plt.axes(projection ='3d')
# defining axes
z = np.linspace(0, 1, 100)
x = z * np.sin(25 * z)
y = z * np.cos(25 * z)
c=x+y
ax.scatter(x, y, z, c = c)
# syntax for plotting
ax.set_title('3d Scatter plot geeks for geeks')
plt.show()

53
Output:

Code:

# importing libraries
from mpl_toolkits import mplot3d
import numpy as np
import matplotlib.pyplot as plt
# defining surface and axes
x = np.outer(np.linspace(-2, 2, 10), np.ones(10))
y = x.copy().T
z = np.cos(x ** 2 + y ** 3)
fig = plt.figure()
# syntax for 3-D plotting
ax = plt.axes(projection ='3d')
# syntax for plotting
ax.plot_surface(x, y, z, cmap ='viridis', edgecolor ='green')
ax.set_title('Surface plot geeks for geeks')
plt.show()

54
Output:

55
EX. NO.: 7

Apply and explore various plotting functions

Procedures
Geographic data
One common type of visualization in data science is that of geographic data. Matplotlib's main
tool for this type of visualization is the Basemap toolkit, which is one of several Matplotlib
toolkits which lives under the mpl_toolkits namespace. Admittedly, Basemap feelsa bit clunky
to use, and often even simple visualizations take much longer to render than you might hope.
More modern solutions such as leaflet or the Google Maps API may be a better choice for more
intensive map visualizations. Still, Basemap is a useful tool for Python users to have in their
virtual toolbelts. In this section, we'll show several examples of the type of map visualization
that is possible with this toolkit.

Installation of Basemap is straightforward; if you're using conda you can type this and the
package will be downloaded:

$ conda install basemap

We add just a single new import to our standard boilerplate:


Code:
%matplotlib inline

import numpy as np

import matplotlib.pyplot as plt


from mpl_toolkits.basemap import Basemap
Once you have the Basemap toolkit installed and imported, geographic plots are just a few lines
away (the graphics in the following also requires the PIL package in Python 2, or the pillow
package in Python 3):
plt.figure(figsize=(8, 8))

m = Basemap(projection='ortho', resolution=None, lat_0=50, lon_0=-100)

56
m.bluemarble(scale=0.5)

output:

The meaning of the arguments to Basemap will be discussed momentarily.


The useful thing is that the globe shown here is not a mere image; it is a fully-functioning
Matplotlib axes that understands spherical coordinates and which allows us to easily overplot
data on the map! For example, we can use a different map projection, zoom-in to North America
and plot the location of Seattle. We'll use an etopo image (which shows topographical features
both on land and under the ocean) as the map background:
Code:
fig = plt.figure(figsize=(8, 8))

m = Basemap(projection='lcc', resolution=None,

width=8E6, height=8E6,

lat_0=45, lon_0=-100,)

m.etopo(scale=0.5, alpha=0.5)

# Map (long, lat) to (x, y) for plotting

x, y = m(-122.3, 47.6)

plt.plot(x, y, 'ok', markersize=5)

plt.text(x, y, ' Seattle', fontsize=12)

57
output:

This gives you a brief glimpse into the sort of geographic visualizations that are possible with
just a few lines of Python. We'll now discuss the features of Basemap in more depth, and
provide several examples of visualizing map data. Using these brief examples as building
blocks, you should be able to create nearly any map visualization that you desire.

Map Projections

The first thing to decide when using maps is what projection to use. You're probably familiar
with the fact that it is impossible to project a spherical map, such as that of the Earth, onto a
flat surface without somehow distorting it or breaking its continuity. These projections have
been developed over the course of human history, and there are a lot of choices! Depending on
the intended use of the map projection, there are certain map features (e.g., direction, area,
distance, shape, or other considerations) that are useful to maintain.

The Basemap package implements several dozen such projections, all referenced by a short
format code. Here we'll briefly demonstrate some of the more common ones.

We'll start by defining a convenience routine to draw our world map along with the longitude
and latitude lines:

code

from itertools import chain

def draw_map(m, scale=0.2):

# draw a shaded-relief image

m.shadedrelief(scale=scale)

# lats and longs are returned as a dictionary

58
lats = m.drawparallels(np.linspace(-90, 90, 13))

lons = m.drawmeridians(np.linspace(-180, 180, 13))

# keys contain the plt.Line2D instances

lat_lines = chain(*(tup[1][0] for tup in lats.items()))

lon_lines = chain(*(tup[1][0] for tup in lons.items()))

all_lines = chain(lat_lines, lon_lines)

# cycle through these lines and set the desired style

for line in all_lines:

line.set(linestyle='-', alpha=0.3, color='w'

Cylindrical projections
The simplest of map projections are cylindrical projections, in which lines of constant latitude
and longitude are mapped to horizontal and vertical lines, respectively. This type of mapping
represents equatorial regions quite well, but results in extreme distortions near the poles. The
spacing of latitude lines varies between different cylindrical projections, leading to different
conservation properties, and different distortion near the poles. In the following figure we show
an example of the equidistant cylindrical projection, which chooses a latitude scaling that
preserves distances along meridians. Other cylindrical projections are the Mercator
(projection='merc') and the cylindrical equal area (projection='cea') projections.

Code:

fig = plt.figure(figsize=(8, 6), edgecolor='w')

m = Basemap(projection='cyl', resolution=None,

llcrnrlat=-90, urcrnrlat=90,

llcrnrlon=-180, urcrnrlon=180, )

draw_map(m)

59
Output:

The additional arguments to Basemap for this view specify the latitude (lat) and longitude (lon)
of the lower-left corner (llcrnr) and upper-right corner (urcrnr) for the desired map, in units of
degrees.

Pseudo-cylindrical projections
Pseudo-cylindrical projections relax the requirement that meridians (lines of constant
longitude) remain vertical; this can give better properties near the poles of the projection. The
Mollweide projection (projection='moll') is one common example of this, in which all
meridians are elliptical arcs. It is constructed so as to preserve area across the map: though there
are distortions near the poles, the area of small patches reflects the true area. Other pseudo-
cylindrical projections are the sinusoidal (projection='sinu') and Robinson (projection='robin')
projections.

Code:

fig = plt.figure(figsize=(8, 6), edgecolor='w')

m = Basemap(projection='moll', resolution=None,

lat_0=0, lon_0=0)

draw_map(m)

60
Output:

The extra arguments to Basemap here refer to the central latitude (lat_0) and longitude (lon_0)
for the desired map.

Perspective projections

Perspective projections are constructed using a particular choice of perspective point, similar
to if you photographed the Earth from a particular point in space (a point which, for some
projections, technically lies within the Earth!). One common example is the orthographic
projection (projection='ortho'), which shows one side of the globe as seen from a viewer at a
very long distance. As such, it can show only half the globe at a time. Other perspective- based
projections include the gnomonic projection (projection='gnom') and stereographic projection
(projection='stere'). These are often the most useful for showing small portions of the map.

Here is an example of the orthographic projection:


Code:
fig = plt.figure(figsize=(8, 8))

m = Basemap(projection='ortho', resolution=None,

lat_0=50, lon_0=0)

draw_map(m)

61
Output:

Conic projections
A Conic projection projects the map onto a single cone, which is then unrolled. This can lead
to very good local properties, but regions far from the focus point of the cone may become very
distorted. One example of this is the Lambert Conformal Conic projection (projection='lcc'),
which we saw earlier in the map of North America. It projects the map ontoa cone arranged
in such a way that two standard parallels (specified in Basemapby lat_1 and lat_2) have well-
represented distances, with scale decreasing between them and increasing outside of them.Other
useful conic projections are the equidistant conic projection(projection='eqdc') and the Albers
equal-area projection (projection='aea'). Conic projections, like perspective projections,tend to
be good choices for representing small to medium patches of the globe.

code
fig = plt.figure(figsize=(8, 8))

m = Basemap(projection='lcc', resolution=None,

lon_0=0, lat_0=50, lat_1=45, lat_2=55,

width=1.6E7, height=1.2E7)

draw_map(m)

62
output:

Other projections
If you're going to do much with map-based visualizations, I encourage you to read up on other
available projections, along with their properties, advantages, and disadvantages. Most likely,
they are available in the Basemap package. If you dig deep enough into this topic, you'll find
an incredible subculture of geo-viz geeks who will be ready to argue fervently in support of
their favorite projection for any given application!

Drawing a Map Background


Earlier we saw the bluemarble() and shadedrelief() methods for projecting global images on the
map, as well as the drawparallels() and drawmeridians() methods for drawing lines of constant
latitude and longitude. The Basemap package contains a range of useful functions for drawing
borders of physical features like continents, oceans, lakes, and rivers, as well as political
boundaries such as countries and US states and counties. The following are some of the
available drawing functions that you may wish to explore using IPython's help features:

 Physical boundaries and bodies of water

o drawcoastlines(): Draw continental coast lines


o drawlsmask(): Draw a mask between the land and sea, for use with projecting
images on one or the other
o drawmapboundary(): Draw the map boundary, including the fill color for
oceans.
o drawrivers(): Draw rivers on the map

63
o fillcontinents(): Fill the continents with a given color; optionally fill lakes with
another color

 Political boundaries

o drawcountries(): Draw country boundaries


o drawstates(): Draw US state boundaries
o drawcounties(): Draw US county boundaries

 Map features

o drawgreatcircle(): Draw a great circle between two points


o drawparallels(): Draw lines of constant latitude
o drawmeridians(): Draw lines of constant longitude
o drawmapscale(): Draw a linear scale on the map

 Whole-globe images

o bluemarble(): Project NASA's blue marble image onto the map


o shadedrelief(): Project a shaded relief image onto the map
o etopo(): Draw an etopo relief image onto the map
o warpimage(): Project a user-provided image onto the map

For the boundary-based features, you must set the desired resolution when creating a Basemap
image. The resolution argument of the Basemap class sets the level of detail in boundaries,
either 'c' (crude), 'l' (low), 'i' (intermediate), 'h' (high), 'f' (full), or None if no boundaries will be
used. This choice is important: setting high-resolution boundaries on a global map, for example,
can be very slow.

Here's an example of drawing land/sea boundaries, and the effect of the resolution parameter.
We'll create both a low- and high-resolution map of Scotland's beautiful Isle of Skye. It's
located at 57.3°N, 6.2°W, and a map of 90,000 × 120,000 kilometers shows it well:

code

fig, ax = plt.subplots(1, 2, figsize=(12, 8))

for i, res in enumerate(['l', 'h']):

m = Basemap(projection='gnom', lat_0=57.3, lon_0=-6.2,

width=90000, height=120000, resolution=res, ax=ax[i])

m.fillcontinents(color="#FFDDCC", lake_color='#DDEEFF')

64
m.drawmapboundary(fill_color="#DDEEFF")

m.drawcoastlines()

ax[i].set_title("resolution='{0}'".format(res))

output:

Notice that the low-resolution coastlines are not suitable for this level of zoom, while high-
resolution works just fine. The low level would work just fine for a global view, however, and
would be much faster than loading the high-resolution border data for the entire globe! It might
require some experimentation to find the correct resolution parameter for a given view:the best
route is to start with a fast, low-resolution plot and increase the resolution as needed.

Plotting Data on Maps

Perhaps the most useful piece of the Basemap toolkit is the ability to over-plot a variety of data
onto a map background. For simple plotting and text, any plt function works on the map;you
can use the Basemap instance to project latitude and longitude coordinates to (x, y) coordinates
for plotting with plt, as we saw earlier in the Seattle example.

In addition to this, there are many map-specific functions available as methods of the Basemap
instance. These work very similarly to their standard Matplotlib counterparts,but have an
additional Boolean argument latlon, which if set to True allows you to pass raw latitudes and
longitudes to the method, rather than projected (x, y) coordinates.

Some of these map-specific methods are:

 contour()/contourf() : Draw contour lines or filled contours

65
 imshow(): Draw an image
 pcolor()/pcolormesh() : Draw a pseudocolor plot for irregular/regular meshes
 plot(): Draw lines and/or markers.
 scatter(): Draw points with markers.
 quiver(): Draw vectors.
 barbs(): Draw wind barbs.
 drawgreatcircle(): Draw a great circle.

We'll see some examples of a few of these as we continue. For more information on these
functions, including several example plots, see the online Basemap documentation.

Example: California Cities

Recall that in Customizing Plot Legends, we demonstrated the use of size and color in a scatter
plot to convey information about the location, size, and population of California cities.Here,
we'll create this plot again, but using Basemap to put the data in context.

We start with loading the data, as we did before:


Code:

import pandas as pd

cities = pd.read_csv('data/california_cities.csv')

# Extract the data we're interested in

lat = cities['latd'].values

lon = cities['longd'].values

population = cities['population_total'].values

area = cities['area_total_km2'].values

# 1. Draw the map background

fig = plt.figure(figsize=(8, 8))

m = Basemap(projection='lcc', resolution='h', lat_0=37.5, lon_0=-


19, width=1E6, height=1.2E6)
m.shadedrelief()

m.drawcoastlines(color='gray')

m.drawcountries(color='gray')

66
m.drawstates(color='gray')

# 2. scatter city data, with color reflecting population

# and size reflecting area

m.scatter(lon, lat, latlon=True, c=np.log10(population), s=area, cmap='Reds', alpha=0.5)

# 3. create colorbar and legend

plt.colorbar(label=r'$\log_{10}({\rm population})$')

plt.clim(3, 7)

# make legend with dummy points

for a in [100, 300, 500]:

plt.scatter([], [], c='k', alpha=0.5, s=a, label=str(a) + ' km$^2$')

plt.legend(scatterpoints=1, frameon=False,labelspacing=1, loc='lower left')

output:

67

You might also like