0% found this document useful (0 votes)
17 views44 pages

Dev New

dev new (1)
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views44 pages

Dev New

dev new (1)
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

LIST OF EXPERIMENTS

S. No. Name of the Experiment

1 Installation of Data Analysis and Visualization Tool

2 Exploratory Data Analysis on Email Dataset

3. a Basic Numpy Operations

3. b Basic Arithmetic Operations with Numpy Arrays

4 Working with Pandas Dataframes

5 Data Cleaning and Visualization

6 Time Series Analysis

7 Visualizing Geographic Data with Basemap

8 Exploratory Data Analysis on Wine Quality Data Set

9 Exploratory Data Analysis on Haberman Data Set


EX. NO.:1 INSTALLATION OF DATA ANALYSIS AND VISUALIZATION
TOOL

DATE
AIM:
To download and install Python tool and explore the features of NumPy, SciPy, Jupyter,
Statsmodels, and Pandas packages in Python.
INTRODUCTION:
Python is an open-source, object-oriented, and cross-platform programming language.
Compared to programming languages like C++ or Java, Python is very concise. It allows us to
build a working software prototype in a very short time. It has become the most used language
in the data scientist's toolbox. It is also a general-purpose language, and it is very flexible due
to a variety of available packages that solve a wide spectrum of problems and necessities. To
install the necessary packages, use ‘pip’.
Anaconda (http://continuum.io/downloads) is a Python distribution offered by Continuum
Analytics that includes nearly 200 packages, which comprises NumPy, SciPy, pandas, Jupyter,
Matplotlib, Scikit-learn, and NLTK. It is a cross-platform distribution (Windows, Linux, and
Mac OS X) that can be installed on machines with other existing Python distributions and
versions. Its base version is free; instead, add-ons that contain advanced features are charged
separately. Anaconda introduces ‘conda’, a binary package manager, as a command-line tool
to manage your package installations. Anaconda's goal is to provide enterprise-ready Python
distribution for large-scale processing, predictive analytics, and scientific computing.

STEPS:
1. Download Anaconda
2. Install Anaconda
3. Start Anaconda
4. Install data science packages
1. Download Anaconda
This step downloads the Anaconda Python package for the Windows platform.

Anaconda is a free and easy-to-use environment for scientific Python.

1. Visit the Anaconda homepage.


2. Click “Anaconda” from the menu and click “Download” to go to the download page.
3. Choose the download suitable for your platform (Windows, OSX, or Linux):
 Choose Python 3.5
 Choose the Graphical Installer
2. Install Anaconda
This step installs the Anaconda Python software on the system.

This step assumes that sufficient administrative privileges are contained to install software on
the system.
1. Double click the downloaded file.
2. Follow the installation wizard.
3. Start Anaconda
Anaconda comes with a suite of graphical tools called Anaconda Navigator. Start Anaconda
Navigator by opening it from the application launcher.

First, start with the Anaconda command line environment called conda.
Conda is fast, and simple, it’s hard for error messages to hide, and you can quickly confirm
your environment is installed and working correctly.

1. Open a terminal (command line window).


2. Confirm conda is installed correctly, by typing:
conda -V

3. Confirm Python is installed correctly by typing:


python -V

4. Install data science packages


With pip, a package is installed. To install the < package-name > generic package, run this
command:
$> pip install < package-name >
To install the <package-name> generic package, you just need to run the following
command:

$> conda install <package-name>


To install a particular version of the package

$> conda install <package-name>=1.11.0


To install multiple packages at once by listing all their names:

$> conda install <package-name-1> <package-name-2>


To update a package that you previously installed, you can keep on using conda:

$> conda update <package-name>


To update all the available packages simply by using the --all argument

$> conda update –all


To uninstall packages using conda:

$> conda remove <package-name>


a. NumPy

NumPy is the true analytical workhorse of the Python language. It provides the user with
multidimensional arrays, along with a large set of functions to operate a multiplicity of
mathematical operations on these arrays. Arrays are blocks of data arranged along multiple
dimensions, which implement mathematical vectors and matrices. Characterized by optimal
memory allocation, arrays are useful not just for storing data, but also for fast matrix operations
(vectorization), which are indispensable when solving ad hoc data science problems.

$> conda install numpy


b. SciPy

SciPy completes NumPy's functionalities, offering a larger variety of scientific algorithms for
linear algebra, sparse matrices, signal and image processing, optimization, fast Fourier
transformation, and much more.

$> conda install scipy


c. Statsmodels

Statsmodels is a complement to SciPy's statistical functions. It features generalized linear


models, discrete choice models, time series analysis, and a series of descriptive statistics as
well as parametric and nonparametric tests.

d. Pandas

The pandas package deals with everything that NumPy and SciPy cannot do. Thanks to its
specific data structures, namely DataFrames and Series, pandas allow us to handle complex
tables of data of different types and time series. It enables the easy and smooth loading of data
from a variety of sources. The data can then be sliced, diced, handled with missing elements,
added, renamed, aggregated, reshaped, and finally visualized.

$> conda install pandas


e. Jupyter

A scientific approach requires the fast experimentation of different hypotheses in a


reproducible fashion. Initially named IPython and limited to working only with the Python
language, Jupyter was created to address the need for an interactive command shell for several
languages (based on the shell, web browser, and application interface), featuring graphical
integration, customizable commands, rich history (in the JSON format), and computational
parallelism for enhanced performance.

$> conda install jupyter

RESULT:
Thus, the Python tool for data analysis and visualization, and packages such as NumPy, SciPy,
Jupyter, Statsmodels, and Pandas were downloaded, installed, and explored.
EX. NO.:2 EXPLORATORY DATA ANALYSIS ON EMAIL DATASET

DATE:

AIM:
To export all emails as a dataset, import them inside a pandas dataframe, visualize
them, and identify the different types of insights.

ALGORITHM:
Step 1: Import data from your own Gmail accounts in mbox format.
a. Log in to your personal Gmail account.
b. Go to the following link: https:/ / takeout. google. com/ settings/ takeout.
c. Deselect all the items except Gmail option.
d. Select ‘Send download link by email’, ‘One-time archive’ options and hit ‘create
archive’.
e. Find an email archive data in mbox format and download them.
Step 2: Load the required libraries and dataset. Install the mailbox package by using the
command, “pip install mailbox”.
Step 3: Perform data transformation by extracting the required fields such as subject, from,
date, to, label, and thread.
Step 4: Perform missing value analysis and drop the irrelevant column.
Step 4: Analyze email communication data to gain various insights from them.
Step 5: Display the outputs.
Step 6: Stop the program.

PROGRAM:
# Load the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import csv

# Load the dataset


import mailbox
mboxfile = "PATH TO DOWNLOADED MBOX FIL"
mbox = mailbox.mbox(mboxfile)
mbox
# View the list of available keys
for key in mbox[0].keys():
print(key)
# Create a CSV file with only the required attributes
with open('mailbox.csv', 'w') as outputfile:
writer = csv.writer(outputfile)
writer.writerow(['subject','from','date','to','label','thread'])
for message in mbox:
writer.writerow([
message['subject'],
message['from'],
message['date'],
message['to'],
message['X-Gmail-Labels'],
message['X-GM-THRID']
]
)
# Load the CSV file
dfs = pd.read_csv('mailbox.csv', names=['subject', 'from', 'date', 'to', 'label', 'thread'])

# Convert the date


# Check the datatypes of each column
dfs.dtypes

# Convert the date field into an actual DateTime argument


dfs['date'] = dfs['date'].apply(lambda x: pd.to_datetime(x, errors='coerce', utc=True))

# Remove NaN values from the field


dfs = dfs[dfs['date'].notna()]

# Save the dataframe into a separate CSV file


dfs.to_csv('gmail.csv')

# Descriptive statistics
dfs.info()

# Check the first ten entries of the email dataset


dfs.head(10)

# Drop the irrelevant column ‘to’ from the dataframe


dfs.drop(columns='to', inplace=True)

# Display the first 10 entries


dfs.head(10)

#Number of emails sent during a given timeframe


print(dfs.index.min().strftime('%a, %d %b %Y %I:%M %p'))
print(dfs.index.max().strftime('%a, %d %b %Y %I:%M %p'))
print(dfs['label'].value_counts())

# Number of times of the day the emails are sent and received with Gmail
# Create two sub-dataframe—one for sent emails and another for received emails
sent = dfs[dfs['label']=='sent']
received = dfs[dfs['label']=='inbox']

# Import the required libraries


import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
from scipy import ndimage
import matplotlib.gridspec as gridspec
import matplotlib.patches as mpatches

# Create a function that takes a dataframe as an input and creates a plot


def plot_todo_vs_year(df, ax, color='C0', s=0.5, title=''):
ind = np.zeros(len(df), dtype='bool')
est = pytz.timezone('US/Eastern')
df[~ind].plot.scatter('year', 'timeofday', s=s, alpha=0.6, ax=ax,
color=color)
ax.set_ylim(0, 24)
ax.yaxis.set_major_locator(MaxNLocator(8))
ax.set_yticklabels([datetime.datetime.strptime(str(int(np.mod(ts, 24))), "%H").strftime("%I
%p") for ts in ax.get_yticks()]);
ax.set_xlabel('')
ax.set_ylabel('')
ax.set_title(title)
ax.grid(ls=':', color='k')
return ax

# Plot both received and sent emails


fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15, 4))
plot_todo_vs_year(sent, ax[0], title='Sent')
plot_todo_vs_year(received, ax[1], title='Received')

# Find the busiest day of the week in terms of emails


counts = dfs.dayofweek.value_counts(sort=False)
counts.plot(kind='bar')

# Find the most active days for receiving and sending emails separately
sdw = sent.groupby('dayofweek').size() / len(sent)
rdw = received.groupby('dayofweek').size() / len(received)
df_tmp = pd.DataFrame(data={'Outgoing Email': sdw, 'Incoming Email':rdw})
df_tmp.plot(kind='bar', rot=45, figsize=(8,5), alpha=0.5)
plt.xlabel('');
plt.ylabel('Fraction of weekly emails');
plt.grid(ls=':', color='k', alpha=0.5)

OUTPUT:
Load the dataset:
<mailbox.mbox at 0x7f124763f5c0>

List of keys that are present in the extracted dataset:


X-GM-THRID
X-Gmail-Labels
Delivered-To
Received
X-Google-Smtp-Source
X-Received
ARC-Seal
ARC-Message-Signature
ARC-Authentication-Results
Return-Path
Received
Received-SPF
Authentication-Results
DKIM-Signature
DKIM-Signature
Subject
From
To
Reply-To
Date
MIME-Version
Content-Type
X-Mailer
X-Complaints-To
X-Feedback-ID
List-Unsubscribe
Message-ID
Datatypes of each column:
subject object
from object
date object
to object
label object
thread float64
dtype: object
Descriptive Statistics:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 37554 entries, 1 to 78442
Data columns (total 6 columns):
subject 37367 non-null object
from 37554 non-null object
date 37554 non-null datetime64[ns, UTC]
to 36882 non-null object
label 36962 non-null object
thread 37554 non-null object
dtypes: datetime64[ns, UTC](1), object(5)
memory usage: 2.0+ MB
Dataframe after transforming data by extracting the required fields:

Dataframe after performing refactoring and dropping:

Number of emails sent during a given timeframe (Tue, 24 May 2011 11:04 AM, to Fri,
20 Sep 2019 03:04 PM):
Tue, 24 May 2011 11:04 AM
Fri, 20 Sep 2019 03:04 PM
inbox 32952
sent 4602
Name: label, dtype: int64
Plot showing an overview of the time of day of email activity:

Find the busiest day of the week in terms of emails:

Find the most active days for receiving and sending emails:

RESULT:
Thus, the program for creating email dataset, visualizing them and getting different insights
from the data was executed and the output is verified successfully.
EX. NO.:3.a BASIC NUMPY OPERATIONS

DATE:

AIM:

To perform basic NumPy operations in python for

(i) creating different types of NumPy arrays and displaying basic information, such
as the data type, shape, size, and strides
(ii) creating an array using built-in NumPy functions
(iii) performing file operations with NumPy arrays

ALGORITHM:

Step 1: Start the program.

Step 2: Import the NumPy Library.

Step 3: Define a one-dimensional array, two-dimensional array, and three-dimensional array.

Step 4: Print the memory address, the shape, the data type, and the stride of the array.
Step 5: Then, create an array using built-in NumPy functions.
Step 6: Perform file operations with NumPy arrays.

Step 7: Display the output.

Step 8: Stop the program.


PROGRAM:

(i) Creation of different types of Numpy arrays and displaying basic information

# Importing numpy
import numpy as np

# Defining 1D array
my1DArray = np.array([1, 8, 27, 64])
print(my1DArray)

# Defining and printing 2D array


my2DArray = np.array([[1, 2, 3, 4], [2, 4, 9, 16], [4, 8, 18, 32]])
print(my2DArray)

#Defining and printing 3D array


my3Darray = np.array([[[ 1, 2 , 3 , 4],[ 5 , 6 , 7 ,8]], [[ 1, 2, 3, 4],[ 9, 10, 11, 12]]])
print(my3Darray)
# Print out memory address
print(my2DArray.data)

# Print the shape of array


print(my2DArray.shape)

# Print out the data type of the array


print(my2DArray.dtype)

# Print the stride of the array.


print(my2DArray.strides)

(ii) Creation of an array using built-in NumPy functions

# Array of ones
ones = np.ones((3,4))
print(ones)

# Array of zeros
zeros = np.zeros((2,3,4),dtype=np.int16)
print(zeros)

# Array with random values


np.random.random((2,2))

# Empty array
emptyArray = np.empty((3,2))
print(emptyArray)

# Full array
fullArray = np.full((2,2),7)
print(fullArray)

# Array of evenly-spaced values


evenSpacedArray = np.arange(10,25,5)
print(evenSpacedArray)
# Array of evenly-spaced values
evenSpacedArray2 = np.linspace(0,2,9)
print(evenSpacedArray2)

(iii) Performing file operations with NumPy arrays


import numpy as np
#initialize an array
arr = np.array([[[11, 11, 9, 9], [11, 0, 2, 0]], [[10, 14, 9, 14], [0, 1, 11, 11]]])

# open a binary file in write mode


file = open("arr", "wb")
# save array to the file
np.save(file, arr)

# close the file


file.close
# open the file in read binary mode
file = open("arr", "rb")

#read the file to numpy array


arr1 = np.load(file)
#close the file
print(arr1)

OUTPUT:

(i) Creation of different types of Numpy arrays and displaying basic information

[ 1 8 27 64]

[[ 1 2 3 4]

[ 2 4 9 16]

[ 4 8 18 32]]

[[[ 1 2 3 4]

[ 5 6 7 8]]
[[ 1 2 3 4]
[ 9 10 11 12]]]

<memory at 0x00000247AE2A0A00>

(3, 4)

int32

(16, 4)

(ii) Creation of an array using built-in NumPy functions

[[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]]
[[[0 0 0 0]
[0 0 0 0]
[0 0 0 0]]

[[0 0 0 0]
[0 0 0 0]
[0 0 0 0]]]

[[0. 0.]

[0. 0.]
[0. 0.]]

[[7 7]

[7 7]]

[10 15 20]

[0. 0.25 0.5 0.75 1. 1.25 1.5 1.75 2. ]

(iii) Performing file operations with NumPy arrays

[[[11 11 9 9]

[11 0 2 0]]

[[10 14 9 14]

[ 0 1 11 11]]]

RESULT:

Thus, the program to implement NumPy operations with arrays using Python has been executed
and the output was verified successfully.
EX. NO.:3.b BASIC ARITHMETIC OPERATIONS WITH NUMPY ARRAYS

DATE
AIM:
To implement arithmetic operations with NumPy arrays using python.
ALGORITHM:

Step 1: Start the program.

Step 2: Import the NumPy Library.

Step 3: Initialize the NumPy arrays to two different variables.

Step 4: Perform the arithmetic operations on the two arrays using NumPy.

Step 5: Display the output.

Step 6: Stop the program.


PROGRAM:
import numpy as np
a = np.arange(9, dtype = np.float_).reshape(3,3)

print ('First array:')


print (a)
print ('\n')

print ('Second array:')


b = np.array([10,10,10])
print (b )
print ('\n')

print ('Add the two arrays:')


print (np.add(a,b))
print ('\n')

print ('Subtract the two arrays:')


print (np.subtract(a,b))
print ('\n')

print ('Multiply the two arrays:')


print (np.multiply(a,b))
print ('\n')

print ('Divide the two arrays:')


print (np.divide(a,b))
OUTPUT:
First array:
[[ 0. 1. 2.]
[ 3. 4. 5.]
[ 6. 7. 8.]]

Second array:
[10 10 10]

Add the two arrays:


[[ 10. 11. 12.]
[ 13. 14. 15.]
[ 16. 17. 18.]]

Subtract the two arrays:


[[-10. -9. -8.]
[ -7. -6. -5.]
[ -4. -3. -2.]]

Multiply the two arrays:


[[ 0. 10. 20.]
[ 30. 40. 50.]
[ 60. 70. 80.]]

Divide the two arrays:


[[ 0. 0.1 0.2]
[ 0.3 0.4 0.5]
[ 0.6 0.7 0.8]]
RESULT:
Thus, the program to implement NumPy arithmetic operations with arrays using Python has
been executed and the output was verified successfully.
EX. NO.:4 WORKING WITH PANDAS DATAFRAMES

DATE:

Pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data


structure with labeled axes (rows and columns). A Data frame is a two-dimensional data
structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame
consists of three principal components, the data, rows, and columns. In pandas, data structures
can be created in two ways: series and dataframes.

AIM:

(i) To create a dataframe from a series


(ii) To create a dataframe from a dictionary
(iii) To create a dataframe from n-dimensional arrays
(iv) To load a dataset from an external source into a pandas dataframe

ALGORITHM:

Step 1: Start the program.

Step 2: Import the NumPy and pandas packages.

Step 3: Create a dataframe for the list of elements (numbers, dictionary, and n-dimensional
arrays)

Step 4: Load a dataset from an external source into a pandas dataframe

Step 5: Display the output.

Step 6: Stop the program.

PROGRAM:

(i) CREATION OF A DATAFRAME FROM A SERIES


import numpy as np
import pandas as pd
print("Pandas Version:", pd.__version__)
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
series = pd.Series([2, 3, 7, 11, 13, 17, 19, 23])
print(series)
series_df = pd.DataFrame({
'A': range(1, 5),
'B': pd.Timestamp('20190526'),
'C': pd.Series(5, index=list(range(4)), dtype='float64'),
'D': np.array([3] * 4, dtype='int64'),
'E': pd.Categorical(["Depression", "Social Anxiety", "Bipolar Disorder", "Eating Disorder"]),
'F': 'Mental health',
'G': 'is challenging'
})
print(series_df)

(ii) CREATION OF A DATAFRAME FROM DICTIONARY


import numpy as np
import pandas as pd
dict_df = [{'A': 'Apple', 'B': 'Ball'},{'A': 'Aeroplane', 'B':'Bat', 'C': 'Cat'}]
dict_df = pd.DataFrame(dict_df)
print(dict_df)

(iii) CREATION OF A DATAFRAME FROM N-DIMENSIONAL ARRAYS


import numpy as np
import pandas as pd
sdf = {'County':['Ostfold', 'Hordaland', 'Oslo', 'Hedmark', 'Oppland', 'Buskerud'],
'ISO-Code':[1,2,3,4,5,6],
'Area': [4180.69, 4917.94, 454.07, 27397.76, 25192.10, 14910.94],
'Administrative centre': ["Sarpsborg", "Oslo", "City of Oslo", "Hamar", "Lillehammer",
"Drammen"]}
sdf = pd.DataFrame(sdf)
print(sdf)

(iv) LOADING A DATASET FROM AN EXTERNAL SOURCE INTO A PANDAS


DATAFRAME

import numpy as np
import pandas as pd

columns=['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status',


'occupation', 'relationship', 'ethnicity', 'gender', 'capital_gain', 'capital_loss', 'hours_per_week',
'country_of_origin','income']
df=pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-
databases/adult/adult.data',names=columns)
df.head(10)

OUTPUT:
(i) Creation of a dataframe from a series
Pandas Version: 1.3.4
0 2
1 3
2 7
3 11
4 13
5 17
6 19
7 23
dtype: int64
A B C D E F G
0 1 2019-05-26 5.0 3 Depression Mental health is challenging
1 2 2019-05-26 5.0 3 Social Anxiety Mental health is challenging
2 3 2019-05-26 5.0 3 Bipolar Disorder Mental health is challenging
3 4 2019-05-26 5.0 3 Eating Disorder Mental health is challenging

(ii) Creation of a dataframe from a dictionary


A B C
0 Apple Ball NaN
1 Aeroplane Bat Cat

(iii) Creation of a dataframe from n-dimensional array


County ISO-Code Area Administrative centre
0 Ostfold 1 4180.69 Sarpsborg
1 Hordaland 2 4917.94 Oslo
2 Oslo 3 454.07 City of Oslo
3 Hedmark 4 27397.76 Hamar
4 Oppland 5 25192.10 Lillehammer
5 Buskerud 6 14910.94 Drammen

RESULT:
Thus, the programs for creating and loading pandas dataframes using Python have been
implemented and the output was verified successfully.
EX. NO.:5 DATA CLEANING AND VISUALIZATION

DATE:

AIM:
To explore missing values for cleaning data and visualize missing values on sample
dataframe.

ALGORITHM:

Step 1: Import the required libraries.

Step 2: Create a dataframe and add missing values to dataframe.

Step 3: Analyze the missing values in the dataframe using various commands.

Step 4: Drop the rows and columns that contain completely missing values.

Step 5: Visualize missing values.

Step 6: Terminate the program.

# Loading Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Create a dataframe with pandas

data = np.arange(15, 30).reshape(5, 3)


dfx = pd.DataFrame(data, index=['apple', 'banana', 'kiwi', 'grapes', 'mango'],
columns=['store1', 'store2', 'store3'])
dfx

# Add missing values to dataframe


dfx['store4'] = np.nan
dfx.loc['watermelon'] = np.arange(15, 19)
dfx.loc['oranges'] = np.nan
dfx['store5'] = np.nan
dfx['store4']['apple'] = 20.
dfx
# Identify NaN values
dfx.isnull()
# Count the number of NaN values in each store
dfx.isnull().sum()
# Find the total number of missing values
dfx.isnull().sum().sum()
# Display data values in a column by dropping missing values
b=dfx.store4[dfx.store4.notnull()]
# Remove the rows
dfx.store4.dropna()
# Drop only those rows entire values are entirely NaN
dfx.dropna(how='all')
# Drop only those columns entire values are entirely NaN
dfx.dropna(how='all', axis=1)
# Visualizing the missing value with seaborn
sns.heatmap(dfx.isnull(), yticklabels=False, annot=True)

OUTPUT:
Create a dataframe with pandas:

Add missing values to dataframe:


Identify NaN values:

Count the number of NaN values in each store:


store1 1
store2 1
store3 1
store4 5
store5 7
dtype: int64
Find the total number of missing values:
15
Display data values in a column by dropping missing values:
apple 20.0
watermelon 18.0
Name: store4, dtype: float64

Remove the rows:


apple 20.0
watermelon 18.0
Name: store4, dtype: float64

Drop only those rows entire values are entirely NaN:


Drop only those columns entire values are entirely NaN:

Visualizing the missing value with seaborn:

RESULT:
Thus, the missing value analysis for cleaning data was explored and visualized and the output
is verified.
EX. NO.:6 TIME SERIES ANALYSIS

DATE:

AIM:
To perform time series analysis on the open power system dataset and to apply various
visualization techniques.

ALGORITHM:
Step 1: Load the relevant libraries and time series dataset.
Step 2: Perform descriptive analysis on the dataset.
Step 3: Convert the Date column to Datetime format.
Step 4: Add columns such as year, month, and days of the week to the dataset.
Step 5: Generate line plots of the full time series of Germany's daily electricity consumption.
Step 6: Generate line plots for analyzing the electricity consumption for a single year and a
particular month.
Step 7: Generate box plots by grouping the data by months and days of the week.
Step 8: Terminate the program.
PROGRAM:
# Load time series dataset
df_power = pd.read_csv("https://raw.githubusercontent.com/jenfly/opsd/master/opsd_germ
any_daily.csv")
df_power.columns
# Check last 10 entries inside the dataframe
df_power.tail(10)
# View the data types of each column in df_power dataframe
df_power.dtypes
# Convert object to datetime format
df_power['Date'] = pd.to_datetime(df_power['Date'])
# Verify the conversion of the Date column to Datetime format
df_power.dtypes
# Change the index of our dataframe to the Date column:
df_power = df_power.set_index('Date')
df_power.tail(3)
# Add columns with year, month, and weekday name
df_power['Year'] = df_power.index.year
df_power['Month'] = df_power.index.month
df_power['Weekday Name'] = df_power.index.weekday_name
# Display a random sampling of 5 rows
df_power.sample(5, random_state=0)
# Visualizing time series
# Import the seaborn and matplotlib libraries:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(rc={'figure.figsize':(11, 4)})
plt.rcParams['figure.figsize'] = (8,5)
plt.rcParams['figure.dpi'] = 150
# Generate a line plot of the full time series of Germany's daily electricity consumption
df_power['Consumption'].plot(linewidth=0.5)
# Use the dots to plot the data for all the other columns
cols_to_plot = ['Consumption', 'Solar', 'Wind']
axes = df_power[cols_to_plot].plot(marker='.', alpha=0.5,
linestyle='None',figsize=(14, 6), subplots=True)
for ax in axes:
ax.set_ylabel('Daily Totals (GWh)')
# Investigate a single year
ax = df_power.loc['2016', 'Consumption'].plot()
ax.set_ylabel('Daily Consumption (GWh)');
# Examine the month of December 2016
ax = df_power.loc['2016-12', 'Consumption'].plot(marker='o', linestyle='-')
ax.set_ylabel('Daily Consumption (GWh)');
# Group the data by months and visualize the data using box plots
fig, axes = plt.subplots(3, 1, figsize=(8, 7), sharex=True)
for name, ax in zip(['Consumption', 'Solar', 'Wind'], axes):
sns.boxplot(data=df_power, x='Month', y=name, ax=ax)
ax.set_ylabel('GWh')
ax.set_title(name)
if ax != axes[-1]:
ax.set_xlabel('')
# Group the consumption of electricity by the day of the week, and visualize data using box
plot
sns.boxplot(data=df_power, x='Weekday Name', y='Consumption');
OUTPUT:

View the data types of each column in df_power dataframe


Date object
Consumption float64
Wind float64
Solar float64
Wind+Solar float64
dtype: object
Verify the conversion of the Date column to Datetime format
Date datetime64[ns]
Consumption float64
Wind float64
Solar float64
Wind+Solar float64
dtype: object
Change the index of our dataframe to the Date column

Display a random sampling of 5 rows


Generate a line plot of the full time series of Germany's daily electricity consumption

Plot the data for all the other columns using dots

Investigate a single year

Examine the month of December 2016


Group the data by months and visualize the data using box plots

Group the consumption of electricity by the day of the week, and visualize data using box
plot

RESULT:
Thus, time series analysis on the Open power system dataset and application of various
visualizations was performed successfully and the output is verified.
EX. NO.:7 VISUALIZING GEOGRAPHIC DATA WITH BASEMAP

DATE:

AIM:

To visualize California cities data using basemap.

ALGORITHM:
Step 1: Install basemap using the command: $ conda install basemap
Step 2: Load the relevant libraries, such as numpy, pandas, matplotlib, and basemap.
Step 3: Read the ‘california_cities.csv' data and extract it.
Step 4: Draw the map background.
Step 5: Scatter city data, with color reflecting population and size reflecting area.
Step 6: Make a legend with dummy points.
Step 7: Create a colorbar and legend.

PROGRAM:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
cities = pd.read_csv('california_cities.csv')
# Extract the data
lat = cities['latd'].values
lon = cities['longd'].values
population = cities['population_total'].values
area = cities['area_total_km2'].values
# Draw the map background
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution='h', lat_0=37.5, lon_0=-119, width=1E6,
height=1.2E6)
m.shadedrelief()
m.drawcoastlines(color='gray')
m.drawcountries(color='gray')
m.drawstates(color='gray')
# scatter city data, with color reflecting population and size reflecting area
m.scatter(lon, lat, latlon=True, c=np.log10(population), s=area, cmap='Reds', alpha=0.5)
# create colorbar and legend
plt.colorbar(label=r'$\log_{10}({\rm population})$')
plt.clim(3, 7)
# make legend with dummy points
for a in [100, 300, 500]:
plt.scatter([], [], c='k', alpha=0.5, s=a, label=str(a) + ' km$^2$')
plt.legend(scatterpoints=1, frameon=False, labelspacing=1, loc='lower left');

OUTPUT:

RESULT:
Thus, the map showing information about the location, size, and population of California cities
was displayed using the basemap package and the output is verified.
EX. NO.:8 EXPLORATORY DATA ANALYSIS ON WINE QUALITY DATA SET

DATE:

AIM:
To analyze and explore the wine quality dataset and identify the correlations of assessing the
quality of wine with various plots.

ALGORITHM:
Step 1: Load the relevant libraries, such as numpy, pandas, matplotlib, and seaborn.
Step 2: Read the wine quality data and create two different dataframes, namely, df_red for
holding the red wine data and df_white for holding the white wine data.
Step 3: Analyze descriptive statistics on the dataset.
Step 4: Generate plots showing the variation in the quality of wine with respect to alcohol
concentration and correlations among the features.
Step 5: Show the outputs.
Step 6: Terminate the program.

PROGRAM:
# Load the pandas library and create two different dataframes, namely, df_red for holding the
# red wine data and df_white for holding the white wine data
import pandas as pd
df_red =
pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databa
ses/wine-quality/winequality-red.csv", delimiter=";")
df_white =
pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databa
ses/wine-quality/winequality-white.csv", delimiter=";")
# Check the name of the available columns:
df_red.columns
# Descriptive statistics
# Display the entries from the 100th to 110th rows from the red wine dataframe
df_red.iloc[100:110]

# Display the datatypes for each column


df_red.dtypes

# Describe the dataframe


df_red.describe()

# Check for missing values and display information about the data
df_red.info()
# Analyzing red wine
# Identify the quality of red wine
import seaborn as sns
sns.set(rc={'figure.figsize': (14, 8)})
sns.countplot(df_red['quality'])

# Finding correlated columns


sns.pairplot(df_red)

# Generate the heatmap graph


sns.heatmap(df_red.corr(), annot=True, fmt='.2f', linewidths=2)

# Plot the alcohol distribution graph


sns.distplot(df_red['alcohol'])

# Plot the variation of the quality of wine with respect to alcohol concentration
sns.boxplot(x='quality', y='alcohol', data = df_red)

# Plot showing the variation of the quality of wine with respect to alcohol concentration
without outliers
sns.boxplot(x='quality', y='alcohol', data = df_red, showoutliers=False)

# Correlation between the alcohol column and pH values


sns.jointplot(x='alcohol',y='pH',data=df_red, kind='reg')

# Quantify the correlation using Pearson regression


from scipy.stats import pearsonr
def get_correlation(column1, column2, df):
pearson_corr, p_value = pearsonr(df[column1], df[column2])
print("Correlation between {} and {} is {}".format(column1,
column2, pearson_corr))
print("P-value of this correlation is {}".format(p_value))

# Correlation between alcohol and pH


get_correlation('alcohol','pH', df_red)

OUTPUT:
Name of the available columns:
Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur
dioxide, 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality'], dtype='object')
Display the entries from the 100th to 110th rows from the red wine dataframe:

Display the datatypes for each column:


fixed acidity float64
volatile acidity float64
citric acid float64
residual sugar float64
chlorides float64
free sulfur dioxide float64
total sulfur dioxide float64
density float64
pH float64
sulphates float64
alcohol float64
quality int64
dtype: object
Describe the dataframe:

Check for missing values and display information about the data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
fixed acidity 1599 non-null float64
volatile acidity 1599 non-null float64
citric acid 1599 non-null float64
residual sugar 1599 non-null float64
chlorides 1599 non-null float64
free sulfur dioxide 1599 non-null float64
total sulfur dioxide 1599 non-null float64
density 1599 non-null float64
pH 1599 non-null float64
sulphates 1599 non-null float64
alcohol 1599 non-null float64
quality 1599 non-null int64
dtypes: float64(11), int64(1)
memory usage: 150.0 KB

Correlation between different columns of the red wine dataframe:


Heatmap showing the correlation between different columns:

Alcohol distribution graph:


A box plot showing the variation of the quality of wine with respect to alcohol
concentration:

A box plot showing the variation of the quality of wine with respect to alcohol
concentration without outliers:
Joint plot illustrating the correlation between alcohol concentration and the pH values:

Correlation between alcohol and pH:


Correlation between alcohol and pH is 0.20563250850549825
P-value of this correlation is 9.96449774146556e-17

RESULT:
Thus, the analysis on wine quality dataset was explored and the output is verified.
EX. NO.:9 EXPLORATORY DATA ANALYSIS ON HABERMAN DATA SET

DATE:

INTRODUCTION:
Haberman Dataset

The dataset contains cases from a study that was conducted between 1958 and 1970 at the
University of Chicago's Billings Hospital on the survival of patients who had undergone
surgery for breast cancer.

Attribute Information

 Age of patient at the time of operation (numerical)


 Patient's year of operation (year - 1900, numerical)
 Number of positive axillary nodes detected (numerical)
 Survival status (class attribute)
 1 = the patient survived 5 years or longer
 2 = the patient died within 5 years

AIM:
To analyze and explore the Haberman dataset using visualization techniques and to present
an analysis report.

ALGORITHM:
Step 1: Load the relevant libraries, such as numpy, pandas, matplotlib, and seaborn.
Step 2: Read the Haberman data.
Step 3: Analyze descriptive statistics on the dataset.
Step 4: Generate pair plots, histograms and box plots to analyze the survival status of patients.
Step 5: Show the outputs.
Step 6: Terminate the program.

PROGRAM:
# importing the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# loading the dataset


columns=['age', 'operation_year', 'lymph_nodes', 'survival_status']
data=pd.read_csv("haberman.csv",names=columns)

# Description of data
data.describe()

# Information of the dataset


data.info()

# Shape of the data


data.shape

# Columns of the data


data.columns

# Display top 5 elements of the dataset


data.head()

# Display 5 values from the bottom


data.tail()

# Counting the frequency of unique values of survival_status


data["survival_status"].value_counts()

# Converting the survival status into categorical variable


def change_param(x):
if(x==1):
return 'yes'
return 'no'

chVar=lambda x:change_param(x)
data["survival_status"]=pd.DataFrame(data.survival_status.apply(change_param))

# Counting the frequency of unique values of survival_status


data["survival_status"].value_counts()

Pair Plots
sns.set_style("whitegrid")
sns.pairplot(data,hue="survival_status",height=5)
plt.show()

Histogram
# histogram for age
sns.FacetGrid(data,hue="survival_status",height=5) \
.map(sns.distplot,"age")
plt.title("Histogram for Age")
plt.xlabel('AGE')
plt.ylabel('VALUE')
plt.legend()
plt.show();
# histogram for lymph_nodes
sns.FacetGrid(data,hue="survival_status",height=5) \
.map(sns.distplot,"lymph_nodes")
plt.title("Histogram for Lymph Nodes")
plt.legend()
plt.xlabel('LYMPH NODES')
plt.ylabel('VALUE')
plt.show();

# histogram for operation_year


sns.FacetGrid(data,hue="survival_status",height=5) \
.map(sns.distplot,"operation_year");
plt.title("Histogram for Operation Year")
plt.xlabel('OPERATION YEAR')
plt.ylabel('VALUE')
plt.legend()
plt.show();

Box Plots
# box plot for age vs survival status
sns.boxplot(x='survival_status',y="age",data=data,hue="survival_status")
plt.title("Survival Status vs Age")
plt.ylabel("Age")
plt.xlabel("Survival Status")
plt.legend(loc="center")
plt.show()

# box plot for op year vs survival status


sns.boxplot(x='survival_status',y="operation_year",data=data,hue="survival_status")
plt.title("Survival Status vs Operation year")
plt.ylabel("Operation Year")
plt.xlabel("Survival Status")
plt.legend(loc="center")
plt.show()

# box plot for lymph nodes vs survival status


sns.boxplot(x='survival_status',y="lymph_nodes",data=data,hue="survival_status")
plt.title("Survival Status vs Lymph Nodes")
plt.ylabel("Lymph Nodes")
plt.xlabel("Survival Status")
plt.legend(loc="center")
plt.show()

OUTPUT:Description of data:
Observations:
Minimum and Maximum ages of the patient were 30 and 83 respectively.
Minimum and Maximum year for performing the operation is 58 and 69 respectively.
Minimum and maximum lymph nodes were found to be 0 and 52 respectively.
25th, 50th and 75th percentile values of age are 44, 52 and 60.75 respectively
25th, 50th and 75th percentile values of lymph_nodes are 0, 1 and 4 respectively
25th, 50th and 75th percentile values of operation_year are 60, 63 and 65.75 respectively.

Information of the dataset:


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 306 non-null int64
1 operation_year 306 non-null int64
2 lymph_nodes 306 non-null int64
3 survival_status 306 non-null int64
dtypes: int64(4)
memory usage: 9.7 KB

Observations:

1. The dataset does not contains missing values.


2. The survival status depicts whether the patient survived or not. So it should be
changed to categorical variable.

Shape of the data:


(306, 4)

Columns of the data:


Index(['age', 'operation_year', 'lymph_nodes', 'survival_status'], dtype='object')

Display top 5 elements of the dataset:

Display 5 values from the bottom:


Counting the frequency of unique values of survival_status:
1 225
2 81
Name: survival_status, dtype: int64

Observation:
The distribution of the data is not close enough, so the given dataset is an imbalanced dataset.

Counting the frequency of unique values of survival_status after converting into


categorical variables:
yes 225
no 81
Name: survival_status, dtype: int64

Pair Plots:

Observation:

By the above pair plots we can conclude that there is slightly a better separation of datapoints
between lymph_nodes and operation_year.

Histograms:
Observation:

The number of lymph nodes is highly dense between 0-6 (roughly).

Box Plots

RESULT:

Thus, a case study on Haberman data set was used and the various EDA and visualization
techniques is analyzed and the output is verified.

You might also like