Dev New
Dev New
DATE
AIM:
To download and install Python tool and explore the features of NumPy, SciPy, Jupyter,
Statsmodels, and Pandas packages in Python.
INTRODUCTION:
Python is an open-source, object-oriented, and cross-platform programming language.
Compared to programming languages like C++ or Java, Python is very concise. It allows us to
build a working software prototype in a very short time. It has become the most used language
in the data scientist's toolbox. It is also a general-purpose language, and it is very flexible due
to a variety of available packages that solve a wide spectrum of problems and necessities. To
install the necessary packages, use ‘pip’.
Anaconda (http://continuum.io/downloads) is a Python distribution offered by Continuum
Analytics that includes nearly 200 packages, which comprises NumPy, SciPy, pandas, Jupyter,
Matplotlib, Scikit-learn, and NLTK. It is a cross-platform distribution (Windows, Linux, and
Mac OS X) that can be installed on machines with other existing Python distributions and
versions. Its base version is free; instead, add-ons that contain advanced features are charged
separately. Anaconda introduces ‘conda’, a binary package manager, as a command-line tool
to manage your package installations. Anaconda's goal is to provide enterprise-ready Python
distribution for large-scale processing, predictive analytics, and scientific computing.
STEPS:
1. Download Anaconda
2. Install Anaconda
3. Start Anaconda
4. Install data science packages
1. Download Anaconda
This step downloads the Anaconda Python package for the Windows platform.
This step assumes that sufficient administrative privileges are contained to install software on
the system.
1. Double click the downloaded file.
2. Follow the installation wizard.
3. Start Anaconda
Anaconda comes with a suite of graphical tools called Anaconda Navigator. Start Anaconda
Navigator by opening it from the application launcher.
First, start with the Anaconda command line environment called conda.
Conda is fast, and simple, it’s hard for error messages to hide, and you can quickly confirm
your environment is installed and working correctly.
NumPy is the true analytical workhorse of the Python language. It provides the user with
multidimensional arrays, along with a large set of functions to operate a multiplicity of
mathematical operations on these arrays. Arrays are blocks of data arranged along multiple
dimensions, which implement mathematical vectors and matrices. Characterized by optimal
memory allocation, arrays are useful not just for storing data, but also for fast matrix operations
(vectorization), which are indispensable when solving ad hoc data science problems.
SciPy completes NumPy's functionalities, offering a larger variety of scientific algorithms for
linear algebra, sparse matrices, signal and image processing, optimization, fast Fourier
transformation, and much more.
d. Pandas
The pandas package deals with everything that NumPy and SciPy cannot do. Thanks to its
specific data structures, namely DataFrames and Series, pandas allow us to handle complex
tables of data of different types and time series. It enables the easy and smooth loading of data
from a variety of sources. The data can then be sliced, diced, handled with missing elements,
added, renamed, aggregated, reshaped, and finally visualized.
RESULT:
Thus, the Python tool for data analysis and visualization, and packages such as NumPy, SciPy,
Jupyter, Statsmodels, and Pandas were downloaded, installed, and explored.
EX. NO.:2 EXPLORATORY DATA ANALYSIS ON EMAIL DATASET
DATE:
AIM:
To export all emails as a dataset, import them inside a pandas dataframe, visualize
them, and identify the different types of insights.
ALGORITHM:
Step 1: Import data from your own Gmail accounts in mbox format.
a. Log in to your personal Gmail account.
b. Go to the following link: https:/ / takeout. google. com/ settings/ takeout.
c. Deselect all the items except Gmail option.
d. Select ‘Send download link by email’, ‘One-time archive’ options and hit ‘create
archive’.
e. Find an email archive data in mbox format and download them.
Step 2: Load the required libraries and dataset. Install the mailbox package by using the
command, “pip install mailbox”.
Step 3: Perform data transformation by extracting the required fields such as subject, from,
date, to, label, and thread.
Step 4: Perform missing value analysis and drop the irrelevant column.
Step 4: Analyze email communication data to gain various insights from them.
Step 5: Display the outputs.
Step 6: Stop the program.
PROGRAM:
# Load the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import csv
# Descriptive statistics
dfs.info()
# Number of times of the day the emails are sent and received with Gmail
# Create two sub-dataframe—one for sent emails and another for received emails
sent = dfs[dfs['label']=='sent']
received = dfs[dfs['label']=='inbox']
# Find the most active days for receiving and sending emails separately
sdw = sent.groupby('dayofweek').size() / len(sent)
rdw = received.groupby('dayofweek').size() / len(received)
df_tmp = pd.DataFrame(data={'Outgoing Email': sdw, 'Incoming Email':rdw})
df_tmp.plot(kind='bar', rot=45, figsize=(8,5), alpha=0.5)
plt.xlabel('');
plt.ylabel('Fraction of weekly emails');
plt.grid(ls=':', color='k', alpha=0.5)
OUTPUT:
Load the dataset:
<mailbox.mbox at 0x7f124763f5c0>
Number of emails sent during a given timeframe (Tue, 24 May 2011 11:04 AM, to Fri,
20 Sep 2019 03:04 PM):
Tue, 24 May 2011 11:04 AM
Fri, 20 Sep 2019 03:04 PM
inbox 32952
sent 4602
Name: label, dtype: int64
Plot showing an overview of the time of day of email activity:
Find the most active days for receiving and sending emails:
RESULT:
Thus, the program for creating email dataset, visualizing them and getting different insights
from the data was executed and the output is verified successfully.
EX. NO.:3.a BASIC NUMPY OPERATIONS
DATE:
AIM:
(i) creating different types of NumPy arrays and displaying basic information, such
as the data type, shape, size, and strides
(ii) creating an array using built-in NumPy functions
(iii) performing file operations with NumPy arrays
ALGORITHM:
Step 4: Print the memory address, the shape, the data type, and the stride of the array.
Step 5: Then, create an array using built-in NumPy functions.
Step 6: Perform file operations with NumPy arrays.
(i) Creation of different types of Numpy arrays and displaying basic information
# Importing numpy
import numpy as np
# Defining 1D array
my1DArray = np.array([1, 8, 27, 64])
print(my1DArray)
# Array of ones
ones = np.ones((3,4))
print(ones)
# Array of zeros
zeros = np.zeros((2,3,4),dtype=np.int16)
print(zeros)
# Empty array
emptyArray = np.empty((3,2))
print(emptyArray)
# Full array
fullArray = np.full((2,2),7)
print(fullArray)
OUTPUT:
(i) Creation of different types of Numpy arrays and displaying basic information
[ 1 8 27 64]
[[ 1 2 3 4]
[ 2 4 9 16]
[ 4 8 18 32]]
[[[ 1 2 3 4]
[ 5 6 7 8]]
[[ 1 2 3 4]
[ 9 10 11 12]]]
<memory at 0x00000247AE2A0A00>
(3, 4)
int32
(16, 4)
[[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]]
[[[0 0 0 0]
[0 0 0 0]
[0 0 0 0]]
[[0 0 0 0]
[0 0 0 0]
[0 0 0 0]]]
[[0. 0.]
[0. 0.]
[0. 0.]]
[[7 7]
[7 7]]
[10 15 20]
[[[11 11 9 9]
[11 0 2 0]]
[[10 14 9 14]
[ 0 1 11 11]]]
RESULT:
Thus, the program to implement NumPy operations with arrays using Python has been executed
and the output was verified successfully.
EX. NO.:3.b BASIC ARITHMETIC OPERATIONS WITH NUMPY ARRAYS
DATE
AIM:
To implement arithmetic operations with NumPy arrays using python.
ALGORITHM:
Step 4: Perform the arithmetic operations on the two arrays using NumPy.
Second array:
[10 10 10]
DATE:
AIM:
ALGORITHM:
Step 3: Create a dataframe for the list of elements (numbers, dictionary, and n-dimensional
arrays)
PROGRAM:
import numpy as np
import pandas as pd
OUTPUT:
(i) Creation of a dataframe from a series
Pandas Version: 1.3.4
0 2
1 3
2 7
3 11
4 13
5 17
6 19
7 23
dtype: int64
A B C D E F G
0 1 2019-05-26 5.0 3 Depression Mental health is challenging
1 2 2019-05-26 5.0 3 Social Anxiety Mental health is challenging
2 3 2019-05-26 5.0 3 Bipolar Disorder Mental health is challenging
3 4 2019-05-26 5.0 3 Eating Disorder Mental health is challenging
RESULT:
Thus, the programs for creating and loading pandas dataframes using Python have been
implemented and the output was verified successfully.
EX. NO.:5 DATA CLEANING AND VISUALIZATION
DATE:
AIM:
To explore missing values for cleaning data and visualize missing values on sample
dataframe.
ALGORITHM:
Step 3: Analyze the missing values in the dataframe using various commands.
Step 4: Drop the rows and columns that contain completely missing values.
# Loading Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
OUTPUT:
Create a dataframe with pandas:
RESULT:
Thus, the missing value analysis for cleaning data was explored and visualized and the output
is verified.
EX. NO.:6 TIME SERIES ANALYSIS
DATE:
AIM:
To perform time series analysis on the open power system dataset and to apply various
visualization techniques.
ALGORITHM:
Step 1: Load the relevant libraries and time series dataset.
Step 2: Perform descriptive analysis on the dataset.
Step 3: Convert the Date column to Datetime format.
Step 4: Add columns such as year, month, and days of the week to the dataset.
Step 5: Generate line plots of the full time series of Germany's daily electricity consumption.
Step 6: Generate line plots for analyzing the electricity consumption for a single year and a
particular month.
Step 7: Generate box plots by grouping the data by months and days of the week.
Step 8: Terminate the program.
PROGRAM:
# Load time series dataset
df_power = pd.read_csv("https://raw.githubusercontent.com/jenfly/opsd/master/opsd_germ
any_daily.csv")
df_power.columns
# Check last 10 entries inside the dataframe
df_power.tail(10)
# View the data types of each column in df_power dataframe
df_power.dtypes
# Convert object to datetime format
df_power['Date'] = pd.to_datetime(df_power['Date'])
# Verify the conversion of the Date column to Datetime format
df_power.dtypes
# Change the index of our dataframe to the Date column:
df_power = df_power.set_index('Date')
df_power.tail(3)
# Add columns with year, month, and weekday name
df_power['Year'] = df_power.index.year
df_power['Month'] = df_power.index.month
df_power['Weekday Name'] = df_power.index.weekday_name
# Display a random sampling of 5 rows
df_power.sample(5, random_state=0)
# Visualizing time series
# Import the seaborn and matplotlib libraries:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(rc={'figure.figsize':(11, 4)})
plt.rcParams['figure.figsize'] = (8,5)
plt.rcParams['figure.dpi'] = 150
# Generate a line plot of the full time series of Germany's daily electricity consumption
df_power['Consumption'].plot(linewidth=0.5)
# Use the dots to plot the data for all the other columns
cols_to_plot = ['Consumption', 'Solar', 'Wind']
axes = df_power[cols_to_plot].plot(marker='.', alpha=0.5,
linestyle='None',figsize=(14, 6), subplots=True)
for ax in axes:
ax.set_ylabel('Daily Totals (GWh)')
# Investigate a single year
ax = df_power.loc['2016', 'Consumption'].plot()
ax.set_ylabel('Daily Consumption (GWh)');
# Examine the month of December 2016
ax = df_power.loc['2016-12', 'Consumption'].plot(marker='o', linestyle='-')
ax.set_ylabel('Daily Consumption (GWh)');
# Group the data by months and visualize the data using box plots
fig, axes = plt.subplots(3, 1, figsize=(8, 7), sharex=True)
for name, ax in zip(['Consumption', 'Solar', 'Wind'], axes):
sns.boxplot(data=df_power, x='Month', y=name, ax=ax)
ax.set_ylabel('GWh')
ax.set_title(name)
if ax != axes[-1]:
ax.set_xlabel('')
# Group the consumption of electricity by the day of the week, and visualize data using box
plot
sns.boxplot(data=df_power, x='Weekday Name', y='Consumption');
OUTPUT:
Plot the data for all the other columns using dots
Group the consumption of electricity by the day of the week, and visualize data using box
plot
RESULT:
Thus, time series analysis on the Open power system dataset and application of various
visualizations was performed successfully and the output is verified.
EX. NO.:7 VISUALIZING GEOGRAPHIC DATA WITH BASEMAP
DATE:
AIM:
ALGORITHM:
Step 1: Install basemap using the command: $ conda install basemap
Step 2: Load the relevant libraries, such as numpy, pandas, matplotlib, and basemap.
Step 3: Read the ‘california_cities.csv' data and extract it.
Step 4: Draw the map background.
Step 5: Scatter city data, with color reflecting population and size reflecting area.
Step 6: Make a legend with dummy points.
Step 7: Create a colorbar and legend.
PROGRAM:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
cities = pd.read_csv('california_cities.csv')
# Extract the data
lat = cities['latd'].values
lon = cities['longd'].values
population = cities['population_total'].values
area = cities['area_total_km2'].values
# Draw the map background
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution='h', lat_0=37.5, lon_0=-119, width=1E6,
height=1.2E6)
m.shadedrelief()
m.drawcoastlines(color='gray')
m.drawcountries(color='gray')
m.drawstates(color='gray')
# scatter city data, with color reflecting population and size reflecting area
m.scatter(lon, lat, latlon=True, c=np.log10(population), s=area, cmap='Reds', alpha=0.5)
# create colorbar and legend
plt.colorbar(label=r'$\log_{10}({\rm population})$')
plt.clim(3, 7)
# make legend with dummy points
for a in [100, 300, 500]:
plt.scatter([], [], c='k', alpha=0.5, s=a, label=str(a) + ' km$^2$')
plt.legend(scatterpoints=1, frameon=False, labelspacing=1, loc='lower left');
OUTPUT:
RESULT:
Thus, the map showing information about the location, size, and population of California cities
was displayed using the basemap package and the output is verified.
EX. NO.:8 EXPLORATORY DATA ANALYSIS ON WINE QUALITY DATA SET
DATE:
AIM:
To analyze and explore the wine quality dataset and identify the correlations of assessing the
quality of wine with various plots.
ALGORITHM:
Step 1: Load the relevant libraries, such as numpy, pandas, matplotlib, and seaborn.
Step 2: Read the wine quality data and create two different dataframes, namely, df_red for
holding the red wine data and df_white for holding the white wine data.
Step 3: Analyze descriptive statistics on the dataset.
Step 4: Generate plots showing the variation in the quality of wine with respect to alcohol
concentration and correlations among the features.
Step 5: Show the outputs.
Step 6: Terminate the program.
PROGRAM:
# Load the pandas library and create two different dataframes, namely, df_red for holding the
# red wine data and df_white for holding the white wine data
import pandas as pd
df_red =
pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databa
ses/wine-quality/winequality-red.csv", delimiter=";")
df_white =
pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databa
ses/wine-quality/winequality-white.csv", delimiter=";")
# Check the name of the available columns:
df_red.columns
# Descriptive statistics
# Display the entries from the 100th to 110th rows from the red wine dataframe
df_red.iloc[100:110]
# Check for missing values and display information about the data
df_red.info()
# Analyzing red wine
# Identify the quality of red wine
import seaborn as sns
sns.set(rc={'figure.figsize': (14, 8)})
sns.countplot(df_red['quality'])
# Plot the variation of the quality of wine with respect to alcohol concentration
sns.boxplot(x='quality', y='alcohol', data = df_red)
# Plot showing the variation of the quality of wine with respect to alcohol concentration
without outliers
sns.boxplot(x='quality', y='alcohol', data = df_red, showoutliers=False)
OUTPUT:
Name of the available columns:
Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur
dioxide, 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality'], dtype='object')
Display the entries from the 100th to 110th rows from the red wine dataframe:
Check for missing values and display information about the data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
fixed acidity 1599 non-null float64
volatile acidity 1599 non-null float64
citric acid 1599 non-null float64
residual sugar 1599 non-null float64
chlorides 1599 non-null float64
free sulfur dioxide 1599 non-null float64
total sulfur dioxide 1599 non-null float64
density 1599 non-null float64
pH 1599 non-null float64
sulphates 1599 non-null float64
alcohol 1599 non-null float64
quality 1599 non-null int64
dtypes: float64(11), int64(1)
memory usage: 150.0 KB
A box plot showing the variation of the quality of wine with respect to alcohol
concentration without outliers:
Joint plot illustrating the correlation between alcohol concentration and the pH values:
RESULT:
Thus, the analysis on wine quality dataset was explored and the output is verified.
EX. NO.:9 EXPLORATORY DATA ANALYSIS ON HABERMAN DATA SET
DATE:
INTRODUCTION:
Haberman Dataset
The dataset contains cases from a study that was conducted between 1958 and 1970 at the
University of Chicago's Billings Hospital on the survival of patients who had undergone
surgery for breast cancer.
Attribute Information
AIM:
To analyze and explore the Haberman dataset using visualization techniques and to present
an analysis report.
ALGORITHM:
Step 1: Load the relevant libraries, such as numpy, pandas, matplotlib, and seaborn.
Step 2: Read the Haberman data.
Step 3: Analyze descriptive statistics on the dataset.
Step 4: Generate pair plots, histograms and box plots to analyze the survival status of patients.
Step 5: Show the outputs.
Step 6: Terminate the program.
PROGRAM:
# importing the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Description of data
data.describe()
chVar=lambda x:change_param(x)
data["survival_status"]=pd.DataFrame(data.survival_status.apply(change_param))
Pair Plots
sns.set_style("whitegrid")
sns.pairplot(data,hue="survival_status",height=5)
plt.show()
Histogram
# histogram for age
sns.FacetGrid(data,hue="survival_status",height=5) \
.map(sns.distplot,"age")
plt.title("Histogram for Age")
plt.xlabel('AGE')
plt.ylabel('VALUE')
plt.legend()
plt.show();
# histogram for lymph_nodes
sns.FacetGrid(data,hue="survival_status",height=5) \
.map(sns.distplot,"lymph_nodes")
plt.title("Histogram for Lymph Nodes")
plt.legend()
plt.xlabel('LYMPH NODES')
plt.ylabel('VALUE')
plt.show();
Box Plots
# box plot for age vs survival status
sns.boxplot(x='survival_status',y="age",data=data,hue="survival_status")
plt.title("Survival Status vs Age")
plt.ylabel("Age")
plt.xlabel("Survival Status")
plt.legend(loc="center")
plt.show()
OUTPUT:Description of data:
Observations:
Minimum and Maximum ages of the patient were 30 and 83 respectively.
Minimum and Maximum year for performing the operation is 58 and 69 respectively.
Minimum and maximum lymph nodes were found to be 0 and 52 respectively.
25th, 50th and 75th percentile values of age are 44, 52 and 60.75 respectively
25th, 50th and 75th percentile values of lymph_nodes are 0, 1 and 4 respectively
25th, 50th and 75th percentile values of operation_year are 60, 63 and 65.75 respectively.
Observations:
Observation:
The distribution of the data is not close enough, so the given dataset is an imbalanced dataset.
Pair Plots:
Observation:
By the above pair plots we can conclude that there is slightly a better separation of datapoints
between lymph_nodes and operation_year.
Histograms:
Observation:
Box Plots
RESULT:
Thus, a case study on Haberman data set was used and the various EDA and visualization
techniques is analyzed and the output is verified.