0% found this document useful (0 votes)
147 views123 pages

Unit - 1 EDA

The document outlines the objectives and fundamentals of Exploratory Data Analysis (EDA), emphasizing its significance in understanding data through various techniques including univariate, bivariate, and multivariate analysis. It details the phases of data analysis, including data collection, processing, cleaning, and visualization, while highlighting the importance of data types and measurement scales. Additionally, it discusses the role of EDA in data science projects and the use of software tools for effective data exploration and visualization.

Uploaded by

mk4997320
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
147 views123 pages

Unit - 1 EDA

The document outlines the objectives and fundamentals of Exploratory Data Analysis (EDA), emphasizing its significance in understanding data through various techniques including univariate, bivariate, and multivariate analysis. It details the phases of data analysis, including data collection, processing, cleaning, and visualization, while highlighting the importance of data types and measurement scales. Additionally, it discusses the role of EDA in data science projects and the use of software tools for effective data exploration and visualization.

Uploaded by

mk4997320
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 123

AD3301 - DATA

EXPLORATION &
VISUALIZATION
AD3301 DATA EXPLORATION AND VISUALIZATION LT
PC

3024

OBJECTIVES:

TO OUTLINE AN OVERVIEW OF EXPLORATORY DATA ANALYSIS.


TO IMPLEMENT DATA VISUALIZATION USING MATPLOTLIB.
TO PERFORM UNIVARIATE DATA EXPLORATION AND ANALYSIS.
TO APPLY BIVARIATE DATA EXPLORATION AND ANALYSIS.
TO USE DATA EXPLORATION AND VISUALIZATION TECHNIQUES FOR MULTIVARIATE
AND TIME SERIES DATA.

UNIT I EXPLORATORY DATA ANALYSIS


9

EDA FUNDAMENTALS – UNDERSTANDING DATA SCIENCE – SIGNIFICANCE OF EDA –


MAKING SENSE OF DATA – COMPARING EDA WITH CLASSICAL AND BAYESIAN
ANALYSIS – SOFTWARE TOOLS FOR EDA - VISUAL AIDS FOR EDA- DATA
TRANSFORMATION TECHNIQUES-MERGING DATABASE, RESHAPING AND PIVOTING,
UNIT - 1
EXPLORATORY DATA
ANALYSIS
TOPIC - 1
EXPLORATORY DATA
ANALYSIS
FUNDAMENTALS
ATA ?
D
DATA - Collection of objects,
events and facts in the form
of numbers, text, audio &
videos.
How do we get
meaningful & useful
imformation from
data?
EXPLORATORY DATA
ANALYSIS (EDA)
EDA - Process of investing
datasets, explaining subjects,
extracting the information
enfolded in the data and
visualizing outcomes
WHY EDA IS
IMPORTANT?
• Just like everything in this world,
data has its imperfections.
• Raw data is usually skewed, may
have outliers, or too many
missing values.
• A model built on such data
results in sub-optimal
performance.
TOPIC - 2
UNDERSTANDING
DATA SCIENCE
UNDERSTANDING DATA
SCIENCE
Data Science involves cross
disciplinary knowledge from data,
statistics, computer science and
mathematics
DATA SCIENCE
PROJECT FLOW
Data Science Project Flow with EDA
as part of data preparation
PHASES OF DATA
ANALYSIS
PHASES OF DATA ANALYSIS

CRISP (CROSS 1. DATA COLLECTION


INDUSTRY STANDARD
PROCESS FOR DATA 2. DATA PROCESSING
MINING)
FRAMEWORKS IN DATA 3. DATA CLEANING
MINING
4. EDA
5. MODELLING &
ALGORITHMS
STAGES OF DATA
ANALYSIS
STAGES (PILLARS) OF EDA
1. DATA REQUIREMENTS

CRISP (CROSS 2. DATA COLLECTION


INDUSTRY 3. DATA PROCESSING
STANDARD
4. DATA CLEANING
PROCESS FOR
DATA MINING) 5. EDA
FRAMEWORKS 6. MODELLING & ALGORITHMS
IN DATA MINING
7. DATA PRODUCT
8. COMMUNICATION
DATA REQUIREMENTS
• There can be various sources of data for an organization.
• It is important to comprehend what type of data is required
for the organization to be collected, curated, and stored.

DEMENTIA PATIENT - CASE STUDY


An application tracking the sleeping pattern of
patients suffering from dementia requires what
types of sensors’ data storage?
Tracking the sleeping pattern of patients suffering
from dementia requires what types of sensors’ data
storage?
01. SLEEP DATA

02. HEART RATE FROM THE PATIENTS

03. ELECTRO-DERMAL activities

04. USER ACTIVITIES PATTERN

All of these data points are required to correctly diagnose the


mental state of the person.
NUMERICAL

DATA

CATEGORICAL
NUMERICAL (Numbers)

CATEGORICAL (collection of
information divided into groups)
DATA COLLECTION
• Data can be collected from several
objects on several events using
different types of sensors and storage
tools.
• Data collected from several sources
must be stored in the correct format
DATA PROCESSING
• Preprocessing involves the process of
pre-curating the dataset before actual
analysis.
• Common tasks involve correctly
exporting the dataset, placing them
under the right tables, structuring
them, and exporting them in the
correct format
DATA CLEANING
• Preprocessed data is still not ready for detailed
analysis.
• It must be correctly transformed for an
incompleteness check, duplicate check, error
check, and missing value check.
• Finding inaccuracies in the dataset,
understanding the overall data quality,
removing duplicate items, and filling in the
missing values.
• These tasks are performed in the data cleaning
stage.
EDA
• Exploratory data analysis, is the stage
where we actually start to understand
the message contained in the data.
• It should be noted that several types
of data transformation techniques
might be required during the process
of exploration.
MODELING & ALGORITHM
• From a data science perspective,
generalized models can exhibit
relationships among different
variables, such as correlation or
causation.
• These models involve one or more
variables that depend on other
variables to cause an event.
For example, when buying, pens, the
Total price of pens (Total) = price for one
pen (UnitPrice) * the number of pens
bought (Quantity).
Total = UnitPrice * Quantity
• Total price is dependent on the unit
price.
• Hence, the total price is referred to as
the dependent variable, and the unit
price is referred to as an independent
variable.
• Model always describes the
relationship between
independent and dependent
variables.
DATA PRODUCT
• Any computer software that uses data as
inputs, produces outputs, and provides
feedback based on the output to control the
environment is referred to as a data product.
• A data product is generally based on a model
developed during data analysis.
• For example, a recommendation model that
inputs user purchase history and recommends
a related item that the user is highly likely to
buy.
COMMUNICATION
• This stage deals with disseminating the results
to end stakeholders to use the results for
business intelligence.
• One of the most notable steps in this stage is
data visualization.
• Visualization deals with information relay
techniques such as tables, charts, summary
diagrams, and bar charts to show the analyzed
result.
TOPIC - 3
SIGNIFICANCE OF
EDA
SIGNIFICANCE OF EDA

• Different fields of science, economics,


engineering, and marketing accumulate and
store data primarily in databases.
• Appropriate and well-established decisions
should be made using the data collected.
• It is practically impossible to make sense of
datasets containing more than a handful of
data points without the help of computer
programs.
SIGNIFICANCE OF EDA
• To get insights out of the collected data and to make further
decisions, data mining is performed where we go through
distinctive analysis processes.
• Exploratory data analysis is key, and usually the first exercise
in data mining.
• It allows us to visualize data to understand it as well as to
create hypotheses for further analysis.
• EDA actually reveals ground truth about the content without
making any underlying assumptions.
• This is the fact that data scientists use this process to actually
understand what type of modeling and hypotheses can be
created.
KEY COMPONENTS
OF EDA
01. SUMMARIZING DATA
PANDAS

02. STATISTICAL ANALYSIS


SCIPY

03. VISUALIZATION OF DATA


MATPLOTLIB & PLOTLY
STEPS IN EDA

1. PROBLEM DEFINITION 2. DATA PREPARATION

3. DATA ANALYSIS

4. DEVELOPMENT &
REPRESENTATION OF RESULTS
PROBLEM DEFINITION
• Before trying to extract useful insight from
the data, it is essential to define the
business problem to be solved.
• The main tasks involved in problem
definition are defining the main objective of
the analysis, defining the main
deliverables, outlining the main roles and
responsibilities, obtaining the current
status of the data, defining the timetable,
and performing cost/benefit analysis.
• Based on such a problem definition, an
execution plan can be created
DATA PREPARATION
• This step involves methods for preparing the dataset before
actual analysis
Clean the dataset
Define the Sources of
data
Define data schemas and tables Delete non-relevant
datasets
Understand the main Transform the data
characteristics of the data

Divide the data into


required chunks for analysis
DATA ANALYSIS
• This is one of the most crucial steps that deals with descriptive
statistics and analysis of the data

Summarizing the data


Evaluating the models,

Finding the hidden correlation


and relationships among the
data Calculating the
accuracies.
Developing predictive models
TECHNIQUES USED FOR DATA
SUMMARIZATION ?????????
Summary Tables

TECHNIQUES Graphs
USED FOR DATA
SUMMARIZATION
Descriptive Statistics

Inferential Statistics

Grouping Correlation Statistics

Mathematical Models Searching


DEVELOPMENT & REPRESENTATION
OF THE RESULTS
• This step involves presenting the dataset to the target audience
in the form of graphs, summary tables, maps, and diagrams.

Scattering Plots Box Plots

Character Plots Residual Plots

Histograms Mean Plots


Some commonly used plots for EDA are:

• Histograms: To check the distribution of a specific


variable
• Scatter plots: To check the dependency between
two variables
• Feature correlation plot (heatmap): To understand
the dependencies between multiple variables
• Time series plots: To identify trends and seasonality
in time dependent data
EXPLORATORY TOOLS
PYTHON
ENTERPRISE APPLICATIONS
TOPIC - 4
MAKING SENSE OF
DATA
MAKING SENSE OF DATA

• Different disciplines store different kinds of data for different


purposes.

Medical Researchers Patient Data

Universities Students’ & Teachers’ Data

Real estate House & Building datasets


industries
• A dataset contains many observations about a particular object.
• For instance, a dataset about patients in a hospital can contain
many observations.
• A patient can be described by a,

Patient identifier (ID) Weight


Variable

Name Date of Birth

Address Email

Gender
• Each of these features that describes a patient is a variable.
• Each observation can have a specific value for each of these
variables.
PATIENT INFORMATION DATABASE

9
R I E S
T E G O
C A ?
D A TA
O F
MOST OF THE DATASET BROADLY FALLS INTO TWO GROUPS

NUMERICAL

DATA

CATEGORICAL
I C AL
M ER
NU
A TA ?
D
NUMERICAL DATA
• This data has a sense of measurement involved in it; for
example, a person's age, height, weight, blood pressure,
heart rate, temperature, number of teeth, number of bones,
and the number of family members.
• This data is often referred to as quantitative data in
statistics. DISCRETE DATA

NUMERICAL DATA

CONTINUOUS DATA
R E T E
DI S C
A TA ?
D
DISCRETE DATA
• This is data that is countable and its values can be listed
out.
• For example, if we flip a coin, the number of heads in 200
coin flips can take values from 0 to 200 (finite) cases.
• A variable that represents a discrete dataset is referred
to as a discrete variable.
• The discrete variable takes a fixed number of distinct
values.
• For example, the Country variable can have values such
as Nepal, India, Norway, and Japan. It is fixed.
• The Rank variable of a student in a classroom can take
values from 1, 2, 3, 4, 5 and so on.
O U S
T I NU
C O N
A TA?
D
CONTINUOUS DATA
• A variable that can have an infinite number of numerical
values within a specific range is classified as continuous
data.
• A variable describing continuous data is a continuous
variable.
• For example, what is the temperature of your city today?
Can we be finite?
• Similarly, the weight variable in the previous section is a
continuous variable.
TRY
IT!!

Check the preceding table and determine which of the variables are
discrete and which of the variables are continuous. Can you justify your
I C AL
E G R
O
C AT
A TA?
D
CATEGORICAL DATA

• This type of data represents the


characteristics of an object; for example,
gender, marital status, type of address, or
categories of the movies.
• This data is often referred to as qualitative
datasets in statistics.
TYPES OF CATEGORICAL DATA
• Gender (Male, Female, Other, or Unknown)

• Marital Status (Divorced, Legally Separated, Married,


Never Married, Domestic Partner, Unmarried,
Widowed, or Unknown)

• Movie genres (Action, Adventure, Comedy, Crime,


Drama, Fantasy, Historical, Horror, Mystery,
Philosophical, Political, Romance, Saga, Satire,
Science Fiction, Social, Thriller, Urban, or Western)
CATEGORICAL DATA
Blood type (A, B, AB, or O)

Types of drugs (Stimulants, Depressants,


Hallucinogens, Dissociatives, Opioids, Inhalants, or
Cannabis)
• A variable describing categorical data is referred to as a
categorical variable.
• These types of variables can have one of a limited
number of values.
S O F
T Y PE
I C A L
E G O R
C AT E ?
I AB L
V A R
DICHOTOMOUS VARIABLE

CATEGORICAL VARIABLE

POLYTOMOUS VARIABLE
DICHOTOMOUS VARIABLE
• A binary categorical variable can take exactly
two values and is also referred to as a
dichotomous variable.
• For example, when you create an experiment,
the result is either success or failure.
• Hence, results can be understood as a binary
categorical variable.
POLYTOMOUS VARIABLES
• A Polytomous variables are categorical variables that
can take more than two possible values.
• For example, marital status can have several values,
such as divorced, legally separated, married, never
married, domestic partners, unmarried, widowed,
domestic partner, and unknown.
• Since marital status can take more than two possible
values, it is a polytomous variable.
MEASUREMENT SCALES

NOMINAL

ORDINAL

INTERVAL

RATIO
NAL?
O M I
N
NOMINAL

• In nominal, the scales are generally referred to


as labels.
• These scales are mutually exclusive and do
not carry any numerical importance.
• Nominal scales are considered qualitative
scales.
NAL?
R DI
O
ORDINAL
• The main difference in the ordinal and nominal scale is the order.
• In ordinal scales, the order of the values is a significant factor.

Likert
scale
VAL ?
T ER
IN
INTERVAL

• Interval scales are widely used in statistics, for


example, in the measure of central tendencies
—mean, median, mode, and standard
deviations.
T I O ?
RA
RATIO
• Ratio scales contain order, exact values, and absolute zero,
which makes it possible to be used in descriptive and
inferential statistics.
• Mathematical operations, the measure of central
tendencies, and the measure of dispersion and coefficient of
variation can also be computed from such scales.
• Examples include a measure of energy, mass, length,
duration, electrical energy, plan angle, and volume.
TOPIC - 5
COMPARING EDA
WITH CLASSICAL &
BAYESIAN ANALYSIS
COMPARING EDA WITH CLASSICAL AND
BAYESIAN ANALYSIS

1. CLASSICAL DATA ANALYSIS

2. EXPLORATORY DATA ANALYSIS

3. BAYESIAN DATA ANALYSIS


TOPIC - 6
SOFTWARE TOOLS
AVAILABLE FOR EDA
A R E
FT W
SO O R
LS F
TO O
EDA?
SOFTWARE TOOLS AVAILABLE FOR
EDA
PYTHON

R PROGRAMMING

WEKA

KNIME
PYTHON
• This is an open source programming language
widely used in data analysis, data mining, and
data science

R PROGRAMMING LANGUAGE
• R is an open source programming language that
is widely utilized in statistical computation and
graphical data analysis
WEKA
• This is an open source data mining package that
involves several EDA tools and algorithms

KNIME
• This is an open source tool for data analysis and
is based on Eclipse
GETTING
STARTED
WITH EDA
th o n
a py
W ri t e
1. ra m to
p ro g s ?
e fi l e
/ w r i t
rea d
th o n
a p y
W r i t e ro r
2. o r er
a m f
o g r
pr l i n g ?
ha n d
j e c t-
a n o b
W r i t e e p t
3. c o n c
n t ed
o ri e u s i n g
g r a m
pr o ?
th o n
py
READING/WRITING TO FILES

filename = "datamining.txt"
file = open(filename, mode="r",
encoding='utf-8')
for line in file:
lines = file.readlines()
print(lines)
file.close()
ERROR HANDLING
try:
Value = int(input("Type a number between 47 and
100:"))
except ValueError:
print("You must type a number between 47 and
100!")
else:
if (Value > 47) and (Value <= 100):
print("You typed: ", Value)
else:
OBJECT-ORIENTED CONCEPT
class Disease:
def __init__(self, disease = 'Depression'):
self.type = disease
def getName(self):
print("Mental Health Diseases
{0}".format(self.type))
d1 = Disease('Social Anxiety Disorder')
d1.getName()
m p y ?
i s Nu
h a t
W
NUMPY
• NumPy is a python library that can be used to
perform a variety of mathematical operations
on arrays.
• It adds powerful data structures to Python that
guarantee efficient calculations with arrays and
matrices.
IMPORTING NUMPY
• import numpy as np

CREATING DIFFERENT TYPES OF NUMPY


ARRAYS
# Defining 1D array
• my1DArray = np.array([1, 8, 27, 64])
• print(my1DArray)
# Defining and printing 2D array
• my2DArray = np.array([[1, 2, 3, 4], [2, 4, 9, 16], [4, 8, 18,
32]])
• print(my2DArray)
#Defining and printing 3D array
• my3Darray = np.array([[[ 1, 2 , 3 , 4],[ 5 , 6 , 7 ,8]], [[ 1, 2, 3,
4],[ 9, 10, 11, 12]]])
• print(my3Darray)
1 DIMENSIONAL ARRAY
• A One-Dimensional Array is the simplest form of an
Array in which the elements are stored linearly and
can be accessed individually by specifying the index
value of each element stored in the array.

2 DIMENSIONAL ARRAY
DISPLAYING BASIC INFORMATION - DATA
TYPE, SHAPE, SIZE, AND STRIDES OF a
NumPy array
# Print out memory address
• print(my2DArray.data)
# Print the shape of array
• print(my2DArray.shape)
# Print out the data type of the array
• print(my2DArray.dtype)
# Print the stride of the array
• print(my2DArray.strides)
u i l t i n
t i s B
W h a i o n s ?
u n c t
p y F
Nu m
BUILT-IN NUMPY FUNCTIONS
#Array of ones
ones = np.ones((3,4))
# Array of ones with integer data type
ones = np.ones((3, 4), dtype=int)
print(ones)
# Array of zeros
zeros = np.zeros((2,3,4),dtype=np.int16)
print(zeros)
It creates a 3-dimensional array with a shape of (2,
3, 4), meaning it has 2 blocks, each containing 3 rows and 4
columns. The dtype=np.int16 parameter specifies that the
array elements should be of 16-bit integer data type.
BUILT-IN NUMPY FUNCTIONS
# Array with random values
np.random.random((2,2))
It creates a 2x2 array with random values between
0 & 1.
# Empty array
emptyArray = np.empty((3,2))
print(emptyArray)
It creates a 2-dimensional array with a shape of (3,
2).
# Full array
fullArray = np.full((2,2),7)
print(fullArray)
Each element in the array is 7 because you specified
7 as the fill value.
BUILT-IN NUMPY FUNCTIONS

# Array of evenly-spaced values


evenSpacedArray = np.arange(10,25,5)
print(evenSpacedArray)
# Array of evenly-spaced values
evenSpacedArray2 = np.linspace(0,2,9)
print(evenSpacedArray2)
PANDAS

Pandas Library is used to get meaningful


insight from the data.

import numpy as np
import pandas as pd
print("Pandas Version:", pd.__version__)
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
PANDAS

In pandas, we can create data structures


in two ways:
Series
Dataframes
CREATE A DATAFRAME FROM A SERIES
series = pd.Series([2, 3, 7, 11, 13, 17, 19, 23])
print(series)
# Creating data frame from Series
series_df = pd.DataFrame({
'A': range(1, 5),
'B': pd.Timestamp('20190526'),
'C': pd.Series(5, index=list(range(4)), dtype='float64'),
'D': np.array([3] * 4, dtype='int64'),
'E': pd.Categorical(["Depression", "Social Anxiety", "Bipolar Disorder", "Eating
Disorder"]),
'F': 'Mental health',
'G': 'is challenging' })
print(series_df)
CREATE A DATAFRAME FOR A DICTIONARY

# Creating data frame from Dictionary


dict_df = [{'A': 'Apple', 'B': 'Ball'},{'A': 'Aeroplane', 'B':
'Bat', 'C': 'Cat'}]
dict_df = pd.DataFrame(dict_df)
print(dict_df)
CREATE A DATAFRAME FROM N-DIMENSIONAL
ARRAYS
# Creating a dataframe from ndarrays
sdf = {
'County':['Østfold', 'Hordaland', 'Oslo', 'Hedmark', 'Oppland',
'Buskerud'],
'ISO-Code':[1,2,3,4,5,6],
'Area': [4180.69, 4917.94, 454.07, 27397.76, 25192.10, 14910.94],
'Administrative centre': ["Sarpsborg", "Oslo", "City of Oslo",
"Hamar", "Lillehammer", "Drammen"]
}
sdf = pd.DataFrame(sdf)
LOAD A DATASET FROM AN EXTERNAL
SOURCE INTO A PANDAS DATAFRAME

columns = ['age', 'workclass', 'fnlwgt', 'education',


'education_num', 'marital_status', 'occupation', 'relationship',
'ethnicity', 'gender', 'capital_gain', 'capital_loss',
'hours_per_week', 'country_of_origin', 'income']
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-
learning-databases/adult/adult.data',names=columns)
df.head(10)
i l o c ?
lo c &
f o () ?
d f. i n
df.info() - Displays the rows, columns, data types, and
memory used by the dataframe
loc() and iloc() are one of the methods used in slicing data
from the Pandas DataFrame.
• They help in the convenient selection of data from
the DataFrame in Python.
• They are used in filtering the data according to some
conditions.
[1 0 ]?
. i l o c
df
# Selects a
row
df.iloc[10]
ro w s
c t 1 0
Sel e ?
g i l o c
u s i n
# Selects 10
rows
df.iloc[0:10]
l a s t 2
t t h e
S el ec o c ?
i n g i l
s u s
r o w
# Selects the last 2
rows
df.iloc[-2:]
If the values are greater than zero, we
change the color to black (the default
color); if the value is less than zero, we
change the color to red; and finally,
everything else would be colored green.
Define a Python function to accomplish
this
def colorNegativeValueToRed(value):
if value < 0:
color = 'red'
elif value > 0:
color = 'black'
else:
color = 'green'
return 'color: %s' % color
C I P Y ?
S
o t l i b ?
a t p l
M

You might also like