Unit - 1 EDA
Unit - 1 EDA
EXPLORATION &
VISUALIZATION
AD3301 DATA EXPLORATION AND VISUALIZATION LT
PC
3024
OBJECTIVES:
DATA
CATEGORICAL
NUMERICAL (Numbers)
CATEGORICAL (collection of
information divided into groups)
DATA COLLECTION
• Data can be collected from several
objects on several events using
different types of sensors and storage
tools.
• Data collected from several sources
must be stored in the correct format
DATA PROCESSING
• Preprocessing involves the process of
pre-curating the dataset before actual
analysis.
• Common tasks involve correctly
exporting the dataset, placing them
under the right tables, structuring
them, and exporting them in the
correct format
DATA CLEANING
• Preprocessed data is still not ready for detailed
analysis.
• It must be correctly transformed for an
incompleteness check, duplicate check, error
check, and missing value check.
• Finding inaccuracies in the dataset,
understanding the overall data quality,
removing duplicate items, and filling in the
missing values.
• These tasks are performed in the data cleaning
stage.
EDA
• Exploratory data analysis, is the stage
where we actually start to understand
the message contained in the data.
• It should be noted that several types
of data transformation techniques
might be required during the process
of exploration.
MODELING & ALGORITHM
• From a data science perspective,
generalized models can exhibit
relationships among different
variables, such as correlation or
causation.
• These models involve one or more
variables that depend on other
variables to cause an event.
For example, when buying, pens, the
Total price of pens (Total) = price for one
pen (UnitPrice) * the number of pens
bought (Quantity).
Total = UnitPrice * Quantity
• Total price is dependent on the unit
price.
• Hence, the total price is referred to as
the dependent variable, and the unit
price is referred to as an independent
variable.
• Model always describes the
relationship between
independent and dependent
variables.
DATA PRODUCT
• Any computer software that uses data as
inputs, produces outputs, and provides
feedback based on the output to control the
environment is referred to as a data product.
• A data product is generally based on a model
developed during data analysis.
• For example, a recommendation model that
inputs user purchase history and recommends
a related item that the user is highly likely to
buy.
COMMUNICATION
• This stage deals with disseminating the results
to end stakeholders to use the results for
business intelligence.
• One of the most notable steps in this stage is
data visualization.
• Visualization deals with information relay
techniques such as tables, charts, summary
diagrams, and bar charts to show the analyzed
result.
TOPIC - 3
SIGNIFICANCE OF
EDA
SIGNIFICANCE OF EDA
3. DATA ANALYSIS
4. DEVELOPMENT &
REPRESENTATION OF RESULTS
PROBLEM DEFINITION
• Before trying to extract useful insight from
the data, it is essential to define the
business problem to be solved.
• The main tasks involved in problem
definition are defining the main objective of
the analysis, defining the main
deliverables, outlining the main roles and
responsibilities, obtaining the current
status of the data, defining the timetable,
and performing cost/benefit analysis.
• Based on such a problem definition, an
execution plan can be created
DATA PREPARATION
• This step involves methods for preparing the dataset before
actual analysis
Clean the dataset
Define the Sources of
data
Define data schemas and tables Delete non-relevant
datasets
Understand the main Transform the data
characteristics of the data
TECHNIQUES Graphs
USED FOR DATA
SUMMARIZATION
Descriptive Statistics
Inferential Statistics
Address Email
Gender
• Each of these features that describes a patient is a variable.
• Each observation can have a specific value for each of these
variables.
PATIENT INFORMATION DATABASE
9
R I E S
T E G O
C A ?
D A TA
O F
MOST OF THE DATASET BROADLY FALLS INTO TWO GROUPS
NUMERICAL
DATA
CATEGORICAL
I C AL
M ER
NU
A TA ?
D
NUMERICAL DATA
• This data has a sense of measurement involved in it; for
example, a person's age, height, weight, blood pressure,
heart rate, temperature, number of teeth, number of bones,
and the number of family members.
• This data is often referred to as quantitative data in
statistics. DISCRETE DATA
NUMERICAL DATA
CONTINUOUS DATA
R E T E
DI S C
A TA ?
D
DISCRETE DATA
• This is data that is countable and its values can be listed
out.
• For example, if we flip a coin, the number of heads in 200
coin flips can take values from 0 to 200 (finite) cases.
• A variable that represents a discrete dataset is referred
to as a discrete variable.
• The discrete variable takes a fixed number of distinct
values.
• For example, the Country variable can have values such
as Nepal, India, Norway, and Japan. It is fixed.
• The Rank variable of a student in a classroom can take
values from 1, 2, 3, 4, 5 and so on.
O U S
T I NU
C O N
A TA?
D
CONTINUOUS DATA
• A variable that can have an infinite number of numerical
values within a specific range is classified as continuous
data.
• A variable describing continuous data is a continuous
variable.
• For example, what is the temperature of your city today?
Can we be finite?
• Similarly, the weight variable in the previous section is a
continuous variable.
TRY
IT!!
Check the preceding table and determine which of the variables are
discrete and which of the variables are continuous. Can you justify your
I C AL
E G R
O
C AT
A TA?
D
CATEGORICAL DATA
CATEGORICAL VARIABLE
POLYTOMOUS VARIABLE
DICHOTOMOUS VARIABLE
• A binary categorical variable can take exactly
two values and is also referred to as a
dichotomous variable.
• For example, when you create an experiment,
the result is either success or failure.
• Hence, results can be understood as a binary
categorical variable.
POLYTOMOUS VARIABLES
• A Polytomous variables are categorical variables that
can take more than two possible values.
• For example, marital status can have several values,
such as divorced, legally separated, married, never
married, domestic partners, unmarried, widowed,
domestic partner, and unknown.
• Since marital status can take more than two possible
values, it is a polytomous variable.
MEASUREMENT SCALES
NOMINAL
ORDINAL
INTERVAL
RATIO
NAL?
O M I
N
NOMINAL
Likert
scale
VAL ?
T ER
IN
INTERVAL
R PROGRAMMING
WEKA
KNIME
PYTHON
• This is an open source programming language
widely used in data analysis, data mining, and
data science
R PROGRAMMING LANGUAGE
• R is an open source programming language that
is widely utilized in statistical computation and
graphical data analysis
WEKA
• This is an open source data mining package that
involves several EDA tools and algorithms
KNIME
• This is an open source tool for data analysis and
is based on Eclipse
GETTING
STARTED
WITH EDA
th o n
a py
W ri t e
1. ra m to
p ro g s ?
e fi l e
/ w r i t
rea d
th o n
a p y
W r i t e ro r
2. o r er
a m f
o g r
pr l i n g ?
ha n d
j e c t-
a n o b
W r i t e e p t
3. c o n c
n t ed
o ri e u s i n g
g r a m
pr o ?
th o n
py
READING/WRITING TO FILES
filename = "datamining.txt"
file = open(filename, mode="r",
encoding='utf-8')
for line in file:
lines = file.readlines()
print(lines)
file.close()
ERROR HANDLING
try:
Value = int(input("Type a number between 47 and
100:"))
except ValueError:
print("You must type a number between 47 and
100!")
else:
if (Value > 47) and (Value <= 100):
print("You typed: ", Value)
else:
OBJECT-ORIENTED CONCEPT
class Disease:
def __init__(self, disease = 'Depression'):
self.type = disease
def getName(self):
print("Mental Health Diseases
{0}".format(self.type))
d1 = Disease('Social Anxiety Disorder')
d1.getName()
m p y ?
i s Nu
h a t
W
NUMPY
• NumPy is a python library that can be used to
perform a variety of mathematical operations
on arrays.
• It adds powerful data structures to Python that
guarantee efficient calculations with arrays and
matrices.
IMPORTING NUMPY
• import numpy as np
2 DIMENSIONAL ARRAY
DISPLAYING BASIC INFORMATION - DATA
TYPE, SHAPE, SIZE, AND STRIDES OF a
NumPy array
# Print out memory address
• print(my2DArray.data)
# Print the shape of array
• print(my2DArray.shape)
# Print out the data type of the array
• print(my2DArray.dtype)
# Print the stride of the array
• print(my2DArray.strides)
u i l t i n
t i s B
W h a i o n s ?
u n c t
p y F
Nu m
BUILT-IN NUMPY FUNCTIONS
#Array of ones
ones = np.ones((3,4))
# Array of ones with integer data type
ones = np.ones((3, 4), dtype=int)
print(ones)
# Array of zeros
zeros = np.zeros((2,3,4),dtype=np.int16)
print(zeros)
It creates a 3-dimensional array with a shape of (2,
3, 4), meaning it has 2 blocks, each containing 3 rows and 4
columns. The dtype=np.int16 parameter specifies that the
array elements should be of 16-bit integer data type.
BUILT-IN NUMPY FUNCTIONS
# Array with random values
np.random.random((2,2))
It creates a 2x2 array with random values between
0 & 1.
# Empty array
emptyArray = np.empty((3,2))
print(emptyArray)
It creates a 2-dimensional array with a shape of (3,
2).
# Full array
fullArray = np.full((2,2),7)
print(fullArray)
Each element in the array is 7 because you specified
7 as the fill value.
BUILT-IN NUMPY FUNCTIONS
import numpy as np
import pandas as pd
print("Pandas Version:", pd.__version__)
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
PANDAS