0% found this document useful (0 votes)
22 views50 pages

Unit 1

Uploaded by

maryjan88
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views50 pages

Unit 1

Uploaded by

maryjan88
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

UNIT-1

EXPLORATORY DATA ANALYSIS


SYLLABUS
• EDA fundamentals – Understanding data science
– Significance of EDA – Making sense of data –
Comparing EDA with classical and Bayesian
analysis – Software tools for EDA –
Numpy-Pandas-SciPy- Matplotlib
1.1 Exploratory Data Analysis Fundamentals

Data
• Data encompasses a collection of discrete objects, numbers, words,
events, facts, measurements, observations, or even descriptions of
things. Such data is collected and stored by every event or process
occurring in several disciplines, including biology, economics,
engineering, marketing, and others.
Information
• Processing data elicits useful information and processing such
information generates useful knowledge
EDA
• Exploratory Data Analysis is a process of examining the available dataset to

discover patterns, spot anomalies, test hypotheses, and check assumptions using statistical

measures.

The primary aim of EDA


• The primary aim of EDA is to examine what data can tell us before actually going
through formal modeling or hypothesis formulation. John Tuckey promoted EDA to
statisticians to examine and discover the data and create newer hypotheses that could be
used for the development of a newer approach in data collection and experimentations.
1.2 Understanding data science

Data science involves cross-disciplinary knowledge from


computer science, data, statistics, and mathematics.
There are several phases of data analysis, including data
requirements, data collection, data processing, data cleaning,
exploratory data analysis, modeling and algorithms, and data
product and communication.
PHASES OF DATA ANALYSIS
• Data requirements,
• Data collection,
• Data processing,
• Data cleaning,
• Exploratory data analysis,
• Modeling and algorithms,
• Data product
• Communication.
• Data requirements: There can be various sources of data for an organization. It is important to
comprehend what type of data is required for the organization to be collected, curated, and stored.

• For example, an application tracking the sleeping pattern of patients suffering from dementia
requires several types of sensors' data storage, such as sleep data, heart rate from the patient,
electro-dermal activities, and user activities pattern. All of these data points are required to
correctly diagnose the mental state of the person. Hence, these are mandatory requirements for the
application. In addition to this, it is required to categorize the data, numerical or categorical, and
the format of storage and dissemination.

• Data collection: Data collected from several sources must be stored in the correct format and
transferred to the right information technology personnel within a company. As mentioned
previously, data can be collected from several objects on several events using different types of
sensors and storage tools.
• Data processing: Preprocessing involves the process of pre-curating the dataset before
actual analysis. Common tasks involve correctly exporting the dataset, placing them
under the right tables, structuring them, and exporting them in the correct format.

• Data cleaning: Preprocessed data is still not ready for detailed analysis. It must be
correctly transformed for an incompleteness check, duplicates check, error check, and
missing value check. These tasks are performed in the data cleaning stage, which
involves responsibilities such as matching the correct record, finding inaccuracies in
the dataset, understanding the overall data quality, removing duplicate items, and
filling in the missing values.
• EDA: Exploratory data analysis, is the stage where we actually start to understand the message contained in the
data. It should be noted that several types of data transformation techniques might be required during the process
of exploration.
• Modeling and algorithm: From a data science perspective, generalized models or mathematical formulas can
represent or exhibit relationships among different variables, such as correlation or causation. These models or
equations involve one or more variables that depend on other variables to cause an event.
• For example, when buying, say, pens, the total price of pens(Total) = price for one pen(UnitPrice) * the number of
pens bought (Quantity). Hence, our model would be Total = UnitPrice * Quantity. Here, the total price is
dependent on the unit price. Hence, the total price is referred to as the dependent variable and the unit price is
referred to as an independent variable. In general, a model always describes the relationship between independent
and dependent variables. Inferential statistics deals with quantifying relationships between particular variables.
• The Judd model for describing the relationship between data, model, and error still holds true: Data = Model +
Error.
• Data Product: Any computer software that uses data as inputs, produces outputs, and
provides feedback based on the output to control the environment is referred to as a data
product. A data product is generally based on a model developed during data analysis, for
example, a recommendation model that inputs user purchase history and recommends a
related item that the user is highly likely to buy.

• Communication: This stage deals with disseminating the results to end stakeholders to use
the result for business intelligence. One of the most notable steps in this stage is data
visualization. Visualization deals with information relay techniques such as tables, charts,
summary diagrams, and bar charts to show the analyzed result.

1.3 The significance of EDA

• Different fields of science, economics, engineering, and marketing accumulate and


store data primarily in electronic databases. Appropriate and well-established
decisions should be made using the data collected. It is practically impossible to
make sense of datasets containing more than a handful of data points without the
help of computer programs. To be certain of the insights that the collected data
provides and to make further decisions,
• Data mining is performed where we go through distinctive analysis processes.
Exploratory data analysis is key, and usually the first exercise in data mining. It
allows us to visualize data to understand it as well as to create hypotheses for further
analysis. The exploratory analysis centers around creating a synopsis of data or
insights for the next steps in a data mining project.
• Key components of exploratory data analysis include summarizing data, statistical analysis, and visualization
of data. Python provides expert tools for exploratory analysis, with pandas for summarizing; scipy, along with
others, for statistical analysis; and matplotlib and plotly for visualizations.
• Steps in EDA
• Problem definition: Before trying to extract useful insight from the data, it is essential to define the business
problem to be solved. The problem definition works as the driving force for a data analysis plan execution. The
main tasks involved in problem definition are defining the main objective of the analysis, defining the main
deliverables, outlining the main roles and responsibilities, obtaining the current status of the data, defining the
timetable, and performing cost/benefit analysis. Based on such a problem definition, an execution plan can be
created.
• Data preparation: This step involves methods for preparing the dataset before actual analysis. In this step, we
define the sources of data, define data schemas and tables, understand the main characteristics of the data, clean
the dataset, delete non-relevant datasets, transform the data, and divide the data into required chunks for analysis.
• Data analysis: This is one of the most crucial steps that deals with descriptive statistics
and analysis of the data. The main tasks involve summarizing the data, finding the hidden
correlation and relationships among the data, developing predictive models, evaluating the
models, and calculating the accuracies. Some of the techniques used for data
summarization are summary tables, graphs, descriptive statistics, inferential statistics,
correlation statistics, searching, grouping, and mathematical models.
• Development and representation of the results: This step involves presenting the dataset
to the target audience in the form of graphs, summary tables, maps, and diagrams. This is
also an essential step as the result analyzed from the dataset should be interpretable by the
business stakeholders, which is one of the major goals of EDA. Most of the graphical
analysis techniques include scattering plots, character plots, histograms, box plots, residual
plots, mean plots, and others.
1.4 Making sense of data
• Numerical data

This data has a sense of measurement involved in it; for example, a person's age, height, weight, blood pressure,
heart rate, temperature, number of teeth, number of bones, and the number of family members. This data is
often referred to as quantitative data in statistics. The numerical dataset can be either discrete or continuous
types.

• Discrete data

• This is data that is countable and its values can be listed out. For example, if we flip a coin, the number of
heads in 200 coin flips can take values from 0 to 200 (finite) cases. A variable that represents a discrete
dataset is referred to as a discrete variable. The discrete variable takes a fixed number of distinct values. For
example, the Country variable can have values such as Nepal, India, Norway, and Japan. It is fixed. The Rank
variable of a student in a classroom can take values from 1, 2, 3, 4, 5, and so on.
• Continuous data
• A variable that can have an infinite number of numerical values within
a specific range is classified as continuous data. A variable describing
continuous data is a continuous
• variable. For example, what is the temperature of your city today? Can
we be finite? Similarly, the weight variable in the previous section is a
continuous variable.
• Continuous data
• A variable that can have an infinite number of numerical values within
a specific range is classified as continuous data. A variable describing
continuous data is a continuous variable.
• For example, what is the temperature of your city today? Can we be
finite.
• the weight variable in the previous section is a continuous variable.
We are
Categorical data

• Categorical data
This type of data represents the characteristics of an object; for example, gender, marital status, type
of address, or categories of the movies. This data is often referred to as qualitative datasets in
statistics. To understand clearly, here are some of the most common types of categorical data you can
find in data:
Gender (Male, Female, Other, or Unknown)
Marital Status (Annulled, Divorced, Interlocutory, Legally Separated, Married,
Movie genres (Action, Adventure, Comedy, Crime, Drama, Fantasy, Historical,
Horror, Mystery, Philosophical, Political, Romance, Saga, Satire, Science Fiction,
Social, Thriller, Urban, or Western)
Blood type (A, B, AB, or OTypes of drugs (Stimulants, Depressants, Hallucinogens, Dissociatives,
Opioids, Inhalants, or Cannabis)
• A variable describing categorical data is referred to as a categorical variable

• A binary categorical variable can take exactly two values and is also referred to as a dichotomous variable.
For example, when you create an experiment, the result is either success or failure. Hence, results can be
understood as a binary categorical variable.

• Polytomous variables are categorical variables that can take more than two

possible values
Measurement scales

•Nominal
•Ordinal
•Interval
•Ratio
Nominal
• These are practiced for labeling variables without any quantitative value. The scales are generally referred to as
labels. And these scales are mutually exclusive and do not carry any numerical importance.

• Nominal scales are considered qualitative scales and the measurements that are taken using qualitative scales are
considered qualitative data.

• Examples

• What is your gender?

• The languages that are spoken in a particular country

• Biological species

• Parts of speech in grammar (noun, pronoun, adjective, and so on)

• Taxonomic ranks in biology (Archea, Bacteria, and Eukarya)


How to analyse nominal data
• Frequency is the rate at which a label occurs over a period of time
within the dataset.
• Proportion can be calculated by dividing the frequency by the total
number of events.
• Then, you could compute the percentage of each proportion.
• And to visualize the nominal dataset, use either a pie chart or a bar
chart.
Ordinal

• In ordinal scales, the order of the values is a


significant factor. An easy tip to remember
the ordinal scale is that it sounds like an
order
A Likert scale
A Likert scale is a rating scale used to measure opinions, attitudes, or behaviors. It consists of a statement
or a question, followed by a series of five or seven answer statements. Respondents choose the option that
best corresponds with how they feel about the statement or question.
The median item is allowed as the measure of central tendency; however, the average
is not permitted.
Interval

• In interval scales, both the order and exact


differences between the values are significant.
• Interval scales are widely used in statistics, for
example, in the measure of central tendencies—
mean, median, mode, and standard deviations.
Ratio

• Ratio scales contain order, exact values, and absolute zero, which
makes it possible to be used in descriptive and inferential statistics.
These scales provide numerous possibilities for statistical analysis.
• Mathematical operations, the measure of central tendencies, and
the measure of dispersion and coefficient of variation can also be
computed from such scales.
• Examples include a measure of energy, mass, length, duration,
electrical energy, plan angle, and volume
A summary of the data types and scale measures:
1.5 Comparing EDA with classical and Bayesian analysis
• Classical data analysis: For the classical data
analysis approach, the problem definition and
data collection step are followed by model
development, which is followed by analysis and
result communication.
• Exploratory data analysis approach: For the EDA
approach, it follows the same approach as classical
data analysis except the model imposition and the data
analysis steps are swapped. The main focus is on the
data, its structure, outliers, models, and visualizations.
Generally, in EDA, we do not impose any
deterministic or probabilistic models on the data.
• Bayesian data analysis approach: The Bayesian
approach incorporates prior probability distribution
knowledge into the analysis steps as shown in the
diagram.
• Prior probability, in Bayesian statistics, is the probability of an
event before new data is collected. This is the best rational
assessment of the probability of an outcome based on the
current knowledge before an experiment is performed.
1.6 Software tools available for EDA

• Python: This is an open source programming language widely used in


data analysis, data mining, and data science
• R programming language: R is an open source programming
language
• Weka: This is an open source data mining package that involves
several EDA tools and algorithms
• KNIME: This is an open source tool for data analysis and is based on
Eclipse
Numpy
1. For importing numpy, we will use the following code:
import numpy as np
2. For creating different types of numpy arrays
# Defining 1D array
my1DArray = np.array([1, 8, 27, 64])
print(my1DArray)
# Defining and printing 2D array
my2DArray = np.array([[1, 2, 3, 4], [2, 4, 9, 16], [4, 8, 18, 32]])
print(my2DArray)
#Defining and printing 3D array
my3Darray = np.array([[[ 1, 2 , 3 , 4],[ 5 , 6 , 7 ,8]], [[ 1, 2, 3, 4],[ 9, 10, 11, 12]]])
print(my3Darray)
3. For displaying basic information, such as the data type, shape, size, and
strides of a NumPy array, we will use the following code:

# Print out memory address

print(my2DArray.data)

# Print the shape of array

print(my2DArray.shape)

# Print out the data type of the array

print(my2DArray.dtype)

# Print the stride of the array.

print(my2DArray.strides)
4. For creating an array using built-in NumPy functions, we will use the following code

# Array of ones
ones = np.ones((3,4))
print(ones)
# Array of zeros
zeros = np.zeros((2,3,4),dtype=np.int16)
print(zeros)
# Array with random values
np.random.random((2,2))
# Empty array
emptyArray = np.empty((3,2))
print(emptyArray)
# Full array
fullArray = np.full((2,2),7)
5. For NumPy arrays and file operations, we will use the
following code:
# Save a numpy array into file
x = np.arange(0.0,50.0,1.0)
np.savetxt('data.out', x, delimiter=',')
# Loading numpy array from text
z = np.loadtxt('data.out', unpack=True)
print(z)
# Loading numpy array using genfromtxt method
my_array2 = np.genfromtxt('data.out',
skip_header=1,
filling_values=-999)
print(my_array2)
6. For inspecting NumPy arrays, we will use the following
code:

• # Print the number of `my2DArray`'s dimensions


• print(my2DArray.ndim)
• # Print the number of `my2DArray`'s elements
• print(my2DArray.size)
• # Print information about `my2DArray`'s memory layout
• print(my2DArray.flags)
• # Print the length of one array element in bytes
• print(my2DArray.itemsize)
• # Print the total consumed bytes by `my2DArray`'s elements
• print(my2DArray.nbytes)
Broadcasting is a mechanism that permits NumPy to
operate with arrays of different shapes when performing
arithmetic operations:

• # Rule 1: Two dimensions are operatable


if they are equal
• # Rule 2: Two dimensions are also
compatible when one of them is 1
• # Rule 3: Arrays can be broadcast
together if they are compatible in all
dimensions
For seeing NumPy mathematics at work, we will use
the following example:
# Basic operations (+, -, *, /, %)
x = np.array([[1, 2, 3], [2, 3, 4]])
y = np.array([[1, 4, 9], [2, 3, -2]])
# Add two array
add = np.add(x, y)
print(add)
# Subtract two array
sub = np.subtract(x, y)
print(sub)
# Multiply two array
mul = np.multiply(x, y)
print(mul)
# Divide x, y
div = np.divide(x,y)
print(div)
# Calculated the remainder of x and y
Let's now see how we can create a subset and slice an
array using an index:

x = np.array([10, 20, 30, 40, 50])


# Select items at index 0 and 1
print(x[0:2])
# Select item at row 0 and 1 and column 1 from
2D array
y = np.array([[ 1, 2, 3, 4], [ 9, 10, 11 ,12]])
print(y[0:2, 1])
# Specifying conditions
biggerThan2 = (y >= 2)
print(y[biggerThan2])
Pandas
• Wes McKinney open sourced the pandas library (https:/ / github. com/ wesm) that
has been widely used in data science.
1. To set default parameters
import numpy as np
import pandas as pd
print("Pandas Version:", pd.__version__)
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
2. Create data structures in two ways: series and dataframes.
Check the following snippet to understand how we can create a dataframe from
series, dictionary, and n-dimensional arrays.

• create a dataframe from a series:


series = pd.Series([2, 3, 7, 11, 13, 17, 19, 23])
print(series)
# Creating dataframe from Series
series_df = pd.DataFrame({
'A': range(1, 5),
'B': pd.Timestamp('20190526'),
'C': pd.Series(5, index=list(range(4)), dtype='float64'),
'D': np.array([3] * 4, dtype='int64'),
'E': pd.Categorical(["Depression", "Social Anxiety", "Bipolar
Disorder", "Eating Disorder"]),
'F': 'Mental health',
'G': 'is challenging'
})
print(series_df)
create a dataframe for a dictionary:

# Creating dataframe from Dictionary


dict_df = [{'A': 'Apple', 'B': 'Ball'},{'A': 'Aeroplane', 'B':
'Bat', 'C': 'Cat'}]
dict_df = pd.DataFrame(dict_df)
print(dict_df)
create a dataframe from n-dimensional arrays:

# Creating a dataframe from ndarrays


sdf = {
'County':['Ostfold', 'Hordaland', 'Oslo', 'Hedmark', 'Oppland',
'Buskerud'],
'ISO-Code':[1,2,3,4,5,6],
'Area': [4180.69, 4917.94, 454.07, 27397.76, 25192.10,
14910.94],
'Administrative centre': ["Sarpsborg", "Oslo", "City of Oslo",
"Hamar", "Lillehammer", "Drammen"]
}
sdf = pd.DataFrame(sdf)
print(sdf)
load a dataset from an external source into a
pandas DataFrame

columns = ['age', 'workclass', 'fnlwgt', 'education',


'education_num',
'marital_status', 'occupation', 'relationship', 'ethnicity',
'gender','capital_gain','capital_loss','hours_per_week','country_of
_origin','income']
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-
databases/adult/adult.data',names=columns)
df.head(10)
code displays the rows, columns, data types, and
memory used by the dataframe:
df.info()
_____________________________________________________________________________
__________
• select rows and columns in any dataframe:
# Selects a row
df.iloc[10]
# Selects 10 rows
df.iloc[0:10]
# Selects a range of rows
df.iloc[10:15]
# Selects the last 2 rows
df.iloc[-2:]
# Selects every other row in columns 3-5
df.iloc[::2, 3:5].head()
SciPy

SciPy is a scientific library for Python and is open source. This

library depends on the NumPy library, which provides an efficient

n-dimensional array manipulation function. scipy.stats from the

SciPy library.
Matplotlib

Matplotlib provides a huge library of customizable plots, along with a comprehensive set
of
backends. It can be utilized to create professional reporting applications, interactive
analytical applications, complex dashboard applications, web/GUI applications,
embedded
views, and many more.

You might also like