0% found this document useful (0 votes)
2 views15 pages

Program-1

The document outlines a program for performing Exploratory Data Analysis (EDA) on a dataset, focusing on computing descriptive statistics and visualizing data distributions. Key tasks include calculating mean, median, mode, standard deviation, variance, and range for numerical data, detecting outliers using IQR, and visualizing categorical data with bar or pie charts. It emphasizes the importance of Python libraries like pandas, NumPy, Matplotlib, and Seaborn for data handling and visualization.

Uploaded by

Kasi Lingamn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views15 pages

Program-1

The document outlines a program for performing Exploratory Data Analysis (EDA) on a dataset, focusing on computing descriptive statistics and visualizing data distributions. Key tasks include calculating mean, median, mode, standard deviation, variance, and range for numerical data, detecting outliers using IQR, and visualizing categorical data with bar or pie charts. It emphasizes the importance of Python libraries like pandas, NumPy, Matplotlib, and Seaborn for data handling and visualization.

Uploaded by

Kasi Lingamn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Practical Insights into Data Analysis

and Machine Learning

PROGRAM - 1

Develop a program to load a dataset and select one numerical column.


Compute mean, median, mode, standard deviation, variance, and range for
a given numerical column in a dataset. Generate a histogram and boxplot to
understand the distribution of the data. Identify any outliers in the data
using IQR. Select a categorical variable from a dataset. Compute the
frequency of each category and display it as a bar chart or pie chart.

Objective
To perform exploratory data analysis (EDA) on a dataset by computing
descriptive statistics and visualizing the distribution of numerical and categorical
variables.
--------------------------------------------------------------------------------------------------- Program 1 2

1. Introduction
This program focuses on Exploratory Data Analysis (EDA) by computing key statistical measures
and visualizing data. It begins by loading a dataset, selecting a numerical column, and computing
essential statistics such as mean, median, mode, standard deviation, variance, and range. Outliers
are detected using the Interquartile Range (IQR) method, and the data distribution is analyzed
through histograms and boxplots. Additionally, a categorical column is selected, where the
frequency of each category is computed and visualized using a bar chart or pie chart. This process
provides insights into data trends, anomalies, and key statistical properties before moving on to
further analysis or modeling.

To execute this program, a strong foundation in Python programming is required,


particularly proficiency in pandas for loading datasets, handling missing values, and computing
statistical metrics. Knowledge of NumPy is beneficial for numerical computations such as variance
and standard deviation. Understanding statistical concepts like central tendency, dispersion, and
outlier detection using IQR is crucial. Additionally, familiarity with Matplotlib and Seaborn is
needed for data visualization, including histograms, boxplots, bar charts, and pie charts. Experience
with Jupyter Notebook, Google Colab, or IDEs like VS Code or PyCharm can aid in executing the
code effectively. Lastly, data preprocessing techniques and understanding categorical data analysis
will help in summarizing categorical variables efficiently. The following sections will provide
insights into the concepts mentioned above.

1.1 Data Handling

Pandas - The pandas’ library in Python is essential for handling and analyzing structured data
efficiently. It provides data structures such as DataFrames and Series to manage datasets and
perform various operations. Below are some key functionalities:

• Loading Datasets with pandas.read_csv(): The read_csv() function is used to load


datasets from CSV files into a pandas DataFrame.
• Selecting Columns in a DataFrame
• Pandas provides built-in functions to compute descriptive statistics for numerical columns:
Mean (Average): .mean()
Median (Middle Value): .median()
Mode (Most Frequent Value): .mode()
Standard Deviation: .std()
Variance: .var()
Minimum: .min()
Maximum: .max()
3 Practical Insights into Data Analysis and Machine Learning -----------------------------------------

NumPy - The NumPy library is fundamental for performing mathematical and statistical
operations on large datasets, as it provides efficient handling of arrays and matrices. It also provides
the essential functions for mean, variance, standard deviation, range, and more.

• The quantile function np.quantile() in NumPy is used to compute the value below which
a given percentage of observations fall. Quantiles help in understanding the distribution of
data by splitting it into equal intervals. The most commonly used quantiles are quartiles
(which split data into four parts), but quantiles can be calculated for any percentage.

1.2 Descriptive Statistics

1.2.1 Measures of Central Tendency: Mean, Median, and Mode.

Mean (Arithmetic Mean) -


• The mean is the average value of a dataset. It is calculated by summing all the values in a
numerical dataset and dividing by the total number of values.

where:
Xi - Each individual value in the dataset.
N - Total number of values.

Median
• The median is a measure of central tendency that represents the middle value in a dataset
when the data is arranged in ascending (or descending) order. It's a useful statistic because
it's less sensitive to extreme values (outliers) than the mean (average).
• The median is not influenced by extreme values in the dataset. This makes it a good measure
of central tendency when dealing with data that may have outliers.
• If there are an odd number of values, the median is the middle value. If there are an even
number of values, the median is the average of the two middle values.

Example: Let's consider the following dataset: 5, 2, 8, 1, 9, 4.


• Order the data: 1, 2, 4, 5, 8, 9
• Find the middle value:
There are 6 data points (even number), so the median is the average of the two middle values
(4 and 5).

Median = (4 + 5) / 2 = 4.5
--------------------------------------------------------------------------------------------------- Program 1 4

Mode
• The mode is another measure of central tendency that represents the most frequent value in
a dataset. In simpler terms, it's the value that appears most often.
• Unlike the mean, which is only applicable to numerical data, the mode can be used for both
numerical and categorical data. For example, you can find the mode of a set of colors or a
set of names.
• A dataset can have one mode (unimodal), two modes (bimodal), or more than two modes
(multimodal). If all values appear with the same frequency, there is no mode.
• The mode may not always be a good representation of the centre of the data, especially if
the data is skewed or has a wide range of values.

Example 1: Single Mode


Dataset: 3, 4, 10, 10, 15, 19
• The number 10 appears twice, while all other numbers appear once.
• Mode = 10

Example 2: Multiple Modes


Dataset: 3, 4, 4, 10, 10, 15, 19
• The number 4 and 10 appears twice, while all other numbers appear once.
• Mode = 4, 10

Example 3: No Modes
Dataset: 3, 4, 10, 15, 19, 25
• Each number appears only once.
• No mode exists in this dataset.

1.2.2 Measures of Dispersion

Standard Deviation
• The standard deviation (SD) measures how much the values in a dataset deviate from the
mean.
• A high standard deviation means the data points are more spread out.
• A low standard deviation indicates that data points are close to the mean.
• It is sensitive to outliers. Extreme values in the dataset can increase the standard deviation.

where:
Xi - Each individual value in the dataset.
μ - Mean of the dataset
N - The total number of values.
σ - Standard deviation
5 Practical Insights into Data Analysis and Machine Learning -----------------------------------------

Variance
• Variance measures how spread out the data points are from the mean. It is the average of
the squared differences from the mean.
• A higher variance means the data points are more spread.
• A lower variance means they are closer to the mean.
• Variance is more sensitive to extreme values (outliers).

where:
Xi - Each individual value in the dataset.
μ - Mean of the dataset
N - The total number of values.
σ2 - Variance
Variance (σ2) and standard deviation (σ) are directly related. The standard deviation is the square
root of the variance.

Range
• The range is the difference between the maximum and minimum values in a dataset.
• It gives you a basic sense of the data's spread at a glance.
• The range is highly influenced by extreme values (outliers). A single outlier can make the
range seem much larger than it actually is.

Example:
Dataset: 3, 6, 4, 9, 2
• Maximum value: 9 Minimum value: 2
• Range: 9 - 2 = 7

1.2.3 Outlier Detection

Interquartile Range (IQR)


• It is a measure of statistical dispersion that describes the spread of the middle 50% of a
dataset. It provides a clearer picture of the dataset's central tendency.
• It is calculated as the difference between the third quartile (Q3) and the first quartile (Q1).
• The IQR is particularly useful for identifying variability in a dataset while being resistant
to the influence of outliers, making it a robust measure of spread.

IQR = Q3 – Q1
Where:
• Q1 = First quartile (25th percentile): The value below which 25% of the data falls.
• Q3 = Third quartile (75th percentile): The value below which 75% of the data falls.
--------------------------------------------------------------------------------------------------- Program 1 6

Example: Consider the dataset: [3,7,8,5,12,14,21,13,18]


1. Sort the Data: [3,5,7,8,12,13,14,18,21]
2. Find the Median (Q2): The median is 12
5+7
3. Find Q1: The lower half is [3,5,7,8]. The median of this half is =6
2
14+18
4. Find Q3: The upper half is [13,14,18,21]. The median of this half is = 16
2
5. Calculate IQR: IQR=Q3 − Q1
=16 − 6 = 10.

• The IQR of 10 means that the middle 50% of the data lies within a range of 10 units.
• Values below Q1−1.5×IQR or above Q3+1.5×IQR are often considered outliers.

Using the dataset above:


• Lower Bound: Q1−1.5×IQR = 6−1.5×10 = −9
• Upper Bound: Q3+1.5×IQR = 16+1.5×10 = 31

Any value below −9 or above 31 would be considered an outlier. In this dataset, there are no
outliers.

1.3 Data Visualization

1.3.1 Histograms
• A histogram is a representation of the distribution of data.
• A histogram is a type of bar chart that shows how frequently different values occur in your
numerical data.
• The data is divided into ranges called bins or intervals, and each bar on the histogram
represents one of these bins. The height of each bar shows how many data points fall within
that bin.

Image Source: https://cdn.serc.carleton.edu/images/mathyouneed/geomajors/histograms/histogram_skew.webp

• Right skew (positive skew): The tail of the histogram extends longer to the right. This often
means there are some higher values that are pulling the mean up, but most of the data is
concentrated on the lower end.
• Left skew (negative skew): The tail extends longer to the left. This suggests there are some
lower values pulling the mean down, but most of the data is on the higher end.
7 Practical Insights into Data Analysis and Machine Learning -----------------------------------------

• A symmetric histogram is one where the data is evenly distributed around the centre
(mean/median). Often resembles a bell-shaped curve (e.g., normal distribution). A
symmetrical histogram suggests that the data is evenly distributed around the centre.

Advantages of Histogram
► The visual representation helps to understand the underlying patterns in data.
► Understand the central tendency and variability of your data.
► To identify if data is skewed (asymmetrical). This is important because skewed data can
affect the interpretation of other statistical measures, like the mean.
► Outliers, which are extreme values, often stand out on a histogram as isolated bars far from
the main distribution. This makes them easier to detect.

1.3.2 Box plot


• A box plot is a graphical representation of data and it provides the summary of the
distribution of numerical data, highlighting key statistics like the median, quartiles, and
potential outliers.
• It's based on the five-number summary of your data:
► Minimum: The smallest value in dataset.
► Lower quartile (Q1): The value that separates the lowest 25% of data from the rest.
► Median (Q2): The middle value when the data is ordered from least to greatest.
► Upper quartile (Q3): The value that separates the highest 25% of the data from the rest.
► Maximum: The largest value in dataset.

• The Box: Represents the interquartile range (IQR), which contains the middle 50% of
the data.
• The Whiskers: Extend from Q1 to the minimum and from Q3 to the maximum,
excluding outliers.
• The Median Line: A line inside the box marks the median.
• Outliers: Values that lie beyond 1.5 × IQR are plotted separately as dots.
--------------------------------------------------------------------------------------------------- Program 1 8

Advantages of Box plot


► A box plot displays five key statistics in a single visual, making it easy to understand large
datasets.
► To identify the outliers (extreme values) which are clearly marked outside the whiskers,
helping in detecting unusual data points.
► Box plots are excellent for comparing the distributions of multiple datasets.
► The shape of the box plot indicates whether the data is symmetrical, left-skewed, or right-
skewed. If the median line is closer to one side of the box, it suggests skewness in that
direction. This will help in understanding the shape of the data distribution.

1.3.3 Bar chart

• A bar chart (or bar graph) is a visual representation of data using rectangular bars of varying
lengths or heights. Each bar represents a category or group, and the length or height of the
bar corresponds to the value or frequency of that category.
• Bar charts are one of the most commonly used tools in data visualization because they are
simple, intuitive, and effective for comparing data across categories.
• Bar charts can be used for a wide range of data types, including categorical, numerical, and
ordinal data.

Advantages of Bar Charts:


► They present data in a straightforward and easy-to-interpret manner.
► Bar charts can represent both small and large datasets effectively.
► Bar charts are visually appealing and can be customized with colors, labels, and annotations.
9 Practical Insights into Data Analysis and Machine Learning -----------------------------------------

1.3.4 Pie chart


• A pie chart is a circular statistical graphic that is divided into slices to represent the
proportions of different categories in a dataset.
• Each slice of the pie corresponds to a specific category, and the size of the slice is
proportional to the quantity or percentage it represents.

1.3.5 Data Visualization Libraries

Matplotlib

• Matplotlib is a comprehensive library for creating static, animated, and interactive


visualizations in Python.
• Supports a wide range of plot types: line plots, bar charts, histograms, scatter plots, pie
charts, 3D plots, and more. Suitable for both simple and complex visualizations.
• Highly customizable with control over colors, labels, fonts, and grid styles. Fine-tune every
aspect of a plot, including axes, ticks, and legends.
• Works with various backends for rendering plots in different formats (PNG, PDF, SVG)
and environments (Jupyter notebooks, web apps).
• Matplotlib seamlessly integrates with NumPy and Pandas, two popular Python libraries for
numerical computing and data analysis. This allows to easily plot data from arrays and
DataFrames. Compatible with Jupyter notebooks for interactive plotting.

Seaborn
• Seaborn is a library for making statistical graphics in Python. It builds on top of matplotlib
and integrates closely with pandas’ data structures.
• Seaborn helps to explore and understand the data. Its plotting functions operate on
dataframes and arrays containing whole datasets and internally perform the necessary
semantic mapping and statistical aggregation to produce informative plots.
--------------------------------------------------------------------------------------------------- Program 1 10

• Seaborn excels at creating plots that summarize and visualize statistical relationships within
data. It goes beyond basic plotting to offer tools for understanding distributions,
relationships, and patterns.

1.4 Program

.
.
.
11 Practical Insights into Data Analysis and Machine Learning -----------------------------------------
--------------------------------------------------------------------------------------------------- Program 1 12
13 Practical Insights into Data Analysis and Machine Learning -----------------------------------------
--------------------------------------------------------------------------------------------------- Program 1 14

Viva Questions

General Questions:
• What is Exploratory Data Analysis (EDA), and why is it important?
• What are the key steps involved in performing EDA on a dataset?
• What is the difference between descriptive and inferential statistics?

Data Handling:
• What is the role of the panda’s library in Python for data analysis?
• How do you load a dataset using pandas? Can you explain the read_csv() function?
• What are the key differences between a DataFrame and a Series in pandas?

Descriptive Statistics:
• What are the measures of central tendency, and why are they important?
• How do you calculate the mean, median, and mode of a dataset?
• What is the difference between mean and median? When would you use one over the other?
• What is the mode, and can a dataset have more than one mode?
• What is standard deviation, and what does it tell us about a dataset?
• How is variance different from standard deviation?
• What is the range of a dataset, and how is it calculated?
• What is the Interquartile Range (IQR), and how is it used to detect outliers?

Outlier Detection:
• What is an outlier, and how can it affect your analysis?
• How do you detect outliers using the IQR method?
• What are the lower and upper bounds in the IQR method, and how are they calculated?

Data Visualization:
• What is a histogram, and what kind of information does it provide?
• How do you interpret a histogram that is right-skewed or left-skewed?
• What is a box plot, and what information does it convey?
• How do you identify outliers in a box plot?
• What are the advantages of using a bar chart for data visualization?
• When would you use a pie chart instead of a bar chart?
• What are the key differences between Matplotlib and Seaborn?

Statistical Concepts:
• What is the difference between a population and a sample in statistics?
• What is the significance of the normal distribution in statistics?
• What is skewness, and how does it affect the interpretation of data?
• What is kurtosis, and how does it relate to the shape of a distribution?
15 Practical Insights into Data Analysis and Machine Learning -----------------------------------------

Advanced Questions:
• How would you use EDA to prepare data for machine learning models?
• What is the role of EDA in feature engineering?
• How can you use EDA to identify relationships between variables in a dataset?
• What are some limitations of using only descriptive statistics for data analysis?
• How would you use EDA to identify potential biases in a dataset?

You might also like