Program-1
Program-1
PROGRAM - 1
Objective
To perform exploratory data analysis (EDA) on a dataset by computing
descriptive statistics and visualizing the distribution of numerical and categorical
variables.
--------------------------------------------------------------------------------------------------- Program 1 2
1. Introduction
This program focuses on Exploratory Data Analysis (EDA) by computing key statistical measures
and visualizing data. It begins by loading a dataset, selecting a numerical column, and computing
essential statistics such as mean, median, mode, standard deviation, variance, and range. Outliers
are detected using the Interquartile Range (IQR) method, and the data distribution is analyzed
through histograms and boxplots. Additionally, a categorical column is selected, where the
frequency of each category is computed and visualized using a bar chart or pie chart. This process
provides insights into data trends, anomalies, and key statistical properties before moving on to
further analysis or modeling.
Pandas - The pandas’ library in Python is essential for handling and analyzing structured data
efficiently. It provides data structures such as DataFrames and Series to manage datasets and
perform various operations. Below are some key functionalities:
NumPy - The NumPy library is fundamental for performing mathematical and statistical
operations on large datasets, as it provides efficient handling of arrays and matrices. It also provides
the essential functions for mean, variance, standard deviation, range, and more.
• The quantile function np.quantile() in NumPy is used to compute the value below which
a given percentage of observations fall. Quantiles help in understanding the distribution of
data by splitting it into equal intervals. The most commonly used quantiles are quartiles
(which split data into four parts), but quantiles can be calculated for any percentage.
where:
Xi - Each individual value in the dataset.
N - Total number of values.
Median
• The median is a measure of central tendency that represents the middle value in a dataset
when the data is arranged in ascending (or descending) order. It's a useful statistic because
it's less sensitive to extreme values (outliers) than the mean (average).
• The median is not influenced by extreme values in the dataset. This makes it a good measure
of central tendency when dealing with data that may have outliers.
• If there are an odd number of values, the median is the middle value. If there are an even
number of values, the median is the average of the two middle values.
Median = (4 + 5) / 2 = 4.5
--------------------------------------------------------------------------------------------------- Program 1 4
Mode
• The mode is another measure of central tendency that represents the most frequent value in
a dataset. In simpler terms, it's the value that appears most often.
• Unlike the mean, which is only applicable to numerical data, the mode can be used for both
numerical and categorical data. For example, you can find the mode of a set of colors or a
set of names.
• A dataset can have one mode (unimodal), two modes (bimodal), or more than two modes
(multimodal). If all values appear with the same frequency, there is no mode.
• The mode may not always be a good representation of the centre of the data, especially if
the data is skewed or has a wide range of values.
Example 3: No Modes
Dataset: 3, 4, 10, 15, 19, 25
• Each number appears only once.
• No mode exists in this dataset.
Standard Deviation
• The standard deviation (SD) measures how much the values in a dataset deviate from the
mean.
• A high standard deviation means the data points are more spread out.
• A low standard deviation indicates that data points are close to the mean.
• It is sensitive to outliers. Extreme values in the dataset can increase the standard deviation.
where:
Xi - Each individual value in the dataset.
μ - Mean of the dataset
N - The total number of values.
σ - Standard deviation
5 Practical Insights into Data Analysis and Machine Learning -----------------------------------------
Variance
• Variance measures how spread out the data points are from the mean. It is the average of
the squared differences from the mean.
• A higher variance means the data points are more spread.
• A lower variance means they are closer to the mean.
• Variance is more sensitive to extreme values (outliers).
where:
Xi - Each individual value in the dataset.
μ - Mean of the dataset
N - The total number of values.
σ2 - Variance
Variance (σ2) and standard deviation (σ) are directly related. The standard deviation is the square
root of the variance.
Range
• The range is the difference between the maximum and minimum values in a dataset.
• It gives you a basic sense of the data's spread at a glance.
• The range is highly influenced by extreme values (outliers). A single outlier can make the
range seem much larger than it actually is.
Example:
Dataset: 3, 6, 4, 9, 2
• Maximum value: 9 Minimum value: 2
• Range: 9 - 2 = 7
IQR = Q3 – Q1
Where:
• Q1 = First quartile (25th percentile): The value below which 25% of the data falls.
• Q3 = Third quartile (75th percentile): The value below which 75% of the data falls.
--------------------------------------------------------------------------------------------------- Program 1 6
• The IQR of 10 means that the middle 50% of the data lies within a range of 10 units.
• Values below Q1−1.5×IQR or above Q3+1.5×IQR are often considered outliers.
Any value below −9 or above 31 would be considered an outlier. In this dataset, there are no
outliers.
1.3.1 Histograms
• A histogram is a representation of the distribution of data.
• A histogram is a type of bar chart that shows how frequently different values occur in your
numerical data.
• The data is divided into ranges called bins or intervals, and each bar on the histogram
represents one of these bins. The height of each bar shows how many data points fall within
that bin.
• Right skew (positive skew): The tail of the histogram extends longer to the right. This often
means there are some higher values that are pulling the mean up, but most of the data is
concentrated on the lower end.
• Left skew (negative skew): The tail extends longer to the left. This suggests there are some
lower values pulling the mean down, but most of the data is on the higher end.
7 Practical Insights into Data Analysis and Machine Learning -----------------------------------------
• A symmetric histogram is one where the data is evenly distributed around the centre
(mean/median). Often resembles a bell-shaped curve (e.g., normal distribution). A
symmetrical histogram suggests that the data is evenly distributed around the centre.
Advantages of Histogram
► The visual representation helps to understand the underlying patterns in data.
► Understand the central tendency and variability of your data.
► To identify if data is skewed (asymmetrical). This is important because skewed data can
affect the interpretation of other statistical measures, like the mean.
► Outliers, which are extreme values, often stand out on a histogram as isolated bars far from
the main distribution. This makes them easier to detect.
• The Box: Represents the interquartile range (IQR), which contains the middle 50% of
the data.
• The Whiskers: Extend from Q1 to the minimum and from Q3 to the maximum,
excluding outliers.
• The Median Line: A line inside the box marks the median.
• Outliers: Values that lie beyond 1.5 × IQR are plotted separately as dots.
--------------------------------------------------------------------------------------------------- Program 1 8
• A bar chart (or bar graph) is a visual representation of data using rectangular bars of varying
lengths or heights. Each bar represents a category or group, and the length or height of the
bar corresponds to the value or frequency of that category.
• Bar charts are one of the most commonly used tools in data visualization because they are
simple, intuitive, and effective for comparing data across categories.
• Bar charts can be used for a wide range of data types, including categorical, numerical, and
ordinal data.
Matplotlib
Seaborn
• Seaborn is a library for making statistical graphics in Python. It builds on top of matplotlib
and integrates closely with pandas’ data structures.
• Seaborn helps to explore and understand the data. Its plotting functions operate on
dataframes and arrays containing whole datasets and internally perform the necessary
semantic mapping and statistical aggregation to produce informative plots.
--------------------------------------------------------------------------------------------------- Program 1 10
• Seaborn excels at creating plots that summarize and visualize statistical relationships within
data. It goes beyond basic plotting to offer tools for understanding distributions,
relationships, and patterns.
1.4 Program
.
.
.
11 Practical Insights into Data Analysis and Machine Learning -----------------------------------------
--------------------------------------------------------------------------------------------------- Program 1 12
13 Practical Insights into Data Analysis and Machine Learning -----------------------------------------
--------------------------------------------------------------------------------------------------- Program 1 14
Viva Questions
General Questions:
• What is Exploratory Data Analysis (EDA), and why is it important?
• What are the key steps involved in performing EDA on a dataset?
• What is the difference between descriptive and inferential statistics?
Data Handling:
• What is the role of the panda’s library in Python for data analysis?
• How do you load a dataset using pandas? Can you explain the read_csv() function?
• What are the key differences between a DataFrame and a Series in pandas?
Descriptive Statistics:
• What are the measures of central tendency, and why are they important?
• How do you calculate the mean, median, and mode of a dataset?
• What is the difference between mean and median? When would you use one over the other?
• What is the mode, and can a dataset have more than one mode?
• What is standard deviation, and what does it tell us about a dataset?
• How is variance different from standard deviation?
• What is the range of a dataset, and how is it calculated?
• What is the Interquartile Range (IQR), and how is it used to detect outliers?
Outlier Detection:
• What is an outlier, and how can it affect your analysis?
• How do you detect outliers using the IQR method?
• What are the lower and upper bounds in the IQR method, and how are they calculated?
Data Visualization:
• What is a histogram, and what kind of information does it provide?
• How do you interpret a histogram that is right-skewed or left-skewed?
• What is a box plot, and what information does it convey?
• How do you identify outliers in a box plot?
• What are the advantages of using a bar chart for data visualization?
• When would you use a pie chart instead of a bar chart?
• What are the key differences between Matplotlib and Seaborn?
Statistical Concepts:
• What is the difference between a population and a sample in statistics?
• What is the significance of the normal distribution in statistics?
• What is skewness, and how does it affect the interpretation of data?
• What is kurtosis, and how does it relate to the shape of a distribution?
15 Practical Insights into Data Analysis and Machine Learning -----------------------------------------
Advanced Questions:
• How would you use EDA to prepare data for machine learning models?
• What is the role of EDA in feature engineering?
• How can you use EDA to identify relationships between variables in a dataset?
• What are some limitations of using only descriptive statistics for data analysis?
• How would you use EDA to identify potential biases in a dataset?