0% found this document useful (0 votes)

2 views15 pages

Program-1

The document outlines a program for performing Exploratory Data Analysis (EDA) on a dataset, focusing on computing descriptive statistics and visualizing data distributions. Key tasks include calculating mean, median, mode, standard deviation, variance, and range for numerical data, detecting outliers using IQR, and visualizing categorical data with bar or pie charts. It emphasizes the importance of Python libraries like pandas, NumPy, Matplotlib, and Seaborn for data handling and visualization.

Uploaded by

Kasi Lingamn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views15 pages

Program-1

Uploaded by

Kasi Lingamn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Practical Insights into Data Analysis

and Machine Learning

PROGRAM - 1

Develop a program to load a dataset and select one numerical column.

Compute mean, median, mode, standard deviation, variance, and range for
a given numerical column in a dataset. Generate a histogram and boxplot to
understand the distribution of the data. Identify any outliers in the data
using IQR. Select a categorical variable from a dataset. Compute the
frequency of each category and display it as a bar chart or pie chart.

Objective
To perform exploratory data analysis (EDA) on a dataset by computing
descriptive statistics and visualizing the distribution of numerical and categorical
variables.
--------------------------------------------------------------------------------------------------- Program 1 2

1. Introduction
This program focuses on Exploratory Data Analysis (EDA) by computing key statistical measures
and visualizing data. It begins by loading a dataset, selecting a numerical column, and computing
essential statistics such as mean, median, mode, standard deviation, variance, and range. Outliers
are detected using the Interquartile Range (IQR) method, and the data distribution is analyzed
through histograms and boxplots. Additionally, a categorical column is selected, where the
frequency of each category is computed and visualized using a bar chart or pie chart. This process
provides insights into data trends, anomalies, and key statistical properties before moving on to
further analysis or modeling.

To execute this program, a strong foundation in Python programming is required,

particularly proficiency in pandas for loading datasets, handling missing values, and computing
statistical metrics. Knowledge of NumPy is beneficial for numerical computations such as variance
and standard deviation. Understanding statistical concepts like central tendency, dispersion, and
outlier detection using IQR is crucial. Additionally, familiarity with Matplotlib and Seaborn is
needed for data visualization, including histograms, boxplots, bar charts, and pie charts. Experience
with Jupyter Notebook, Google Colab, or IDEs like VS Code or PyCharm can aid in executing the
code effectively. Lastly, data preprocessing techniques and understanding categorical data analysis
will help in summarizing categorical variables efficiently. The following sections will provide
insights into the concepts mentioned above.

1.1 Data Handling

Pandas - The pandas’ library in Python is essential for handling and analyzing structured data
efficiently. It provides data structures such as DataFrames and Series to manage datasets and
perform various operations. Below are some key functionalities:

• Loading Datasets with pandas.read_csv(): The read_csv() function is used to load

datasets from CSV files into a pandas DataFrame.
• Selecting Columns in a DataFrame
• Pandas provides built-in functions to compute descriptive statistics for numerical columns:
Mean (Average): .mean()
Median (Middle Value): .median()
Mode (Most Frequent Value): .mode()
Standard Deviation: .std()
Variance: .var()
Minimum: .min()
Maximum: .max()
3 Practical Insights into Data Analysis and Machine Learning -----------------------------------------

NumPy - The NumPy library is fundamental for performing mathematical and statistical
operations on large datasets, as it provides efficient handling of arrays and matrices. It also provides
the essential functions for mean, variance, standard deviation, range, and more.

• The quantile function np.quantile() in NumPy is used to compute the value below which
a given percentage of observations fall. Quantiles help in understanding the distribution of
data by splitting it into equal intervals. The most commonly used quantiles are quartiles
(which split data into four parts), but quantiles can be calculated for any percentage.

1.2 Descriptive Statistics

1.2.1 Measures of Central Tendency: Mean, Median, and Mode.

Mean (Arithmetic Mean) -

• The mean is the average value of a dataset. It is calculated by summing all the values in a
numerical dataset and dividing by the total number of values.

where:
Xi - Each individual value in the dataset.
N - Total number of values.

Median
• The median is a measure of central tendency that represents the middle value in a dataset
when the data is arranged in ascending (or descending) order. It's a useful statistic because
it's less sensitive to extreme values (outliers) than the mean (average).
• The median is not influenced by extreme values in the dataset. This makes it a good measure
of central tendency when dealing with data that may have outliers.
• If there are an odd number of values, the median is the middle value. If there are an even
number of values, the median is the average of the two middle values.

Example: Let's consider the following dataset: 5, 2, 8, 1, 9, 4.

• Order the data: 1, 2, 4, 5, 8, 9
• Find the middle value:
There are 6 data points (even number), so the median is the average of the two middle values
(4 and 5).

Median = (4 + 5) / 2 = 4.5
--------------------------------------------------------------------------------------------------- Program 1 4

Mode
• The mode is another measure of central tendency that represents the most frequent value in
a dataset. In simpler terms, it's the value that appears most often.
• Unlike the mean, which is only applicable to numerical data, the mode can be used for both
numerical and categorical data. For example, you can find the mode of a set of colors or a
set of names.
• A dataset can have one mode (unimodal), two modes (bimodal), or more than two modes
(multimodal). If all values appear with the same frequency, there is no mode.
• The mode may not always be a good representation of the centre of the data, especially if
the data is skewed or has a wide range of values.

Example 1: Single Mode

Dataset: 3, 4, 10, 10, 15, 19
• The number 10 appears twice, while all other numbers appear once.
• Mode = 10

Example 2: Multiple Modes

Dataset: 3, 4, 4, 10, 10, 15, 19
• The number 4 and 10 appears twice, while all other numbers appear once.
• Mode = 4, 10

Example 3: No Modes
Dataset: 3, 4, 10, 15, 19, 25
• Each number appears only once.
• No mode exists in this dataset.

1.2.2 Measures of Dispersion

Standard Deviation
• The standard deviation (SD) measures how much the values in a dataset deviate from the
mean.
• A high standard deviation means the data points are more spread out.
• A low standard deviation indicates that data points are close to the mean.
• It is sensitive to outliers. Extreme values in the dataset can increase the standard deviation.

where:
Xi - Each individual value in the dataset.
μ - Mean of the dataset
N - The total number of values.
σ - Standard deviation
5 Practical Insights into Data Analysis and Machine Learning -----------------------------------------

Variance
• Variance measures how spread out the data points are from the mean. It is the average of
the squared differences from the mean.
• A higher variance means the data points are more spread.
• A lower variance means they are closer to the mean.
• Variance is more sensitive to extreme values (outliers).

where:
Xi - Each individual value in the dataset.
μ - Mean of the dataset
N - The total number of values.
σ2 - Variance
Variance (σ2) and standard deviation (σ) are directly related. The standard deviation is the square
root of the variance.

Range
• The range is the difference between the maximum and minimum values in a dataset.
• It gives you a basic sense of the data's spread at a glance.
• The range is highly influenced by extreme values (outliers). A single outlier can make the
range seem much larger than it actually is.

Example:
Dataset: 3, 6, 4, 9, 2
• Maximum value: 9 Minimum value: 2
• Range: 9 - 2 = 7

1.2.3 Outlier Detection

Interquartile Range (IQR)

• It is a measure of statistical dispersion that describes the spread of the middle 50% of a
dataset. It provides a clearer picture of the dataset's central tendency.
• It is calculated as the difference between the third quartile (Q3) and the first quartile (Q1).
• The IQR is particularly useful for identifying variability in a dataset while being resistant
to the influence of outliers, making it a robust measure of spread.

IQR = Q3 – Q1
Where:
• Q1 = First quartile (25th percentile): The value below which 25% of the data falls.
• Q3 = Third quartile (75th percentile): The value below which 75% of the data falls.
--------------------------------------------------------------------------------------------------- Program 1 6

Example: Consider the dataset: [3,7,8,5,12,14,21,13,18]

1. Sort the Data: [3,5,7,8,12,13,14,18,21]
2. Find the Median (Q2): The median is 12
5+7
3. Find Q1: The lower half is [3,5,7,8]. The median of this half is =6
2
14+18
4. Find Q3: The upper half is [13,14,18,21]. The median of this half is = 16
2
5. Calculate IQR: IQR=Q3 − Q1
=16 − 6 = 10.

• The IQR of 10 means that the middle 50% of the data lies within a range of 10 units.
• Values below Q1−1.5×IQR or above Q3+1.5×IQR are often considered outliers.

Using the dataset above:

• Lower Bound: Q1−1.5×IQR = 6−1.5×10 = −9
• Upper Bound: Q3+1.5×IQR = 16+1.5×10 = 31

Any value below −9 or above 31 would be considered an outlier. In this dataset, there are no
outliers.

1.3 Data Visualization

1.3.1 Histograms
• A histogram is a representation of the distribution of data.
• A histogram is a type of bar chart that shows how frequently different values occur in your
numerical data.
• The data is divided into ranges called bins or intervals, and each bar on the histogram
represents one of these bins. The height of each bar shows how many data points fall within
that bin.

Image Source: https://cdn.serc.carleton.edu/images/mathyouneed/geomajors/histograms/histogram_skew.webp

• Right skew (positive skew): The tail of the histogram extends longer to the right. This often
means there are some higher values that are pulling the mean up, but most of the data is
concentrated on the lower end.
• Left skew (negative skew): The tail extends longer to the left. This suggests there are some
lower values pulling the mean down, but most of the data is on the higher end.
7 Practical Insights into Data Analysis and Machine Learning -----------------------------------------

• A symmetric histogram is one where the data is evenly distributed around the centre
(mean/median). Often resembles a bell-shaped curve (e.g., normal distribution). A
symmetrical histogram suggests that the data is evenly distributed around the centre.

Advantages of Histogram
► The visual representation helps to understand the underlying patterns in data.
► Understand the central tendency and variability of your data.
► To identify if data is skewed (asymmetrical). This is important because skewed data can
affect the interpretation of other statistical measures, like the mean.
► Outliers, which are extreme values, often stand out on a histogram as isolated bars far from
the main distribution. This makes them easier to detect.

1.3.2 Box plot

• A box plot is a graphical representation of data and it provides the summary of the
distribution of numerical data, highlighting key statistics like the median, quartiles, and
potential outliers.
• It's based on the five-number summary of your data:
► Minimum: The smallest value in dataset.
► Lower quartile (Q1): The value that separates the lowest 25% of data from the rest.
► Median (Q2): The middle value when the data is ordered from least to greatest.
► Upper quartile (Q3): The value that separates the highest 25% of the data from the rest.
► Maximum: The largest value in dataset.

• The Box: Represents the interquartile range (IQR), which contains the middle 50% of
the data.
• The Whiskers: Extend from Q1 to the minimum and from Q3 to the maximum,
excluding outliers.
• The Median Line: A line inside the box marks the median.
• Outliers: Values that lie beyond 1.5 × IQR are plotted separately as dots.
--------------------------------------------------------------------------------------------------- Program 1 8

Advantages of Box plot

► A box plot displays five key statistics in a single visual, making it easy to understand large
datasets.
► To identify the outliers (extreme values) which are clearly marked outside the whiskers,
helping in detecting unusual data points.
► Box plots are excellent for comparing the distributions of multiple datasets.
► The shape of the box plot indicates whether the data is symmetrical, left-skewed, or right-
skewed. If the median line is closer to one side of the box, it suggests skewness in that
direction. This will help in understanding the shape of the data distribution.

1.3.3 Bar chart

• A bar chart (or bar graph) is a visual representation of data using rectangular bars of varying
lengths or heights. Each bar represents a category or group, and the length or height of the
bar corresponds to the value or frequency of that category.
• Bar charts are one of the most commonly used tools in data visualization because they are
simple, intuitive, and effective for comparing data across categories.
• Bar charts can be used for a wide range of data types, including categorical, numerical, and
ordinal data.

Advantages of Bar Charts:

► They present data in a straightforward and easy-to-interpret manner.
► Bar charts can represent both small and large datasets effectively.
► Bar charts are visually appealing and can be customized with colors, labels, and annotations.
9 Practical Insights into Data Analysis and Machine Learning -----------------------------------------

1.3.4 Pie chart

• A pie chart is a circular statistical graphic that is divided into slices to represent the
proportions of different categories in a dataset.
• Each slice of the pie corresponds to a specific category, and the size of the slice is
proportional to the quantity or percentage it represents.

1.3.5 Data Visualization Libraries

Matplotlib

• Matplotlib is a comprehensive library for creating static, animated, and interactive

visualizations in Python.
• Supports a wide range of plot types: line plots, bar charts, histograms, scatter plots, pie
charts, 3D plots, and more. Suitable for both simple and complex visualizations.
• Highly customizable with control over colors, labels, fonts, and grid styles. Fine-tune every
aspect of a plot, including axes, ticks, and legends.
• Works with various backends for rendering plots in different formats (PNG, PDF, SVG)
and environments (Jupyter notebooks, web apps).
• Matplotlib seamlessly integrates with NumPy and Pandas, two popular Python libraries for
numerical computing and data analysis. This allows to easily plot data from arrays and
DataFrames. Compatible with Jupyter notebooks for interactive plotting.

Seaborn
• Seaborn is a library for making statistical graphics in Python. It builds on top of matplotlib
and integrates closely with pandas’ data structures.
• Seaborn helps to explore and understand the data. Its plotting functions operate on
dataframes and arrays containing whole datasets and internally perform the necessary
semantic mapping and statistical aggregation to produce informative plots.
--------------------------------------------------------------------------------------------------- Program 1 10

• Seaborn excels at creating plots that summarize and visualize statistical relationships within
data. It goes beyond basic plotting to offer tools for understanding distributions,
relationships, and patterns.

1.4 Program

.
.
.
11 Practical Insights into Data Analysis and Machine Learning -----------------------------------------
--------------------------------------------------------------------------------------------------- Program 1 12
13 Practical Insights into Data Analysis and Machine Learning -----------------------------------------
--------------------------------------------------------------------------------------------------- Program 1 14

Viva Questions

General Questions:
• What is Exploratory Data Analysis (EDA), and why is it important?
• What are the key steps involved in performing EDA on a dataset?
• What is the difference between descriptive and inferential statistics?

Data Handling:
• What is the role of the panda’s library in Python for data analysis?
• How do you load a dataset using pandas? Can you explain the read_csv() function?
• What are the key differences between a DataFrame and a Series in pandas?

Descriptive Statistics:
• What are the measures of central tendency, and why are they important?
• How do you calculate the mean, median, and mode of a dataset?
• What is the difference between mean and median? When would you use one over the other?
• What is the mode, and can a dataset have more than one mode?
• What is standard deviation, and what does it tell us about a dataset?
• How is variance different from standard deviation?
• What is the range of a dataset, and how is it calculated?
• What is the Interquartile Range (IQR), and how is it used to detect outliers?

Outlier Detection:
• What is an outlier, and how can it affect your analysis?
• How do you detect outliers using the IQR method?
• What are the lower and upper bounds in the IQR method, and how are they calculated?

Data Visualization:
• What is a histogram, and what kind of information does it provide?
• How do you interpret a histogram that is right-skewed or left-skewed?
• What is a box plot, and what information does it convey?
• How do you identify outliers in a box plot?
• What are the advantages of using a bar chart for data visualization?
• When would you use a pie chart instead of a bar chart?
• What are the key differences between Matplotlib and Seaborn?

Statistical Concepts:
• What is the difference between a population and a sample in statistics?
• What is the significance of the normal distribution in statistics?
• What is skewness, and how does it affect the interpretation of data?
• What is kurtosis, and how does it relate to the shape of a distribution?
15 Practical Insights into Data Analysis and Machine Learning -----------------------------------------

Advanced Questions:
• How would you use EDA to prepare data for machine learning models?
• What is the role of EDA in feature engineering?
• How can you use EDA to identify relationships between variables in a dataset?
• What are some limitations of using only descriptive statistics for data analysis?
• How would you use EDA to identify potential biases in a dataset?

Unit 3
No ratings yet
Unit 3
20 pages
Polytechnic University of The Philippines Statistical Analysis With Software Application
100% (1)
Polytechnic University of The Philippines Statistical Analysis With Software Application
9 pages
Business Analytics Unit 4
No ratings yet
Business Analytics Unit 4
24 pages
Unit 2 1
No ratings yet
Unit 2 1
54 pages
Fundamentals of Statistics With MS Excel
No ratings yet
Fundamentals of Statistics With MS Excel
83 pages
Data Analysis and Visualization EDA
No ratings yet
Data Analysis and Visualization EDA
51 pages
Knowing The Data Set
No ratings yet
Knowing The Data Set
31 pages
2 - Descriptive Statistics
No ratings yet
2 - Descriptive Statistics
29 pages
ML 3170724 Unit-2
No ratings yet
ML 3170724 Unit-2
40 pages
Ads Exp 1
No ratings yet
Ads Exp 1
13 pages
BB Module 2 BASIC STATISTICS
No ratings yet
BB Module 2 BASIC STATISTICS
63 pages
Section 1 Slide
No ratings yet
Section 1 Slide
132 pages
CHP 2
No ratings yet
CHP 2
52 pages
Meas T
No ratings yet
Meas T
8 pages
02 Data
No ratings yet
02 Data
36 pages
Datascience First Conti..and Second Unit
No ratings yet
Datascience First Conti..and Second Unit
49 pages
DS Chapter - 2
No ratings yet
DS Chapter - 2
73 pages
Week - 1 Day - 1 Descriptive Statistics
No ratings yet
Week - 1 Day - 1 Descriptive Statistics
40 pages
Statistics
No ratings yet
Statistics
10 pages
Data science-Unit-3-Complete
No ratings yet
Data science-Unit-3-Complete
33 pages
Unit II TYCS DS
No ratings yet
Unit II TYCS DS
176 pages
ML Lab Manual Bcsl602
No ratings yet
ML Lab Manual Bcsl602
108 pages
Principles of Data Science WEB 5
No ratings yet
Principles of Data Science WEB 5
30 pages
614 Descriptive Statistcs
No ratings yet
614 Descriptive Statistcs
56 pages
Exploring Numerical Data - Students
No ratings yet
Exploring Numerical Data - Students
97 pages
Data Mining and Predictive Modelling Assignment
No ratings yet
Data Mining and Predictive Modelling Assignment
34 pages
02 Exploratory Data Analytics
No ratings yet
02 Exploratory Data Analytics
41 pages
Chapter 3
No ratings yet
Chapter 3
17 pages
Statistics ClassNotes - 2
No ratings yet
Statistics ClassNotes - 2
10 pages
Measures of Central Tendency and Variability
No ratings yet
Measures of Central Tendency and Variability
7 pages
Math
No ratings yet
Math
50 pages
Module 3 Data Analysis Techniques
No ratings yet
Module 3 Data Analysis Techniques
55 pages
Unit 2 Know Data Concepts
No ratings yet
Unit 2 Know Data Concepts
4 pages
Shubh Am
No ratings yet
Shubh Am
70 pages
Descriptive Statistics PDF
100% (1)
Descriptive Statistics PDF
40 pages
Chapter - 3
No ratings yet
Chapter - 3
11 pages
Data and Metrics
No ratings yet
Data and Metrics
35 pages
Chapter 5
No ratings yet
Chapter 5
6 pages
Angilan, Ef
No ratings yet
Angilan, Ef
5 pages
02data Part2
No ratings yet
02data Part2
34 pages
ADS Imp Ans
No ratings yet
ADS Imp Ans
11 pages
Statistics
No ratings yet
Statistics
23 pages
Discriptive Statistics
No ratings yet
Discriptive Statistics
23 pages
Features
No ratings yet
Features
42 pages
ADS PRINT Ans
No ratings yet
ADS PRINT Ans
4 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
63 pages
Math 1281 Learning Journal Unit 6
No ratings yet
Math 1281 Learning Journal Unit 6
7 pages
Business Analytics
No ratings yet
Business Analytics
44 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
15 pages
Social Science Statistics (June-Aug) 2025-Topic 2
No ratings yet
Social Science Statistics (June-Aug) 2025-Topic 2
21 pages
Chapter2-Statistical Analysis
No ratings yet
Chapter2-Statistical Analysis
86 pages
L1-D3 Concepts of Data Analysis
No ratings yet
L1-D3 Concepts of Data Analysis
17 pages
DSBDAL - Assignment No 10
No ratings yet
DSBDAL - Assignment No 10
5 pages
Module 1 Overview - of - Statistics
No ratings yet
Module 1 Overview - of - Statistics
11 pages
A Quick Approach To Statistics by G.R.pashA
100% (1)
A Quick Approach To Statistics by G.R.pashA
210 pages
DA Practical Lab 02 Statistical Functions
No ratings yet
DA Practical Lab 02 Statistical Functions
6 pages
Ge8 Statistics
No ratings yet
Ge8 Statistics
2 pages
It B.tech II Year II Sem DV (R18a0555)
No ratings yet
It B.tech II Year II Sem DV (R18a0555)
73 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
ML Lab Final R22
No ratings yet
ML Lab Final R22
67 pages
Statistics and Its Types (v1.0)
No ratings yet
Statistics and Its Types (v1.0)
6 pages
Research Methodology: Unida Christian Colleges
No ratings yet
Research Methodology: Unida Christian Colleges
28 pages
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
No ratings yet
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
20 pages
Chapter 8 - Sampling Distribution
No ratings yet
Chapter 8 - Sampling Distribution
34 pages
S & Punit 1
No ratings yet
S & Punit 1
186 pages
Skewness and Kurtosis Original
No ratings yet
Skewness and Kurtosis Original
38 pages
Module3 Cloudcomputing
No ratings yet
Module3 Cloudcomputing
18 pages
Distributions of Sample Statistics
No ratings yet
Distributions of Sample Statistics
112 pages
Program 3
No ratings yet
Program 3
7 pages
Calmorin Frequency Distribution
No ratings yet
Calmorin Frequency Distribution
21 pages
Case Study - Module 12
No ratings yet
Case Study - Module 12
27 pages
Module1 Cloudcomputing
No ratings yet
Module1 Cloudcomputing
25 pages
Measures of Relative Standing
No ratings yet
Measures of Relative Standing
59 pages
STAT-205 (IT) Mid Term Paper
No ratings yet
STAT-205 (IT) Mid Term Paper
2 pages
Statistics For Management and Economics
100% (1)
Statistics For Management and Economics
16 pages
Ip Melc1 Q4
No ratings yet
Ip Melc1 Q4
2 pages
M2 Practice Problem SolutionB
No ratings yet
M2 Practice Problem SolutionB
4 pages
Unit 1
No ratings yet
Unit 1
15 pages
Chapter 2 - Preparing To Model
No ratings yet
Chapter 2 - Preparing To Model
16 pages
Ecn 2331 Statistics For Economics Lesson 2 Part 2
No ratings yet
Ecn 2331 Statistics For Economics Lesson 2 Part 2
53 pages
Program 2
No ratings yet
Program 2
9 pages
Module5 Cloudcomputing
No ratings yet
Module5 Cloudcomputing
40 pages
Basic Business Statistics 13th Edition Berenson Solutions Manualinstant Download
100% (9)
Basic Business Statistics 13th Edition Berenson Solutions Manualinstant Download
47 pages
01descriptive Statistics
No ratings yet
01descriptive Statistics
48 pages
Elementary Statistics: Describing, Exploring, and Comparing Data
No ratings yet
Elementary Statistics: Describing, Exploring, and Comparing Data
31 pages
ANOVA
No ratings yet
ANOVA
13 pages
Math 10 Formative Assessment 4.1 Quartiles (SY 2023-2024)
No ratings yet
Math 10 Formative Assessment 4.1 Quartiles (SY 2023-2024)
2 pages
QM Notes Sajin
No ratings yet
QM Notes Sajin
35 pages
Essentials of Modern Business Statistics With Microsoft Excel 8Th Edition David R. Anderson - Ebook PDF
100% (2)
Essentials of Modern Business Statistics With Microsoft Excel 8Th Edition David R. Anderson - Ebook PDF
31 pages
Lecture Notes Confidence Intervals
No ratings yet
Lecture Notes Confidence Intervals
7 pages
One Sample & Two Sample Mean Tests
No ratings yet
One Sample & Two Sample Mean Tests
5 pages
Covariance Matrix
No ratings yet
Covariance Matrix
6 pages
Comm Question Babnk
No ratings yet
Comm Question Babnk
2 pages
f2 Ch07 Re 25 Marks Solution
No ratings yet
f2 Ch07 Re 25 Marks Solution
2 pages
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet

Program-1

Uploaded by

Program-1

Uploaded by

Practical Insights into Data Analysis

and Machine Learning

Develop a program to load a dataset and select one numerical column.

To execute this program, a strong foundation in Python programming is required,

1.1 Data Handling

• Loading Datasets with pandas.read_csv(): The read_csv() function is used to load

1.2 Descriptive Statistics

1.2.1 Measures of Central Tendency: Mean, Median, and Mode.

Mean (Arithmetic Mean) -

Example: Let's consider the following dataset: 5, 2, 8, 1, 9, 4.

Example 1: Single Mode

Example 2: Multiple Modes

1.2.2 Measures of Dispersion

1.2.3 Outlier Detection

Interquartile Range (IQR)

Example: Consider the dataset: [3,7,8,5,12,14,21,13,18]

Using the dataset above:

1.3 Data Visualization

Image Source: https://cdn.serc.carleton.edu/images/mathyouneed/geomajors/histograms/histogram_skew.webp

1.3.2 Box plot

Advantages of Box plot

1.3.3 Bar chart

Advantages of Bar Charts:

1.3.4 Pie chart

1.3.5 Data Visualization Libraries

• Matplotlib is a comprehensive library for creating static, animated, and interactive

You might also like