Dav 1 Unit
Dav 1 Unit
1
CSE(DATA SCIENCE) R20 (DAV)
4. Ordinary Scale: An ordinal scale classifies data into distinct categories during which ranking is implied For
example:
Faculty rank : Professor, Associate Professor, Assistant Professor
Students grade : A, B, C, D.E.F
5. Interval scale: An interval scale may be an ordered scale during which the difference between measurements
is a meaningful quantity but the measurements don’t have a true zero point. For example:
Temperature in Fahrenheit and Celsius.
Years
6. Ratio scale: A ratio scale may be an ordered scale during which the difference between the measurements is a
meaningful quantity and therefore the measurements have a true zero point. Hence, we can perform arithmetic
operations on real scale data. For example : Weight, Age, Salary etc.
Importance of data:
1. Informed Decision-Making
Data allows individuals, businesses, and organizations to make decisions based on facts and insights rather
than intuition or guesswork. By analyzing relevant data, decision-makers can understand trends, patterns, and
correlations that guide strategic choices.
2. Improving Efficiency and Productivity
Data-driven approaches enable businesses and organizations to streamline operations, identify bottlenecks,
and optimize workflows. This can lead to improved productivity, reduced waste, and better resource
management.
3. Innovation and Research
Data is at the core of scientific research, product development, and innovation. Researchers use data to test
hypotheses, validate theories, and make new discoveries. In industries such as technology and healthcare, data
is critical for developing new products, services, and treatments.
4. Customer Insights
In business, customer data is crucial for understanding consumer behavior, preferences, and trends. This
allows companies to tailor their marketing strategies, improve customer service, and develop products that
better meet customer needs.
5. Competitive Advantage
Organizations that effectively gather, analyze, and leverage data often have a competitive advantage. Data can
provide insights into market conditions, customer needs, and competitor activities, enabling organizations to
stay ahead in a rapidly changing environment.
6. Risk Management
Data helps identify, assess, and mitigate risks. By analyzing historical data, companies can anticipate potential
problems, track performance indicators, and make proactive decisions to minimize risks and costs.
7. Personalization
Data enables businesses to personalize experiences for customers. For example, websites and apps use data
about user behavior to offer customized recommendations, targeted advertising, and personalized content.
8. Predictive Analysis
Data is often used in predictive analytics to forecast future trends. This is particularly useful in industries such
as finance, marketing, and healthcare, where predicting future events can inform strategy and planning.
2
CSE(DATA SCIENCE) R20 (DAV)
9. Compliance and Reporting
In many industries, data is essential for regulatory compliance. Organizations need accurate data for reporting
purposes, ensuring that they meet legal, financial, and environmental standards.
10. Performance Measurement
Data enables organizations to track key performance indicators (KPIs) and assess how well they are meeting
their objectives. This is important for monitoring progress, making adjustments, and ensuring that goals are
being achieved.
11. Automation and AI
Data is foundational to automation, machine learning, and artificial intelligence. Algorithms rely on large
datasets to train models, recognize patterns, and make decisions autonomously, leading to improved efficiency
and decision-making.
12. Global Connectivity and Communication
With the advent of the internet and digital technology, data flows across borders, enabling global connectivity,
communication, and collaboration. It plays a central role in connecting people, businesses, and governments
worldwide.
6
CSE(DATA SCIENCE) R20 (DAV)
Transportation : To evaluate logistics data, spot trends in transportation routes, and improve transportation
routes, the transportation sector can employ data analytics. Data analytics can help transportation businesses
cut expenses and speed up delivery times.
Supporting R&D: Guide research and development efforts with data-driven insights.
Tech companies often rely on data analysis to drive innovation in product development, such as creating new
features based on user feedback and usage patterns.
Supporting Evidence-Based Research
In academic and scientific fields, data analysis supports:
Research Validation: Test hypotheses and validate research findings with statistical methods.
Publication of Findings: Present data-driven evidence in research papers and studies.
Researchers in fields like epidemiology use data analysis to study disease patterns and evaluate public health
interventions.
Mitigating Risks
Data analysis helps in identifying and managing risks:
Risk Assessment: Analyze potential risks and vulnerabilities in business operations.
Fraud Detection: Use data analysis techniques to detect fraudulent activities.
Financial institutions, for instance, use data analysis to identify suspicious transactions and prevent fraud.
Fostering Competitive Advantage
Data analysis provides a competitive edge by:
Benchmarking: Compare performance against competitors and industry standards.
Strategic Planning: Develop strategic plans based on data-driven insights and competitive analysis.
Businesses that leverage data analysis can gain a competitive advantage by making better strategic decisions
and staying ahead of market trends.
9
CSE(DATA SCIENCE) R20 (DAV)
Python language basics –
Python is one of the most popular programming languages today, known for its simplicity and extensive features. Its
clean and straightforward syntax makes it beginner-friendly, while its powerful libraries and frameworks makes it
perfect for developers.
Python is a high-level, interpreted language with easy-to-read syntax.
Python is used in various fields like web development, data science, artificial intelligence and automation.
First Python Program to Learn Python Programming
Here is a simple Python code, printing a string. We recommend you to edit the code and try to print your own name.
Python
1
# Python Program to print a sample string
2
print("Welcome to Python Tutorial")
Output
Welcome to Python Tutorial
1. Getting Started with Python Programming
Welcome to the getting started with Python programming section! Here, we'll cover the essential topics you need to
kickstart your journey in Python programming. From syntax and keywords to comments, variables, and indentation.
2. Input/Output
In this section of Python guide, we will learn input and output operations in Python. It is crucial for interacting with
users and processing data effectively. From printing a simple line using print() function to exploring advanced
formatting techniques and efficient methods for receiving user input.
Print output using print() function
Print without new line
sep parameter in print()
Output Formatting
Taking Input in Python
Taking Multiple Inputs from users
3. Python Data Types
Python offers versatile collections of data types, including lists, string, tuples, sets, dictionaries, and arrays. In this
section, we will learn about each data types in detail.
10
CSE(DATA SCIENCE) R20 (DAV)
Data Types
Strings
Numbers
Boolean
Python List
Python Tuples
Python Sets
Python Dictionary
Python Arrays
Type Casting
4. Python Operators
In this section of Python Operators we will cover from performing basic arithmetic operations to evaluating complex
logical expressions. Here We'll learn comparison operators for making decisions based on conditions, and then explore
bitwise operators for low-level manipulation of binary data. Additionally, we'll understand assignment operators for
efficient variable assignment and updating, membership and identity operators.
Arithmetic operators
Comparison Operators
Logical Operators
Bitwise Operators
Assignment Operators
Membership & Identity Operators - "in", and "is" operator
5. Python Conditional Statement
Python Conditional statements are important in programming, enabling dynamic decision-making and code branching.
In this section of Python Tutorial, we'll explore Python's conditional logic, from basic if...else statements to nested
conditions and the concise ternary operator.
If else
Nested if statement
11
CSE(DATA SCIENCE) R20 (DAV)
if-elif-else Ladder
Introduction to pandas
pandas is a powerful and open-source Python library. The Pandas library is used for data manipulation and analysis.
Pandas consist of data structures and functions to perform efficient operations on data.
This free tutorial will cover an overview of Pandas, covering the fundamentals of Python Pandas.
12
CSE(DATA SCIENCE) R20 (DAV)
What is Pandas Library in Python?
Pandas is a powerful and versatile library that simplifies the tasks of data manipulation in Python. Pandas is well-
suited for working with tabular data, such as spreadsheets or SQL tables.
The Pandas library is an essential tool for data analysts, scientists, and engineers working with structured data in
Python.
What is Python Pandas used for?
The Pandas library is generally used for data science, but have you wondered why? This is because the Pandas library
is used in conjunction with other libraries that are used for data science.
It is built on top of the NumPy library which means that a lot of the structures of NumPy are used or replicated in
Pandas.
The data produced by Pandas is often used as input for plotting functions in Matplotlib, statistical analysis in SciPy,
and machine learning algorithms in Scikit-learn.
You must be wondering, Why should you use the Pandas Library. Python’s Pandas library is the best tool to analyze,
clean, and manipulate data.
Here is a list of things that we can do using Pandas.
Data set cleaning, merging, and joining.
Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data.
Columns can be inserted and deleted from Data Frame and higher-dimensional objects.
Powerful group by functionality for performing split-apply-combine operations on data sets.
Data Visualization.
Getting Started with Pandas
Let’s see how to start working with the Python Pandas library:
Installing Pandas
The first step in working with Pandas is to ensure whether it is installed in the system or not. If not, then we need to
install it on our system using the pip command.
Follow these steps to install Pandas:
Step 1: Type ‘cmd’ in the search box and open it.
Step 2: Locate the folder using the cd command where the python-pip file has been installed.
Step 3: After locating it, type the command:
pip install pandas
For more reference, take a look at this article on installing pandas follows.
Importing Pandas
After the Pandas have been installed in the system, you need to import the library. This module is generally imported
as follows:
import pandas as pd
Note: Here, pd is referred to as an alias for the Pandas. However, it is not necessary to import the library using the
alias, it just helps in writing less code every time a method or property is called.
Data Structures in Pandas Library
Pandas generally provide two data structures for manipulating data. They are:
13
CSE(DATA SCIENCE) R20 (DAV)
Series
DataFrame
Pandas Series
A Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, Python
objects, etc.). The axis labels are collectively called indexes.
The Pandas Series is nothing but a column in an Excel sheet. Labels need not be unique but must be of a hashable
type.
The object supports both integer and label-based indexing and provides a host of methods for performing operations
involving the index.
Pandas Series
Creating a Series
Pandas Series is created by loading the datasets from existing storage (which can be a SQL database, a CSV file, or an
Excel file).
Pandas Series can be created from lists, dictionaries, scalar values, etc.
Example: Creating a series using the Pandas Library.
import pandas as pd
import numpy as np
# Creating empty series
ser = pd.Series()
print("Pandas Series: ", ser)
# simple array
data = np.array(['g', 'e', 'e', 'k', 's'])
ser = pd.Series(data)
print("Pandas Series:\n", ser)
Output
Pandas Series: Series([], dtype: float64)
Pandas Series:
0 g
1 e
2 e
3 k
4 s
dtype: object
Pandas DataFrame
Pandas DataFrame is a two-dimensional data structure with labeled axes (rows and columns).
14
CSE(DATA SCIENCE) R20 (DAV)
Creating DataFrame
Pandas DataFrame is created by loading the datasets from existing storage (which can be a SQL database, a CSV file,
or an Excel file).
Pandas DataFrame can be created from lists, dictionaries, a list of dictionaries, etc.
Example: Creating a DataFrame Using the Pandas Library
import pandas as pd
# Calling DataFrame constructor
df = pd.DataFrame()
print(df)
# list of strings
lst = ['Geeks', 'For', 'Geeks', 'is', 'portal', 'for', 'Geeks']
# Calling DataFrame constructor on list
df = pd.DataFrame(lst)
print(df)
Output:
Empty DataFrame
Columns: []
Index: []
0
0 Geeks
1 For
2 Geeks
3 is
4 portal
5 for
6 Geeks
How to run the Pandas Program in Python?
The Pandas program can be run from any text editor, but it is recommended to use Jupyter Notebook for this, as
Jupyter gives you the ability to execute code in a particular cell rather than the entire file.
Jupyter also provides an easy way to visualize Pandas DataFrame and plots.
Jupyter notebook:
Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live
code, equations, visualizations, and narrative text. It is a popular tool among data scientists, researchers, and educators
for interactive computing and data analysis. The name "Jupyter" is derived from the three core programming
languages it originally supported: Julia, Python, and R.
What is Jupyter Notebook?
Three fundamental programming languages—Julia, Python, and R—that it initially supported are where Jupyter
Notebook gets its name. But now since it supports more than 40 programming languages, it is a flexible option for a
range of computational jobs. Because the notebook interface is web-based, users may use their web browsers to
interact with it.
Components of Jupyter Notebook
15
CSE(DATA SCIENCE) R20 (DAV)
The Jupyter Notebook is made up of the three components listed below. -
1. The notebook web application
It is an interactive web application that allows you to write and run code.
Users of the notebook online application can:
Automatic syntax highlighting and indentation are available when editing code in the browser.
Activate the browser's code.
Check out the computations' output in media formats like HTML, LaTex, PNG, PDF, etc.
Create and use widgets in JavaScript.
Contains mathematical formulas presented in Markdown cells
2. Kernels
The independent processes launched by the notebook web application are known as kernels, and they are used to
execute user code in the specified language and return results to the notebook web application.
The following languages are available for the Jupyter Notebook kernel:
Python
R
Julia
Ruby
Scala
node.js
3. Notebook documents
All content viewable in the notebook online application, including calculation inputs and outputs, text, mathematical
equations, graphs, and photos, is represented in the notebook document.
Types of cells in Jupyter Notebook
1. Code Cell: A code cell's contents are interpreted as statements in the current kernel's programming language.
Python is supported in code cells because Jupyter notebook's kernel is by default written in that language. The
output of the statement is shown below the code when it is executed. The output can be shown as text, an
image, a matplotlib plot, or a set of HTML tables.
2. Markdown Cell: Markdown cells give the notebook documentation and enhance its aesthetic appeal. This
cell has all formatting options, including the ability to bold and italicize text, add headers, display sorted or
unordered lists, bulleted lists, hyperlinks, tabular contents, and images, among others.
3. Raw NBConvert Cell: There is a location where you can write code directly in Raw NBConvert Cell. The
notebook kernel does not evaluate these cells..
4. Heading Cell: The header cell is not supported by the Jupyter Notebook. The panel displayed in the
screenshot below will pop open when you choose the heading from the drop-down menu.
Key features of Jupyter Notebook
Several programming languages are supported.
Integration of Markdown-formatted text.
Rich outputs, such as tables and charts, are displayed.
16
CSE(DATA SCIENCE) R20 (DAV)
flexibility in terms of language switching (kernels).
Upgrading pip
Step 3: Install the jupyter notebook using the command pip install jupyter notebook in the terminal.(refer to the
image)
pip install jupyter notebook
Step 4: Use the command jupyter notebook in terminal to run the notebook.
jupyter notebook
After you type the command, this home page should open up in your default browser.
1. Bar Chart
Use Case: Ideal for comparing discrete categories or groups.
Data Type: Categorical data.
Example: Displaying sales by region, product categories, etc.
2. Line Chart
Use Case: Great for visualizing trends over time or continuous data.
Data Type: Time series data or continuous variables.
Example: Showing stock price changes over a month.
3. Pie Chart
Use Case: Shows proportions of a whole.
Data Type: Categorical data where the parts represent portions of the total.
Example: Market share of different brands in an industry.
4. Histogram
Use Case: Used for showing the distribution of continuous data.
Data Type: Continuous data.
Example: Displaying the distribution of ages in a population.
5. Scatter Plot
Use Case: Best for showing relationships or correlations between two continuous variables.
Data Type: Continuous variables.
Example: Plotting height vs. weight to analyze correlation.
6. Box Plot
Use Case: Ideal for showing the distribution of data, highlighting outliers, and showing spread.
Data Type: Continuous data.
Example: Displaying the salary range in different departments.
7. Heatmap
19
CSE(DATA SCIENCE) R20 (DAV)
Use Case: Shows data density or intensity across two variables (color-coded).
Month Salary
January $105
February $95
March $105
April $105
May $100
Suppose, we want to express the salary of the employee using a single value and not 5 different values for 5
months. This value that can be used to represent the data for salaries for 5 months here can be referred to as
the measure of central tendency. The three possible ways to find the central measure of the tendency for the
above data are,
Mean: The mean salary of the given salary can be used as on of the measures of central tendency, i.e., x̄ = (105
+ 95 + 105 + 105 + 100)/5 = $102.
Mode: If we use the most frequently occurring value to represent the above data, i.e., $105, the measure of
central tendency would be mode.
Median: If we use the central value, i.e., $105 for the ordered set of salaries, given as, $95, $100, $105, $015,
$105, then the measure of central tendency here would be median.
We can use the following table for reference to check the best measure of central tendency suitable for a
particular type of variable:
21
CSE(DATA SCIENCE) R20 (DAV)
Nominal Mode
Ordinal Median
Interval/Ratio (not
Mean
skewed)
Let us study the following measures of central tendency, their formulas, usage, and types in detail below.
Mean
Median
Mode
Mean as a Measure of Central Tendency
The mean (or arithmetic mean) often called the average is most likely one of the measures of central tendency
that you are most familiar with. It is also known as average. Mean is simply the sum of all the components in
a group or collection, divided by the number of components.
We generally denote the mean of a given data-set by x̄, pronounced “x bar”. The formula to calculate the mean
for ungrouped data to represent it as the measure is given as,
For a set of observations: Mean = Sum of the terms/Number of terms
For a set of grouped data: Mean, x̄ = Σfx/Σf
where,
x̄ = the mean value of the set of given data.
f = frequency of each class
x = mid-interval value of each class
Example: The weights of 8 boys in kilograms: 45, 39, 53, 45, 43, 48, 50, 45. Find the mean weight for the
given set of data.
Therefore, the mean weight of the group:
Mean = Sum of the weights/Number of boys
= (45 + 39 + 53 + 45 + 43 + 48 + 50 + 45)/8
= 368/8
= 46
Thus, the mean weight of the group is 46 kilograms.
When Not to Use the Mean as the Measure of Central Tendency?
Using mean as the measure of central tendency brings out one major disadvantage, i.e., mean is particularly
sensitive to outliers. This is for the case when the values in a data are unusually larger or smaller compared to
the rest of the data.
22
CSE(DATA SCIENCE) R20 (DAV)
Median as a Measure of Central Tendency
Median, one of the measures of central tendency, is the value of the given data-set that is the middle-most
observation, obtained after arranging the data in ascending order is called the median of the data. The major
advantage of using the median as a central tendency is that it is less affected by outliers and skewed data. We
can calculate the median for different types of data, grouped data, or ungrouped data using the median
formula.
For ungrouped data: For odd number of observations, Median = [(n + 1)/2]th term. For even number of
observations, Median = [(n/2)th term + ((n/2) + 1)th term]/2
For grouped data: Median = l + [((n/2) - c)/f] × h
where,
l = Lower limit of the median class
c = Cumulative frequency
h = Class size
n = Number of observations
Median class = Class where n/2 lies
Let us use the same example given above to find the median now.
Example: The weights of 8 boys in kilograms: 45, 39, 53, 45, 43, 48, 50, 45. Find the median.
Solution:
Arranging the given data set in ascending order: 39, 43, 45, 45, 45, 48, 50, 53
Total number of observations = 8
For even number of observation, Median = [(n/2)th term + ((n/2) + 1)th term]/2
⇒ Median = (4th term + 5th term)/2 = (45 + 45)/2 = 45
Mode as a Measure of Central Tendency
Mode is one of the measures of the central tendency, defined as the value which appears most often in the
given data, i.e. the observation with the highest frequency is called the mode of data. The mode for grouped
data or ungrouped data can be calculated using the mode formulas given below,
Mode for ungrouped data: Most recurring observation in the data set.
Mode for grouped data: L + h (fm−f1)(fm−f1)+(fm−f2)(fm−f1)(fm−f1)+(fm−f2)
where,
L is the lower limit of the modal class
h is the size of the class interval
fmm is the frequency of the modal class
f11 is the frequency of the class preceding the modal class
f22 is the frequency of the class succeeding the modal class
Example: The weights of 8 boys in kilograms: 45, 39, 53, 45, 43, 48, 50, 45. Find the mode.
Solution:
Since the mode is the most occurring observation in the given set.
Mode = 45
Empirical Relation Between Measures of Central Tendency
The three measures of central tendency i.e. mean, median, and mode are closely connected by the following
relations (called an empirical relationship).
2Mean + Mode = 3Median
23
CSE(DATA SCIENCE) R20 (DAV)
For instance, if we are asked to calculate the mean, median, and mode of continuous grouped data, then we
can calculate mean and median using the formulae as discussed in the previous sections and then find mode
using the empirical relation.
Example: The median and mode for a given data set are 56 and 54 respectively. Find the approximate
value of the mean for this data set.
2Mean + Mode = 3Median
2Mean = 3Median - Mode
2Mean = 3 × 56 - 54
2Mean = 168 - 54 = 114
Mean = 57
Measures of Central Tendency and Type of Distribution
Any data set is a distribution of 'n' number of observations. The best measure of the central tendency of any
given data depends on this type of distribution. Some types of distributions in statistics are given as,
Normal Distribution
Skewed Distribution
Let us understand how the type of distribution can affect the values of different measures of central tendency.
Measures of Central Tendency for Normal Distribution
Here is the frequency distribution table for a set of data:
Observation 6 9 12 15 18 21
Frequency 5 10 15 10 5 0
We can observe the histogram for the above-given symmetrical distribution as shown below,
The above histogram displays a symmetrical distribution of data. Finding the mean, median, and mode for this
data-set, we observe that the three measures of central tendency mean, median, and mode are all located in the
center of the distribution graph. Thus, we can infer that in a perfectly symmetrical distribution, the mean and
the median are the same. The above-given example had one mode, i.e, it is a unimodal set, and therefore the
mode is the same as the mean and median. In a symmetrical distribution that has two modes, i.e. the given set
is bimodal, the two modes would be different from the mean and median.
Measures of Central Tendency for Skewed Distribution
For skewed distributions, if the distribution of data is skewed to the left, the mean is less than the median,
which is often less than the mode. If the distribution of data is skewed to the right, then the mode is often less
than the median, which is less than the mean. Let us understand each case using different examples.
24
CSE(DATA SCIENCE) R20 (DAV)
Measures of Central Tendency for Right-Skewed Distribution
Consider the following data-set and plot the histogram for the same to check the type of distribution.
Observation 6 9 12 15 18 21
Frequency 17 19 8 5 3 2
We observe the given data set is an example of a right or positively skewed distribution. Calculating the three
measures of central tendency, we find mean = 10, median = 9, and mode = 9. We, therefore, infer that if the
distribution of data is skewed to the right, then the mode is, lesser than the mean. And median generally lies
between the values of mode and mean.
Measures of Central Tendency for Left-Skewed Distribution
Consider the following data-set and plot the histogram for the same to check the type of distribution.
Observation 6 9 12 15 18 21
Frequency 2 13 5 10 15 19
We observe the given data set is an example of left or negatively skewed distribution. Calculating the three
measures of central tendency, we find mean = 15.75, median = 18, and mode = 21. We, therefore, infer that if
the distribution of data is skewed to the left, then the mode is, greater than the median, which is greater than
the mean.
Let us summarize the above observations using the graphs given below.
25
CSE(DATA SCIENCE) R20 (DAV)
The three most common measures of central tendency are mean, median, and mode.
Mean is simply the sum of all the components in a group or collection, divided by the number of components.
The value of the middle-most observation obtained after arranging the data in ascending order is called the
median of the data.
The value which appears most often in the given data i.e. the observation with the highest frequency is called
the mode of data.
The three measures of central tendency i.e. mean, median and mode are closely connected by the following
relations (called an empirical relationship): 2Mean + Mode = 3Median
DISPERSION:
Measures of Dispersion are used to represent the scattering of data. These are the numbers that show the
various aspects of the data spread across various parameters.
Let’s learn about the measure of dispersion in statistics , its types, formulas, and examples in detail.
Dispersion in Statistics
Dispersion in statistics is a way to describe how spread out or scattered the data is around an average value. It
helps to understand if the data points are close together or far apart.
Dispersion shows the variability or consistency in a set of data. There are different measures of dispersion like
range, variance, and standard deviation.
Measure of Dispersion in Statistics
Measures of Dispersion measure the scattering of the data. It tells us how the values are distributed in the data
set. In statistics, we define the measure of dispersion as various parameters that are used to define the various
attributes of the data.
These measures of dispersion capture variation between different values of the data.
Types of Measures of Dispersion
Measures of dispersion can be classified into the following two types :
Absolute Measure of Dispersion
Relative Measure of Dispersion
These measures of dispersion can be further divided into various categories. They have various parameters
and these parameters have the same unit.
26
CSE(DATA SCIENCE) R20 (DAV)
27
CSE(DATA SCIENCE) R20 (DAV)
In statistics, a range refers to the difference between the highest and lowest values in a dataset. It provides a simple
measure of the spread or dispersion of the data. Calculating the range involves subtracting the minimum value from
the maximum value.
Range is a fundamental statistical concept that helps us understand the spread or variability of data within a
dataset. Range in Statistics provides valuable insights into the extent of variation among the values in a dataset. Range
quantifies the difference between the highest and lowest values in the dataset.
We can use following steps for range calculation:
Identify the maximum value (the largest value) in your dataset.
Identify the minimum value (the smallest value) in your dataset.
Subtract the minimum value from the maximum value to find the range.
Range=Maximum value−Minimum value
Example : Consider a dataset of exam scores for a class:
Scores: 85, 92, 78, 96, 64, 89, 75, find the range?
Solution:
Maximum Value = 96
Minimum Value = 64
Range = 96 - 64 = 32
So, the range of the exam scores is 32.
Advantages
1. Easy to understand: The concept of range is simple and easy to grasp for people unfamiliar with statistics.
It's essentially the difference between the highest and lowest values in a dataset, making it intuitive.
2. Quick to calculate: Computing the range involves only finding the maximum and minimum values in the
dataset and subtracting them, making it a fast measure to calculate.
3. Provides a basic measure of variability: Despite its simplicity, the range gives a basic indication of the spread
or variability of the data. A larger range suggests greater variability, while a smaller range suggests less
variability.
Disadvantages
1. Sensitivity to outliers: The range is heavily influenced by extreme values (outliers) in the dataset. A single
outlier can greatly inflate the range, potentially giving a misleading picture of the variability of the majority
of the data.
2. Does not consider distribution: The range does not take into account the distribution of values within the
dataset. Two datasets with the same range can have very different distributions, leading to different
interpretations of variability.
3. Limited information: While the range provides a basic measure of variability, it does not provide any
information about the distribution's shape or central tendency. Other measures such as the interquartile
range, variance, or standard deviation offer more comprehensive insights into the dataset's characteristics.
4. Sample size dependency: The range does not account for sample size, so datasets with different sample sizes
may have similar ranges even if their variability differs significantly. This can lead to misinterpretations,
especially when comparing datasets of different sizes.
28
CSE(DATA SCIENCE) R20 (DAV)
VARIANCE:
Variance is a measurement value used to find how the data is spread concerning the mean or the average value of the
data set. It is used to find the distribution of data in the dataset and define how much the values differ from the mean.
The symbol used to define the variance is σ2. It is the square of the Standard Deviation.
The are two types of variance used in statistics,
Sample Variance
Population Variance
Population Variance
The population variance is used to determine how each data point in a particular population fluctuates or is spread out,
while the sample variance is used to find the average of the squared deviations from the mean.
In this article, we will learn about Variance (Sample, Population), their formulas, properties, and others in detail.
Population Variance Formula
The formula for population variance is written as,
σ2 = ∑ (xi – x̄)2/n
where,
x̄ is the mean of population data set
n is the total number of observations
Population variance is mainly used when the entire population’s data is available for analysis.
Sample Variance
If the population data is very large it becomes difficult to calculate the population variance of the data set. In that case,
we take a sample of data from the given data set and find the variance of that data set which is called sample variance.
While calculating the sample mean we make sure to calculate the sample mean, i.e. the mean of the sample data set
not the population mean. We can define the sample variance as the mean of the square of the difference between the
sample data point and the sample mean.
Sample Variance Formula
The formula of Sample variance is given by,
σ2 = ∑ (xi – x̄)2/(n – 1)
where,
x̄ is the mean of sample data set
n is the total number of observations
29
CSE(DATA SCIENCE) R20 (DAV)
Absolute Measures of Dispersion Related Formulas
μ = (x – a)/n
Mean Deviation where, a is the central value(mean, median,
mode) and n is the number of observation
(Q3 – Q1)/2
Quartile Deviation where,Q3 = Third Quartile and Q1 = First
Quartile
Coefficient of Dispersion
Coefficients of dispersion are calculated when two series are compared, which have great differences in their average.
We also use co-efficient of dispersion for comparing two series that have different measurements. It is denoted using
the letters C.D.
(Mean Deviation)/μ
Coefficient of Mean Deviation where,
μ is the central point for which the mean is calculated
30