0% found this document useful (0 votes)
62 views30 pages

Dav 1 Unit

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views30 pages

Dav 1 Unit

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

CSE(DATA SCIENCE) R20 (DAV)

Unit-1: Introduction to Data Analytics


Data and its importance
Data is a word we hear everywhere nowadays. In general, data is a collection of facts, information, and statistics and
this can be in various forms such as numbers, text, sound, images, or any other format.
According to the Oxford “Data is distinct pieces of information, usually formatted in a special way”. Data can be
measured, collected, reported, and analyzed, whereupon it is often visualized using graphs, images, or other analysis
tools. Raw data (“unprocessed data”) may be a collection of numbers or characters before it’s been “cleaned” and
corrected by researchers. It must be corrected so that we can remove outliers, instruments, or data entry errors. Data
processing commonly occurs in stages, and therefore the “processed data” from one stage could also be considered the
“raw data” of subsequent stages. Field data is data that’s collected in an uncontrolled “in situ” environment.
Experimental data is the data that is generated within the observation of scientific investigations. Data can be
generated by:
 Humans
 Machines
 Human-Machine combines.
It can often generated anywhere where any information is generated and stored in structured or unstructured formats.
What is Information?
Information is data that has been processed, organized, or structured in a way that makes it meaningful, valuable and
useful. It is data that has been given context, relevance and purpose. It gives knowledge, understanding and insights
that can be used for decision-making, problem-solving, communication and various other purposes.
Categories of Data
Data can be catogeries into two main parts –
 Structured Data: This type of data is organized data into specific format, making it easy to search, analyze
and process. Structured data is found in a relational databases that includes information like numbers, data and
categories.
 UnStructured Data: Unstructured data does not conform to a specific structure or format. It may include
some text documents , images, videos, and other data that is not easily organized or analyzed without
additional processing.
Types of Data
Generally data can be classified into two parts:
1. Categorial Data: In categorical data we see the data which have a defined category, for example:
 Marital Status
 Political Party
 Eye colour
2. Numerical Data: Numerical data can further be classified into two categories:
 Discrete Data: Discrete data contains the data which have discrete numerical values for example
Number of Children, Defects per Hour etc.
 Continuous Data: Continuous data contains the data which have continuous numerical values for
example Weight, Voltage etc.
3. Nominal Scale: A nominal scale classifies data into several distinct categories in which no ranking criteria is
implied. For example Gender, Marital Status.

1
CSE(DATA SCIENCE) R20 (DAV)
4. Ordinary Scale: An ordinal scale classifies data into distinct categories during which ranking is implied For
example:
 Faculty rank : Professor, Associate Professor, Assistant Professor
 Students grade : A, B, C, D.E.F
5. Interval scale: An interval scale may be an ordered scale during which the difference between measurements
is a meaningful quantity but the measurements don’t have a true zero point. For example:
 Temperature in Fahrenheit and Celsius.
 Years
6. Ratio scale: A ratio scale may be an ordered scale during which the difference between the measurements is a
meaningful quantity and therefore the measurements have a true zero point. Hence, we can perform arithmetic
operations on real scale data. For example : Weight, Age, Salary etc.

Importance of data:
1. Informed Decision-Making
 Data allows individuals, businesses, and organizations to make decisions based on facts and insights rather
than intuition or guesswork. By analyzing relevant data, decision-makers can understand trends, patterns, and
correlations that guide strategic choices.
2. Improving Efficiency and Productivity
 Data-driven approaches enable businesses and organizations to streamline operations, identify bottlenecks,
and optimize workflows. This can lead to improved productivity, reduced waste, and better resource
management.
3. Innovation and Research
 Data is at the core of scientific research, product development, and innovation. Researchers use data to test
hypotheses, validate theories, and make new discoveries. In industries such as technology and healthcare, data
is critical for developing new products, services, and treatments.
4. Customer Insights
 In business, customer data is crucial for understanding consumer behavior, preferences, and trends. This
allows companies to tailor their marketing strategies, improve customer service, and develop products that
better meet customer needs.
5. Competitive Advantage
 Organizations that effectively gather, analyze, and leverage data often have a competitive advantage. Data can
provide insights into market conditions, customer needs, and competitor activities, enabling organizations to
stay ahead in a rapidly changing environment.
6. Risk Management
 Data helps identify, assess, and mitigate risks. By analyzing historical data, companies can anticipate potential
problems, track performance indicators, and make proactive decisions to minimize risks and costs.
7. Personalization
 Data enables businesses to personalize experiences for customers. For example, websites and apps use data
about user behavior to offer customized recommendations, targeted advertising, and personalized content.
8. Predictive Analysis
 Data is often used in predictive analytics to forecast future trends. This is particularly useful in industries such
as finance, marketing, and healthcare, where predicting future events can inform strategy and planning.
2
CSE(DATA SCIENCE) R20 (DAV)
9. Compliance and Reporting
 In many industries, data is essential for regulatory compliance. Organizations need accurate data for reporting
purposes, ensuring that they meet legal, financial, and environmental standards.
10. Performance Measurement
 Data enables organizations to track key performance indicators (KPIs) and assess how well they are meeting
their objectives. This is important for monitoring progress, making adjustments, and ensuring that goals are
being achieved.
11. Automation and AI
 Data is foundational to automation, machine learning, and artificial intelligence. Algorithms rely on large
datasets to train models, recognize patterns, and make decisions autonomously, leading to improved efficiency
and decision-making.
12. Global Connectivity and Communication
 With the advent of the internet and digital technology, data flows across borders, enabling global connectivity,
communication, and collaboration. It plays a central role in connecting people, businesses, and governments
worldwide.

DATA ANALYTICS AND ITS TYPES:


Data analytics is an important field that involves the process of collecting, processing, and interpreting data to
uncover insights and help in making decisions. Data analytics is the practice of examining raw data to identify trends,
draw conclusions, and extract meaningful information. This involves various techniques and tools to process and
transform data into valuable insights that can be used for decision-making.
What is Data Analytics?
In this new digital world, data is being generated in an enormous amount which opens new paradigms. As we have
high computing power and a large amount of data we can use this data to help us make data-driven decision making.
The main benefits of data-driven decisions are that they are made up by observing past trends which have resulted in
beneficial results.
In short, we can say that data analytics is the process of manipulating data to extract useful trends and hidden patterns
that can help us derive valuable insights to make business predictions.
To gain expertise in data analytics and learn how to apply these techniques in real-world scenarios, consider enrolling
in the Data Science Live course . This course offers in-depth training on data analysis, machine learning, and
statistical methods, equipping you with the skills needed to excel in the field of data science. Learn from industry
experts and take your data analytics capabilities to the next level with practical, hands-on experience.
Understanding Data Analytics
Data analytics encompasses a wide array of techniques for analyzing data to gain valuable insights that can enhance
various aspects of operations. By scrutinizing information, businesses can uncover patterns and metrics that might
otherwise go unnoticed, enabling them to optimize processes and improve overall efficiency.
For instance, in manufacturing, companies collect data on machine runtime, downtime, and work queues to analyze
and improve workload planning, ensuring machines operate at optimal levels.
Beyond production optimization, data analytics is utilized in diverse sectors. Gaming firms utilize it to design reward
systems that engage players effectively, while content providers leverage analytics to optimize content placement and
presentation, ultimately driving user engagement.
Types of Data Analytics
There are four major types of data analytics:
1. Predictive (forecasting)
3
CSE(DATA SCIENCE) R20 (DAV)
2. Descriptive (business intelligence and data mining)
3. Prescriptive (optimization and simulation)
4. Diagnostic analytics

Data Analytics and its Types


Predictive Analytics
Predictive analytics turn the data into valuable, actionable information. predictive analytics uses data to determine the
probable outcome of an event or a likelihood of a situation occurring. Predictive analytics holds a variety of statistical
techniques from modeling, machine learning , data mining , and game theory that analyze current and historical facts
to make predictions about a future event. Techniques that are used for predictive analytics are:
 Linear Regression
 Time Series Analysis and Forecasting
 Data Mining
Basic Cornerstones of Predictive Analytics
 Predictive modeling
 Decision Analysis and optimization
 Transaction profiling
Descriptive Analytics
Descriptive analytics looks at data and analyze past event for insight as to how to approach future events. It looks at
past performance and understands the performance by mining historical data to understand the cause of success or
failure in the past. Almost all management reporting such as sales, marketing, operations, and finance uses this type of
analysis.
The descriptive model quantifies relationships in data in a way that is often used to classify customers or prospects
into groups. Unlike a predictive model that focuses on predicting the behavior of a single customer, Descriptive
analytics identifies many different relationships between customer and product.
4
CSE(DATA SCIENCE) R20 (DAV)
Common examples of Descriptive analytics are company reports that provide historic reviews like:
 Data Queries
 Reports
 Descriptive Statistics
 Data dashboard
Prescriptive Analytics
Prescriptive Analytics automatically synthesize big data, mathematical science, business rule, and machine learning to
make a prediction and then suggests a decision option to take advantage of the prediction.
Prescriptive analytics goes beyond predicting future outcomes by also suggesting action benefits from the predictions
and showing the decision maker the implication of each decision option. Prescriptive Analytics not only anticipates
what will happen and when to happen but also why it will happen. Further, Prescriptive Analytics can suggest decision
options on how to take advantage of a future opportunity or mitigate a future risk and illustrate the implication of each
decision option.
For example, Prescriptive Analytics can benefit healthcare strategic planning by using analytics to leverage
operational and usage data combined with data of external factors such as economic data, population demography, etc.
Diagnostic Analytics
In this analysis, we generally use historical data over other data to answer any question or for the solution of any
problem. We try to find any dependency and pattern in the historical data of the particular problem.
For example, companies go for this analysis because it gives a great insight into a problem, and they also keep detailed
information about their disposal otherwise data collection may turn out individual for every problem and it will be
very time-consuming. Common techniques used for Diagnostic Analytics are:
 Data discovery
 Data mining
 Correlations
The Role of Data Analytics
Data analytics plays a pivotal role in enhancing operations, efficiency, and performance across various industries by
uncovering valuable patterns and insights. Implementing data analytics techniques can provide companies with a
competitive advantage. The process typically involves four fundamental steps:
 Data Mining : This step involves gathering data and information from diverse sources and transforming them
into a standardized format for subsequent analysis. Data mining can be a time-intensive process compared to
other steps but is crucial for obtaining a comprehensive dataset.
 Data Management : Once collected, data needs to be stored, managed, and made accessible. Creating a
database is essential for managing the vast amounts of information collected during the mining process. SQL
(Structured Query Language) remains a widely used tool for database management, facilitating efficient
querying and analysis of relational databases.
 Statistical Analysis : In this step, the gathered data is subjected to statistical analysis to identify trends and
patterns. Statistical modeling is used to interpret the data and make predictions about future trends. Open-
source programming languages like Python, as well as specialized tools like R, are commonly used for
statistical analysis and graphical modeling.
 Data Presentation : The insights derived from data analytics need to be effectively communicated to
stakeholders. This final step involves formatting the results in a manner that is accessible and understandable
to various stakeholders, including decision-makers, analysts, and shareholders. Clear and concise data
presentation is essential for driving informed decision-making and driving business growth.
5
CSE(DATA SCIENCE) R20 (DAV)
Steps in Data Analysis
 Define Data Requirements : This involves determining how the data will be grouped or categorized. Data
can be segmented based on various factors such as age, demographic, income, or gender, and can consist of
numerical values or categorical data.
 Data Collection : Data is gathered from different sources, including computers, online platforms, cameras,
environmental sensors, or through human personnel.
 Data Organization : Once collected, the data needs to be organized in a structured format to facilitate
analysis. This could involve using spreadsheets or specialized software designed for managing and analyzing
statistical data.
 Data Cleaning : Before analysis, the data undergoes a cleaning process to ensure accuracy and reliability.
This involves identifying and removing any duplicate or erroneous entries, as well as addressing any missing
or incomplete data. Cleaning the data helps to mitigate potential biases and errors that could affect the analysis
results.
Usage of Data Analytics
There are some key domains and strategic planning techniques in which Data Analytics has played a vital role:
 Improved Decision-Making – If we have supporting data in favour of a decision, then we can implement
them with even more success probability. For example, if a certain decision or plan has to lead to better
outcomes then there will be no doubt in implementing them again.
 Better Customer Service – Churn modeling is the best example of this in which we try to predict or identify
what leads to customer churn and change those things accordingly so, that the attrition of the customers is as
low as possible which is a most important factor in any organization.
 Efficient Operations – Data Analytics can help us understand what is the demand of the situation and what
should be done to get better results then we will be able to streamline our processes which in turn will lead to
efficient operations.
 Effective Marketing – Market segmentation techniques have been implemented to target this important factor
only in which we are supposed to find the marketing techniques which will help us increase our sales and
leads to effective marketing strategies.
Future Scope of Data Analytics
 Retail : To study sales patterns, consumer behavior, and inventory management, data analytics can be applied
in the retail sector. Data analytics can be used by retailers to make data-driven decisions regarding what
products to stock, how to price them, and how to best organize their stores.
 Healthcare : Data analytics can be used to evaluate patient data, spot trends in patient health, and create
individualized treatment regimens. Data analytics can be used by healthcare companies to enhance patient
outcomes and lower healthcare expenditures.
 Finance : In the field of finance, data analytics can be used to evaluate investment data, spot trends in the
financial markets, and make wise investment decisions. Data analytics can be used by financial institutions to
lower risk and boost the performance of investment portfolios.
 Marketing : By analyzing customer data, spotting trends in consumer behavior, and creating customized
marketing strategies, data analytics can be used in marketing. Data analytics can be used by marketers to boost
the efficiency of their campaigns and their overall impact.
 Manufacturing : Data analytics can be used to examine production data, spot trends in production methods,
and boost production efficiency in the manufacturing sector. Data analytics can be used by manufacturers to
cut costs and enhance product quality.

6
CSE(DATA SCIENCE) R20 (DAV)
 Transportation : To evaluate logistics data, spot trends in transportation routes, and improve transportation
routes, the transportation sector can employ data analytics. Data analytics can help transportation businesses
cut expenses and speed up delivery times.

The Importance of Data Analysis


Data analysis is pivotal in various aspects of modern life and business. Here’s a detailed look at why it is so
crucial:
Informed Decision-Making
One of the primary reasons data analysis is important is its role in informed decision-making. By analyzing
data, organizations can:
 Understand Performance: Analyze sales figures, customer feedback, and operational metrics to gauge
performance and make strategic decisions.
 Predict Outcomes: Use historical data to forecast future trends, such as sales projections or market demand.
 Evaluate Strategies: Assess the effectiveness of marketing campaigns, business strategies, and operational
changes.
For example, a retail company might analyze customer purchasing patterns to optimize inventory levels and
improve sales strategies.
Improving Business Efficiency
Data analysis helps businesses streamline operations and reduce costs by:
 Identifying Inefficiencies: Detect inefficiencies in processes and workflows through data analysis.
 Optimizing Resources: Allocate resources more effectively based on data-driven insights.
 Enhancing Productivity: Implement process improvements and performance metrics based on data findings.
A manufacturing company, for instance, might use data analysis to optimize its supply chain, leading to
reduced costs and improved production efficiency.
Identifying Market Trends
Businesses and organizations use data analysis to stay ahead of market trends:
 Market Research: Analyze market data to identify emerging trends, customer preferences, and competitive
landscapes.
 Trend Analysis: Use data to understand shifts in consumer behavior and adapt strategies accordingly.
A fashion retailer might analyze sales data and social media trends to predict upcoming fashion trends and
adjust their product offerings.
Enhancing Customer Experience
Data analysis is crucial for improving customer experiences:
 Personalization: Analyze customer data to offer personalized products, services, and recommendations.
 Feedback Analysis: Use data from customer feedback and reviews to improve service quality and address
issues.
For example, an e-commerce platform might analyze user behavior to offer personalized product
recommendations and improve user experience.
Driving Innovation
Data analysis fuels innovation by:
7
CSE(DATA SCIENCE) R20 (DAV)
 Identifying Opportunities: Spot new business opportunities and areas for improvement.

 Supporting R&D: Guide research and development efforts with data-driven insights.
Tech companies often rely on data analysis to drive innovation in product development, such as creating new
features based on user feedback and usage patterns.
Supporting Evidence-Based Research
In academic and scientific fields, data analysis supports:
 Research Validation: Test hypotheses and validate research findings with statistical methods.
 Publication of Findings: Present data-driven evidence in research papers and studies.
Researchers in fields like epidemiology use data analysis to study disease patterns and evaluate public health
interventions.
Mitigating Risks
Data analysis helps in identifying and managing risks:
 Risk Assessment: Analyze potential risks and vulnerabilities in business operations.
 Fraud Detection: Use data analysis techniques to detect fraudulent activities.
Financial institutions, for instance, use data analysis to identify suspicious transactions and prevent fraud.
Fostering Competitive Advantage
Data analysis provides a competitive edge by:
 Benchmarking: Compare performance against competitors and industry standards.
 Strategic Planning: Develop strategic plans based on data-driven insights and competitive analysis.
Businesses that leverage data analysis can gain a competitive advantage by making better strategic decisions
and staying ahead of market trends.

Applications of Data Analysis


Data analysis has diverse applications across various sectors:
Business and Marketing
 Customer Segmentation: Identify different customer groups for targeted marketing strategies.
 Campaign Effectiveness: Measure the success of marketing campaigns and adjust tactics as needed.
Healthcare
 Patient Care: Analyze patient data for better diagnosis and treatment plans.
 Public Health: Study health trends and manage public health initiatives.
Finance
 Investment Decisions: Analyze market trends and financial data for investment opportunities.
 Risk Management: Assess financial risks and develop strategies for risk mitigation.
Education
 Student Performance: Analyze student performance data to improve educational outcomes.
 Curriculum Development: Use data to develop and refine educational programs.
Government and Public Policy
8
CSE(DATA SCIENCE) R20 (DAV)
 Policy Analysis: Evaluate the impact of public policies and programs.

 Resource Allocation: Use data to allocate resources effectively in public services.


Challenges in Data Analysis
Despite its importance, data analysis faces several challenges:
Data Quality Issues
 Incomplete Data: Missing or incomplete data can lead to inaccurate analysis.
 Data Accuracy: Ensuring the accuracy and reliability of data sources.
Data Privacy Concerns
 Confidentiality: Protecting sensitive data from unauthorized access and breaches.
 Compliance: Adhering to data protection regulations and standards.
Complexity of Data Management
 Data Integration: Combining data from various sources can be complex and time-consuming.
 Data Storage: Managing large volumes of data and ensuring efficient storage solutions.
Skills and Expertise Requirements
 Technical Skills: Requires expertise in statistical analysis, data management, and programming.
 Interpretation Skills: Analyzing data is not enough; interpreting results accurately is crucial.

Future Trends in Data Analysis


The field of data analysis is evolving with several emerging trends:
Artificial Intelligence and Machine Learning
 Advanced Analytics: AI and ML techniques are used for more sophisticated data analysis and predictive
modeling.
 Automation: Automating repetitive tasks and processes through AI-driven tools.
Big Data Analytics
 Scalability: Handling and analyzing vast amounts of data from diverse sources.
 Advanced Techniques: Utilizing advanced algorithms for big data processing and analysis.
Real-Time Data Processing
 Immediate Insights: Analyzing data in real-time for timely decision-making.
 Applications: Real-time data analysis is used in finance, transportation, and social media.
Increased Emphasis on Data Ethics
 Ethical Practices: Ensuring ethical practices in data collection, analysis, and usage.
 Transparency: Promoting transparency and accountability in data analysis processes.

9
CSE(DATA SCIENCE) R20 (DAV)
Python language basics –
Python is one of the most popular programming languages today, known for its simplicity and extensive features. Its
clean and straightforward syntax makes it beginner-friendly, while its powerful libraries and frameworks makes it
perfect for developers.
 Python is a high-level, interpreted language with easy-to-read syntax.
 Python is used in various fields like web development, data science, artificial intelligence and automation.
First Python Program to Learn Python Programming
Here is a simple Python code, printing a string. We recommend you to edit the code and try to print your own name.
Python
1
# Python Program to print a sample string
2
print("Welcome to Python Tutorial")

Output
Welcome to Python Tutorial
1. Getting Started with Python Programming
Welcome to the getting started with Python programming section! Here, we'll cover the essential topics you need to
kickstart your journey in Python programming. From syntax and keywords to comments, variables, and indentation.
2. Input/Output
In this section of Python guide, we will learn input and output operations in Python. It is crucial for interacting with
users and processing data effectively. From printing a simple line using print() function to exploring advanced
formatting techniques and efficient methods for receiving user input.
 Print output using print() function
 Print without new line
 sep parameter in print()
 Output Formatting
 Taking Input in Python
 Taking Multiple Inputs from users
3. Python Data Types
Python offers versatile collections of data types, including lists, string, tuples, sets, dictionaries, and arrays. In this
section, we will learn about each data types in detail.

10
CSE(DATA SCIENCE) R20 (DAV)

Data Types
 Strings
 Numbers
 Boolean
 Python List
 Python Tuples
 Python Sets
 Python Dictionary
 Python Arrays
 Type Casting
4. Python Operators
In this section of Python Operators we will cover from performing basic arithmetic operations to evaluating complex
logical expressions. Here We'll learn comparison operators for making decisions based on conditions, and then explore
bitwise operators for low-level manipulation of binary data. Additionally, we'll understand assignment operators for
efficient variable assignment and updating, membership and identity operators.
 Arithmetic operators
 Comparison Operators
 Logical Operators
 Bitwise Operators
 Assignment Operators
 Membership & Identity Operators - "in", and "is" operator
5. Python Conditional Statement
Python Conditional statements are important in programming, enabling dynamic decision-making and code branching.
In this section of Python Tutorial, we'll explore Python's conditional logic, from basic if...else statements to nested
conditions and the concise ternary operator.
 If else
 Nested if statement
11
CSE(DATA SCIENCE) R20 (DAV)
 if-elif-else Ladder

 If Else on One Line


 Ternary Condition
 Match Case Statement
6. Python Loops
Here, we'll explore Python loop constructs, including the for and while loops, along with essential loop control
statements like break, continue, and pass. Additionally, we'll uncover the concise elegance of list and dictionary
comprehensions for efficient data manipulation. By mastering these loop techniques, you'll streamline your code for
improved readability and performance.
 For Loop
 While Loop
 Loop control statements (break, continue, pass)
 List Comprehension
 Dictionary Comprehension
7. Python Functions
Python Functions are the backbone of organized and efficient code in Python. Here, in this section of Python 3 tutorial
we'll explore their syntax, parameter handling, return values, and variable scope. From basic concepts to advanced
techniques like closures and decorators. Along the way, we'll also introduce versatile functions like range(), and
powerful tools such as *args and **kwargs for flexible parameter handling. Additionally, we'll delve into functional
programming with map, filter, and lambda functions.
 Python Function Global and Local Scope Variables
 Use of pass Statement in Function
 Return statement in Python Function
 Python range() function
 *args and **kwargs in Python Function
 Python closures
 Python ‘Self’ as Default Argument
 Decorators in Python
 Python closures
 Map Function
 Filter Function
 Reduce Function
 Lambda Function

Introduction to pandas
pandas is a powerful and open-source Python library. The Pandas library is used for data manipulation and analysis.
Pandas consist of data structures and functions to perform efficient operations on data.
This free tutorial will cover an overview of Pandas, covering the fundamentals of Python Pandas.

12
CSE(DATA SCIENCE) R20 (DAV)
What is Pandas Library in Python?
Pandas is a powerful and versatile library that simplifies the tasks of data manipulation in Python. Pandas is well-
suited for working with tabular data, such as spreadsheets or SQL tables.
The Pandas library is an essential tool for data analysts, scientists, and engineers working with structured data in
Python.
What is Python Pandas used for?
The Pandas library is generally used for data science, but have you wondered why? This is because the Pandas library
is used in conjunction with other libraries that are used for data science.
It is built on top of the NumPy library which means that a lot of the structures of NumPy are used or replicated in
Pandas.
The data produced by Pandas is often used as input for plotting functions in Matplotlib, statistical analysis in SciPy,
and machine learning algorithms in Scikit-learn.
You must be wondering, Why should you use the Pandas Library. Python’s Pandas library is the best tool to analyze,
clean, and manipulate data.
Here is a list of things that we can do using Pandas.
 Data set cleaning, merging, and joining.
 Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data.
 Columns can be inserted and deleted from Data Frame and higher-dimensional objects.
 Powerful group by functionality for performing split-apply-combine operations on data sets.
 Data Visualization.
Getting Started with Pandas
Let’s see how to start working with the Python Pandas library:
Installing Pandas
The first step in working with Pandas is to ensure whether it is installed in the system or not. If not, then we need to
install it on our system using the pip command.
Follow these steps to install Pandas:
Step 1: Type ‘cmd’ in the search box and open it.
Step 2: Locate the folder using the cd command where the python-pip file has been installed.
Step 3: After locating it, type the command:
pip install pandas
For more reference, take a look at this article on installing pandas follows.
Importing Pandas
After the Pandas have been installed in the system, you need to import the library. This module is generally imported
as follows:
import pandas as pd
Note: Here, pd is referred to as an alias for the Pandas. However, it is not necessary to import the library using the
alias, it just helps in writing less code every time a method or property is called.
Data Structures in Pandas Library
Pandas generally provide two data structures for manipulating data. They are:
13
CSE(DATA SCIENCE) R20 (DAV)
 Series

 DataFrame
Pandas Series
A Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, Python
objects, etc.). The axis labels are collectively called indexes.
The Pandas Series is nothing but a column in an Excel sheet. Labels need not be unique but must be of a hashable
type.
The object supports both integer and label-based indexing and provides a host of methods for performing operations
involving the index.

Pandas Series
Creating a Series
Pandas Series is created by loading the datasets from existing storage (which can be a SQL database, a CSV file, or an
Excel file).
Pandas Series can be created from lists, dictionaries, scalar values, etc.
Example: Creating a series using the Pandas Library.
import pandas as pd
import numpy as np
# Creating empty series
ser = pd.Series()
print("Pandas Series: ", ser)
# simple array
data = np.array(['g', 'e', 'e', 'k', 's'])
ser = pd.Series(data)
print("Pandas Series:\n", ser)
Output
Pandas Series: Series([], dtype: float64)
Pandas Series:
0 g
1 e
2 e
3 k
4 s
dtype: object
Pandas DataFrame
Pandas DataFrame is a two-dimensional data structure with labeled axes (rows and columns).

14
CSE(DATA SCIENCE) R20 (DAV)
Creating DataFrame
Pandas DataFrame is created by loading the datasets from existing storage (which can be a SQL database, a CSV file,
or an Excel file).
Pandas DataFrame can be created from lists, dictionaries, a list of dictionaries, etc.
Example: Creating a DataFrame Using the Pandas Library
import pandas as pd
# Calling DataFrame constructor
df = pd.DataFrame()
print(df)
# list of strings
lst = ['Geeks', 'For', 'Geeks', 'is', 'portal', 'for', 'Geeks']
# Calling DataFrame constructor on list
df = pd.DataFrame(lst)
print(df)
Output:
Empty DataFrame
Columns: []
Index: []
0
0 Geeks
1 For
2 Geeks
3 is
4 portal
5 for
6 Geeks
How to run the Pandas Program in Python?
The Pandas program can be run from any text editor, but it is recommended to use Jupyter Notebook for this, as
Jupyter gives you the ability to execute code in a particular cell rather than the entire file.
Jupyter also provides an easy way to visualize Pandas DataFrame and plots.

Jupyter notebook:
Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live
code, equations, visualizations, and narrative text. It is a popular tool among data scientists, researchers, and educators
for interactive computing and data analysis. The name "Jupyter" is derived from the three core programming
languages it originally supported: Julia, Python, and R.
What is Jupyter Notebook?
Three fundamental programming languages—Julia, Python, and R—that it initially supported are where Jupyter
Notebook gets its name. But now since it supports more than 40 programming languages, it is a flexible option for a
range of computational jobs. Because the notebook interface is web-based, users may use their web browsers to
interact with it.
Components of Jupyter Notebook

15
CSE(DATA SCIENCE) R20 (DAV)
The Jupyter Notebook is made up of the three components listed below. -
1. The notebook web application
It is an interactive web application that allows you to write and run code.
Users of the notebook online application can:
 Automatic syntax highlighting and indentation are available when editing code in the browser.
 Activate the browser's code.
 Check out the computations' output in media formats like HTML, LaTex, PNG, PDF, etc.
 Create and use widgets in JavaScript.
 Contains mathematical formulas presented in Markdown cells
2. Kernels
The independent processes launched by the notebook web application are known as kernels, and they are used to
execute user code in the specified language and return results to the notebook web application.
The following languages are available for the Jupyter Notebook kernel:
 Python
 R
 Julia
 Ruby
 Scala
 node.js
3. Notebook documents
All content viewable in the notebook online application, including calculation inputs and outputs, text, mathematical
equations, graphs, and photos, is represented in the notebook document.
Types of cells in Jupyter Notebook
1. Code Cell: A code cell's contents are interpreted as statements in the current kernel's programming language.
Python is supported in code cells because Jupyter notebook's kernel is by default written in that language. The
output of the statement is shown below the code when it is executed. The output can be shown as text, an
image, a matplotlib plot, or a set of HTML tables.
2. Markdown Cell: Markdown cells give the notebook documentation and enhance its aesthetic appeal. This
cell has all formatting options, including the ability to bold and italicize text, add headers, display sorted or
unordered lists, bulleted lists, hyperlinks, tabular contents, and images, among others.
3. Raw NBConvert Cell: There is a location where you can write code directly in Raw NBConvert Cell. The
notebook kernel does not evaluate these cells..
4. Heading Cell: The header cell is not supported by the Jupyter Notebook. The panel displayed in the
screenshot below will pop open when you choose the heading from the drop-down menu.
Key features of Jupyter Notebook
 Several programming languages are supported.
 Integration of Markdown-formatted text.
 Rich outputs, such as tables and charts, are displayed.

16
CSE(DATA SCIENCE) R20 (DAV)
 flexibility in terms of language switching (kernels).

 Opportunities for sharing and teamwork during export.


 Adaptability and extensibility via add-ons.
 Integration of interactive widgets with data science libraries.
 Quick feedback and live code execution.
 Widely employed in scientific research and education.
Getting Started with Jupyter Notebook
The easiest way to install jupyter notebook is through the terminal:
Step 1: Python's latest version for this method(https://www.python.org/downloads/).
Step 2 : Updating pip using cmd.
python -m pip install --upgrade pip

Upgrading pip
Step 3: Install the jupyter notebook using the command pip install jupyter notebook in the terminal.(refer to the
image)
pip install jupyter notebook
Step 4: Use the command jupyter notebook in terminal to run the notebook.
jupyter notebook
After you type the command, this home page should open up in your default browser.

Jupyter Home Page


Applications of Jupyter Notebook
1. Data science workflows: Organizing and recording the steps involved in data analysis.
17
CSE(DATA SCIENCE) R20 (DAV)
2. Making slide displays and reports for presentations.
3. Data analysis involves investigating and displaying data.
4. The creation and evaluation of machine learning models.
5. NLP: Text analysis and natural language processing.
Notebook Extensions
Extensions for Jupyter Notebook are add-ons or modules that improve the environment's functionality. Jupyter
Notebook is further enhanced and customizable by these extensions, which offer more capabilities and settings. The
Jupyter JavaScript API and the page's DOM are both accessible to extensions.
Although Jupyter Notebooks come with lots of built-in abilities, extensions let you add more. Actually, Jupyter
supports four different kinds of extensions:
 Kernel
 IPython kernel
 Notebook
 Notebook server
You can download jupyter_contrib_nbextensions, one of the most well-liked extension sets, from GitHub. This is
actually a set of pip-installed extensions made available by the Jupyter community.
Keyboards Shortcuts
Working with code and markdown cells in Jupyter Notebook requires the use of keyboard shortcuts to increase
productivity. Here are several significant Jupyter Notebook keyboard shortcuts:
 Ctrl+Enter: Enter the current cell while holding down the control key.
 Y: Switch to the Code cell type.
 M: Markdown cell type change.
 D, D: Delete the current cell by pressing D twice.
 R: Switch to Raw cell type.
 Shift + S: Checkpoint and save the notebook.
 Z: Undelete a deleted cell.
 A: Add a new cell above the existing one.
 H: Display the keyboard shortcuts list.
 B: Add a new cell below the existing one.
Advantages
1. Supports interactive experimentation and step-by-step code execution for data exploration.
2. Multilingual: Supports a variety of programming languages.
3. Rich Documentation: Enables the creation of code-, text-, and visualization-filled notebooks.
4. Data Visualization: Works nicely with libraries for data visualization.
5. Support from the community: Gains from a vibrant community and a large ecosystem.
Disadvantages
1. Learning curve: Beginners may find it difficult.
18
CSE(DATA SCIENCE) R20 (DAV)
2. Version control: Complicated when controlling notebook versions.
3. Resource Consuming: May use a lot of system resources.
4. Not Good for Big Projects: Not good for big software development projects.
5. Dependency Management: Managing dependencies takes more work.

Visual representation of the data


In data analytics and visualization, various types of visual representations are commonly used to display different
kinds of data. Here’s an overview of common visualizations and the types of data they are best suited for:

1. Bar Chart
 Use Case: Ideal for comparing discrete categories or groups.
 Data Type: Categorical data.
 Example: Displaying sales by region, product categories, etc.
2. Line Chart
 Use Case: Great for visualizing trends over time or continuous data.
 Data Type: Time series data or continuous variables.
 Example: Showing stock price changes over a month.
3. Pie Chart
 Use Case: Shows proportions of a whole.
 Data Type: Categorical data where the parts represent portions of the total.
 Example: Market share of different brands in an industry.
4. Histogram
 Use Case: Used for showing the distribution of continuous data.
 Data Type: Continuous data.
 Example: Displaying the distribution of ages in a population.
5. Scatter Plot
 Use Case: Best for showing relationships or correlations between two continuous variables.
 Data Type: Continuous variables.
 Example: Plotting height vs. weight to analyze correlation.
6. Box Plot
 Use Case: Ideal for showing the distribution of data, highlighting outliers, and showing spread.
 Data Type: Continuous data.
 Example: Displaying the salary range in different departments.
7. Heatmap
19
CSE(DATA SCIENCE) R20 (DAV)
 Use Case: Shows data density or intensity across two variables (color-coded).

 Data Type: Numerical data or matrix data.


 Example: Representing website traffic by time of day and day of the week.
8. Bubble Chart
 Use Case: Similar to scatter plots, but with an additional dimension represented by the size of the bubble.
 Data Type: Three continuous variables.
 Example: Visualizing sales (x-axis), profits (y-axis), and number of customers (bubble size).
9. Area Chart
 Use Case: Shows the cumulative total of a variable over time, emphasizing the volume.
 Data Type: Time series data.
 Example: Displaying total sales growth over several years.
10. Radar Chart (Spider Chart)
 Use Case: Useful for comparing multiple variables in a way that highlights relative strengths and weaknesses.
 Data Type: Multi-variable comparisons.
 Example: Comparing the performance of different products across various attributes.
11. Tree Map
 Use Case: Display hierarchical data in a compact and space-efficient way.
 Data Type: Hierarchical or categorical data.
 Example: Displaying file sizes within a folder structure.
12. Waterfall Chart
 Use Case: Used to visualize incremental changes and how they contribute to a total.
 Data Type: Sequential data.
 Example: Tracking monthly revenue gains and losses.
13. Violin Plot
 Use Case: Shows the distribution of data across multiple categories while combining aspects of box plots and
kernel density plots.
 Data Type: Continuous data across categories.
 Example: Comparing distribution of exam scores by class.

Measures of Central tendency:


Measures of Central Tendency
Measures of central tendency describe a set of data by identifying the central position in the data set as a
single representative value. There are generally three measures of central tendency, commonly used in
statistics- mean, median, and mode. Mean is the most common measure of central tendency used to describe a
data set.
We come across new data every day. We find them in newspapers, articles, in our bank statements, mobile and
electricity bills. Now the question arises whether we can figure out some important features of the data by
considering only certain representatives of the data. This is possible by using measures of central tendency. In
20
CSE(DATA SCIENCE) R20 (DAV)
the following sections, we will look at the different measures of central tendency and the methods to calculate
them.
What are Measures of Central Tendency?
Measures of central tendency are the values that describe a data set by identifying the central position of the
data. There are 3 main measures of central tendency - Mean, Median and Mode.
 Mean- Sum of all observations divided by the total number of observations.
 Median- The middle or central value in an ordered set.
 Mode- The most frequently occurring value in a data set.
Measures of Central Tendency Definition
The central tendency is defined as the statistical measure that can be used to represent the entire distribution or
a dataset using a single value called a measure of central tendency. Any of the measures of central tendency
provides an accurate description of the entire data in the distribution.
Measures of Central Tendency Example
Let us understand the concept of the measures of central tendency using an example. The monthly salary of an
employee for the 5 months is given in the table below,

Month Salary

January $105

February $95

March $105

April $105

May $100

Suppose, we want to express the salary of the employee using a single value and not 5 different values for 5
months. This value that can be used to represent the data for salaries for 5 months here can be referred to as
the measure of central tendency. The three possible ways to find the central measure of the tendency for the
above data are,
Mean: The mean salary of the given salary can be used as on of the measures of central tendency, i.e., x̄ = (105
+ 95 + 105 + 105 + 100)/5 = $102.
Mode: If we use the most frequently occurring value to represent the above data, i.e., $105, the measure of
central tendency would be mode.
Median: If we use the central value, i.e., $105 for the ordered set of salaries, given as, $95, $100, $105, $015,
$105, then the measure of central tendency here would be median.
We can use the following table for reference to check the best measure of central tendency suitable for a
particular type of variable:

21
CSE(DATA SCIENCE) R20 (DAV)

Best Suitable Measure of Central


Type of Variable
Tendency

Nominal Mode

Ordinal Median

Interval/Ratio (not
Mean
skewed)

Interval/Ratio (skewed) Median

Let us study the following measures of central tendency, their formulas, usage, and types in detail below.
 Mean
 Median
 Mode
Mean as a Measure of Central Tendency
The mean (or arithmetic mean) often called the average is most likely one of the measures of central tendency
that you are most familiar with. It is also known as average. Mean is simply the sum of all the components in
a group or collection, divided by the number of components.
We generally denote the mean of a given data-set by x̄, pronounced “x bar”. The formula to calculate the mean
for ungrouped data to represent it as the measure is given as,
For a set of observations: Mean = Sum of the terms/Number of terms
For a set of grouped data: Mean, x̄ = Σfx/Σf
where,
 x̄ = the mean value of the set of given data.
 f = frequency of each class
 x = mid-interval value of each class
Example: The weights of 8 boys in kilograms: 45, 39, 53, 45, 43, 48, 50, 45. Find the mean weight for the
given set of data.
Therefore, the mean weight of the group:
Mean = Sum of the weights/Number of boys
= (45 + 39 + 53 + 45 + 43 + 48 + 50 + 45)/8
= 368/8
= 46
Thus, the mean weight of the group is 46 kilograms.
When Not to Use the Mean as the Measure of Central Tendency?
Using mean as the measure of central tendency brings out one major disadvantage, i.e., mean is particularly
sensitive to outliers. This is for the case when the values in a data are unusually larger or smaller compared to
the rest of the data.
22
CSE(DATA SCIENCE) R20 (DAV)
Median as a Measure of Central Tendency
Median, one of the measures of central tendency, is the value of the given data-set that is the middle-most
observation, obtained after arranging the data in ascending order is called the median of the data. The major
advantage of using the median as a central tendency is that it is less affected by outliers and skewed data. We
can calculate the median for different types of data, grouped data, or ungrouped data using the median
formula.
For ungrouped data: For odd number of observations, Median = [(n + 1)/2]th term. For even number of
observations, Median = [(n/2)th term + ((n/2) + 1)th term]/2
For grouped data: Median = l + [((n/2) - c)/f] × h
where,
l = Lower limit of the median class
c = Cumulative frequency
h = Class size
n = Number of observations
Median class = Class where n/2 lies
Let us use the same example given above to find the median now.
Example: The weights of 8 boys in kilograms: 45, 39, 53, 45, 43, 48, 50, 45. Find the median.
Solution:
Arranging the given data set in ascending order: 39, 43, 45, 45, 45, 48, 50, 53
Total number of observations = 8
For even number of observation, Median = [(n/2)th term + ((n/2) + 1)th term]/2
⇒ Median = (4th term + 5th term)/2 = (45 + 45)/2 = 45
Mode as a Measure of Central Tendency
Mode is one of the measures of the central tendency, defined as the value which appears most often in the
given data, i.e. the observation with the highest frequency is called the mode of data. The mode for grouped
data or ungrouped data can be calculated using the mode formulas given below,
Mode for ungrouped data: Most recurring observation in the data set.
Mode for grouped data: L + h (fm−f1)(fm−f1)+(fm−f2)(fm−f1)(fm−f1)+(fm−f2)
where,
 L is the lower limit of the modal class
 h is the size of the class interval
 fmm is the frequency of the modal class
 f11 is the frequency of the class preceding the modal class
 f22 is the frequency of the class succeeding the modal class
Example: The weights of 8 boys in kilograms: 45, 39, 53, 45, 43, 48, 50, 45. Find the mode.
Solution:
Since the mode is the most occurring observation in the given set.
Mode = 45
Empirical Relation Between Measures of Central Tendency
The three measures of central tendency i.e. mean, median, and mode are closely connected by the following
relations (called an empirical relationship).
2Mean + Mode = 3Median
23
CSE(DATA SCIENCE) R20 (DAV)
For instance, if we are asked to calculate the mean, median, and mode of continuous grouped data, then we
can calculate mean and median using the formulae as discussed in the previous sections and then find mode
using the empirical relation.
Example: The median and mode for a given data set are 56 and 54 respectively. Find the approximate
value of the mean for this data set.
2Mean + Mode = 3Median
2Mean = 3Median - Mode
2Mean = 3 × 56 - 54
2Mean = 168 - 54 = 114
Mean = 57
Measures of Central Tendency and Type of Distribution
Any data set is a distribution of 'n' number of observations. The best measure of the central tendency of any
given data depends on this type of distribution. Some types of distributions in statistics are given as,
 Normal Distribution
 Skewed Distribution
Let us understand how the type of distribution can affect the values of different measures of central tendency.
Measures of Central Tendency for Normal Distribution
Here is the frequency distribution table for a set of data:

Observation 6 9 12 15 18 21

Frequency 5 10 15 10 5 0

We can observe the histogram for the above-given symmetrical distribution as shown below,

The above histogram displays a symmetrical distribution of data. Finding the mean, median, and mode for this
data-set, we observe that the three measures of central tendency mean, median, and mode are all located in the
center of the distribution graph. Thus, we can infer that in a perfectly symmetrical distribution, the mean and
the median are the same. The above-given example had one mode, i.e, it is a unimodal set, and therefore the
mode is the same as the mean and median. In a symmetrical distribution that has two modes, i.e. the given set
is bimodal, the two modes would be different from the mean and median.
Measures of Central Tendency for Skewed Distribution
For skewed distributions, if the distribution of data is skewed to the left, the mean is less than the median,
which is often less than the mode. If the distribution of data is skewed to the right, then the mode is often less
than the median, which is less than the mean. Let us understand each case using different examples.

24
CSE(DATA SCIENCE) R20 (DAV)
Measures of Central Tendency for Right-Skewed Distribution
Consider the following data-set and plot the histogram for the same to check the type of distribution.

Observation 6 9 12 15 18 21

Frequency 17 19 8 5 3 2

We observe the given data set is an example of a right or positively skewed distribution. Calculating the three
measures of central tendency, we find mean = 10, median = 9, and mode = 9. We, therefore, infer that if the
distribution of data is skewed to the right, then the mode is, lesser than the mean. And median generally lies
between the values of mode and mean.
Measures of Central Tendency for Left-Skewed Distribution
Consider the following data-set and plot the histogram for the same to check the type of distribution.

Observation 6 9 12 15 18 21

Frequency 2 13 5 10 15 19

We observe the given data set is an example of left or negatively skewed distribution. Calculating the three
measures of central tendency, we find mean = 15.75, median = 18, and mode = 21. We, therefore, infer that if
the distribution of data is skewed to the left, then the mode is, greater than the median, which is greater than
the mean.
Let us summarize the above observations using the graphs given below.

Important Notes on Measures of Central Tendency:

25
CSE(DATA SCIENCE) R20 (DAV)
 The three most common measures of central tendency are mean, median, and mode.

 Mean is simply the sum of all the components in a group or collection, divided by the number of components.
 The value of the middle-most observation obtained after arranging the data in ascending order is called the
median of the data.
 The value which appears most often in the given data i.e. the observation with the highest frequency is called
the mode of data.
 The three measures of central tendency i.e. mean, median and mode are closely connected by the following
relations (called an empirical relationship): 2Mean + Mode = 3Median

DISPERSION:

Measures of Dispersion are used to represent the scattering of data. These are the numbers that show the
various aspects of the data spread across various parameters.

Let’s learn about the measure of dispersion in statistics , its types, formulas, and examples in detail.
Dispersion in Statistics
Dispersion in statistics is a way to describe how spread out or scattered the data is around an average value. It
helps to understand if the data points are close together or far apart.
Dispersion shows the variability or consistency in a set of data. There are different measures of dispersion like
range, variance, and standard deviation.
Measure of Dispersion in Statistics
Measures of Dispersion measure the scattering of the data. It tells us how the values are distributed in the data
set. In statistics, we define the measure of dispersion as various parameters that are used to define the various
attributes of the data.
These measures of dispersion capture variation between different values of the data.
Types of Measures of Dispersion
Measures of dispersion can be classified into the following two types :
 Absolute Measure of Dispersion
 Relative Measure of Dispersion
These measures of dispersion can be further divided into various categories. They have various parameters
and these parameters have the same unit.

26
CSE(DATA SCIENCE) R20 (DAV)

Let’s learn about them in detail.


Absolute Measure of Dispersion
The measures of dispersion that are measured and expressed in the units of data themselves are called
Absolute Measure of Dispersion. For example – Meters, Dollars, Kg, etc.
Some absolute measures of dispersion are:
Range: It is defined as the difference between the largest and the smallest value in the distribution.
Mean Deviation: It is the arithmetic mean of the difference between the values and their mean.
Standard Deviation: It is the square root of the arithmetic average of the square of the deviations measured
from the mean.
Variance: It is defined as the average of the square deviation from the mean of the given data set.
Quartile Deviation: It is defined as half of the difference between the third quartile and the first quartile in a
given data set.
Interquartile Range: The difference between upper(Q3 ) and lower(Q1) quartile is called Interterquartile
Range. Its formula is given as Q3 – Q1.
Relative Measure of Dispersion
We use relative measures of dispersion to measure the two quantities that have different units to get a better
idea about the scattering of the data.
Here are some of the relative measures of dispersion:
Coefficient of Range: It is defined as the ratio of the difference between the highest and lowest value in a
data set to the sum of the highest and lowest value.
Coefficient of Variation: It is defined as the ratio of the standard deviation to the mean of the data set. We use
percentages to express the coefficient of variation.
Coefficient of Mean Deviation: It is defined as the ratio of the mean deviation to the value of the central
point of the data set.
Coefficient of Quartile Deviation: It is defined as the ratio of the difference between the third quartile and
the first quartile to the sum of the third and first quartiles.
RANGE:

27
CSE(DATA SCIENCE) R20 (DAV)
In statistics, a range refers to the difference between the highest and lowest values in a dataset. It provides a simple
measure of the spread or dispersion of the data. Calculating the range involves subtracting the minimum value from
the maximum value.
Range is a fundamental statistical concept that helps us understand the spread or variability of data within a
dataset. Range in Statistics provides valuable insights into the extent of variation among the values in a dataset. Range
quantifies the difference between the highest and lowest values in the dataset.
We can use following steps for range calculation:
 Identify the maximum value (the largest value) in your dataset.
 Identify the minimum value (the smallest value) in your dataset.
 Subtract the minimum value from the maximum value to find the range.
Range=Maximum value−Minimum value
Example : Consider a dataset of exam scores for a class:
Scores: 85, 92, 78, 96, 64, 89, 75, find the range?
Solution:
Maximum Value = 96
Minimum Value = 64
Range = 96 - 64 = 32
So, the range of the exam scores is 32.
Advantages
1. Easy to understand: The concept of range is simple and easy to grasp for people unfamiliar with statistics.
It's essentially the difference between the highest and lowest values in a dataset, making it intuitive.
2. Quick to calculate: Computing the range involves only finding the maximum and minimum values in the
dataset and subtracting them, making it a fast measure to calculate.
3. Provides a basic measure of variability: Despite its simplicity, the range gives a basic indication of the spread
or variability of the data. A larger range suggests greater variability, while a smaller range suggests less
variability.
Disadvantages
1. Sensitivity to outliers: The range is heavily influenced by extreme values (outliers) in the dataset. A single
outlier can greatly inflate the range, potentially giving a misleading picture of the variability of the majority
of the data.
2. Does not consider distribution: The range does not take into account the distribution of values within the
dataset. Two datasets with the same range can have very different distributions, leading to different
interpretations of variability.
3. Limited information: While the range provides a basic measure of variability, it does not provide any
information about the distribution's shape or central tendency. Other measures such as the interquartile
range, variance, or standard deviation offer more comprehensive insights into the dataset's characteristics.
4. Sample size dependency: The range does not account for sample size, so datasets with different sample sizes
may have similar ranges even if their variability differs significantly. This can lead to misinterpretations,
especially when comparing datasets of different sizes.

28
CSE(DATA SCIENCE) R20 (DAV)
VARIANCE:
Variance is a measurement value used to find how the data is spread concerning the mean or the average value of the
data set. It is used to find the distribution of data in the dataset and define how much the values differ from the mean.
The symbol used to define the variance is σ2. It is the square of the Standard Deviation.
The are two types of variance used in statistics,
 Sample Variance
 Population Variance

Population Variance
The population variance is used to determine how each data point in a particular population fluctuates or is spread out,
while the sample variance is used to find the average of the squared deviations from the mean.
In this article, we will learn about Variance (Sample, Population), their formulas, properties, and others in detail.
Population Variance Formula
The formula for population variance is written as,
σ2 = ∑ (xi – x̄)2/n
where,
 x̄ is the mean of population data set
 n is the total number of observations
Population variance is mainly used when the entire population’s data is available for analysis.

Sample Variance
If the population data is very large it becomes difficult to calculate the population variance of the data set. In that case,
we take a sample of data from the given data set and find the variance of that data set which is called sample variance.
While calculating the sample mean we make sure to calculate the sample mean, i.e. the mean of the sample data set
not the population mean. We can define the sample variance as the mean of the square of the difference between the
sample data point and the sample mean.
Sample Variance Formula
The formula of Sample variance is given by,
σ2 = ∑ (xi – x̄)2/(n – 1)
where,
 x̄ is the mean of sample data set
 n is the total number of observations

Absolute Measures of Dispersion Related Formulas

Formulae of Measures of Dispersion

29
CSE(DATA SCIENCE) R20 (DAV)
Absolute Measures of Dispersion Related Formulas

H – S where, H is the Largest Value and S is


Range
the Smallest Value

Population Variance, σ2 = Σ(xi-μ)2 /n


Sample Variance, S2 = Σ(xi-μ)2 /(n-1)
Variance
where, μ is the mean and n is the number of
observation

Standard Deviation S.D. = √(σ2)

μ = (x – a)/n
Mean Deviation where, a is the central value(mean, median,
mode) and n is the number of observation

(Q3 – Q1)/2
Quartile Deviation where,Q3 = Third Quartile and Q1 = First
Quartile

Coefficient of Dispersion
Coefficients of dispersion are calculated when two series are compared, which have great differences in their average.
We also use co-efficient of dispersion for comparing two series that have different measurements. It is denoted using
the letters C.D.

Relative Measures of Dispersion Related Formulas

Coefficient of Range (H – S)/(H + S)

Coefficient of Variation (SD/Mean)×100

(Mean Deviation)/μ
Coefficient of Mean Deviation where,
μ is the central point for which the mean is calculated

Coefficient of Quartile Deviation (Q3 – Q1)/(Q3 + Q1)

30

You might also like