0% found this document useful (0 votes)
16 views70 pages

PSAI Unit 1

Uploaded by

vasikar22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views70 pages

PSAI Unit 1

Uploaded by

vasikar22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Data Science

Data science is a multidisciplinary field.


Data science can be defined as the study of data, to extract the meaning
insights, key patterns, latest trends, hidden points from data.
As we are already aware that Data science is a multidisciplinary field, which
combines various disciplines & different practices such as mathematics,
statistics, AI, ML, IOT, computer science/engineering to analyze large amounts
of data.
Data Science is the field of study that deals with extracting the knowledge, key
insights from noisy data, then turning those insights and knowledge that
business organizations can take.

Data science is a combination of three main subjects such as mathematics,


computer science, and business knowledge.
Mathematics related to data science deals with various topics such as
probability and statistics, discrete mathematics, linear algebra, calculus and
computer science topics such as Python, R, AI, ML are for data collection ,
data essential for cleaning, analysis and visualization. Additionally, knowledge
of algorithms and data structures is critical for developing efficient data
processing pipelines. Business knowledge Understanding the specific field or
industry in which data science is applied is essential to interpreting results,
identifying significant variables, and making informed decisions. This can
include economics, finance, healthcare, marketing, etc.

Why Data Science is important?

Data Science is important because it combines tools, techniques, methods,


technology to generate meaningful insights from the data.

Discovering Insights: Through data analysis, we ensure that we derive


answers to our inquiries from the data available. Data science, including data
analysis, plays a vital role in uncovering valuable information from datasets,
addressing queries, and potentially predicting future outcomes or unknowns.
This discipline employs scientific methods, techniques, algorithms, and
frameworks to extract knowledge and understanding from vast datasets.

1
Decision Making: Data permeates every aspect of modern organizations,
serving as a cornerstone for success by facilitating evidence-based
decision-making rooted in factual data, statistical analysis, and emerging
trends. With the expanding volume and significance of data, the emergence of
data science has become inevitable. This interdisciplinary field of IT has
propelled data scientist roles to the forefront, making them highly sought-after
in the 21st century job market.

Optimized Processes: By analyzing data, organizations can identify


inefficiencies in their processes and systems, leading to optimizations that
improve productivity, reduce costs, and enhance overall performance.

Personalization: Data science enables personalized experiences for customers


and users by analyzing their preferences, behavior, and interactions with
products or services. This leads to tailored recommendations, targeted
marketing campaigns, and enhanced customer satisfaction.

Innovation and Growth: Data science fuels innovation by unlocking new


insights, uncovering opportunities, and identifying untapped markets. It
empowers organizations to develop innovative products and services that
meet evolving customer needs and drive growth.

Competitive Advantage: In today's data-driven world, organizations that


leverage data science effectively gain a competitive edge. By extracting
actionable insights from data, they can outperform competitors, adapt to
market changes quickly, and stay ahead in their respective industries.

History of Data Science

The history of data science can be traced back to the emergence of statistics
and computer science, with roots extending into various fields such as
mathematics, economics, and information theory.

Data science continues to evolve as a discipline using statistics to make


predictions.

2
The foundations of data analysis can be found in the development of
statistical methods by pioneers like Francis Galton, Karl Pearson, and Ronald
Fisher in the late 19th and early 20th centuries. These statisticians laid the
groundwork for analyzing and interpreting data through techniques such as
regression analysis and hypothesis testing.
The Use of Statistics was deeply rooted within the field of data science, data
science started with statistics and continues to evolve major domains into it
such as ML, AI, IOT.

The advent of computers in the mid-20th century revolutionized data


processing and analysis. Early computer scientists like Alan Turing and John
von Neumann made significant contributions to the development of
computational theory and algorithms, which became essential for handling
large datasets.

While the term "data science" was coined earlier, it gained prominence in the
early 21st century as organizations began to recognize the value of
data-driven decision-making. Companies and organizations popularize data
science by leveraging data analytics to improve their products and services.

The proliferation of data, often referred to as big data. The evolution of


Technology, growth of the internet, increasing the connectivity of devices
through IOT have all contributed to the generation and availability of massive
amounts of data. This influx of data has opened up new opportunities for
businesses and organizations.

Data science is used to study data in four main ways:

Descriptive Analysis
Descriptive Analysis examines data to gain insights into what happened in the
past. It is characterized by pie charts, bar charts, graphs.
Diagnostic Analysis
Diagnostic Analysis is a deep-understanding or detailed examination to
understand why something happened. It is characterized by data mining, and
correlations

3
Predictive Analysis
Predictive analytics uses historical data to make accurate predictions about
data patterns that may emerge in the future. It is characterized by techniques
such as machine learning, forecasting, pattern matching and predictive
modeling.
Prescriptive Analysis
Prescriptive Analysis takes predictive data to another level. It not only
provides what is likely to happen in the future, but also proposes the most
effective course of action in response to that result.

Data Science Real Life Examples

Recommendations: Companies like Amazon and Netflix use data science


algorithms to analyze user behavior and preferences, providing personalized
recommendations for products, movies, or TV shows.

Health Predictive Analytics: Healthcare providers utilize data science to


predict patient outcomes, identify high-risk individuals, and optimize
treatment plans. For example, predictive models can help detect diseases early
or predict readmission rates for hospitalized patients.

Social Media Analysis: Organizations analyze social media data using data
science techniques to understand customer sentiment, track trends, and
inform marketing strategies. Sentiment analysis algorithms can gauge public
opinion about products or brands based on social media posts and comments.

4
INTRODUCTION OF STATISTICS

Statistics is to study and manipulate data, including ways to gather,


review, analyze, and draw conclusions from data. Statistics is the
examination of the collection, analysis, interpretation, presentation,
and organization of data.

Statistics is a mathematical discipline focused on gathering, analyzing,


interpreting, presenting, and arranging data in a specific manner. It
encompasses the processes of data collection, classification,
representation for clarity, and subsequent analysis. Additionally,
statistics involves drawing conclusions from sample data obtained
through surveys or experiments. It involves the analysis of numerical
data, enabling the extraction of meaningful conclusions from the
collected and analyzed data sets.

Statistics is a field of applied mathematics that generally deals with


the collection of data, tabulation of data, data analytics, presentation
of the data.

Statistics can be defined in following ways..


Collection of data
Organization of data
Analysis of data
Interpretation of data
Presentation/Organization of data

In Which Fields Statistics is used?

Science and Research: Statistics plays a crucial role in scientific


research, including fields such as biology, chemistry, physics, and
environmental science. It helps researchers analyze experimental data,
test hypotheses, and draw conclusions about the natural world.
Economics and Finance: In economics, statistics is used to analyze
economic trends, evaluate market behavior, and make forecasts. In
finance, it is applied in risk management, investment analysis, and
portfolio optimization.
Healthcare and Medicine: Statistics is integral to medical research,
clinical trials, epidemiology, and public health. It helps healthcare
professionals assess the effectiveness of treatments, study disease
patterns, and make evidence-based decisions.
1
Business and Marketing: Businesses use statistics to analyze market
trends, consumer behavior, and sales data. It aids in making informed
decisions related to pricing strategies, product development, and
marketing campaigns.

Social Sciences: Statistics is widely used in sociology, psychology,


anthropology, and political science to analyze social phenomena,
conduct surveys, and study human behavior.

Education: In education, statistics is used for assessing student


performance, evaluating teaching methods, and conducting research
on educational outcomes.

Engineering: Engineers use statistics for quality control, reliability


analysis, and experimental design in fields such as manufacturing, civil
engineering, and aerospace engineering.

Government and Policy Making: Statistics is essential for


governments and policymakers to formulate policies, allocate
resources, and measure the impact of interventions in areas such as
education, healthcare, and social welfare.

Environmental Science: Statistics is applied in environmental


monitoring, ecological studies, and climate research to analyze data on
pollution levels, biodiversity, and climate change.

Sports Analytics: In sports, statistics is used to analyze player


performance, predict outcomes of games, and optimize strategies for
teams.

Limitations of Statistics..
Statistics possesses significant strengths, yet it's imperative to
acknowledge its limitations. Key constraints encompass:

Data quality: The efficacy of statistical analysis hinges on data


quality. Inaccurate, incomplete, or biased data yields flawed analysis
outcomes.

Sampling: Statistical analyses often rely on samples rather than full


populations, potentially limiting the generalizability of findings.

2
Assumptions: Statistical methods operate based on certain
assumptions about data, such as normal distribution. Failure to meet
these assumptions can skew analysis results.

Correlation vs. causation: While statistics can highlight correlations


between variables, establishing causation is beyond its scope.

Misinterpretation: Statistics can be prone to misuse or


misinterpretation, leading to erroneous conclusions.

Five Stages of Statistics


Collection of data
Organization of data
Analysis of data
Interpretation of data
Presentation/Organization of data

Data Collection
Data collection is the very first step in statistical analysis.
Collection of data is the process of collecting or gathering data or
information from multiple sources to answers to research questions
and problem statements. Data collection involves gathering and
analyzing information from various sources to address research
inquiries, assess outcomes, and predict trends and probabilities. This
crucial phase is integral to research, analysis, and decision-making
across diverse fields such as social sciences, business, and healthcare.
What type of data you need to collect, whichever is relevant to problem

Organization of data
Once data is collected, organization is the next step, how to organize
the data is how to proceed with the collected data. organization of
data refers to the process of arranging data in a systematic and
structured manner to enhance analysis and interpretation.
The systematic arrangement of gathered or raw data to enhance
comprehensibility is termed data organization. By organizing data,
researcher facilitate subsequent statistical analyses, enabling
comparisons among similar datasets.

3
Analysis of data
Analysis is the process of collecting the large volumes of and then
using statistics and other data analysis techniques to identify the
trends, patterns and new insights from the data. data analysis involves
using mathematical techniques to extract useful information from a
dataset. There are various methods of data analysis in statistics,
including descriptive statistics, inferential statistics, regression
analysis, and hypothesis testing.

Interpretation of data
To get a better understanding about the data and getting familiar with
data. Data interpretation involves analyzing data and deriving
significant insights through various analytical methods. It assists
researchers in categorizing, manipulating, and summarizing data to
inform sound business decisions. The ultimate objective of data
interpretation projects is to formulate effective marketing strategies or
broaden the client user base.

Visualization of data
Data visualization/ presentation is the art and science of transforming
raw data into a visual format that’s easy to understand. It is like
turning the numbers & statistics into a captivating story that your
audience will quickly grasp what data is all about.
Data visualization serves as a crucial statistical instrument for visually
representing data through means like charts, graphs, and maps. Its
primary function is to simplify the comprehension, analysis, and
interpretation of intricate data sets.

Functions of Statistics

Definiteness
In statistics, definiteness entails presenting facts and figures in a
precise manner, which enhances the logical coherence and
persuasiveness of a statement compared to mere description.

4
Reduces the Complexity of data
Statistics simplifies the complexity inherent in raw data, which can
initially be difficult to comprehend. Through the application of various
statistical measures such as graphs, averages, dispersions, skewness,
kurtosis, correlation, and regression, we transform the data into a more
understandable and intelligible form. These measures facilitate
interpretation and inference drawing. Consequently, statistics plays a
pivotal role in expanding one's knowledge and understanding.

Facilitates comparison
Comparing different sets of observations is a fundamental aspect of
statistics, essential for drawing meaningful conclusions. The primary
objective of statistics is to facilitate comparisons between past and
present results, thereby discerning the causes of changes that have
occurred and predicting the impact of such changes in the future.

Testing Hypotheses
Formulating and testing hypotheses is an important function of
statistics. This helps in developing new theories. So statistics examines
the truth and helps in innovating new ideas.

Statistics

Types of statistics
Descriptive Statistics Inferential statistics

Estimates of Central Tendency Population, sample


Mean, median, mode, TM, WM Hypothesis Testing
Estimates of Variability Z-test
Range, SD, variance, IQR T - test
Shape of distribution F - test
Skewness, kurtosis Anova
Data Visualization Chi- square test
Boxplot, bar graph, histogram

5
Descriptive statistics
Descriptive statistics is a branch of statistics that involves
summarizing, organizing & presenting data in a meaningful, very
concisely manner.
Descriptive statistics involves in describing, analyzing the data set's
main characteristics and main features without making any
conclusions about the dataset.
The primary purpose of descriptive statistics is to provide a clear and
precise description, summarizing of data, enabling the researcher to
gain insights, to identify hidden patterns, trends, distributions within
the dataset.
Descriptive statistics involves a graphical representation of data
through charts, graphs, maps, plots.
Descriptive statistics can be defined as a branch of statistics used to
summarize the characteristics of a sample using certain quantitative
techniques. It helps to provide simple and accurate summaries of
samples and observations using measures like mean, median,
variance, graphs and charts. Univariate descriptive statistics are used
to describe data that contains only one variable. On the other hand,
bivariate and multivariate descriptive statistics are used to describe
multivariate data.

Descriptive statistics sometimes is also referred to as summary


statistics.

Descriptive statistics Types


Measures of central tendency[mean, median, mode, Trimmed Mean,
Weighted mean]
Measures of Dispersion[range, variance, sd, IQR]
Skewness, Distributions
Data Visualization Techniques.

6
Inferential Statistics
While Descriptive Statistics provides us with tools to summarize &
describe data.
Inferential statistics allows us to draw inferences/ make conclusions
about populations based on sample data. Inferential statistics helps to
develop a good understanding of the population data by analyzing the
samples obtained from it. It helps in making generalizations about the
population by using various analytical tests and tools.
The procedures include choosing a sample apply the tools like
regression analysis and hypothesis testing.
statistical inference is the branch of statistics concerned with drawing
conclusions and or making decisions concerning a population based
only on sample data.

Descriptive statistics and inferential Statistics are two branches of


statistics. The former describes the data, the latter helps us in making
inferences from the data.

Inferential statistics

Population
Sample
Hypothesis testing
Z-test
F-test
T- test
Anova
chi-square test.

7
Example of Inferential Statistics

Suppose you are cooking some recipe and you want to taste it before
serving to the guest to get an idea about the dish. You will never eat
the full dish to get that idea. Rather you will taste very little portion of
your dish with a spoon.

● So here you are only doing exploratory analysis to get an idea


what you cook with a sample in your hand.
● Next if you generalize that your dish required some extra sugar
or salt then that makes an inference.
● To get a valid and right inference your portion of the dish that
you tested should be representative of your sample. Otherwise
the conclusion will be wrong.

Statistics is a powerful and indispensable tool in various fields,


providing a systematic framework for collecting, analyzing,
interpreting, and presenting data. Through descriptive statistics, it
enables the summarization and visualization of data, offering valuable
insights into patterns, trends, and relationships. Additionally, inferential
statistics empowers researchers to make informed decisions and
predictions about populations based on sample data. However, the
validity of statistical conclusions relies heavily on the quality of data
collection, appropriate methodologies, and sound statistical reasoning.
In essence, statistics serves as a fundamental pillar for
evidence-based decision-making, problem-solving, and understanding
complex phenomena in the modern world.

8
9
Population & Sample
Population
Population refers to the entire set of observations, events, objects
about which you want to gather information. It makes a Data pool for
study.
Population refers to the entire group of individuals, items, data points
that we are interested in studying.
It represents the entire group that you are interested in studying.
Population includes every possible unit (or) element that falls within the
scope of your study.
The population is the study's relevant data.

Types of Populations:
Finite Population:A population is considered finite if it consists of a
distinct and countable number of elements. For example, the
population of students in this class would be a finite population.

Infinite Population:An infinite population is one that is too large to


count or is theoretically infinite. For instance, the population of all
possible measurements of human heights is effectively infinite.

Key points

Parameter:A parameter is a numerical characteristic of a population.


It is often represented by Greek letters. For example, the population
mean (μ) and population standard deviation (σ) are parameters.

Census vs. Sample:A census involves collecting data from every


individual in the population. However, due to practical constraints,
researchers often use a sample, which is a subset of the population, to
make inferences about the population as a whole.

Inferential Statistics:
Inferential statistics can be defined as a field of statistics that uses
analytical tools for inferring conclusions about a population by
examining random samples. The goal of inferential statistics is to make
conclusions about a population. In inferential statistics, a statistic is
taken from the sample data (e.g., the sample mean) that used to make
inferences about the population parameter (e.g., the population mean).

1
Example of Population

All the members of the parliament are in the population(545), the


members who are in the central cabinet are sample(29).

Sample

A sample is defined as a smaller and more manageable representation


of a larger population. A subset of population that contains all the
characteristics of population.

Sample is an unbiased subset of population that best represents the


whole data.

When you conduct research about a group of people, it’s highly


possible to collect data from every person in that group. Instead, you
select a sample. The sample is the group of individuals who will actually
participate in the research.

Sampling is the process in statistics and data analysis that involves


selecting the subset of the data from a larger population or dataset in
order to make conclusions about the population.

The Sub-set is chosen in such a way that, sample should represent all
characteristics of the population. Means it shouldn’t be biased.

The main goal of sampling is to make conclusions about the


populations, without even studying about every individual (or) single
object, event, item about the population.

Choosing the sample from a population is essential for several


reasons.

Feasibility: sometimes it is highly impossible to study population due


to many factors like time, cost, logistics, in this situation sampling
allows the users to collect the information, which is representative of
population.

2
Resource efficiency: Conducting research (or) Collecting data from an
entire population can be time consuming and expensive.

Time Constraints: When studying the total population it is highly


impossible and a very time consuming process. Sampling enables the
researcher to get results quickly, when the sample is analyzed or
tested.

Reducing Bias: To reduce the Bias in Sample.

Representative sample
Representative sample is the subset of a population that accurately
reflects the characteristics of the entire population.
The selection process for a representative sample aims to include the
individuals, objects, data points from various subgroups in the
population proportionally to their presence in the population.

Random Sample
Random Sample is also a subset of the population chosen through the
random selection process. Where each member of the population has
an equal chance of being selected into the sample.
Random Sample aims to eliminate bias in the selection process,
ensuring that every individual element in the population has an equal
opportunity to be selected into the sample.

Probability sampling
Every member of the population has an equal chance of being
selected. It ensures that the sample is unbiased/representative.
Sampling technique that involves randomly selecting a small group of
people from a larger population.
Non-Probability sampling
The non-probability sampling method is a method in which the
researcher selects the sample based on subjective judgment rather
than random selection.

3
Sample with Replacement,Selected items are returned to the
population and again selected to sample. While sampling without
replacement vice-versa. (SWR, SWOR)

Random sampling is a process in which each available member of the


population being sampled has an equal chance of being chosen for the
sample at each draw. The sample that results is called a simple random
sample.
In other words, Random sampling is the process in which each
available observation or item of population has the equal probability
or equal chance to get into the sample in random manner is called
simple random sampling.

There are two types of sampling: replacement sampling, where after


each draw, observations are returned to the population for possible
future reselection, and non-replacement sampling, where observations
are not available for future draws once they have been selected.

Probability sampling types:

Systematic Sampling:
Systematic sampling is essentially the same as random sampling,
except that it’s usually a little easier to carry out. Everyone in the
population is numbered, but instead of random numbers being
generated, people are randomly selected at regular intervals.

Ex: All the company’s employees are listed alphabetically. The first 10
numbers are selected at random. The starting point is number 6.
Starting from number 6, you select every 10th employee on the list
(from 6,16, 26,36, 46,56 etc.) until you have a sample size of 100
people.

4
Stratified Sampling
Stratified sampling involves dividing the population into
subpopulations that are different in each way. It allows you to draw
more precise conclusions by ensuring that every subgroup is properly
represented in the sample.

To use this sampling method, you divide the population into


subgroups (called strata) based on the relevant characteristics (e.g.,
gender identity, age range, income bracket, job role).

Stratum is homogenous groups of population having similar


characteristics.

EX: The number of female employees is 800, and the number of male
employees is 200. You want the sample to be representative of the
company’s gender balance, so you divide the population into two
groups based on gender. For each group, you use random sampling.
For each group, you select 80 female employees and 20 male
employees. This will give you a sample of 100 people.

Clustered Sampling
cluster sampling is a method of probability sampling in which a large
sample of a population is divided into smaller groups called clusters
and randomly selected from the clusters to make up a sample. This
method is most commonly used when the sample size and population
size are very large.

Ex: Cluster sampling is more useful when a survey needs to be


conducted over a larger population. When the population is larger for
you to survey it, that’s where cluster sampling comes in.

5
Population Parameter:
● Population parameter in statistics refers to the measure or
characteristics of an entire population being studied.
● It is mathematically Constant.
● Generally population parameters are unknown and inferred from
sample data.
● Population is denoted by “N”
● Population Mean - μ
● Population Standard Deviation - σ
● Population Variance - σ2

Sample Statistic:
● Sample Statistic in statistics refers to the measure or
characteristic that is calculated from data collected from a
sample of individuals within a population.
● It is also mathematically Constant.
● Sample Statistics are calculated to infer the population
parameters.
● Sample is denoted by “n”
● Population Mean - 𝑥
● Population Standard Deviation - s
● Population Variance - s2

6
Collection of Data

In statistics, data collection is the act of gathering information from all


multiple sources to find a solution to a research problem. It helps you
evaluate the results of the problem. Data collection methods allow
people to determine answers to relevant questions. Most organizations
use data collection techniques to make assumptions about future
probabilities and trends.

The term "dataset" is used to describe a collection of data. Typically, a


dataset is organized around a particular theme or purpose, consisting
of a structured collection of related data points. These data points can
represent various types of information, depending on the context, such
as numerical values, text, or categorical data. Datasets are
fundamental to data analysis, machine learning, and statistical
research, serving as the raw material that researchers and analysts use
to derive insights, build models, and make decisions.

Key Characteristics of Datasets Include:

Structure: Datasets are often structured in a way that facilitates


analysis. This can involve organizing the data into rows and columns in
a table, where rows typically represent individual records (such as
survey responses or transaction details), and columns represent
specific variables (such as age, income, purchase amount).

Purpose: The purpose of collecting a dataset can range from academic


research to business intelligence, policy formulation, and beyond. The
goal of data collection often influences the design of the dataset,
including what information is collected and how it is structured.

Understanding how to work with datasets, including collection,


cleaning, analysis, and interpretation, is a critical skill across many
disciplines, reflecting the central role of data in the modern world.
Applications of Datasets:

Statistical Analysis: Researchers and analysts use datasets to perform


statistical tests and analyses, helping to understand trends, relationships, and
patterns within the data.

Machine Learning and AI: In the field of machine learning, datasets are used
to train and evaluate models. The quality and relevance of the dataset
significantly impact the performance of these models.

Decision Making: Businesses and organizations rely on datasets to inform


strategic decisions, such as market entry, product development, and customer
relationship management.

Policy and Planning: Governments and public institutions use datasets to


inform policy decisions, urban planning, and resource allocation, aiming to
address societal needs effectively.
Quantitative Data and Qualitative Data

When it comes to conducting data research it is essential to grasp the


distinctions between qualitative data & quantitative data. This
understanding will enable you to employ collection, hypothesis and
analysis methods.
At its easiest, Data can be broken down into two distinctive categories:
Quantitative Data and Qualitative Data. But what’s the contrast
between the two? And when do you have to employ them? And how
can you employ them together?
So let’s demystify the complexities by completely understanding them,
similarities and differences between qualitative and quantitative data
and how they are both crucial to the success of any data research and
analysis.
Key takeaways:
Quantitative data refers to any information that can be quantified,
counted or measured, and given a numerical value is referred to as
Quantitative data. Qualitative data is Expressive, Descriptive in nature,
expressed in terms of language rather than numerical values is
referred to as Quantitative data.

Quantitative data can be counted, measured, and expressed using


numerical values. Qualitative data is descriptive and conceptual.
Qualitative data can be categorized based on traits and
characteristics.

Quantitative data
Quantitative data refers to data that contains any information that
can be quantified — that is, numbers. If it can be counted or
measured, and given a numerical value, it's quantitative in nature.
Quantitative data can tell you "how many", "how much", or "how
often".
Examples:
How many people attended last week's webinar? How much revenue
did the company generate in 2019? How often do certain customer
groups use online banking?
Quantitative Data is more Objective in Nature, and can only be
expressed in numericals.
Characteristics of Quantitative Data

Precision

Numerical data allows for precise measurement and unit of analysis,


providing more accurate results than other data collection forms.

Arithmetic Operations

Arithmetic Operations can certainly be applied to quantitative data.

Quantifiability

Quantitative data can be easily Quantified and Analyzed Statistically.

How to Collect the Quantitative data

Close ended questions are typically used for collecting quantitative


data. closed-ended questions are frequently used to collect
quantitative data. These questions only offer a small number of
possible answers, such as multiple-choice, rating scales, or yes/no
responses. Closed-ended questions' structured format makes it simple
to analyze and interpret the data they yield statistically.

Quantitative data can be analyzed and interpreted to determine the


relationship between variables because Quantitative data is obtained
through experiments with quantifiable results.

When to use the Quantitative Data

Quantitative data is appropriate and useful in various situations where


numerical information is needed to measure, compare, or analyze
phenomena.

Measurement and Comparison:

Quantitative data allows for precise measurement and comparison of


variables.For instance, in scientific research, quantitative data is
crucial for measuring physical quantities like temperature, weight, or
time. It enables researchers to quantify the extent of change or
differences accurately.
Statistical Analysis:

Quantitative data is essential for conducting statistical analyses, such


as hypothesis testing or regression analysis. These methods rely on
numerical values to establish relationships, make predictions, or draw
conclusions. For example, in market research, quantitative data can be
used to analyze consumer preferences by examining survey responses
on a numerical scale.

Qualitative data

Qualitative data is descriptive, expressed in terms of feelings


rather than numerical values. Qualitative data is data that cannot
be numbered, measured or effectively communicated utilizing
numbers. It is collected from content, sound and pictures and
shared through information visualization apparatuses, such as
word clouds, timelines, chart databases, concept maps and
infographics.

Quantitative data can tell you "why", "what".


Qualitative data is a form of data that provides rich, detailed insights
into human behavior, experiences, and perceptions. Qualitative data is
concerned with words, narratives, and the meanings people attach to
their experiences.

Examples:
Interview Transcripts: Verbatim transcripts of interviews with people,
capturing their spoken words and reactions to open-ended questions.
Observation Notes: Point by point notes taken amid the perception of a
specific wonder or occasion, depicting what was seen, listened, or
experienced.
Photography and Videos: Visual information that can be analyzed for
substance, setting, and feelings communicated inside the pictures or
recordings.

Characteristics:

Subjective:

Qualitative data often reflects the perspectives and interpretations of


individuals.
Richness: It captures the depth and complexity of human experiences.

Interpretation: Analysis involves Identifying the Patterns, themes.

How to Collect the Qualitative data


With high-quality information, you improve the quality of
decision-making. But you also improve the quality of the expected
results with each effort. Qualitative data collection methods are
exploratory. They tend to focus more on gaining insights and
understanding why by digging deeper. Although quantitative data
cannot be quantified, measuring it or analyzing qualitative data can
become a problem. Because of the lack of measurability, qualitative
data collection methods are rarely largely unstructured or structured -
even to some extent.
Let’s explore the most common methods used for the collection of
qualitative data:

Open ended questions are typically used for collecting qualitative


data.

Open-ended questions are free-form overview questions that permit


and empower respondents to reply in open-text organized to reply
based on their total information, feeling, and understanding.

There are several methods you can use to collect quantitative data,
including:
● Experiments.
● Controlled observations.
● Surveys: paper, kiosk, mobile, questionnaires.
● Longitudinal studies.
● Polls.
● Telephone interviews.
● Face-to-face interviews.

When to use the Quantitative Data

Qualitative data are usually collected in various research and data


collection contexts when the goal is to understand and investigate
complex phenomena, to gain insight into people's experiences,
perceptions and behaviors, or to create multifaceted descriptions of a
topic. Here are some common situations and research methods where
qualitative data collection is appropriate:

● Exploratory Research:

Qualitative data collection is often used at the beginning of a research


project to explore a topic in depth. This helps researchers formulate
hypotheses and refine research questions for subsequent quantitative
studies.

● Content Analysis:

Qualitative data can be collected through the analysis of text, audio, or


visual materials, such as documents, social media posts, or media
content. Content analysis helps identify patterns and themes within the
data.

Qualitative Data Quantitative Data

A Method of developing a better It is the method used to generate


understanding of human & social numerical data by using a lot of
sciences, in understanding human techniques such as logical, statistical
behavior and personalities better in and mathematical techniques,
descriptive language. measured and counted

subjective approach objective approach

Open Ended Questions Close ended Questions

Qualitative Data can be found in Qualitative Data can be found in


detailed descriptions. RDBMS, Spreadsheets, tables.

Can’t apply Arithmetic Operations Can apply Arithmetic Operations

Hard to Analyze the Data Easy to Analyze the Data

“Why”, “What”. “How many”, “How much”, “How often”.

Fast Accumulation Slow Accumulation

Can Store in MONGODB, NO SQL Can Store in SQL.

It is referred as “UnStructured Data” It is referred as “Structured Data”


Structured Data

What is Data
Data is a collection of facts or statistics about an individual. Data can be
written, observed, a number, an image, a number, a graph, or a symbol. For
instance, an individual price, weight, address, age, name, temperature, date,
or distance can be data.
Data is just a piece of information. It doesn't mean anything on its own. You
need to analyze, organize, and interpret it to make sense of it. Data can be
simple—and may even seem useless until it is analyzed, organized, and
interpreted.

What is Information
Information can be defined as knowledge acquired through observation,
communication, investigation, or education. In other words, information can
be defined as the outcome of the analysis and interpretation of data. While
data refers to the individual facts, figures, or charts, information refers to
the perception of these facts.

For example, a set of data might contain temperature readings in a place


over a number of years. Without any context, these temperatures are
meaningless. However, if you analyze and organize this data, you can
identify seasonal temperature trends or even more general climate patterns.
Only when your data is organized and structured in a meaningful way can it
provide useful information for others.

There are two main types of data:

● Quantitative data
● Qualitative data

Structured data:

Structured data is a way of organizing and presenting information in a


standardized format. Structured data is data that has been structured
and formatted in a manner that allows it to be read and understood by
both humans and computers. This is usually accomplished through the
utilization of a defined schema or data model that serves as a
framework for the data.Structured data enables the easy storage of
data, easy data retrieval and analysis of the data.
Easy Storage: Structured data is predefined within a set of constraints,
making it easier to organize and store.
Easy retrieval: With structured data, information can be easily
retrieved either on its own or in combination with data from other
fields.
Data Analysis: The defined structure and clear organization of
elements make it easier to analyze and draw insights from the data.

Structured data can be found in various places

DATABASES: Structured data is commonly stored in databases, which


provide a structured framework for organizing and storing data.
Relational databases, such as MySQL and PostgreSQL, use tables with
rows and columns to store structured data.
SPREADSHEETS: Structured data can also be stored in spreadsheets
such as Microsoft Excel and Google Sheets. Each column contains a
particular attribute or field and each row contains a record or record.

It's important to note that structured data is intentionally


organized and formatted for easier processing and analysis.

Structured data is often represented in rows and columns, similar


to a table format. This tabular structure allows for organized and
systematic storage of information.
By organizing structured data into rows and columns, businesses can
efficiently store and manage large amounts of information, making it
accessible for various operations, including searching, filtering, sorting,
and performing calculations or analysis.
Characteristics of Structured Data

Tabular data:

Tabular format is the representation of structured data in a table. Row


data represents a record or data instance, while column data
represents a particular attribute or property of the data.

It offers a well-structured framework that facilitates effective data


analysis and comprehension. This format is widely used to store and
display a wide range of data, including customer data, sales data,
surveys, and much more.

Fixed schema:
Structured data usually has a built-in schema. A built-in schema is a set
of rules that define how data is stored and organized in a database or
dataset. Structured data usually has a predefined schema. This means
that the structure of the data is pre-defined. Each column has a unique
data type meaning.
Consistency:
Structured data plays a vital role in maintaining consistency in data
entry, which in turn makes performing operations like searching and
sorting much easier. For example, if a dataset of products is structured
with consistent fields like "product name," "price," and "category," it
becomes more efficient to search for a specific product or sort the
data based on price or category.
Relational Nature:
Structured data is relational in nature. It refers to data that is organized
and represented in a fixed format, typically using tables and relationships
between those tables. Relational databases are commonly used to store
and manage structured data, where data is structured into tables with
rows and columns. The relationships between the tables are defined using
keys, such as primary keys and foreign keys, to establish links.
Examples
Relational Databases: Relational databases organize data into tables
with predefined columns and rows. They use a structured query language
(SQL) to retrieve and manipulate data.
Excel Spreadsheets: Excel spreadsheets organize data into rows and
columns, allowing users to input, sort, and analyze data in a structured
format.
Importance of Structured Data

Data Integration:

Structured data allows you to integrate data across multiple data


sources. By using common data models and data formats, structured
data allows you to seamlessly combine disparate data sets. This
integration improves data analysis by providing an integrated view
and allowing you to identify patterns, relationships, and correlations
across data sets.
Data Analysis:
Structured data enables powerful data analysis tools, including data
mining, statistics and machine learning, to uncover meaningful
insights, trends and patterns that inform decision-making. Structured
data enables organizations to gain insights into customer behavior,
market dynamics, operational efficiencies and more.

Scalability and efficiency:

Structured data provides scalability and efficiency, particularly when


dealing with large datasets. Structured databases and data
management systems are designed to handle substantial volumes of
data with optimized storage and retrieval mechanisms. This scalability
and efficiency contribute to faster data processing, improved
performance, and reduced storage costs.
Structured Data Unstructured Data

Predefined format Not a Predefined Format

Stored in RDBMS, EXCEL Stored in Mongo DB, NoSQL

Easy to analyze Hard to analyze

Quantitative Data Qualitative Data

Can Perform Arithmetic Can’t Perform Arithmetic


Operations Operations

Schema Dependent NO Schema Dependant

Examples: Examples:
Customer information, Emails, social media posts,
transaction records, inventory multimedia files, sensor data.
lists, financial data.

Easily analyzed using traditional Requires advanced techniques


statistical methods and data like natural language processing
mining techniques. (NLP) and machine learning for
analysis.

Use cases: Use Cases:


Business intelligence, data Sentiment analysis, social media
analytics, financial reporting. monitoring, text mining.
Types of Structured data

Structured data can be divided into two main types:

numerical data and categorical data.

Numerical Data

Numerical data refers to the data that is in the form of numbers, and
not in any language or descriptive form. Often referred to as
quantitative data, numerical data is collected in number form and
stands different from any form of number data types due to its ability
to be statistically and arithmetically calculated.

It doesn’t involve any natural language description and is quantitative


in nature.

Can Perform Arithmetic operations.

It can also be Counted & Measured.

Can apply Statistical Measures such as [Central Tendency, Variability]

Examples:

● Age of individuals: 20, 35, 52, 67, etc.


● Temperature in Celsius or Fahrenheit: 25°C, 98.6°F, etc.
● Height of individuals: 165 cm, 6 feet, etc.
● Weight of objects or individuals: 50 kg, 150 lbs, etc.
● Scores on a test or exam: 80 out of 100, 7.5 out of 10, etc.
● Income levels: $40,000, $75,000, etc.
● Time duration: 2 hours, 30 minutes, etc.
● Number of items sold: 100 units, 500 pieces, etc.
● Ratings on a scale: 4 out of 5, 8 out of 10, etc.

● Sales figures: $10,000, $1 million, etc.


Categorical Data

Categorical data refers to a data type that can be stored and identified,
classified based on the names or labels given to them. The data
collected in the categorical form is also known as qualitative data.

Each dataset can be grouped and labeled depending on their matching


qualities, under only one category. This makes the categories
mutually exclusive.

Cannot Perform Arithmetic Operations on Categorical Data.

Categorical apply Statistical Measures On Categorical Data.

Examples

● Gender: Male, Female, Other


● Marital Status: Single, Married, Divorced, Widowed
● Hair Color: Brown, Black, Blonde, Red
● Eye Color: Blue, Brown, Green, Hazel
● Education Level: High School, Bachelor's Degree, Master's
Degree, PhD
● Blood Type: A, B, AB, O
● Clothing Size: Small, Medium, Large, Extra Large
● Favorite Color: Red, Blue, Green, Yellow, Purple
● Occupation: Teacher, Engineer, Doctor, Lawyer, Artist
● Vehicle Type: Car, Truck, Motorcycle, Bicycle

Numerical Data(Quantitative Data) is again Divided into Two More


Types

Namely

● Discrete Data

● Continuous Data
Discrete Data:

Data that can only take on certain values are discrete data. Discrete
data, also called discrete variables, are sets of data that accept only
specific values. It is typically represented as whole numbers or
integers.

Discrete variables are represented by discrete data and can only be


counted in a limited amount of time. The fact that these variables are
countable rather than measurable.

Features of discrete data

Countable: Discrete data can only take on specific, countable values.


This means that the data can be counted and represented by whole
numbers or integers.

Non-continuous: Unlike continuous data, which can take on any


value within a range, discrete data can only have specific values.

Categorical or ordinal: Discrete data can be categorical or ordinal in


nature. Categorical data represents different categories or groups.

Estimates of Central Tendency: For discrete data, central tendency


measures like the mean, median, and mode, Trimmed Mean,
weighted Median can be computed. The typical or central value of the
data is revealed by these measures.

Examples

how many tickets are sold in a given day.

how many students are enrolled in your course.

the quantity of workers in an organization.

how many computers there are in every department.

the quantity of clients who purchased various goods.

The quantity of groceries you purchase every week.


Continuous Data

Continuous data is defined as a data collection in which observations


are collected over time and a series of values are continuously
updated. It is in contrast to discrete data, where observations are
measured at specific intervals and are separated by gaps.

These numbers are not generally spotless and clean like those in
discrete information, as they're normally gathered from exact
estimations. Over the long run, estimating a specific subject permits
us to make a characterized range where we can sensibly hope to
gather more information.

In statistical analysis and machine learning, continuous data is often


used to model and analyze real-world phenomena. It is commonly
represented by graphs and is utilized to study patterns, trends, and
relationships among variables.

Features of Continuous data

Consistent information changes over the long run and can have
various qualities at various time spans.

Continuous data comprises random variables that may or may not be


whole numbers.

Continuous data is measured using data analysis methods such as line


graphs and skews.

Examples of Continuous Data

Daily wind speed

Freezer temperature

Weight of newborn babies

Length of customer service calls

Product box measurements and weight


Categorical data

Categorical data refers to the data, which is classified, categorized


based on the names, labels given to them. Categorical data refers to a
type of data that represents characteristics or qualities of individuals
or objects. It includes data that is divided into distinct categories or
groups, where each data point belongs to one specific category.

Categorical measurements are expressed in terms of natural


language descriptions, but not in terms of numbers.

The easy way to determine whether the given data is categorical or


numerical data is to calculate the average. If you are able to calculate
the average, then it is considered to be numerical data. If you cannot
calculate the average, then it is considered to be a categorical data.

Categorical data is qualitative. That is, it describes an event using a


string of words rather than numbers. Categorical data is analyzed
using mode

EXAMPLES

gender (male, female)

geographic location (north, south, east, west)

occupation (doctor, lawyer, teacher, etc.)

Countries(India, USA, UK, Russia, Japan).

These consist of two categories of categorical data, namely;

nominal data and ordinal data.


Nominal Data

Nominal data is a type of categorical data where the categories or


groups have no inherent order or ranking. In other words, the labels
or names assigned to the groups are just labels and do not carry any
quantitative value.

Nominal data are categorical, the categories are being mutually


exclusive without any overlap. The categories of nominal data are
purely descriptive in nature, they are not associated with any
quantitative or numeric value. Nominal data can never be quantified.
Nominal data cannot be put into any order. None of the categories can
be greater than or worth more than one another. The mean of nominal
data cannot be calculated even if the data is arranged in alphabetical
order. The mode is the only measure of central tendency for nominal
data. In most cases, nominal data is alphabetical.

Example:

If we have a variable called "color" with categories such as red, blue,


and green, these categories are considered nominal data. Each
category is distinct, but they do not have an inherent order or
numerical value associated with them.

Which State Do You Live in?

Answer: (Kurnool, Hyderabad, Bangalore) , You cannot order for these


cities. These cities are Not Associated with any Numerical value.

Ordinal Data

Ordinal data is a kind of qualitative data that classifies variables into


ordered categories. which have an order or rank based like from high
to low or low to high. The data can be arranged in an order, such as
from smallest to largest or least preferred to most preferred, but the
differences between the values may not be evenly spaced.
● Ordinal data are non-numeric or categorical but may use
numerical figures as categorizing labels.

● Ordinal data are always ranked in some natural order.

● Ordinal data can be used to calculate summary statistics,


e.g., frequency distribution, median, and mode, range of
variables.

Examples

Rank economic status of countries according to Income level range.

● Low income group Countries($1025 or less than it)

● Low Middle income Countries($1026 - $4035)

● Upper Middle income Countries($4036-$12475)

● Higher income countries($1246or greater than it)

Rank the education level

● Elementary

● High School

● Intermediate

● Graduation

● Post Graduation

● Doctorate

When analyzing nominal data, common techniques include:

● Frequency distribution

● Mode
● Chi-square test

When analyzing ordinal data, common techniques include:

● Descriptive statistics

● Rank correlations
Data Visualization

Data Visualization is the process of transforming complex data sets


into captivating visuals so that any audience can understand what is
narration of the story.

Data visualization is the graphical representation of information and


data in a pictorial or graphical format. Data visualization tools provide
an accessible way to see and understand trends, patterns in data, and
outliers. Data visualization tools and technologies are essential to
analyzing massive amounts of information and making data-driven
decisions. The concept of using pictures is to understand data that has
been used for centuries. General types of data visualization are Charts,
Tables, Graphs, Maps, and Dashboards.

Why Data Visualization is important:

Data Visualization Discovers the Trends in Data


The most important thing that data visualization does is discover the
trends in data. After all, it is much easier to observe data trends when
all the data is laid out in front of you in a visual form as compared to
data in a table.

Data Visualization Tells a Data Story


Data visualization is also a medium to tell a data story to the viewers.
The visualization can be used to present the data facts in an
easy-to-understand form while telling a story and leading the viewers
to an inevitable conclusion. This data story, like any other type of story,
should have a good beginning, a basic plot, and an ending that it is
leading towards. For example, if a data analyst has to craft a data
visualization for company executives detailing the profits of various
products, then the data story can start with the profits and losses of
multiple products and move on to recommendations on how to tackle
the losses.
Benefits of data visualization

Data visualization can be used in many contexts in nearly every field,


like public policy, finance, marketing, retail, education, sports, history,
and more. Here are the benefits of data visualization:
Storytelling: People are drawn to colors and patterns in clothing, arts
and culture, architecture, and more. Data is no different—colors and
patterns allow us to visualize the story within the data.

Accessibility:Information is shared in an accessible,


easy-to-understand manner for a variety of audiences.

Visualize relationships: It’s easier to spot the relationships and


patterns within a data set when the information is presented in a graph
or chart.

Exploration: More accessible data means more opportunities to


explore, collaborate, and inform actionable decisions.

BOX PLOT:

Box Plot is a graphical method to visualize data distribution for gaining


insights and making informed decisions. Box plot is a type of chart that
depicts a group of numerical data through their quartiles.
Box plot Visualization technique is used to understand data
distribution & outliers. Box Plot is the visual representation depicting
groups of numerical data through their quartiles. Boxplot is also used
for detecting the outlier in the data set.
Box plot is also known as 5 number summary.

Box Plot represents the following points in a dataset.

● Minimum Value
● First Quartile (Q1 or 25th Percentile)
● Second Quartile (Q2 or 50th Percentile)
● Third Quartile (Q3 or 75th Percentile)
● Maximum Value

1
Q1 - It is the median value from minimum value(min) and median (Q2).
Q2 - It is the median value of the total data set.
Q3 - It is the median value from and median (Q2) and maximum
value(Max)
Horizontal lines are whiskers.

2
Q1 is also called 25% percentile.
Q2 is also called 50% percentile.
Q3 is also called 75% percentile.

Outliers: Points lying beyond the minimum and maximum values are
outliers
Interquartile range: It is Q3-Q1. It is the spread or range of the middle
50% of the data.
Whiskers: From Minimum Value to Q1 is the first 25% of data
From Q3 to Maximum value is the last 25% of the data

● Median (Q2/50th percentile): The middle value of the data set


● First Quartile (Q1/25th percentile): The middle number between
the smallest number (not the “minimum”) and the median of the
data set
● Third Quartile (Q3/75th percentile): The middle value between
the median and the highest value (not the “maximum”) of the
dataset
● Interquartile Range (IQR): 25th to the 75th percentile
● Whiskers (horizontal lines)
● Outliers (points beyond min and max point)
● “Minimum”: Q1 - 1.5*IQR
● “Maximum”: Q3 + 1.5*IQR

Box plot Distributions

Box Plot for a Normal Distribution:

Median (Q2): In a normal distribution, the median (Q2) is equal to the


mean. It is located at the center of the box.

Quartiles (Q1 and Q3): The quartiles divide the distribution into four
equal parts, with approximately 25% of the data falling between each
quartile. In a standard normal distribution (mean = 0, standard

3
deviation = 1), Q1 is approximately -0.675 and Q3 is approximately
0.675.

Interquartile Range (IQR): The box in a box plot represents the


interquartile range, which spans from Q1 to Q3. For a standard normal
distribution, the IQR is approximately 1.35.

Whiskers: In a standard box plot for a normal distribution, the whiskers


extend to approximately ±1.5 times the IQR from Q1 and Q3,
respectively. However, since the normal distribution theoretically
extends to infinity in both directions, the whiskers effectively extend
indefinitely.

Outliers: In a theoretical normal distribution, there are no outliers, as


every value has a non-zero probability of occurring.

4
Box plot for normal distribution looks like the above graph.

Boxplot for Skewed Distributions:


Median (Q2): The median (Q2) represents the midpoint of the dataset. In a
skewed distribution, the median is less affected by extreme values compared to
the mean. It is positioned closer to the longer tail of the distribution.

Quartiles (Q1 and Q3): The quartiles divide the dataset into four equal parts,
with approximately 25% of the data falling between each quartile. In a skewed
distribution, Q1 and Q3 might not be equidistant from the median due to the
skewness.

Interquartile Range (IQR): The box in a box plot represents the interquartile
range, which spans from Q1 to Q3. The length of the box indicates the spread of
the middle 50% of the data.

Whiskers: The whiskers extend from the quartiles to the smallest and largest
values within 1.5 times the IQR from Q1 and Q3, respectively. However, in
skewed distributions, the whisker lengths may vary due to the asymmetric
nature of the data.

Outliers: Outliers are data points that fall beyond the whiskers. In skewed
distributions, outliers tend to occur more frequently on the side opposite to the
longer tail of the distribution.

5
Example:

17,17,18,19,20,22,23,25,33,64,10,5

Sort the dataset first.


5,10,17,17,18,19,20,22,23,25,33,64
{3, 4 (17, 17)} {6, 7 (19, 20)} {9, 10 (23, 25)}

5 10 17 17 18 19 20 22 23 25 33 64

25% 25% 25% 25%

Position of the number, for given percentile (Pn) =Percentile(N+1)/100


N= No. of items in the dataset
If the above result comes in float (in decimals) then, take the mean of 2
numbers of Pn and P(n+1)
If the result comes in integer, then take the value of Pn
In our example N = 12
Position number for 25th percentile= 25(12+1)/100= 3.25 (3,4)
The result is a decimal number, we will take mean of 3rd and 4th
number
Position number for 50th percentile= 50(12+1)/100= 6.5 (6,7)
6
The result is a decimal number, we will take mean of 6th and 7th
number
Position number for 50th percentile= 75(12+1)/100= 9.75 (9,10)
The result is a decimal number, we will take mean of 9th and 10th
number
25th percentile = 3.25
50th percentile = 6.5
75th percentile = 9.75

Q1 = 17
Q2 = 19.5
Q3 = 24

Interquartile range (IQR)=Q3-Q1=24-17=7


Minimum Value= Q1-1.5 * IQR=17-1.5*7=6.5
Maximum Value= Q3 +1.5 * IQR=24+1.5*7=34.5
Outliers of the dataset= 5 & 64

7
BarGraph

Grouped Data

A bar graph can be defined as a graphical representation of data,


quantities, or numbers using bars or strips. They are used to compare
and contrast different types of data, frequencies, or other measures of
distinct categories of data.

For example, consider a dataset of ages of individuals. Instead of


listing every single age, the data can be grouped into intervals like
0-10, 11-20, 21-30, and so on. Then, the frequency of individuals falling
into each interval can be determined.

Ungrouped data

Ungrouped data is essentially the raw data you collect in its original
form without being organized into classes or categories. It's a
collection of individual data points that have not yet been subjected to
any processing or summarization. Working with ungrouped data
means dealing with each observation individually.
Ex: A list of exact ages of students in a class (e.g., 21, 23, 22, 20).

A bar graph can be defined as a graphical representation of data,


quantities, or numbers using bars or strips. They are used to compare
and contrast different types of data, frequencies, or other measures of
distinct categories of data.

bar graphs can be effectively used to represent both grouped and


ungrouped data, providing a visual comparison of quantities across
different categories or groups. The choice between using a bar graph
for grouped or ungrouped data depends on the nature of the data and
the specific insights you wish to glean from it.

8
Bar Graphs with Ungrouped Data

When using bar graphs to represent ungrouped data, each bar


represents an individual data point or category. This is particularly
useful for categorical data where you might compare the frequency or
magnitude of various categories directly. For instance, a bar graph
could display the number of students opting for different majors in a
college, with each major (a category) represented by a separate bar.

Bar Graphs with Grouped Data

For grouped data, bar graphs can show the distribution of data across
different intervals or groups. In this context, each bar represents a
group rather than an individual observation. For example, if you have
data on the ages of individuals in a population, you might group these
ages into 10-year intervals (0-9, 10-19, etc.) and use a bar graph to
show the number of individuals within each age group. This helps in
understanding the distribution of data across different ranges, making
it easier to identify patterns like skewness or bimodality.

The types of bar charts are as follows:

1. Vertical bar chart


2. Horizontal bar chart

9
The above two graphs are bar graphs for grouped data and ungrouped
data.

Histogram

A histogram is a type of bar graph that represents the distribution of


numerical data by showing the number of data points that fall within a
range of values, known as bins or intervals. Unlike regular bar graphs
that are used for categorical data, histograms are specifically
designed to show the distribution of quantitative data, making them a
fundamental tool in statistical analysis for understanding the shape,
spread, variability, and central tendency of data.

Key Characteristics of Histograms

● Bins: The range of values is divided into intervals called bins. The
bins must be of equal size but can cover different ranges
depending on the data distribution.
● Frequency: The height of each bar represents the frequency or
count of data points within each bin. Alternatively, histograms
can represent relative frequency, showing the proportion of data
points in each bin relative to the total dataset.
● No Gaps: In histograms, bars are placed adjacent to each other
without gaps to emphasize the continuous nature of the data.

10
Uses of Histograms

● Understanding Distribution: Histograms are ideal for showing


how data are distributed across different ranges. They can help
identify patterns such as normal distribution, skewness, or
bimodality.
● Detecting Outliers: Wide or sparse bins may indicate outliers or
unusual data points.
● Comparing Data Sets: Overlaying histograms for different
datasets can provide insights into changes or differences in
distributions.

11
Estimates of Central Tendency

In the discipline of statistics, location estimates—also referred to as


measures of central tendency—are essential. They help us
summarize and understand the central value or Point of a
dataset. Estimates of location are statistical measures that provide us
with information about where the center or typical value of a dataset
lies and making it more manageable for analysis & providing a single
point that describes where most of the data is concentrated.

Descriptive statistics

Descriptive statistics refers to a branch of statistics that involves


summarizing, organizing, and presenting data in a meaningful format
and concisely.
It focuses on describing and analyzing a dataset's main features and
characteristics without making any generalizations or inferences to a
larger population.

The primary analysis of descriptive statistics is to provide a clear and


concise summary of the data, enabling researchers or analysts to gain
insights and to understand the patterns, trends, and distributions
within the dataset.
This summary typically includes measures such as measure central
tendency (mean, median, mode, trimmed mean, weighted mean),
measures of dispersion (range, variance, standard deviation), and
shape of the distribution (skewness, kurtosis).

Descriptive statistics also involves a graphical representation of data


through charts, graphs, and tables, which can further aid in visualizing
and interpreting the information. Common graphical techniques
include histograms, bar charts, pie charts, scatter plots, and box plots
Mean
The mean is one of the basic measures of central tendency, apart from
the mode and median. Mean is defined as the average of the given set
of values. It denotes the equal distribution of values for a given data
set.
Mean can be referred to as arithmetic mean.

𝑛
∑ .𝑥𝑖
𝑖=1
Formula: Mean(μ)=
𝑛
Xi= individual Items
n= no.of. Data points
example
Xi = 45+95+12+52+47+35+65+88+22
N=9
μ= 51.22

The mean is influenced by outliers or Extreme Values.

Outliers
An outlier is an extreme value which is significantly different from the
other values in a dataset. When calculating the mean or average, these
extreme values can heavily impact the result.

Due to its calculation utilizing the sum of all the values, the mean takes
into consideration the value of each data point. As a result, if there
are outliers within the dataset, they can excessively influence the
average. Outliers can either increase or decrease the mean,
depending on their position relative to the other values.

It is important when calculating the mean, especially in datasets with


outliers. In such cases, consider other measures of central tendency,
such as the median or mode, which are less affected by outliers.

Example:
x= 15,85,95,75,42,12,1,93,501
x= 1,12,15,42,75,85,93,95,501 [1 and 501 are outlier]
Trimmed Mean
In statistics, the trimmed mean is a measure of central tendency that is
used when the dataset has outliers.

The trimmed mean is a statistical measure that calculates a dataset’s


average after removing a certain percentage of extreme values from
both ends. By excluding outliers, this statistical measure can give a
more precise representation of a dataset’s typical value or central
value of the dataset.
a 10% trimmed mean excludes the highest 10% of values and the
lowest 10%. In other words, it uses the middle 80%.

When summarizing a dataset, the mean is frequently the preferred


statistic to use. Calculating meanμ providing us with a rough estimate
of the mean value of our data. On the other hand, the presence of
outliers can greatly affect the mean value, leading to a misrepresented
representation of the typical values.

𝑛−𝑝
∑ 𝑥(𝑖)
Trimmed Mean x = 𝑖=𝑝+1
𝑛−2𝑝
Example
Given Dataset=25000,23000,22720,18000,7202,39009,32007,21003,
1002,990
Step 1) sort the values
990, 1002,7202,18000,21003,22720,23000,25000,32007,39009
Step 2)Cut the extreme value at both ends by 10%.
Remove 990, 39009
1002,7202,18000,21003,22720,23000,25000,32007

Step 3) find the average or mean for remaining values.

X = 1002,7202,18000,21003,22720,23000,25000,32007
N=8

Trimmed Mean = 18,741.75


A trimmed mean eliminates the influence of extreme values. Trimmed
means are widely used, and in many cases are preferable to using the
ordinary mean

Weighted Mean
The weighted mean is a measure of central tendency for a set of data
where each observation is given a weight.
A weighted average is a mean that assigns different levels of
importance to the values within a dataset. The regular mean, also
known as the arithmetic mean, assigns equal importance to all
observations. In writing, the weighted average is often referred to as
the weighted mean.
When you need to take into account the relative importance of values
in a dataset, utilize a weighted mean. Simply put, you are assigning
varying degrees of importance to the values during the calculations.
if we are taking the average from multiple sensors and one of the
sensors is less accurate, then we might down weight the data from that
sensor.
Calculating the weighted average involves multiplying each data point
by its weight and summing those products. Then sum the weights for
all data points. Finally, divide the weight value products by the sum of
the weights.
𝑛
∑ 𝑤𝑖 𝑥𝑖
Weighted Mean = 𝑋𝑤 = 𝑖=1
𝑛
∑ 𝑤𝑖
𝑖=1
Example

category weight(wi)

Home work 25

Quiz 30

Test 10

Final Exam 35
Student Score(xi)

HW 88

Q 71

T 97

FE 90

Wi.xi = 25*88+30*71+10*97+35*90
ΣWi = 25+30+10+35 = 100

Wi.xi = 2200+2130+970+3150/100
ΣWi= 8450/100
wi= 84.5

Median
The median is the value in the middle of a group. The point where half
the data is greater and half the data is lesser. The median is a way to
condense multiple data points into a single representative value.
Calculating the median is easy when it comes to statistical measures.
To calculate the median, arrange the data in ascending order and
identify the middlemost data point as the median.

Moreover, the calculation of the median depends upon the quantity of


data points available. The median is the middlemost data for an odd
number of data, and the average of the two middle values for an even
number of data.

The median is a measure of central tendency that is not influenced


by outliers.
The median is the middle number on a sorted list of the data. If there is
an even number of data values, the middle value is one that is not
actually in the data set, but rather the average of the two values that
divide the sorted data into upper and lower halves. Compared to the
mean, which uses all observations, the median depends only on the
values in the center of the sorted data.
𝑛+1
Median for odd data set (M) = ( ) term
2
𝑛 𝑛
Median for even data set (m) = ( )𝑡𝑒𝑟𝑚 + ( 2 + 1)𝑡𝑒𝑟𝑚
2
Example = odd data set
7,14,5,19,26,42,13
5,7,13,14,19,26,42
N= 7
Median =14

Even data set


8,21,14,36,17,2,56,41
2,8,14,17,21,36,41,56

17+21/2
38/2
Median = 19

Robust Estimate
The median is not the only robust estimate of location. In fact, a
trimmed mean is widely used to avoid the influence of outliers.
Robust estimates are statistical measures that are not significantly
influenced by outliers or extreme values in the data. These estimates
are designed to be more resistant to the impact of outliers and can
provide a more accurate reflection of the central tendency.

Some Examples of Robust Estimates


Median: The median is a robust measure of central tendency. It
represents the middle value of a dataset when arranged in ascending
or descending order. Unlike the mean, the median is not influenced by
extreme values or outliers.
Trimmed mean: A trimmed mean is calculated by removing a certain
percentage of the extreme values or outliers from the dataset and
taking the average of the remaining values. This reduces the impact of
outliers on the estimate.
In conclusion, estimates of location, including the mean, median, and
mode, are fundamental tools in statistics for summarizing and
understanding data. They help us identify central tendencies and make
informed decisions in various domains. It's crucial to choose the
appropriate measure based on the characteristics of your data.

Mode is used for categorical data and numerical data.

Mean, median - numerical data(discrete , continuous data)


Estimates of Variability

A measure of variability is a summary statistic that defines the amount


of dispersion in a dataset.
Dispersion = Dispersion refers to the state of getting scattering of
particles or data points in a particular way. (How spread out are the
values)

A statistical measure that determines the middle point of a distribution


using a single value is called a measure of central tendency. The
purpose of a central tendency is to find out a single value that is most
typical of the entire group. It is a value that should represent the entire
dataset as accurately as possible.
However, they don't tell the whole story, and here's why you must
consider variability:
Measures of variability define the extent to which data points
deviate from the center. In the field of statistics, variability,
dispersion, and spread are all different words used to describe the
extent of the distribution's width.
Measures of variability indicate the extent to which data points
deviate from the center. Variance refers to the difference being
exhibited by the data points with the data set, as related to each
other or with respect to the mean.
Some degree of variation is unavoidable.Everywhere you look, there is
variability. The duration of your daily commute to work fluctuates
slightly. When you repeatedly order your favorite dish at a restaurant,
it never quite tastes the same each time.
The Spread of Data: Central tendency alone can be misleading because
it doesn't tell you anything about how spread out the data points are.
Variability measures, such as range, variance, and standard deviation,
provide critical information about the dispersion or spread of data.
Incomplete Information: Relying solely on central tendency can
obscure important details within the data. For example, two datasets
may have the same mean, but very different patterns of variation.

Risk Assessment: In many engineering applications, understanding


variability is crucial for assessing risk. By analyzing the spread of data,
you can estimate the probability of extreme events or failures. This is
essential for designing systems that can withstand worst-case
scenarios.
A low dispersion suggests that the data points have a tendency to
form a close cluster around the center(mean), in this case,
distribution width will be small.
A high level of dispersion indicates that objects have a tendency to
fall far away from the center(mean), in this case, distribution width
will be large.
A dataset with lower variability implies greater consistency among
its values.
When there is a greater degree of variability, the data points will
exhibit more dissimilarity and spread out.

Dispersion can be quantified using various statistical measures,


including:
Range:
We will begin with the range as it is the easiest measure of variability
to compute. The range of a dataset can be defined as the numerical
difference between the largest and smallest values present within that
dataset.
Example:

Dataset 1 Dataset 2

20 11

21 16

22 19

25 23

26 29

29 32

33 45

34 47

38 53

43 67

Dataset 1: 43-20 = 23
Dataset 2: 67-11 = 56
Dataset 2 has a broader range and, hence, more variability than
dataset 1.
The Range is highly sensitive to outliers. If there is an
exceptionally high or low number among the values, it has an
impact on the entire range of data.
Mean Absolute Deviation (MAD)
The mean absolute deviation of a dataset is calculated by finding the
average distance between each data point and the mean. It provides
insight into the range of values within a dataset.
Calculate the mean, Calculate how far away each data point is from
the mean using positive distances, these are called absolute
deviations, Add those deviations together, Divide the sum by the
number of data points.

Example
25,15,20,17,22,28,27
154/7=22
25-22=3 3+7+2+5+0+6+5/7=28/7=4
15-22=-7
20-22=-2
17-22=-5
22-22=0
28-22=6
27-22=5
*Apply the modulus for negative values.
A large MAD indicates a dataset has more spread out relative to
the mean. A Small MAD indicates a dataset has less spread out
relative to the mean.

Median Absolute Deviation(MAD)


The median absolute deviation (MAD) is a robust measure of the
variability of a univariate sample of quantitative data.
Example
3,8,27,6,5,2,15,32,9
2,3,5,6,8,9,15,27,32
9 is the median
2-8=-6
3-8=-5
5-8=-3
6-8=-2
8-8=0
9-8=1
15-8=7
27-8=19
32-8=24

* Apply modulus for negative values


0,1,2,3,5,6,7,19,24
MAD = 5

Variance
Variance of measure of how much an individual Value or Data point
falls far away from the mean. To find the variance, you can calculate
the average of the squared differences between each data point and
the mean.
variance is denoted by the symbol "σ2".
Step 1: calculate the mean
Step 2: Subtract the mean from data points
Step 3: After Subtracting from mean, You will get absolute values
Step 4: Square them Up
Step 5 : Find Mean for Squared Values.

Example: 8,11,15,18,20,22,32,44,55
Mean = 225/9 =25
8-25=-17
11-25=-14
15-25= -10
18-25=-7
20-25=-5
22-25=-3
32-25=7
44-25=19
55-25=30

Squared difference values= 289+196+100+49+25+9+49+361+900


Mean of the average squared difference values = 1978/9=219.77.

Σ(𝑥𝑖−μ)2
σ2 = 𝑛
[ population variance]

Σ(𝑥𝑖−𝑥)2
s2 = 𝑛−1
[ sample variance]
Standard Deviation
The standard deviation measures the average deviation of each data
point from the mean. A dataset with values grouped closely together
will result in a smaller standard deviation. Conversely, if the values
are more dispersed, the standard deviation will be larger due to
the increased standard distance.

The standard deviation can be defined as the square root of the


variance.

Standard deviation denoted by σ

√219.77 = 14.8246

Σ(𝑥𝑖−μ)2
σ2 = 𝑛
[ population SD]

Σ(𝑥𝑖−х)2
s2 = 𝑛−1
[ sample variance SD]

Interquartile Range
The interquartile range represents the central 50% of the data.
Consider the median value that separates the dataset into two equal
halves. You can also split the data into quarters. These quarters are
known as quartiles among statisticians and are represented in
ascending order as Q1, Q2, and Q3. Q1 contains the lowest 25% of the
dataset's values. The upper quartile, denoted as Q4, consists of the top
25% of the dataset with the highest values. The interquartile range
represents the middle 50% of the data, lying between the upper and
lower quartiles. Put simply, the interquartile range contains the middle
50% of the data points between Q1 and Q3.

The interquartile range is a Robust Estimate of variability, just like the


median is a Robust Estimate of central tendency.

Example
11,18,22,40,41,62,70
Step 1: ¼(n+1) = ¼(7+1)= 8/4= 2 —> Q1(25%)
Step 2 : ½(n+1)= ½(7+1) = 8/2 =4 —>Q2(50%)
Step :¾(n+1) = ¾(7+1)= 24/4= 6 —> Q3(75%)
Q1(25%)= 18
Q2(50%)= 40
Q3(75%)= 62
QR = Q3 - Q1 or 75% - 25%.
IQR = 62-18 = 44.

The most appropriate measure of spread varies based on the


individual characteristics of the dataset and the specific goals of the
analysis. The range is a basic measure that shows the complete spread
of the data, but it can be greatly affected by extreme values. The
interquartile range (IQR) is a useful tool for identifying outliers
because it is not strongly influenced by extreme values. Contrarily, the
standard deviation quantifies the level of variation or spread within a
set of values and is especially valuable for evaluating the symmetry
and distribution shape, particularly when the data conforms to a
bell-shaped normal distribution.

Hence, the most appropriate measure to use relies on the type of data
and the particular analytical goals. One must carefully take into
account the assumptions and constraints of each measurement,
making sure that the selected measurement is compatible with the
data's traits and the analysis's objectives.
Probability Axioms

So, I have written here that there is an 80 % chance that India will win
the match, and a 10 % chance that the price of gold will rise tomorrow.
Now, here the meaning of probability is that there is a chance, which
means that the event is uncertain. Now, here the meaning of
probability is that there is a chance, which means that the event is
uncertain. Now, what we have done is that we have assigned a
numerical value with the uncertainty that refers to the probability.

Probability quantifies the chance that an event will happen. In


numerous real-world scenarios, predicting an event's outcome is
necessary, where the result may be certain or uncertain. Under these
circumstances, we describe the event's likelihood of happening or not
happening in terms of probability.

Probability Formula:
Probability can be defined as the ratio of the number of favorable
outcomes to the total number of outcomes of an event. For an
experiment having 'n' number of outcomes, the number of favorable
outcomes can be denoted by x. The formula to calculate the probability
of an event is as follows.
𝐹𝑎𝑣𝑜𝑟𝑎𝑏𝑙𝑒 𝑂𝑢𝑡𝑐𝑜𝑚𝑒𝑠(𝑥)
Probability(Event) = 𝑇𝑜𝑡𝑎𝑙 𝑂𝑢𝑡𝑐𝑜𝑚𝑒𝑠 (𝑛)

𝑁(𝐴)
P(A) = 𝑁 (𝑆)

Probability of Event:
The probability of an event is a measure of the likelihood that the event
will occur, expressed as a number between 0 and 1.
An event with a probability of 1 is considered certain to happen, while
an event with a probability of 0 is certain not to happen.
Terminology:
Experiment: An activity whose outcomes are not known is an
experiment. Every experiment has a few favorable outcomes and a
few unfavorable outcomes. The historic experiments of Thomas Alva
Edison had more than a thousand unsuccessful attempts before he
could make a successful attempt to invent the light bulb.

Random Experiment: A random experiment is an experiment for


which the set of possible outcomes is known, but which particular
outcome will occur on a particular execution of the experiment
cannot be said prior to performing the experiment. Tossing a coin,
rolling a die, and drawing a card from a deck are all examples of
random experiments.

Trial: The numerous attempts in the process of an experiment are


called trials. In other words, any particular performance of a
random experiment is called a trial. For example, tossing a coin is a
trial.

Event: A trial with a clearly defined outcome is an event. For


example, getting a tail when tossing a coin is termed as an event.

Equally likely Outcomes: An experiment in which each of the


outcomes has an equal probability, such outcomes are referred to as
equally likely outcomes. In the process of rolling a six-faced dice, the
probability of getting any number is equal.

Sample Space: It is the set of outcomes of all the trials in an


experiment. On rolling a dice, the possible outcomes are 1, 2, 3, 4, 5,
and 6. These outcomes make up the sample space. S = {1, 2, 3, 4, 5, 6}

Mutually Exclusive Events: Two events such that the happening of


one event prevents the happening of another event are referred to
as mutually exclusive events. In other words, two events are said to
be mutually exclusive events, if they cannot occur at the same time.
For example, tossing a coin can result in either heads or tails. Both
cannot be seen at the same time.

1
Types of Probability
These are 3 major types of probability :
1. Theoretical Probability
2. Experimental Probability
3. Axiomatic Probability

Theoretical Probability:
Theoretical or classical probability is based on the assumption that all
outcomes in a sample space are equally likely. It doesn't require
experimental data or subjective judgment but instead uses a priori
reasoning to calculate probabilities.
Experimental probability:
Experimental or empirical probability is based on actual experiments
and observations. Instead of assuming all outcomes are equally likely,
it calculates the probability of an event based on how often the event
occurs relative to the total number of trials.
Axiomatic Probability: (Probability Axioms)
Axiomatic probability, developed by Russian mathematician Andrey
Kolmogorov in the 1930s, is a more formal and rigorous approach to
probability that is based on set theory. It establishes probability on a
firm theoretical foundation through a set of axioms (basic rules) that
all probability measures must follow.

2
Axioms of Probability
There are three axioms of probability that make the foundation of
probability theory-
Axiom 1: Probability of Event
The first one is that the probability of an event is always between 0 and
1. 1 indicates definite action of any outcome of an event and 0
indicates no outcome of the event is possible.

Axiom 2: Probability of Sample Space


For sample space, the probability of the entire sample space is 1.

Axiom 3: Mutually Exclusive Events


And the third one is- the probability of the event containing any
possible outcome of two mutually disjoint is the summation of their
individual probability.

Axiom 1: For any given event X, the probability of that event must be
greater than or equal to 0. Thus,
0 ≤ P(X) ≤ 1
Axiom 2: We know that the sample space S of the experiment is the set
of all the outcomes. This means that the probability of any one
outcome happening is 100 percent i.e P(S) = 1. Intuitively this means
that whenever this experiment is performed, the probability of getting
some outcome is 100 percent.
P(S) = 1
Axiom 3: For the experiments where we have two outcomes A and B. If
A and B are mutually exclusive,
P(A ∪ B) = P(A) + P(B)

You might also like