0% found this document useful (0 votes)
7 views22 pages

Inferential Statistics

The document discusses inferential statistics, focusing on techniques such as ANOVA, correlation coefficients, and regression analysis, including simple, multiple, logistic, and polynomial regression. It also covers the handling of missing data, outlier detection methods, and segmentation analysis, emphasizing the importance of these techniques in predictive modeling and data analytics. Key concepts include the differences between descriptive and inferential statistics, types of missing data, and various methods for identifying and dealing with outliers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views22 pages

Inferential Statistics

The document discusses inferential statistics, focusing on techniques such as ANOVA, correlation coefficients, and regression analysis, including simple, multiple, logistic, and polynomial regression. It also covers the handling of missing data, outlier detection methods, and segmentation analysis, emphasizing the importance of these techniques in predictive modeling and data analytics. Key concepts include the differences between descriptive and inferential statistics, types of missing data, and various methods for identifying and dealing with outliers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 22

Inferential Statistics

Making descriptions of data and drawing inferences and conclusions from the respective data
Analysis of Variance (ANOVA)

ANOVA is a statistical technique that compares means across two or more groups. It tests
whether there are statistically significant differences between the groups' means. ANOVA
calculates both within-group variance (variation within each group) and between-group
variance (variation between the group means) to determine whether any observed
differences are likely due to chance or represent true differences between groups.

 Null Hypothesis (H0): A statement of no effect or no difference, which researchers


aim to test against.
 Alternative Hypothesis (H1): A statement indicating the presence of an effect or
difference.

Difference between Descriptive and Inferential statistics


Descriptive Statistics Inferential Statistics

It makes inferences about the


It gives information about raw data which
population using data drawn from the
describes the data in some manner.
population.

It helps in organizing, analyzing, and to It allows us to compare data, and make


present data in a meaningful manner. hypotheses and predictions.

It is used to explain the chance of


It is used to describe a situation.
occurrence of an event.

It explains already known data and is


It attempts to reach the conclusion about
limited to a sample or population having a
the population.
small size.

It can be achieved with the help of charts,


It can be achieved by probability.
graphs, tables, etc.

Correlation Coefficient

Correlation is a statistical measure that indicates the extent to which two or more variables
fluctuate in relation to each other. The correlation coefficient is measured on a scale that varies from
+ 1 through 0 to – 1. Complete correlation between two variables is expressed by either + 1 or -1.
When one variable increases as the other increases the correlation is positive; when one decreases as
the other increases it is negative. Complete absence of correlation is represented by 0.
import numpy as np

# Sample data

x = np.array([1, 2, 3, 4, 5])

y = np.array([5, 4, 3, 2, 1])

# Calculate correlation coefficient

correlation_matrix = np.corrcoef(x, y)

correlation = correlation_matrix[0, 1]

print("Correlation coefficient:", correlation).


Where:

Xᵢ and Yᵢ represent individual paired values of the two variables.

X̄ and Ȳ represent the means (averages) of the X and Y variables, respectively.

Σ denotes the sum of the values across all data points.

Alternatively, the formula can be written in terms of the covariance and standard deviations of

X and Y:

r = cov(X, Y) / (σₓ * σᵧ)

Where:

cov(X, Y) represents the covariance between X and Y.

σₓ and σᵧ represent the standard deviations of X and Y, respectively.

Regression Coefficient

The term regression is used when you try to find the relationship between variables.In
Machine Learning and in statistical modeling, that relationship is used to predict the outcome
of events.
Least Square Method

Linear regression uses the least square method.

The concept is to draw a line through all the plotted data points. The line is positioned in a
way that it minimizes the distance to all of the data points.

The distance is called "residuals" or "errors".

The red dashed lines represents the distance from the data points to the drawn mathematical
function.

Regression Analysis – Linear Model Assumptions

Linear regression analysis is based on six fundamental assumptions:

1. The dependent and independent variables show a linear relationship between the slope
and the intercept.
2. The independent variable is not random.
3. The value of the residual (error) is zero.
4. The value of the residual (error) is constant across all observations.
5. The value of the residual (error) is not correlated across all observations.
6. The residual (error) values follow the normal distribution.

Regression Analysis – Simple Linear Regression

Simple linear regression is a model that assesses the relationship between a dependent
variable and an independent variable. The simple linear model is expressed using the
following equation:
Y = a + bX + ϵ

Where:

 Y – Dependent variable
 X – Independent (explanatory) variable
 a – Intercept
 b – Slope
 ϵ – Residual (error)

Regression Analysis – Multiple Linear Regression

Multiple linear regression analysis is essentially similar to the simple linear model, with the
exception that multiple independent variables are used in the model. The mathematical
representation of multiple linear regression is:

Y = a + bX1 + cX2 + dX3 + ϵ

Where:

 Y – Dependent variable
 X1, X2, X3 – Independent (explanatory) variables
 a – Intercept
 b, c, d – Slopes
 ϵ – Residual (error)

Logistic Regression: Logistic regression is used when the dependent variable is binary (two
possible outcomes). It models the probability of a particular outcome occurring.

In business, logistic regression is employed for tasks like predicting customer churn (yes/no),
whether a customer will purchase a product (yes/no), or whether a loan applicant will default
on a loan (yes/no).

Polynomial Regression: Polynomial regression is used when the relationship between the
independent and dependent variables follows a polynomial curve and is not linear.
Model development in predictive analytics is the process of creating algorithms to analyze
data and make predictions:

1. Data collection and preparation: Identify relevant data sets and remove any redundant or
noisy data

2. Algorithm development: Create algorithms to pair with the data

3. Training: Train the model to learn the data so it can make predictions

4. Model assessment: Test the accuracy of the model's predictions and make adjustments as
needed
Predictive modeling is a data mining technique that uses historical and current data to predict
future outcomes. It's used in many industries and applications, including:
 Marketing to gauge customer responses

 Financial analysis to estimate stock market trends and events

 Fraud detection

 Customer segmentation

 Disease diagnosis

 Stock price prediction


The accuracy of predictive models depends on several factors, including: The quality of the
data, The choice of variables, and The model's assumptions.

Some common predictive modeling techniques include:

Regression

Neural networks

Decision trees

https://airbyte.com/data-engineering-resources/what-is-data-partitioning

https://www.analyticsvidhya.com/blog/2019/09/data-scientists-guide-8-types-of-sampling-
techniques/
Missing values
Missing values are common when working with real-world datasets

Types of Missing Data


 Missing Completely at Random (MCAR).

 Missing at Random (MAR).

 Not Missing at Random (NMAR).

Missing Data that's Missing Completely at Random (MCAR)


These are data that are missing completely at random. That is, the missingness is independent

from the data. There is no discernible pattern to this type of data missingness.

This means that you cannot predict whether the value was missing due to specific

circumstances or not. They are just completely missing at random.

Missing Data that's Missing at Random (MAR):

These types of data are missing at random but not completely missing. The data's missingness

is determined by the data

Consider for instance that you built a smart watch that can track people's heart rates every

hour. Then you distributed the watch to a group of individuals to wear so you can collect data

for analysis.

After collecting the data, you discovered that some data were missing, which was due to

some people being reluctant to wear the wristwatch at night. As a result, we can conclude that

the missingness was caused by the observed data.

Missing Data that's Not Missing at Random (NMAR)

These are data that are not missing at random and are also known as ignorable data. In other

words, the missingness of the missing data is determined by the variable of interest.
A common example is a survey in which students are asked how many cars they own. In this

case, some students may purposefully fail to complete the survey, resulting in missing values.

Handling missing values falls generally into two categories. The two categories are as

follows:

 Deletion

 Imputation

One of the most prevalent methods for dealing with missing data is deletion. And one of the

most commonly used methods in the deletion approach is using the list wise deletion method.

 In the list-wise deletion method, you remove a record or observation in the dataset if it

contains some missing values.

Another frequent general method for dealing with missing data is to fill in the missing value

with a substituted value.

Regression imputation

The regression imputation method includes creating a model to predict the observed value of

a variable based on another variable. Then you use the model to fill in the missing value of

that variable.

This technique is utilized for the MAR and MCAR categories when the features in the dataset

are dependent on one another. For example using a linear regression model.
Simple Imputation

This method involves utilizing a numerical summary of the variable where the missing value

occurred (that is using the feature or variable's central tendency summary, such as mean,

median, and mode).

KNN Imputation

KNN imputation is a fairer approach to the Simple Imputation method. It operates by

replacing missing data with the average mean of the neighbors nearest to it.

What is an outlier?

In data analytics, outliers are values within a dataset that vary greatly from the others—
they’re either much larger, or significantly smaller. Outliers may indicate variabilities in a
measurement, experimental errors, or a novelty.

In a real-world example, the average height of a giraffe is about 16 feet tall. However, there
have been recent discoveries of two giraffes that stand at 9 feet and 8.5 feet, respectively.
These two giraffes would be considered outliers in comparison to the general giraffe
population.

Types of outliers

There are two kinds of outliers:

 A univariate outlier is an extreme value that relates to just one variable. For
example, Sultan Kösen is currently the tallest man alive, with a height of 8ft, 2.8
inches (251cm). This case would be considered a univariate outlier as it’s an extreme
case of just one factor: height.

 A multivariate outlier is a combination of unusual or extreme values for at least two


variables. For example, if you’re looking at both the height and weight of a group of
adults, you might observe that one person in your dataset is 5ft 9 inches tall—a
measurement that would fall within the normal range for this particular variable. You
may also observe that this person weighs 110lbs. Again, this observation alone falls
within the normal range for the variable of interest: weight. However, when you
consider these two observations in conjunction, you have an adult who is 5ft 9 inches
and weighs 110lbs—a surprising combination. That’s a multivariate outlier.

Besides the distinction between univariate and multivariate outliers, outliers categorized as
any of the following:

 Global outliers (otherwise known as point outliers) are single data points that lay
far from the rest of the data distribution.

 Contextual outliers (otherwise known as conditional outliers) are values that


significantly deviate from the rest of the data points in the same context, meaning that
the same value may not be considered an outlier if it occurred in a different context.
Outliers in this category are commonly found in time series data.

If an e-commerce company experiences a sudden increase in orders in the middle of


the night, it would be a contextual outlier.

 Collective outliers are seen as a subset of data points that are completely different
with respect to the entire dataset. A group of customers who consistently make
purchases that are significantly larger than the rest of the customers

Common causes of outliers in datasets:

 Human error while manually entering data, such as a typo


 Intentional errors, such as dummy outliers included in a dataset to test detection
methods

 Sampling errors that arise from extracting or mixing data from inaccurate or various
sources

 Data processing errors that arise from data manipulation, or unintended mutations of a
dataset

 Measurement errors as a result of instrumental error

 Experimental errors, from the data extraction process or experiment planning or


execution

 Natural outliers which occur “naturally” in the dataset, as opposed to being the result
of an error otherwise listed. These naturally-occurring errors are known as novelties

Outliers using visualizations

In data analytics, analysts create data visualizations to present data graphically in a


meaningful and impactful way, in order to present their findings to relevant stakeholders.
These visualizations can easily show trends, patterns, and outliers from a large set of data in
the form of maps, graphs and charts.

Identifying outliers with box plots

Visualizing data as a box plot makes it very easy to spot outliers. A box plot will show the
“box” which indicates the interquartile range (from the lower quartile to the upper quartile,
with the middle indicating the median data value) and any outliers will be shown outside of
the “whiskers” of the plot, each side representing the minimum and maximum values of the
dataset, respectively. If the box skews closer to the maximum whisker, the prominent outlier
would be the minimum value. Likewise, if the box skews closer to the minimum-valued
whisker, the prominent outlier would then be the maximum value.

Outliers using statistical methods

A data analyst may use a statistical method to assist with machine learning modeling, which
can be improved by identifying, understanding, and—in some cases—removing outliers.

1. Outliers with DBSCAN


DBSCAN (Density Based Spatial Clustering of Applications with Noise) is a clustering
method that’s used in machine learning and data analytics applications. Relationships
between trends, features, and populations in a dataset are graphically represented by
DBSCAN, which can also be applied to detect outliers.

The above illustration is of a DBSCAN cluster analysis. Points around A are core points.
Points B and C are not core points, but are density-connected via the cluster of A (and thus
belong to this cluster). Point N is Noise, since it is neither a core point nor reachable from a
core point.
Outliers by finding the Z-Score

Computing a z-score helps describe any data point by placing it in relation to the standard
deviation and mean of the whole group of data points. Positive standard scores appear as raw
scores above the mean, whereas negative standard scores appear below the mean. The mean
is 0 and standard deviation is 1, creating a normal distribution.

Outliers are found from z-score calculations by observing the data points that are too far from
0 (mean). In many cases, the “too far” threshold will be +3 to -3, where anything above +3 or
below -3 respectively will be considered outliers.

Outliers with the Isolation Forest algorithm

Isolation Forest—otherwise known as iForest—is another anomaly detection algorithm. The


founders of the algorithm used two quantitative features of anomalous data points—that they
are “few” in quantity and have “different” attribute-values to those of normal instances—to
isolate outliers from normal data points in a dataset.

To show these outliers, the Isolation Forest will build “Isolation Trees” from the set of data,
and outliers will be shown as the points that have shorter average path lengths than the rest of
the branches.

Segmentation

Segmentation analysis is a powerful data analytics technique that helps you identify and
understand different groups of customers, users, or prospects based on their characteristics,
preferences, or behaviors. By segmenting your data, you can tailor your marketing, product,
or service strategies to each group and optimize your performance and outcomes.

Criteria-based segmentation

This technique involves dividing your data into segments based on predefined criteria that are
relevant to your business goals, such as demographics, psychographics, geographic, or
behavioral variables. For example, you can segment your customers by age, income, lifestyle,
location, or purchase frequency. The main advantage of this technique is that it is easy to
implement and interpret, and it can help you target specific segments with customized
messages or offers. The main disadvantage is that it can be too simplistic or arbitrary, and it
may not capture the underlying patterns or relationships in your data.

Cluster analysis

This technique involves using statistical methods to group your data into segments based on
their similarities or differences in multiple dimensions, without relying on predefined criteria.
For example, you can use cluster analysis to find out how your customers are clustered based
on their preferences, needs, or behaviors across various product or service attributes. The
main advantage of this technique is that it can reveal hidden insights and patterns in your
data, and it can help you discover new or niche segments that you may not have considered
before. The main disadvantage is that it can be complex and challenging to implement and
validate, and it may require advanced data analytics skills and tools.

Hybrid segmentation

This technique involves combining criteria-based and cluster analysis techniques to create
segments that are both meaningful and actionable. For example, you can use criteria-based
segmentation to create broad segments based on your business objectives, and then use
cluster analysis to refine or subdivide those segments based on your data characteristics. The
main advantage of this technique is that it can leverage the strengths of both techniques and
overcome their limitations, and it can help you create segments that are more accurate and
relevant to your data and goals. The main disadvantage is that it can be time-consuming and
resource-intensive, and it may require more testing and evaluation to ensure its validity and
reliability.

Automated Data Preparation


Data preparation is transforming raw data into a clean and structured format, ready for
analysis. It involves several tasks, including data cleaning, integration, transformation, and
enrichment.
Automated data preparation tools can perform a range of tasks, including data cleaning,
integration, transformation, and enrichment, with minimal human intervention.

Key Components of Automated Data Preparation


Automated data preparation involves several key components, each addressing different
aspects of the data preparation process:

1. Data Ingestion
Data ingestion is the process of collecting and importing data from various sources into a
centralized repository. Automated data ingestion tools can connect to multiple data sources,
such as databases, cloud storage, APIs, and streaming platforms, to extract and load data in
real-time or batch mode.

2. Data Cleaning
Data cleaning involves identifying and rectifying errors, inconsistencies, and missing values
in the dataset. Automated data cleaning tools use algorithms and rules to detect and correct
common data quality issues, such as:

 Duplicate Records: Identifying and removing duplicate records to ensure data


uniqueness.

 Missing Values: Handling missing values through imputation, interpolation, or


deletion.

 Inconsistent Data: Standardizing inconsistent data formats, such as date and


time formats, units of measurement, and categorical values.

 Outliers: Detecting and handling outliers that may skew analysis results.

3. Data Transformation
Data transformation involves converting data from one format or structure to another to meet
the requirements of the analysis or target system. Automated data transformation tools can
perform a range of tasks, including:

 Data Normalization: Scaling numerical data to a standard range to ensure


comparability.
 Data Aggregation: Summarizing data at different levels of granularity for
analysis.

 Data Pivoting: Converting rows to columns and vice versa to restructure the
dataset.

 Feature Engineering: Creating new features or variables from existing data to


enhance the analysis.

4. Data Integration
Data integration involves combining data from multiple sources into a unified dataset.
Automated data integration tools can perform tasks such as:

 Data Merging: Combining data from different sources based on common keys
or identifiers.

 Data Blending: Integrating data from disparate sources to create a


comprehensive view.

 Data Harmonization: Aligning data from different sources to ensure


consistency and coherence.

5. Data Enrichment
Data enrichment involves enhancing the dataset with additional information or context to
improve analysis and decision-making. Automated data enrichment tools can:

 Append External Data: Incorporate data from external sources, such as


demographics, weather, or market data.

 Geocoding: Add geographic coordinates to address data for spatial analysis.

 Text Enrichment: Extract and add relevant information from unstructured text
data using natural language processing techniques.

You might also like