Inferential Statistics
Inferential Statistics
Making descriptions of data and drawing inferences and conclusions from the respective data
Analysis of Variance (ANOVA)
ANOVA is a statistical technique that compares means across two or more groups. It tests
whether there are statistically significant differences between the groups' means. ANOVA
calculates both within-group variance (variation within each group) and between-group
variance (variation between the group means) to determine whether any observed
differences are likely due to chance or represent true differences between groups.
Correlation Coefficient
Correlation is a statistical measure that indicates the extent to which two or more variables
fluctuate in relation to each other. The correlation coefficient is measured on a scale that varies from
+ 1 through 0 to – 1. Complete correlation between two variables is expressed by either + 1 or -1.
When one variable increases as the other increases the correlation is positive; when one decreases as
the other increases it is negative. Complete absence of correlation is represented by 0.
import numpy as np
# Sample data
x = np.array([1, 2, 3, 4, 5])
y = np.array([5, 4, 3, 2, 1])
correlation_matrix = np.corrcoef(x, y)
correlation = correlation_matrix[0, 1]
Alternatively, the formula can be written in terms of the covariance and standard deviations of
X and Y:
Where:
Regression Coefficient
The term regression is used when you try to find the relationship between variables.In
Machine Learning and in statistical modeling, that relationship is used to predict the outcome
of events.
Least Square Method
The concept is to draw a line through all the plotted data points. The line is positioned in a
way that it minimizes the distance to all of the data points.
The red dashed lines represents the distance from the data points to the drawn mathematical
function.
1. The dependent and independent variables show a linear relationship between the slope
and the intercept.
2. The independent variable is not random.
3. The value of the residual (error) is zero.
4. The value of the residual (error) is constant across all observations.
5. The value of the residual (error) is not correlated across all observations.
6. The residual (error) values follow the normal distribution.
Simple linear regression is a model that assesses the relationship between a dependent
variable and an independent variable. The simple linear model is expressed using the
following equation:
Y = a + bX + ϵ
Where:
Y – Dependent variable
X – Independent (explanatory) variable
a – Intercept
b – Slope
ϵ – Residual (error)
Multiple linear regression analysis is essentially similar to the simple linear model, with the
exception that multiple independent variables are used in the model. The mathematical
representation of multiple linear regression is:
Where:
Y – Dependent variable
X1, X2, X3 – Independent (explanatory) variables
a – Intercept
b, c, d – Slopes
ϵ – Residual (error)
Logistic Regression: Logistic regression is used when the dependent variable is binary (two
possible outcomes). It models the probability of a particular outcome occurring.
In business, logistic regression is employed for tasks like predicting customer churn (yes/no),
whether a customer will purchase a product (yes/no), or whether a loan applicant will default
on a loan (yes/no).
Polynomial Regression: Polynomial regression is used when the relationship between the
independent and dependent variables follows a polynomial curve and is not linear.
Model development in predictive analytics is the process of creating algorithms to analyze
data and make predictions:
1. Data collection and preparation: Identify relevant data sets and remove any redundant or
noisy data
3. Training: Train the model to learn the data so it can make predictions
4. Model assessment: Test the accuracy of the model's predictions and make adjustments as
needed
Predictive modeling is a data mining technique that uses historical and current data to predict
future outcomes. It's used in many industries and applications, including:
Marketing to gauge customer responses
Fraud detection
Customer segmentation
Disease diagnosis
Regression
Neural networks
Decision trees
https://airbyte.com/data-engineering-resources/what-is-data-partitioning
https://www.analyticsvidhya.com/blog/2019/09/data-scientists-guide-8-types-of-sampling-
techniques/
Missing values
Missing values are common when working with real-world datasets
from the data. There is no discernible pattern to this type of data missingness.
This means that you cannot predict whether the value was missing due to specific
These types of data are missing at random but not completely missing. The data's missingness
Consider for instance that you built a smart watch that can track people's heart rates every
hour. Then you distributed the watch to a group of individuals to wear so you can collect data
for analysis.
After collecting the data, you discovered that some data were missing, which was due to
some people being reluctant to wear the wristwatch at night. As a result, we can conclude that
These are data that are not missing at random and are also known as ignorable data. In other
words, the missingness of the missing data is determined by the variable of interest.
A common example is a survey in which students are asked how many cars they own. In this
case, some students may purposefully fail to complete the survey, resulting in missing values.
Handling missing values falls generally into two categories. The two categories are as
follows:
Deletion
Imputation
One of the most prevalent methods for dealing with missing data is deletion. And one of the
most commonly used methods in the deletion approach is using the list wise deletion method.
In the list-wise deletion method, you remove a record or observation in the dataset if it
Another frequent general method for dealing with missing data is to fill in the missing value
Regression imputation
The regression imputation method includes creating a model to predict the observed value of
a variable based on another variable. Then you use the model to fill in the missing value of
that variable.
This technique is utilized for the MAR and MCAR categories when the features in the dataset
are dependent on one another. For example using a linear regression model.
Simple Imputation
This method involves utilizing a numerical summary of the variable where the missing value
occurred (that is using the feature or variable's central tendency summary, such as mean,
KNN Imputation
replacing missing data with the average mean of the neighbors nearest to it.
What is an outlier?
In data analytics, outliers are values within a dataset that vary greatly from the others—
they’re either much larger, or significantly smaller. Outliers may indicate variabilities in a
measurement, experimental errors, or a novelty.
In a real-world example, the average height of a giraffe is about 16 feet tall. However, there
have been recent discoveries of two giraffes that stand at 9 feet and 8.5 feet, respectively.
These two giraffes would be considered outliers in comparison to the general giraffe
population.
Types of outliers
A univariate outlier is an extreme value that relates to just one variable. For
example, Sultan Kösen is currently the tallest man alive, with a height of 8ft, 2.8
inches (251cm). This case would be considered a univariate outlier as it’s an extreme
case of just one factor: height.
Besides the distinction between univariate and multivariate outliers, outliers categorized as
any of the following:
Global outliers (otherwise known as point outliers) are single data points that lay
far from the rest of the data distribution.
Collective outliers are seen as a subset of data points that are completely different
with respect to the entire dataset. A group of customers who consistently make
purchases that are significantly larger than the rest of the customers
Sampling errors that arise from extracting or mixing data from inaccurate or various
sources
Data processing errors that arise from data manipulation, or unintended mutations of a
dataset
Natural outliers which occur “naturally” in the dataset, as opposed to being the result
of an error otherwise listed. These naturally-occurring errors are known as novelties
Visualizing data as a box plot makes it very easy to spot outliers. A box plot will show the
“box” which indicates the interquartile range (from the lower quartile to the upper quartile,
with the middle indicating the median data value) and any outliers will be shown outside of
the “whiskers” of the plot, each side representing the minimum and maximum values of the
dataset, respectively. If the box skews closer to the maximum whisker, the prominent outlier
would be the minimum value. Likewise, if the box skews closer to the minimum-valued
whisker, the prominent outlier would then be the maximum value.
A data analyst may use a statistical method to assist with machine learning modeling, which
can be improved by identifying, understanding, and—in some cases—removing outliers.
The above illustration is of a DBSCAN cluster analysis. Points around A are core points.
Points B and C are not core points, but are density-connected via the cluster of A (and thus
belong to this cluster). Point N is Noise, since it is neither a core point nor reachable from a
core point.
Outliers by finding the Z-Score
Computing a z-score helps describe any data point by placing it in relation to the standard
deviation and mean of the whole group of data points. Positive standard scores appear as raw
scores above the mean, whereas negative standard scores appear below the mean. The mean
is 0 and standard deviation is 1, creating a normal distribution.
Outliers are found from z-score calculations by observing the data points that are too far from
0 (mean). In many cases, the “too far” threshold will be +3 to -3, where anything above +3 or
below -3 respectively will be considered outliers.
To show these outliers, the Isolation Forest will build “Isolation Trees” from the set of data,
and outliers will be shown as the points that have shorter average path lengths than the rest of
the branches.
Segmentation
Segmentation analysis is a powerful data analytics technique that helps you identify and
understand different groups of customers, users, or prospects based on their characteristics,
preferences, or behaviors. By segmenting your data, you can tailor your marketing, product,
or service strategies to each group and optimize your performance and outcomes.
Criteria-based segmentation
This technique involves dividing your data into segments based on predefined criteria that are
relevant to your business goals, such as demographics, psychographics, geographic, or
behavioral variables. For example, you can segment your customers by age, income, lifestyle,
location, or purchase frequency. The main advantage of this technique is that it is easy to
implement and interpret, and it can help you target specific segments with customized
messages or offers. The main disadvantage is that it can be too simplistic or arbitrary, and it
may not capture the underlying patterns or relationships in your data.
Cluster analysis
This technique involves using statistical methods to group your data into segments based on
their similarities or differences in multiple dimensions, without relying on predefined criteria.
For example, you can use cluster analysis to find out how your customers are clustered based
on their preferences, needs, or behaviors across various product or service attributes. The
main advantage of this technique is that it can reveal hidden insights and patterns in your
data, and it can help you discover new or niche segments that you may not have considered
before. The main disadvantage is that it can be complex and challenging to implement and
validate, and it may require advanced data analytics skills and tools.
Hybrid segmentation
This technique involves combining criteria-based and cluster analysis techniques to create
segments that are both meaningful and actionable. For example, you can use criteria-based
segmentation to create broad segments based on your business objectives, and then use
cluster analysis to refine or subdivide those segments based on your data characteristics. The
main advantage of this technique is that it can leverage the strengths of both techniques and
overcome their limitations, and it can help you create segments that are more accurate and
relevant to your data and goals. The main disadvantage is that it can be time-consuming and
resource-intensive, and it may require more testing and evaluation to ensure its validity and
reliability.
1. Data Ingestion
Data ingestion is the process of collecting and importing data from various sources into a
centralized repository. Automated data ingestion tools can connect to multiple data sources,
such as databases, cloud storage, APIs, and streaming platforms, to extract and load data in
real-time or batch mode.
2. Data Cleaning
Data cleaning involves identifying and rectifying errors, inconsistencies, and missing values
in the dataset. Automated data cleaning tools use algorithms and rules to detect and correct
common data quality issues, such as:
Outliers: Detecting and handling outliers that may skew analysis results.
3. Data Transformation
Data transformation involves converting data from one format or structure to another to meet
the requirements of the analysis or target system. Automated data transformation tools can
perform a range of tasks, including:
Data Pivoting: Converting rows to columns and vice versa to restructure the
dataset.
4. Data Integration
Data integration involves combining data from multiple sources into a unified dataset.
Automated data integration tools can perform tasks such as:
Data Merging: Combining data from different sources based on common keys
or identifiers.
5. Data Enrichment
Data enrichment involves enhancing the dataset with additional information or context to
improve analysis and decision-making. Automated data enrichment tools can:
Text Enrichment: Extract and add relevant information from unstructured text
data using natural language processing techniques.