Data analyticsMSE
Data analyticsMSE
Analysis of Variance also termed as ANOVA. It is procedure followed by statisticans to check the
potential difference between scale-level dependent variable by a nominal-level variable having two
or more categories. It was developed by Ronald Fisher in 1918 and it extends t-test and z-test which
compares only nominal level variable to have just two categories.
Types of ANOVA
One-way ANOVA - One-way ANOVA have only one independent variable and refers to numbers in
this variable. For example, to assess differences in IQ by country, you can have 1, 2, and more
countries data to compare.
Two-way ANOVA - Two way ANOVA uses two independent variables. For example, to access
differences in IQ by country (variable 1) and gender(variable 2). Here you can examine the
interaction between two independent variables. Such Interactions may indicate that differences in
IQ is not uniform across a independent variable. For examples females may have higher IQ score
over males and have very high score over males in Europe than in America.
Two-way ANOVAs are also termed as factorial ANOVA and can be balanced as well as unbalanced.
Balanced refers to having same number of participants in each group where as unbalanced refers to
having different number of participants in each group. Following special kind of ANOVAs can be used
to handle unbalanced groups.
Hierarchical approach(Type 1) -If data was not intentionaly unbalanced and has some type of
hierarchy between the factors.
Classical experimental approach(Type 2) - If data was not intentionaly unbalanced and has no
hierarchy between the factors.
Setup null and alternative hypothesis where null hypothesis states that there is no significant
difference among the groups. And alternative hypothesis assumes that there is a significant
difference among the groups.
Compare p-value of the F-ratio with the established alpha or significance level.
If null hypothesis is rejected, conclude that mean of groups are not equal.
Suppose we have two groups of students (Group A and Group B) and we want to test
whether there is a significant difference in their exam scores. The null hypothesis is
that there is no difference between the two groups, and the alternative hypothesis is
that there is a difference.
Permutation Test:
The permutation test is a technique that involves shuffling the labels of the
observations and computing the test statistic many times to obtain the null
distribution of the test statistic.
1. Compute the observed test statistic: In this case, the test statistic is the
difference in means between the two groups.
2. Combine the data: Combine the scores of both groups into a single dataset.
3. Shuffle the labels: Randomly shuffle the group labels (A or B) for each
observation in the combined dataset.
4. Compute the test statistic: Calculate the difference in means between the
shuffled groups.
5. Repeat steps 3 and 4 many times (e.g. 1000 times) to obtain the null
distribution of the test statistic.
6. Compare the observed test statistic with the null distribution: Calculate the p-
value by counting the proportion of times the shuffled test statistic was
greater than or equal to the observed test statistic.
Randomization Test:
The randomization test is a type of permutation test that involves randomly re-
assigning the observations to groups rather than shuffling the labels.
1. Compute the observed test statistic: In this case, the test statistic is the
difference in means between the two groups.
2. Combine the data: Combine the scores of both groups into a single dataset.
3. Randomly assign the observations to groups: Randomly assign the
observations to either Group A or Group B.
4. Compute the test statistic: Calculate the difference in means between the two
groups.
5. Repeat steps 3 and 4 many times (e.g. 1000 times) to obtain the null
distribution of the test statistic.
6. Compare the observed test statistic with the null distribution: Calculate the p-
value by counting the proportion of times the random test statistic was
greater than or equal to the observed test statistic.
1. Data integration and storage: Modern data analytics tools allow businesses to
collect, integrate, and store data from a variety of sources. This can include
structured data (such as customer information) as well as unstructured data
(such as social media posts).
2. Data exploration and visualization: Once the data has been collected and
stored, modern analytics tools allow businesses to explore the data through
various visualizations such as charts, graphs, and maps. This helps to identify
patterns, trends, and outliers in the data.
3. Machine learning and predictive modeling: Modern analytics tools use
machine learning algorithms to identify patterns and make predictions based
on historical data. This can be used to make informed decisions about future
actions.
4. Real-time analytics: Many modern analytics tools allow businesses to analyze
data in real-time. This can be especially useful for businesses that need to
make quick decisions based on changing circumstances.
5. Collaboration and sharing: Modern analytics tools allow teams to collaborate
on data analysis projects and share insights with each other. This can improve
decision-making and lead to better outcomes for the business.
6. Cloud-based deployment: Many modern analytics tools are cloud-based,
meaning that businesses can access them from anywhere with an internet
connection. This makes it easier for teams to work together and for businesses
to scale their analytics capabilities as needed.
Example: If i want to predict what type of people buy a wine. I would find data on people who buy
wine. Their age, height, financial status, etc. So analyzing this data i can build a model to predict
whether a person would buy wine or not.
So regression analysis is used to predict the behavior of an dependent variable(people who buy a
wine) based on the behavior of a few/large no. of independent variables(age, height, financial
status).
It is mainly used for prediction, forecasting, time series modeling, and determining the causal-
effect relationship between variables.
o Independent Variable: The factors which affect the dependent variables or which are used
to predict the values of the dependent variables are called independent variable, also called
as a predictor.
o Outliers: Outlier is an observation which contains either very low value or very high value in
comparison to other observed values. An outlier may hamper the result, so it should be
avoided.
o Multicollinearity: If the independent variables are highly correlated with each other than
other variables, then such condition is called Multicollinearity. It should not be present in the
dataset, because it creates problem while ranking the most affecting variable.
o Underfitting and Overfitting: If our algorithm works well with the training dataset but not
well with test dataset, then such problem is called Overfitting. And if our algorithm does not
perform well even with training dataset, then such problem is called underfitting.
As mentioned above, Regression analysis helps in the prediction of a continuous variable. There are
various scenarios in the real world where we need some future predictions such as weather
condition, sales prediction, marketing trends, etc., for such case we need some technology which
can make predictions more accurately. So for such case we need Regression analysis which is a
statistical method and used in machine learning and data science. Below are some other reasons for
using Regression analysis:
Regression estimates the relationship between the target and the independent variable.
By performing the regression, we can confidently determine the most important factor, the least
important factor, and how each factor is affecting the other factors.