0% found this document useful (0 votes)
40 views12 pages

Data analyticsMSE

The document discusses various statistical and data analysis concepts including analysis of variance (ANOVA), probability distributions, permutation and randomization tests, and modern data analytics tools. Specifically, it provides details on: 1) The different types of ANOVA including one-way, two-way, and N-way ANOVA and how they are used to compare groups. 2) Examples of common probability distributions like the normal, binomial, Poisson, and uniform distributions. 3) The steps to perform permutation and randomization tests including computing test statistics, shuffling/randomly assigning data, and comparing to the null distribution. 4) Key features of modern data analytics tools such as data integration, exploration, machine learning, real

Uploaded by

prathamesh patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views12 pages

Data analyticsMSE

The document discusses various statistical and data analysis concepts including analysis of variance (ANOVA), probability distributions, permutation and randomization tests, and modern data analytics tools. Specifically, it provides details on: 1) The different types of ANOVA including one-way, two-way, and N-way ANOVA and how they are used to compare groups. 2) Examples of common probability distributions like the normal, binomial, Poisson, and uniform distributions. 3) The steps to perform permutation and randomization tests including computing test statistics, shuffling/randomly assigning data, and comparing to the null distribution. 4) Key features of modern data analytics tools such as data integration, exploration, machine learning, real

Uploaded by

prathamesh patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

1.

What is purpose of regression analysis


2. What is classification explain in detail
*3. Explore in detail about naive based reprogramming
4 Explain logistics regression with example.
5 Compare different classification methods.
6. Explain analysis of variance in details.
Analysis of variance (ANOVA) is a statistical method used in data analytics to test whether there are
significant differences between the means of two or more groups. ANOVA is commonly used in
experimental designs and is used to analyze the variance within and between groups to determine if
the differences between the means are statistically significant

Analysis of Variance also termed as ANOVA. It is procedure followed by statisticans to check the
potential difference between scale-level dependent variable by a nominal-level variable having two
or more categories. It was developed by Ronald Fisher in 1918 and it extends t-test and z-test which
compares only nominal level variable to have just two categories.

Types of ANOVA

ANOVAs are majorly of three types:

One-way ANOVA - One-way ANOVA have only one independent variable and refers to numbers in
this variable. For example, to assess differences in IQ by country, you can have 1, 2, and more
countries data to compare.

Two-way ANOVA - Two way ANOVA uses two independent variables. For example, to access
differences in IQ by country (variable 1) and gender(variable 2). Here you can examine the
interaction between two independent variables. Such Interactions may indicate that differences in
IQ is not uniform across a independent variable. For examples females may have higher IQ score
over males and have very high score over males in Europe than in America.

Two-way ANOVAs are also termed as factorial ANOVA and can be balanced as well as unbalanced.
Balanced refers to having same number of participants in each group where as unbalanced refers to
having different number of participants in each group. Following special kind of ANOVAs can be used
to handle unbalanced groups.

Hierarchical approach(Type 1) -If data was not intentionaly unbalanced and has some type of
hierarchy between the factors.

Classical experimental approach(Type 2) - If data was not intentionaly unbalanced and has no
hierarchy between the factors.

Full Regression approach(Type 3) - If data was intentionaly unbalanced because of population.


N-way or Multivariate ANOVA - N-way ANOVA have multiple independent variables. For example, to
assess differences in IQ by country, gender, age etc. simultaneously, N-way ANOVA is to be
deployed.

ANOVA Test Procedure

Following are the general steps to carry out ANOVA.

Setup null and alternative hypothesis where null hypothesis states that there is no significant
difference among the groups. And alternative hypothesis assumes that there is a significant
difference among the groups.

Calculate F-ratio and probability of F.

Compare p-value of the F-ratio with the established alpha or significance level.

If p-value of F is less than 0.5 then reject the null hypothesis.

If null hypothesis is rejected, conclude that mean of groups are not equal.

7. What is data analytics illustrate in brief with example


https://www.guru99.com/what-is-data-analysis.html
8. Explain probability distribution methods with example
Probability distribution methods are used in statistics to describe the probability of
different outcomes or events that may occur within a population or sample. There
are several different types of probability distributions, each of which is used to model
different types of data. Some common probability distribution methods include:

1. Normal distribution: The normal distribution is used to model data that is


normally distributed, or "bell-shaped". In this distribution, the mean, median,
and mode are all equal and the distribution is symmetrical. One example of a
variable that might follow a normal distribution is the height of individuals in a
population.
2. Binomial distribution: The binomial distribution is used to model the
probability of a binary outcome, such as success or failure. For example, the
binomial distribution could be used to model the probability of flipping a coin
and getting heads or tails.
3. Poisson distribution: The Poisson distribution is used to model the probability
of rare events occurring within a specific time period. For example, it could be
used to model the number of car accidents that occur on a particular road in a
given month.
4. Exponential distribution: The exponential distribution is used to model the
time between events occurring in a Poisson process. For example, it could be
used to model the time between earthquakes or the time between customer
arrivals at a store.
5. Uniform distribution: The uniform distribution is used to model data that is
evenly distributed across a range of values. For example, it could be used to
model the probability of rolling a certain number on a fair die.

9. How permutation and randomization test is performed illustrate


with example
Permutation and randomization tests are non-parametric statistical methods that do
not require any assumptions about the distribution of the data. They are used when
we have a null hypothesis about the equality of two or more populations and want to
test whether there is sufficient evidence to reject it.

Here's an example of how permutation and randomization tests can be performed:

Suppose we have two groups of students (Group A and Group B) and we want to test
whether there is a significant difference in their exam scores. The null hypothesis is
that there is no difference between the two groups, and the alternative hypothesis is
that there is a difference.

Permutation Test:

The permutation test is a technique that involves shuffling the labels of the
observations and computing the test statistic many times to obtain the null
distribution of the test statistic.

Here are the steps to perform a permutation test:

1. Compute the observed test statistic: In this case, the test statistic is the
difference in means between the two groups.
2. Combine the data: Combine the scores of both groups into a single dataset.
3. Shuffle the labels: Randomly shuffle the group labels (A or B) for each
observation in the combined dataset.
4. Compute the test statistic: Calculate the difference in means between the
shuffled groups.
5. Repeat steps 3 and 4 many times (e.g. 1000 times) to obtain the null
distribution of the test statistic.
6. Compare the observed test statistic with the null distribution: Calculate the p-
value by counting the proportion of times the shuffled test statistic was
greater than or equal to the observed test statistic.

Randomization Test:
The randomization test is a type of permutation test that involves randomly re-
assigning the observations to groups rather than shuffling the labels.

Here are the steps to perform a randomization test:

1. Compute the observed test statistic: In this case, the test statistic is the
difference in means between the two groups.
2. Combine the data: Combine the scores of both groups into a single dataset.
3. Randomly assign the observations to groups: Randomly assign the
observations to either Group A or Group B.
4. Compute the test statistic: Calculate the difference in means between the two
groups.
5. Repeat steps 3 and 4 many times (e.g. 1000 times) to obtain the null
distribution of the test statistic.
6. Compare the observed test statistic with the null distribution: Calculate the p-
value by counting the proportion of times the random test statistic was
greater than or equal to the observed test statistic.

10. Summarize the morden data analytics tool in detail .


Modern data analytics tools are designed to help businesses and organizations make
informed decisions by analyzing large amounts of data. These tools have become
increasingly important as the amount of data generated by businesses and
individuals continues to grow. Here are some of the key features of modern data
analytics tools:

1. Data integration and storage: Modern data analytics tools allow businesses to
collect, integrate, and store data from a variety of sources. This can include
structured data (such as customer information) as well as unstructured data
(such as social media posts).
2. Data exploration and visualization: Once the data has been collected and
stored, modern analytics tools allow businesses to explore the data through
various visualizations such as charts, graphs, and maps. This helps to identify
patterns, trends, and outliers in the data.
3. Machine learning and predictive modeling: Modern analytics tools use
machine learning algorithms to identify patterns and make predictions based
on historical data. This can be used to make informed decisions about future
actions.
4. Real-time analytics: Many modern analytics tools allow businesses to analyze
data in real-time. This can be especially useful for businesses that need to
make quick decisions based on changing circumstances.
5. Collaboration and sharing: Modern analytics tools allow teams to collaborate
on data analysis projects and share insights with each other. This can improve
decision-making and lead to better outcomes for the business.
6. Cloud-based deployment: Many modern analytics tools are cloud-based,
meaning that businesses can access them from anywhere with an internet
connection. This makes it easier for teams to work together and for businesses
to scale their analytics capabilities as needed.

11. How would you compose the statistical concept in inference ?


Statistical inference is a branch of statistics that involves using statistical
methods to make conclusions or predictions about a population based on a
sample of data. There are several key concepts that are important in
statistical inference, including:

1. Population: The population is the group of individuals, objects, or


measurements that we are interested in studying. It is usually too
large or too expensive to collect data from every member of the
population, so we collect data from a sample instead.
2. Sample: A sample is a subset of the population that we actually
collect data from. The goal of statistical inference is to use the
information in the sample to make conclusions or predictions about
the population.
3. Parameter: A parameter is a characteristic of the population, such as
the population mean or standard deviation. We usually don't know
the value of the parameter, so we use the sample data to estimate it.
4. Statistic: A statistic is a characteristic of the sample, such as the
sample mean or standard deviation. We can use the sample statistic
to estimate the population parameter.
5. Sampling distribution: The sampling distribution is the distribution of
all possible sample statistics that could be obtained from a
population. It helps us to understand the uncertainty or variability in
our estimates.
6. Hypothesis testing: Hypothesis testing is a method of making
decisions about the population based on the sample data. We start
with a null hypothesis that there is no difference or no effect, and we
use the sample data to calculate a test statistic. We then compare the
test statistic to a critical value or calculate a p-value to determine
whether we should reject the null hypothesis in favor of an alternative
hypothesis.
7. Confidence intervals: Confidence intervals are a range of values that
we are fairly certain contains the true value of the population
parameter. We use the sample data to calculate the confidence
interval and specify a level of confidence (such as 95% or 99%).

By understanding these key concepts, we can use statistical inference to


make conclusions or predictions about a population based on a sample of
data. This is a powerful tool for decision-making in a variety of fields,
including business, healthcare, and social sciences.

1.What are the purpose of regression analysis?


In simple words: The purpose of regression analysis is to predict an outcome based on a historical
data. This historical data is understood using regression analysis and this understanding helps us
build a model which to predict an outcome based on this regression model. Its helps us predict and
that is why it is called predictive analysis model.

Example: If i want to predict what type of people buy a wine. I would find data on people who buy
wine. Their age, height, financial status, etc. So analyzing this data i can build a model to predict
whether a person would buy wine or not.

So regression analysis is used to predict the behavior of an dependent variable(people who buy a
wine) based on the behavior of a few/large no. of independent variables(age, height, financial
status).

 It is mainly used for prediction, forecasting, time series modeling, and determining the causal-
effect relationship between variables.

Some examples of regression can be as:

o Prediction of rain using temperature and other factors

o Determining Market trends

o Prediction of road accidents due to rash driving.

Terminologies Related to the Regression Analysis:

o Dependent Variable: The main factor in Regression analysis which we want to predict or


understand is called the dependent variable. It is also called target variable.

o Independent Variable: The factors which affect the dependent variables or which are used
to predict the values of the dependent variables are called independent variable, also called
as a predictor.

o Outliers: Outlier is an observation which contains either very low value or very high value in
comparison to other observed values. An outlier may hamper the result, so it should be
avoided.
o Multicollinearity: If the independent variables are highly correlated with each other than
other variables, then such condition is called Multicollinearity. It should not be present in the
dataset, because it creates problem while ranking the most affecting variable.

o Underfitting and Overfitting: If our algorithm works well with the training dataset but not
well with test dataset, then such problem is called Overfitting. And if our algorithm does not
perform well even with training dataset, then such problem is called underfitting.

Why do we use Regression Analysis?

As mentioned above, Regression analysis helps in the prediction of a continuous variable. There are
various scenarios in the real world where we need some future predictions such as weather
condition, sales prediction, marketing trends, etc., for such case we need some technology which
can make predictions more accurately. So for such case we need Regression analysis which is a
statistical method and used in machine learning and data science. Below are some other reasons for
using Regression analysis:

Regression estimates the relationship between the target and the independent variable.

It is used to find the trends in data.

It helps to predict real/continuous values.

By performing the regression, we can confidently determine the most important factor, the least
important factor, and how each factor is affecting the other factors.

2.What are the classification in details.


The Classification algorithm is a Supervised Learning technique that is used to identify the
category of new observations on the basis of training data. In Classification, a program
learns from the given dataset or observations and then classifies new observation into a
number of classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc.
Classes can be called as targets/labels or categories.
Unlike regression, the output variable of Classification is a category, not a value, such as
"Green or Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised
learning technique, hence it takes labeled input data, which means it contains input with the
corresponding output.
In classification algorithm, a discrete output function(y) is mapped to input variable(x).
1. y=f(x), where y = categorical output  
The best example of an ML classification algorithm is Email Spam Detector.
The main goal of the Classification algorithm is to identify the category of a given dataset,
and these algorithms are mainly used to predict the output for the categorical data.
Classification algorithms can be better understood using the below diagram. In the below
diagram, there are two classes, class A and Class B. These classes have features that are
similar to each other and dissimilar to other classes.
The algorithm which implements the classification on a dataset is known as a classifier.
There are two types of Classifications:
Binary Classifier: If the classification problem has only two possible outcomes, then it is
called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
Multi-class Classifier: If a classification problem has more than two outcomes, then it is
called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.
Learners in Classification Problems:
In the classification problems, there are two types of learners:
Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives the
test dataset. In Lazy learner case, classification is done on the basis of the most related data
stored in the training dataset. It takes less time in training but more time for predictions.
Example: K-NN algorithm, Case-based reasoning
Eager Learners:Eager Learners develop a classification model based on a training dataset
before receiving a test dataset. Opposite to Lazy learners, Eager Learner takes more time in
learning, and less time in prediction. Example: Decision Trees, Naïve Bayes, ANN.

3.Explore the details about naive based programming.


Naive Bayes is a machine learning algorithm that is widely used for classification tasks. It is
based on Bayes' theorem, which states that the probability of a hypothesis (or class) given
the observed evidence (or features) is proportional to the probability of the evidence given
the hypothesis times the prior probability of the hypothesis.
Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem
and used for solving classification problems.
It is mainly used in text classification that includes a high-dimensional training dataset.
Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which
helps in building the fast machine learning models that can make quick predictions.
It is a probabilistic classifier, which means it predicts on the basis of the probability of an
object.
Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis,
and classifying articles.
Bayes' Theorem:
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on the
conditional probability.
The formula for Bayes' theorem is given as:

Naïve Bayes Classifier Algorithm


Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability
of a hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Working of Naïve Bayes' Classifier:
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Convert the given dataset into frequency tables.
Generate Likelihood table by finding the probabilities of given features.
Now, use Bayes theorem to calculate the posterior probability.
4. Logistic Regression in Machine Learning with example
Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical dependent
variable using a given set of independent variables.
Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is
used for solving the classification problems.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
The curve from the logistic function indicates the likelihood of something such as whether
the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.

Here is an example of how logistic regression can be used:


Suppose we have a dataset of students and we want to predict whether a student will pass
or fail an exam based on their study time. We have data on 100 students, including the
number of hours they studied and whether they passed or failed the exam. Our goal is to
build a model that can predict whether a new student will pass or fail based on their study
time.
We can use logistic regression to build a model that predicts the probability of passing the
exam, given the number of hours studied. We can start by plotting the data on a graph, with
the x-axis representing the number of hours studied and the y-axis representing the
pass/fail outcome (0 for fail, 1 for pass). We can then fit a logistic function to the data, which
will give us a curve that represents the probability of passing the exam as a function of the
number of hours studied.
Once we have fitted the logistic function to the data, we can use it to predict the probability
of passing the exam for a new student based on their study time. For example, if a new
student studies for 5 hours, we can use the logistic function to predict the probability of
passing the exam. If the probability is above a certain threshold (usually 0.5), we can predict
that the student will pass the exam, otherwise we can predict that they will fail.

5.compare different types of classification methods


Types of ML Classification Algorithms:
Classification Algorithms can be further divided into the Mainly two category:
Linear Models
Logistic Regression
Support Vector Machines
Non-linear Models
K-Nearest Neighbours
Kernel SVM
Naïve Bayes
Decision Tree Classification
Random Forest Classification.
There are several types of classification methods in machine learning, each with their own
strengths and weaknesses. Here are some of the most common types of classification
methods:
Logistic Regression: Logistic regression is a type of regression analysis used for predicting
binary outcomes. It is a simple and widely used method for binary classification tasks.
Naive Bayes: Naive Bayes is a probabilistic algorithm that uses Bayes theorem to predict the
probability of a class given a set of features. It is often used in text classification tasks, such
as spam filtering or sentiment analysis.
Decision Trees: Decision trees are a tree-like model where each internal node represents a
test on an attribute, each branch represents the outcome of the test, and each leaf node
represents a class label. Decision trees are easy to interpret and can handle both categorical
and numerical data.
Random Forest: Random forest is an ensemble learning method that uses multiple decision
trees to improve classification accuracy. It is often used when the dataset is large and
complex.
Support Vector Machines (SVM): SVM is a powerful method for classification that is often
used for image recognition, text classification, and bioinformatics. It works by finding a
hyperplane that separates the classes with the largest margin.
Neural Networks: Neural networks are a type of machine learning model that are modeled
after the human brain. They are powerful and flexible, but can be difficult to interpret.
K-Nearest Neighbors (KNN): KNN is a simple and easy-to-implement algorithm that works by
finding the k nearest neighbors to a given data point, and classifying the data point based on
the most common class label among its neighbors.

You might also like