Data Analytics
Data Analytics
Unlike regression, the output variable of Classification is a category, not a value, such
as "Green or Blue", "fruit or animal", etc. Since the Classification algorithm is a
Supervised learning technique, hence it takes labeled input data, which means it
contains input with the corresponding output.
The main goal of the Classification algorithm is to identify the category of a given
dataset, and these algorithms are mainly used to predict the output for the categorical
data.
Classification algorithms can be better understood using the below diagram. In the
below diagram, there are two classes, class A and Class B. These classes have
features that are similar to each other and dissimilar to other classes.
Binary Classifier: If the classification problem has only two possible outcomes, then it is
called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
Multi-class Classifier: If a classification problem has more than two outcomes, then it is
called as Multi-class Classifier.
Example: If i want to predict what type of people buy a wine. I would find
data on people who buy wine. Their age, height, financial status, etc. So
analyzing this data i can build a model to predict whether a person would
buy wine or not.
Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.
Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight,
etc.
Logistic Regression is a significant machine learning algorithm because it has the ability
to provide probabilities and classify new data using continuous and discrete datasets.
Suppose we have a dataset of students and we want to predict whether a student will
pass or fail an exam based on their study time. We have data on 100 students,
including the number of hours they studied and whether they passed or failed the exam.
Our goal is to build a model that can predict whether a new student will pass or fail
based on their study time.
We can use logistic regression to build a model that predicts the probability of passing
the exam, given the number of hours studied. We can start by plotting the data on a
graph, with the x-axis representing the number of hours studied and the y-axis
representing the pass/fail outcome (0 for fail, 1 for pass). We can then fit a logistic
function to the data, which will give us a curve that represents the probability of passing
the exam as a function of the number of hours studied.
Once we have fitted the logistic function to the data, we can use it to predict the
probability of passing the exam for a new student based on their study time. For
example, if a new student studies for 5 hours, we can use the logistic function to predict
the probability of passing the exam. If the probability is above a certain threshold
(usually 0.5), we can predict that the student will pass the exam, otherwise we can
predict that they will fail.
1. Decision Trees: Decision trees are a popular classification method that works by
recursively splitting the data into smaller subsets based on the values of input
features. The algorithm chooses the best feature to split the data based on a
criterion such as Gini index or entropy. Decision trees are easy to interpret and
visualize, making them a popular choice for problems that require transparency
and explainability. However, they can be prone to overfitting and may not perform
well on data with complex relationships.
2. Naive Bayes: Naive Bayes is a probabilistic classification method based on
Bayes' theorem. It assumes that the input features are conditionally independent
given the class variable. Naive Bayes is computationally efficient and can handle
high-dimensional data well. It is often used in natural language processing tasks
such as sentiment analysis and spam filtering. However, it assumes that the
input features are independent, which may not hold true in many real-world
scenarios.
3. Logistic Regression: Logistic regression is a linear classification method that
models the probability of an observation belonging to a particular class. It uses a
logistic function to map the input features to the probability of the output
variable. Logistic regression is computationally efficient, easy to interpret, and
can handle non-linear relationships between input features and output variables.
However, it assumes that the decision boundary is linear, which may not be the
case in many real-world scenarios.
4. Support Vector Machines (SVM): SVM is a powerful classification method that
constructs a hyperplane or a set of hyperplanes in a high-dimensional space to
separate the data into different classes. It works by maximizing the margin
between the hyperplanes and the closest data points of each class. SVM can
handle non-linear relationships and can be used for both binary and multi-class
classification problems. However, SVM can be computationally intensive and
may not perform well on noisy or overlapping data.
5. Random Forest: Random Forest is an ensemble learning method that constructs
a multitude of decision trees at training time and outputs the class that is the
mode of the classes predicted by the individual trees. Random Forest can handle
high-dimensional data, noisy data, and non-linear relationships between input
features and output variables. It is also robust to overfitting and can handle
missing values. However, it can be difficult to interpret the results of a Random
Forest, and it may not perform well on imbalanced data.
6. k-Nearest Neighbors (k-NN): k-NN is a lazy learning classification method that
uses distance metrics to find the k-nearest neighbors of a new observation in the
training data and assigns it the class that is most frequent among its k-nearest
neighbors. k-NN is simple to implement and can handle non-linear relationships
and multi-class classification problems. However, it can be sensitive to irrelevant
features and noisy data, and the choice of the value of k can significantly affect
its performance.
Types of ANOVA
Two-way ANOVA - Two way ANOVA uses two independent variables. For
example, to access differences in IQ by country (variable 1) and
gender(variable 2). Here you can examine the interaction between two
independent variables. Such Interactions may indicate that differences in IQ
is not uniform across a independent variable. For examples females may
have higher IQ score over males and have very high score over males in
Europe than in America.
If null hypothesis is rejected, conclude that mean of groups are not equal.
Suppose a retail company wants to increase its sales revenue. They can
use different types of data analytics to achieve their goal.
Descriptive Analytics: They can analyze the sales data from the previous
year to identify which products sold the most, which stores had the highest
sales revenue, and which marketing campaigns were the most successful.
By using different types of data analytics, the retail company can gain
insights into their business operations and make data-driven decisions that
can help them increase their sales revenue.
Permutation Test:
The permutation test is a technique that involves shuffling the labels of the
observations and computing the test statistic many times to obtain the null
distribution of the test statistic.
Randomization Test:
1. Compute the observed test statistic: In this case, the test statistic
is the difference in means between the two groups.
2. Combine the data: Combine the scores of both groups into a
single dataset.
3. Randomly assign the observations to groups: Randomly assign
the observations to either Group A or Group B.
4. Compute the test statistic: Calculate the difference in means
between the two groups.
5. Repeat steps 3 and 4 many times (e.g. 1000 times) to obtain the
null distribution of the test statistic.
6. Compare the observed test statistic with the null distribution:
Calculate the p-value by counting the proportion of times the
random test statistic was greater than or equal to the observed test
statistic.
Data quality: Big data often contains incomplete, inconsistent, and inaccurate data, which can
negatively affect the accuracy of analysis results. It is essential to ensure that data is of high
quality before conducting any analysis.
Data security: As big data contains sensitive and confidential information, data security is a
significant concern. The potential for data breaches and cyber-attacks must be addressed
through robust security measures.
Scalability: As the volume of data grows, traditional data processing tools and techniques may
not be able to handle the workload. Big data analytics systems must be scalable to
accommodate growth.
Data integration: Big data often comes from various sources, and integrating all the data can be
a challenge. Integrating data from various sources is necessary to get a complete picture of the
data and obtain accurate analysis results.
Data analysis: Analyzing big data requires sophisticated algorithms and tools, which can be
difficult to implement and manage. Data scientists need to have the right expertise to use these
tools effectively.
Infrastructure: Big data requires a robust infrastructure to store, process, and analyze data. It
can be expensive to set up and maintain such an infrastructure.
Interpretation and visualization: The insights generated from big data must be interpreted and
visualized to make them understandable and actionable. This requires expertise in data
visualization and communication.
Regulatory compliance: Big data analytics must comply with regulatory requirements regarding
data privacy, data protection, and data governance.
However, unsupervised learning also has some limitations. One of the main
challenges is the lack of interpretability. Because unsupervised learning
algorithms do not have a specific outcome or target variable, it can be
difficult to interpret the results and understand the underlying patterns or
relationships in the data. Additionally, because the algorithm is not provided
with any feedback, it can be difficult to evaluate the accuracy of the model.
5. Evaluating the rules: The final step is to evaluate the association rules to
determine their usefulness and relevance. This may involve calculating
metrics such as support, confidence, and lift, which measure the strength
and significance of the rules.
https://developers.google.com/machine-learning/clustering/clustering-algori
thms
20. How data can be created for analytics through active learning?
To create data for analytics through active learning, you can follow these
steps:
1. Identify the problem: Determine the problem you want to solve through
analytics. For example, you might want to predict customer churn or
identify fraudulent transactions.
2. Collect initial data: Collect a small set of labeled data to train your model.
This initial data can be obtained through various sources such as historical
data or data from domain experts.
4. Train the model: Train the model using the initial labeled data and the
selected active learning algorithm.
7. Add labeled samples to training data: Add the labeled samples to the
training data and retrain the model.
By using active learning to create data for analytics, you can significantly
reduce the cost and time required for data labeling while achieving high
accuracy in your predictive models.
21. What is Logistic Regression? What kind of problems can be
solved using logistic regression?Explain advantages and
disadvantages and disadvantages of logistic
Regression?
Naive Bayes and Logistic Regression are two commonly used classification
algorithms in machine learning. Although both are used for classification
problems, they have several differences in their approach and
performance.