Q) Concept of Data Analytics
Q) Concept of Data Analytics
For example:
"Studies" =>"Study"
2) Experience (E):
As name suggests, it is the knowledge gained from data points provided
to the algorithm or model. ii)Once, provided with the dataset, the model
will run iteratively and will learn some inherent pattern. The learning thus
acquired is called Experience (E). iii)Supervised, unsupervised and
reinforcement learning are some ways to learn or gain experience. The
experience gained by out ML model or algorithm will be used to solve the
Task (T).
Advantages of k-NN:
1. The k-NN algorithm is simple and easy to implement. 2. The k-NN a
versatile algorithm as we can use it for classification as well a regression.
3. The k-NN is very useful for nonlinear data because there is no
assumption abou data in this algorithm.
Disadvantages of k-NN:
1. The k-NN algorithm gets significantly slower as the number of examples
and/or predictors/independent variables increase. 2. The k-
NN algorithm is computationally a bit expensive algorithm because it store
all the training data. 3. The k-NN algorithm requires high memory storage.
Advantages of SVM:
1. SVM offers great accuracy. 2. SVM work well with high dimensional
space. 3. It is effective in cases where number of dimensions is greater
than the number of samples. 4. It uses a subset of training points in the
decision function (called support vectors), so it is also memory efficient.
Disadvantages of SVM:
1. SVMs have high training time hence in practice not suitable for large
datasets. 2. It also does not perform very well, when the data set has
more noise i.e. target classes are overlapping.
Q) Semi-supervised Learning
Semi-supervised learning is an important category that lies between the
Supervised and Unsupervised machine learning.2) Semi-supervised
Learning algorithms or methods are neither fully supervised nor fully
unsupervised. They basically fall between the two ie. supervised and
unsupervised learning methods. 3) Semi-supervised learning is an
approach to machine learning that combines a small amount of labeled
data with a large amount of unlabeled data during training. 4)Semi-
supervised learning falls between unsupervised learning (with no labelled
training data) and supervised learning (with only labeled training data). 5)
Unlabeled data, when used in conjunction with a small amount of labeled
data, can produce considerable improvemen: in learning.
Advantages of Semi-supervised Machine Learning:
1. It is easy to understand and simple to implement. 2. It reduces the
amount of annotated data used. 3. It is a stable algorithm. 4. It has high
efficiency.
Disadvantages of Semi-supervised Machine Learning :
1. Iteration results are not stable. 2. It is not applicable to network-level
data. 3. It has low accuracy.
Q] REGRESSION MODELS
Regression helps us to understand the relationship between various data
points and helps us to find hidden patterns among the data. 2) Regression
is one of the most powerful and popular statistical tool or a learning
technique that helps to discover the best relationship between a
dependent variable and an Independent variable. 3) The goal of
regression analysis is to model the expected value of a dependent
variable y in terms of the value of an independent variable x. 4)
Regression analysis is a set of statistical processes for estimating the
relationships among variables. 5)Regression analysis is a set of statistical
methods used to estimate relationships between a dependent variable
(target) and one or more independent variables (predictor).
Q] Linear Regression
Linear regression is one of the easiest and most popular Machine Learning
algorithms. It is a statistical method that is used for predictive analysis. 2)
Linear regression is the most representative machine learning method to
build models for value prediction and classification from training data. 3)
Linear regression maps an independent variable to a dependent variable
by a linear equation. Many times an independent variable can have a
deterministic mapping to a dependent variable. 4) Linear regression may
be defined as the statistical model that analyzes the linear relationship
between a dependent variable with given set of independent variables.
5)Linear relationship between variables means that when the value of one
or more independent variables will change (increase or decrease), the
value of dependent variable will also change accordingly (increase or
decrease). 6) Linear regression shows the linear relationship between the
independent variable (X- axis) and the dependent variable (Y-axis), hence
called linear regression.
Linear regression can be further divided into two types of the
algorithm:
1. Simple Linear Regression: if a single independent variable is used to
predict the value of a numerical dependent variable, then such a linear
regression algorithm is called simple linear regression. 2). Multiple Linear
Regression: If more than one independent variable is used to predict the
value of a numerical dependent variable, then such a linear regression
algorithm is called multiple linear regression.
Q] Polynomial Regression
Polynomial regression, like linear regression. It uses the relationship
between the variables x and y to find the best way to draw a line through
the data points. 2) The dataset used in Polynomial regression for training
is of non-linear nature. It makes use of a linear regression model to fit the
complicated and non-linear functions and datasets. 3) Polynomial
regression is a regression algorithm that models the relationship between
a dependent(y) and independent variable(x) as nth degree polynomial.
4)The Polynomial regression equation is: y = bo+b1x1+b2x1 2+..... bnX1
n, where data points are arranged in a non-linear fashion, we need the
Polynomial regression model. 5) the curve is near exponential due to the
presence of the term x1 2. The that no way a linear line could fit all the
data points. However, by transforming the linear line into a polynomial
form, the curve is made to pass through all the points.
Q] Logistic Regression
Logistic regression is a classification algorithm used in machine learning to
predict a binary or categorical outcome. It predicts the probability that a
given input belongs to a specific class, and this output probability ranges
between 0 and 1. 2) Binary Output: Often used when the target variable
has two possible outcomes, like yes/no, true/false, or success/failure 3)
Prediction: It doesn't give exact outcomes but probabilities. A threshold is
applied to convert these probabilities into binary outputs. 4) S-shaped
Curve: Instead of a straight line, logistic regression uses a logistic
function, which has an S-curve or sigmoid shape. This curve allows logistic
regression to smoothly transition from probabilities near 0 to probabilities
near 1. 5) Classification Problems: While linear regression is for continuous
outcomes, logistic regression is for classification problems, including
binary and multinomial classifications. 6) Applications: It has a wide range
of applications, from medical diagnosis (predicting disease presence) to
email filtering (spam or not-spam). 7) When the output has more than two
categories, logistic regression can be extended to multinomial logistic
regression, which handles multiple discrete categories by using
techniques like "one-vs-rest" or "softmax."
Q) concept of clustering
Clustering is an unsupervised learning technique in machine learning used
to find natural groupings within a dataset. Unlike supervised learning,
which focuses on predicting a specific target variable, clustering aims to
identify patterns or relationships among data points without predefined
labels. 2) Purpose: The main goal is to group data points into clusters
based on similarity, allowing the identification of patterns or relationships
within the data. 3) Applications: Clustering is widely used to understand
data structure, group similar items, and explore the data for meaningful
patterns. It has applications in customer segmentation, image processing,
market research, and more. 4) How it Works: Clustering algorithms
analyze the data to find similarities among samples and then group these
similar data points into clusters. Clusters consist of objects that are more
similar to each other than to objects in other clusters. 5) Clustering
Techniques: There are various clustering algorithms, each with its
approach to forming clusters. Popular methods include k-means,
hierarchical clustering, DBSCAN, and Gaussian Mixture Models. 6) Number
of Clusters: Some algorithms require the user to specify the number of
clusters in advance (like k-means), while others can determine the optimal
number of clusters automatically (like DBSCAN).
Use Case Example: A common real-world example is customer
segmentation, where customers are grouped based on purchasing
behavior. This can help businesses identify customer profiles and tailor
marketing strategies accordingly.
Q] Apriori
Apriori algorithm Is widely used algorithm for generate frequent item sets.
Apriori seminal algorithm proposed by R. Agrawal and R. Srikant in 1994-
>The name of the algorithm is based on the fact that the algorithm uses
prise knowledge of frequent item set properties. It is a classic algorithm
for learning association rules->Apriori algorithm is easy to execute and
very simple is used to mine all frequent item sets in database Apriori
Propertyan item set is infrequent, then all its supersets must also be
infrequent. So, according to Apriori property if (A) is infrequent item set
then all its superset like (A,B),(AC) {A,B,C) etc., will also be infrequent->It
is called ant monotonicity because the property is monotonic in the
context of failing a test->Apriori algorithm used step wise search
approach, it means k-item sets are used discover (k+1) item sets
Supervised Learning Unsupervised Learning
1) In supervised learning both 1) In unsupervised learning
input and D output variables only input variables are
are provided on the basis of provided and no output
which the output could be variable are available due to
predicted and probability of which the outcome or
its correctness is higher. resultant learning is
dependent on one intellectual
2.)As supervised learning is observation.
treated as highly accurate
and trustworthy method so 2)Unsupervised learning is
the accuracy and correctness comparatively less accurate
is better as compare to and trustworthy method.
unsupervised learning.
3)Unsupervised learning
3.)Supervised learning algorithms are trained using
algorithms are trained using unlabeled data.
labeled data.
4)Unsupervised learning
4.)Supervised learning model model does not take any
takes direct L feedback to feedback.
check if it is predicting m
correct output or not. 5)Unsupervised learning
model finds the hidden
5)Supervised learning model patterns in data.
predicts the L output.
6)In unsupervised learning,
6)In supervised learning, only input data is provided to
input data is provided to the the model.
model along with the output.
7)The goal of unsupervised
7)The goal of supervised learning is to find the hidden
learning is to train the model patterns and useful insights
so that it can predict the t from the unknown dataset.
output when it is given new
data. 8) Unsupervised learning does
not need any supervision to
8)Supervised learning needs train the model.
superv.sion to train the
model.
Data Analysis Data Analytics
7) It supports inferential
analysis.