0% found this document useful (0 votes)
20 views28 pages

Q) Concept of Data Analytics

Uploaded by

cssalunke79
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views28 pages

Q) Concept of Data Analytics

Uploaded by

cssalunke79
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Q] CONCEPT OF DATA ANALYTICS

Advancement in data science has created opportunities to sort, manage


and analyze large or massive amounts of data more effectively and
efficiently. 2)Data science is closely related to the fields of data mining
and machine learning, but it is broader in scope. 3)The term data
comprises facts, observations and raw information. Data itself have little
meaning if it is not processed. The processed data in meaningful form
known as information.
4) Analytics is used for the discovery, interpretation, and communication
of meaningful patterns and/or insights in data. The term analytics is used
to refer to any data-driven decision-making.
5)Data analytics may analyze many varieties of data to provide views into
patterns and insights that are not humanly possible.
6) Data analytics (DA) is the science of examining raw data with the
purpose of drawing conclusions from that information.

Q] Roles in Data Analytics


1. Data Analyst:-- A Data Analyst explores and interprets large datasets
to identify trends, patterns, and relationships. They often create
visualizations and reports to help stakeholders understand data insights,
facilitating decision-making and problem-solving.
2). Data Scientist:-- A Data Scientist handles large and complex datasets,
often using machine learning and deep learning techniques to generate
business insights. They work with advanced tools and algorithms to
analyze high-dimensional data, aiming to convert complex information
into actionable intelligence. 3). Data Architect:-- A Data Architect designs
and implements database systems and data models, ensuring the data
infrastructure meets the needs of the organization. They provide support
for various tools and platforms, enabling data engineers to work efficiently
with data. Their primary focus is on structuring and optimizing data
storage and flow. 4). Data Engineer:--
A Data Engineer is responsible for building and maintaining the data
architecture for analytics projects. They create and manage data pipelines
to ensure data is collected, processed, and made available in a form
suitable for analysis. Data engineers play a critical role in ensuring data
quality and accessibility.

Q] Lifecycle of Data Analytics:-


The data analytics lifecycle is a process that consists of six basic
stages/phases (data discovery, data preparation, model planning, model
building, communication results, and operationalize) as that define how
information is created, gathered, processed used and analyzed for
organizational goals. 2) the six phases of the data analytics lifecycle that
is followed one phase after another to complete one cycle. 3) The lifecycle
of the data analytics provides a framework for the best performances of
each phase from the creation of the project until its completion. 4) Data
Analytics (DA) is the science of examining raw data with the purpose of
drawing conclusions about the information Of course, here's a brief
summary of the data analytics lifecycle, covering the key steps from data
discovery to operationalization.

i)Phase 1: Data Discovery


In the data discovery phase, the goal is to establish project objectives and
identify the key questions to be answered through data analysis. This
phase involves defining the purpose of the data analytics project and
outlining the expected outcomes. It's about understanding what the
business is trying to achieve and mapping out the data required for
achieving those goals.
ii)Phase 2: Data Preparation
During data preparation, data is cleaned, transformed, and conditioned for
analysis. This involves collecting data from various sources, processing it
to ensure quality, and preparing it for modeling. The data is often loaded
into a sandbox environment for processing and exploration. Techniques
like ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform) are
used to manage data flow.
iii)Phase 3: Model Planning
Model planning is where the team decides which analytical techniques
and workflows to use for building models. This phase includes evaluating
the quality of the data and selecting appropriate algorithms and statistical
methods for the project. It sets the stage for the model building phase.
iv)Phase 4: Model Building
In this phase, the team creates the datasets for training and testing and
begins building the model according to the plan established in the
previous phase. The model is developed, tested, and refined in a real-time
environment. The aim is to produce a robust model that can provide the
insights required to meet the project objectives.
v)Phase 5: Communicate Results
This phase focuses on reviewing the results to determine whether the
project was successful. The team and stakeholders evaluate the key
findings and summarize the insights gained from the data
### Advantages of Data Analytics:-
Data analytics provides insights and evidence to support decision-making,
reducing guesswork. 2) Analyzing data can reveal inefficiencies and help
optimize processes, leading to cost savings. 3) Businesses can use data to
understand customer behavior and preferences, enabling personalized
experiences. 4) Data analytics can identify potential risks and
vulnerabilities, allowing for proactive risk mitigation.
### Disadvantages of Data Analytics
Poor-quality or inaccurate data can lead to incorrect conclusions and
flawed decisions. 2)Collecting and analyzing data can raise privacy issues
and increase the risk of data breaches. 3) Relying too heavily on data-
driven insights can stifle creativity and intuition. 4) Integrating data from
different sources and systems can be complex and prone to errors.
Q] TYPES OF DATA ANALYTICS
Organizations from almost every sector are generating a large volume of
data on a regular basis. Merely collecting large amounts of data will not
serve any purpose and cannot be used directly for the profit of the
company/organization. Organizations can extract very useful information
from this data which can further support complex decision making hence,
there is a need for data analytics. The art and science of refining data to
fetch useful insight which further helps in decision making is known as
Analytics.
There are four types of data analytics:- 1. Descriptive Analytics: What
happened?. 2). Diagnostic Analytics: Why did it happen? .3). Predictive
Analytics: What will it happen? . 4). Prescriptive Analytics: How can we
make it happen?
1)Descriptive Analytics
Descriptive analytics examines the raw data or content to answer
question, what happened?, by analyzing valuable information found from
the available past (historical) data. ii)The goal of descriptive analytics is to
provide insights into the past leading to the present, using descriptive
statistics, interactive explorations of the data, and data mining.Descriptive
analytics enables learning from the past and assessing how the past might
influence future outcomes. iii)Descriptive analytics is valuable as it
enables associations to gain from past practices and helps them in seeing
how they may impact future results.
Examples:- 1. An organizations' records give a past review of their
financials, operations, customers and stakeholders, sales and so on.
2) Diagnostic Analytics
Diagnostic analytics is a form of analytics which examines data to answer
the question, why did it happen?. ii)It is kind of root cause analysis that
focuses on the processes and causes, key factors and unseen patterns.
iii)The goal/objective of diagnostic analytics is to find the root cause of
issues. It can be accomplished by techniques like data discovery,
correlations, data mining and drill-down. iv)Diagnostic analytics tries to
gain a deeper understanding of the reasons behind the pattern of data
found in the past. Here, business/organizational intelligence comes into
play by digging down to find the root cause of the pattern or nature of
data obtained. For example, with diagnostic analysis, a data analyst
will be able to find why the performance of each player of the hockey
team of India has risen (or degraded) in the recent past nine months.
3) Predictive Analytics
Predictive analytics is a branch of advanced analytics that uses historical
data, statistical algorithms, and machine learning to forecast future
outcomes. It provides insights into what might happen based on trends
and patterns observed in existing data. ii)The primary goal of predictive
analytics is to help organizations anticipate future events, allowing them
to take proactive actions to mitigate risks, capitalize on opportunities, or
optimize processes. iii)Predictive analytics employs various techniques,
including:- i)Statistical Modeling, 2)Machine Learning, 3)Data Mining
iv)Predictive analytics has a wide range of applications, such as: i)Demand
Forecasting: Predicting customer demand to manage inventory and
production. ii)Risk Assessment: Identifying potential risks and predicting
the likelihood of adverse events. v)It's important to note that predictive
analytics does not guarantee
4) Prescriptive Analytics
Prescriptive analytics goes beyond predicting future outcomes by also
suggesting actions to benefit from the predictions and showing the
decision maker the implications of each decision option. ii)Prescriptive
analytics not only anticipates what will happen and when it will happen,
but also why it will happen. iii) Further, prescriptive analytics can
suggest decision options on how to take advantage 1 of a future
opportunity or mitigate a future risk and illustrate the implication of each
decision option. iv) In practice, prescriptive analytics can continually
and automatically process new data to improve prediction accuracy and
provide better decision options.
Example: In the healthcare industry, we can use prescriptive analytics to
manage the patient population by measuring the number of patients who
are clinically obese.
Stemming Lemmatization

1) Stemming is faster because 1) Lemmatization is slower as


it chops words without knowing compared to stemming but it
the context of the word in given knows the context of the word
sentences. before proceeding.
2.)It is a rule-based approach.
3.)Accuracy is less. 2)It is a dictionary-based
4.)When we convert any word approach.
into root- form then stemming
may create the non-existence 3)Accuracy is more as
meaning of a word. compared to Stemming.
5.)Stemming is preferred when 4)Lemmatization always gives
the meaning of the word is not the dictionary meaning word
important for while converting into root-form.
analysis.Example: Spam
Detection 5)Lemmatization would be
recommended when the
6.) For example: meaning of the word is
important for analysis.
"Studies" =>"Studi" Example: Question Answer

For example:
"Studies" =>"Study"

Q] What is machine learning


 Machine Learning (ML) is a field of computer science that studies
algorithms and techniques for automating solutions to complex problems
that are hard to program using conventional programming methods.
Though data science includes machine learning as one of its fundamental
areas of study, machine learning in itself is a vast research area of study
that requires good skills and experience to expertise. The basic idea of
machine learning is to allow machines (computers) to independently learn
from the wealth of data that is fed as input into the machine. The field of
machine learning deals with all those algorithms that help machines to get
self-trained in this process.
Machine learning techniques are broadly categorized into supervised
machine learning, unsupervised machine learning, and reinforcement
learning.
1. Supervised Machine Learning sometimes described as "learn from the
past to predict the future". Supervised machine learning is a field of
learning where the machine learns with the help of a supervisor and
instructor 2) . Unsupervised Machine Learning the machine learns without
any supervision The goal of unsupervised learning is to model the
structure of the data to learn more about the data. 3). Reinforcement
Machine Learning happens with an interaction with the environment If we
are assumed to be a program and with every encounter with the
environment the program eventually starts learning then the process is
called reinforcement learning.

Q] Model of machine learning:-


Machine learning is a set of methods that computers use to make and
improve predicts or behaviors based on data. For example, to predict the
value of a farmhouse, the computers would learn patterns from past
farmhouse sales.
2) The formal definition of ML is "a computer program is said to learn from
experience E with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured by P, improves
with experience E." 3)The definition is basically focusing on three
parameters, also the main components of any learning algorithm, namely
Task (T), Performance (P) and experience (E).
ML. is a field of Al consisting of learning algorithms that Improve their
performance (P), At executing some task (1) and Over time with
experience (E). Based on these parameters, represents a machine
learning model.
1. Task (T):--From the perspective of problem, we may define the task
Tas the real-world problem to be solved. ii)The problem can be anything
like finding best house price in a specific location or to find best marketing
strategy etc. iii )On the other hand, if we talk about machine learning, the
definition of task is different because it is difficult to solve MI. based tasks
by conventional programming approach.

2) Experience (E):
As name suggests, it is the knowledge gained from data points provided
to the algorithm or model. ii)Once, provided with the dataset, the model
will run iteratively and will learn some inherent pattern. The learning thus
acquired is called Experience (E). iii)Supervised, unsupervised and
reinforcement learning are some ways to learn or gain experience. The
experience gained by out ML model or algorithm will be used to solve the
Task (T).

3. Performance (P):An ML algorithm is supposed to perform task and


gain experience with the passage of time. ii)The measure
which tells whether ML algorithm is performing as per expectation or not
is its performance (P). 3) The P is basically a quantitative metric that tells
how a model is performing the Task (T) using its Experience (E).
Advantages of Machine Learning:
1. It is used in variety of applications such as banking and financial sector,
healthcare, retail, publishing and social media, robot locomotion, game
playing etc. 2). It has capabilities to handle multi-dimensional and multi-
variety data in dynamic or uncertain environments. 3). It allows time cycle
reduction and efficient utilization of resources. 4). Source programs such
as RapidMiner help in increased usability of algorithms for various
applications. 5). The process of automation of tasks is easily possible.
Disadvantages of Machine Learning:
1) It is impossible to make immediate accurate predictions with a machine
learning system. 2). Machine learning needs a lot of training data for
future prediction. 3). Interpretation of results is also a major challenge to
determine effectiveness of machine learning algorithms. 4). Use of low-
quality data leads to the problems related to data preprocessing and
feature extraction.

Q) Uses of Machine Learning


1. Speech Recognition:-- Speech recognition is a process of converting
voice Instructions into text and known as computer speech recognition. At
present, machine learning algorithms are widely used by various
applications of speed recognition. Google assistant, Cortana and Alexa are
using speech recognition technology to follow the voice instructions.
2). Image Recognition:-- It is one of the most common applications of
machine learning used to identify objects, persons, places, digital images,
etc. The popular example includes Google Lens uses image recognition.
Google Lens identifies objects through a camera and Facebook provides us
a feature of auto friend tagging suggestion. 3). Stock Market Trading:--
Machine learning is widely used in stock market trading. In the stock
market, there is always a risk of up and downs in shares, so for this
machine.
Q] Definition of Deep Learning
Deep learning is defined as, a class of machine learning algorithms that
uses multiple layers to progressively extract higher-level features from the
raw input.
For example, in image processing, lower layers may identify edges, while
higher layers may identify the concepts relevant to a human such as digits
or letters or faces.
Deep Learning (DL) is a machine learning technique that constructs Neural
Networks (NNs) and Artificial Neural Networks (ANNs) to mimic the
structure and function like a human brain.
Following are the various deep learning tools available in the market
today:- 1. Tensor Flow is one of the best frameworks is used for natural
language processing, text classification and summarization, speech
recognition and translation and more. It is flexible and has a
comprehensive list of libraries and tools which lets us to build and deploy
ML. applications. 2. Microsoft Cognitive Toolkit is the most effective for
image, speech and text-based data. 3. Caffe is the deep learning tools
built for scale, Caffe helps machines to track speed, modularity and
expression. It uses interfaces with C, C++, Python, MATLAB and is
especially relevant for convolution neural networks.
Advantages of Deep Learning:
1. In DL the features are automatically deduced and optimally tuned for
desired outcome. 2. In DL same neural network based approach can be
applied to many different applications and data types. 3. The deep
learning architecture is flexible to be adapted to new problems in be
future. 4. DL provides maximum utilization of unstructured data to obtain
insights from it. 5. DL has ability to execute feature engineering by itself.
In this approach, an algorithm scans the data to identify features which
correlate and then combine them to promote faster learning without being
told to do so explicitly.
Disadvantages of Deep Learning:
1. DL requires very large amount of data in order to perform better
than other techniques. 2. DL is extremely expensive to train due to
complex data models. 3. There is no standard theory in DL to guide
users in selecting right deep learning tools as it requires knowledge,
training method and other parameters.

Q] Applications/Uses of Deep Learning


1] Automatic Speech Recognition: Large-scale automatic speech
recognition is the first and most convincing successful case of deep
learning.
3] Medical Image Analysis: Deep learning has been shown to produce
competitive results in medical application such as cancer cell
classification, lesion detection, organ segmentation and image
enhancement.
4] Mobile Advertising: Finding the appropriate mobile audience for mobile
advertising is always challenging. Deep learning has been used to
interpret large. many-dimensioned advertising datasets.
6. Financial Fraud Detection: Deep learning is being successfully applied to
financial fraud detection, tax evasion detection and anti-money
laundering. 7. Military: The United States (US) Department of Defense
applied deep learning to train robots in new tasks through observation.
Q] Definition of Artificial Intelligence
Artificial Intelligence (AI) is a wide-ranging branch of computer science
concerned with building smart machines capable of performing tasks that
typically require human intelligence.
Advantages of Artificial Intelligence:
Al machines are highly reliable and can perform the same action multiple
times with high accuracy. 2).Al machines or systems are prone to less
errors and high accuracy as it takes decisions as per pre-experience or
information. 3).Al systems can be of very high
speed and fast-decision making, because of that Al systems can beat a
chess champion in the Chess game. 4) AI can be very useful for public
utilities such as a self- driving car which can make the journey safer.
Disadvantages of Artificial Intelligence:
1).The hardware and software requirement of Al is very costly as it
requires lots of maintenance to meet current world requirements. 2). As
humans are so creative and can imagine some new ideas but still Al
machines cannot beat this power of human intelligence and cannot be
creative and imaginative. 3).With the increment of technology, people are
getting more dependent on devices and hence they are losing their
mental capabilities.
Q] Applications of Artificial Intelligence
1. Robotics: Artificial Intelligence has a remarkable role in Robotics. With
the help of Al, we can create intelligent robots. 2. Finance: The
finance industry is implementing automation, chatbot, adaptive
intelligence, algorithm trading, and machine learning into financial
processes. 3. Natural
Language Processing: It is possible to interact with the computer that
understands natural language spoken by humans. 4). Expert Systems:
There are some applications which integrate machine, software, and
special information to impart reasoning and advising. They provide
explanation and advice to the users.

Q] THE MODELING PROCESS


A model is an abstraction of reality. A model is the representation of a
relationship between variables in a dataset. 2)A model describes how one
or more variables in the data are related to other variables. 3)Modeling is
a process in which a representative abstraction is built from the observed
dataset. 4) For example, based on credit score, Income level and
requested loan amount, a model can be developed to determine the
interest rate of a personal loan. For this task, previously known
observational data such as credit score, income level, loan amount and
interest rate are needed.
The process of modeling comprises following four steps:
1. Feature engineering and selecting a model. 2. Training the model. 3.
Validating the model. 4. Testing the model on new data. 5) The first three
steps in modeling are usually repeated because we most likely will not
build an optimal model for the project on the first try. 6) As such, we will
be building several models and then select the one that performs the best
on the testing data set (which is unseen data).
Q] Training the Model
the right and accurate predictors in place and a modeling technique in
mind, we can progress to model training. In model training we present the
model data from which it can learn. 2) After we have created the right
predictors (engineering features) and have selected the appropriate
modeling technique for the project, we can now proceed to train the
model on a training data set. 3) A training data set is a data sample that
we select for the model to learn to perform actions from. 4) Popular
modeling techniques are available to be implemented in almost any
programming language that we may choose, including Python. 5)These
techniques essentially allow us to train your model by using a simple set
of lines of codes 6) Training time is the number of hours to train the medal
also plays a vital role in determining the selection of the model. It is
directly related to the accuracy of the obtained model
Q) Validating the Model
Validation is of the model is extremely important because it determines
whether the model works in real-life conditions 2) Once, a
model is trained, it's time to test whether it can be extrapolated to reality
Le model validation. 3) Data science has many modeling techniques and
the question is which one is the right one to use. A good model has
following two properties.:--- 1. it has good predictive power, 2.
it generalizes well to data it hasn't seen. 4) To achieve above properties
we define an error measure and a validation strategy. 5) Two common
error measures in machine learning are the classification error rats for
classification problems and the mean squared error for regression
problems.
i). The classification error rate is the percentage of observations in the test
data set that your model mislabeled; lower is better. ii)The mean
squared error measures how big the average error of your prediction is.
Squaring the average error Number ofvalidation strategies exist, including
the following common ones.
Q] Supervised Learning:
Based on the ML tasks, supervised learning algorithms can be divided into
two classes namely, Classification and Regression.
1)Classification:
The key objective of classification-based tasks is to predict categorical
output labels responses for the given input dara. The output will be based
on what the model he learned in training phase.
2)Regression:
The key objective of regression-based tasks is to predict output labels or
response which are continue numeric values, for the given input data. The
output will be baset on what the model has learned in its training phase.

Advantages of Supervised learning:


1. Supervised learning model helps us to solve various real-world
problems such as fraud detection. 2. With the help of supervised learning,
the model can predict the output on the basis of prior experiences. 3. In
supervised learning, we can have an exact idea about the classes of
objects.

Disadvantages of supervised learning:


1. Supervised learning cannot predict the correct output if the test data is
different from the training dataset. 2. Training required in supervised
learning consumes lots of time. 3. Supervised learning models are not
suitable for handling the complex tasks.

Advantages of k-NN:
1. The k-NN algorithm is simple and easy to implement. 2. The k-NN a
versatile algorithm as we can use it for classification as well a regression.
3. The k-NN is very useful for nonlinear data because there is no
assumption abou data in this algorithm.

Disadvantages of k-NN:
1. The k-NN algorithm gets significantly slower as the number of examples
and/or predictors/independent variables increase. 2. The k-
NN algorithm is computationally a bit expensive algorithm because it store
all the training data. 3. The k-NN algorithm requires high memory storage.

Advantages of Decision Tree:


1. Decision trees are simple to understand and interpret.
2. Decision trees are able to handle both numerical and categorical
data. 3. Decision trees works well with large datasets. 4. Decision
trees are fast and accurate.

Disadvantages of Decision Tree:


1. A small change in the training data can result in a larger change in the
tree and consequently the final predictions. 2. Decision trees performance
is not good if there are lots of uncorrelated variables in the data set. 3.
Decision trees are generally easy to use, but making them, particularly
huge ones with numerous divisions or branches, is complex.

Advantages of SVM:
1. SVM offers great accuracy. 2. SVM work well with high dimensional
space. 3. It is effective in cases where number of dimensions is greater
than the number of samples. 4. It uses a subset of training points in the
decision function (called support vectors), so it is also memory efficient.

Disadvantages of SVM:
1. SVMs have high training time hence in practice not suitable for large
datasets. 2. It also does not perform very well, when the data set has
more noise i.e. target classes are overlapping.

Advantages of Naïve Bayes:


1. Naïve Bayes is fast and easy ML algorithms to predict a class of
datasets. 2. Naive Bayes will converge faster than discriminative models
like logistic regression. 3. It can make probabilistic predictions and can
handle continuous as well as discrete data. 4. Naïve Bayes
requires less training data.

Disadvantages of Naïve Bayes:


1. If categorical variable has a category (in test data set), which was not
observed in training data set, then model will assign a 0 (zero) probability
and will be unable to make a prediction. 2. Naive Bayes assumes that all
features are independent or unrelated, so it cannot learn the relationship
between features.

1. Clustering: Clustering is a method of grouping the objects into


clusters such that objects with most similarities remains into a group and
has less or no similarities with the objects of another group. Cluster
analysis finds the commonalities between the data objects and
categorizes them as per the presence and absence of those
commonalities. A clustering problem is where we want to discover the
inherent groupings in the data, such as grouping customers by purchasing
behavior.
2. Association: An association rule is an unsupervised learning method
which is used for finding the relationships between variables in the large
database. It determines the set of items that occurs together in the
dataset. Association rule makes marketing strategy more effective.

Advantages of Unsupervised Learning:


1. Unsupervised learning is preferable as it is easy to ger unlabeled data
in comparison to labeled data. 2. Unsupervised learning is used for more
complex tasks as compared to supervised learning because, in
unsupervised learning, we don't have labeled input data.

Disadvantages of Unsupervised Learning:


1. The result of the unsupervised learning algorithm might be less
accurate as input data is not labeled, and algorithms 2s not know the
exact output in advance. 2. Unsupervised learning is intrinsically more
difficult than supervised learning as it does not have corresponding output
Advantages of k-means Clustering Algorithm:
1. The k-means algorithm is simple easy to understand and to implement.
2. The k-means algorithm is the most popular clustering algorithm,
because it provides easily interpretable clustering results. 3. The k-means
algorithm is fast and efficient in terms of computational cost.

Disadvantages of k-means Clustering Algorithm:


1. The k-means algorithm is difficult to predict the number of clusters ie.
the value of k. 3. The output of k-means algorithm is strongly impacted
by initial inputs like number of clusters 4. It is very sensitive to rescaling.

Q) Semi-supervised Learning
Semi-supervised learning is an important category that lies between the
Supervised and Unsupervised machine learning.2) Semi-supervised
Learning algorithms or methods are neither fully supervised nor fully
unsupervised. They basically fall between the two ie. supervised and
unsupervised learning methods. 3) Semi-supervised learning is an
approach to machine learning that combines a small amount of labeled
data with a large amount of unlabeled data during training. 4)Semi-
supervised learning falls between unsupervised learning (with no labelled
training data) and supervised learning (with only labeled training data). 5)
Unlabeled data, when used in conjunction with a small amount of labeled
data, can produce considerable improvemen: in learning.
Advantages of Semi-supervised Machine Learning:
1. It is easy to understand and simple to implement. 2. It reduces the
amount of annotated data used. 3. It is a stable algorithm. 4. It has high
efficiency.
Disadvantages of Semi-supervised Machine Learning :
1. Iteration results are not stable. 2. It is not applicable to network-level
data. 3. It has low accuracy.

Q] REGRESSION MODELS
Regression helps us to understand the relationship between various data
points and helps us to find hidden patterns among the data. 2) Regression
is one of the most powerful and popular statistical tool or a learning
technique that helps to discover the best relationship between a
dependent variable and an Independent variable. 3) The goal of
regression analysis is to model the expected value of a dependent
variable y in terms of the value of an independent variable x. 4)
Regression analysis is a set of statistical processes for estimating the
relationships among variables. 5)Regression analysis is a set of statistical
methods used to estimate relationships between a dependent variable
(target) and one or more independent variables (predictor).

Q] Linear Regression
Linear regression is one of the easiest and most popular Machine Learning
algorithms. It is a statistical method that is used for predictive analysis. 2)
Linear regression is the most representative machine learning method to
build models for value prediction and classification from training data. 3)
Linear regression maps an independent variable to a dependent variable
by a linear equation. Many times an independent variable can have a
deterministic mapping to a dependent variable. 4) Linear regression may
be defined as the statistical model that analyzes the linear relationship
between a dependent variable with given set of independent variables.
5)Linear relationship between variables means that when the value of one
or more independent variables will change (increase or decrease), the
value of dependent variable will also change accordingly (increase or
decrease). 6) Linear regression shows the linear relationship between the
independent variable (X- axis) and the dependent variable (Y-axis), hence
called linear regression.
Linear regression can be further divided into two types of the
algorithm:
1. Simple Linear Regression: if a single independent variable is used to
predict the value of a numerical dependent variable, then such a linear
regression algorithm is called simple linear regression. 2). Multiple Linear
Regression: If more than one independent variable is used to predict the
value of a numerical dependent variable, then such a linear regression
algorithm is called multiple linear regression.
Q] Polynomial Regression
Polynomial regression, like linear regression. It uses the relationship
between the variables x and y to find the best way to draw a line through
the data points. 2) The dataset used in Polynomial regression for training
is of non-linear nature. It makes use of a linear regression model to fit the
complicated and non-linear functions and datasets. 3) Polynomial
regression is a regression algorithm that models the relationship between
a dependent(y) and independent variable(x) as nth degree polynomial.
4)The Polynomial regression equation is: y = bo+b1x1+b2x1 2+..... bnX1
n, where data points are arranged in a non-linear fashion, we need the
Polynomial regression model. 5) the curve is near exponential due to the
presence of the term x1 2. The that no way a linear line could fit all the
data points. However, by transforming the linear line into a polynomial
form, the curve is made to pass through all the points.
Q] Logistic Regression
Logistic regression is a classification algorithm used in machine learning to
predict a binary or categorical outcome. It predicts the probability that a
given input belongs to a specific class, and this output probability ranges
between 0 and 1. 2) Binary Output: Often used when the target variable
has two possible outcomes, like yes/no, true/false, or success/failure 3)
Prediction: It doesn't give exact outcomes but probabilities. A threshold is
applied to convert these probabilities into binary outputs. 4) S-shaped
Curve: Instead of a straight line, logistic regression uses a logistic
function, which has an S-curve or sigmoid shape. This curve allows logistic
regression to smoothly transition from probabilities near 0 to probabilities
near 1. 5) Classification Problems: While linear regression is for continuous
outcomes, logistic regression is for classification problems, including
binary and multinomial classifications. 6) Applications: It has a wide range
of applications, from medical diagnosis (predicting disease presence) to
email filtering (spam or not-spam). 7) When the output has more than two
categories, logistic regression can be extended to multinomial logistic
regression, which handles multiple discrete categories by using
techniques like "one-vs-rest" or "softmax."

Q) concept of clustering
Clustering is an unsupervised learning technique in machine learning used
to find natural groupings within a dataset. Unlike supervised learning,
which focuses on predicting a specific target variable, clustering aims to
identify patterns or relationships among data points without predefined
labels. 2) Purpose: The main goal is to group data points into clusters
based on similarity, allowing the identification of patterns or relationships
within the data. 3) Applications: Clustering is widely used to understand
data structure, group similar items, and explore the data for meaningful
patterns. It has applications in customer segmentation, image processing,
market research, and more. 4) How it Works: Clustering algorithms
analyze the data to find similarities among samples and then group these
similar data points into clusters. Clusters consist of objects that are more
similar to each other than to objects in other clusters. 5) Clustering
Techniques: There are various clustering algorithms, each with its
approach to forming clusters. Popular methods include k-means,
hierarchical clustering, DBSCAN, and Gaussian Mixture Models. 6) Number
of Clusters: Some algorithms require the user to specify the number of
clusters in advance (like k-means), while others can determine the optimal
number of clusters automatically (like DBSCAN).
Use Case Example: A common real-world example is customer
segmentation, where customers are grouped based on purchasing
behavior. This can help businesses identify customer profiles and tailor
marketing strategies accordingly.

Q) CONCEPT OF REINFORCEMENT LEARNING


Reinforcement Learning (RL) is a feedback-based Machine Lear ag (ML)
technique in which an agent learns to behave in an environment by
performing the actions and seeing the results of actions. 2) The RL is an
agent-based goal seeking technique where an (AI) agent tries to
determine the best action to take in a given environment depending on a
reward. 3) The agent has access to data which correspond to the various
states in an environment and a label for each action. 4) A deep learning
network may be used to take in an observation or state-array and output
probabilities for each action (or label). 5) The most popular
implementation of RL is Google's AlphaGo Al which defeated a top- ranked
human Go player. 6) Practical applications of RL include route optimization
strategies for a self-driving vehicle, for example. Most such applications
are experimental as of this publication.
Basic Terms used in Reinforcement Learning:
1. Agent is an entity that can perceive/explore the environment and act
upon it. 2. Environment is a situation in which an agent is present or
surrounded by. In RL, we assume the stochastic environment, which
means it is random in nature. 3. Actions are the moves taken by an agent
within the environment.
4. State is a situation returned by the environment after each action taken
by the agent.

Advantages of Reinforcement Machine :


1. RL is used to solve complex problems that cannot be solved by
conventional techniques 2). The solutions obtained by RL. very accurate.
3). RL. model will undergo a rigorous training process that can take time.
This can help to correct any errors. 4) . Due to RL's learning ability, it can
be used with neural networks. This can be termed as deep reinforcement
learning

Disadvantages of Reinforcement Machine Learning:


1. RL needs a lot of data and a lot of computation. 2) . Too much
reinforcement learning can lead to an overload of states which can
diminish the results. 3) . RL algorithm is not preferable for solving simple
problems. To solving simpler problems won't be correct. 4). RL need lots of
data to feed the model consumes time and lots of computational power.

SOCIAL MEDIA ANALYTICS


A social network is a type of complex network and can be described as a
social structure composed of a set of social actors or users and the inter-
relations and social interactions between them.These social networks are
useful to study the relationships between individuals, groups, social units
or societies. • [Social media analytics is the process of collecting, tracking
and analyzing data from social networks. Social media analytics relies on
new and established statistical and machine learning techniques to
derive meaning from large amounts of textual and numeric data.
Benefits of Social Media Analytics:
1. The continuous monitoring, capturing and analyzing of social media
data can become the valuable information for decision-making. 2. The
social media is that it gives us the ability to track and analyze the growth
of the community on social media sites and the activities and behavior of
the people using the sites. 3) Governments from around the
world are starting to realize the
potential of data analytics in making timely and effective decisions.
Q) Social Media Analytics Process
Social media analytics is a process that involves three key stages: data
capturing, data understanding, and data presentation. Here's an overview
of each stage:
1. Data Capturing - This is the initial stage where data is gathered from
various social media platforms like Facebook, Instagram, Twitter, and
others. The focus is on collecting relevant information, such as user
reactions, comments, likes, dislikes, and reviews about a topic, product, or
brand. 2) The data is typically unstructured and requires preprocessing to
clean, integrate, transform, reduce, and discretize it. This step ensures
that only relevant and useful information is retained for further analysis.
2. Data Understanding: - After capturing the data, this stage involves
analyzing it to gain meaningful insights. Noise removal and preprocessing
continue here to ensure the data is accurate for analysis. 2) Techniques
like statistical analysis, machine learning, deep learning, and natural
language processing are applied to transform the raw data into useful
information. 3) This stage is crucial, as it determines the quality of insights
and the accuracy of the final results. Proper analysis techniques are
essential to avoid misleading outcomes.
3. Data Presentation: - In this final stage, the results from the data
understanding stage are summarized and presented using data
visualization tools to make the output easily interpretable 2) Effective
data visualization is key to presenting the insights in a way that is simple
and easily understood by stakeholders. 3) Interactive data visualization
allows for easier identification of patterns, outliers, and key insights,
aiding decision-making. 4) If the results are unsatisfactory, the process
may need to revisit the data capturing or data understanding stages to
improve analysis and outcomes.

Q] Seven Layers of social media analytics


The seven layers of social media analytics refer to different aspects of
social media data that are analyzed to derive business insights and make
informed decisions. Here's a brief overview of these seven layers:
1.(Layer1) **Text**: Analyzes textual content like posts, tweets,
comments, and status updates to understand sentiment, opinions, and
user feedback. 2. .(Layer2) **Networks**: Examines social
connections and relationships, such as friendships and followers, using
graph theory to identify influential users and network structures.
3. .(Layer3) **Actions**: Focuses on user interactions like likes, shares,
comments, and event creation to measure popularity, trends, and user
engagement. 4. .(Layer4) **Mobile**: Looks at user engagement with
mobile applications, including in-app interactions, to understand user
behavior and improve marketing strategies.
5.(layer 5) **Hyperlinks**: Analyzes hyperlinks (in-links and out-links) to
study web navigation patterns, internet traffic sources, and connection
structures. 6. (Layer6)**Location**: Explores geographic information,
tracking user locations and mapping data to understand spatial patterns
for business intelligence. 7. **Search Engines**:
Investigates historical search engine data to analyze search trends,
keyword popularity, and optimize search engine marketing (SEM) and
search engine optimization (SEO).

Q] Social Media Analytics Life Cycle


The Social Media Analytics Life Cycle comprises six key steps designed to
extract valuable insights from social media data. Here's a brief summary
of each step: 1. **Identification**: - This step involves determining
which social media data sources to use for analysis, based on specific
business objectives. The goal is to identify relevant data among the vast
amount of information available across various social media platforms. 2.
**Extraction**: - Once the data sources are identified, the appropriate
Application Programming Interfaces (APIs) and specialized tools are used
to extract the needed data. Data extraction must comply with privacy and
ethical considerations. 3. **Cleaning**: - In this step, the extracted data is
preprocessed to remove noise, fill in missing values, smooth out
fluctuations, and remove outliers or inconsistencies. This process ensures
that the data is accurate and ready for analysis. 4. **Analyzing**: - This
step involves using various analytical tools and techniques to extract
meaningful insights from the cleaned data. The choice of analysis
methods depends on the business objectives and the nature of the data.
5. **Visualization**: - After analyzing the data, the results are presented in
a visual format, such as graphs, charts, or word clouds. Data visualization
makes it easier to understand and communicate the insights derived from
the analysis.
Q] Linked Prediction:-
Link prediction is the problem of predicting the existence of a link between
two entities in a social network. • The link prediction problem is one
common research issue in social network analysis and mining. • The link
prediction issue studies a static snapshot of the nodes and edges of a
social network at a given time Ti and based on the study, predicts the
future links of the social network for a future rime T2. • The link prediction
problem is a common feature found in many social networking sites for
possible friends' suggestions as found on Facebook or Twitter. • This
feature, in turn, allows a user to increase the personal or professional
friends circle to broaden the social links and connections. • This will
increase the social networking activities as each user will be then
connected to more users on the social network.
Q] The Bag of Words (BoW)
- **Definition**: BoW is a representation of text data as a vector, where
each dimension corresponds to a unique word in a dataset, and the value
in each dimension indicates the frequency of that word in a given
document or text segment. - **Tokenization**: To create a BoW
representation, the text is first tokenized into individual words or terms.
This process creates a "bag" of words, ignoring word order and structure. -
**Usage**: The resulting vectors from BoW are used as input for machine
learning algorithms in tasks like text classification, sentiment analysis, and
topic modeling. - **Characteristics**: BoW is simple and effective for
capturing word frequency.
Q] n-Grams:-
An n-gram means a sequence of n words. An n-gram is a piece of text
containing M words can be broken into a collection of Mn 1 n-grams 2) We
can create a bag-of-words out of n-grams, run TF-IDF on them, or model
them with a Markov chain, just as if they were normal words. 3)The
problem with n-grams is that there are so many potential ones out there.
4) The problem with n-grams is that there are so many potential ones out
there. Most n- grams that appear in a piece of text will occur only once,
with the frequency decreasing, the larger n is. The general approach here
is to only look at n-grams that oceur more than a certain number of times
in the corpus.
Q] Stop word
Stop words are common words in a language that are typically removed
from text during Natural Language Processing (NLP) and Text Mining
because they carry little useful information for certain tasks. Common
examples in English include "a", "an", "the", "is", "are", "of", and similar
words. 2)In text analysis, removing stop words helps focus on more
significant terms that better represent the meaning or context of the text.
This process is useful in applications like Bag-of-Words (BoW), TF-IDF, and
n-grams, where the goal is to extract relevant features from text. 3)
However, there's no universal list of stop words, as what counts as a stop
word may vary depending on the context and purpose of the analysis. In
some cases, stop words are predefined in NLP libraries; in others, they are
identified manually from a specific dataset or corpus. 4) While removing
stop words can be helpful, it may sometimes cause issues, especially with
n-grams..
Q] Trend Analytics:-
Trend analytics involves analyzing data over a period of time to predict
future trends or events. This approach is used in various fields, like
business, finance, and project management, to make informed decisions
based on observed patterns. The fundamental
concept is that by examining historical data, one can anticipate future
outcomes.
There are three main types of trend analysis:-- 1) Geographic Trend
Analysis: This type focuses on trends within or across specific geographic
locations. It examines patterns influenced by geography, like culture,
climate, or food habits. Geographic trend analysis is limited to specific
regions and often easier to interpret. 2) Intuitive Trend Analysis: This
method relies on the analyst's intuition or logical explanations due to a
lack of substantial data. It involves predicting future trends based on
behavioral patterns and logical reasoning. However, this approach can be
prone to biases and is harder to interpret. 3) Temporal Trend Analysis: This
analysis examines changes over time to predict future events. Time-series
analysis is a common method used in temporal trend analysis, where data
is arranged in chronological order. It is widely used to model trends and
forecast future outcomes in various domains.
Q] Data mining
Data mining is defined as, "to extracting or mining knowledge from
massive amount of datasets.". Data mining is an interdisciplinary subfield
of computer science and statistics with an overall goal to extract
information (with intelligent methods) from a data set and transform the
information into a comprehensible structure for further use.
Step 1: Data Cleaning: In this step, the noise and inconsistent data is
removed and/or cleaned Step 2: Data Integration: In this step, multiple
data sources are combined. Step 3: Data Selection: In this step, data
relevant to the analysis task are retrieved from the dataset. Step 4: Data
Transformation: In this step, data is transformed or consolidated into
forms appropriate for mining by performing aggregation or summary
operations. Step 5: Data Mining: In this step, intelligent methods are
applied in order to extract data patterns. Step 6: Pattern Evaluation: In
this step, data patterns are evaluated.
Advantages of Data Mining: 1. Data mining is a quick process that
makes it easy for new users to analyze enormous amounts of data in a
short time. 2. The data mining technique enablesorganizations to obtain
knowledge-based data, 3. Compared with other statistical data
applications, data mining is efficient and cost
Disadvantages of Data Mining
1. Data mining system violates the privacy of its user and that is why it
lacks in the matters of safety and security of its users. 3. Data mining
technique is not a 100 percent accurate and may cause serious
consequences in certain conditions. 4. Data mining involves lots of
technology in use for the data collection process. Every data generated
needs its own storage space as well as maintenance. This can greatly
increase the implementation cost.
Q] Frequent Pattern (FP) Growth Algorithm
Frequent Pattern (FP) Growth Algorithm Number of improvements
techniques such as Partitioning, Hash-based technique, Transaction
reduction, Sampling, Dynamic item set counting etc.->The above
shortcomings of Apriori algorithm can be overcome using the FP (Frequent
Pattern)-growth algorithm->FP-growth algorithm adopts a divide-and-
conquer strategy as follows. First, it compresses the database
representing frequent items into a frequent pattern tree, or FP-tree, which
retains the item set association information

Advantages of FP-growth Algorithm:


1. The FP-growth algorithm scans the database only twice which helps in
decreasing computation cost. 2. The FP-growth algorithm uses divide and
conquer method so the size of subsequent conditional FP-tree is reduced.
3. The FP-growth method transforms the problem of finding long frequent
patterns into searching for shorter ones in much smaller conditional
databases recursively.
Disadvantages of FP-growth Algorithm:
1. The FP-growth algorithm is difficult to be used in an interactive mining
process as users may change the support threshold according to the rules
which may lead to repetition of the whole mining process. 2. The FP-
growth algorithm is not suitable for incremental mining. 3. When the
dataset is large, it is sometimes unrealistic to construct a main memory
based FP-tree.

Q] Apriori
Apriori algorithm Is widely used algorithm for generate frequent item sets.
Apriori seminal algorithm proposed by R. Agrawal and R. Srikant in 1994-
>The name of the algorithm is based on the fact that the algorithm uses
prise knowledge of frequent item set properties. It is a classic algorithm
for learning association rules->Apriori algorithm is easy to execute and
very simple is used to mine all frequent item sets in database Apriori
Propertyan item set is infrequent, then all its supersets must also be
infrequent. So, according to Apriori property if (A) is infrequent item set
then all its superset like (A,B),(AC) {A,B,C) etc., will also be infrequent->It
is called ant monotonicity because the property is monotonic in the
context of failing a test->Apriori algorithm used step wise search
approach, it means k-item sets are used discover (k+1) item sets
Supervised Learning Unsupervised Learning
1) In supervised learning both 1) In unsupervised learning
input and D output variables only input variables are
are provided on the basis of provided and no output
which the output could be variable are available due to
predicted and probability of which the outcome or
its correctness is higher. resultant learning is
dependent on one intellectual
2.)As supervised learning is observation.
treated as highly accurate
and trustworthy method so 2)Unsupervised learning is
the accuracy and correctness comparatively less accurate
is better as compare to and trustworthy method.
unsupervised learning.
3)Unsupervised learning
3.)Supervised learning algorithms are trained using
algorithms are trained using unlabeled data.
labeled data.
4)Unsupervised learning
4.)Supervised learning model model does not take any
takes direct L feedback to feedback.
check if it is predicting m
correct output or not. 5)Unsupervised learning
model finds the hidden
5)Supervised learning model patterns in data.
predicts the L output.
6)In unsupervised learning,
6)In supervised learning, only input data is provided to
input data is provided to the the model.
model along with the output.
7)The goal of unsupervised
7)The goal of supervised learning is to find the hidden
learning is to train the model patterns and useful insights
so that it can predict the t from the unknown dataset.
output when it is given new
data. 8) Unsupervised learning does
not need any supervision to
8)Supervised learning needs train the model.
superv.sion to train the
model.
Data Analysis Data Analytics

The process of extracting The process of extracting


information from raw data is meaningful valuable insights
called as data analysis. from raw data called as data
analytics.
2) It is described as a
traditional form or generic 2)It is described as a
form of analytics. particularized form of
analytics.
3) It includes several stages
like the collection of data and3)To process data, firstly raw
then the inspection of data is defined in a
business data is done. meaningful manner, then data
cleaning and conversion are
4) It supports decision making done to get meaningful
by analyzing enterprise data. information from raw data.

5) It uses various tools to 4)It analyzes the data by


process data such as Tableau, focusing on insights into
Python, Excel, etc. business data.

6) Descriptive analysis 5)It uses different tools to


cannot be performed on this. analyze data such as Rapid
Miner, Open Refine, Node XL,
7) It does not deal with KNIME, etc.
inferential analysis.
6)A Descriptive analysis can
be performed on this.

7) It supports inferential
analysis.

You might also like