2 Da
2 Da
• Descrip ve analy cs is a simple, surface-level type of analysis that looks at what has happened
in the past.
• The two main techniques used in descrip ve analy cs are data aggrega on and data mining.
so, the data analyst first gathers the data and presents it in a summarized format (that’s the
aggrega on part) and then “mines” the data to discover pa erns.
• The data is then presented in a way that can be easily understood by a wide audience (not just
data experts).
• It’s important to note that descrip ve analy cs doesn’t try to explain the historical data or
establish cause-and-effect rela onships
• At this stage, it’s simply a case of determining and describing the “what Happened”.
Diagnostic Analytics:
• While descrip ve analy cs looks at the “what”, diagnos c analy cs explores the “why”.
• When running diagnos c analy cs, data analysts will first seek to iden fy anomalies within the
data—that is, anything that cannot be explained by the data in front of them.
• For example: If the data shows that there was a sudden drop in sales for the month of March,
the data analyst will need to inves gate the cause.
• To do this, they will iden fy any addi onal data sources that might tell them more about why
such anomalies arose.
• Finally, the data analyst will try to uncover causal rela onships.
• For example, looking at any events that may correlate or correspond with the decrease in sales.
At this stage, data analysts may use probability theory, regression analysis, filtering, and
meseries data analy cs.
Predictive Analytics:
• Building on predictive analytics, prescriptive analytics advises on the actions and decisions
that should be taken.
• Prescriptive analytics shows you how you can take advantage of the outcomes that have been
predicted.
• When conducting prescriptive analysis, data analysts will consider a range of possible scenarios
and assess the different actions the company might take.
• Prescriptive analytics is one of the more complex types of analysis, and may involve working
with algorithms, machine learning, and computational modeling procedures.
• However, the effective use of prescriptive analytics can have a huge impact on the company’s
decision-making process.
• Manage the delivery of user satisfaction surveys and report on results using data visualization
software
• Work with business line owners to develop requirements, define success metrics, manage and
execute analytical projects, and evaluate results
• Monitor practices, processes, and systems to identify opportunities for improvement
• Proactively communicate and collaborate with stakeholders, business units, technical teams and
support teams to define concepts and analyze needs and functional requirements
• Translate important questions into concrete analytical tasks
• Gather new data to answer client questions, collating and organizing data from multiple sources
• Apply analytical techniques and tools to extract and present new insights to clients using reports
and/or interactive dashboards
• Collaborate with data scientists and other team members to find the best product solutions
• Establish data processes, define data quality criteria, and implement data quality processes
• Take ownership of the codebase, including suggestions for improvements and refactoring
• Build data validation models and tools to ensure data being recorded is accurate
• Work as part of a team to evaluate and analyze key data that will be used to shape future business
strategies
The Data Analysis Process:
• The first step is to identify why you are conducting analysis and what question or challenge
you hope to solve.
• At this stage, you’ll take a clearly defined problem and come up with a relevant question or
hypothesis you can test. You’ll then need to identify what kinds of data you’ll need and where
it will come from. For example: A potential business problem might be that customers aren’t
subscribing to a paid membership after their free trial ends. Your research question could then be
“What strategies can we use to boost customer retention?”
• Data analysts will usually gather structured data from primary or internal sources, such as CRM
software or email marketing tools.
• They may also turn to secondary or external sources, such as open data sources.
• These include government portals, tools like Google Trends, and data published by major
organizations such as UNICEF and the World Health Organization.
• Original dataset may contain duplicates, anomalies, or missing data which could distort how the
data is interpreted, so these all need to be removed.
• Data cleaning can be a time-consuming task, but it’s crucial for obtaining accurate results.
• How you analyze the data will depend on the question you’re asking and the kind of data you’re
working with, but some common techniques include regression analysis, cluster analysis, and
time-series analysis
• This final step in the process is where data is transformed into valuable business insights.
• Depending on the type of analysis conducted, you’ll present your findings in a way that others
can understand—in the form of a chart or graph
Data Analytics Techniques:
• Regression analysis:
o This method is used to estimate or “model” the relationship between a set of variables.
o Regression analysis is mainly used to make predictions.
• Factor analysis: (Dimension Reduction) o This technique helps data analysts to uncover the
underlying variables that drive people’s behavior and the choices they make. o Ultimately, it
condenses the data in many variables into a few “super-variables”, making the data easier to
work with. o For example: If you have three different variables which represent customer
satisfaction, you might use factor analysis to condense these variables into just one all-
encompassing customer satisfaction score.
• Cohort analysis:
o A cohort is a group of users who have a certain characteristic in common within a specified
time period
o For example, all customers who purchased using a mobile device in March may be
considered as one distinct cohort.
o In cohort analysis, customer data is broken up into smaller groups or cohorts; so, instead
of treating all customer data the same, companies can see trends and patterns over time
that relate to particular cohorts.
o In recognizing these patterns, companies are then able to offer a more targeted service.
• Cluster analysis:
o This technique is all about identifying structures within a dataset. o Cluster analysis
essentially segments the data into groups that are internally homogenous and externally
heterogeneous
o In other words, the objects in one cluster must be more similar to each other than they are
to the objects in other clusters. o Cluster analysis enables you to see how data is distributed
across a dataset where there are no existing predefined classes or groupings.
o In marketing, for example, cluster analysis may be used to identify distinct target groups
within a larger customer base.
• Time-series analysis:
o Time-series data is a sequence of data points which measure the same variable at different
points in time.
o Time-series analysis, then, is the collection of data at specific intervals over a period of
time in order to identify trends and cycles, enabling data analysts to make accurate
forecasts for the future. o If you wanted to predict the future demand for a particular
product, you might use time-series analysis to see how the demand for this product
typically looks at certain points in time.
Data Analytics Tools:
• Microsoft Excel o It is a software program that enables you to organize, format, and
calculate data using formulas within a spreadsheet system. o Microsoft Excel may be used by
data analysts to run basic queries and to create pivot tables, graphs, and charts.
o Excel also features a macro programming language called Visual Basic for Applications
(VBA).
• Tableau
o It is a popular business intelligence and data analytics software which is primarily used as
a tool for data visualization.
o Data analysts use Tableau to simplify raw data into visual dashboards, worksheets, maps,
and charts. o This helps to make the data accessible and easy to understand, allowing data
analysts to effectively share their insights and recommendations.
• SAS (Statistical Analysis Software) o It is a command-driven software package used for
carrying out advanced statistical analysis and data visualization.
o SAS is one of the most widely used software packages in the industry.
• RapidMiner
o It is a software package used for data mining (uncovering patterns), text mining, predictive
analytics, and machine learning.
o Used by both data analysts and data scientists. o RapidMiner comes with a wide range of
features—including data modeling, validation, and automation.
• Power BI o It is a business analytics solution that lets you visualize your data and share
insights across your organization.
o Similar to Tableau, Power BI is primarily used for data visualization. o While Tableau is
built for data analysts, Power BI is a more general business intelligence tool.
• Qualitative is data that can’t be measured or counted in the form of numbers. These types of
data are sorted by category, not by number.
• These data consist of audio, images, symbols, or text.
• The gender of a person, i.e., male, female, or others, is qualitative data.
The Qualitative data are further classified into two parts :
Nominal Data:
• Nominal Data is used to label variables without any order or quantitative value.
• The colour of hair can be considered nominal data, as one colour can’t be compared with another
colour.
• Ordinal data have natural ordering where a number is present in some kind of order by their
position on the scale.
• These data are used for observation like customer satisfaction, happiness, etc., but we can’t do
any arithmetical tasks on them.
• The ordinal data is qualitative data for which their values have some kind of relative position.
• These kinds of data can be considered as “in-between” the qualitative data and quantitative data.
• The ordinal data only shows the sequences and cannot use for statistical analysis.
• Quantitative data can be expressed in numerical values, which makes it countable and includes
statistical data analysis.
• It answers the questions like, “how much,” “how many,” and “how often.”
• For example, the price of a phone, the computer’s RAM, the height or weight of a person, etc.,
falls under the quantitative data.
• Quantitative data can be used for statistical manipulation
Examples of Quantitative Data :
Discrete Data:
• Height of a person
• Speed of a vehicle “Time-taken” to finish the work
• Wi-Fi Frequency
• Market share price
Difference between Discrete and Continuous Data
Discrete Data Continuous Data
Discrete data are countable and finite; they are Continuous data are measurable; they are in the
whole numbers or integers form of fraction or decimal
Discrete data are represented mainly by bar graphs Continuous data are represented in the form of a
histogram
The values cannot be divided into subdivisions into The values can be divided into subdivisions into
smaller pieces smaller pieces
Continuous data are in the form of a continuous
Discrete data have spaces between the values
sequence
Examples: Total students in a class, number of Example: Temperature of room, the weight of a
days in a week, size of a shoe, etc person, length of an object, etc
• Supervised Learning
• Unsupervised Learning
• Reinforcement Learning
• Regression
• Classification
• Clustering
• Association Rule
• Dimensionality Reduction
Supervised Learning Models:
• Supervised Learning is the simplest machine learning model to understand in which input data
is called training data and has a known label or result as an output.
• It works on the principle of input-output pairs.
• It requires creating a function that can be trained using a training data set, and then it is applied
to unknown data and makes some predictive performance.
• Supervised learning is task-based and tested on labeled data sets.
Regression
Linear Regression:
• Linear regression is the simplest machine learning model in which we try to predict one output
variable using one or more input variables.
• The representation of linear regression is a linear equation, which combines a set of input
values(x) and predicted output(y) for the set of those input values.
• It is represented in the form of a line:
Y = bx+ c.
• The main aim of the linear regression model is to find the best fit line that best fits the data points.
• Linear regression is extended to multiple linear regression (find a plane of best fit) and
polynomial regression (find the best fit curve).
Decision Tree
• Decision trees are the popular machine learning models that can be used for both regression and
classification problems.
• A decision tree uses a tree-like structure of decisions along with their possible consequences and
outcomes.
• In this, each internal node is used to represent a test on an attribute Each branch is used to
represent the outcome of the test.
• The more nodes a decision tree has, the more accurate the result will be.
• The advantage of decision trees is that they are intuitive and easy to implement, but they lack
accuracy.
• Decision trees are widely used in operations research, specifically in decision analysis,
strategic planning, and mainly in machine learning.
Random Forest
• Random Forest is the ensemble learning method, which consists of a large number of decision
trees.
• Each decision tree in a random forest predicts an outcome, and the prediction with the majority
of votes is considered as the outcome.
• A random forest model can be used for both regression and classification problems.
• For the classification task, the outcome of the random forest is taken from the majority of votes.
• Whereas in the regression task, the outcome is taken from the mean or average of the predictions
generated by each tree.
Classification:
Classification models are the second type of Supervised Learning techniques, which are used to
generate conclusions from observed values in the categorical form. For example, then classification
model can identify if the email is spam or not; a buyer will purchase the product or not, etc.
Classification algorithms are used to predict two classes and categorize the output into different groups.
In classification, a classifier model is designed that classifies the dataset into different categories, and
each category is assigned a label.
• Binary classification: If the problem has only two possible classes, called a binary classifier.
For example, cat or dog, Yes or No,
• Multi-class classification: If the problem has more than two possible classes, it is a multi-class
classifier.
Some popular classification algorithms are as below:
a) Logistic Regression
Logistic Regression is used to solve the classification problems in machine learning. They are similar
to linear regression but used to predict the categorical variables. It can predict the output in either Yes
or No, 0 or 1, True or False, etc. However, rather than giving the exact values, it provides the
probabilistic values between 0 & 1.
Support vector machine or SVM is the popular machine learning algorithm, which is widely used for
classification and regression tasks. However, specifically, it is used to solve classification problems.
The main aim of SVM is to find the best decision boundaries in an N-dimensional space, which can
segregate data points into classes, and the best decision boundary is known as Hyperplane. SVM selects
the extreme vector to find the hyperplane, and these vectors are known as support vectors.
c) Naïve Bayes
Naïve Bayes is another popular classification algorithm used in machine learning. It is called so as it is
based on Bayes theorem and follows the naïve(independent) assumption between the features which is
given as:
Each naïve Bayes classifier assumes that the value of a specific variable is independent of any other
variable/feature. For example, if a fruit needs to be classified based on color, shape, and taste. So yellow,
oval, and sweet will be recognized as mango. Here each feature is independent of other features.
Unsupervised Machine learning models implement the learning process opposite to supervised
learning, which means it enables the model to learn from the unlabeled training dataset. Based on the
unlabeled dataset, the model predicts the output. Using unsupervised learning, the model learns
hidden patterns from the dataset by itself without any supervision.
Unsupervised learning models are mainly used to perform three tasks, which are as follows:
• Clustering
Clustering is an unsupervised learning technique that involves clustering or groping the
data points into different clusters based on similarities and differences. The objects with the
most similarities remain in the same group, and they have no or very few similarities from
other groups.
Clustering algorithms can be widely used in different tasks such as Image segmentation,
Statistical data analysis, Market segmentation, etc.
Some commonly used Clustering algorithms are K-means Clustering, hierarchal Clustering,
DBSCAN, etc.
In reinforcement learning, the algorithm learns actions for a given set of states that lead to a goal state.
It is a feedback-based learning model that takes feedback signals after each state or action by interacting
with the environment. This feedback works as a reward (positive for each good action and negative for
each bad action), and the agent's goal is to maximize the positive rewards to improve their performance.
The behavior of the model in reinforcement learning is similar to human learning, as humans learn
things by experiences as feedback and interact with the environment.
Below are some popular algorithms that come under reinforcement learning:
It aims to learn the policy that can help the AI agent to take the best action for maximizing the reward
under a specific circumstance. It incorporates Q values for each state-action pair that indicate the reward
to following a given state path, and it tries to maximize the Q-value.
Missing Imputations:
Imputation:
• Imputation is a technique used for replacing the missing data with some substitute value to retain
most of the data/information of the dataset.
We use imputation because Missing data can cause the below issues: –
1. Incompatible with most of the Python libraries used in Machine Learning: While using the
libraries for ML(the most common is skLearn), they don’t have a provision to automatically
handle these missing data and can lead to errors.
2. Distortion in Dataset:- A huge amount of missing data can cause distortions in the variable
distribution i.e it can increase or decrease the value of a particular category in the dataset.
3. Affects the Final Model:- The missing data can cause a bias in the dataset and can lead to a
faulty analysis by the model.
Pros:
Cons:
• Doesn’t factor the correlations between features. It only works on the column level.
• Will give poor results on encoded categorical features (do NOT use it on categorical features).
• Not very accurate.
• Doesn’t account for the uncertainty in the imputations.
Cons: It also doesn’t factor the correlations between features. It can introduce bias in the data.
Zero or Constant imputation — as the name suggests — it replaces the missing values with either
zero or any constant value you specify
• Easy to implement.
• We can use it in production.
• It retains the importance of “missing values” if it exists.
Cons:
Business Model:
• A Business Model can be defined as a representation of a business or solution that often include
a graphic component along with supporting text and relationships to other components.
• Business Model is a structured model, just like a blueprint for the final product to be developed.
• It gives structure and dynamics for planning.
• It also provides the foundation for the final product.
• With the help of modelling techniques, we can create a complete description of existing and
proposed organizational structures, processes, and information used by the enterprise.
• Analyzing requirements is a part of business modelling process and it forms the core focus area.
• Functional Requirements are gathered during the “Current state”.
• These requirements are provided by the stakeholders regarding the business processes, data, and
business rules that describe the desired functionality which will be designed in the Future State.