0% found this document useful (0 votes)
17 views8 pages

Predictive Analytics

Predictive analytics uses data and statistical algorithms to forecast future events by analyzing historical data for patterns and trends. It aids organizations in decision-making, risk management, and enhancing customer experiences across various industries. The process involves defining requirements, gathering and preparing data, developing and evaluating models, and deploying them for practical use.

Uploaded by

Anna Mae Salazar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views8 pages

Predictive Analytics

Predictive analytics uses data and statistical algorithms to forecast future events by analyzing historical data for patterns and trends. It aids organizations in decision-making, risk management, and enhancing customer experiences across various industries. The process involves defining requirements, gathering and preparing data, developing and evaluating models, and deploying them for practical use.

Uploaded by

Anna Mae Salazar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

What is Predictive Analytics?

Predictive analytics, in simple terms, is the use of data and statistical and machine algorithms
to make educated guesses or forecasts about future events or outcomes. It involves analyzing
historical data to identify patterns, trends, and relationships that can help predict what might
happen in the future.

Imagine you have a record of past sales data for a retail store. By applying predictive analytics,
you can analyze this data to predict future sales trends. This might involve considering factors
like previous sales patterns, seasonal trends, economic indicators, and other relevant data to
make informed predictions about future sales volumes.

In essence, predictive analytics helps businesses and organizations make better decisions by
using data-driven insights to anticipate future events and make proactive adjustments based
on those predictions.

Why Predictive Analytics?

Organizations and individuals conduct predictive analytics for several important reasons:
1. Anticipating Future Trends: Predictive analytics allows you to forecast future events,
trends, and outcomes. This can be invaluable for businesses in industries like finance,
retail, and healthcare, as well as in various other fields. For example, retailers can use
predictive analytics to anticipate consumer demand and adjust inventory accordingly.
2. Optimizing Decision-Making: Predictive analytics provides data-driven insights that
can help improve decision-making. By making predictions based on historical data,
organizations can optimize their strategies and resources. For instance, a healthcare
provider might use predictive analytics to identify patients at risk of certain medical
conditions and provide early interventions.
3. Risk Management: Many industries use predictive analytics for risk assessment and
management. Insurance companies, for instance, use it to assess policyholders' risk
profiles and set premiums accordingly. In finance, predictive analytics can help detect
fraudulent transactions.
4. Enhancing Customer Experience: Predictive analytics helps organizations better
understand their customers' preferences and behaviors. By analyzing data on past
interactions, companies can personalize marketing campaigns, recommend products,
and tailor services to individual customers.
5. Increasing Efficiency and Cost Savings: Predictive analytics can help organizations
streamline operations and reduce costs. For example, manufacturing companies might
use it to optimize supply chain logistics and minimize production downtime.
6. Improving Healthcare: Healthcare providers can use predictive analytics to predict
disease outbreaks, allocate resources effectively, and improve patient outcomes. For
instance, predictive models can help hospitals predict patient readmissions and take
preventive actions.
7. Asset Management and Maintenance: In industries like energy, transportation, and
manufacturing, predictive analytics can predict equipment failures and maintenance
needs. This proactive approach can save money by reducing downtime and preventing
costly breakdowns.
8. Fraud Detection and Prevention: In finance and e-commerce, predictive analytics can
identify patterns indicative of fraudulent activity. This helps organizations take
immediate action to prevent or mitigate fraud.
9. Resource Allocation: In fields such as education and government, predictive analytics
can be used to allocate resources efficiently. Schools might use it to predict student
performance and provide additional support to those at risk of falling behind.
10. Personalization: In the digital world, predictive analytics powers recommendation
systems. Platforms like Netflix and Amazon use predictive algorithms to recommend
content and products to users based on their past preferences and behavior.

In essence, predictive analytics helps organizations make more informed decisions, reduce
uncertainty, and gain a competitive edge by leveraging data and statistical models to foresee
future outcomes and trends. It's a valuable tool for strategic planning, risk management, and
improving overall operational efficiency.

How to deploy Predictive Analytics?

1. Define the Requirements – defining what you want to predict or achieve


Understand the business problem you're trying to solve. Is it managing inventory?
Reducing fraud? Predicting sales? This could be anything from forecasting sales to
identifying potential machine breakdowns.
2. Gather Data – collecting relevant data
This could be historical data related to your goal, such as past sales records, customer
information, or equipment sensor data.
3. Clean and Prepare Data - removing errors, missing values, and outliers; structuring it
in a way that's suitable for analysis.
Data can be messy. Clean it by removing errors, missing values, and outliers. Organize
and structure it in a way that's suitable for analysis.
Identify the variables (features) in your data that you believe are related to your goal.
4. Develop the model – selecting algorithm and training the model
Select a predictive analytics method or algorithm. Start simple if you're new to this –
linear regression is a good choice for beginners. More complex problems may require
machine learning algorithms like decision trees or neural networks.
Use historical data to train your predictive model. This means teaching the model to
learn patterns and relationships in the data.
5. Evaluate the model – testing the model's performance on new, unseen data
(validation data)
After training, test the model's performance on new, unseen data (validation data).
You want to ensure that your model can make accurate predictions.
If the model's performance isn't satisfactory, make adjustments. This could involve
tweaking parameters, adding more data, or choosing a different algorithm
6. Deploy the model – putting the model into action
Once you're satisfied with your model's performance, it's time to put it into action.
This could mean integrating it into your website, software, or decision-making
processes.
Regression

Regression is a machine learning algorithm commonly used for predicting numeric values
based on input features. It's a fundamental algorithm in the field of supervised learning,
particularly for tasks where the target variable is continuous or numeric.

Supervised learning is a type of machine learning where an algorithm learns from labeled
training data to make predictions or decisions. In supervised learning, the algorithm is
provided with a dataset in which the correct answers, or labels, are already known. The goal
is for the algorithm to learn the mapping or relationship between input data and the
corresponding output labels. Once trained, the algorithm can make predictions or classify
new, unseen data accurately.

Common Types:
• Linear Regression: Assumes a linear relationship between the features and the target
variable. It's a simple and interpretable algorithm that fits a straight line to the data.
This is like drawing a straight line through data points to make predictions. It's used
when you want to predict a numeric value, like predicting the price of a house based
on its size.
• Multiple Regression: Extends linear regression to multiple input features.
• Logistic Regression: Although it has "regression" in its name, it's used for binary
classification problems, not regression. It models the probability of belonging to a
class.
In logistic regression, the goal is to predict a binary outcome, typically either "0" or "1"
(e.g., yes/no, true/false, spam/not spam), based on input features. It models the
probability of an input belonging to a particular class. While it uses the term
"regression," it's more accurate to think of it as a classification algorithm because it
deals with discrete class labels, not continuous numeric values as traditional
regression does.
So, logistic regression is a type of regression used for binary classification tasks, where
the outcome variable is categorical and binary. It's a fundamental and widely used
algorithm in machine learning for tasks such as spam detection, medical diagnosis, and
more.

Can regression be used with values that have been converted from categorical to numeric?

Yes, but it's important to understand the implications and limitations of doing so.
When you convert categorical variables to numeric values, you typically create dummy
variables or use encoding techniques to represent the categories as numbers. Here are two
common approaches:

1. One-Hot Encoding: This method creates binary (0 or 1) dummy variables for each
category in the original categorical variable. Each dummy variable represents the
presence or absence of a category. For example, if you have a "Color" variable with
categories "Red," "Blue," and "Green," you would create three binary columns:
"IsRed," "IsBlue," and "IsGreen."
2. Label Encoding: Label encoding assigns a unique numeric value to each category. For
example, "Red" might be encoded as 1, "Blue" as 2, and "Green" as 3

Key considerations when using linear regression with such numeric representations of
categorical variables:

1. Loss of Information: Converting categorical variables to numeric values may result in


a loss of information. Linear regression assumes a linear relationship between
variables, which may not hold if the numeric values assigned to categories are
arbitrary.
2. Interpretation: Interpreting the coefficients of dummy variables in linear regression
can be challenging. For one-hot encoding, the coefficient of each dummy variable
represents the change in the dependent variable associated with that category
compared to the reference category (usually the omitted category). For label encoding,
interpreting coefficients can be even more challenging because the numeric values
may not have meaningful units or orders.
3. Assumptions: Linear regression assumes certain statistical properties of the data, such
as linearity, independence of errors, and constant variance of errors. These
assumptions may not hold when categorical variables are converted to numeric form,
especially with label encoding.
4. Alternative Models: Depending on the nature of your data and the relationships
between variables, other models like logistic regression (for classification tasks) or
decision trees/random forests may be more suitable when dealing with categorical
variables.

Linear Regression

The representation is a linear equation that combines a specific set of input values (x) the
solution to which is the predicted output (y) for that set of input values. As such, both the
input (x) and the output (y) values are numeric.

In a simple linear regression model, the equation takes the form:


Y=β0+β1X

• Y is the predicted or dependent variable.


• β0 is the intercept or constant term, representing the expected value of Y when X is
zero.
• β1 is the coefficient of the independent variable X, representing the change in Y for a
one-unit change in X.

Predictive Analytics using MS Excel

You can perform basic predictive analytics using Microsoft Excel, although it has limitations
compared to dedicated data science and analytics tools.

Scenario #1: Suppose you want to predict a student's final exam score based on the number
of hours they studied. (See sample dataset. Please note that this is a small and synthetic
dataset for illustration purposes. In real-world applications, you would typically work with
larger and more diverse datasets to build and evaluate your regression models.)

Step 1: Prepare Your Data


1. Open your dataset in Excel.
Step 2: Create a Scatterplot
2. Highlight the data in both columns.
3. Go to the "Insert" tab and select "Scatter" under the "Charts" group. Choose "Scatter".
You should now see a scatterplot of your data points.
Step 3: Add the Trendline
5. Right-click on one of the data points in the chart and select "Add Trendline."
6. In the "Format Trendline" pane that appears on the right, check the box for "Linear."
You'll see the linear trendline added to your scatterplot, and Excel will also display the
equation for the line and the R-squared value, which indicates how well the line fits
the data.
Step 4: Interpret the Results
8. The equation of the line is in the form of "y = mx + b," where "y" is the predicted final
exam score, "x" is the hours studied, "m" is the slope (coefficient), and "b" is the
intercept.
For example, your equation might be: "Final Exam Score = 7.5 * Hours Studied + 45."
which means for every one-unit change in X (hours studied), the change in Y is 7.5.

9. You can use this equation to make predictions. For instance, if a student studies for 7
hours, you can plug this value into the equation to estimate their final exam score:
"Final Exam Score = 7.5 * 7 + 45 = 97.5."
10. The R-squared (R²) value, also known as the coefficient of determination, is a statistical
measure that assesses the goodness of fit of a regression model. It quantifies the
proportion of the variance in the dependent variable (target) that is explained by the
independent variables (features) in the model. In other words, R-squared tells you how
well the model fits the data.

R-squared values range from 0 to 1, where 0 indicates that the model does not explain
any of the variance in the dependent variable, and 1 indicates that the model perfectly
explains all the variance. The closer R² is to 1, the better the model explains the data.
Limitations: R-squared does not tell you whether the coefficients of the independent
variables are statistically significant or whether the model is appropriate for making
accurate predictions. It also doesn't provide information about the direction or
strength of relationships between variables.

Multiple Regression

The multiple regression model aims to find the best linear relationship between the
independent variables and the dependent variable.

In multiple regression, it takes the form:


y = B0 + B1x1 + B2x2 + … + Bnxn

Where B0 is the intercept (constant term), B1 , B2 , Bn are the coefficients for each
independent variable, x1 , x2 , xn are the inputs (values of the independent variables), and y
is the output (target or dependent) variable

Scenario #2: There is a fast-food chain company in the Philippines. The company wants to
predict the sales through advertisements by considering the following factors – TV, Radio,
Newspaper. (See sample dataset. Please note that this is a small and synthetic dataset for
illustration purposes. In real-world applications, you would typically work with larger and
more diverse datasets to build and evaluate your regression models.)

To perform a multiple regression in Excel,

we first need to enable Excel’s Analysis ToolPak Add-in. The Analysis ToolPak in Excel is an
add-in program that provides data analysis tools for statistical and engineering analysis.

To add it in your workbook, follow these steps.

Step 1 – Excel Options


Go to File -> Options:
Step 2 – Locate Analytics ToolPak
Go to Add-ins on the left panel -> Manage Excel Add-ins -> Go:
You have successfully added the Analysis ToolPak in Excel! You can check it by going to the
Data bar in the Ribbon.

Let’s start building our predictive model in Excel!

Step 3 - Go to Data -> Data Analysis:


Go to Data Analysis in the Data ToolPak, select Regression and press OK.

Step 4 – Select Options


Select some of the options necessary for our analysis, such as :
• Input y range – The range of independent factor (Sales)
• Input x range – The range of dependent factors (Columns C:E)
• Output range – The range of cells where you want to display the results (Select
anywhere in the sheet.)
Click Residuals> OK.

Analyzing our Predictive Model’s Results in Excel

In the summary, we have 3 types of output:


• Regression statistics table

The regression statistics table tells us how well the line of best fit defines the linear
relationship between the independent and dependent variables. Two of the most
important measures are the R squared and Adjusted R squared values.
The R-squared statistic is the indicator of goodness of fit which tells us how much
variance is explained by the line of best fit. R-squared value ranges from 0 to 1. If R-
squared value is 0.953, this means that our line is able to explain 95% of the variance
– a good sign. 95% of the variation in Sales is explained by the independent variables.
The closer to 1, the better the regression line (read on) fits the data.
But there is a problem – as we keep adding more variables, our R squared value will
keep increasing even though the variable might not be having any effect. Adjusted R-
squared solves this problem and is a much more reliable metric.

• ANOVA table

This table breaks down the sum of squares into its components to give details of
variability within the model.

It includes a very important metric, Significance F (or the P-value) , which tells us
whether your model is statistically significant or not. We use this to check if your
results are reliable (statistically significant).
In a nutshell, it means that our results are likely not due to randomness but because
of an underlying cause. The most used threshold for the p-value is 0.05. If we are
getting a value less than this, than we are good to go. Otherwise, we would need to
choose another set of independent variables.
If we have a value well below the threshold of 0.05 so we can move on.

If Significance F is greater than 0.05, it's probably better to stop using this set of
independent variables. Delete a variable with a high P-value (greater than 0.05) and
rerun the regression until Significance F drops below 0.05.

• Regression coefficients table

The Coefficient table breaks down the components of the regression line in the form
of coefficients. We can understand a lot from these.
A positive coefficient means that for each unit increase in the attribute, the response
variable increases by ___ units (check coefficient).

A negative coefficient means that for each unit increase in the attribute, the response
variable decreases by ___ units (check coefficient).

• Residual Table

The residual table reflects how much the predicted value varies from the actual value.
It consists of the values predicted by our model:

How can we Improve our Model?


If the p-value for the variable is more than 0.05, you may remove this variable from
your analysis.
Follow all the steps mentioned above but do not include the variable with more than
the 0.05 threshold. (check regression stat if the value of adjusted R-squared has
improved.

Then get the regression model using its form.

You might also like