DADS302 Unit-05
DADS302 Unit-05
DADS302
EXPLORATORY DATA ANALYSIS
Unit 5
Predictive Analysis
Table of Contents
SL Fig
Topic SAQ/Activity PG NO
NO No/Table/Graph
1 Introduction - -
3–4
1.1 Objectives - -
2 Working 1 - 5–7
What Justifies The Use Of Predictive
3 - - 8
Analysis
4 Model Types 2 - 9 – 11
5 Applications - 12 – 13
6 Tools For Predictive Analysis 3 - 14
7 Regression 4,5 1 15 – 19
8 Regression Line Fitting – Python Code 6-16 - 20 – 26
9 Activity - - 27
10 Summary - - 27
11 Glossary - - 28
12 Concept Map 17 - 28
13 Study Notes & Did You Know - - 28 – 29
14 Case Study 18-21 - 29 – 32
15 Terminal Questions - - 32
16 Self-Assessment Answers - - 32
17 Terminal Questions Answers - - 32 – 36
18 References - - 36
1. INTRODUCTION
➔ For instance, data mining is analysing big data sets to find patterns in them. The same
is done using text analysis, but not for lengthy passages of text.
➔ With predictive analytics, data patterns in the past and present are examined to see
if they are likely to recur. Additionally, operational savings and risk reduction can be
increased through predictive analysis.
➔ Data mining, predictive modelling, and machine learning are all included in predictive
analytics, which is a combination of several statistical technologies and approaches.
In order to anticipate the future, this process thoroughly examines both historical and
current data. Predictive analytics goal is not merely to understand the past, but also
to make a more accurate assessment of what might occur in the future.
➔ Predictive analytics objective, as we all know, is to identify patterns and trends in the
past and present in order to make predictions about the future. There are companies
that have created proprietary, allegedly industry-specific, predictive analytics
solutions. Other firms, however, rely on outside solutions to be proficient in
predictive analytics.
There is a methodical flow that you may use to carry out predictive analytics projects in
either situation. There are seven steps to this.
1.1 Objectives
1. WORKING
Fig 1: Working
1. Initiative Definition
Understanding the main goal of carrying out a predictive analytics project is a crucial
component of project definition. Clarity is required when answering queries like, "What are
you trying to model?" Organizations will be able to acquire the correct value-driver from this
endeavour by seeking answers to these questions.
2.Collection of Data
Data is the only requirement that predictive analytics needs in order to function effectively.
To execute the appropriate algorithms, however, you require a sizable number of data for
slicing and dicing. There shouldn't be a problem gathering data if a documented method for
doing so already exists.
On the other hand, firms without such systems must first put up a data aggregation tool to
assist them in gathering raw data from various sources.
3.Cleaning of Data
It is crucial for the company to clean and sanitise the data before you begin your
investigation. Data from many sources must be combined into a single, complete database
with consistent formatting as part of the cleaning process. In order to receive the correct
value-driver from this endeavour, this will guarantee that the analysis utilising the predictive
analytics technology is effective.
5.Building a Model
Organizations can begin developing a predictive model for forecasting future events
once the cleaned data has been thoroughly reviewed. The software programme will generate
a number of models; therefore, the organisation must choose the best (in terms of accuracy)
to forecast occurrences.
6.Deployment
The chosen model must then be put into use on a daily basis after the models have
been developed and refined. Daily use is connected to the project definition from step 1 once
more. For instance, if the model is used to forecast data security events by examining
computer event logs, it must be used to monitor ongoing operations and produce reports of
potential security gaps in order to prevent security breaches. Because of the model chosen,
businesses may be able to resolve some challenges proactively.
7.Monitoring
After the models are put into use, it is crucial to continuously monitor them and make any
necessary course modifications. Unchecked over-reliance on this strategy has the potential
to be disastrous.
1. Finding fraud
Combining different models can aid in the identification of any suspicious tendencies
and the prevention of illegal activity. High-performance predictive analytics solutions may
examine the network in real-time to look for any fraud-causing irregularities as
cybersecurity concerns continue to develop.
3.Optimization of Operations
Inventory forecasting and resource management is another example that is used to
illustrate the importance. For example, airlines utilise to determine the cost of their tickets.
In order to determine the maximum occupancy, hotels can forecast their expected
customers, which increases revenue.
Key Lessons :
1. Predictive analytics forecasts future performance using statistics and modelling strategies.
2. Predictive approaches are used in fields and industries like marketing and insurance to
make crucial choices.
3. Predictive models support the creation of investment portfolios, video game development,
voice-to-text message translation, and customer service judgments.
4. Despite the fact that machine learning and predictive analytics are two distinct fields,
people frequently mix them up.
5. Decision trees, regression, and neural networks are a few examples of predictive
modelling types.
4. MODEL TYPES
Decision Tree
• Decision trees may be helpful if you want to understand how someone makes
decisions. This kind of model divides the data into various groups according to
various factors, such as price or market capitalization. It resembles a tree, complete
with individual branches and leaves, as its name suggests. Branches represent the
options, while individual leaves stand for a specific choice.
• The simplest models are decision trees because they are the most straightforward to
comprehend and analyse. When you have to make a decision quickly, they are also
incredibly helpful.
Regression
The most often used model in statistical analysis is this one. Use it when there is a linear
relationship between the inputs and you want to find patterns in vast data sets. The formula
describing the relationship between all the inputs in the dataset is determined by this
method. Regression can be used, for instance, to determine how a security's performance
may be influenced by the price and other important variables.
Neural Networks
As a sort of predictive analytics, neural networks were created by modelling the functioning
of the human brain. Using artificial intelligence and pattern recognition, this model is capable
of handling complex data interactions. Use it if you need to overcome a number of obstacles,
such as when you have an excessive amount of data available, when you lack the necessary
formula to help you identify a relationship between the inputs and outputs in your dataset,
or when you need to make predictions rather than provide an explanation.
● The use of predictive analysis has many advantages. Since there are no alternative
(and obvious) solutions accessible, applying this kind of analysis can be helpful to
entities when they need to make predictions regarding outcomes.
● Models can be used by investors, financial experts, and company executives to assist
lower risk. For instance, by taking certain criteria into mind, such as age, capital, and
aspirations, an investor and their advisor can utilise certain models to help create an
investment portfolio with minimal risk to the investor.
● When models are utilised, cost savings is significantly impacted. A product's chance
of success or failure can be predicted by businesses prior to its release. Alternately,
they can reserve funds in front of the manufacturing process by employing predictive
strategies for production enhancements.
● Due to perceived disparities in its results, the use of predictive analytics has been
questioned and, in some circumstances, legally limited. Predictive models are most
frequently involved in this, which leads to statistical discrimination against racial or
Challenges
5. APPLICATIONS
Customer inflow and outflow in the entertainment and hospitality industries depend on a
number of variables, all of which affect how many employees a venue or hotel needs at any
given time. Understaffing could lead to a poor customer experience, overworked personnel,
and expensive errors while overstaffing costs money.
A team created a multivariate regression model that took into account several characteristics
to forecast the number of hotel check-ins on a certain day. Caesars was able to staff its hotels
and casinos to the best of its capacity and prevent overstaffing because to this technique.
they are. Predictive analytics is the process of looking at past behavioural data and
applying it to make future predictions.
• In marketing, predictive analytics can be used to anticipate seasonal sales trends and
organise promotions accordingly.
• When the conditions for an impending malfunction are satisfied, the algorithm is
activated to notify a worker who can stop the device, possibly save the business
thousands, if not millions, of dollars in lost revenue from damaged goods and repair
expenses. Instead of forecasting malfunction scenarios months or years in advance,
this study provides real-time predictions.
Some algorithms even suggest solutions and improvements to prevent future issues and
boost productivity, saving resources like time, money, and effort. Prescriptive analytics, of
which this is an illustration, are frequently used in conjunction with other forms of analytics
to address problems.
7. REGRESSION
• Regression looks for connections between different variables. You may, for instance,
watch multiple workers at one company to see how their pay varies depending on factors
like experience, education, role, location, and so on.
To put it another way, you need to find a function that accurately maps some traits or
variables to others.
Linear Regression
One of the most significant and often employed regression techniques is likely linear
regression. It's one of the easiest regression techniques. The results are simple to interpret,
which is one of its key benefits.
Performance of Regression
• Due in part to the reliance on the predictors xi, the actual answers yi, I = 1,..., n vary.
However, the output also has an additional intrinsic volatility.
• SSR = 0 is the equivalent as R2 = 1. Given that the values of the projected and actual
answers are a perfect fit, that fits the data perfectly.
Fig 4: Regression
Underfitting
When a model, typically as a result of its inherent simplicity, is unable to effectively represent
the dependencies among data, underfitting occurs. When used with new data, it frequently
produces a low R2 with known data and poor generalisation skills.
Overfitting
When a model picks up on both random fluctuations and data dependencies, overfitting
results. In other words, a model becomes too adept at learning from the available data.
Overfitting frequently occurs in complex models, which have numerous features or terms.
When used with known data, these models typically produce high R2. However, when
applied to fresh data, they frequently don't generalise well and have considerably lower R2.
• A basic scientific Python library called NumPy enables a wide range of high-
performance operations on both single-dimensional and multidimensional arrays.
Numerous mathematical operations are also provided. Naturally, it is open-source.
• A popular Python machine learning library called scikit-learn was created on top of
NumPy and a few additional libraries. It offers the tools for data preprocessing,
dimensionality reduction, regression implementation, classification, clustering, and
more. Scikit-Learn is also open-source, just like NumPy.
Self-Assessment Questions - 1
For the vast majority of regression methodologies and implementations, these phases are
more or less generic. You'll discover how to carry out these measures throughout the
tutorial's remaining sections for a variety of situations.
The class LinearRegression from sklearn.linear model and the package numpy must first be
imported
The numpy.ndarray array type is the base data type for NumPy. The remainder of this
tutorial refers to instances of the numpy.ndarray class as arrays.
To execute linear and polynomial regression and provide appropriate predictions, use
LinearRegression.
Fig 7: Data
The second step is defining data to work with. The inputs (regressors, 𝑥) and output
(response, 𝑦) should be arrays or similar objects. This is the simplest way of providing data
for regression:
The input array (x) and the output array (y) are now both arrays. This array must be two-
dimensional, or more specifically, it must have one column and as many rows as are required.
As a result, you should use the.reshape() method on x. That is exactly what the.reshape()
parameter (-1, 1) specifies.
Fig 8: arrays
Step 3: Create model and fitting it
The following step is to build a linear regression model and fit the data to the model.
Create a LinearRegression class instance to represent the regression model:
Our model, as described above, makes use of all parameter default settings.
Now is the time to use the model. You must first call.fit() on the model:
Using the current input and output, x and y, as the parameters, you can use the function.fit()
to get the ideal values of the weights b0 and b1. Or to put it another way,.fit() fits the model.
Self, the variable model itself, is the result. Because of this, you can substitute the following
sentence for the previous two:
.score() applied on model returns the coefficient of determination, R2, which is:
The model's properties are.intercept_, which stands for the coefficient b0, and.coef_, which
stands for b1:
B0 has a value of around 5.63. This demonstrates how, when x is zero, your model correctly
predicts the answer of 5.63. When x is increased by one, the projected response increases by
0.54, according to the value of b1 = 0.54.
You'll see that y is also available as a two-dimensional array. You'll receive a similar outcome
in this situation. Here's how it might appear:
This example, as you can see, is quite similar to the previous one, except in this
case,.intercept_ is a one-dimensional array with a single element, b0, and.coef_ is a two-
dimensional array with a single element, b1.
In this instance, you would multiply each component of x by model.coef_ and then add
model.intercept_ to the result.
Only the output's dimensions are different from the previous example. The predicted
response has changed from having a single dimension to a two-dimensional array.
Both methods will get the same outcome if you reduce the number of dimensions in x to one.
To achieve this, multiply x by model and then substitute x.reshape(-1), x.flatten(), or x.ravel()
for x. coef_.
Regression models are frequently used in practise to make forecasts. So you may calculate
the outputs based on fresh inputs using fitted models:
Here . The new response, y new, is produced by applying .predict() to the new regressor, x
new. In order to create an array with the entries 0, inclusive, up to but excluding 5, that is,
0, 1, 2, 3, and 4, this example neatly uses numpy's arange() function.
9. ACTIVITY
ACTIVITY A
In the past ten years, corporations have increasingly used predictive programming. Businesses use
predictive programming to find problems and opportunities, anticipate customer behaviour and
trends, and improve decision-making. One of the frequent use cases that highlights the significance
of predictive modelling in machine learning is fraud detection. Real-time remote analytic capabilities
can enhance fraud detection scenarios and boost security effectiveness.
……………………………………………………………………………………………………………………
………………….
The objective is to merge various data sets, identify anomalies, and stop illegal activity.
10. SUMMARY
A type of technology called predictive analytics generates forecasts regarding some future
unknowns. It uses a variety of methodologies, including artificial intelligence (AI), data
mining, machine learning, modelling, and statistics to arrive at these conclusions. Data
mining, predictive modelling, and machine learning are all included in predictive analytics,
which is a combination of several statistical technologies and approaches. In order to
anticipate the future, this process thoroughly examines both historical and current data.
Predictive analytics is valued by businesses for a variety of reasons, including its ability to
solve challenging problems and identify new opportunities. The simplest type of linear
regression is simple or single-variate linear regression since only one independent variable,
x = x, is involved.
11. GLOSSARY
Data Mining - Large data sets are sorted through in data mining in order to find patterns and
relationships that may be used in data analysis to assist solve business challenges.
Decision Tree - A decision tree is a graph that use the branching approach to show each
potential result for a certain input.
Big data is frequently mentioned in relation to predictive analytics. For instance, engineering
data is gathered through global sensors, instruments, and networked systems. Transactional
data, sales figures, client complaints, and marketing data are examples of business system
data that can be found at a corporation. On the basis of this rich source of data, organisations
make data-driven decisions more frequently.
90% of the data in the global datasphere is duplicated, and 0% is original. According to one
of the articles posted on CIO, between 80 and 90% of the data in the global digital realm is
unstructured. To download all the data from the internet now, a user would need 181 million
years.
Let's begin by importing the dataset and the required Python libraries:
There are 10 attributes in the collection, and each row represents a district. Let's now
employ the info() method to quickly describe the data, including the total number of rows,
the nature of each attribute, and the proportion of non-zero values
The dataset has 20,640 occurrences. Due to the fact that there are only 20,433 non-zero
values for the total bedrooms attribute, 207 districts do not have any values. Later, we will
have to address that.
With the exception of the ocean proximity parameter, all attributes are numeric. Since it is
an object, any sort of Python object may be contained in it. Using the value counts() method,
you can determine which categories are present in that column and how many districts fall
under each category.
housing.ocean_proximity.value_counts()
Plotting a histogram for each number attribute is another easy approach to determine the
type of data you're working with
Fig 20:Visualization
As most median income values cluster between 1.5 and 6, but some median income extends
way beyond 6, let's take a deeper look at the histogram of median income.
It is crucial to have enough instances of each stratum in your dataset; otherwise, the
estimation of a stratum's value may be skewed. This means that each layer should be large
enough and that there shouldn't be too many strata.
1. What can
2. Numpy and Pandas
• Neural Networks
2. Define Regression.
A statistical method called regression links a dependent variable to one or more
independent (explanatory) variables. A regression model can demonstrate whether changes
in one or more of the explanatory variables are related to changes in the dependent variable.
➔ Detection of Fraud :
As concerns about cybersecurity rise, there are numerous examples of predictive analytics.
Fraud detection is the most crucial. These models may notice odd behaviour and identify
system anomalies to pinpoint hazards.
For instance, experts can provide the system with previous data regarding cyberthreats and
risks. The appropriate staff will receive a notification when the predictive analytics
programme spots anything similar. It will limit access for hackers and gaps that might put
the system in danger.
➔ A Medical Diagnosis :
The healthcare industry benefits the most from the predictive analysis module. Health
information is crucial to fully understand the medical past and present of any patient.
Predictive analytics models contribute to the knowledge of the problem by providing an
accurate diagnosis based on historical data.
Using certain health metrics, predictive analytics helps practitioners determine the
underlying causes of diseases. As a result, they have rapid analytics, which enables them to
start making treatments right away. The propagation of negative health effects can be halted
using predictive analytics algorithms.
➔ Content Suggestion :
One of the most accessible and apparent predictive analytics examples is content suggestion.
Entertainment firms can forecast what viewers will watch based on their past behaviour
through algorithms and models.
What businesses employ predictive analytics, you ask? The most appropriate response is
Netflix. The entertainment company makes recommendations to customers for material
based on genre, keywords, ratings, and other factors using predictive algorithms. The
intelligent system predicts user behaviour using extremely sophisticated analytics.
OUTPUT :
18. REFERENCE
• https://www.techfunnel.com/hr-tech/predictive-analytics/
• https://www.ibm.com/analytics/predictive-analytics
• https://www.investopedia.com/terms/p/predictive-analytics.asp
• https://www.geeksforgeeks.org/linear-regression-python-implementation/
• https://thecleverprogrammer.com/2020/12/29/house-price-prediction-with-python/
• https://blogs.sap.com/2021/07/09/7-real-world-use-cases-of-predictive-
analytics/#:~:text=There%20are%20countless%20examples%20of,%2C%20healthcare%2C
%20and%20many%20more.&text=One%20of%20the%20biggest%20uses,learn%20all%20a
bout%20their%20customers.