Unit-5 DS

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 20

Introduction to Big Data Platform

Big data defined


What exactly is big data?
The definition of big data is data that contains greater variety, arriving in increasing volumes
and with more velocity. This is also known as the three Vs.
Put simply, big data is larger, more complex data sets, especially from new data sources.
These data sets are so voluminous that traditional data processing software just can’t manage
them. But these massive volumes of data can be used to address business problems you
wouldn’t have been able to tackle before.
Download the evolution of big data and data lakehouseebook (PDF)
The three Vs of big data

Volume The amount of data matters. With big data, you’ll have to process high volumes of
low-density, unstructured data. This can be data of unknown value, such as
Twitter data feeds, clickstreams on a web page or a mobile app, or sensor-enabled
equipment. For some organizations, this might be tens of terabytes of data. For
others, it may be hundreds of petabytes.

Velocity Velocity is the fast rate at which data is received and (perhaps) acted on.
Normally, the highest velocity of data streams directly into memory versus being
written to disk. Some internet-enabled smart products operate in real time or near
real time and will require real-time evaluation and action.

Variety Variety refers to the many types of data that are available. Traditional data types
were structured and fit neatly in a relational database. With the rise of big data,
data comes in new unstructured data types. Unstructured and semistructured data
types, such as text, audio, and video, require additional preprocessing to derive
meaning and support metadata.

Platform
What is a big data platform?
The constant stream of information from various sources is becoming more intense[4],
especially with the advance in technology. And this is where big data platforms come in to
store and analyze the ever-increasing mass of information.

A big data platform is an integrated computing solution that combines numerous software
systems, tools, and hardware for big data management. It is a one-stop architecture that
solves all the data needs of a business regardless of the volume and size of the data at hand.
Due to their efficiency in data management, enterprises are increasingly adopting big data
platforms to gather tons of data and convert them into structured, actionable business
insights[5].

Currently, the marketplace is flooded with numerous Open source and commercially
available big data platforms. They boast different features and capabilities for use in a big
data environment.

The Five ‘V’s of Big Data


Big Data is simply a catchall term used to describe data too large and complex to store in
traditional databases. The “five ‘V’s” of Big Data are:
 Volume – The amount of data generated
 Velocity - The speed at which data is generated, collected and analyzed
 Variety - The different types of structured, semi-structured and unstructured data
 Value - The ability to turn data into useful insights
 Veracity - Trustworthiness in terms of quality and accuracy
2.Challenges of Conventional System

 Big data is the storage and analysis of large data sets.


 These are complex data sets that can be both structured or unstructured.
 They are so large that it is not possible to work on them with traditional analytical
tools.
 One of the major challenges of conventional systems was the uncertainty of the Data
Management Landscape.
 Big data is continuously expanding, there are new companies and technologies that
are being developed every day.
 A big challenge for companies is to find out which technology works bests for them
without the introduction of new risks and problems.
 These days, organizations are realising the value they get out of big data analytics and
hence they are deploying big data tools and processes to bring more efficiency in their
work environment.

Storage
With vast amounts of data generated daily, the greatest challenge is storage (especially when
the data is in different formats) within legacy systems. Unstructured data cannot be stored in
traditional databases.

Processing

Processing big data refers to the reading, transforming, extraction, and formatting of useful
information from raw information. The input and output of information in unified formats
continue to present difficulties.

Security

Security is a big concern for organizations. Non-encrypted information is at risk of theft or


damage by cyber-criminals. Therefore, data security professionals must balance access to
data against maintaining strict security protocols.
3.Intelligent Data Analysis
Intelligent Data Analysis Definition
Intelligent Data Analysis (IDA) is an interdisciplinary study that is concerned with the
extraction of useful knowledge from data, drawing techniques from a variety of fields, such
as artificial intelligence, high-performance computing, pattern recognition, and statistics.
Data intelligence platforms and data intelligence solutions are available from data
intelligence companies such as Data Visualization Intelligence, Strategic Data Intelligence,
Global Data Intelligence.

1. What is Intelligent Data Analysis?

Intelligent data analysis refers to the use of analysis, classification, conversion, extraction
organization, and reasoning methods to extract useful knowledge from data. This data
analytics intelligence process generally consists of the data preparation stage, the data mining
stage, and the result validation and explanation stage.

Data preparation involves the integration of required data into a dataset that will be used for
data mining; data mining involves examining large databases in order to generate new
information; result validation involves the verification of patterns produced by data mining
algorithms; and result explanation involves the intuitive communication of results.

2. What is Big Data Intelligence?

Big data intelligence involves the use of Artificial Intelligence and Machine Learning to
make big data analytics actionable and transform big data into insights, and provides
engagement capabilities for data scientists, enterprise analytics strategists, data intelligence
warehouse architects, and implementation and development experts. Enterprise data
intelligence is used in business intelligence operations, analyzing sales, evaluating
inventories, and building customer data intelligence.

4.Nature of Data

The nature of data


Data is the plural of datum, so it is always treated as plural. We can find data in all the
situations of the world around us, in all the structured or unstructured, in continuous or
discrete conditions, in weather records, stock market logs, in photo albums, music playlists,
or in our Twitter accounts. In fact, data can be seen as the essential raw material of any kind
of human activity. According to the Oxford English Dictionary:
Data are known facts or things used as basis for inference or reckoning.
As shown in the following figure, we can see Data in two distinct
ways: Categorical and Numerical:

Categorical data are values or observations that can be sorted into groups or categories. There
are two types of categorical values, nominal and ordinal. A nominal variable has no intrinsic
ordering to its categories. For example, housing is a categorical variable having two
categories (own and rent). An ordinal variable has an established ordering. For example, age
as a variable with three orderly categories (young, adult, and elder).
Numerical data are values or observations that can be measured. There are two kinds of
numerical values, discrete and continuous. Discrete data are values or observations that can
be counted and are distinct and separate. For example, number of lines in a code. Continuous
data are values or observations that may take on any value within a finite or infinite interval.
For example, an economic time series such as historic gold prices.
The kinds of datasets used in this book are as follows:
 E-mails (unstructured, discrete)
 Digital images (unstructured, discrete)
 Stock market logs (structured, continuous)
 Historic gold prices (structured, continuous)
5.Analytic Processes and Tools

How big data analytics works


Big data analytics refers to collecting, processing, cleaning, and analyzing large datasets to
help organizations operationalize their big data.

1. Collect Data
Data collection looks different for every organization. With today’s technology, organizations
can gather both structured and unstructured data from a variety of sources — from cloud
storage to mobile applications to in-store IoT sensors and beyond. Some data will be stored in
data warehouses where business intelligence tools and solutions can access it easily. Raw or
unstructured data that is too diverse or complex for a warehouse may be assigned metadata
and stored in a data lake.

2. Process Data
Once data is collected and stored, it must be organized properly to get accurate results on
analytical queries, especially when it’s large and unstructured. Available data is growing
exponentially, making data processing a challenge for organizations. One processing option is
batch processing, which looks at large data blocks over time. Batch processing is useful when
there is a longer turnaround time between collecting and analyzing data. Stream processing
looks at small batches of data at once, shortening the delay time between collection and
analysis for quicker decision-making. Stream processing is more complex and often more
expensive.

3. Clean Data
Data big or small requires scrubbing to improve data quality and get stronger results; all data
must be formatted correctly, and any duplicative or irrelevant data must be eliminated or
accounted for. Dirty data can obscure and mislead, creating flawed insights.

4. Analyze Data
Getting big data into a usable state takes time. Once it’s ready, advanced analytics processes
can turn big data into big insights. Some of these big data analysis methods include:
Data mining sorts through large datasets to identify patterns and relationships by identifying
anomalies and creating data clusters.
Predictive analytics uses an organization’s historical data to make predictions about the
future, identifying upcoming risks and opportunities.
Deep learning imitates human learning patterns by using artificial intelligence and machine
learning to layer algorithms and find patterns in the most complex and abstract data.
Try Tableau for free to create beautiful visualizations with your data.
TRY TABLEAU FOR FREE

Big data analytics tools and technology

Big data analytics cannot be narrowed down to a single tool or technology. Instead, several
types of tools work together to help you collect, process, cleanse, and analyze big data. Some
of the major players in big data ecosystems are listed below.

Hadoop is an open-source framework that efficiently stores and processes big datasets on
clusters of commodity hardware. This framework is free and can handle large amounts of
structured and unstructured data, making it a valuable mainstay for any big data operation.
NoSQL databases are non-relational data management systems that do not require a fixed
scheme, making them a great option for big, raw, unstructured data. NoSQL stands for “not
only SQL,” and these databases can handle a variety of data models.
MapReduce is an essential component to the Hadoop framework serving two functions. The
first is mapping, which filters data to various nodes within the cluster. The second is
reducing, which organizes and reduces the results from each node to answer a query.
YARN stands for “Yet Another Resource Negotiator.” It is another component of second-
generation Hadoop. The cluster management technology helps with job scheduling and
resource management in the cluster.
Spark is an open source cluster computing framework that uses implicit data parallelism and
fault tolerance to provide an interface for programming entire clusters. Spark can handle both
batch and stream processing for fast computation.
Tableau is an end-to-end data analytics platform that allows you to prep, analyze,
collaborate, and share your big data insights. Tableau excels in self-service visual analysis,
allowing people to ask new questions of governed big data and easily share those insights
across the organization.

The big benefits of big data analytics


 The ability to analyze more data at a faster rate can provide big benefits to an
organization, allowing it to more efficiently use data to answer important questions.
Big data analytics is important because it lets organizations use colossal amounts of
data in multiple formats from multiple sources to identify opportunities and risks,
helping organizations move quickly and improve their bottom lines. Some benefits of
big data analytics include:

 Cost savings. Helping organizations identify ways to do business more efficiently


 Product development. Providing a better understanding of customer needs
 Market insights. Tracking purchase behavior and market trends
 Read more about how real organizations reap the benefits of big data.

The big challenges of big data

Big data brings big benefits, but it also brings big challenges such new privacy and security
concerns, accessibility for business users, and choosing the right solutions for your business
needs. To capitalize on incoming data, organizations will have to address the following:
Making big data accessible: Collecting and processing data becomes more difficult as the
amount of data grows. Organizations must make data easy and convenient for data owners of
all skill levels to use.
Maintaining quality data: With so much data to maintain, organizations are spending more
time than ever before scrubbing for duplicates, errors, absences, conflicts, and
inconsistencies.
Keeping data secure. As the amount of data grows, so do privacy and security concerns.
Organizations will need to strive for compliance and put tight data processes in place before
they take advantage of big data.
Finding the right tools and platforms. New technologies for processing and analyzing big
data are developed all the time. Organizations must find the right technology to work within
their established ecosystems and address their particular needs. Often, the right solution is
also a flexible solution that can accommodate future infrastructure changes

6.Analysis vs Reporting
Key differences between analytics vs reporting

Differences between analytics and reporting can significantly benefit your business. If you
want to use both to their full potential and not miss out on essential parts of either one
knowing the difference between the two is important. Some key differences are:

Analytics Reporting

Analytics is the method of examining and Reporting is an action that includes all the needed
analyzing summarized data to make business information and data and is put together in an organized
decisions. way.

Questioning the data, understanding it, Identifying business events, gathering the required
investigating it, and presenting it to the end information, organizing, summarizing, and presenting
users are all part of analytics. existing data are all part of reporting.

The purpose of analytics is to draw The purpose of reporting is to organize the data into
conclusions based on data. meaningful information.

Analytics is used by data analysts, scientists, Reporting is provided to the appropriate business
and business people to make effective leaders to perform effectively and efficiently within a
decisions. firm.

Analytics and reporting can be used to reach a number of different goals. Both of these can be
very helpful to a business if they are used correctly.

7.Modern Data Analytic Tools


top 10 analytics tools in big data.

1. APACHE Hadoop

It’s a Java-based open-source platform that is being used to store and process big data. It is
built on a cluster system that allows the system to process data efficiently and let the data
run parallel. It can process both structured and unstructured data from one server to
multiple computers. Hadoop also offers cross-platform support for its users. Today, it is
the best big data analytic tool and is popularly used by many tech giants such as Amazon,
Microsoft, IBM, etc.
Features of Apache Hadoop:

 Free to use and offers an efficient storage solution for businesses.


 Offers quick access via HDFS (Hadoop Distributed File System).
 Highly flexible and can be easily implemented with MySQL, and JSON.
 Highly scalable as it can distribute a large amount of data in small segments.
 It works on small commodity hardware like JBOD or a bunch of disks.

2. Cassandra

APACHE Cassandra is an open-source NoSQL distributed database that is used to fetch


large amounts of data. It’s one of the most popular tools for data analytics and has been
praised by many tech companies due to its high scalability and availability without
compromising speed and performance. It is capable of delivering thousands of operations
every second and can handle petabytes of resources with almost zero downtime. It was
created by Facebook back in 2008 and was published publicly.
Features of APACHE Cassandra:
 Data Storage Flexibility: It supports all forms of data i.e. structured,
unstructured, semi-structured, and allows users to change as per their needs.
 Data Distribution System: Easy to distribute data with the help of replicating
data on multiple data centers.
 Fast Processing: Cassandra has been designed to run on efficient commodity
hardware and also offers fast storage and data processing.
 Fault-tolerance: The moment, if any node fails, it will be replaced without any
delay.

3. Qubole

It’s an open-source big data tool that helps in fetching data in a value of chain using ad-hoc
analysis in machine learning. Qubole is a data lake platform that offers end-to-end service
with reduced time and effort which are required in moving data pipelines. It is capable of
configuring multi-cloud services such as AWS, Azure, and Google Cloud. Besides, it also
helps in lowering the cost of cloud computing by 50%.
Features of Qubole:
 Supports ETL process: It allows companies to migrate data from multiple
sources in one place.
 Real-time Insight: It monitors user’s systems and allows them to view real-time
insights
 Predictive Analysis: Qubole offers predictive analysis so that companies can
take actions accordingly for targeting more acquisitions.
 Advanced Security System: To protect users’ data in the cloud, Qubole uses an
advanced security system and also ensures to protect any future breaches.
Besides, it also allows encrypting cloud data from any potential threat.

4. Xplenty
It is a data analytic tool for building a data pipeline by using minimal codes in it. It offers a
wide range of solutions for sales, marketing, and support. With the help of its interactive
graphical interface, it provides solutions for ETL, ELT, etc. The best part of using Xplenty
is its low investment in hardware & software and its offers support via email, chat,
telephonic and virtual meetings. Xplenty is a platform to process data for analytics over
the cloud and segregates all the data together.
Features of Xplenty:
 Rest API: A user can possibly do anything by implementing Rest API
 Flexibility: Data can be sent, and pulled to databases, warehouses, and
salesforce.
 Data Security: It offers SSL/TSL encryption and the platform is capable of
verifying algorithms and certificates regularly.
 Deployment: It offers integration apps for both cloud & in-house and supports
deployment to integrate apps over the cloud.

5. Spark

APACHE Spark is another framework that is used to process data and perform numerous
tasks on a large scale. It is also used to process data via multiple computers with the help
of distributing tools. It is widely used among data analysts as it offers easy-to-use APIs that
provide easy data pulling methods and it is capable of handling multi-petabytes of
data as well. Recently, Spark made a record of processing 100 terabytes of data in just 23
minutes which broke the previous world record of Hadoop (71 minutes). This is the
reason why big tech giants are moving towards spark now and is highly suitable for ML
and AI today.
Features of APACHE Spark:
 Ease of use: It allows users to run in their preferred language. (JAVA, Python,

etc.)
 Real-time Processing: Spark can handle real-time streaming via Spark
Streaming
 Flexible: It can run on, Mesos, Kubernetes, or the cloud.

6. Mongo DB

Came in limelight in 2010, is a free, open-source platform and a document-oriented


(NoSQL) database that is used to store a high volume of data. It uses collections and
documents for storage and its document consists of key-value pairs which are considered a
basic unit of Mongo DB. It is so popular among developers due to its availability for multi-
programming languages such as Python, Jscript, and Ruby.
Features of Mongo DB:

 Written in C++: It’s a schema-less DB and can hold varieties of documents


inside.
 Simplifies Stack: With the help of mongo, a user can easily store files without
any disturbance in the stack.
 Master-Slave Replication: It can write/read data from the master and can be
called back for backup.

7. Apache Storm

A storm is a robust, user-friendly tool used for data analytics, especially in small
companies. The best part about the storm is that it has no language barrier (programming)
in it and can support any of them. It was designed to handle a pool of large data in fault-
tolerance and horizontally scalable methods. When we talk about real-time data
processing, Storm leads the chart because of its distributed real-time big data processing
system, due to which today many tech giants are using APACHE Storm in their system.
Some of the most notable names are Twitter, Zendesk, NaviSite, etc.
Features of Storm:

 Data Processing: Storm process the data even if the node gets disconnected
 Highly Scalable: It keeps the momentum of performance even if the load
increases
 Fast: The speed of APACHE Storm is impeccable and can process up to 1
million messages of 100 bytes on a single node.

8. SAS

Today it is one of the best tools for creating statistical modeling used by data analysts. By
using SAS, a data scientist can mine, manage, extract or update data in different variants
from different sources. Statistical Analytical System or SAS allows a user to access the data
in any format (SAS tables or Excel worksheets). Besides that it also offers a cloud platform
for business analytics called SAS Viya and also to get a strong grip on AI & ML, they have
introduced new tools and products.
Features of SAS:
 Flexible Programming Language: It offers easy-to-learn syntax and has also

vast libraries which make it suitable for non-programmers


 Vast Data Format: It provides support for many programming languages which
also include SQL and carries the ability to read data from any format.
 Encryption: It provides end-to-end security with a feature
called SAS/SECURE.

9. Data Pine

Data pine is an analytical used for BI and was founded back in 2012 (Berlin, Germany). In
a short period of time, it has gained much popularity in a number of countries and it’s
mainly used for data extraction (for small-medium companies fetching data for close
monitoring). With the help of its enhanced UI design, anyone can visit and check the data
as per their requirement and offer in 4 different price brackets, starting from $249 per
month. They do offer dashboards by functions, industry, and platform.
Features of Data pine:
 Automation: To cut down the manual chase, datapine offers a wide array of AI

assistant and BI tools.


 Predictive Tool: datapine provides forecasting/predictive analytics by using
historical and current data, it derives the future outcome.
 Add on: It also offers intuitive widgets, visual analytics & discovery, ad hoc
reporting, etc.

10. Rapid Miner

It’s a fully automated visual workflow design tool used for data analytics. It’s a no-code
platform and users aren’t required to code for segregating data. Today, it is being heavily
used in many industries such as ed-tech, training, research, etc. Though it’s an open-source
platform but has a limitation of adding 10000 data rows and a single logical processor.
With the help of Rapid Miner, one can easily deploy their ML models to the web or mobile
(only when the user interface is ready to collect real-time figures).
Features of Rapid Miner:
 Accessibility: It allows users to access 40+ types of files (SAS, ARFF, etc.) via

URL
 Storage: Users can access cloud storage facilities such as AWS and dropbox
 Data validation: Rapid miner enables the visual display of multiple results in
history for better evaluation.
Conclusion:
Big data has been in limelight for the past few years and will continue to dominate the
market in almost every sector for every market size. The demand for big data is booming at
an enormous rate and ample tools are available in the market today, all you need is the right
approach and choose the best data analytic tool as per the project’s requirement.

8.Statistical Concepts:
Sampling Distribution
What is sampling distribution?
Sampling distribution is a statistic that determines the probability of an event based on data
from a small group within a large population. Its primary purpose is to establish
representative results of small samples of a comparatively larger population. Since the
population is too large to analyze, you can select a smaller group and repeatedly sample or
analyze them. The gathered data, or statistic, is used to calculate the likely occurrence, or
probability, of an event.
Using a sampling distribution simplifies the process of making inferences, or conclusions,
about large amounts of data.
Types of distributions
There are three standard types of sampling distributions in statistics:
1. Sampling distribution of mean
The most common type of sampling distribution is the mean. It focuses on calculating the
mean of every sample group chosen from the population and plotting the data points. The
graph shows a normal distribution where the center is the mean of the sampling distribution,
which represents the mean of the entire population.
2. Sampling distribution of proportion
This sampling distribution focuses on proportions in a population. You select samples and
calculate their proportions. The means of the sample proportions from each group represent
the proportion of the entire population
3. T-distribution
A T-distribution is a sampling distribution that involves a small population or one where you
don't know much about it. It is used to estimate the mean of the population and other statistics
such as confidence intervals, statistical differences and linear regression. The T-distribution
uses a t-score to evaluate data that wouldn't be appropriate for a normal distribution.
The formula for t-score is:

t = [ x - μ ] / [ s / sqrt( n ) ]

In the formula, "x" is the sample mean and "μ" is the population mean and signifies standard
deviation.
Re-Sampling
What is resampling?
Resampling is a series of techniques used in statistics to gather more information about a
sample. This can include retaking a sample or estimating its accuracy. With these additional
techniques, resampling often improves the overall accuracy and estimates any uncertainty
within a population.
Sampling vs. resampling
Sampling is the process of selecting certain groups within a population to gather data.
Resampling often involves performing similar testing
methods with sample sizes within that group. This can mean testing the same sample, or
reselecting samples that can provide more information about a population. There are several
differences between sampling and resampling, including:
Methods
Resampling uses methods like the bootstrapping technique and permutation tests. With
sampling, there are four main methods:
Simple random sampling: Simple random sampling is when every person or data piece
within a population or a group has an equal chance of selection. You might generate random
numbers or have another random selection process.
Systematic sampling: Systematic sampling is often still random, but people might receive
numbers or values at the start. The person holding the experiment then might select intervals
to divide the group, like every third person.
Stratified sampling: Stratified sampling is when you divide the main population into several
subgroups based on certain qualities. This can mean collecting samples from groups of
different ages, cultures or other demographics.
Cluster sampling: Cluster sampling is similar to stratified sampling, as you can divide
populations into separate subgroups. Rather than coordinated groups with similar qualities,
you select these groups randomly, often causing differences in results.
Statistical Inference

Statistical Inference Definition


Statistical inference is the process of analysing the result and making conclusions from data
subject to random variation. It is also called inferential statistics. Hypothesis testing
and confidence intervals are the applications of the statistical inference. Statistical inference
is a method of making decisions about the parameters of a population, based on random
sampling. It helps to assess the relationship between the dependent and independent
variables. The purpose of statistical inference to estimate the uncertainty or sample to sample
variation. It allows us to provide a probable range of values for the true values of something
in the population. The components used for making statistical inference are:
 Sample Size
 Variability in the sample
 Size of the observed differences

Types of Statistical Inference


There are different types of statistical inferences that are extensively used for making
conclusions. They are:

 One sample hypothesis testing


 Confidence Interval
 Pearson Correlation
 Bi-variate regression
 Multi-variate regression
 Chi-square statistics and contingency table
 ANOVA or T-test

Statistical Inference Procedure


The procedure involved in inferential statistics are:

 Begin with a theory


 Create a research hypothesis
 Operationalize the variables
 Recognize the population to which the study results should apply
 Formulate a null hypothesis for this population
 Accumulate a sample from the population and continue the study
 Conduct statistical tests to see if the collected sample properties are adequately
different from what would be expected under the null hypothesis to be able to reject
the null hypothesis

Statistical Inference Solution


Statistical inference solutions produce efficient use of statistical data relating to groups of
individuals or trials. It deals with all characters, including the collection, investigation and
analysis of data and organizing the collected data. By statistical inference solution, people
can acquire knowledge after starting their work in diverse fields. Some statistical inference
solution facts are:
 It is a common way to assume that the observed sample is of independent
observations from a population type like Poisson or normal
 Statistical inference solution is used to evaluate the parameter(s) of the expected
model like normal mean or binomial proportion

Importance of Statistical Inference


Inferential Statistics is important to examine the data properly. To make an accurate
conclusion, proper data analysis is important to interpret the research results. It is majorly
used in the future prediction for various observations in different fields. It helps us to make
inference about the data. The statistical inference has a wide range of application in different
fields, such as:

 Business Analysis
 Artificial Intelligence
 Financial Analysis
 Fraud Detection
 Machine Learning
 Share Market
 Pharmaceutical Sector

Prediction Error
In statistics, prediction error refers to the difference between the predicted values made by
some model and the actual values.
Prediction error is often used in two settings:
1. Linear regression: Used to predict the value of some continuous response variable.
We typically measure the prediction error of a linear regression model with a metric known as
RMSE, which stands for root mean squared error.
It is calculated as:
RMSE = √Σ(ŷi – yi)2 / n
where:
Σ is a symbol that means “sum”
ŷi is the predicted value for the ith observation
yi is the observed value for the ith observation
n is the sample size

2. Logistic Regression: Used to predict the value of some binary response variable.
One common way to measure the prediction error of a logistic regression model is with a
metric known as the total misclassification rate.
It is calculated as:
Total misclassification rate = (# incorrect predictions / # total predictions)
The lower the value for the misclassification rate, the better the model is able to predict the
outcomes of the response variable.
The following examples show how to calculate prediction error for both a linear regression
model and a logistic regression model in practice.
Example 1: Calculating Prediction Error in Linear Regression

Suppose we use a regression model to predict the number of points that 10 players will score
in a basketball game.
The following table shows the predicted points from the model vs. the actual points the
players scored:
We would calculate the root mean squared error (RMSE) as:
RMSE = √Σ(ŷi – yi)2 / n
RMSE = √(((14-12)2+(15-15)2+(18-20)2+(19-16)2+(25-20)2+(18-19)2+(12-16)2+(12-
20)2+(15-16)2+(22-16)2) / 10)
RMSE = 4
The root mean squared error is 4. This tells us that the average deviation between the
predicted points scored and the actual points scored is 4.

Related: What is Considered a Good RMSE Value?


Example 2: Calculating Prediction Error in Logistic Regression
Suppose we use a logistic regression model to predict whether or not 10 college basketball
players will get drafted into the NBA.
The following table shows the predicted outcome for each player vs. the actual outcome (1 =
Drafted, 0 = Not Drafted):
We would calculate the total misclassification rate as:
Total misclassification rate = (# incorrect predictions / # total predictions)
Total misclassification rate = 4/10
Total misclassification rate = 40%
The total misclassification rate is 40%.

This value is quite high, which indicates that the model doesn’t do a very good job of
predicting whether or not a player will get drafted.

You might also like