0% found this document useful (0 votes)
324 views43 pages

CCW331 Business Analytics Material Unit I Type2

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 43

lOMoARcPSD|29346578

UNIT 1 - NOTES
lOMoARcPSD|29346578

UNIT 1

INTRODUCTION TO BUSINESS ANALYTICS

ANALYTICS AND DATA SCIENCE:


Definition:
Analytics generally refers to the science of manipulating data by applying different models and
statistical formulae on it to find insights.
These insights are the key factors that help us solve various problems. These problems may be
of many types, and when we work with data to find insights and solve business-related
problems, we are actually doing Business Analytics.
The tools used for analytics may range from spreadsheets to predictive analytics for complex
business problems. The process includes using these tools to draw out patterns and identify
relationships. Next, new questions are asked and the iterative process starts again and continues
until the business goal is achieved.
Business analytics refers to a subset of several methodologies, such as data mining, statistical
analysis, and predictive analytics, to analyze and transform data into useful information.
Business analytics is also used to identify and anticipate trends and outcomes. With the help of
these results, it becomes easier to make data-driven business decisions.
The use of business analytics is very popular in some industries such as healthcare, hospitality,
and any other business that has to track or closely monitor its customers. Many high-end
business analytics software solutions and platforms have been developed to ingest and process
large data sets.
Business Analytics Examples
Some of the examples of Business Analytics are:
• A simple example of Business Analytics would be working with data to find out what would
be the optimal price point for a product that a company is about to launch. While doing this
research, there are a lot of factors that it would have to take into consideration before arriving
at a solution.
• Another example would be applying Business Analytics techniques to identify and figure out
how many and which customers are likely to cancel the subscription
lOMoARcPSD|29346578

• One of the highly appreciated examples of Business Analytics is working with available data
to figure out and assess how and why the tastes and preferences change of customers who visit
a particular restaurant regularly.
Components of Business Analytics
Modern world business strategies are centred around data. Business Analytics, Machine
Learning, Data Science, etc. are used to arrive at solutions for complex and specific business
problems. Even though all of these have various components, the core components still remain
similar. Following are the core components of Business Analytics:
• Data Storage3 The data is stored by the computers in a way that it can be further used in the
future. The processing of this data using storage devices is known as data storage. Object
storage, Block Storage, etc. are some of the storage products and services.
• Data Visualization3 It is the process of graphically representing the information or insights
drawn through the analysis of data. Data visualization makes the communication of outputs to
the management easier in simple terms.
• Insights3 Insights are the outputs and inferences drawn from the analysis of data by
implementing business analytics techniques and tools.
• Data Security3 One of the most important components of Business Analytics is Data Security.
It involves monitoring and identifying malicious activities in the security networks. Real-time
data and predictive modelling techniques are used to identify vulnerabilities in the system.

TYPES OF BUSINESS ANALYTICS


There are various types of analytics that are performed on a daily basis across many companies.
Let9s understand each one of them in this section.
Descriptive Analytics
Whenever we are trying to answer questions such as <what were the sales figures last year= or
what has occurred before=, we are basically doing descriptive analysis. In descriptive analysis,
we describe or summarize the past data and transform it into easily comprehensible forms, such
as charts or graphs.
Example 3
Let9s take an example of DMart, we can look at the product9s history and find out which
products have been sold more or which products have large demand by looking at the product
sold trends and based on their analysis we can further make the decision of putting a stock of
that item in large quantity for the coming year.
lOMoARcPSD|29346578

Predictive Analytics
Predictive analytics is exactly what it sounds like. It is that side of business analytics where
predictions about a future event are made. An example of predictive analytics is calculating the
expected sales figures for the upcoming fiscal year. Predictive analytics is majorly used to set
up expectations and follow proper processes and measures to meet those expectations.
Example 3
The best example would be Amazon and Netflix recommender system. You might have
noticed that whenever you buy any product from Amazon, on the payment side it shows you
a recommendation saying the customer who purchased this has also purchased this product
that recommendation is based on the customer purchased behavior in the past. By looking at
customer past purchase behavior analyst creates an association between each product and
that9s the reason it shows recommendation when you buy any product.
The next example would be Netflix, when you watch any movies or web series on Netflix
you can see that Netflix provide you with a lot of recommended movies or web series, that
recommendation is based on past data or past trends, it identifies which movie or series has
gain lot of public interest and based on that it creates a recommendation

Prescriptive Analytics
In the case of prescriptive analytics, we make use of simulation, data modelling, and
optimization of algorithms to find answers to questions such as <what needs to be done=. This
is used to provide solutions and identify the potential results of those solutions. This field of
business analytics has recently surfaced and is on heavy rise since it gives multiple solutions,
with their possible effectiveness, to the problems faced by businesses. Let9s say Plan A fails or
there aren9t enough resources to execute it, then there is still Plan B, Plan C, etc., in hand.
Example 3
The best example would be Google self-driving Car, by looking at the past trends and
forecasted data it identifies when to turn or when to slow down, works much like a human
driver.

ANALYTICS LIFE CYCLE


In the early 1990's as data mining was evolving from toddler to adolescent. As a community,
we spent a lot of time getting the data ready for the fairly limited tools and computing power.
lOMoARcPSD|29346578

The CRISP-DM that emerged as a result is still valid today in the era of Big Data & Stream
Analytics.

Business Understanding
Focuses on understanding the project objectives and requirements from a business perspective.
The analyst formulates this knowledge as a data mining problem and develops preliminary plan
Data Understanding
Starting with initial data collection, the analyst proceeds with activities to get familiar with the
data, identify data quality problems & discover first insights into the data. In this phase, the
analyst might also detect interesting subsets to form hypotheses for hidden information
Data Preparation
The data preparation phase covers all activities to construct the final dataset from the initial
raw data
Modelling
The analyst evaluates, selects & applies the appropriate modelling techniques. Since some
techniques like neural nets have specific requirements regarding the form of the data. There
can be a loop back here to data prep
Evaluation
The analyst builds & chooses models that appear to have high quality based on loss functions
that were selected. The analyst them tests them to ensure that they can generalise the models
against unseen data. Subsequently, the analyst also validates that the models sufficiently cover
all key business issues. The end result is the selection of the champion model(s)
lOMoARcPSD|29346578

Deployment
Generally, this will mean deploying a code representation of the model into an operating
system. This also includes mechanisms to score or categorise new unseen data as it arises. The
mechanism should use the new information in the solution of the original business problem.
Importantly, the code representation must also include all the data prep steps leading up to
modelling. This ensures that the model will treat new raw data in the same manner as during
model development

BUSINESS PROBLEM DEFINITION


The Business Understanding phase is to understand what the business wants to solve. Important
task within this phase according to the Data Science Project Management including:
1. Determine the business question and objective: What to solve from the business perspective,
what the customer wants, and define the business success criteria (Key Performance Indicator
or KPI). For fresher, research what kind of the situation company would face and try to build
your project on top of it.
2. Situation Assessment: You need to assess the resources availability, project requirements,
risks, and cost-benefit from this project. While you might not know the situation within the
company if you are not hired yet, you could assess it based on your research and explain what
your assessment is based on.
3. Determine the project goals: What the technical data mining perspective success criteria. You
could set it based on model metrics or availability time or anything as long as you could explain
it 4 what is important is that it logically sounded.
4. Project plan: Try to create a detailed plan for each project phase and what kind of tools you
would use.
Determine the business question and objective:
The first thing you must do in any project is to find out exactly what you9re trying to
accomplish! That9s less obvious than it sounds. Many data miners have invested time on data
analysis, only to find that their management wasn9t particularly interested in the issue they
were investigating. You must start with a clear understanding of
• A problem that your management wants to address
• The business goals
lOMoARcPSD|29346578

• Constraints (limitations on what you may do, the kinds of solutions that can be used, when the
work must be completed, and so on)
• Impact (how the problem and possible solutions fit in with the business)
Deliverables for this task include three items (usually brief reports focusing on just the main
points):
• Background: Explain the business situation that drives the project. This item, like many that
follow, amounts only to a few paragraphs.
• Business goals: Define what your organization intends to accomplish with the project. This is
usually a broader goal than you, as a data miner, can accomplish independently. For example,
the business goal might be to increase sales from a holiday ad campaign by 10 percent year
over year.
• Business success criteria: Define how the results will be measured. Try to get clearly defined
quantitative success criteria. If you must use subjective criteria (hint: terms like gain
insight or get a handle on imply subjective criteria), at least get agreement on exactly who will
judge whether or not those criteria have been fulfilled.
Assessing your situation
This is where you get into more detail on the issues associated with your business goals. Now
you will go deeper into fact-finding, building out a much fleshier explanation of the issues
outlined in the business goals task.
Deliverables for this task include five in-depth reports:
• Inventory of resources: A list of all resources available for the project. These may include
people (not just data miners, but also those with expert knowledge of the business problem,
data managers, technical support, and others), data, hardware, and software.
• Requirements, assumptions, and constraints: Requirements will include a schedule for
completion, legal and security obligations, and requirements for acceptable finished work. This
is the point to verify that you9ll have access to appropriate data!
• Risks and contingencies: Identify causes that could delay completion of the project, and
prepare a contingency plan for each of them. For example, if an Internet outage in your office
could pose a problem, perhaps your contingency could be to work at another office until the
outage has ended.
lOMoARcPSD|29346578

• Terminology: Create a list of business terms and data-mining terms that are relevant to your
project and write them down in a glossary with definitions (and perhaps examples), so that
everyone involved in the project can have a common understanding of those terms.
• Costs and benefits: Prepare a cost-benefit analysis for the project. Try to state all costs and
benefits in dollar (euro, pound, yen, and so on) terms. If the benefits don9t significantly exceed
the costs, stop and reconsider this analysis and your project.
Defining your project goals
Reaching the business goal often requires action from many people, not just the data miner. So
now, you must define your little part within the bigger picture. If the business goal is to reduce
customer attrition, for example, your data-mining goals might be to identify attrition rates for
several customer segments, and develop models to predict which customers are at greatest risk.
Deliverables for this task include two reports:
• Project goals: Define project deliverables, such as models, reports, presentations, and
processed datasets.
• Project success criteria: Define the project technical criteria necessary to support the business
success criteria. Try to define these in quantitative terms (such as model accuracy or predictive
improvement compared to an existing method). If the criteria must be qualitative, identify the
person who makes the assessment.
Project plan
Now you specify every step that you, the data miner, intend to take until the project is
completed and the results are presented and reviewed.
Deliverables for this task include two reports:
• Project plan: Outline your step-by-step action plan for the project. Expand the outline with a
schedule for completion of each step, required resources, inputs (such as data or a meeting with
a subject matter expert), and outputs (such as cleaned data, a model, or a report) for each step,
and dependencies (steps that can9t begin until this step is completed). Explicitly state that
certain steps must be repeated (for example, modeling and evaluation usually call for several
back-and-forth repetitions).
• Initial assessment of tools and techniques: Identify the required capabilities for meeting your
data-mining goals and assess the tools and resources that you have. If something is missing,
you have to address that concern very early in the process.
lOMoARcPSD|29346578

DATA COLLECTION
Data is a collection of facts, figures, objects, symbols, and events gathered from different
sources. Organizations collect data to make better decisions. Without data, it would be difficult
for organizations to make appropriate decisions, and so data is collected at various points in
time from different audiences.
For instance, before launching a new product, an organization needs to collect data on product
demand, customer preferences, competitors, etc. In case data is not collected beforehand, the
organization9s newly launched product may lead to failure for many reasons, such as less
demand and inability to meet customer needs.
Although data is a valuable asset for every organization, it does not serve any purpose until
analyzed or processed to get the desired results.
Collecting the information from the numerical fact after observation is known as raw data.
There are two types of data. Below we have provided the types of data: Primary Data and
Secondary Data.
The two types of data are as follows.
1. Primary Data
When an investigator collects data himself with a definite plan or design in his/her way, then
the data is known as primary data. Generally, the results derived from the primary data are
accurate as the researcher gathers the information. But, one of the disadvantages of primary
data collection is the expenses associated with it. Primary data research is very time-
consuming and expensive.
2. Secondary Data
Data that the investigator does not initially collect but instead obtains from published or
unpublished sources are secondary data. Secondary data is collected by an individual or an
institution for some purpose and are used by someone else in another context. It is worth
noting that although secondary data is cheaper to obtain, it raises concerns about accuracy.
As the data is second-hand, one cannot fully rely on the information to be authentic.
Data Collection: Methods
Data collection is defined as collecting and analysing data to validate and research using
some techniques. It is done to diagnose a problem and learn its outcome and future trends.
When there is a need to solve a question, data collection methods help assume the future
result.
lOMoARcPSD|29346578

We must collect reliable data from the correct sources to make the calculations and analysis
easier. There are two types of data collection methods. This is dependent on the kind of data
that is being collected. They are:
1. Primary Data Collection Methods
2. Secondary Data Collection Methods
Types of Data Collection
Students require primary or secondary data while doing their research. Both primary and
secondary data have their own advantages and disadvantages. Both the methods come into
the picture in different scenarios. One can use secondary data to save time and primary data
to get accurate results.
Primary Data Collection Method
Primary or raw data is obtained directly from the first-hand source through experiments,
surveys, or observations. The primary data collection method is further classified into two
types, and they are given below:
1. Quantitative Data Collection Methods
2. Qualitative Data Collection Methods
Quantitative Data Collection Methods
The term “Quantity” tells us a specific number. Quantitative data collection methods express
the data in numbers using traditional or online data collection methods. Once this data is
collected, the results can be calculated using Statistical methods and Mathematical tools.
Some of the quantitative data collection methods include
Time Series Analysis
The term time series refers to a sequential order of values of a variable, known as a trend, at
equal time intervals. Using patterns, an organization can predict the demand for its products
and services for the projected time.
Smoothing Techniques
In cases where the time series lacks significant trends, smoothing techniques can be used. They
eliminate a random variation from the historical demand. It helps in identifying patterns and
demand levels to estimate future demand. The most common methods used in smoothing
demand forecasting techniques are the simple moving average method and the weighted
moving average method.
lOMoARcPSD|29346578

Barometric Method
Also known as the leading indicators approach, researchers use this method to speculate future
trends based on current developments. When the past events are considered to predict future
events, they act as leading indicators.
Qualitative Data Collection Methods
The qualitative method does not involve any mathematical calculations. This method is
closely connected with elements that are not quantifiable. The qualitative data collection
method includes several ways to collect this type of data, and they are given below:
Interview Method
As the name suggests, data collection is done through the verbal conversation of
interviewing the people in person or on a telephone or by using any computer-aided
model. This is one of the most often used methods by researchers. A brief description of
each of these methods is shown below:
Personal or Face-to-Face Interview: In this type of interview, questions are asked
personally directly to the respondent. For this, a researcher can do online surveys to take
note of the answers.
Telephonic Interview: This method is done by asking questions on a telephonic call. Data
is collected from the people directly by collecting their views or opinions.
Computer-Assisted Interview: The computer-assisted type of interview is the same as a
personal interview, except that the interviewer and the person being interviewed will be
doing it on a desktop or laptop. Also, the data collected is directly updated in a database to
make the process quicker and easier. In addition, it eliminates a lot of paperwork to be done
in updating the collection of data.
Questionnaire Method of Collecting Data
The questionnaire method is nothing but conducting surveys with a set of quantitative
research questions. These survey questions are done by using online survey questions
creation software. It also ensures that the people9s trust in the surveys is legitimised. Some
types of questionnaire methods are given below:
Web-Based Questionnaire: The interviewer can send a survey link to the selected
respondents. Then the respondents click on the link, which takes them to the survey
questionnaire. This method is very cost-efficient and quick, which people can do at their
own convenient time. Moreover, the survey has the flexibility of being done on any device.
So, it is reliable and flexible.
lOMoARcPSD|29346578

Mail-Based Questionnaire: Questionnaires are sent to the selected audience via email. At
times, some incentives are also given to complete this survey which is the main attraction.
The advantage of this method is that the respondent9s name remains confidential to the
researchers, and there is the flexibility of time to complete this survey.
Observation Method
As the word 8observation9 suggests, data is collected directly by observing this method. This
can be obtained by counting the number of people or the number of events in a particular
time frame. Generally, it9s effective in small-scale scenarios. The primary skill needed here
is observing and arriving at the numbers correctly. Structured observation is the type of
observation method in which a researcher detects certain specific behaviours.
Document Review Method
The document review method is a data aggregation method used to collect data from existing
documents with data about the past. There are two types of documents from which we can
collect data. They are given below:
Public Records: The data collected in an organisation like annual reports and sales
information of the past months are used to do future analysis.
Personal Records: As the name suggests, the documents about an individual such as type
of job, designation, and interests are taken into account.
Secondary Data Collection Method
The data collected by another person other than the researcher is secondary data. Secondary
data is readily available and does not require any particular collection methods. It is
available in the form of historical archives, government data, organisational records etc.
This data can be obtained directly from the company or the organization where the research
is being organised or from outside sources.
The internal sources of secondary data gathering include company documents, financial
statements, annual reports, team member information, and reports got from customers or
dealers. Now, the external data sources include information from books, journals,
magazines, the census taken by the government, and the information available on the internet
about research. The leading edge of this data aggregation method is that it is easy to collect
since they are readily accessible.

The secondary data collection methods, too, can involve both quantitative and qualitative
techniques. Secondary data is easily available and hence, less time-consuming and expensive
lOMoARcPSD|29346578

as compared to the primary data. However, with the secondary data collection methods, the
authenticity of the data gathered cannot be verified.
Collection of Data in Statistics
There are various ways to represent data after gathering. But, the most popular method is to
tabulate the data using tally marks and then represent them in a frequency distribution table.
The frequency distribution table is constructed by using the tally marks. Tally marks are a
form of a numerical system used for counting. The vertical lines are used for the counting.
The cross line is placed over the four lines giving the total at 55.

Example:
Consider a jar containing the different colours of pieces of bread as shown below:

Construct a frequency distribution table for the above-mentioned data.


Ans:

DATA PREPARATION
Data preparation is the process of gathering, combining, structuring and organizing data so it
can be used in business intelligence (BI), analytics and data visualization applications. The
components of data preparation include data preprocessing, profiling, cleansing, validation and
transformation; it often also involves pulling together data from different internal systems and
external sources.
lOMoARcPSD|29346578

Data preparation work is done by information technology (IT), BI and data management teams
as they integrate data sets to load into a data warehouse, NoSQL database or data lake
repository, and then when new analytics applications are developed with those data sets. In
addition, data scientists, data engineers, other data analysts and business users increasingly use
self-service data preparation tools to collect and prepare data themselves.
Data preparation is often referred to informally as data prep. It's also known as data wrangling,
although some practitioners use that term in a narrower sense to refer to cleansing, structuring
and transforming data; that usage distinguishes data wrangling from the data pre-
processing stage.
Purposes of data preparation
One of the primary purposes of data preparation is to ensure that raw data being readied for
processing and analysis is accurate and consistent so the results of BI and analytics
applications will be valid. Data is commonly created with missing values, inaccuracies or other
errors, and separate data sets often have different formats that need to be reconciled when
they're combined. Correcting data errors, validating data quality and consolidating data sets are
big parts of data preparation projects.
Data preparation also involves finding relevant data to ensure that analytics applications deliver
meaningful information and actionable insights for business decision-making. The data often
is enriched and optimized to make it more informative and useful -- for example, by blending
internal and external data sets, creating new data fields, eliminating outlier values and
addressing imbalanced data sets that could skew analytics results.
In addition, BI and data management teams use the data preparation process to curate data sets
for business users to analyse. Doing so helps streamline and guide self-service BI applications
for business analysts, executives and workers.
What are the benefits of data preparation?
Data scientists often complain that they spend most of their time gathering, cleansing and
structuring data instead of analysing it. A big benefit of an effective data preparation process
is that they and other end users can focus more on data mining and data analysis -- the parts of
their job that generate business value. For example, data preparation can be done more quickly,
and prepared data can automatically be fed to users for recurring analytics applications.
Done properly, data preparation also helps an organization do the following:
• Ensure the data used in analytics applications produces reliable results;
• Identify and fix data issues that otherwise might not be detected;
• Enable more informed decision-making by business executives and operational workers;
lOMoARcPSD|29346578

• Reduce data management and analytics costs;


• Avoid duplication of effort in preparing data for use in multiple applications; and
• Get a higher roi from bi and analytics initiatives.
Effective data preparation is particularly beneficial in big data environments that store a
combination of structured, semi structured and unstructured data, often in raw form until it's
needed for specific analytics uses. Those uses include predictive analytics, machine learning
(ML) and other forms of advanced analytics that typically involve large amounts of data to
prepare. For example, in an article on preparing data for machine learning, Felix Wick,
corporate vice president of data science at supply chain software vendor Blue Yonder, is quoted
as saying that data preparation "is at the heart of ML."
Steps in the data preparation process
Data preparation is done in a series of steps. There's some variation in the data preparation
steps listed by different data professionals and software vendors, but the process typically
involves the following tasks:
1. Data discovery and profiling. The next step is to explore the collected data to better
understand what it contains and what needs to be done to prepare it for the intended uses. To
help with that, data profiling identifies patterns, relationships and other attributes in the data,
as well as inconsistencies, anomalies, missing values and other issues so they can be addressed.
What is data profiling?
Data profiling refers to the process of examining, analyzing, reviewing and summarizing data
sets to gain insight into the quality of data. Data quality is a measure of the condition of data
based on factors such as its accuracy, completeness, consistency, timeliness and accessibility.
Additionally, data profiling involves a review of source data to understand the data's structure,
content and interrelationships.
This review process delivers two high-level values to the organization: It provides a high-level
view of the quality of its data sets; and two, it helps the organization identify potential data
projects.
Given those benefits, data profiling is an important component of data preparation programs.
Its assistance helping organizations to identify quality data makes it an important precursor to
data processing and data analytics activities.
Moreover, an organization can use data profiling and the insights it produces to continuously
improve the quality of its data and measure the results of that effort.
Data profiling may also be known as data archaeology, data assessment, data discovery or data
quality analysis.
lOMoARcPSD|29346578

Organizations use data profiling at the beginning of a project to determine if enough data has
been gathered, if any data can be reused or if the project is worth pursuing. The process of data
profiling itself can be based on specific business rules that will uncover how the data set aligns
with business standards and goals.
Types of data profiling
There are three types of data profiling.
• Structure discovery. This focuses on the formatting of the data, making sure everything is
uniform and consistent. It uses basic statistical analysis to return information about the validity
of the data.
• Content discovery. This process assesses the quality of individual pieces of data. For example,
ambiguous, incomplete and null values are identified.
• Relationship discovery. This detects connections, similarities, differences and associations
among data sources.
What are the steps in the data profiling process?
Data profiling helps organizations identify and fix data quality problems before the data is
analyzed, so data professionals aren't dealing with inconsistencies, null values or incoherent
schema designs as they process data to make decisions.
Data profiling statistically examines and analyzes data at its source and when loaded. It also
analyzes the metadata to check for accuracy and completeness.
It typically involves either writing queries or using data profiling tools.
A high-level breakdown of the process is as follows:
1. The first step of data profiling is gathering one or multiple data sources and the associated
metadata for analysis.
2. The data is then cleaned to unify structure, eliminate duplications, identify interrelationships
and find anomalies.
3. Once the data is cleaned, data profiling tools will return various statistics to describe the data
set. This could include the mean, minimum/maximum value, frequency, recurring patterns,
dependencies or data quality risks.
For example, by examining the frequency distribution of different values for each column in a
table, a data analyst could gain insight into the type and use of each column. Cross-column
analysis can be used to expose embedded value dependencies; inter-table analysis allows the
analyst to discover overlapping value sets that represent foreign key relationships between
entities.
lOMoARcPSD|29346578

Benefits of data profiling


Data profiling returns a high-level overview of data that can result in the following benefits:
• leads to higher-quality, more credible data;
• helps with more accurate predictive analytics and decision-making;
• makes better sense of the relationships between different data sets and sources;
• keeps company information centralized and organized;
• eliminates errors, such as missing values or outliers, that add costs to data-driven projects;
• highlights areas within a system that experience the most data quality issues, such as data
corruption or user input errors; and
• produces insights surrounding risks, opportunities and trends.
Data profiling challenges
Although the objectives of data profiling are straightforward, the actual work involved is quite
complex, with multiple tasks occurring from the ingestion of data through its warehousing.
That complexity is one of the challenges organizations encounter when trying to implement
and run a successful data profiling program.
The sheer volume of data being collected by a typical organization is another challenge, as is
the range of sources -- from cloud-based systems to endpoint devices deployed as part of an
internet-of-things ecosystem -- that produce data.
The speed at which data enters an organization creates further challenges to having a successful
data profiling program.
These data prep challenges are even more significant in organizations that have not adopted
modern data profiling tools and still rely on manual processes for large parts of this work.
On a similar note, organizations that don't have adequate resources -- including trained data
professionals, tools and the funding for them -- will have a harder time overcoming these
challenges.
However, those same elements make data profiling more critical than ever to ensure that the
organization has the quality data it needs to fuel intelligent systems, customer personalization,
productivity-boosting automation projects and more.
Examples of data profiling
Data profiling can be implemented in a variety of use cases where data quality is important.
For example, projects that involve data warehousing or business intelligence may require
gathering data from multiple disparate systems or databases for one report or analysis.
Applying data profiling to these projects can help identify potential issues and corrections that
lOMoARcPSD|29346578

need to be made in extract, transform and load (ETL) jobs and other data integration processes
before moving forward.
Additionally, data profiling is crucial in data conversion or data migration initiatives that
involve moving data from one system to another. Data profiling can help identify data quality
issues that may get lost in translation or adaptions that must be made to the new system prior
to migration.
The following four methods, or techniques, are used in data profiling:
• column profiling, which assesses tables and quantifies entries in each column;
• cross-column profiling, which features both key analysis and dependency analysis;
• cross-table profiling, which uses key analysis to identify stray data as well as semantic and
syntactic discrepancies; and
• data rule validation, which assesses data sets against established rules and standards to validate
that they're being followed.
Data profiling tools
Data profiling tools replace much, if not all, of the manual effort of this function by discovering
and investigating issues that affect data quality, such as duplication, inaccuracies,
inconsistencies and lack of completeness.
These technologies work by analyzing data sources and linking sources to their metadata to
allow for further investigation into errors.
Furthermore, they offer data professionals quantitative information and statistics around data
quality, typically in tabular and graph formats.
Data management applications, for example, can manage the profiling process through tools
that eliminate errors and apply consistency to data extracted from multiple sources without the
need for hand coding.
Such tools are essential for many, if not most, organizations today as the volume of data they
use for their business activities significantly outpaces even a large team's ability to perform this
function through mostly manual efforts.
Data profile tools also generally include data wrangling, data gap and metadata discovery
capabilities as well as the ability to detect and merge duplicates, check for data similarities and
customize data assessments.
Commercial vendors that provide data profiling capabilities include Datameer, Informatica,
Oracle and SAS. Open source solutions include Aggregate Profiler, Apache Griffin, Quadient
DataCleaner and Talend.
lOMoARcPSD|29346578

2. Data cleansing. Next, the identified data errors and issues are corrected to create complete and
accurate data sets. For example, as part of cleansing data sets, faulty data is removed or fixed,
missing values are filled in and inconsistent entries are harmonized.
What is data cleansing?
Data cleansing, also referred to as data cleaning or data scrubbing, is the process of fixing
incorrect, incomplete, duplicate or otherwise erroneous data in a data set. It involves identifying
data errors and then changing, updating or removing data to correct them. Data cleansing
improves data quality and helps provide more accurate, consistent and reliable information for
decision-making in an organization.
Data cleansing is a key part of the overall data management process and one of the core
components of data preparation work that readies data sets for use in business intelligence (BI)
and data science applications. It's typically done by data quality analysts and engineers or other
data management professionals. But data scientists, BI analysts and business users may also
clean data or take part in the data cleansing process for their own applications.
Data cleansing vs. data cleaning vs. data scrubbing
Data cleansing, data cleaning and data scrubbing are often used interchangeably. For the most
part, they're considered to be the same thing. In some cases, though, data scrubbing is viewed
as an element of data cleansing that specifically involves removing duplicate, bad, unneeded
or old data from data sets.
Data scrubbing also has a different meaning in connection with data storage. In that context,
it's an automated function that checks disk drives and storage systems to make sure the data
they contain can be read and to identify any bad sectors or blocks.

Why is clean data important?


Business operations and decision-making are increasingly data-driven, as organizations look
to use data analytics to help improve business performance and gain competitive advantages
over rivals. As a result, clean data is a must for BI and data science teams, business executives,
marketing managers, sales reps and operational workers. That's particularly true in retail,
financial services and other data-intensive industries, but it applies to organizations across the
board, both large and small.
If data isn't properly cleansed, customer records and other business data may not be accurate
and analytics applications may provide faulty information. That can lead to flawed business
lOMoARcPSD|29346578

decisions, misguided strategies, missed opportunities and operational problems, which


ultimately may increase costs and reduce revenue and profits. IBM estimated that data quality
issues cost organizations in the U.S. a total of $3.1 trillion in 2016, a figure that's still widely
cited.
What kind of data errors does data scrubbing fix?
Data cleansing addresses a range of errors and issues in data sets, including inaccurate, invalid,
incompatible and corrupt data. Some of those problems are caused by human error during the
data entry process, while others result from the use of different data structures, formats and
terminology in separate systems throughout an organization.
The types of issues that are commonly fixed as part of data cleansing projects include the
following:
• Typos and invalid or missing data. Data cleansing corrects various structural errors in data
sets. For example, that includes misspellings and other typographical errors, wrong numerical
entries, syntax errors and missing values, such as blank or null fields that should contain data.
• Inconsistent data. Names, addresses and other attributes are often formatted differently from
system to system. For example, one data set might include a customer's middle initial, while
another doesn't. Data elements such as terms and identifiers may also vary. Data cleansing
helps ensure that data is consistent so it can be analyzed accurately.
• Duplicate data. Data cleansing identifies duplicate records in data sets and either removes or
merges them through the use of deduplication measures. For example, when data from two
systems is combined, duplicate data entries can be reconciled to create single records.
• Irrelevant data. Some data -- outliers or out-of-date entries, for example -- may not be relevant
to analytics applications and could skew their results. Data cleansing removes redundant data
from data sets, which streamlines data preparation and reduces the required amount of data
processing and storage resources.
What are the steps in the data cleansing process?
The scope of data cleansing work varies depending on the data set and analytics requirements.
For example, a data scientist doing fraud detection analysis on credit card transaction data may
want to retain outlier values because they could be a sign of fraudulent purchases. But the data
scrubbing process typically includes the following actions:
1. Inspection and profiling. First, data is inspected and audited to assess its quality level and
identify issues that need to be fixed. This step usually involves data profiling, which documents
relationships between data elements, checks data quality and gathers statistics on data sets to
help find errors, discrepancies and other problems.
lOMoARcPSD|29346578

2. Cleaning. This is the heart of the cleansing process, when data errors are corrected and
inconsistent, duplicate and redundant data is addressed.
3. Verification. After the cleaning step is completed, the person or team that did the work should
inspect the data again to verify its cleanliness and make sure it conforms to internal data quality
rules and standards.
4. Reporting. The results of the data cleansing work should then be reported to IT and business
executives to highlight data quality trends and progress. The report could include the number
of issues found and corrected, plus updated metrics on the data's quality levels.
The cleansed data can then be moved into the remaining stages of data preparation, starting
with data structuring and data transformation, to continue readying it for analytics uses.
Characteristics of clean data
Various data characteristics and attributes are used to measure the cleanliness and overall
quality of data sets, including the following:
• accuracy
• completeness
• consistency
• integrity
• timeliness
• uniformity
• validity
Data management teams create data quality metrics to track those characteristics, as well as
things like error rates and the overall number of errors in data sets. Many also try to calculate
the business impact of data quality problems and the potential business value of fixing them,
partly through surveys and interviews with business executives.
The benefits of effective data cleansing
Done well, data cleansing provides the following business and data management benefits:
• Improved decision-making. With more accurate data, analytics applications can produce
better results. That enables organizations to make more informed decisions on business
strategies and operations, as well as things like patient care and government programs.
• More effective marketing and sales. Customer data is often wrong, inconsistent or out of
date. Cleaning up the data in customer relationship management and sales systems helps
improve the effectiveness of marketing campaigns and sales efforts.
lOMoARcPSD|29346578

• Better operational performance. Clean, high-quality data helps organizations avoid


inventory shortages, delivery snafus and other business problems that can result in higher costs,
lower revenues and damaged relationships with customers.
• Increased use of data. Data has become a key corporate asset, but it can't generate business
value if it isn't used. By making data more trustworthy, data cleansing helps convince business
managers and workers to rely on it as part of their jobs.
• Reduced data costs. Data cleansing stops data errors and issues from further propagating in
systems and analytics applications. In the long term, that saves time and money, because IT
and data management teams don't have to continue fixing the same errors in data sets.
Data cleansing and other data quality methods are also a key part of data governance programs,
which aim to ensure that the data in enterprise systems is consistent and gets used properly.
Clean data is one of the hallmarks of a successful data governance initiative.
Data cleansing challenges
Data cleansing doesn't lack for challenges. One of the biggest is that it's often time-consuming,
due to the number of issues that need to be addressed in many data sets and the difficulty of
pinpointing the causes of some errors. Other common challenges include the following:
• deciding how to resolve missing data values so they don't affect analytics applications;
• fixing inconsistent data in systems controlled by different business units;
• cleaning up data quality issues in big data systems that contain a mix of structured, semi
structured and unstructured data;
• getting sufficient resources and organizational support; and
• dealing with data silos that complicate the data cleansing process.
Data cleansing tools and vendors
Numerous tools can be used to automate data cleansing tasks, including both commercial
software and open-source technologies. Typically, the tools include a variety of functions for
correcting data errors and issues, such as adding missing values, replacing null ones, fixing
punctuation, standardizing fields and combining duplicate records. Many also do data matching
to find duplicate or related records.
Tools that help cleanse data are available in a variety of products and platforms, including the
following:
• specialized data cleaning tools from vendors such as Data Ladder and WinPure;
• data quality software from vendors such as Datactics, Experian, Innovative Systems, Melissa,
Microsoft and Precisely;
lOMoARcPSD|29346578

• data preparation tools from vendors such as Altair, DataRobot, Tableau, Tibco Software and
Trifacta;
• data management platforms from vendors such as Alteryx, Ataccama, IBM, Informatica, SAP,
SAS, Syniti and Talend;
• customer and contact data management software from vendors such as Redpoint Global,
RingLead, Synthio and Tye;
• tools for cleansing data in Salesforce systems from vendors such as Cloudingo and Plauti; and
• open-source tools, such as DataCleaner and OpenRefine
3. Data structuring. At this point, the data needs to be modeled and organized to meet the
analytics requirements. For example, data stored in comma-separated values (CSV) files or
other file formats has to be converted into tables to make it accessible to BI and analytics tools.
4. Data transformation and enrichment. In addition to being structured, the data typically must
be transformed into a unified and usable format. For example, data transformation may involve
creating new fields or columns that aggregate values from existing ones. Data enrichment
further enhances and optimizes data sets as needed, through measures such as augmenting and
adding data.
What is data transformation?
Data transformation is the process of converting data from one format, such as a database file,
XML document or Excel spreadsheet, into another.
Transformations typically involve converting a raw data source into a cleansed, validated and
ready-to-use format. Data transformation is crucial to data management processes that include
data integration, data migration, data warehousing and data preparation.
The process of data transformation can also be referred to as extract/transform/load (ETL). The
extraction phase involves identifying and pulling data from the various source systems that
create data and then moving the data to a single repository. Next, the raw data is cleansed, if
needed. It's then transformed into a target format that can be fed into operational systems or
into a data warehouse, a date lake or another repository for use in business intelligence and
analytics applications. The transformation may involve converting data types, removing
duplicate data and enriching the source data.
Data transformation is crucial to processes that include data integration, data management, data
migration, data warehousing and data wrangling.
It is also a critical component for any organization seeking to leverage its data to generate
timely business insights. As the volume of data has proliferated, organizations must have an
efficient way to harness data to effectively put it to business use. Data transformation is one
lOMoARcPSD|29346578

element of harnessing this data, because -- when done properly -- it ensures data is easy to
access, consistent, secure and ultimately trusted by the intended business users.
What are the key steps in data transformation?
The process of data transformation, as noted, involves identifying data sources and types;
determining the structure of transformations that need to occur; and defining how fields will
be changed or aggregated. It includes extracting data from its original source, transforming it
and sending it to the target destination, such as a database or data warehouse. Extractions can
come from many locations, including structured sources, streaming sources or log files from
web applications.
Data analysts, data engineers and data scientists are typically in charge of data transformation
within an organization. They identify the source data, determine the required data formats and
perform data mapping, as well as execute the actual transformation process before moving the
data into appropriate databases for storage and use.
Their work involves five main steps:
1. data discovery, in which data professionals use data profiling tools or profiling
scripts to understand the structure and characteristics of the data and also to
determine how it should be transformed;
2. data mapping, during which data professionals connect, or match, data fields from
one source to data fields in another;
3. code generation, a part of the process where the software code required to
transform the data is created (either by data transformation tools or the data
professionals themselves writing script);
4. execution of the code, where the data undergoes the transformation; and
5. review, during which data professionals or the business/end users confirm that the
output data meets the established transformation requirements and, if not, address
and correct any anomalies and errors.
These steps fall in the middle of the ETL process for organizations that use on-premises
warehouses. However, scalable cloud-based data warehouses have given rise to a slightly
different process called ELT for extract, load, transform; in this process, organizations can
load raw data into data warehouses and then transform data at the time of use.
What are the benefits and challenges of data transformation?
Organizations across the board need to analyze their data for a host of business operations,
from customer service to supply chain management. They also need data to feed the increasing
number of automated and intelligent systems within their enterprise.
lOMoARcPSD|29346578

To gain insight into and improve these operations, organizations need high-quality data in
formats compatible with the systems consuming the data.
Thus, data transformation is a critical component of an enterprise data program because it
delivers the following benefits:
• higher data quality;
• reduced number of mistakes, such as missing values;
• faster queries and retrieval times;
• less resources needed to manipulate data;
• better data organization and management; and
• more usable data, especially for advanced business intelligence or analytics.
The data transformation process, however, can be complex and complicated. The challenges
organizations face include the following:
• high cost of transformation tools and professional expertise;
• significant compute resources, with the intensity of some on-premises
transformation processes having the potential to slow down other operations;
• difficulty recruiting and retaining the skilled data professionals required for this
work, with data professionals some of the most in-demand workers today; and
• difficulty of properly aligning data transformation activities to the business's data-
related priorities and requirements.
Reasons to do data transformation
Organizations must be able to mine their data for insights in order to successfully compete in
the digital marketplace, optimize operations, cut costs and boost productivity. They also require
data to feed systems that use artificial intelligence, machine learning, natural language
processing and other advanced technologies.
To gain accurate insights and to ensure accurate operations of intelligent systems, organizations
must collect data and merge it from multiple sources and ensure that integrated data is high
quality.
This is where data transformation plays the star role, by ensuring that data collected from one
system is compatible with data from other systems and that the combined data is ultimately
compatible for use in the systems that require it. For example, databases might need to be
combined following a corporate acquisition, transferred to a cloud data warehouse or merged
for analysis.
Examples of data transformation
There are various data transformation methods, including the following:
lOMoARcPSD|29346578

• aggregation, in which data is collected from multiple sources and stored in a single
format;
• attribute construction, in which new attributes are added or created from existing
attributes;
• discretization, which involves converting continuous data values into sets of data
intervals with specific values to make the data more manageable for analysis;
• generalization, where low-level data attributes are converted into high-level data
attributes (for example, converting data from multiple brackets broken up by ages
into the more general "young" and "old" attributes) to gain a more comprehensive
view of the data;
• integration, a step that involves combining data from different sources into a single
view;
• manipulation, where the data is changed or altered to make it more readable and
organized;
• normalization, a process that converts source data into another format to limit the
occurrence of duplicated data; and
• smoothing, which uses algorithms to reduce "noise" in data sets, thereby helping
to more efficiently and effectively identify trends in the data.
Data transformation tools
Data professionals have a number of tools at their disposal to support the ETL process. These
technologies automate many of the steps within data transformation, replacing much, if not all,
of the manual scripting and hand coding that had been a major part of the data transformation
process.
Both commercial and open-source data transformation tools are available, with some options
designed for on-premises transformation processes and others catering to cloud-based
transformation activities.
Moreover, some data transformation tools are focused on the data transformation process itself,
handling the string of actions required to transform data. However, other ETL tools on the
market are part of platforms that offer a broad range of capabilities for managing enterprise
data.
Options include IBM InfoSphere, DataStage, Matillion, SAP Data Services and Talend.
5. Data validation and publishing. In this last step, automated routines are run against the data
to validate its consistency, completeness and accuracy. The prepared data is then stored in a
lOMoARcPSD|29346578

data warehouse, a data lake or another repository and either used directly by whoever prepared
it or made available for other users to access.
What is data validation?
Data validation is the practice of checking the integrity, accuracy and structure of data before
it is used for a business operation. Data validation operation results can provide data used for
data analytics, business intelligence or training a machine learning model. It can also be used
to ensure the integrity of data for financial accounting or regulatory compliance.
Data can be examined as part of a validation process in a variety of ways, including data type,
constraint, structured, consistency and code validation. Each type of data validation is designed
to make sure the data meets the requirements to be useful.
Data validation is related to data quality. Data validation can be a component to measure data
quality, which ensures that a given data set is supplied with information sources that are of the
highest quality, authoritative and accurate.
Data validation is also used as part of application workflows, including spell checking and rules
for strong password creation.
Why validates data?
For data scientists, data analysts and others working with data, validating it is very important.
The output of any given system can only be as good as the data the operation is based on. These
operations can include machine learning or artificial intelligence models, data analytics reports
and business intelligence dashboards. Validating the data ensures that the data is accurate,
which means all systems relying on a validated given data set will be as well.
Data validation is also important for data to be useful for an organization or for a specific
application operation. For example, if data is not in the right format to be consumed by a
system, then the data can't be used easily, if at all.
As data moves from one location to another, different needs for the data arise based on the
context for how the data is being used. Data validation ensures that the data is correct for
specific contexts. The right type of data validation makes the data useful.
What are the different types of data validation?
Multiple types of data validation are available to ensure that the right data is being used. The
most common types of data validation include the following:
• Data type validation is common and confirms that the data in each field, column,
list, range or file matches a specified data type and format.
lOMoARcPSD|29346578

• Constraint validation checks to see if a given data field input fits a specified
requirement within certain ranges. For example, it verifies that a data field has a
minimum or maximum number of characters.
• Structured validation ensures that data is compliant with a specified data format,
structure or schema.
• Consistency validation makes sure data styles are consistent. For example, it
confirms that all values are listed to two decimal points.
• Code validation is similar to a consistency check and confirms that codes used for
different data inputs are correct. For example, it checks a country code or North
American Industry Classification System (NAICS) codes.
How to perform data validation
Among the most basic and common ways that data is used is within a spreadsheet program
such as Microsoft Excel or Google Sheets. In both Excel and Sheets, the data validation process
is a straightforward, integrated feature. Excel and Sheets both have a menu item listed as Data
> Data Validation. By selecting the Data Validation menu, a user can choose the specific data
type or constraint validation required for a given file or data range.
ETL (Extract, Transform and Load) and data integration tools typically integrate data
validation policies to be executed as data is extracted from one source and then loaded into
another. Popular open source tools, such as dbt, also include data validation options and are
commonly used for data transformation.
Data validation can also be done programmatically in an application context for an input value.
For example, as an input variable is sent, such as a password, it can be checked by a script to
make sure it meets constraint validation for the right length.
Data preparation can also incorporate or feed into data curation work that creates and oversees
ready-to-use data sets for BI and analytics. Data curation involves tasks such as indexing,
cataloging and maintaining data sets and their associated metadata to help users find and access
the data. In some organizations, data curator is a formal role that works collaboratively with
data scientists, business analysts, other users and the IT and data management teams. In others,
data may be curated by data stewards, data engineers, database administrators or data scientists
and business users themselves.
What are the challenges of data preparation?
Data preparation is inherently complicated. Data sets pulled together from different source
systems are highly likely to have numerous data quality, accuracy and consistency issues to
resolve. The data also must be manipulated to make it usable, and irrelevant data needs to be
lOMoARcPSD|29346578

weeded out. As noted above, it's a time-consuming process: The 80/20 rule is often applied to
analytics applications, with about 80% of the work said to be devoted to collecting and
preparing data and only 20% to analyzing it.
In an article on common data preparation challenges, Rick Sherman, managing partner of
consulting firm Athena IT Solutions, detailed the following seven challenges along with advice
on how to overcome each of them:
• Inadequate or non-existent data profiling. If data isn't properly profiled, errors,
anomalies and other problems might not be identified, which can result in flawed
analytics.
• Missing or incomplete data. Data sets often have missing values and other forms of
incomplete data; such issues need to be assessed as possible errors and addressed if
so.
• Invalid data values. Misspellings, other typos and wrong numbers are examples of
invalid entries that frequently occur in data and must be fixed to ensure analytics
accuracy.
• Name and address standardization. Names and addresses may be inconsistent in
data from different systems, with variations that can affect views of customers and
other entities.
• Inconsistent data across enterprise systems. Other inconsistencies in data sets
drawn from multiple source systems, such as different terminology and unique
identifiers, are also a pervasive issue in data preparation efforts.
• Data enrichment. Deciding how to enrich a data set -- for example, what to add to it -
- is a complex task that requires a strong understanding of business needs and
analytics goals.
• Maintaining and expanding data prep processes. Data preparation work often
becomes a recurring process that needs to be sustained and enhanced on an ongoing
basis.

HYPOTHESIS GENERATION
Data scientists work with data sets small and large, and are tellers of stories. These stories have
entities, properties and relationships, all described by data. Their apparatus and methods open
up data scientists to opportunities to identify, consolidate and validate hypotheses with data,
and use these hypotheses as starting points for our data narratives. Hypothesis generation is a
key challenge for data scientists. Hypothesis generation and by extension hypothesis
refinement constitute the very purpose of data analysis and data science.
Hypothesis generation for a data scientist can take numerous forms, such as:
lOMoARcPSD|29346578

data scientist must not know the outcome of the hypothesis that has been generated based on
any evidence.
<A hypothesis may be simply defined as a guess. A scientific hypothesis is an intelligent
guess. = 3 Isaac Asimov
Hypothesis generation is a crucial step in any data science project. If you skip this or skim
through this, the likelihood of the project failing increases exponentially.

Hypothesis Generation vs. Hypothesis Testing


Hypothesis generation is a process beginning with an educated guess whereas
hypothesis testing is a process to conclude that the educated guess is true/false or the
relationship between the variables is statistically significant or not.
This latter part could be used for further research using statistical proof. A hypothesis
is accepted or rejected based on the significance level and test score of the test used for
testing the hypothesis.

How Does Hypothesis Generation Help?


Here are 5 key reasons why hypothesis generation is so important in data science:
• Hypothesis generation helps in comprehending the business problem as we dive deep in
inferring the various factors affecting our target variable
• You will get a much better idea of what are the major factors that are responsible
to solve the problem
• Data that needs to be collected from various sources that are key in converting
your business problem into a data science-based problem
• Improves your domain knowledge if you are new to the domain as you spend time
understanding the problem
• Helps to approach the problem in a structured manner
lOMoARcPSD|29346578

When Should you Perform Hypothesis Generation?


The million-dollar question 3 when in the world should you perform hypothesis generation?
• The hypothesis generation should be made before looking at the dataset or collection
of the data
• You will notice that if you have done your hypothesis generation adequately, you
would have included all the variables present in the dataset in your hypothesis
generation
• You might also have included variables that are not present in the dataset

Case S tu d y : Hypothesis Gen eratio n o n <RED Tax i Trip

Duration Prediction=

Let us now look at the <RED TAXI TRIP DURATION PREDICTION= problem statement
and generate a few hypotheses that would affect our taxi trip duration to understand
hypothesis generation.
Here9s the problem statement:
To predict the duration of a trip so that the company can assign the cabs that are free for the
next trip. This will help in reducing the wait time for customers and will also help in earning
customer trust.
Let9s begin!
Hypothesis Generation Based on Various Factors
1. Distance/Speed based Features
Let us try to come up with a formula that would have a relation with trip duration and would
help us in generating various hypotheses for the problem: TIME=DISTANCE/SPEED
Distance and speed play an important role in predicting the trip duration.

We can notice that the trip duration is directly proportional to the distance travelled and
inversely proportional to the speed of the taxi. Using this we can come up with a hypothesis
based on distance and speed.
• Distance: More the distance travelled by the taxi, the more will be the trip duration.
• Interior drop point: Drop points to congested or interior lanes could result in an
increase in trip duration
• Speed: Higher the speed, the lower the trip duration

2. Features based on Car


Cars are of various types, sizes, brands, and these features of the car could be vital for
commute not only on the basis of the safety of the passengers but also for the trip duration. Let
us now generate a few hypotheses based on the features of the car.
lOMoARcPSD|29346578

• Condition of the car: Good conditioned cars are unlikely to have breakdown issues
and could have a lower trip duration
• Car Size: Small-sized cars (Hatchback) may have a lower trip duration and larger-
sized cars (XUV) may have higher trip duration based on the size of the car and
congestion in the city

3. Type of the Trip


Trip types can be different based on trip vendors 3 it could be an outstation trip, single or
pool rides. Let us now define a hypothesis based on the type of trip used.
• Pool Car: Trips with pooling can lead to higher trip duration as the car reaches
multiple places before reaching your assigned destination

4. Features based on Driver Details


A driver is an important person when it comes to commute time. Various factors about the
driver can help in understanding the reason behind trip duration and here are a few
hypotheses this.

• Age of driver: Older drivers could be more careful and could contribute to higher trip
duration
• Gender: Female drivers are likely to drive slowly and could contribute to higher trip
duration
• Driver experience: Drivers with very less driving experience can cause higher
trip duration
• Medical condition: Drivers with a medical condition can contribute to higher trip
duration

5. Passenger details
Passengers can influence the trip duration knowingly or unknowingly. We usually come
across passengers requesting drivers to increase the speed as they are getting late and there
could be other factors to hypothesize which we can look at.
• Age of passengers: Senior citizens as passengers may contribute to higher trip
duration as drivers tend to go slow in trips involving senior citizens
• Medical conditions or pregnancy: Passengers with medical conditions contribute
to a longer trip duration
• Emergency: Passengers with an emergency could contribute to a shorter trip duration
• Passenger count: Higher passenger count leads to shorter duration trips due to
congestion in seating
lOMoARcPSD|29346578

6. Date-Time Features
The day and time of the week are important as New York is a busy city and could be highly
congested during office hours or weekdays. Let us now generate a few hypotheses on the date
and time-based features.
Pickup Day:
• Weekends could contribute to more outstation trips and could have a higher trip
duration
• Weekdays tend to have higher trip duration due to high traffic
• If the pickup day falls on a holiday, then the trip duration may be shorter
• If the pickup day falls on a festive week, then the trip duration could be lower due to
lesser traffic
Time:
• Early morning trips have a lesser trip duration due to lesser traffic
• Evening trips have a higher trip duration due to peak hours
7. Road-based Features
Roads are of different types and the condition of the road or obstructions in the road
are factors that can9t be ignored. Let9s form some hypotheses based on these factors.
• Condition of the road: The duration of the trip is more if the condition of the road is
bad
• Road type: Trips in concrete roads tend to have a lower trip duration
• Strike on the road: Strikes carried out on roads in the direction of the trip causes the
trip duration to increase
8. Weather Based Features
Weather can change at any time and could possibly impact the commute if the weather turns
bad. Hence, this is an important feature to consider in our hypothesis.
• Weather at the start of the trip: Rainy weather condition contributes to a higher trip
duration
After writing down our hypothesis and looking at the dataset you will notice that you
would have covered the writing of hypothesis on most of the features present in the data set.
There could also be a possibility that you might have to work with fewer features and the
features on which you have generated hypotheses are not currently being captured/stored by
the business and are not available.
Always go ahead and capture data from external sources if you think that the data is
relevant for your prediction. Ex: Getting weather information
It is also important to note that since hypothesis generation is an estimated guess, the
hypothesis generated could come out to be true or false once exploratory data analysis and
lOMoARcPSD|29346578

hypothesis testing is performed on the data.

MODELING:
After all the cleaning, formatting and feature selection, we will now feed the data to the chosen
model. But how does one select a model to use?
How to choose a model?
IT DEPENDS. It all depends on what the goal of your task or project is and this should already
be identified in the Business Understanding phase
Steps in choosing a model
1. Determine size of training data 4 if you have a small dataset, fewer number of
observations, high number of features, you can choose high bias/low variance
algorithms (Linear Regression, Naïve Bayes, Linear SVM). If your dataset is large and
has a high number of observations compared to number of features, you can choose a
low bias/high variance algorithms (KNN, Decision trees).
2. Accuracy and/or interpretability of the output 4 if your goal is inference, choose
restrictive models as it is more interpretable (Linear Regression, Least Squares). If your
goal is higher accuracy, then choose flexible models (Bagging, Boosting, SVM).
3. Speed or training time 4 always remember that higher accuracy as well as large
datasets means higher training time. Examples of easy to run and to implement
algorithms are: Naïve Bayes, Linear and Logistic Regression. Some examples
of algorithms that need more time to train are: SVM, Neural Networks, and Random
Forests.
4. Linearity 4try checking first the linearity of your data by fitting a linear line or by
trying to run a logistic regression, you can also check their residual errors. Higher errors
mean that the data is not linear and needs complex algorithms to fit. If data is Linear,
you can choose: Linear Regression, Logistic Regression, Support Vector Machines. If
Non-linear: Kernel SVM, Random Forest, Neural Nets.
Parametric vs. Non-Parametric Machine Learning Models
Parametric Machine Learning Algorithms
Parametric ML Algorithms are algorithms that simplify the function to a know form. They are
often are called the <Linear ML Algorithms=.
Parametric ML Algorithms
• Logistic Regression
• Linear Discriminant Analysis
lOMoARcPSD|29346578

• Perceptron
• Naïve Bayes
• Simple Neural Networks
Benefits of Parametric ML Algorithms
• Simpler 4 easy to understand methods and easy to interpret results
• Speed 4 very fast to learn from the data provided
• Less data 4 it does not require as much training data
Limitations of Parametric ML Algorithms
• Limited Complexity 4suited only to simpler problems
• Poor Fit 4 the methods are unlikely to match the underlying mapping function
Non-Parametric Machine Learning Algorithms
Non-Parametric ML Algorithms are algorithms that do not make assumptions about the form
of the mapping functions. It is good to use when you have a lot of data and no prior knowledge
and you don9t want to worry too much about choosing the right features.
Non-Parametric ML Algorithms
• K-Nearest Neighbors (KNN)
• Decision Trees like CART
• Support Vector Machines (SVM)
Benefits of Non-Parametric ML Algorithms
• Flexibility4 it is capable of fitting a large number of functional forms
• Power 4 do not assume about the underlying function
• Performance 4 able to give a higher performance model for predictions
Limitations of Non-Parametric ML Algorithms
• Needs more data 4 requires a large training dataset
• Slower processing 4 they often have more parameters which means that training time
is much longer
• Overfitting 4 higher risk of overfitting the training data and results are harder to
explain why specific predictions were made.
lOMoARcPSD|29346578

In the process flow above, Data Modeling is broken down into four tasks together with its
projected outcome or output in detail.
Simply put, the Data Modeling phase9s goal is to:
1. Selecting modeling techniques
The wonderful world of data mining offers lots of modeling techniques, but not all of them
will suit your needs. Narrow the list based on the kinds of variables involved, the selection of
techniques available in your tools, and any business considerations that are important to you.
For example, many organizations favour methods with output that9s easy to interpret, so
decision trees or logistic regression might be acceptable, but neural networks would probably
not be accepted.
Deliverables for this task include two reports:
• Modeling technique: Specify the technique(s) that you will use.
• Modeling assumptions: Many modeling techniques are based on certain
assumptions. For example, a model type may be intended for use with data that has a
specific type of distribution. Document these assumptions in this report.

2. Designing tests

The test in this task is the test that you9ll use to determine how well your model works. It may
be as simple as splitting your data into a group of cases for model training and another group
for model testing.
Training data is used to fit mathematical forms to the data model, and test data is used during
the model-training process to avoid overfitting: making a model that9s perfect for one dataset,
but no other. You may also use holdout data, data that is not used during the model-training
lOMoARcPSD|29346578

process, for an additional test.


The deliverable for this task is your test design. It need not be elaborate, but you should at least
take care that your training and test data are similar and that you avoid introducing any bias
into the data.
3. Building model(s)

Modeling is what many people imagine to be the whole job of the data miner, but it9s just one
task of dozens! Nonetheless, modeling to address specific business goals is the heart of the
data-mining profession.
Deliverables for this task include three items:
• Parameter settings: When building models, most tools give you the option of
adjusting a variety of settings, and these settings have an impact on the structure of the
final model. Document these settings in a report.
• Model descriptions: Describe your models. State the type of model (such as linear
regression or neural network) and the variables used. Explain how the model is
interpreted. Document any difficulties encountered in the modeling process.
• Models: This deliverable is the models themselves. Some model types can be easily
defined with a simple equation; others are far too complex and must be transmitted in
a more sophisticated format.

4. Assessing model(s)

Now you will review the models that you9ve created, from a technical standpoint and also from
a business standpoint (often with input from business experts on your project team).
Deliverables for this task include two reports:
• Model assessment: Summarizes the information developed in your model review. If
you have created several models, you may rank them based on your assessment of their
value for a specific application.
• Revised parameter settings: You may choose to fine-tune settings that were used to
build the model and conduct another round of modeling and try to improve your results.
VALIDATION:
Why data validation?
Data validation happens immediately after data preparation/wrangling and before
modeling. it is because during data preparation there is a high possibility of things going wrong
especially in complex scenarios.
lOMoARcPSD|29346578

Data validation ensures that modeling happens on the right data. faulty data as input to
the model would generate faulty insight!
How is data validation done?
Data validation should be done by involving minimum one external person who has a
proper understanding of the data and business. I
t is usually clients who technically good enough to check the data. Once we go through
data preparation and just before data modeling, we usually make data visualization and give
my newly prepared data to the client.
The clients with the help of SQL queries or any other tools try to validate if output
contains no error.
Combing CRISP-DM/ASUM-DM with the agile methodology, steps can be taken in
parallel meaning you do not have to wait for the green light for data validation to do the
modeling. But once you get feedback from the domain expert that there are faults in the data,
we need to correct the data by re-doing the data-preparation and re-model the data.
What are the common causes leading to a faulty output from data preparation?
Common causes are:
1. Lack of proper understanding of the data, therefore, the logic of the data preparation
is not correct.
2. Common bugs in programming/data preparation pipeline that led to a faulty output.
lOMoARcPSD|29346578

EVALUATION:
The evaluation phase includes three tasks. These are
• Evaluating results
• Reviewing the process
• Determining the next steps

Task: Evaluating results


At this stage, you9ll assess the value of your models for meeting the business goals that started
the data-mining process. You9ll look for any reasons why the model would not be satisfactory
for business use. If possible, you9ll test the model in a practical application, to determine
whether it works as well in the workplace as it did in your tests.
Deliverables for this task include two items:
• Assessment of results (for business goals): Summarize the results with respect to the
business success criteria that you established in the business-understanding phase.
Explicitly state whether you have reached the business goals defined at the start of the
project.
• Approved models: These include any models that meet the business success criteria.

Task: Reviewing the process


Now that you have explored data and developed models, take time to review your process. This
is an opportunity to spot issues that you might have overlooked and that might draw your
attention to flaws in the work that you9ve done while you still have time to correct the problem
before deployment. Also consider ways that you might improve your process for future
projects.
The deliverable for this task is the review of process report. In it, you should outline your
review process and findings and highlight any concerns that require immediate attention, such
as steps that were overlooked or that should be revisited.
lOMoARcPSD|29346578

Task: Determining the next steps


The evaluation phase concludes with your recommendations for the next move. The model
may be ready to deploy, or you may judge that it would be better to repeat some steps and try
to improve it. Your findings may inspire new data-mining projects.
Deliverables for this task include two items:
• List of possible actions: Describe each alternative action, along with the strongest
reasons for and against it.
• Decision: State the final decision on each possible action, along with the reasoning
behind the decision.

INTERPRETATION
Data interpretation as the process of assigning meaning to the collected information and
determining the conclusions, significance, and implications of the findings.

Data Interpretation Examples


Data interpretation is the final step of data analysis. This is where you turn results into
actionable items. To better understand it, here are 2 instances of interpreting data:
Let's say you've got four age groups of the user base. So, a company can notice which age
group is most engaged with their content or product. Based on bar charts or pie charts, they
can either: develop a marketing strategy to make their product more appealing to non-
involved groups or develop an outreach strategy that expands on their core user base.
Steps Of Data Interpretation
Data interpretation is conducted in 4 steps:
• Assembling the information, you need (like bar graphs and pie charts);
• Developing findings or isolating the most relevant inputs;
• Developing conclusions;
• Coming up with recommendations or actionable solutions.
Considering how these findings dictate the course of action, data analysts must be accurate
with their conclusions and examine the raw data from multiple angles. Different variables may
allude to various problems, so having the ability to backtrack data and repeat the analysis
using different templates is an integral part of a successful business strategy.
What Should Users Question During Data Interpretation?
To interpret data accurately, users should be aware of potential pitfalls present within this
process. You need to ask yourself if you are mistaking correlation for causation. If two things
occur together, it does not indicate that one caused the other.
lOMoARcPSD|29346578

The 2nd thing you need to be aware of is your own confirmation bias. This occurs when you
try to prove a point or a theory and focus only on the patterns or findings that support that
theory while discarding those that do not.
The 3rd problem is irrelevant data. To be specific, you need to make sure that the data you
have collected and analyzed is relevant to the problem you are trying to solve.

Data Interpretation Methods


Data analysts or data analytics tools help people make sense of the numerical data that has been
aggregated, transformed, and displayed. There are two main methods for data interpretation:
quantitative and qualitative.
Qualitative Data Interpretation Method
This is a method for breaking down or analyzing so-called qualitative data, also known as
categorical data. It is important to note that no bar graphs or line charts are used in this method.
Instead, they rely on text. Because qualitative data is collected through person-to-person
techniques, it isn't easy to present using a numerical approach.
Surveys are used to collect data because they allow you to assign numerical values to answers,
making them easier to analyze. If we rely solely on the text, it would be a time-consuming and
error-prone process. This is why it must be transformed.
Quantitative Data Interpretation
Method
This data interpretation is applied when we are dealing with quantitative or numerical data.
Since we are dealing with numbers, the values can be displayed in a bar chart or pie chart.
There are two main types: Discrete and Continuous. Moreover, numbers are easier to analyze
since they involve statistical modeling techniques like mean and standard deviation.
Mean is an average value of a particular data set obtained or calculated by dividing the sum of
the values within that data set by the number of values within that same set.
Standard Deviation is a technique is used to ascertain how responses align with or deviate
from the average value or mean. It relies on the meaning to describe the consistency of the
replies within a particular data set. You can use this when calculating the average pay for a
certain profession and then displaying the upper and lower values in the data set.
As stated, some tools can do this automatically, especially when it comes to quantitative data.
Whatagraph is one such tool as it can aggregate data from multiple sources using different
system integrations. It will also automatically organize and analyze that which will later be
displayed in pie charts, line charts, or bar charts, however you wish.
lOMoARcPSD|29346578

Benefits Of Data Interpretation


Multiple data interpretation benefits explain its significance within the corporate world,
medical industry, and financial industry:
Informed decision-making
The managing board must examine the data to take action and
implement new methods. This emphasizes the significance of well-analyzed data as well as a
well-structured data collection process.
Anticipating needs and identifying trends
Data analysis provides users with relevant
insights that they can use to forecast trends. It would be based on customer concerns and
expectations.
For example, a large number of people are concerned about privacy and the leakage of
personal information. Products that provide greater protection and anonymity are more likely
to become popular.
Clear foresight.
Companies that analyze and aggregate data better understand their own
performance and how consumers perceive them. This provides them with a better
understanding of their shortcomings, allowing them to work on solutions that will significantly
improve their performance.

DEPLOYMENT AND ITERATIONS:


The deployment phase includes four tasks. These are
• Planning deployment (your methods for integrating data-mining discoveries into use)
• Planning monitoring and maintenance
• Reporting final results
• Reviewing final results

Task: Planning deployment


When your model is ready to use, you will need a strategy for putting it to work in your
business.
The deliverable for this task is the deployment plan. This is a summary of your strategy for
deployment, the steps required, and the instructions for carrying out those steps.

Task: Planning monitoring and maintenance


Data-mining work is a cycle, so expect to stay actively involved with your models as they are
integrated into everyday use.
The deliverable for this task is the monitoring and maintenance plan. This is a summary of your
strategy for ongoing review of the model9s performance. You9ll need to ensure that it is being
used properly on an ongoing basis, and that any decline in model performance will be detected.
Task: Reporting final results
Deliverables for this task include two items:
lOMoARcPSD|29346578

• Final report: The final report summarizes the entire project by assembling all the
reports created up to this point, and adding an overview summarizing the entire project
and its results.
• Final presentation: A summary of the final report is presented in a meeting with
management. This is also an opportunity to address any open questions.
Task: Review project
Finally, the data-mining team meets to discuss what worked and what didn9t, what would be
good to do again, and what should be avoided!
This step, too, has a deliverable, although it is only for the use of the data-mining team, not the
manager (or client). It9s the experience documentation report.
This is where you should outline any work methods that worked particularly well, so that they
are documented to use again in the future, and any improvements that might be made to your
process. It9s also the place to document problems and bad experiences, with your
recommendations for avoiding similar problems in the future.
Iterations are done to upgrade the performance of the system
The outcome of decision, action and the conclusion conducted from the model are documented
and updated into the database. This helps in changing and upgrading the performance of the
existing system.

Some queries are updated in the database such as <were the decision and action impactful?=
<What was the return or investment?=,= how was the analysis group compared with the
regulating class?=. The performance-based database is continuously updated once the new
insight or knowledge is extracted.

You might also like