Data Analytics Process
Data Analytics Process
Given the considerable amount of data collected by industries nowadays, they need to adopt
the right analytics strategies for better decision-making. In this conceptual blog, we will start
by building your understanding of the data analysis process before providing an in-depth
explanation of all the steps involved.
Data analysis is analysing data to provide organizations with meaningful insights for better
decision-making from historical data using different data analysis techniques such as
performing statistical analysis and creating data visualizations for storytelling. Let's apply the
complete data analysis process to the following real-time data analytic project for better
understanding.
Imagine an insurance company whose business model is to compensate or not its clients
based on the type of insurance they have subscribed (auto and home) and the detailed brief
submitted to support their claims.
The company noticed a 30% customer churn for the past few months. Realizing this issue, it
seeks data analyst expertise to help them properly identify the root cause of the problem so
that it does not keep losing customers. To help in the process, the manager thinks that this is
due to the delay taken by agents to process clients' requests.
The job of a Data Analyst is to understand the business problem better, collect appropriate
data, and process and explore them to extract useful information to help the insurance
company make smart business decisions.
As a data analyst, you might find it challenging to make the best use of your data. Following
the data analysis process and best practices for each new or existing data analysis project will
help you make the most out of the data for the business.
Data Analysis Process Step 1 - Define and Understand the Business Problem
In the use case, the company stated that the delay in request processing might cause customer
churn. This is not the exact problem but a statement. The goal of a data analyst in the first
step of the data analysis process is to get a clarification on the problem from the business. To
do so, data analysts schedule a meeting with the following people from the Business and the
Data Consulting team.
Business Team
Head of the Insurance Company, who is responsible for the coordination of both auto
and home insurance departments.
Here's how the discussion between the Business and Data Consulting teams could
proceed through the analysis process -
Business team: we want to know why we are currently facing this level of customer churn.
Data team: currently facing, meaning you did not have that in the past?
Business team: No, because we only had the auto insurance department in the past.
Data team: could you please describe the request processing process?
Business team: the customers send their request, we check the completion of the required
documents, and only then do we proceed forward when all the documents are completed.
Data team: what is the proportion of employees before adding home insurance department?
Business team: we just trained some people from auto insurance to join the new department.
etc...
At the end of such a discussion, the Data team could develop a better understanding of the
Business problem and then adopt analytic strategies to facilitate the process.
Avoiding as much technical jargon as possible during this phase is also important. Your goal
is to harness your soft skills and domain knowledge as much as you can for a smooth
discussion with the business.
Every business problem understanding includes defining Key Performance Indicators (KPIs)
to keep track of the deliverables performances. Different licensed and open-source tools exist,
as shown below:
Tools Description
A free visualization tool that can store real-time metrics is handy when
dealing with Time Series use cases.
Once the data analyst understands the Business problem, the next step is to perform the
inventory of existing information and collect a data set that better fits the analytics use case.
This can be either first-party data, third-party data to the company, or open data repositories.
First-party data corresponds to the data accessible within the company, and third-party ones
are those the company buys from external sources.
These collected data must be legally and technically exploitable, reliable, and sufficiently up-
to-date on the stated problem.
We can imagine that we have the following four sources of data available for our use case
Requests' Statistics
o Request conversion rate: number of clients' requests that made it to the next
step after the first submission.
Client's Attributes
o The feedback of each client on the analysis process of their previous request.
Insurance Data
This data gathered by the Data Engineer is then used further in the data analysis process by
Data Analysts and Data Scientists.
Commonly Used Data Collection and Storage Tools in the Data Analysis Process
The Data Engineer is responsible for creating the right data pipelines to gather and store these
data in a data warehouse or a data lake using different big data technologies such as Scala,
PostgreSQL, Python, etc.
Tools Description
One of the main reasons for using Scala is its ability to provide
parallelization features for processing large data sets, which can be
very useful when collecting data from multiple sources.
Data cleaning is one of the major steps in the data analysis process, and a good Data Analyst
spends around 70 to 90% of their time on data cleaning. This step takes that much time
because having high-quality data can have global benefits across the organization, such as:
Make decision-making easier by creating the correct key performance indicators from
the raw data.
Working with quality data can improve team productivity because they will not need
to allocate time to deal with incorrect data.
o In our data analytics process example, we can replace the missing request
conversion rate with the median value specific to each department.
Normalizing variables
o from insurance data, auto and home departments can require the same ID
document, ID_home, and ID_auto, which can be normalized to ID.
Replacing dates by duration to know how long each client has been using the
company's service and how long each agent has been in a specific department.
o the age of the customer when subscribing for the first time to the company's
insurance service.
o the total number of requests made by each client.
o , the agents' arrival date can be replaced by their seniority. For instance, the
longer the period, the more senior he/she is.
o the clients' raw textual data might contain some grammatical errors, so
running them through the
Data cleaning can be done using programming languages such as Python, R, etc. The
previous list of processes is not exhaustive but specific to our case for a better understanding
of the process.
There are many tools for data cleaning, but the focus here is being made on the open source
ones, as shown below.
Tools Description
Distributed processing system used by data scientists to reduce the
cost and time required for the Extract, Transform and Load process
due to its ability to deal with several petabytes of data at the time.
Data Analysis Process Step 4 - Analysing the Data for Interpretations and Insights
A data scientist is likely to feel relieved once done with cleaning data. Now comes the time to
express curiosity and analytical and data storytelling skills by using different data
visualization tools and techniques and statistical analysis approaches to answer the business
problem appropriately.
The data analysis process you will go through depends on the business problem you are
trying to solve. Most business problems fall into the following five data analysis categories:
What happened? --> Descriptive Analysis
That is, most of the time, the first question the business team might want to find an answer to
before diving into any other exploration.
Referring to our use case, the insurance company can use descriptive analytics to understand
what has happened in the past few months by running different hypotheses to accept or reject
the null hypothesis, which corresponds to the claim of the insurance manager.
the distribution of the variables in your data by examining their shape, whether they
are right, left-skewed, or normally distributed, etc.
detect eventual outliers that might exist in the data set and the relationship between all
the data types.
An efficient understanding of those trends and relationships in the data can guide the tasks
that need to be performed, whether it is clustering, classification, regression analysis, etc.
Once the data analyst has an idea of the task, we can proceed with the scientific literature
review phase, which aims to benchmark state-of-the-art Machine learning, Artificial
Intelligence, or even statistical solutions for the use case.
Problem
Data Analysts can communicate their findings to the Business using different business
analytics solutions and open source tools.
In our use case, the Data Analyst might conclude that the customer churn is due to the delay
created during pre-processing.
However, the analysis might show two additional facts in addition to the direct observation of
the insurance manager:
(1) agents spend more time checking the request's documents completion instead of
focusing on analysing whether a given request is worth the compensation.
This question means that the Data team needs to provide the business team with the right
recommendations to mitigate customer churn.
Tools Description
Choosing the right model depends on the data analysis result. Failing to do so will ultimately
lead to choosing the wrong modelling data models.
As a data analyst, you can make the following recommendations to mitigate the previously
identified two facts. In addition, a new discussion will be required to set the key success and
performance indicators for the data analysis project.
The document completion issue might be solved by creating a conversational chatbot agent
that focuses on the following actions:
and instantly notify the clients whether the list of requested documents is completed
or not.
Once the document is completed, a second machine learning model is responsible for
submitting it to the right department when the confidence score satisfies a given threshold
defined by the business team.
the infrastructure that will host the model and also its dependence with existing
applications.
change management to identify how the current team will efficiently and comfortably
interact with the model.
Data Analysis Process Step 7- Monitor the Model Performance
Machine learning models are not traditional applications, so monitoring their performance
over time is crucial. You can get users' and business feedback to improve them.
We hope this article has given you a complete overview of the data analysis lifecycle. There
might be more or fewer steps in the analysis process from one data analysis project to
another. Still, a data analyst will likely come across at least the first five steps when solving a
real-world business problem. You have the complete data analytics project plan template to
help you efficiently plan your next data analysis project.