Ba Unit 1a
Ba Unit 1a
2 MARK
What is data?
Data in business analytics refers to the information that is collected, processed, and analyzed
to gain insights and make informed decisions. It is the collective information related to a
company and its operations. This can include any statistical information, raw analytical data,
customer feedback data, sales numbers and other sets of information.
1. Discovery
This first phase involves getting the context around your problem: you need to know what
problem you are solving and what business outcomes you wish to see.
You should begin by defining your business objective and the scope of the work. Work out
what data sources will be available and useful to you (for example, Google Analytics,
Salesforce, your customer support ticketing system, or any marketing campaign information
you might have available), and perform a gap analysis of what data is required to solve your
business problem analysis compared with what data you have available, working out a plan to
get any data you still need.
Once your objective has been identified, you should formulate an initial hypothesis. Design
your analysis so that it will determine whether to accept or reject this hypothesis. Decide in
advance what the criteria for accepting or rejecting the hypothesis will be to ensure that your
analysis is rigorous and follows the scientific method.
2. Data preparation
In the next stage, you need to decide which data sources will be useful for the analysis, collect
the data from all these disparate sources, and load it into a data analytics sandbox so it can be
used for prototyping.
When loading your data into the sandbox area, you will need to transform it. The two
main types of transformations are preprocessing transformations and analytics
transformations. Preprocessing means cleaning your data to remove things like nulls, defective
values, duplicates, and outliers. Analytics transformations can mean a variety of things, such
as standardizing or normalizing your data so it can be used more effectively with certain
machine learning algorithms, or preparing your datasets for human consumption (for example,
transforming machine labels into human-readable ones, such as “sku123” → “T-Shirt,
brown”).
Depending on whether your transformations take place before or after the loading stage,
this whole process is known as either ETL (extract, transform, load) or ELT (extract, load,
transform). You can set up your own ETL pipeline to deal with all of this, or use an integrated
customer data platform to handle the task all within a unified environment.
It is important to note that the sub-steps detailed here don’t have to take place in separate
systems. For example, if you have all data sources in a data warehouse already, you can simply
use a development schema to perform your exploratory analysis and transformation work in
that same warehouse.
3. Model planning
A model in data analytics is a mathematical or programmatic description of the
relationship between two or more variables. It allows us to study the effects of different
variables on our data and to make statistical assumptions about the probability of an event
happening.
The main categories of models used in data analytics are SQL models, statistical
models, and machine learning models. A SQL model can be as simple as the output of a SQL
SELECT statement, and these are often used for business intelligence dashboards. A statistical
model shows the relationship between one or more variables (a feature that some data
warehouses incorporate into more advanced statistical functions in their SQL processing), and
a machine learning model uses algorithms to recognize patterns in data and must be trained on
other data to do so. Machine learning models are often used when the analyst doesn’t have
enough information to try to solve a problem using easier steps.
You need to decide which models you want to test, operationalize, or deploy. To choose
the most appropriate model for your problem, you will need to do an exploration of your
dataset, including some exploratory data analysis to find out more about it. This will help guide
you in your choice of model because your model needs to answer the business objective that
started the process and work with the data available to you.
Do you want the outcome to be qualitative or quantitative? If your question expects a
quantitative answer (for example, “How many sales are forecast for next month?” or “How
many customers were satisfied with our product last month?”) then you should use a regression
model. However, if you expect a qualitative answer (for example, “Is this email spam?”, where
the answer can be Yes or No, or “Which of our five products are we likely to have the most
success in marketing to customer X?”), then you may want to use a classification or clustering
model.
Is accuracy or speed of the model particularly important? If so, check whether your
chosen model will perform well. The size of your dataset will be a factor when evaluating the
speed of a particular model.
Is your data unstructured? Unstructured data cannot be easily stored in either relational or graph
databases and includes free text data such as emails or files. This type of data is most suited to
machine learning.
Have you analyzed the contents of your data? Analyzing the contents of your data can
include univariate analysis or multivariate analysis (such as factor analysis or principal
component analysis). This allows you to work out which variables have the largest effects and
to identify new factors (that are a combination of different existing variables) that have a big
impact.
5. Communicating results
You must communicate your findings clearly, and it can help to use data visualizations to
achieve this. Any communication with stakeholders should include a narrative, a list of key
findings, and an explanation of the value your analysis adds to the business. You should also
compare the results of your model with your initial criteria for accepting or rejecting your
hypothesis to explain to them how confident they can be in your analysis.
6. Operationalizing
Once the stakeholders are happy with your analysis, you can execute the same model outside
of the analytics sandbox on a production dataset.
You should monitor the results of this to check if they lead to your business goal being
achieved. If your business objectives are being met, deliver the final reports to your
stakeholders, and communicate these results more widely across the business.
DATA SCIENCE
Data Science is a combination of mathematics, statistics, machine learning, and
computer science. Data Science is collecting, analyzing and interpreting data to gather insights
into the data that can help decision-makers make informed decisions.
Data Science is used in almost every industry today that can predict customer behavior
and trends and identify new opportunities. Businesses can use it to make informed decisions
about product development and marketing. It is used as a tool to detect fraud and optimize
processes. Governments also use Data Science to improve efficiency in the delivery of public
services.
Nowadays, organizations are overwhelmed with data. Data Science will help in
extracting meaningful insights from that by combining various methods, technology, and tools.
In the fields of e-commerce, finance, medicine, human resources, etc, businesses come across
huge amounts of data. Data Science tools and technologies help them process all of them.
In simple terms, Data Science helps to analyze data and extract meaningful insights from it by
combining statistics & mathematics, programming skills, and subject expertise.
DATA COLLECTION
Data collection is the methodological process of gathering information about a specific
subject. It’s crucial to ensure your data is complete during the collection phase and that it’s
collected legally and ethically. In the data life cycle, data collection is the second step. After
data is generated, it must be collected to be of use to your team. After that, it can be processed,
stored, managed, analysed, and visualized to aid in your organization’s decision-making.
Primary and secondary methods of data collection are two approaches used to gather
information for research or analysis purposes. Let's explore each method in detail:
1. Primary Data Collection:
Primary data collection involves the collection of original data directly from the source or
through direct interaction with the respondents. This method allows researchers to obtain
firsthand information specifically tailored to their research objectives. There are various
techniques for primary data collection such as survey, experiment, interview and observation2.
Secondary Data Collection:
Secondary data collection involves using existing data collected by someone else for a purpose
different from the original intent. Researchers analyze and interpret this data to extract relevant
information. Secondary data can be obtained from various sources such as published source,
government record, past research study.
DATA PREPARATION
Data preparation is the sorting, cleaning, and formatting of raw data so that it can be
better used in business intelligence, analytics, and machine learning applications.
Data comes in many formats, but for the purpose of this guide we’re going to focus on data
preparation for the two most common types of data: numeric and textual.
Numeric data preparation is a common form of data standardization. A good example would
be if you had customer data coming in and the percentages are being submitted as both
percentages (70%, 95%) and decimal amounts (.7, .95) – smart data prep, much like a smart
mathematician, would be able to tell that these numbers are expressing the same thing, and
would standardize them to one format.
Textual data preparation addresses a number of grammatical and context-specific text
inconsistencies so that large archives of text can be better tabulated and mined for useful
insights.
Text tends to be noisy as sentences, and the words they are made up of, vary with language,
context and format (an email vs a chat log vs an online review). So, when preparing our text
data, it is useful to ‘clean’ our text by removing repetitive words and standardizing meaning.
For example, if you receive a text input of:
‘My vacuum’s battery died earlier than I expected this Saturday morning
A very basic text preparation algorithm would omit the unnecessary and repetitive words
leaving you with:
‘Vacuum’s’ [subject] died [active verb] earlier [problem] Saturday morning [time]’
This stripped-down sentence format is now primed to be much easier to be tabulated
analytically
1. Gather data
The data preparation process begins with finding the right data. This can come from an existing
data catalog or data sources can be added ad-hoc.
Once data has been cleansed, it must be validated by testing for errors in the data preparation
process up to this point. Often, an error in the system will become apparent during this
validation step and will need to be resolved before moving forward.
That’s where self-service data preparation tools like Talend Data Preparation come in. Cloud-
native platforms with machine learning capabilities simplify the data preparation process. This
means that data scientists and business analysts can focus on analyzing data instead of just
cleaning it.
But it also allows business professionals who may lack advanced IT skills to run the process
themselves. This makes data preparation more of a team sport rather than wasting valuable
resources and cycles with IT teams.
To get the best value out of a self-service data preparation tool, look for a platform with:
1. Talend
Talend’s self-service data preparation tool is a fast and accessible first step for any business
seeking to improve its data prep approach. And they offer a series of informative basic guides
to data prep!
2. OpenRefine
Combining a powerful, no-code, GUI with easy Python compatibility,
OpenRefine is a favorite for no-code and Python literates alike. Regardless of your coding skill
level, it’s complex data filtering capacity can be a boon to any business. Plus it’s free.
3. Paxata
Alternatively, Paxata offers a sophisticated, ‘data governing’ approach to data
preparation, promising to clean and effectively govern datasets at scale.
Design and productivity features like automatic documentation, versioning, and
operationalizing into ETL processes