0% found this document useful (0 votes)
113 views35 pages

CCW331 BUSINESS ANALYTICS-notes

The document outlines the objectives and curriculum for a Business Analytics course, covering topics such as the Analytics Life Cycle, Business Intelligence, Business Forecasting, and analytics applications in HR, Supply Chain, Marketing, and Sales. It emphasizes the importance of data analysis in decision-making and the methodologies involved in transforming data into actionable insights. The course aims to equip students with the skills necessary to apply analytics across various business functions to enhance performance and strategic planning.

Uploaded by

matheswaranmw27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
113 views35 pages

CCW331 BUSINESS ANALYTICS-notes

The document outlines the objectives and curriculum for a Business Analytics course, covering topics such as the Analytics Life Cycle, Business Intelligence, Business Forecasting, and analytics applications in HR, Supply Chain, Marketing, and Sales. It emphasizes the importance of data analysis in decision-making and the methodologies involved in transforming data into actionable insights. The course aims to equip students with the skills necessary to apply analytics across various business functions to enhance performance and strategic planning.

Uploaded by

matheswaranmw27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 35

CCW331 -BUSINESS ANALYTICS

COURSE OBJECTIVES:
∙ To understand the Analytics Life Cycle.
∙ To comprehend the process of acquiring Business Intelligence
∙ To understand various types of analytics for Business Forecasting
∙ To model the supply chain management for Analytics.
∙ To apply analytics for different functions of a business

UNIT I INTRODUCTION TO BUSINESS ANALYTICS 6


Analytics and Data Science – Analytics Life Cycle – Types of Analytics – Business Problem Definition –
Data Collection – Data Preparation – Hypothesis Generation – Modeling – Validation and Evaluation –
Interpretation – Deployment and Iteration

UNIT II BUSINESS INTELLIGENCE 6


Data Warehouses and Data Mart - Knowledge Management –Types of Decisions - Decision Making Process
- Decision Support Systems – Business Intelligence –OLAP – Analytic functions

UNIT III BUSINESS FORECASTING 6


Introduction to Business Forecasting and Predictive analytics - Logic and Data Driven Models –Data Mining
and Predictive Analysis Modelling –Machine Learning for Predictive analytics.

UNIT IV HR & SUPPLY CHAIN ANALYTICS 6


Human Resources – Planning and Recruitment – Training and Development - Supply chain network -
Planning Demand, Inventory and Supply – Logistics – Analytics applications in HR & Supply Chain -
Applying HR Analytics to make a prediction of the demand for hourly employees for a year.

UNIT V MARKETING & SALES ANALYTICS 6


Marketing Strategy, Marketing Mix, Customer Behaviour –selling Process – Sales Planning – Analytics
applications in Marketing and Sales - predictive analytics for customers' behaviour in marketing and sales.

30 PERIODS

UNIT I INTRODUCTION TO BUSINESS ANALYTICS


Analytics and Data Science – Analytics Life Cycle – Types of Analytics – Business
Problem Definition – Data Collection – Data Preparation – Hypothesis Generation –
Modeling – Validation and Evaluation – Interpretation – Deployment and Iteration

INTRODUCTION
ANALYTICS AND DATA SCIENCE

ANALYTICS:

The word analytics has come into the foreground in last decade or so. The
increase of the internet and information technology has made analytics very
relevant in the current age. Analytics is a field which combines data, information
technology, statistical analysis, quantitative methods and computer-based models
into one.

This all are combined to provide decision makers all the possible scenarios to
make a well thought and researched decision. The computer-based model ensures
that decision makers are able to see performance of decision under various
scenarios.

Meaning

Business analytics (BA) is a set of disciplines and technologies for solving


business problems using data analysis, statistical models and other quantitative
methods. It involves an iterative, methodical exploration of an organization's data,
with an emphasis on statistical analysis, to drive decision-making.

At its core, business analytics involves a combination of the following:


● identifying new patterns and relationships with data mining;

● using quantitative and statistical analysis to design business models;

● conducting A/B and multi-variable testing based on findings;

● forecasting future business needs, performance, and industry trends with


predictive modelling; and

● Communicating your findings in easy-to-digest reports to colleagues,


management, and customers.

Definition

⮚ Business analytics (BA) refers to the skills, technologies, and practices for
continuous iterative exploration and investigation of past business
performance to gain insight and drive business planning. Business
analytics focuses on developing new insights and understanding of
business performance based on data and statistical methods.

⮚ Business Analytics is the process of transforming data into insights to


improve business decisions. Data management, data visualization,
predictive modelling, data
mining, forecasting simulation, and optimization are some of the tools used to
create
insights from data.
❖ Scope of Business Analytics

Business analytics has a wide range of application and usages. It can be used
for descriptive analysis in which data is utilized to understand past and present
situation. This kind of descriptive analysis is used to asses’ current market position
of the company and effectiveness of previous business decision.

It is used for predictive analysis, which is typical used to asses’ previous business
performance.

Business analytics is also used for prescriptive analysis, which is utilized to


formulate optimization techniques for stronger business performance.

For example, business analytics is used to determine pricing of various products


in a departmental store based past and present set of information.
❖ How business analytics works

Before any data analysis takes place, BA starts with several foundational processes:
● Determine the business goal of the analysis.
● Select an analysis methodology.
● Get business data to support the analysis, often from various systems and sources.
● Cleanse and integrate data into a single repository, such as a data
warehouse or data mart.

❖ Need/Importance of Business Analytics

▪ Business analytics is a methodology or tool to make a sound


commercial decision. Hence it impacts functioning of the whole
organization. Therefore, business analytics can help improve profitability
of the business, increase market share and revenue and provide better
return to a shareholder.
▪ Facilitates better understanding of available primary and secondary data,
which again affect operational efficiency of several departments.
▪ Provides a competitive advantage to companies. In this digital age flow of
information is almost equal to all the players. It is how this information is
utilized makes the company competitive. Business analytics combines
available data with various well thought models to improve business
decisions.
▪ Converts available data into valuable information. This information can be
presented in any required format, comfortable to the decision maker.

For starters, business analytics is the tool your company needs to make
accurate decisions. These decisions are likely to impact your entire organization as
they help you to improve profitability, increase market share, and provide a
greater return to potential shareholders.

While some companies are unsure what to do with large amounts of data,
business analytics works to combine this data with actionable insights to improve
the decisions you make as a company

Essentially, the four main ways business analytics is important, no matter the industry, are:
▪ Improves performance by giving your business a clear picture of what is
and isn’t working
▪ Provides faster and more accurate decisions
▪ Minimizes risks as it helps a business make the right choices regarding
consumer behaviour, trends, and performance
▪ Inspires change and innovation by answering questions about the consumer.

❖ Essentials of business analytics


Business analytics has many use cases, but when it comes to commercial
organizations, BA is typically used to:
● Analyze data from a variety of sources. This could be anything from cloud
applications to marketing automation tools and CRM software.
● Use advanced analytics and statistics to find patterns within datasets. These
patterns can help you predict trends in the future and access new insights
about the consumer and their behaviour.
● Monitor KPIs and trends as they change in real-time. This makes it easy
for businesses to not only have their data in one place but to also come to
conclusions quickly and accurately.
● Support decisions based on the most current information. With BA
providing such a vast amount of data that you can use to back up your
decisions, you can be sure that you are fully informed for not one, but
several different scenarios.

DATA SCIENCE

● Data are individual facts, statistics, or items of information, often numeric.


● In more technical sense, data are a set of values of qualitative or quantitative
variables about one or more persons or objects.
● Data is various kinds of information formatted in a particular way. Therefore,
data collection is the process of gathering, measuring, and analyzing accurate
data from a variety of relevant sources to find answers to research problems,
answer questions, evaluate outcomes, and forecast trends and probabilities.
● Data Science is not a singular field. It is a quantitative field that shares its
background with math, statistics and computer programming. With the help of
data science, industries are qualified to make careful data-driven decisions.

Data Science Lifecycle


Data Science Lifecycle revolves around the use of machine learning and different analytical
strategies to produce insights and predictions from information in order to acquire a commercial enterprise
objective. The complete method includes a number of steps like data cleaning, preparation, modelling,
model evaluation, etc. It is a lengthy procedure and may additionally take quite a few months to complete.
So, it is very essential to have a generic structure to observe for each and every hassle at hand. The globally
mentioned structure in fixing any analytical problem is referred to as a Cross Industry Standard Process for
Data Mining or CRISP-DM framework.

Let us understand what is the need for Data Science?


Earlier data used to be much less and generally accessible in a well-structured form, that we could save
effortlessly and easily in Excel sheets, and with the help of Business Intelligence tools data can be processed
efficiently. But Today we used to deals with large amounts of data like about 3.0 quintals bytes of records is
producing on each and every day, which ultimately results in an explosion of records and data. According to
recent researches, It is estimated that 1.9 MB of data and records are created in a second that too through a
single individual.

So this a very big challenge for any organization to deal with such a massive amount of data generating
every second. For handling and evaluating this data we required some very powerful complex algorithms
and technologies.

The following are some primary motives for the use of Data science technology:

1. It helps to convert the big quantity of uncooked and unstructured records into significant insights.
2. It can assist in unique predictions such as a range of surveys, elections, etc.
3. It also helps in automating transportation such as growing a self-driving car, we can say which is the
future of transportation.
4. Companies are shifting towards Data science and opting for this technology. Amazon, Netflix, etc,
which cope with the big quantity of data, are the use of information science algorithms for higher
consumer experience.
1. Business Understanding:
The complete cycle revolves around the enterprise goal. What will you resolve if you do not longer
have a specific problem? It is extraordinarily essential to apprehend the commercial enterprise goal sincerely
due to the fact that will be your ultimate aim of the analysis. After desirable perception only we can set the
precise aim of evaluation that is in sync with the enterprise objective. You need to understand if the
customer desires to minimize savings loss, or if they prefer to predict the rate of a commodity, etc.

2. Data Understanding:
After enterprise understanding, the subsequent step is data understanding. This includes a series of
all the reachable data. Here you need to intently work with the commercial enterprise group as they are
certainly conscious of what information is present, what facts should be used for this commercial enterprise
problem, and different information. This step includes describing the data, their structure, their relevance,
their records type. Explore the information using graphical plots. Basically, extracting any data that you can
get about the information through simply exploring the data.
3. Preparation of Data:
Next comes the data preparation stage. This consists of steps like choosing the applicable data,
integrating the data by means of merging the data sets, cleaning it, treating the lacking values through either
eliminating them or imputing them, treating inaccurate data through eliminating them, additionally test for
outliers the use of box plots and cope with them. Constructing new data, derive new elements from present
ones. Format the data into the preferred structure, eliminate undesirable columns and features. Data
preparation is the most time-consuming but arguably the most essential step in the complete existence cycle.
Your model will be as accurate as your data.
4. Exploratory Data Analysis:
This step includes getting some concept about the answer and elements affecting it, earlier than
constructing the real model. Distribution of data inside distinctive variables of a character is explored
graphically the usage of bar-graphs, Relations between distinct aspects are captured via graphical
representations like scatter plots and warmth maps. Many data visualization strategies are considerably used
to discover each and every characteristic individually and by means of combining them with different
features.
5. Data Modeling:
Data modeling is the coronary heart of data analysis. A model takes the organized data as input and
gives the preferred output. This step consists of selecting the suitable kind of model, whether the problem is
a classification problem, or a regression problem or a clustering problem. After deciding on the model
family, amongst the number of algorithms amongst that family, we need to cautiously pick out the
algorithms to put into effect and enforce them. We need to tune the hyperparameters of every model to
obtain the preferred performance. We additionally need to make positive there is the right stability between
overall performance and generalizability. We do no longer desire the model to study the data and operate
poorly on new data.

6. Model Evaluation:
Here the model is evaluated for checking if it is geared up to be deployed. The model is examined
on an unseen data, evaluated on a cautiously thought out set of assessment metrics. We additionally need to
make positive that the model conforms to reality. If we do not acquire a quality end result in the evaluation,
we have to re-iterate the complete modelling procedure until the preferred stage of metrics is achieved. Any
data science solution, a machine learning model, simply like a human, must evolve, must be capable to
enhance itself with new data, adapt to a new evaluation metric. We can construct more than one model for a
certain phenomenon, however, a lot of them may additionally be imperfect. The model assessment helps us
select and construct an ideal model.
7. Model Deployment:
The model after a rigorous assessment is at the end deployed in the preferred structure and channel.
This is the last step in the data science life cycle. Each step in the data science life cycle defined above must
be laboured upon carefully. If any step is performed improperly, and hence, have an effect on the subsequent
step and the complete effort goes to waste. For example, if data is no longer accumulated properly, you’ll
lose records and you will no longer be constructing an ideal model. If information is not cleaned properly,
the model will no longer work. If the model is not evaluated properly, it will fail in the actual world. Right
from Business perception to model deployment, every step has to be given appropriate attention, time, and
effort.
ANALYTICS LIFE CYCLE

Data Analytics Lifecycle :

The Data analytic lifecycle is designed for Big Data problems and data science projects. The
cycle is iterative to represent real project. To address the distinct requirements for performing
analysis on Big Data, step – by – step methodology is needed to organize the activities and tasks
involved with acquiring, processing, analyzing, and repurposing data.

Phase 1: Discovery –

● The data science team learn and investigate the problem.


● Develop context and understanding.
● Come to know about data sources needed and available for the project.
● The team formulates initial hypothesis that can be later tested with data.
Phase 2: Data Preparation –

● Steps to explore, preprocess, and condition data prior to modeling and analysis.
● It requires the presence of an analytic sandbox, the team execute, load, and transform, to get data into the
sandbox.
● Data preparation tasks are likely to be performed multiple times and not in predefined order.
● Several tools commonly used for this phase are – Hadoop, Alpine Miner, Open Refine, etc.
Phase 3: Model Planning –
● Team explores data to learn about relationships between variables and subsequently, selects key
variables and the most suitable models.
● In this phase, data science team develop data sets for training, testing, and production purposes.
● Team builds and executes models based on the work done in the model planning phase.
● Several tools commonly used for this phase are – Matlab, STASTICA.
Phase 4: Model Building –
● Team develops datasets for testing, training, and production purposes.
● Team also considers whether its existing tools will suffice for running the models or if they need more
robust environment for executing models.
● Free or open-source tools – Rand PL/R, Octave, WEKA.
● Commercial tools – Matlab , STASTICA.
Phase 5: Communication Results –
● After executing model team need to compare outcomes of modeling to criteria established for success
and failure.
● Team considers how best to articulate findings and outcomes to various team members and stakeholders,
taking into account warning, assumptions.
● Team should identify key findings, quantify business value, and develop narrative to summarize and
convey findings to stakeholders.
Phase 6: Operationalize –
● The team communicates benefits of project more broadly and sets up pilot project to deploy work in
controlled way before broadening the work to full enterprise of users.
● This approach enables team to learn about performance and related constraints of the model in
production environment on small scale , and make adjustments before full deployment.
● The team delivers final reports, briefings, codes.
● Free or open source tools – Octave, WEKA, SQL, MADlib.
TYPES OF ANALYTICS

Types of Business Analytics


There are mainly four types of Business Analytics, each of these types are
increasingly complex. They allow us to be closer to achieving real-time and future
situation insight application. Each of these types of business analytics have been
discussed below.
1. Descriptive Analytics
2. Diagnostic Analytics
3. Predictive Analytics
4. Prescriptive Analytics
1. Descriptive Analytics
It summarizes an organisation’s existing data to understand what has happened in
the past or is happening currently. Descriptive Analytics is the simplest form of
analytics as it employs data aggregation and mining techniques. It makes data
more accessible to members of an organisation such as the investors, shareholders,
marketing executives, and sales managers.

It can help identify strengths and weaknesses and provides an insight into
customer behaviour too. This helps in forming strategies that can be developed in
the area of targeted marketing.

2. Diagnostic Analytics
This type of Analytics helps shift focus from past performance to the current
events and determine which factors are influencing trends. To uncover the root
cause of events, techniques such as data discovery, data mining and drill-down are
employed. Diagnostic analytics makes use of probabilities, and likelihoods to
understand why events may occur. Techniques such as sensitivity analysis and
training algorithms are employed for classification and regression.

3. Predictive Analytics
This type of Analytics is used to forecast the possibility of a future event with the
help of statistical models and ML techniques. It builds on the result of descriptive
analytics to devise models to extrapolate the likelihood of items. To run predictive
analysis, Machine Learning experts are employed. They can achieve a higher level
of accuracy than by business intelligence alone.
One of the most common applications is sentiment analysis. Here, existing data
collected from social media and is used to provide a comprehensive picture of an
users opinion. This data is analysed to predict their sentiment (positive, neutral or
negative).

4. Prescriptive Analytics
Going a step beyond predictive analytics, it provides recommendations for the
next best action to be taken. It suggests all favourable outcomes according to a
specific course of action and also recommends the specific actions needed to
deliver the most desired result. It mainly relies on two things, a strong feedback
system and a constant iterative analysis. It learns the relation between actions and
their outcomes. One common use of this type of analytics is to create
recommendation systems.

Business Analytics Tools

Business Analytics tools help analysts to perform the tasks at hand and generate
reports which may be easy for a layman to understand. These tools can be
obtained from open source platforms, and enable business analysts to manage their
insights in a comprehensive manner. They tend to be flexible and user-friendly.
Various business analytics tools and techniques like.

● Python is very flexible and can also be used in web scripting. It is


mainly applied when there is a need for integrating the data analyzed
with a web application or the statistics is to be used in a database
production. The I Python Notebook facilitates and makes it easy to
work with Python and data. One can share notebooks with other people
without necessarily telling them to install anything which reduces code
organizing overhead

● SAS The tool has a user-friendly GUI and can churn through terabytes
of data with ease. It comes with an extensive documentation and
tutorial base which can help early learners get started seamlessly.

● R is open source software and is completely free to use making it easier


for individual professionals or students starting out to learn. Graphical
capabilities or data visualization is the strongest forte of R with R
having access to packages like GGPlot, RGIS, Lattice, and GGVIS
among others which provide superior graphical competency.

● Tableau is the most popular and advanced data visualization tool in the
market. Story-telling and presenting data insights in a comprehensive
way has become one of the trademarks of a competent business analyst
Tableau is a great platform to develop customized visualizations in no
time, thanks to the drop and drag features.

Python, R, SAS, Excel, and Tableau have all got their unique places when it
comes to usage.
BUSINESS PROBLEM DEFINITION

It defines the problem that a company is facing. Also, it involves an intricate analysis of the
problem, details relevant to the situation, and a solution that can solve the problem. This is a simple yet
effective way to present a problem and its solution concisely.In other words, it is a communication tool that
helps you visualize and minimize the gap between what’s ideal vs. what’s real. Or to put it in business
lingo, the expected performance, and the real performance.

A business problem statement is a compact communication tool that helps you convey what you
want to change.

How to write Business Problem Statement?

Before writing a business problem statement, it is crucial to conduct a complete analysis of the
problem and everything related. You should have the knowledge to describe your problem and also suggest
a solution to it To make things easy for you, we have explained the four key aspects to help you write your
business problem statement. They include:

1. Define the problem


Defining the problem is the primary aspect of a business problem statement. Summarize your
problem in simple and layman terms. It is highly recommended to avoid industrial lingo and
buzzwords. Write a 3-5 sentences long summary, avoid writing more than it.
2. Provide the problem analysis

Adding statistics and results from surveys, industry trends, customer demographics, staffing reports,
etc., helps the reader understand the problem distinctly. These references should describe your
problem and its effects on various attributes of your business.
Avoid adding too many numbers in your problem statement, and include only the absolute necessary
statistics. It’s best to include not more than three significant facts.
3. Propose a solution
Your business problem statement should conclude with a solution to the problem that was
previously described. The solution should describe how the current state can be improved.

1. Avoid including elaborate actions and steps in a problem statement. These can be further explained
when you write a project plan.

4. Consider the audience


When you start writing your business problem statement, or any formal document, it is important to be
aware of the reader. Write your problem statement keeping in mind the reader’s knowledge about the
situation, requirements, and expectations.
Although intuitive knowledge does have its place, it is wiser to first consider and mention the facts you have
learned based on your research and propose solutions accordingly.
How to Develop a Business Problem Statement

A popular method that is used while writing a problem statement is the 5W2H (What, Why, Where,
Who, When, How, How much) method. These are the questions that need to be asked and answered while
writing a business problem statement.

Let’s understand them in detail.

● What: What is the problem that needs to be solved? Include the root cause of the problem. Mention
other micro problems that are connected with the macro ones.
● Why: Why is it a problem? Describe the reasons why it is a problem. Include supporting facts and
statistics to highlight the trouble.
● Where: Where is the problem observed? Mention the location and the specifics of it. Include the
products or services in which the problem is seen.
● Who: Who is impacted by this problem? Define and mention the customers, the staff, departments,
and businesses affected by the problem.
● When: When was the problem first observed? Talk about the timeline. Explain how the intensity of
the problem has changed from the time it was first observed.
● How: How is the problem observed? Mention the indications of the problem. Talk about the
observations you made while conducting problem analysis.
● How much: How often is the problem observed? If you have identified a trend during your research,
mention it. Comment on the error rate and the frequency and magnitude of the problem.

Business Problem Statement Framework

A problem statement consists of four main components. They are:

● The problem: The problem statement begins with mentioning and explaining the current state.
● Who it affects: Mention the people who are affected by the problem.
● How it impacts: Explain the impacts of the problem.
● The solution: Your problem statement ends with a proposed solution.

DATA COLLECTION

Data

● Knowledge is power, information is knowledge, and data is information in


digitized form, at least as defined in IT. Hence, data is power.
● Data are individual facts, statistics, or items of information, often numeric.
In a more technical sense, data are a set of values of qualitative or
quantitative variables about one or more persons or objects
● Data is various kinds of information formatted in a particular way.
Therefore, data collection is the process of gathering, measuring, and
analyzing accurate data from a variety of relevant sources to find answers
to research problems, answer questions, evaluate outcomes, and forecast
trends and probabilities.
● Accurate data collection is necessary to make informed business decisions,
ensure quality assurance, and keep research integrity.
● The concept of data collection isn’t a new one, as we’ll see later, but the
world has changed. There is far more data available today, and it exists in
forms that were unheard of a century ago. The data collection process has
had to change and grow with the times, keeping pace with technology.
● Data collection breaks down into two methods: 1. Primary & 2. Secondary

❖ Data Collection
Data collection is the process of acquiring, collecting, extracting, and storing the
voluminous amount of data which may be in the structured or unstructured form
like text, video, audio, XML files, records, or other image files used in later stages
of data analysis. In the process of big data analysis, “Data collection” is the initial
step before starting to analyze the patterns or useful information in data. The data
which is to be analyzed must be collected from different valid sources.

The actual data is then further divided mainly into two types known as:
1. Primary data
2. Secondary data

1.Primary Data
The data which is Raw, original, and extracted directly from the official sources is
known as primary data. This type of data is collected directly by performing
techniques such as questionnaires, interviews, and surveys. The data collected
must be according to the demand and requirements of the target audience on
which analysis is performed otherwise it would be a burden in the data processing.
Few methods of collecting primary data:
⮚ Interview method:
The data collected during this process is through interviewing the target audience
by a person called interviewer and the person who answers the interview is known
as the interviewee. Some basic business or product related questions are asked and
noted down in the form of notes, audio, or video and this data is stored for
processing. These can be both structured and unstructured like personal interviews
or formal interviews through telephone, face to face, email, etc.
Survey Method
The survey method is the process of research where a list of relevant questions are
asked and answers are noted down in the form of text, audio, or video. The survey
method can be obtained in both online and offline mode like through website
forms and email. Then that survey answers are stored for analyzing data.
Examples are online surveys or surveys through social media polls.
⮚ Observation method:
The observation method is a method of data collection in which the researcher
keenly observes the behaviour and practices of the target audience using some
data collecting tool and stores the observed data in the form of text, audio, video,
or any raw formats. In this method, the data is collected directly by posting a few
questions on the participants. For example, observing a group of customers and
their behaviour towards the products. The data obtained will be sent for
processing.
⮚ Projective Technique
Projective data gathering is an indirect interview, used when potential respondents
know why they're being asked questions and hesitate to answer. For instance,
someone may be reluctant
to answer questions about their phone service if a cell phone carrier representative
poses the questions. With projective data gathering, the interviewees get an
incomplete question, and they must fill in the rest, using their opinions, feelings,
and attitudes.

⮚ Delphi Technique.
The Oracle at Delphi, according to Greek mythology, was the high priestess of
Apollo’s temple, who gave advice, prophecies, and counsel. In the realm of data
collection, researchers use the Delphi technique by gathering information from a
panel of experts. Each expert answers questions in their field of specialty, and the
replies are consolidated into a single opinion.

⮚ Focus Groups.
Focus groups, like interviews, are a commonly used technique. The group consists
of anywhere from a half-dozen to a dozen people, led by a moderator, brought
together to discuss the issue.

⮚ Questionnaires.
Questionnaires are a simple, straightforward data collection method. Respondents
get a series of questions, either open or close-ended, related to the matter at hand.
⮚ Experimental method:
The experimental method is the process of collecting data through performing
experiments, research, and investigation. The most frequently used experiment
methods are CRD, RBD, LSD, FD.
● CRD- Completely Randomized design is a simple experimental design
used in data analytics which is based on randomization and replication. It is
mostly used for comparing the experiments.
● RBD- Randomized Block Design is an experimental design in which the
experiment is divided into small units called blocks. Random experiments are
performed on each of the blocks and results are drawn using a technique
known as analysis of variance (ANOVA). RBD was originated from the
agriculture sector.
● LSD – Latin Square Design is an experimental design that is similar to
CRD and RBD blocks but contains rows and columns. It is an arrangement of
NxN squares with an equal amount of rows and columns which contain letters
that occurs only once in a row. Hence the differences can be easily found with
fewer errors in the experiment. Sudoku puzzle is an example of a Latin square
design.
● FD- Factorial design is an experimental design where each experiment
has two factors each with possible values and on performing trail other
combinational factors are derived.

2.Secondary data:

Secondary data is the data which has already been collected and reused again for
some valid purpose. This type of data is previously recorded from primary data
and it has two types of sources named internal source and external source.
i. Internal source:
These types of data can easily be found within the organization such as market
record, a sales record, transactions, customer data, accounting resources, etc. The
cost and time consumption is less in obtaining internal sources.
● Financial Statements
● Sales Reports
● Retailer/Distributor/Deal Feedback
● Customer Personal Information (e.g., name, address, age, contact info)
● Business Journals
● Government Records (e.g., census, tax records, Social Security info)
● Trade/Business Magazines
● The internet

ii. External source:


The data which can’t be found at internal organizations and can be gained through
external third party resources is external source data. The cost and time
consumption is more because this contains a huge amount of data. Examples of
external sources are Government publications, news publications, Registrar
General of India, planning commission, international labour bureau, syndicate
services, and other non-governmental publications.
iii. Other sources:
● Sensors data: With the advancement of IoT devices, the sensors of these
devices collect data which can be used for sensor data analytics to track the
performance and usage of products.
● Satellites data: Satellites collect a lot of images and data in terabytes on
daily basis through surveillance cameras which can be used to collect
useful information.
● Web traffic: Due to fast and cheap internet facilities many formats of data
Which is uploaded by users on different platforms can be predicted and
collected with their permission for data analysis. The search engines also
provide their data through keywords and queries searched mostly.

❖ Data Collection Tools

1. Word Association.
The researcher gives the respondent a set of words and asks them what comes to
mind when they hear each word.
2. Sentence Completion.
Researchers use sentence completion to understand what kind of ideas the
respondent has. This tool involves giving an incomplete sentence and seeing how
the interviewee finishes it.
3. Role-Playing.
Respondents are presented with an imaginary situation and asked how they would
act or react if it was real.
4. In-Person Surveys.
The researcher asks questions in person.
5. Online/Web Surveys.
These surveys are easy to accomplish, but some users may be unwilling to answer
truthfully, if at all.
6. Mobile Surveys.
These surveys take advantage of the increasing proliferation of mobile technology.
Mobile collection surveys rely on mobile devices like tablets or smart phones to
conduct surveys via SMS or mobile apps.

7. Phone Surveys.
No researcher can call thousands of people at once, so they need a third party to
handle the chore. However, many people have call screening and won’t answer.

8. Observation.
Sometimes, the simplest method is the best. Researchers who make direct
observations collect data quickly and easily, with little intrusion or third-party
bias. Naturally, it’s only effective in small-scale situations.
DATA PREPARATION
❖ Data Preparation

Data preparation is about constructing a dataset from one or more data


sources to be used for exploration and modeling. It is a solid practice to
start with an initial dataset to get familiar with the data, to discover first
insights into the data and have a good understanding of any possible data
quality issues. Data preparation is often a time consuming process and
heavily prone to errors. The old saying "garbage-in-garbage-out" is
particularly applicable to those data science projects where data gathered
with many invalid, out-of-range and missing values. Analyzing data that
has not been carefully screened for such problems can produce highly
misleading results. Then, the success of data science projects heavily
depends on the quality of the prepared data.
Data
Data is information typically the results of measurement (numerical) or
counting (categorical). Variables serve as placeholders for data.
There are two types of variables, numerical and categorical.

A numerical or continuous variable is one that can accept any value


within a finite or infinite interval (e.g., height, weight, temperature, blood
glucose,). There are two types of numerical data, interval and ratio. Data
on an interval scale can be added and subtracted but cannot be
meaningfully multiplied or divided because there is no true zero. For
example, we cannot say that one day is twice as hot as another day. On
the other hand, data on a ratio scale has true zero and can be added,
subtracted, multiplied or divided (e.g., weight).

A categorical or discrete variable is one that can accept two or more


values (categories). There are two types of categorical data, nominal and
ordinal. Nominal data does not have an intrinsic ordering in the
categories. For example, "gender" with two categories, male and female.
In contrast, ordinal data does have an intrinsic ordering in the categories.

For example, "level of energy" with three orderly categories (low,


medium and high).
Dataset
Dataset is a collection of data, usually presented in a tabular form. Each column
represents a particular variable, and each row corresponds to a given member of

the data.
There are some alternatives for columns, rows and values.
● Columns, Fields, Attributes, Variables
● Rows, Records, Objects, Cases, Instances, Examples, Vectors
● Values, Data

In predictive modeling, predictors or attributes are the input variables


and target or class attribute is the output variable whose value is
determined by the values of the predictors and function of the predictive
model.
Database
Database collects, stores and manages information so users can retrieve,
add, update or remove such information. It presents information in tables
with rows and columns. A table is referred to as a relation in the sense
that it is a collection of objects of the same type (rows). Data in a table
can be related according to common keys or concepts, and the ability to
retrieve related data from related tables is the basis for the term relational
database. A Database Management System (DBMS) handles the way
data is stored, maintained, and retrieved. Most data science toolboxes
connect to databases through ODBC (Open Database Connectivity) or
JDBC (Java Database Connectivity) interfaces.
SQL (Structured Query Language) is a database computer language for
managing and manipulating data in relational database management
systems (RDBMS).

SQL Data Definition Language (DDL) permits database tables to be


created, altered or deleted. We can also define indexes (keys), specify
links between tables, and impose constraints between database tables.

● CREATE TABLE : creates a new table


● ALTER TABLE : alters a table
● DROP TABLE : deletes a table
● CREATE INDEX : creates an index
● DROP INDEX : deletes an index

SQL Data Manipulation Language (DML) is a language which enables


users to access and manipulate data.

● SELECT : retrieval of data from the database


● INSERT INTO : insertion of new data into the database
● UPDATE : modification of data in the database
● DELETE : deletion of data in the database
ETL (Extraction, Transformation and Loading)
ETL extracts data from data sources and loads it into data destinations using a set
of transformation functions.

● Data extraction provides the ability to extract data from a variety


of data sources, such as flat files, relational databases, streaming
data, XML files, and ODBC/JDBC data sources.
● Data transformation provides the ability to cleanse, convert,
aggregate, merge, and split data. Before data transformation need
to do the following steps
Data Cleansing
The process of eliminating missing values and noisy data
Data Integration
The process of merging and aggregating the data
Constructing new data, derive new elements from present ones.
Format the data into the preferred structure, eliminate undesirable
columns and features
● Data loading provides the ability to load data into destination
databases via update, insert or delete statements, or in bulk.
HYPOTHESIS GENERATION

What is Hypothesis Generation?

Hypothesis

Hypothesis is nothing but an assumption or a supposition made about a specific


population parameter, such as any measurement or quantity about the population that is set
and that can be used as a value to the distribution variable
“A hypothesis may be simply defined as a guess. A scientific hypothesis is an intelligent
guess.”
Hypothesis generation is an educated “guess” of various factors that are impacting the
business problem that needs to be solved using machine learning. In framing a hypothesis, the
data scientist must not know the outcome of the hypothesis that has been generated based on
any evidence.

Hypothesis generation is a crucial step in any data science project. If you skip this or skim
through this, the likelihood of the project failing increases exponentially.

Hypothesis generation is a process beginning with an educated guess whereas hypothesis


testing is a process to conclude that the educated guess is true/false or the relationship
between the variables is statistically significant or not.
Reasons for hypothesis generation

5 key reasons why hypothesis generation is important

● Hypothesis generation helps in comprehending the business problem as we dive deep in


inferring the various factors affecting our target variable
● You will get a much better idea of what are the major factors that are responsible to solve the
problem
● Data that needs to be collected from various sources that are key in converting your business
problem into a data science-based problem
● Improves your domain knowledge if you are new to the domain as you spend time
understanding the problem
● Helps to approach the problem in a structured manner

Types of Hypothesis

⮚ Simple Hypothesis
⮚ Complex Hypothesis
⮚ Null Hypothesis
⮚ Alternate Hypothesis
⮚ Statistical Hypothesis

Simple Hypothesis

Simple hypothesis, also known as a basic hypothesis, proposes that


an independent variable is accountable for the corresponding dependent
variable. In simpler words, the occurrence of independent variable results
in the existence of the dependent variable. Generally, simple hypotheses
are thought of as true and they create a causal relationship between the
two variables.

One example of a simple hypothesis is exercising daily leads to


weight loss.

Complex Hypothesis

This type of hypothesis is also termed a modal. It holds for the


relationship between two variables that are independent and result in a
dependent variable. This means that the amalgamation of independent
variables results in the dependent variables.

An example of this kind of hypothesis can be “adults who don’t drink


and smoke are less likely to have liver-related problems.

Null Hypothesis

A null hypothesis is created when a researcher thinks that there is


no connection between the variables that are being observed.

An example of this kind of hypothesis can be “A student’s


performance is not impacted if they drink tea or coffee before classes.

Alternate Hypothesis

If a researcher wants to disapprove of a null hypothesis, then the


researcher has to develop an opposite assumption—known as an
alternative hypothesis.

For example, beginning your day with tea instead of coffee can keep
you more alert.

Statistical Hypothesis

This kind of hypothesis is most common in systematic investigations


that involve a huge target audience.

For example, in Louisiana, 45% of students have middle-income


parents.

Hypothesis testing
Hypothesis testing involves drawing inferences about two contrasting
propositions (each called a hypothesis) relating to the value of one or more population
parameters, such as the mean, proportion, standard deviation, or variance.
Null hypothesis
One of these propositions (called the null hypothesis) describes the existing
theory or a belief that is accepted as valid unless strong statistical evidence exists to
the contrary. The null hypothesis is denoted by H0
Alternative hypothesis
The second proposition (called the alternative hypothesis) is the complement
of the null hypothesis; it must be true if the null hypothesis is false.The alternative
hypothesis is denoted by H1.
Using sample data, we either reject the null hypothesis and conclude that the
sample data provide sufficient statistical evidence to support the alternative
hypothesis, If we fail to reject the null hypothesis and conclude that the sample data
does not support the alternative hypothesis.
If we fail to reject the null hypothesis, then we can only accept as valid the
existing theory or belief, but we can never prove it.

Hypothesis-Testing Procedure Conducting a hypothesis test involves several


steps:
1. Identifying the population parameter of interest and formulating the
hypotheses to test
2. Selecting a level of significance, which defines the risk of drawing an
incorrect conclusion when the assumed hypothesis is actually true
3. Determining a decision rule on which to base a conclusion
4. Collecting data and calculating a test statistic
5. Applying the decision rule to the test statistic and drawing a conclusion

We apply this procedure to two different types of hypothesis tests; the first
involving a single population (called one-sample tests) and, later, tests involving more
than one population (multiple-sample tests).
MODELING
Model
Many decision problems can be formalized using a model. A model is an
abstraction or representation of a real system, idea, or object. Models capture the most
important features of a problem and present them in a form that is easy to interpret. A
model can be as simple as a written or verbal description of some phenomenon, a
visual representation such as a graph or a flowchart, or a mathematical or spreadsheet
representation.

New Product Sales Over Time


Three Forms of a Model
The sales of a new product, such as a first-generation iPad, Android phone, or 3-D
television, often follow a common pattern. We might represent this in one of three following
ways:
1. A simple verbal description of sales might be: The rate of sales starts small as early
adopters begin to evaluate a new product and then begins to grow at an increasing rate over
time as positive customer feedback spreads. Eventually, the market begins to become
saturated and the rate of sales begins to decrease.
2. A sketch of sales as an S-shaped curve over time, as shown in Figure , is a visual
model that conveys this phenomenon.
3. Finally, analysts might identify a mathematical model that characterizes this curve.
Several different mathematical functions do this; one is called a Gompertz curve and has the
formula: S = aebect , where S = sales, t = time, e is the base of natural logarithms, and a, b,
and c are constants. Of course, you would not be expected to know this; that’s what analytics
professionals do. Such a mathematical model provides the ability to predict sales
quantitatively, and to analyze potential decisions by asking “what if?” questions.
A simple descriptive model is a visual representation called an influence diagram
because it describes how various elements of the model influence, or relate to, others.
An influence diagram is a useful approach for conceptualizing the structure of a model and
can assist in building a mathematical or spreadsheet model.
The elements of the model are represented by circular symbols called nodes. Arrows called
branches connect the nodes and show which elements influence others.
Influence diagrams are quite useful in the early stages of model building when we need to
understand and characterize key relationships.
Below figure shows how to construct simple influence diagrams
From basic business principles, we know that the total cost of producing a fixed
volume of a product is comprised of fixed costs and variable costs. Thus, a simple influence
diagram that shows these relationships is given in Figure
An Influence Diagram Relating Total Cost to Its Key Components

We can develop a more detailed model by noting that the variable cost depends on the
unit variable cost as well as the quantity produced. The expanded model is shown in below
Figure . In this figure, all the nodes that have no branches pointing into them are inputs to the
model. We can see that the unit variable cost and fixed costs are data inputs in the model. The
quantity produced, however, is a decision variable because it can be controlled by the
manager of the operation. The total cost is the output (note that it has no branches pointing
out of it) that we would be interested in calculating. The variable cost node links some of the
inputs with the output and can be considered as a “building block” of the model for total cost.

.
Figure shows how to build a mathematical model, drawing upon the influence diagram
Decision Models
A decision model is a logical or mathematical representation of a problem or business
situation that can be used to understand, analyze, or facilitate making a decision.
Most decision models have three types of input:
1. Data, which are assumed to be constant for purposes of the model. Some examples
would be costs, machine capacities, and intercity distances.
2. Uncontrollable variables, which are quantities that can change but cannot be directly
controlled by the decision maker. Some examples would be customer demand, inflation rates,
and investment returns. Often, these variables are uncertain.
3. Decision variables, which are controllable and can be selected at the discretion of the
decision maker. Some examples would be production quantities ,staffing levels, and
investment allocations.
Decision models characterize the relationships among the data, uncontrollable
variables, and decision variables, and the outputs of interest to the decision maker.

Modeling
With the help of modelling techniques, we can create a complete description of existing and
proposed organizational structures, processes, and information used by the enterprise.
Business Model is a structured model, just like a blueprint for the final product to be
developed. It gives structure and dynamics for planning. It also provides the foundation for
the final product.

Predictive modeling is the process by which a model is created to predict


an outcome. If the outcome is categorical it is called classification and if
the outcome is numerical it is called regression. Descriptive modeling or
clustering is the assignment of observations into clusters so that
observations in the same cluster are similar. Finally, association rules can
find interesting associations amongst observations.
VALIDATION AND EVALUATION

Model Evaluation
Model Evaluation is an integral part of the model development process. It
helps to find the best model that represents our data and how well the
chosen model will work in the future. Evaluating model performance with
the data used for training is not acceptable in data science because it can
easily generate overoptimistic and over fitted models.
There are two methods of evaluating models in data science,
⮚ Hold-Out
⮚ Cross-Validation.
To avoid overfitting, both methods use a test set (not seen by the model) to evaluate model
performance.
Hold-Out
In this method, the mostly large dataset is randomly divided to three subsets:
1. Training set is a subset of the dataset used to build predictive models.
2. Validation set is a subset of the dataset used to assess the
performance of model built in the training phase. It provides a test
platform for fine tuning model's parameters and selecting the best-
performing model. Not all modeling algorithms need a validation
set.
3. Test set or unseen examples are a subset of the dataset to assess the
likely future performance of a model. If a model fit to the training
set much better than it fits the test set, overfitting is probably the
cause.

Cross-Validation
When only a limited amount of data is available, to achieve an unbiased
estimate of the model performance we use k-fold cross-validation. In k-
fold cross-validation, we divide the data into k subsets of equal size.
We build models k times, each time leaving out one of the subsets from
training and use it as the test set. If k equals the sample size, this is called
"leave-one-out".

Model evaluation can be divided to two sections:


● Classification Evaluation
Classification is about predicting the class labels given input data

Evaluation Metrics for Classification


1. Accuracy
2. Precision (P)
3. Recall (R)
4. F1 score (F1)
5. Area under the ROC (Receiver Operating Characteristic) curve or simply Area Under
Curve (AUC)
6. Log loss
7. Precision at k (P@k)
8. Average precision at k (AP@k)
9. Mean average precision at k (MAP@k)

● Regression Evaluation
Regression refers to predictive modeling problems that involve predicting a
numeric value.
● It is different from classification that involves predicting a class label. Unlike
classification, you cannot use classification accuracy to evaluate the predictions
made by a regression model.

Evaluation Metrics for Regression


1. Mean absolute error (MAE)
2. Mean squared error (MSE)
3. Root mean squared error (RMSE)
4. Root mean squared logarithmic error (RMSLE)
5. Mean percentage error (MPE)
6. Mean absolute percentage error (MAPE)
7. R-square (R^2)

INTERPRETATION

Data interpretation is the process of reviewing data and arriving at relevant


conclusions using various analytical research methods. Data analysis assists
researchers in categorizing, manipulating, and summarizing data to answer critical
questions.
DEPLOYMENT AND ITERATION

Model Deployment

The concept of deployment in data science refers to the application of a


model for prediction using a new data. Building a model is generally not
the end of the project. Even if the purpose of the model is to increase
knowledge of the data, the knowledge gained will need to be organized and
presented in a way that the customer can use it. Depending on the
requirements, the deployment phase can be as simple as generating a report
or as complex as implementing a repeatable data science process. In many
cases, it will be the customer, not the data analyst, who will carry out the
deployment steps. For example, a credit card company may want to deploy
a trained model or set of models (e.g., neural networks, meta-learner) to
quickly identify transactions, which have a high probability of being
fraudulent. However, even if the analyst will not carry out the deployment
effort it is important for the customer to understand up front what actions
will need to be carried out in order to actually make use of the created
models.
Model deployment methods:
In general, there is four way of deploying the models in data science.
1. Data science tools (or cloud)
2. Programming language (Java, C, VB, …)
3. Database and SQL script (TSQL, PL-SQL, …)
4. PMML (Predictive
Model Markup
Language) 5.

An example of using a data mining tool (Orange) to deploy a decision tree model.

You might also like