Bda CH1

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 18

Introduction

Main points
 Overview of Big data, Big Data Characteristics
 Different Types of Data
 Data Analytics
 Data Analytics Life Cycle

Big Data is a field dedicated to the analysis, processing, and storage of


large collections of
data that frequently originate from disparate sources. Big Data solutions
and practices are
typically required when traditional data analysis, processing and storage
technologies and
techniques are insufficient.
The analysis of Big Data datasets is an interdisciplinary endeavor that
blends mathematics,
statistics, computer science and subject matter expertise. This mixture of
skillsets and
perspectives has led to some confusion as to what comprises the field of
Big Data and its
analysis, for the response one receives will be dependent upon the
perspective of whoever
is answering the question. The boundaries of what constitutes a Big Data
problem are also
changing due to the ever-shifting and advancing landscape of software
and hardware
technology.
Data within Big Data environments generally accumulates from being
amassed within the
enterprise via applications, sensors and external sources. Data processed
by a Big Data
solution can be used by enterprise applications directly or can be fed into
a data warehouse
to enrich existing data there. The results obtained through the processing
of Big Data can
lead to a wide range of insights and benefits, such as:
• operational optimization
• actionable intelligence
• identification of new markets
• accurate predictions
• fault and fraud detection
• more detailed records
• improved decision-making
• scientific discoveries
Data Analysis
Data analysis is the process of examining data to find facts, relationships,
patterns,
insights and/or trends. The overall goal of data analysis is to support
better decision making.
A simple data analysis example is the analysis of ice cream sales data in
order to determine how the number of ice cream cones sold is related to
the daily temperature. The results of such an analysis would support
decisions related to how much ice cream a store should order in relation
to weather forecast information. Carrying out data analysis helps
establish patterns and relationships among the data being analyzed.

Data Analytics
Data analytics is a broader term that encompasses data analysis. Data
analytics is a
discipline that includes the management of the complete data lifecycle,
which encompasses collecting, cleansing, organizing, storing, analyzing
and governing data. The term includes the development of analysis
methods, scientific techniques and automated tools. In Big Data
environments, data analytics has developed methods that allow data
analysis to occur through the use of highly scalable distributed
technologies and frameworks that are capable of analyzing large volumes
of data from different sources. The Big Data analytics lifecycle generally
involves identifying, procuring, preparing and analyzing large amounts of
raw, unstructured data to extract meaningful information that can serve
as an input for identifying patterns, enriching existing enterprise data and
performing large-scale searches.
Different kinds of organizations use data analytics tools and techniques in
different ways.
Take, for example, these three sectors:
• In business-oriented environments, data analytics results can lower
operational costs
and facilitate strategic decision-making.
• In the scientific domain, data analytics can help identify the cause
of a phenomenon to improve the accuracy of predictions.
• In service-based environments like public sector organizations,
data analytics can help strengthen the focus on delivering high-
quality services by driving down costs.
Data analytics enable data-driven decision-making with scientific backing
so that decisions can be based on factual data and not simply on past
experience or intuition alone. There are four general categories of
analytics that are distinguished by the results they produce:
• descriptive analytics
• diagnostic analytics
• predictive analytics
• prescriptive analytics
The different analytics types leverage different techniques and analysis
algorithms. This implies that there may be varying data, storage and
processing requirements to facilitate the delivery of multiple types of
analytic results.

Descriptive Analytics
Descriptive analytics are carried out to answer questions about events
that have already occurred. This form of analytics contextualizes data to
generate information.
Sample questions can include:
• What was the sales volume over the past 12 months?
• What is the number of support calls received as categorized by
severity and
geographic location?
• What is the monthly commission earned by each sales agent?
It is estimated that 80% of generated analytics results are descriptive in
nature. Value wise, descriptive analytics provide the least worth and
require a relatively basic skillset. Descriptive analytics are often carried
out via ad-hoc reporting or dashboards, as shown in Figure 1.5. The
reports are generally static in nature and display historical data that is
presented in the form of data grids or charts. Queries are executed on
operational data stores from within an enterprise, for example a Customer
Relationship Management system (CRM) or Enterprise Resource Planning
(ERP) system.
Diagnostic Analytics
Diagnostic analytics aim to determine the cause of a phenomenon that
occurred in the past
using questions that focus on the reason behind the event. The goal of
this type of analytics is to determine what information is related to the
phenomenon in order to enable answering questions that seek to
determine why something has occurred.
Such questions include:
• Why were Q2 sales less than Q1 sales?
• Why have there been more support calls originating from the
Eastern region than
from the Western region?
• Why was there an increase in patient re-admission rates over the
past three months?
Diagnostic analytics provide more value than descriptive analytics but
require a more advanced skillset. Diagnostic analytics usually require
collecting data from multiple sources and storing it in a structure that
lends itself to performing drill-down and roll-up analysis, as shown in
Figure 1.6. Diagnostic analytics results are viewed via interactive
visualization tools that enable users to identify trends and patterns. The
executed queries are more complex compared to those of descriptive
analytics and are performed on multidimensional data held in analytic
processing systems.

Predictive Analytics
Predictive analytics are carried out in an attempt to determine the
outcome of an event that might occur in the future. With predictive
analytics, information is enhanced with meaning to generate knowledge
that conveys how that information is related. The strength and magnitude
of the associations form the basis of models that are used to generate
future predictions based upon past events. It is important to understand
that the models used for predictive analytics have implicit dependencies
on the conditions under which the past events occurred. If these
underlying conditions change, then the models that make predictions
need to be updated.
Questions are usually formulated using a what-if rationale, such as the
following:
• What are the chances that a customer will default on a loan if they
have missed a monthly payment?
• What will be the patient survival rate if Drug B is administered
instead of Drug A?
• If a customer has purchased Products A and B, what are the
chances that they will also purchase Product C?
Predictive analytics try to predict the outcomes of events, and predictions
are made based on patterns, trends and exceptions found in historical
and current data. This can lead to the identification of both risks and
opportunities.
This kind of analytics involves the use of large datasets comprised of
internal and external data and various data analysis techniques. It
provides greater value and requires a more advanced skillset than both
descriptive and diagnostic analytics. The tools used generally abstract
underlying statistical intricacies by providing user-friendly front-end
interfaces

Prescriptive Analytics
Prescriptive analytics build upon the results of predictive analytics by
prescribing actions that should be taken. The focus is not only on which
prescribed option is best to follow, but why. In other words, prescriptive
analytics provide results that can be reasoned about because they embed
elements of situational understanding. Thus, this kind of analytics can be
used to gain an advantage or mitigate a risk.
Sample questions may include:
• Among three drugs, which one provides the best results?
• When is the best time to trade a particular stock?
Prescriptive analytics provide more value than any other type of analytics
and correspondingly require the most advanced skillset, as well as
specialized software and tools. Various outcomes are calculated, and the
best course of action for each outcome is suggested. The approach shifts
from explanatory to advisory and can include the simulation of various
scenarios.
This sort of analytics incorporates internal data with external data.
Internal data might include current and historical sales data, customer
information, product data and business rules. External data may include
social media data, weather forecasts and government produced
demographic data. Prescriptive analytics involve the use of business rules
and large amounts of internal and external data to simulate outcomes
and prescribe the best course of action.

Business Intelligence (BI)


BI enables an organization to gain insight into the performance of an
enterprise by
analyzing data generated by its business processes and information
systems. The results of
the analysis can be used by management to steer the business in an
effort to correct
detected issues or otherwise enhance organizational performance. BI
applies analytics to
large amounts of data across the enterprise, which has typically been
consolidated into an
enterprise data warehouse to run analytical queries. As shown in Figure
1.9, the output of
BI can be surfaced to a dashboard that allows managers to access and
analyze the results
and potentially refine the analytic queries to further explore the data.
Big Data Characteristics

Volume
The anticipated volume of data that is processed by Big Data solutions is
substantial and ever-growing. High data volumes impose distinct data
storage and processing demands, as well as additional data preparation,
curation and management processes.
Typical data sources that are responsible for generating high data
volumes can include:
• online transactions, such as point-of-sale and banking
• scientific and research experiments, such as the Large Hadron
Collider and Atacama Large Millimeter/Submillimeter Array telescope
• sensors, such as GPS sensors, RFIDs, smart meters and telematics
• social media, such as Facebook and Twitter
Velocity
In Big Data environments, data can arrive at fast speeds, and enormous
datasets can accumulate within very short periods of time. From an
enterprise’s point of view, the velocity of data translates into the amount
of time it takes for the data to be processed once it enters the
enterprise’s perimeter. Coping with the fast inflow of data requires the
enterprise to design highly elastic and available data processing solutions
and corresponding data storage capabilities.Depending on the data
source, velocity may not always be high.
Variety
Data variety refers to the multiple formats and types of data that need to
be supported by Big Data solutions. Data variety brings challenges for
enterprises in terms of data integration, transformation, processing, and
storage. Figure 1.14 provides a visual representation of data variety,
which includes structured data in the form of financial transactions, semi-
structured data in the form of emails and unstructured data in the form of
images.
Veracity
Veracity refers to the quality or fidelity of data. Data that enters Big Data
environments needs to be assessed for quality, which can lead to data
processing activities to resolve invalid data and remove noise. In relation
to veracity, data can be part of the signal or noise of a dataset. Noise is
data that cannot be converted into information and thus has no value,
whereas signals have value and lead to meaningful information. Data with
a high signal-to-noise ratio has more veracity than data with a lower ratio.
Data that is acquired in a controlled manner, for example via online
customer registrations, usually contains less noise than data acquired via
uncontrolled sources, such as blog postings. Thus the signal-to-noise ratio
of data is dependent upon the source of the data and its type.
Value
Value is defined as the usefulness of data for an enterprise. The value
characteristic is intuitively related to the veracity characteristic in that the
higher the data fidelity, the more value it holds for the business. Value is
also dependent on how long data processing takes because analytics
results have a shelf-life; for example, a 20 minute delayed stock quote
has little to no value for making a trade compared to a quote that is 20
milliseconds old. As demonstrated, value and time are inversely related.
The longer it takes for data to be turned into meaningful information, the
less value it has for a business. Stale results inhibit the quality and speed
of informed decision-making. Figure 1.15 provides two illustrations of how
value is impacted by the veracity of data and the timeliness of generated
analytic results.

Different Types of Data


The data processed by Big Data solutions can be human-generated or
machine-generated, although it is ultimately the responsibility of
machines to generate the analytic results. Human-generated data is the
result of human interaction with systems, such as online services and
digital devices.
Structured Data
Structured data conforms to a data model or schema and is often stored
in tabular form. It is used to capture relationships between different
entities and is therefore most often stored in a relational database.
Structured data is frequently generated by enterprise applications and
information systems like ERP and CRM systems. Due to the abundance of
tools and databases that natively support structured data, it rarely
requires special consideration in regards to processing or storage.
Examples of this type of data include banking transactions, invoices, and
customer records.
Unstructured Data
Data that does not conform to a data model or data schema is known as
unstructured data. It is estimated that unstructured data makes up 80%
of the data within any given enterprise. Unstructured data has a faster
growth rate than structured data. This form of data is either textual or
binary and often conveyed via files that are self-contained and non-
relational. A text file may contain the contents of various tweets or blog
postings. Binary files are often media
files that contain image, audio or video data. Technically, both text and
binary files have a structure defined by the file format itself, but this
aspect is disregarded, and the notion of being unstructured is in relation
to the format of the data contained in the file itself. Special purpose logic
is usually required to process and store unstructured data. For example,
to play a video file, it is essential that the correct codec (coder-decoder) is
available. Unstructured data cannot be directly processed or queried
using SQL. If it is required to be stored within a relational database, it is
stored in a table as a Binary Large Object (BLOB). Alternatively, a Not-only
SQL (NoSQL) database is a non-relational database that can be used to
store unstructured data alongside structured data.
Semi-structured Data
Semi-structured data has a defined level of structure and consistency, but
is not relational in nature. Instead, semi-structured data is hierarchical or
graph-based. This kind of data is commonly stored in files that contain
text. For instance XML and JSON files are common forms of semi-
structured data. Due to the textual nature of this data and its
conformance to some level of structure, it is more easily processed than
unstructured data.
Examples of common sources of semi-structured data include electronic
data interchange (EDI) files, spreadsheets, RSS feeds and sensor data.
Semi-structured data often has special pre-processing and storage
requirements, especially if the underlying format is not text-based. An
example of pre-processing of semi-structured data would be the
validation of an XML file to ensure that it conformed to its schema
definition.
Metadata
Metadata provides information about a dataset’s characteristics and
structure. This type of data is mostly machine-generated and can be
appended to data. The tracking of metadata is crucial to Big Data
processing, storage and analysis because it provides information about
the pedigree of the data and its provenance during processing. Examples
of metadata include:
• XML tags providing the author and creation date of a document
• attributes providing the file size and resolution of a digital
photograph
Big Data solutions rely on metadata, particularly when processing semi-
structured and
unstructured data.

Big Data Analytics Lifecycle


Big Data analysis differs from traditional data analysis primarily due to
the volume, velocity and variety characteristics of the data being
processes. To address the distinct requirements for performing analysis on
Big Data, a step-by-step methodology is needed to organize the activities
and tasks involved with acquiring, processing, analyzing and repurposing
data. The upcoming sections explore a specific data analytics lifecycle
that organizes and manages the tasks and activities associated with the
analysis of Big Data.
From a Big Data adoption and planning perspective, it is important that in
addition to the lifecycle, consideration be made for issues of training,
education, tooling and staffing of a data analytics team.
The Big Data analytics lifecycle can be divided into the following nine
stages, as shown in:
1. Business Case Evaluation
2. Data Identification
3. Data Acquisition & Filtering
4. Data Extraction
5. Data Validation & Cleansing
6. Data Aggregation & Representation
7. Data Analysis
8. Data Visualization
9. Utilization of Analysis Results

Business Case Evaluation


Each Big Data analytics lifecycle must begin with a well-defined business
case that presents a clear understanding of the justification, motivation
and goals of carrying out the analysis. The Business Case Evaluation
stage requires that a business case be created, assessed and approved
prior to proceeding with the actual hands-on analysis tasks.
An evaluation of a Big Data analytics business case helps decision-makers
understand the business resources that will need to be utilized and which
business challenges the analysis will tackle. The further identification of
KPIs during this stage can help determine assessment criteria and
guidance for the evaluation of the analytic results. If KPIs are not readily
available, efforts should be made to make the goals of the analysis
project SMART, which stands for specific, measurable, attainable, relevant
and timely. Based on business requirements that are documented in the
business case, it can be determined whether the business problems being
addressed are really Big Data problems.
In order to qualify as a Big Data problem, a business problem needs to be
directly related to one or more of the Big Data characteristics of volume,
velocity, or variety. Note also that another outcome of this stage is the
determination of the underlying budget required to carry out the analysis
project. Any required purchase, such as tools, hardware and training,
must be understood in advance so that the anticipated investment can be
weighed against the expected benefits of achieving the goals. Initial
iterations of the Big Data analytics lifecycle will require more up-front
investment of Big Data technologies, products and training compared to
later iterations where these earlier investments can be repeatedly
leveraged.
Data Identification
The Data Identification stage is dedicated to identifying the datasets
required for the analysis project and their sources. Identifying a wider
variety of data sources may increase the probability of finding hidden
patterns and correlations. For example, to provide insight, it can be
beneficial to identify as many types of related data sources as possible,
especially when it is unclear exactly what to look for.
Depending on the business scope of the analysis project and nature of the
business problems being addressed, the required datasets and their
sources can be internal and/or external to the enterprise. In the case of
internal datasets, a list of available datasets from internal sources, such
as data marts and operational systems, are typically compiled and
matched against a predefined dataset specification.
In the case of external datasets, a list of possible third-party data
providers, such as data markets and publicly available datasets, are
compiled. Some forms of external data may be embedded within blogs or
other types of content-based web sites, in which case they may need to
be harvested via automated tools.
Data Acquisition and Filtering
During the Data Acquisition and Filtering stage the data is gathered from
all of the data sources that were identified during the previous stage. The
acquired data is then subjected to automated filtering for the removal of
corrupt data or data that has been deemed to have no value to the
analysis objectives. Depending on the type of data source, data may
come as a collection of files, such as data purchased from a third-party
data provider, or may require API integration, such as with Twitter. In
many cases, especially where external, unstructured data is concerned,
some or most of the acquired data may be irrelevant (noise) and can be
discarded as part of the filtering process. Data classified as “corrupt” can
include records with missing or nonsensical values or invalid data types.
Data that is filtered out for one analysis may possibly be valuable for a
different type of analysis
Data Extraction
Some of the data identified as input for the analysis may arrive in a
format incompatible with the Big Data solution. The need to address
disparate types of data is more likely with data from external sources. The
Data Extraction lifecycle stage, is dedicated to extracting disparate data
and transforming it into a format that the underlying Big Data solution
can use for the purpose of the data analysis The extent of extraction and
transformation required depends on the types of analytics and
capabilities of the Big Data solution. For example, extracting the required
fields from delimited textual data, such as with webserver log files, may
not be necessary if the underlying Big Data solution can already directly
process those files. Similarly, extracting text for text analytics, which
requires scans of whole documents, is simplified if the underlying Big
Data solution can directly read the document in its native format. Further
transformation is needed in order to separate the data into two separate
fields as required by the Big Data solution.
Data Validation and Cleansing
Invalid data can skew and falsify analysis results. Unlike traditional
enterprise data, where the data structure is pre-defined and data is pre-
validated, data input into Big Data analyses can be unstructured without
any indication of validity. Its complexity can further make it difficult to
arrive at a set of suitable validation constraints. Big Data solutions often
receive redundant data across different datasets. This redundancy can be
exploited to explore interconnected datasets in order to assemble
validation parameters and fill in missing valid data.
Data Aggregation and Representation
Data may be spread across multiple datasets, requiring that datasets be
joined together via common fields, for example date or ID. In other cases,
the same data fields may appear in multiple datasets, such as date of
birth. Either way, a method of data reconciliation is required or the
dataset representing the correct value needs to be determined.
Performing this stage can become complicated because of differences in:
• Data Structure – Although the data format may be the same, the
data model may be different.
• Semantics – A value that is labeled differently in two different
datasets may mean the same thing, for example “surname” and
“last name.”
The large volumes processed by Big Data solutions can make data
aggregation a time and effort-intensive operation. Reconciling these
differences can require complex logic that is executed automatically
without the need for human intervention.
Future data analysis requirements need to be considered during this
stage to help foster data reusability. Whether data aggregation is required
or not, it is important to understand that the same data can be stored in
many different forms. One form may be better suited for a particular type
of analysis than another. For example, data stored as a BLOB would be of
little use if the analysis requires access to individual data fields.
A data structure standardized by the Big Data solution can act as a
common denominator that can be used for a range of analysis techniques
and projects. This can require establishing a central, standard analysis
repository, such as a NoSQL database.
Data Analysis
The Data Analysis stage is dedicated to carrying out the actual analysis
task, which typically involves one or more types of analytics. This stage
can be iterative in nature, especially if the data analysis is exploratory, in
which case analysis is repeated until the appropriate pattern or
correlation is uncovered.
 The exploratory analysis approach will be explained shortly, along
with confirmatory analysis. Depending on the type of analytic result
required, this stage can be as simple as querying a dataset to
compute an aggregation for comparison. On the other hand, it can
be as challenging as combining data mining and complex statistical
analysis techniques to discover patterns and anomalies or to
generate a statistical or mathematical model to depict relationships
between variables.
 Confirmatory data analysis is a deductive approach where the cause
of the phenomenon being investigated is proposed beforehand. The
proposed cause or assumption is called a hypothesis. The data is
then analyzed to prove or disprove the hypothesis and provide
definitive answers to specific questions. Data sampling techiniques
are typically used. Unexpected findings or anomalies are usually
ignored since a predetermined cause was assumed.
Exploratory data analysis is an inductive approach that is closely
associated with data mining. No hypothesis or predetermined
assumptions are generated. Instead, the data is explored through
analysis to develop an understanding of the cause of the phenomenon.
Although it may not provide definitive answers, this method provides a
general direction that can facilitate the discovery of patterns or
anomalies.

Data Visualization
The ability to analyze massive amounts of data and find useful insights
carries little value if the only ones that can interpret the results are the
analysts. The Data Visualization stage, is dedicated to using data
visualization techniques and tools to graphically communicate the
analysis results for effective interpretation by business users.
Business users need to be able to understand the results in order to
obtain value from the analysis and subsequently have the ability to
provide feedback, as indicated by the dashed line leading from stage 8
back to stage 7. The results of completing the Data Visualization stage
provide users with the ability to perform visual analysis, allowing for the
discovery of answers to questions that users have not yet even
formulated.
Utilization of Analysis Results
Subsequent to analysis results being made available to business users to
support business decision-making, such as via dashboards, there may be
further opportunities to utilize the analysis results. The Utilization of
Analysis Results stage, is dedicated to determining how and where
processed analysis data can be further leveraged. Depending on the
nature of the analysis problems being addressed, it is possible for the
analysis results to produce “models” that encapsulate new insights and
understandings about the nature of the patterns and relationships that
exist within the data that was analyzed. A model may look like a
mathematical equation or a set of rules. Models can be used to improve
business process logic and application system logic, and they can form
the basis of a new system or software program.
Common areas that are explored during this stage include the following:
• Input for Enterprise Systems – The data analysis results may be
automatically or
manually fed directly into enterprise systems to enhance and
optimize their
behaviors and performance. For example, an online store can be fed
processed
customer-related analysis results that may impact how it generates
product
recommendations. New models may be used to improve the
programming logic
within existing enterprise systems or may form the basis of new
systems.
• Business Process Optimization – The identified patterns,
correlations and anomalies
discovered during the data analysis are used to refine business
processes. An
example is consolidating transportation routes as part of a supply
chain process.
Models may also lead to opportunities to improve business process
logic.
• Alerts – Data analysis results can be used as input for existing
alerts or may form the
basis of new alerts. For example, alerts may be created to inform
users via email or
SMS text about an event that requires them to take corrective
action.

Differences between BI and Data Science.


Business Intelligence (BI) and Data Science are both crucial fields that
deal with data analysis, but they serve different purposes and employ
different techniques. Here’s a detailed comparison highlighting the key
differences between BI and Data Science:
1. Definition
 Business Intelligence (BI):
o BI refers to the processes and technologies used to collect,
analyze, and present business data to support decision-making.
It focuses on historical and current data to provide insights into
business performance.
 Data Science:
o Data Science is an interdisciplinary field that uses scientific
methods, algorithms, and systems to extract knowledge and
insights from structured and unstructured data. It involves
predictive modeling, machine learning, and advanced analytics
to forecast future outcomes.
2. Purpose
 BI:
o The primary purpose of BI is to provide actionable insights for
business decisions based on historical and current data. It
emphasizes descriptive analytics, allowing organizations to
understand their performance and operations.
 Data Science:
o Data Science aims to uncover hidden patterns, make
predictions, and derive insights from data. It often focuses on
answering complex questions, forecasting future trends, and
supporting innovative solutions through advanced analytics.
3. Data Types and Scope
 BI:
o Typically deals with structured data from various business
operations, such as sales figures, financial metrics, and
customer feedback. It emphasizes summarizing and visualizing
this data through dashboards and reports.
 Data Science:
o Works with both structured and unstructured data, including
text, images, and videos. Data Science incorporates various
data types and often requires data cleaning, transformation,
and complex analysis to derive insights.
4. Techniques and Tools
 BI:
o Relies on tools for data visualization, reporting, and
dashboards. Common BI tools include Tableau, Power BI,
QlikView, and Excel. The techniques focus on descriptive
analytics and reporting (e.g., aggregating data, generating
summaries).
 Data Science:
o Utilizes a broader range of techniques, including statistical
analysis, machine learning, data mining, and predictive
modeling. Data scientists often use programming languages
such as Python or R and tools like Jupyter Notebooks,
TensorFlow, and Apache Spark.
5. Skill Sets
 BI:
o BI professionals typically have skills in data visualization,
reporting, and understanding business processes. They often
possess knowledge of SQL, data warehousing, and business
analysis.
 Data Science:
o Data scientists require a strong foundation in mathematics,
statistics, programming, and algorithms. They should also be
proficient in machine learning, data mining, and data wrangling
techniques.
6. Output
 BI:
o The output of BI includes dashboards, reports, and data
visualizations that help stakeholders understand past and
current performance. BI outputs are often straightforward,
focusing on what has happened in the business.
 Data Science:
o The output of Data Science may include predictive models,
machine learning algorithms, and statistical analyses. These
outputs are often used to forecast future trends, identify
patterns, and recommend actions.
7. Time Orientation
 BI:
o Primarily focuses on historical and current data to inform
business decisions. It helps organizations understand past
performance and make decisions based on that information.
 Data Science:
o Often looks to the future, using historical data to make
predictions and provide insights about potential future events
or trends. Data Science seeks to explore "what could happen"
rather than just "what has happened."
8. Business Application
 BI:
o Commonly applied in reporting and analyzing business
performance, tracking key performance indicators (KPIs), and
facilitating data-driven decision-making across various business
functions.
 Data Science:
o Applied in more complex scenarios, such as developing
recommendation systems, performing customer segmentation,
fraud detection, and automating processes through machine
learning.
Conclusion
While Business Intelligence focuses on analyzing past and present
data to provide insights for immediate decision-making, Data
Science delves into more advanced analytics, aiming to predict
future trends and uncover hidden patterns in data. BI typically
emphasizes visualization and reporting, while Data Science involves
complex modeling and algorithms. Both fields are complementary
and play vital roles in helping organizations leverage data for better
decision-making and strategic planning.
Aspect Business Intelligence Data Science
(BI)
Definition Processes and Interdisciplinary field using
technologies for collecting scientific methods to extract
and analyzing business insights from data.
data to support decision-
making.
Purpose To provide actionable To uncover patterns, make
insights based on predictions, and derive
historical and current data. insights from data.
Data Primarily deals with Works with both structured
Types and structured data (e.g., sales and unstructured data (e.g.,
Scope figures, financial metrics). text, images).
Technique Relies on data Utilizes statistical analysis,
s and visualization and reporting machine learning, and
Tools tools (e.g., Tableau, Power programming languages (e.g.,
BI). Python, R).
Skill Sets Skills in data visualization, Strong foundation in
reporting, and business mathematics, statistics, and
analysis; knowledge of programming; expertise in
SQL. machine learning.
Output Dashboards, reports, and Predictive models, machine
data visualizations learning algorithms, and
showing past and current statistical analyses for future
performance. trends.
Time Focuses on historical and Looks to the future to predict
Orientatio current data for decision- trends and analyze potential
n making. outcomes.
Business Commonly used for Applied in complex scenarios
Applicatio tracking KPIs, analyzing like recommendation systems,
n business performance, and customer segmentation, and
reporting. fraud detection.

You might also like