0% found this document useful (0 votes)
73 views81 pages

BI-Class Notes - Unit 1-4-5

The document provides an introduction to Business Intelligence (BI), covering its definition, history, components, and various analytics types such as descriptive, predictive, prescriptive, and streaming analytics. It emphasizes the importance of BI in making effective and timely business decisions and outlines the BI life cycle, including data collection, warehousing, analysis, and reporting. Additionally, it discusses BI architecture, including core components and tools that support data integration and visualization for informed decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views81 pages

BI-Class Notes - Unit 1-4-5

The document provides an introduction to Business Intelligence (BI), covering its definition, history, components, and various analytics types such as descriptive, predictive, prescriptive, and streaming analytics. It emphasizes the importance of BI in making effective and timely business decisions and outlines the BI life cycle, including data collection, warehousing, analysis, and reporting. Additionally, it discusses BI architecture, including core components and tools that support data integration and visualization for informed decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

UNIT

1
Business Intelligence Introduction [6 Hours]
• Definition
• History of Business intelligence
• Leveraging Data and Knowledge for BI
• BI Components
• Business Intelligence and Business
Analytics
• BI Life Cycle
• Business intelligence architectures
• Effective and timely decisions
Definition & History of Business
intelligence
• In 1865, Richard Millar Devens presented the phrase ―Business
Intelligenceǁ (BI) in the ―Cyclopædia of Commercial and Business
Anecdotes.ǁ
• He used it to describe how Sir Henry Furnese, a banker, profited from
information by gathering and acting on it before his competition.
• More recently, in 1958, an article was written by an IBM computer
scientist named Hans Peter Luhn, describing the potential of gathering
business intelligence (BI) through the use of technology.
• Business intelligence, as it is understood today, uses technology to
gather and analyze data, translate it into useful information, and act on
it ―before the competition.ǁ
• Essentially, the modern version of BI focuses on technology as a
way to make decisions quickly and efficiently, based on the right
information at the right time.
Histor
y In 1968, only individuals with extremely specialized skills

could translate data into usable information.
• At this time, data from multiple sources was normally stored in
silos, and research was typically presented in a fragmented,
disjointed report that was open to interpretation.
• Edgar Codd recognized this as a problem, and published a
paper in 1970, altering how people thought about databases.
• His proposal of developing a ―relational database modelǁ
gained tremendous popularity and was adopted worldwide.
Histor
y• Decisionsupport systems (DSS) was the first
management system to database
be developed.
• Many historians suggest the modern version of business
intelligence evolved from the DSS database. The number of BI
vendors grew in the 1980s, as business people discovered the
value of business intelligence.
• An assortment of tools was developed during this time, to
access and organize data in simpler ways.
• OLAP, executive information systems, and data warehouses
were some of the tools developed to work with DSS.
Business Intelligence and Business
Analytics
• Currently, the two terms are used interchangeably.
• Both describe the general practice of using data in making
informed,
intelligent business decisions.
• The term business intelligence has evolved to depend on a
range of technologies that provide useful insights.
• Conversely, analytics represents the tools and processes that can translate
raw data into actionable, useful information for decision-making purposes.
• Different forms of analyticshave been developed, including
streaming analytics, which works in real time.
Descriptive
•Analytics
Descriptive analyticsdescribes, or summarizes data, and is
focused
primarily on historical information.
• This type of analytics describes the past, allowing for an understanding of
how previous behaviors affect the present.
• Descriptive analytics can be used to explain how a company operates and to
describe different aspects of the business.
• In the best-case scenario, descriptive analytics tells a story with a relevant
theme and provides useful information.
Predictive
Analytics
• Predictive analytics is used to predict the future.
• This type of analytics uses statistical data to supply companies with useful
insights about upcoming changes, such as identifying sales trends,
purchasing patterns, and forecasting customer behavior.
• The business uses of predictive analytics normally include anticipating sales
growth at the end of the year, what products customers might purchase
simultaneously, and forecasting inventory totals.
• Credit scores offer an example of this type of analytics, with financial
services using them to determine a customer‘s probability of making
payments on time.
Prescriptive
Analytics
• Prescriptive analytics is a relatively new field, and is still a little hard to work
with.
• This type of analytics ―prescribesǁ several different possible actions and
guides people toward a solution.
• Prescriptive analytics is designed to provide advice.
• Essentially, it predicts multiple futures and allows organizations to assess
many possible outcomes, based upon their actions.
• In the best-case scenario, prescriptive analytics will predict what will
happen, why it will happen, and provide recommendations.
• Larger companies have used prescriptive analytics to successfully optimize
scheduling, revenue streams, and inventory, in turn, improving the customer
experience.
Streaming
• Analytics
Streaming analytics is the real-time processing of data. It is designed to
constantly calculate, monitor, and manage data-based statistical information,
and respond immediately.
• The process deals with recognizing and responding to specific situations, as
they happen.
• Streaming analytics has significantly improved the development and use of
business information.
• Data for streaming analytics can come from a variety of sources, including
mobile phones, the Internet of Things (IoT), market data, transactions, and
mobile devices (tablets, laptops).
Streaming
Analytics
It connects management to external data sources, allowing applications to combine and

merge data into an application flow, or update external databases with processed
information, quickly and efficiently. Streaming analytics supports:
• Minimizing damage caused by social media meltdowns, security breaches, airplane
crashes, manufacturing defects, stock exchange meltdowns, customer churn, etc.
• Analyzing routine business operations in real time
• Finding missed opportunities with big data
• The option to create new business models, revenue streams, and product innovations
• Some examples of streaming data are social media feeds, real-time stock trades, up-to-the-
minute retail inventory management, or ride-sharing apps.
• For instance, when a customer calls Lyft, streams of data are joined to create seamless user
experiences. The application merges real-time location tracking, pricing, traffic stats, and
real-time traffic data to provide the customer with the nearest available driver, pricing, and
a time estimate to the destination using both historical and real-time data.
• Streaming analytics has become an extremely useful tool for short-term coordination, as
well as developing business intelligence over the long term.
BI Life
Cycle
• Business Intelligence is a strategic initiative which helps the organizations
to measure the effectiveness of their plans on the market.
• A successful company must know how to plan and how to address a BI
strategy so that the project or projects implicated in the process have
maximum profitability.
• Company managers with each project manager should adopt a specific
methodology based on the needs they know they have.

BI vs Analytics

❖ BI usespast data for current business operating decisions.


Meanwhile,
Analytics uses past data for planning future business decisions.
❖ BI is descriptive (demographic answers and performance answers)
while Analytics is predictive (predictive answers and recommendation)
❖ BI discusses reporting, dashboards. Analytics talks about
future-looking,
probability.
Business Intelligence
Lifecycle
1.

Data Collection
Because using data to obtain information, data collection can be done in 2 ways, namely
primary (such as interviews) and secondary (such as searching the internet).

2. Data Warehousing
• Data Warehousing is a storage area for data that has been collected and is large in number.

3. Data Analysis & Q.A.


• Then the raw data will be analyzed by entering a query or question to get useful
information conclusions.

4. Reporting, Dashboards, KPIs, Trends


• After entering the query or question, the information will come out in the form of
reporting, dashboards, KPIs, and trends

. Business Decision
• If the visualization is visible, the next part determines the decision by considering the
information obtained.
Business intelligence
architectures

Business intelligence architecture components and diagram

A BI architecture can be deployed in an on-premises data center or in the cloud.

In either case, it contains a set of core components that collectively support the different stages of the
BI process from data collection, integration, data storage and analysis to data visualization, information
delivery and the use of BI data in business decision-making.
The core components of a BI architecture
the following:
include
• Source systems.
• These are all of the systems that capture and hold the transactional and
operational data identified as essential for the enterprise BI program.
• For example, this can include enterprise resource planning, customer
relationship management, flat files, application programming interfaces,
finance, manufacturing and supply chain management systems as well
as secondary sources, such as market data and customer databases from
outside information providers.
• As a result, both internal and external data sources are often
incorporated into a BI architecture.
• Important criteria in the data source selection process include data
relevancy, data currency, data quality and the level of detail in the
available data sets.
• In addition, a combination of structured, semi-structured and
unstructured data types might be required to meet the data analysis and
decision-making needs of executives and other end users.
Business intelligence
architectures
• Data integration and cleansing tools.
• To effectively analyze the collected data for a BI program, an organization
must integrate and consolidate different data sets to create unified views of
them.
• The most widely used data integration technology for BI applications is
extract, transform and load (ETL) software, which pulls data from source
systems in batch processes.
• A variant of ETL is extract, load and transform, a technology in which data
is extracted and loaded as-is and transformed later for specific BI uses.
• Other methods include real-time data integration, such as change data
capture and streaming integration to support real-time analytics applications,
and data virtualization, which combines data from different source systems
virtually.
• A BI architecture typically also includes data profiling and data cleansing
tools that are used to identify and fix data quality issues.
• They help BI and data management teams provide clean, consistent data
that's suitable for BI uses.
The core components of a BI architecture include the
following
• Analytics data stores.
• This encompasses the various repositories where BI data is stored and
managed.
• The primary repository is a data warehouse, which usually stores
structured data in a relational, columnar or multidimensional database
and makes it available for querying and analysis.
• An enterprise data warehouse can also be tied to smaller data marts set
up for individual departments and business units with data that's specific
to their BI needs.
• BI architectures often include an operational data store (ODS) that's an
interim repository for data before it goes into a data warehouse. An
ODS can also be used to run analytical queries against recent transaction
data. Depending on the size of a BI environment, a data warehouse, data
mart and an ODS can be deployed on a single database server or
separate business intelligence systems.
• A well-planned architecture should specify which of the different data
stores is best suited for particular BI uses.
The core components of a BI architecture include the
following
• BI and data visualization tools.
• The tools used to analyze data and present information to
business users include a suite of technologies that can be
built into a BI architecture -- for example, ad hoc query,
data mining and online analytical processing software.
• The growing adoption of self-service BI tools enables
business analysts and managers to run queries themselves
instead of relying on the members of the BI team to do
that for them.
• BI software also includes data visualization tools that can
be used to create graphical representations of data in the
form of charts, graphs and other types of visualizations
designed to illustrate trends, patterns and outlier
elements in data sets.
The core components of a BI architecture include the
following
• Dashboards, portals and reports.
• These information delivery tools give users visibility into the
results of BI and analytics applications with built-in data
visualizations and, often, self-service capabilities to do
additional data analysis.
• For example, BI dashboards and online portals can be
designed to provide real-time data access with configurable
views and give users the ability to drill down into data.
Reports tend to present data in a more static format.
• Other components that increasingly are part of a business
architecture include data preparation software used to
structure and organize data for analysis and a metadata
repository, a business glossary and a data catalog, which can
help users find relevant data and understand its lineage and
meaning.
The core components of a BI architecture include the
following
• BI architecture tools
• BI architecture tools facilitate the centralization of
data collection as well as data analysis and
visualization.
• These tools play an integral role in empowering
businesses to make informed decisions and extract
insights from extensive data sets.
• Tools:
• Microsoft Power BI
• Oracle Business Intelligence
• SAS Business Intelligence
• Tableau
Business Intelligence Architecture With Components

BI Architecture Framework In Modern Business


ARCHITECTURE OF
BUSINESS INTELLIGENCE
Main differences business intelligence and data warehousing
Effective and timely
decisions.
• Business intelligence may be defined as a set of
mathematical models and analysis methodologies that exploit
the available data to generate information and knowledge
useful for complex decision-making processes
Effective and timely
decisions
• In complex organizations, public or private, decisions are made on a continual
basis.
• The ability of these knowledge workers to make decisions, both as individuals and
as a community, is one of the primary factors that influence the performance and
competitive strength of a given organization.
• require a more rigorous attitude based on analytical methodologies and
mathematical models
• Retention in the mobile phone industry or low customer loyalty, also known as
customer attrition or churn or rely on a budget adequate to pursue a customer
retention
• choosing those customers to be contacted so as to optimize the effectiveness
of the campaign
• target the best group of customers and thus reduce churning and maximize
customer retention
The main purpose of business intelligence systems is to provide
knowledge workers with tools and methodologies that allow them to
make effective and timely decisions.

• Effective decisions.
• rigorous analytical methods allows decision makers to
rely on information and knowledge & ensuing in-depth
examination and thought lead to a deeper awareness
and comprehension of the underlying logic of the
decision-making process
• Timely decisions.
• If decision makers can rely on a business intelligence
system facilitating their activity, we can expect that the
overall quality of the decision-making process will be
greatly improved.
[Unit 4]
Data
❖ Definition of data
Warehousing
warehouse
❖ Data marts
❖ Data quality
❖ Data warehouse architecture
❖ ETL tools
❖ Metadata
❖ Schemas Used in Data Warehouses: Star, Snowflake and
fact constellation
❖ Cubes and multidimensional analysis
❖ Hierarchies of concepts
❖ OLAP operations ,OLAP vs OLTP
❖ Materialization of cubes of data
Metadat
a The standard
• definition of metadata is "data
about the data," which unfortunately is not a
particularly enlightening description.
• It is useful to think of metadata as a catalog of the
intellectual capital that surrounds the creation,
management, and use of a collection of
information.
• That can range from simple observations about
the number of columns in a database table to
complex descriptions about the way that data
flowed from multiple sources into the target
Metadat
a From relatively
• humble beginnings as the data
dictionary associated with mainframe database
tables, the concept of metadata has evolved over
time to become a major component of a BI
program.
• Essentially, metadata is a sharable master key to
all the information that is feeding the business
analytics, from the extraction and population of the
central repository to the provisioning of data out of
the warehouse and onto the screens of the business
clients.
• Metadata are data about data (e.g., see Sen, 2004; and
Zhao, 2005).
• Metadata describe the structure of and some meaning about
data, thereby contributing to their effective or ineffective
use.
• Metadata are generally defined in terms of usage as
technical or business metadata.
• Pattern is another way to view metadata.
• According to the pattern view, we can differentiate between
• syntactic metadata (i.e., data describing the syntax of data),
• structural metadata (i.e., data describing the structure of the data),
and
• semantic metadata (i.e., data describing the meaning of the data in
a specific domain).
Metadat
a• The primary purpose of metadata should be to provide
context to the reported data.
• In many ways, metadata assist in the conversion of
data and information into knowledge.
• Zhao (2005) described five levels of metadata
management maturity: (1) ad hoc, (2) discovered, (3)
managed, ( 4) optimized, and (5) automated.
• The design, creation, and use of metadata-descriptive
or summary data about data-and its accompanying
standards may involve ethical issues.
The Importance of
Metadata
The management of metadata is probably one of the most critical tasks associated
with a successful BI program, for a number of reasons.
❖ Metadata encapsulates both the logical and physical business
knowledge required to transform disparate data sets into a coherent
warehouse.
❖ Metadata captures the structure and meaning of the data that is being
fed into the warehouse.
❖ The recording of operational metadata provides a road map for
deriving an information audit trail.
❖ One can capture differences associated with how data is manipulated
over time (as well as the corresponding business rules), which is critical
with data warehouses whose historical data spans large periods of time.
❖ Metadata provides the means for tracing the evolution of information
as a way to validate and verify results derived from an analytical
process.
Metadata is divided into two areas:
•technical metadata, which describes the
data mechanics, and
•business metadata, which describes the
business perception of that same
information.
Technical Meta
data
• Technical metadata describes the structure
of
information, whether it is the data that
sourcing the or the is
warehouse.
warehouse data in the
• Technical metadata characterizes the structure
of data, the way that data move, and how it is
transformed as it moves from one location to
another.
• This may incorporate some or all of the
following.
• Connectivity metadata, which describes the ways that
data consumers interact with the database system,
including the names used to establish connections,
database names, data source names, whether connections
can be shared, and the connection timeout.
• Table information, including table names; the
description of what is modeled by each table; in which
database the table is stored; the physical location, size,
and growth rate of the table; the data sources that feed
each table; update histories (including the date of last
update and of last refresh); the results of the last update;
candidate keys; foreign keys; the degrees of the foreign
key cardinality (e.g., 1:1 versus 1 :many); referential
integrity constraints; functional dependencies; and
indexes
• Record structure information, which describes the
structure of the record; overall record size; whether the
record is a variable or static length; all column names,
types, descriptions, and sizes; source of values that populate
each column; whether a column is an automatically
generated unique key; null status; domain restrictions; and
validity constraints.
• Record manipulation metadata, which includes record
creation time, time of last update, the last person to modify
the record, and the results of the last modification.
• Index metadata, which describes what indexes exist, on
which tables those indexes are made, the columns that are
used to perform the indexing, whether nulls are allowed,
and whether the index is automatically or manually updated
• Data practitioners, which enumerates the staff members who
work with data, their contact information (e.g., telephone number,
e-mail address), and the objects to which they access.
• Security and access metadata, which identifies the owner of the
data, the ownership paradigm, who may access the data and with
which permissions (e.g., read-only versus modify)
• Data model metadata, which captures entity-relationship
diagrams, dimensional layouts and star join structures, logical
data models, and physical data models
• Physical features metadata, such as the size of tables, the
number of records in each table, and the maximum and minimum
record sizes if the records are of variable length
• Reference metadata, such as defined enumerated data domains,
value ranges, likely values (for reasonableness tests), and
mappings between data domains
• Management metadata, such as the history of a data table or
database, stewardship information, and responsibility matrices
• Transformation metadata, which describes the data sources
that feed into the data warehouse, the ultimate data destination,
and, for each destination data value, the set of transformations
used to materialize the datum and a description of the
transformation
• Process metadata, which describes the information flow and
sequence of extraction and transformation processing, including
data profiling, data cleansing, standardization, and integration
• Supplied data metadata, which, for all supplied data sets, gives
the name of the data set, the name of the supplier, the names of
individuals responsible for data delivery, the delivery mechanism
(including time, location, and method), the expected size of the
supplied data, the data sets that are sourced using each supplied
data set, and any transformations to be applied upon receiving the
Metadata is divided into two areas:
•technical metadata, which describes the
data mechanics, and
•business metadata, which describes the
business perception of that same
information.
Business
Metadata
Business metadata incorporates much of the same information as technical
metadata, as well as:
• Metadata that describes the structure of data as perceived by business
clients
• Descriptions of the methods for accessing data for client
analytical applications
• Business meanings for tables and their attributes
• Data ownership characteristics and responsibilities
• Data domains and mappings between those domains, for validation
• Aggregation and summarization directives
• Reporting directives
• Security and access policies
• Business rules that describe constraints or directives associated with data
within a record or between records as joined through a join condition
The Metadata
Repository
• Metadata is data, which means that it can
managed the same way other data is managed.
be modeled and

• As the primary source of knowledge about the inner workings of


the BI environment, it is important to build and maintain a
metadata repository that is available to all knowledge workers
involved in the BI program.
• Whether the metadata repository is physically centralized or
distributed across multiple systems and however it is accessed, it
is important to provide a mechanism for publishing metadata.
• The existence of disparate data systems that contribute
information to the BI environment complicates this process,
because each system may have its own methods for managing its
own metadata.
Management
Issues
• As a manager, it is important to know that the area of
data warehousing is not just about building BI
frameworks. As is typical with any loosely structured
technology, the amount of buzz surrounding data
warehousing seems to be inversely proportional to the
number of truly successful implementations, and my
guess is that the number of available experts on data
warehousing is probably equal to the number of failed
data warehousing projects. The significant
management issues associated with the topics in this
chapter deal with aspects of this.
[Unit 5] Data Mining and Application
of BI
• Data mining
• Definition of data mining
• Models and methods for data mining
• Data mining
• Classical statistics & OLAP
• Applications of data mining
• Representation of input data
• Data mining process
• Applications of BI: Data Warehousing Helps Multi Care Save More Lives
• Smarter Insurance: Infinity P&C Improves Customer
• Service and Combats Fraud with Predictive Analytics.
Data
mining
• the term data mining indicates the process of
exploration and analysis of a dataset, usually
of large size, in order to find regular patterns, to
extract relevant knowledge and to obtain
meaningful recurring rules.
Definition of data
mining
• Data mining activities constitute an iterative process aimed at the
analysis of large with the purpose of
databases,
information and extracting
potentially useful for knowledgethatmay
knowledge prove accurate
workers engaged and
in decision
making and problem solving.
• The term data mining refers therefore to the overall process
consisting of data gathering and analysis, development of
inductive learning models and adoption of practical decisions and
consequent actions based on the knowledge acquired.
• Data mining activities can be subdivided into two major
investigation Streams interpretation and prediction.
Interpretation. The purpose of interpretation is to identify regular patterns in the data and
to express them through rules and criteria that can be easily understood by experts in the
application domain.
o Prediction. The purpose of prediction is to anticipate the value that a random variable
will
Models and methods for data
• There are several learning methods that are available to perform
mining
the different data mining tasks.
• A number of techniques originated in the field of computer
science, such as classification trees or association rules, and
are referred to as machine learning or knowledge discovery in
databases.
Applications of data
mining
• Data mining methodologies can be applied to a variety of domains, from
marketing and manufacturing process control to the study of risk factors in
medical diagnosis, from the evaluation of the effectiveness of new drugs to
fraud detection.

Relational marketing identification of customer segments, prediction of the


rate of positive responses
▪ Fraud detection illegal use of credit cards and bank checks)
▪ Risk evaluation. estimate the risk connected with future decisions)
▪Text mining represent unstructured data, in order to classify articles, books,
documents
▪Image recognition. The treatment and classification of digital images It is useful to
recognize written characters, compare and identify human faces, apply correction
filters to photographic equipment and detect suspicious behaviors through
surveillance video cameras.
▪Web mining. intended for the analysis of so-called clickstreams – the sequences of
pages visited and the choices made by a web surfer
▪ Medical diagnosis. Learning models are an invaluable tool within the medical field
for the early detection of diseases using clinical test results.
[Unit 5] Data Mining and Application
of BI
• Data mining
• Definition of data mining
• Models and methods for data mining
• Data mining
• Classical statistics & OLAP
• Applications of data mining
• Representation of input data
• Data mining process
• Applications of BI: Data Warehousing Helps Multi Care Save More Lives
• Smarter Insurance: Infinity P&C Improves Customer
• Service and Combats Fraud with Predictive Analytics.
Key properties of Data Mining :
1. Automatic discovery of patterns
2. Prediction of likely outcomes
3. Creation of actionable
information
4.Focus on large datasets and
Classical statistics &
OLAP
• OLAP
extraction of details andaggregate totals from data
information distribution of incomes of home loan applicants
• Statistics
verification of hypothesesformulated by analysts
validation analysis of variance of incomes home
of loan
applicants
• Data mining
identification of patterns andrecurrences in data
knowledge characterization of home loan applicants and
prediction of future applicants
Data Mining OLAP

Data mining refers to the field of OLAP is a technology of immediate


computer science, which deals with access to data with the help of
the extraction of data, trends and multidimensional structures.
patterns from huge sets of data.

It deals with the data summary. It deals with detailed


transaction- level data.
It is discovery-driven. It is query driven.
It is used for future data prediction. It is used for analyzing past data.

It has huge numbers of dimensions. It has a limited number of


dimensions.
Bottom-up approach. Top-down approach.
It is an emerging field. It is widely used.
Alternative names for Data Mining :
1.Knowledge discovery (mining) in databases
(KDD)
2. Knowledge extraction
3. Data/pattern analysis
4. Data archaeology
5. Data dredging
6. Information harvesting
7. Business intelligence
Data mining
process
• Definition of objectives. Data mining analyses are carried out in specific
application domains and are intended to provide decision makers with useful
knowledge
Data mining
•process
Data gathering and integration. Once the objectives of the investigation
have been identified, the gathering of data begins. Data may come from
different sources and therefore may require integration.
• ▪ Exploratory analysis. In the third phase of the data mining process, a
preliminary analysis of the data is carried out with the purpose of getting
acquainted with the available information and carrying out data cleansing.
• ▪ Attribute Selection. In the subsequent phase, the relevance of the different
attributes is evaluated in relation to the goals of the analysis.(selecting
appropriate attribute means columns rollno, rank, name, state)
• ▪ Model development and validation. Once a high quality dataset has been
assembled and possibly enriched with newly defined attributes, pattern
recognition and predictive models can be developed. (validation against
training sets)
• ▪ Prediction and interpretation. knowledge workers may be able to use it
to draw predictions and acquire a more in-depth knowledge of the
phenomenon of interest.
Tasks of Data
Mining
• Anomaly detection (Outlier/change/deviation detection) – The identification
of unusual data records, that might be interesting or data errors that require
further investigation.
• Association rule learning (Dependency modeling) – Searches for
relationships between variables. For example a supermarket might gather
data on customer purchasing habits. Using association rule learning, the
supermarket can determine which products are frequently bought together
and use this information for marketing purposes. This is sometimes referred
to as market basket analysis.
• Clustering – is the task of discovering groups and structures in the data that
are in some way or another "similar", without using known structures in the
data.
• Classification – is the task of generalizing known structure to apply to new
data. For example, an e-mail program might attempt to classify an e-mail as
"legitimate" or as "spam".
• Regression – attempts to find a function which models the data with the least
error.
Tasks of Data
Mining
• Summarization – providing a more compact
representation of the data set, including visualization
and report generation.
Knowledge Discovery in
Databases(KDD)
Data Cleaning - In this step the noise and inconsistent data is
removed.
Data Integration - In this step multiple data
sources are
combined.
Data Selection- In this step relevant to theanalysis
task are retrieved from the database.
Data Transformation - In this step data are transformed or
consolidated into forms appropriate for mining by performing
summary or aggregation operations.
Data Mining - In this step intelligent methods are applied in
order to extract data
patterns.
Pattern Evaluation - In this step, data patterns are evaluated.
Knowledge Presentation - In this step,knowledge is repres
Major Issues In Data
Mining:
•Mining different kinds of knowledge in databases. - The need
of different users is not the same. And Different user may be in
interested in different kind of knowledge. Therefore it is necessary
for data mining to cover broad range of knowledge discovery task.
• Interactive mining of knowledge at multiple levels of
abstraction. - The data mining process needs to be interactive
because it allows users to focus the search for patterns, providing
and refining data mining requests based on returned results.
• Incorporation of background knowledge. - To guide discovery
process and to express the discovered patterns, the background
knowledge can be used. Background knowledge may be used to
express the discovered patterns not only in concise terms but at
multiple level of abstraction.
• Data mining query languages and ad hoc data mining. -
Data Mining Query language that allows the user to describe ad
hoc mining tasks, should be integrated with a data warehouse
query language and optimized for efficient and flexible data
mining.
• Presentation and visualization of data mining results. - Once
the patterns are discovered it needs to be expressed in high level
languages, visual representations. This representations should
be easily understandable by the users.
• Handling noisy or incomplete data. - The data cleaning
methods are required that can handle the noise, incomplete
objects while mining the data regularities. If data cleaning
methods are not there then the accuracy of the discovered
patterns will be poor.
• Pattern evaluation. - It refers to interestingness of the problem.
The patterns discovered should be interesting because either they
represent common knowledge or lack novelty.
• Efficiency and scalability of data mining algorithms. - In order to
effectively extract the information from huge amount of data in
databases, data mining algorithm must be efficient and scalable.
• Parallel, distributed, and incremental mining algorithms. -
The factors such as huge size of databases, wide distribution of
data,and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms.
These algorithm divide the data into partitions which is further
processed parallel. Then the results from the partitions is merged.
The incremental algorithms, updates databases without having mine
the data again from scratch.
Representation of input
data
• Input to a data mining analysis takes the form of a two
dimensional table, called a dataset, irrespective of the
actual logic and material representation adopted to store
the information in files, databases, data warehouses and
data marts used as data sources.
• The rows in the dataset correspond to the observations
recorded in the past and are also called examples, cases,
instances or records.
• The columns represent the information available for
each observation and are termed attributes, variables,
characteristics or features.
• Attributes contained in a dataset can be categorized as
categorical or numerical
❖ Categorical. Categorical attributes assume a finite number
of distinct values

❖ Numerical. Numerical attributes assume a finite or infinite


number of values and lend themselves to subtraction or
division operations
• Counts. Counts are categorical attributes in relation to
which a specific property can be true or false
• Nominal. Nominal attributes are categorical attributes
without a natural ordering, such as the province of
residence.
• Ordinal. Ordinal attributes, such as education level, are
categorical attributes that lend themselves to a natural
ordering
• Discrete. Discrete attributes are numerical attributes that
assume a finite number or a countable infinity of
values(not same every time)
• Continuous. Continuous attributes are numerical
attributes that assume an uncountable infinity of values.
Analysis
methodologies
• First fundamental distinction between supervised and
unsupervised learning processes
• Supervised learning.
• In a supervised (or direct) learning analysis, a target
attribute either represents the class to which each record
belongs. (set of rules are defined and observed)
• Unsupervised learning.
• Unsupervised (or indirect) learning analyses are not
guided by a target attribute.
• one is interested in identifying clusters of records that
are similar within each cluster and different from
members of other clusters.
Seven basic data mining
tasks
• Characterization and discrimination (arrange data according to
characterization, and divide as per set of rules, students or professor, divide as per
class or subjects)
• Classification (Each observation is described by a given number of attributes
whose value is known)
• Regression (regression is used when the target variable takes on continuous
values, predict the sale of product before launch)
• time series analysis (predicting the value of the target variable for one or more
future periods, on history of that variable)
• association rules (identify interesting and recurring associations between
groups of records of a dataset, who buy what how many times)
• Clustering (The term cluster refers to a homogeneous subgroup existing within
a population.)
• description and visualization (representation is justified by the remarkable
conciseness of the information achieved through a well-designed chart)
• Characterization and Discrimination
• For example, in the AllElectronics store, classes of items
for sale include computers and printers, and concepts of
customers include bigSpenders and budgetSpenders.
• Classification
• to check whether it is raining or not; fraud detection
• Regression
•predicting house prices; predicting the weather, predicting
the impact of SAT/GRE scores on college admissions
• Time series analysis
• Stock market analysis
Applications of
BI
• Data Warehousing Helps MultiCare Save More Lives

•Smarter Insurance: Infinity P&C Improves Customer


Service and Combats Fraud with Predictive Analytics.

You might also like