0% found this document useful (0 votes)
11 views

Notes - Unit1 - 1

Uploaded by

4nm20cs188
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Notes - Unit1 - 1

Uploaded by

4nm20cs188
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Business Intelligence Notes – Unit-1-1

What is BI?
Business Intelligence (BI) is about getting the right information, to the right decision makers,
at the right time.
BI is an enterprise-wide platform that supports reporting, analysis and decision making

In 1989- Howard Dresner, Gartner Group, defined as: “BI is a set of concepts and
methodologies to improve decision making in business through use of facts and fact-based
systems”.
❑ The goal of BI is improved decision making.
Yes, decisions were made earlier too (without BI). The use of BI should lead
to improved decision making.
❑ BI is more than just technologies.
It is a group of concepts and methodologies.
❑ It is fact based.
Decisions are no longer made on gut feeling or purely on hunch. It has to
be backed by facts.
◼ BI uses a set of processes, technologies, and tools :
-To transform raw data into meaningful information.
◼ BI mines the information:
-To provide knowledge and uses the knowledge gained to provide beneficial insights;
These insights then lead to impactful decision making which in turn provide business benefits
such as:

• increased profitability;
• increased productivity,
• reduced cost,
• improved operation etc.

Dr. Raghunandan K R Department of CSE NMAMIT


Business Intelligence Notes – Unit-1-1

The transformation of raw data to business benefits through BI may be depicted as:

◼ Business intelligence (BI) comprises the strategies and technologies used by


enterprises for the data analysis of business information.
◼ BI technologies provide historical, current, and predictive views of business
operations.
◼ Common functions of business intelligence technologies include
➢ Reporting
➢ Online Analytical Processing
➢ Analytics
➢ Dashboard Development
➢ Data Mining
➢ Complex event Processing
➢ Business Performance Management
➢ Bench marking
➢ Text mining
➢ Predictive analytics
➢ Perspective analytics

Dr. Raghunandan K R Department of CSE NMAMIT


Business Intelligence Notes – Unit-1-1

◼ Descriptive analytics: What happened?


◼ Predictive Analytics: What will happen?
◼ Prescriptive Analytics: How can we make it happen?
Do you think “BI only deals with analysis of past data” ?

◼ The term Business Intelligence (BI) refers to technologies, applications and practices
for the collection, integration, analysis, and presentation of business information.
◼ The purpose of Business Intelligence is to support better business decision making.

Why Business Intelligence?


➢ To gain competitive advantage in the marketplace (basedon better,
faster and fact-based decisions)

➢ To retain your customers by making relevant recommendations of products and


services to them

➢ To improve employee productivity by identifying and removing bottle neck processes.

➢ To optimize your service offerings

➢ To bring NEW value to your business

Dr. Raghunandan K R Department of CSE NMAMIT


Business Intelligence Notes – Unit-1-1

What BI can deliver?


➢ BI can deliver “single version of the truth”.
➢ Bl can deliver information that is actionable.
➢ Bl can glean “insights” from corporate data

Types of Digital Data


◼ Data growth has seen exponential acceleration since the advent of the computer and
Internet.
◼ In fact, the computer and Internet duo has imparted the digital form to data.
◼ Digital data can be classified into three forms:
❑ Unstructured.
❑ Semi-structured.
❑ Structured.
Unstructured data: This is the data which does not conform to a data model or is not in a
form which can be used easily by a computer program. About 80-90% data of an organization
is in this format;
For example, memos, chat rooms, PowerPoint presentations, images, videos, letters,
researches, white papers, body of an email, etc.
Semi-structured data: This is the data which does not conform to a data model but has some
structure. However, it is not in a form which can be used easily by a computer program:
For example, emails, XML, markup languages like HTML, etc. Metadata for this data is available
but is not sufficient.
Structured data: This is the data which is in an organized form (e.g., in rows and columns) and
can be easily used by a computer program. Relationships exist between entities of data, such
classes and their objects.
Data stored in databases is an example of structured data.
Distribution of data
Percent distribution of the three forms of data as shown in Figure 2.1.
◼ Usually, data is in the unstructured format which makes extracting information from
it difficult.
◼ According to Merrill Lynch, 80-90% of business data is either unstructured Or semi-
structured
◼ Gartner also estimates that unstructured data constitutes 80% of the whole enterprise
data.

Dr. Raghunandan K R Department of CSE NMAMIT


Business Intelligence Notes – Unit-1-1

GETTING TO KNOW STRUCTURED DATA

Example of structured data.


➢ The patient index card shown above is in a structured form. All the fields in the
patient index card are also structured fields.
➢ “GoodLife” nurses make electronic records for every patient who visits the
hospital. These records are stored in a relational database.
➢ For example, nurse records the body temperature and blood pressure of a patient
D, and enters them in the hospital database.
➢ Doctor, who is treating patient D, searches the database to know his body
temperature.
➢ Doctor is able to locate the desired information easily because the hospital data is
structured and is stored in a relational database.

Characteristics of Structured Data

◼ Structured data is organized in semantic chunks (entities) with similar entities grouped
together to form relations or classes.
◼ Entities in the same group have the same descriptions, i.e. attributes.
◼ Descriptions for all entities in a group contains

• The same defined format,


• A predefined length,
• Follow the same order.

Dr. Raghunandan K R Department of CSE NMAMIT


Business Intelligence Notes – Unit-1-1

Where Does Structured Data Come From?

◼ Data coming from databases such as Access, OLTP systems, SQL as well spreadsheets
such as Excel etc. are all in the structured format
◼ To summarize, structured data
❑ Consists of fully described data sets.
❑ Has clearly defined categories and sub-categories.
❑ Is placed neatly in rows and columns.
❑ Goes into the records and hence the database is
regulated by a well-defined structure
❑ Can be indexed easily either by the DBMS itself or
manually.
◼ Working with structured data is easy when it comes to Storage, Scalability, Security,
Update and Delete operations
❑ Storage: Both defined and user-defined data types help with
the storage of structured data
❑ Scalability: Scalability is not generally an issue with
increase in data
❑ Security: Ensuring security is easy
❑ Update and Delete: updating, deleting, etc. is easy due to
structured form

Dr. Raghunandan K R Department of CSE NMAMIT


Business Intelligence Notes – Unit-1-1

◼ Ease of Retrieval of structured data (Hassle-Free Retrieval)


Its easy to retrieve desired information from structured data because of its following
features (Figure 2.6):
◼ Retrieving information: A well-defined structure helps in easy retrieval of data.
◼ Indexing and searching: Data can be indexed based not only on a text string but also
on other attributes. This enables streamlined search.
◼ Mining data: Structured data can be easily mined and knowledge can be extracted
from it.
◼ BI operations: BI works extremely well with structured data. Hence data mining,
warehousing, etc. can be easily undertaken.

GETTING TO KNOW UNSTRUCTURED DATA


Unstructured data is the one which cannot be stored in the form of rows and columns as
in database and does not conform to any data model, i.e, it is difficult to determine the
meaning of the data.
◼ It does not follow any rules and or semantics.
◼ It can be any type and is hence unpredictable.
Where Does Unstructured Data Come From?
Anything in a non-database form is unstructured data. It can be classified into two broad
categories:

• Bitmap objects: For example, image, video, or audio files.


• Textual objects: For example, Microsoft Word documents, emails, or
Microsoft Excel spreadsheets.
• A lot of unstructured data is also noisy text such as chats, emails and SMS texts.
• The language of noisy text differs significantly from the standard form of language
Sources of Unstructured Data:
Web pages
Images (JPEG, GIF, PNG, etc.)
Videos
Memos
Reports
Word documents and PowerPoint presentations
Surveys

Dr. Raghunandan K R Department of CSE NMAMIT


Business Intelligence Notes – Unit-1-1

The characteristics of unstructured data are depicted in following figure

Characteristics of Unstructured Data:


• Data neither conforms to a data model nor has any structure.
• Data can not be stored in the form of rows and columns as in Databases
• Data does not follows any semantic or rules
• Data lacks any particular format or sequence
• Data has no easily identifiable structure
• Due to lack of identifiable structure, it can not used by computer programs easily

Dr. Raghunandan K R Department of CSE NMAMIT


Business Intelligence Notes – Unit-1-1

How to Manage Unstructured Data?


1. Indexing :
• In Relational Database Management system(RDBMS) data is indexed to enable
faster search and retrieval.
• On the basis of some value in the data, index is an an identifier and represents the
large record in the data set.
• In the absence of an index, the whole data set/document will be scanned for
retrieving the desired information.
• Indexing in unstructured data is difficult because neither does this data have any
pre-defined attributes nor does it follow any pattern or naming conventions.
• Text can be indexed based on a text string but in case of non-text based files, e.g.
audio/video, etc., indexing depends on file names.
2. Tags/ Metadata
• Using metadata, data in a document, etc can be tagged. This enables search and
retrieval.
• But in unstructured data, this is difficult as little or no metadata available.
• Structure of data has to be determined which is very difficult as the data itself has
no particular format and is coming from more than one source

3. Classification/Taxonomy:
• Taxonomy is classifying data on the basis of the relationships that exist between
data.
• Data can be arranged in groups and placed in hierarchies based on the taxonomy
prevalent in an organization. However, classifying unstructured data is difficult as
identifying relationships between data is not an easy task.
• In the absence of any structure or metadata or schema, identifying accurate
relationships and classifying is not easy.
• Since the data is unstructured, naming conventions or standards are not consistent
across an organization, thus making it difficult to classify data.

4. CAS (Content Addressable Storage):


• It stores data based on their metadata. It assigns a unique
name to every object stored in it.
• The object is retrieved based on its content and not its location.
• It is used extensively to store emails, etc.

How to Store Unstructured Data?


Challenges faced:

Dr. Raghunandan K R Department of CSE NMAMIT


Business Intelligence Notes – Unit-1-1

• Storage space: It is difficult to store and manage unstructured data. A lot of space
is required to store such data. It is difficult to store images, videos, audios, etc.
• Scalability: As the data grows, scalability becomes an issue and the cost of storing
such data grows.
• Retrieve information: Even if unstructured data is stored, it is difficult to retrieve
and recover from it.
• Security: Ensuring security is difficult due to varied sources of data.
• Update and delete: Updating and deleting unstructured data are difficult due to
no clear structure.
• Indexing and searching: Indexing unstructured data is difficult and error-prone as
the structure is not clear and attributes are not pre-defined. As a result, the search
results are not very accurate. Indexing becomes all the more difficult as the
volume of data grows.
Solutions to Storage Challenges of Unstructured Data
Few possible solutions depicted as below:

◼ Changing format:
❑ Unstructured data may be converted to formats which are easily managed,
stored and searched.
❑ For example, IBM is working on providing a solution which will convert audio,
video, ete. to text.
◼ Developing new hardware:

Dr. Raghunandan K R Department of CSE NMAMIT


Business Intelligence Notes – Unit-1-1

❑ New hardware needs to be developed to support unstructured data. It may


either complement the existing storage devices or may be a stand-alone for
unstructured data.
◼ Storing in RDBMS/BLOBs:
❑ Unstructured data may be stored in relational databases which support
BLOBs (Binary Large Objects).

❑ While unstructured data such as video or image file cannot be stored fairly
neatly into a relational column, there is no such problem when it comes to
storing its metadata, such as the date and time of its creation, the owner or
author of the data, etc.
◼ Storing in XML( eXtensible Markup Language) format:
❑ Unstructured data may be stored in XML format which tries to give some
structure to it by using tags and elements.
◼ CAS (Content Addressable Storage):
❑ It organizes files based on their metadata and assigns a unique name to every
object stored in it.
❑ The object is retrieved based on its content and not its location.
❑ It is used extensively to store emails, etc.

How to Extract Information from Stored Unstructured Data?


Challenges faced:
◼ Interpretation: Unstructured data is not easily interpreted by
conventional search algorithms.
◼ Tags: As the data grows, it is not possible to put tags manually.
◼ Indexing: Designing algorithms to understand the meaning of the documents and then
tagging or indexing them accordingly is difficult.
◼ Deriving meaning: Computer programs cannot automatically derive
meaning/structure from unstructured data.
◼ File formats: Increasing number of file formats makes it difficult to interpret data.
◼ Classification/Taxonomy: Different naming make it difficult to
classify data.

Dr. Raghunandan K R Department of CSE NMAMIT


Business Intelligence Notes – Unit-1-1

Possible solutions to the challenges:


◼ Tags: Unstructured data can be stored in a virtual repository and be automatically
tagged. For example, Documentum provides this type of solution.
◼ Text mining: Text mining tools help in grouping as well as classifying unstructured data
and assist in analyzing by considering grammar, context, synonyms, etc.
◼ Application platforms: Application platforms like XOLAP help extract information from
email and XML-based documents.
◼ Classification/Taxonomy: Taxonomies within the organization can be managed
automatically to organize data in hierarchical structures.
◼ Naming conventions/standards: Following naming conventions or standards across an
organization greatly improve Storage, retrieval, index, and search.

UIMA: A Possible Solution for Unstructured Data


◼ UIMA (Unstructured Information Management Architecture) is an open source
platform from IBM which integrates different kinds of analysis engines to provide a
complete solution for knowledge discovery from unstructured data.
◼ In UIMA (depicted in figure 2.14), the analysis engines enable integration and analysis
of unstructured information and bridge the gap between structured and unstructured
data.
❑ UIMA stores information in a structured format.
❑ Th structured resources can be then mined, searched, and put to other uses.
❑ The information obtained from structured sources is also used for subsequent
analysis unstructured data.

Dr. Raghunandan K R Department of CSE NMAMIT


Business Intelligence Notes – Unit-1-1

❑ Various analysis engines analyze unstructured data


in different ways such as:
1. Breaking up of documents into separate words
2. Grouping and classifying according to taxonomy
3. Detecting parts of speech, grammar and synonyms
4. Detecting events and times
5. Detecting relationships between various elements

GETTING TO KNOW SEMI-STRUCTURED DATA


Characteristics:

Characteristics of semi-structured data are summarized below:


❑ It is organized into semantic entities.
❑ Similar entities are grouped together.
❑ Entities in the same group may not have same
attributes.
❑ The order of attributes is not necessarily important.
❑ Not always all attributes are required.
❑ Size of the same attributes in a group may differ.
❑ Type of the same attributes in a group may differ.

Dr. Raghunandan K R Department of CSE NMAMIT


Business Intelligence Notes – Unit-1-1

Where Does Semi-Structured Data Come From?

How to Manage Semi-Structured Data?


Listed below are few ways in which semi-structured data are managed and stored

• Schemas: These can be used to describe the structure of data. Schemas define the
constraints, content of the document, etc. The problem with schemas is that
requirements are ever changing, and the changes required in data also lead to changes
in schema.
• Graph based data models: These can be used to describe data. This is schema less
approach and is also known as self-describing as data is presented in such a way that
it explains itself. The relationships and hierarchies are represented in the form of a
tree-like structure where the Vertices contain the object or entity and the leaves
contain data.
• XML: This is widely used to store and exchange semi-
structured data. It allows the user to define tags to store
data in hierarchical or nested forms.
Schemas in XML are not tightly coupled to data.

Dr. Raghunandan K R Department of CSE NMAMIT


Business Intelligence Notes – Unit-1-1

Disadvantages of Semi-structured data


Lack of fixed, rigid schema make it difficult in storage of the data
Interpreting the relationship between data is difficult as there is no separation of the schema
and the data.
Queries are less efficient as compared to structured data.
Complexity: Semi-structured data can be more complex to manage and process than
structured data, as it may contain a wide variety of formats, tags, and metadata. This can make
it more difficult to develop and maintain data models and processing pipelines.
Lack of standardization: Semi-structured data often lacks the standardization and consistency
of structured data, which can make it more difficult to ensure data quality and accuracy. This
can also make it harder to compare and analyze data across different sources.
Reduced performance: Processing semi-structured data can be more resource-intensive than
processing structured data, as it often requires more complex parsing and indexing
operations. This can lead to reduced performance and longer processing times.
Limited tooling: While there are many tools and technologies available for working with
structured data, there are fewer options for working with semi-structured data. This can make
it more challenging to find the right tools and technologies for a particular use case.
Data security: Semi-structured data can be more difficult to secure than structured data, as it
may contain sensitive information in unstructured or less-visible parts of the data. This can
make it more challenging to identify and protect sensitive information from unauthorized
access.

Problems faced in storing semi-structured data


• Data usually has an irregular and partial structure. Some sources have implicit
structure of data, which makes it difficult to interpret the relationship between data.
• Schema and data are usually tightly coupled i.e they are not only linked together but
are also dependent of each other. Same query may update both schema and data with
the schema being updated frequently.
• Distinction between schema and data is very uncertain or unclear. This complicates
the designing of structure of data
• Storage cost is high as compared to structured data
Possible solution for storing semi-structured data
• Data can be stored in DBMS specially designed to store semi-structured data

Dr. Raghunandan K R Department of CSE NMAMIT


Business Intelligence Notes – Unit-1-1

• XML is widely used to store and exchange semi-structured data. It allows its user to
define tags and attributes to store the data in hierarchical form.
Schema and Data are not tightly coupled in XML.
• Object Exchange Model (OEM) can be used to store and exchange semi-structured
data. OEM structures data in form of graph.
• RDBMS can be used to store the data by mapping the data to relational schema and
then mapping it to a table
How to Extract Information from Semi-Structured Data?
Challenges faced:

Possible solutions:
◼ Indexing: Indexing data in a graph-based model enables quick search.
◼ OEM: This data modeling technique allows for the data to be stored in a graph-based
data model which is easier to index and search.
◼ XML: It allows data to be arranged in a hierarchical or tree-like structure which
enables indexing and searching.
◼ Mining tools: Various mining tools are available which search data based on graphs,
schemas, structures, etc.

Dr. Raghunandan K R Department of CSE NMAMIT


Business Intelligence Notes – Unit-1-1

XML: A Solution for Semi- Structures Data Management:


◼ XML (eXtensible Markup Language) is an open source markup language written in
plain text.
◼ It is independent of hardware and software. It is designed to store and transport
data over the Internet. It allows data to be stored in a hierarchical/nested fashion.
◼ In XML, the user can define tags to store data. It also enables separation of content
(eXtensible Markup Language) and presentation (eXtensible Stylesheet Language).
◼ XML is slowly emerging as a solution for semi-structured data management.
◼ XML has no pre-defined tags.
◼ XML is known as self-describing as data can exist without a
schema and schema can be added later.
◼ Schema can be described in the XSLT or XML schema.
Characteristics of XML language are as follows:

• XML (eXtensible Markup Language) is slowly emerging as a


standard for exchanging data over the Web.

• It enables separation of content (eXtensible Markup Language) and presentation


(eXtensible Stylesheet Language).

• DTD’s (Document Type Descriptors) provide partial schemas for XML documents.

Dr. Raghunandan K R Department of CSE NMAMIT

You might also like