Chapter 2

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10

Chapter 2: General Enterprise Data Flow

Now that a basic understanding of Business Analytics has been established, it is time to learn how data
actually travels inside an enterprise so that it can be transformed into useable information, and in turn,
actionable items.

Business Analytics implementation in an enterprise is not a venture that can be done alone. It is a
collaborative effort that involves multiple teams from multiple departments constantly communicating
with each other in order to figure out the information needed by the main stakeholders. Determining
the information is the starting point of an implementation, which will dictate which data will actually be
conducive for the desired analysis.

How does data get transformed into information in an enterprise?


The following is an illustration of a typical enterprise System Landscape:

Data Sources
An enterprise’s data needs grow bigger and bigger as the business scales up. Due to this, the machines
(servers and clients) bought to address data needs a few years ago might no longer be enough to
address the need today, yet they are system-critical that taking them offline even for a bit could create
an operational scenario where the business users won’t be able to transact, which makes the data for
reporting to the higher ups no longer accurate or no longer available. There are three main categories of
data sources in an Enterprise.

ERP Systems
In an ideal world, ALL of an enterprise’s data is fed into its ERP System and all reports are obtained
directly from it. However, the real world makes it difficult because in the end, ERP Systems are still just
machines, with their limited (not infinite) capacity and processing power. An ERP System might also not
be able to address an enterprise’s needs as the company grows larger. This means that the company will
then have to procure a new ERP System, or upgrade its current one, which requires a significant
investment.

An ERP System makes extensive use of Master Data to help keep track of Business Partners and Items.
Usually the maintenance of these is assigned to key people, who will be the ones to manage the creation
of new Master Data or the updating of such. Lastly, when new equipment is bought or an existing ERP
System is upgraded, the company might need to schedule a little bit of down time to implement them.
The ERP System is unavailable at these times, so these will need to be scheduled ahead of time, and
concerned parties will need to be informed so they can work around it (adding System Memory, for
example, requires for the system to be shut down first before new Memory Modules can be installed).

Other Databases
Sometimes, due to geographical or cost constraints, a branch of the company might be physically
impossible to connect to the corporate network. This means that they can’t use the ERP System without
resorting to workarounds. One such workaround is to maintain a separate database that records all
transactions for the day. At the end of the day, the database will upload the collected data to the ERP
system.

In other instances, databases might be part of a legacy system that is still being used. It might be
integrated into a Business Process that is system-critical, and current Cost/Time/Technical constraints
mean that they can’t be assimilated to the ERP system just yet. In order to be able to decommission
these systems, the business process and the data they produce must be integrated to the ERP. If this is
impossible, then an Enterprise Data Warehouse will be required to consolidate their data. This will
require additional cost in time and manpower, as it is a project that will require specialized knowledge in
both the legacy system AND the ERP/EDW (This is an example of Data Migration).

Flat Files
As mentioned before, in a perfect world, all of an enterprise’s data is going to be present in the ERP, for
instant extraction and reporting. However, in reality, there is a process in place so that data within it
cannot be tampered with. Transactions will usually have an approval process to help keep out doubtful
and fraudulent records, while Master Data is managed by key employees. However, there are some
instances where a branch is in such a remote location that an internet connection is not available. This is
where Flat Files come in. Transactions for that branch will be recorded in a flat file, later to be sent to
the Head Office for processing and consolidation.

Flat files are usually Excel or delimited text files that business users create in order to make their own
reports when needed. Delimited text files are usually either tab-delimited or comma-separated value
(CSV) files. These files can still be opened in Excel, though tab-delimited files might need a few extra
steps before it can be read (though because they are text files, Notepad will also do). In order to keep an
accurate enterprise-wide report, these will have to be formatted in such a way that it can be uploaded
back into the ERP or Enterprise Data Warehouse.

Enterprise Data Warehouse


While the ERP system has some built-in reporting functionality, it is far from a complete solution. The
most obvious limitations are the fact that custom reports are difficult to create, and data visualization
capabilities are lacking, if present at all. What’s more, the reporting functionality will also consume
system memory in order to be processed. This can have an adverse impact on its ability to transact,
especially if large, detailed reports (per customer or per item, or worse, both) are needed. An Enterprise
Data Warehouse is needed in order to work around these limitations.

The Enterprise Data Warehouse is built in order to consolidate the disparate data sources so that only
the data necessary for reporting will actually be used. Consolidating data is an important aspect of
Business Analytics, because first and foremost, above even facilitating data analysis, is concerned with
delivering “a single version of the truth”. That is, an accurate representation of the business, from any
view point. From an implementation standpoint, this will require the following:

1. New hardware that will become the server hosting the Data Warehouse. It must be connected to the
corporate network.
2. A dedicated project team from the Enterprise Side made up of Business Users.
3. A dedicated project team either from the Enterprise IT Team or an external organization who will be
responsible for setting up the environment.

Building an Enterprise Data Warehouse is a massive undertaking that can take weeks, months, even
years to complete, depending on how large the target scope is. In order to build an Enterprise Data
Warehouse:

1. The Business Users will need to determine the reports they want to derive from their data sources.
2. The Business Users will then convene with the IT Team in order to iron out the technical requirements
(Blueprinting). This includes providing information on business processes and where the data can be
obtained. This could take a few days to a few weeks.
3. Once the IT Team has worked out the actual requirements needed by the Business Users, it is time to
implement the EDW to those specifications.
4. Testing will follow for data accuracy with the help of the Business Users.

Because it is on separate hardware, it usually follows a daily “load schedule” during off-peak hours,
usually midnight or very early morning, where the previous day’s transactions will be loaded into it. It is
scheduled during off-peak hours because those times are usually the ones where the ERP especially, is
not being used.

Note that the EDW is at its core a large database. If the scope starts becoming too large, it may be
advisable to create another one that will have its own purpose, but uses the same Data Sources.
Hardware may be powerful enough to host multiple Data Warehouses in the same machine. SAP
Business Warehouse is a tool to help build Data Warehouses, as is SAP Data Services. Note that the
actual implementation is highly technical, so the Business Users are not expected to actually help build
the EDW, rather the IT Team might defer to them occasionally to ensure correctness and accuracy, and
clarify some other things that did not come up during Blueprinting.

Reporting Tools
Once the Enterprise Data Warehouse has consolidated and sorted out the individual data elements
required by the Business Users, it is time to recombine them into a report that will then allow analysis by
the Business Users to help keep track of the status of the business. Because the Enterprise Data
Warehouse is essentially a large Database, it is likely that technical column names are still used instead
of more common, Business-friendly terms. For example, a database column that represents a Business
Partner’s last name is called “INDIV_FNM” or some such. This doesn’t really make sense from the
Business User’s perspective, as the name doesn’t immediately make sense. To help alleviate this, a
Semantic Layer is set up as a sort of “translator” so that the Business User can immediately understand
what the data is, by allowing them to see technical terms as business terms.

One other bottleneck in reports creation and Data Analysis is the complexity of extracting data from the
EDW for analysis. It used to be that the Business User will have to request data from the IT Team. This
provides a lot of delays in information. For one, the actual extraction might take some time, and the fact
that the IT person may not be all that well-versed in Business Lingo, which will affect the quality of data.
If it’s wrong, he will have to re-extract the data. All of those delays, and that’s not even counting the
delays from having to wait for response E-mails!

One of the defining features today in Business Analytics Tools is what’s called Self-Service BI. In addition
to the Semantic Layer, reporting tools are created with an easy to understand interface (usually drag-
and-drop actions make up the majority of interactions) so that the Business User will be empowered to
create their own reports. It covers easy extraction from the Data Warehouse to Report Creation to
Publishing, without or with minimal help from IT. This helps with the timely flow of information, as
reports can be created in an instant by the Business User alone.

Another aspect of Business Intelligence is the quick dissemination of reports to their intended
audiences. It is for this reason that specialized tools also usually come with the ability to log in to a
platform where reports can be published. Examples of Reporting Tools (BA Tools, really) will be covered
in a future section10.

3-Tier Architecture
An enterprise does not only have one single landscape. An enterprise cannot depend on just one
because they need a contingency plan when something inevitably breaks. Imagine what will happen to
the enterprise if for example, an incorrect configuration was pushed into the system that caused it to
crash. Since there is no longer a system to transact to, the whole company’s data is at a standstill. To
prevent such an accident from occurring, companies will ideally have three of the landscapes:
Development (DEV), Quality Assurance (QAS, pronounced “kwas”) and Production (PRD, pronounced
“prod”, as in the first syllable).

PRD is the most critical of the three, as it contains “live data”. It is the system that is used in the day-to-
day transactions of the company. A lot of redundancies might be required for this landscape, as it is
needed for the proper function of the enterprise. As such, its physical hardware tends to be the most
powerful of the three. Downtime for it must be reduced as much as possible due to its operational
importance.

DEV, as its name states, is for development purposes. When a new report needs to be created or a
change in configuration needs to be made, it should be done here first. If the report runs (data is correct
and completes at a timely manner) or the configuration does not result in catastrophic failure, they will
be rolled up and applied/promoted to QAS. If everything is in order in QAS after further testing, that is
the only time they will finally be promoted to PRD.

These three landscapes does not need to be aligned with regards to data. That is to say, the data in PRD
need not be present in DEV for development purposes. However, ideally, all reports, objects and
software configurations should match. This is because in a BA environment, as long as a report is
properly configured, it will be able to get the updated data as long as the “object” representing that data
is present. Also, promotion from PRD going back to QAS and DEV isn’t permitted.

Other enterprises has a 4th, off-premises landscape known as Disaster Recovery (DR). This is essentially
a copy of PRD that is placed separate from the other three landscapes. It will act as a contingency when
PRD becomes subject to catastrophic failure (usually through accidents and Acts of God). As such, it is
important to keep the data between these two aligned, to minimize downtime.

Data Reliability
The one inviolable rule when working with numbers and computers is this: “Garbage in, Garbage Out”.
Some people say “Numbers don’t lie”, but that is incomplete, because the veracity of the numbers must
be taken to task before calculations are made need to be considered before any definitive statements
can be made. This is a constant challenge with Analytics. As the data travels and transforms through the
enterprise, something might get lost or unintentionally changed, and tracking down these anomalies will
have significant impact on the correctness of the reports being produced. Because if one item is
inaccurate, are the other items that came with it also affected? It is for this reason that Clear
Communication is a must not just within the company, but with everyone involved in a Business
Analytics Project. Sometimes these can be easily traced, other times, not so much. The following are just
some ways inconsistencies can be introduced:

1. Inconsistent Terminology – A department might refer to an SKU as a “Product” and another might
refer to it as “Material”. This extends to more than just the labels. The “Product” department might be
using only the first 5 characters of the SKU’s Code for their reporting, while the “Material” department
might need the whole 20-character string for their own reporting. In that case, both must be present
and accounted for.

2. Rounding Errors and Truncation – Consider the number of decimal places a given piece of numeric
data has. As it travels from the Source to the EDW to the Reporting Tool, it will have to be encoded into
different formats. Potential side effects include Rounding Errors. This could cause final numbers to
deviate from the source.

As can be seen in the above example, simply rounding the Price column had a significant enough effect
that the displayed total has been changed. These are just for four items. Imagine the impact if the whole
ERP system was considered! Errors like these are usually given special considerations if the discrepancies
can be proven to be from rounding errors.
Truncation will have the same effect (though more pronounced), however, instead of rounding the
number, decimal places are outright omitted:

Be careful how decimal places are considered.

3. NULLs and Zeroes – Null Values represent “nothing”. However, in computing, Nulls and Zeroes are
considered as different entities. This can have an impact on the evaluation of conditional formulas and
averages.

4. Incorrect Inputs – this is where the concept of “Garbage In, Garbage Out” is very apparent. While ERP
Systems usually have a built-in way to reject incorrect inputs (inputting letters in a field that only accepts
numbers), some legacy systems don’t have this functionality. Even worse still are the “technically
correct” inputs that get accepted but are gibberish (nonsense data and fields left blank that shouldn’t
be). Data cleanup to ensure consistency is a lot of work, and should only be done as a last resort. The
best way to avoid Garbage Inputs is to put policies in place that will ensure correctness.

5. Outright Data Discrepancies - A company usually has some tactical decisions (particularly marketing)
where promos and bundles of their products and services will be joined together, in order to take
advantage of a gap in the market or a season, to increase sales. Since the bundles consist of different
products, it also has an impact on inventory. In other cases, a trial run of a new product is made
available to the market to test its viability. This situation means that a new Item should be present in the
ERP in order to reflect their numbers properly. However, because they are a special case that had to be
created quickly, they are for internal use only for the departments responsible. These will have to be
later pushed into the ERP in order to get a more accurate reading on the enterprise as a whole.

Again, prevention is better than cure. It is always better to prevent an issue from arising in the first
place, rather than trying to fix it when it does show up.
Relational Databases and the Star Schema
The Relational Model is the first data model that can be fully described mathematically. All data
(fields/columns) is represented in terms of tuples (rows/record), grouped into relations. It is the most
common way to store and access enterprise data, as it uses some form of Structured Query Language
(SQL).

The usage of primary and foreign keys denote relationships between tables. Data can be obtained from
multiple tables to produce one tuple of data by JOINing tables via their keys. SQL, initially pushed as the
standard language for relational databases, deviates from the relational model in several places. The
current ISO SQL standard doesn't mention the relational model or use relational terms or concepts.
However, it is possible to create a database conforming to the relational model using SQL if one does
not use certain SQL features.

Storing data obviously takes up space, and the more space needed to store data, the more expensive it
is for the enterprise to maintain (mainly, it will need to purchase additional hard drives and other
storage media). Take for example the following table:

Having to repeatedly state the Customer Name and Product Description after every transaction will take
up a lot of space. It is for this reason that these kinds of data are represented by an ID-Description pair
to make it easier to create relations and JOIN tables together to save space. The above table can then be
expressed as the following:
Instead of just relying on one table to show all of the data, we break it down to three tables. Please note
that the names were arbitrarily assigned:

1. TXN – records all transactions that is encoded into the system.


2. CUS_MAS – stores all customer information.
3. PROD_MAS – stores all product information.

If we want to see the name of the customers who bought products, we only need to JOIN the TXN and
CUS_MAS tables. This saves space because we don’t have to show the customer names all the time in
the TXN Table, and allows for more flexible reporting.

The Star Schema

A schema or logical data model is a representation of the abstract structure of domain information. It is
often expressed as a diagram, and is used as foundation to designing database structures. There are
many different kinds of schemas, but the most-commonly used one in enterprise computing is the Star
Schema.

A Star Schema is the simplest approach used in designing enterprise data warehouses. It is comprised of
a Fact Table (usually just one) referencing any number of Dimension Tables.
A Fact Table records measurements for a specific event. These are typically referred to as Transaction
Tables that contain very granular numeric data. In addition to this numeric data (typically amounts and
quantities), it will also contain surrogate keys that define its relationships to many Dimension Tables,
which contain descriptive data. In an enterprise, an “event” can be any sale that occurs. In other words,
enterprise measures can be derived from the fact tables.

A Dimension Table by contrast will contain less records than Fact Tables. They don’t contain
transactions, rather, they contain descriptive information like Customer Information, Addresses, Date
and Time, etc. The data they contain are sometimes referred to as Master Data. In an enterprise, there
will be dedicated custodians for this kind of data because they should follow a strict process to add/edit
them, as they can change the view of the enterprise data.

Relationships in the schema (JOINing tables) is dictated by Keys within the tables. Keys ensure that each
row of data within the table is unique. These are typically “ID” columns that automatically increment,
the more rows are populated, using some sort of algorithm. They can consist of a single or multiple
columns, to ensure uniqueness. Keys can also be Primary or Foreign, depending on the context or view.
For example, in the figure below, think of each table as a building, and we are in F_SALES. In F_SALES,
there is the CUST_ID column, which is a part of F_SALES primary key. The same column name is present
in D_CUSTOMER. Since we are inside F_SALES, we can say that the CUST_ID column in D_CUSTOMER is
a foreign key, and vice-versa. It is important to note that what matters when JOINing multiple tables is
not the name of the columns, but rather their contents. For example, if CUST_ID in F_SALES contains the
“CUST1000” value, then that same value should also be in CUST_ID in D_CUSTOMER. The names are
kept similar so that the Database Administrators and designers will have an easier time maintaining the
RDBMS.

You might also like