Data Governance A Conceptual Framework in Order To Prevent Your Data Lake From Becoming A Data Swamp
Data Governance A Conceptual Framework in Order To Prevent Your Data Lake From Becoming A Data Swamp
Data Governance:
A conceptual framework in order to prevent your Data Lake from becoming a
Data Swamp
Charikleia Paschalidi
2015
Paschalidi Charikleia
Master Program
Master of Science in Information Security
Copyright of this thesis is retained by the authors and the Luleå University of Technology.
Ideas contained in this thesis remain the intellectual property of the authors and their
supervisor, except where explicitly otherwise referenced.
All rights reserved. The use of any part of this thesis reproduced, transmitted in any form or
by any means, electronic, mechanical, photocopying, recording, or otherwise or stored in a
retrieval system without the prior written consent of the author and the Luleå University of
Technology (Department of Computer Science, Electrical and Space Engineering) is not
permitted.
Contact Information:
Project Member:
Paschalidi Charikleia
E-mail: [email protected]
University Advisor:
Devinder Thapa
E-Mail: [email protected]
Information Security nowadays is becoming a very popular subject of discussion among both
academics and organizations. Proper Data Governance is the first step to an effective
Information Security policy. As a consequence, more and more organizations are now
switching their approach to data, considering them as assets, in order to get as much value as
possible out of it. Living in an IT-driven world makes a lot of researchers to approach Data
Governance by borrowing IT Governance frameworks.
The aim of this thesis is to contribute to this research by doing an Action Research in a big
Financial Institution in the Netherlands that is currently releasing a Data Lake where all the
data will be gathered and stored in a secure way. During this research a framework on
implementing a proper Data Governance into the Data Lake is introduced.
The results were promising and indicate that under specific circumstances, this framework
could be very beneficial not only for this specific institution, but for every organisation that
would like to avoid confusions and apply Data Governance into their tasks.
Keywords:
Data Governance, Data Lake, Standards, Policies, Business, Risk and Compliance, IBM
tooling, Hadoop. Data Integration Layer (DIL), Data Quality, Scrum, Agile Methods.
1 Introduction
Data is often described as the oil of the 21 st century. According to IBM, “approximately more
than three exabytes of digital data are created daily around the world” (Zhu et al., IBM, 2011).
In the age of smartphone and social media, the exposure to data is so huge that we are
drowning in it. Therefore, companies understood how significant big data is and, they are
switching their focus from software and hardware to the data they process, in order to gain
competitive advantage in knowing more than their competitors (Kemp IT Law, 2015, p. 1 -
169).
Big Data refers to the use of the huge volumes of data, its collection and analysis generated by
our digital lives. A few years ago, companies either did not have access or they did not know
what to do with it. Nowadays, though, companies use it to reveal new insights into the
Business Processes, Demand or Customer Sentiment, as well as to anticipate and react to
changes. Banks, for instance, are using Big Data in order to get a better insight of their
customers, so that they can be able to provide them with better services. (Taylor, Nevitt &
Carnie, 2012).
However, having access to data is not enough. According to Dave Coplin, Director of
Research and Futurologist at Microsoft, “Data itself is worthless. It is what you do with it that
has value.” In order to explore the value of data, we need to explore the data first. This,
requires the recognition of data as corporate assets and the implementation of some form of
Data Governance for an effective Data Management.
The purpose of this thesis is to introduce a road map about how to develop a framework for
the Data Governance of Big Data. Since most of the knowledge that exists nowadays comes
from IT Governance, this thesis aims to separate those two fields and focus on Data
Governance itself.
1.1Problem Area
The problem is derived from the increased importance for businesses to focus more and more
on their data. It is usually common, for the majority of the organisations, although they begin
to realize the importance of their data that they do not know what to do with it. Nowadays,
when our digital life is as intense and important as our normal everyday life, and the exposure
of mountains of data becomes bigger, companies need to know what their “Big Data” is, as
well as how to use it efficiently.
According to Financial Times, “Companies find data useful, because they can reveal new
insights into Business Processes, Demand or Customer Sentiment that help them anticipate or
react to changes”. Banks, for instance, use Big Data in order to obtain a more understandable
picture concerning their customers. In that way, they can deliver more appropriate and
customized services to their clients, as well as spot fraud easier (Taylor, Nevitt & Carnie,
2012). Therefore, the need for Data Governance is currently more important than ever.
In this paper, there is an attempt to construct a framework for implementing Data Governance
in a Big Financial Institution in the Netherlands, since they are trying to implement a new
Data Lake solution for Commercial Banking. Currently, there is no such framework
implemented and there is a lot of confusion inside the company about who is responsible for
the data and what they should do about it.
The company is a leading Commercial Bank in Belgium, Netherlands and Luxemburg as well
as in Central and Eastern Europe and has strong global franchises in Specialized Finance and
Financial Markets. Its clients are supported through an extensive network in more than 40
countries and its headquarters are in Amsterdam, the Netherlands. The Bank’s customer base
includes individuals, small and medium-sized businesses, large corporations, institutions and
governments.
Currently, the Commercial Banking (CB) is releasing the GDIL (Global Data Integration
Layer) which is the first version of the CB Data Lake, focused on IT -Business interaction,
with the help and tools of IBM Company. DIL is a central post-transaction data integration
hub for Commercial Banking, rather than a point-to-point solution for every application. It is
supposed to reduce the interface complexity and customization needs on packages and it is
both real-time and batch-based. In addition, DIL should not be considered as a wide data
warehouse, since it doesn’t keep historical data.
Ideally, the data in a bank are well-organized, but in reality, it is challenging to find the right
data and get it. What happens so far within the institution is, a creation of copies of data
needed in order to send it to the departments who request it. The main goal of the Data Lake is
to get rid of those copies that are both expensive and time-consuming.
Building the Data Lake will be demand-driven, based on user cases that pop up. In other
words, every project that arises and requires Data Lake functionality, will contribute to the
implementation of the Data Lake solution.
The whole Data Lake solution consists of three parts, each of them with its own
responsibilities:
How should a conceptual framework be built in order to prevent a Data Lake from
becoming a Data Swamp?
In order to manage to answer the research question, I firstly need to do a research in data
governance, in order to find out what it is already known on the topic in both the academic
and business world.
The second step will be to explore what it is already known inside the company on the
subject, find out the gaps and fill them in.
Later on, I intend to research the tools IBM provides in order to successfully implement the
Data Governance into the Data Lake.
The next step, which is also the practical part of the thesis, is to interview both sides, those
who send as well as those who receive data from the data lake in order to understand what
data are meaningful and should exist in the lake.
Eventually, taking into consideration the steps above, I will need to come up with a
conceptual framework and define roles in every part of the lake that will help to successfully
implement Data Governance to the Data Lake.
1.5.1 Assumptions
The study assumes that people who are currently working on either side of the Data Lake in
the company are willing to cooperate and answer the questions asked honestly and within the
time-frame of the assignment. Another assumption is that the steps can be aligned with the
Scum methodology in order to be properly implemented into the everyday activities of the
team.
1.5.2 Delimitations
As mentioned above, there is only limited research on Data Governance. Most of researches
either transfer knowledge form IT Governance to Data Governance or use resources from
practitioners, analysts and consultants, due to lack of academic resources. As a result,
collecting all the data needed and coming up with a specific questionnaire might be
challenging.
In addition, people working in the team should be willing to adjust to the new framework and
the agile way of working in order to manage to implement the steps suggested in the
theoretical framework.
2 Literature Review
During the process of the research, an analysis of the questionnaires will take place, in order
to understand people’s awareness about Data Governance in the organisation, as well as the
level of process of Data Governance the organisation itself is in. Later on, I will be actively
involved in the process of implementing Data Governance in the Data Lake as a member of
the DIL department and I will come up and suggest a conceptual framework for implementing
Data Governance.
The following chapter of this research contains the description of the methodology that was
used during the development of the thesis. Here, the approach taken by the author is defined,
as well as the way the approach was executed.
After getting enough knowledge from a variety of articles and papers concerning Governance
and Big Data and after experiencing myself the way of working in the team, I had to come up
with a Data Governance Framework applicable to the specific department and align the
procedure to the Scrum methodology. I therefore, decided to disregard articles that would not
add value to this thesis due to their specific content. Therefore, the papers selected for this
research were chosen my reviewing the titles and the abstracts.
The Traditional Data Governance Methods mentioned above, often suffer from certain
common problems – limitations. Those limitations are mentioned below:
1 Data governance doesn't usually fit into the overall IT governance effort.
2 Data governance efforts are ignored. Development teams often prefer to “work around”
an organization’s data group. In order for data governance efforts to succeed,
development teams should truly collaborate.
3 Data governance is too difficult to conceive. Usually, development teams report that the
data group within their organization is too difficult to work with.
4 Data governors are often too slow to respond. As a consequence, developers tend to
believe that what they believe is best.
5 Data governors are not considered to provide value. This is because of the additional
bureaucracy involved with traditional approaches.
3 Research Method
The method used to answer the question is mostly qualitative. Qualitative research involves
the use of qualitative data, such as interviews, documents, and participant observation data to
understand and explain social phenomena.
According to Myers (2009), qualitative research methods were developed in the social
sciences to enable researchers to study social and cultural phenomena (Myers, M. D., June
1997, pp. 241-242). Examples of qualitative methods are action research, case study research
and ethnography. Qualitative data sources include observation and participant observation
(fieldwork), interviews and questionnaires, documents and texts, and the researcher’s
impressions and reactions.
The reason I decided to do qualitative research is that this was the most appropriate method
for the research. This is because; qualitative methods are designed to help researchers
understand people and the sociocultural areas within which they live. According to Kaplan
and Maxwell (1994), the goal of understanding a phenomenon from the point of view of the
participants and its particular social and institutional context is largely lost when textual data
are quantified (Kaplan, B. and Maxwell, J.A., 1994, p.45-68)
There are several different qualitative research methods which require different skills,
assumptions and practices. Some of the different research methods are the action research, the
case study research, the ethnography and the grounded theory.
The first one, Action Research, according to Rapoport (1970), “aims to contribute both to the
practical concerns of people in an immediate problematic situation and to the goals of social
science by joint collaboration within a mutually acceptable ethical framework” (Rapoport,
R.N., 1970, pp. 499-513). More specifically, Avison et al (1999) describe this method as: “an
iterative process involving researchers and practitioners acting together on a particular cycle
of activities, including problem diagnosis, action intervention, and reflective learning”. (p. 94)
Case Study Research is defined by Yin (2002) as an empirical study that investigates a
contemporary phenomenon within its real-life context, especially when the boundaries
between phenomenon and context are not clearly evident (Yin, R. K., 2002).
There are many different Qualitative Research methods that were initially taken into
consideration in order to perform this study. At first, a Case Study Research seemed the most
appropriate one. The main reason for that was the fact that, at the beginning of the project, a
lot of interviews took place in order to get an understanding of people’s comprehension of
Data Governance and participation in the Data Lake.
Later on, though, during the study process, Case Study was proved to be an unsuitable
method, mainly because of the nature of the research. The goal of the study was not only
observation but also, intervening and active participation in the creation of the Framework
needed in order to implement Data Governance into the DIL department of the Financial
Institution.
Therefore, the most suitable method for this particular study was considered to be the Action
Research. The main reasons that led me to this conclusion are the following:
1. My main goal was to come up with results that would be beneficial not only for me
and my research but also for the company itself, since it refers to an existing and
interesting case. The research is being executed within the Financial Institution, in a
real-life context and in collaboration with practitioners.
2. The knowledge that I would expect to gain from the study would be immediately
applied into the organisation. The main purpose of the research was to actively
participate in the creation of a framework that would help in an effective Data
Governance implementation in the Data Lake.
3. Therefore, in order to achieve that, I would have to work close with the members of
the department as part of the team and make observations of people´s conception and
understanding when it comes to Data Governance. At the same time, I would get
immediate feedback from stakeholders in order to improve the framework in progress.
5. Last but not least, consensus from the Financial Institution side was ensured in order
to perform the thesis, as well as easy access to the people who were interviewed, since
I am currently working in this organisation and I am part of the team.
1. Diagnosing: In this initial phase, the researcher needs to understand the company’s current
situation and be familiarized with the internal way of working. This is the most crucial phase
of the Action Research since this is the starting point based on which the proposed changes
and framework was created. The diagnostic phase is described in more details in chapter 5.1
and 5.2, explaining the current situation of the Financial Institute as well as the reason there is
a need for the Data Lake.
2. Action planning: The action planning phase involves the actions necessary to solve the
problems the Institution had at the beginning of the research. Those problems were identified
in the Diagnosing Phase, guided by the theoretical framework (Chapter 4). In other words
planning establishes the target for change and the approach to change (Baskerville, 1999,
p.15). The actions needed to implement Data Governance into the Data Lake are described in
section 5.3, where the IBM launch approach and tooling are introduced as a way of
implementing Data Governance in a proper way.
3. Action taking: In this phase, the actual implementation of the proposed framework proved
to be a challenge due to confusions and misunderstandings in between the team members.
Therefore, this phase consists mostly of interviews, trainings, meetings and further research in
order to understand people’s confusion and lack of understanding, gather information about
what actions should be taken and exchange of opinions about the right approach of the
suggested framework.
In addition, the team adopted the Agile Scrum Methodology of working, something that
increased transparency of the team since everybody had to daily inform the rest of the team
members of his or her tasks. As a consequence, although this transparency caused a lot of
confusion at the beginning, it helped and increased the teams collaboration over time and
helped the team use the right people for the right task.
4. Evaluating: The part of the interview aimed to evaluate the people’s view, inside and
outside the team, about Data Governance. A clear understanding of what people who work for
DIL, as well as parties who come in contact with the Data Lake, think of the solution provided
and presented in section 5.4.
In addition, an effort to create a profile of each participant’s prior knowledge of Big Data,
Data Governance, Data management, as well as their personal involvement with the Data
Lake, in order to be able to interpret their views on the proposed framework more effectively.
5. Specifying learning: In this phase, a reflection of the previous steps needs to be made, as
the actual results of the processed need to be finalized. The results of the research seemed
quite promising and a framework on both a higher and a lower level was introduced in chapter
6.
c) It might not be the appropriate solution in case the researcher has only limited amount
of time on his disposal. (Simonsen, 2009, p. 10)
4 Theoretical Framework
4.1Big Data
There are, so far, various definitions for Big Data, since the phenomenon is both of technical
and sociological nature. In most cases, people use the term to describe the mountains of data
they have been exposed to, mostly generated by our digital lives. It often involves the
collection and analysis of data that have high volume and low value information. The main
three characteristics of Big Data are the so-called 3 Vs; volume, velocity and diversity
(McAfee, A., and Brynjolfsson, E., 2012, p.59-66), although a fourth V is introduced by the
Oxford Internet Institute; Veracity (Oxford Internet Institute, 2014). Those data seem
insignificant if we look at each piece of data separately, but its meaning increased
dramatically when we look at them as a whole.
Although no standard definition about Big Data exist, nobody can deny its existence.
According to Nicola Askhan, Data Governance Coach, people think of data the same way
they think of the air. They take it for granted and they do not understand its added value until
they lose it.
Nowadays, Big Data is everywhere; in every online login, in every use of application, in any
online purchase, etc. (CRISAN, ZBUCHEA& MORARU, 2014). According to Boyd and
Crawford, “We live in the age of Big Data” ( Boyd, D. and K. Crawford., 2012, p. 662-279) Those
data are of massive scale and complexity.
The advantages of Big Data are, among others, the following (Oxford Internet Institute,
2014):
Advocating and Facilitating: The fact that Big Data can be massively produced in real
time and without any restrictions in size, provides the advantage of showing the
granular detail of a given problem and therefore, help in creating interactive tools and
engaging the people involved in the problem so that they get a deep understanding of
the situation.
Describing and predicting: Nowadays, researchers can combine social data and real-
time information that have never been combined before in order to provide high-
resolution, dynamic data sources and methods of analysis.
Accountability and Transparency: Big Data, if manager correctly, help us get a better
insight on the information we need and facilitate accountability and transparency in
real life.
As a consequence, there are some concerns that usually arise when it comes to Big Data.
Those concerns have to do mainly with the quality of data analysis and the protection of
privacy and intimacy. Therefore, the governance of Data is of great importance.
More specifically, Data Lake is a new concept, used to describe the evolutionary form of data
repositories. It is a thinking framework as well as a technical solution and it defines how to
receive and deliver data, how to separate them into different repositories based on their
special characteristics, and how to secure them. In addition, Data Lake also defines how to
access data for either reporting or exploration and analytic modelling, and the tools and the
way to implement them.
The concept of the data lake is that data will be delivered once, avoiding the recreation of
copies across departments and divisions within a company. A combination a real-time and
batch delivery of data exists, a support for unstructured data is provided when using new
technologies, and new technologies enable bringing the analytics to the data instead of
bringing the data to the analytics. The main differences between the traditional Data
Warehouses and the Data Lake are described in table 1 (Daniel E. O’Leary, 2014).
Receives data from the System of Records, as well as from other Data Lakes in their
native format.
Implements the transformation in an Esperanto language, according to IBM Data Base
Workbench, which has been already decided to be the English language.
Delivers data to the Receiving System in their preferred/native format.
The new sources describe other information besides the data managed by the system
of records (internal sources, log files from customer interactions, etc.)
Systems used by data scientists and business analysts and their rules and models are
described in the Decision Model Management.
Information sources that need to be shared are handled by the Information Owner.
There is also a Governance, Risk and Compliance platform used to demonstrate
compliance on regulations and business policies.
Line of Business Insight applications are designed to provide reports, search and
simple analytics capabilities that are being controlled by the business.
In addition, there is a Data Lake Operations team responsible for managing the data
lake operations.
In the case of the Financial Institution, Data Management is a key building block to support
its strategy. It is the objective to ensure that the company has trustworthy data that can enable:
Next Generation Digital Bank: Data is an asset, which is equally important as people
and systems in order to be the next generation digital bank.
In our case, except Data Management, we also have to concentrate on Metadata Management.
Metadata is information concerning our data and they are crucial in order to understand the
value of our initial, raw data. IBM Corporation refers to Metadata Management as a procedure
of cataloguing information about data objects. Since the majority of the organizations do not
know how to make use of their data, they tend to spread them across different systems and
departments. As a consequence, users who own, access and manage those systems cannot
communicate with each other easily. The definition given by IBM Redbooks is that “ Metadata
management refers to the tools, processes, and environment that are provided so that
organizations can reliably and easily share, locate, and retrieve information from these
systems.” (“Jackie Zhu et al., Metadata Management with IBM InfoSphere Information
Server, IBM Redbooks, October 2011)
In other words, metadata management is the necessary process that organizations need in
order to make sure that the final reports and analysis are coming from the right data sources,
complete and with high quality. It is the tools, processes and environment that are provided to
enable an organisation to answer the Question, “How do we know what we know about our
data?” (“Jackie Zhu et al., Metadata Management with IBM InfoSphere Information Server,
IBM Redbooks, October 2011"). In order to achieve that, metadata management has to be
provided in every step of the implementation process.
There are two dominant theories about Data Governance. The first one, the so called
“Modernist Theory”, supports that within any policy domain, the structure of the institution is
the one that shapes the consideration of policies, the interaction of the actors, as well as the
evaluation of the outcomes. The second one, the “imperative” approach, claims that
governance theory itself should provide an explanation of the reasons on which people act as
well as make sense of them. (Dr Alberto Asquer, 2013)
The aim of this thesis is to contribute on the “interpretive” approach to the theory of
governance by investigating the understanding of Data Governance and the tooling provided
within a particular organizational institution and coming up with a framework of data
governance implementation. The main reason to contribute to the second scenario is that a
theory of governance that heavily relies on institutions, structures, and processes seems well
equipped to account for the reproduction or adjustment of social practices within a relatively
stable context, where individual beliefs, preferences, and meanings may not significantly
affect the established institutional and social order.
Usually, Data Governance is being handled by a group of professionals inside the company
who form the Data Governance Team. The members of that team might differ in each
organisation (Sarsfield, S., 2015). In case of the Financial Institution here is a newly formed
team, called Global Data Management and suggests the following teams and roles for the
upcoming solutions:
Global Data Council: A team that has the mandate to agree upon or amend the data
strategy framework, which comprises of the data governance, definitions, architecture,
quality, processes, sustained awareness, lineage, organisation structure, privacy,
security, etc. It consists of a mix of data suppliers, data users, data custodians,
architect and CDO and meets every 3 months to discuss the solutions.
Operational Data Council: Refers to the team that makes decisions concerning data
management and steers programs on a tactical level. It consists of a mix of data
suppliers, data users, data custodians and architects and they agree to meet once every
two weeks.
Data owner: The data owner is the one who is responsible and accountable (has legal
rights and complete control over) for the data within his systems. He/she decides how
the data is acquired and used. He/she is aware of where the data is further distributed
within ING.
Data steward: The Data steward is the one who assists the data owner in his day-to-
day activities in order to ensure the data is managed and of the required quality as
defined and agreed in ING’s data policies. The role of data steward is optional. It is at
the discretion of the data owner whether or not to assign this role.
Data user: The data user is the one who needs data from one or more data owners in
order to aggregate, further process, store, or to create meaningful reports for different
purposes (management reporting, regulatory reporting, statutory reporting, customer
intelligence, etc.)
Data custodian: Data custodian is the one who helps the data user by providing the
organisation and systems to gather data from one or many data owners, store and
further process as per the requirements of the data user.
Data definition owner: The data definition owner is the one who defines standard
definitions for the data attributes in the data categories he/she is responsible for, in
order to facilitate their exchange across the entire organisation. He/she does so by
carefully aligning the requirements of the data users and the data owners.
Data quality officer: Data quality officer helps the data definition owner to set data
quality requirements for every data element based on the requirements of the data
users and translates them to the data owners and data stewards.
Data architect: The data architect is the one who designs the data models and data
lineage effectively and efficiently to ensure that the data policies and procedures are
implemented and adhered to. He/she also identifies that redundant systems/processes
that can be eliminated.
Data security officer: Data security officer is the one who ensures that the data is
secure in terms of storage, exchanges, etc. and that cybercrime threats are adequately
taken care of.
Data privacy officer:Data privacy officer is the one who ensures that the data
exchange is always subjected to the privacy issues with respect to legal and other
requirements between global and local exchange or local to local exchange.
The types of technologies that are usually used in Data Governance are the following
((Sarsfield, S., 2015):
Preventative: This technology stops bad quality data that shouldn’t exist in the lake
from coming into the organization. This way, any possible disruption is limited. Tools
of this technology are type-ahead, workforce management and, data quality dashboard.
Diagnostic and health: Organizations use this technology in cases when the damage
is already done. The data exist in storage for many years already and therefore, there is
need for data profiling as well as batch data quality.
Infrastructure: The tools that could be used here are metadata, ETL, Master Data
Management, enterprise-class data quality tools and data monitoring.
Enrichment: The tools used here are services and data sources.
4.4.1 Data Governance Challenges
According to the findings by the IBM Data Governance findings, the most important
challenges data governance faces nowadays are the following (The IBM Data Governance
blueprint, Leveraging best practices and proven technologies, May 2007):
In order for a company to be able to identify any Data Governance issues, The IBM Data
Governance Council suggests six key questions that the organisation should answer (The IBM
Data Governance blueprint, Leveraging best practices and proven technologies, May 2007):
1. Unaware: In this initial level of maturity, there is no data management initiative, and the
organisation does not understand the need to govern its data. The processes that take
place during this phase, are unpredictable and poorly controlled, with no strict rules. Data
might exist in multiple files and formats, stored across multiple systems with multiple
names and no attempt to catalog has ever been made. Most of the companies nowadays,
are beyond that level, since it is widely known the need of Data Governance.
2. Reactive: In this level, the scope is very limited and there are no specialist data
governance or data quality tools. Instead, the organisation relies on a central person to
implement the Data Governance. The success of the organisation at this level, depends on
the technical analyst who is responsible for the “technical” aspects of the data.
3. Proactive: Moving on to the third level of maturity model, the whole culture of the
organisation starts to change. Especially Financial organisations, who do not produce any
other products than data, they really need to focus on them and treat them like assets. In
the proactive stage, people have already started worrying about data, acquiring special
tools for managing it, having a master data management (MDM) initiative. On this stage,
the organisation is also looking for a single point of truth and starts investigating what is
causing all the existing data issues. More importantly, people start understanding the
importance of staff training, and IT and business start working together exchanging
information about data. The challenge of this level, though, is that data and business
processes remain separate, slowing innovation.
4. Managed: At the managed maturity level, all the data are being managed appropriately.
The company does not need to manage all its data to the same level of management. It
only has to manage it according to how critical it is. In order to classify your data
according to their importance, you need to have risk controls and monitoring in place.
5. Optimized: This is the highest level any organisation can reach and in order to achieve
that, all the previous steps need to occur step by step in order to help the business grow.
Here, business is driven by standardized processes and manages to make the most value
out of its data.
It depends on each company the extent to which they want to expand on the maturity level of
Data Governance. Some organisations may not aim towards the optimized level of maturity
model, for instance, but they can still use the model in order to move from the level they are
currently in to the level they want to reach.
In this paper, the focus is mainly on the proactive and managed levels of maturity of Data
Governance. Proactive is the stage to which the company currently belongs and the managed
level is the target one. In other words, the Financial Institution started realizing that there is a
swift in focus towards data and how to safely manage and protect them. It considers data as an
asset and tries to train its employees towards MDM. What the organisation wants to achieve,
is that managed maturity level of Data Governance, where the data is being appropriately
handled and according to how critical it is by knowing its CIA ratings.
5 Research Context
5.1The Institution’s current situation
During the diagnostic phase of the research, the current situation of the organization is being
examined.
The financial institution is currently releasing the Global Data Integration Layer (DIL), which
is the first version of the data lake, and it is mandatory for every new post-transaction feeds
from System of Records.. One of the first uses of the data lake, is the feed to Finance. DIL
might also need to combine some of the data it receives from multiple resources into a single
destination feed, creating Deep Data. In figure 4, the steps DIL has to follow are described in
detail.
It is already mentioned that the feed to Finance is one of the main uses of Data Lake. Figures
5 and 6, describe how the feed to Finance is being done right now within the company, and
how the company plans to do it in the future, using the Data Lake.
The Institution’s plan about the implementation of the Data Lake is to build the template once
but run many times and implement it properly per country and with standard installation
guides. It is intended to contain an Extract-Transform-Load (ETL) framework that will help
with the housekeeping and integration with the Catalog and the lineage. The catalog should
describe all the data existing in the repositories inside the Data Lake together with its
definition and origin (lineage).
5.2 Why does the Financial Institution need the Data Lake?
The Financial Institution understood the need of implementing a Data Lake solution, after it
realized the great challenge of managing the great amount of data that it should deal with. The
organisation adopted a Service Oriented Architecture (SOA) which made it successful in
reuse of services across the whole company. The reuse of data assets however, became
difficult as the organisation realized that data cannot always be discovered easily and in case
it did, it was already replicated many times. The result of this process was both of low value
of data and time consuming.
The interaction with customers is mainly based on the availability of great amounts of
digitalized data which is being increased rapidly ad need to be of a high quality. The quality
though, cannot be understood if the business department cannot access the lake easily. Many
consumers depend on the same sources of data and therefore, having a common Data Lake
will enable all the clients to use and combine the same data, while the distribution burden
becomes lower as data only needs to be fed once into the lake.
What is more, the fact that all the data co-exist in the same area helps dealing with them on
real time and making them more valid as the relevance of data depends on the context and
timeliness. This also helps with the increasingly challenging security data assets, especially in
case they need to be combined. Different data coming from different resourced and with
different definitions but still, having to cover the needs of different lines of business, creates
the need of combination of this data in order to be transformed towards a commonly
comprehensive data model with standardized definitions.
5.3 IBM Launch Approach and Tooling for Data Governance in the Data
Lake
The following paragraphs refer to the Action Planning phase of the Action Research method,
where the actions and tooling needed in order to solve the existing problems of the Institution
are being involved.
According to IBM Corporation, there are certain steps the company needs to do before
implementing the Data Lake Foundation described above. Those steps are to initially prepare
the infrastructure, systems, software, etc., setup subscription tables, staging areas and code
hub, design the usage of the catalog, identify the smallest possible environment the Lake runs
on and prepare a demo approach from the user perspective. The demo scenario includes User
stories for the next two phases, an explanation of IT Management and a data lineage from the
business side. (Start the build of the Lake, May 13 th, 2014,Smarter Analytics, IBM
Corporation)
After following the steps described above, phase 1 begins. As illustrated in figure 7 below, the
Data Lake operation steps are the following:
1. Advertise – After pre-filling the catalog with known business terms, give some extra
role-based catalog information
6. Access – Publish information from both operational status and information warehouse
repositories
Figure 7: Data Lake foundation launch (Phase 1) – IBM Corporation, 2014
The second phase of the Data Lake foundation, is the implementation of the Big Data on top.
This is done by loading data from Information Ingestion to the Deep Data repository and
showing the lineage with the catalog and the Hadoop environment that is being used. (IBM
Corporation)
The goals that the Institution could achieve by using Infosphere DataStage are described
below:
Design the data flows that extract information from multiple sources, transform this
data using special transformation rules, and load the data to target databases
Connect the applications of the Institution directly to the target systems to ensure that
the data is relevant, complete, and accurate.
Reduce development time and improve the consistency of design and deployment by
using prebuilt functions.
Minimize the project delivery cycle by working with a common set of tools across
InfoSphere Information Server.
The questions were mostly focused on the understanding of people working for different
Systems of Records or Receiving parties have about Data Governance. Those people were
being asked about the nature of the data they use and send to the Data, Lake, their business
value, their CIA (Confidentiality, Integrity, Availability) ratings, etc. They were also asked
about the official roles people have in their department, such as an Information Owner, or
somebody who is officially responsible for the data they handle, any other extra roles, and
whether they have certain policies and standards, concerning the way they handle the data.
Last but not least, those people were also asked about their department. How big their
department is, how is the security awareness and education disseminated, the procedures they
follow in order to evaluate their data, whether they use any particular program for backup and
recovery, etc.
The Data Lake Architects were also asked some extra questions, useful for gaining a full
insight of the company’s awareness. Those questions are the following:
Do you think that everybody in the DIL department is aware of their own roles and
responsibilities?
Do you have any specific policies and standards already in place?
What is the program for backup and recovery of your data?
What is the plan to keep different kinds of metadata up-to-date?
The company is already aware that the implementation and use of the Data Lake will not be
easy as it lacks a lot of crucial factors that would help in the process. The main problems is
that there is also no official information ownership and as a result, no clear responsibility
about the data assets, no cost efficiency, no single source of truth. Furthermore, there is not
yet a common data language to be used across all the departments that do transactions with
the Data Lake. In general, since the organisation is at its very early stage of the
implementation, there is a confusion about all the new concepts that people need to get used
to. The most important part is to properly implement Data Governance into the Data Lake.
No access to the Data Lake should be given to any unauthorised party and there should be a
Data Lake Security Mechanism to check the access rights. Even when a party is authorized to
access the Data Lake, it still has to deliver the data in the agreed format in the Ingestion space.
All these, need to be defined by the Data Lake itself. What is more, an Esperanto language
should be defined across the departments as well as the other Data Lakes in the company in
order to exchange information. Last but not least, a set of processes, roles and responsibilities
are necessary to be implemented in order to manage data and data exchange between all
parties and divisions that use systems, data lakes and reports, across the organisation.
6 Data Governance Implementation
A second challenge that the company faces regarding the data Governance in the Data Lake is
the adoption of the real-time processing. Such financial institutions, are mostly making use of
batch-oriented solutions. The payments needed to be processed over a couple of days while
banking, used to be done during office hours. In modern everyday routine though, people are
using online payments, anytime, from anywhere in the world. As a consequence, the banking
operation processes needed to change accordingly.
Another great challenge the organisation faces is the fact that the data are not easily accessible
to the business users. In addition, I order for the institution to develop advanced analytics
algorithms, it is necessary to give broad access to raw data, not only to business units, but to
data scientists as well.
In order to overcome those challenges, and make the most of this Data Lake solution, it is
essential that a proper Data Governance exists from the very beginning. The Data Lake is a
completely new concept for the company and since it is on an early stage, a proper Data
Governance needs to be implemented in order to start working properly and avoid confusions,
duplication of data, etc. As we mentioned earlier, The maturity levels of Data governance are:
1. Unaware, 2. Reactive, 3. Proactive, 4. Managed and 5. Optimized and the company needs
to move from the third to the fourth level.
More specifically, in DIL, there are specific rules and processes that need to be in place on a
strategic, practical and operational level in order for the Data Integration Level to work
properly.
Data Management and Warehouse: At this stage, the company has usually one
source of structured data and it is developed to accommodate only a particular type of
question which is already known upfront. It is a classic decision support method which
Extracts, Transformsand Loads (ETL) data to an alternative database environment and
the level of the data is on the level of the event occurring.
Data Stage/Hadoop System: In this tactical level, we already have a Data Lake. A
classic database approach in which users can access from multiple sources, combine
raw data in order to make deep data, in both patch-based and real time.
Real-Time Predictive Data: This last stage, is the ideal stage where the organisation
would like to be. Especially nowadays, when the success of every business depends on
how quickly it can react to conditions and trends, the ability of an organisation to
analyse data in real-time is crucial.
The Input model, consists of an Input Protocol that describes the Entry criteria according to
which DIL will accept data into the Data Lake as well as the procedures the members of the
DIL team should follow. The Output model, on the other side, consists of an Operational
Level Agreement between the parties that come in contact with DIL in order to formalize the
interface between the System of Records and the Receiving Application of the Solution.
The receiving parties may need only a subset of the data and therefore, they can do any
modifications needed (new business logic) in their side. The only change that should be
possible in DIL is the combination of System of Records into the Data Stage.
In case an application complies with the Entry Criteria, the DIL department agrees to accept
the data into the Data Lake. The model below describe all the detailed steps that should be
followed from the moment there is a request for data to enter, until the data is finally stored
into the Data Lake. All those procedures all also aligned with the scrum agile methodology
used within the Financial Institution.
Analyze the solution and data on demand
In order for data to enter into the Data Lake, there is primarily an initiative either from the
System of Records, or from the Receiving application (demand driven). The moment there is
a request, the data on demand should be analyzed.
As mentioned above, the data need to be of certain quality in terms of accuracy, timeliness,
completeness and credibility.
In addition, the data should be delivered in the DIL department in their raw format. Having
access to the raw data available, provided the Financial Institution CB with the capability to
have a single point of truth and develop advanced analytics algorithms.
Identify Stakeholders
The next step of the Intake solution, is to Identify the stakeholders have a role in managing the
data exchange. There must be clear roles and responsibilities defined from each side of the
information exchange before any further actions happen.
More precisely, a Data Owner should be assigned to be responsible for the data entering the
lake. In case the request is demand driven, a data user from the receiving application should
be assigned and responsible for the data and its transformation.
The role of Data Stewards is optional and at the discretion of the Data Owner whether to
assign this role or not. Data Custodians are obliged to translate the requirements of the data
users into the data requirements to the Data Owners as well as to arrange the delivery of data
and the data quality agreements with both the Data Owners and the Data Users.
The actions that have to be executed in the Refine Solution are the following:
Determine the availability of the data:
On this stage, both Information Owner and Data User have agreed on the conditions under
which the data should be used and the DIL department starts the procedures required in order
to obtain the data. The availability and accessibility of the data determines those actions. It
usually depends on the place the data is stored, the people who are involved, the privacy
policies, etc.
After the refinement meetings, the DIL team should follow some planning sessions in order
for the Product Owner together with the whole scrum team (DIL) to agree on the Sprint goals
and the priorities of the Product Backlog. The most important or urgent epics should be
always on top of the Backlog board, refined for the next Sprint.
The steps that should be followed during the planning session are the ones below:
The RACI chart below, indicates the roles of each member that takes part in the Initiation
Solution:
Table 2: RACI chart - Roles in the Initiation Solution
R= Responsible
A= Accountable
C= consulted
I= Informed
6.2.2 Output model
Besides the Services Procedures and Agreements, the Data Integration Layer (DIL) department should
set a framework of the Operational Responsibilities that should be properly assigned and responsibly
performed during the whole Solution process.
The Output model consists an Operational Level Agreement between the parties that come in contact
with DIL. The aim of an OLA is, to formalize the interface between the System of Records and the
Receiving Application of the Solution. The Data Integration Layer is only responsible for receiving
and delivering data but holds no responsibility of the data itself. Therefore, the Data Lake cannot
deliver any data before such an agreement is signed.
An OLA describes the interface the solution takes place in, the conditions of its operation, and the
support that is being provided during the whole process. Moreover, it outlines the responsibilities of
the parties being involved and aims to deliver a set of standard deliverables between them.
In order for an OLA to be complete, clear roles and responsibilities of all the parties involved should
be assigned. Those Roles and Responsibilities are being described below:
In case a file is delivered with the wrong data because of an issue in a system, SoR is informed and
should resend the file with the correct format and data.
Every EOD (End Of Days) delivery will be delivered separately. It will not be clustered into one
delivery in case of issues.
Changes in the data will be limited as much as possible as the data need to be placed in the Data
Lake in its raw format
Assisting in guaranteeing the continuity of automated processes required to obtain the data
from the System of Records
Reporting any possible changes in the usage of the data they are receiving
Investigating and planning to implement notifications to the System of Records in case they do
not receive the requested data within the timelines agreed
DIL Responsibilities:
The main role of the Data Lake is to receive data from the System of Records, deliver it to the
Receiving System and, implementing the transformations to/from the common Esperanto language. In
an operational level, DIL should follow the steps below:
In case that a file is delivered with the wrong data because of an issue in a system, the SoR is
informed and should resend the file with the correct format and data.
Every EOD delivery will be delivered separately. It will not be clustered into one delivery in case
of issues
Changes in the DIL data will be limited as much as possible and will be stored in a different place
within the lake, where all the deep data is being stored
The RACI chart below indicates who is accountable or responsible for every action:
R= Responsible
A= Accountable
C= consulted
I= Informed
The use of the conceptual Framework, has helped the DIL team follow certain steps that were
not clear before in order to properly implement Data Governance into the Data Lake. This is
really important, especially in this very early phase, because the earlier they start
implementing Data Governance properly, the easier it will get later on. In addition, the people
in the team, especially the DevOps team is no longer confused and understands the reason
behind every action they take. Having the whole picture in their mind, and who is responsible
for each part of the procedure, helps effectiveness and collaboration in the team and between
teams.
Despite the fact that the results look promising enough, and that the framework is already
being used by the Institution, research on Data Governance should never stop. Data are
becoming more and more important for organizations and since we live in a world that is
continuously being changed, different frameworks might be needed later on, more suitable for
the future circumstances.
References:
Agile/Lean Data Governance Best Practices (Agile/Lean Data Governance Best Practices)
http://agiledata.org/essays/dataGovernance.html#Traditional
Alir, N., Takahashi, C., Toratani, S., & Vasconcelos, D. (7). IBM Infosphere DataStage Data Flow and
Job Design (p. 658). ISBN-13: 9780738431116, ISBN-10: 0738431117
Anderson, J., Aydin, C., & Jay, C. (1994). Qualitative Research Methods for Evaluating Computer
Information Systems. In Evaluating Health Care Information Systems: Methods and Applications (pp.
45-68). Sage, Thousand Oaks, CA.
Asquer, D. A. (2013, January, 1). 34. The Governance of Big Data: Perspectives and Issues. ICCP
2013 Conference: First International Conference on Public Policy. Financial and Management Studies,
SOAS, University of London
Avison, D., Lau, F., Myers, M., & Nielsen, P. (1999). Action research. Communications of the ACM,
42(1), 94-97
Baskerville, R. (1993). Information systems security design methods: Implications for information
systems development (4th ed., Vol. 25, pp. 375-414). ACM Computing Surveys (CSUR).
Big data and positive social change in the developing world: A white paper for practitioners and
researchers. (2014). Bellagio Big Data Workshop Participants, Oxford: Oxford Internet Institute.
Boyd, D., & Crawford, K. (2012). Critical Questions for Big Data. Provocations for a Cultural,
Technological and Scholarly Phenomenon. Information, Communication & Society, 662-279.
Cate, F. (2010). Data Tagging for New Information Governance Models. IEEE COMPUTER AND
RELIABILITY SOCIETIES. doi:1540-7993/10/$26.00 © 2010 IEEE
Chessell, M., Nguyen, N., Van Kessel, R., & Van Der Starre, R. (2014). Governing and Managing Big
Data for Analytics and Decision Makers. IBM Redbooks.
Crisan, C., ZBUCHEA, A., & MORARU, S. (2014). Big Data: The Beauty or the Beast. Management,
Finance, and Ethics. doi:10.13140/2.1.2709.7282 Conference: Strategica 2014
Fisher, T. (2009). The data asset: How smart companies govern their data for business success (Vol.
24). John Wiley & Sons.
Fu, X., Wojak, A., Neagu, D., Ridley, M., & Travis, K. (2011). Data Governance in predictive
toxicology: A review. Journal of Cheminformatics.
James, M. (2011, April 15). The Backlog Refinement Meeting (or Backlog Grooming).
Khatri, V., & Brown, C. (2010). Designing Data Governance. Communications of the Acm, 53(1).
Leedy, P., & Ormrod, J. (2005). Practical research. Upper Saddle River, NJ: Prentice Hall.
Legal aspects of managing Big Data. (2015). Cpomputer Law & Security Review, 31(1), 1-169.
Levin, D. (n.d.). The opening of vision: Nihilism and the postmodern situation.
Loshin, D. (2013). Data Governance for Master Data Management and Beyond. SAS Institute Inc.
World Headquarters White Paper.
Martin, P., & Tumer, B. (1986). Grounded Theory and Organizational Research. The Journal of
Applied Behavioral Science, 22(2), 141-157.
May, T. (1997). Social research: Issues, methods and process (2nd ed.). Trowbridge: Redwood Books.
Mayers, M. (2013, September 3). Qualitative Research in Information Systems. Association for
Information Systems (AISWorld) Section on Qualitative Research in Information Systems, 241-242.
O’Leary, D. (2014). Embedding AI and Crowdsourcing in the Big Data Lake. University of Southern
California.
Rapoport, R. (1970). Three Dilemmas in Action Research. In Human Relations (Vol. 23:6, pp. 499-
513).
Roland, R. (1985). Research Methods in Information Systems (p. 193201). Amsterdam, NorthHolland.
Russom, P. (2006). Taking data quality to the enterprise through data governance. The Data
Warehousing Institute.
Sarsfield, S. (Director) (2015, February 2). Data Governance Imperative. Cambs, GBR: IT
Governance. Lecture conducted from ProQuest ebrary, .
Schiffman, L., & Kanuk, L. (1997). Consumer Behaviour. London: Prentice Hall.
Seiner, R. (2012, December 1). 40. Applying an Maturity Model to Data Goernance. The Data
Administration Newsletter.
Simonsen, J. (2009, January 1). [Radio broadcast]. Scandinavia: Molde University College. IRIS 32,
Inclusive Design. (pp. 1-11).
Siponen, M. (2002). Designing secure information systems and software: Critical evaluation of the
existing approaches and a new paradigm.
Start the build of the Lake. (2014, May 13). Smarter Analytics, IBM Corporation.
Tallon, P. (2013). Corporate Governance of Big Data: Perspectives on Value, Risk, and Cost. Loyola
University Maryland, 13, 0018-9162.
Taylor, P., Nevitt, C., & Carnie, K. (2012, December 11). The rise of Big Data. Financial Times.
The IBM Data Governance blueprint, Leveraging best practices and proven technologies. (2007, May
1).
Trope, R., Power, E., Polley, V., & Morley, B. (2007). A Coherent Strategy for Data Security through
Data Governance. 1540-7993.
Weber, K., Otto, B., & Osterle, H. (2009). One Size Does Not Fit All – A Contingency Approach to
Data Governance. ACM Journal of Data and Information Quality, 1(1).
Wood, C. (1990). Principles of secure information systems design. Computers & Security, 9(1), 13-24.
Zhu, J. (2011). Metadata Management with IBM InfoSphere Information Server. IBM Redbooks.
O'Leary, D. (2014). Embedding AI and crowdsourcing in the Big Data Lake. AI Innovation in
Industry, University of Southern California.