0% found this document useful (0 votes)
70 views42 pages

Documento General

This document provides an introduction to data analytics for business. It discusses several key concepts: 1) How to think about analytical problems by starting with the decision to be made and working backwards to determine what data and analysis is needed. 2) The different types of analytics including descriptive, predictive, and prescriptive and the methods and tools used in each. 3) How real world events and characteristics of people, objects, transactions, and locations are captured by different source systems and eventually become data for analysis.

Uploaded by

Ricky Velasco
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views42 pages

Documento General

This document provides an introduction to data analytics for business. It discusses several key concepts: 1) How to think about analytical problems by starting with the decision to be made and working backwards to determine what data and analysis is needed. 2) The different types of analytics including descriptive, predictive, and prescriptive and the methods and tools used in each. 3) How real world events and characteristics of people, objects, transactions, and locations are captured by different source systems and eventually become data for analysis.

Uploaded by

Ricky Velasco
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 42

1

INTRODUCTION TO DATA ANALYTICS FOR BUSINESS


UNIVERSITY OF COLORADO BOULDER

Thinking about analytical problems

 Specifically, we think backwards, meaning we start with the decision we want to


make then we consider the information needed to make that decision. It's an idea
crossly related to the hypothesis drive approach used in scientific research. Where we
start with something specific, we think is true, then we design an experiment to test
the hypothesis that it really is true.
 To do this it is needed context, I need to understand enough about the business and
about how decisions are made to understand what is really important.
 First question to asked is ¿Will the analysis influence the decision?
 Second question is ¿What would the analysis output look like?
 ¿What methods and tools are needed?
 Simpler is often better
 ¿Where will the data come from?
 In some cases, you may need to reach outside your business for the data
 But in terms of how you attack the analytic problem in front of you, remember to think
backwards. Start with the decisions you want to make, determine what analysis
outputs would help make that decision. Design the analysis that creates those outputs,
and determine what data is needed for analysis and how to get it. Then and only then
are you ready to begin. See you next time.

Conceptual business model

 A conceptual business model is a diagram that illustrates how an industry or business


functions. It shows important elements in the business and maps out how those
elements relate to each other.
 Businesses change constantly

The information-action value chain

 it's critical that you have a good working understanding of where the data you use
comes from and what real world phenomena that data describes. It's also important
that you understand how the results of your analyses will be used to make decisions
and ultimately how they will lead to some specific action that is taken in the
marketplace
 The better you are at understanding the value of each step, the more effective you will
be as an analyst. The way we illustrate this idea is through a framework we call the
information-action value chain. For those of you not familiar with the term value
chain, it's an idea that describes a sequential process where each step adds some sort
of value to an object or an idea relative to a desired end point or outcome
 Relational database
 there are a wide selection of methods, techniques, and tools available to you to
perform data analysis, but they broadly fall into three categories; descriptive analytics,
predictive analytics and prescriptive analytics.
 Descriptive Analytics as its name suggests, helps us describe what things look like now
or what happened in the past. The idea of course is to use that information to better
understand the business environment and how it works. And to apply that knowledge
along with business acumen to make better decisions going forward. Descriptive
Analytics can take the form of simple aggregations or cross tabulations data. Simple
2
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
statistical measures like means, medians, and standard deviations. More sophisticated
statistics like distributions, confidence intervals and test of hypotheses or advanced
association or clustering algorithms.
 Predictive Analytics help us take what we know about what happened in the past, and
use that information to help us predict what will happen in the future. This almost
always involves the application of advanced statistical methods or other numeric
techniques such as linear or logistic regression. Tree based algorithms, neural
networks and simulation techniques such as Monte Carlo simulation.
 The last class of analytics is what we call Prescriptive Analytics. This type of analysis
helps explicitly link analysis to decision making by providing recommendations on what
we should do or what choice we should make to achieve a certain outcome. It usually
involves the integration of numerical optimization techniques with business rules and
even financial model.
 Both predictive and prescriptive analytics often use descriptive analytics' techniques in
the exploratory phase or to provide inputs to those models.
 Prescriptive analytic techniques might help us understand which customers we should
target to maximize return on our investment. In summary, there are a variety of
techniques available to you to use in accomplishing your analysis. Depending on the
nature of the business need and whether you're trying to understand what happened,
what will happen, or what you should do in the future.
 No matter how good your analysis is and no matter how promising your plan is.
There's a good chance you'll get nowhere fast if you can't effectively communicate
your results and sell your proposal.
 Pay attention to the quality of your materials, whether they're slides, documents or
other artifacts. Right or wrong, people tend to equate poor quality of materials with
poor quality of analysis. The best analysis and presentation can be derailed by
something as simple as a typo or a bad number on a slide. Finally,

Real world events and characteristics

 First, we'll talk about people. Who they are and what they do. Then we'll talk about
things, objects, and the environment.
 People have characteristics that describe them, like age, gender, nationality, ethnicity,
race, marital and familial status. Educational level, socioeconomic status, housing
status. The list goes on and on. People also have preferences, beliefs, attitudes and
motivations that help define who they are.
 We often group these characteristics into a few broad categories that you might
encounter in a business context. Namely demographics, psychographics and
technographics. Demographics broadly describe population level characteristics like
age, gender, nationality, etc. And are the most widely used characteristics in a lot of
different types of analysis. Psychographics speak more to people's opinions, attitudes,
and interests. They include preferences, likes and dislikes, and tend to reveal insights
about why people do the things they do.
 Technographics are really a subset of psychographics which focus on how people
approach technology and what their motivations and attitudes are about using new
and existing technologies.
 Certainly there are other categories of attributes. In some areas, like healthcare, the
notion of personal attributes can go a lot deeper to include a whole host of physical
3
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
attributes that might be important to a business organization. Some of these
characteristics also imply events. Age implies birthdays, marital status might imply a
wedding or anniversary, education implies graduations. All of these related events
could be of interest to the business and to the analyst. In addition to characteristics,
people also have identifiers. They have names, addresses, telephone numbers, email
addresses, Facebook and Twitter handles. And all sorts of unique attributes that might
be used to identify them in the real world.
 It has to be considered the: Physical location (where they live, travel tendency, how do
they move, etc) and Virtual location (online location, web browsing)
 In many industries, some of the most important and frequently used information is
around transaction or events that involve an exchange between people or businesses.
Far and away, the most common transaction of interest in business analytics is a
purchase, the event where someone buys a product or service that our company is
selling. But there are a wide array of other types of transactions, like investments,
transfers, execution of contracts, accounting entries. Or more detailed elements of
purchase transactions, like placed orders or payment processing events.
 Both natural and non-natural events can have a significant impact on a business. Think
about the impact of weather on air travel or the impact of a major event like the
Superbowl or World Cup on television viewership or internet usage. The impact of
major world events can obviously have an enormous impact on people as well as
businesses. The most important takeaway from all of this of course, is the recognition
that just about everything you will be looking at as a data analyst starts as something
in the real world.

Data capture by source systems

 The truth is, there are thousands, if not more different types of system that capture
data.
 Group many of those systems into five broad categories that you're most likely to
encounter in your company. Specifically, we'll talk about core enterprise systems,
customer and people systems, product and presence systems, technical operations
systems, and external source systems.
 Let's start with core enterprise systems, these are usually large scale system that tie in
directly to the financial operations of a company. They include things like billing and
invoicing systems that help manage purchasing transaction and the collection of
payments. They also include enterprise reprise planning or ERP systems, which are
really broad systems to help manage business processes on the back end of the
business. As the name suggests, these systems usually focus on the resources of the
company whether they're financial assets, materials, or production capacity. Supply
chain management systems focus specifically on the flow and storage of goods and
services through a system. They help track raw materials and products from their
points of origin to their points of consumption and provide insights into throughput
and inventory levels.
 Let's move on to customer and people systems. These systems can also be critical to
the business but focus more on people organizations both inside and outside the
company. The most expansive of these systems are customer relationship
management or CRM systems.
4
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
 Next in our lists of source categories are products and present systems. We're really
grouping these together for convenience. They all tend to be support systems that
help different parts of a business. But are only loosely related to each other.
 Technical operation systems, these systems are usually very tactical helping to monitor
processes or other systems to make sure they are functioning properly and to identify
issues when they occur.

2nd week: Analytical Technologies

Data storage and Data bases

 In this module, we'll focus on the applications and tools that are used to store, extract
and analyze data.
 In this video, we'll definitely pick up where we left off, and talk about the various ways
in which data can be stored. Given that we're potentially capturing massive amounts
of data in our source systems. It's natural to ask where the heck does all the stuff go?
Well, it turns out that each source usually has it's own storage system to hold data
relevant to that system. Unfortunately, that isn't necessarily ideal for us as analyst for
a few reasons.
 First, it's likely that the source's storage system is optimized for functional
performance, not for data extraction and analysis. As an example, you may have seen
the terms online transactional processing or OLTP and An Online Analytical Processing
OLAP or OLAP. These terms refer to storage systems that are optimized for business
operations and transactions versus those that are optimized for analytics.
 While it's possible to perform analytics on transactional systems, it's often much easier
to do it on analytical systems.
 The second challenge with source store storage systems is that they often contain a lot
more information than we really need for analytics. It's not uncommon for a source
database to contain all sorts of internal working data that really doesn't have a use
outside the system operation. We prefer not to have to carry all that extra data indoor
analytical environment.
 Risky to access directly. We also need to remember the source system can be critical
to the day to day operation of the business. We may not want to risk slowing down or
even crashing those systems by allowing direct access to system data by analytical
applications.
 Retention times vary; data may not be stored locally for long. Finally, because source
systems often deal with very high volumes of data, they may not store data for very
long in order to optimize the overall performance of that system. This means that if we
want the data, or some subset of the data, to be available for a longer period of time,
we need to grab it and put it in a longer-term storage location.
 As we mentioned earlier, we do sometimes connect to record resource systems
especially when we access to real time data. We may even intercept data as it streams
through a connection point.
 However, a more common solution is to gather data into a separate storage location.
This may be a central data repository, where data is physically colocated. It could also
be a virtual repository, where the data is physically located in different places but
appears to the user as though it's in a common location. And it could be a combination
of these two things or a semi-centralized repository.
5
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
 There are many ways to store data, but here we'll cover the two broad mechanisms
that you're most likely to see in the world of analytics, data files and databases.
 it's useful the first talk about file systems. A file system is basically just the digital
equivalent of an organized file cabinet. I take pieces of information, I put them into a
folder and perhaps put that folder into a larger folder. Think about your own
computer, this is how you probably store most things on your PC or Mac. The nice
thing about a file system is that I can put pretty much anything I want there and just
note it's name and location so I can find it later. File systems are attractive in that they
can handle all sorts of information, including what we call unstructured data. We can
store documents, spreadsheets, pictures, music, video, you name it. The file system
doesn't really care what's in the file.
 One important example of a file system is the Hadoop Distributed File System or HDFS,
which is a big data manifestation of the file system concept. HDFS uses massively
parallel processing on relatively inexpensive infrastructure to efficiently store very
large amounts of information without much regard to the data type. We'll talk a bit
more about big data technologies in a later video. So, what about the data files
themselves? There are many types of data files. Think of all the different extensions
that files on your computer have but there are few file types that come up most often
in the world of data analytics.
 The first is what's called a delimited text file. Normally, a delimited text file contains
data that represents a two dimensional table with columns and rows. That data itself is
stored as text with breaks between the columns and rows, identified using specific
characters or formatting codes called delimiters. The most common delimiters are
commas, tabs and pipes. The pipe is the vertical line character you see on your
keyboard.
 You'll often see comma delimited files with the extension CSV which stands for comma
separated values. Tab and pipe delimited files usually just have the TXT text file
extension.
 The nice thing about text files is that they are understood by wide variety of systems
and analytical tools. So it's pretty straight forward to move data from one environment
to another using this file format.
 A second file type is an Extensible Markup Language or XML file. XML is a flexible
structure of encoding documents and data that was developed in the late 90s,
primarily to facilitate data sharing over the Internet. However, it has a wide range of
applications from web pages to applications to messaging systems. The nice thing
about XML is that it is a common standard and it allows for a more complex structuring
of data that something like I'm doing the text file. The downside is that it requires a
more sophisticated interface to interpret the data and structure for analysis.
 A third type of file is a log file. Log files are generally used to capture event data from a
system and are common in machine data, messaging and web analytics applications.
Log files may or may not follow a standard structure and they generally require
something called a parser to read and interpret the file. The advantage of log files is
that they are very flexible, it can capture just about any data structure you want.
However, this comes at the expense of a much more complicated process for reading
and using the data. In fact, there are specific software tools that specialize in parsing
log files.
 The last type of data file we'll discuss is really a class of files that are specific to
common data analysis tools. Most tools have their own proprietary file formats for
6
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
storing data, along with other information called metadata. Which describes
calculations, operations, or other attributes of the data itself. Far and away, the most
common of these is the Microsoft Excel spreadsheet file. Even though there are a lot of
sophisticated analysis tools out there, the reality is that a disproportionate amount of
actual analytical work is done in Excel. And that's not necessarily a bad thing, just
about everyone knows how to do basic operations in Excel. And there are certain
things that are quite frankly just easier to do in Excel, making it a very flexible tool for
both manipulating and sharing data and analytical results. However, Excel is just one of
many specialized file formats. Tools like SaaS, SPSS, Tableau, and a whole host of other
applications each have their own specific formats for storing data in standalone files.
Broadly, this means you may need to use that specific application to open and use
those files. But increasingly, applications are opening up a bit and building in the ability
to ingest and interpret other file types. It really depends on the tool.
 Now let's move on to databases. A database is simply an organize collection of data.
When we say database, we're usually referring to both the structure and design of a
data environment, as well as the data itself. A database seeks to store data in a more
complex way than what could be achieved in a data file. Specifically, a database usually
stores a number of different date entities with some unifying information about how
those entities are arranged or related. This enables access to a wider array of
information in one common environment. Versus storing that information in multiple
data files that may or may not be tied together. Usually, a database is constructed
using a Database Management System or DBMS. A Database Management System is a
software application use for creating, maintaining and accessing databases.
 There are variety of different types of databases but far and away the most common is
the Relational Database.
 The basic concept behind relational databases, is that we store information in two
dimensional tables, and then to find specific relationships among those tables. It turns
out that this can be a really efficient and effective way of storing data that is pretty
easy to understand which contributes to it's popularity. The idea was developed by a
computer scientist named EF Codd at IBM in 1969 and 1970. Given that it's been
nearly 50 years since then, it's pretty impressive that the relational database remains
the dominant paradigm in data storage.
 That having been said, the relational database is not the only type of database. In fact,
there are a number of emerging types of databases that are being used to handle
special types of data, store unstructured data, or improve performance in the era of
big data.
 Let's talk about four common Alternative Databases. Graph Databases, Document
Stores, Columnar Databases, and Key-Value Stores.
 A graph database is based on graph theory, or the study of para-wise relationships
between objects. These databases tend to work well with highly interconnected data,
like relationships between people or locations and have applications in physical and
social network analysis. A document store, as its name suggests, is generally designed
to store documents, along with key pieces of metadata describing those documents.
It's useful for storing unstructured data or different data types in a way that's a little
more useful than a typical file system. Columnar databases are storage mechanisms
which seek to improve the performance of data access. By focusing on columns of data
tables, versus the row based approach of relational database systems. It turns that
when we write information into a database, we usually do it row by row. Which makes
7
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
sense when each row of data represents something like a purchase transaction or a
new customer. However, when we extract data from databases, we're usually more
interested in summarizing some attribute across all rows. Columnar databases tend to
be much more efficient at this type of data extraction operation.
 Key-value stores are very simple but power ways of storing data. They store
information in very small pairs. Typically, a key and a value. This method of storing
data is very flexible as it doesn't require the extensive design and structure of other
data base types. Without getting into too much detail, it turns out that storing data in
this way is very efficient. Uses less memory and can be used to achieve very high levels
of speed in certain types of operation. However, it also requires more sophisticated
programming to manipulate and extract data.

Big Data & Cloud Computing

 Generally speaking, big data refers to the idea that certain emergent data sources like
machine data and web-based data generate a huge amount of information that can't
be easily processed by traditional tools. Consequently, big data also refers to the broad
class of tools that are also used to capture, store, process, and analyze these high-
volume data sources.
 Cloud computing, on the other hand, really just speaks to where these operations
happen. Traditionally, most computing operations were done with machines and
software owned and operated by the organizations that needed them. They were
located either on-premises or in their own data centers. With Cloud computing,
organizations can effectively rent hardware capacity, software, and services from a
third party, that runs everything in the third party's data centers. Big data and cloud
computing are certainly complementary technologies. There can be substantial
advantages to executing big data operations in the cloud. However, its important to
recognise that we can have big data without the cloud, and we can have the cloud,
even if we don't have big data.
 Let's start with big data, again, big data encompasses variety of technologies that it
seeks to handle large data sources. In fact, there are a dizzying number tools out there
that in some way play in the big data space, and that number is increasing all the time.
Take a look at this big data landscape developed by Matt Turk and Jim Hao from
FirstMark Capital in 2016. This is a pretty intimidating list of tools. You may recognize
some of the names, but most of these you've probably never heard of, and these are
only the most popular ones. We couldn't possibly cover all these areas in the course
but what we will do is simplify the world of big data a bit to get sense for the major
functions that are relevant to data analytics. However, you may want to pause and get
a copy of this landscape to reference as we proceed to the rest of the discussion We'll
primarily be interested in the infrastructure and analytics part of this landscape.
 When data sources produce a very large amount of data, particularly when done at a
very rapid pace, specialized tools are needed to quickly move, interpret or transform
the data. Usually the objective is to get the data into some form of unstructured or
structured storage as quickly as possible to make it available for downstream
application like analytics. However, there are also tools that seeks to combine these
initial processing with the real time analytics especially when the data is what we call
streaming data. A simple way to think about streaming data, not surprisingly is to
actually think of it like a stream or river. Imagine that I am sitting in the bank of river
and there are fish swimming by, one by one. If I wanted to know how many fish swim
8
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
down the river or what types of fish they are, I can do this in a couple of ways. I could
catch all the fish put them in the separate buckets by type, then count them up
convenient intervals. That's how our traditional data warehouse in business intelligent
approach work. However, if I am fast enough, I can just count and categories the fish
swim by and never put them into buckets at all. Streaming data processing in analytics
works a little more like that, but streaming data is only one type of data I may need to
work with. I may need rapid access to a source database or need to deal with large
chunks of data in log files or other files.
 If you look at that data landscape, you'll see a few boxes labelled spark, data
transformation, data integration, and real-time. These tools focused on the ingestion
and manipulation of data. You may also come across some other tools and languages
like Hive, Pig, Nifi, Kafka and Scala that are used in these operations.
 Let's move on to data storage. Earlier in the course, we talked about the various ways
we can store unstructured and structured data. Including file systems and data base
And we briefly described relational databases as well as graph databases, document
stores, columnar databases and key value stores. There are variety of big data tools
available to store very large amounts of data using each of these storage paradigms.
You'll see this in the big data landscape as Hadoop on premise, Hadoop in the cloud,
No SQL databases, new SQL databases, graph databases, and PP databases and Cloud
EBW.
 Often what happens in a real data environment is that some unstructured storage like
the distributive file system or HDFS, is used to efficiently store large amounts of
unstructured or semi-structured data. Then additional processing is done to drive
some subset of that data into a more structured database. This processing is done
using some of the same tools we discussed earlier for data ingestion and manipulation.
The data ingestion and data storage options in the world of big data are pretty broad
but the number of options for analytics are absolutely mind boggling. See the entire
section labeled Analytics in the big data landscape.In a later video we're going to take
a closer look at some of the more common tools used for analytics. But broadly
speaking, big data analytics tools seek to perform the same types of descriptive,
predictive and prescriptive analytics that we perform in any environment but enable
the ability to do it on very large datasets and potentially in real time. One of the
reasons that there are so many different tools in the space is that there are so many
different types of questions we might seek to answer, depending on the industry we're
in or the problem we're trying to solve. There's lots of room for each players to
develop highly specialized solutions for unique problems. Of course there's also a
number of bigger players, many with familiar names that seek to provide broad
analytical platforms that enable a number of different types of analysis in one package.
Again you'll want to understand how your organization chooses to approach analytics
and what tools are using your environment. The last area we'll touch on in big data is
infrastructure. Infrastructure broadly refers to the hardware networking, management
and security around a data environment. One with over simplification, you could think
fo infrastructure as the machines and connections between them that support the big
data environment. With big data comes the need to store very large amounts of data
and the need to move that data into, within and out of the data environment. There
are special tools that help to do this very efficiently and with manageable cost. We
won't get into a lot of detail here, but the Hadoop Framework HDFS and a processing
algorithm called MapReduce are examples of how big data technologies enable
9
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
efficient infrastructure. Basically, these tools allow data to be stored and distributed
across inexpensive off-the-shelf hardware, using massively parallel process. Which
allows very high speed and massive data storage at a fraction of the cost in traditional
hardware and storage.
 In many ways, the cloud is a bit easier to get one's head around than big data.
Remember, the cloud is not so much about what we are doing but where we are doing
that. In a traditional data environment, I might buy my own machines and connect
them to my own network, purchase licenses for major pieces of software to run on
those machines, and then develop my own applications in that environment using
custom coding or other utility applications. With cloud computing, I can take one or
more of these functions and essentially rent them from a third party supplier. So, why
might I do that? Well, it turns out that most infrastructure and core software needs are
scale businesses, meaning the bigger my data environment, the more efficient and
cost effective it is to provide for those needs. Companies like Amazon, Microsoft, IDM,
Google, and Sales Force Just to name a few have massive amount of scale, and can
very cheaply build and maintain incredibly large environments, and rent space in those
environments to smaller companies that don't have the same scale advantages.
 let's introduce a few terms that underpin the idea of Cloud computing. Software as a
service, SaaS, Platform as a service, PaaS and Infrastructure as a service, IaaS. You're
probably a little familiar with software as a service. Any application that you're using
via the web is probably an example of software as a service, In this model, the
software itself is hosted on machines provided by the software provider and you
interact with them through some client interface. Which maybe a smaller application
stored on your machine or simply a web browser. Softwares are services really target
in at any users who were interested in accomplishing a certain function. Platform as a
service takes this idea a bit further Instead of just providing software, platform as a
service offering provides something more like a development environment where you
can develop your own applications. However, just like software as a service, al or most
of the platform elements are hosted in the platform provider's environment. Platform
as a service is targeted at developers who want to focus on developing applications
versus managing the development environment itself. Infrastructure as a service is the
most extensive of this hosted service offerings. With infrastructure as service,
customer are given access to the raw building blocks of the data environment like
processing capacity, storage, connectivity, security etc. This allows an organization to
build a highly customized environment in which both application development and
application used can occur without worrying a lot about the overhead associated with
sending up a physical data environment. Here's a nice diagram which summarizes
what things are done by your organization versus a third party provider under each of
these paradigm So let's talk a little bit about why cloud computing might be attractive.
First, the environment can be really easy and cheap to set up. In fact, many of the
major cloud providers allow limited use of just about all their products for free.
Secondly, the cloud makes it really simple and fast to scale processing, storage, and
application capacity. This could be critical to businesses like start ups that many that
start small and scale quickly. Additionally, using Cloud services can remove a lot of
distraction and overhead from the customer organization since much of the
administration, back up, redundancy and disaster will cover out responsibilities or
handled by the service provider. This means that organizations can run leaner and be
more focused on their core business functions. You might be thinking that if the Cloud
10
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
is so great why doesn't everybody use it. It's a good question, and there are number of
different factors that might drive a company to do it themselves. The first is control
and security. Organizations may not be comfortable trusting a third party to provide
critical services or to store sensitive data. The second in inertia. Most companies have
invested a lot of money in building out their own data environments and it may not be
cost effective or a high priority to migrate these environments to the cloud unless the
benefits of doing so are quite large. Along the same lines companies have invested in
people who understand and manage these data environments. And it simply may not
be worth it to try and retain the workforce to work under a different paradigm, or to
risk losing the critical knowledge that individuals have about the business should they
leave. It's also possible that the technical needs of a organization are so unique that it
would be hard to replicate them in a generic clone environment. Although in reality,
Cloud environments are becoming so flexible that they can accommodate just about
anything. A key consideration like the influence the attractiveness of the cloud solution
is where the source data comes from. To get the data from the Cloud, you have to
move it to the Cloud. If a company has a very large data sources that are largely within
the wrong data center. And maybe easier and more cost effective to create local data
environments versus trying to push all that data to the Cloud. The last reason is scale
itself. Once an organization's data environment gets large enough, it can begin to offer
more of the same scale advantages that Cloud services provide. In addition to increase
control in security. Very large data companies, Dropbox for example, have even moved
away from cloud environments in their own high scale environments.
 Earlier we noted that big data and cloud computing can work together, although they
don't have to. But it turns out, that most cloud services companies offer versions of
many of the big data technologies we discussed earlier as part of software offerings.

Virtualization, Federation, and In-Memory computing

 When we talk about storing data and making it available for data analytics, we usually
describe our process where data is physically moved from various source systems into
a common location like a database. This usually happens using something called an
extract, transform, and load, or ETL, process. As the name suggests, an ETL process
extracts data from one location, transforms the data in some way, and then loads the
data into a new location, again, generally some type of database. Furthermore, when
we want to analyze data in a database, we generally pull a specific data set from the
database and perform analytics using some other tool. We might even reload some
data back into the database like scores from a statistical model to use in business
operations. The reason we do things this way is largely driven by the computing
storage resources available in our data and analytics environment. Because ETL
operations can be very processing intensive, we try to do them in the background
before data is needed by analysts. Also because disk storage tends to be a lot cheaper
than memory, we tend to keep data on disk and use disk operations for access, even
though putting data in memory would make the access much faster. There's a reason
why the disk storage on your PC is measured in terabytes, but the memory is
measured in gigabytes. Memory is a lot more expensive. The downside of this
approach is that every time we move data from one place to another, we increase the
likelihood that something might go wrong in the process. And we insert some amount
of delay between the time the data is created and when it is available for access.
However, as processing power and memory become cheaper, some additional options
11
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
have emerged for both storage and access for analytics. In this video, we're going to
briefly cover four of these emerging ideas, two related to data storage, data
virtualization and data federation, and two related to data storage for analytics, in-
memory computing and in-database analytics. The reason we cover this in a course on
analytics is that it's important for you as the data analyst to understand what
mechanisms are used in your data environment. This will allow you to better interpret
what you're seeing, why you're seeing it, and how relevant your findings are based on
when and where the data came from. The more you know about your data and your
data environment, the more effective you can be.
 So let's start with data virtualization and data federation, which are related but slightly
different concepts. The idea behind data virtualization is that we keep source data
where it is for each source, but we make it look like all the data is in one place and we
allow users to access that data using one common interface. With data virtualization,
we don't necessarily seek to change the data or integrate data from multiple sources.
But we make it a lot simpler for users to get it without having to worry about details of
the underlying data format and technology. One advantage of data virtualization is
that we can avoid having to store data in multiple places, namely in the source system
and in some target database. Another advantage is that changes in source data are
usually reflected immediately in the user access layer. Since I don't need to wait for
ETL processes to run and move the data from one place to another. It's also easier to
alter the access layer, should there be changes in the structure of the underlying
source data. However, data virtualization does have some limitations. First and
foremost, while it removes a data layer in the environment, it adds a processing layer.
And it can take longer to run data extraction operations since this additional layer
must translate user instructions into whatever language is appropriate for the sources
in question. Furthermore, if any data cleansing or complex transformation operations
are required, those processes add to the processing load, and can further slow down
access. In these cases, it may actually be better to use more traditional ETL processes.
Again, data virtualization alone only makes data look like it's in one place. It doesn't
necessarily make sense of how data from different sources relate to each other, which
one of the primary advantages of constructing a centralized database in the first place.
 This is where data federation comes in. With data federation, not only do we make it
look like data is in one place, but we actually fit that data into a common integrated
data model. We perform all the same transformations and establish all the same
relationships among data entities that we would do in a physical database, but we do
it all virtually. That is, without ever actually moving the data. The advantages of data
federation are similar to those of data virtualization with the added benefit of
presenting a more integrated view of data from multiple sources to the user. Of
course, this comes at the cost of even more complex processing that can result in
slower performance when data is accessed or extracted. Both data virtualization and
data federation are usually accomplished using specialized software applications that
connect to a variety of different source systems. While they eliminate the need to
move data using ETL processes, they still require development and maintenance to
establish those connections and present a unified view of data to users.
 Data virtualization and data federation can be attractive in environments where
resources are limited, the velocity of changes is very rapid, little transformation or
integration is required, or when sources have very high quality data or store a lot of
history themselves. However, they become less attractive as the volume or complexity
12
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
of transformations increase, or when there is a need to store historical data outside
the source.
 The other two ideas we want to discuss, in-memory computing and in-database
analytics, are a little different in that they seek to maximize the performance of the
analytical operations versus minimizing data movement in physical storage. With in-
memory computing, all the data needed for analysis is actually loaded into a computer
or server's random access memory, or RAM, where it can be accessed very quickly.
Typically, a whole data structure, including relationships between data entities, is
stored and available for analytical purposes. The advantage of this approach is
obviously the speed. As an analyst, I can apply complex techniques to the data in much
less time than it would take were I to try and access data stored on disk locally or on a
remote server. And once the data is in memory, I can try a lot of different things
without having to wait too long between each attempt. This enables analytical efforts
that require exploration and trial and error to accomplish. To do this however, I need
to get the data into memory. Unless my total data volume is pretty small, it's far too
expensive to store all of it in memory and to store it there all the time. So what I
usually need to do is execute a one time load of some manageable subset of data into
memory for analysis. Depending on how complex my data set is and how many entities
are involved, this load can take quite a while. Many of the most popular data
visualization exploration tools use some form of in-memory computing. As do a
number of specialized data appliances that combine database and data access
operations. In-database analytics also seeks to speed up analytics, but in kind of the
opposite web as in-memory computing. Instead of moving the data to a place where
an analytical application can manipulate it quickly, with in-database analytics, we
move specific analytical operations back into the database. Where they can be quickly
executed as data is loaded into the database itself, either using ETL or other custom
procedures. So when would I do this? Let's say I had developed a predictive or
prescriptive model that helped to detect fraud, or which triggered certain actions like
stock trades or price changes based on real time data. By incorporating this model
directly into the database, I can drastically reduce the time lag between the input
events and output actions based on that model.
 In some situations, even milliseconds can matter, so I want to minimize any delay in
action. In-database analytics are ideal in this type of situation.

Relational database

 The applications that run relational databases are called relational database
management systems, or RDBMS. While there are emerging database types, if you
want to be an effective analyst in most organizations, you'll almost certainly need to
understand what relational databases are, how they work, and how to extract data
from them. In relational databases, we store information in tables, and then define
specific relationships among those tables. A table is a two-dimensional structure that
stores data in rows and columns. Most relational databases are row oriented, meaning
that the ideas or items described in the table are stored in rows, with the columns of
the tables containing attributes that describe the ideas or items of interest.
 Let's look at an example of a database table. Consider this table, which we'll call
FLIGHTS, that contains information about commercial airline flights. Each row
describes the one flight, and each column describes an attribute of that flight. In this
case, we have a flight identifier, an aircraft identifier we call the tail number, a flight
13
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
number, departure and arrival airports, departure and arrival times, and the number
of passengers on-board the aircraft. Sometimes we refer to the rows of a table as
records, a term that has its origins in the idea that one would store information about
something in a physical record. Kind of like how physical files were the ancestors of the
digital files we store in our computers. Similarly, we can describe columns as fields or
attributes to reflect that they contain data that describes the record. We'll use rows
and records, and columns, fields, and attributes interchangeably as we go forward.
Each row of a table in a relational database must be unique. In other words, there
can't be duplicate rows in any table. To ensure that each row is unique, we define
something called a primary key for the table. The primary key is a column or set of
columns that are guaranteed to be unique for every row in the table. There are a few
ways that we can define a primary key. If a primary key can be constructed using
attributes that occur naturally in a data record, we call this a natural key. Sometimes
this can be done using only one attribute. In other cases, it takes more than one
attribute to uniquely define a row. In that case, we call the combination of attributes a
composite key. Sometimes it's easier just to define a new column in a database and
force it to be unique, often by simply numbering each row and incrementing that
number as new rows are added. A key defined using this approach is called a surrogate
key. A surrogate key usually doesn't have a meaning outside the database. It's
something we've added simply to facilitate the unique storage of data records. In our
FLIGHTS table, the FLIGHT_ID field uniquely defines each row, but it doesn't really have
a meaning outside the database. So this is an example of a surrogate key. However, I
could have constructed a composite key using the combination of FLIGHT_NUMBER,
DEPARTURE_AIRPORT, and DEPARTURE_TIME, which also turns out to be unique.
Because all of this attributes are naturally occurring in the record, this composite key
would also be a natural key. When I start talking about multiple tables in a database
and how they're related, it's useful to depict them visually using a shorthand like this.
Here we list out all columns in the table vertically instead of horizontally with the table
name at the top. This shorthand lets us see all the types of information contained in
the table in a compact form. And as we'll see in a moment, it also lets us more easily
show how tables are related to each other.
 We can also incorporate information about the primary key of our table into the
shorthand, like this. I've simply added a PK to the right of the FLIGHT_ID field, which
indicates that FLIGHT_ID is the primary key of this table. One thing you might also
notice is that all of the column names in our examples use a continuous string of
characters. When I have multiple words or abbreviations in my column names, I've
placed an underscore between them instead of a space. Most relational database
systems expect both column names and table names to be in this type of format, so
keep that in mind as we go forward. As we noted earlier, the real power of the
relational database is the way it links ideas together by identifying relationships
between tables. To do this, we define something called a foreign key, which is a
column or columns that establish a logical link between tables. Usually the way this
works is that a foreign key in one table matches the primary key in another table.
 Let's starts with our FLIGHTS table using the shorthand we just just introduced.
Suppose we have another table in the database, called PLANES, that contains
information about specific aircraft, like the TAIL_NUMBER, the AIRLINE name, the
AIRCRAFT_TYPE, the FLEET_TYPE, and the number of seats on the plane. The primary
key of the PLANES table is TAIL_NUMBER, which is unique for each aircraft. ince this
14
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
does have a meaning outside the database, this is also a natural primary key. You
might notice that TAIL_NUMBER is common to both the PLANES table and our FLIGHTS
table. We can establish a linkage between these tables by identifying TAIL_NUMBER as
a foreign key in the FLIGHTS table. Visually, we can draw the linkage using a line that
connects these two field names. Let's take this idea a step further by adding another
table, called AIRPORTS, which has information about each airport, including the
COUNTRY, STATE, CITY, POSTAL_CODE, LATITUDE, and LONGITUDE. Again, the airport
name or code is the natural primary key that uniquely defines each row of the table. In
this case, it looks like there are two fields in our FLIGHTS table that contain the same
type of information as AIRPORT, DEPARTURE_AIRPORT and ARRIVAL_AIRPORT. Both of
these fields can be identified as foreign keys that link the FLIGHTS table to the
AIRPORTS table. In essence, we've linked these tables twice. Note that the column
names in the tables don't need to be the same to establish a foreign key relationship.
All that needs to be true is that they represent the same idea, and that they are of the
same data type. Let's add one more table to our example. Here's a table called
CITY_PAIRS, which provides information about the route that a flight might take,
Including the DISTANCE between points and a field called REGIONALITY that might
describe something like a domestic route versus an international route. In this table,
the combination of DEPARTURE_AIRPORT and ARRIVAL_AIRPORT define the unique
row of the table. So this is an example of composite natural primary key. The
CITY_PAIRS table also seems to have relationships to the FLIGHTS table, but the
relationship is a little different this time. Specifically, it looks like I need both the
DEPARTURE_AIRPORT and the ARRIVAL_AIRPORT to establish a link between the
tables. So that's exactly how I describe the relationship. Unlike the AIRPORTS table, I
don't link to the CITY_PAIRS table twice using a single foreign key, but rather I link it
once using two foreign keys. Okay, what we have at this point is the beginning of
something called a logical data model, or a visualization of how a database is
structured. The data model for a large database would obviously have many more
tables with many more linkages among those tables. As a data analyst, it's always a
good idea to try and get your hands on a copy of the data model so you can see what
types of data are available in the database and how to make sense of the database
itself. This is particularly important when you start thinking about how to extract data
for analysis. You'll need to understand the database structure to write effective
queries against the database. We'll talk more about that in module three.
 One of the things you may have noticed in our example is that we broke different ideas
into different tables. There are a number of different reasons we might do this. We
might find that different types of data come from different sources, or that some data
changes rapidly while other data changes slowly. But broadly speaking, the notion of
trying to break data into different ideas is called normalization. It turns out that
normalizing data eliminates redundancy by ensuring that we store unique data only
once versus multiple times. For example, what if we stored the AIRCRAFT_TYPE for
each aircraft in our FLIGHTS table instead of in the PLANES table? A single aircraft
would likely take many flights, so every time there was a flight for that aircraft in the
FLIGHTS table, I'd have the same aircraft type, which is a bit redundant. By putting
AIRCRAFT_TYPE in a separate PLANES table, I only need to store AIRCRAFT_TYPE once,
namely, in the unique row for that aircraft. In relational database design, there is a
special degree of normalization called third normal form, or 3NF. When a database is
15
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
in third normal form, all unnecessary redundancy has been removed from the
database.
 While the definition of third normal form is a bit technical, it was paraphrased by Bill
Kent from IBM in 1982 as follows. Every non-key attribute must provide a fact about
the key, the whole key and nothing but the key. What this means is that for a database
to be in third normal form, each table must only contain information than in some way
directly describes its primary key. Consider this example. Here we have a table that
describes orders, with order ID as the primary key. The attributes of this table are
Date, Customer_ID, Product_Name, and Product_Type. In this example, all columns
seem to describe the order directly, except Product_Type, which actually seems to
describe the product. So this table would not be in third normal form, since
Product_Type is not directly describing the order.
 Okay, now we have a basic understanding of what a relational database is, and how
and why we separate different ideas into tables and relate them to each other. So why
are relational databases so popular? Here are a few reasons. First, they allow us to
group data logically around discrete ideas. This makes sense conceptually, and allows
the model to be understood more easily. Secondly, as we mentioned earlier, they can
minimize the amount of duplicate data stored in a database, thereby reducing storage
requirements. They also minimize the number of places where changes to data need
to be made. Consider the AIRCRAFT_TYPE example we used earlier. Let's say I needed
to correct the AIRCRAFT_TYPE for an aircraft. It would be a lot easier to change that
value once in the PLANES table than to have to change it in every record related to
that aircraft in the FLIGHTS table. Additionally, in highly transactional systems where
lots of updates or additions are made, using relational database can improve the
overall performance of the database. Finally, and perhaps most importantly to the
data analyst, a relational database model is incredibly flexible in terms of how data can
be queried and extracted. If constructed correctly, a relational model presents very
few limitations on the types of data sets that can be obtained from the database.
However, there are some downsides to relational models. The more normalized the
database becomes, the more complex the data extraction and analysis operations
become, as more tables need to be joined together to get consolidated data sets. This
is the cost of the flexibility that the model provides. It can also require more effort to
integrate fundamentally new data domains or data sources, as they must be
architected to fit within the current database model. Sometimes it can be hard to fit
new ideas into a model that didn't anticipate them. To get around some of these
limitations, some organizations opt for data models that are not fully normalized. It
turns out that the more we find ourselves joining the same tables together for
repeated operations, the more it makes sense to denormalize information. In other
words, we create structures that are not in third normal form to facilitate things like
reporting and analytics. The tradeoff of this approach is less efficient storage of data
for more efficient interrogation of data. Usually, this is accomplished by creating tables
downstream of a normalized database model. Or through the use of something called
the database view, which makes it look like data is denormalized, even though the
data is actually stored in normalized tables. With views, the tables are joined
dynamically in the background, but the user only sees the denormalized view of those
tables. The idea of denormalization is often taken further through the construction of
data marts, or alternate non-relational data structures, like cubes. A data mart can be
thought of as a smaller, specialized database that is established for a specific user
16
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
group or function, like finance or marketing campaign analysis. Data marts usually
contain a subset of the information stored in a larger database and may be
normalized, denormalized, or somewhere in between.
 A cube is basically an n-dimensional table. So instead of a two-dimensional table with
columns and rows, I might have a three-dimensional structure with columns, rows and
stacks. Or even higher dimension structures that can't be easily visualized. Cubes are
not relational database structures but are used pretty commonly in downstream
business intelligence tools. Okay, we've covered quite a bit in this video. Let's recap.
We started by describing relational databases as a collection of tables with
relationships defined among them. We talked about how the uniqueness of rows in a
table is defined using a primary key, how that key can be natural or surrogate, and
how a composite key can be constructed using multiple fields in the table. We
described how foreign key relationships establish the linkages between tables, and
saw a few different ways in which those linkages can occur. We learned what it means
for a data model to be in third normal form, and discussed the advantages and
disadvantages of normalization and the relational database itself. Finally, we
presented a few examples of how data is actually denormalized to facilitate repeated
data extraction operations in reporting and analytics. These ideas will be key in module
three, where we'll actually learn how to extract data for analysis from relational
databases.

Data tools landscape

 Keep in mind that these tools are evolving all the time, and most vendors are trying to
expand their solutions to include more functionality.
 Let's start by defining some classes of tools that you're likely to encounter. Namely,
database systems, standard reporting tools, dashboarding tools, data visualization
tools, data exploration tools, and statistically modeling and advanced programming
tools. Broadly speaking, this ordering of tools begins with the most IT or development
centric and moves towards tools that are more analytic or exploratory in nature.
 We talked quite a bit about database systems in our previous videos. As a reminder,
these tools are used create, maintain, and extract data from databases. In this video,
we'll focus on the most popular options for relational database systems.
 Standard reporting tools are used to provide stable repetitive use of data (used for
stable, repetitive display or manipulation of data, often for business end users).
Usually standard reports are created once we've already identified a specific way of
looking at data that we think is particularly useful or insightful. We use reporting tools
to automate the generation of these reports on some periodic basis. Monthly, weekly,
daily, hourly, so we don't have to do it annually. These reports may or may not provide
some limited manipulated functions like filtering or draw down capability, and they're
usually directed toward business and users. Standard reporting tools were some of the
first business intelligence tools created and have been around for quite a while.
Although the level of sophistication and usability is increased substantially overtime.
 The idea of dashboarding is an extension of standard reporting (used for more
dynamic, but still repetitive display of information, especially summary or executive).
As more and more standard reports are created in an organization, it becomes more
difficult to isolate the most important pieces of information that an executive or other
decision maker might need to make sense of the business. One solution to this
problem is to take a subset of reports and present them in one simplified view that
17
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
allows the most important metrics to be quickly identified and interpreted.
Dashboards also tend to be a bit more dynamic and may present more timely
information than some standard reports. As the name suggests, the analogy here is
the dashboard in your car, which allows you to see the most important things that are
going on as you drive. Executive dashboarding emerged from the executive
information systems in the 1980s, but really gained popularity in the 1990s. But like
standard reporting tools, the sophisticated dashboarding tools continues to evolve.
 While we may or may not spend time developing reports or dashboards, the idea of
data visualization is squarely within the domain of the data analyst data visualization is
the process of arranging data in such a way that we can more easily see what's going
on and draw conclusions based on what we see (used for more interactive evaluation
of information using advanced visual representation). The idea of data visualization
has been around for hundreds of years, but it's only been in the last decade or so that
the class of sophisticated tools explicitly designed to make sense of large amounts of
complex data have emerged. These tools both the facilitate the aggregation and
manipulation of data and provide a spectrum of advance visualization techniques to
the user. In fact, these tools are quickly becoming the work course applications in
many business analytics organizations.
 Data exploration is an intelligent extension of the idea of data visualization (expands
on visualization with advanced navigation or ‘cues’ for ‘next step’ analysis. Data
exploration tools seek to proactively guide the data analyst by automatically scanning
data and providing cues or suggestions on what thing the data analyst might look at
next. They also provide advanced navigation tools that allow the analyst to efficiently
explore a data set. These capabilities are most often built into some of the same tools
that specialize in data visualization.
 The last class of tools we'll introduce are statistically modeling and advanced
programming tools (used for running advanced algorithm, usually statistical, on data
sheets). These tools are used to execute highly sophisticated analytical procedures on
data, often using statistical techniques. They're the core tools of data scientists, and a
key part of the data analyst toolkit as well. They range from highly integrated interface
driven software packages to raw programming environments where analysts can
manipulate data directly using one or more programming languages.

The tools of the data analyst

 We'll start by looking at three broad methodologies a data analyst might employ to
access and analyze data.
 Let's call the first method the intermediate file approach. In this approach we extract
data from a database or other location where data is stored, and we export the data
we need into a standalone file like a text file or Excel file. This often involves writing
SQL code against the database to extract just the data we need. We then import the
data into an analytical tool like Excel, a business intelligence tool, or a statistical
software package or programming environment. Once the data is in the analytical
environment I can execute whatever type of analysis is desired. Note that this
approach assumes that all the data I need is already integrated in one database
environment. So why might I use this approach? For starters, you might not be able to
connect your analytical tools directly to your database due to security or stability
concerns. Or you may not have time to set up all the permissions and network
connections required to make a direct connection. It might be the case that you need
18
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
to extract the data at one point in time and analyze it later or analyze it offline. You
might also want to drive the same data set into multiple analytical tools. In these
cases, it's more convenient to store the data in an intermediate file and import it for
analysis. Finally, by breaking access into two steps I have a little more control and
visibility into each steps, which might be useful for data validation and quality control.
 A second method might be called the direct connection approach. With this approach
we connect our analytics tool directly to a database or other data source using what's
called an open database connectivity, or ODBC connection, or some other application
program interface or API connection. Broadly, APIs are standard mechanisms for
exchanging information between programs, and ODBC is one special case of an API
used to connect to databases. Most analytical tools have the ability to connect directly
to the most common database systems and a number of other common data sources.
In this approach, I use the analytical tool interface to set up a connection and identify
the data I want to access. Usually this involves the same ideas as SQL queries on a
database. The tool executes the extraction operation in the background, and that can
proceed with whatever analysis is desired. There are a couple nice things about the
direct connection approach. First, it cuts out a couple of steps in the process since I
don't need to export and import data using an intermediate file. Secondly, a direct
connection can be set up to automatically refresh as the underlying data changes.
Meaning that my analytical work can be refreshed as well to reflect the most recent
information. On the downside, it can require a bit of set up to establish the actual
connection to the database. And it's a little bit harder to be sure that the extraction
process is happening as intended. Offline analysis can be a bit trickier as well as we
need to make sure we have a copy of our data stored locally.
 The last method we'll cover is what we'll call the downstream integration approach.
It's quite often the case in data analytics that the information we need is located in a
bunch of different locations and formats. While it would be nice to have everything on
one warehouse, it doesn't always make sense to spend the time and energy to put
everything there prior to analysis. Especially if we're not entirely sure those sources
will turn out to be important. The newer and more complex the analysis, the more
likely it is that we'll need to do at least some integration in the analytical environment.
In cases like this, we generally use a broader set of API and ODBC connections as our
analytical tool of choice to connect to several sources concurrently. And use additional
functionality in the tool to integrate the data construct analytical data sets. We then
proceed, as we would, using the other methods. Of course, there are a number of
other approaches that we might take when working with data to get the results we're
looking for, including hybrids of the ones we've discussed here. For example, I might
perform certain manipulations in Excel, and then import the results into a more
sophisticated analysis tool. I can also do the opposite, using an advanced tool to isolate
some set of data that I want to incorporate into Excel. Perhaps into a business or
financial model. The approach you take in any situation will depend on what it is
you're trying to do. But one thing we haven't really discussed is when you'd want to
use one tool versus another. This is a really complicated question and the answer
depends not only on the capabilities of the tool itself but on the skills of the analysts,
the nature of the data environment and even the organization in which an analyst
works. In our video on data and analysis tools we broadly discussed what functions
each type of tool is designed to perform, but we also saw that there's quite a bit of
19
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
overlap in the capabilities of different tools. What one analyst finds really easy to do in
one application another analyst might find more intuitive in a different application.
 That having been said, here are a few ideas that you can start with that I've drawn
from my own experience in leading analytical teams. But you'll have to discover what
works best in your environment. Let's start with Excel. Excel is really great for quick
and dirty analyses. Basic charts and graphs are for when you want to share your
analysis with business partners who don't have access to more sophisticated tools.
Excel also makes certain types of manipulations really easy, like calculations that
depend on multiple rows of data. And it's a great environment for trial and error
around really complicated calculations since you can see every formula in every cell. It
also turns out that an awful lot of financial modeling and business case development is
done in excel. When analysis outputs are used as the inputs to these Excel models, it
can be convenient just to start in that environment. On the flipside, Excel is not
necessarily a great tool for sharing data broadly or for developing standard reports or
dashboards. There's also a size limitation. The baseline Excel product can only handle
about 1 million rows of data, which may not be large enough for some analysis.
Although with Microsoft's power pivot plug-in, larger data sets can be handled. Even
with less than 1 million rows, performance can be sluggish on anything but high-end
machine.
 Let's move on to business intelligence tools, which includes standard reporting, data
visualization, and data exploration tools. These tools are a good choice for a wide
variety of analytical needs intended to make complex manipulation of data easier and
faster than other tools. It goes without saying that the analysis requires extensive
exploration or advance visualization techniques. Tools suited to those operations will
produce better results. Additionally, some organizations will have tools that
permanently sit on top of a data environment and provide pre-built data structures
like cubes, or predefined calculations that facilitate repeated analytic operations.
Business intelligence tools are also preferable in cases when the output of analysis will
be shared broadly or turned into a standard report since they typically include more
advanced scheduling distribution functionality. Statistical modeling and advanced
programming tools are the obvious choice when we need to do highly sophisticated
analysis, especially using advanced analytic techniques. There are ways of
incorporating more advanced techniques into both Excel and other applications using
plugins or complimentary applications. In fact, we'll be using some of these in later
courses. However, few of these work were really well at scale, where we need to
analyze very large datasets, or where we need to drive the results of our analysis back
into business operations. There are also cases where we want to have a really high
degree of flexibility, where we need very fast performance or where we want to build
analytics directly into other software or data processes. In these cases, we might want
to code our analytics from scratch using a programming language adept at performing
data manipulation. The downside of this approach is that we have to build things from
scratch. We give up a lot of the simplification that's provided by more user friendly
tools.
 So what should you take away from this discussion? First, you should recognize that
there are a lot of different ways of navigating through the analytical process. But there
are a few common ways that we'll apply more often than not. Secondly, when it comes
to tools, there's always more than one way to accomplish anything. But there are
some broad considerations that can help you pick which tool is right for the job.
20
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
3rd week: Data extraction using SQL

Introduction to SQL

 So what is SQL? Like relational databases themselves, SQL was developed in the early
1970s to help users manipulate and extract data from those databases (A
programming language designed to manipulate and extract data from a Relational
Database Management System (RDBMS), developed in the early 70’s at IBM). It's a
language that's based on relational algebra, which is a set of mathematical operations
that speak to how things are related, like intersections, unions and differences. In
1986, SQL was adopted as an American National Standards Institute, or ANSI,
standard. And in 1987, it was adopted as an International Standards Organization, or
ISO, standard. Note that there are different proprietary versions of the SQL language
used in different database systems. However, it turns out that the differences are very
subtle. And they are almost all identical when it comes to basic syntax. This is one of
the reasons SQL is such a highly transferable skill. In this course, we'll be focusing on
SQL queries or pieces of code that extract data from database tables (Data
manipulation or Data definition operations are used to create or alter the database
itself). However, SQL's actually a much broader language, which could be used to both
create and manipulate data within a database, using data definition or data
manipulation operations.
 Most relational databases are row oriented, meaning that the ideas or items described
in a table are stored in rows with the columns of the table containing attributes that
describe the ideas or items of interest.
 The idea behind a SQL query is to extract just the data we want from a database table
or set of tables. Let's start by outlining a short list of common commands that can be
used on one table. Here we have a very short list of the most common commands
used in SQL queries. Select, from, where, group by, having, and order by. In fact, the
majority of queries on single tables can be constructed using just these commands and
a little creativity.
o The SELECT commands define which attributes, columns, or fields I want to
extract from the table. Normally I'm not interested in all the attributes in a
table, so select allows me to bring back only the ones I need.
o The FROM command defines the table from which I want to extract the data.
The SELECT and FROM commands work together and are required in every
SQL query. All other commands are optional.
o The where command adds filters that restrict which rows of data are extracted
from the table. Similar to the way the SELECT command only return columns I
want, the WHERE command only returns data based on the rows I want
included.
o Sometimes, I don't actually want the rows themselves, but some aggregation
of the rows, for example, maybe I have a table that list purchase transactions,
but what I really want is a summary of purchase transactions by month. The
GROUP BY command is used to define the level of aggregation I want in the
output data set.
o If I do want aggregated data, and I want to filter the output set further based
on those aggregations, I use the HAVING command. The HAVING command is
similar to the WHERE command, except that it operates on aggregated rows of
21
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
data versus the underlying rows of the database table. This might be a little
confusing at first, but the difference will become clearer when we run through
some examples.
o Finally, the ORDER BY command allows me to define how I want the output set
to be sorted.
o There's one special case we want to introduce before we move on to the next
command. Let's say that I wanted all the columns from a table. I can include
each column name in my select statement, but this can get pretty tedious,
especially if my table has a large number of columns. Luckily, SQL provides us a
shorthand way of identifying all columns in the table, using what's called a
wildcard character. In this case, an asterisk or star. If I want to return all
columns of the table, I just use SELECT * FROM TABLE_NAME
o The syntax of the where command is simply where followed by a set of
conditions defined using relational algebra. Like a column equaling a certain
value or another column, being greater than or less than a certain value or
another column, or even being a member of a specific list of values. If you
have experience in Excel, this is similar to how the filter function works in
restricting rows of a spreadsheet. Note that store is enclosed in single
quotation marks. In SQL, string values are those containing letters or numbers
and letters are always enclosed in single quotation marks.

Aggregating and sorting data in SQL

 In this video, we're going to focus on data aggregations in SQL, using the group by
command, filtering on aggregate values using the having command, and sorting data
using the order by command. So what do we mean when we say data aggregations?
An aggregation basically takes the values in multiple rows of data and returns one
value. Effectively collapsing all those rows into a single row containing a measure of
interest. In the world of SQL we're really talking about one of the following types of
familiar operations. Each of which operates on a table field.
 The sum function calculates the arithmetic sum across the sum of field values.
Similarly, the average or AVG function calculates the average across the sum of field
values. Both the sum and average functions require that the data in the field be
numeric. They won't work on other types of data like strings or dates. This is not the
case for the min and the max functions, which return the minimum or maximum
values from a set of field values. The min and max functions will work with most data
types, including numbers, strings, or dates. The count function returns the number of
values present in a field of values. Some SQL versions provide a few more options for
aggregate functions. But the ones we introduced here are pretty universal across all
versions you're likely to encounter. For all these functions the syntax is similar. I have
the function name immediately followed by the field name in parenthesis. They also
work pretty much as you think they would, with one caveat.
 There's a special type of value that can exist within a field called a null. A null basically
represents the absence of data. In other words there is no value in that field. It
shouldn't be confused with a zero or a blank value. Which are actually real values in
American text fields. This distinction is important, because aggregate functions treat
nulls differently than they do other values. Specifically all aggregate values ignore nulls
except for one special case of the count function. If we use the wild card character star
in our count function, we return the number of rows in a data set whether or not there
22
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
are nulls present. That means that if I use count field name and count star I'll get two
different answers if there are nulls in the fields of interest. Now that we've covered the
types of aggregate functions available to us, let's talk about how we use them in
queries.
 If we're interested in aggregate values across an entire table, the syntax is pretty
simple. We just use an aggregate function instead of a field name in our select
statement. For example, consider the transactions table we introduced in a previous
video. Since each row represents one transaction, if we wanted to count the total
number of transactions from the table we'd use this query. As you can see we just
select count star from transactions. When we run this code we get the following which
tells me there are nine rows in this table.
 SQL Does provide us a way to specify or change a column name in our query through
the use of something called an alias. Let's say that we wanted to call the number of
rows in the transactions table num_rows. We could write a query that looks like this.
Here I have select count(*) as num_rows from transactions. The as modifier tells the
SQL engine that I want to rename a field as something else. The use of aliases is not
limited to aggregate functions. I can rename any field to something else if I want to.
We can also use aliases to rename tables in our queries.
 As a side note, I'm using the underscore character in my alias because SQL expects
column and table names to be a continuous set of characters, meaning I can't use
spaces or most punctuation marks in those names. If you put a space between words
in a table name, the SQL engine will interpret it as two separate things and will most
likely produce an error. So again we can use any of these aggregate functions on the
whole table. However a much more common use of aggregate is in conjunction with
the group by statement. As the name suggests, the group by statement allows me to
group data and provide summary aggregates for each group. Conceptually the way this
works is that I choose some way to group data that is based on one or more of the
fields in my table. then i determine what aggregates I want based on one or more
other fields in my table. The syntax of the SQL code looks like this.
 The important thing to note about this syntax is that anything in the select statement
that's not an aggregate function must also appear in the group by statement.

Extracting data from multiple Tables

 Before we jump in let's revisit the idea of aliases that we introduced when we talked
about data aggregations. We saw that we could rename any column or aggregate in
our select by adding as, and a new name like this. Select count star as num_rows from
transactions. It turns out that I can also use aliases to rename whole tables as well as
columns. This will important when we start talking about multiple tables, especially
when those tables have column names that are the same. Using aliases helps us
specify which table a column belongs to. (a.Table_Name is another way to use
aliases). The syntax for table aliases is a little different than for column aliases, but the
idea is the same. Consider one of our earlier example queries. Select channel, product,
price from transactions. If I wanted to clearly specify the table associated with each
column, I could of written the query this way. Select transactions.channel,
transactions.product, transactions.price, from transactions. Note that I more
completely described each field in the select statement using the convention table
name dot field name. This makes the table association really clear. But as you might
imagine, it can get tedious to type out the whole table name every time I reference a
23
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
column. This is where table aliases come in. Let's say I simply want to refer to the
transactions tabel as table A. I could write the same query this way. Select a.channel,
a.product, a.price, from transactions space a. In this case, space a is my alias for the
transactions table. Note that unlike column aliases, I don't need to use the word as
between the table and its alias. I can if I want to, but it's usually more efficient just to
use a space.
 Remember that in a relational database the linkage between tables is defined by a
foreign key. That is, we have the same column or at least the same type of column in
both tables. Often, the foreign key in one table links to a primary key in another table,
but it doesn't have to be the case. In this short-hand diagram, you could see logically
how two tables, A and B, are linked using the fields A.Key and B.Key. In SQL, we use an
operation called the JOIN to allow us to access more than one table at a time. There
are a number of different types of joins that can be applied. Depending on what type
of output we're hoping to get. We'll cover the three most common types of joins first,
outer joins, inner joins and left joins. This venn diagram illustrates logically what these
joins do. The inner join only returns rows of data where there is a common key value
match. In other words, when the specific values in the key field are the same in both
tables. The full outer join returns all rows of data from both tables, whether or not
there is a key value match between them. Finally, the left join returns all rows of data
in one table and adds data from any rows in the second table where there is a key
value match. Another way to think about what joins do is that they try to match rows
of data between table and sort of align them up using the key relationship. With an
inner join, I only give data back if the rows line up properly. With an outer join, I get
back both rows that line up and those that don't. Meaning that I'll have some rows on
my output dataset that only had data on one side, and other rows that only had data
on the other side in addition to those with data on both sides. The left join is sort of in
the middle. It makes sure that I have all rows of data from one side, but may or may
not have data from the other side. Some examples of join should make this clear, but
before we do that let's look at the sin tax of sequel codes when using join clause.
Assume I want data from two tables named table one and table two respectfully. The
sin tax for a full outer join looks like this, I select some number of counts from table
one and table two using aliases a and b that I define in the subsequent clauses. I select
from table one alias a, and full our join table two alias to b. Finally I define the field
from each table on which I want the join to take place by saying on a.key equals b.key.
The syntax for an inner join and for a left join are similar, except that I replace full
outer join with inner join and left join respectively. Select a.star, b.star from
transactions a, left join products b, on a dot product equals b.product. I've chosen a
left joint here, because I want to start with all my transactions on the left, and only join
in data from the products table, if there is a match. In this case, there's a match for
each transaction, and I get the following output data set. Here I have all the data from
the original transactions table, but I've added the material and medium columns from
the product table. Know that each time the same product comes up in the transactions
table, the same information from the product table's included in the output table. So
what if I chosen inner join instead? In this case, it would have made no difference as
there was a match on product table for every type of product and the transactions
table. Or what if we were missing information in the products table? For example,
what if my products table were missing magazines. With a left join, all the transactions
would be in the table. But the values for the last two columns will be missing in those
24
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
rows like this. But if I used an inner join, the rows for magazines will be missing
entirely. If I needed to see information about all transactions, I wouldn't get what I
needed. With the left join, I'm able to see exactly what information I'm missing.
 In data analysis, this is often precisely the type of thing I'm looking for. Personally, I
have a strong bias towards using left joins over inter joins when possible, especially for
data analytics. I can always throw away data later, but it's much harder to spot missing
data problems, if they're already removed in your query.
 Generally speaking, however, full outer joins are much less common than ; left joins
and inner joins. One thing you may have noticed to my output dataset is that even
though I selected all columns from both tables using a.star and b.star, the field in my
own cross only appeared once. Some SQL engines are smart enough to realise that
including the same values from both tables of redundant. Others will in fact return
values for each table in separate columns. We've used one column here for simplicity.

Stacking data with UNION command

 When we learned about joining tables together, we really focused on how to enrich
rows of data by adding columns from different tables. You can think of this as kind of a
left right operation. We have some data on the left and some data on the right, and we
want to bring them together. We do this using a joint operation. We can also think
about bringing tables together top to bottom. Meaning that if we have two sets of
data with similar types of rows, we can sort of stack the data on top of each other to
get one taller data set with more rows. To do this, we use the UNION command. The
syntax of the UNION command looks like this. Here I have two simple SELECT from
clauses with the UNION command between them. The key requirement for performing
a UNION is that the top and bottom data sets need to have the same number of
columns. And those columns need to represent the same set of ideas in the same
order. So in this case, the type of data in FIELD_A from the top SELECT clause needs to
match the type of data in FIELD_D on the bottom SELECT clause. The type of data in
FIELD_B needs to match the type of data in FIELD_E and the type of data in FIELD_C
needs to match the type of data in FIELD_F. Note that I said the type of data, not the
column name. The column names do not have to match. But the type of data in those
columns do have to match. So if FIELD_A is a number then FIELD_D also has to be a
number. If FIELD_C is a date then FIELD_F also has to be a date. If the column names
are different, the SQL engine will generally just use the column names from the first
SELECT statement in the data output.
 Of course, it also makes sense that the columns represent the same idea. It may not
make sense to have customer name and a product type in the same column, even
though they are both text fields. But, strictly speaking, SQL won't prevent me from
doing that.
 Sometimes the reasons I store data in different tables in the first place, is because each
source might have a number of elements unique to that particular source. I wouldn't
necessarily want to try to enforce all those different elements in to one common table.

Extending SQL queries using operators

 Operators are words or symbols that we use in our code to define some sort of
condition among data elements. Most of these are probably familiar to you in concept
but here we'll talk about how to use them in our SQL. Specifically we'll talk about three
types of operators, comparison operators, arithmetic operators, and logical operators.
25
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
Here's a summary of the operators we'll cover in each category. If you're just listening,
don't worry we'll list each of them as we go forward.
 The first type of operators we'll talk about are comparison operators. Comparison
operators help to find whether a condition between two fields or functions of fields is
true or false.
 We touched on this with some of our related videos where we talked about where and
having statement. In fact that's primarily where these type of operators are used, to
establish some criteria by which rows or aggregate rows are filtered from a data set.
To use comparison operators we place them between two fields, functions and fields
or fixed values.
 The second type of operators are arithmetic operators, namely plus for addition,
minus for subtraction, star for multiplication and forward slash for division. Most SQL
engines also support the modulus operator, represented as a percent sign which
returns the remainder of one value divided by another value. We can use arithmetic
operators in a couple of different ways. We can use them in conjunction with
comparison operators in where and having statements to construct more complex
conditions. Here, I add FIELD_A to FIELD_B in each row and put it in a new field called
FIELD_N. Be careful not to confuse this type of operation with aggregation functions
like the sum function, which are designed to aggregate data across rows when a group
by command is used. Simple arithmetic operators worked within one row of data
across columns. However, I can use arithmetic operations with aggregation functions
as well like this.
 The third type of operators we'll cover are logical operator. There are actually quite a
few logical operators available in SQL including some that are unique to specific SQL
engines. Just about all of them are primarily used in where or having clauses as we are
trying to define specific conditions for row or aggregate faltering. Here we will discuss
a few of the most common logical operators that you're likely to use in day-to-day
query writing. We'll start with two familiar logical operators called Boolean operators,
and and or. We generally use these operators in a where or have in clause where we
want to include more than one condition in the clause. When we use and, it means
that all conditions in our statement need to be true. When we use or, it means that at
least one of the conditions in our statement needs to be true. I can represent the
situation visual when using a Venn diagram. In the diagram, the shaded portion
represents the condition under which some overall statement is true.
 The next logical operator we'll discuss is the in operator. The in operator allows us to
set up a condition to determine whether a field value or a expression is contained
within a specific list of possible values. You can think of the in operator as short hand
for a long list of or conditions I might include in a where or having clause. For example,
I could write this, where FIELD_A equals AAA or FIELD_A equals BBB or FIELD_A equals
CCC. Or I could write this where FIELD_A in AAA, BBB, CCC. As the list of possible
values gets large, using the in operator makes my code a lot simpler. It's also useful
when my list of possible values is defined by a sub-query. We discuss this case in a
separate video.
 Another time saving operator is the between operator, which allows us to set up a
condition where a field or expression is between two other values or expressions.
Again, this is really just a shorthand for a compound condition using comparison
operators and the and operator. For example, I could write this, WHERE FIELD_A is
greater than or equal to 10 and FIELD_A is less than or equal to 100. Or I could write
26
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
this, WHERE FIELD_A between 10 and 100. The like operator is a powerful operator
that has a lot of flexibility. But its syntax is also a bit more involved. The like operator
basically searches for a specific set of characters in a string or text field and returns a
true result if there is a match. The tricky part about using like is that I usually need to
incorporate one or more wild card characters that indicate where in the string I expect
the pattern to occur. The two most common wild card characters we use are the
following. The percent symbol means any string of zero or more characters. The
underscore character means any single character. It's probably easiest to illustrate
how they work with a few examples.
 One special condition will add to our discussion of logical operators is the IS NULL
condition. This is a special condition that is true when a field or expression is a null
which you may recall is a special database value that indicates the absence of data.
The syntax is pretty simple. For example, I use IS NULL in a WHERE clause like this,
WHERE FIELD_A IS NULL. In analytics we use this condition quite a bit to look for whole
scenario of data where the field route row do not have the data we need. The last
logical operator we will talk about is the NOT operator. NOT basically reverses the
logical meaning of other logical operators. Technically, I can use it with pretty any
operator. But there are only a few cases where it really makes sense. The fist is in
conjunction with the AND operator. Well, there's a third case of the diagram you might
be interested in, namely this one. The shaded part of the diagram represents the case
where one condition is true or the other condition is true, but not both conditions.
Here's an example of a query that uses NOT with an and operator, to achieve this type
of situation.

Using SQL subqueries

 A subquery also called a nested query or inner query is a complete SQL query that
exists within a larger SQL query which we called the main query or outer query. There
are few reasons we might use subqueries In data analytics, we're often trying multiple
data in some unique way immediately for the first time. As we think through the best
way to pull the data, we might have multiple steps that we want to isolate in task to
make sure they're doing exactly what we want them to do. Building queries and pieces
from the inside out can allow us to more effectively test each step and get to our final
output more quickly. In other cases, there are operations that are very difficult or even
impossible to do without sub-queries, so we must use them if we want to accomplish
the task. Finally, depending on the nature of the database design and the hardware
and software involved, queries with subqueries might run faster and more efficiently
than those without them. Of course, the opposite can be true as well. You'll need to
learn what works best in your specific data environment. Before we get to the syntax
of subqueries, let's talk a little bit about what they actually do. We know that a SQL
query returns some set of data, essentially a two dimensional data table with some
number of columns and rows. If our query only has one column, it effectively returns a
one-dimensional list of values. When we use a sub-query in a larger sequel query,
we're basically just replacing a table name or list of values with the results of our sub-
query. Let's look at two of the most common uses of sub-queries.
o The first is an where clause of a query when we're using the in operator to
filter a data set based on some column value, being in a list of specific column
values. Without a subquery the code might look like this Select * from table_A
where field_A in value_1, value_2 All the way up to Value N. Let's say then
27
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
instead of a static list, I wanted to use a list from another query in the where
statement. I'd simply replace the list with a subquery like this. Select star from
Table A, where field IN, select Field B from Table B. Just like the list itself, we
put the subquery inside parenthesis. Also remember that the in operator is
looking for list of values so my subquery must return only one column
otherwise the SQL engine won't know how to interpret it.
 In fact the more complex the query, the more value you may find in using subqueries.
As we mentioned earlier, the ability to breakdown a query into testable parts can help
insure that you queries doing exactly what you wanted to do.

4th week: Real world analytical organizations

Analytical organizations – roles

 We outline the overall process for how data is generated, captured, stored, extracted,
analyzed, and driven towards a business action. However, we haven't spent a lot of
time talking about who actually performs these operations. Now hopefully you as a
data analyst, will be doing many of these things. But there are a lot of other people
involved in creating and maintaining the data environment that you rely on to be
successful. In this video we'll outline some of the most common roles in data and
analytics organization. Understanding who this people are and what they do in your
organization will help you both understand the environment better and help you to
identify the relationships you'll need to build to be most effective.
 Our value chain discussion in module one assumed that somewhat linear pathway
between events in the real world and action in the market place that's fueled by
information. However, real organizations aren't constructed around information alone.
It's more likely that your organization would be organized around functional groups
like marketing or finance, customer groups like enterprises or small businesses, or line
of business like a software group versus a hardware group. Furthermore, there's often
some sort of separation between organizations that focus on information technology,
IT and organizations that focused on the customer facing side of the business. With
these different structures and divisions come people with specialized skills or
knowledge in each group. This presents a bit of a trade of between technical skills and
contextual acumen. It's often the case that resources with more technical skills tend to
do the heavy lifting in the data environment. While those with more context to do
interpret data and make decisions. When we think about data and analytics we find
ourselves at the intersection of these worlds. We're one part IT, one part business and
we're one part technical and one part contextual. However, we can still think about
division of responsibilities in terms of the outputs that are produced.
 Let's outline the major functional activities that take place in a real data environment.
Specifically data architecture, data management, reporting, Ad-Hoc analysis and
modelling. You'll notice that this functions are similar but not exactly the same as the
classes of analytic tools we outlined in module two. The reason is that here we're
focusing on who does the work not how the work gets done. Each of this functions
may use multiple classes of tools to get the job done. Let's defined each of this
functions.
o Data architecture refers to the design in the data environment to meet the
need of the enterprise.
28
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
o Data management involves the actual building and maintenance of the data
environment.
o Reporting, as we discussed in module two enables standard periodic
renderings of specific metrics or data relationships.
o Ad-hoc analytics broadly refers to the directed analysis that seek to answer a
specific question, particularly one that is new or infrequent. If we find
ourselves doing the same thing repeatedly, we're really doing reporting.
However, there's a natural linkage between ad-hoc analytics and reporting. For
example, I may seek to answer a question that's never been asked before. I'd
execute an ad-hoc analysis to do that. If the analysis reveals something
interesting and I determine I'd like to see the result each week or each month,
I'd streamline and automate the results into a standard report.
o Finally, modeling refers to advance analysis or application of data using higher
order techniques, including statistical procedures. Broadly, the three more
analytic functions in this list namely reporting, ad-hoc analysis and modeling
may be described as data mining. A term that refers to the process of
extracting useful information from data or mining insights. Let's add one more
subjective distinction to this list or there are always exceptions to the rule. It
turns out that the functions at the ends of the spectrum tend to rely a little
more on creativity and design, thinking and problem solving. Well, those
toward the center tend to rely more on operational and management skills.
Think about of building a house, the same type of idea applies. The
architecture designs the house and the interior decorator who finishes it tend
of somewhat more creative personalities. The master plumbers,electricians
and carpenters who actually build out structures and systems in the house are
equally skilled but more focused on structural repeatability and robustness of
function. The same things are true in our data environment.
 Now that we've the sense of the broad functions performed in the data environment,
let's talk about the specific teams or roles that support all those function. We'll start
with the more technical IT centric roles and move towards more analytical and
business related roles.
o Let's start with a couple of highly technical IT support areas, one is
infrastructure. Infrastructure teams manage the physical hardware and
connections that exist both inside the company and which link to the outside
world. Most of this activity will likely be transparent to users of the data but
it's critical to the operation of the data environment. Another area is system
and application development in administration. These teams build and
maintain systems that capture information for the business. They may also
provide ancillary functions like corporate IT that help to administer software
and other tools. Both of these types of teams are almost always located in an
IT organization or an organization devoted to software development.
o There are another set of technical roles that are more directly associated with
the data environment. We'll call these technical data management and
business intelligence delivery roles. The first of these is the data architect. The
data architect is responsible for the actual design of the data environment and
is usually the person responsible for structuring the data models used in
enterprise databases for the data storage and access. This role is normally
found in an IT organization either in a data warehousing team or larger
29
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
enterprise architecture team. A second role is that of the database
administrator or DBA. The DBA is broadly responsible for the database itself.
Including creation of the database and maintenance of the database to ensure
stability, accessibility and efficient performance. One important role, the DBA
can also play is helping analyst or other database users to tune their queries to
run efficiently. This can be really helpful if you find you're queries are failing or
taking a long time to run. A third role in this area is that of the ETL developer,
or more generally, a data integration developer. As a reminder, ETL stands for
extract, transform and load, or the process of taking data from one place,
manipulating it and placing it somewhere else. These developers are largely
responsible for populating a database and making sure the data is loaded
correctly into the various database structures. Both DBA's and ETL developers
are also usually located in an IT organization often within a data warehousing
team. The last role we'll discuss in this area is the business intelligence or BI
development. The BI developer sits right on the boundary of what most
organizations consider an IT function. This role can take a few different forms
but generally the BI Developer manages some of the more technical aspects of
a business intelligence tool set including maintenance. And is often responsible
for the technical implementation and distribution of standard reports.
o Let's move onto some rolls more closely aligned with data manipulation and
analysis. The first is the database analyst, who is someone who has the skills to
access the database directly usually by writing SQL queries, and who may have
the ability to do at least some analysis on the data. Some organizations
establish a layer of database analyst between the data environment and the
more contextual analysts in the business. Consequently, these roles
sometimes actually exist within an IT organization. Although best practices
increasingly to locate them with business functions. A data analyst may or may
not access the database directly, but usually has enough additional context
about the business to execute a wide range of analysis on the data and draw a
conclusion. This is the central role around which most data analytics functions
revolve in many organizations. The modeler is a more skilled extension of the
data analyst. The modeler usually spends most of his or her time perform
predictive and prescriptive analytics on data using sophisticated techniques
which is somewhat more advanced than a basic data analyst role. Both the
data analyst and the modeler are normally part of a business side organization,
either a functional team or a cross functional team dedicated to analytics. The
last role in this area is a bit different and is a role that is often misunderstood.
This is the role of the business analyst. Business analysis is not really a data
analytics function. It's the process of analyzing how a business works. Normally
with the goal of identifying ways in which a process or business system can be
improved. Sometimes business analysis incorporates data, but unlike data
analytics it's not really the core objective. With these roles in particular, you
might be thinking that the distinctions are pretty fine, and that we seem to
have spoken to all these ideas already in the course. You're exactly right. Even
though this roles have been separated in the past, what we're finding is the
best analytic to perform by individuals who can move seamlessly across all of
them. They have the technical ability to pull data like the data base analyst but
they also have the strong context that and business understanding of the
30
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
business analyst. Not to mention the analytical chops to accomplish data
analytics in modeling. This idea is actually a pretty nice segue to the last role
we'll discussing this video, the data scientist. It turns out, the data scientist is
used to describe a lot of different roles depending on the organization.
Sometimes it is used as a alternative to the modeler role we discussed earlier,
especially when the organization is seeking someone with very deep
knowledge of statistical procedures like a PhD in statistic. Increasingly however
organizations are using the term of data scientist to describe exactly the type
of blended role that blurs the lines between traditional roles. This person has a
broad analytical skill set, but also has a high degree of context. They also apply
scientific approach but our master communicator's who can explain how
analytical findings translate to business action.

Analytical organizations – structures

 In this video, we're going to talk about how analytics teams in general are situated
within the larger organization. It turns out that how your organization thinks about
analytics can have a big impact on not only the types of analysis you perform. But on
the efficiencies of those analytics, as well as the speed at which your skills and
knowledge can evolve.
 The material in this video is derived both from our experience in advising companies
on how to develop high performing analytical capabilities. As well as building and
leading analytic teams of our own. For the purposes of this discussion we're really
talking about some combination of reporting, ad hoc analytics and modeling functions.
And roles like database analysts, data analysts, modelers and data scientists, although
other roles could be included as well.
 How analytic teams are structured within an organization tends to hinge upon one
basic question. How centralized or decentralized should these organizations be?
Should analytics activities be gathered under one team or should they be embedded
within many teams? As you might imagine, the answer depends on a number of
different factors. What we'll do here is present four different models for analytical
organizations that speak to different degrees of centralization. We'll describe each
model and outline some of it's major pros and cons. To illustrate, we'll use a visual
representation like this. What we have here is one circle in the center that represents
functions that are centralized. With a ring a ring of satellites circles that represents
non-centralized teams based on the overall organizational structure of the business.
These might be areas like marketing or finance, in a functionally organized company.
Where teams align to customer groups or lines of business if the company is organized
that way. We use dark shading to illustrate where analytics functions take place. And
we'll show collaboration between the peripheral teams in the center using dotted
lines. So, let's get started.
o We'll begin with a fully centralized model, where some set of analytical
activities are accomplished using one centralized team. For example, an
enterprising analytics team might serve the needs of marketing, finance,
operations, customer care, etc. With respect to reporting, ad hoc analysis and
statistical modeling. A centralized model has a few key advantages. First, we
can usually achieve a higher level of consistency when analysis is done by a
single team, since it's easier to ensure that common methods are used from
one analysis to the next. It's also easier to ensure that the priorities of the
31
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
team, including what analyses are done and when, are aligned with the overall
needs of the enterprise versus the needs of only one group. Finally, analysis
tens to become very efficient in the centralized team it gets a lot of practise.
Especially when analytics or less contextual an more procedural. This usually
means, I get more analytical output out of fewer resources. However, these
advantages come at a cost. In a centralized model, the team that executes the
analysis is usually not the same the team that asked for the analysis.
Collaboration is required and the requesting organization may not get priority
of other needs. In this case, the centralized team is less responsive to
peripheral organizations and it's harder for those organizations to control their
destinies. A second disadvantage is around contexts, someone who sits in
marketing and who does marketing all the time is going to have a higher
degree of marketing contexts. Than an analyst in the centralized organization,
who may work on a lot of different things. Finally, while the centralized model
requires fewer people, it does rely on some consistency on work load. It's
harder to fill the plate of a centralized analytical team with non-centralized,
non analytical activities when the workload is light.
o The second model called the allocated model seeks to improve on the
responsiveness of the analytical organization while retaining most of the
benefits of a centralized approach. In this model, a sentimental analytical
activity is still accomplished using a centralized team. But within that team
specific capacity is reserved for one or more of the peripheral functions.
Usually, this means that the peripheral organization gets to make prioritization
decisions around what work is done and when up to the limits of their
allocating capacity. Some additional capacity is reserved in a centralized team
to handle ebbs and flows in peripheral demand, or to direct additional effort
towards overall enterprise priorities. Again, the main benefit of this approach
is improved responsiveness to the organization requesting analysis. It can also
have the benefit of improving the context of the analyst group, especially
when individuals are allocated to a single function for an extended period of
time. At the same time, it's still possible to maintain a good level of
consistency and methods. And the team still has some discretion to balance
efforts and support on broader priorities. The challenge with the allocated
approach is that it can be hard to match the level of allocation desired with the
overall needs of the enterprise. Inevitably, some organizations will end up too
much or too little capacity, where the reload perceived. This approach also
tends to require slightly larger resource for. Since we will move some of the
flexibility and how those resources can be deployed.
o Let's say we take this one step further and actually distribute our analytical
resources across functional teams. But we put structures of processes in place
to allow some degree of coordination across those teams. What we have in
this case is called the coordinated model. In this model, the staffing and
priorities of the analytical resources are completely controlled by functional
teams. However, these teams are tied together by some set of government
structures, standard methodologies or communities like users groups or
centers of excellence. For example, let's say we add separate analytics teams
within marketing, finance and operations. But those teams regularly convened
in a users group and participated in an enterprise level data governance
32
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
program. This would be a coordinated model. The benefits of this approach
build on those of the allocated model. With full control of resource priority,
functional analytics teams can be highly responsive. And since they are
integrated into specific functions, a high level of context within those functions
can be achieved. At the same time, coordination of across teams helps to
ensure that there is some degrees of consistency in method and approach.
However, coordination’s across teams is notorious with difficult to achieve it's
also much more likely the work will be duplicated across organizations or the
different group will come up different answers to the same question. You'll
hear this described as multiple versions of the truth, which can be a big drain
on an organization's effectiveness. Finally, the coordinated approach tends to
require more resources than more centralized structures, sends each
organization's staffs to handle its own peaks and workload.
o The last organizational model we'll discuss is the distributed model, were
analytic activities are wholly accomplished within peripheral organizations
with little or no coordination. For example, the business and consumer division
of a large bank might have completely separate analytics teams that work in
different locations and on different problems. The advantages of this model
are similar to the coordinated model. Namely, a very high degree of
responsiveness and context could be achieved. The team also has complete
flexibility in how they accomplish analytics since they don't necessarily need to
adhere to centralized standards. On the downside, there's little guarantee of
consistency in methods or even data sources. It's much more likely the efforts
may be duplicated, and this approach generally requires the largest number of
resources. Since they're few mechanism to identify overlap and streamline
activities.
 So which of these models is preferred? Well, that really depends. There are
organizations that have found success using each of these models, and even
combinations of these models. Rather than ranking the models, why don't we look for
some factors that tend to make each model more or less viable in an organization? The
most significant factor that influences our organizational model is the size of the
company. It turns out that analytical organizations really start performing well when
they reach a critical mass of resources. The specific number varies but at least a few
analysts. Having multiple perspectives on the team generally encourages knowledge
sharing, learning and development of skills across team members. More senior
members can help teach more junior members and newer members may have new
perspective that complement a knowledge based of more tenured members. The
larger an organization is, the easier it is to create multiple teams with critical mass,
hence a higher likelihood that less centralized models will work well. It's entirely
possible that you'll find yourself as the only analyst on the team. Does this mean
you're set up for failure? Not necessarily, it just means that you'll need to work harder
to find other people to learn from, or you might need to get really good at self
teaching. The second factor influencing organizational design is how different the
analytical methods are across the organization. The more similar the methods, the
more to be gain from centralized model where many different resources could be
brought together to execute the analysis. If the methods are completely different or
require highly specialized knowledge, something more decentralized may work just
fine. Depending on the nature of the work, the physical location of an organizations
33
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
functions can also have a meaningful impact on the viability of the model. When an
organization and it's functions are centrally located, it's much easier to manage things
like responsiveness and collaboration with a centralized team. When organizational
groups are geographically distributed, especially at a global scale, it can make more
sense to embed some or all analytical functions in those groups via a decentralized
design. Even though telepresence technologies make remote collaboration much
easier, differences in time zones alone are often enough to push organizations in this
direction. The last factor we'll discuss is our old friend, context. As we noted earlier,
more decentralized models tend to place analysts closer to the business functions they
serve. It can build a higher degree of context in that area. Depending on how much
context and specialization is needed, a decentralized model might be more effective in
certain cases. More often than not, when companies grapple with how to setup
analytics organizations, it's usually a battle between responsiveness and scale. Most
organizations want the efficiency that some degree of centralization provides, but
want to retain the context to control of more decentralized modules. For that reason,
many companies end up in either the allocated or coordinated structures. Of course,
this assumes the company has actually thought through how to best organize for
analytics. This may or may not be the case. It's entirely possible and even probable
that an organization is failing to reach its potential because of an ineffective structure.
This is where you have the opportunity to leverage what we've learned here. As you
explore your organization and get familiar with where and how analytics are done, you
can assess what's working well and where there might be opportunity. And over time,
you can help to influence the way your organization evolve.
 You probably won't be able to do this on day one, but there are some things you can
do that are consistent with the ideas we've covered. Let's say you do find yourself as
the loan analyst in a functional team. There's no reason why couldn't seek out other
analysts and other functions and put together an informal knowledge sharing group.
The point here is that you can use the creativity and inquisitive nature that drew you
to analytics in the first place to help make your analytics organization the best it can
be. Now that you understand the common ways analytics are structured in an
organization and what factors drive those structures, you're well prepared to help lead
your organization in the right direction.

Data governance

 The idea of data governance is intended to put some structure around how data is
managed and used in an organization. By establishing rules and processes around a
variety of data related operations and decisions. In this video, we'll cover some of the
most common areas addressed by data governance, and how data governance might
be set up in an organization. Let's start by discussing four major functions of data
governance. Establishing & Maintaining Standards. Establishing Accountability for
Data. Managing & Communicating Data Development. And Providing Information
about the Data Environment.
o A primary role of data governance is to establish and maintain standards
around data. This can take a few different forms. The first, is identifying what
sources are preferred for each type of data or metric used in an organization.
There's an idea called Master Data Management, or MDM, which identifies the
most critical data within an organization and ensures there is a clear
understanding of where that data should come from and where it should be
34
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
stored. A related idea is that of common reference data. Generally speaking,
reference data provides sets of allowable values for certain data attributes, or
provides additional descriptive information about key ideas in the company's
data environment. Sometimes this data is loosely referred to as look up data,
or dimensional data. Data governance helps to ensure that reference data is
complete and accurate. Data governance also helps to establish common
definitions and calculations. The same term might have different meanings
across the organization. And different teams might use slightly different
calculations to arrive at the same metric. Governance helps to ensure that
everyone is on the same page and does things the same way. The last set of
controls are around data access and compliance. A governance process can
help to find who should have access to data under what circumstances, and is
often applied in support of more general sarbanes oxilly, or sox controls and
data privacy concerns.
o The second major role of data governance is to establish and maintain
accountability for data. We'll talk a bit more about how data governance
programs are structured in a few minutes. But usually organizations assign
responsibility for specific data domains to individuals called data stewards.
Data stewards are generally accountable for ensuring that their area has the
correct definitions and are responsible for the overall state of their data
domain. Governance can also help identify who is responsible for addressing
various types of data quality issues.
o The third role of data governance, is to help manage the overall process of
data development and to communicate changes to the data environment. Lots
of teams use data and everyone of them probably has a laundry list of
additions or modifications they'd like to see implemented. However, there's
usually not enough capacity to accomplish them all and there needs to be
some way of prioritizing the work that needs to get done. Governance can
help by providing a process for vetting, assessing, and prioritizing which data
projects are undertaken, usually by rationalizing those projects against the
overall business priorities of the enterprise. Because data environments are
constantly evolving, there also needs to be some mechanism for letting the
users of the data know when new data is added. Or some change or
improvement is made. Having a well structured data governance approach can
facilitate communication about data and make sure everyone is informed and
aware of the changes.
o The last role that data governance plays, is in providing information about the
data environment itself. There's a broad class of activities called metadata
management, which helps to keep track of metadata, or data about data.
Given that we've gone through all the trouble of creating standard definitions
and calculations, it's generally useful to formally document them and provide
that documentation to the enterprise. We also might want to provide
information about the lineage of data and metrics, which traces where data
elements come from. Or keep a history of changes that have been made to a
data environment. All of these would fall under metadata management. We
also might want to provide information about the quality of certain data
domains or metrics. Governance mechanisms can help serve as a clearing
house for this type of information. Metadata can speak to the what and where
35
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
of the data environment, but it can also indicate how good the information is.
Likewise, governance can help keep track of the who, including tracking who
data stewards are and who may be involved in other data governance
functions. Users can consult this information to determine who to contact with
questions or concerns about the data.
 So that's what data governance does, let's switch gears a bit and talk about how it
works. There can be a lot of variance in how data governance is implemented within
an organization. However, there are a few characteristics that are almost always
present in a successful program.
o The first is cross-functional representation. The whole point of data
governance is to get everyone on the same page. To do that, everyone needs
to be involved. The best governance structures have broad participation across
technical and nontechnical teams, usually via something like a data
governance council that brings those groups together and addresses
governance issues.
o The second, is an ongoing process and schedule. A data governance council
doesn't do much good if it never convenes, or doesn't convene often enough.
Or if it doesn't make any decisions, or if it has no mechanism to execute on
decisions. A sound data governance program provides the structure.
o The third common element, is a set of defined roles. Someone needs to act as
the defacto leader of the program. This may be a Chair of the Governance
Council or other leader. Earlier, we discussed to role of data stewards. Some
form of data stewardship or ownership is critical to a successful governance
program.
 Beyond these ideas, data governance structures can take many forms and you may see
some functions implemented in different ways. For example, sometimes an
organization will formally staff a data governance to team that develops and
coordinates processes and handles things like meta data management or data quality.
However, more often than not, data governance is executed virtually, with
responsibilities rolled into the normal job functions on those on the cross functional
team. Likewise, the drive for data governance can come from different parts of the
organization. In some cases, the function is executed out of IT. Other times is driven by
an analytics team. It's also quite common to see data governance driven by a
functional group, like finance or operations. This doesn't change the cross functional
nature the activity, but who takes the lead can say a lot about how the organization
thinks about data.
 Finally, some organizations may adopt more formal tools to assist in their data
governance efforts. While others take a less formal or manual approach. There are
robust software tools that can help with master data management, metadata
management, or data quality. But not all organizations find them necessary. It's not
uncommon to see organizations build their own tools or use informal documentation
methods, like Wikis or even standalone documents, to manage data governance
activities. At this point, we've covered the major roles the data governance programs
play in organizations. As well as how those functions are executed. With some detail
around what good programs have in common and how they tend to differ. Why is this
important to you as the data analyst? It's really all about knowledge, context, and the
ability for you to have confidence in the data you're using for analysis. Being tied into
your company's data governance program, or helping to create one if it doesn't exist,
36
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
can help you rapidly learn about what's available, understand what's good and what's
not. And keep abreast of new additions or changes to the data environment. This in
turn will assure that you're always using the best data available for the job. And will
help you produce the best insights you can and have confidence in your results.

Data privacy

 Information is at the core of just about everything we do in the world of data analytics.
And an awful lot of the data we use is in some way, shape or form related to real
people. Increasingly concerns are being raised in the public about the amount of
information that is stored and accessible by both private organizations and
governments. This includes the risk of identity theft and other adverse consequences
for consumers and citizen. Should data that they considered private somehow be
made public or use for purposes that are unsanctioned by individuals in question. In
our rules as data analyst, when the most critical questions we have to ask is how can
we or how should we use data. In this video we're going to take a broad look at the
idea of the data privacy by introducing four levels of standards, the guide how we use
data that might be considered sensitive. However, we're not going to go into a lot of
detail for a couple of reasons. First, the set of laws and regulations that govern data
privacy is extensive and very complex and those regulations differ depending on where
you are. Secondly, the data privacy landscape is changing very rapidly and what's true
today might not be true tomorrow. Nonetheless, what we will do is give you a sense
for some common definitions and the types of regulations that are out there. Our
discussion will be slanted towards the data privacy environment in the United States
but the same basic ideas will apply more globally.
 Let's outline these four levels of standards. The top level is legal standards which was
established by law, order, or rule to compel treatment of certain classes of data. Legal
standards must be followed by any organizations subject to them. There's not a lot of
choice in the matter and consequences can be severe if legal standards are not
followed. The second level is ethical standard. These standards are established by
industry or professional organizations which see to achieve some level of non-legally
binding treatment of information. There can be consequences for violating these
standards but they are usually imposed outside of the courts. The third level of
standards are policy standards, which are internal standards established by an
organization to guide its own treatment of data, usually through something like a
privacy policy. The company decides how to enforce these standards. The last level of
standards is simply what we might call good judgment. Even if some action is not
prohibited by legal, ethical, or policy standards. We should always ask ourselves, is this
really a good idea and what might the consequences of using data in certain way be?
 We're going to go into each of these areas in a bit more detail but we'll spend the most
time discussing a few types of data and the legal standards attached to them.
 Let's start with something called Personally Identifiable Information or PII. Like most
terms associated with data privacy, PII has a long definition. As defined by the US
National Institutes of Standard or NIST, PII includes any information about an
individual maintained by an agency including. One, any information that can be used
to distinguish or trace an individual's identity such as name, social security number,
date and place of birth, mother/maiden name or biometric records. And two, any
other information that is linked or linkable to an individual, such as medical,
educational, financial, and employment information. Here are some examples of what
37
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
is considered PII. All or part of someone's name, including maiden name. Any
identification number, address information, personal physical characteristics including
images. And any number of things that may be linked to one or more of these
definitive identifiers. The linked data part of the PII definition is particularly interesting,
as it includes just about anything that I could conceivably link to an individual. In the
area of Internet connectivity and big data the ability to link information across
desperate domains has never been greater. In fact, both the National Institution of
Standards and the US Office of Management and Budget, OMB have recognized how
easy it might be to identifying individuals. Let's read part of their findings. A common
misconception is that PII only includes data that can be used to directly identify or
contact an individual or personal data that is especially sensitive. The OMB and NIST
definition of PII is broader. The definition is also dynamic and can depend on context.
Data elements that may not an identifying individual directly for example, age, height,
birth date, may nonetheless constitute PIl if those data elements can be combined
with or without additional data to identify an individual. In other words, if the data are
linked or can be linked to the specific individual it is potentially PII. Moreover, what
can be personally linked to an individual may depend on what technology is available
to do so. As technology advances, computer programs may scan the Internet with
wider scope to create a mosaic of information that maybe used to link information to
an individual in ways that were not previously possible. This is often referred to as the
mosaic effect. The implications of this mosaic effect are significant. Let's put a really
fine point on this with a little math. The point is this, it takes a surprisingly small
number of data points to uniquely identify a very large number of people. Of course
those data points need to be the right ones but you can see how this mosaic effect
can have a very real implication on data privacy. What's really interesting about PII is
that while there are quite a few legal standards they tend to be narrowly associated
with specific government agencies or specific use cases especially in the United States.
International standards are a bit more stringent but there's surprisingly little over
arching legislation that restricts how personal information can be used.
 The second type of information we'll discuss is consumer financial information, or CFI.
CFI is defined in the US by the Gramm-Leach-Bliley Act, also known as the Financial
Services Modernization Act of 1999, as follows. CFI is any information that is not
publicly available. And that a consumer provides to a financial institution to obtain a
financial product or service from the institution. Results from a transaction between
the consumer and the institution involving a financial product or service. Or that a
financial institution otherwise obtains about a customer in connection with providing a
financial product or service. This definition is further incorporated into a variety of
Federal Trade Commission and Securities and Exchange Commission guidelines, as well
as into the Fair Credit Reporting Act. Regulations around CFI are a little more concrete
than those around PII. Here are some general information of CFI legislation. First, it
generally applied to financial institutions and those who collect nonpublic personal
information from customers, consumers or financial institutions. They include a
number of specific provisions and how account numbers and other specific pieces of
information must be treated. However, most of the rules are around disclosure versus
prescription of what's allowed or not allowed. This means that they don't so much
restrict what we can do with information, but rather outline what we need to tell
customers about that information and what options customers must have to restrict
use of their information. And important detail here is that these regulations default
38
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
when opt-out posture. Meaning that if customers don't want information using certain
ways, they must actively opt-out from those users.
 Let's move on to another type of customer information called customer proprietary
network information or CPNI. CPNI is collected by telecommunications companies
about a customers telephone calls. It includes the time, date, duration and destination
number of each call. The type of network a customer subscribes to and any other
information that appears on the customer's telephone bill. Importantly, this definition
does not explicitly include non-telephone activity like web browsing. Although they
were varying legal opinions on wether this type of information is covered under CPNI
regulations. CPNI regulations are generally governed by the US Telecommunications
Act of 1996 and the 2007 Federal Communications Commission or FCC CPNI Order.
There are also broader statutes like the Electronic Communications Privacy Act of 1986
and the Communications Assistance for Law Enforcement Act of 1994 or CALEA which
speak to the conditions under which the government can access to this and other
types of electronic data. Here's some key provisions of CPNI legislation. First, it limits
the information which carriers may provide the third-party marketing firms without
first securing the affirmative consent of their customers. It also defines when and how
customers service representatives may share call details. It establishes notification and
reporting obligations for carriers as well as identity verification procedures including a
specific requirement. The verification processes must include a match between
information provided by a person and what is shown in a company's systems. There
are a couple of interesting details to these rules. For one thing, they do allow a
company to freely share information with any other communications company which
is a pretty broad set of players. Secondly, like CFI rules, CPNI regulations take an opt-
out posture by default.
 The last type of information we'll talk about is Protected Health Information or PHI.
PHI is considered one of the most sensitive types of information and consequently it's
among those tightly controlled and regulated. In the US, PHI is defined under the
Health Insurance Portability and Accountability Act of 1996 or HIPAA. The definition is
three parts and reflects the detail involved. One, PHI is created or received by a health
care provider, health plan, employer, or health care clearinghouse. Two, it relates to
the past, present or future physical or mental health or condition of an individual, the
provision of health care to an individual or the past, present or future payment for the
provision of health care to an individual. And which either identifies the individual or
with respect to which there is a reasonable basis to believe the information can be
used to identify the individual. And three, is maintained in electronic media, or
transmitted or maintained in any other form or medium. The provisions of HIPAA
around PHI are pretty complex and we won't get into the details here. However, there
are broadly covered under our privacy rule which speaks to the safeguards that must
be taken to protect PHI in any form. And a security role which provides additional
measures that must be taken when information is stored electronically. The rules
applied to health care providers, health plans and health care clearinghouses. And
they include a lot of specific provisions around how certain types of data need to be
treated including the stripping out of identifiable information in other precaution.
There are a couple of important exclusions to HIPAA regulations. First, they exclude
education records covered by the Family Educational Rights and Privacy Act. They also
exclude employment records held by a covered entity in its role as employer.
39
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
 In the business world it turns out that some of the more relevant ethics and standards
bodies operate in the area of marketing, which makes sense as we're generally
interacting with customers through some sort of market activity or interface. These
include the direct marketing association which provides broad guidelines on how to
interact with customers. The digital advertising alliance which adds guidance on first
party data collection. And the network advertising initiative which addresses third-
party data collection and the practice of sharing data through data exchanges.
However, the ability for these types of organizations to enforce their standards is
much weaker. Companies typically comply out of choice, not out of necessity. But it's
generally good practice to comply with these guidelines anyway.

Data quality

 If you've spent any time around computers you've almost certainly heard the term
garbage in, garbage out. Meaning that the quality of your outputs is only as good as
the quality of your inputs. This principle absolutely applies when we talk about data
analytics. Your ability to execute an analysis and have confidence in the results has a
huge dependence on the overall quality of the data you use. However, data quality can
be a pretty tricky thing to manage. And usually requires a fairly relentless focus on
preventing, detecting and remediating issues.
 So what exactly is data quality? There are two over arching definitions that we might
apply. The first, and the one you see in most technical articles or standards
documents, is the fitness for use or meets requirements definition. This definition
basically says the data quality is the degree to which data can be used for its intended
purpose. The second definition is a bit more philosophical and suggests that data
quality is the degree to which data accurately represents the real world. If you've been
paying attention in this course, you know how fond we are of understanding what is
happening in the real world, so we really like that definition. However, in the real
world you also have to get things done. So we think there's a lot of value to the first
definition as well. The good news is, we really don't have to choose. We can do our
best to make our data as representative of the real world as practical. And we can
decide when we've gotten close enough that the decisions we make and the actions
we take are sound ones.
 We can also add a bit more detail around what it means for data to be of high quality.
There are few characteristics that generally help to define good data.
o The first is completeness or a measure of whether or not we have all the data
we expect to have. Are we capturing all events we should be capturing? When
we capture an event, do we have all the attributes of that event that we
expect to have? If we use reference data, are all the values in that reference
data accounted for? A related idea is uniqueness, which is basically the
opposite of completeness. For example, if I record one event, am I sure I've
recorded it only once and not multiple times?
o A second idea is accuracy, a measure of whether the data we have is an
accurate representative of the idea it's trying to capture. If the data point is a
number, is it the right number? If it's a string, is it the right string and is it
spelled correctly? Are timestamps and other attributes correctly captured?
The concept of consistency is an extension of accuracy. Do I capture the same
data the same way every time? Or if I capture it in two different places, do I
have the same values?
40
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
o A third measure is what we might call conformance or validity. Whether stored
data conforms to syntax and coding and other specifications of a data model.
Is data stored in the correct format? If codes are used for attributes, are they
the expected codes? Are pieces of data named using the conventions that
have been established for a system or database?
o A fourth measure is timeliness, which speaks to whether data is captured or
made available soon enough after a real world event for it to be useful. You
might hear the term data latency to describe how long it takes for data to be
available for something like reporting or analytics. For example, if you need to
make a same day decision and the data isn't available until the next day, the
data is of low usefulness for that purpose, and of low quality per our fitness for
use definition.
o The fifth and final measure we'll include is provenance, which is the degree to
which we have visibility into the origins of the data. This is kind of a second-
order measure, but speaks to how much confidence we have that the data
we're looking at is real and is accurate.
 It makes sense that if I'm somehow able to measure each of these characteristics of
my data, I'll get a pretty good idea of how good that data is. So let's talk a little bit
more about how and where we might take these measures and what things we might
do to manage data equality.
 It's useful to bring back the first part of the information action value chain we
discussed in module one, which you may recall looks like this. We start with events
and characteristics in the real world, capture data in source systems, store data,
extract data, and execute our analysis. We can also look at this framework from more
of a systems point of view like this.
 The most effective way to address data quality issues is to prevent them from ever
happening in the first place by controlling by how data is captured at the source. How
we do this depends on exactly how the data is captured. One of the biggest drivers of
bad data are errors introduced via manual data entry by people, whether their
customers, other outside partners or our own employees. To help minimize these
types of errors organizations might build in validation mechanisms or auto populate
certain pieces of information. For example, an online form might force you to enter a
valid phone number in a specific format, make you use a dropdown box to choose the
state where you live, or even prepopulate your city based on the ZIP code you enter. It
also might not let you submit or proceed unless all the require fields are filled in.
Generally speaking, the less information that is typed into free-form fields, the higher
the quality of the data will be.
 Other capture mechanisms might be driven by the design of the source system itself.
Sometimes there are bugs in source applications that result in bad data. In these cases
the best solution is to find the bugs and fix them. In both cases we seek to improve
data quality from preventing it from the start.
 However, even the best data capture mechanisms aren't perfect, and it's inevitable
that some bad data will make it into the source system. If we're smart, we can still try
and catch it before it makes its way downstream to other systems. The way we do this
is with some set of automated checks that run against the data within a source system.
And look for one or more of the quality characteristics we discussed ealier. In some
cases it maybe possible to automatically correct data or fill in missing values. In other
41
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
cases, errors can be flagged and picked up by some remediation or maintenance
process.
 Again, catching issues early is almost always better, but it's not always possible. Our
next opportunity to enforce quality is in the process used to bring data into a common
location, usually a database of some sort. There are a couple things we typically do in
ETL processes or other data loading operations that help with quality. First, we applied
what's called audit balance and control operations to our jobs. These operations is
generally make sure that transfer processes itself happens as intended. And that we
don't actually introduced data quality problems as those jobs run. There's a lot of
different ways these operations can be set up but usually they involve constructing
summary metrics on both sides of the transfer and ensuring that they balance.
 Secondly, we can actually write ETL code in such a way that data is standardized,
forced into a common format, or even filled in using reference information as it's
loaded. We can even take this to step further and enforce what's called referential
integrity in our database. Which basically means that all reference data contained in
data records must have known values and related reference tables. This help's ensure
that no unknown attribute values can enter the database. Of course it's also a good
idea to set some sort of flag or alert when an unknown values is observed and filtered
out. Despite our best efforts, it's still possible that data has made it's way into our
database. Or worse yet, that we somehow introduce some errors in moving it from
one place to another. Just like we did in our source systems, we can set up ordinate
checks for data quality characteristics for our database, but there are couple of
differences. First, because I'm potentially getting data from multiple places and
because I may stored for a lot longer on my database, I may be able to use some of
that other data to make sense of what I'm seeing from one particular source system.
This allows my data quality checks to be a bit more sophisticated. And it's not
uncommon to see statistical techniques like those used in statistical process control to
detect and alert when a metric has exceeded some normal threshold.
 On the other hand, because more steps have happened between the event itself and
the data in my database, my checks may not be as precise at telling me exactly where
the problem is. I may need to backtrack through the source system to isolate the issue.
 Now, if we've done a good job we should catch the vast majority of issues long before
they end up in a downstream report or analysis. However, something will inevitably
slip through, and every time we manipulate data we have the potential for new errors.
It turns out that we can implement checks similar to those we've discussed at each
downstream step in the process, including our reporting and analytics.
 At the end of the line is what we might call the eyeball check. The person reviewing a
report or interpreting analysis needs to know enough about the business to recognize
when data looks fishy. While we certainly don't want to rely on that to catch quality
issues, it is the method of last resort. So given all the options we have for where we
execute data quality, the question becomes, which ones do we use? The answer is
simple, all of them. The best data quality programs use a multi faceted approach that
puts quality controls at every step of the process. They also integrate these checks into
a larger coordinated process that ensures data quality issues are fully investigated and
remediated by the correct team. This is one reason why data quality is often
integrated into an organization's larger data governance process. This way, the same
accountability structures and communication mechanisms can be leveraged to identify
and resolve issues. As an analyst you have an important role to play in data quality. In
42
INTRODUCTION TO DATA ANALYTICS FOR BUSINESS
addition to being one of those performing the eye ball check, you will be spending a lot
of time pouring through the data. And you'll almost certainly come across something
that doesn't look like from time to time. When you do, take the initiative and get it
resolved using whatever structures are present in your organization.

You might also like