0% found this document useful (0 votes)
1K views39 pages

Foundation of Data Analysis

Discover the essential principles and techniques of data analysis with this comprehensive ebook. Whether you're a beginner or an experienced professional, this book provides a solid foundation in statistical concepts, data visualization, and data interpretation. Learn how to effectively collect, clean, and analyze data to make informed business decisions. Packed with real-world examples and practical exercises, this ebook is the perfect resource for anyone looking to sharpen their data analysis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views39 pages

Foundation of Data Analysis

Discover the essential principles and techniques of data analysis with this comprehensive ebook. Whether you're a beginner or an experienced professional, this book provides a solid foundation in statistical concepts, data visualization, and data interpretation. Learn how to effectively collect, clean, and analyze data to make informed business decisions. Packed with real-world examples and practical exercises, this ebook is the perfect resource for anyone looking to sharpen their data analysis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Data analysis

Introduction 2
Definition of Data and Data Analysis 2
Decision Intelligence 4
Understand the Data ecosystem 4
EMC's data analysis process 5
SAS's iterative process 5
Big data analytics process 6
Data-driven decision making 6

G
Data and gut instinct 7

N
Key data analyst skills 8
Analytical thinking for effective outcomes 12

Ơ
Explore core analytical skills 12
Data life cycle 15

Ư
Case Study 16
More on the phases of data analysis 17
The ask phase
The prepare phase
PH 17
18
The process phase 18
The analyze phase 18
H
The share phase 18
N

The act phase 18


Key data analyst tools 19
A

Spreadsheets 19
Databases and query languages 19
C

Visualization tools 19
Choose the right tool for the job 20

Step-by-Step: Make spreadsheets your friend 20


SQL in action 22
G

What is a query? 22
Example of a query 23
N

Multiple columns in a query 23


Endless SQL possibilities 26

Capitalization, indentation, and semicolons 26


WHERE conditions 26
SELECT all columns 26
Comments 27
Example of a query with comments 27
Aliases 28
Example of a query with aliases 29
Putting SQL to work as a data analyst 29
Plan a data visualization 30
Build your data visualization toolkit 32
Spreadsheets (Microsoft Excel or Google Sheets) 33
Visualization software (Tableau) 33
Programming language (R with RStudio) 34
Consider fairness 34
Data analyst roles and job descriptions 37
Decoding the job description 37
Job specializations by industry 38

G
Introduction

N
Ơ
Definition of Data and Data Analysis

Ư
Data is a collection of facts (pictures, words,numbers..)
Data analysis is the collection, transformation, and organization of data in order to draw
PH
conclusions, make predictions and drive informed decision making.
H
N

The six steps of the data analysis process that you have been learning in this program are:
A

1. Ask: business challenge, objective, or question


2. Prepare: data generation, collection, storage, and data management
C

3. Process: data cleaning and data integrity


4. Analyze: data exploration, visualization, and analysis

5. Share: communicating and interpreting results


6. Act: putting insights to work to solve the problem
G

These six steps apply to any data analysis.


N

Ask
First up, the analysts needed to define what the project would look like and what would

qualify as a successful result. So, to determine these things, they asked effective questions
and collaborated with leaders and managers who were interested in the outcome of their
people analysis. These were the kinds of questions they asked:

● What do you think new employees need to learn to be successful in their first year on
the job?
● Have you gathered data from new employees before? If so, may we have access to
the historical data?
● Do you believe managers with higher retention rates offer new employees something
extra or unique?
● What do you suspect is a leading cause of dissatisfaction among new employees?
● By what percentage would you like employee retention to increase in the next fiscal
year?

Preparation
It all started with solid preparation. The group built a timeline of three months and decided
how they wanted to relay their progress to interested parties. Also during this step, the
analysts identified what data they needed to achieve the successful result they identified in
the previous step - in this case, the analysts chose to gather the data from an online survey
of new employees. These were the things they did to prepare:
● They developed specific questions to ask about employee satisfaction with different
business processes, such as hiring and onboarding, and their overall compensation.

G
● They established rules for who would have access to the data collected - in this case,
anyone outside the group wouldn't have access to the raw data, but could view

N
summarized or aggregated data. For example, an individual's compensation wouldn't

Ơ
be available, but salary ranges for groups of individuals would be viewable.
● They finalized what specific information would be gathered, and how best to present
the data visually. The analysts brainstormed possible project- and data-related issues

Ư
and how to avoid them.

Process
PH
The group sent the survey out. Great analysts know how to respect both their data and the
people who provide it. Since employees provided the data, it was important to make sure all
employees gave their consent to participate. The data analysts also made sure employees
H
understood how their data would be collected, stored, managed, and protected. Collecting
and using data ethically is one of the responsibilities of data analysts. In order to maintain
N

confidentiality and protect and store the data effectively, these were the steps they took:
A

● They restricted access to the data to a limited number of analysts.


● They cleaned the data to make sure it was complete, correct, and relevant. Certain
C

data was aggregated and summarized without revealing individual responses.


● They uploaded raw data to an internal data warehouse for an additional layer of

security.
G

Analyze
Then, the analysts did what they do best: analyze! From the completed surveys, the data
N

analysts discovered that an employee’s experience with certain processes was a key
indicator of overall job satisfaction. These were their findings:

● Employees who experienced a long and complicated hiring process were most likely
to leave the company.
● Employees who experienced an efficient and transparent evaluation and feedback
process were most likely to remain with the company.

The group knew it was important to document exactly what they found in the analysis, no
matter what the results. To do otherwise would diminish trust in the survey process and
reduce their ability to collect truthful data from employees in the future.

Share
Just as they made sure the data was carefully protected, the analysts were also careful
sharing the report. This is how they shared their findings:

● They shared the report with managers who met or exceeded the minimum number of
direct reports with submitted responses to the survey.
● They presented the results to the managers to make sure they had the full picture.
● They asked the managers to personally deliver the results to their teams.

This process gave managers an opportunity to communicate the results with the right
context. As a result, they could have productive team conversations about next steps to
improve employee engagement.

G
Act

N
The last stage of the process for the team of analysts was to work with leaders within their

Ơ
company and decide how best to implement changes and take actions based on the
findings. These were their recommendations:

Ư
● Standardize the hiring and evaluation process for employees based on the most
efficient and transparent practices.
● PH
Conduct the same survey annually and compare results with those from the previous
year.

A year later, the same survey was distributed to employees. Analysts anticipated that a
H
comparison between the two sets of results would indicate that the action plan worked.
Turns out, the changes improved the retention rate for new employees and the actions taken
N

by leaders were successful!


A

4 EXAMPLES OF BUSINESS ANALYTICS IN ACTION


C

Decision Intelligence

Decision Intelligence is a combination of applied data science and the social and managerial
G

sciences. It is all about harnessing the power and beauty of data.


N

Understand the Data ecosystem


An ecosystem is a group of elements that interact with one another. Data ecosystems are
made up of various elements that interact with one another in order to produce, manage,
store, organize, analyze, and share data. These elements include hardware and software
tools, and the people who use them. Data can also be found in something called the cloud.
The cloud is a place to keep data online, rather than on a computer hard drive. So instead of
storing data somewhere inside your organization's network, that data is accessed over the
internet. So the cloud is just a term we use to describe the virtual location.

EMC's data analysis process


EMC Corporation's data analytics process is cyclical with six steps:
1. Discovery
2. Preprocessing data
3. Model planning
4. Model building
5. Communicate results
6. Operationalize

EMC Corporation is now Dell EMC. This model, created by David Dietrich, reflects the
cyclical nature of typical business projects. The phases aren’t static milestones; each
step connects and leads to the next, and eventually repeats. Key questions help analysts
test whether they have accomplished enough to move forward and ensure that teams have

G
spent enough time on each of the phases and don’t start modeling before the data is ready.
It is a little different from the data analysis process on which this program is based, but it has

N
some core ideas in common: the first phase is interested in discovering and asking

Ơ
questions; data has to be prepared before it can be analyzed and used; and then findings
should be shared and acted on.

Ư
For more information, refer to this e-book, Data Science & Big Data Analytics

SAS's iterative process


PH
An iterative data analysis process was created by a company called SAS, a leading data
analytics solutions provider. It can be used to produce repeatable, reliable, and predictive
H
results:
N

1. Ask
A

2. Prepare
3. Explore
4. Model
C

5. Implement

6. Act
7. Evaluate
G

The SAS model emphasizes the cyclical nature of their model by visualizing it as an infinity
N

symbol. Its process has seven steps, many of which mirror the other models, like ask,
prepare, model, and act. But this process is also a little different; it includes a step after the
act phase designed to help analysts evaluate their solutions and potentially return to the ask

phase again.

For more information, refer to Managing the Analytics Life Cycle for Decisions at Scale

Project-based data analytics process


A project-based data analytics process has five simple steps:

1. Identifying the problem


2. Designing data requirements
3. Preprocessing data
4. Performing data analysis
5. Visualizing data

This data analytics project process was developed by Vignesh Prajapati. It doesn’t include
the sixth phase, or the act phase. However, it still covers a lot of the same steps described. It
begins with identifying the problem, preparing and processing data before analysis, and
ends with data visualization. For more information, refer to Understanding the data
analytics project life cycle

Big data analytics process

G
Authors Thomas Erl, Wajid Khattak, and Paul Buhler proposed a big data analytics process

N
in their book, Big Data Fundamentals: Concepts, Drivers & Techniques. Their process
suggests phases divided into nine steps:

Ơ
1. Business case evaluation

Ư
2. Data identification
3. Data acquisition and filtering
4.
5.
6.
Data extraction
Data validation and cleaning
Data aggregation and representation
PH
7. Data analysis
8. Data visualization
H
9. Utilization of analysis results
N

This process appears to have three or four more steps than the previous models. But in
A

reality, they have just broken down what has been referred to as preparation and process
into smaller steps. It emphasizes the individual tasks required for gathering, preparing, and
C

cleaning data before the analysis phase. For more information, refer to Big Data
Adoption and Planning Considerations

Data-driven decision making


G
N

Data-driven decision-making is defined as using facts to guide business strategy.


Organizations in many different industries are empowered to make things better. The first
step in data-driven decision-making is figuring out the business need. Usually, this is a

problem that needs to be solved. For example, a problem could be a new company needing
to establish better brand recognition, so it can compete with bigger, more well-known
competitors. Or maybe an organization wants to improve a product and needs to figure out
how to source parts from a more sustainable or ethically responsible supplier. Or, it could be
a business trying to solve the problem of unhappy employees, low levels of engagement,
satisfaction and retention. Whatever the problem is, once it's defined, a data analyst finds
data, analyzes it and uses it to uncover trends, patterns and relationships. Sometimes the
data-driven strategy will build on what's worked in the past. Other times, it can guide a
business to branch out in a whole new direction.
Let's look at a real-world example. Think about a music or movie streaming service. How do
these companies know what people want to watch or listen to, and how do they provide it?
Using data-driven decision-making, they gather information about what their customers are
currently listening to, analyze it, then use the insights they've gained to make suggestions for
things people will most likely enjoy in the future. This keeps customers happy and coming
back for more, which in turn means more revenue for the company. Another example of
data-driven decision-making can be seen in the rise of e-commerce. It wasn't long ago that
most purchases were made in a physical store, but the data showed people's preferences
were changing. So a lot of companies created entirely new business models that remove the
physical store, and let people shoprite from their computers or mobile phones with products
delivered right to their doorstep. In fact, data-driven decision-making can be so powerful, it

G
can make entire business methods obsolete.

N
By ensuring that data is built into every business strategy, data analysts play a critical role in

Ơ
their companies' success, but it's important to note that no matter how valuable data-driven
decision-making is, data alone will never be as powerful as data combined with human
experience, observation, and sometimes even intuition. To get the most out of data-driven

Ư
decision-making, it's important to include insights from people who are familiar with the
business problem. These people are called subject matter experts, and they have the ability
PH
to look at the results of data analysis and identify any inconsistencies, make sense of gray
areas, and eventually validate choices being made.

Data and gut instinct


H
N

There are other factors that influence the decision-making process. You may have read
mysteries where the detective used their gut instinct, and followed a hunch that helped them
A

solve the case. Gut instinct is an intuitive understanding of something with little or no
explanation. This isn’t always something conscious; we often pick up on signals without even
realizing. You just have a “feeling” it’s right.
C

1. Why gut instinct can be a problem


At the heart of data-driven decision making is data. Therefore, it's essential that data
G

analysts focus on the data to ensure they make informed decisions. If you ignore data by
preferring to make decisions based on your own experience, your decisions may be biased.
N

But even worse, decisions based on gut instinct without any data to back them up can cause
mistakes.

Consider an example of a restaurant entrepreneur, partnering with a well-known chef to


develop a new restaurant in a bustling part of the city’s central shopping district. The chef
has several restaurants across the city. Banking on their reputation, the restaurant
entrepreneur and chef followed gut instinct and created another uniquely themed restaurant.
However, fundraising efforts fell short to fund the opening of the restaurant after months of
planning and preparation. The property will go back on the market to be sold at a loss. Had
the entrepreneur done more research, they would've found data showing prospective
customers in this new restaurant location were very different from the chef's other
restaurants.

2. Data + business knowledge = mystery solved


Blending data with business knowledge, plus maybe a touch of gut instinct, will be a
common part of your process as a junior data analyst. The key is figuring out the exact mix
for each particular project. A lot of times, it will depend on the goals of your analysis. That is
why analysts often ask, “How do I define success for this project?”
In addition, try asking yourself these questions about a project to help find the perfect
balance:
● What kind of results are needed?
● Who will be informed?
● Am I answering the question being asked?
● How quickly does a decision need to be made?

G
For instance, if you are working on a rush project, you might need to rely on your own
knowledge and experience more than usual. There just isn’t enough time to thoroughly

N
analyze all of the available data. But if you get a project that involves plenty of time and
resources, then the best strategy is to be more data-driven. It’s up to you, the data analyst,

Ơ
to make the best possible choice. You will probably blend data and knowledge a million
different ways over the course of your data analytics career. And the more you practice, the

Ư
better you will get at finding that perfect blend.

Key data analyst skills PH


Analytical skills are qualities and characteristics associated with solving problems using
facts. There are a lot of aspects to analytical skills, but, we'll focus on five essential points.
H
They are curiosity, understanding context, having a technical mindset, data design, and data
strategy.
N

Curious people usually seek out new challenges and experiences. This leads to knowledge.
A

Context is the condition in which something exists or happens. This can be a structure or an
C

environment. A simple way of understanding context is by counting to 5. One, two, three,


four, five. All of those numbers exist in the context of one through five. But what if a friend of

yours said to you, one, two, four, five, three? Well, the three will be out of context. Let's look
at another example. Have you ever shuffled a deck of cards and noticed the joker? If you're
G

playing a game that doesn't include jokers, identifying that card means you understand it's
out of context.
N

A technical mindset involves the ability to break things down into smaller steps or

pieces and work with them in an orderly and logical way. For instance, when paying your
bills, you probably already break down the process into smaller steps. When you take
something that seems like a single task, like paying your bills, and break it into smaller steps with
an orderly process, that's using a technical mindset.

Data design is how you organize information. As a data analyst, design typically has to do
with an actual database. But, again, the same skills can easily be applied to everyday life. For
example, think about the way you organize the contacts in your phone. That's actually a type of
data design. Maybe you list them by first name instead of last, or maybe you use email
addresses instead of their names. What you're really doing is designing a clear, logical list that
lets you call or text a contact in a quick and simple way.
The last, but definitely not least, the fifth and final element of analytical skills is data strategy.
Data strategy is the management of the people, processes, and tools used in data analysis.
Let's break that down. You manage people by making sure they know how to use the right data
to find solutions to the problem you're working on.

Scenario: Use data to create better movies


The movies Mega-Pik has released recently aren’t having the impact they used to. Five of their
last six releases barely broke even at the box office, and the sixth film lost a lot of money. The
lead executives at Mega-Pik have noticed that their competitors went through a similar slump,
but recovered when they started producing remakes of past successes and marketing them to a
new audience.

G
Mega-Pik is interested in following this trend. They want to do this based on data-driven

N
strategies, so they hire your analytics company to help them make popular movies again.
Specifically, they ask for exploratory data analysis (EDA) to help them understand what

Ơ
audiences have liked in the past and determine if the successes of those films can be replicated.

Ư
You and your team develop the following objectives for Mega-Pik’s EDA:

● Identify key factors that contribute to a movie's opening weekend success.




PH
Understand the relationship between a movie's budget and its revenue.
Determine which genres are most successful.
● The right dataset
● Your company collects, cleans, and organizes the following relevant information into a
H
dataset:
● Movie name
N

● Release date
● Opening night revenue
A

● Opening weekend revenue


● Budget (cost to create)
C

● Marketing costs
● Ratings

● Genre
G

Now you’ll examine how inherent data analysis skills can help you guide Mega-Pik to make
data-driven decisions about which movies to produce.
N

If you worked for the company performing this


Curiosity data analysis, what kinds of questions would


you ask based on the data and how it relates
to the objectives of the EDA? Curiosity is
critical here, because it will help you come up
with questions you can answer

For example, you might wonder if there’s a


relationship between a movie’s budget and
the revenue it generates on opening night or
over the opening weekend. You might also be
curious about combining columns to make
new metrics, such as which genres tend to
perform better on opening weekend—both
overall and in the seasons in which the
movies were released. You might even ask if
there should be additional columns of data
that you don’t already have, such as audience
demographics.

Curiosity is a skill that drives analysts to


discover just how much information they can
coax out of the data in expected or
unexpected ways. Keep in mind that curiosity
isn’t the only skill that compels analysts to ask
probing questions about their data.

G
Context is crucial for any kind of meaningful
Understanding context data analysis. By contextualizing data, you

N
start to understand why the data shows what
it does. Factors including the time of year a

Ơ
movie is released, holidays, and competing
events can all have an effect on revenue,

Ư
which is the gauge Mega-Pik uses to
determine success. Audience demographics
such as age, gender, education, and income
PH
levels can help you understand who is going
to the movies. This context might clarify which
genres or storylines are most interesting to
movie-goers.
H
Analysts determine context by looking for
patterns or anomalies in a dataset. It also
N

helps to understand the entertainment


industry, which provides a whole other set of
A

contextual clues. For example, family films


typically generate more revenue when
children are on vacation from school. This
C

provides important context about the


relationship between genre and revenue over

a short timeframe. To understand the


relationship between family films and
revenue, you might have to search over a
G

time period of more than one year to avoid


inaccurate conclusions based on school
N

schedule. Further the “season” in which


children are on vacation from school differs by
country, which is another contextual clue you

have to take into account. An accurate


analysis of this data needs to come from
cross-referencing all of the various contexts,
including external data or historical trends.

As you have discovered, having a technical


Technical mindset mindset means approaching problems (and
datasets) in a systematic and logical manner.
This starts with the way you clean, organize,
and prepare your data. It can also guide the
tools or software you use to break down data
and help you identify and fix incorrect data
that can skew your analysis.

Remember that problems aren't always


technical, but a technical mindset is the skill
that you use to break down any complex
issue into manageable parts. Focusing on
implementing a process, regardless of what
that looks like, is a great first step to
exercising your technical mindset.

The skill of data design is an extension of


Data design your technical mindset. It deals with how

G
information is organized. Suppose the dataset
here is presented in a spreadsheet. You
would be able to shift the cells to organize the

N
data to find different patterns. For example,
you might organize the data by revenue and

Ơ
then by genre, which could reveal that
comedies are more profitable than dramas.

Ư
Basically, how you choose to structure your
data makes analysis easier and more
insightful.

Data strategy
PH
Data strategy is the management of the
people, processes, and tools used in data
analysis. In this scenario, think of it as the
approach you use to analyze your dataset.
H
One element might be the tools you use. If
Mega-Pik wants a relatively simple
N

dashboard, you might use Google Sheets or


Excel because there are only a few columns
A

of data. On the other hand, if Mega-Pik wants


a dashboard where information updates every
time new data comes in, you’d need a robust
C

tool like Tableau.


The data strategy you select should be based

on the dataset and the deliverables. Think


about a data strategy as a kind of resource
allocation—the tools, time, and effort that you
G

put into a project will vary based on what you


need to accomplish. One strategy you might
N

use for this case study is to prioritize any


analyses that would directly affect the next
quarter's revenue. The way you allocate

resources can lead you to quicker, more


actionable insights.

Analytical thinking for effective outcomes


Analytical thinking involves identifying and defining a problem and then solving it by using data in
an organized, step-by-step manner. The five key aspects to analytical thinking. They are
visualization, strategy, problem-orientation, correlation, and finally, big-picture and
detail-oriented thinking.
In data analytics, visualization is the graphical representation of information. Some
examples include graphs, maps, or other design elements. Visualization is important because
visuals can help data analysts understand and explain information more effectively. Think about it
like this. If you are trying to explain the Grand Canyon to someone, using words would be much
more challenging than showing them a picture. A visualization of the Grand Canyon would help
you make your point much quicker.

With so much data available, having a strategic mindset is key to staying focused and on track.
Strategizing helps data analysts see what they want to achieve with the data and how they can
get there. Strategy also helps improve the quality and usefulness of the data we collect. By
strategizing, we know all our data is valuable and can help us accomplish our goals.

G
Data analysts use a problem-oriented approach in order to identify, describe, and solve

N
problems. It's all about keeping the problem top of mind throughout the entire project. For
example, say a data analyst is told about the problem of a warehouse constantly running out of

Ơ
supplies. They would move forward with different strategies and processes. But the number one
goal would always be solving the problem of keeping inventory on the shelves.

Ư
A correlation is like a relationship. You can find all kinds of correlations in data. But as you start
identifying correlations in data, there's one thing you always want to keep in mind: Correlation
PH
does not equal causation. In other words, just because two pieces of data are both trending in
the same direction, that doesn't necessarily mean they are all related.

The final piece of the analytical thinking puzzle: big-picture thinking. This means being able to
H
see the big picture as well as the details. A jigsaw puzzle is a great way to think about this.
Big-picture thinking is like looking at a complete puzzle. You can enjoy the whole picture without
N

getting stuck on every tiny piece that went into making it. It helps you zoom out and see
possibilities and opportunities. This leads to exciting new ideas or innovations. On the flip side,
A

detail-oriented thinking is all about figuring out all of the aspects that will help you
execute a plan. In other words, the pieces that make up your puzzle.
C

Explore core analytical skills


What is the root cause of a problem? A root cause is the reason why a problem occurs. If we
G

can identify and get rid of a root cause, we can prevent that problem from happening again.
A simple way to wrap your head around root causes is with the process called the Five
N

Whys. In the Five Whys you ask "why" five times to reveal the root cause. The fifth and final
answer should give you some useful and sometimes surprising insights. Here's an example

of the Five Whys in action.


G
N
Ơ
Ư
PH
H
N
A
C

Let's say you wanted to make a blueberry pie but couldn't find any blueberries. You've
G

recently explored a case involving lacking the necessary ingredients to bake pies; now, you’ll
go more in-depth with some business applications of the five whys technique to do root
N

cause analysis.

Boost customer service


An online grocery store was receiving numerous customer service complaints about poor
deliveries. To address this problem, a data analyst at the company asked their first “why?”

Why #1. “Customers are complaining about poor grocery deliveries. Why?”
The data analyst began by reviewing the customer feedback more closely. They noted the
vast majority of complaints dealt with products arriving damaged. So, they asked “why?”
again.

Why #2. “Products are arriving damaged. Why?”


To answer this question, the data analyst continued exploring the customer feedback. It
turned out that many customers said products were not packaged properly.

Why #3. “Products are not packaged properly. Why?”


After asking their third “why,” the data analyst did some further detective work. They
ultimately learned that their company’s grocery packers were not adequately trained on
packing procedures.

Why #4. “Grocery packers are not adequately trained. Why?”


This “why” enabled the data analyst to uncover that nearly 35% of all packers were new to
the company. They had not yet had the chance to complete all required training, yet they

G
were already being asked to pack groceries for customer orders.

N
Why #5. “Packers have not completed required training. Why?”

Ơ
This final “why?” led the data analyst to find out that the human resources department had
not provided necessary training to any newly hired packers. This was because HR was in
the middle of reworking the training program. Rather than training new hires using the old

Ư
system, they had provided them with a quick one-page guide, which was insufficient.

PH
So, in this example, the root cause of the problem was that HR had not completed the
training program updates and was using a less-thorough guide to train new packers.
Fortunately, this was a problem that the grocer could control. And thanks to the data
analyst’s work, they provided more support to the HR department to complete the training
H
and retrain all newly hired grocery packers!
N

Advance quality control


A

An irrigation company was experiencing an increase in the number of defects in their water
pumps. The company's data team used the five whys to analyze the situation:
C

Why #1. “There has been an increase in the number of defects in water pumps. Why?”
To answer this question, the data team set up a meeting with shop floor engineers. They

asked for some insights into machine performance and manufacturing processes. After
G

some exploration, it was discovered that the machines used to produce the pumps were not
properly calibrated.
N

Why #2. “The machines are not properly calibrated. Why?”


After more brainstorming with the engineering team, it was determined that the machines

were miscalibrated during the last maintenance cycle.

Why #3. “The machines were miscalibrated during maintenance. Why?”


Next, the data team investigated the procedures involved with machine calibration. They
found out that the current method was inappropriate for the machines.

Why #4. “The calibration method is inappropriate for the machines. Why?”
This “why” led them to discover that the company had recently installed new software in their
machines. Because it was a minor software upgrade, the engineers didn’t realize it would
affect calibration. They didn’t have the information they needed to properly calibrate the
upgraded machines.
Why #5. “The engineers don’t have the information they need to calibrate the upgraded
machines. Why?”
The fifth and final “why” turned up even more evidence: The installation team had upgraded
machine software, but had failed to share the corresponding calibration procedures with the
engineers.

So, in this example, the root cause of the problem was that the engineers lacked important
information about how to calibrate the machines using the new software system. The
solution was found, and the irrigation company was able to implement it right away. Soon,
the engineers had the necessary calibration instructions, and the pump defects were

G
eliminated!

N
Another question commonly asked by data analysts is, where are the gaps in our

Ơ
process? For this, many people will use something called gap analysis. Gap analysis lets
you examine and evaluate how a process works currently in order to get where you want to
be in the future. Businesses conduct gap analysis to do all kinds of things, such as improve

Ư
a product or become more efficient. The general approach to gap analysis is understanding
where you are now compared to where you want to be.
PH
Data life cycle
H
First, let's spend a little time understanding the data life cycle. No, data isn't actually alive,
but it does have a life cycle. How do data analysts bring data to life? Well, it starts with the
N

right data analysis tool. These include spreadsheets, databases, query languages, and
visualization software.
A

The life cycle of data is to plan, capture, manage, analyze, archive and destroy.
C

The data life cycle provides a generic or common framework for how data is

managed. You may recall that variations of the data analysis life cycle were
described in Origins of the data analysis process. The same can be done for the
G

data life cycle. The rest of this reading provides a glimpse of how government,
N

finance, and education institutions can view data life cycles a little differently.

Planning

Let's start with the first phase, planning. This actually happens well before starting an
analysis project. During planning, a business decides what kind of data it needs, how it will
be managed throughout its life cycle, who will be responsible for it, and the optimal
outcomes. For example, let's say an electricity provider wanted to gain insights into how to
save people energy. In the planning phase, they might decide to capture information on how
much electricity its customers use each year, what types of buildings are being powered, and
what types of devices are being powered inside of them. The electricity company would also
decide which team members will be responsible for collecting, storing, and sharing that data.
All of this happens during planning, and it helps set up the rest of the project.

Capture
The next phase is when you capture data. This is where data is collected from a variety of
different sources and brought into the organization. With so much data being created
everyday, the ways to collect it are truly endless. One common method is getting data from
outside resources. For example, if you were doing data analysis on weather patterns, you'd
probably get data from a publicly available dataset like the National Climatic Data Center.
Another way to get data is from a company's own documents and files, which are usually
stored inside a database. While we've mentioned databases before, we haven't gone into
too much detail about what they are. A database is a collection of data stored in a computer
system. In the case of our electricity provider, the business would probably measure data
usage among its customers within a database that it owns. As a quick note, when you
maintain a database of customer information, ensuring data integrity, credibility, and privacy

G
are all important concerns.

N
Manage

Ơ
Here we're talking about how we care for our data, how and where it's stored, the tools used
to keep it safe and secure, and the actions taken to make sure that it's maintained properly.
This phase is very important to data cleansing, which we'll cover later on.

Ư
Analyze
PH
Next it's time to analyze your data. This is where data analysts really shine. In this phase,
the data is used to solve problems, make great decisions, and support business goals. For
example, one of our electricity company's goals might be to find ways to help customers
save energy.
H
Archive
Archiving means storing data in a place where it's still available, but may not be used again.
N

During analysis, analysts handle huge amounts of data. Can you imagine if we had to sort
A

through all of the available data that's out there, even if it was no longer useful and relevant to
our work? It makes way more sense to archive it than to keep it around.
Destroy
C

They would have data stored on multiple hard drives. To destroy it, the company would use
secure data erasure software. If there were any paper files, they would be shredded too. This is

important for protecting a company's private information, as well as private data about its
customers.
G

Case Study
N

1. U.S. Fish and Wildlife Service


The U.S. Fish and Wildlife Service uses the following data life cycle:
● Plan
● Acquire
● Maintain
● Access
● Evaluate
● Archive
For more information, refer to U.S. Fish and Wildlife's Data Management Life Cycle
page.
2. The U.S. Geological Survey (USGS)
The USGS uses the data life cycle below:
● Plan
● Acquire
● Process
● Analyze
● Preserve
● Publish/share
Several cross-cutting or overarching activities are also performed during each stage of their life
cycle:
● Describe (metadata and documentation)
● Manage quality
● Backup and secure

G
For more information, refer to the USGS Data Lifecycle page..
3. Financial institutions

N
Financial institutions may take a slightly different approach to the data life cycle as described in
The Data Life Cycle, an article in Strategic Finance magazine:

Ơ
● Capture
● Qualify

Ư
● Transform
● Utilize
● Report
● Archive
● Purge
PH
4. Harvard Business School (HBS)
One final data life cycle informed by Harvard University research has eight stages:
H

● Generation
N

● Collection
A

● Processing
● Storage
● Management
C

● Analysis
● Visualization

● Interpretation
For more information, refer to 8 Steps in the Data Life Cycle.
G
N

More on the phases of data analysis


Each step in the data analysis process—ask, prepare, process, analyze, share, and

act—plays a crucial role in extracting meaningful insights from data. As you navigate through
each phase, from asking the right questions to taking informed actions, you harness the true
power of data. In this reading, you’ll explore how the data analysis process guides this
program.
The ask phase
At the start of any successful data analysis, the data analyst:
● Takes the time to fully understand stakeholder expectations
● Defines the problem to be solved
● Decides which questions to answer in order to solve the problem
Qualifying stakeholder expectations means determining who the stakeholders are, what they
want, when they want it, why they want it, and how best to communicate with them. Defining
the problem means looking at the current state and identifying the ways in which it’s different
from the ideal state. With expectations qualified and the problem defined, you can derive
questions that will help achieve these goals.

The prepare phase


In the preparation phase, the emphasis is on identifying and locating data you can use to
answer your questions. You'll also discover why it's so important that data and results are
objective and unbiased. In other words, any decisions made from an analysis should always

G
be based on facts and be fair and impartial.

N
The process phase

Ơ
In this phase, the aim is to refine the data. Data analysts find and eliminate any errors and

Ư
inaccuracies that can get in the way of results. This usually means:
● Cleaning data
● Transforming data into a more useful format
PH
● Combining two or more datasets to make information more complete
● Removing outliers (data points that could skew the information)

After data analysts process data, they check the data they prepared to make sure it's
H
complete and correct. This phase is all about getting the details right. Accordingly, the data
analyst will refine strategies for verifying and sharing their data cleaning with stakeholders. In
N

an upcoming course, you’ll use spreadsheets and structured query language, or SQL, to
clean data.
A

The analyze phase


C

With a solid foundation of well-defined questions and clean data, you’ll delve into the analyze

phase. This is when you turn the data you’ve gathered, prepared, and processed into
actionable information. Data analysts use many powerful tools in their work. In one
G

upcoming course you'll continue using two of them: spreadsheets and SQL. In another
upcoming course you’ll explore using the programming language R to work with and analyze
N

data.

The share phase


This phase is exactly what it sounds like: It’s time to share what you’ve learned with your
stakeholders! In this part of the program, you'll learn how data analysts interpret results and
share them with others to help stakeholders make effective, data-driven decisions. In the
share phase, visualization is a data analyst's best friend. So, an upcoming course will
highlight why visualization is essential to getting others to understand what your data is
telling you. In another upcoming course, you’ll learn how to visualize data with R.

The act phase


The data analysis journey culminates in the act phase, when data insights are put to work.
For you, this action involves preparing for your job search and having the chance to
complete a case study project.

Key data analyst tools


To put it simply, a spreadsheet is a digital worksheet. It stores, organizes, and sorts data.
This is important because the usefulness of your data depends on how well it's structured.
When you put your data into a spreadsheet, you can see patterns, group information and
easily find the information you need. Spreadsheets also have some really useful features
called formulas and functions. A formula is a set of instructions that performs a specific

G
calculation using the data in a spreadsheet. Formulas can do basic things like add, subtract,
multiply and divide, but they don't stop there. You can also use formulas to find the average

N
of a number set. Look up a particular value, return the sum of a set of values that meets a

Ơ
particular rule, and so much more. A function is a preset command that automatically
performs a specific process or task using the data in a spreadsheet.

Ư
As you are learning, the most common programs and solutions used by data analysts
include spreadsheets, query languages, and visualization tools. In this reading, you will learn
PH
more about each one. You will cover when to use them, and why they are so important in
data analytics.
Spreadsheets
H
Data analysts rely on spreadsheets to collect and organize data. Two popular spreadsheet
applications you will probably use a lot in your future role as a data analyst are Microsoft
N

Excel and Google Sheets.


A

Spreadsheets structure data in a meaningful way by letting you


● Collect, store, organize, and sort information
● Identify patterns and piece the data together in a way that works for each specific
C

data project
● Create excellent data visualizations, like graphs and charts.

G

Databases and query languages


N

A database is a collection of structured data stored in a computer system. Some popular


Structured Query Language (SQL) programs include MySQL, Microsoft SQL Server, and
BigQuery.

Query languages
● Allow analysts to isolate specific information from a database(s)
● Make it easier for you to learn and understand the requests made to databases
● Allow analysts to select, create, add, or download data from a database for analysis

Visualization tools
Data analysts use a number of visualization tools, like graphs, maps, tables, charts, and
more. Two popular visualization tools are Tableau and Looker.
These tools
● Turn complex numbers into a story that people can understand
● Help stakeholders come up with conclusions that lead to informed decisions and
effective business strategies
● Have multiple features

- Tableau's simple drag-and-drop feature lets users create interactive graphs in


dashboards and
worksheets
- Looker communicates directly with a database, allowing you to connect your data
right to the visual
tool you choose
A career as a data analyst also involves using programming languages, like R and Python,

G
which are used a lot for statistical analysis, visualization, and other data analysis.

N
Choose the right tool for the job

Ơ
As a data analyst, you will usually have to decide which program or solution is right for the

Ư
particular project you are working on. In this reading, you will learn more about how to
choose which tool you need and when.

PH
Depending on which phase of the data analysis process you’re in, you will need to use
different tools. For example, if you are focusing on creating complex and eye-catching
visualizations, then the visualization tools we discussed earlier are the best choice. But if you
are focusing on organizing, cleaning, and analyzing data, then you will probably be choosing
H
between spreadsheets and databases using queries. Spreadsheets and databases both
N

offer ways to store, manage, and use data. The basic content for both tools are sets of
values. Yet, there are some key differences, too:
A

Spreadsheets Databases
C

Accessed through a software application Database accessed using a query language


Structured data in a row and column format Structured data using rules and relationships
G

Organizes information in cells Organizes information in complex collections


N

Provides access to a limited amount of data Provides access to huge amounts of data

Manual data entry Strict and consistent data entry


Generally one user at a time Multiple users

Controlled by the user Controlled by a database management


system

Step-by-Step: Make spreadsheets your


friend
Example 1: Get started
Enter basic data:
1. Begin with a new spreadsheet.
2. Select cell A2.
3. Enter your first name.
4. Select cell B2.
5. Enter your last name.

Adjust the size of rows and columns:


To make the text fit in the rows and columns, adjust their sizes. Use either of the following
methods:

G
1. If your name is longer than the width of the column, select and drag the right edge
of the corresponding column until it fits.

N
2. To wrap text, select the cells, columns, or rows with text that you want to reformat.
3. Select the Format menu.

Ơ
4. Under Wrapping, select Wrap.

Ư
Example 2: Add labels
Add labels, or attributes, to help you keep track of the data:
1. Select cell A1.
2. Enter First Name.
3. Select cell B1.
PH
4. Enter Last Name.
H
5. Select cells A1 and B1. To do this, select a single cell and drag your cursor over to
the other cell to include it in the selection.
N

6. From the toolbar, select the bold icon.


A

Example 3: Add more attributes and data


Add more attributes and data to your spreadsheet:
C

1. Select cell C1 and enter Siblings.


2. Select cell D1 and enter Favorite Color.

3. Select cell E1 and enter Favorite Dessert.


4. Select all three cells and make them bold by selecting the bold icon from the
G

toolbar.
5. Adjust the columns to fit the new text.
N

6. Enter the corresponding data in cells C2, D2, and E2 (your number of siblings,
favorite color, and favorite dessert).

7. Add data about two more people in rows 3 and 4. These can be people you know or
people you’ve just made up.

Example 4: Organize your data


One way to organize your data is by sorting it.
1) Select all columns that contain data. There are a few ways to select multiple cells:

A. To select non adjacent cells and/or cell ranges, hold the Command (Mac) or Ctrl
(PC) key and select the cells.
B. To select a range of cells, hold the Shift key and either drag your cursor over which
cells you want to include or use the arrow keys to select a range.
C. Select a single cell and drag your cursor over the cells you want to include in your
selection.
2) Select the Data menu.
3) Select Sort range, then select Advanced range sorting options.
4) In the Advanced range sorting options window, select the checkbox for Data has
header row. Make sure that A to Z is selected.
5) Select the Sort by drop-down menu, then select Siblings.
6) Select Sort. This will organize the spreadsheet by the number of siblings, from
lowest to highest.

Example 5: Use a formula

G
Spreadsheets enable data professionals to analyze data. In this example, the instructor uses
a formula to calculate a sum.

N
1. Select the next empty cell in the Siblings column (C5).
2. Enter the formula =C2+C3+C4.

Ơ
3. Press Enter on your keyboard to complete the formula.
4. The formula calculates the total number of siblings.

Ư
SQL in action
PH
Just as humans use different languages to communicate with others, so do computers.
Structured Query Language (or SQL, often pronounced “sequel”) enables data analysts to talk
to their databases. SQL is one of the most useful data analyst tools, especially when working
with large datasets in tables. It can help you investigate huge databases, track down text
H
(referred to as strings) and numbers, and filter for the exact kind of data you need—much
faster than a spreadsheet can.
N
A

What is a query?
A query is a request for data or information from a database. When you query databases,
C

you use SQL to communicate your question or request. You and the database can always
exchange information as long as you speak the same language.

Every programming language, including SQL, follows a unique set of guidelines known as
syntax. Syntax is the predetermined structure of a language that includes all required words,
G

symbols, and punctuation, as well as their proper placement. As soon as you enter your
search criteria using the correct syntax, the query starts working to pull the data you’ve
N

requested from the target database.


The syntax of every SQL query is the same:

● Use SELECT to choose the columns you want to return.


● Use FROM to choose the tables where the columns you want are located.
● Use WHERE to filter for certain information.

A SQL query is like filling in a template. You will find that if you are writing a SQL query from
scratch, it is helpful to start a query by writing the SELECT, FROM, and WHERE keywords in the
following format:
SELECT
FROM
WHERE
Next, enter the table name after the FROM; the table columns you want after the SELECT;
and, finally, the conditions you want to place on your query after the WHERE. Make sure to
add a new line and indent when adding these, as shown below:

SELECT Specifies the columns from which to


retrieve data
FROM Specifies the table from which to retrieve
data

G
WHERE Specifies criteria that the data must meet

N
Following this method each time makes it easier to write SQL queries. It can also help you

Ơ
make fewer syntax errors.

Ư
Example of a query
PH
Here is how a simple query would appear in BigQuery, a data warehouse on the Google
Cloud Platform.
H
N
A

The above query uses three commands to locate customers with the first_name, 'Tony':
1. SELECT the column named first_name
C

FROM a table named customer_name (in a dataset named customer_data)


2. (The dataset name is always followed by a dot, and then the table name.)
3. But only return the data WHERE the first_name is 'Tony'
G

The results from the query might be similar to the following:


N

As you can conclude, this query had the correct syntax, but wasn't very useful after the data
was returned.

Multiple columns in a query


Of course, as a data professional, you will need to work with more data beyond customers
named Tony. Multiple columns that are chosen by the same SELECT command can be
indented and grouped together.
If you are requesting multiple data fields from a table, you need to include these columns in
your SELECT command. Each column is separated by a comma as shown below:

G
N
Ơ
Ư
PH
H

The above query uses three commands to locate customers with the first_name, 'Tony'.
N

1. SELECT the columns named customer_id, first_name, and last_name


A

FROM a table named customer_name (in a dataset named customer_data)


2. (The dataset name is always followed by a dot, and then the table name.)
C

3. But only return the data WHERE the first_name is 'Tony'


The only difference between this query and the previous one is that more data columns are
selected. The previous query selected first_name only while this query selects
G

customer_id and last_name in addition to first_name. In general, it is a more efficient


use of resources to select only the columns that you need. For example, it makes sense to
N

select more columns if you will actually use the additional fields in your WHERE clause. If you
have multiple conditions in your WHERE clause, they may be written like this:

G
Notice that unlike the SELECT command that uses a comma to separate fields / variables /
parameters, the WHERE command uses the AND statement to connect conditions. As you

N
become a more advanced writer of queries, you will make use of other connectors /
operators such as OR and NOT.

Ơ
Here is a BigQuery example with multiple fields used in a WHERE clause:

Ư
PH
H
N

The above query uses three commands to locate customers with a valid (greater than 0),
A

customer_id whose first_name is 'Tony' and last_name is 'Magnolia'.


1. SELECT the columns named customer_id, first_name, and last_name
C

FROM a table named customer_name (in a dataset named customer_data)


2. (The dataset name is always followed by a dot, and then the table name.)
3. But only return the data WHERE customer_id is greater than 0, first_name is
G

Tony, and last_name is Magnolia.


N

Note that one of the conditions is a logical condition that checks to see if customer_id is
greater than zero.
If only one customer is named Tony Magnolia, the results from the query could be:

Endless SQL possibilities
Capitalization, indentation, and semicolons
You can write your SQL queries in all lowercase and don’t have to worry about extra spaces
between words. However, using capitalization and indentation can help you read the
information more easily. Keep your queries neat, and they will be easier to review or
troubleshoot if you need to check them later on.

G
N
Ơ
Notice that the SQL statement shown above has a semicolon at the end. The semicolon is a

Ư
statement terminator and is part of the American National Standards Institute (ANSI) SQL-92
standard, which is a recommended common syntax for adoption by all SQL databases.
PH
However, not all SQL databases have adopted or enforce the semicolon, so it’s possible you
may come across some SQL statements that aren’t terminated with a semicolon. If a
statement works without a semicolon, it’s fine.
H
WHERE conditions
N

In the query shown above, the SELECT clause identifies the column you want to pull data
from by name, field1, and the FROM clause identifies the table where the column is
A

located by name, table. Finally, the WHERE clause narrows your query so that the database
returns only the data with an exact value match or the data that matches a certain condition
C

that you want to satisfy.


For example, if you are looking for a specific customer with the last name Chavez, the WHERE

clause would be:


WHERE field1 = 'Chavez'
G

However, if you are looking for all customers with a last name that begins with the letters
“Ch," the WHERE clause would be:
N

WHERE field1 LIKE 'Ch%'


You can conclude that the LIKE clause is very powerful because it allows you to tell the

database to look for a certain pattern! The percent sign % is used as a wildcard to match one
or more characters. In the example above, both Chavez and Chen would be returned. Note
that in some databases an asterisk * is used as the wildcard instead of a percent sign %.

SELECT all columns


Can you use SELECT * ?
In the example, if you replace SELECT field1 with SELECT * , you would be selecting all of
the columns in the table instead of the field1 column only. From a syntax point of view, it is a
correct SQL statement, but you should use the asterisk * sparingly and with caution.
Depending on how many columns a table has, you could be selecting a tremendous amount
of data. Selecting too much data can cause a query to run slowly.

Comments
Some tables aren’t designed with descriptive enough naming conventions. In the example,
field1 was the column for a customer’s last name, but you wouldn’t know it by the name. A
better name would have been something such as last_name. In these cases, you can place
comments alongside your SQL to help you remember what the name represents. Comments
are text placed between certain characters, /* and */, or after two dashes --) as shown

G
below.

N
Ơ
Ư
PH
Comments can also be added outside of a statement as well as within a statement.
You can use this flexibility to provide an overall description of what you are going to
do, step-by-step notes about how you achieve it, and why you set different
parameters/conditions.
H
N
A
C

The more comfortable you get with SQL, the easier it will be to read and understand
queries at a glance. Still, it never hurts to have comments in a query to remind
G

yourself of what you’re trying to do. This also makes it easier for others to
understand your query if your query is shared. As your queries become more and
N

more complex, this practice will save you a lot of time and energy to understand
complex queries you wrote months or years ago.

Example of a query with comments


Here is an example of how comments could be written in BigQuery:
In the above example, a comment has been added before the SQL statement to
explain what the query does. Additionally, a comment has been added next to each
of the column names to describe the column and its use. Two dashes -- are

G
generally supported. So it is best to use -- and be consistent with it. You can use #
in place of -- in the above query, but # is not recognized in all SQL versions; for

N
example, MySQL doesn’t recognize #. You can also place comments between /*

Ơ
and */ if the database you are using supports it.
As you develop your skills professionally, depending on the SQL database you use,

Ư
you can pick the appropriate comment delimiting symbols you prefer and stick with
those as a consistent style. As your queries become more and more complex, the
PH
practice of adding helpful comments will save you a lot of time and energy to
understand queries that you may have written months or years prior.

Aliases
H
N

You can also make it easier on yourself by assigning a new name or alias to the
column or table names to make them easier to work with (and avoid the need for
A

comments). This is done with a SQL AS clause. In the example below, aliases are
used for both a table name and a column. Within the database, the table is called
C

actual_table_name and the column in that table is called actual_column_name.


They are aliased as my_table_alias and my_column_alias, respectively. These

aliases are good for the duration of the query only. An alias doesn’t change the
actual name of a column or table in the database.
G
N

Example of a query with aliases

Putting SQL to work as a data analyst

G
Imagine you are a data analyst for a small business and your manager asks you for
some employee data. You decide to write a query with SQL to get what you need

N
from the database.

Ơ
You want to pull all the columns: empID, firstName, lastName, jobCode, and
salary. Because you know the database isn’t that big, instead of entering each

Ư
column name in the SELECT clause, you use SELECT *. This will select all the
columns from the Employee table in the FROM clause.
PH
H
N

Now, you can get more specific about the data you want from the Employee table. If
A

you want all the data about employees working in the 'SFI' job code, you can use a
WHERE clause to filter out the data based on this additional requirement.
C

Here, you use:



G
N

A portion of the resulting data returned from the SQL query might look like this:
Suppose you notice a large salary range for the 'SFI' job code. You might like to
flag all employees in all departments with lower salaries for your manager. Because
interns are also included in the table and they have salaries less than $30,000, you
want to make sure your results give you only the full time employees with salaries
that are $30,000 or less. In other words, you want to exclude interns with the 'INT'
job code who also earn less than $30,000. The AND clause enables you to test for
both conditions.
You create a SQL query similar to below, where <> means "does not equal":

G
N
Ơ
Ư
The resulting data from the SQL query might look like the following (interns with the
job code INT aren't returned):
PH
H
N
A

With quick access to this kind of data using SQL, you can provide your manager with
C

tons of different insights about employee data, including whether employee salaries
across the business are equitable. Fortunately, the query shows only an additional

two employees might need a salary adjustment and you share the results with your
G

manager.
Pulling the data, analyzing it, and implementing a solution might ultimately help
N

improve employee satisfaction and loyalty. That makes SQL a pretty powerful tool.

Plan a data visualization


Because of the importance of data visualization, most data analytics tools (such as
spreadsheets and databases) have a built-in visualization component while others (such as
Tableau) specialize in visualization as their primary value-add. In this reading, you will
explore the steps involved in the data visualization process and a few of the most common
data visualization tools available.
G
N
Steps to plan a data visualization

Ơ
Let’s go through an example of a real-life situation where a data analyst might need
to create a data visualization to share with stakeholders. Imagine you’re a data

Ư
analyst for a clothing distributor. The company helps small clothing stores manage
their inventory, and sales are booming. One day, you learn that your company is
PH
getting ready to make a major update to its website. To guide decisions for the
website update, you’re asked to analyze data from the existing website and sales
records. Let’s go through the steps you might follow.
H
Step 1: Explore the data for patterns
N

First, you ask your manager or the data owner for access to the current sales
records and website analytics reports. This includes information about how
A

customers behave on the company’s existing website, basic information about who
visited, who bought from the company, and how much they bought.
C

While reviewing the data you notice a pattern among those who visit the company’s
website most frequently: geography and larger amounts spent on purchases. With

further analysis, this information might explain why sales are so strong right now in
the northeast—and help your company find ways to make them even stronger
G

through the new website.


N

Step 2: Plan your visuals


Next it is time to refine the data and present the results of your analysis. Right now,

you have a lot of data spread across several different tables, which isn’t an ideal way
to share your results with management and the marketing team. You will want to
create a data visualization that explains your findings quickly and effectively to your
target audience. Since you know your audience is sales oriented, you already know
that the data visualization you use should:
● Show sales numbers over time
● Connect sales to location
● Show the relationship between sales and website use
● Show which customers fuel growth
Step 3: Create your visuals
Now that you have decided what kind of information and insights you want to display,
it is time to start creating the actual visualizations. Keep in mind that creating the
right visualization for a presentation or to share with stakeholders is a process. It
involves trying different visualization formats and making adjustments until you get
what you are looking for. In this case, a mix of different visuals will best communicate
your findings and turn your analysis into the most compelling story for stakeholders.
So, you can use the built-in chart capabilities in your spreadsheets to organize the
data and create your visuals.

G
N
Ơ
Ư
PH
H
N
A
C

G
N

Build your data visualization toolkit


There are many different tools you can use for data visualization.
● You can use the visualizations tools in your spreadsheet to create simple
visualizations such as line and bar charts.
● You can use more advanced tools such as Tableau that allow you to integrate
data into dashboard-style visualizations.
● If you’re working with the programming language R you can use the
visualization tools in RStudio.
Your choice of visualization will be driven by a variety of drivers including the size of
your data, the process you used for analyzing your data (spreadsheet, or
databases/queries, or programming languages). For now, just consider the basics.

Spreadsheets (Microsoft Excel or Google Sheets)


In our example, the built-in charts and graphs in spreadsheets made the process of
creating visuals quick and easy. Spreadsheets are great for creating simple

G
visualizations like bar graphs and pie charts, and even provide some advanced

N
visualizations like maps, and waterfall and funnel diagrams (shown in the following
figures).

Ơ
But sometimes you need a more powerful tool to truly bring your data to life. Tableau
and RStudio are two examples of widely used platforms that can help you plan,

Ư
create, and present effective and compelling data visualizations.

Visualization software (Tableau) PH


Tableau is a popular data visualization tool that lets you pull data from nearly any
system and turn it into compelling visuals or actionable insights. The platform offers
H
built-in visual best practices, which makes analyzing and sharing data fast, easy, and
N

(most importantly) useful. Tableau works well with a wide variety of data and includes
an interactive dashboard that lets you and your stakeholders click to explore the data
A

interactively.
C

G
N

You can start exploring Tableau from the How-to Video resources. Tableau Public is
free, easy to use, and full of helpful information. The Resources page is a
one-stop-shop for how-to videos, examples, and datasets for you to practice with. To
explore what other data analysts are sharing on Tableau, visit the Viz of the Day
page where you will find beautiful visuals ranging from an overview of the
Lighthouses of Greece to Who’s Talking in Popular Films.
Programming language (R with RStudio)
A lot of data analysts work with a programming language called R. Most people who
work with R end up also using RStudio, an integrated developer environment (IDE),
for their data visualization needs. As with Tableau, you can create dashboard-style
data visualizations using RStudio.

G
N
Ơ
Ư
PH
Check out their website to learn more about RStudio.
H
You could easily spend days exploring all the resources provided at RStudio.com,
N

but the RStudio Cheatsheets and the RStudio Visualize Data Primer are great places
to start. When you have more time, check out the webinars and videos which offer
A

advice and helpful perspectives for both beginners and advanced users.
C

Consider fairness

Previously, you learned that part of a data professional’s responsibility is to make


G

certain that their analysis is fair. Fairness means ensuring your analysis doesn't
create or reinforce bias. This can be challenging, but if the analysis is not objective,
N

the conclusions can be misleading and even harmful. In this reading, you’re going to
explore some best practices you can use to guide your work toward a more fair

analysis!

Following are some strategies that support fair analysis:

Best practice Explanation Example

Consider all of the Part of your job as a data A state’s Department of


available data analyst is to determine Transportation is
what data is going to be interested in measuring
useful for your analysis. traffic patterns on
Often there will be data holidays. At first, they only
that isn’t relevant to what include metrics related to
you’re focusing on or traffic volumes and the
doesn’t seem to align with fact that the days are
your expectations. But holidays. But the data
you can’t just ignore it; it’s team realizes they failed
critical to consider all of to consider how weather
the available data so that on these holidays might
your analysis reflects the also affect traffic volumes.
truth and not just your Considering this

G
own expectations. additional data helps
them gain more complete

N
insights.

Ơ
Identify surrounding As you’ll learn throughout A human resources
factors these courses, context is department wants to

Ư
key for you and your better plan for employee
stakeholders to vacation time in order to

conclusions of anyPH
understand the final

analysis. Similar to
anticipate staffing needs.
HR uses a list of national
bank holidays as a key
considering all of the part of the data-gathering
data, you also must process. But they fail to
H
understand surrounding consider important
factors that could holidays that aren’t on the
N

influence the insights bank calendar, which


you’re gaining. introduces bias against
A

employees who celebrate


them. It also gives HR
C

less useful results


because bank holidays

may not necessarily apply


to their actual employee
G

population.
N

Include self-reported data Self-reporting is a data A data analyst is working


collection technique on a project for a
where participants brick-and-mortar retailer.

provide information about Their goal is to learn more


themselves. Self-reported about their customer
data can be a great way base. This data analyst
to introduce fairness in knows they need to
your data collection consider fairness when
process. People bring they collect data; they
conscious and decide to create a survey
unconscious bias to their so that customers can
observations about the self-report information
world, including about about themselves. By
other people. Using doing that, they avoid bias
self-reporting methods to that might be introduced
collect data can help with other demographic
avoid these observer data collection methods.
biases. Additionally, For example, if they had
separating self-reported sales associates report
data from other data you their observations about
collect provides important customers, they might
context to your introduce any
conclusions! unconscious bias the
employees had to the
data.

G
Use oversampling When collecting data A fitness company is

N
effectively about a population, it’s releasing new digital
important to be aware of content for users of their

Ơ
the actual makeup of that equipment. They are
population. Sometimes, interested in designing

Ư
oversampling can help content that appeals to
you represent groups in different users, knowing
PH
that population that
otherwise wouldn’t be
represented fairly.
that different people may
interact with their
equipment in different
Oversampling is the ways. For example, part
process of increasing the of their user-base is age
H
sample size of 70 or older. In order to
nondominant groups in a represent these users,
N

population. This can help they oversample them in


you better represent them their data. That way,
A

and address imbalanced decisions they make


datasets. about their fitness content
C

will be more inclusive.


Think about fairness from To ensure that your A data team kicks off a
beginning to end analysis and final project by including
G

conclusions are fair, be fairness measures in their


sure to consider fairness data-collection process.
N

from the earliest stages of These measures include


a project to when you act oversampling their
on the data insights. This population and using

means that data self-reported data.


collection, cleaning, However, they fail to
processing, and analysis inform stakeholders about
are all performed with these measures during
fairness in mind. the presentation. As a
result, stakeholders leave
with skewed
understandings of the
data. Learning from this
experience, they add key
information about fairness
considerations to future
stakeholder
presentations.

Data analyst roles and job descriptions


As technology continues to advance, being able to collect and analyze the data from
that new technology has become a huge competitive advantage for a lot of
businesses. Everything from websites to social media feeds are filled with fascinating
data that, when analyzed and used correctly, can help inform business decisions. A
company’s ability to thrive now often depends on how well it can leverage data,

G
apply analytics, and implement new technologies.

N
This is why skilled data analysts are some of the most sought-after professionals in
the world. A study conducted by IBM estimates that there are over 380,000 job

Ơ
openings in the Data Analytics field in the United States*. Because the demand is so
strong, you’ll be able to find job opportunities in virtually any industry. Do a quick

Ư
search on any major job site and you’ll notice that every type of business from zoos,
to health clinics, to banks are seeking talented data professionals. Even if the job title
PH
doesn’t use the exact term “data analyst,” the job description for most roles involving
data analysis will likely include a lot of the skills and qualifications you’ll gain by the
end of this program. In this reading, we’ll explore some of the data analyst-related
roles you might find in different companies and industries.
H
* Burning Glass data, Feb 1, 2021 - Jan 31, 2022, US
N

Decoding the job description


A

The data analyst role is one of many job titles that contain the word “analyst.”
C

To name a few others that sound similar but may not be the same role:
● Business analyst—analyzes data to help businesses improve processes,

products, or services
● Data analytics consultant—analyzes the systems and models for using data
G

● Data engineer—prepares and integrates data from different sources for


N

analytical use
● Data scientist—uses expert skills in technology and social science to find
trends through data analysis

● Data specialist—organizes or converts data for use in databases or software


systems
● Operations analyst—analyzes data to assess the performance of business
operations and workflows
Data analysts, data scientists, and data specialists sound very similar but focus on
different tasks. As you start to browse job listings online, you might notice that
companies’ job descriptions seem to combine these roles or look for candidates who
may have overlapping skills. The fact that companies often blur the lines between
them means that you should take special care when reading the job descriptions and
the skills required.
The table below illustrates some of the overlap and distinctions between them:

G
N
Ơ
Ư
PH
H
N

We used the role of data specialist as one example of many specializations within
data analytics, but you don’t have to become a data specialist! Specializations can
A

take a number of different turns. For example, you could specialize in developing
data visualizations and likewise go very deep into that area.
C

Job specializations by industry


G

We learned that the data specialist role concentrates on in-depth knowledge of


databases. In similar fashion, other specialist roles for data analysts can focus on
N

in-depth knowledge of specific industries. For example, in a job as a business


analyst you might wear some different hats than in a more general position as a data
analyst. As a business analyst, you would likely collaborate with managers, share

your data findings, and maybe explain how a small change in the company’s project
management system could save the company 3% each quarter. Although you would
still be working with data all the time, you would focus on using the data to improve
business operations, efficiencies, or the bottom line.
Other industry-specific specialist positions that you might come across in your data
analyst job search include:
● Marketing analyst—analyzes market conditions to assess the potential sales
of products and services
● HR/payroll analyst—analyzes payroll data for inefficiencies and errors
● Financial analyst—analyzes financial status by collecting, monitoring, and
reviewing data
● Risk analyst—analyzes financial documents, economic conditions, and client
data to help companies determine the level of risk involved in making a
particular business decision
● Healthcare analyst—analyzes medical data to improve the business aspect of
hospitals and medical facilities

G
N
Ơ
Ư
PH
H
N
A
C

G
N

You might also like