Foundation of Data Analysis
Foundation of Data Analysis
Introduction 2
Definition of Data and Data Analysis 2
Decision Intelligence 4
Understand the Data ecosystem 4
EMC's data analysis process 5
SAS's iterative process 5
Big data analytics process 6
Data-driven decision making 6
G
Data and gut instinct 7
N
Key data analyst skills 8
Analytical thinking for effective outcomes 12
Ơ
Explore core analytical skills 12
Data life cycle 15
Ư
Case Study 16
More on the phases of data analysis 17
The ask phase
The prepare phase
PH 17
18
The process phase 18
The analyze phase 18
H
The share phase 18
N
Spreadsheets 19
Databases and query languages 19
C
Visualization tools 19
Choose the right tool for the job 20
Ọ
What is a query? 22
Example of a query 23
N
G
Introduction
N
Ơ
Definition of Data and Data Analysis
Ư
Data is a collection of facts (pictures, words,numbers..)
Data analysis is the collection, transformation, and organization of data in order to draw
PH
conclusions, make predictions and drive informed decision making.
H
N
The six steps of the data analysis process that you have been learning in this program are:
A
Ask
First up, the analysts needed to define what the project would look like and what would
LÊ
qualify as a successful result. So, to determine these things, they asked effective questions
and collaborated with leaders and managers who were interested in the outcome of their
people analysis. These were the kinds of questions they asked:
● What do you think new employees need to learn to be successful in their first year on
the job?
● Have you gathered data from new employees before? If so, may we have access to
the historical data?
● Do you believe managers with higher retention rates offer new employees something
extra or unique?
● What do you suspect is a leading cause of dissatisfaction among new employees?
● By what percentage would you like employee retention to increase in the next fiscal
year?
Preparation
It all started with solid preparation. The group built a timeline of three months and decided
how they wanted to relay their progress to interested parties. Also during this step, the
analysts identified what data they needed to achieve the successful result they identified in
the previous step - in this case, the analysts chose to gather the data from an online survey
of new employees. These were the things they did to prepare:
● They developed specific questions to ask about employee satisfaction with different
business processes, such as hiring and onboarding, and their overall compensation.
G
● They established rules for who would have access to the data collected - in this case,
anyone outside the group wouldn't have access to the raw data, but could view
N
summarized or aggregated data. For example, an individual's compensation wouldn't
Ơ
be available, but salary ranges for groups of individuals would be viewable.
● They finalized what specific information would be gathered, and how best to present
the data visually. The analysts brainstormed possible project- and data-related issues
Ư
and how to avoid them.
Process
PH
The group sent the survey out. Great analysts know how to respect both their data and the
people who provide it. Since employees provided the data, it was important to make sure all
employees gave their consent to participate. The data analysts also made sure employees
H
understood how their data would be collected, stored, managed, and protected. Collecting
and using data ethically is one of the responsibilities of data analysts. In order to maintain
N
confidentiality and protect and store the data effectively, these were the steps they took:
A
security.
G
Analyze
Then, the analysts did what they do best: analyze! From the completed surveys, the data
N
analysts discovered that an employee’s experience with certain processes was a key
indicator of overall job satisfaction. These were their findings:
LÊ
● Employees who experienced a long and complicated hiring process were most likely
to leave the company.
● Employees who experienced an efficient and transparent evaluation and feedback
process were most likely to remain with the company.
The group knew it was important to document exactly what they found in the analysis, no
matter what the results. To do otherwise would diminish trust in the survey process and
reduce their ability to collect truthful data from employees in the future.
Share
Just as they made sure the data was carefully protected, the analysts were also careful
sharing the report. This is how they shared their findings:
● They shared the report with managers who met or exceeded the minimum number of
direct reports with submitted responses to the survey.
● They presented the results to the managers to make sure they had the full picture.
● They asked the managers to personally deliver the results to their teams.
This process gave managers an opportunity to communicate the results with the right
context. As a result, they could have productive team conversations about next steps to
improve employee engagement.
G
Act
N
The last stage of the process for the team of analysts was to work with leaders within their
Ơ
company and decide how best to implement changes and take actions based on the
findings. These were their recommendations:
Ư
● Standardize the hiring and evaluation process for employees based on the most
efficient and transparent practices.
● PH
Conduct the same survey annually and compare results with those from the previous
year.
A year later, the same survey was distributed to employees. Analysts anticipated that a
H
comparison between the two sets of results would indicate that the action plan worked.
Turns out, the changes improved the retention rate for new employees and the actions taken
N
Decision Intelligence
Ọ
Decision Intelligence is a combination of applied data science and the social and managerial
G
An ecosystem is a group of elements that interact with one another. Data ecosystems are
made up of various elements that interact with one another in order to produce, manage,
store, organize, analyze, and share data. These elements include hardware and software
tools, and the people who use them. Data can also be found in something called the cloud.
The cloud is a place to keep data online, rather than on a computer hard drive. So instead of
storing data somewhere inside your organization's network, that data is accessed over the
internet. So the cloud is just a term we use to describe the virtual location.
EMC Corporation is now Dell EMC. This model, created by David Dietrich, reflects the
cyclical nature of typical business projects. The phases aren’t static milestones; each
step connects and leads to the next, and eventually repeats. Key questions help analysts
test whether they have accomplished enough to move forward and ensure that teams have
G
spent enough time on each of the phases and don’t start modeling before the data is ready.
It is a little different from the data analysis process on which this program is based, but it has
N
some core ideas in common: the first phase is interested in discovering and asking
Ơ
questions; data has to be prepared before it can be analyzed and used; and then findings
should be shared and acted on.
Ư
For more information, refer to this e-book, Data Science & Big Data Analytics
1. Ask
A
2. Prepare
3. Explore
4. Model
C
5. Implement
Ọ
6. Act
7. Evaluate
G
The SAS model emphasizes the cyclical nature of their model by visualizing it as an infinity
N
symbol. Its process has seven steps, many of which mirror the other models, like ask,
prepare, model, and act. But this process is also a little different; it includes a step after the
act phase designed to help analysts evaluate their solutions and potentially return to the ask
LÊ
phase again.
For more information, refer to Managing the Analytics Life Cycle for Decisions at Scale
This data analytics project process was developed by Vignesh Prajapati. It doesn’t include
the sixth phase, or the act phase. However, it still covers a lot of the same steps described. It
begins with identifying the problem, preparing and processing data before analysis, and
ends with data visualization. For more information, refer to Understanding the data
analytics project life cycle
G
Authors Thomas Erl, Wajid Khattak, and Paul Buhler proposed a big data analytics process
N
in their book, Big Data Fundamentals: Concepts, Drivers & Techniques. Their process
suggests phases divided into nine steps:
Ơ
1. Business case evaluation
Ư
2. Data identification
3. Data acquisition and filtering
4.
5.
6.
Data extraction
Data validation and cleaning
Data aggregation and representation
PH
7. Data analysis
8. Data visualization
H
9. Utilization of analysis results
N
This process appears to have three or four more steps than the previous models. But in
A
reality, they have just broken down what has been referred to as preparation and process
into smaller steps. It emphasizes the individual tasks required for gathering, preparing, and
C
cleaning data before the analysis phase. For more information, refer to Big Data
Adoption and Planning Considerations
Ọ
problem that needs to be solved. For example, a problem could be a new company needing
to establish better brand recognition, so it can compete with bigger, more well-known
competitors. Or maybe an organization wants to improve a product and needs to figure out
how to source parts from a more sustainable or ethically responsible supplier. Or, it could be
a business trying to solve the problem of unhappy employees, low levels of engagement,
satisfaction and retention. Whatever the problem is, once it's defined, a data analyst finds
data, analyzes it and uses it to uncover trends, patterns and relationships. Sometimes the
data-driven strategy will build on what's worked in the past. Other times, it can guide a
business to branch out in a whole new direction.
Let's look at a real-world example. Think about a music or movie streaming service. How do
these companies know what people want to watch or listen to, and how do they provide it?
Using data-driven decision-making, they gather information about what their customers are
currently listening to, analyze it, then use the insights they've gained to make suggestions for
things people will most likely enjoy in the future. This keeps customers happy and coming
back for more, which in turn means more revenue for the company. Another example of
data-driven decision-making can be seen in the rise of e-commerce. It wasn't long ago that
most purchases were made in a physical store, but the data showed people's preferences
were changing. So a lot of companies created entirely new business models that remove the
physical store, and let people shoprite from their computers or mobile phones with products
delivered right to their doorstep. In fact, data-driven decision-making can be so powerful, it
G
can make entire business methods obsolete.
N
By ensuring that data is built into every business strategy, data analysts play a critical role in
Ơ
their companies' success, but it's important to note that no matter how valuable data-driven
decision-making is, data alone will never be as powerful as data combined with human
experience, observation, and sometimes even intuition. To get the most out of data-driven
Ư
decision-making, it's important to include insights from people who are familiar with the
business problem. These people are called subject matter experts, and they have the ability
PH
to look at the results of data analysis and identify any inconsistencies, make sense of gray
areas, and eventually validate choices being made.
There are other factors that influence the decision-making process. You may have read
mysteries where the detective used their gut instinct, and followed a hunch that helped them
A
solve the case. Gut instinct is an intuitive understanding of something with little or no
explanation. This isn’t always something conscious; we often pick up on signals without even
realizing. You just have a “feeling” it’s right.
C
Ọ
analysts focus on the data to ensure they make informed decisions. If you ignore data by
preferring to make decisions based on your own experience, your decisions may be biased.
N
But even worse, decisions based on gut instinct without any data to back them up can cause
mistakes.
LÊ
G
For instance, if you are working on a rush project, you might need to rely on your own
knowledge and experience more than usual. There just isn’t enough time to thoroughly
N
analyze all of the available data. But if you get a project that involves plenty of time and
resources, then the best strategy is to be more data-driven. It’s up to you, the data analyst,
Ơ
to make the best possible choice. You will probably blend data and knowledge a million
different ways over the course of your data analytics career. And the more you practice, the
Ư
better you will get at finding that perfect blend.
Curious people usually seek out new challenges and experiences. This leads to knowledge.
A
Context is the condition in which something exists or happens. This can be a structure or an
C
yours said to you, one, two, four, five, three? Well, the three will be out of context. Let's look
at another example. Have you ever shuffled a deck of cards and noticed the joker? If you're
G
playing a game that doesn't include jokers, identifying that card means you understand it's
out of context.
N
A technical mindset involves the ability to break things down into smaller steps or
LÊ
pieces and work with them in an orderly and logical way. For instance, when paying your
bills, you probably already break down the process into smaller steps. When you take
something that seems like a single task, like paying your bills, and break it into smaller steps with
an orderly process, that's using a technical mindset.
Data design is how you organize information. As a data analyst, design typically has to do
with an actual database. But, again, the same skills can easily be applied to everyday life. For
example, think about the way you organize the contacts in your phone. That's actually a type of
data design. Maybe you list them by first name instead of last, or maybe you use email
addresses instead of their names. What you're really doing is designing a clear, logical list that
lets you call or text a contact in a quick and simple way.
The last, but definitely not least, the fifth and final element of analytical skills is data strategy.
Data strategy is the management of the people, processes, and tools used in data analysis.
Let's break that down. You manage people by making sure they know how to use the right data
to find solutions to the problem you're working on.
G
Mega-Pik is interested in following this trend. They want to do this based on data-driven
N
strategies, so they hire your analytics company to help them make popular movies again.
Specifically, they ask for exploratory data analysis (EDA) to help them understand what
Ơ
audiences have liked in the past and determine if the successes of those films can be replicated.
Ư
You and your team develop the following objectives for Mega-Pik’s EDA:
● Release date
● Opening night revenue
A
● Marketing costs
● Ratings
Ọ
● Genre
G
Now you’ll examine how inherent data analysis skills can help you guide Mega-Pik to make
data-driven decisions about which movies to produce.
N
G
Context is crucial for any kind of meaningful
Understanding context data analysis. By contextualizing data, you
N
start to understand why the data shows what
it does. Factors including the time of year a
Ơ
movie is released, holidays, and competing
events can all have an effect on revenue,
Ư
which is the gauge Mega-Pik uses to
determine success. Audience demographics
such as age, gender, education, and income
PH
levels can help you understand who is going
to the movies. This context might clarify which
genres or storylines are most interesting to
movie-goers.
H
Analysts determine context by looking for
patterns or anomalies in a dataset. It also
N
G
information is organized. Suppose the dataset
here is presented in a spreadsheet. You
would be able to shift the cells to organize the
N
data to find different patterns. For example,
you might organize the data by revenue and
Ơ
then by genre, which could reveal that
comedies are more profitable than dramas.
Ư
Basically, how you choose to structure your
data makes analysis easier and more
insightful.
Data strategy
PH
Data strategy is the management of the
people, processes, and tools used in data
analysis. In this scenario, think of it as the
approach you use to analyze your dataset.
H
One element might be the tools you use. If
Mega-Pik wants a relatively simple
N
With so much data available, having a strategic mindset is key to staying focused and on track.
Strategizing helps data analysts see what they want to achieve with the data and how they can
get there. Strategy also helps improve the quality and usefulness of the data we collect. By
strategizing, we know all our data is valuable and can help us accomplish our goals.
G
Data analysts use a problem-oriented approach in order to identify, describe, and solve
N
problems. It's all about keeping the problem top of mind throughout the entire project. For
example, say a data analyst is told about the problem of a warehouse constantly running out of
Ơ
supplies. They would move forward with different strategies and processes. But the number one
goal would always be solving the problem of keeping inventory on the shelves.
Ư
A correlation is like a relationship. You can find all kinds of correlations in data. But as you start
identifying correlations in data, there's one thing you always want to keep in mind: Correlation
PH
does not equal causation. In other words, just because two pieces of data are both trending in
the same direction, that doesn't necessarily mean they are all related.
The final piece of the analytical thinking puzzle: big-picture thinking. This means being able to
H
see the big picture as well as the details. A jigsaw puzzle is a great way to think about this.
Big-picture thinking is like looking at a complete puzzle. You can enjoy the whole picture without
N
getting stuck on every tiny piece that went into making it. It helps you zoom out and see
possibilities and opportunities. This leads to exciting new ideas or innovations. On the flip side,
A
detail-oriented thinking is all about figuring out all of the aspects that will help you
execute a plan. In other words, the pieces that make up your puzzle.
C
What is the root cause of a problem? A root cause is the reason why a problem occurs. If we
G
can identify and get rid of a root cause, we can prevent that problem from happening again.
A simple way to wrap your head around root causes is with the process called the Five
N
Whys. In the Five Whys you ask "why" five times to reveal the root cause. The fifth and final
answer should give you some useful and sometimes surprising insights. Here's an example
LÊ
Let's say you wanted to make a blueberry pie but couldn't find any blueberries. You've
G
recently explored a case involving lacking the necessary ingredients to bake pies; now, you’ll
go more in-depth with some business applications of the five whys technique to do root
N
cause analysis.
An online grocery store was receiving numerous customer service complaints about poor
deliveries. To address this problem, a data analyst at the company asked their first “why?”
Why #1. “Customers are complaining about poor grocery deliveries. Why?”
The data analyst began by reviewing the customer feedback more closely. They noted the
vast majority of complaints dealt with products arriving damaged. So, they asked “why?”
again.
G
were already being asked to pack groceries for customer orders.
N
Why #5. “Packers have not completed required training. Why?”
Ơ
This final “why?” led the data analyst to find out that the human resources department had
not provided necessary training to any newly hired packers. This was because HR was in
the middle of reworking the training program. Rather than training new hires using the old
Ư
system, they had provided them with a quick one-page guide, which was insufficient.
PH
So, in this example, the root cause of the problem was that HR had not completed the
training program updates and was using a less-thorough guide to train new packers.
Fortunately, this was a problem that the grocer could control. And thanks to the data
analyst’s work, they provided more support to the HR department to complete the training
H
and retrain all newly hired grocery packers!
N
An irrigation company was experiencing an increase in the number of defects in their water
pumps. The company's data team used the five whys to analyze the situation:
C
Why #1. “There has been an increase in the number of defects in water pumps. Why?”
To answer this question, the data team set up a meeting with shop floor engineers. They
Ọ
asked for some insights into machine performance and manufacturing processes. After
G
some exploration, it was discovered that the machines used to produce the pumps were not
properly calibrated.
N
Why #4. “The calibration method is inappropriate for the machines. Why?”
This “why” led them to discover that the company had recently installed new software in their
machines. Because it was a minor software upgrade, the engineers didn’t realize it would
affect calibration. They didn’t have the information they needed to properly calibrate the
upgraded machines.
Why #5. “The engineers don’t have the information they need to calibrate the upgraded
machines. Why?”
The fifth and final “why” turned up even more evidence: The installation team had upgraded
machine software, but had failed to share the corresponding calibration procedures with the
engineers.
So, in this example, the root cause of the problem was that the engineers lacked important
information about how to calibrate the machines using the new software system. The
solution was found, and the irrigation company was able to implement it right away. Soon,
the engineers had the necessary calibration instructions, and the pump defects were
G
eliminated!
N
Another question commonly asked by data analysts is, where are the gaps in our
Ơ
process? For this, many people will use something called gap analysis. Gap analysis lets
you examine and evaluate how a process works currently in order to get where you want to
be in the future. Businesses conduct gap analysis to do all kinds of things, such as improve
Ư
a product or become more efficient. The general approach to gap analysis is understanding
where you are now compared to where you want to be.
PH
Data life cycle
H
First, let's spend a little time understanding the data life cycle. No, data isn't actually alive,
but it does have a life cycle. How do data analysts bring data to life? Well, it starts with the
N
right data analysis tool. These include spreadsheets, databases, query languages, and
visualization software.
A
The life cycle of data is to plan, capture, manage, analyze, archive and destroy.
C
The data life cycle provides a generic or common framework for how data is
Ọ
managed. You may recall that variations of the data analysis life cycle were
described in Origins of the data analysis process. The same can be done for the
G
data life cycle. The rest of this reading provides a glimpse of how government,
N
finance, and education institutions can view data life cycles a little differently.
Planning
LÊ
Let's start with the first phase, planning. This actually happens well before starting an
analysis project. During planning, a business decides what kind of data it needs, how it will
be managed throughout its life cycle, who will be responsible for it, and the optimal
outcomes. For example, let's say an electricity provider wanted to gain insights into how to
save people energy. In the planning phase, they might decide to capture information on how
much electricity its customers use each year, what types of buildings are being powered, and
what types of devices are being powered inside of them. The electricity company would also
decide which team members will be responsible for collecting, storing, and sharing that data.
All of this happens during planning, and it helps set up the rest of the project.
Capture
The next phase is when you capture data. This is where data is collected from a variety of
different sources and brought into the organization. With so much data being created
everyday, the ways to collect it are truly endless. One common method is getting data from
outside resources. For example, if you were doing data analysis on weather patterns, you'd
probably get data from a publicly available dataset like the National Climatic Data Center.
Another way to get data is from a company's own documents and files, which are usually
stored inside a database. While we've mentioned databases before, we haven't gone into
too much detail about what they are. A database is a collection of data stored in a computer
system. In the case of our electricity provider, the business would probably measure data
usage among its customers within a database that it owns. As a quick note, when you
maintain a database of customer information, ensuring data integrity, credibility, and privacy
G
are all important concerns.
N
Manage
Ơ
Here we're talking about how we care for our data, how and where it's stored, the tools used
to keep it safe and secure, and the actions taken to make sure that it's maintained properly.
This phase is very important to data cleansing, which we'll cover later on.
Ư
Analyze
PH
Next it's time to analyze your data. This is where data analysts really shine. In this phase,
the data is used to solve problems, make great decisions, and support business goals. For
example, one of our electricity company's goals might be to find ways to help customers
save energy.
H
Archive
Archiving means storing data in a place where it's still available, but may not be used again.
N
During analysis, analysts handle huge amounts of data. Can you imagine if we had to sort
A
through all of the available data that's out there, even if it was no longer useful and relevant to
our work? It makes way more sense to archive it than to keep it around.
Destroy
C
They would have data stored on multiple hard drives. To destroy it, the company would use
secure data erasure software. If there were any paper files, they would be shredded too. This is
Ọ
important for protecting a company's private information, as well as private data about its
customers.
G
Case Study
N
The U.S. Fish and Wildlife Service uses the following data life cycle:
● Plan
● Acquire
● Maintain
● Access
● Evaluate
● Archive
For more information, refer to U.S. Fish and Wildlife's Data Management Life Cycle
page.
2. The U.S. Geological Survey (USGS)
The USGS uses the data life cycle below:
● Plan
● Acquire
● Process
● Analyze
● Preserve
● Publish/share
Several cross-cutting or overarching activities are also performed during each stage of their life
cycle:
● Describe (metadata and documentation)
● Manage quality
● Backup and secure
G
For more information, refer to the USGS Data Lifecycle page..
3. Financial institutions
N
Financial institutions may take a slightly different approach to the data life cycle as described in
The Data Life Cycle, an article in Strategic Finance magazine:
Ơ
● Capture
● Qualify
Ư
● Transform
● Utilize
● Report
● Archive
● Purge
PH
4. Harvard Business School (HBS)
One final data life cycle informed by Harvard University research has eight stages:
H
● Generation
N
● Collection
A
● Processing
● Storage
● Management
C
● Analysis
● Visualization
Ọ
● Interpretation
For more information, refer to 8 Steps in the Data Life Cycle.
G
N
act—plays a crucial role in extracting meaningful insights from data. As you navigate through
each phase, from asking the right questions to taking informed actions, you harness the true
power of data. In this reading, you’ll explore how the data analysis process guides this
program.
The ask phase
At the start of any successful data analysis, the data analyst:
● Takes the time to fully understand stakeholder expectations
● Defines the problem to be solved
● Decides which questions to answer in order to solve the problem
Qualifying stakeholder expectations means determining who the stakeholders are, what they
want, when they want it, why they want it, and how best to communicate with them. Defining
the problem means looking at the current state and identifying the ways in which it’s different
from the ideal state. With expectations qualified and the problem defined, you can derive
questions that will help achieve these goals.
G
be based on facts and be fair and impartial.
N
The process phase
Ơ
In this phase, the aim is to refine the data. Data analysts find and eliminate any errors and
Ư
inaccuracies that can get in the way of results. This usually means:
● Cleaning data
● Transforming data into a more useful format
PH
● Combining two or more datasets to make information more complete
● Removing outliers (data points that could skew the information)
After data analysts process data, they check the data they prepared to make sure it's
H
complete and correct. This phase is all about getting the details right. Accordingly, the data
analyst will refine strategies for verifying and sharing their data cleaning with stakeholders. In
N
an upcoming course, you’ll use spreadsheets and structured query language, or SQL, to
clean data.
A
With a solid foundation of well-defined questions and clean data, you’ll delve into the analyze
Ọ
phase. This is when you turn the data you’ve gathered, prepared, and processed into
actionable information. Data analysts use many powerful tools in their work. In one
G
upcoming course you'll continue using two of them: spreadsheets and SQL. In another
upcoming course you’ll explore using the programming language R to work with and analyze
N
data.
LÊ
G
calculation using the data in a spreadsheet. Formulas can do basic things like add, subtract,
multiply and divide, but they don't stop there. You can also use formulas to find the average
N
of a number set. Look up a particular value, return the sum of a set of values that meets a
Ơ
particular rule, and so much more. A function is a preset command that automatically
performs a specific process or task using the data in a spreadsheet.
Ư
As you are learning, the most common programs and solutions used by data analysts
include spreadsheets, query languages, and visualization tools. In this reading, you will learn
PH
more about each one. You will cover when to use them, and why they are so important in
data analytics.
Spreadsheets
H
Data analysts rely on spreadsheets to collect and organize data. Two popular spreadsheet
applications you will probably use a lot in your future role as a data analyst are Microsoft
N
data project
● Create excellent data visualizations, like graphs and charts.
Ọ
G
Query languages
● Allow analysts to isolate specific information from a database(s)
● Make it easier for you to learn and understand the requests made to databases
● Allow analysts to select, create, add, or download data from a database for analysis
Visualization tools
Data analysts use a number of visualization tools, like graphs, maps, tables, charts, and
more. Two popular visualization tools are Tableau and Looker.
These tools
● Turn complex numbers into a story that people can understand
● Help stakeholders come up with conclusions that lead to informed decisions and
effective business strategies
● Have multiple features
G
which are used a lot for statistical analysis, visualization, and other data analysis.
N
Choose the right tool for the job
Ơ
As a data analyst, you will usually have to decide which program or solution is right for the
Ư
particular project you are working on. In this reading, you will learn more about how to
choose which tool you need and when.
PH
Depending on which phase of the data analysis process you’re in, you will need to use
different tools. For example, if you are focusing on creating complex and eye-catching
visualizations, then the visualization tools we discussed earlier are the best choice. But if you
are focusing on organizing, cleaning, and analyzing data, then you will probably be choosing
H
between spreadsheets and databases using queries. Spreadsheets and databases both
N
offer ways to store, manage, and use data. The basic content for both tools are sets of
values. Yet, there are some key differences, too:
A
Spreadsheets Databases
C
Structured data in a row and column format Structured data using rules and relationships
G
Provides access to a limited amount of data Provides access to huge amounts of data
G
1. If your name is longer than the width of the column, select and drag the right edge
of the corresponding column until it fits.
N
2. To wrap text, select the cells, columns, or rows with text that you want to reformat.
3. Select the Format menu.
Ơ
4. Under Wrapping, select Wrap.
Ư
Example 2: Add labels
Add labels, or attributes, to help you keep track of the data:
1. Select cell A1.
2. Enter First Name.
3. Select cell B1.
PH
4. Enter Last Name.
H
5. Select cells A1 and B1. To do this, select a single cell and drag your cursor over to
the other cell to include it in the selection.
N
toolbar.
5. Adjust the columns to fit the new text.
N
6. Enter the corresponding data in cells C2, D2, and E2 (your number of siblings,
favorite color, and favorite dessert).
LÊ
7. Add data about two more people in rows 3 and 4. These can be people you know or
people you’ve just made up.
A. To select non adjacent cells and/or cell ranges, hold the Command (Mac) or Ctrl
(PC) key and select the cells.
B. To select a range of cells, hold the Shift key and either drag your cursor over which
cells you want to include or use the arrow keys to select a range.
C. Select a single cell and drag your cursor over the cells you want to include in your
selection.
2) Select the Data menu.
3) Select Sort range, then select Advanced range sorting options.
4) In the Advanced range sorting options window, select the checkbox for Data has
header row. Make sure that A to Z is selected.
5) Select the Sort by drop-down menu, then select Siblings.
6) Select Sort. This will organize the spreadsheet by the number of siblings, from
lowest to highest.
G
Spreadsheets enable data professionals to analyze data. In this example, the instructor uses
a formula to calculate a sum.
N
1. Select the next empty cell in the Siblings column (C5).
2. Enter the formula =C2+C3+C4.
Ơ
3. Press Enter on your keyboard to complete the formula.
4. The formula calculates the total number of siblings.
Ư
SQL in action
PH
Just as humans use different languages to communicate with others, so do computers.
Structured Query Language (or SQL, often pronounced “sequel”) enables data analysts to talk
to their databases. SQL is one of the most useful data analyst tools, especially when working
with large datasets in tables. It can help you investigate huge databases, track down text
H
(referred to as strings) and numbers, and filter for the exact kind of data you need—much
faster than a spreadsheet can.
N
A
What is a query?
A query is a request for data or information from a database. When you query databases,
C
you use SQL to communicate your question or request. You and the database can always
exchange information as long as you speak the same language.
Ọ
Every programming language, including SQL, follows a unique set of guidelines known as
syntax. Syntax is the predetermined structure of a language that includes all required words,
G
symbols, and punctuation, as well as their proper placement. As soon as you enter your
search criteria using the correct syntax, the query starts working to pull the data you’ve
N
A SQL query is like filling in a template. You will find that if you are writing a SQL query from
scratch, it is helpful to start a query by writing the SELECT, FROM, and WHERE keywords in the
following format:
SELECT
FROM
WHERE
Next, enter the table name after the FROM; the table columns you want after the SELECT;
and, finally, the conditions you want to place on your query after the WHERE. Make sure to
add a new line and indent when adding these, as shown below:
G
WHERE Specifies criteria that the data must meet
N
Following this method each time makes it easier to write SQL queries. It can also help you
Ơ
make fewer syntax errors.
Ư
Example of a query
PH
Here is how a simple query would appear in BigQuery, a data warehouse on the Google
Cloud Platform.
H
N
A
The above query uses three commands to locate customers with the first_name, 'Tony':
1. SELECT the column named first_name
C
2. (The dataset name is always followed by a dot, and then the table name.)
3. But only return the data WHERE the first_name is 'Tony'
G
As you can conclude, this query had the correct syntax, but wasn't very useful after the data
was returned.
G
N
Ơ
Ư
PH
H
The above query uses three commands to locate customers with the first_name, 'Tony'.
N
The only difference between this query and the previous one is that more data columns are
selected. The previous query selected first_name only while this query selects
G
select more columns if you will actually use the additional fields in your WHERE clause. If you
have multiple conditions in your WHERE clause, they may be written like this:
LÊ
G
Notice that unlike the SELECT command that uses a comma to separate fields / variables /
parameters, the WHERE command uses the AND statement to connect conditions. As you
N
become a more advanced writer of queries, you will make use of other connectors /
operators such as OR and NOT.
Ơ
Here is a BigQuery example with multiple fields used in a WHERE clause:
Ư
PH
H
N
The above query uses three commands to locate customers with a valid (greater than 0),
A
2. (The dataset name is always followed by a dot, and then the table name.)
3. But only return the data WHERE customer_id is greater than 0, first_name is
G
Note that one of the conditions is a logical condition that checks to see if customer_id is
greater than zero.
If only one customer is named Tony Magnolia, the results from the query could be:
LÊ
Endless SQL possibilities
Capitalization, indentation, and semicolons
You can write your SQL queries in all lowercase and don’t have to worry about extra spaces
between words. However, using capitalization and indentation can help you read the
information more easily. Keep your queries neat, and they will be easier to review or
troubleshoot if you need to check them later on.
G
N
Ơ
Notice that the SQL statement shown above has a semicolon at the end. The semicolon is a
Ư
statement terminator and is part of the American National Standards Institute (ANSI) SQL-92
standard, which is a recommended common syntax for adoption by all SQL databases.
PH
However, not all SQL databases have adopted or enforce the semicolon, so it’s possible you
may come across some SQL statements that aren’t terminated with a semicolon. If a
statement works without a semicolon, it’s fine.
H
WHERE conditions
N
In the query shown above, the SELECT clause identifies the column you want to pull data
from by name, field1, and the FROM clause identifies the table where the column is
A
located by name, table. Finally, the WHERE clause narrows your query so that the database
returns only the data with an exact value match or the data that matches a certain condition
C
However, if you are looking for all customers with a last name that begins with the letters
“Ch," the WHERE clause would be:
N
database to look for a certain pattern! The percent sign % is used as a wildcard to match one
or more characters. In the example above, both Chavez and Chen would be returned. Note
that in some databases an asterisk * is used as the wildcard instead of a percent sign %.
Comments
Some tables aren’t designed with descriptive enough naming conventions. In the example,
field1 was the column for a customer’s last name, but you wouldn’t know it by the name. A
better name would have been something such as last_name. In these cases, you can place
comments alongside your SQL to help you remember what the name represents. Comments
are text placed between certain characters, /* and */, or after two dashes --) as shown
G
below.
N
Ơ
Ư
PH
Comments can also be added outside of a statement as well as within a statement.
You can use this flexibility to provide an overall description of what you are going to
do, step-by-step notes about how you achieve it, and why you set different
parameters/conditions.
H
N
A
C
Ọ
The more comfortable you get with SQL, the easier it will be to read and understand
queries at a glance. Still, it never hurts to have comments in a query to remind
G
yourself of what you’re trying to do. This also makes it easier for others to
understand your query if your query is shared. As your queries become more and
N
more complex, this practice will save you a lot of time and energy to understand
complex queries you wrote months or years ago.
LÊ
G
generally supported. So it is best to use -- and be consistent with it. You can use #
in place of -- in the above query, but # is not recognized in all SQL versions; for
N
example, MySQL doesn’t recognize #. You can also place comments between /*
Ơ
and */ if the database you are using supports it.
As you develop your skills professionally, depending on the SQL database you use,
Ư
you can pick the appropriate comment delimiting symbols you prefer and stick with
those as a consistent style. As your queries become more and more complex, the
PH
practice of adding helpful comments will save you a lot of time and energy to
understand queries that you may have written months or years prior.
Aliases
H
N
You can also make it easier on yourself by assigning a new name or alias to the
column or table names to make them easier to work with (and avoid the need for
A
comments). This is done with a SQL AS clause. In the example below, aliases are
used for both a table name and a column. Within the database, the table is called
C
aliases are good for the duration of the query only. An alias doesn’t change the
actual name of a column or table in the database.
G
N
LÊ
Example of a query with aliases
G
Imagine you are a data analyst for a small business and your manager asks you for
some employee data. You decide to write a query with SQL to get what you need
N
from the database.
Ơ
You want to pull all the columns: empID, firstName, lastName, jobCode, and
salary. Because you know the database isn’t that big, instead of entering each
Ư
column name in the SELECT clause, you use SELECT *. This will select all the
columns from the Employee table in the FROM clause.
PH
H
N
Now, you can get more specific about the data you want from the Employee table. If
A
you want all the data about employees working in the 'SFI' job code, you can use a
WHERE clause to filter out the data based on this additional requirement.
C
A portion of the resulting data returned from the SQL query might look like this:
Suppose you notice a large salary range for the 'SFI' job code. You might like to
flag all employees in all departments with lower salaries for your manager. Because
interns are also included in the table and they have salaries less than $30,000, you
want to make sure your results give you only the full time employees with salaries
that are $30,000 or less. In other words, you want to exclude interns with the 'INT'
job code who also earn less than $30,000. The AND clause enables you to test for
both conditions.
You create a SQL query similar to below, where <> means "does not equal":
G
N
Ơ
Ư
The resulting data from the SQL query might look like the following (interns with the
job code INT aren't returned):
PH
H
N
A
With quick access to this kind of data using SQL, you can provide your manager with
C
tons of different insights about employee data, including whether employee salaries
across the business are equitable. Fortunately, the query shows only an additional
Ọ
two employees might need a salary adjustment and you share the results with your
G
manager.
Pulling the data, analyzing it, and implementing a solution might ultimately help
N
improve employee satisfaction and loyalty. That makes SQL a pretty powerful tool.
LÊ
Ơ
Let’s go through an example of a real-life situation where a data analyst might need
to create a data visualization to share with stakeholders. Imagine you’re a data
Ư
analyst for a clothing distributor. The company helps small clothing stores manage
their inventory, and sales are booming. One day, you learn that your company is
PH
getting ready to make a major update to its website. To guide decisions for the
website update, you’re asked to analyze data from the existing website and sales
records. Let’s go through the steps you might follow.
H
Step 1: Explore the data for patterns
N
First, you ask your manager or the data owner for access to the current sales
records and website analytics reports. This includes information about how
A
customers behave on the company’s existing website, basic information about who
visited, who bought from the company, and how much they bought.
C
While reviewing the data you notice a pattern among those who visit the company’s
website most frequently: geography and larger amounts spent on purchases. With
Ọ
further analysis, this information might explain why sales are so strong right now in
the northeast—and help your company find ways to make them even stronger
G
you have a lot of data spread across several different tables, which isn’t an ideal way
to share your results with management and the marketing team. You will want to
create a data visualization that explains your findings quickly and effectively to your
target audience. Since you know your audience is sales oriented, you already know
that the data visualization you use should:
● Show sales numbers over time
● Connect sales to location
● Show the relationship between sales and website use
● Show which customers fuel growth
Step 3: Create your visuals
Now that you have decided what kind of information and insights you want to display,
it is time to start creating the actual visualizations. Keep in mind that creating the
right visualization for a presentation or to share with stakeholders is a process. It
involves trying different visualization formats and making adjustments until you get
what you are looking for. In this case, a mix of different visuals will best communicate
your findings and turn your analysis into the most compelling story for stakeholders.
So, you can use the built-in chart capabilities in your spreadsheets to organize the
data and create your visuals.
G
N
Ơ
Ư
PH
H
N
A
C
Ọ
G
N
LÊ
G
visualizations like bar graphs and pie charts, and even provide some advanced
N
visualizations like maps, and waterfall and funnel diagrams (shown in the following
figures).
Ơ
But sometimes you need a more powerful tool to truly bring your data to life. Tableau
and RStudio are two examples of widely used platforms that can help you plan,
Ư
create, and present effective and compelling data visualizations.
(most importantly) useful. Tableau works well with a wide variety of data and includes
an interactive dashboard that lets you and your stakeholders click to explore the data
A
interactively.
C
Ọ
G
N
LÊ
You can start exploring Tableau from the How-to Video resources. Tableau Public is
free, easy to use, and full of helpful information. The Resources page is a
one-stop-shop for how-to videos, examples, and datasets for you to practice with. To
explore what other data analysts are sharing on Tableau, visit the Viz of the Day
page where you will find beautiful visuals ranging from an overview of the
Lighthouses of Greece to Who’s Talking in Popular Films.
Programming language (R with RStudio)
A lot of data analysts work with a programming language called R. Most people who
work with R end up also using RStudio, an integrated developer environment (IDE),
for their data visualization needs. As with Tableau, you can create dashboard-style
data visualizations using RStudio.
G
N
Ơ
Ư
PH
Check out their website to learn more about RStudio.
H
You could easily spend days exploring all the resources provided at RStudio.com,
N
but the RStudio Cheatsheets and the RStudio Visualize Data Primer are great places
to start. When you have more time, check out the webinars and videos which offer
A
advice and helpful perspectives for both beginners and advanced users.
C
Consider fairness
Ọ
certain that their analysis is fair. Fairness means ensuring your analysis doesn't
create or reinforce bias. This can be challenging, but if the analysis is not objective,
N
the conclusions can be misleading and even harmful. In this reading, you’re going to
explore some best practices you can use to guide your work toward a more fair
LÊ
analysis!
G
own expectations. additional data helps
them gain more complete
N
insights.
Ơ
Identify surrounding As you’ll learn throughout A human resources
factors these courses, context is department wants to
Ư
key for you and your better plan for employee
stakeholders to vacation time in order to
conclusions of anyPH
understand the final
analysis. Similar to
anticipate staffing needs.
HR uses a list of national
bank holidays as a key
considering all of the part of the data-gathering
data, you also must process. But they fail to
H
understand surrounding consider important
factors that could holidays that aren’t on the
N
population.
N
G
Use oversampling When collecting data A fitness company is
N
effectively about a population, it’s releasing new digital
important to be aware of content for users of their
Ơ
the actual makeup of that equipment. They are
population. Sometimes, interested in designing
Ư
oversampling can help content that appeals to
you represent groups in different users, knowing
PH
that population that
otherwise wouldn’t be
represented fairly.
that different people may
interact with their
equipment in different
Oversampling is the ways. For example, part
process of increasing the of their user-base is age
H
sample size of 70 or older. In order to
nondominant groups in a represent these users,
N
Think about fairness from To ensure that your A data team kicks off a
beginning to end analysis and final project by including
G
G
apply analytics, and implement new technologies.
N
This is why skilled data analysts are some of the most sought-after professionals in
the world. A study conducted by IBM estimates that there are over 380,000 job
Ơ
openings in the Data Analytics field in the United States*. Because the demand is so
strong, you’ll be able to find job opportunities in virtually any industry. Do a quick
Ư
search on any major job site and you’ll notice that every type of business from zoos,
to health clinics, to banks are seeking talented data professionals. Even if the job title
PH
doesn’t use the exact term “data analyst,” the job description for most roles involving
data analysis will likely include a lot of the skills and qualifications you’ll gain by the
end of this program. In this reading, we’ll explore some of the data analyst-related
roles you might find in different companies and industries.
H
* Burning Glass data, Feb 1, 2021 - Jan 31, 2022, US
N
The data analyst role is one of many job titles that contain the word “analyst.”
C
To name a few others that sound similar but may not be the same role:
● Business analyst—analyzes data to help businesses improve processes,
Ọ
products, or services
● Data analytics consultant—analyzes the systems and models for using data
G
analytical use
● Data scientist—uses expert skills in technology and social science to find
trends through data analysis
LÊ
G
N
Ơ
Ư
PH
H
N
We used the role of data specialist as one example of many specializations within
data analytics, but you don’t have to become a data specialist! Specializations can
A
take a number of different turns. For example, you could specialize in developing
data visualizations and likewise go very deep into that area.
C
Ọ
your data findings, and maybe explain how a small change in the company’s project
management system could save the company 3% each quarter. Although you would
still be working with data all the time, you would focus on using the data to improve
business operations, efficiencies, or the bottom line.
Other industry-specific specialist positions that you might come across in your data
analyst job search include:
● Marketing analyst—analyzes market conditions to assess the potential sales
of products and services
● HR/payroll analyst—analyzes payroll data for inefficiencies and errors
● Financial analyst—analyzes financial status by collecting, monitoring, and
reviewing data
● Risk analyst—analyzes financial documents, economic conditions, and client
data to help companies determine the level of risk involved in making a
particular business decision
● Healthcare analyst—analyzes medical data to improve the business aspect of
hospitals and medical facilities
G
N
Ơ
Ư
PH
H
N
A
C
Ọ
G
N
LÊ