Course 1 Data Analyst Data Data Everywhere
Course 1 Data Analyst Data Data Everywhere
The analysts organized those tasks and activities around the six phases of the
data analysis process:
1. Ask
2. Prepare
3. Process
4. Analyze
5. Share
6. Act
The analysts asked questions to define both the issue to be solved and what
would equal a successful result. Next, they prepared by building a timeline and
collecting data with employee surveys that were designed to be inclusive. They
processed the data by cleaning it to make sure it was complete, correct, relevant,
and free of errors and outliner. They analyzed the clean employee survey data.
Then the analysts shared their findings and recommendations with team leaders.
Afterward, leadership acted on the results and focused on improving key areas.
Data Ecosystem
Data ecosystems are made up of various elements that interact with one
another in order to produce, manage, store, organize, analyze, and share
data.
These elements include hardware and software tools, and the people who
use them.
Data can also be found in the cloud, which is a virtual location accessed
over the internet.
As a data analyst, it is their job to harness the power of the data
ecosystem, find the right information, and provide the team with analysis
that helps them make smart decisions.
Examples of data ecosystems include retail stores, human resources
departments, and agricultural companies.
Data scientists create new questions using data, while analysts find
answers to existing questions by creating insights from data sources.
Data analysis is the collection, transformation, and organization of data in
order to draw conclusions, make predictions, and drive informed decision-
making.
Data analytics is the science of data and encompasses everything from
the job of managing and using data to the tools and methods that data
workers use each and every day.
Data can be used in everyday life (e.g. fitness trackers, product reviews)
and in business (e.g. learning about customers, improving processes,
helping employees)
Data-driven decision-making is using facts to guide business strategy
First step is to define the business need (e.g. brand recognition, product
improvement, employee satisfaction)
Data analyst finds data, analyzes it, and uses it to uncover trends,
patterns, and relationships
Examples of data-driven decision-making: music/movie streaming
services, e-commerce, mobile phones
Data analysts play a critical role in their companies' success, but data
alone is not as powerful as data combined with human experience,
observation, and intuition
Subject matter experts can identify inconsistencies, make sense of gray
areas, and validate choices being made
data analysis life cycle—the process of going from data to decision. Data goes
through several phases as it gets created, consumed, tested, processed, and
reused.
The process presented as part of the Google Data Analytics Certificate is one
that will be valuable to you as you keep moving forward in your career:
1. Ask: Business Challenge/Objective/Question
2. Prepare: Data generation, collection, storage, and data management
3. Process: Data cleaning/data integrity
4. Analyze: Data exploration, visualization, and analysis
5. Share: Communicating and interpreting results
6. Act: Putting your insights to work to solve the problem
EMC Corporation's data analytics life cycle is cyclical with six steps:
1. Discovery
2. Pre-processing data
3. Model planning
4. Model building
5. Communicate results
6. Operationalize
EMC Corporation is now Dell EMC. This model, created by David Dietrich, reflects
the cyclical nature of real-world projects. The phases aren’t static milestones; each
step connects and leads to the next, and eventually repeats. Key questions help
analysts test whether they have accomplished enough to move forward and ensure
that teams have spent enough time on each of the phases and don’t start modeling
before the data is ready.
An iterative life cycle was created by a company called SAS, a leading data
analytics solutions provider. It can be used to produce repeatable, reliable, and
predictive results:
1. Ask
2. Prepare
3. Explore
4. Model
5. Implement
6. Act
7. Evaluate
The SAS model emphasizes the cyclical nature of their model by visualizing it as an
infinity symbol. Their life cycle has seven steps, many of which we have seen in the
other models, like Ask, Prepare, Model, and Act. But this life cycle is also a little
different; it includes a step after the act phase designed to help analysts evaluate
their solutions and potentially return to the ask phase again.
This data analytics project life cycle was developed by Vignesh Prajapati. It doesn’t
include the sixth phase, or what we have been referring to as the Act phase.
However, it still covers a lot of the same steps as the life cycles we have already
described. It begins with identifying the problem, preparing and processing data
before analysis, and ends with data visualization.
Authors Thomas Erl, Wajid Khattak, and Paul Buhler proposed a big data
analytics life cycle in their book, Big Data Fundamentals: Concepts, Drivers &
Techniques. Their life cycle suggests phases divided into nine steps:
This life cycle appears to have three or four more steps than the previous life cycle
models. But in reality, they have just broken down what we have been referring to as
Prepare and Process into smaller steps. It emphasizes the individual tasks required
for gathering, preparing, and cleaning data before the analysis phase.
Glossary of terms
Data analyst: Someone who collects, transforms, and organizes data in order to
drive informed decision-making
Data ecosystem: The various elements that interact with one another in order to
produce, manage, store, organize, analyze, and share data
Data science: A field of study that uses raw data to create new ways of modeling
and understanding the unknown
Module 2
Embrace your data analyst skills
Other
1. Curiosity: a desire to know more about something, asking the right questions
2. Understanding context: understanding where information fits into the “big picture”
3. Having a technical mindset: breaking big things into smaller steps
4. Data design: thinking about how to organize data and information
5. Data strategy: thinking about the people, processes, and tools used in data analysis
Gap analysis is used to examine and evaluate how a process currently works
with the goal of getting to where you want to be in the future.
Glossary of terms
Module 3
Data has its own life cycle, which consists of the following stages: plan,
capture, manage, analyze, archive and destroy.
Planning involves deciding what kind of data is needed, how it will be
managed, who will be responsible for it, and the optimal outcomes.
Capture involves collecting data from a variety of sources and bringing it
into the organization.
Manage involves caring for the data, storing it, keeping it safe and secure,
and taking actions to maintain it properly.
Analyze involves using the data to solve problems, make decisions, and
support business goals.
Archive involves storing data in a place where it is still available, but may
not be used again.
Destroy involves using secure data erasure software to delete data from
hard drives and shredding paper files.
Different Variations of data life cycle.
The U.S. Fish and Wildlife Service uses the following data life cycle:
1. Plan
2. Acquire
3. Maintain
4. Access
5. Evaluate
6. Archive
1. Plan
2. Acquire
3. Process
4. Analyze
5. Preserve
6. Publish/Share
Several cross-cutting or overarching activities are also performed during each stage of their life cycle:
Financial Institutions
1. Capture
2. Qualify
3. Transform
4. Utilize
5. Report
6. Archive
7. Purge
1 Generation
2. Collection
3. Processing
4. Storage
5. Management
6. Analysis
7. Visualization
8. Interpretation
Historical data is important to both the U.S. Fish and Wildlife Service and the USGS,
so their data life cycle focuses on archiving and backing up data. Harvard's interests
are in research and teaching, so its data life cycle includes visualization and
interpretation even though these are more often associated with a data analysis life
cycle. The HBS data life cycle also doesn't call out a stage for purging or destroying
data. In contrast, the data life cycle for finance clearly identifies archive and purge
stages. To sum it up, although data life cycles vary, one data management principle
is universal. Govern how data is handled so that it is accurate, secure, and available
to meet your organization's needs.
Data analysis is the process of analyzing data, and is not a life cycle
This program is split into six courses based on the steps of data analysis:
ask, prepare, process, analyze, share, and act
The ask phase involves defining the problem to be solved and
understanding stakeholder expectations
The prepare phase involves collecting and storing data, and identifying
which kinds of data are most useful
The process phase involves finding and eliminating errors and
inaccuracies, and cleaning data
The analyze phase involves using tools to transform and organize data to
draw useful conclusions
The share phase involves interpreting results and sharing them with
others to help stakeholders make decisions
The act phase involves putting insights to work to solve the original
business problem and preparing for a job search
Spreadsheets Databases
Software applications Data stores - accessed using a query language (e.g. SQL)
Structure data in a row and column format Structure data using rules and relationships
Organize information in cells Organize information in complex collections
Provide access to a limited amount of data Provide access to huge amounts of data
Manual data entry Strict and consistent data entry
Generally one user at a time Multiple users
Controlled by the user Controlled by a database management system
Encouragement to review the videos and readings and test out what has
been learned
Depending on which phase of the data analysis process you’re in, you will need to use
different tools. For example, if you are focusing on creating complex and eye-catching
visualizations, then the visualization tools we discussed earlier are the best choice. But if you
are focusing on organizing, cleaning, and analyzing data, then you will probably be choosing
between spreadsheets and databases using queries. Spreadsheets and databases both
offer ways to store, manage, and use data. The basic content for both tools are sets of values.
Yet, there are some key differences, too:
Glossary of terms
Analytical skills: Qualities and characteristics associated with using facts to solve
problems.
Analytical thinking: The process of identifying and defining a problem, then solving it
by using data in an organized, step-by-step manner
Data: A collection of facts
Data analysis: The collection, transformation, and organization of data in order to
draw conclusions, make predictions, and drive informed decision-making
Data analyst: Someone who collects, transforms, and organizes data in order to
draw conclusions, make predictions, and drive informed decision-making
Data analytics: The science of data
Data design: How information is organized
Data-driven decision-making: Using facts to guide business strategy
Data ecosystem: The various elements that interact with one another in order to
produce, manage, store, organize, analyze, and share data
Data science: A field of study that uses raw data to create new ways of modeling
and understanding the unknown
Data strategy: The management of the people, processes, and tools used in data
analysis
Data visualization: The graphical representation of data
Database: A collection of data stored in a computer system
Data set: A collection of data that can be manipulated or analyzed as one unit
Formula: A set of instructions used to perform a calculation using the data in a
spreadsheet
Function: A preset command that automatically performs a specified process or task
using the data in a spreadsheet
Query: A request for data or information from a database
Query language: A computer programming language used to communicate with a
database
Stakeholders: People who invest time and resources into a project and are
interested in its outcome
Structured Query Language: A computer programming language used to
communicate with a database
Spreadsheet: A digital worksheet
SQL: (Refer to Structured Query Language)
What is Query?
A query is a request for data or information from a database. When you query
databases, you use SQL to communicate your question or request. You and the
database can always exchange information as long as you speak the same
language.
Take a moment to appreciate all the work you have done in this course. You
identified a question to answer, and systematically worked your way through the
data analysis process to answer that question—just like professional data
analysts do every day!
In reviewing the data analysis process so far, you have already performed a lot of
these steps. Here are some examples to think about before you begin writing
your learning log entry:
Module 5
Data analytics helps businesses make better decisions. It all starts with a abusiness task and the
question it's trying to answer. With the skills you'll learn throughout this program, you'll be able to
ask the right questions, plan out the best way to gather and analyze data, and then present it visually to
arm your team so they can make an informed, data-driven decision. That makes you critical to the
success of any business you work for. Data is a powerful tool.
Data analysts have a responsibility to make sure their analyses are fair
Fairness means ensuring that analysis does not create or reinforce bias
There is no one standard definition of fairness
Conclusions based on data can be true and unfair
Example of a company that is notorious for being a boys club and wants to
see which employees are doing well
Data shows that men are the only people succeeding at this company
Conclusion that they should hire more men is true but unfair because it
ignores other systematic factors that are contributing to this problem
Ethical data analyst can look at the data and conclude that the company
culture is preventing some employees from succeeding
Harvard data scientists developing a mobile platform to track patients at
risk of cardiovascular disease in the Stroke Belt
Team of analysts and social scientists to provide insights on human bias
and social context
Collected self-reported data in a separate system to avoid potential for
racial bias
Over sampled non-dominant groups to ensure model was including them
Fairness was a top priority every step of the way to collect data and create
conclusions that didn't negatively impact the communities studied
As technology continues to advance, being able to collect and analyze the data
from that new technology has become a huge competitive advantage for a lot of
businesses. Everything from websites to social media feeds are filled with
fascinating data that, when analyzed and used correctly, can help inform
business decisions. A company’s ability to thrive now often depends on how well
it can leverage data, apply analytics, and implement new technologies.
explore, questions to answer, or problems to solve. It's easy for these things to get mixed up. Here's a
way to keep them straight when we talk about them in data analytics. An issue is a topic or subject to
investigate. A question is designed to discover information and a problem is an obstacle
or complication that needs to be worked out
Course 2
Ask Question to make Data Driven Decision
Module one
Taking Action with Data
1. Ask
It’s impossible to solve a problem if you don’t know what it is. These are some
things to consider:
2. Prepare
You will decide what data you need to collect in order to answer your questions
and how to organize it so that it is useful. You might use your business task to
decide:
3. Process
Clean data is the best data and you will need to clean up your data to get rid of
any possible errors, inaccuracies, or inconsistencies. This might mean:
What data errors or inaccuracies might get in my way of getting the best
possible answer to the problem I am trying to solve?
How can I clean my data so the information I have is more consistent?
4. Analyze
You will want to think analytically about your data. At this stage, you might sort
and format your data to make it easier to:
Perform calculations
Combine data from multiple sources
Create tables with your results
5. Share
How can I make what I present to the stakeholders engaging and easy to
understand?
What would help me understand this if I were the listener?
6. Act
Now it’s time to act on your data. You will take everything you have learned from
your data analysis and put it to use. This could mean providing your stakeholders
with recommendations based on your findings so they can make data-driven
decisions.
How can I use the feedback I received during the share phase (step 5) to
actually meet the stakeholder’s needs and expectations?
These six steps can help you to break the data analysis process into smaller,
manageable parts, which is called structured thinking. This process involves
four basic activities:
When you are starting out in your career as a data analyst, it is normal to feel
pulled in a few different directions with your role and expectations. Following
processes like the ones outlined here and using structured thinking skills can
help get you back on track, fill in any gaps and let you know exactly what you
need.
1. Making predictions.
This problem type involves using data to make an informed decision about how
things may be in the future.
For example, a hospital system might use a remote patient monitoring to predict
health events for chronically ill patients. The patients would take their health
vitals at home every day, and that information combined with data about their
age, risk factors, and other important details could enable the hospital's algorithm
to predict future health problems and even reduce future hospitalizations.
2. Categorizing things.
This means assigning information to different groups or clusters based on
common features.
An example of this problem type is a manufacturer that reviews data on shop
floor employee performance. An analyst may create a group for employees who
are most and least effective at engineering. A group for employees who are most
and least effective at repair and maintenance, most and least effective at
assembly, and many more groups or clusters.
4. Identifying themes
Identifying themes takes categorization as a step further by grouping information
into broader concepts.
Going back to our manufacturer that has just reviewed data on the shop floor
employees. First, these people are grouped by types and tasks. But now a data
analyst could take those categories and group them into the broader concept of
low productivity and high productivity. This would make it possible for the
business to see who is most and least productive, in order to reward top
performers and provide additional support to those workers who need more
training.
5. Discovering connections
It enables data analysts to find similar challenges faced by different entities, and
then combine data and insights to address them.
Here's what I mean; say a scooter company is experiencing an issue with the
wheels it gets from its wheel supplier. That company would have to stop
production until it could get safe, quality wheels back in stock. But meanwhile,
the wheel companies encountering the problem with the rubber it uses to make
wheels, turns out its rubber supplier could not find the right materials either. If all
of these entities could talk about the problems they're facing and share data
openly, they would find a lot of similar challenges and better yet, be able to
collaborate to find a solution.
6. Finding patterns.
Data analysts use data to find patterns by using historical data to understand
what happened in the past and is therefore likely to happen again. E-commerce
companies use data to find patterns all the time. Data analysts look at transaction
data to understand customer buying habits at certain points in time throughout
the year. They may find that customers buy more canned goods right before a
hurricane, or they purchase fewer cold-weather accessories like hats and gloves
during warmer months. The e-commerce companies can use these insights to
make sure they stock the right amount of products at these key times.
Making predictions
A company that wants to know the best advertising method to bring in new customers is an
example of a problem requiring analysts to make predictions. Analysts with data on location, type
of media, and number of new customers acquired as a result of past ads can't guarantee future
results, but they can help predict the best placement of advertising to reach the target audience.
Categorizing things
A company that sells smart watches that help people monitor their health would be interested in
designing their software to spot something unusual. Analysts who have analyzed aggregated
health data can help product developers determine the right algorithms to spot and set off alarms
when certain data doesn't trend normally.
Identifying themes
User experience (UX) designers might rely on analysts to analyze user interaction data. Similar to
problems that require analysts to categorize things, usability improvement projects might require
analysts to identify themes to help prioritize the right product features for improvement. Themes
are most often used to help researchers explore certain aspects of data. In a user study, user
beliefs, practices, and needs are examples of themes.
By now you might be wondering if there is a difference between categorizing things and
identifying themes. The best way to think about it is: categorizing things involves assigning items
to categories; identifying themes takes those categories a step further by grouping them into
broader themes.
Discovering connections
A third-party logistics company working with another company to get shipments delivered to
customers on time is a problem requiring analysts to discover connections. By analyzing the wait
times at shipping hubs, analysts can determine the appropriate schedule changes to increase the
number of on-time deliveries.
Finding patterns
Smart Questions
Effective questions follow the SMART methodology.
That means they're specific, measurable, action-oriented, relevant and time-
bound.
Let's break that down.
Specific questions are simple, significant and focused on a single topic or a few
closely related ideas. This helps us collect information that's relevant to what
we're investigating. If a question is too general, try to narrow it down by focusing
on just one element.
For example, instead of asking a closed-ended question, like,
are kids getting enough physical activities these days?
Ask what percentage of kids achieve the recommended
60 minutes of physical activity at least five days a week?
That question is much more specific and can give you more useful information.
Here's an example that breaks down the thought process of turning a problem
question into one or more SMART questions using the SMART method:
On a scale of 1-10 (with 10 being the most important) how important is your
car having four-wheel drive?
What are the top five features you would like to see in a car package?
What features, if included with four-wheel drive, would make you more
inclined to buy the car?
How much more would you pay for a car with four-wheel drive?
Has four-wheel drive become more or less popular in the last three years?
Reports and dashboards are both useful for data visualization. But there
are pros and cons for each of them. A report is a static collection of data given to
stakeholders periodically. A dashboard on the other hand, monitors live,
incoming data. Let's talk about reports first. Reports are great for giving
snapshots of high level historical data for an organization. There are some
downsides to keep in mind too. Reports need regular maintenance and aren't
very visually appealing. Because they aren't automatic or dynamic, reports don't
show live, evolving data. For a live reflection of incoming data, you'll want to
design a dashboard. Dashboards are great for a lot of reasons, they give your
team more access to information being recorded, you can interact through data
by playing with filters, and because they're dynamic, they have long-term value.
ut dashboards do have some cons too. For one thing, they take a lot of time to
design and can actually be less efficient than reports, if they're not used very
often. If the base table breaks at any point, they need a lot of maintenance to get
back up and running again. Dashboards can sometimes overwhelm people with
information too. If you aren't used to looking through data on a dashboard, you
might get lost in it.
A pivot table is a data summarization tool that is used in data processing.
Pivot tables are used to summarize, sort, re-organize, group, count, total, or
average data stored in a database. It allows its users to transform columns into
rows and rows into columns.
Dashboards are powerful visual tools that help you tell your data story. A
dashboard organizes information from multiple datasets into one central location,
offering huge time-savings. Data analysts use dashboards to track, analyze, and
visualize data in order to answer questions and solve problems.
* it is important to remember that changed data is pulled into dashboards
automatically only if the data structure is the same. If the data structure changes,
you have to update the dashboard design before the data can update live.
Creating a dashboard
Here is a process you can follow to create a dashboard:
1. Identify the stakeholders who need to see the data and how they will use it
Use these tips to help make your dashboard design clear, easy to follow, and
simple:
This is optional, but a lot of data analysts like to sketch out their
dashboards before creating them.
You have a lot of options here and it all depends on what data story you
are telling. If you need to show a change of values over time, line charts or bar
graphs might be the best choice. If your goal is to show how each part
contributes to the whole amount being reported, a pie or donut chart is probably
a better choice.
Filters show certain data while hiding the rest of the data in a dashboard.
This can be a big help to identify patterns while keeping the original data intact. It
is common for data analysts to use and share the same dashboard, but manage
their part of it with a filter.
Types of Dashboard
Strategic dashboards
Analytical dashboards
Mathematical Thinking
Small data can be really small. These kinds of data tend to be made up of
data sets concerned with specific metrics over a short, well defined period of
time.
Big data on the other hand has larger, less specific data-sets covering a
longer period of time. They usually have to be broken down to be analyzed. Big
data is useful for looking at large- scale questions and problems, and they help
companies make big decisions.
A lot of organizations deal with data overload and way too much unimportant
or irrelevant information.
Important data can be hidden deep down with all of the non-important data,
which makes it harder to find and use. This can lead to slower and more
inefficient decision-making time frames.
The data you need isn’t always easily accessible.
Current technology tools and solutions still struggle to provide measurable
and reportable data. This can lead to unfair algorithmic bias.
There are gaps in many big data business solutions.
Now for the good news! Here are some benefits that come with big data:
When large amounts of data can be stored and analyzed, it can help
companies identify more efficient ways of doing business and save a lot of
time and money.
Big data helps organizations spot the trends of customer buying patterns and
satisfaction levels, which can help them create new products and solutions
that will make customers happy.
By analyzing big data, businesses get a much better understanding of current
market conditions, which can help them stay ahead of the competition.
As in our earlier social media example, big data helps companies keep track
of their online presence—especially feedback, both good and bad, from
customers. This gives them the information they need to improve and protect
their brand.
When thinking about the benefits and challenges of big data, it helps to think
about the three Vs: volume, variety, and velocity. Volume describes the
amount of data. Variety describes the different kinds of data. Velocity describes
how fast the data can be processed. Some data analysts also consider a fourth
V: veracity. Veracity refers to the quality and reliability of the data. These are all
important considerations related to processing huge, complex data-sets.
Module 3
Click the gray triangle above row number 1 and to the left of Column A to
select all cells in the spreadsheet.
From the main menu, click Home, and then click Conditional Formatting to
select Highlight Cell Rules > More Rules.
For Select a Rule Type, choose Use a formula to determine which cells to
format.
Click the Format button, select the Fill tab, select yellow (or any other color),
and then click OK.
Problem Domain
The specific area of analysis that encompasses every area of activity affecting or
affected by the problem
Deliverables are items or tasks you will complete before you can finish
the project.
Milestones are significant tasks you will confirm along your timeline to
help everyone know the project is on track
Deliverables: What work is being done, and what things are being created
as a result of this project? When the project is complete, what are you
expected to deliver to the stakeholders? Be specific here. Will you collect
data for this project? How much, or for how long?
Avoid vague statements. For example, “fixing traffic problems” doesn’t specify
the scope. This could mean anything from filling in a few potholes to building a
new overpass. Be specific! Use numbers and aim for hard, measurable goals
and objectives. For example: “Identify top 10 issues with traffic patterns within the
city limits, and identify the top 3 solutions that are most cost-effective for reducing
traffic congestion.”
Milestones: This is closely related to your timeline. What are the major
milestones for progress in your project? How do you know when a given part
of the project is considered complete?
Milestones can be identified by you, by stakeholders, or by other team members
such as the Project Manager. Smaller examples might include incremental steps
in a larger project like “Collect and process 50% of required data (100 survey
responses)”, but may also be larger examples like ”complete initial data analysis
report” or “deliver completed dashboard visualizations and analysis reports to
stakeholders”.
Timeline: Your timeline will be closely tied to the milestones you create for
your project. The timeline is a way of mapping expectations for how long
each step of the process should take. The timeline should be specific enough
to help all involved decide if a project is on schedule. When will the
deliverables be completed? How long do you expect the project will take to
complete? If all goes as planned, how long do you expect each component of
the project will take? When can we expect to reach each milestone?
Reports: Good SOWs also set boundaries for how and when you’ll give
status updates to stakeholders. How will you communicate progress with
stakeholders and sponsors, and how often? Will progress be reported
weekly? Monthly? When milestones are completed? What information will
status reports contain?
At a minimum, any SOW should answer all the relevant questions in the above
areas. Note that these areas may differ depending on the project. But at their
core, the SOW document should always serve the same purpose by containing
information that is specific, relevant, and accurate. If something changes in the
project, your SOW should reflect those changes.
SOWs should also contain information specific to what is and isn’t considered
part of the project. The scope of your project is everything that you are expected
to complete or accomplish, defined to a level of detail that doesn’t leave any
ambiguity or confusion about whether a given task or item is part of the project or
not.
Notice how the previous example about studying traffic congestion defined its
scope as the area within the city limits. This doesn’t leave any room for confusion
— stakeholders need only to refer to a map to tell if a stretch of road or
intersection is part of the project or not. Defining requirements can be trickier
than it sounds, so it’s important to be as specific as possible in these documents,
and to use quantitative statements whenever possible.
For example, assume that you’re assigned to a project that involves studying the
environmental effects of climate change on the coastline of a city: How do you
define what parts of the coastline you are responsible for studying, and which
parts you are not?
In this case, it would be important to define the area you’re expected to study
using GPS locations, or landmarks. Using specific, quantifiable statements will
help ensure that everyone has a clear understanding of what’s expected.
“The best thing you can do for the fairness and accuracy of your data, is to make
sure you start with an accurate representation of the population, and collect the
data in the most appropriate, and objective way. Then, you'll have the facts so
you can pass on to your team”
Context can turn raw data into meaningful information. It is very important for
data analysts to contextualize their data. This means giving the data perspective
by defining it. To do this, you need to identify:
Who: The person or organization that created, collected, and/or funded the
data collection
What: The things in the world that data could have an impact on
Where: The origin of the data
When: The time when the data was created or collected
Why: The motivation behind the creation or collection
How: The method used to create or collect it
Module 4
Your data analysis project should answer the business task and create
opportunities for data-driven decision-making. That's why it is so important to
focus on project stakeholders. As a data analyst, it is your responsibility to
understand and manage your stakeholders’ expectations while keeping the
project goals front and center.
You might remember that stakeholders are people who have invested time,
interest, and resources into the projects that you are working on. This can be a
pretty broad group, and your project stakeholders may change from project to
project. But there are three common stakeholder groups that you might find
yourself working with: the executive team, the customer-facing team, and the
data science team.
Let’s get to know more about the different stakeholders and their goals. Then
we'll learn some tips for communicating with them effectively.
1. Executive team
For example, you might find yourself working with the vice president of human
resources on an analysis project to understand the rate of employee absences. A
marketing director might look to you for competitive analyses. Part of your job will
be balancing what information they will need to make informed decisions with
their busy schedule.
But you don’t have to tackle that by yourself. Your project manager will be
overseeing the progress of the entire team, and you will be giving them more
regular updates than someone like the vice president of HR. They are able to
give you what you need to move forward on a project, including getting approvals
from the busy executive team. Working closely with your project manager can
help you pinpoint the needs of the executive stakeholders for your project, so
don’t be afraid to ask them for guidance.
2. Customer-facing team
Let’s say a customer-facing team is working with you to build a new version of a
company’s most popular product. Part of your work might involve collecting and
sharing data about consumers’ buying behavior to help inform product features.
Here, you want to be sure that your analysis and presentation focuses on what is
actually in the data-- not on what your stakeholders hope to find.
When you're working with each group of stakeholders- from the executive team,
to the customer-facing team, to the data science team, you'll often have to go
beyond the data. Use the following tips to communicate clearly, establish trust,
and deliver your findings across groups.
Discuss goals. Stakeholder requests are often tied to a bigger project or goal.
When they ask you for something, take the opportunity to learn more. Start a
discussion. Ask about the kind of results the stakeholder wants. Sometimes, a
quick chat about goals can help set expectations and plan the next steps.
Feel empowered to say “no.” Let’s say you are approached by a marketing
director who has a “high-priority” project and needs data to back up their
hypothesis. They ask you to produce the analysis and charts for a presentation
by tomorrow morning. Maybe you realize their hypothesis isn’t fully formed and
you have helpful ideas about a better way to approach the analysis. Or maybe
you realize it will take more time and effort to perform the analysis than
estimated. Whatever the case may be, don’t be afraid to push back when you
need to.
Stakeholders don’t always realize the time and effort that goes into collecting and
analyzing data. They also might not know what they actually need. You can help
stakeholders by asking about their goals and determining whether you can
deliver what they need. If you can’t, have the confidence to say “no,” and provide
a respectful explanation. If there’s an option that would be more helpful, point the
stakeholder toward those resources. If you find that you need to prioritize other
projects first, discuss what you can prioritize and when. When your stakeholders
understand what needs to be done and what can be accomplished in a given
timeline, they will usually be comfortable resetting their expectations. You should
feel empowered to say no-- just remember to give context so others understand
why.
Plan for the unexpected. Before you start a project, make a list of potential
roadblocks. Then, when you discuss project expectations and timeline with your
stakeholders, give yourself some extra time for problem-solving at each stage of
the process.
Know your project. Keep track of your discussions about the project over email
or reports, and be ready to answer questions about how certain aspects are
important for your organization. Get to know how your project connects to the
rest of the company and get involved in providing the most insight possible. If you
have a good understanding about why you are doing an analysis, it can help you
connect your work with other goals and be more effective at solving larger
problems.
Start with words and visuals. It is common for data analysts and stakeholders
to interpret things in different ways while assuming the other is on the same
page. This illusion of agreement* has been historically identified as a cause of
projects going back-and-forth a number of times before a direction is finally
nailed down. To help avoid this, start with a description and a quick visual of what
you are trying to convey. Stakeholders have many points of view and may prefer
to absorb information in words or pictures. Work with them to make changes and
improvements from there. The faster everyone agrees, the faster you can
perform the first analysis to test the usefulness of the project, measure the
feedback, learn from the data, and implement changes.
Effective Communication
After the next report is completed, you can also send out a project update
offering more information. The email could look like this:
Limitations with Data
Telling a Story
Compare the same types of data: Data can get mixed up when you chart it
for visualization. Be sure to compare the same types of data and double
check that any segments in your chart definitely display different metrics.
Visualize with care: A 0.01% drop in a score can look huge if you zoom in
close enough. To make sure your audience sees the full story clearly, it is a
good idea to set your Y-axis to 0.
Leave out needless graphs: If a table can show your story at a glance, stick
with the table instead of a pie chart or a graph. Your busy audience will
appreciate the clarity.
Test for statistical significance: Sometimes two data-sets will look
different, but you will need a way to test whether the difference is real and
important. So remember to run statistical tests to see how much confidence
you can place in that difference.
Pay attention to sample size: Gather lots of data. If a sample size is small,
a few unusual responses can skew the results. If you find that you have too
little data, be careful about using it to form judgments. Look for opportunities
to collect more data, then chart those trends over longer periods.
Module 1
Week 1
Decide if you will collect the data using your own resources or receive (and
possibly purchase it) from another party. Data that you collect yourself is
called first-party data.
Data sources
If you don’t collect the data using your own resources, you might get data from
second-party or third-party data providers. Second-party data is collected
directly by another group and then sold. Third-party data is sold by a provider
that didn’t collect the data themselves. Third-party data might come from a
number of different sources.
Solving your business problem
Datasets can show a lot of interesting information. But be sure to choose data
that can actually help solve your problem question. For example, if you are
analyzing trends over time, make sure you use time series data — in other
words, data that includes dates.
If you are collecting your own data, make reasonable decisions about sample
size. A random sample from existing data might be fine for some projects. Other
projects might need more strategic data collection to focus on certain criteria.
Each project has its own needs.
Time frame
If you are collecting your own data, decide how long you will need to collect it,
especially if you are tracking trends over a long period of time. If you need an
immediate answer, you might not have time to collect new data. In this case, you
would need to use historical data that already exists.
Structured data
Unstructured data
Data-modeling techniques
Data modeling can help you explore the high-level details of your data and
how it is related across the organization’s information systems. Data modeling
sometimes requires data analysis to understand how the data is put together;
that way, you know how to map the data. And finally, data models make it easier
for everyone in your organization to understand and collaborate with you on your
data. This is important for you and everyone on your team!
*Text data type, or a string data type, which is a sequence of characters and
punctuation that contains textual information
*A Boolean data type is a data type with only two possible values: true or false.
In this reading, you will explore the basics of Boolean logic and learn how to use
multiple conditions in a Boolean statement. These conditions are created with
Boolean operators, including AND, OR, and NOT. These operators are similar to
mathematical operators and can be used to create logical statements that filter
your results. Data analysts use Boolean statements to do a wide range of data
analysis tasks, such as creating queries for searches and checking for conditions
when writing programming code.
Imagine you are shopping for shoes, and are considering certain preferences:
You will buy the shoes only if they are pink and Grey
You will buy the shoes if they are entirely pink or entirely grey, or if they are
pink and grey
You will buy the shoes if they are grey, but not if they have any pink
Below are Venn diagrams that illustrate these preferences. AND is the center of
the Venn diagram, where two conditions overlap. OR includes either condition.
NOT includes only the part of the Venn diagram that doesn't contain the
exception.
Your condition is “If the color of the shoe has any combination of grey and pink,
you will buy them.” The Boolean statement would break down the logic of that
statement to filter your results by both colors. It would say “IF (Color=”Grey”)
AND (Color=”Pink”) then buy them.” The AND operator lets you stack multiple
conditions.
Below is a simple truth table that outlines the Boolean logic at work in this
statement. In the Color is Grey column, there are two pairs of shoes that meet
the color condition. And in the Color is Pink column, there are two pairs that
meet that condition. But in the If Grey AND Pink column, there is only one pair of
shoes that meets both conditions. So, according to the Boolean logic of the
statement, there is only one pair marked true. In other words, there is one pair of
shoes that you can buy.
The OR operator
The OR operator lets you move forward if either one of your two conditions is
met. Your condition is “If the shoes are grey or pink, you will buy them.” The
Boolean statement would be “IF (Color=”Grey”) OR (Color=”Pink”) then buy
them.” Notice that any shoe that meets either the Color is Grey or the Color is
Pink condition is marked as true by the Boolean logic. According to the truth
table below, there are three pairs of shoes that you can buy.
Finally, the NOT operator lets you filter by subtracting specific conditions from the
results. Your condition is "You will buy any grey shoe except for those with any
traces of pink in them." Your Boolean statement would be “IF (Color="Grey")
AND (Color=NOT “Pink”) then buy them.” Now, all of the grey shoes that aren't
pink are marked true by the Boolean logic for the NOT Pink condition. The pink
shoes are marked false by the Boolean logic for the NOT Pink condition. Only
one pair of shoes is excluded in the truth table below.
For data analysts, the real power of Boolean logic comes from being able to
combine multiple conditions in a single statement. For example, if you wanted to
filter for shoes that were grey or pink, and waterproof, you could construct a
Boolean statement such as: “IF ((Color = “Grey”) OR (Color = “Pink”)) AND
(Waterproof=“True”).” Notice that you can use parentheses to group your
conditions together.
Whether you are doing a search for new shoes or applying this logic to your
database queries, Boolean logic lets you create multiple conditions to filter your
results. And now that you know a little more about how Boolean logic is used,
you can start using it!
Wide data, every data subject has a single row with multiple columns to
hold the values of various attributes of the subject.
Long data is data in which each row is one time point per subject, so
each subject will have data in multiple rows.
Transforming Data
Module 2
Three more types of data bias, observer bias, interpretation bias, and
confirmation bias, and we'll learn how to avoid them;
Interpretation bias, can lead to two people seeing or hearing the exact same
thing, and interpreting it in a variety of different ways, because they have different
backgrounds, and experiences.
The four types of data bias we covered, sampling bias, observer bias,
interpretation bias, and confirmation bias, are all unique, but they do have one
thing in common. They each affect the way we collect, and make sense of the
data. Unfortunately, they're also just a small sample, pun intended, of the types
of bias you may encounter in your career as a data analyst.
The more high quality data we have, the more confidence we can have in our decisions.
Let's learn how we can go about finding and identifying good data sources.
First things first, we need to learn how to identify them. A process I like to call ROCCC, R-O-
C-C-C.
Like a good friend, good data sources are reliable. With this data you can trust
that you're getting accurate, complete and unbiased information that's been
vetted and proven fit for use. Okay. Onto O. O is for original. There's a good
chance you'll discover data through a second or third party source. To make sure
you're dealing with good data, be sure to validate it with the original source. Time
for the first C. C is for comprehensive. The best data sources contain all critical
information needed to answer the question or find the solution. Think about it like
this. You wouldn't want to work for a company just because you found one great
online review about it. You'd research every aspect of the organization to make
sure it was the right fit. It's important to do the same for your data analysis. The
next C is for current. The usefulness of data decreases as time passes. If you
wanted to invite all current clients to a business event, you wouldn't use a 10-
year-old client list. The same goes for data. The best data sources are
current and relevant to the task at hand. The last C is for cited. If you've ever told
a friend where you heard that a new movie sequel was in the works, you've cited
a source. Citing makes the information you're providing more credible. When
you're choosing a data source, think about three things.
Data ethics refers to well- founded standards of right and wrong that
dictate how data is collected, shared, and used.
Ownership. This answers the question who owns data? It isn't the
organization that invested time and money collecting, storing, processing, and
analyzing it. It's individuals who own the raw data they provide, and they have
primary control over its usage, how it's processed and how it's shared.
Transaction transparency, which is the idea that all data processing
activities and algorithms should be completely explainable and understood by the
individual who provides their data. This is in response to concerns over data
bias, which we discussed earlier, is a type of error that systematically skews
results in a certain direction. Biased outcomes can lead to negative
consequences. To avoid them, it's helpful to provide transparent analysis
especially to the people who share their data.
Consent. This is an individual's right to know explicit details about how and
why their data will be used before agreeing to provide it. They should know
answers to questions like why is the data being collected? How will it be
used? How long will it be stored? The best way to give consent is probably a
conversation between the person providing the data and the person requesting
it.
Currency. Individuals should be aware of financial transactions
resulting from the use of their personal data and the scale of these
transactions. If your data is helping to fund a company's efforts, you should know
what those efforts are all about and be given the opportunity to opt out
When talking about data, privacy means preserving a data subject's
information and activity any time a data transaction occurs.
When referring to data, openness refers to free access, usage and sharing
of data. Sometimes we refer to this as open data, but it doesn't mean we
ignore the other aspects of data ethics we covered.
*Interoperability is the ability of data systems and services to openly connect and share
data
You have been learning about the importance of privacy in data analytics. Now, it
is time to talk about data anonymization and what types of data should be
anonymized. Personally identifiable information, or PII, is information that can
be used by itself or with other data to track down a person's identity.
Healthcare and financial data are two of the most sensitive types of data. These
industries rely a lot on data anonymization techniques. After all, the stakes are
very high. That’s why data in these two industries usually goes through de-
identification, which is a process used to wipe data clean of all personally
identifying information.
Telephone numbers
Names
License plates and license numbers
Social security numbers
IP addresses
Medical records
Email addresses
Photographs
Account numbers
For some people, it just makes sense that this type of data should be
anonymized. For others, we have to be very specific about what needs to be
anonymized. Imagine a world where we all had access to each other’s
addresses, account numbers, and other identifiable information. That would
invade a lot of people’s privacy and make the world less safe. Data
anonymization is one of the ways we can keep data private and secure!
In data analytics, open data is part of data ethics, which has to do with using
data ethically. Openness refers to free access, usage, and sharing of data. But
for data to be considered open, it has to:
Be available and accessible to the public as a complete dataset
Be provided under terms that allow it to be reused and redistributed
Allow universal participation so that anyone can use, reuse, and redistribute
the data
Data can only be considered open when it meets all three of these standards.
One of the biggest benefits of open data is that credible databases can be used
more widely. Basically, this means that all of that good data can be leveraged,
shared, and combined with other data. This could have a huge impact on
scientific collaboration, research advances, analytical capacity, and decision-
making. But it is important to think about the individuals being represented by the
public, open data, too.
The key aspects of universal participation are that everyone must be able
to use, reuse, and redistribute open data. Also, no one can place
restrictions on data to discriminate against a person or group.
The benefits of open data include making good data more widely available
and combining data from different fields of knowledge.
Module 3
Relational databases
A relational database is a database that contains a series of tables that can be connected to show
relationships. Basically, they allow data analysts to organize and link data based on what the data has
in common.
In a non-relational table, you will find all of the possible variables you might be interested in
analyzing all grouped together. This can make it really hard to sort through. This is one reason why
relational databases are so common in data analysis: they simplify a lot of analysis processes and
make data easier to find and use across an entire database.
Database Normalization
Normalization is a process of organizing data in a relational database. For example, creating tables
and establishing relationships between those tables. It is applied to eliminate data redundancy,
increase data integrity, and reduce complexity in a database.
By contrast, a foreign key is a field within a table that is a primary key in another table. A table can
have only one primary key, but it can have multiple foreign keys. These keys are what create the
relationships between tables in a relational database, which helps organize and connect data across
multiple tables in the database.
Some tables don't require a primary key. For example, a revenue table can have multiple foreign keys
and not have a primary key. A primary key may also be constructed using multiple columns of a table.
This type of primary key is called a composite key. For example, if customer_id and location_id are
two columns of a composite key for a customer table, the values assigned to those fields in any given
row must be unique within the entire table.
SQL? You’re speaking my language
Databases use a special language to communicate called a query language. Structured Query
Language (SQL) is a type of query language that lets data analysts communicate with a database. So,
a data analyst will use SQL to create a query to view the specific data that they want from within the
larger set. In a relational database, data analysts can write queries to get data from the related tables.
SQL is a powerful tool for working with databases
As a data analyst, there are three common types of metadata that you'll come across: descriptive,
structural, and administrative.
Descriptive metadata is metadata that describes a piece of data and can be used to identify it
at a later point in time. The date and time a photo was taken is an example of administrative metadata.
Administrative metadata indicates the technical source and details for a digital asse t. For instance, the
descriptive metadata of a book in a library would include the code you see on its spine, known as
a unique International Standard Book Number, also called the ISBN. (Description of the data)
Structural metadata, which is metadata that indicates how a piece of data is organized and
whether it's part of one or more than one data collection. Structural metadata indicates exactly how
many collections data lives in. It provides information about how a piece of data is organized and
whether it’s part of one, or more than one, data collection . Let's head back to the library. An example
of structural data would be how the pages of a book are put together to create different chapters. It's
important to note that structural metadata also keeps track of the relationship between two things. For
example, it can show us that the digital document of a book manuscript was actually the original
version of a now printed book. (How and where data is collected)
Administrative metadata is metadata that indicates the technical source of a digital asset. The
date and time a photo was taken is an example of administrative metadata. Administrative metadata
indicates the technical source and details for a digital asset. When we looked at the metadata inside the
photo, that was administrative metadata. It shows you the type of file it was, the date and time it was
taken, and much more.
Take a look at any data you find. What is it? Where did it come from? Is it useful? How do you know?
This is where metadata comes in to provide a deeper understanding of the data. To put it simply,
metadata is data about data. In database management, it provides information about other data and
helps data analysts interpret the contents of the data within a database.
Regardless of whether you are working with a large or small quantity of data, metadata is the mark of
a knowledgeable analytics team, helping to communicate about data across the business and making it
easier to reuse data. In essence, metadata tells the who, what, when, where, which, how, and why of
data.
Elements of metadata
Before looking at metadata examples, it is important to understand what type of information metadata
typically provides.
Examples of metadata
In today’s digital world, metadata is everywhere, and it is becoming a more common practice to
provide metadata on a lot of media and information you interact with. Here are some real-world
examples of where to find metadata:
Photos
Whenever a photo is captured with a camera, metadata such as camera filename, date, time, and
geolocation are gathered and saved with it.
Emails
When an email is sent or received, there is lots of visible metadata such as subject line, the sender, the
recipient and date and time sent. There is also hidden metadata that includes server names, IP
addresses, HTML format, and software details.
Websites
Every web page has a number of standard metadata fields, such as tags and categories, site creator’s
name, web page title and description, time of creation and any iconography.
Digital files
Usually, if you right click on any computer file, you will see its metadata. This could consist of file
name, file size, date of creation and modification, and type of file.
Books
Metadata is not only digital. Every book has a number of standard metadata on the covers and inside
that will inform you of its title, author’s name, a table of contents, publisher information, copyright
description, index, and a brief description of the book’s contents.
Data as you know it
Knowing the content and context of your data, as well as how it is structured, is very valuable in your
career as a data analyst. When analyzing data, it is important to always understand the full picture. It is
not just about the data you are viewing, but how that data comes together. Metadata ensures that you
are able to find, use, preserve, and reuse data in the future. Remember, it will be your responsibility to
manage and make use of data in its entirety; metadata is as important as the data itself.
Metadata creates a single source of truth by keeping things consistent and uniform.. Metadata also
makes data more reliable by making sure it's accurate, precise, relevant, and timely. This also makes it
easier for data analysts to identify the root causes of any problems that might pop up. A metadata
repository is a database specifically created to store metadata. Metadata repositories make it easier
and faster to bring together multiple sources for data analysis. They do this by describing the state and
location of the metadata, the structure of the tables inside, and how data flows through the repository.
.., On the other hand, metadata is stored in a single, central location and it gives the company
standardized information about all of its data. This is done in two ways. First, metadata includes
information about where each system is located and where the data sets are located within those
systems. Second, the metadata describes how all of the data is connected between the various
systems. Another important aspect of metadata is something called data governance. Data
governance is a process to ensure the formal management of a company’s data assets. This gives an
organization better control of their data and helps a company manage issues related to data security
and privacy, integrity, usability, and internal and external data flows. It's important to note that data
governance is about more than just standardizing terminology and procedures. It's about the roles and
responsibilities of the people who work with the metadata every day
two basic types of data used by data analysts: internal and external. Internal data is data that lives
within a company's own systems. It's typically also generated from within the company. You may also
hear internal data described as primary data. External data is data that lives and is generated outside an
organization. It can come from a variety of places, including other businesses, government sources,
the media, professional associations, schools, and more. External data is sometimes called secondary
data. Gathering internal data can be complicated. Depending on your data analytics project, you might
need data from lots of different sources and departments, including sales, marketing, customer
relationship management, finance, human resources, and even the data archives. But the effort is
worth it. Internal data has plenty of advantages for a business. There are lots of reasons for these open
data initiatives. One is to make government activities more transparent, like letting the public see
where money is spent. It also helps educate citizens about voting and local issues. Open data also
improves public service by giving people ways to be a part of public planning or provide feedback to
the government. Finally, open data leads to innovation and economic growth by helping people and
companies better understand their markets.
Big Query
SELECT is the section of a query that indicates what data you want SQL to return to you
FROM is the section of a query that indicates which table the desired data comes from.
WHERE is the section of a query that indicates any filters you’d like to apply to your dataset
Module 4
There are plenty of best practices you can use when organizing data, including naming conventions,
foldering, and archiving older files.
We've talked about file naming before, which is also known as naming conventions. These are
consistent guidelines that describe the content, date, or version of a file in its name. Basically, this
means you want to use logical and descriptive names for your files to make them easier to find and
use.
Speaking of easily finding things, organizing your files into folders helps keep project-related files
together in one place. This is called foldering.
Relational databases can help you avoid data duplication and store your data more efficiently.
Work out and agree on file naming conventions early on in a project to avoid renaming files
again and again.
Align your file naming with your team's or company's existing file-naming conventions.
Ensure that your file names are meaningful; consider including information like project name and
anything else that will help you quickly identify (and use) the file for the right purpose.
Include the date and version number in file names; common formats are YYYYMMDD for dates
and v## for versions (or revisions).
Create a text file as a sample file with content that describes (breaks down) the file naming
convention and a file name that applies it.
Avoid spaces and special characters in file names. Instead, use dashes, underscores, or capital
letters. Spaces and special characters can cause errors in some applications.
Remember these tips for staying organized as you work with files:
Create folders and sub folders in a logical hierarchy so related files are stored together.
Separate ongoing from completed work so your current project files are easier to find. Archive
older files in a separate folder, or in an external storage location.
If your files aren't automatically backed up, manually back them up often to avoid losing
important work.
To separate current from past work and reduce clutter, data analysts create archives.
The process of structuring folders broadly at the top, then breaking down those folders into more
specific topics, is creating a hierarchy.
Data security means protecting data from unauthorized access or corruption by adopting safety
measures.
In order to do this, companies need to find ways to balance their data security measures with their data
access needs.
Luckily, there are a few security measures that can help companies do just that. The two we will talk
about here are encryption and tokenization.
Encryption uses a unique algorithm to alter data and make it unusable by users and applications that
don’t know the algorithm. This algorithm is saved as a “key” which can be used to reverse the
encryption; so if you have the key, you can still use the data in its original form.
Tokenization replaces the data elements you want to protect with randomly generated data referred to
as a “token.” The original data is stored in a separate location and mapped to the tokens. To access the
complete original data, the user or application needs to have permission to use the tokenized data and
the token mapping. This means that even if the tokenized data is hacked, the original data is still safe
and secure in a separate location.
Encryption and tokenization are just some of the data security options out there. There are a lot of
others, like using authentication devices for AI technology.
As a junior data analyst, you probably won’t be responsible for building out these systems. A lot of
companies have entire teams dedicated to data security or hire third party companies that specialize in
data security to create these systems. But it is important to know that all companies have a
responsibility to keep their data secure, and to understand some of the potential systems your future
employer might use.
*a mentor is a professional who shares their knowledge, skills, and experience to help you develop
and grow. They can be trusted advisors, sounding boards, critics, resources or all of the above
*A sponsor is a professional advocate who's committed to moving a sponsee's career forward with an
organization. To understand the difference between these two roles, think of it like this. A mentor
helps you skill up, a sponsor helps you move up. Having the support of a sponsor is like having a
safety net. They can give you the confidence to take risks at work,
like asking for a new assignment or promotion
Course 4
Module 1
Data Integrity and Analytics Objectives
Consider the following data issues and suggestions on how to work around them.
Gather the data on a small scale to perform a If you are surveying employees about what they think about
preliminary analysis and then request additional a new performance and bonus plan, use a sample for a
time to complete the analysis after you have preliminary analysis. Then, ask for another 3 weeks to collect
collected more data. the data from all employees.
If there isn’t time to collect data, perform the If you are analyzing peak travel times for commuters but
analysis using proxy data from other datasets. This don’t have the data for a particular city, use the data from
is the most common workaround. another city with a similar size and demographic.
If you are missing data for 18- to 24-year-olds, do the analysis but note the
Adjust your analysis to align
following limitation in your report: this conclusion applies to adults 25 years
with the data you already have.
and older only.
If you have the wrong data because requirements were If you need the data for female voters and received
misunderstood, communicate the requirements again. the data for male voters, restate your needs.
If you can’t correct data errors yourself, you can ignore If your dataset was translated from a different
the wrong data and go ahead with the analysis if your language and some of the translations don’t make
sample size is still large enough and ignoring the data sense, ignore the data with bad translation and go
won’t cause systematic bias. ahead with the analysis of the other data.
* Important note: sometimes data with errors can be a warning sign that the data isn’t reliable. Use
your best judgment.
Use the following decision tree as a reminder of how to deal with data errors or not enough
data:
When you use sample size or a sample, you use a part of a population that's representative of the
population. The goal is to get enough information from a small group within a population to make
predictions or conclusions about the whole population. The sample size helps ensure the degree to
which you can be confident that your conclusions accurately represent the population Using a sample
for analysis is more cost-effective and takes less time.If done carefully and thoughtfully, you can get
the same results using a sample size instead of trying to hunt down every single cat owner to find out
their favorite cat toys. There is a potential downside, though. When you only use a small sample of a
population, it can lead to uncertainty. You can't really be 100 percent sure that your statistics are a
complete and accurate representation of the population. This leads to sampling bias, which we covered
earlier in the program. Sampling bias is when a sample isn't representative of the population as a
whole. This means some members of the population are being over-represented or underrepresented
using random sampling can help address some of those issues with sampling bias. Random sampling
is a way of selecting a sample from a population so that every possible type of the sample has an equal
chance of being chosen.
Terminology Definitions
The entire group that you are interested in for your study. For example, if you are surveying
Population
people in your company, the population would be all the employees in your company.
A subset of your population. Just like a food sample, it is called a sample because it is only a
Sample taste. So if your company is too large to survey every individual, you can survey a
representative sample of your population.
Since a sample is used to represent a population, the sample’s results are expected to differ
from what the result would have been if you had surveyed the entire population. This
Margin
difference is called the margin of error. The smaller the margin of error, the closer the results
of error
of the sample are to what the result would have been if you had surveyed the entire
population.
How confident you are in the survey results. For example, a 95% confidence level means
that if you were to run the same survey 100 times, you would get similar results 95 of those
Confidence level
100 times. Confidence level is targeted before you start your study because it will affect how
big your margin of error is at the end of your study.
Confidence The range of possible values that the population’s result would be at the confidence level of
interval the study. This range is the sample result +/- the margin of error.
Statistical The determination of whether your result could be due to random chance or not. The greater
significance the significance, the less due to chance.
When figuring out a sample size, here are things to keep in mind:
Don’t use a sample size less than 30. It has been statistically proven that 30 is the smallest sample size
where an average result of a sample starts to represent the average result of a population.
The confidence level most commonly used is 95%, but 90% can work in some cases.
Increase the sample size to meet specific needs of your project:
For a higher confidence level, use a larger sample size
To decrease the margin of error, use a larger sample size
For greater statistical significance, use a larger sample size
Note: Sample size calculators use statistical formulas to determine a sample size. More about these are
coming up in the course! Stay tuned.
This recommendation is based on the Central Limit Theorem (CLT) in the field of probability and
statistics. As sample size increases, the results more closely resemble the normal (bell-shaped) distribution
from a large number of samples. A sample of 30 is the smallest sample size for which the CLT is still valid.
Researchers who rely on regression analysis – statistical methods to determine the relationships between
controlled and dependent variables – also prefer a minimum sample of 30.
For example, if you live in a city with a population of 200,000 and get 180,000 people to respond to a
survey, that is a large sample size. But without actually doing that, what would an acceptable, smaller
sample size look like?
Would 200 be alright if the people surveyed represented every district in the city?
A sample size of 200 might be large enough if your business problem is to find out how residents
felt about the new library
A sample size of 200 might not be large enough if your business problem is to determine how
residents would vote to fund the library
You could probably accept a larger margin of error surveying how residents feel about the new library
versus surveying residents about how they would vote to fund it. For that reason, you would most
likely use a larger sample size for the voter survey.
A new car model was just launched a few days ago and
The analyst proxies the number of clicks to the car
the auto dealership can’t wait until the end of the month
specifications on the dealership’s website as an estimate
for sales data to come in. They want sales projections
of potential sales at the dealership.
now.
Business scenario How proxy data can be used
A brand new plant-based meat product was only recently The analyst proxies the sales data for a turkey substitute
stocked in grocery stores and the supplier needs to made out of tofu that has been on the market for several
estimate the demand over the next four years. years.
Confidence level: The probability that your sample size accurately reflects the greater
population.
Margin of error: The maximum amount that the sample results are expected to differ from those
of the actual population.
Population: This is the total number you hope to pull your sample from.
Sample: A part of a population that is representative of the population.
Estimated response rate: If you are running a survey of individuals, this is the percentage of
people you expect will complete your survey out of those who received the survey.
Imagine you are playing baseball and that you are up at bat. The crowd is roaring, and you are getting
ready to try to hit the ball. The pitcher delivers a fastball traveling about 90-95mph, which takes about
400 milliseconds (ms) to reach the catcher’s glove. You swing and miss the first pitch because your
timing was a little off. You wonder if you should have swung slightly earlier or slightly later to hit a
home run. That time difference can be considered the margin of error, and it tells us how close or far
your timing was from the average home run swing.
The margin of error is also important in marketing. Let’s use A/B testing as an example. A/B testing
(or split testing) tests two variations of the same web page to determine which page is more successful
in attracting user traffic and generating revenue. User traffic that gets monetized is known as the
conversion rate. A/B testing allows marketers to test emails, ads, and landing pages to find the data
behind what is working and what isn’t working. Marketers use the confidence interval (determined
by the conversion rate and the margin of error) to understand the results.
For example, suppose you are conducting an A/B test to compare the effectiveness of two different
email subject lines to entice people to open the email. You find that subject line A: “Special offer just
for you” resulted in a 5% open rate compared to subject line B: “Don’t miss this opportunity” at 3%.
Does that mean subject line A is better than subject line B? It depends on your margin of error. If the
margin of error was 2%, then subject line A’s actual open rate or confidence interval is somewhere
between 3% and 7%. Since the lower end of the interval overlaps with subject line B’s results at 3%,
you can’t conclude that there is a statistically significant difference between subject line A and B.
Examining the margin of error is important when making conclusions based on your test results.
Confidence level: A percentage indicating how likely your sample accurately reflects the greater
population
Population: The total number you pull your sample from
Sample: A part of a population that is representative of the population
Margin of error: The maximum amount that the sample results are expected to differ from those
of the actual population
In most cases, a 90% or 95% confidence level is used. But, depending on your industry, you might
want to set a stricter confidence level. A 99% confidence level is reasonable in some industries, such
as the pharmaceutical industry.
Key takeaway
Margin of error is used to determine how close your sample’s result is to what the result would likely
have been if you could have surveyed or tested the entire population. Margin of error helps you
understand and interpret survey or test results in real-life. Calculating the margin of error is
particularly helpful when you are given the data to analyze. After using a calculator to calculate the
margin of error, you will know how much the sample results might differ from the results of the entire
population.
Module 2
Can you guess what inaccurate or bad data costs businesses every year? Thousands of dollars?
Millions? Billions? Well, according to IBM, the yearly cost of poor-quality data is $3.1 trillion in
the US alone.
It's not a new system implementation or a computer technical glitch. The most common factor is
actually human error.
Dirty data can be the result of someone typing in a piece of data incorrectly; inconsistent
formatting; blank fields; or the same piece of data being entered more than once, which creates
duplicates. Dirty data is data that's incomplete, incorrect, or irrelevant to the problem you're
trying to solve.
When you work with dirty data, you can't be sure that your results are correct. In fact, you can
pretty much bet they won't be. Earlier, you learned that data integrity is critical to reliable
data analytics results, and clean data helps you achieve data integrity. Clean data is data that's
complete, correct, and relevant to the problem you're trying to solve. When you work with clean
data, you'll find that your projects go much more smoothly.
Clean data is incredibly important for effective analysis. If a piece of data is entered into a
spreadsheet or database incorrectly, or if it's repeated, or if a field is left blank, or if data formats
are inconsistent, the result is dirty data. Small mistakes can lead to big consequences in the long
run.
Let's talk about some people you'll work with as a data analyst
Data engineers transform data into a useful format for analysis and give it a reliable
infrastructure. This means they develop, maintain, and test databases, data processors and related
systems.
Data warehousing specialists develop processes and procedures to effectively store and organize
data. They make sure that data is available, secure, and backed up to prevent loss.
*It's always a good idea to examine and clean data before beginning analysis
Data cleaning becomes even more important when working with external data, especially if it
comes from multiple sources
A null is an indication that a value does not exist in a data set. Note that it's not the same as a
zero. In the case of a survey, a null would mean the customers skipped that question. A zero
would mean they provided zero as their response. To do your analysis, you would first need to
clean this data. Step one would be to decide what to do with those nulls. You could either filter
them out and communicate that you now have a smaller sample size, or you can keep them in and
learn from the fact that the customers did not provide responses. There's lots of reasons why this
could have happened. Maybe your survey questions weren't written as well as they could be.
Duplicate data
Outdated data
Potential harm to
Description Possible causes
businesses
Any data that is old which should be People changing roles or companies, or Inaccurate insights,
replaced with newer and more accurate software and systems becoming decision-making, and
information obsolete analytics
Incomplete data
Any data that is missing Improper data collection or Decreased productivity, inaccurate insights, or
important fields incorrect data entry inability to complete essential services
Incorrect/inaccurate data
Any data that is complete Human error inserted during data input, Inaccurate insights or decision-making based
but inaccurate fake information, or mock data on bad information resulting in revenue loss
Inconsistent data
Clean data depends largely on the data integrity rules that an organization follows, such as
spelling and punctuation guidelines, nulls are empty fields. This kind of dirty data requires a
little more work than just fixing a spelling error or changing a format.
some other types of dirty data. The first has to do with labeling. To understand labeling, imagine
trying to get a computer to correctly identify panda bears among images of all different kinds of
animals. You need to show the computer thousands of images of panda bears. They're all labeled
as panda bears. Any incorrectly labeled picture, like the one here that's just bear, will cause a
problem.
Field is a single piece of information from a row or column of a spreadsheet
Field length is a tool for determining how many characters can be keyed into a field. Assigning a
certain length to the fields in your spreadsheet is a great way to avoid errors.
Data validation is a tool for checking the accuracy and quality of data before adding or importing
it.
consistency by ensuring measures are formatted or structured the same way across
systems.
Clean data is essential to data integrity and reliable solutions and decisions.
However, before removing unwanted data, it's always a good practice to make a copy of the data
set That way, if you remove something that you end up needing in the future, you can easily
access it and put it back in the data set.
irrelevant data, which is data that doesn't fit the specific problem that you're trying to solve.
Extra spaces can cause unexpected results when you sort, filter, or search through your data
Cleaning Data from Different source
Merger which is an agreement that unites two organizations into a single new one.
Data merging is the process of combining two or more datasets into a single dataset. This
presents a unique challenge because when two totally different datasets are combined, the
information is almost guaranteed to be inconsistent and misaligned.
In data analytics, compatibility describes how well two or more datasets are able to work
together.
Not checking for spelling errors: Misspellings can be as simple as typing or input errors. Most
of the time the wrong spelling or common grammatical errors can be detected, but it gets harder
with things like names or addresses. For example, if you are working with a spreadsheet table of
customer data, you might come across a customer named “John” whose name has been input
incorrectly as “Jon” in some places. The spreadsheet’s spellcheck probably won’t flag this, so if
you don’t double-check for spelling errors and catch this, your analysis will have mistakes in it.
Forgetting to document errors: Documenting your errors can be a big time saver, as it helps
you avoid those errors in the future by showing you how you resolved them. For example, you
might find an error in a formula in your spreadsheet. You discover that some of the dates in one
of your columns haven’t been formatted correctly. If you make a note of this fix, you can
reference it the next time your formula is broken, and get a head start on troubleshooting.
Documenting your errors also helps you keep track of changes in your work, so that you can
backtrack if a fix didn’t work.
Not checking for misfielded values: A misfielded value happens when the values are entered
into the wrong field. These values might still be formatted correctly, which makes them harder to
catch if you aren’t careful. For example, you might have a dataset with columns for cities and
countries. These are the same type of data, so they are easy to mix up. But if you were trying to
find all of the instances of Spain in the country column, and Spain had mistakenly been entered
into the city column, you would miss key data points. Making sure your data has been entered
correctly is key to accurate, complete analysis.
Overlooking missing values: Missing values in your dataset can create errors and give you
inaccurate conclusions. For example, if you were trying to get the total number of sales from the
last three months, but a week of transactions were missing, your calculations would be
inaccurate. As a best practice, try to keep your data as clean as possible by maintaining
completeness and consistency.
Only looking at a subset of the data: It is important to think about all of the relevant data when
you are cleaning. This helps make sure you understand the whole story the data is telling, and
that you are paying attention to all possible errors. For example, if you are working with data
about bird migration patterns from different sources, but you only clean one source, you might
not realize that some of the data is being repeated. This will cause problems in your analysis later
on. If you want to avoid common errors like duplicates, each field of your data requires equal
attention.
Losing track of business objectives: When you are cleaning data, you might make new and
interesting discoveries about your dataset-- but you don’t want those discoveries to distract you
from the task at hand. For example, if you were working with weather data to find the average
number of rainy days in your city, you might notice some interesting patterns about snowfall, too.
That is really interesting, but it isn’t related to the question you are trying to answer right now.
Being curious is great! But try not to let it distract you from the task at hand.
Not fixing the source of the error: Fixing the error itself is important. But if that error is
actually part of a bigger problem, you need to find the source of the issue. Otherwise, you will
have to keep fixing that same error over and over again. For example, imagine you have a team
spreadsheet that tracks everyone’s progress. The table keeps breaking because different people
are entering different values. You can keep fixing all of these problems one by one, or you can
set up your table to streamline data entry so everyone is on the same page. Addressing the source
of the errors in your data will save you a lot of time in the long run.
Not analyzing the system prior to data cleaning: If we want to clean our data and avoid future
errors, we need to understand the root cause of your dirty data. Imagine you are an auto
mechanic. You would find the cause of the problem before you started fixing the car, right? The
same goes for data. First, you figure out where the errors come from. Maybe it is from a data
entry error, not setting up a spell check, lack of formats, or from duplicates. Then, once you
understand where bad data comes from, you can control it and keep your data clean.
Not backing up your data prior to data cleaning: It is always good to be proactive and create
your data backup before you start your data clean-up. If your program crashes, or if your changes
cause a problem in your dataset, you can always go back to the saved version and restore it. The
simple procedure of backing up your data can save you hours of work-- and most importantly, a
headache.
Not accounting for data cleaning in your deadlines/process: All good things take time, and
that includes data cleaning. It is important to keep that in mind when going through your process
and looking at your deadlines. When you set aside time for data cleaning, it helps you get a more
accurate estimate for ETAs for stakeholders, and can help you know when to request an adjusted
ETA.
Data cleaning is essential for accurate analysis and decision-making. Common mistakes to avoid when
cleaning data include spelling errors, misfielded values, missing values, only looking at a subset of the
data, losing track of business objectives, not fixing the source of the error, not analyzing the system prior
to data cleaning, not backing up your data prior to data cleaning, and not accounting for data cleaning in
your deadlines/process. By avoiding these mistakes, you can ensure that your data is clean and accurate,
leading to better outcomes for your business.
"Remove duplicates" is a tool that automatically searches for and eliminates duplicate entries
from a spreadsheet. Choose "Data has header row" because our spreadsheet has a row at the very
top that describes the contents of each column
In data analytics, a text string is a group of characters within a cell, most often composed of
letters. An important characteristic of a text string is its length, which is the number of characters
in it.
split is a tool that divides a text string around the specified character and puts each fragment into
a new and separate cell. Split is helpful when you have more than one piece of data in a cell and
you want to separate them out
Split text to columns is also helpful for fixing instances of numbers stored as text. Sometimes
values in your spreadsheet will seem like numbers, but they're formatted as text. This can happen
when copying and pasting from one place to another or if the formatting's wrong
Delimiter is a term for a character that indicates the beginning or end of a data item, such as a
comma
Functions can optimize your efforts to ensure data integrity. As a reminder, a function is a set of
instructions that performs a specific calculation using the data in a spreadsheet.
COUNTIF is a function that returns the number of cells that match a specified value. Basically, it
counts the number of times a value appears in a range of cells.
LEN is a function that tells you the length of the text string by counting the number of characters
it contains. This is useful when cleaning data if you have a certain piece of information in your
spreadsheet that you know must contain a certain length. For example, this association uses six-
digit member identification codes. If we'd just imported this data and wanted to be sure our codes
are all the correct number of digits, we'd use LEN. The syntax of LEN is equals LEN, open
parenthesis, the range, and the close parenthesis.
RIGHT is a function that gives you a set number of characters from the right side of a text string.
The syntax is equals RIGHT, open parenthesis, the range, a comma and the number of characters
we want. Then, we finish with a closed parenthesis
The syntax of LEFT is equals LEFT, open parenthesis, the range, a comma, and a number of
characters from the left side of the text string we want. Then, we finish it with a closed
parenthesis.
MID is a function that gives you a segment from the middle of a text string.The syntax for MID
is equals MID, open parenthesis, the range, then a comma. When using MID, you always need to
supply a reference point. In other words, you need to set where the function should start. After
that, place another comma, and how many middle characters you want.
CONCATENATE, which is a function that joins together two or more text strings. The syntax is
equals CONCATENATE, then an open parenthesis inside indicates each text string you want to
join, separated by commas
TRIM is a function that removes leading, trailing, and repeated spaces in data. Sometimes when
you import data, your cells have extra spaces, which can get in the way of your analysis
The syntax for TRIM is equals TRIM, open parenthesis, your range, and closed parenthesis.
TRIM fixed the extra space
Workflow automation
In this reading, you will learn about workflow automation and how it can help you work faster and
more efficiently. Basically, workflow automation is the process of automating parts of your work.
That could mean creating an event trigger that sends a notification when a system is updated. Or it
could mean automating parts of the data cleaning process. As you can probably imagine, automating
different parts of your work can save you tons of time, increase productivity, and give you more
bandwidth to focus on other important aspects of the job.
What can
be
automated?
Automation sounds amazing, doesn’t it? But as convenient as it is, there are still some parts of
the job that can’t be automated. Let's take a look at some things we can automate and some
things that we can’t.
Can it be
Task Why?
automated?
Communicating with Communication is key to understanding the needs of your team and
your team and No stakeholders as you complete the tasks you are working on. There is
stakeholders no replacement for person-to-person communications.
Presenting your data is a big part of your job as a data analyst. Making
data accessible and understandable to stakeholders and creating data
Presenting your findings No
visualizations can’t be automated for the same reasons that
communications can’t be automated.
Can it be
Task Why?
automated?
Sometimes the best way to understand data is to see it. Luckily, there
are plenty of tools available that can help automate the process of
Data exploration Partially visualizing data. These tools can speed up the process of visualizing
and understanding the data, but the exploration itself still needs to be
done by a data analyst.
Data mapping is the process of matching fields from one database to another. This is very
important to the success of data migration, data integration, and lots of other data management
activities.
compatibility describes how well two or more data sets are able to work together.
Module 3
Spreadsheets and SQL both have their advantages and disadvantages:
Built-in spell check and other useful functions Fast and powerful functionality
A byte is a collection of 8 bits. Take a moment to examine the table below to get a feel for the
difference between data measurements and their relative sizes to one another.
Zettabyte 1024 Exabytes ZB All the data on the internet in 2019 (~4.5 ZB)
Module 4
Verification is a process to confirm that a data cleaning effort was well-executed and the
resulting data is accurate and reliable. It involves rechecking your clean dataset, doing some
manual clean ups if needed, and taking a moment to sit back and really think about the original
purpose of the project. That way, you can be confident that the data you collected is credible and
appropriate for your purposes.
Making sure your data is properly verified is so important because it allows you to double-check
that the work you did to clean up your data was thorough and accurate.
Verification lets you catch mistakes before you begin analysis.
Other big part of the verification process is reporting on your efforts
Open communication is a lifeline for any data analytics project. Reports are a super effective
way to show your team that you're being 100 percent transparent about your data cleaning.
Reporting is also a great opportunity to show stakeholders that you're accountable, build trust
with your team, and make sure you're all on the same page of important project details.
A changelog is a file containing a chronologically ordered list of modifications made to a
project. It's usually organized by version and includes the date followed by a list of
added, improved, and removed features.
Changelogs are very useful for keeping track of how a dataset evolved over the course of a
project. They're also another great way to communicate and report on data to others
Verification is a critical part of any analysis project. Without it you have no way of knowing that
your insights can be relied on for data-driven decision-making. Think of verification as a stamp
of approval
It also involves manually cleaning data to compare your expectations with what's actually
present. The first step in the verification process is going back to your original unclean data set
and comparing it to what you have now.
Review the dirty data and try to identify any common problems. For example, maybe you had a
lot of nulls. In that case, you check your clean data to ensure no nulls are present. To do that, you
could search through the data manually or use tools like conditional formatting or filters.
Another key part of verification involves taking a big-picture view of your project. This is an
opportunity to confirm you're actually focusing on the business problem that you need to solve
and the overall project goals and to make sure that your data is actually capable of solving that
problem and achieving those goals.
Documenting Results and the cleaning process
recalling the errors that were cleaned and
informing others of the changes -- assume that the data errors aren't fixable.
documentation helps you to determine the quality of the data to be used in analysis.
Embrace Changelogs
What do engineers, writers, and data analysts have in common? Change.
Engineers use engineering change orders (ECOs) to keep track of new product design details and
proposed changes to existing products. Writers use document revision histories to keep track of
changes to document flow and edits. And data analysts use changelogs to keep track of data
transformation and cleaning. Here are some examples of these:
1. Right-click the cell and select Show edit history. 2. Click the left-arrow < or right arrow > to
Google Sheets
move backward and forward in the history as needed.
Microsoft 1. If Track Changes has been enabled for the spreadsheet: click Review. 2. Under Track
Excel Changes, click the Accept/Reject Changes option to accept or reject any change made.
Bring up a previous version (without reverting to it) and figure out what changed by comparing it
BigQuery
to the current version.
A changelog can build on your automated version history by giving you an even more detailed record
of your work. This is where data analysts record all the changes they make to the data. Here is another
way of looking at it. Version histories record what was done in a data change for a project, but don't
tell us why. Changelogs are super useful for helping us understand the reasons changes have been
made. Changelogs have no set format and you can even make your entries in a blank document. But if
you are using a shared changelog, it is best to agree with other data analysts on the format of all your
log entries.
Finally, a changelog is important for when lots of changes to a spreadsheet or query have been made.
Imagine an analyst made four changes and the change they want to revert to is change #2. Instead of
clicking the undo feature three times to undo change #2 (and losing changes #3 and #4), the analyst
can undo just change #2 and keep all the other changes. Now, our example was for just 4 changes, but
try to think about how important that changelog would be if there were hundreds of changes to keep
track of.
1. A company has official versions of important queries in their version control system.
2. An analyst makes sure the most up-to-date version of the query is the one they will change. This
is called syncing
3. The analyst makes a change to the query.
4. The analyst might ask someone to review this change. This is called a code review and can be
informally or formally done. An informal review could be as simple as asking a senior analyst to
take a look at the change.
5. After a reviewer approves the change, the analyst submits the updated version of the query to a
repository in the company's version control system. This is called a code commit. A best practice
is to document exactly what the change was and why it was made in a comments area. Going
back to our example of a query that pulls daily revenue, a comment might be: Updated revenue to
include revenue coming from the new product, Calypso.
6. After the change is submitted, everyone else in the company will be able to access and use this
new query when they sync to the most up-to-date queries stored in the version control system.
7. If the query has a problem or business needs change, the analyst can undo the change to the
query using the version control system. The analyst can look at a chronological list of all changes
made to the query and who made each change. Then, after finding their own change, the analyst
can revert to the previous version.
8. The query is back to what it was before the analyst made the change. And everyone at the
company sees this reverted, original query, too.
Some of the most common errors involve human mistakes like mistyping or misspelling, flawed
processes like poor design of a survey form, and system issues where older systems integrate data
incorrectly