What Is Data - Coursera
What Is Data - Coursera
| Coursera
What is data?
Since we’ve spent some time discussing what data science is, we should spend some time looking at what exactly
data is.
Definitions of “data”
First, let’s look at what a few trusted sources consider data to be.
First up, we’ll look at the Cambridge English Dictionary, which states that data is:
These are slightly different definitions and they get at different components of what data is. Both agree that data is
values or numbers or facts, but the Cambridge definition focuses on the actions that surround data - data is
collected, examined and most importantly, used to inform decisions. We’ve focused on this aspect before - we’ve
talked about how the most important part of data science is the question and how all we are doing is using data to
answer the question. The Cambridge definition focuses on this.
The Wikipedia definition focuses more on what data entails. And although it is a fairly short definition, we’ll take a
second to parse this and focus on each component individually.
So, the first thing to focus on is “a set of values” - to have data, you need a set of items to measure from. In
statistics, this set of items is often called the population. The set as a whole is what you are trying to discover
something about. For example, that set of items required to answer your question might be all websites or it might be
the set of all people coming to websites, or it might be a set of all people getting a particular drug. But in general, it’s
a set of things that you’re going to make measurements on.
The next thing to focus on is “variables” - variables are measurements or characteristics of an item. For example,
you could be measuring the height of a person, or you are measuring the amount of time a person stays on a
website. On the other hand, it might be a more qualitative characteristic you are trying to measure, like what a
person clicks on on a website, or whether you think the person visiting is male or female.
Finally, we have both qualitative and quantitative variables. Qualitative variables are, unsurprisingly, information
about qualities. They are things like country of origin, sex, or treatment group. They’re usually described by words,
not numbers, and they are not necessarily ordered. Quantitative variables on the other hand, are information about
quantities. Quantitative measurements are usually described by numbers and are measured on a continuous,
ordered scale; they’re things like height, weight and blood pressure.
https://www.coursera.org/learn/data-scientists-tools/ungradedWidget/WETHi/what-is-data 1/6
5/2/22, 5:12 PM What Is Data? | Coursera
https://www.coursera.org/learn/data-scientists-tools/ungradedWidget/WETHi/what-is-data 2/6
5/2/22, 5:12 PM What Is Data? | Coursera
An example of a structured dataset - a spreadsheet of individuals (first initial, last name) and their country of
origin, sex, height, and weight)
Unfortunately, this is rarely how data is presented to you. The data sets we commonly encounter are much messier,
and it is our job to extract the information we want, corral it into something tidy like the imagined table above, analyse
it appropriately, and often, visualize our results.
https://www.coursera.org/learn/data-scientists-tools/ungradedWidget/WETHi/what-is-data 3/6
5/2/22, 5:12 PM What Is Data? | Coursera
individual’s genome. In this case, this data was interpreted into expression data, and produced a plot called a
“volcano plot”.
A volcano plot is produced at the end of a long process to wrangle the raw FASTQ data into interpretable
expression data
https://www.coursera.org/learn/data-scientists-tools/ungradedWidget/WETHi/what-is-data 4/6
5/2/22, 5:12 PM What Is Data? | Coursera
The US population is stratified by sex and age to produce a population pyramid plot
Here is the US census website and some tools to help you examine it, but if you aren’t from the US, I urge you to
check out your home country’s census bureau (if available) and look at some of the data there!
https://www.coursera.org/learn/data-scientists-tools/ungradedWidget/WETHi/what-is-data 5/6
5/2/22, 5:12 PM What Is Data? | Coursera
The DeepDream software is trained on your image and a famous painting and your provided image is then
rendered in the style of the famous painter
There is another fun Google initiative involving image analysis, where you help provide data to Google’s machine
learning algorithm… by doodling!
Summary
In this lesson we focused on data - both in defining it and in exploring what data may look like and how it can be
used.
First, we looked at two definitions of data, one that focuses on the actions surrounding data, and another on what
comprises data. The second definition embeds the concepts of populations, variables, and looks at the differences
between quantitative and qualitative data.
Second, we examined different sources of data that you may encounter, and emphasized the lack of tidy datasets.
Examples of messy datasets, where raw data needs to be wrangled into an interpretable form, can include
sequencing data, census data, electronic medical records, etc. And finally, we return to our beliefs on the relationship
between data and your question and emphasize the importance of question-first strategies. You could have all the
data you could ever hope for, but if you don’t have a question to start, the data is useless.
https://www.coursera.org/learn/data-scientists-tools/ungradedWidget/WETHi/what-is-data 6/6