0% found this document useful (0 votes)

33 views

What Is Data - Coursera

The document defines data as a set of values that represent measurements of qualitative or quantitative variables from a population. Common types of messy real-world data include sequencing data, census information, electronic medical records, images, and website traffic logs, which require processing to extract useful information. While data is important, a good data scientist focuses first on asking meaningful questions and then seeks relevant data to answer them.

Uploaded by

Artatrana Dash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views

What Is Data - Coursera

Uploaded by

Artatrana Dash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

5/2/22, 5:12 PM What Is Data?

| Coursera

What is data?
Since we’ve spent some time discussing what data science is, we should spend some time looking at what exactly
data is.

Definitions of “data”
First, let’s look at what a few trusted sources consider data to be.
First up, we’ll look at the Cambridge English Dictionary, which states that data is:

Information, especially facts or numbers, collected to be examined and considered and

used to help decision-making.

Second, we’ll look at the definition provided by Wikipedia, which is:

A set of values of qualitative or quantitative variables.

These are slightly different definitions and they get at different components of what data is. Both agree that data is
values or numbers or facts, but the Cambridge definition focuses on the actions that surround data - data is
collected, examined and most importantly, used to inform decisions. We’ve focused on this aspect before - we’ve
talked about how the most important part of data science is the question and how all we are doing is using data to
answer the question. The Cambridge definition focuses on this.
The Wikipedia definition focuses more on what data entails. And although it is a fairly short definition, we’ll take a
second to parse this and focus on each component individually.
So, the first thing to focus on is “a set of values” - to have data, you need a set of items to measure from. In
statistics, this set of items is often called the population. The set as a whole is what you are trying to discover
something about. For example, that set of items required to answer your question might be all websites or it might be
the set of all people coming to websites, or it might be a set of all people getting a particular drug. But in general, it’s
a set of things that you’re going to make measurements on.
The next thing to focus on is “variables” - variables are measurements or characteristics of an item. For example,
you could be measuring the height of a person, or you are measuring the amount of time a person stays on a
website. On the other hand, it might be a more qualitative characteristic you are trying to measure, like what a
person clicks on on a website, or whether you think the person visiting is male or female.
Finally, we have both qualitative and quantitative variables. Qualitative variables are, unsurprisingly, information
about qualities. They are things like country of origin, sex, or treatment group. They’re usually described by words,
not numbers, and they are not necessarily ordered. Quantitative variables on the other hand, are information about
quantities. Quantitative measurements are usually described by numbers and are measured on a continuous,
ordered scale; they’re things like height, weight and blood pressure.

https://www.coursera.org/learn/data-scientists-tools/ungradedWidget/WETHi/what-is-data 1/6
5/2/22, 5:12 PM What Is Data? | Coursera

A summary of the concepts present in the Wikipedia definition of data

So, taking this whole definition into consideration we have measurements (either qualitative or quantitative) on a set
of items making up data - not a bad definition.

What can data look like? (rarely)

When we were going over the definitions, our examples of variables and measurements (country of origin, sex,
height, weight) are pretty basic examples; you can easily envision them in a nice looking spreadsheet, with
individuals along one side of the table, and the information for those variables along the other side.

https://www.coursera.org/learn/data-scientists-tools/ungradedWidget/WETHi/what-is-data 2/6
5/2/22, 5:12 PM What Is Data? | Coursera

An example of a structured dataset - a spreadsheet of individuals (first initial, last name) and their country of
origin, sex, height, and weight)
Unfortunately, this is rarely how data is presented to you. The data sets we commonly encounter are much messier,
and it is our job to extract the information we want, corral it into something tidy like the imagined table above, analyse
it appropriately, and often, visualize our results.

More common types of messy data

Here are just some of the data sources you might encounter and we’ll briefly look at what a few of these data sets
often look like or how they can be interpreted, but one thing they have in common is the messiness of the data - you
have to work to extract the information you need to answer your question.
Sequencing data
Population census data
Electronic medical records (EMR), other large databases
Geographic information system (GIS) data (mapping)
Image analysis and image extrapolation
Language and translations
Website traffic
Personal/Ad data (eg: Facebook, Netflix predictions, etc)

Messy data: Sequencing

One type of data, that I work with regularly, is sequencing data. This data is generally first encountered in the FASTQ
format, the raw file format produced by sequencing machines. These files are often hundreds of millions of lines
long, and it is our job to parse this into an understandable and interpretable format and infer something about that

https://www.coursera.org/learn/data-scientists-tools/ungradedWidget/WETHi/what-is-data 3/6
5/2/22, 5:12 PM What Is Data? | Coursera

individual’s genome. In this case, this data was interpreted into expression data, and produced a plot called a
“volcano plot”.

A volcano plot is produced at the end of a long process to wrangle the raw FASTQ data into interpretable
expression data

Messy data: Census information

One rich source of information is country wide censuses. In these, almost all members of a country answer a set of
standardized questions and submit these answers to the government. When you have that many respondants, the
data is large and messy; but once this large database is ready to be queried, the answers embedded are
important.Here we have a very basic result of the last US census - in which all respondants are divided by sex and
age, and this distribution is plotted in this population pyramid plot.

https://www.coursera.org/learn/data-scientists-tools/ungradedWidget/WETHi/what-is-data 4/6
5/2/22, 5:12 PM What Is Data? | Coursera

The US population is stratified by sex and age to produce a population pyramid plot
Here is the US census website and some tools to help you examine it, but if you aren’t from the US, I urge you to
check out your home country’s census bureau (if available) and look at some of the data there!

Messy data: Electronic medical records (EMR)

Electronic medical records are increasingly prevalent as a way to store health information, and more and more
population based studies are using this data to answer questions and make inferences about populations at large, or
as a method to identify ways to improve medical care. For example, if you are asking about a population’s common
allergies, you will have to extract many individuals’ allergy information, and put that into an easily interpretable table
format where you will then perform your analysis.

Messy data: Image analysis/extrapolation

A more complex data source to analyse are images/videos. There is a wealth of information coded in an image or
video, and it is just waiting to be extracted. An example of image analysis that you may be familiar with is when you
upload a picture to Facebook and not only does it automatically recognize faces in the picture, but then suggests
who they may be. A fun example you can play with is the DeepDream software that was originally designed to detect
faces in an image, but has since moved on to more artistic pursuits.

https://www.coursera.org/learn/data-scientists-tools/ungradedWidget/WETHi/what-is-data 5/6
5/2/22, 5:12 PM What Is Data? | Coursera

The DeepDream software is trained on your image and a famous painting and your provided image is then
rendered in the style of the famous painter
There is another fun Google initiative involving image analysis, where you help provide data to Google’s machine
learning algorithm… by doodling!

Data is of secondary importance

Recognizing that we’ve spent a lot of time going over what data is, we need to reiterate - Data is important, but it is
secondary to your question. A good data scientist asks questions first and seeks out relevant data second.
Admittedly, often the data available will limit, or perhaps even enable, certain questions you are trying to ask. In
these cases, you may have to reframe your question or answer a related question, but the data itself does not drive
the question asking.

Summary
In this lesson we focused on data - both in defining it and in exploring what data may look like and how it can be
used.
First, we looked at two definitions of data, one that focuses on the actions surrounding data, and another on what
comprises data. The second definition embeds the concepts of populations, variables, and looks at the differences
between quantitative and qualitative data.
Second, we examined different sources of data that you may encounter, and emphasized the lack of tidy datasets.
Examples of messy datasets, where raw data needs to be wrangled into an interpretable form, can include
sequencing data, census data, electronic medical records, etc. And finally, we return to our beliefs on the relationship
between data and your question and emphasize the importance of question-first strategies. You could have all the
data you could ever hope for, but if you don’t have a question to start, the data is useless.

https://www.coursera.org/learn/data-scientists-tools/ungradedWidget/WETHi/what-is-data 6/6

Data Points: Visualization That Means Something
From Everand
Data Points: Visualization That Means Something
Nathan Yau
4/5 (15)
Week 5 COS 111 MAIN
No ratings yet
Week 5 COS 111 MAIN
42 pages
CS109a Lecture1
No ratings yet
CS109a Lecture1
67 pages
4.0 Introduction to Data
No ratings yet
4.0 Introduction to Data
16 pages
Data Science 5
100% (3)
Data Science 5
216 pages
Lecture 1,2&3
No ratings yet
Lecture 1,2&3
80 pages
Data Science UNIT 1 Final
No ratings yet
Data Science UNIT 1 Final
107 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
16 pages
Data Science and Ai Education For Young Minds
No ratings yet
Data Science and Ai Education For Young Minds
75 pages
Session 2 More on Data and R - Data Organization Andf Visualization
No ratings yet
Session 2 More on Data and R - Data Organization Andf Visualization
77 pages
DS Notes
No ratings yet
DS Notes
49 pages
DG Intro
No ratings yet
DG Intro
22 pages
FDS Module 1 Notes
No ratings yet
FDS Module 1 Notes
27 pages
22mca341 - Data Science
No ratings yet
22mca341 - Data Science
109 pages
Chapter 1-Introduction To Data
No ratings yet
Chapter 1-Introduction To Data
18 pages
03-07-2024-Data Science - Orentation Programme
No ratings yet
03-07-2024-Data Science - Orentation Programme
53 pages
08-MBA-DATA ANALYTICS - Data Science and Business Analysis - Unit 1
No ratings yet
08-MBA-DATA ANALYTICS - Data Science and Business Analysis - Unit 1
40 pages
Data in Supply Chain
No ratings yet
Data in Supply Chain
28 pages
Unit 1 Introduction
No ratings yet
Unit 1 Introduction
86 pages
C20 Combined
No ratings yet
C20 Combined
291 pages
Module 3: Data Presentation & Interpretation
No ratings yet
Module 3: Data Presentation & Interpretation
127 pages
Chapter 2 EMTE@Kibru 014914
No ratings yet
Chapter 2 EMTE@Kibru 014914
40 pages
Class 11 Ip Chapter 5 2024-2025
No ratings yet
Class 11 Ip Chapter 5 2024-2025
11 pages
DAT100_Int_Data_Ana_Lec2_Intro II
No ratings yet
DAT100_Int_Data_Ana_Lec2_Intro II
39 pages
Chapter 02
No ratings yet
Chapter 02
93 pages
Notes - EDA-Unit1 (2)
No ratings yet
Notes - EDA-Unit1 (2)
34 pages
EDA 1
No ratings yet
EDA 1
137 pages
Defining Data Science
100% (1)
Defining Data Science
167 pages
Data Science Process
No ratings yet
Data Science Process
30 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
FDS Unit 1 Notes
No ratings yet
FDS Unit 1 Notes
53 pages
TYBSC CS Data Science Munotes
No ratings yet
TYBSC CS Data Science Munotes
137 pages
3 Data Science Intro
No ratings yet
3 Data Science Intro
76 pages
Data Wrangling PDF
No ratings yet
Data Wrangling PDF
81 pages
CHAPTER-1
No ratings yet
CHAPTER-1
149 pages
Coursera - Data Analytics - Course 3
No ratings yet
Coursera - Data Analytics - Course 3
14 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
36 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
15 pages
TYCS DS Unit1
No ratings yet
TYCS DS Unit1
28 pages
22UCS303 DS-Unit III-N
No ratings yet
22UCS303 DS-Unit III-N
85 pages
Data Science
No ratings yet
Data Science
12 pages
Aiml Answers
No ratings yet
Aiml Answers
20 pages
Note On Data Analytics
No ratings yet
Note On Data Analytics
21 pages
Unit I- Data Science
No ratings yet
Unit I- Data Science
161 pages
FDS - UNIT 1
No ratings yet
FDS - UNIT 1
233 pages
SM Session 1 IPL 2024 Post Session Slides
No ratings yet
SM Session 1 IPL 2024 Post Session Slides
44 pages
Data Science-New (Unit-I)
No ratings yet
Data Science-New (Unit-I)
18 pages
How data is col
No ratings yet
How data is col
11 pages
Explaratory Data Analysis - Python
No ratings yet
Explaratory Data Analysis - Python
16 pages
Unit 2- Data Representation
No ratings yet
Unit 2- Data Representation
44 pages
Unit-2-1
No ratings yet
Unit-2-1
48 pages
FIT1043 - Lecture 3 - 2024
No ratings yet
FIT1043 - Lecture 3 - 2024
69 pages
EDA - Unit 1
No ratings yet
EDA - Unit 1
82 pages
1. Εισαγωγή στην Εξόρυξη Δεδομένων
No ratings yet
1. Εισαγωγή στην Εξόρυξη Δεδομένων
70 pages
L1 - Introduction To Data Science
No ratings yet
L1 - Introduction To Data Science
33 pages
Lecture 5 1 Flavours of Data
No ratings yet
Lecture 5 1 Flavours of Data
30 pages
Introduction To Data Science: Chapter Two
No ratings yet
Introduction To Data Science: Chapter Two
52 pages
KIT306/606: Data Analytics Unit Coordinator: A/Prof. Quan Bai University of Tasmania
No ratings yet
KIT306/606: Data Analytics Unit Coordinator: A/Prof. Quan Bai University of Tasmania
51 pages
Module 1 Introduction To DataScience and Analytics
No ratings yet
Module 1 Introduction To DataScience and Analytics
10 pages
Legal Research
No ratings yet
Legal Research
8 pages
Reviewer Q3 Zara
No ratings yet
Reviewer Q3 Zara
7 pages
Engagement
No ratings yet
Engagement
12 pages
Impact of Corporate Social Responsibility On Firm's Corporate Image (A Case Study of Dangote PLC)
100% (1)
Impact of Corporate Social Responsibility On Firm's Corporate Image (A Case Study of Dangote PLC)
16 pages
A Tale of Two Cultures Contrasting Qualitative and Quantitative Research
No ratings yet
A Tale of Two Cultures Contrasting Qualitative and Quantitative Research
24 pages
Research Design
No ratings yet
Research Design
7 pages
Burnout in Sport A Systematic Review FEEEEEERA
No ratings yet
Burnout in Sport A Systematic Review FEEEEEERA
26 pages
Components For The Development of Rural Areas Challenges and Solutions
No ratings yet
Components For The Development of Rural Areas Challenges and Solutions
10 pages
1.2.2 Market Research
No ratings yet
1.2.2 Market Research
8 pages
Group-Analyzing and Interpreting Data
No ratings yet
Group-Analyzing and Interpreting Data
2 pages
(Ebook) The Practice of Research in Criminology and Criminal Justice by Ronet D. Bachman; Russell K. Schutt ISBN 9781506306810, 1506306810 2024 Scribd Download
100% (13)
(Ebook) The Practice of Research in Criminology and Criminal Justice by Ronet D. Bachman; Russell K. Schutt ISBN 9781506306810, 1506306810 2024 Scribd Download
57 pages
Seth C. Lewis - Dissertation - Journalism Innovation and The Ethic of Participation: A Case Study of The Knight Foundation and Its News Challenge
No ratings yet
Seth C. Lewis - Dissertation - Journalism Innovation and The Ethic of Participation: A Case Study of The Knight Foundation and Its News Challenge
261 pages
Critical Ethnography Thesis
100% (3)
Critical Ethnography Thesis
8 pages
Get Assessment in Student Affairs: A Contemporary Look, 2nd Edition John H. Schuh PDF Ebook With Full Chapters Now
100% (3)
Get Assessment in Student Affairs: A Contemporary Look, 2nd Edition John H. Schuh PDF Ebook With Full Chapters Now
52 pages
Sotfile Kumpul PDF
No ratings yet
Sotfile Kumpul PDF
133 pages
Reading and Writing
No ratings yet
Reading and Writing
18 pages
Starr, Martha A. (2014), Qualitative and Mixed - Methods Research in Economics. Surprising Growth, Promising Future
No ratings yet
Starr, Martha A. (2014), Qualitative and Mixed - Methods Research in Economics. Surprising Growth, Promising Future
27 pages
(Edit) Empowering Women Through Gastronomy Tourism
No ratings yet
(Edit) Empowering Women Through Gastronomy Tourism
11 pages
Literature Review Quantitative Research
100% (1)
Literature Review Quantitative Research
6 pages
EAPP3
No ratings yet
EAPP3
5 pages
Ebooks File Research Methodology For Allied Health Professionals: A Comprehensive Guide To Thesis & Dissertation 1st Edition Hazari All Chapters
100% (4)
Ebooks File Research Methodology For Allied Health Professionals: A Comprehensive Guide To Thesis & Dissertation 1st Edition Hazari All Chapters
62 pages
A Primer On Deductive Qualitative Analysis As Theory Testing & Theory Development
0% (1)
A Primer On Deductive Qualitative Analysis As Theory Testing & Theory Development
6 pages
Educacion Recibe Lustre La Patria (Through Education Is The Educacion (The Intimate Alliance Between Religion and Education)
No ratings yet
Educacion Recibe Lustre La Patria (Through Education Is The Educacion (The Intimate Alliance Between Religion and Education)
3 pages
Practical Research Final Revise
No ratings yet
Practical Research Final Revise
23 pages
Green Social Work and Sustainability
No ratings yet
Green Social Work and Sustainability
12 pages
Que & Ans RM
No ratings yet
Que & Ans RM
38 pages
Nres Transes Chap 1 To 4 Midterms
No ratings yet
Nres Transes Chap 1 To 4 Midterms
9 pages
Research Template For G12 QNR PDF
No ratings yet
Research Template For G12 QNR PDF
4 pages
Challenges in Researching Consumer Ethics
No ratings yet
Challenges in Researching Consumer Ethics
18 pages
Research Methodology Terminology
No ratings yet
Research Methodology Terminology
10 pages
PYQ's BSS-317 An Introduction To Sociological Research Methods
No ratings yet
PYQ's BSS-317 An Introduction To Sociological Research Methods
67 pages

What Is Data - Coursera

Uploaded by

What Is Data - Coursera

Uploaded by

5/2/22, 5:12 PM What Is Data?

Information, especially facts or numbers, collected to be examined and considered and

Second, we’ll look at the definition provided by Wikipedia, which is:

A set of values of qualitative or quantitative variables.

A summary of the concepts present in the Wikipedia definition of data

What can data look like? (rarely)

More common types of messy data

Messy data: Sequencing

Messy data: Census information

Messy data: Electronic medical records (EMR)

Messy data: Image analysis/extrapolation

Data is of secondary importance

You might also like