0% found this document useful (0 votes)

20 views

Data Sources

Uploaded by

IUB VIBES

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views

Data Sources

Uploaded by

IUB VIBES

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

DATACAMP CHAPTER 2

Data Sources
Data science for everyone
Course Instructor
Anam Shahid
Data Sources
Data sources
Previously, you learned about the data science workflow. In this chapter, we'll focus on the first step: data collection and
storage.

The data science workflow

Before we can start deriving insights from data, we first need to collect the data from different sources.

Sources of data
We are generating vast amounts of data on a daily basis simply by surfing the internet, tracking a run, or paying by card in
a shop. The companies behind these services that we use, collect this data internally. They use this to help them make
data-driven decisions. On the other hand, there are also many free, open data sources available. This means the data can
be freely used, shared and built-on by anyone. Note that sometimes companies share parts of their data with a wider
public as well. Let's first take a look at company data sources.

A. Company data
Some of the most common company sources of data are web events, survey data, customer data, logistics data, and
financial transactions.

1. Web data
When you visit a web page or click on a link, usually this information is tracked by companies in order to calculate
conversion rates or monitor the popularity of different pieces of content. The following information is captured: the name
of the event, which could mean the URL of the page visited or an identifier for the element that was clicked, the
timestamp of the event, and an identifier for the user that performed the action.
2. Survey data
Data can also be collected by asking people for their opinions in surveys. This can be, for example, in the form of a face-
to-face interview, online questionnaire, or a focus group.

3. Net Promoter Score

You've likely answered a question as shown in the image. This is a very common type of survey data used by companies:
the Net Promoter Score, or NPS, which asks how likely a user is to recommend a product to a friend or colleague.

B. Open data
There are multiple ways to access open data. Two of them are APIs and public records.

1. Public data APIs

Let's begin with APIs. API stands for Application Programming Interface. It's an easy way of requesting data from a third
party over the internet. Many companies have public APIs to let anyone access their data. Some note able APIs include
Twitter, Wikipedia, Yahoo! Finance, and Google Maps, but there are many, many more.

Tracking a hashtag
Let's look at an example of the Twitter API. Suppose we want to track Tweets with the hashtag DataFramed, Data
Camp’s wonderful podcast on Data Science. We can use the Twitter API to request all Tweets with this hashtag. At this
point, we have many options for analysis. We could perform a sentiment analysis on the text of each Tweet and get an
idea of how people like our podcast. We could simply track how often hashtag DataFramed appears each week. We could
also combine this data with our downloads data and see if positive Tweets are correlated with more downloads.
2. Public records
Public records are another great way of gathering data. They can be collected and shared by international organizations
like the World Bank, the UN, or the WTO, national statistical offices, who use census and survey data, or government
agencies, who make information about for example the weather, environment or population publicly available. For
example, in the US, data-dot-gov has health, education, and commerce data available for free download. In the EU, data-
dot-europa-dot-eu has similar data.

Data types
Data types
You now know where to collect data. But what does that data look like? In this topic we'll talk about the different types of
data.

Why care about data types?

You might wonder why it's important to know what type of data you have collected. This will be essential later on in the
data science process. For instance, it's especially relevant when you want to store the data, which we'll talk about in the
next as not all types of data can be stored in the same place. Furthermore, when you're visualizing or analyzing the data
it's important to know the type of data you are dealing with. Not all visualizations or analyses can be performed with all
data types. So, let's dive in.

Quantitative vs qualitative data

There are two general types of data: qualitative and quantitative data. It’s important to understand the key differences
between both. Quantitative data can be counted, measured, and expressed using numbers. Qualitative data is descriptive
and conceptual. Qualitative data can be observed but not measured. Now that we know the differences, let’s dive into each
type of data with a real-world example.
1. Quantitative data
Quantitative data can be expressed in numbers. For example, the fridge is 60 inches tall, has two apples in it, and costs
1000 dollars.

2. Qualitative data
Qualitative data, on the other hand, are things that can be observed but not measured like: the fridge is red, was built in
Italy, and might need to be cleaned out because it smells like fish.

Other data types

Other than the traditional quantitative and qualitative data, there are many other data types that are becoming more and
more important. There is image data, text data, geospatial data, network data, and many more. Note that these other data
types aren't mutually exclusive with quantitative and qualitative data. Meaning often these other data types are a special
mix of quantitative and qualitative data. Let's look at some examples.

1. Other data types: Image data

Digital images are everywhere. An image is made up of pixels. These pixels contain information about color and intensity.
Typically, the pixels are stored in computer memory. In the example you can see that if we zoom in on the image we can
distinguish the different pixels.

2. Other data types: Text data

Emails, documents, reviews, social media posts, and so on. As you can imagine, text data can be found in many places.
This data can be stored and analyzed to find relevant insights.

3. Other data types: Geospatial data

Geospatial data are data with location information. In the example you can see that many different types of information
can be captured using geospatial data. For a specific region we can keep track of where the roads, the buildings, and
vegetation are. This is especially useful for navigation apps like Waze and Google maps.
4. Other data types: Network data
Network data consists of the people or things in a network, depicted by circles, and the relationships between them,
depicted by lines. Here you can see an example of a social network. You can easily see who knows whom.

Recap
In this we looked at the most common data types: quantitative data, qualitative data, image data, text data,
geospatial data, and network data. These can all serve as inputs for your data science analysis. But before doing
that, the data needs to be stored. That's what we'll cover in the next topic.

Data storage and retrieval

Previously in this chapter, you learned about different data sources and data types.

The data science workflow

Now, let's discuss efficient ways of storing and retrieving the data that was collected. As you can see this is still part of the
first step in the data science workflow we defined before.

Things to consider when storing data

When storing data there are multiple things to take into consideration. First, we need to determine where we want to
store the data (location). Then, we need to know what kind of data we are storing (data type). And lastly, we need to
consider how we can retrieve our data from storage. Let's take a closer look.
A. Location:
1. Parallel storage solutions
Data science projects could require large amounts of data. At this point the data probably can't be stored on a single
computer anymore. In order to make sure that all data is saved and easy to access, it is stored across many different
computers. Large companies often have their own set of storage computers, called a “cluster” or a “server”, on premises.

2. The cloud
Alternatively, you could pay another company to store data for you. This is referred to as “cloud storage”. Common
cloud storage providers include Microsoft Azure, Amazon Web Services, or AWS, and Google Cloud. These services
provide more than just data storage; they can also help your organization with data analytics, machine learning, and deep
learning. For now, we’ll just focus on data storage.

B. Types of data storage:

Different types of data require different storage solutions. Some data is unstructured, like email, text, video and audio
files, web pages, and social media messages. This type of data is often stored in a type of database called a Document
Database.

More commonly, data can be expressed as tables of information, like what you might find in a spreadsheet. A database
that stores information in tables is called a Relational Database. Both of these types of databases can be found on the
cloud storage providers that were mentioned earlier.

C. Retrieval: Data querying

Once data has been stored in a Document Database or a Relational Database, we’ll need to access it. At a basic level,
we’ll want to be able to request a specific piece of data, such as “All of the images that were created on March 3rd” or
“All of the customer addresses in Montana”. In addition, we might even want to do some analysis, such as summing,
counting, or averaging data.

Each type of database has its own query language; Document Databases mainly use NoSQL, while Relational
Databases mainly use SQL. SQL stands for “Structured Query Language” and NoSQL stands for “Not only SQL”
Data Pipelines
Data Pipelines
Let’s learn about data pipelines. So far we've learned about data collection and storage, but how can we scale all this? This
is where data pipelines come in.

Data collection and storage

Data engineers work to collect and store data, so that others, like analysts and data scientists can access data for their
work, whether it's for visualization or building machine learning models.

How do we scale?
But how do we scale this? Consider the different data sources you learned about - what if we're collecting data from more
than one data source? And then, what if these data sources have different types of data? For example, consider real-time
streaming data, which is data that is continuously being generated, like tweets from all around the world. This makes
storing this incoming data complicated, because as a data engineer, you want to make sure data is organized and easy to
access.

What is a data pipeline?

Enter the data pipeline. A data pipeline moves data into defined stages, for example, from data ingestion through an
API to loading data into a database. A key feature is that pipelines automate this movement. Data is constantly coming in
and it would be tedious to ask a data engineer to manually run programs to collect and store data. Instead a data engineer
schedules tasks whether it's hourly, daily, or tasks can be triggered by an event. Because of this automation, data pipelines
need to be monitored. Luckily, alerts can be generated automatically, for example, when 95% of storage capacity has been
reach or if an API is responding with an error. Data pipelines aren't necessary for all data science projects, but they are
when working with a lot of data from different sources. There isn't a set way to make a pipeline - pipelines are highly
customized depending on your data, storage options, and ultimate usage of the data. ETL, which stands for extract,
transform, and load, is a popular framework for data pipelines. Let's explore it with a case study.

Case study: smart home

1. Extract

2. Transform and Load

3. Automation
Once we've set up all those steps, we automate. For example, we can say every time we get a tweet, we transform it in
a certain way and store it in a specific table in our database. There are tools that specialized to do this; the most popular is
called Airflow.

Reference link

https://campus.datacamp.com/courses/data-science-for-everyone/data-collection-and-storage-2?ex=1

Brandbook Umbro
No ratings yet
Brandbook Umbro
47 pages
Hair and Beauty Salon Business Plan
87% (15)
Hair and Beauty Salon Business Plan
8 pages
Peter Charles Sturman - Mi Fu - Style and The Art of Calligraphy in Northern Song China-Yale University Press (1997)
No ratings yet
Peter Charles Sturman - Mi Fu - Style and The Art of Calligraphy in Northern Song China-Yale University Press (1997)
281 pages
9 Different Data Types To Better Understand Your Business
No ratings yet
9 Different Data Types To Better Understand Your Business
7 pages
Defining Data Science
100% (1)
Defining Data Science
167 pages
Unit 1 To 5
No ratings yet
Unit 1 To 5
202 pages
CRM Data Collection and Storage
No ratings yet
CRM Data Collection and Storage
22 pages
Ds unit 3 notes
No ratings yet
Ds unit 3 notes
29 pages
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
DA-1,2,3[1]_merged
No ratings yet
DA-1,2,3[1]_merged
39 pages
Unit 1
No ratings yet
Unit 1
137 pages
HTC Emerging Ch2
No ratings yet
HTC Emerging Ch2
37 pages
Unit 1
No ratings yet
Unit 1
19 pages
Data Collection in Our World
No ratings yet
Data Collection in Our World
17 pages
unit_1
No ratings yet
unit_1
9 pages
FDSUNIT 1
No ratings yet
FDSUNIT 1
27 pages
Data Science: Chapter Two
No ratings yet
Data Science: Chapter Two
8 pages
mod 3
No ratings yet
mod 3
96 pages
Reality Mining: Using Big Data to Engineer a Better World
From Everand
Reality Mining: Using Big Data to Engineer a Better World
Nathan Eagle
4/5 (2)
Final Full Notes Unit1 Data Analytics
No ratings yet
Final Full Notes Unit1 Data Analytics
41 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
Data Literacy
No ratings yet
Data Literacy
5 pages
UNIT-1 Bda Kalyan
No ratings yet
UNIT-1 Bda Kalyan
25 pages
Chapter 2 Emerging
No ratings yet
Chapter 2 Emerging
31 pages
M-1
No ratings yet
M-1
98 pages
Data Science Unit 1 Notes
No ratings yet
Data Science Unit 1 Notes
22 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
4.0 Introduction to Data
No ratings yet
4.0 Introduction to Data
16 pages
ML Chapter 01
No ratings yet
ML Chapter 01
19 pages
DS Notes
No ratings yet
DS Notes
49 pages
Course 3
No ratings yet
Course 3
22 pages
Data v2
No ratings yet
Data v2
25 pages
#2 Data Science
No ratings yet
#2 Data Science
32 pages
Undestanding Data Module-3
No ratings yet
Undestanding Data Module-3
8 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
28 pages
Emergency chapter two(2)
No ratings yet
Emergency chapter two(2)
41 pages
Data Science: October 2021
No ratings yet
Data Science: October 2021
51 pages
Data Science
No ratings yet
Data Science
35 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
6 pages
Chapter 2 Data Science1
No ratings yet
Chapter 2 Data Science1
41 pages
Lecture 3 (DS) - Steps in Data Science Process
No ratings yet
Lecture 3 (DS) - Steps in Data Science Process
57 pages
Unit II
No ratings yet
Unit II
6 pages
unit-1ppt
No ratings yet
unit-1ppt
29 pages
Chapter 2 Data Science (4)
No ratings yet
Chapter 2 Data Science (4)
8 pages
Moshi Moshi
No ratings yet
Moshi Moshi
25 pages
CH-2 Data Science Emerging Technology
No ratings yet
CH-2 Data Science Emerging Technology
20 pages
Explaratory Data Analysis - Python
No ratings yet
Explaratory Data Analysis - Python
16 pages
Chapter 2 - Intro to Data Sciences[2]
No ratings yet
Chapter 2 - Intro to Data Sciences[2]
41 pages
Your Data Literacy Depends On Understanding The Types of Data and How They're Captured
No ratings yet
Your Data Literacy Depends On Understanding The Types of Data and How They're Captured
5 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
Teaching Note - Big Data and Cloud Computing-Vaidik
No ratings yet
Teaching Note - Big Data and Cloud Computing-Vaidik
17 pages
CHAPTER 2 Emerging
No ratings yet
CHAPTER 2 Emerging
8 pages
Foundation of Data Science
100% (2)
Foundation of Data Science
143 pages
Data Analytics
No ratings yet
Data Analytics
69 pages
Eds Unit 1
No ratings yet
Eds Unit 1
28 pages
Storage Options for Transformed Data
No ratings yet
Storage Options for Transformed Data
3 pages
3. AI primer
No ratings yet
3. AI primer
24 pages
CS 3353 FDS Unit 1 Notes Jpr
No ratings yet
CS 3353 FDS Unit 1 Notes Jpr
39 pages
IDS_sem ans unit 1
No ratings yet
IDS_sem ans unit 1
10 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
unit-1ppt-241202105748-ba1c594f
No ratings yet
unit-1ppt-241202105748-ba1c594f
30 pages
Big Data and Data Science
No ratings yet
Big Data and Data Science
6 pages
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
From Everand
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
Calvert Long
No ratings yet
Wonderware ® FactorySuite™ (InBatch Getting Started)
No ratings yet
Wonderware ® FactorySuite™ (InBatch Getting Started)
82 pages
SUMMER_INTERNSHIP_PROJECT_REPORT
No ratings yet
SUMMER_INTERNSHIP_PROJECT_REPORT
45 pages
Botany of Maize
No ratings yet
Botany of Maize
36 pages
Soul and Incorporeality in Plato PDF
No ratings yet
Soul and Incorporeality in Plato PDF
46 pages
Leukemia in Children: 1 Rahul Dhaker, Asst. Professor, RCN
No ratings yet
Leukemia in Children: 1 Rahul Dhaker, Asst. Professor, RCN
41 pages
Principles of Microeconomics Syllabus Pag 1
No ratings yet
Principles of Microeconomics Syllabus Pag 1
1 page
UT2 S.st ( Class 6th) (1)
No ratings yet
UT2 S.st ( Class 6th) (1)
2 pages
TESFAY ASSEFA SILICA SAND FACTORY INVESTMENT PROPOSAL
100% (1)
TESFAY ASSEFA SILICA SAND FACTORY INVESTMENT PROPOSAL
64 pages
Horse Judging Manual
No ratings yet
Horse Judging Manual
28 pages
Optimizing Emergency Department Throughput Operations Management Solutions for Health Care Decision Makers 1st Edition John M. Shiver (Author) all chapter instant download
100% (13)
Optimizing Emergency Department Throughput Operations Management Solutions for Health Care Decision Makers 1st Edition John M. Shiver (Author) all chapter instant download
85 pages
Project Rest: (Read and Exercise Tayo!)
No ratings yet
Project Rest: (Read and Exercise Tayo!)
9 pages
ETE 2025 EventProgramme
No ratings yet
ETE 2025 EventProgramme
22 pages
Individualized Education Program
No ratings yet
Individualized Education Program
13 pages
My Experience As The President of The Student Council
No ratings yet
My Experience As The President of The Student Council
2 pages
Fashion Magazine in Black and White Beige Minimal Type-Driven Style - 20240903 - 150447 - 0000
No ratings yet
Fashion Magazine in Black and White Beige Minimal Type-Driven Style - 20240903 - 150447 - 0000
6 pages
Contextualization
No ratings yet
Contextualization
3 pages
Topic 4.0 - Ibs
No ratings yet
Topic 4.0 - Ibs
34 pages
Worms As Attack Vectors - Theory, Threats, and Defenses
No ratings yet
Worms As Attack Vectors - Theory, Threats, and Defenses
27 pages
Project LABANG-Numeracy
No ratings yet
Project LABANG-Numeracy
17 pages
Paulding Progress October 14, 2015
No ratings yet
Paulding Progress October 14, 2015
20 pages
Narayan Deorao Javle (Deceased) Vs Krishna (2021) - Equity of Redemption Is A Right Which Is Subsidiary To The
No ratings yet
Narayan Deorao Javle (Deceased) Vs Krishna (2021) - Equity of Redemption Is A Right Which Is Subsidiary To The
4 pages
English For Primary 5
No ratings yet
English For Primary 5
79 pages
Sarmin-WOMENBENGAL-2016
No ratings yet
Sarmin-WOMENBENGAL-2016
7 pages
Department of Education: Designing Mechanism For Intervention and Remediation To Compensate Lost Contact Hours
No ratings yet
Department of Education: Designing Mechanism For Intervention and Remediation To Compensate Lost Contact Hours
3 pages
NHAI Plantation Tender
No ratings yet
NHAI Plantation Tender
2 pages
The Girl With A Pearl Earring by Tracy Chevalier
No ratings yet
The Girl With A Pearl Earring by Tracy Chevalier
8 pages
Introduction in Human Anatomy and Physiology
100% (1)
Introduction in Human Anatomy and Physiology
35 pages

Data Sources

Uploaded by

Data Sources

Uploaded by

DATACAMP CHAPTER 2

The data science workflow

3. Net Promoter Score

1. Public data APIs

Why care about data types?

Quantitative vs qualitative data

Other data types

1. Other data types: Image data

2. Other data types: Text data

3. Other data types: Geospatial data

Data storage and retrieval

Data storage and retrieval

The data science workflow

Things to consider when storing data

B. Types of data storage:

C. Retrieval: Data querying

Data collection and storage

What is a data pipeline?

Case study: smart home

2. Transform and Load

You might also like