Data Sources
Data Sources
Data Sources
Data science for everyone
Course Instructor
Anam Shahid
Data Sources
Data sources
Previously, you learned about the data science workflow. In this chapter, we'll focus on the first step: data collection and
storage.
Sources of data
We are generating vast amounts of data on a daily basis simply by surfing the internet, tracking a run, or paying by card in
a shop. The companies behind these services that we use, collect this data internally. They use this to help them make
data-driven decisions. On the other hand, there are also many free, open data sources available. This means the data can
be freely used, shared and built-on by anyone. Note that sometimes companies share parts of their data with a wider
public as well. Let's first take a look at company data sources.
A. Company data
Some of the most common company sources of data are web events, survey data, customer data, logistics data, and
financial transactions.
1. Web data
When you visit a web page or click on a link, usually this information is tracked by companies in order to calculate
conversion rates or monitor the popularity of different pieces of content. The following information is captured: the name
of the event, which could mean the URL of the page visited or an identifier for the element that was clicked, the
timestamp of the event, and an identifier for the user that performed the action.
2. Survey data
Data can also be collected by asking people for their opinions in surveys. This can be, for example, in the form of a face-
to-face interview, online questionnaire, or a focus group.
B. Open data
There are multiple ways to access open data. Two of them are APIs and public records.
Tracking a hashtag
Let's look at an example of the Twitter API. Suppose we want to track Tweets with the hashtag DataFramed, Data
Camp’s wonderful podcast on Data Science. We can use the Twitter API to request all Tweets with this hashtag. At this
point, we have many options for analysis. We could perform a sentiment analysis on the text of each Tweet and get an
idea of how people like our podcast. We could simply track how often hashtag DataFramed appears each week. We could
also combine this data with our downloads data and see if positive Tweets are correlated with more downloads.
2. Public records
Public records are another great way of gathering data. They can be collected and shared by international organizations
like the World Bank, the UN, or the WTO, national statistical offices, who use census and survey data, or government
agencies, who make information about for example the weather, environment or population publicly available. For
example, in the US, data-dot-gov has health, education, and commerce data available for free download. In the EU, data-
dot-europa-dot-eu has similar data.
Data types
Data types
You now know where to collect data. But what does that data look like? In this topic we'll talk about the different types of
data.
2. Qualitative data
Qualitative data, on the other hand, are things that can be observed but not measured like: the fridge is red, was built in
Italy, and might need to be cleaned out because it smells like fish.
Recap
In this we looked at the most common data types: quantitative data, qualitative data, image data, text data,
geospatial data, and network data. These can all serve as inputs for your data science analysis. But before doing
that, the data needs to be stored. That's what we'll cover in the next topic.
2. The cloud
Alternatively, you could pay another company to store data for you. This is referred to as “cloud storage”. Common
cloud storage providers include Microsoft Azure, Amazon Web Services, or AWS, and Google Cloud. These services
provide more than just data storage; they can also help your organization with data analytics, machine learning, and deep
learning. For now, we’ll just focus on data storage.
More commonly, data can be expressed as tables of information, like what you might find in a spreadsheet. A database
that stores information in tables is called a Relational Database. Both of these types of databases can be found on the
cloud storage providers that were mentioned earlier.
Each type of database has its own query language; Document Databases mainly use NoSQL, while Relational
Databases mainly use SQL. SQL stands for “Structured Query Language” and NoSQL stands for “Not only SQL”
Data Pipelines
Data Pipelines
Let’s learn about data pipelines. So far we've learned about data collection and storage, but how can we scale all this? This
is where data pipelines come in.
How do we scale?
But how do we scale this? Consider the different data sources you learned about - what if we're collecting data from more
than one data source? And then, what if these data sources have different types of data? For example, consider real-time
streaming data, which is data that is continuously being generated, like tweets from all around the world. This makes
storing this incoming data complicated, because as a data engineer, you want to make sure data is organized and easy to
access.
3. Automation
Once we've set up all those steps, we automate. For example, we can say every time we get a tweet, we transform it in
a certain way and store it in a specific table in our database. There are tools that specialized to do this; the most popular is
called Airflow.
Reference link
https://campus.datacamp.com/courses/data-science-for-everyone/data-collection-and-storage-2?ex=1