2 Data Engineering (Storing Data)
2 Data Engineering (Storing Data)
Storing Data
Source:
https://campus.datacamp.com/courses/unde
rstanding-data-engineering/storing-
data?ex=9
Storing Data
Let's continue our exploration of the world of data engineering. This lecture will focus on
storage. In this lesson, we're going to learn more about data structure.
1. Structured data
Structured data is easy to search and organize. Data is entered following a rigid
structure, like a spreadsheet where there are set columns. Each column takes values of
a certain type, like text, data, or decimal. It makes it easy to form relations, hence it's
organized in what is called a relational database. About 20% of the data is structured.
SQL, which stands for Structured Query Language, is used to query such data.
Relational database
Because it's structured we can easily relate this table to other structured data. For
example, if there's another table holding information about offices, we can connect on
the office column. Tables that can be connected that way form a relational database.
Table 2: Office Table
2. Semi-structured data
Semi-structured data resembles structured data, but allows more freedom. It's therefore
relatively easy to organize, and pretty structured, but allows more flexibility. It also has
different types and can be grouped to form relations, although this is not as
straightforward as with structured data - you have to pay for that flexibility at some point.
Semi-structured data is stored in NoSQL databases (as opposed to SQL) and usually
leverages the JSON, XML file formats.
1. Lyrics
At Spotflix, unstructured data consists in lyrics
2. Songs
3. Pictures
albums pictures and artists profile pictures
4. Videos
music videos etc.
Adding some structure
At Spotflix, we could use machine learning algorithms to parse song spectrums, analyze
beats per minute, chord progressions, genres to help categorize songs. Or, we could
have artists give additional information when they upload their songs. Having them add
the genre, and some tags, would make it semi-structured data, and would make
searching and organizing easier.
Summary
All right, now you know what is characteristic of structured data, semi-structured data
and unstructured data, the differences between the three, and you're able to give
examples for each of them.
SQL databases
We've mentioned SQL several times by now, so how about we spend a bit more time on
this language that is so fundamental in data engineering?
SQL
SQL stands for Structured Query Language. SQL is to databases what English is to pop
music. It's the preferred language to query RDBMS or Relational Database
Management System - basically systems that gather several tables like the Employees
table from the previous lesson, where all tables are related to each other. More on that
in a moment. SQL has two main advantages: it allows you to access many records at
once, and group, filter or aggregate them. Most programming languages let you do that,
but SQL was the first, which is why it's been so influential. It's a little bit like the Beatles
and pop music. It's also very close to English, which makes it easy to write and
understand. As you already know data engineers use SQL to create and maintain and
updates databases, while data scientists use SQL to query, filter, group and aggregate
data in the tables of databases.
Database schema
So far, we've looked at tables individually; but databases are made of many tables. The
database schema governs how tables are related. A database schema is the skeleton
structure that represents the logical view of the entire database. It defines how the data
is organized and how the relations among them are associated. It formulates all the
constraints that are to be applied on the data.
Example:
Finally, there are several implementations of SQL like SQLite, MySQL, PostgreSQL,
Oracle SQL, SQL Server. How they differ is out of the scope of this course, but they are
pretty similar. Switching from one to the other is like switching from a QWERTY
keyboard to an AZERTY one, or switching from British English to American English. A
few things change, but most things stay the same.
Summary
You now understand why SQL is the language or reference for RDBMS, how data
engineers and data scientists use it differently, can give an example of a database
schema, and can cite several SQL implementations.
Remember the data pipelines lesson at the end of previous lecture? We quickly
mentioned data lakes. Along the course we also mentioned databases several times.
We mentioned data warehouses. So what are these and what is the difference?
A data warehouse is a system that stores highly structured information from various
sources. Data warehouses typically store current and historical data from one or more
systems. Some examples of data warehouses are Amazon Redshift, Google BigQuery,
Snowflake and IBM Db2 warehouse etc.
A data lake is a repository of data from disparate sources that is stored in its original,
raw format. Like data warehouses, data lakes store large amounts of current and
historical data. What sets data lakes apart is their ability to store data in a variety of
formats including JSON, BSON, CSV, TSV etc
Summary
All right! Now you know the characteristics of data lakes, data warehouses and
databases, how they differ, and why a data catalog is useful and necessary.