0% found this document useful (0 votes)
43 views11 pages

2 Data Engineering (Storing Data)

Uploaded by

fatimamaryam882
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views11 pages

2 Data Engineering (Storing Data)

Uploaded by

fatimamaryam882
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Understanding Data Engineering

Storing Data

Course Instructor: Anam Shahid

Source:
https://campus.datacamp.com/courses/unde
rstanding-data-engineering/storing-
data?ex=9
Storing Data
Let's continue our exploration of the world of data engineering. This lecture will focus on
storage. In this lesson, we're going to learn more about data structure.

1. Structured data
Structured data is easy to search and organize. Data is entered following a rigid
structure, like a spreadsheet where there are set columns. Each column takes values of
a certain type, like text, data, or decimal. It makes it easy to form relations, hence it's
organized in what is called a relational database. About 20% of the data is structured.
SQL, which stands for Structured Query Language, is used to query such data.

Employee table (Example of Structure data):


Here is an example of structured data. This is an extract of Spotflix's employee table.
It's easy to read the table and well-organized. You can see it follows a model: each row
expects an employee and each column a specific information about that employee
(team, role). Each column needs to be of a certain type. The index is a number, and
acts as a unique ID, because two employees may have the same name, last name, or
both. The penultimate column holds logical values: values can only be true or false. For
example, Rick Sanchez is part-time. The rest of the columns are text.

Table 1:Employee Table

Relational database
Because it's structured we can easily relate this table to other structured data. For
example, if there's another table holding information about offices, we can connect on
the office column. Tables that can be connected that way form a relational database.
Table 2: Office Table

Figure 1 Connections of both above tables

2. Semi-structured data
Semi-structured data resembles structured data, but allows more freedom. It's therefore
relatively easy to organize, and pretty structured, but allows more flexibility. It also has
different types and can be grouped to form relations, although this is not as
straightforward as with structured data - you have to pay for that flexibility at some point.
Semi-structured data is stored in NoSQL databases (as opposed to SQL) and usually
leverages the JSON, XML file formats.

Favorite artists JSON file (Example of semi-structured data):


Here is an example of a JSON(JavaScript Object Notation) file storing the favorite
artists of each Spotflix user. As you can see, the model is consistent: each user id
contains the user's last and first name, and their favorite artists. However, the number of
favorite artists may differ: I have four, Sara has two and Lis has three favorite artists.
Relational databases don't allow that kind of flexibility, but semi-structured formats let
you do it.
3. Unstructured data
Unstructured data is data that does not follow a model and can't be contained in a rows
and columns format. This makes it difficult to search and organize. It's usually text,
sound, pictures or videos. It's usually stored in data lakes, although it can also appear in
data warehouses or databases - Most of the data around us is unstructured.
Unstructured data can be extremely valuable, but because it's hard to search and
organize, this value could not be extracted until recently, with the advent of machine
learning and artificial intelligence.

Examples of Unstructured data:

1. Lyrics
At Spotflix, unstructured data consists in lyrics
2. Songs
3. Pictures
albums pictures and artists profile pictures

4. Videos
music videos etc.
Adding some structure
At Spotflix, we could use machine learning algorithms to parse song spectrums, analyze
beats per minute, chord progressions, genres to help categorize songs. Or, we could
have artists give additional information when they upload their songs. Having them add
the genre, and some tags, would make it semi-structured data, and would make
searching and organizing easier.

Summary

All right, now you know what is characteristic of structured data, semi-structured data
and unstructured data, the differences between the three, and you're able to give
examples for each of them.

Q. Can you correctly define structured, semi-structured and unstructured data?

SQL databases
We've mentioned SQL several times by now, so how about we spend a bit more time on
this language that is so fundamental in data engineering?

SQL
SQL stands for Structured Query Language. SQL is to databases what English is to pop
music. It's the preferred language to query RDBMS or Relational Database
Management System - basically systems that gather several tables like the Employees
table from the previous lesson, where all tables are related to each other. More on that
in a moment. SQL has two main advantages: it allows you to access many records at
once, and group, filter or aggregate them. Most programming languages let you do that,
but SQL was the first, which is why it's been so influential. It's a little bit like the Beatles
and pop music. It's also very close to English, which makes it easy to write and
understand. As you already know data engineers use SQL to create and maintain and
updates databases, while data scientists use SQL to query, filter, group and aggregate
data in the tables of databases.

Example (Remember the employees table)


We're not going to learn SQL in this course. However, looking at some examples will
help your understanding. Let's look at a data engineering example first, creating a table.
Take a moment to refresh your memory of Spotflix's employee table. Remember the
first columns holds non-decimal numbers, the penultimate one stores logical values,
and the others hold text.

SQL for data engineers


We can create such a table using SQL. We type the command CREATE TABLE, and
declare the name of the table, "employees". Then we proceed to create the first column,
employee_id, and specify the type of data expected, integers - which mean this column
will only accept whole numbers, without any decimal. We then create the second
column, first_name, and specify it should be text (VARCHAR stands for "variable
characters"). Two-hundred fifty-five here means that the value entered can't be more
than Two-hundred fifty-five characters long. And we do the same for last name, role and
team. We declare full_time as a Boolean, which is the type for logical values. This
column can only hold zero for false or one for true. Office is declared as VARCHAR as
well because it's text. Data engineers then run other statements to update the table and
write records into it.
SQL for data scientists
Data scientists will then use SQL to query data in the tables. For example, if Julian
wants to get the first and last name of all the employees whose role title contains the
keyword data, he can select the first and last name, FROM the employees table,
WHERE the “role” title contains data. The percentage signs on each side of "Data"
mean "Data" can appear anywhere in the role title.

Database schema
So far, we've looked at tables individually; but databases are made of many tables. The
database schema governs how tables are related. A database schema is the skeleton
structure that represents the logical view of the entire database. It defines how the data
is organized and how the relations among them are associated. It formulates all the
constraints that are to be applied on the data.

Example:

Finally, there are several implementations of SQL like SQLite, MySQL, PostgreSQL,
Oracle SQL, SQL Server. How they differ is out of the scope of this course, but they are
pretty similar. Switching from one to the other is like switching from a QWERTY
keyboard to an AZERTY one, or switching from British English to American English. A
few things change, but most things stay the same.
Summary
You now understand why SQL is the language or reference for RDBMS, how data
engineers and data scientists use it differently, can give an example of a database
schema, and can cite several SQL implementations.

Data warehouses and Data lakes


Now it's time to clarify some concepts.

Warehouses with stunning view on the lake

Remember the data pipelines lesson at the end of previous lecture? We quickly
mentioned data lakes. Along the course we also mentioned databases several times.
We mentioned data warehouses. So what are these and what is the difference?

First, let's look at our data pipeline again.

1. Data lakes and data warehouses


As the data pipeline graph shows, the data lake is where all the collected raw data gets
stored, just as it was uploaded from the different sources. It's unprocessed and messy.
While the data lake stores all the data, the data warehouse stores specific data for a
specific use. For example, users and their subscription type, or all the listening sessions
for behavioral analysis. For this reason, a data lake can take petabytes of data, but
warehouses are usually pretty small - small on the scale of big data, I mean. It can still
way bigger than your external hard drive. A data lake can store any kind of data,
whether it's structured, semi-structured or unstructured. This means that it does not
enforce any model on the way to store the data. This makes it cost-effective. Data
warehouses enforce a structured format, which makes them more costly to manipulate.
However, this lack of structure also means it's very difficult to analyze. Some big data
analytics using deep learning can be implemented to discover hidden patterns and
trends, but that's about it, and should probably be last resort. The data warehouse, on
the other hand, is optimized for analytics to drive business decisions. Because no model
is enforced in data lakes and any structure can be stored, it is necessary to keep a data
catalog up to date. Data lakes are used by data scientists for real-time analytics on big
data, while data warehouses are used by analysts for ad-hoc (Latin phrase describing
something created especially for a particular occasion), read-only queries like
aggregation and summarization.
OR

A database stores the current data required to power an application.

A data warehouse is a system that stores highly structured information from various
sources. Data warehouses typically store current and historical data from one or more
systems. Some examples of data warehouses are Amazon Redshift, Google BigQuery,
Snowflake and IBM Db2 warehouse etc.

A data lake is a repository of data from disparate sources that is stored in its original,
raw format. Like data warehouses, data lakes store large amounts of current and
historical data. What sets data lakes apart is their ability to store data in a variety of
formats including JSON, BSON, CSV, TSV etc

2. Data catalog for data lakes


A data catalog is a source of truth that compensates for the lack of structure in a data
lake. Among other things, it keeps track of where the data comes from, how it is used,
who is responsible for maintaining it, and how often it gets updated. It's good practice in
terms of data governance (managing the availability, usability, integrity and security of
the data), and guarantees the reproducibility of the processes in case anything
unexpected happens. Or if someone wants to reproduce an analysis from the very
beginning, starting with the ingestion of the data. Because of the very flexible way data
lakes store data, a data catalog is necessary to prevent the data lake becoming a data
swamp. It's good practice to have a data catalog referencing any data that moves
through your organization, so that we don't have to rely on tribal knowledge, which
makes us autonomous, and makes working with the data more scalable. We can go
from finding data to preparing it without having to rely on a human source of information
every time we have a question.
3. Database vs. data warehouse
Let's take a step back. We've used the term database several times. Where does it fit
in? Database is a very general term that can be loosely defined as organized data
stored and accessed on a computer. It's a general term and a data warehouse is a type
of database.

Summary
All right! Now you know the characteristics of data lakes, data warehouses and
databases, how they differ, and why a data catalog is useful and necessary.

You might also like