0% found this document useful (0 votes)

43 views11 pages

2 Data Engineering (Storing Data)

Uploaded by

fatimamaryam882

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views11 pages

2 Data Engineering (Storing Data)

Uploaded by

fatimamaryam882

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Understanding Data Engineering

Storing Data

Course Instructor: Anam Shahid

Source:
https://campus.datacamp.com/courses/unde
rstanding-data-engineering/storing-
data?ex=9
Storing Data
Let's continue our exploration of the world of data engineering. This lecture will focus on
storage. In this lesson, we're going to learn more about data structure.

1. Structured data
Structured data is easy to search and organize. Data is entered following a rigid
structure, like a spreadsheet where there are set columns. Each column takes values of
a certain type, like text, data, or decimal. It makes it easy to form relations, hence it's
organized in what is called a relational database. About 20% of the data is structured.
SQL, which stands for Structured Query Language, is used to query such data.

Employee table (Example of Structure data):

Here is an example of structured data. This is an extract of Spotflix's employee table.
It's easy to read the table and well-organized. You can see it follows a model: each row
expects an employee and each column a specific information about that employee
(team, role). Each column needs to be of a certain type. The index is a number, and
acts as a unique ID, because two employees may have the same name, last name, or
both. The penultimate column holds logical values: values can only be true or false. For
example, Rick Sanchez is part-time. The rest of the columns are text.

Table 1:Employee Table

Relational database
Because it's structured we can easily relate this table to other structured data. For
example, if there's another table holding information about offices, we can connect on
the office column. Tables that can be connected that way form a relational database.
Table 2: Office Table

Figure 1 Connections of both above tables

2. Semi-structured data
Semi-structured data resembles structured data, but allows more freedom. It's therefore
relatively easy to organize, and pretty structured, but allows more flexibility. It also has
different types and can be grouped to form relations, although this is not as
straightforward as with structured data - you have to pay for that flexibility at some point.
Semi-structured data is stored in NoSQL databases (as opposed to SQL) and usually
leverages the JSON, XML file formats.

Favorite artists JSON file (Example of semi-structured data):

Here is an example of a JSON(JavaScript Object Notation) file storing the favorite
artists of each Spotflix user. As you can see, the model is consistent: each user id
contains the user's last and first name, and their favorite artists. However, the number of
favorite artists may differ: I have four, Sara has two and Lis has three favorite artists.
Relational databases don't allow that kind of flexibility, but semi-structured formats let
you do it.
3. Unstructured data
Unstructured data is data that does not follow a model and can't be contained in a rows
and columns format. This makes it difficult to search and organize. It's usually text,
sound, pictures or videos. It's usually stored in data lakes, although it can also appear in
data warehouses or databases - Most of the data around us is unstructured.
Unstructured data can be extremely valuable, but because it's hard to search and
organize, this value could not be extracted until recently, with the advent of machine
learning and artificial intelligence.

Examples of Unstructured data:

1. Lyrics
At Spotflix, unstructured data consists in lyrics
2. Songs
3. Pictures
albums pictures and artists profile pictures

4. Videos
music videos etc.
Adding some structure
At Spotflix, we could use machine learning algorithms to parse song spectrums, analyze
beats per minute, chord progressions, genres to help categorize songs. Or, we could
have artists give additional information when they upload their songs. Having them add
the genre, and some tags, would make it semi-structured data, and would make
searching and organizing easier.

Summary

All right, now you know what is characteristic of structured data, semi-structured data
and unstructured data, the differences between the three, and you're able to give
examples for each of them.

Q. Can you correctly define structured, semi-structured and unstructured data?

SQL databases
We've mentioned SQL several times by now, so how about we spend a bit more time on
this language that is so fundamental in data engineering?

SQL
SQL stands for Structured Query Language. SQL is to databases what English is to pop
music. It's the preferred language to query RDBMS or Relational Database
Management System - basically systems that gather several tables like the Employees
table from the previous lesson, where all tables are related to each other. More on that
in a moment. SQL has two main advantages: it allows you to access many records at
once, and group, filter or aggregate them. Most programming languages let you do that,
but SQL was the first, which is why it's been so influential. It's a little bit like the Beatles
and pop music. It's also very close to English, which makes it easy to write and
understand. As you already know data engineers use SQL to create and maintain and
updates databases, while data scientists use SQL to query, filter, group and aggregate
data in the tables of databases.

Example (Remember the employees table)

We're not going to learn SQL in this course. However, looking at some examples will
help your understanding. Let's look at a data engineering example first, creating a table.
Take a moment to refresh your memory of Spotflix's employee table. Remember the
first columns holds non-decimal numbers, the penultimate one stores logical values,
and the others hold text.

SQL for data engineers

We can create such a table using SQL. We type the command CREATE TABLE, and
declare the name of the table, "employees". Then we proceed to create the first column,
employee_id, and specify the type of data expected, integers - which mean this column
will only accept whole numbers, without any decimal. We then create the second
column, first_name, and specify it should be text (VARCHAR stands for "variable
characters"). Two-hundred fifty-five here means that the value entered can't be more
than Two-hundred fifty-five characters long. And we do the same for last name, role and
team. We declare full_time as a Boolean, which is the type for logical values. This
column can only hold zero for false or one for true. Office is declared as VARCHAR as
well because it's text. Data engineers then run other statements to update the table and
write records into it.
SQL for data scientists
Data scientists will then use SQL to query data in the tables. For example, if Julian
wants to get the first and last name of all the employees whose role title contains the
keyword data, he can select the first and last name, FROM the employees table,
WHERE the “role” title contains data. The percentage signs on each side of "Data"
mean "Data" can appear anywhere in the role title.

Database schema
So far, we've looked at tables individually; but databases are made of many tables. The
database schema governs how tables are related. A database schema is the skeleton
structure that represents the logical view of the entire database. It defines how the data
is organized and how the relations among them are associated. It formulates all the
constraints that are to be applied on the data.

Example:

Finally, there are several implementations of SQL like SQLite, MySQL, PostgreSQL,
Oracle SQL, SQL Server. How they differ is out of the scope of this course, but they are
pretty similar. Switching from one to the other is like switching from a QWERTY
keyboard to an AZERTY one, or switching from British English to American English. A
few things change, but most things stay the same.
Summary
You now understand why SQL is the language or reference for RDBMS, how data
engineers and data scientists use it differently, can give an example of a database
schema, and can cite several SQL implementations.

Data warehouses and Data lakes

Now it's time to clarify some concepts.

Warehouses with stunning view on the lake

Remember the data pipelines lesson at the end of previous lecture? We quickly
mentioned data lakes. Along the course we also mentioned databases several times.
We mentioned data warehouses. So what are these and what is the difference?

First, let's look at our data pipeline again.

1. Data lakes and data warehouses

As the data pipeline graph shows, the data lake is where all the collected raw data gets
stored, just as it was uploaded from the different sources. It's unprocessed and messy.
While the data lake stores all the data, the data warehouse stores specific data for a
specific use. For example, users and their subscription type, or all the listening sessions
for behavioral analysis. For this reason, a data lake can take petabytes of data, but
warehouses are usually pretty small - small on the scale of big data, I mean. It can still
way bigger than your external hard drive. A data lake can store any kind of data,
whether it's structured, semi-structured or unstructured. This means that it does not
enforce any model on the way to store the data. This makes it cost-effective. Data
warehouses enforce a structured format, which makes them more costly to manipulate.
However, this lack of structure also means it's very difficult to analyze. Some big data
analytics using deep learning can be implemented to discover hidden patterns and
trends, but that's about it, and should probably be last resort. The data warehouse, on
the other hand, is optimized for analytics to drive business decisions. Because no model
is enforced in data lakes and any structure can be stored, it is necessary to keep a data
catalog up to date. Data lakes are used by data scientists for real-time analytics on big
data, while data warehouses are used by analysts for ad-hoc (Latin phrase describing
something created especially for a particular occasion), read-only queries like
aggregation and summarization.
OR

A database stores the current data required to power an application.

A data warehouse is a system that stores highly structured information from various
sources. Data warehouses typically store current and historical data from one or more
systems. Some examples of data warehouses are Amazon Redshift, Google BigQuery,
Snowflake and IBM Db2 warehouse etc.

A data lake is a repository of data from disparate sources that is stored in its original,
raw format. Like data warehouses, data lakes store large amounts of current and
historical data. What sets data lakes apart is their ability to store data in a variety of
formats including JSON, BSON, CSV, TSV etc

2. Data catalog for data lakes

A data catalog is a source of truth that compensates for the lack of structure in a data
lake. Among other things, it keeps track of where the data comes from, how it is used,
who is responsible for maintaining it, and how often it gets updated. It's good practice in
terms of data governance (managing the availability, usability, integrity and security of
the data), and guarantees the reproducibility of the processes in case anything
unexpected happens. Or if someone wants to reproduce an analysis from the very
beginning, starting with the ingestion of the data. Because of the very flexible way data
lakes store data, a data catalog is necessary to prevent the data lake becoming a data
swamp. It's good practice to have a data catalog referencing any data that moves
through your organization, so that we don't have to rely on tribal knowledge, which
makes us autonomous, and makes working with the data more scalable. We can go
from finding data to preparing it without having to rely on a human source of information
every time we have a question.
3. Database vs. data warehouse
Let's take a step back. We've used the term database several times. Where does it fit
in? Database is a very general term that can be loosely defined as organized data
stored and accessed on a computer. It's a general term and a data warehouse is a type
of database.

Summary
All right! Now you know the characteristics of data lakes, data warehouses and
databases, how they differ, and why a data catalog is useful and necessary.

Christianity in Early Africa PDF
No ratings yet
Christianity in Early Africa PDF
48 pages
ادریس کاندھلوی اور ان کی تفسیر معارف القرآن
No ratings yet
ادریس کاندھلوی اور ان کی تفسیر معارف القرآن
20 pages
Grade 8 Term 4 Eng HL 2021
100% (1)
Grade 8 Term 4 Eng HL 2021
9 pages
Ingles Com Musicas Student
No ratings yet
Ingles Com Musicas Student
47 pages
BFSI
No ratings yet
BFSI
5 pages
Data Structrue MCQ's
No ratings yet
Data Structrue MCQ's
170 pages
ECU Measurement Calibration and Diagnostics Brochure
89% (9)
ECU Measurement Calibration and Diagnostics Brochure
44 pages
Structured, Semi Structured and Unstructured Data
No ratings yet
Structured, Semi Structured and Unstructured Data
13 pages
Chapter 17: Renaissance and Reformation
No ratings yet
Chapter 17: Renaissance and Reformation
18 pages
Indian Philosophy
100% (1)
Indian Philosophy
45 pages
Mana Mohan R
No ratings yet
Mana Mohan R
147 pages
(SQL Notes) - TheTestingAcademy - Pramod - Google Drive
No ratings yet
(SQL Notes) - TheTestingAcademy - Pramod - Google Drive
20 pages
Spanish Pre Exam Beginners
No ratings yet
Spanish Pre Exam Beginners
15 pages
Microsoft 70-744 Exam Q&A
No ratings yet
Microsoft 70-744 Exam Q&A
194 pages
Bai Giang CSDL (Tieng Anh)
No ratings yet
Bai Giang CSDL (Tieng Anh)
252 pages
gettyUKtour2012 Choir Book
No ratings yet
gettyUKtour2012 Choir Book
73 pages
Flexible Instruction Delivery Plan Template
100% (6)
Flexible Instruction Delivery Plan Template
4 pages
Pythonlearn 15 Databases
No ratings yet
Pythonlearn 15 Databases
96 pages
Py4Inf 14 Database
No ratings yet
Py4Inf 14 Database
74 pages
Chapter 2
No ratings yet
Chapter 2
77 pages
Intro - To - DBMS 1
No ratings yet
Intro - To - DBMS 1
95 pages
Unit 2 It-01-1
No ratings yet
Unit 2 It-01-1
72 pages
Lecture Notes Hands-On With Nosql - Mongodb: - O O O O O O - O O O O O O O
No ratings yet
Lecture Notes Hands-On With Nosql - Mongodb: - O O O O O O - O O O O O O O
8 pages
DBMS 2nd Semester
No ratings yet
DBMS 2nd Semester
74 pages
Iict - Database SQL
No ratings yet
Iict - Database SQL
56 pages
Chapter 1 SQL 2024
No ratings yet
Chapter 1 SQL 2024
46 pages
M1 - Intro
No ratings yet
M1 - Intro
56 pages
Storing Data in Data Engineering
No ratings yet
Storing Data in Data Engineering
39 pages
SQL Material
No ratings yet
SQL Material
47 pages
Lectrure Series 4 - Mid 2 - Data Resources - (Book - CH 5)
No ratings yet
Lectrure Series 4 - Mid 2 - Data Resources - (Book - CH 5)
32 pages
Ln. 3 - Relational Database Management System Grade 10 CBSE
No ratings yet
Ln. 3 - Relational Database Management System Grade 10 CBSE
19 pages
Database Fundamentals
No ratings yet
Database Fundamentals
46 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
50 pages
Database
No ratings yet
Database
47 pages
Big Data Introduction
No ratings yet
Big Data Introduction
46 pages
Big Data & Analytics (CSE448) L1
No ratings yet
Big Data & Analytics (CSE448) L1
51 pages
5.3x Screen Interactions Exercise
No ratings yet
5.3x Screen Interactions Exercise
28 pages
Data Structures: Hadrien Lacroix
No ratings yet
Data Structures: Hadrien Lacroix
39 pages
Data Types
No ratings yet
Data Types
36 pages
Data Engineering For Everyone 2
No ratings yet
Data Engineering For Everyone 2
39 pages
Big Data and Analytics Cse448 Module 1 L
No ratings yet
Big Data and Analytics Cse448 Module 1 L
38 pages
SQL Notes
No ratings yet
SQL Notes
45 pages
RAG Related Questions
No ratings yet
RAG Related Questions
35 pages
Answer Updated
No ratings yet
Answer Updated
35 pages
Module 1
No ratings yet
Module 1
40 pages
On The Reading of Riddles: Rethinking Du Boisian "Double Consciousness" - by Ernest Allen, Jr.
No ratings yet
On The Reading of Riddles: Rethinking Du Boisian "Double Consciousness" - by Ernest Allen, Jr.
22 pages
DCICT 2 Databases
No ratings yet
DCICT 2 Databases
26 pages
01 Database System
No ratings yet
01 Database System
35 pages
Week 5 part-II Data Models
No ratings yet
Week 5 part-II Data Models
32 pages
DBMS Aryan
No ratings yet
DBMS Aryan
33 pages
INTRODUCTION TO Database Management
No ratings yet
INTRODUCTION TO Database Management
29 pages
1.database Management System (DBMS) Overview
No ratings yet
1.database Management System (DBMS) Overview
29 pages
Data and Data Storage
No ratings yet
Data and Data Storage
29 pages
CPE 313 Database Management Systems: Fall 2021/2022
No ratings yet
CPE 313 Database Management Systems: Fall 2021/2022
24 pages
Unit 4 DigitalData
No ratings yet
Unit 4 DigitalData
22 pages
Chapter 4 Multithreading in Java PDF
No ratings yet
Chapter 4 Multithreading in Java PDF
21 pages
Emerging Technologies: Rohan Raj Poudel
No ratings yet
Emerging Technologies: Rohan Raj Poudel
31 pages
Database
No ratings yet
Database
21 pages
CACS101 CFA Unit 4
No ratings yet
CACS101 CFA Unit 4
21 pages
PostgreSQL Data Base Design Part 1
No ratings yet
PostgreSQL Data Base Design Part 1
25 pages
Module 1 - SQL For Analytics Introduction
No ratings yet
Module 1 - SQL For Analytics Introduction
19 pages
Gab Assignment
No ratings yet
Gab Assignment
7 pages
Unit 1 (Big Data)
No ratings yet
Unit 1 (Big Data)
20 pages
Cambridge O Level: Computer Science 2210/22
No ratings yet
Cambridge O Level: Computer Science 2210/22
12 pages
Gabriel LP Week6
No ratings yet
Gabriel LP Week6
21 pages
Unit - IV XML Databases Adbt 25 Pages
No ratings yet
Unit - IV XML Databases Adbt 25 Pages
13 pages
Module 9 Spiritual Self
No ratings yet
Module 9 Spiritual Self
10 pages
SQL Lec 01 03
No ratings yet
SQL Lec 01 03
14 pages
COE301 Lab 8 MIPS Exceptions and IO
No ratings yet
COE301 Lab 8 MIPS Exceptions and IO
10 pages
Introduction To SQL - Manipulating Data Sets
No ratings yet
Introduction To SQL - Manipulating Data Sets
12 pages
Databases
No ratings yet
Databases
16 pages
22Xx405 - Database Management System Unit 1 & LP 1-Understanding Data and Information, Database Vs Information
No ratings yet
22Xx405 - Database Management System Unit 1 & LP 1-Understanding Data and Information, Database Vs Information
11 pages
Regular Expressions
No ratings yet
Regular Expressions
30 pages
DBMS Unit1
No ratings yet
DBMS Unit1
10 pages
Range BIDB
No ratings yet
Range BIDB
10 pages
As-5 Data & Databases
No ratings yet
As-5 Data & Databases
10 pages
Unit 5
No ratings yet
Unit 5
7 pages
P6 - Chapter 6 - MS Powerpoint
No ratings yet
P6 - Chapter 6 - MS Powerpoint
6 pages
1 Introduction
No ratings yet
1 Introduction
9 pages
Database Concept
No ratings yet
Database Concept
6 pages
Ideologies Ideologies: BA (Hons.) History (University of Delhi) BA (Hons.) History (University of Delhi)
No ratings yet
Ideologies Ideologies: BA (Hons.) History (University of Delhi) BA (Hons.) History (University of Delhi)
6 pages
Subtitle
No ratings yet
Subtitle
3 pages
Unit 2 Homework
No ratings yet
Unit 2 Homework
4 pages
Soulmates 1
No ratings yet
Soulmates 1
2 pages
BG Notes
No ratings yet
BG Notes
3 pages
Structured SemiStructured Unstructured Data
No ratings yet
Structured SemiStructured Unstructured Data
2 pages
Contoh RPH
No ratings yet
Contoh RPH
3 pages
OSPF
No ratings yet
OSPF
2 pages

2 Data Engineering (Storing Data)

Uploaded by

2 Data Engineering (Storing Data)

Uploaded by

Understanding Data Engineering

Course Instructor: Anam Shahid

Employee table (Example of Structure data):

Table 1:Employee Table

Figure 1 Connections of both above tables

Favorite artists JSON file (Example of semi-structured data):

Examples of Unstructured data:

Q. Can you correctly define structured, semi-structured and unstructured data?

Example (Remember the employees table)

SQL for data engineers

Data warehouses and Data lakes

Warehouses with stunning view on the lake

First, let's look at our data pipeline again.

1. Data lakes and data warehouses

A database stores the current data required to power an application.

2. Data catalog for data lakes

You might also like