Microsoft Azure Data Fundamentals

Uploaded by

Dang Nhat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views6 pages

Microsoft Azure Data Fundamentals

Uploaded by

Dang Nhat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

MICROSOFT AZURE DATA FUNDAMENTALS

Dang Nhat Microsoft Azure Data Fundamentals

MỤC LỤC
1. INTRODUCTION TO DATA ENGINEERING...........................................

Trang i
Dang Nhat Microsoft Azure Data Fundamentals

1. CORE DATA CONCEPTS

1.1. Data formats
1.1.1. Structured data
Structured data is data that adheres to a fixed schema, so all of the data has the
same fields or properties. Most commonly, the schema for structured data entities
is tabular - in other words, the data is represented in one or more tables that consist
of rows to represent each instance of a data entity, and columns to represent
attributes of the entity.
Structured data is often stored in a database in which multiple tables can reference
one another by using key values in a relational model.
1.1.2. Semi-structured data
Semi-structured data is information that has some structure, but which allows for
some variation between entity instances. For example, while most customers may
have an email address, some might have multiple email addresses, and some might
have none at all.
One common format for semi-structured data is JavaScript Object Notation (JSON).
1.1.3. Unstructured data
Not all data is structured or even semi-structured. For example, documents,
images, audio and video data, and binary files might not have a specific structure.
This kind of data is referred to as unstructured data.
1.1.4. Data stores
Organizations typically store data in structured, semi-structured, or unstructured
format to record details of entities (for example, customers and products), specific
events (such as sales transactions), or other information in documents, images, and
other formats. The stored data can then be retrieved for analysis and reporting
later.
There are two broad categories of data store in common use:
 File stores
 Databases
1.2. File storage
The ability to store data in files is a core element of any computing system. Files can
be stored in local file systems on the hard disk of your personal computer, and on
removable media such as USB drives; but in most organizations, important data files
are stored centrally in some kind of shared file storage system. Increasingly, that
central storage location is hosted in the cloud, enabling cost-effective, secure, and
reliable storage for large volumes of data.
The specific file format used to store data depends on a number of factors, including:
The type of data being stored (structured, semi-structured, or unstructured).
Trang 1
Dang Nhat Microsoft Azure Data Fundamentals

The applications and services that will need to read, write, and process the data.
The need for the data files to be readable by humans, or optimized for efficient
storage and processing.
1.2.1. Delimited text files
Data is often stored in plain text format with specific field delimiters and row
terminators. The most common format for delimited data is comma-separated
values (CSV) in which fields are separated by commas, and rows are terminated by
a carriage return / new line. Optionally, the first line may include the field names.
Other common formats include tab-separated values (TSV) and space-delimited (in
which tabs or spaces are used to separate fields), and fixed-width data in which
each field is allocated a fixed number of characters.
1.2.2. JavaScript Object Notation (JSON)
JSON is a ubiquitous format in which a hierarchical document schema is used to
define data entities (objects) that have multiple attributes. Each attribute might be
an object (or a collection of objects); making JSON a flexible format that's good
for both structured and semi-structured data.
1.2.3. Extensible Markup Language (XML)
ML is a human-readable data format that was popular in the 1990s and 2000s. It's
largely been superseded by the less verbose JSON format, but there are still some
systems that use XML to represent data. XML uses tags enclosed in angle-brackets
(<../>) to define elements and attributes.
1.2.4. Binary Large Object (BLOB)
Ultimately, all files are stored as binary data (1's and 0's), but in the human-
readable formats discussed above, the bytes of binary data are mapped to printable
characters (typically through a character encoding scheme such as ASCII or
Unicode). Some file formats however, particularly for unstructured data, store the
data as raw binary that must be interpreted by applications and rendered. Common
types of data stored as binary include images, video, audio, and application-
specific documents.
When working with data like this, data professionals often refer to the data files as
BLOBs (Binary Large Objects).
1.2.5. Optimized file formats
While human-readable formats for structured and semi-structured data can be
useful, they're typically not optimized for storage space or processing. Over time,
some specialized file formats that enable compression, indexing, and efficient
storage and processing have been developed.
Some common optimized file formats you might see include Avro, ORC, and
Parquet:
 Avro is a row-based format. It was created by Apache. Each record contains a
header that describes the structure of the data in the record. This header is stored
as JSON. The data is stored as binary information. An application uses the
Trang 2
Dang Nhat Microsoft Azure Data Fundamentals

information in the header to parse the binary data and extract the fields it
contains. Avro is a good format for compressing data and minimizing storage
and network bandwidth requirements.
 ORC (Optimized Row Columnar format) organizes data into columns rather
than rows. It was developed by HortonWorks for optimizing read and write
operations in Apache Hive (Hive is a data warehouse system that supports fast
data summarization and querying over large datasets). An ORC file contains
stripes of data. Each stripe holds the data for a column or set of columns. A
stripe contains an index into the rows in the stripe, the data for each row, and a
footer that holds statistical information (count, sum, max, min, and so on) for
each column.
 Parquet is another columnar data format. It was created by Cloudera and
Twitter. A Parquet file contains row groups. Data for each column is stored
together in the same row group. Each row group contains one or more chunks of
data. A Parquet file includes metadata that describes the set of rows found in
each chunk. An application can use this metadata to quickly locate the correct
chunk for a given set of rows, and retrieve the data in the specified columns for
these rows. Parquet specializes in storing and processing nested data types
efficiently. It supports very efficient compression and encoding schemes.
1.3. Databases
1.3.1. Relational databases
Relational databases are commonly used to store and query structured data. The
data is stored in tables that represent entities, such as customers, products, or sales
orders. Each instance of an entity is assigned a primary key that uniquely identifies
it; and these keys are used to reference the entity instance in other tables. For
example, a customer's primary key can be referenced in a sales order record to
indicate which customer placed the order. This use of keys to reference data
entities enables a relational database to be normalized; which in part means the
elimination of duplicate data values so that, for example, the details of an
individual customer are stored only once; not for each sales order the customer
places. The tables are managed and queried using Structured Query Language
(SQL), which is based on an ANSI standard, so it's similar across multiple
database systems.
1.3.2. Non-relational databases
Non-relational databases are data management systems that don’t apply a
relational schema to the data. Non-relational databases are often referred to as
NoSQL database, even though some support a variant of the SQL language.
There are four common types of Non-relational database commonly in use.
 Key-value databases in which each record consists of a unique key and an
associated value, which can be in any format.

Trang 3
Dang Nhat Microsoft Azure Data Fundamentals

 Document databases, which are a specific form of key-value database in which

the value is a JSON document (which the system is optimized to parse and
query).
 Column family databases, which store tabular data comprising rows and
columns, but you can divide the columns into groups known as column-
families. Each column family holds a set of columns that are logically related
together.
 Graph databases, which store entities as nodes with links to define relationships
between them.
1.4. Transactional data processing

Trang 4

MS Azure DP-900
No ratings yet
MS Azure DP-900
264 pages
Pyspark Interview: Abhinav Singh
No ratings yet
Pyspark Interview: Abhinav Singh
275 pages
Azure Data Storage Options: - by Shashank Gupta
No ratings yet
Azure Data Storage Options: - by Shashank Gupta
20 pages
Unit 1-DBMS
No ratings yet
Unit 1-DBMS
100 pages
TCS Azure Data Engineer Interview Questions and Answers
No ratings yet
TCS Azure Data Engineer Interview Questions and Answers
7 pages
Comparison of File Formats For Big Data
No ratings yet
Comparison of File Formats For Big Data
4 pages
Introduction To Nosql: - Key Value Databases
No ratings yet
Introduction To Nosql: - Key Value Databases
14 pages
Structured, Semi Structured and Unstructured Data
No ratings yet
Structured, Semi Structured and Unstructured Data
13 pages
RDBMS Unit 1 Notes
No ratings yet
RDBMS Unit 1 Notes
18 pages
Quality Control v4 0
No ratings yet
Quality Control v4 0
3 pages
Microsoft Azure Data Fundamentals
No ratings yet
Microsoft Azure Data Fundamentals
60 pages
Building Information Modeling (BIM) Statement of Work (SOW) : The Great Lakes Region 5 Standard Template For
No ratings yet
Building Information Modeling (BIM) Statement of Work (SOW) : The Great Lakes Region 5 Standard Template For
15 pages
Unit - 3 (HDFS)
No ratings yet
Unit - 3 (HDFS)
23 pages
Azure Data Fundamentals - Study Notes
No ratings yet
Azure Data Fundamentals - Study Notes
22 pages
File Format 1750507666
No ratings yet
File Format 1750507666
11 pages
Unit 6
No ratings yet
Unit 6
143 pages
06 BigDataAndBigDataDesign
No ratings yet
06 BigDataAndBigDataDesign
52 pages
Unit 5 DBMS
No ratings yet
Unit 5 DBMS
38 pages
Logs
No ratings yet
Logs
12 pages
Module 2
No ratings yet
Module 2
40 pages
Lecture 2 File Types Suitable For Storing Big Data
No ratings yet
Lecture 2 File Types Suitable For Storing Big Data
12 pages
PAS 1192 02 File Naming Convention Template
No ratings yet
PAS 1192 02 File Naming Convention Template
9 pages
Pyspark Code Quality by Azurelib
No ratings yet
Pyspark Code Quality by Azurelib
4 pages
Lecture 9 Chapter 5 Part 5 Big Data Storage Concepts
No ratings yet
Lecture 9 Chapter 5 Part 5 Big Data Storage Concepts
15 pages
Domain 1
No ratings yet
Domain 1
8 pages
DP 900
No ratings yet
DP 900
59 pages
Advanced Data Cleaning Techniques With PySpark
No ratings yet
Advanced Data Cleaning Techniques With PySpark
25 pages
06 NoSQL
No ratings yet
06 NoSQL
80 pages
Practical-1 CSV To Parquet Within S3
No ratings yet
Practical-1 CSV To Parquet Within S3
8 pages
Database Applications 1.1. Introduction To Database Applications 1.1.1. What Is A Database?
No ratings yet
Database Applications 1.1. Introduction To Database Applications 1.1.1. What Is A Database?
8 pages
04 Proyek PDB
No ratings yet
04 Proyek PDB
39 pages
SOLIS CE201 ASSIGNMENT2 Finals
No ratings yet
SOLIS CE201 ASSIGNMENT2 Finals
3 pages
WRC 302
100% (3)
WRC 302
38 pages
Loading Data in +snowflake
No ratings yet
Loading Data in +snowflake
10 pages
Aggregate Oriented Database
No ratings yet
Aggregate Oriented Database
3 pages
Azure Cloud Adoption Strategy Paper - v2.0 Draft
No ratings yet
Azure Cloud Adoption Strategy Paper - v2.0 Draft
22 pages
Oracle: Question & Answers
No ratings yet
Oracle: Question & Answers
6 pages
Big Data File Formats For Data Engineers
No ratings yet
Big Data File Formats For Data Engineers
3 pages
Delta Lake vs. Parquet. If Delta Lake Tables Also Use Parquet - by Abhinav Prakash - Jan, 2024 - Medium
No ratings yet
Delta Lake vs. Parquet. If Delta Lake Tables Also Use Parquet - by Abhinav Prakash - Jan, 2024 - Medium
13 pages
Database, Data Warehouse in Building Information Systems
No ratings yet
Database, Data Warehouse in Building Information Systems
12 pages
DZ Nosql Migration Essentials
No ratings yet
DZ Nosql Migration Essentials
9 pages
Hive
No ratings yet
Hive
37 pages
Arrow Cookbook
No ratings yet
Arrow Cookbook
12 pages
Unit 2
No ratings yet
Unit 2
34 pages
Aggregrate Data Models
No ratings yet
Aggregrate Data Models
9 pages
NoSql 2024 Assign2
No ratings yet
NoSql 2024 Assign2
189 pages
Data Fundamentals
No ratings yet
Data Fundamentals
37 pages
National-CMMS Equipment-Template V09a - 03.30.2012
No ratings yet
National-CMMS Equipment-Template V09a - 03.30.2012
244 pages
Mpa Bim Uses
No ratings yet
Mpa Bim Uses
61 pages
Introduction To Database
No ratings yet
Introduction To Database
6 pages
Data Engineering Top 100 Questions
No ratings yet
Data Engineering Top 100 Questions
59 pages
DP 203 Microsoft Azure Data Engineer Associate Exam Study Guide PDF
No ratings yet
DP 203 Microsoft Azure Data Engineer Associate Exam Study Guide PDF
23 pages
ADBMS
No ratings yet
ADBMS
8 pages
11zon DBMS1 PDF
No ratings yet
11zon DBMS1 PDF
20 pages
Spark Optimisation Techniques
No ratings yet
Spark Optimisation Techniques
3 pages
SQL Material
No ratings yet
SQL Material
47 pages
Optimizing PySpark Operations
No ratings yet
Optimizing PySpark Operations
4 pages
BDA Assignment1 BE6 20
No ratings yet
BDA Assignment1 BE6 20
10 pages
Databricks Practice Questions 1
No ratings yet
Databricks Practice Questions 1
10 pages
01 Topol Arrow and Go
No ratings yet
01 Topol Arrow and Go
32 pages
Databricks Differences Abhishek
No ratings yet
Databricks Differences Abhishek
7 pages
Unit II No-SQL DB Managment
No ratings yet
Unit II No-SQL DB Managment
33 pages
Unit 5 - 230601 - 174540-1
No ratings yet
Unit 5 - 230601 - 174540-1
14 pages
Module 1
No ratings yet
Module 1
11 pages
Google GCP BigLake
No ratings yet
Google GCP BigLake
13 pages
DP-900 Cheatsheet
No ratings yet
DP-900 Cheatsheet
24 pages
MBDHC 2
No ratings yet
MBDHC 2
23 pages
2 - Disadvantages of NoSQL Technology
No ratings yet
2 - Disadvantages of NoSQL Technology
3 pages
Microsoft Azure Data Fundamentals Explore Core Data Concepts
No ratings yet
Microsoft Azure Data Fundamentals Explore Core Data Concepts
8 pages
BDA Unit 4 Notes
No ratings yet
BDA Unit 4 Notes
20 pages
Module 1 Notes
No ratings yet
Module 1 Notes
7 pages
Azure DE Interview Que
100% (1)
Azure DE Interview Que
25 pages
DP 900 Data Fundamentals 1710103456
No ratings yet
DP 900 Data Fundamentals 1710103456
35 pages
File Formats in Big Data
No ratings yet
File Formats in Big Data
13 pages
ABD Exame PDF
No ratings yet
ABD Exame PDF
17 pages
buildingSMART Technical Roadmap
No ratings yet
buildingSMART Technical Roadmap
33 pages
Introduction To Nosql: What Is A Nosql Database Used For?
No ratings yet
Introduction To Nosql: What Is A Nosql Database Used For?
6 pages
SQL 2
No ratings yet
SQL 2
7 pages
DP900 Chapter1 Notes
No ratings yet
DP900 Chapter1 Notes
10 pages
3D Imaging Services Statement of Work (SOW) : The Great Lakes Region 5 Standard Template For
No ratings yet
3D Imaging Services Statement of Work (SOW) : The Great Lakes Region 5 Standard Template For
19 pages
cp5293 Big Data Analytics Unit 5 PDF
No ratings yet
cp5293 Big Data Analytics Unit 5 PDF
28 pages
Hive Lecture Notes
100% (1)
Hive Lecture Notes
17 pages
GSA Region 5 Bim Execution Plan (Project Title) : Submitted Date: ##/##/#### Submitted by
No ratings yet
GSA Region 5 Bim Execution Plan (Project Title) : Submitted Date: ##/##/#### Submitted by
23 pages
An Introduction To OODB and Database System
No ratings yet
An Introduction To OODB and Database System
86 pages
BIM Uses On Test Projects
No ratings yet
BIM Uses On Test Projects
92 pages
Database Management and Relational Database Management System
No ratings yet
Database Management and Relational Database Management System
11 pages
Building Information Modeling (BIM) Options Statement of Work (SOW)
No ratings yet
Building Information Modeling (BIM) Options Statement of Work (SOW)
7 pages
Nosql What Does It Mean
No ratings yet
Nosql What Does It Mean
15 pages
Database Management Systems
No ratings yet
Database Management Systems
14 pages
SQL Demystified: A Beginner's Roadmap to Data Retrieval and Management
From Everand
SQL Demystified: A Beginner's Roadmap to Data Retrieval and Management
Kaushal Mehta
No ratings yet
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
DBMS MASTER: Become Pro in Database Management System
From Everand
DBMS MASTER: Become Pro in Database Management System
Ummed Singh
No ratings yet
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
From Everand
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
JAMIE POWERS
No ratings yet
Basic Concepts in Data Structures
From Everand
Basic Concepts in Data Structures
K.Meenendranath Reddy
No ratings yet
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
Data Structures I Essentials
From Everand
Data Structures I Essentials
Dennis Smolarski
No ratings yet
Introduction to Microsoft SQL Server
From Everand
Introduction to Microsoft SQL Server
Eric Frick
No ratings yet
Semantic Translation: Fundamentals and Applications
From Everand
Semantic Translation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Concise Oracle Database For People Who Has No Time
From Everand
Concise Oracle Database For People Who Has No Time
Billy Aung Myint
No ratings yet
Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
From Everand
Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
1/5 (1)

Microsoft Azure Data Fundamentals

Uploaded by

Microsoft Azure Data Fundamentals

Uploaded by

MICROSOFT AZURE DATA FUNDAMENTALS

Dang Nhat Microsoft Azure Data Fundamentals

1. CORE DATA CONCEPTS

 Document databases, which are a specific form of key-value database in which

You might also like