Ultimate Azure Data Engineering: Build Robust Data Engineering Systems on Azure with SQL, ETL, Data Modeling, and Power BI for Business Insights and Crack Azure Certifications (English Edition)
()
About this ebook
Book Description Embark on a comprehensive journey into Azure data engineering with “Ultimate Azure Data Engineering”. Starting with foundational topics like SQL and relational database concepts, you'll progress to comparing data engineering practices in Azure versus on-premises environments. Next, you will dive deep into Azure cloud fundamentals, learning how to effectively manage heterogeneous data sources and implement robust Extract, Transform, Load (ETL) concepts using Azure Data Factory, mastering the orchestration of data workflows and pipeline automation.
The book then moves to explore advanced database design strategies and discover best practices for optimizing data performance and ensuring stringent data security measures. You will learn to visualize data insights using Power BI and apply these skills to real-world scenarios.
Whether you're aiming to excel in your current role or preparing for Azure data engineering certifications, this book equips you with practical knowledge and hands-on expertise to thrive in the dynamic field of Azure data engineering.
Table of Contents 1. Introduction to Data Engineering 2. Understanding SQL and RDBMS Concepts 3. Data Engineering: Azure Versus On-Premises 4. Azure Cloud Concepts 5. Working with Heterogenous Data Sources 6. ETL Concepts 7. Database Design and Modeling 8. Performance Best Practices and Data Security 9. Data Visualization and Application in Real World 10. Data Engineering Certification Guide Index
Related to Ultimate Azure Data Engineering
Related ebooks
The Definitive Guide to Data Integration: Unlock the power of data integration to efficiently manage, transform, and analyze data Rating: 0 out of 5 stars0 ratingsComprehensive SQL Techniques: Mastering Data Analysis and Reporting Rating: 0 out of 5 stars0 ratingsData Engineering with Databricks Cookbook: Build effective data and AI solutions using Apache Spark, Databricks, and Delta Lake Rating: 0 out of 5 stars0 ratingsThe Microsoft Fabric Handbook: Simplifying Data Engineering and Analytics Rating: 0 out of 5 stars0 ratingsData Engineering Best Practices: Architect robust and cost-effective data solutions in the cloud era Rating: 0 out of 5 stars0 ratingsNavigating Azure: A Comprehensive Guide to Microsoft's Cloud Platform Rating: 0 out of 5 stars0 ratingsHands-on Cloud Analytics with Microsoft Azure Stack Rating: 0 out of 5 stars0 ratingsBig Data and Analytics: The key concepts and practical applications of big data analytics (English Edition) Rating: 0 out of 5 stars0 ratingsMicrosoft Azure: From Basics to Expert Proficiency Rating: 0 out of 5 stars0 ratingsUltimate Azure Data Scientist Associate (DP-100) Certification Guide Rating: 0 out of 5 stars0 ratingsAzure Bicep QuickStart Pro Rating: 0 out of 5 stars0 ratingsNeo4j High Performance Rating: 0 out of 5 stars0 ratingsMastering Snowflake Platform: Generate, fetch, and automate Snowflake data as a skilled data practitioner (English Edition) Rating: 0 out of 5 stars0 ratingsMastering Hyper-V 2012 R2 with System Center and Windows Azure Rating: 0 out of 5 stars0 ratingsUltimate Azure Data Engineering Rating: 0 out of 5 stars0 ratingsPentaho Data Integration 4 Cookbook Rating: 0 out of 5 stars0 ratingsMastering Amazon Relational Database Service for MySQL: Building and configuring MySQL instances (English Edition) Rating: 0 out of 5 stars0 ratingsGoogle Cloud Data Engineer 100+ Practice Exam Questions With Well Explained Answers Rating: 0 out of 5 stars0 ratingsLearning Apache Spark 2 Rating: 0 out of 5 stars0 ratingsNosql Complete Self-Assessment Guide Rating: 0 out of 5 stars0 ratingsMCA Microsoft Certified Associate Azure Security Engineer Study Guide: Exam AZ-500 Rating: 0 out of 5 stars0 ratingsDP-300: Administering Relational Databases on Microsoft Azure Practice Questions Rating: 5 out of 5 stars5/5
Certification Guides For You
CompTIA Security+ Study Guide: Exam SY0-601 Rating: 5 out of 5 stars5/5Coding For Dummies Rating: 5 out of 5 stars5/5CompTIA A+ Complete Review Guide: Core 1 Exam 220-1101 and Core 2 Exam 220-1102 Rating: 5 out of 5 stars5/5Coding All-in-One For Dummies Rating: 4 out of 5 stars4/5Thinking Beyond Coding Rating: 5 out of 5 stars5/5Mike Meyers' CompTIA A+ Certification Passport, Seventh Edition (Exams 220-1001 & 220-1002) Rating: 2 out of 5 stars2/5CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide Rating: 5 out of 5 stars5/5CompTIA A+ Complete Study Guide: Exam Core 1 220-1001 and Exam Core 2 220-1002 Rating: 4 out of 5 stars4/5CompTIA Security+ Study Guide with over 500 Practice Test Questions: Exam SY0-701 Rating: 5 out of 5 stars5/5CompTIA Network+ Study Guide: Exam N10-009 Rating: 0 out of 5 stars0 ratingsMicrosoft Office 365 for Business Rating: 4 out of 5 stars4/5CompTIA A+ Certification All-in-One Study Guide: Exams (Core 1: 220-1101 and Core 2: 220-1102) Rating: 0 out of 5 stars0 ratingsCompTIA Project+ Study Guide: Exam PK0-005 Rating: 0 out of 5 stars0 ratingsCompTIA Security+ SY0-701 Certification Guide: Master cybersecurity fundamentals and pass the SY0-701 exam on your first attempt Rating: 0 out of 5 stars0 ratingsCompTIA A+ Certification All-in-One For Dummies Rating: 3 out of 5 stars3/5(ISC)2 CISSP Certified Information Systems Security Professional Official Study Guide Rating: 3 out of 5 stars3/5CompTIA A+ CertMike: Prepare. Practice. Pass the Test! Get Certified!: Core 1 Exam 220-1101 Rating: 0 out of 5 stars0 ratingsMicrosoft Certified Azure Fundamentals Study Guide: Exam AZ-900 Rating: 0 out of 5 stars0 ratingsCompTIA A+ Complete Study Guide: Core 1 Exam 220-1101 and Core 2 Exam 220-1102 Rating: 0 out of 5 stars0 ratingsCompTIA Network+ CertMike: Prepare. Practice. Pass the Test! Get Certified!: Exam N10-008 Rating: 0 out of 5 stars0 ratingsCCNA Certification Study Guide Volume 1: Exam 200-301 v1.1 Rating: 5 out of 5 stars5/5CompTIA CySA+ Study Guide: Exam CS0-003 Rating: 2 out of 5 stars2/5CISM Certified Information Security Manager Study Guide Rating: 4 out of 5 stars4/5CEH v11 Certified Ethical Hacker Study Guide Rating: 0 out of 5 stars0 ratingsCCNA Certification All-in-One For Dummies Rating: 5 out of 5 stars5/5SSCP (ISC)2 Systems Security Certified Practitioner Official Study Guide Rating: 0 out of 5 stars0 ratings
Reviews for Ultimate Azure Data Engineering
0 ratings0 reviews
Book preview
Ultimate Azure Data Engineering - Ashish Agarwal
CHAPTER 1
Introduction to Data Engineering
Introduction
This chapter exposes you to the fundamental concepts, techniques, and tools required for a solid understanding of data engineering. You will learn about the modern data ecosystem and the roles that data engineers, data analysts, and data scientists play. The data engineering ecosystem is made up of various components. It comprises many data types, formats, and data sources. Then, we have sequences of data processing steps that transform raw data into analytics-ready data commonly called data pipelines. Data pipelines can be batch or streaming, depending on the frequency and latency of data ingestion and processing. Data pipelines enable data engineers to automate data workflows, ensure data quality and reliability, and deliver timely and accurate data to data consumers. Data pipelines collect data from a variety of sources, turn it into analytics-ready data, and make it available to data consumers for analysis and decision-making. This data is processed and stored in data repositories such as relational databases, non-relational databases, data warehouses, data marts, data lakes, and big data stores. Data integration platforms bring together diverse data sources to create a single perspective for the user.
Structure
In this chapter, we will discuss the following topics:
Basic Concepts of Data Engineering
Difference Between Data Engineering, Data Analysis, and Data Science
Data Engineering
Data Analysis
Data Scientist
Modern Data Ecosystem
Source Systems, Formats, and Data Types
Basics of ETL Concepts
Extract
Transform
Load
Relational and Non-relational Databases
Data Warehouse and Data Marts
Data Lake, Big Data Store, Lakehouse, and Delta Lake
Basic Concepts of Data Engineering
Building systems that facilitate data collecting from a variety of source systems and then make it useful for analysis and decision-making after appropriate data cleaning, validating, and transforming is referred to as data engineering. Most of the time, this data is utilized to facilitate further analysis and data science, which frequently includes data processing, cleansing, validating, transforming, and machine learning, typically requiring significant computing and storage to make the data usable.
Data engineering helps to collaborate with teams of business intelligence engineers, data scientists, and data analysts.
Figure 1.1: Data Engineering Architecture Overview
The preceding figure illustrates how a typical data engineering project’s overall architecture look like. The very first step is understanding the source systems from which we are expecting the data to flow into the system.
The complexity, accessibility, and availability of the source systems play a crucial role while designing the data pull or extraction process of the raw data, which will eventually be used for Data Ingestion and further processing.
We will be discussing the various types of data sources and systems in depth in the later chapters of this book.
Difference Between Data Engineering, Data Analysis, and Data Science
When it comes to defining various job roles related to the data world in any small, medium, or large enterprise, there are basically three major categories to consider: data engineers, data analysts, and data scientists.
On a high level, these roles will sound very similar to each other and are often considered to be the same in the data community, especially for aspiring engineers who are looking to pursue their careers in the database space.
In this section, we will learn about the key aspects of each of the roles and what makes them different from each other.
Data Engineering
Systems that help data scientists and analysts do their jobs are developed and optimized by data engineers.
Every business relies on the reliability of its data and the ease with which users may access it. The data engineer makes certain that all data is appropriately received, converted, stored, and made available to other users.
Roles and responsibilities
Data engineers lay the groundwork for data analysts and scientists to build upon. To manage data at a very large scale, data engineers frequently employ sophisticated tools and approaches while building data pipelines. Data engineering has a far stronger emphasis on software development skills than the other two job pathways.
In larger organizations, data engineers may concentrate on using data tools, maintaining databases, or building and managing data pipelines. Regardless of the specific role, a skilled data engineer allows data scientists or analysts to concentrate on finding analytical solutions rather than transferring data from one source to another.
The attitude of a data engineer frequently places greater emphasis on constructing and optimizing. Examples of the kinds of projects a data engineer could be working on include:
Construction of APIs for data consumption
Integrating fresh or external datasets into pipelines of data already in use
Using fresh data to do feature modifications for machine learning models
Continually testing and monitoring the system to guarantee optimum functioning
Data Analyst
Data analysts add value to their organizations by gathering data, analyzing it to find answers to problems, and conveying the findings to assist management in making choices. Data cleansing, analysis, and data visualization are frequent tasks carried out by data analysts.
The term data analyst may be used differently, depending on the industry (for example, "business analyst,
business intelligence analyst,
operations analyst, or
database analyst"). Regardless of title, the data analyst is a generalist who can integrate into a variety of roles and teams to support others in making better data-driven decisions.
A traditional company might become data-driven with the help of a data analyst. Their main duty is to aid others in keeping track of their progress and concentrating as best as they can.
Roles and responsibilities
How can a marketer utilize analytics data to aid in the rollout of their next campaign? How can a salesperson choose the right demographics to target? How can a CEO comprehend the fundamental causes of current business growth? The data analyst responds to each of these queries by doing analysis and presenting the findings.
In the larger area of data, data analyst roles are frequently "entry-level" occupations, but not all analysts are at this level. Data analysts are essential for businesses that separate technical and business functions because they are skilled communicators who are also knowledgeable about technological tools.
A skilled data analyst will remove uncertainty from business choices and contribute to the success of the entire organization. By merging several reports, analyzing fresh data, and translating the results, the data analyst acts as a useful link between various teams. This, in turn, enables the organization to keep an accurate pulse on its expansion.
The precise abilities needed will vary based on the needs of the firm; however, the following are some typical tasks:
Data preparation and organization
Use descriptive statistics to gain a broader perspective on their data
Examine intriguing trends in the data
Make dashboards and visualizations to aid in the interpretation and decision-making of data for business
Deliver the findings of technical analysis to external or internal teams or commercial clients
Both the technical and non-technical aspects of an organization benefit greatly from the work of the data analyst. The analyst promotes stronger team connections by conducting exploratory analysis or describing executive dashboards.
Data Scientist
A data scientist is an expert who uses their knowledge of statistics and available machine learning algorithms to develop machine learning models, make predictions, and provide crucial business insights.
Similar to a data analyst, a data scientist still needs to be able to clean, analyze, and visualize data. A data scientist can also train and improve machine learning models and will have greater depth and competence in these areas.
Roles and responsibilities
A data scientist is someone who can add a great deal of value by addressing more complicated and open-ended problems, making use of their expertise in cutting-edge statistics and algorithms. The scientist concentrates on making accurate forecasts for the future, while the analyst concentrates on comprehending facts from both past and present viewpoints.
By applying both supervised (such as classification and regression) and unsupervised learning (such as clustering, neural networks, and anomaly detection) techniques to their machine learning models, the data scientist will be able to unearth hidden insights. They essentially develop mathematical models that will enable them to recognize trends and make precise forecasts more effectively.
Examples of work done by data scientists include the following:
Assessing statistical models to assess the reliability of the analyses
Creating more accurate forecasting algorithms with machine learning
Testing and ongoing improvement of machine learning model accuracy
Creating data visualizations to highlight the findings of sophisticated analysis
Data scientists approach and view the world from a completely new viewpoint. The data scientist will pose new queries and develop models to make predictions based on fresh data, whereas an analyst may describe trends and interpret those findings in business terms.
Modern Data Ecosystem
A data ecosystem is a collection of business applications and infrastructure that is used to gather and analyze data. It allows businesses to develop improved marketing, pricing, and operational strategies by helping them better understand their consumers.
Data engineers, data analysts, and data scientists all play a part in the current data ecology. The ecosystem for data engineering consists of a variety of parts. It contains many data sources, formats, and data kinds. Data pipelines collect information from many sources, turn it into data that is suitable for analysis, and then make it accessible to data consumers for analysis and decision-making.
These data are processed and stored in data repositories such as relational databases, non-relational databases, data warehouses, data marts, data lakes, and big data stores. For the benefit of the data consumers, data integration platforms aggregate several types of data into a single perspective. Building data platforms, creating data stores, and collecting, importing, wrangling, querying, and analyzing data are all parts of a typical data engineering lifecycle.
It also includes data governance, compliance, security, monitoring, and changing system performance to ensure the system is operating at its best in a highly optimal way.
The evolution of technology at a rapid pace is leading to the development of several heterogeneous data formats categorized into two major types, that is, structured and unstructured data. This can further be seen in multiple data formats, for example, textual data, images, video streams, chats, data output or conversations, real-time events, various social media platforms, legacy systems, and many more.
Now considering so many diverse and continuously evolving data sources, we need to have a robust data engineering system to make this data insightful for enterprises to be able to use it for an effective decision-making process.
Source Systems, Formats, and Data Types
When it comes to source systems, in modern times, we not only have a variety of source systems but also disparate systems in terms of the formats, the way they store the data, manage the data, transfer data between systems, and extract or export of the data to the downstream systems.
The flexibility of the source systems in the modern data ecosystem to produce or generate a variety of data formats requires a solid data engineering architecture to be in place for a seamless and easy-to-manage process.
In this section, you will learn about the best practices and standards followed to manage this complexity.
Source Systems
The phrase source systems should not be used arbitrarily to refer to some systems and not others. When we refer to source systems, we are referring to the data sources that make up a certain data warehouse, which is our starting point when using the phrase. Many businesses follow the Common Data Model, also referred to as CDM, which are mostly connected so that they may also serve as source systems for one another.
However, when we discuss data-generating systems, we may differentiate between those that produce new data and those that do not. A cash register is an example of a system that generates data because, as it scans things, it also creates new data. These files then provide the shop with information about which products, when they are leaving the store, and at what price. The business has the option of deleting the data from the register when the day is done, the client has left, and the register is balanced, but we don’t always want to do so because this data may be utilized for many other purposes.
The data-generating system becomes a source system for one or more data warehouses, such as Data Lake, Data Mart, or Delta Lakes, when we decide to preserve the data. We may do a wide range of studies and business initiatives based on the information in the data warehouse (for example, inventory management, supply chain management, earnings analyses, multi-purchase analyses, and more).
Several instances of sources that provide data include:
Geospatial Data: This information, when combined with an app user’s location, can lay the groundwork for several new services: we can notify the user that one of our cafés is now within driving distance, and by displaying this message, they can receive a special discount.
HR Systems: This is the information that comes from several employee management programs, which hold the data for the overall organization right from hiring, onboarding, attendance management, payroll, transfers, and termination. Examples of such systems are Workday, Taleo, Tally, and more.
Hospital Management Systems (HMS): These are software applications that handle the administrative, clinical, and financial functions of hospitals and other healthcare facilities. They can store and process data such as patient records, medical histories, prescriptions, lab reports, billing, insurance, inventory, and more. Examples of such systems are Epic, Cerner, Meditech, and others. By analyzing this data, hospitals can enhance the quality of care, lower costs, increase efficiency, and comply with regulations.
Reminder Programs: When clients don’t pay their bills on time, these programs remind them. By analyzing the data, we may perform credit scoring and deal situations according to the payment histories of the customers.
Banking and Financial Systems: These are software applications that deal with the main functions of banks and other financial institutions, such as deposit accounts, loans, investments, payments, transfers, and more. They can store and process data, such as customer information, transaction records, balances, interest rates, fees, and others. Systems, such as Oracle FLEXCUBE, Temenos T24, Finacle, and others, are a few examples. Banks and financial institutions can use this data to better their products and services, manage risks, follow regulations, and increase customer satisfaction and loyalty.
Telemetry, Monitoring, and Security Systems: These are software applications that track and examine data from devices or systems that are hard or impossible to reach, such as satellites, aircraft, vehicles, power plants, and more. They can report data like location, speed, temperature, pressure, fuel level, performance, faults, and so on. Systems, such as LabVIEW, PRTG Network Monitor, SolarWinds, and more, are a few examples. This data can help telemetry, monitoring, and security systems improve operations, avoid and fix failures, ensure safety, and give feedback and control.
Wearable Devices or Machines: These are small gadgets that track our physical activity, heart rate, SpO2, walking, running, or sleep activity, and accordingly notify us by analyzing this data over a certain period.
Internet of Things (IoT): Now, a growing number of gadgets can send sensor data. This information focuses on how the devices — which might be anything from hearing aids to televisions — are used. The information can then be applied to new product development or service enhancements.
CRM Programs: These systems store call and conversation histories from clients. This is essential consumer data that may be used to examine complaint behavior and determine what the company needs to improve. Additionally, it might reveal which clients use a lot of service resources and are consequently less valuable. It serves as an input for improving customer management procedures.
ERP Systems: This data comprises accounting management systems, which record financial transactions for the organization using accounting forms. If we wish to reveal correlations between initiatives and whether outcomes were as anticipated, it can be tied to KPI data.
Billing Software Systems: These systems print invoices to specific clients. We may do segmentations based on behavior, values, and other criteria by looking at this data.
Data on Social Media: This information may be used to gauge the mood of both individuals and groups. It may be quite helpful for employees who manage corporate social relationships since it will provide information on how the public and important influencers see the organization. This type of market surveillance may be initiated by analyzing the positive and negative terms that are linked with a company using text mining.
Wikipedia: These databases might aid intelligent robots conducting consumer interactions in comprehending complex linkages. For instance, references for a person’s name that may be connected to a particular product, place, company, or historical instances or events.
Source Data Formats
The various source systems that we talked about in the preceding section have their own way of sending data to the downstream systems, and this is called data push from the source systems.
Another way of taking data from the preceding source systems is the data pull using any ETL/ELT tools.
Most of the source systems are designed in a way that the data present would be structured data, and hence even the extracted data would be in a structured format, but still, we can have some of the source systems outputting data in the form of unstructured data.
Unstructured data is typically an unorganized form of data that has no predefined formats or structure attached to it, which makes it a bit difficult to be able to read, process, or analyze. Examples of unstructured data are images, audio files, voice notes, video files, and binary large objects data, which are mostly related to the qualitative aspects of the data.
Unlike unstructured data, structured data is much more organized, consistent, easy to store and search, and quantitative in manner.
In the modern data ecosystem, we also have a concept of semi-structured data, which is a mix of both structured and unstructured data. The best example of semi-structured data can be an image captured using any digital media that can store details about the image like date and time, location, and people tagging.
So, in the preceding example, the photo itself becomes unstructured data, while the details related to the photo or what we can call metadata of the photo can be stored as structured data in the form of a table.
Another example could be the online survey forms wherein we fill out the details, such as name, city, age, address, and more. And along with that, we are requested to upload some documents as proof like PAN cards, Identity Proofs, and so on.
The following are some examples of various source data formats:
There could be source database systems, such as Mainframes, Oracle, Sybase, Teradata, or SQL servers, which will have data stored in a structured format.
There could be some applications that might be sending data in the form of files, for example, CSV files, text files, excel files, XML files, JSON files, and more.
There could be systems that might have the data present in the forms of images, videos, PDF files, and more.
It is very crucial to understand the source systems before we proceed with designing any data engineering solutions in terms of the schema of this data output, the frequency at which the data is being changed or updated, the frequency at which new data comes into the system or gets archived away from the system to be able to design a robust architecture for any data engineering solution.
Data Types
The sorts of data that can be stored in database objects like tables are determined by their data types. Each column in a table has a name and a data type, and every table has columns.
Both options are available when building the table. The data type establishes the potential types of interactions and informs the database of what to anticipate from each column. Use the int
data type, for instance, if you want a column to only contain integers.
SQL contains several data types, each having its own importance and usability as per the need of the data to be stored. It is equally important to have the right data types defined for the data to be stored for effective memory management, data storage, considering space usability and requirements, and most importantly, the usability of the data while implementing the business rules and statistical functions where we implemented these business rules for analyzing the data, processing the data, and building business metrics for decision making.
We will be discussing in detail each data type in SQL Server and other major database systems used across the industries, along with the implementation scenarios in the later chapters of the book.
Basics of ETL/ELT Concepts
Data is extracted, converted (cleaned, sanitized, and scrubbed), and then loaded into an output data container during the three-step extract, transform, and load (ETL) process. It is possible to combine data from one or more sources and output it to one or more locations. ETL processing is normally carried out by software programs, although system administrators can also perform it manually. ETL software often automates the entire procedure and can be executed manually, automatically, as a batch of tasks, or on a recurring basis.
An ETL system that has been appropriately built takes data from source systems, enforces data type and data validity criteria, and ensures the data is structurally compliant with the output requirements. For application developers to create applications and end users to make decisions, certain ETL systems may also supply data in a presentation-ready format.
An ETL process is one of the most crucial and an integral part of any data engineering system architecture.
Alternatively, another way to integrate data is extraction, loading, and transformation (ELT), where the data is first moved to the destination and then transformed there. This approach leverages the computing power and storage space of the target system to speed up the data transfer and store multiple versions of the raw data. The main difference between ETL and ELT is the order of steps and the location of the data transformation. ETL changes the data before moving it to the ETL tool or on a separate server, while ELT changes the data after moving it to the destination.
Moreover, ETL often needs a more precise definition of the data models and schemas beforehand, while ELT allows for more agility and adaptability. The decision between ETL and ELT depends on several factors, such as the type and volume of data sources, business needs, available resources, and analytics objectives.
In the modern data ecosystem, ELT is especially suitable for cloud-based data warehouses or data lakes