Data Ingestion with Python Cookbook: A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process 1st Edition Esppenchutz download
Data Ingestion with Python Cookbook: A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process 1st Edition Esppenchutz download
or textbooks at https://ebookmass.com
_____ Follow the link below to get your download now _____
https://ebookmass.com/product/data-ingestion-with-python-
cookbook-a-practical-guide-to-ingesting-monitoring-and-
identifying-errors-in-the-data-ingestion-process-1st-
edition-esppenchutz/
https://ebookmass.com/product/python-data-cleaning-cookbook-second-
edition-michael-walker/
https://ebookmass.com/product/data-driven-seo-with-python-solve-seo-
challenges-with-data-science-using-python-1st-edition-andreas-
voniatis/
Statistical Process Monitoring Using Advanced Data-Driven
and Deep Learning Approaches: Theory and Practical
Applications 1st Edition Fouzi Harrou
https://ebookmass.com/product/statistical-process-monitoring-using-
advanced-data-driven-and-deep-learning-approaches-theory-and-
practical-applications-1st-edition-fouzi-harrou/
https://ebookmass.com/product/data-structure-and-algorithms-with-
python-the-ultimate-guide-towards-coding-john-thomas/
https://ebookmass.com/product/nonclinical-study-contracting-and-
monitoring-a-practical-guide-1st-edition/
Data Ingestion with Python
Cookbook
Gláucia Esppenchutz
BIRMINGHAM—MUMBAI
Data Ingestion with Python Cookbook
Copyright © 2023 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted
in any form or by any means, without the prior written permission of the publisher, except in the case
of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information
presented. However, the information contained in this book is sold without warranty, either express
or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable
for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and
products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot
guarantee the accuracy of this information.
ISBN 978-1-83763-260-2
www.packtpub.com
This book represents a lot and wouldn’t be possible without my loving husband, Lincoln, and his
support and understanding during this challenging endeavor. I want to thank all my friends that
didn’t let me give up and always boosted my spirits, along with my grandmother, who always believed,
helped, and said I would do big things one day. Finally, I want to thank my beloved and four-pawed
best friend, who is at peace, Minduim, for “helping” me to write this book.
– Gláucia Esppenchutz
Contributors
I want to thank my patient and beloved husband and my friends. Thanks also to my mentors in the
Python open source community and the DataBootCamp founders, who guided me at the beginning
of my journey.
Thanks to the Packt team, who helped me through some hard times; you were terrific!
About the reviewers
Bitthal Khaitan is currently working as a big data and cloud engineer with CVS Health, a Fortune
4 organization. He has a demonstrated history of working in the cloud, data and analytics industry
for 12+ years. His primary certified skills are Google Cloud Platform (GCP), the big data ecosystem
(Hadoop, Spark, etc.), and data warehousing on Teradata. He has worked in all phases of the SDLC
of DW/BI and big data projects with strong expertise in the USA healthcare, insurance and retail
domains. He actively helps new graduates with mentoring, resume reviews, and job hunting tips in
the data engineering domain. Over 20,000 people follow Bitthal on LinkedIn. He is currently based
out of Dallas, Texas, USA.
Jagjeet Makhija is a highly accomplished technology leader with over 20 years of experience. They are
skilled not only in various domains including AI, data warehouse architecture, and business analytics,
but also have a strong passion for staying ahead of technology trends such as AI and ChatGPT.
Jagjeet is recognized for their significant contributions to the industry, particularly in complex proof
of concepts and integrating Microsoft products with ChatGPT. They are also an avid book reviewer
and have actively shared their extensive knowledge and expertise through presentations, blog articles,
and online forums.
Krishnan Raghavan is an IT professional with over 20 years of experience in the area of software
development and delivery excellence across multiple domains and technology, ranging from C++ to
Java, Python, data warehousing, and big data tools and technologies. Krishnan tries to give back to the
community by being part of GDG – Pune Volunteer Group, helping the team in organizing events.
When not working, Krishnan likes to spend time with his wife and daughter, as well as reading fiction,
non-fiction, and technical books. Currently, he is unsuccessfully trying to learn how to play the guitar.
You can connect with Krishnan at mail to: k r i s h n a n @ g m a i l . c o m or via
LinkedIn: www.linkedin.com/in/krishnan-raghavan
I would like to thank my wife, Anita, and daughter, Ananya, for giving me the time and space to
review this book.
Table of Contents
Prefacexv
2
Principals of Data Access – Accessing Your Data 31
Technical requirements 31 How to do it… 47
Implementing governance in a data How it works… 48
access workflow 32 There’s more… 49
Getting ready 32 See also 52
How to do it… 33 Managing encrypted files 52
How it works… 34 Getting ready 52
See also 34 How to do it… 53
Accessing databases and data How it works… 54
warehouses34 There’s more… 55
Getting ready 35 See also 56
How to do it… 35 Accessing data from AWS using S3 56
How it works… 37 Getting ready 56
There’s more… 38 How to do it… 59
See also 39 How it works… 62
Accessing SSH File Transfer Protocol There’s more… 63
(SFTP) files 39 See also 63
Getting ready 39 Accessing data from GCP using
How to do it… 41 Cloud Storage 64
How it works… 43 Getting ready 64
There’s more… 43 How to do it… 66
See also 44 How it works… 68
Retrieving data using API There’s more… 70
authentication44 Further reading 70
Getting ready 45
3
Data Discovery – Understanding Our Data before Ingesting It 71
Technical requirements 71 How to do it… 73
Documenting the data discovery How it works… 77
process71 Configuring OpenMetadata 77
Getting ready 72 Getting ready 77
Table of Contents ix
4
Reading CSV and JSON Files and Solving Problems 95
Technical requirements 95 How it works… 105
Reading a CSV file 96 There’s more… 106
See also 107
Getting ready 96
How to do it… 96 Using PySpark to read CSV files 108
How it works… 98 Getting ready 108
There’s more… 98 How to do it… 108
See also 99 How it works… 109
Reading a JSON file 99 There’s more… 110
See also 114
Getting ready 100
How to do it… 100 Using PySpark to read JSON files 114
How it works… 100 Getting ready 114
There’s more… 101 How to do it… 115
See also 103 How it works… 116
Creating a SparkSession for PySpark 103 There’s more… 117
See also 117
Getting ready 103
How to do it… 104 Further reading 117
5
Ingesting Data from Structured and Unstructured Databases 119
Technical requirements 119 There’s more… 127
Configuring a JDBC connection 120 See also 129
6
Using PySpark with Defined and Non-Defined Schemas 159
Technical requirements 159 How to do it… 169
Applying schemas to data ingestion 160 How it works… 170
7
Ingesting Analytical Data 181
Technical requirements 181 How it works… 197
Ingesting Parquet files 182 There’s more… 198
See also 200
Getting ready 182
How to do it… 183 Ingesting partitioned data 200
How it works… 184 Getting ready 200
There’s more… 185 How to do it… 201
See also 185 How it works… 201
Ingesting Avro files 185 There’s more… 203
See also 204
Getting ready 186
How to do it… 186 Applying reverse ETL 204
How it works… 188 Getting ready 204
There’s more… 190 How to do it… 205
See also 190 How it works… 206
Applying schemas to analytical data 191 There’s more… 207
See also 207
Getting ready 191
How to do it… 191 Selecting analytical data for reverse
How it works… 194 ETL207
There’s more… 194 Getting ready 207
See also 195 How to do it… 208
Filtering data and handling common How it works… 209
issues195 See also 210
9
Putting Everything Together with Airflow 243
Technical requirements 244 How to do it… 257
Installing Airflow 244 How it works… 260
There's more… 262
Configuring Airflow 244 See also 262
Getting ready 244
How to do it… 245 Configuring sensors 262
How it works… 247 Getting ready 262
See also 248 How to do it… 263
How it works… 264
Creating DAGs 248 See also 265
Getting ready 248
How to do it… 249 Creating connectors in Airflow 265
How it works… 253 Getting ready 266
There's more… 254 How to do it… 266
See also 255 How it works… 269
There's more… 270
Creating custom operators 255 See also 270
Getting ready 255
Table of Contents xiii
10
Logging and Monitoring Your Data Ingest in Airflow 281
Technical requirements 281 Designing advanced monitoring 304
Installing and running Airflow 282 Getting ready 304
How to do it… 306
Creating basic logs in Airflow 283
How it works… 308
Getting ready 284
There’s more… 309
How to do it… 284
See also 309
How it works… 287
See also 289 Using notification operators 309
Getting ready 310
Storing log files in a remote location 289
How to do it… 312
Getting ready 289
How it works… 315
How to do it… 290
There’s more… 318
How it works… 298
See also 299 Using SQL operators for data quality 318
Getting ready 318
Configuring logs in airflow.cfg 299
How to do it… 320
Getting ready 299
How it works… 321
How to do it… 299
There’s more… 323
How it works… 301
See also 323
There’s more… 303
See also 304 Further reading 324
11
Automating Your Data Ingestion Pipelines 325
Technical requirements 325 Scheduling daily ingestions 326
Installing and running Airflow 326 Getting ready 327
xiv Table of Contents
12
Using Data Observability for Debugging, Error Handling,
and Preventing Downtime 349
Technical requirements 349 Getting ready 358
Docker images 350 How to do it… 358
How it works… 361
Setting up StatsD for monitoring 351 There’s more… 363
Getting ready 351
How to do it… 351 Creating an observability dashboard 363
How it works… 353 Getting ready 363
See also 354 How to do it… 363
How it works… 369
Setting up Prometheus for storing There’s more… 370
metrics354
Getting ready 354 Setting custom alerts or notifications 370
How to do it… 354 Getting ready 371
How it works… 356 How to do it… 371
There’s more… 357 How it works… 377
Index379
Chapter 2, Data Access Principals – Accessing your Data, explores data access concepts related to data
governance, covering workflows and management of familiar sources such as SFTP servers, APIs,
and cloud providers. It also provides examples of creating data access policies in databases, data
warehouses, and the cloud.
Chapter 3, Data Discovery – Understanding Our Data Before Ingesting It, teaches you the significance of
carrying out the data discovery process before data ingestion. It covers manual discovery, documentation,
and using an open-source tool, OpenMetadata, for local configuration.
Chapter 4, Reading CSV and JSON Files and Solving Problems, introduces you to ingesting CSV and JSON
files using Python and PySpark. It demonstrates handling varying data volumes and infrastructures
while addressing common challenges and providing solutions.
Chapter 5, Ingesting Data from Structured and Unstructured Databases, covers fundamental concepts
of relational and non-relational databases, including everyday use cases. You will learn how to read
and handle data from these models, understand vital considerations, and troubleshoot potential errors.
Chapter 6, Using PySpark with Defined and Non-Defined Schemas, delves deeper into common PySpark
use cases, focusing on handling defined and non-defined schemas. It also explores reading and
understanding complex logs from Spark (PySpark core) and formatting techniques for easier debugging.
Chapter 7, Ingesting Analytical Data, introduces you to analytical data and common formats for reading
and writing. It explores reading partitioned data for improved performance and discusses Reverse
ETL theory with real-life application workflows and diagrams.
Chapter 8, Designing Monitored Data Workflows, covers logging best practices for data ingestion,
facilitating error identification, and debugging. Techniques such as monitoring file size, row count,
and object count enable improved monitoring of dashboards, alerts, and insights.
Chapter 9, Putting Everything Together with Airflow, consolidates the previously presented information
and guides you in building a real-life data ingestion application using Airflow. It covers essential
components, configuration, and issue resolution in the process.
Chapter 10, Logging and Monitoring Your Data Ingest in Airflow, explores advanced logging and
monitoring in data ingestion with Airflow. It covers creating custom operators, setting up notifications,
and monitoring for data anomalies. Configuration of notifications for tools such as Slack is also covered
to stay updated on the data ingestion process.
Chapter 11, Automating Your Data Ingestion Pipelines, focuses on automating data ingests using
previously learned best practices, enabling reader autonomy. It addresses common challenges with
schedulers or orchestration tools and provides solutions to avoid problems in production clusters.
Chapter 12, Using Data Observability for Debugging, Error Handling, and Preventing Downtime,
explores data observability concepts, popular monitoring tools such as Grafana, and best practices
for log storage and data lineage. It also covers creating visualization graphs to monitor data source
issues using Airflow configuration and data ingestion scripts.
Preface xvii
For almost all recipes in this book, you can use a Jupyter Notebook to execute the code. Even though it
is not mandatory to install it, this tool can help you to test the code and try new things on the code due
to the friendly interface.
If you are using the digital version of this book, we advise you to type the code yourself or access
the code via the GitHub repository (link available in the next section). Doing so will help you
avoid any potential errors related to the copying and pasting of code.
Conventions used
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file
extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “Then
we proceeded with the with open statement.”
A block of code is set as follows:
$ python3 –-version
Python 3.8.10
Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words
in menus or dialog boxes appear in the text like this. Here is an example: “Then, when we selected
showString at NativeMethodAccessorImpl.java:0, which redirected us to the
Stages page.”
Sections
In this book, you will find several headings that appear frequently (Getting ready, How to do it..., How
it works..., There’s more..., and See also).
To give clear instructions on how to complete a recipe, use these sections as follows:
Getting ready
This section tells you what to expect in the recipe and describes how to set up any software or any
preliminary settings required for the recipe.
How to do it…
This section contains the steps required to follow the recipe.
Preface xix
How it works…
This section usually consists of a detailed explanation of what happened in the previous section.
There’s more…
This section consists of additional information about the recipe in order to make you more knowledgeable
about the recipe.
See also
This section provides helpful links to other useful information for the recipe.
Get in touch
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the
subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen.
If you have found a mistake in this book, we would be grateful if you would report this to us. Please
visit www.packtpub.com/support/errata, selecting your book, clicking on the Errata
Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would
be grateful if you would provide us with the location address or website name. Please contact us at
[email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you
are interested in either writing or contributing to a book, please visit authors.packtpub.com.
xx Preface
https://packt.link/free-ebook/9781837632602
In this part, you will be introduced to the fundamentals of data ingestion and data engineering,
passing through the basic definition of an ingestion pipeline, the common types of data sources, and
the technologies involved.
This part has the following chapters:
Technical requirements
The commands inside the recipes of this chapter use Linux syntax. If you don’t use a Linux-based
system, you may need to adapt the commands:
You can find the code from this chapter in this GitHub repository: https://github.com/
PacktPublishing/Data-Ingestion-with-Python-Cookbook.
Note
Windows users might get an error message such as Docker Desktop requires a newer WSL
kernel version. This can be fixed by following the steps here: https://docs.docker.
com/desktop/windows/wsl/.
Getting ready
Let’s create a folder for our project:
1. First, open your system command line. Since I use the Windows Subsystem for Linux (WSL),
I will open the WSL application.
2. Go to your home directory and create a folder as follows:
$ mkdir my-project
Depending on your operational system, you might or might not have output here – for example,
WSL 20.04 users might have the following output:
Command 'python' not found, did you mean:
command 'python3' from deb python3
command 'python' from deb python-is-python3
If your Python path is configured to use the python command, you will see output similar
to this:
Python 3.9.0
Sometimes, your Python path might be configured to be invoked using python3. You can
try it using the following command:
$ python3 --version
5. Now, let’s check our pip version. This check is essential, since some operating systems have
more than one Python version installed:
$ pip --version
If your operating system (OS) uses a Python version below 3.8x or doesn’t have the language
installed, proceed to the How to do it steps; otherwise, you are ready to start the following Installing
PySpark recipe.
How to do it…
We are going to use the official installer from Python.org. You can find the link for it here: https://
www.python.org/downloads/:
Note
For Windows users, it is important to check your OS version, since Python 3.10 may not be
yet compatible with Windows 7, or your processor type (32-bits or 64-bits).
2. After downloading the installation file, double-click it and follow the instructions in the wizard
window. To avoid complexity, choose the recommended settings displayed.
The following screenshot shows how it looks on Windows:
3. If you are a Linux user, you can install it from the source using the following commands:
$ wget https://www.python.org/ftp/python/3.9.1/Python-3.9.1.tgz
$ ./configure –enable-optimizations
$ make -j 9
After installing Python, you should be able to execute the pip command. If not, refer to the pip
official documentation page here: https://pip.pypa.io/en/stable/installation/.
How it works…
Python is an interpreted language, and its interpreter extends several functions made with C or
C++. The language package also comes with several built-in libraries and, of course, the interpreter.
The interpreter works like a Unix shell and can be found in the usr/local/bin directory: https://
docs.python.org/3/tutorial/interpreter.html.
Lastly, note that many Python third-party packages in this book require the pip command to be
installed. This is because pip (an acronym for Pip Installs Packages) is the default package manager
for Python; therefore, it is used to install, upgrade, and manage the Python packages and dependencies
from the Python Package Index (PyPI).
There’s more…
Even if you don’t have any Python versions on your machine, you can still install them using the
command line or HomeBrew (for macOS users). Windows users can also download them from the
MS Windows Store.
Note
If you choose to download Python from the Windows Store, ensure you use an application
made by the Python Software Foundation.
See also
You can use pip to install convenient third-party applications, such as Jupyter. This is an open source,
web-based, interactive (and user-friendly) computing platform, often used by data scientists and data
engineers. You can install it from the official website here: https://jupyter.org/install.
8 Introduction to Data Ingestion
Installing PySpark
To process, clean, and transform vast amounts of data, we need a tool that provides resilience and
distributed processing, and that’s why PySpark is a good fit. It gets an API over the Spark library that
lets you use its applications.
Getting ready
Before starting the PySpark installation, we need to check our Java version in our operational system:
If everything is correct, you should see the preceding message as the output of the command
and the OpenJDK 18 version or higher. However, some systems don’t have any Java version
installed by default, and to cover this, we need to proceed to step 2.
2. Now, we download the Java Development Kit (JDK).
Go to https://www.oracle.com/java/technologies/downloads/, select
your OS, and download the most recent version of JDK. At the time of writing, it is JDK 19.
The download page of the JDK will look as follows:
Installing PySpark 9
Execute the downloaded application. Click on the application to start the installation process.
The following window will appear:
Note
Depending on your OS, the installation window may appear slightly different.
10 Introduction to Data Ingestion
Click Next for the following two questions, and the application will start the installation.
You don’t need to worry about where the JDK will be installed. By default, the application is
configured, as standard, to be compatible with other tools’ installations.
3. Next, we again check our Java version. When executing the command again, you should see
the following version:
$ java -version
openjdk version "1.8.0_292"
OpenJDK Runtime Environment (build 1.8.0_292-8u292-b10-
0ubuntu1~20.04-b10)
OpenJDK 64-Bit Server VM (build 25.292-b10, mixed mode)
How to do it…
Here are the steps to perform this recipe:
If the command runs successfully, the installation output’s last line will look like this:
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.2
2. Execute the pyspark command to open the interactive shell. When executing the pyspark
command in your command line, you should see this message:
$ pyspark
Python 3.8.10 (default, Jun 22 2022, 20:18:18)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more
information.
22/10/08 15:06:11 WARN Utils: Your hostname, DESKTOP-DVUDB98
resolves to a loopback address: 127.0.1.1; using 172.29.214.162
instead (on interface eth0)
22/10/08 15:06:11 WARN Utils: Set SPARK_LOCAL_IP if you need to
bind to another address
22/10/08 15:06:13 WARN NativeCodeLoader: Unable to load native-
hadoop library for your platform... using builtin-java classes
where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-
defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For
SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.1.2
/_/
You can observe some interesting messages here, such as the Spark version and the Python
used from PySpark.
3. Finally, we exit the interactive shell as follows:
>>> exit()
$
12 Introduction to Data Ingestion
How it works…
As seen at the beginning of this recipe, Spark is a robust framework that runs on top of the JVM. It is
also an open source tool for creating resilient and distributed processing output from vast data. With
the growth in popularity of the Python language in the past few years, it became necessary to have a
solution that adapts Spark to run alongside Python.
PySpark is an interface that interacts with Spark APIs via Py4J, dynamically allowing Python code to
interact with the JVM. We first need to have Java installed on our OS to use Spark. When we install
PySpark, it already comes with Spark and Py4J components installed, making it easy to start the
application and build the code.
There’s more…
Anaconda is a convenient way to install PySpark and other data science tools. This tool encapsulates all
manual processes and has a friendly interface for interacting with and installing Python components,
such as NumPy, pandas, or Jupyter:
For more detailed information about how to install Anaconda and other powerful commands, refer
to https://docs.anaconda.com/.
It is possible to configure and use virtualenv with PySpark, and Anaconda does it automatically
if you choose this type of installation. However, for the other installation methods, we need to make
some additional steps to make our Spark cluster (locally or on the server) run it, which includes
indicating the virtualenv /bin/ folder and where your PySpark path is.
See also
There is a nice article about this topic, Using VirtualEnv with PySpark, by jzhang, here: https://
community.cloudera.com/t5/Community-Articles/Using-VirtualEnv-with-
PySpark/ta-p/245932.
Configuring Docker for MongoDB 13
Getting ready
Following the good practice of code organization, let’s start creating a folder inside our project to
store the Docker image:
Create a folder inside our project directory to store the MongoDB Docker image and data as follows:
my-project$ mkdir mongo-local
my-project$ cd mongo-local
How to do it…
Here are the steps to try out this recipe:
Note
If you are a WSL user, an error might occur if you use the WSL 1 version instead of version 2.
You can easily fix this by following the steps here: https://learn.microsoft.com/
en-us/windows/wsl/install.
14 Introduction to Data Ingestion
We then check our server. To do this, we can use the command line to see which Docker
images are running:
my-project/mongo-local$ docker ps
We can even check on the Docker Desktop application to see whether our container is running:
Figure 1.6 – The Docker Desktop vision of the MongoDB container running
3. Finally, we need to stop our container. We need to use Container ID to stop the container,
which we previously saw when checking the Docker running images. We will rerun it in Chapter 5:
my-project/mongo-local$ docker stop 427cc2e5d40e
How it works…
MongoDB’s architecture uses the concept of distributed processing, where the main node interacts with
clients’ requests, such as queries and document manipulation. It distributes the requests automatically
among its shards, which are a subset of a larger data collection here.
Configuring Docker for MongoDB 15
Since we may also have other running projects or software applications inside our machine, isolating
any database or application server used in development is a good practice. In this way, we ensure
nothing interferes with our local servers, and the debug process can be more manageable.
This Docker image setting creates a MongoDB server locally and even allows us to make additional
changes if we want to simulate any other scenario for testing or development.
The commands we used are as follows:
There’s more…
For frequent users, manually configuring other parameters for the MongoDB container, such as the
version, image port, database name, and database credentials, is also possible.
A version of this image with example values is also available as a docker-compose file in the official
documentation here: https://hub.docker.com/_/mongo.
The docker-compose file for MongoDB looks similar to this:
# Use your own values for username and password
version: '3.1'
services:
mongo:
image: mongo
restart: always
environment:
MONGO_INITDB_ROOT_USERNAME: root
MONGO_INITDB_ROOT_PASSWORD: example
mongo-express:
image: mongo-express
restart: always
ports:
- 8081:8081
environment:
ME_CONFIG_MONGODB_ADMINUSERNAME: root
ME_CONFIG_MONGODB_ADMINPASSWORD: example
ME_CONFIG_MONGODB_URL: mongodb://root:example@mongo:27017/
See also
You can check out MongoDB at the complete Docker Hub documentation here: https://hub.
docker.com/_/mongo.
However, there are some additional steps to configure our Airflow. Thankfully, the Apache Foundation
also has a docker-compose file that contains all other requirements to make Airflow work. We
just need to complete a few more steps.
Getting ready
Let’s start by initializing our Docker application on our machine. You can use the desktop version or
the CLI command.
Make sure you are inside your project folder for this. Create a folder to store Airflow internal components
and the docker-compose.yaml file:
my-project$ mkdir airflow-local
my-project$ cd airflow-local
How to do it…
1. First, we fetch the docker-compose.yaml file directly from the Airflow official docs:
my-project/airflow-local$ curl -LfO 'https://airflow.apache.org/
docs/apache-airflow/2.3.0/docker-compose.yaml'
Note
Check the most stable version of this docker-compose file when you download it, since
new, more appropriate versions may be available after this book is published.
Note
If you have any error messages related to the AIRFLOW_UID variable, you can create a .env
file in the same folder where your docker-compose.yaml file is and define the variable
as AIRFLOW_UID=50000.
After executing the command, you should see output similar to this:
Creating network "airflow-local_default" with the default driver
Creating volume "airflow-local_postgres-db-volume" with default
driver
Pulling postgres (postgres:13)...
13: Pulling from library/postgres
(...)
Status: Downloaded newer image for postgres:13
Pulling redis (redis:latest)...
latest: Pulling from library/redis
bd159e379b3b: Already exists
(...)
Status: Downloaded newer image for redis:latest
Pulling airflow-init (apache/airflow:2.3.0)...
2.3.0: Pulling from apache/airflow
42c077c10790: Pull complete
(...)
Status: Downloaded newer image for apache/airflow:2.3.0
Creating airflow-local_postgres_1 ... done
Creating airflow-local_redis_1 ... done
Creating airflow-local_airflow-init_1 ... done
Attaching to airflow-local_airflow-init_1
(...)
airflow-init_1 | [2022-10-09 09:49:26,250] {manager.
py:213} INFO - Added user airflow
airflow-init_1 | User "airflow" created with role "Admin"
(...)
airflow-local_airflow-init_1 exited with code 0
6. Then, we need to check the Docker processes. Using the following CLI command, you will see
the Docker images running:
my-project/airflow-local$ docker ps
Configuring Docker for Airflow 19
In the Docker Desktop application, you can also see the same containers running but with a
more friendly interface:
8. Then, we log in to the Airflow platform. Since it’s a local application used for testing and
learning, the default credentials (username and password) for administrative access in Airflow
are airflow.
When logged in, the following screen will appear:
Configuring Docker for Airflow 21
9. Then, we stop our containers. We can stop our containers until we reach Chapter 9, when we
will explore data ingest in Airflow:
my-project/airflow-local$ docker-compose stop
How it works…
Airflow is an open source platform that allows batch data pipeline development, monitoring, and
scheduling. However, it requires other components, such as an internal database, to store metadata to
work correctly. In this example, we use PostgreSQL to store the metadata and Redis to cache information.
All this can be installed directly in our machine environment one by one. Even though it seems quite
simple, it may not be due to compatibility issues with OS, other software versions, and so on.
Docker can create an isolated environment and provide all the requirements to make it work. With
docker-compose, it becomes even simpler, since we can create dependencies between the
components that can only be created if the others are healthy.
You can also open the docker-compose.yaml file we downloaded for this recipe and take a look
to explore it better. We will also cover it in detail in Chapter 9.
22 Introduction to Data Ingestion
See also
If you want to learn more about how this docker-compose file works, you can look at the Apache
Airflow official Docker documentation on the Apache Airflow documentation page: https://
airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/
index.html.
Creating schemas
Schemas are considered blueprints of a database or table. While some databases strictly require
schema definition, others can work without it. However, in some cases, it is advantageous to work
with data schemas to ensure that the application data architecture is maintained and can receive the
desired data input.
Getting ready
Let’s imagine we need to create a database for a school to store information about the students, the
courses, and the instructors. With this information, we know we have at least three tables so far.
In this recipe, we will cover how schemas work using the Entity Relationship Diagram (ERD), a visual
representation of relationships between entities in a database, to exemplify how schemas are connected.
How to do it…
Here are the steps to try this:
1. We define the type of schema. The following figure helps us understand how to go about this:
Other documents randomly have
different content
THE WANDERER
Wanderer
Woman
Wanderer
Woman
Here, up the rocky path,
Go onward. Through the shrubs
The path runs by the cot
Wherein I dwell,
On to the rills
From whence I drink.
Wanderer
Woman
Wanderer
Woman
Wanderer
Lo, an inscription whereupon I tread,
But all illegible,
Worn out by wayfarers are ye,
Which should show forth your Master’s piety,
Unto a thousand children’s children.
Woman
Wanderer
Up yonder?
Woman
Leftwards directly
On through the underwood,
Here!
Wanderer
Woman
That is my cottage.
Wanderer
The fragments of a temple!
Woman
Wanderer
Woman
Wanderer
Ivy hath clad around
Thy slender form divine.
How do ye upward strive
From out the wreck,
Twin columns!
And thou, the solitary sister there,
How do ye,
With sombre moss upon your sacred heads,
Gaze in majestic mourning down
Upon these scattered fragments
There at your feet,
Your kith and kin!
Where lie the shadows of the bramble bush,
Concealed by wrack and earth,
And the long grass wavers above.
Nature dost then so hold in price
Thy masterpiece’s masterpiece?
Dost thou, regardless, shatter thus
Thy sanctuary?
Dost sow the thistles therein?
Woman
Wanderer
Sweet is thy rest.
How, bathed in heavenly healthiness,
Restful he breathes!
Thou, born above the relics
Of a most sacred past,
Upon thee may its spirit rest.
He whom it environeth
Will in the consciousness of power divine
Each day enjoy.
Seedling so rich expand,
The shining spring’s
Resplendent ornament,
In presence of thy fellows shine,
And when the flower-sheathe fades and falls
May from thy bosom rise
The abounding fruit,
And ripening, front the sun.
Woman
Wanderer
Woman
My husband soon
Home from the fields
Returns. Stay, stay, O man,
And eat with us thy evening bread.
Wanderer
Here do ye dwell?
Woman
Wanderer
O Nature! everlastingly conceiving.
Each one thou bearest for the joy of life,
All of thy babes thou hast endowed
Lovingly with a heritage—a Name.
High on the cornice doth the swallow build,
Of what an ornament she hides
All unaware.
The caterpillar round the golden bough
Spins her a winter quarters for her young.
Thus dost thou patch in ’twixt the august
Fragments of bygone time
For needs of thine—for thy own needs
A hut. O men—
Rejoicing over graves.
Farewell, thou happy wife.
Woman
Wanderer
Woman
A happy wayfaring!
Wanderer
To Cuma.
Wanderer
Woman
Wanderer
Farewell!
O Nature! guide my way,
The stranger’s travel-track
Which over graves
Of sacred times foregone
I still pursue.
Me to some covert guide,
Sheltered against the north,
And where from noontide’s glare
A poplar grove protects.
And when at eve I turn
Home to the hut,
Made golden with the sun’s last beam,
Grant that such wife may welcome me,
The boy upon her arm.
IMITATED FROM GOETHE’S “ALEXIS AND DORA”
Ah, without stop or stay the ship still momently presses
On through the foaming deep, further and further from shore.
Far-traced the furrow is cut by the keel, and in it the dolphins
Bounding follow as though prey were before them in flight.
All betokens a fortunate voyage; light-hearted the shipman
Gently handles the sail that takes on it labour for all.
Forward as pennon and streamer presses the voyager’s spirit,
One alone by the mast stands reverted and sad.
Mountains already blue he sees departing, he sees them
Sink in the sea, while sinks every joy from his gaze.
Also for thee has vanished the ship that bears thy Alexis,
Robs thee, O Dora, of friend, robs thee of, ah! the betrothed.
Thou, too, gazest in vain after me. Our hearts are still beating
For one another, but ah! on one another no more.
Single moment wherein I have lived, thou weigh’st in the balance
More than all days erewhile coldly squandered by me.
Ah, in that moment alone, the last, arose in my bosom
Life unhoped for in thee, come down as a gift from the Gods.
Now in vain dost thou with thy light make glorious the æther,
Thy all-illumining day—Phœbus, by me is abhorred.
Back on myself I return, and fain would I there in the silence
Live o’er again the time when daily to me she appeared.
Was it possible beauty to see and never to feel it?
Did not the heavenly charm work on thy dullness of soul?
Blame not thyself, poor heart, so the poet proposes a riddle,
Artfully wrought into words oft to the ear of the crowd,
The network of images, lovely and strange, is a joy to the hearer,
Yet still there lacketh the word affirming the sense of the whole.
Is it at last disclosed, then every spirit is gladdened,
And in the verse perceives meaning of twofold delight.
Ah, why so late, O love, dost thou unbind from my forehead
Wrappings that darkened my eyes—why too late dost unbind?
Long time the freighted bark delayed for favouring breezes,
Fair at last rose the wind pressing off-shore to the sea.
Idle seasons of youth and idle dreams of the future
Ye have departed—for me only remaineth the hour;
Yes it remains the gladness remaining for me; Dora I hold thee
Yes, it remains the gladness remaining for me; Dora, I hold thee.
Hope to my gaze presents, Dora, thy image alone.
Often on thy way to the temple I saw thee gay-decked and decorous,
Stepped the good mother beside, all ceremonious and grave.
Quick-footed wert thou and eager, bearing thy fruit to the market,
Quitting the well, thy head how daringly balanced the jar;
There, lo! thy throat was shown, thy neck more fair than all others,
Fairer than others were shown the poise and play of thy limbs.
Ofttime I held me in fear for the totter and crash of the pitcher,
Yet upright ever it stood, there where the kerchief was pleached.
Fairest neighbour, yes, my wont it was to behold thee,
As we behold the stars, as we contemplate the moon.
In them rejoicing, while never once in the tranquil bosom,
Even in shadow of thought stirs the desire to possess.
Thus did ye pass, my years. But twenty paces asunder
Our dwellings, thine and mine, nor once on thy threshold I trod.
Now the hideous deep divides us! Ye lie to the heavens,
Billows! your lordly blue to me is the colour of night.
Already was everything in motion. A boy came running
Swift to my father’s house, calling me down to the shore.
“The sail is already hoisted; it flaps in the wind,” so spake he.
“Weighed with a lusty cheer the anchor parts from the sand.
Come, Alexis! O come!” And gravely, in token of blessing,
Laid my good father his hand on the clustering curls of the son.
Careful the mother reached me a bundle newly made ready;
“Come back happy!” they cried. “Come back happy and rich.”
So out of doors, the bundle under my arm, did I fling me,
And at the wall below, there by the garden gate,
Saw thee stand; thou smiledst upon me and spake’st. “Alexis,
Yonder clamouring folk, are these thy comrades aboard?
Distant shores thou visitest now and merchandise precious
Thou dost deal in, and jewels for the wealthy city dames.
Wilt thou not bring me also one little light chain? I would buy it
Thankfully. I have wished so oft to adorn me with this.”
Holding my own I stood and asked, in the way of a merchant,
First of the form, the weight exact, of the order thou gavest.
Modest in truth was the price thou assignedst. While gazing upon thee,
Neck and shoulders I saw worthy the jewels of our queen.
L d d d h f h hi Th id h ki dl
Louder sounded the cry from the ship. Then saidest thou kindly,
“Some of the garden fruit take thou with thee on thy way.
Take the ripest oranges—take white figs. The sea yields
Never a fruit at all. Nor doth every country give fruits.”
Thereon I stepped within; the fruit thou busily broughtest,
There in the gathered robe bearing a burden all gold.
Often I pleaded, “see this is enough,” and ever another
And fairer fruit down dropped, lightly touched, to thy hand.
Then at the last to the bower thou camest. There was a basket,
And the myrtle in bloom bent over thee, over me.
Skilfully didst thou begin to arrange the fruit and in silence.
First the orange, that lies heavy a globe of gold,
Then the tenderer fig, which slightest pressure will injure,
And with myrtle o’erlaid, fair adorned was the gift.
But I lifted it not. I stood, we looked one another
Full in the eyes. When straight the sight of my eyes waxed dim.
Thy bosom I felt on my own! and now my arm encircled
The stately neck, whereon thousandfold kisses I showered.
Sank thy head on my shoulder—by tender arms enfolded
As with a chain was he the man whom thou hast made blest.
The hands of Love I felt, he drew us with might together,
And thrice from a cloudless sky it thundered; and now there flowed
Tears from my eyes, down streaming, weeping wert thou. I wept,
And through sorrow and joy the world seemed to pass from our sense.
Ever more urgent their shoreward cry; but thither to bear me
My feet refused: I cried, “Dora, and art thou not mine?”
“For ever,” thou gently saidst. And thereon it seemed that our tears,
As by some breath divine, gently were blown from our eyes.
Nearer the cry “Alexis!” Then peered the boy, as he sought me,
In through the garden gate. How the basket he eyed.
How he constrained me. How I pressed thee once more by the hand.
How arrived I aboard? I know as one drunken I seemed.
Even so my companions took me to be; they bore with one ailing,
And already in haze of distance the city grew dim.
“For ever,” Dora, thy whisper was. In my ear it echoes
Even with the thunder of Zeus. There stood she by his throne,
She, his daughter, the Goddess of Love, and beside her the Graces.
So by the Gods confirmed this our union abides.
y
O then haste thee, our bark, with the favouring winds behind thee.
Labour, thou lusty keel, sunder the foaming flood!
Bring me to that strange haven; that so for me may the goldsmith
In his workshop anon fashion the heavenly pledge.
Ay, in truth, the chainlet shall grow to a chain, O Dora.
Nine times loosely wound shall it encircle thy neck.
Further, jewels most manifold will I procure for thee; golden
Bracelets also. My gifts richly shall deck thy hand.
There shall the ruby contend with the emerald; loveliest sapphire
Matched against jacinth shall stand, while with a setting of gold
Every gem may be held in a perfect union of beauty.
O what joy for the lover to grace with jewel and gold the beloved.
If pearls I view, my thought is of thee; there rises before me
With every ring the shape slender and fair of thy hand.
I will barter and buy, and out of them all the fairest
Thou shalt choose. I devote all my lading to thee.
But not jewel and gem alone shall thy lover procure thee.
What a housewife would choose, that will he bring with him too.
Coverlets delicate, woollen and purple, hemmed to make ready
A couch that grateful and soft fondly shall welcome the pair.
Lengths of the finest linen. Thou sittest and sewest and clothest
Me therein and thyself, and haply also a third.
Visions of hope delude my heart. Allay, O Divine Ones,
Flames of resistless desire wildly at work in my breast,
And yet I fain would recall delights that are bitter,
When care to me draws near, hideous, cold and unmoved.
Not the Erinnyes torch nor the baying of hounds infernal
Strikes such terror in him, the culprit in realms of despair,
As that phantom unmoved in me who shows me the fair one
Far away. Open stands even now the garden gate,
And another, not I, draws near—for him fruits are falling,
And for him, too, the fig strengthening honey retains.
Him too doth she draw to the bower. Does he follow? O sightless
Make me, O Gods! destroy the vision of memory in me.
Yes—a maiden is she—she who gives herself straight to one lover,
She to another who woes as speedily turns her around.
Laugh not, O Zeus, this time, at an oath audaciously broken—
Th nder more fiercel ! strike! et hold back th lightning shaft
Thunder more fiercely! strike! yet hold back thy lightning shaft.
Send on my trace the sagging clouds. In gloom as of night-time
Let thy bright lightning-flash strike this ill-fated mast.
Scatter the planks around and give to the raging waters
This my merchandise. Give me to the dolphins a prey.
Now ye Muses enough! In vain is your effort to image
How in a heart that loves alternate sorrow and joy.
Nor are ye able to heal those wounds which Love has inflicted,
Yet their assuagement comes, Gracious Ones, only from you.
Editor’s Note.—The four Goethe translations with which this
volume closes are taken from rough jottings, hardly more than
protoplasm.
They much need re-handling, which they cannot now receive. Many
lines are, as verse, defective for the ear ... yet some contain sufficient
beauty, as well as fidelity, in translation to justify, perhaps, their
preservation as fragments of unfinished work.
This does not apply to the other translations which were left by E.
D. in fair MS. as completed.
1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside
the United States, check the laws of your country in addition to
the terms of this agreement before downloading, copying,
displaying, performing, distributing or creating derivative works
based on this work or any other Project Gutenberg™ work. The
Foundation makes no representations concerning the copyright
status of any work in any country other than the United States.
1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if
you provide access to or distribute copies of a Project
Gutenberg™ work in a format other than “Plain Vanilla ASCII” or
other format used in the official version posted on the official
Project Gutenberg™ website (www.gutenberg.org), you must,
at no additional cost, fee or expense to the user, provide a copy,
a means of exporting a copy, or a means of obtaining a copy
upon request, of the work in its original “Plain Vanilla ASCII” or
other form. Any alternate format must include the full Project
Gutenberg™ License as specified in paragraph 1.E.1.
• You pay a royalty fee of 20% of the gross profits you derive
from the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”
• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.
1.F.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.
ebookmasss.com