Immediate download Data Ingestion with Python Cookbook: A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process 1st Edition Esppenchutz ebooks 2024
Immediate download Data Ingestion with Python Cookbook: A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process 1st Edition Esppenchutz ebooks 2024
com
OR CLICK HERE
DOWLOAD NOW
https://ebookmass.com/product/python-data-cleaning-cookbook-second-
edition-michael-walker/
ebookmass.com
https://ebookmass.com/product/study-guide-for-pathophysiology-the-
biological-basis-disease-in-adults-and-ebook-pdf-version/
ebookmass.com
Cherished by the Agent (In Clear Sight Book 2) Kennedy L.
Mitchell
https://ebookmass.com/product/cherished-by-the-agent-in-clear-sight-
book-2-kennedy-l-mitchell/
ebookmass.com
https://ebookmass.com/product/unthinkable-anna-hill/
ebookmass.com
https://ebookmass.com/product/energy-systems-and-sustainability-third-
edition-bob-everett/
ebookmass.com
https://ebookmass.com/product/evolution-of-a-taboo-pigs-and-people-in-
the-ancient-near-east-max-d-price/
ebookmass.com
https://ebookmass.com/product/dad-jokes-for-kids-350-silly-laugh-out-
loud-jokes-for-the-whole-family-jimmy-niro/
ebookmass.com
Zoology 12th Edition Stephen A. Miller
https://ebookmass.com/product/zoology-12th-edition-stephen-a-miller/
ebookmass.com
Data Ingestion with Python
Cookbook
Gláucia Esppenchutz
BIRMINGHAM—MUMBAI
Data Ingestion with Python Cookbook
Copyright © 2023 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted
in any form or by any means, without the prior written permission of the publisher, except in the case
of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information
presented. However, the information contained in this book is sold without warranty, either express
or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable
for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and
products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot
guarantee the accuracy of this information.
ISBN 978-1-83763-260-2
www.packtpub.com
This book represents a lot and wouldn’t be possible without my loving husband, Lincoln, and his
support and understanding during this challenging endeavor. I want to thank all my friends that
didn’t let me give up and always boosted my spirits, along with my grandmother, who always believed,
helped, and said I would do big things one day. Finally, I want to thank my beloved and four-pawed
best friend, who is at peace, Minduim, for “helping” me to write this book.
– Gláucia Esppenchutz
Contributors
I want to thank my patient and beloved husband and my friends. Thanks also to my mentors in the
Python open source community and the DataBootCamp founders, who guided me at the beginning
of my journey.
Thanks to the Packt team, who helped me through some hard times; you were terrific!
About the reviewers
Bitthal Khaitan is currently working as a big data and cloud engineer with CVS Health, a Fortune
4 organization. He has a demonstrated history of working in the cloud, data and analytics industry
for 12+ years. His primary certified skills are Google Cloud Platform (GCP), the big data ecosystem
(Hadoop, Spark, etc.), and data warehousing on Teradata. He has worked in all phases of the SDLC
of DW/BI and big data projects with strong expertise in the USA healthcare, insurance and retail
domains. He actively helps new graduates with mentoring, resume reviews, and job hunting tips in
the data engineering domain. Over 20,000 people follow Bitthal on LinkedIn. He is currently based
out of Dallas, Texas, USA.
Jagjeet Makhija is a highly accomplished technology leader with over 20 years of experience. They are
skilled not only in various domains including AI, data warehouse architecture, and business analytics,
but also have a strong passion for staying ahead of technology trends such as AI and ChatGPT.
Jagjeet is recognized for their significant contributions to the industry, particularly in complex proof
of concepts and integrating Microsoft products with ChatGPT. They are also an avid book reviewer
and have actively shared their extensive knowledge and expertise through presentations, blog articles,
and online forums.
Krishnan Raghavan is an IT professional with over 20 years of experience in the area of software
development and delivery excellence across multiple domains and technology, ranging from C++ to
Java, Python, data warehousing, and big data tools and technologies. Krishnan tries to give back to the
community by being part of GDG – Pune Volunteer Group, helping the team in organizing events.
When not working, Krishnan likes to spend time with his wife and daughter, as well as reading fiction,
non-fiction, and technical books. Currently, he is unsuccessfully trying to learn how to play the guitar.
You can connect with Krishnan at mail to: k r i s h n a n @ g m a i l . c o m or via
LinkedIn: www.linkedin.com/in/krishnan-raghavan
I would like to thank my wife, Anita, and daughter, Ananya, for giving me the time and space to
review this book.
Table of Contents
Prefacexv
2
Principals of Data Access – Accessing Your Data 31
Technical requirements 31 How to do it… 47
Implementing governance in a data How it works… 48
access workflow 32 There’s more… 49
Getting ready 32 See also 52
How to do it… 33 Managing encrypted files 52
How it works… 34 Getting ready 52
See also 34 How to do it… 53
Accessing databases and data How it works… 54
warehouses34 There’s more… 55
Getting ready 35 See also 56
How to do it… 35 Accessing data from AWS using S3 56
How it works… 37 Getting ready 56
There’s more… 38 How to do it… 59
See also 39 How it works… 62
Accessing SSH File Transfer Protocol There’s more… 63
(SFTP) files 39 See also 63
Getting ready 39 Accessing data from GCP using
How to do it… 41 Cloud Storage 64
How it works… 43 Getting ready 64
There’s more… 43 How to do it… 66
See also 44 How it works… 68
Retrieving data using API There’s more… 70
authentication44 Further reading 70
Getting ready 45
3
Data Discovery – Understanding Our Data before Ingesting It 71
Technical requirements 71 How to do it… 73
Documenting the data discovery How it works… 77
process71 Configuring OpenMetadata 77
Getting ready 72 Getting ready 77
Table of Contents ix
4
Reading CSV and JSON Files and Solving Problems 95
Technical requirements 95 How it works… 105
Reading a CSV file 96 There’s more… 106
See also 107
Getting ready 96
How to do it… 96 Using PySpark to read CSV files 108
How it works… 98 Getting ready 108
There’s more… 98 How to do it… 108
See also 99 How it works… 109
Reading a JSON file 99 There’s more… 110
See also 114
Getting ready 100
How to do it… 100 Using PySpark to read JSON files 114
How it works… 100 Getting ready 114
There’s more… 101 How to do it… 115
See also 103 How it works… 116
Creating a SparkSession for PySpark 103 There’s more… 117
See also 117
Getting ready 103
How to do it… 104 Further reading 117
5
Ingesting Data from Structured and Unstructured Databases 119
Technical requirements 119 There’s more… 127
Configuring a JDBC connection 120 See also 129
6
Using PySpark with Defined and Non-Defined Schemas 159
Technical requirements 159 How to do it… 169
Applying schemas to data ingestion 160 How it works… 170
7
Ingesting Analytical Data 181
Technical requirements 181 How it works… 197
Ingesting Parquet files 182 There’s more… 198
See also 200
Getting ready 182
How to do it… 183 Ingesting partitioned data 200
How it works… 184 Getting ready 200
There’s more… 185 How to do it… 201
See also 185 How it works… 201
Ingesting Avro files 185 There’s more… 203
See also 204
Getting ready 186
How to do it… 186 Applying reverse ETL 204
How it works… 188 Getting ready 204
There’s more… 190 How to do it… 205
See also 190 How it works… 206
Applying schemas to analytical data 191 There’s more… 207
See also 207
Getting ready 191
How to do it… 191 Selecting analytical data for reverse
How it works… 194 ETL207
There’s more… 194 Getting ready 207
See also 195 How to do it… 208
Filtering data and handling common How it works… 209
issues195 See also 210
9
Putting Everything Together with Airflow 243
Technical requirements 244 How to do it… 257
Installing Airflow 244 How it works… 260
There's more… 262
Configuring Airflow 244 See also 262
Getting ready 244
How to do it… 245 Configuring sensors 262
How it works… 247 Getting ready 262
See also 248 How to do it… 263
How it works… 264
Creating DAGs 248 See also 265
Getting ready 248
How to do it… 249 Creating connectors in Airflow 265
How it works… 253 Getting ready 266
There's more… 254 How to do it… 266
See also 255 How it works… 269
There's more… 270
Creating custom operators 255 See also 270
Getting ready 255
Table of Contents xiii
10
Logging and Monitoring Your Data Ingest in Airflow 281
Technical requirements 281 Designing advanced monitoring 304
Installing and running Airflow 282 Getting ready 304
How to do it… 306
Creating basic logs in Airflow 283
How it works… 308
Getting ready 284
There’s more… 309
How to do it… 284
See also 309
How it works… 287
See also 289 Using notification operators 309
Getting ready 310
Storing log files in a remote location 289
How to do it… 312
Getting ready 289
How it works… 315
How to do it… 290
There’s more… 318
How it works… 298
See also 299 Using SQL operators for data quality 318
Getting ready 318
Configuring logs in airflow.cfg 299
How to do it… 320
Getting ready 299
How it works… 321
How to do it… 299
There’s more… 323
How it works… 301
See also 323
There’s more… 303
See also 304 Further reading 324
11
Automating Your Data Ingestion Pipelines 325
Technical requirements 325 Scheduling daily ingestions 326
Installing and running Airflow 326 Getting ready 327
xiv Table of Contents
12
Using Data Observability for Debugging, Error Handling,
and Preventing Downtime 349
Technical requirements 349 Getting ready 358
Docker images 350 How to do it… 358
How it works… 361
Setting up StatsD for monitoring 351 There’s more… 363
Getting ready 351
How to do it… 351 Creating an observability dashboard 363
How it works… 353 Getting ready 363
See also 354 How to do it… 363
How it works… 369
Setting up Prometheus for storing There’s more… 370
metrics354
Getting ready 354 Setting custom alerts or notifications 370
How to do it… 354 Getting ready 371
How it works… 356 How to do it… 371
There’s more… 357 How it works… 377
Index379
Chapter 2, Data Access Principals – Accessing your Data, explores data access concepts related to data
governance, covering workflows and management of familiar sources such as SFTP servers, APIs,
and cloud providers. It also provides examples of creating data access policies in databases, data
warehouses, and the cloud.
Chapter 3, Data Discovery – Understanding Our Data Before Ingesting It, teaches you the significance of
carrying out the data discovery process before data ingestion. It covers manual discovery, documentation,
and using an open-source tool, OpenMetadata, for local configuration.
Chapter 4, Reading CSV and JSON Files and Solving Problems, introduces you to ingesting CSV and JSON
files using Python and PySpark. It demonstrates handling varying data volumes and infrastructures
while addressing common challenges and providing solutions.
Chapter 5, Ingesting Data from Structured and Unstructured Databases, covers fundamental concepts
of relational and non-relational databases, including everyday use cases. You will learn how to read
and handle data from these models, understand vital considerations, and troubleshoot potential errors.
Chapter 6, Using PySpark with Defined and Non-Defined Schemas, delves deeper into common PySpark
use cases, focusing on handling defined and non-defined schemas. It also explores reading and
understanding complex logs from Spark (PySpark core) and formatting techniques for easier debugging.
Chapter 7, Ingesting Analytical Data, introduces you to analytical data and common formats for reading
and writing. It explores reading partitioned data for improved performance and discusses Reverse
ETL theory with real-life application workflows and diagrams.
Chapter 8, Designing Monitored Data Workflows, covers logging best practices for data ingestion,
facilitating error identification, and debugging. Techniques such as monitoring file size, row count,
and object count enable improved monitoring of dashboards, alerts, and insights.
Chapter 9, Putting Everything Together with Airflow, consolidates the previously presented information
and guides you in building a real-life data ingestion application using Airflow. It covers essential
components, configuration, and issue resolution in the process.
Chapter 10, Logging and Monitoring Your Data Ingest in Airflow, explores advanced logging and
monitoring in data ingestion with Airflow. It covers creating custom operators, setting up notifications,
and monitoring for data anomalies. Configuration of notifications for tools such as Slack is also covered
to stay updated on the data ingestion process.
Chapter 11, Automating Your Data Ingestion Pipelines, focuses on automating data ingests using
previously learned best practices, enabling reader autonomy. It addresses common challenges with
schedulers or orchestration tools and provides solutions to avoid problems in production clusters.
Chapter 12, Using Data Observability for Debugging, Error Handling, and Preventing Downtime,
explores data observability concepts, popular monitoring tools such as Grafana, and best practices
for log storage and data lineage. It also covers creating visualization graphs to monitor data source
issues using Airflow configuration and data ingestion scripts.
Preface xvii
For almost all recipes in this book, you can use a Jupyter Notebook to execute the code. Even though it
is not mandatory to install it, this tool can help you to test the code and try new things on the code due
to the friendly interface.
If you are using the digital version of this book, we advise you to type the code yourself or access
the code via the GitHub repository (link available in the next section). Doing so will help you
avoid any potential errors related to the copying and pasting of code.
Conventions used
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file
extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “Then
we proceeded with the with open statement.”
A block of code is set as follows:
$ python3 –-version
Python 3.8.10
Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words
in menus or dialog boxes appear in the text like this. Here is an example: “Then, when we selected
showString at NativeMethodAccessorImpl.java:0, which redirected us to the
Stages page.”
Sections
In this book, you will find several headings that appear frequently (Getting ready, How to do it..., How
it works..., There’s more..., and See also).
To give clear instructions on how to complete a recipe, use these sections as follows:
Getting ready
This section tells you what to expect in the recipe and describes how to set up any software or any
preliminary settings required for the recipe.
How to do it…
This section contains the steps required to follow the recipe.
Preface xix
How it works…
This section usually consists of a detailed explanation of what happened in the previous section.
There’s more…
This section consists of additional information about the recipe in order to make you more knowledgeable
about the recipe.
See also
This section provides helpful links to other useful information for the recipe.
Get in touch
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the
subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen.
If you have found a mistake in this book, we would be grateful if you would report this to us. Please
visit www.packtpub.com/support/errata, selecting your book, clicking on the Errata
Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would
be grateful if you would provide us with the location address or website name. Please contact us at
[email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you
are interested in either writing or contributing to a book, please visit authors.packtpub.com.
xx Preface
https://packt.link/free-ebook/9781837632602
In this part, you will be introduced to the fundamentals of data ingestion and data engineering,
passing through the basic definition of an ingestion pipeline, the common types of data sources, and
the technologies involved.
This part has the following chapters:
Technical requirements
The commands inside the recipes of this chapter use Linux syntax. If you don’t use a Linux-based
system, you may need to adapt the commands:
You can find the code from this chapter in this GitHub repository: https://github.com/
PacktPublishing/Data-Ingestion-with-Python-Cookbook.
Note
Windows users might get an error message such as Docker Desktop requires a newer WSL
kernel version. This can be fixed by following the steps here: https://docs.docker.
com/desktop/windows/wsl/.
Getting ready
Let’s create a folder for our project:
1. First, open your system command line. Since I use the Windows Subsystem for Linux (WSL),
I will open the WSL application.
2. Go to your home directory and create a folder as follows:
$ mkdir my-project
Depending on your operational system, you might or might not have output here – for example,
WSL 20.04 users might have the following output:
Command 'python' not found, did you mean:
command 'python3' from deb python3
command 'python' from deb python-is-python3
If your Python path is configured to use the python command, you will see output similar
to this:
Python 3.9.0
Sometimes, your Python path might be configured to be invoked using python3. You can
try it using the following command:
$ python3 --version
5. Now, let’s check our pip version. This check is essential, since some operating systems have
more than one Python version installed:
$ pip --version
If your operating system (OS) uses a Python version below 3.8x or doesn’t have the language
installed, proceed to the How to do it steps; otherwise, you are ready to start the following Installing
PySpark recipe.
How to do it…
We are going to use the official installer from Python.org. You can find the link for it here: https://
www.python.org/downloads/:
Note
For Windows users, it is important to check your OS version, since Python 3.10 may not be
yet compatible with Windows 7, or your processor type (32-bits or 64-bits).
2. After downloading the installation file, double-click it and follow the instructions in the wizard
window. To avoid complexity, choose the recommended settings displayed.
The following screenshot shows how it looks on Windows:
3. If you are a Linux user, you can install it from the source using the following commands:
$ wget https://www.python.org/ftp/python/3.9.1/Python-3.9.1.tgz
$ ./configure –enable-optimizations
$ make -j 9
After installing Python, you should be able to execute the pip command. If not, refer to the pip
official documentation page here: https://pip.pypa.io/en/stable/installation/.
How it works…
Python is an interpreted language, and its interpreter extends several functions made with C or
C++. The language package also comes with several built-in libraries and, of course, the interpreter.
The interpreter works like a Unix shell and can be found in the usr/local/bin directory: https://
docs.python.org/3/tutorial/interpreter.html.
Lastly, note that many Python third-party packages in this book require the pip command to be
installed. This is because pip (an acronym for Pip Installs Packages) is the default package manager
for Python; therefore, it is used to install, upgrade, and manage the Python packages and dependencies
from the Python Package Index (PyPI).
There’s more…
Even if you don’t have any Python versions on your machine, you can still install them using the
command line or HomeBrew (for macOS users). Windows users can also download them from the
MS Windows Store.
Note
If you choose to download Python from the Windows Store, ensure you use an application
made by the Python Software Foundation.
See also
You can use pip to install convenient third-party applications, such as Jupyter. This is an open source,
web-based, interactive (and user-friendly) computing platform, often used by data scientists and data
engineers. You can install it from the official website here: https://jupyter.org/install.
8 Introduction to Data Ingestion
Installing PySpark
To process, clean, and transform vast amounts of data, we need a tool that provides resilience and
distributed processing, and that’s why PySpark is a good fit. It gets an API over the Spark library that
lets you use its applications.
Getting ready
Before starting the PySpark installation, we need to check our Java version in our operational system:
If everything is correct, you should see the preceding message as the output of the command
and the OpenJDK 18 version or higher. However, some systems don’t have any Java version
installed by default, and to cover this, we need to proceed to step 2.
2. Now, we download the Java Development Kit (JDK).
Go to https://www.oracle.com/java/technologies/downloads/, select
your OS, and download the most recent version of JDK. At the time of writing, it is JDK 19.
The download page of the JDK will look as follows:
Installing PySpark 9
Execute the downloaded application. Click on the application to start the installation process.
The following window will appear:
Note
Depending on your OS, the installation window may appear slightly different.
10 Introduction to Data Ingestion
Click Next for the following two questions, and the application will start the installation.
You don’t need to worry about where the JDK will be installed. By default, the application is
configured, as standard, to be compatible with other tools’ installations.
3. Next, we again check our Java version. When executing the command again, you should see
the following version:
$ java -version
openjdk version "1.8.0_292"
OpenJDK Runtime Environment (build 1.8.0_292-8u292-b10-
0ubuntu1~20.04-b10)
OpenJDK 64-Bit Server VM (build 25.292-b10, mixed mode)
How to do it…
Here are the steps to perform this recipe:
If the command runs successfully, the installation output’s last line will look like this:
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.2
2. Execute the pyspark command to open the interactive shell. When executing the pyspark
command in your command line, you should see this message:
$ pyspark
Python 3.8.10 (default, Jun 22 2022, 20:18:18)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more
information.
22/10/08 15:06:11 WARN Utils: Your hostname, DESKTOP-DVUDB98
resolves to a loopback address: 127.0.1.1; using 172.29.214.162
instead (on interface eth0)
22/10/08 15:06:11 WARN Utils: Set SPARK_LOCAL_IP if you need to
bind to another address
22/10/08 15:06:13 WARN NativeCodeLoader: Unable to load native-
hadoop library for your platform... using builtin-java classes
where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-
defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For
SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.1.2
/_/
You can observe some interesting messages here, such as the Spark version and the Python
used from PySpark.
3. Finally, we exit the interactive shell as follows:
>>> exit()
$
12 Introduction to Data Ingestion
How it works…
As seen at the beginning of this recipe, Spark is a robust framework that runs on top of the JVM. It is
also an open source tool for creating resilient and distributed processing output from vast data. With
the growth in popularity of the Python language in the past few years, it became necessary to have a
solution that adapts Spark to run alongside Python.
PySpark is an interface that interacts with Spark APIs via Py4J, dynamically allowing Python code to
interact with the JVM. We first need to have Java installed on our OS to use Spark. When we install
PySpark, it already comes with Spark and Py4J components installed, making it easy to start the
application and build the code.
There’s more…
Anaconda is a convenient way to install PySpark and other data science tools. This tool encapsulates all
manual processes and has a friendly interface for interacting with and installing Python components,
such as NumPy, pandas, or Jupyter:
For more detailed information about how to install Anaconda and other powerful commands, refer
to https://docs.anaconda.com/.
It is possible to configure and use virtualenv with PySpark, and Anaconda does it automatically
if you choose this type of installation. However, for the other installation methods, we need to make
some additional steps to make our Spark cluster (locally or on the server) run it, which includes
indicating the virtualenv /bin/ folder and where your PySpark path is.
See also
There is a nice article about this topic, Using VirtualEnv with PySpark, by jzhang, here: https://
community.cloudera.com/t5/Community-Articles/Using-VirtualEnv-with-
PySpark/ta-p/245932.
Configuring Docker for MongoDB 13
Getting ready
Following the good practice of code organization, let’s start creating a folder inside our project to
store the Docker image:
Create a folder inside our project directory to store the MongoDB Docker image and data as follows:
my-project$ mkdir mongo-local
my-project$ cd mongo-local
How to do it…
Here are the steps to try out this recipe:
Note
If you are a WSL user, an error might occur if you use the WSL 1 version instead of version 2.
You can easily fix this by following the steps here: https://learn.microsoft.com/
en-us/windows/wsl/install.
14 Introduction to Data Ingestion
We then check our server. To do this, we can use the command line to see which Docker
images are running:
my-project/mongo-local$ docker ps
We can even check on the Docker Desktop application to see whether our container is running:
Figure 1.6 – The Docker Desktop vision of the MongoDB container running
3. Finally, we need to stop our container. We need to use Container ID to stop the container,
which we previously saw when checking the Docker running images. We will rerun it in Chapter 5:
my-project/mongo-local$ docker stop 427cc2e5d40e
How it works…
MongoDB’s architecture uses the concept of distributed processing, where the main node interacts with
clients’ requests, such as queries and document manipulation. It distributes the requests automatically
among its shards, which are a subset of a larger data collection here.
Configuring Docker for MongoDB 15
Since we may also have other running projects or software applications inside our machine, isolating
any database or application server used in development is a good practice. In this way, we ensure
nothing interferes with our local servers, and the debug process can be more manageable.
This Docker image setting creates a MongoDB server locally and even allows us to make additional
changes if we want to simulate any other scenario for testing or development.
The commands we used are as follows:
There’s more…
For frequent users, manually configuring other parameters for the MongoDB container, such as the
version, image port, database name, and database credentials, is also possible.
A version of this image with example values is also available as a docker-compose file in the official
documentation here: https://hub.docker.com/_/mongo.
The docker-compose file for MongoDB looks similar to this:
# Use your own values for username and password
version: '3.1'
services:
mongo:
image: mongo
restart: always
environment:
MONGO_INITDB_ROOT_USERNAME: root
MONGO_INITDB_ROOT_PASSWORD: example
mongo-express:
image: mongo-express
restart: always
ports:
- 8081:8081
environment:
ME_CONFIG_MONGODB_ADMINUSERNAME: root
ME_CONFIG_MONGODB_ADMINPASSWORD: example
ME_CONFIG_MONGODB_URL: mongodb://root:example@mongo:27017/
See also
You can check out MongoDB at the complete Docker Hub documentation here: https://hub.
docker.com/_/mongo.
However, there are some additional steps to configure our Airflow. Thankfully, the Apache Foundation
also has a docker-compose file that contains all other requirements to make Airflow work. We
just need to complete a few more steps.
Getting ready
Let’s start by initializing our Docker application on our machine. You can use the desktop version or
the CLI command.
Make sure you are inside your project folder for this. Create a folder to store Airflow internal components
and the docker-compose.yaml file:
my-project$ mkdir airflow-local
my-project$ cd airflow-local
How to do it…
1. First, we fetch the docker-compose.yaml file directly from the Airflow official docs:
my-project/airflow-local$ curl -LfO 'https://airflow.apache.org/
docs/apache-airflow/2.3.0/docker-compose.yaml'
Note
Check the most stable version of this docker-compose file when you download it, since
new, more appropriate versions may be available after this book is published.
Title: Ampiaispesä
Kyläkertomus
Language: Finnish
Kyläkertomus
Kirj.
VEIKKO KORHONEN
I.
Maaliskuu oli lopuillaan. Kolmojoki oli luonut jääpeitteensä ja
virtaili tasaisesti lämpimässä auringon paisteessa.
Kolmojoki oli aikoinaan virtaillut koskemattoman metsän läpi.
Vähitellen sen rannoille oli muodostunut viljelyksiä ja taloja, ja nyt se
jo eroitti toisistaan kaksi kyläkuntaa, jotka olivat sen äyräille
muodostuneet. Kyläkunta olisi oikeastaan sopinut olemaan yhtenä,
mutta se oli jotenkuten eroittunut kahdeksi. Joen pohjoispuolinen
kyläkunta sai joesta nimensä ja eteläistä sanottiin Korpijoeksi,
koskapa sen kylän asukkaat vieläkin väittivät Kolmojokea aikoinaan
sanotun Korpijoeksi.
— Eikö mitä… Oli se isä pannut rysän jokeen, mutta oli jättänyt
kalasimen auki. Se ukko on välistä semmoinen toljake, virkkoi Eedla.
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebookmass.com