Hitchcock K. Linux System Administration For The 2020s. The Modern Sysadmin 2022
Hitchcock K. Linux System Administration For The 2020s. The Modern Sysadmin 2022
Hitchcock K. Linux System Administration For The 2020s. The Modern Sysadmin 2022
Administration
for the 2020s
The Modern Sysadmin Leaving
Behind the Culture of Build
and Maintain
—
Kenneth Hitchcock
Linux System
Administration for
the 2020s
The Modern Sysadmin Leaving
Behind the Culture of Build
and Maintain
Kenneth Hitchcock
Linux System Administration for the 2020s
Kenneth Hitchcock
Hampshire, UK
v
Table of Contents
vi
Table of Contents
vii
Table of Contents
viii
Table of Contents
Chapter 5: Automation����������������������������������������������������������������������111
Automation in Theory����������������������������������������������������������������������������������������111
Idempotent Code�����������������������������������������������������������������������������������������112
Knowing When and When Not to Automate�������������������������������������������������113
State Management��������������������������������������������������������������������������������������114
Automation Tooling�������������������������������������������������������������������������������������������115
Automation Scripting Languages����������������������������������������������������������������115
Automation Platforms���������������������������������������������������������������������������������������119
Automation in Estate Management Tools����������������������������������������������������119
Ansible Automation Platform�����������������������������������������������������������������������120
Making the Decision������������������������������������������������������������������������������������131
Automation with Management Tools�����������������������������������������������������������������133
State Management��������������������������������������������������������������������������������������133
Enterprise Products�������������������������������������������������������������������������������������134
Use Case Example���������������������������������������������������������������������������������������134
Setting Up a SOE�����������������������������������������������������������������������������������������137
Automate the Automation���������������������������������������������������������������������������������139
Self-Healing�������������������������������������������������������������������������������������������������139
When to Self-Heal����������������������������������������������������������������������������������������145
How to Implement Self-Healing�������������������������������������������������������������������145
Automation Best Practices��������������������������������������������������������������������������������149
Do Not Reinvent the Wheel, Again …����������������������������������������������������������149
ix
Table of Contents
Things to Avoid��������������������������������������������������������������������������������������������151
Shell Scripts������������������������������������������������������������������������������������������������151
Restarting Services When Not Required������������������������������������������������������151
Using Old Versions���������������������������������������������������������������������������������������152
Correct Version Documentation�������������������������������������������������������������������152
Good Practices���������������������������������������������������������������������������������������������152
Summary����������������������������������������������������������������������������������������������������������153
Chapter 6: Containers�����������������������������������������������������������������������155
Getting Started��������������������������������������������������������������������������������������������������155
Virtual Machine vs. Container����������������������������������������������������������������������156
Container History�����������������������������������������������������������������������������������������156
Container Runtimes�������������������������������������������������������������������������������������157
Container Images����������������������������������������������������������������������������������������163
Containers in Practice���������������������������������������������������������������������������������������166
Prerequisites�����������������������������������������������������������������������������������������������166
Creating Containers�������������������������������������������������������������������������������������167
Custom Images and Containers�������������������������������������������������������������������171
Container Practices�������������������������������������������������������������������������������������������177
Cloud Native������������������������������������������������������������������������������������������������177
Good Practices���������������������������������������������������������������������������������������������177
Bad Practices����������������������������������������������������������������������������������������������180
Container Development�������������������������������������������������������������������������������������182
Development Considerations�����������������������������������������������������������������������182
Container Tooling�����������������������������������������������������������������������������������������185
DevSecOps��������������������������������������������������������������������������������������������������������189
DevSecOps Tooling��������������������������������������������������������������������������������������189
GitOps���������������������������������������������������������������������������������������������������������������190
x
Table of Contents
GitOps Toolbox���������������������������������������������������������������������������������������������191
Container Orchestration������������������������������������������������������������������������������������192
What Does It Do?�����������������������������������������������������������������������������������������193
Why Not Use Podman?��������������������������������������������������������������������������������193
Orchestration Options����������������������������������������������������������������������������������194
Summary����������������������������������������������������������������������������������������������������������200
Part III: Day Two Practices and Keeping the Lights On������������201
Chapter 7: Monitoring�����������������������������������������������������������������������203
Linux Monitoring Tools��������������������������������������������������������������������������������������204
Process Monitoring��������������������������������������������������������������������������������������204
Disk and IO��������������������������������������������������������������������������������������������������206
CPU��������������������������������������������������������������������������������������������������������������208
Memory�������������������������������������������������������������������������������������������������������210
Virtual Memory��������������������������������������������������������������������������������������������211
Network�������������������������������������������������������������������������������������������������������212
Graphical Tools��������������������������������������������������������������������������������������������217
Historical Monitoring Data���������������������������������������������������������������������������219
Central Monitoring��������������������������������������������������������������������������������������������221
Nagios���������������������������������������������������������������������������������������������������������221
Prometheus�������������������������������������������������������������������������������������������������224
Thanos���������������������������������������������������������������������������������������������������������227
Enterprise Monitoring����������������������������������������������������������������������������������229
Dashboards�������������������������������������������������������������������������������������������������������232
Dashboarding Tools�������������������������������������������������������������������������������������233
Grafana��������������������������������������������������������������������������������������������������������233
xi
Table of Contents
Application Monitoring��������������������������������������������������������������������������������������237
Tracing Tools������������������������������������������������������������������������������������������������237
Exposing Metrics�����������������������������������������������������������������������������������������239
Summary����������������������������������������������������������������������������������������������������������239
Chapter 8: Logging���������������������������������������������������������������������������241
Linux Logging Systems�������������������������������������������������������������������������������������241
Rsyslog��������������������������������������������������������������������������������������������������������241
Fluentd��������������������������������������������������������������������������������������������������������������244
Plugin Based������������������������������������������������������������������������������������������������244
Used at Scale�����������������������������������������������������������������������������������������������245
Installation���������������������������������������������������������������������������������������������������245
Configuration�����������������������������������������������������������������������������������������������246
Understanding Logs������������������������������������������������������������������������������������������247
Where Are the Log Files�������������������������������������������������������������������������������247
How to Read Log Files���������������������������������������������������������������������������������247
Infrastructure Logs��������������������������������������������������������������������������������������248
Application Logs������������������������������������������������������������������������������������������249
Increasing Verbosity������������������������������������������������������������������������������������250
Log Maintenance����������������������������������������������������������������������������������������������251
Log Management Tools��������������������������������������������������������������������������������252
Log Forwarding�������������������������������������������������������������������������������������������������253
Central Logging Systems�����������������������������������������������������������������������������253
Summary����������������������������������������������������������������������������������������������������������257
xii
Table of Contents
Chapter 9: Security���������������������������������������������������������������������������259
Linux Security���������������������������������������������������������������������������������������������������259
Standard Linux Security Tools���������������������������������������������������������������������260
Recommended Linux Security Configurations���������������������������������������������264
DevSecOps��������������������������������������������������������������������������������������������������������268
What Is It?����������������������������������������������������������������������������������������������������269
Everyone Is Responsible for Security����������������������������������������������������������269
Tools������������������������������������������������������������������������������������������������������������271
System Compliance������������������������������������������������������������������������������������������272
System Hardening���������������������������������������������������������������������������������������272
Vulnerability Scanning��������������������������������������������������������������������������������������276
Linux Scanning Tools�����������������������������������������������������������������������������������276
Container Image Scanning Tools�����������������������������������������������������������������277
Container Platform Scanning Tools��������������������������������������������������������������280
Summary����������������������������������������������������������������������������������������������������������283
xiii
Table of Contents
xiv
Table of Contents
Index�������������������������������������������������������������������������������������������������331
xv
About the Author
Kenneth Hitchcock is a principal consultant
working for Red Hat, with over 20 years of
experience in IT.
Ken has spent the last 11 years
predominantly focused on Red Hat products,
certificating himself as a Red Hat Architect
along the way. The last decade has been
paramount in Ken’s understanding of how
large Linux estates should be managed, and in
the spirit of openness, he was inspired to share
his knowledge and experiences in this book.
Originally from Durban, South Africa, Ken now lives in the south of
England, where he hopes to not only continue to inspire all he meets but
also to continue improving himself and the industry he works in.
xvii
About the Technical Reviewer
Zeeshan Shamim has been an IT professional
in various capacities from management to
DevOps for the past 15 odd years. He has
worked in roles ranging from support to
DevOps/sysadmin in various organizations
ranging from big telecom firms to financial
banks and is a proponent of open source
technologies.
xix
Acknowledgments
This book is based on all the experience and training I received over the
years, all of which started while working for Justin Garlick and Alasdair
Mackenzie. Thank you for all the opportunities to learn and for giving me
the foundation to get started in the open source world.
My eventual move to Red Hat opened opportunities to work with larger
teams and allowed me to learn so much from so many great influential
people in the various Red Hat teams. I am grateful for the guidance and
friendship from Dan Hawker, Will McDonald, Vic Gabrie, Martin Sumner,
Chris Brown, Paulo Menon, Zeeshan Shamim, and so many others I have
not named who are from a truly special group of people that are always
willing to help and make working at Red Hat so special. Thank you all for
showing what it means to be open.
xxi
Introduction
This book is divided into four main parts with each designed to expand
from the previous. More subjects are introduced as you go along; some
may require further reading, and others are explained. At the end of the
book, you will be left either feeling happy that you are doing things the
right way or have a thousand ideas on how to improve.
Part 1
If you are reading this book with existing Linux knowledge, use Part 1 as a
refresher or an opportunity to see things from a different perspective.
It is entirely possible there is something you may not know or have
possibly forgotten.
For the reader new to Linux, this is not a book to teach you all the
foundational skills either or bring you to the same level as readers with
years of experience; that will require more effort on your part. It will,
however, give you the keywords and subjects you will need to explore
further on your own.
Anyone who has ever been exposed to something new will understand
the statement “you don’t know what you don’t know.” Part 1 is there to give
you the breadcrumbs to these unknowns; further chapters will give you
a bit more. The value in Part 1 will come from the structure it gives; it will
show you what to learn and where to focus to build a solid foundation.
The advanced users with many years of experience will most likely
breeze through the first chapters and not gain anything new. All I can offer
you is a different perspective on how I believe a solid Linux knowledge
foundation can be laid.
xxiii
Introduction
P
art 2
Part 2 will explore how to improve ways of working with Linux systems and
hopefully give you a few shortcuts along the way. It contains much of my
experience as a consultant from the last ten years and will contain some
interesting views on what I have experienced a little more recently. The
ultimate goal of this part is to bring you up to speed with the latest estate
management trends and tools. Everyone who is reading this book should
gain some benefit from what I am sharing with you.
It will start with new tooling and the new ways of working most
organizations have started to adopt. Using these new tools, we will delve
into estate management and how Linux systems have and should be
provisioned. We will look into backing up and restoring platforms with a
good understanding of the disaster recovery options available today. We will
visit good and bad practices people commonly do and how to avoid them.
We will then discuss best practices for running an efficient environment.
Like we discussed community and enterprise Linux distros in
Chapter 1, we will discuss community and enterprise estate management
tools. We will look at how these tools can be leveraged to build a solution
that can be truly inspirational.
With good statement management, there needs to be a high degree of
automation; in Chapter 5, we will explore in-depth automation concepts
and practices to achieve higher productivity than what it was like to build
systems ten years ago.
Finally, we will discuss different aspects of containerization, when is
the right time and what should and should not be containerized.
xxiv
Introduction
Part 3
Day two operations are the most important aspects of keeping your Linux
estate running. These are the nuts and bolts of tailoring your Linux estate
to your organization's needs. Part 3 is going to be focusing on some of the
most important day two configurations needed for a Linux system to be
supported by your organizations.
Chapters leading up to Part 3 were not heavily focused on how to use
these tools but more focused around what they were for and what you
could spend your time looking into further.
In Part 3, however, we are going to focus on a bit more of the traditional
Linux system administration day two operations. We will be looking at
monitoring, logging, security, and how to plan system maintenance. Some
of the tooling you may already be using, and some might not be what
you have seen before. These chapters will explore what I have seen in the
industry over the last decade and discuss some interesting new ways of
working.
Part 4
The goal of Part 4 is to help you to understand how a problem should be seen
and analyzed before taking any action, allowing for effective troubleshooting
instead of guessing where the issues could be. Part 4 will give you a solid
theoretical foundation to use when taking on difficult problems.
When a problem does go beyond our understanding or we just don't
have the time to spend days trying to find the root cause, we need to ask
others for help. Learning the correct ways on how to ask for help will
save frustration when the community does not respond as you would
have liked.
Finally, we will briefly delve into a few advanced administration tools
that can be used to give you more information about your system.
xxv
PART I
Laying the
Foundation
Before delving into advanced or intermediate topics around managing
large or smaller Linux estates, it is very important that we establish the
required baseline skills to fully appreciate this book. This is the main
purpose of Part 1.
CHAPTER 1
Linux at a Glance
Where did Linux come from? Where is Linux going? Why should you not
be afraid of using Linux?
These are important questions to anyone new to Linux or anyone
who is looking to understand more about this amazing operating system.
Linux has and continues to change the world; the opportunities Linux has
already brought are astounding, but what it still has to offer is what truly
excites me. Together with open source communities, Linux will continue
to evolve, grow, and encourage innovation from millions of developers
creating new projects across the globe. With the open collaborative nature
of the open source world, we will be capable of anything. No problem can
be too big.
During this first chapter, we take a look at the differences between
community and enterprise Linux distributions. We will discuss why
enterprise Linux is preferred by some and why community distributions
are preferred by others. We will look at the different approaches some
distributions have taken in how the operating system should be managed
and understand why variations of distributions have spawned. Finally, I
hope to help you understand the possible reasons why someone would use
community or enterprise Linux distributions.
4
Chapter 1 Linux at a Glance
O
pen Source
Open source does not mean “free.” The fact that the software has no cost
does not mean the software has no value; it means the source code is open
and not locked away in a proprietary vault somewhere.
If the software has no cost, then what’s the point, right? How can
someone make money from it?
This is where companies like SUSE, Canonical, and Red Hat make their
money. They sell subscriptions for the support of their distributions but
don’t actually sell the software. You can use Red Hat Enterprise Linux, for
example, without a subscription, and you can update the operating system
from community repositories with no problem. You can’t, however, ask
Red Hat to support you. For that, you need to pay.
J ust for fun is the name of the book written by Linus Torvalds
1
2
Free Software Foundation
5
Chapter 1 Linux at a Glance
Linux Is Everywhere
Almost everything we use today from smartphones to laptop computers or
the kiosk terminals we buy our movie tickets on at the cinema all share one
thing in common: they all use Linux. Well, almost everything. I still see the
odd Windows “BSOD” when I walk through the London underground. My
point is, we use Linux or see it almost on a daily basis without knowing it.
Linux is often used in train stations and airports for advertising boards, but
did you realize the entertainment systems used on your flights are often
Linux driven too? Maybe not the best example if you spent an eight-hour
flight with no entertainment.
Linux systems like these are easier to develop and improve, and as
the communities who maintain them are constantly working on bug fixes
or new projects, this kind of development model drives innovation and
constantly encourages investment from larger organizations to develop
new ideas.
Hardware vendors are also beginning to understand why open
source is better and are constantly looking into ways to make use of open
source tooling. This cannot be stressed enough in the mobile or cell
phone market.
In 2013, the Android market had 75% of the market share; today,
that number is still around 72%. That is still an extraordinarily large
percentage of the global smartphone market. Over five billion people use
smartphones. That is, about two thirds of the planet’s population use a
mobile device. 72% of those devices use Android. This means that almost
half of the world’s population is using Linux right now.
Smart TVs, tablets, home automation devices, and IoT devices are not
to be excluded either. Open source software has enabled these platforms
and gadgets to grow increasingly more popular. Companies like Google,
Amazon, and Philips are a few that have released really good products for
6
Chapter 1 Linux at a Glance
simple home automation. People who are least technical today now have
the ability to configure their home to allow lights to come on by schedule
or motion.
It still seems like something from a sci-fi film when I see that a kettle
can be set to boil water in the morning before even getting out of bed.
If that doesn’t interest you, imagine a smart device with a robotic arm
controlled by a virtual chef that can be commanded to cook your dinner
from a menu.
It is not just the innovation in our homes that is impressive, it is the
automation that is going to change the world that excites me. I recently saw
new automation tooling to manage a vegetable garden. The software used
can detect and command a robotic system to remove weeds, water the
vegetables, and spray pesticide.
These innovative devices and ideas have deep roots in open source
and Linux. The availability of hundreds if not hundreds of thousands of
developers and hobbyists has shown that collaboration far outperforms
any proprietary software company development efforts.
These examples I mentioned are small scale now, but imagine at full
scale; imagine hundreds of acres of farmland being automated to grow
food or automated restaurants with robotic chefs that can cook anything
you can select off a menu. Yes, there is always the human factor that could
face the brunt of this innovation, and there too is an answer for that. By
automating and innovating ourselves out of jobs, we are building systems
and platforms to feed and clothe us. Just like our technology is evolving, so
must we. Where the farmer toiled in the field, they now can spend the time
enhancing the machine learning that drives the automation. The farmer
now can spend more time with their family or innovating better farming
techniques.
By following open source practices and giving back to the community,
farmers can expand and feed more people. Building community projects and
sharing with the planet only increases the ability for the eradication of starvation.
It’s these innovations that make the future bright and open new doors.
7
Chapter 1 Linux at a Glance
Community
The “community” is a name generally given to a collective of people
who do not work for a single organization to develop a product. Well, I
suppose that is not entirely true. Some organizations like Red Hat sponsor
communities to develop and work on community products to act as their
upstream variants. These communities do prefer to focus more on the
term “project” than “product.”
Upstream
Another word thrown around in the open source world is “upstream.”
“Upstream” is a term used to describe what an enterprise product is
based on. This doesn’t mean the enterprise product is a direct copy of
the upstream product either. The upstream is considered more of the
“bleeding edge” or innovation breeding ground, typically used to prove
and test new product features before being pushed into enterprise
products.
8
Chapter 1 Linux at a Glance
If a product sponsor, like Red Hat, likes one of the upstream features,
they take the code from the community and work the new feature into
the enterprise equivalent. These features are then tested and reworked
to ensure they are enterprise grade before releasing new versions to
customers.
It is worth mentioning that the enterprise products often have different
names than their “upstream” equivalents. Take the example of Fedora
Linux. Fedora is considered the upstream for Red Hat Enterprise Linux
or RHEL.
Community Contributors
For a community to exist or provide a product, the community needs
contributors. Community contributors are typically software developers,
hobbyists, or people who enjoy building and developing projects in their
spare time. These contributors dedicate their spare time giving back code
for anyone to use. For them, it’s all about getting their work out to as many
people as possible and sharing.
Note One other thing to note is that giving back to the community
doesn’t only mean writing code. Being part of a community could be
anything, provided you can contribute to the project or community
in a meaningful way. It could be a monetary donation or giving up
some of your time to host a meetup. Anything that can help grow the
project will have value.
9
Chapter 1 Linux at a Glance
Common Distributions
At the time of writing, there were around 600 Linux distros available, give
or take a few. Many are forks of more well-known distros, and some are
forks of forks of forks.
As mentioned a few times already, open source is all about being open
and having code available for anyone to use. This is why it’s possible for
anyone to create a new distro; in fact, there are distros that help you create
distros.
One thing all distros have in common is the Linux kernel. So it all still
comes down to what Linus is releasing. You are welcome to try to create
your own kernel; I’m sure many have tried, but sometimes it is best to not
try to reinvent the wheel, especially if it is still working well enough. There
may come a time when the kernel needs to be reengineered, but till then
we will trust in Linus.
The kernel is one of a few things that is the same across most if not all
Linux distros. Differences that do exist between Linux distros are around
package management systems. Linux distros like Red Hat Enterprise Linux,
Fedora, and CentOS/Rocky use the RPM-based package management
system. Distros like Debian and Ubuntu use their deb-based package
management system.
Another less-known packaging system that seemed to be getting a bit
of traction is Pacman. Pacman is currently being used by the gaming distro
SteamOS and Manjaro.
With all the distros available, it’s important to know where each distro
came from and what that distro was built for. As mentioned earlier, Fedora
is regarded as the “upstream” for Red Hat Enterprise Linux, but this is
not what it was initially intended for. Fedora was first released in 2002 by
Warren Togami as an undergraduate project. The goal of the project was
for Fedora to act as a repository for third-party products to be developed
and tested on a non-Red Hat platform.
10
Chapter 1 Linux at a Glance
Other distros have been built purely for security, like Kali Linux built
and configured for penetration testing. A distro like Puppy Linux was built
to be a cutdown distro to allow users to run a “lighter” Linux on older
slower hardware.
As a small taste to that mindmap of all the Linux distros available, the
following is a small part of the distro family tree for RPM-based distros.
This table does not take into account the forks of forks of forks that have
happened from these.
11
Chapter 1 Linux at a Glance
Before Committing
The best approach for any new user moving to Linux, who may not
be familiar with Linux or open source tools for that matter, is to do a
staged approach. Start by changing to open source tools on your current
operating system. Change to using a Firefox browser or use Thunderbird
for email. Find open source alternatives to products you currently use and
get familiar with them. Once familiar with the new tools, then switching to
Linux will seem less of a culture shock.
Tip Make a list of all the tools you use and search the Internet
for open source alternatives, then switch one by one. Try different
products if some are not right for you or spin up a virtual machine
with a distro you think works for you and test your alternative
tools there.
12
Chapter 1 Linux at a Glance
13
Chapter 1 Linux at a Glance
Try Ubuntu
Ubuntu, for example, is a good choice for someone new to Linux. The
install is simple, the methods of creating install media are not overly
complex, and there are enough one-minute google searches to find
answers on how to create bootable media. The installation itself does not
require much thought; defaults work quite well and will leave the user with
a suitable installation.
Ubuntu configuration could involve a small learning curve for the
brand-new user. Ubuntu does however have a nice “apps store” to find
almost anything.
Applications like Wine and Lutris work quite well on Ubuntu, which
means gaming is possible with less frustration. Lutris itself is a very useful
tool in that it wraps configuration required for games to run on Linux
quite nicely. The scripts are easily found in Lutris and can be added with
relative ease.
My advice for any new user is to start with Ubuntu or something very
similar. Get familiar with how Linux works in general. Learn about
systemd, and understand how firewalls are configured.
Get familiar with installing drivers for hardware not included in the
kernel, like graphics cards. Spend time on discussion boards learning how
to figure things out for yourself.
Push yourself and learn how to configure your Linux distro as a web
server or a plex server for your home as a fun starter project.
14
Chapter 1 Linux at a Glance
15
Chapter 1 Linux at a Glance
3
Read the Freaking Manual
16
Chapter 1 Linux at a Glance
These distros should be used when you are comfortable working out
issues on your own and don’t need too much guidance. You should be well
versed in finding meaningful errors and understand where to increase
verbosity when needed.
Compiling from code and rebuilding kernel modules should not
be something you have not done before either. In some cases, getting
applications or drivers to work will involve these kinds of tasks. Distros
like Arch Linux and Gentoo should be left for the die-hard fans who
wish to set themselves a challenge, so do not take them lightly if you are
predispositioned to frustration.
17
Chapter 1 Linux at a Glance
If banks did not use enterprise products and opted to use community-
based products, they would need to wait for the community to fix a
vulnerability when it is reported. This can sometimes take a couple hours
or a couple days if not weeks.
If the vulnerability was a particularly bad one, it could cost the bank
more than any software subscription ever would. It could even spell the
end for some organizations if they were to be breached because of a
vulnerability waiting to be fixed by the community.
Enterprise Linux companies may make money from the software they
support, but do not think for a minute they do not help the communities
to help develop their products. Enterprise Linux companies have become
extremely important to the communities from which they get most of their
“upstream” products. Red Hat as an example not only uses “upstream”
projects like Fedora for RHEL but also has many, many other “upstream”
projects they support. It is this support that grows the products and
promotes adoption throughout the whole industry.
Red Hat
Red Hat has a large portfolio of enterprise products from Red Hat
Enterprise Linux all the way through to the OpenShift container platform
they use as their hybrid cloud solution.
Red Hat has been developing solutions since their start in 1993 and
hasn’t stopped trying to release the next best enterprise product. Red
Hat is constantly setting the trend in enterprise open source solutions; if
Red Hat has not actively been developing new products, they have been
acquiring companies that have. An example of this is the acquisition of
StackRox recently.
Red Hat builds their business around three main product categories
that drive their business and customer adoption.
18
Chapter 1 Linux at a Glance
Automation
With the acquisition of Ansible in October 2015, Red Hat strengthened
their offerings to the market with one of the best automation products
yet. Red Hat not only made Ansible enterprise grade but also took the
previously proprietary Ansible Platform (Ansible Tower) and open sourced
it. The community version of Ansible Platform is called AWX.
Ansible continues to grow in popularity and continues to be one of the
most actively developed automation products in the community. There are
new modules being developed constantly to improve the product almost
on a daily basis.
Hybrid Cloud
The cloud is something all of us now know about. It is nothing new; most
organizations are actively looking at cloud options if they have not already
moved or are planning the move for future roadmaps. Red Hat is no
different.
Red Hat over the years has become very good at finding the next big
thing. This was the case with the acquisition of Makara in 2010. What
made Makara so special was because of their PaaS (Platform as a Service)
solution they were developing. In May 2011, OpenShift was announced
from this acquisition, and in 2012 OpenShift was open sourced.
19
Chapter 1 Linux at a Glance
Canonical
Canonical was founded in the UK by Mark Shuttleworth in 2004. Canonical
is better known for their community Linux distro called Ubuntu. Very
much like Red Hat, Canonical offers paid support subscriptions for their
products. Ubuntu, however, is not like Red Hat Enterprise Linux and
Fedora. There is only the community Ubuntu product. Canonical offers
support and break/fix where it can but does not actually have its own
distro like Red Hat does.
Canonical like Red Hat has a portfolio consisting of more than just
Linux support. Canonical offers products in the following categories.
Linux Support
The first and obvious part of Canonical’s business is around their support
for Ubuntu. As discussed before, Ubuntu is only developed by the
community and supported by Canonical for a price.
Cloud
Canonical offers support for Kubernetes, which is a container
orchestration product similar to OpenShift. For their private cloud
solution, Canonical supports and helps install OpenStack. Both products
provide cloud capabilities for Canonical.
20
Chapter 1 Linux at a Glance
Internet of Things
One area Canonical is different from both Red Hat and SUSE is their
support around IoT devices and embedded Ubuntu. More companies
are looking for a Linux distro for their “smart” devices and appliances.
Canonical has an edge in this market as one of the only enterprise Linux
companies to provide this level of support.
SUSE
SUSE, the third and by no means last enterprise Linux company, currently
has a slightly wider portfolio than Canonical, but not quite that of their
closest competitor, Red Hat. SUSE, like Red Hat, has their own enterprise
Linux distro. The community version of SUSE Enterprise Linux is called
openSUSE. The enterprise version of SUSE has support subscriptions
from SUSE Linux Enterprise Desktop through to SUSE Enterprise Linux for
IBM Power.
SUSE as mentioned has a slightly wider portfolio than Canonical.
Currently, SUSE has two product categories driving their business, which
may be an unfair oversimplification of their products.
21
Chapter 1 Linux at a Glance
22
Chapter 1 Linux at a Glance
23
Chapter 1 Linux at a Glance
vulnerabilities and fix them before they become public. Communities tend
to be reactive and are always behind the curve when releasing security
patches. Something large organizations like banks prefer not to have. You
and me, we can turn our labs off if we are concerned about a security issue.
Banks do not have that privilege.
Knowledge Check
For the best use of this book, you are expected to know the basics of Linux
system administration. This would include things like
Summary
In this chapter, the following subjects were introduced:
24
Chapter 1 Linux at a Glance
25
PART II
Strengthening
Core Skills
Now that Part 1 has been completed and expressed where you should be
as a Linux system administrator, Part 2 will focus on building new skills in
areas you may not have had exposure to yet.
CHAPTER 2
Task Management
The Linux operating system is in essence a series of files and processes
working together to assist the user in completing computational requests.
These processes need to be managed occasionally. As a user of Linux, it is
recommended to understand how processes can be started, stopped, and,
when it is required, killed, sometimes forcibly.
Starting a Process
Starting a process can be done in a number of ways; the most common one
you will use is done by starting a service. Starting an apache web service,
for example, usually involves starting the httpd service. This service
spawns a few httpd processes depending on your configuration. A service
however is really nothing more than a script or a set of commands that
call a binary followed by parameters. When looking at your process, the
parameters are often listed after it.
Starting the apache web server as mentioned requires a service
command. With most Linux distros, this will be a systemctl command:
To check if the service has started, you can replace the start parameter
with the status parameter, or you can check what processes are running
that match the name httpd:
30
Chapter 2 New Tools to Improve the Administrative Experience
Top
Top is installed by default on almost every Linux distro I have ever
used. Executing the command “top” should give a similar output to the
following:
# top
top - 21:51:30 up 35 days, 22:34, 1 user, load average: 4.80,
5.38, 3.13
Tasks: 423 total, 1 running, 421 sleeping, 0 stopped,
1 zombie
%Cpu(s): 8.8 us, 6.9 sy, 0.0 ni, 81.9 id, 0.0 wa, 1.8 hi,
0.6 si, 0.0 st
31
Chapter 2 New Tools to Improve the Administrative Experience
32
Chapter 2 New Tools to Improve the Administrative Experience
A
lternatives to Top
There are a few alternatives to “top” if you want to try something different
(Table 2-1). Personally, I have tried and used a few but always default to
top, mostly as the systems I work on are not my own. If you have not tried
the alternatives to top in Table 2-1 before, I recommend you install and see
for yourself if they add any benefit to your way of working.
33
Chapter 2 New Tools to Improve the Administrative Experience
nmon
nmon is another very useful tool to help diagnose issues on your system. It
is typically not installed by default but can be installed on most platforms.
nmon has a very clear method of showing CPU, memory, disk, and kernel,
to name a few. It’s definitely a tool I would recommend using.
Killing Processes
Occasionally, there may be a need to kill a process; this could be for
anything from a hung thread to a process with a memory leak. Before you
kill this process, always ask yourself if killing the process is the best way
of terminating your process. I do understand that sometimes there is no
other option, and the task must be killed. However, never start by forcefully
killing a process. Always start by trying to use service commands like
systemctl or similar. Some applications or utilities have their own custom
tools that can also be used. Read the official documentation or man pages
to see if there is a recommended method.
I have experienced in the past that some system administrators do not
always understand the implications of killing a process. I once worked with
a system administrator who thought it was a good idea to forcibly kill a
PostgreSQL database process as his main method of stopping the database
service. This not only scared me but showed me the system administrator
really did not understand the knock-on effect he could be inflicting on
himself if he persisted with this behavior. As a consultant working with him
at the time, I explained why this was a horrible idea and then stepped him
through proper procedure.
If you ever do have to kill a process, always try to follow the
following steps:
1. Get the process ID by using a process viewing tool
like “top” or “ps.”
34
Chapter 2 New Tools to Improve the Administrative Experience
There are numerous other signal options for the kill command, each
used for different situations. The “kill -l” command will give you a list of
all signals that can be used.
Zombie Processes
Let’s first understand what a zombie process is. A zombie process is when
a process has been killed, and its memory descriptor EXIT_ZOMBIE has
not been cleared by its parent process. This is normally done when the
parent process executes the wait() system call to read the dead processes’
exit status and any other information. After the wait() has completed, the
EXIT_ZOMBIE memory descriptor is cleared. When this is not done, it is
usually down to either the parent process misbehaving or bad coding.
I once heard a very simplistic explanation for killing a zombie process.
“You cannot kill something that is already dead.”
35
Chapter 2 New Tools to Improve the Administrative Experience
B
ackground Tasks
Services when started create background running tasks, largely because no
system administrator wants to have an active session running all the time
and the fact that it would just be plain silly to do so.
Background tasks can be viewed by looking at tools like top or
running the ps command, but what do you need to do to send a task to the
background? What happens when you start a long-running process and
need to do other tasks? You could open a new window or console and run
the task there. However, a better approach would be to send the current
task to the background. The following are the basic steps to send a current
running task to the background:
36
Chapter 2 New Tools to Improve the Administrative Experience
If you want to bring the background task back to the foreground, you
simply execute the “fg” command.
Screen
A highly popular multitasking tool used by many is “screen.” Most Linux
system administrators will have used or at least know about “screen”
and will most likely already know the basics, but for those new to Linux,
“screen” is a tool that allows the user to create sessions that run as
background processes. The user can disconnect and reconnect to a session
as they wish, which means a long-running script or process can be left
running in a screen session while the user disconnects and goes home. In
the past, the task or process would have been tied to the user’s session, and
once the user disconnected, the tasks would be killed.
Screen is found in most distros and can be installed quite simply by
attempting to install a package named “screen.”
37
Chapter 2 New Tools to Improve the Administrative Experience
To use the screen in a very basic way, all you need to know are the
commands listed in Table 2-2.
T mux
With less availability of “screen,” a new tool being used is “tmux.” “tmux”
like “screen” allows a user to disconnect and reconnect to a session except
that “tmux” has quite a rich set of features. I personally now use “tmux” on
all my Linux platforms. The commands have become muscle memory, and
I am often feeling lost when I work on a system without “tmux.” It sounds
strange to say that, but as a Linux system administrator, we are often asked
to troubleshoot issues, and this involves being able to multitask. We may
need a window running a watch command with another window tailing
a log. Flipping between these windows can be chaotic when you have
tons of applications running. So to avoid this, using “tmux” allows me
38
Chapter 2 New Tools to Improve the Administrative Experience
to create a split screen and new windows within tmux. I am able to flip
between sessions, and best of all, I don’t have to leave the comfort of the
command line.
Very much like “screen,” there are a few basic commands you
need to know to start using “tmux.” From there, you can expand your
understanding by reading the man pages or the help.
Table 2-3 list some of the common commands you will use in your day
to day activities.
A
nsible Introduction
The role of a Linux system administration has evolved over the last decade
into more of an automation engineer role. More system administrators are
writing automation code than ever before. The traditional Linux system
administration role is slowly becoming less important than it used to be.
You may be reading this book because either you are trying to learn what
you should be doing to stay relevant in the fast-moving Linux world or you
are new to Linux and want to learn how to start.
39
Chapter 2 New Tools to Improve the Administrative Experience
Installing Ansible
Installing Ansible is fortunately not too complex compared to some other
tools you can use for automation. This really makes good sense as one of
the driving factors to use Ansible is the easier learning curve to use it.
Ansible can be installed in two ways.
Package Management
The first and simplest way to install Ansible is through your distros
package management system like dnf or apt.
Simply trying to install the Ansible package will work on most
community distros as they generally have Ansible available in their
standard repositories. Enterprise distros like Red Hat Enterprise
Linux, however, require separate subscriptions and access to different
repositories. For those distros, ensure you follow their official
documentation on how to enable the required repositories.
40
Chapter 2 New Tools to Improve the Administrative Experience
Pip
Another way to install Ansible is through the Python preferred installer
program, or commonly known as pip. There are no subscriptions or
different repositories required other than getting pip itself installed. Once
pip is installed, Ansible can be installed via the pip install commands.
Configuring Ansible
The heart of Ansible is the YAML you write that executes a task. For this,
there is very little that you need to configure. If you installed Ansible via a
package management system like dnf or apt, you will have configuration
files created for you. If you installed via pip or downloaded binaries, you
will need to create configuration files yourself.
The Ansible configuration file is called ansible.cfg and can be used
to customize Ansible within its limits. As an example, you can configure
where plugins or inventory files are stored if you wish to configure a
nonstandard environment.
41
Chapter 2 New Tools to Improve the Administrative Experience
Ansible can also be told where to find the configuration file by setting
the ANSIBLE_CONFIG environmental variable.
Ansible Inventory
Before using Ansible, you need to know how to target systems that Ansible
will execute commands or tasks on. In Ansible, we do this with the help
of an inventory file. If Ansible is installed from a package management
system, the default inventory file created is /etc/ansible/hosts. This file
can be used as is, or you can edit your ansible.cfg to tell Ansible where
the inventory file can be located. Another common method of specifying
where Ansible can find an inventory file is done when executing the
“ansible” or “ansible-playbook” commands with the “-i” parameter,
followed by the path to the inventory file.
42
Chapter 2 New Tools to Improve the Administrative Experience
[webserver]
servera
serverb
[database]
serverc
Running Ansible
The Ansible command-line tools are made up of a few binaries. The two
commonly used ones are “ansible” and “ansible-playbook.” The “ansible”
command can be used to execute single ad hoc commands directly to
a host, whereas the “ansible-playbook” command is used to execute
playbooks which can contain many Ansible tasks. An example of an ad hoc
Ansible command used to ping all hosts in your inventory file can be done
as follows:
43
Chapter 2 New Tools to Improve the Administrative Experience
Playbooks
Once you have graduated from running Ansible ad hoc commands, you
will want to progress on to creating playbooks. Simply put, a playbook is
a way of running multiple Ansible tasks one after another. The Ansible
playbook needs to start by specifying a host or group that the tasks will
execute on. A variable file or list of variables can also be added, but for a
very simple playbook, this is not really required. The following is a basic
example of a playbook:
---
- name: "Install webserver"
host: webserver
tasks:
- name: "Install httpd"
yum:
name: "httpd"
state: present
Roles
Playbooks can become quite complex, and often there are times when
you will want to reuse code. This is where Ansible roles become useful.
An Ansible role is a way of using Ansible to do a specific job. This could
be as simple as installing a package or as complex as deploying an entire
cloud platform. Typically, a well-written Ansible role should execute
without issue, out of the box. A default variable should be configured,
so if the user does not set anything, the role will still run. A good Ansible
role should also include a README.md file with instructions on how to
use the role. The role should also include metadata that can be used by
Ansible Galaxy.
44
Chapter 2 New Tools to Improve the Administrative Experience
[Role name]
-> [tasks]
--> main.yaml
-> [defaults]
--> main.yaml
-> [handlers]
--> main.yaml
-> [meta]
--> main.yaml
-> [vars]
--> main.yaml
45
Chapter 2 New Tools to Improve the Administrative Experience
Modules
Ansible modules are another important aspect of Ansible not many
people understand. If an Ansible role can be seen as a toolbox, the Ansible
modules can be considered the nuts and bolts.
In the “Playbooks” section a few pages back, there was a playbook
example. In the example, I used the “yum” module to install the “httpd”
package. This “yum” module is part of the standard Ansible collections
and does not require any additional installation. The “yum” module in this
example tells the system the play is being executed on (“webservers”) to
use the “yum” binary to install the “httpd” package.
Some modules are much more complex than the “yum” module and
can be more complex to use. Fortunately, Ansible documentation is fairly
good and offers a good explanation of all the parameters and options
a module generally has. To view the documentation, you can either do
a quick Internet search or use the command-line help for Ansible. An
example can be to look at the help for the yum module:
# ansible-doc yum
46
Chapter 2 New Tools to Improve the Administrative Experience
from all corners contributing code to promote Ansible adoption. It’s not
only vendors but system users like ourselves who too have been creating
Ansible modules for almost anything you can think of.
Ansible Galaxy
Ansible Galaxy is an excellent way of sharing your Ansible with the world.
This not only gets your name out for others to recognize but also adds to
the ever-growing library of Ansible that can be used by everyone.
When confronted with the need to write a new Ansible role or module,
always start by searching the Ansible Galaxy for anything you could use or
at least start with.
Web Consoles
Linux system administration has traditionally consisted of logging into a
system via ssh and running various command-line commands to configure
the platform as required. This can still be done today, but with the growth
of Linux. System configuration was always going to evolve to include
easier-to-use methods to accommodate newer users to Linux while they
were learning.
Cockpit
Anyone who has built and configured Linux servers will know very well
that desktops tend to not be used much on server platforms. Most of the
time, all that is required is ssh and whatever software is needed for the
47
Chapter 2 New Tools to Improve the Administrative Experience
Installation
Very much like most software installations on Linux, it is recommended
to install “cockpit” using your package management system. On Red Hat
Enterprise Linux, this would be yum or dnf, and with Ubuntu you would
use apt. The following is the installation for RHEL or Fedora:
Configuration
Once installed, ensure that the “cockpit” service has been enabled and
started:
Finally, ensure that if you have your firewall running, port 9090 has
been opened for tcp traffic:
# firewall-cmd --list-all
48
Chapter 2 New Tools to Improve the Administrative Experience
Using Cockpit
Once installed and configured, you should be able to open a web browser
and enter the hostname or IP address of your Linux system with port 9090:
https://<Hostname or IP>:9090
The web console should now open and ask for credentials. The
username and password can be any local user, and if you know the root
password, you can use that too.
Within “cockpit,” you can configure your network interfaces, add
storage via NFS or iSCSI, and view logs. There are a few nice features
like the ability to join domains from the web console and view what
applications are installed on your system. Most of the configuration is self-
explanatory and simple enough to understand.
Limitations
One of the main limitations of Cockpit is that it is a web console of the
running Linux server, which simply means the server has to be running
and the “cockpit” service needs to be working. You cannot use “cockpit”
to resolve boot issues and cannot use “cockpit” to install any virtual
machines.
Alternatives to Cockpit
Where there is one project in the open source world, you can be positive
there will be many more similar to it. This is exactly the same with Linux
web console administration. While Cockpit is the common option to use
and does provide quite nice features, it would be worth mentioning some
alternatives that can be used.
49
Chapter 2 New Tools to Improve the Administrative Experience
Webmin
Very much like Cockpit, you are able to configure various configuration
options like adding new users or starting and stopping services. The
downside to Webmin is the slower product update release cycle. Where
Cockpit aims to release versions every two weeks, Webmin can go long
periods without updates. This can, however, also be seen as a good thing.
Best advice would be to compare these products for yourself and see what
works best for you.
Ajenti
Another really nice web console alternative to Cockpit is Ajenti. Very much
like Webmin and Cockpit, Ajenti provides a clean and easy-to-use web
console that allows the user to configure the Linux platform it is installed
on. The same limitations are present in all of these web consoles and will
only provide configuration for the system it is installed on.
Text Consoles
If web consoles are not for you, then text UI tools may be something more
up your alley. Text consoles or “tui” tools provide quick configuration
options for the user when they are not familiar with all the command-line
parameters. A good example is configuring authentication on RHEL back
in the day. In the early days of RHEL configuration, you would need to dig
50
Chapter 2 New Tools to Improve the Administrative Experience
out the help and work out all the various parameters you would need to
get your command to be successful. Running a text UI for authentication
now gives you all the options that can be selected or deselected. You don’t
need to remember any parameters other than connection details. The
configuration is quicker and simpler with less room for error. Something I
would advise when under pressure.
Installing
There is not just one package to install for all tui consoles. Each application
would need to provide its own “tui” if the standard “tui” packages are not
installed.
Using the Linux system’s package management system, try installing
the “tui” by adding “-tui” at the end of the package name. An example of
this could be trying to use the NetworkManager text UI. If the package is
not installed, try installing the following:
Using
Text consoles are simple enough to use and are generally self-explanatory.
Best way to get to know them is by starting to use them. I personally have
been using the NetworkManager “tui” to configure network configuration,
mostly because it is faster and I have to remember less.
51
Chapter 2 New Tools to Improve the Administrative Experience
Summary
In this chapter, you were introduced to the following:
52
CHAPTER 3
Estate Management
The last chapter discussed new tooling or ways of working, and this
chapter will continue in the same vein. However, it will look at the bigger
picture of your estate: what things you should be doing and what things
you should be avoiding.
Not only during this chapter will we look at how Linux estates have
been managed by people in the past and how this can be improved on,
we will also delve into system building, system patching, and tooling you
should be using or should consider using. With this, we will briefly discuss
management software that can be used today. The main idea of this
chapter is to introduce you to how you can change your ways of working to
make your life easier when managing these large estates.
We not only will discuss technical aspects of Linux estate management
but will also look into ways to conduct proper planning and how to avoid
those awful midnight patching cycles requiring engineers to work out of
hours. This chapter will explore ideas to streamline as much of the manual
work that the traditional Linux system administration used to require. We
will discuss how you can start cultural change within your organization
and how you can start driving conversations to promote innovation and
spend less time firefighting.
Toward the end of this chapter, we will discuss common bad practices
Linux system administrators sometimes do. We will then end the chapter
with some recommended good practices Linux system administrators
should start doing if not already.
Outdated Skills
More often than I would like, I meet Linux system administrators who
are not constantly updating skills. Some of the individuals only have
themselves to blame, and some unfortunately are not supported by the
organization they work for. Examples of this include not being familiar
with new changes and known issues in new releases of the products they
administer, often due to lack of training. Other examples include not
staying up to date with changes in the market and new trends being used
for platform management. Other issues involve organizations not wanting
Linux system administrators to introduce new technologies within their
environments. This often means the Linux system administrators would
need to experiment in home labs or sandbox environments, which some
54
Chapter 3 Estate Management
might not have access to. For whatever reason, the Linux sysadmins’ skills
become outdated and leave them stranded with the organization that
refuses to support them.
Over Engineering
This brings me to something I think we all have been guilty of sometimes:
making a project overly complex for the simple requirement it was
required for. If I had a penny for every time I saw something that was just
completely overengineered for simple tasks, I would be a wealthy man. Just
take a step back and ask yourself, do I really need to add all this complexity
into this script or into this piece of work? If the answer is no, then cut the
excess and keep the task simple. Use the “KISS” acronym if you have been
guilty of overengineering in the past. “Keep it simple silly.”
55
Chapter 3 Estate Management
Shell Scripting
We all have written our fair share of shell scripts, and, yes, sometimes
it cannot be avoided, and for those situations you just have to grin and
bear it. However, when possible, try to use newer automation tooling to
manage your systems or execute the task you wanted to do. Management
software can also be leveraged to manage systems without the use of shell
scripts. Changing your approach to using alternatives to shell scripting will
begin your transformation into larger estate management from a central
location. Not to mention, less maintenance for you to do on old shell
scripts.
New starters to your organization will also not need to hassle you to
understand what your script does; you could just redirect them to read the
official documentation on the management tool you are using.
Today, there are less and less good reasons to write shell scripts other
than quick wrapper scripts or quick scripts to test something. Scripts
should not be used for anything permanent and most definitely not for
anything in production.
Snowflakes
Every snowflake that falls to earth is unique; at least this is what I have
been told and read. This sometimes is how Linux estates have evolved.
Linux system administrators have built systems where each one has
become its own unique snowflake. Each Linux system in the estate
becomes so different with its own unique configuration, not to mention
overly complex that it becomes so bad that no one in the organization
knows what the system does or how to rebuild it. These systems scare
me more than anything. They require so much effort to work out what is
required when they need to be rebuilt and become a liability if platforms
need to be failed over to a disaster recovery site.
56
Chapter 3 Estate Management
Build Process
The Linux build process is normally something you think about or spend
a fair bit of time doing when you are building or managing a medium to
large estate. For the home hobbyist or individual user, you tend to not
be too bothered with this process and tend to just build your system
manually. In larger estates, it is not uncommon to be asked to build ten or
a hundred systems for different reasons. Building these systems manually
is not a good option anymore with how the industry is evolving.
The days of building systems that cannot be replaced are over. Systems
are now being treated more like cloud instances. If it breaks, drop it and
redeploy. This process makes sense as it saves time and energy. No need
to try firefight the problem on the spot or even fix the root cause there
and then. Just drop the system and redeploy. Most systems send logs
externally, so troubleshooting and root cause analysis can continue later
when production is up and running.
57
Chapter 3 Estate Management
Twenty years ago, this way of thinking would have got you some
interesting looks and if management heard could have got you escorted
out the building. Fortunately, times have changed and today this way of
thinking is encouraged.
To understand how to improve, we need to understand what we are
doing wrong. For this, let us discuss different methods of deploying Linux
systems, what makes them worthwhile doing, and what makes them
something to avoid.
58
Chapter 3 Estate Management
Network Install
Booting off your network through an NFS server is another method of
deploying a Linux system manually. This would still require boot media
where you will need to redirect the install to a network location. This install
method would require an NFS system running with the install media
exported and available for use. You would still need to run through the
install manually, and you would still need to ensure you do not just select
defaults if you are building a production system. This can be streamlined
if you have a kickstart file but will require some further understanding
of how kickstart files are written. Fortunately, we live in a world where
information is available freely on the Internet. There are many examples of
kickstart files and many forum questions and answers to get you started.
Templates
Even though there are ways of imaging physical machines, it won’t be
something you do very often. Virtual machines, however, are another story
all together. As you are reading this as a Linux sysadmin, I’m assuming
you already understand the process of creating a virtual machine from
a template and have most likely done a few clones in the past. If not,
the process is quite straightforward. A Linux system is generally created
and installed manually on the virtualization platform. Once built and
configured, the virtual machine is then converted to a virtual machine
image or appliance, depending on the virtualization platform you are
using. This image can be locked or converted into a template to avoid
59
Chapter 3 Estate Management
60
Chapter 3 Estate Management
PXE Server
For Linux systems to build from a network, you will need to have a system
listening for build requests. This is known as a PXE boot server. This
system effectively allows a “fresh” system to boot from its network adapter
and prompt a user to select what they wish to deploy, that is, if you have
multiple builds you use. Default options can also be configured for a
system to automatically build without user intervention.
Typically, the network boot server would be something like a Red
Hat Satellite server or a Foreman server. If you choose to use your own
DHCP server, these systems do require the DHCP system to redirect the
“Next Server” option to these systems once a network address has been
allocated. Personally, I would recommend using the DHCP servers that
come with Satellite or Foreman. It makes life a bit easier to manage and
avoids having to get central DHCP systems configured. It also can reduce
complexity with firewall configuration if the traffic needs to span firewalls.
Satellite and Foreman can also be configured to listen on different network
interfaces, allowing DHCP and DNS segregation if you are concerned
about unwanted DHCP or DNS impact on your network.
61
Chapter 3 Estate Management
Kickstart
Once a system has booted into the network installation, the PXE server
would need to be configured to deliver the installation instructions. This is
known as the kickstart file. Kickstart files are basically answer files for the
Linux installer. These files can be used for network installations and for
ISO or USB device installations.
A good kickstart file should be configured to deploy a basic installation
of the Linux distro you are using. With the main focus being around disk
layout and basic package installation, keeping a kickstart file simple
will allow you to use the same kickstart file for a variety of different
system types.
Hypervisor API
With the network installation option on the previous installation method,
you required the use of a PXE boot system. Deploying Linux systems from
templates, however, does not require the Linux operating system to be
installed as that was already done in the template. What you will need is a
method of speaking to the hypervisor that currently hosts the template for
your deployment.
62
Chapter 3 Estate Management
• Ansible modules
• Puppet or similar
With one of the preceding methods, you can automate the provisioning
process for your hypervisor to create a virtual machine from a template.
The template creation can also be automated to pull down images from
the Internet if you want to streamline further.
A
nsible Examples
For the preceding process, I personally recommend using Ansible.
Ansible offers an easier learning curve and is the current market trend
for automation. The Ansible modules are already available with Ansible,
and additional modules can be added by installing the required Ansible
collections. Other than the modules, there are more than enough examples
available online to show you how to automate almost anything you can
think of. In fact, if this is something you wish to pursue, have a look at my
GitHub repositories for some basic examples on Ansible. One particular
Ansible role I have been involved in developing over the last couple years
is one called “ansible-role-cornerstone.” This Ansible role helps the user
build virtual machines in VMware and Libvirt and also allows the user to
provision cloud instances in AWS and Azure.
https://github.com/kenhitchcock
63
Chapter 3 Estate Management
Using Images
If the template deployment model is something you wish to pursue, it will
be worth understanding which approach is best to use: using a golden
image or using a catalog of images.
Golden Image
The “golden image” model involves one image or template created as the
base starting point for all systems. This image would be the central source
of truth that everything can be built with. Let’s have a look at some reasons
to use this approach vs. to not use it.
Use It
• One image to manage and maintain.
Don’t Use It
• If your image is not 100% correct, you can end up
with an estate of Linux systems that would need to be
rebuilt.
64
Chapter 3 Estate Management
Image Catalog
This catalog can be virtual machine templates, images, or kickstart files. All
will have similar advantages and disadvantages.
If you decide to use a catalog of images or kickstart files, it is
recommended that you keep a track of what the images are used for and
how they can be used for variations. The following is a basic example:
• [Mysql image]
• [Postgresql image]
This method of managing your build process does seem like a good
idea on the surface, but it is important to understand the additional work it
can bring. Let’s break this down into some advantages and disadvantages
for comparison.
65
Chapter 3 Estate Management
Advantages
• Catalog of prebuilt images or kickstart files that can be
used to build systems quickly. Allowing a repeatable
build every time.
Disadvantages
• If you use templates for virtual machine cloning,
the templates will need to be patched and could be
forgotten.
66
Chapter 3 Estate Management
67
Chapter 3 Estate Management
The user portal should be able to integrate with systems like change
management platforms, where build requests can automatically be sent for
approval. Once a job has been approved, the portal should have the ability
to speak to an automation platform to kick off build jobs. Once the jobs are
complete, the user should be notified.
Using “tee shirt sizes” for system build within the user portal will reduce
complexity for end users. There will still be a requirement to determine
what resource requirements would be needed for the user’s build, but this
can be done with documentation or notes during the user request process.
68
Chapter 3 Estate Management
System Patching
System patching is one of the most important jobs a system administrator
should be doing. Most companies or organizations that require
accreditation can be fined or worse if systems are not patched regularly.
For this reason, patching is planned and executed regularly with most
organizations. This often means patching or updates need to be done out
of hours and within certain maintenance windows, a painful task for the
poor sysadmin assigned the job, especially when the work ends up being at
1 a.m. in the morning.
Let’s first understand what the different update types are and then
understand how patching and updates can be managed in a streamline
manner to potentially reduce out of hours work.
Update Types
Linux updates for enterprise distributions are made available to customers
who pay for subscriptions. Previously, we have gone over before, but to
reiterate, these subscriptions are what divides enterprise from community.
Enterprise Linux companies like Red Hat and SUSE will constantly be
releasing updates. These updates come in two forms: package updates
and errata.
Package Updates
Package updates make up the bulk of most system patching cycles. The
updates tend to be new features or new versions of the installed package.
Normally, during the update cycle, the package manager or package
install files would make backups of any configuration files that may be
70
Chapter 3 Estate Management
impacted. However, never take it for granted that this will be done. I once
came across an issue where a product was updated and overwrote the
customizations made in the configuration file. My advice would be to
always ensure you have backups in place before any update cycles are run.
E rrata
Another update type commonly found with Linux updates is the errata
update. Errata are the bug fixes and security updates. These make up
probably the most important type of update you will need to install.
It is in these errata you will receive the important security fixes when
vulnerabilities are discovered in a package or file. These errata must be
applied to any production environment as soon as possible.
Note As the sysadmin of your Linux estate, ensure that you are
getting all alert emails with errata releases. Being told as soon
as a new errata is released will help you plan the patching cycle,
especially if the errata contains security fixes.
S
taging
When applying patches and errata to your organization’s systems, it is
vital to know that the new updates do not cause any adverse effects to any
running systems. To reduce that risk, it makes sense to stage your updates.
What I mean by this is that your patching should have a flow from your
lowest priority systems to your highest priority systems. You would start
by patching your lowest priority systems, like a sandbox environment. Run
automated testing or have a testing team sign the platform off to confirm
nothing has broken during the latest patch cycle. Once you have the
confidence that nothing has been affected by the new patch releases, you
can proceed with your next environment (Figure 3-3).
71
Chapter 3 Estate Management
The example flow diagram only takes into consideration tests after the
sandbox environment and does not consider packages that are different
in different platforms. For this setup, it would be beneficial to have
automated tests all the way through to preproduction environments. As
preproduction should be a mirror of production, you are best positioned
after preproduction has been patched to know if anything will break in
production after patching has been completed.
72
Chapter 3 Estate Management
“yum update” on the systems they use. Hopefully, this is never the case,
but sometimes mistakes can happen, and it is useful to be able to avoid
unnecessary downtime by not having updates available in the first place.
Table 3-1 lists a few of the commonly used patch management systems
used today.
Red Hat Satellite Used for RHEL 6 and up systems. Can be used for
patching and system provisioning plus more.
SUSE Manager Used for SUSE systems and can be used for patching
and system provisioning.
Note In the next chapter, I will discuss the Satellite server a bit
more in detail. If you want to know more, I recommend reading some
of the official documentation for more information.
P
lanning
Patch planning sounds like something you would do in your sleep, and as
most organizations do their patching out of hours, this is probably what
happens when the systems are being updated.
Having a solid plan for system patching is almost as important as the
patching itself. This plan would allow all systems to be patched in a timely
fashion and avoid the risk of systems not getting updates in time and being
exposed to the vulnerabilities they were designed to avoid.
73
Chapter 3 Estate Management
Rollback
When the rare occurrence occurs of a system patch causing more harm
than it fixes, you will need to know how to roll back the system to a working
state. This can be done in a few ways.
Restore Snapshot
Virtual machines can be snapshotted, and typically it is a quick process.
Restoring from a snapshot can sometimes take longer if the snapshot has
been running for a longer time, but typically this process takes seconds.
74
Chapter 3 Estate Management
Reinstallation of Packages
The slightly more irritating approach would be the removal and
reinstallation of the defective package. The problem with this approach
is the discovery process to find the offending package would take most
of your time if you are patching a large number of packages and systems.
Although this would solve the problem, you will be spending time you may
not have during your patch window.
Redeployment of System
The sledgehammer approach would be to blow away the system and
redeploy. Something that can be done in lower priority systems like
sandpit but definitely not something most organizations will do in
production.
75
Chapter 3 Estate Management
need to know what directories and files are important to back up. You will
need to understand how to restore from these backups and finally what the
best options are for faster recovery.
/etc
/home
/root
/usr
/opt
/svr
/var (be sure to exclude logs or anything large not required)
76
Chapter 3 Estate Management
Disaster Recovery
As an organization, it is vital that production platforms remain up as
much as possible. This could involve many different solutions and should
involve redundancy at all levels. When those plans fail in the completely
unprepared scenario, there needs to be a plan to recover from disaster.
The goal of disaster recovery is not to ensure all single points of failure are
covered but more how to return back to production.
77
Chapter 3 Estate Management
Stretched Clusters
Technically, this is not a disaster recovery solution but does allow the
ability for data centers to be failover between each other, allowing reduced
downtime and giving the ability to switch data centers when maintenance
is required.
This solution, however, does require infrastructure that can be clusters.
Everything from storage through to networking equipment will need to be
configured in such a way that failover is possible.
Infrastructure As Code
As most organizations have already started to embrace the world of
automation, this method of disaster recovery should not seem completely
strange.
If everything deployed and configured in your estate is automated, all
that would need to be recovered to continue operating would be the code
to execute your automation. If this is backed up and restored across data
centers or cloud platforms, the automation could then be run to rebuild
all systems required by your organization. This approach would require a
high degree of organization and would involve a strict build process that
only allows systems to be built from code.
There is the element of actual data that would need to be restored in
the event of disaster, which in itself would need a complete book written
on the subject to address all the complexities involved in creating the
perfect solution.
Cloud
Very much like having another data center, using cloud platforms like AWS
or Azure can provide an excellent platform for disaster recovery. Having
an entire cloud platform automated to build a replica of our on-premise
78
Chapter 3 Estate Management
systems could provide an ideal fast failover. Ideally, this platform if not
used for production could be turned off to save costs. Then in the event of
disaster, the cloud environment could be powered and traffic redirected
while issues on-premise are resolved. This solution would require massive
investment on your part to ensure configuration is replicated from the
on-premise systems, and you would still need to work out how data can be
replicated to ensure no data is lost. Out of all the disaster recovery options,
this one could be one of the cheaper options, as once the platform is built,
it could be powered off. Factoring only data costs and the cost of reserving
IP addresses, the cloud platform could potentially lie dormant until
required.
79
Chapter 3 Estate Management
Firewall Disabled
Local Linux firewalls can be a pain to maintain and configure when
running thousands of systems, but their importance cannot be stressed
enough. Just like patching systems in a secure network reduces the risk of
further damage if an intruder did even manage to breach your network,
local Linux firewalls could provide another inconvenience for the would-
be intruder.
Automate the firewall configuration on build and use configuration
management platforms to ease the pain of managing these firewalls. They
could make the difference one day.
80
Chapter 3 Estate Management
81
Chapter 3 Estate Management
Running As Root
Logging in to a system as root is not something anyone should be doing in
a production environment. Production does not always mean customer-
facing systems either. Development environments with developers
actively working can also be regarded as production. Logging in directly
as root removes any audit trails and gives full permissions for someone to
accidentally cause an issue. Always log in with your own credentials and su
to root if you need to. This practice needs to be followed by everyone and
not just standard users.
Good Practices
The following are some of my personal opinions on what constitutes estate
management good practices.
82
Chapter 3 Estate Management
83
Chapter 3 Estate Management
Source Control
Anything you develop to manage your estate should be stored in a source
control platform like Git. The code should have code review done and
should absolutely never be executed in production until rigorous testing
has been done. We all have the best intentions when we write code and
can often be blinded by the mistakes we make. A second pair of eyes can
sometimes be all the difference.
Summary
In this chapter, you were introduced to the following topics and
discussion points:
84
Chapter 3 Estate Management
85
CHAPTER 4
Estate Management
Tools
Managing larger Linux estates can be challenging if not done properly.
Trying to manage thousands of Linux systems following techniques and
tooling from 20 years ago will leave you in a heap of trouble, none so much
when compliance scanning shows holes in your environment. Not only
will you find mass amounts of security vulnerabilities that could give any
security person heart palpitations, it will also leave you with a depressing
amount of remediation work.
To avoid these issues, the use of management software is highly
recommended. Even with a modest amount of Linux systems to manage,
management platforms will only make life easier. The day-to-day tasks can
be automated, the build process streamlined, and the dreaded security
remediation offloaded to the management platform to handle for you.
Some tooling does come with a cost, and for that reason, it is important
to also know what community options are available. Very much like we
discussed in the earlier chapters, we will do a similar comparison. The
idea behind this chapter is to get you familiar with management platforms,
what they are used for, and how they can make your life easier as a Linux
sysadmin.
Management Systems
There are two kinds of management systems we will look at in this chapter:
Linux platform management systems and automation platforms. For each
type of management system, I will explain what the system does and the
basic concepts of the tool. To be very clear from the start, this book is not
an official guide on how to use these platforms. All I am trying to do is get
you familiar with what the tools do and how they could benefit you.
88
Chapter 4 Estate Management Tools
Obviously, you don’t always need all the preceding functionality, but
it does help to have the features available in case you start evolving your
ways of working. An example of this could be your organization’s decision
to start using more cloud facilities. Having a tool with cloud provisioning
abilities will save you having to use another platform or writing your own.
Red Hat The premier enterprise Linux estate management tool from
Satellite Red Hat. Used to manage estates of RHEL 6 and upward. The
product has been around for almost two decades at the time of
writing
Foreman A community product used for managing the Linux system
build process. Foreman is the upstream for Red Hat Satellite 6
Katello A community product that provides content management for
Foreman. Katello is another product used by Red Hat Satellite 6
as its upstream equivalent
Pulp A community product that manages package repositories
for Linux systems. Pulp like Foreman and Katello is another
upstream product for Red Hat Satellite 6
SUSE Manager The enterprise product from SUSE to manage SUSE platforms.
SUSE Manager is based on the community product Uyuni,
which in itself is a fork of the Spacewalk project
(continued)
89
Chapter 4 Estate Management Tools
Product Description
Spacewalk Spacewalk has in the past been used as the upstream for Red
Hat Satellite 5. Today, it remains a community Linux platform
management tool that has been abandoned by its developers
Uyuni A community platform management system that provides
system provisioning and patch management capabilities.
Configuration management is managed by SaltStack and
features the ability to run compliance scanning. Uyuni is a fork
of the Spacewalk product with integrated SaltStack. Uyuni is
also the upstream for SUSE Manager
EuroLinux Another community Linux estate management tool which
appears to be also forked from Spacewalk with SaltStack
integration
90
Chapter 4 Estate Management Tools
The Decision
The product you use will be heavily weighted by your organization’s
needs. Often, regulatory compliance will dictate if you use enterprise vs.
community products. Feature the product should have, tend to be dictated
by decision makers above you who do not understand what you as a Linux
sysadmin does or what the products do, leaving you potentially with a
product that will be more of a hindrance than a help.
My advice with the above is to build a case for the product you feel
is correct not only for you but your organization. For that, you will need
to be decisive in your decision and show a clear good reason or reasons
for why the product you prefer to use is the best for the job. If the product
is an enterprise product, you will also need to justify costs and prove
it is better than its competitors. Depending on your company’s way of
working, a presentation with advantages and disadvantages should be a
useful exercise, possibly with a comparison of features between different
products.
91
Chapter 4 Estate Management Tools
To support your decision and to build your case, you must be confident
the product is the right product for you. To do this, you must be familiar
with it and understand its limitations. This can be achieved by doing the
following:
• Have the vendor demo the product: If the product is
a paid-for product, ask the vendor to come visit you.
Request a demo of the product to be shown to you and
your company’s decision makers. This will increase
your chances of getting the product you want if it has
more visibility within your organization. Decision
makes will then have all the information available to
them to make an informed decision.
92
Chapter 4 Estate Management Tools
Satellite Server
The first and probably the one most people will know is the Red Hat
flagship management system: Red Hat Satellite server. Originally released
in 2002, Satellite was based on the upstream Spacewalk community project
until Satellite 6.x was released in 2014. Since then, Satellite 6 has been
based on a number of upstream products all combined together to provide
the latest Red Hat platform management system.
Satellite 5
Red Hat Satellite 5.x worked quite well as an overall Linux estate
management system. Satellite provided patch management, system
deployment, compliance scanning, configuration management, and
general estate management functionality.
Some interesting points I always ended up spending more time on
were around system deployment and configuration management. Both
were problematic in one way or another to use.
Configuration Management
Over the course of its life, Satellite 5.x improved from version to version
but had one major issue: its configuration management system. This
attempt at configuration management was just awful. The configuration
management used a concept of storing configuration files that would be
pushed to client systems. Unfortunately, this had the habit of sprawling
into chaos as more and more config files were stored. The config could be
versioned, but it was extremely painful to manage and often ended up in a
real mess.
93
Chapter 4 Estate Management Tools
System Deployment
The system deployment used in early Satellite used Cobbler along with
PXE boot mechanisms. Getting the deployment system to work sometimes
proved to be quite challenging at times. I spent many hours tweaking
config to get systems to deploy only to later find out I didn’t set correct
permissions or I was missing a package. Later versions improved and
became easier to install. Possibly a combination of me gaining experience
and the documentation improving.
Satellite 6
The current major release of the Satellite server is version 6.x. Satellite 6.x
is based on a combination of products including the following:
• Foreman
• Katello
• Pulp
• Hammer
• Candlepin
Content Management
The best feature for me that was introduced with Satellite 6 was the
complete overhaul of the content management system. Previously in
Satellite 5 and in Spacewalk systems, the content was segregated into
“channels.” These “channels” required cloning from one to the other
to create a staging flow for you to apply content from dev to test to
production. If this doesn’t make sense, don’t worry; it confused enough
people when I tried to show them in the past. I will break it down a bit
more in the “Spacewalk” section a bit later to explain a bit more.
94
Chapter 4 Estate Management Tools
Content Views
Fortunately, Satellite 6.x has provided a better solution with the help of
Katello. The new system no longer uses “channels” but instead uses a new
concept called “content views.” A “content view” is a collection of content
that Satellite can provide to a system. This content can contain puppet
modules, Ansible roles, or standard yum repositories. Where a “content
view” really shines is in its ability to be versioned. This means, as new
packages are downloaded on Satellite or new puppet modules are added,
the “content views” previously versioned are unaffected. Meaning any
systems allocated to these “content views” will not see the new content.
Perfect if you wish to stage your content through your life cycles.
Life Cycles
Content views are useful to group content, but they do need to be used by
systems. To do that, registered systems are added to different life cycles.
These life cycles can be called whatever you like, but generally they are
given boring names as follows:
Library (Default) ➤ Development ➤ Test UAT ➤ Pre Production ➤
Production
Content views are then associated with life cycles which in turn are
associated systems.
95
Chapter 4 Estate Management Tools
The basic flow in Figure 4-1 shows how content views are updated,
published, and pushed into different life cycle environments, thus allowing
the migration of content like package updates or errata to be moved from
test through to production environments.
96
Chapter 4 Estate Management Tools
System Provisioning
With the introduction of using Foreman instead of Cobbler, the Linux
deployment process has been simplified in one extent and complicated
in another. The complexity has mostly been brought in around ensuring
that organizations and locations have been configured for all components
of the provisioning process. Things like “operating systems” and “network
subnets” all need to be added to the correct “organization” and “location.”
Once you have gotten your head around the grouping issues, the rest of
the configuration becomes a bit more straightforward than the previous
Cobbler configuration.
One major thing to note about the system provisioning process is
that when you deploy your Satellite, you do need to ensure that you add
the features for system deployment. The official Red Hat documentation
explains the process quite clearly and provides all the parameters you
will need. If you prefer to not use the documentation, you can also look
at the “satellite-installer --help” command for more parameters.
They are self-explanatory and should make sense when you see them.
My recommendation is to stick to the official documentation when you
install your Satellite for the first time. Once you have one done, the help
command is useful to remind you of what you need.
System Patching
Patching systems registered to Satellite are no different to previous Satellite
versions. The systems still need to run “yum update” to get the latest
content. Remote execution can also be used like it used to be used in
Satellite 5.x for mass execution across the entire estate. Personally, I would
recommend using your automation platform to do this, but this would be
up to you on how you wish to manage your estate.
97
Chapter 4 Estate Management Tools
Configuration Management
Configuration management has drastically been improved with Satellite
6 from Satellite 5. Early versions of Satellite 6.x only used “Puppet” for
configuration management and SOE (Standard Operating Environment).
One downside of Puppet is that Puppet requires puppet agents running
on client systems. These agents often need to be configured to check in to
the Satellite Puppet master to ensure they are kept in line with expected
configuration.
The Puppet content would be stored within the “content view”
associated with the client system registered to the Satellite server. The
Puppet client would then check in with the Satellite Puppet master and
would then run through the content available to check if anything new
needs to be applied or corrected. If the Puppet agent was stopped on the
client system, the configuration would not be applied.
Later versions of Satellite 6 introduced Ansible as another option for
configuration management, which, very similar to Puppet configuration,
required a “content view” to contain all the Ansible roles and configuration
you wished your system to be configured with.
The Puppet or Ansible configuration would also be version controlled
with “content views” and would also require publishing and promoting for
updated content to be made available to the systems registered to Satellite.
98
Chapter 4 Estate Management Tools
SUSE Manager
SUSE Manager 4.x is the latest SUSE Linux platform management tool.
SUSE Manager 4.x is based on the community product Uyuni (ya - uni).
Uyuni
Uyuni was originally forked from the Spacewalk project but started to
divert so much from the original Spacewalk project that it has started to
become a tool of its own, which is refreshing to know as Spacewalk has
been around for a long time and had its fair share of issues.
99
Chapter 4 Estate Management Tools
Support
Uyuni and SUSE Manager have their own challenges I’m sure, and those
with more experience with this tool may know all about them. The SUSE
Manager configuration is not drastically too dissimilar to Red Hat Satellite
and is self-explanatory. Both products have excellent documentation and
provide enterprise support.
100
Chapter 4 Estate Management Tools
Foreman
Foreman is one of the main upstream projects for Red Hat Satellite 6.x.
The main function of Foreman is to assist with the provisioning of Linux
systems; however, Foreman does have the ability to be extended in its
functionality by adding plugins.
Provision Hypervisors
One nice feature of Foreman is that it has the ability to provision not only
Linux platforms but also virtualization hypervisors. A very handy ability if
you are looking at automating your estate to the “nth” degree.
Plugins
Foreman, however, by itself does not assist in content management or
configuration management like Satellite 6.x. For that, you will need to
combine Foreman with extra plugins. The plugins range from Katello for
content management all the way through to configuration management
101
Chapter 4 Estate Management Tools
Spacewalk
As the management tool that started it all for Linux platform management,
Spacewalk deserved to be mentioned, be it only briefly.
Abandoned
Spacewalk unfortunately has been abandoned by its developers and has
been left to others to fork and evolve the project. Uyuni is one such project
that has taken what Spacewalk started and is currently building quite a
nice-looking tool. Canonical, the Ubuntu distro company, is another that is
using a Spacewalk variation for its platform management.
102
Chapter 4 Estate Management Tools
Network Provisioning
Spacewalk also introduced more people to Cobbler and kickstart
deployment, taking Linux system building to a new era of deployment.
Having the ability to boot a system off the network and selecting a kickstart
file to use really took some of the pain away from running around with
physical media. The fact that the kickstart files would automate the install
made life even easier, and those who were up for the challenge could
create their own snippets of code to configure the newly built system that
bit further.
Environment Staging
Environment management with channels meant that package cloning
could ensure that environments did not get updates unless the Linux
sysadmins deemed it so. This was the same with errata and bug fixes.
Provisioning Tools
Another type of management system that can be used to manage
your estate is a dedicated provisioning tool. Technically, Foreman is a
provisioning tool and so is Satellite, but a dedicated provisioning tool
103
Chapter 4 Estate Management Tools
would act as a single interface into all aspects of your portfolio. If you
deployed on-premise in the cloud, having a provisioning tool would mean
that all operations could be executed from one location.
Cloudforms
Red Hat Cloudforms is an enterprise provisioning tool based on the
ManageIQ upstream project. Red Hat acquired ManageIQ in December
of 2012 and continued to drive the adoption of Cloudforms through the
next decade.
State Machines
With the integration options available, virtual machines and cloud
instances can be created with what is known as a “state machine.” A “state
machine” is written with Ruby on rails code to provide the automation
steps required to build and configure your virtual machine or cloud
instance. In the newer releases of Cloudforms, the ability to use Ansible
instead of Ruby on rails has become available.
A rather large downside of state machine development was the
complex setup required to get it to work. This was not something that came
out of the box and often required someone with experience to assist in
getting it to work. Even then, the process was still complicated.
104
Chapter 4 Estate Management Tools
Chargeback
Another really nice facility is the ability to control chargeback for any
systems that are being built in the estate. Cost centers or similar can be
configured to manage estate costs and can be billed to different teams or
departments.
Request Approvals
When users request a new system or platform, approvals can be configured
that need to be passed before any automation can be executed. Multiple
layers of approvals can also be configured, allowing change control teams
to approve builds. Something very useful if you want to stay in control of
what is built.
Advantages
• Very simple to install as Cloudforms is deployed from a
template appliance.
105
Chapter 4 Estate Management Tools
Disadvantages
• Cloudforms has a steep technical learning curve.
Terraform
Terraform is another interesting product to use if you wish to provision to
different platforms. Provided by HashiCorp, Terraform is an open source
infrastructure as code solution that has the ability to provision across
multiple environments such as AWS or Azure.
Products Available
HashiCorp provides a few options for using Terraform outside of their
enterprise support.
106
Chapter 4 Estate Management Tools
Community CLI
There is the standard community CLI option that is available to everyone
who wishes to learn and use it. Most if not all Terraform functionality is
available for you to start using the platform from day one.
107
Chapter 4 Estate Management Tools
Pipeline Tooling
Jenkins or Tekton could be a useful way of connecting to the management
systems API to automate deployments or system patching. The API calls
can be triggered, and the events could be caught; from that, different logic
could be used to determine next actions. This could be an interesting way
to introduce self-healing capabilities in your estate.
108
Chapter 4 Estate Management Tools
Automation Platforms
Using Ansible or similar is a cleaner and better approach to contacting
the management systems API. Ansible, for example, has plenty modules
available that already speak to different management tools through
their API. An example of this is the new Satellite modules that are now
available for users to automate Satellite configuration. Something useful
for managing your patching cycles when you can automate the promotion
and publishing of content views.
Shell Scripts
Not the best solution but it is something you could use if all you wanted
to do was automate some basic tasks. Personally, I would not take this
approach; I would rather write some Ansible to do the work for me.
Summary
In this chapter, you were introduced to the following:
109
CHAPTER 5
Automation
This is the first chapter in which we will target a specific discipline,
automation.
In this chapter, we will delve into the dark arts of manipulating systems
by the hundreds if not thousands. We will discuss what the best tools are
and why you should use them or avoid them. We will look at how these
tools differ from each other so you can make an informed decision of
which tool works best for you. We will then look at what the market trends
are for these products and why some people prefer one tool over the other.
This chapter discusses automation in general and does not focus
on one particular product. The idea is to understand the concepts of
automation and how they should be applied in the best possible way.
We will discuss topics such as “when you should automate vs. when you
should not.” We will explore using techniques to automate automation and
when that should be done.
Finally, we will end the chapter discussing best practices and using
shell scripting, in which we will discuss different shell scripting languages
that can be used.
A
utomation in Theory
Automation should not be anything new to most people reading this.
There has always been some form of automation in what we have done in
the past, be it custom shell scripts or some management tool scheduled to
kick off a job.
Automation has evolved quite a bit over the last decade, with new
tooling and automation platforms being developed. The days of using
custom scripts that are executed from cron jobs are coming to an end if
not already. Complex automation solutions are now managing everything
from system builds to self-healing systems.
Automation does not only have to be technical either. Most
organizations are now looking at solutions to automate business processes
along with their technical estate management. Building in automation
to create change requests or raise support tickets is becoming more a
requirement than a luxury. The time and effort saved is what brings more
organizations to the realization that having no full-stack automation in the
pipeline spells disaster for keeping up with competitors.
Idempotent Code
The number one thing that all automation should be adhering to is
ensuring that the code written is idempotent. This effectively means that
the code will only make a change if the state does not match the required
state from the automation platform.
An example of this could be updating a system package. If the system
has a package installed that is already at the latest version, you would not
want the automation task to do anything except confirm the package is
at the version requested. By not doing anything but confirming the state
of the package, the system remains untouched. If the package required a
service restart, the reinstallation or updating of the package could have
resulted in a tiny outage. This is possibly a poor example as handles can
also be used to ensure there are no outages.
Always remember when writing automation code:
“Is my code idempotent?”
112
Chapter 5 Automation
Reasons to Automate
To automate should be the default today; however, if you need reasons,
here are a few worth mentioning:
• Infrastructure as code
113
Chapter 5 Automation
• Code as documentation
• Reduced risk
• To encourage innovation
These preceding points are not really good reasons, but more excuses
in my personal opinion. The world of estate management and estate
building is rapidly changing today, and not automating should not be an
option. As Linux sysadmins, our job has changed whether we like it or not.
We no longer are Linux sysadmins, we are now automation engineers.
State Management
Another very important thing to understand about different automation
platforms is their ability to manage system state. Some platforms only check
state when code is being executed for a specific task, whereas other platforms
constantly check the systems they manage for its current state to see if
anything has been changed. When state change is detected, the configuration
is updated to match the desired state from the automation platform.
114
Chapter 5 Automation
Automation Tooling
We can all agree automation is not going away anytime soon, and to not be
left behind, it is important to understand what tooling you should be using.
Adopting automation practices will require a set of tools and development
languages that you will need to learn; which ones to use will require you to
make an informed decision.
Over the next few pages, we will discuss the different options available
for you today and discuss what makes them good or bad to use.
YAML
“YAML Ain’t Markup Language” is the main language used by automation
platforms such as Ansible and SaltStack. YAML is one of the easier
scripting languages to learn as most of the syntax is quite simple to
understand and remember.
115
Chapter 5 Automation
YAML in Action
The following two examples use the popular automation platforms Ansible
and SaltStack. Both examples provide the same result, which is to install
the “httpd” package.
Ansible
---
- name: "Build Linux Web server"
hosts: webservers
become: true
tasks:
- name: "Install latest apache httpd package"
ansible.builtin.yum:
name: httpd
state: latest
116
Chapter 5 Automation
SaltStack
websetup:
pkg:
- installed
- pkgs:
- apache2
Ruby
Ruby is a high-level all-purpose programming language that was designed
to be a true object-oriented language. Ruby is similar in ways to Perl and
Python except how it handles its instance variables. Ruby keeps these
variables private to the class and only exposes them through accessor
methods like “attr_writer” or “attr_reader.”
The following is a basic example of Ruby code. It is not often you would
write Ruby code for automation tasks unless you needed to write a new
function or something along those lines:
#!/usr/bin/ruby
def build(opt1 = "Linux", opt2 = "Windows")
puts "The system that will be built is #{opt1}"
puts "The system that will be built is #{opt2}"
end
build
117
Chapter 5 Automation
class helloworld (
$file_path = undef
){
notify { "Hello world!": message => "I am in the
${environment} environment"}
unless $file_path == undef {
file{ $file_path :
ensure => file,
content => "I am in the ${environment} environment",
}
}
}
Python
Python is used in a few ways to automate tasks. You can use Python to
write your own scripts in the same vein that you could write Ruby or
any other scripting language. Python, however, tends to be used mostly
for the modules or functions used by the likes of Ansible and SaltStack.
Below is a snippet of some basic Python code.
result = x + y
return result
118
Chapter 5 Automation
S
hell Scripting
Shell scripting can be used for automation but is not recommended,
mostly due to the fact that other automation languages and tooling have
such a rich array of modules that connect to most platforms through
their API.
Shell scripting by default does not really work well as an idempotent
scripting language. To implement idempotence would mean a fair bit of
extra coding.
A
utomation Platforms
With a basic idea of what the automation scripting languages look like, it
now makes sense to talk about some of the automation tooling you can use
that leverages the previously discussed scripting languages.
119
Chapter 5 Automation
Reasons to Use
• Already in place
• Save time
• Added complexity
120
Chapter 5 Automation
Agentless
Ansible does not require any agents to manage its client systems.
Connections are made to client systems via ssh on Linux or Unix-based
systems and WinRM if connecting to a Windows system. Authentication
can be done by entering the system’s password when executing the Ansible
or through the use of ssh keys. Most Linux sysadmins tend to use the ssh
option, mostly as the connections to the Linux or Unix systems will be
seamless and not prompt for a password. This could be a major irritation if
you are running a playbook against a hundred systems.
If ssh keys are not a possibility, there are other options that can be
used, but this would require all the systems to authenticate to a central
location. Else you will be stuck with entering different passwords for
different systems.
121
Chapter 5 Automation
U
sing Ansible
Ansible is available in both enterprise and community versions. Both
enterprise and community products have two “types” of Ansible that
can be installed. There is the graphical interface that can be used and a
command-line version.
Command Line
In Chapter 2, we briefly discussed how the Ansible command-line tool
can be installed and used. We covered how there are added benefits of
installing with a package management system like Yum vs. installing via
pip. We also discussed how to run some basic commands.
The Ansible command line is often referred to as Ansible Core, but the
naming might be changing if not changed already. Red Hat is working hard
at making Ansible better all the time and is constantly working on how
Ansible can be used or integrated with other products; for this reason, the
name may change to fit the usage.
One thing to remember: If you choose to use the command-line
version of Ansible only, it does not matter too much if you use the
community or enterprise version from a functionality point of view. Both
products have the same functionality last time I checked. The biggest issue
would come around support if you needed help.
122
Chapter 5 Automation
• Flexible
• Scalable
• Enterprise support
• New features being developed that may not make
it to AWX
123
Chapter 5 Automation
AWX
When Red Hat acquired Ansible, Ansible Tower had not been open
sourced. To correct this, Red Hat developers and engineers worked on
open sourcing Ansible Tower as fast as possible. The resulting product is
the AWX project.
AWX is almost the same as its enterprise equivalent (Ansible
Tower), except for a few enterprise features. An example is role-based
authentication has been excluded from the community version. If a user
required these features, they would need to purchase an Ansible Tower
license.
124
Chapter 5 Automation
• Flexible
SaltStack
Another python-based configuration management tool that can be used
is SaltStack. SaltStack comes in both community versions and enterprise
supported versions.
125
Chapter 5 Automation
• Direct SSH
• Agent and server
Remote Execution
In a similar way to how Ansible connects to systems and runs ad hoc
commands, SaltStack has the ability to execute remote execution
commands. This functionality is very similar in what the Satellite server
currently does and what the Spacewalk server used to do.
Configuration Management
SaltStack is a bit more traditional in how configuration management
has been done in the past. Configuration is managed on the SaltStack
master and pushed to systems that have changed or require updated
configuration. The SaltStack master system controls the state of its clients
(minions) that it manages through both understanding the state the
system needs to be and the events that have been triggered on the minion
system that the master is watching. If anything changes that should not,
the SaltStack master reverts the unauthorized changes.
126
Chapter 5 Automation
• Modular approach.
• Massively flexible.
• Scalable.
• Feature rich.
Puppet
Puppet is a Ruby-based configuration management system that was used
quite a bit more before the introduction of Ansible.
127
Chapter 5 Automation
128
Chapter 5 Automation
• Idempotent platform.
Chef
Probably one of the most widely used products for configuration
management behind Puppet is Chef. Chef is a Ruby-based product like
Puppet and works in a server agent architecture. Chef originally was a
mixture of proprietary and open source components; however, since April
2019 Chef declared they would be open sourcing everything that is Chef.
True to their work, today at the time of writing Chef has a community
version of their configuration management tool available for download
and use.
129
Chapter 5 Automation
Managed Service
A managed service is available from Chef for the organization or small
business that does not wish to build any on-premise systems. There are
additional costs involved as with any managed service.
On-Premise
Community
As Chef is now open source, the community open source version of Chef is
available for download and use but does come with the standard warning
of no support.
• Flexible
130
Chapter 5 Automation
Market Trends
Looking at what the current trends are in the market or with your
competitors can help you make a slightly more informed decision. I’m
not advocating that you follow the herd for good or worse, I’m suggesting
you look at what is working or not working for others. The market trends
do not tell you how much effort has gone into setting the environment
up nor does it show you the return of investment, but it does give you a
better idea if one product is being used more than another. The last thing
you want to do is invest time into something that will get replaced in
6–12 months.
131
Chapter 5 Automation
• Documentation.
These points will help when you present your findings to your
organization’s holders and justify the time spent in trialing the different
products.
132
Chapter 5 Automation
These vendors can neglect to explain the training required or the time it
will take to get everything in a position to be useful. This can mislead new
users in thinking the platform comes configured out of the box, almost
always ending up with a platform not being used. When you spend your
time testing and trialing products, ensure you factor in the effort that will
be required to get your organization in a position where they can be using
the product as effectively as possible.
State Management
One very important feature you want from your automation platform or
estate management tool is the ability to keep your estate’s state managed.
You want to configure your estate management or automation platform
tool to monitor and correct any configuration drift that may occur.
Whether it is by accident or through malice intent, you want to make
sure that you have no nasty surprises next time there is a system reboot.
133
Chapter 5 Automation
Controlling the state of all your systems will ensure your systems run
exactly as they were when first built and tested. This is crucial in reducing
firefighting further down the line.
Enterprise Products
Estate management tools that provide the best standard operating
environment configuration capabilities are usually enterprise products
unfortunately. Red Hat’s “Satellite server” product or SUSE’s “SUSE
Manager” product is among the best choices for a Linux estate today. Both
will either include Puppet or SaltStack. These products are quite good at
allowing you to manage the “state” of your systems in your estate. Their
implementation is quite different and will require some upskilling time.
The positive thing though is the documentation is quite good, and as you
are paying for the service, support is also available if you need help with
anything. Most support companies will do their best to help, up to a point.
Enterprise support does not mean the vendor will provide professional
services for free, but they will do their best to keep you happy. In the
end, you are paying for a product, and it is at their interest that you keep
using it.
134
Chapter 5 Automation
The Mistake
Where this use case is interesting is when a simple mistake is made.
A new Linux sysadmin has been given the task of debugging a login
issue on one of the preproduction systems in the estate. During the
debugging process, a simple typo was accidentally entered into the
ssh_config file. Unfortunately, the typo, if enforced, could cause the sshd
service to fail on restart. As the Linux sysadmin is unaware of this change
to the ssh_config file, they do not restart any services as they don’t believe
anything was changed. Why would they restart services, especially if they
don’t want to cause any unwanted outages, no matter how small.
135
Chapter 5 Automation
Safety Net
Due to the fact that the organization was smart enough to have a
configuration management tool listening for unwanted configuration
changes, the sysadmin’s typo never matured into a problem.
The victim system’s local Puppet agent checked in with the Puppet
master on the Red Hat Satellite server shortly after the Linux sysadmin’s
typo and brought the configuration back inline with what was deemed
to be the correct configuration when Puppet was configured for the
environment.
If it were not for the safety net, a simple mistake like a typo could have
caused a delay in any debugging for any outages that may have occurred
after the problem was created. If this were a production system and an
outage was extended because of sloppy work, there could have been bigger
implications for the unfortunate Linux sysadmin.
Yes, this example had ways of circumventing the issue by logging in to
a console, but what if this configuration was something a bit more serious
like grub configuration? This would have meant the system may not have
booted after its scheduled reboot, effectively creating a problem where
there never should have been one in the first place.
136
Chapter 5 Automation
Setting Up a SOE
To avoid issues similar to the use case explained, it is vital that the estate
configuration state is managed. To have a successful SOE platform
configured, you will need to make sure you have your estate management
tool configured in accordance with good practices. This will require proper
planning, preparation, and in some cases organization culture change.
Source Control
Any code that is being written to manage your estate should go
through a proper code development process. This means the following
should happen:
137
Chapter 5 Automation
Phased Testing
Phased testing or staged testing is the process of testing your automation
and configuration management through different environments before
reaching production. The approach should be similar to the following.
Code Development
• Code developed and tested in a sandbox environment.
138
Chapter 5 Automation
Code Promotion
• Code can then be promoted into the first live
testing environment. Usually, this is a development
environment or similar. Remember development
environments are still live as they have users on
them, so caution is recommended. If possible, avoid
environments that can’t have downtime ideally.
Self-Healing
Having your platform fully automated is an amazing achievement that any
Linux sysadmin should be proud of. Taking the next step is what will bring
your worth to your organization to a whole new level.
Building a platform that can heal itself when disaster strikes is the
next major advancement all Linux sysadmins should start doing. In the
past, giving your platform the ability to recognize system outages and
apply solutions without you having to lift a finger is something only sci-fi
movies did. Today, you can do this with an array of tools or self-developed
techniques.
139
Chapter 5 Automation
Self-Healing Layers
There are a few layers at which self-healing can occur. There is a hardware
layer, the “platform” layer, and finally the application layer.
Each of these layers has its own areas of failure and has its own
methods of recovery. If we start from the bottom up, let’s look at the
hardware layer.
140
Chapter 5 Automation
You would need clever logic that would need to pay attention to
platform monitoring and logging tools. These tools can bring the attention
of your self-healing platform to alerts, events, and logs. Not only should all
hardware be reporting the health of its own components, hardware should
also be looking for clues on how hardware closest to it is performing.
In the event of hardware failure, the self-healing system should
have automated decision points to run specific actions. In the event of
a completely unknown situation occurring, a fail-safe option should
be triggered which could be as simple as flagging a major incident and
getting a human involved. Learning from these incidents will improve
your platform, so don’t be disheartened while you are still perfecting the
platform. It is impossible to predict every possible scenario.
The basic flow of your automation healing for your hardware should be
similar to the following.
Reporting
141
Chapter 5 Automation
Automated Recovery
142
Chapter 5 Automation
143
Chapter 5 Automation
144
Chapter 5 Automation
When to Self-Heal
Building your automation and self-healing platform should be based on
when you want self-healing to occur. Do you want to pay attention only
when a failure has occurred, or do you want to watch for signs that a
problem is immanent?
Do you want to be proactive or reactive?
I’m sure you can agree that it is much better to catch a problem before
it can become a bigger issue later on. By being proactive, you can then
have your self-healing system resolve issues in a less intrusive manner.
This could allow your system or systems to properly shut down and
rebuild. Failovers can then be scheduled when traffic has decreased, and
nodes can then be drained gracefully instead of forcefully. By allowing
rebuilds to occur in a controlled manner, outages can be scheduled to
accommodate organizational processes like proper change control. With
this flow, everything can be automated including the admin work to get
platform change approval.
Gates
Just like you need gates to control the water flow in a canal, you will need
gates to control your self-healing environments. These gates should be
stop points to validate what has just happened. You wouldn’t want to start
a lengthy process of remediation if a simple solution was not checked first.
That is, has a reboot worked?
145
Chapter 5 Automation
146
Chapter 5 Automation
Machine Learning
Another way of developing your own self-healing platform could be done
by teaching it to understand what to do under certain conditions. For
this, you need to build a neural network of scenarios. That could include
looking at logs, alerts, and events. Your self-healing logic should then
weigh the percentage of data that is matched to a condition that will
trigger a sequence of automation tasks. Tasks can then be run from estate
management tools and automation platforms. Most of the tools available
today also have the ability to integrate with service desk platforms to raise
incidents and even have the capability to initiate communication with a
standby engineer.
Machine learning though is a steep learning curve and requires a
certain aptitude to master. If it is something you are interested in, spend
some time playing with a few basic examples and take it from there.
The Internet is a great place to learn from and is full of people with nice
examples you can try.
One important note about starting with machine learning is that
you will need some hefty number crunching hardware to compile your
machine learning programs. If you are fortunate to have a decent GPU, you
could start with that. Adding more GPUs will decrease your compilation
time, but with the general shortage of GPUs today, you could face an uphill
struggle with this one.
147
Chapter 5 Automation
Off-the-Shelf Products
If you have the budget and don’t have the time or patience to build your
own platform, you could look at some of the “off-the-shelf” options.
These products can provide a fair bit of functionality but may not give
you everything you may need. Pay close attention to what is included out
of the box and what you need to configure yourself. It could be a glorified
automation engine that still requires you to write logic, or worse. Buy some
of the organization’s professional services to build the logic for you.
Dynatrace
One product I have heard about in the past is Dynatrace. I know very little
about the product as I have never used it before and am not making any
recommendations based on what I know, so I will leave it up to you to
research and check for yourself. I only remember Dynatrace because I
attended a presentation at Red Hat Summit in Boston a couple years back.
During the presentation, the presenter explained how they used
Dynatrace to manage their own estate, and while the presentation was
happening, there was an outage with one of their web servers (possibly
staged for effect). The Dynatrace platform basically kicked in and started
running its diagnostics and potential remediation work. It did however
notify the presenter there was a problem, which I guess any monitoring
platform could do anyways.
The impression I got though was that Dynatrace could do some smart
things and could potentially be a good option. Like I mentioned, do your
own checking and testing before jumping in with both feet.
148
Chapter 5 Automation
Code Libraries
Most if not all major automation platforms have vast libraries of online
examples and code you can download. If your first step is to make a
decision on which automation platform to use, your next step should be to
find the equivalent online library of code or modules.
149
Chapter 5 Automation
Ansible
In the Ansible world, you can make use of the Ansible Galaxy. The Ansible
Galaxy has a vast array of different Ansible roles you can search for and use
by the tags people have associated with them.
https://galaxy.ansible.com/
Puppet
Like Ansible, Puppet has a great library similar to Ansible Galaxy called
Puppet Forge.
https://forge.puppet.com/
SaltStack
SaltStack doesn’t quite have the same setup but does have a GitHub
location called SaltStack formulas with tons of content.
https://github.com/saltstack-formulas
M
etadata
Once you get yourself in a position to contribute code back to the
community you have chosen to follow, make sure you understand how to
format and build your code so the correct metadata can be used to catalog
your work. Nothing is more frustrating than providing code that no one
can find or use.
150
Chapter 5 Automation
Using examples from your chosen provider can help get you started.
Download an example and “borrow” the code from there.
Things to Avoid
A few things to avoid when automating, no matter what platform you
decide to use.
Shell Scripts
If possible, use modules and code provided by your automation tool
of choice. Ansible as an example has a rich library of Ansible modules
available. Not everything is installed by default anymore but can be
installed with Ansible collections. Similar approaches may be available
with other platforms. Always investigate what the best way to run a task is
before you resort to a shell script.
Shell scripts tend to be non-idempotent and would require further
code to ensure they were. Using a prebuilt module that speaks to a
platform API or similar would handle all the extra coding for you and leave
you with neat clean code.
151
Chapter 5 Automation
Good Practices
As there are things you should always avoid when writing automation
code, there are also a few good practices you should include in your
working methods.
Debugging
Remember to remove extra debugging steps or tasks added to your code.
It may be useful when you are testing and developing your code but can
look untidy when used in production. It could also get you some unwanted
questions from people who do not understand why there are red lines
everywhere when they execute your automation tasks.
152
Chapter 5 Automation
It is also well recommended when you are sharing code online through
different sharing portals. Getting used to doing it when you start is the best
way to continue doing it.
Source Control
Commit your code to GitHub/GitLab or whichever git provider you prefer,
but do it often and always before you stop for the day. Not only is your code
safe but also allows you to build a portfolio of your work for others to use.
Summary
In this chapter, we discussed the following:
153
CHAPTER 6
Containers
This chapter goes in a slightly different direction from previous chapters.
This will be the first chapter where we discuss organization workload
and how to manage that workload. Previously, we discussed platforms,
automation, and general Linux system administration. We will touch a
bit more on platforms toward the end of this chapter but will be more
structured around the major topic of this chapter: containers.
This chapter will delve into the world of containerization and how you
can manage workloads within them. We will discuss what containers are,
how you can get started with them, what you should be doing to manage
them, and the dos and don'ts. Finally, we will end the chapter on how you
can manage a full estate of containers using tools available today.
The goal of this chapter is to help you get a basic understanding of
containers and the orchestration tools you can use to manage them.
G
etting Started
As a Linux sysadmin, you have most likely already heard of containers; you
may already be using them in your organization.
Containers are the next major evolution in Linux estates. It is
important that as a Linux sysadmin you are fully aware of what they are,
how they are built, and, most importantly, how they are managed.
Simply put, a container is a set of one or more processes and files
managed within its own isolated environment.
Container History
The idea of a container is not a new concept at all and as a concept has
been around longer than Linux itself. Containers started their conceptual
journey in the late 1970s and early 1980s with the first introduction of using
chroot to create isolated environments. Later in the early 2000s, Solaris and
FreeBSD expanded the idea with practical implementations of platforms
that provided segregation.
156
Chapter 6 Containers
Container Runtimes
Container runtimes are what makes it possible to run a container on your
system. A container runtime allows the container to speak to the host
kernel and run processes.
The original container runtimes were simple and could run in isolated
environments, but over time these runtimes have become more complex
and have evolved where multiple layers are required to manage containers
in complex environments. For you to understand the full flow of how
containers are created and managed today, there are three categories you
need to understand about container runtimes:
• Container engines
157
Chapter 6 Containers
Native Runtimes
Native OCI runtimes run their processes on the same kernel of the host
system where the OCI runtime is running.
Note Due to the fact that the host shares its kernel with the native
runtime, there is a concern that a compromised container could
impact the host it is running on. For this reason, you should always
understand all the security issues that you could potentially be
building into your containers.
Some examples of native OCI runtimes are runc, crun, and containerd.
Sandbox Runtimes
158
Chapter 6 Containers
Virtual Runtimes
There are two main CRI options today that are capable of doing the
preceding steps. They are containerd and CRI-O.
Containerd
A high-level runtime developed by Docker with runc under the covers,
Containerd contains all the functionality of a CRI and is regarded as a good
example of a CRI.
159
Chapter 6 Containers
CRI-O
CRI-O is a slimmer implementation of a CRI. Red Hat is currently
supporting the integration of CRI-O into Kubernetes and their OpenShift
product. Docker was removed in favor of moving to a CRI type
architecture, thus enabling the flexibility of switching low-level runtimes.
C
ontainer Engines
The final category of container runtimes you need to understand is the
layer where you can actually do some container creation. This layer is the
container engine. Just like a virtual machine requires a hypervisor to run
on, containers require a container engine.
From the diagram in the “Virtual Machine vs. Container” section, you
can see where the container engine layer exists between containers and
the operating system. This is the container engine.
Table 6-1 lists the two common container engines that are used today;
they are Docker and Podman. Throughout this chapter, we will be using
Podman as the container engine for any container examples or exercises.
Docker Released in March of 2013. One of the first mass used container
runtimes
Podman Unlike Docker, Podman does not run an underlying daemon to run
containers
160
Chapter 6 Containers
Docker
Today when I speak to people about containers, they often still refer to
containers as “Docker containers.” Docker was the first real container
engine most people used; many still use Docker and still swear by it.
If you are a Docker or Podman person, it does not matter too much if
you are just using it on your laptop or test lab; in the end, all you want to do
is create a container based on an image.
Docker, however, has become a bit more difficult to install since I first
used it. In the past, the Docker binaries could be installed with dnf or
yum, but now you may need to have separate repositories enabled or have
special subscriptions. If Docker is the choice you wish to go with, you will
need to read the documentation.
I have managed to install “Docker” on my Fedora system using the
following command:
Once Docker has been installed, you may want to read the man pages
on how Docker is used.
You will need to understand how to start a container, find out if the
container is running, and how to delete the container when you are done.
Table 6-2 lists some of the docker parameter options that can be used.
161
Chapter 6 Containers
Podman
Podman came a while after Docker and is similar to Docker in how
containers are created and managed. One major difference between
Podman and Docker is that Podman does not require a service or daemon
to be running. This is due to the fact that Docker runs on top of a runc
container, whereas Podman does not. Instead, Podman directly uses runc
containers.
All Docker commands should work with Podman; the help and man
pages from Podman will also be a great source of information when you
are starting.
162
Chapter 6 Containers
Podman and Docker can use the same images and Dockerfiles, so if
you find any Docker examples they should work with Podman too.
Installing Podman is as simple as running your install command
for your local package management system. In the case of Fedora, the
command to install Podman is
# man podman
If the man pages are too long to read and you just want to get
started, run
# podman help
Similar to Docker, you can search for images, and you can list local
images and containers. If you have any Dockerfiles, you can use those
to build custom images if you like, and most importantly you can create
containers.
Podman is simple enough to get your head around, and there are
plenty of examples for both Podman and Docker online.
If you are not familiar with either Docker or Podman, do not worry too
much. We will be running through some practical examples for you to try
shortly.
Container Images
If you were to build a virtual machine, you would need to create a “virtual
machine shell” in your hypervisor, boot the virtual machine, and install
an operating system. Containers, as they share libraries with the operating
system, typically do not need their own operating system installed.
Instead, container images are created with the files and libraries required
for the container to run its workload.
163
Chapter 6 Containers
Container Registries
You can imagine that container image variations could grow quite large;
just by thinking of a few examples alone, you can see the number growing.
For that reason, it is important to store these images for later use. No one
will want to create a new image each time they have a particular workload
they wish to deploy. If you have had any experience building application
servers, you will understand that some configuration can be quite time
consuming. Having to repeat the configuration process for each new
environment is not something I would recommend.
This is where container registries become useful; they not only
store the custom images that you create for your organization but also
the libraries of downloaded images for particular workloads you may
have. Instead of building a php image, for example, you could find a php
container image with everything available to run your php application.
Container registries are available to you in a few ways. There are
cloud or Internet registries where you can pull images you may need. You
can then customize these images and push them to your private cloud
repositories if you choose, or you can push them to your local on-premise
registries.
164
Chapter 6 Containers
Cloud Registries
Cloud registries are a great way to work with images if you have a small
estate to manage. Just like it does not make sense to build an estate
management platform for a small number of systems in your estate, the
same is true for a small container estate. If all you are using containers for
is some basic applications that do not change often, hosting your images in
a cloud registry makes perfect sense.
Companies like IBM and Google have cloud registry options for
you to host your container images. Depending on your organization
requirements, Google could be a good place to start. They offer a $300 free
tier for testing Google services which include registry options. After the
trial is finished, there will of course be a cost involved; just like evaluating
estate management tools, you will need to work out what works best
for you.
Local Registries
If you have a deeper requirement for container image storage, you may
want to consider looking at on-premise container registry options. The
options available will drastically depend on the level of service you will
need. This could be as simple as just having a place to store container
images in a disconnected or air-gapped environment, or it could be more
complex in that you require image scanning for security reasons. Whatever
your requirements, trial and test to confirm what works for you.
165
Chapter 6 Containers
With community products, you get basic functionality and in most cases
quite nice features. With enterprise products, you gain all the goodness
around security and compliance scanning.
When choosing, you will need to consider everything again from price
to features.
Containers in Practice
Now with the basics of containers covered in theory, let's get some hands-
on experience with creating container workloads.
Prerequisites
For this section of the book, you will need to have the following available to
you if you wish to test some of the configuration that will be discussed.
Shopping List
• A Linux system with root privileges
System Prep
Before we can create any of the containers or configurations, you need to
prep the system you will be using.
Install Packages
Install the Podman, Docker, or whichever runtime packages using the
official documentation if you are not familiar with the process. In most
cases, this will be as simple as running “dnf install podman -y” or “apt-
get -y install podman”.
166
Chapter 6 Containers
Note For this section, I will be using Podman; for that reason, it may
be worth using Podman to avoid extra config or google searching.
Creating Containers
Over the next few pages, we will cover the basic hands-on experience you
will need to start working with basic containers. The goal for this section is
not to make you a container specialist but more to show you how to create
a simple container environment to gain experience with.
167
Chapter 6 Containers
The output would list all the available images that contain the nginx
keyword. Pay attention to the number of stars and if the image is from an
official source:
INDEX NAME DESCRIPTION
STARS OFFICIAL
docker.io docker.io/library/nginx Official build of Nginx.
15732 [OK]
...output reduced
168
Chapter 6 Containers
# podman images
Running a Container
If you found the container image you wish to use and have managed to
download or pull it successfully, you can run a basic container instance of that
image on your test system. To run a basic nginx container from the previously
downloaded nginx image, you run a command similar to the following:
169
Chapter 6 Containers
95bf289585a8caef7e9b9ae6bac0918e99aaac64d46b461180484c8dd1efa0a4
The “-d” option in the command tells podman to detach from the
running container and leave it to run in the background. The “-p” sets the
port that the container will listen on.
Running Containers
Once you have created your container, you may want to see if it is running.
The simplest way of doing this is to run a container list command as per
the following:
From the list, you can see all the containers you have managed to start
on your local system. The nginx example is running on all interfaces and
listening on port 8080.
Figure 6-2.
170
Chapter 6 Containers
The screenshot in Figure 6-2 shows the nginx serving requests on the
localhost on port 8080.
171
Chapter 6 Containers
Tagging Images
The first thing you will need to do when you have local images downloaded
is to tag them with your internal podman registry. This way, you can
instruct podman to push images to the local registry instead of a remote
registry. Think of it as a way to change the path of a container image.
To tag an image, you run the podman tag command. If we take the
example of nginx that we have been using so far, we can tag the nginx
image with the following command:
172
Chapter 6 Containers
Pushing Images
With the nginx image tagged, the next step is to push or upload the
nginx image to the local repository. This can be done with the following
command:
Remote Registries
If you created a podman registry on a different host and exposed the
registry on the network interface instead of the loopback address, you can
tag and push your images to that address too if you wish. Just be sure to
open any firewall ports to allow traffic through to the podman registry.
The same can be said for any on-premise image registry; as long as you
have the ability and permissions to push images, the podman tagging and
push commands will allow you to use local registries.
Customize an Image
So far, all we have done is use container images as they are without adding
any of our own customizations.
What good would a web server be without any content, right? The
same can be said for container images; what’s the point of running a nginx
or apache web server if you don't host any web content on them?
Let's have a look at how to add our own custom content to a
web server.
Dockerfile
To understand how to add some basic customizations to a container
image, we will need to use a build file. This build file is most commonly
referred to as a Dockerfile. Both Podman and Docker can use Dockerfiles.
173
Chapter 6 Containers
These Dockerfiles are used to create any customizations you would like
in your container image. Think of these files as image install files.
To use a Dockerfile, all you need to do is create a new file called
Dockerfile. Do not change the name or add any extensions. The file needs
to exist in the current directory, or you need to specify the location when
you run the podman build command.
Example
Like before, let’s run through an example. For this example, we are going
to build a CentOS image with apache httpd installed on it. Once the web
server packages are installed, the example will pull down an example
HTML file from my GitHub account. Finally, we will run a new container
with the new image.
Dockerfile
Next, you will need to create a Dockerfile. Remember that the Dockerfile
should be named exactly as “Dockerfile.” Ensure that you are in the same
directory as your Dockerfile when you try to build your new image.
The Dockerfile in my example will pull the latest CentOS image it can
find if you have not already pulled one. Once the image is available, yum
will install both the “httpd” and “git” packages. These will make up all the
packages required for our custom image. Feel free to add anything else you
want to use like PHP. Once the packages are installed, a git clone will pull
down the source code for our web content and move it to the /var/www/
html directory for the web server to use. In this example, I wrote a very
basic HTML page. This can be anything you wish, so change with your own
content if you want to try something a bit different.
174
Chapter 6 Containers
FROM centos:latest
RUN yum -y install httpd git; \
git clone https://github.com/kenhitchcock/basicwebapp.git; \
mv basicwebapp/index.html /var/www/html/index.html
CMD ["/usr/sbin/httpd", "-D", "FOREGROUND"]
EXPOSE 80
Build Image
To build the image based on our earlier Dockerfile , you need to ensure
you are in the same directory as your Dockerfile, then run the podman
build command:
Create Container
With the newly built container image and the custom content, we can run
and test the new image:
175
Chapter 6 Containers
To double-check that your container has actually started, you can run the
following command:
# podman ps
# podman ps
CONTAINER ID IMAGE COMMAND
CREATED STATUS PORTS NAMES
08832f29f46e localhost/centos:latest /usr/sbin/httpd -...
24 hours ago Up 12 minutes ago 0.0.0.0:80->80/tcp elated_jepsen
Delete Container
To delete a container, you first need to stop the container, and then you can
delete it. This can be done with similar commands to the following:
# podman rm 08832f29f46e
08832f29f46edab6bdd41227a542bf494f926831d099a0a83ee8838bfe71fdf9
176
Chapter 6 Containers
Container Practices
With a better understanding of what containers are and how to manage
them, you now need to understand what constitutes good and bad practices.
Cloud Native
The first thing you need to understand when working with containerized
workloads is what cloud native means.
The simplest explanation of cloud native is that it is the practice
of using cloud technologies to deploy workloads in a lightweight and
fast manner.
Cloud-native tools typically involve using automation, scalable
platforms in private or public clouds, containers, service meshes, and
generally with immutable infrastructure. The use of these tools and many
others can enable high rates of product workload releases. Netflix is an
excellent example of this. Netflix releases around 100 production releases
in a day through lightweight, fast workloads that are streamlined to
production by using automation and other tooling.
Good Practices
Keep It Small
The number one rule to running any container or cloud-native workload is
to keep the workload as small as possible. It is not recommended to create
workloads that are in the gigabyte size range. The smaller the workload,
the better for deployment and scalability. If your workload demands
higher sizing, then potentially you need to rearchitect how the workload is
written. This could be breaking the workload down into microservices and
working from there.
Always push back if you are forced to create large workload deployments.
The benefits of running smaller workloads will pay off in the long term.
177
Chapter 6 Containers
D
ynamic Deployment
Workload deployment should never be done manually. Code should be
committed to your source control and pushed through to production.
Make use of pipelining tools, source control webhooks, and anything else
that can trigger workload deployment.
A basic example of what this should look like can be seen in Figure 6-3.
Figure 6-3.
178
Chapter 6 Containers
Scalable
Any workload or application that will be deployed in a cloud type
environment or be considered cloud native must be scalable. The ability to
scale up when demand increases is vital to good cloud working practices.
If the workload you are deploying cannot scale dynamically, you need to
consider rearchitecting the workload. Not being able to scale dynamically
is symptomatic of dated workloads and potentially old code.
“Does It Cloud”?
Just because you are deploying into a cloud platform does not mean
your workload is cloud native. There are many other things that make
workloads cloud native, but the three questions you should ask when you
want to know if your workload is for the cloud are
With the preceding questions, you can now ask yourself, “does it
cloud”? If your answer is no to any of the questions, you have work to do
before you can migrate or deploy to a cloud environment effectively.
Do not fall into the trap by trying to build virtual machines into
containers or into cloud-native style hyperscalers. Just because you can
does not always mean you should. The pitfalls to doing this will come back
to bite you later on when you are not able to take advantage of the benefits
of cloud computing. Large workloads can be inefficient and wasteful,
negating the cost saving you may have been expecting.
If you need big workloads, then container or cloud platforms may
not be what you need right now. Take a step back and look carefully at
the workload first, refactor code, and break monoliths down into smaller
applications that can “cloud” as the cloud was intended.
179
Chapter 6 Containers
B
ad Practices
There are many good practices and many bad practices. These are referred
to as antipatterns. Here are a couple more common practices that should
be avoided when possible.
D
ifferent Images
The temptation can be to use different images for different environments,
as it seems like a more secure method of building workload images.
However, building test images for test, development images for
development, and production images for production opens the possibility
for differences to occur that are not tested and signed off. It is very possible
that an image used in your test environment could have no vulnerabilities,
but an image used in production does. For this reason, migrate the
application-baked images through your environments. This way, you
ensure security checks are done, and code is properly tested and most
importantly signed off for production use (Figure 6-4).
180
Chapter 6 Containers
Figure 6-4.
The basic idea on how images are tested and promoted should be
similar to Figure 6-4.
181
Chapter 6 Containers
Figure 6-5.
C
ontainer Development
In this chapter so far, we have touched briefly on how containers can be
developed. We have explored some simple good and bad practices and
hopefully given you a good idea what cloud native is. For this section, let’s
understand how you can create a meaningful workload using container
development.
D
evelopment Considerations
C
oding Languages
Writing code for containers is no different than writing code for your local
development environment or laptop. You can still choose and use your
favorite development language and can still push code to your favorite
182
Chapter 6 Containers
source control platform. There is no hard and fast rule that says you cannot
use one particular language or the other. However, not all development
languages are created equally. Using older languages may not translate
to the cloud as effectively as newer ones. Before starting to write any new
application, spend some time looking at some of the following options.
Table 6-3 lists a few development language options that are used today
within containerized applications.
C
ode Editor
To write useful code, you need to practice and have an editor that works
well enough without breaking the bank. There are a few available you can
use, but it always comes down to personal preference and what features
you are willing to live without. Table 6-4 lists a few code editor options that
can be used.
183
Chapter 6 Containers
Tip VSCode is free to use, has great plugins, and is quite simple to
use. Before you spend too much time with other editors, try VSCode
and change if you find something better.
S
ource Control
No matter which source control platform you wish to use, just make
sure you use one. Not using source control is a massive mistake for any
developer or organization. You lose the ability to peer review code in an
effective centralized manner, and you risk the loss of code. It is not worth
taking the risk. Table 6-5 lists source control options that can be used to
control your source code.
184
Chapter 6 Containers
Note Git is probably the most popular source control system today.
Get familiar with it asap.
Container Tooling
Once you have your code developed and container ideas in place, you
will want to start working on streamlining your container image creation.
There are many ways to do this, both right and not so right. You will also
have a fair few tools you can choose from.
185
Chapter 6 Containers
C
I/CD
The first area to look into is your container delivery system. This is known
as your continuous integration and continuous delivery system. These will
help deploy your workload into your various environments and give you
the flexibility to do much more with your container images or workload
deployment. Table 6-6 lists a few options available for CI/CD pipelines.
Jenkins Popular free open source tool that's easy enough to use and has
loads of plugin options
TeamCity Integration with Visual Studio, useful for Windows development and
testing. Has both free and proprietary options
GitLab Has the ability to build and run tasks directly from your GitLab
repositories
Travis CI Can automatically detect commits in GitHub and run tests on a
hosted Travis CI platform
Tekton Another open source CI/CD tool that supports deployments across
different cloud or on-premise platforms
Jenkins Example
Jenkins is one of the more popular pipelining tools to use today and is
free to use for testing. To see what Jenkins pipeline code looks like, the
following is a basic example using pseudo code:
186
Chapter 6 Containers
node {
def app
stage('Clone repository') {
/* Basic comment about cloning code*/
checkout scm
}
stage('Build image') {
/* Build your container image */
app = docker.build("jenkinsproject/helloworld")
}
stage('Test image') {
/* Run your unit testing of some type */
app.inside {
sh 'echo "Tests passed"'
}
}
stage('Push image') {
/* With a verified image, push your image to a registry */
docker.withRegistry('https://someregistry.com',
'registry-credentials') {
app.push("${env.BUILD_NUMBER}")
app.push("latest")
}
}
}
From this Jenkins example, you can see that stages are used in the
pipeline; you can add as many as you like for different tasks. You may want
to add a stage for security image scanning as an example. Ideally, you want
to build in as much automation and testing as possible.
187
Chapter 6 Containers
I mage Registry
As you develop your applications and container content, you will need a
place to store these images. It is ok if you want to test and build when you
need to, but as a good practice, it is recommended to start storing your
container images as you start building your application portfolio. This
practice is highly recommended if you are going to be deploying anything
into a live environment.
Previously in this chapter, we discussed how to build a podman
image registry; to extend on that, look at providing storage to ensure your
containers are not ephemeral. Podman, for instance, has the ability to
create volumes; those volumes can be mounted in your container when
you create them.
Using orchestration platforms like OpenShift or Kubernetes can
provide image registries but are often ephemeral by default. Ensure you
have storage volumes mounted so you do not lose any of your images.
188
Chapter 6 Containers
Tip VSCode is a great option if you need something that’s free and
easy to use. Overall, for me it’s a winner, but test it for yourself.
L inting Tools
Before pushing or committing any type of code, be it YAML or Dockerfiles,
make use of linting tools. For Dockerfiles, there is a nice online linting tool
you can copy and paste your Dockerfile content to be checked.
www.fromlatest.io/#/
D
evSecOps
A keyword in today's world of platform management is DevOps. DevOps
is a vital set of practices and tools that bridge the gap between developers
and operational teams. DevSecOps is an addition to this concept, where
everyone is responsible for security.
D
evSecOps Tooling
DevSecOps empowers both developers and operational teams to
understand security requirements and build security into their tooling.
189
Chapter 6 Containers
P
ipelines
In a standard situation where there are no DevOps or DevSecOps practices
used, security teams are required to scan and report issues every time
a new system or platform is built. Security teams are responsible for
the organization’s security and the ones who would have to answer the
difficult questions if a breach is ever experienced. For this reason, they are
meticulous in their scanning and ensuring no vulnerabilities are exposed
in live environments. This process can involve additional security tools
and can take time to be completed. This can also be a frustrating job if new
platforms or systems are released constantly.
By following DevSecOps practices, security considerations can be built
into pipeline or image building tools. With this process, developers and
operational teams take responsibility for security, thus greatly reducing
back and forth with security teams.
S
ecurity Gates
With security built into pipeline tools like Jenkins, security gates can be
built where if an image fails a security scan for whatever reason, the build
process can be stopped, allowing remediation to occur before being
released into a live environment.
G
itOps
Another keyword in today’s estate management and container platform
management is GitOps.
“GitOps is an operational framework that takes DevOps best practices
used for application development such as version control, collaboration,
compliance, and CI/CD tooling, and applies them to infrastructure
automation.”
https://about.gitlab.com/topics/gitops/
190
Chapter 6 Containers
GitOps Toolbox
Some useful tools that can help you along your GitOps learning are as
follows. There are many other tools and variations you can use, but as this
subject can be one for a book on its own, I have mentioned only a few.
Git
The first step to using GitOps is to start using Git. This can be GitLab,
Bitbucket, or GitHub, any Git platform that allows the ability for CI/CD
pipelines to detect merge requests.
Infrastructure As Code
Technically not a tool, however, everything you write to automate or
configure your platform should be in the form of code. That could be
YAML for your OpenShift or Kubernetes configurations or Ansible to build
a new system. Everything should be built or configured from code; no
manual configuration should be used anywhere.
Pipeline Tools
Choose your pipeline tool and configure it to detect merge or pull requests
in your git environment. Every time a new change is made, the pipeline
should be kicked off to build or deploy new application versions or build
new systems.
ArgoCD
Another GitOps tool being used more and more is ArgoCD. ArgoCD helps
with GitOps workflows and can be used as a stand-alone tool or as a part of
your CI/CD pipeline.
191
Chapter 6 Containers
Figure 6-6.
Figure 6-6 shows the basic flow an ArgoCD configuration should take.
C
ontainer Orchestration
A few containers can quickly spiral into hundreds if not thousands in
environments where applications are being deployed on a regular basis. To
manage this kind of growth, the need for container orchestration becomes
more important. Tools like Kubernetes, Docker Swarm, and OpenShift
provide the ability for administrators to manage large estates of container
192
Chapter 6 Containers
workloads and ensure their availability. Each tool has its own advantages
and disadvantages and could be discussed in such length it would take
many more chapters to cover; however, as we are not focusing too much
on container orchestration at the moment, let’s just touch on the basics
for now.
• Scalable
• Flexible
• Secure
• Automated
• Easy to use
193
Chapter 6 Containers
Orchestration Options
Kubernetes
Kubernetes, or K8s, is an open source project that was originally developed
by Google and based on their original “Borg” system (cluster manager
system).
Red Hat was one of the first contributors to Kubernetes before it was
officially launched.
In 2015, Google donated the Kubernetes project to the CNCF (Cloud
Native Computing Foundation).
Kubernetes Forks
As Kubernetes is open sourced, there are many downstream variations
of Kubernetes today like Red Hat’s OpenShift, VMware’s version of
Kubernetes, and many cloud platforms like AWS and Azure providing their
own managed services.
194
Chapter 6 Containers
Master Components
Kubernetes has a few fundamental cluster components that enable it to
provide the orchestration for pods and the containers within.
• The ETCD key value database that stores all the cluster
configuration
195
Chapter 6 Containers
Nodes
Nodes are the workers of the Kubernetes clusters. They are responsible
for hosting the container workload users deploy. Nodes consist of a few
subcomponents:
• Container runtime.
Namespaces
Daemonsets
Pods
Containers are run within pods; these pods are what are spawned on
worker nodes. Typically, one container is run within one pod, but this is
not a hard and fast rule.
196
Chapter 6 Containers
Services
Services are what binds multiple pods of the same application together.
When multiple pods are spawned on different worker nodes, you need
to balance traffic between them. A service is the layer that provides that
“service.”
Volumes
By default, all containers are ephemeral, which means they have no way
to store their data after a pod restart or recreation. By mounting volumes
or persistent volumes to pods, you are to recover any data from previously
destroyed or restarted pods.
Configmaps
OpenShift
Before OpenShift was OpenShift, it was a PaaS product by a company
called Makara. Red Hat acquired Makara in 2010 for the PaaS platform
which was proprietary at the time based on Linux container technology.
197
Chapter 6 Containers
Early OpenShift
Prior to OpenShift 3.0, the Red Hat PaaS platform was proprietary and
custom developed. It took two years after the acquisition for Red Hat to
release the first open sourced version and then three years after that to
move away from the custom platform to a more “mature” Kubernetes at
the time.
OpenShift 3.0 was the first release where Red Hat used Docker for the
container runtime and Kubernetes for the orchestration layer.
OpenShift 3.11 was the last minor release of OpenShift 3 and the last
version where Docker was used as the container runtime.
Current OpenShift
Red Hat currently has OpenShift 4.9 generally available for public use. The
detachment of the “hardcoded” Docker has allowed OpenShift 4.x to move
to a container runtime interface approach where any low-level container
runtime can be used.
OpenShift has matured to become the leading container orchestration
platform for the enterprise and thus has become the number one
container orchestration product for many organizations. Red Hat’s
continued investment continues to grow OpenShift new functionality and
acquisitions.
Advanced Cluster Security (StackRox), Advanced Cluster Management,
monitoring, logging, and many other enterprise-grade features make
OpenShift the go-to product for any serious hybrid cloud organization.
OpenShift Components
As OpenShift is based on Kubernetes, most of the components are very
similar and named in a very similar manner. There are of course some
variations, like namespaces in Kubernetes are referred to as projects in
198
Chapter 6 Containers
Product
Enterprise
Security
OpenShift has been built with security in mind, opening the adoption for
more security conscious organizations. The recent acquisition of StackRox
has only strengthened this argument even more.
Web Console
Many More
Without listing all the differences, there are other features like image
management and enterprise storage solutions that Red Hat OpenShift
provides over Kubernetes. If you are interested, you should do as
recommended with most products in this book. Build a proof of concept
and compare the differences for yourself.
199
Chapter 6 Containers
Summary
In this chapter, you were introduced to the following:
• What cloud native means and the various good and bad
practices of using containers
200
PART III
Monitoring
What are some of the most important features any new Linux system must
have before it can be accepted into your organization’s production or live
estate? The common answers given are monitoring, logging, and security.
For good reason too, any system that is not being monitored, logged, or
secure is just a recipe for disaster and in almost every single case will be
rejected by any serious operations team.
This chapter will take a deeper look at one of the first things a Linux
system should have: monitoring. We will discuss tools that have been used
in the past and what tools are available out of the box with most Linux
distros. We will then look at some of the newer tools and trends that have
been used within the last five years.
Finally, we will discuss what developers and applications require from
a monitoring point of view, how applications can be better monitored, and
how to initiate discussions with developers on how to develop applications
to support this. This chapter will not give you all the answers to all the
different monitoring use cases. It will give you ideas on what you could be
doing and what tooling could help with some of your current monitoring
questions. It may even create a few questions you did not realize you
needed to ask.
P
rocess Monitoring
Default Process Commands, ps and top
By default, on most if not all Linux distros, you will find both the “top” and
“ps” commands. They not only show you all the processes running on your
system but also give you the process ID number that can be used to kill a
defunct or hung process.
If you are not sure a particular process is running, for instance, the
apache web service, you can run a command similar to the following:
The “top” or the alternative commands could also be used, but you
may struggle to search through the list for your process. Using “ps” and
“grep” will give you a quicker and cleaner output.
204
Chapter 7 Monitoring
Pstree
A quick and nice tool to see all the processes and the parents of each
process is the “pstree” command. The following is a basic output of pstree
with a reduced output:
# pstree
systemd─┬─ModemManager───3*[{ModemManager}]
├─NetworkManager───2*[{NetworkManager}]
├─abrt-dbus───2*[{abrt-dbus}]
├─3*[abrt-dump-journ]
├─abrtd───2*[{abrtd}]
... [reduced for length]
├─thermald───{thermald}
├─udisksd───4*[{udisksd}]
├─upowerd───2*[{upowerd}]
├─uresourced───2*[{uresourced}]
└─wpa_supplicant
Resource-Hungry Processes
The “ps” command is useful for another reason. I’m sure you have
experienced a process that has been CPU or memory intensive. Finding
that offending process can sometimes be a bit tricky if you are trying
to figure out how processes are consuming resources from the “top” or
similar commands. The following are two “ps” commands you can use to
find the top five CPU- and memory-intensive processes.
Memory-Intensive Processes
# ps -auxf | sort -nr -k 4 | head -5
205
Chapter 7 Monitoring
CPU-Intensive Processes
# ps -auxf | sort -nr -k 3 | head -5
Tip Look at the ps --help and ps man pages for more options to
use with ps.
Disk and IO
There could be a situation where you have slow disk performance or disks
filling up. Some useful tools that can be used for disk and IO monitoring
that are still used today are tools like “iostat,” “iotop,” “du,” and “df.”
# iostat
Linux 5.13.4-200.fc34.x86_64 (localhost.
localdomain) 22/11/21 _x86_64_ (4 CPU)
Device tps kB_read/s kB_wrtn/s kB_dscd/s
kB_read kB_wrtn kB_dscd
dm-0 1.16 1.79 59.42 24.16
16712024 554638596 225476964
nvme0n1 1.15 1.79 59.43 24.24
16724756 554741532 226295700
zram0 0.20 0.16 0.64 0.00
1482776 6013392 0
206
Chapter 7 Monitoring
# iotop
Total DISK READ: 0.00 B/s | Total DISK WRITE: 108.61 K/s
Current DISK READ: 0.00 B/s | Current DISK WRITE: 3.19 K/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO>
COMMAND
733973 be/4 ken 0.00 B/s 57.50 K/s 0.00 % 0.00 %
chrome --type=utility --utility-sub-type=network.mojom.
NetworkService --field-trial-han~be2ad25, --shared-files=v8_
context_snapshot_data:100 --enable-crashpad [ThreadPoolForeg]
729143 be/4 root 0.00 B/s 51.11 K/s 0.00 % 0.00 %
[kworker/u8:6-btrfs-endio-write]
1 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 %
systemd rhgb --system --deserialize 51
2 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 %
[kthreadd]
3 be/0 root 0.00 B/s 0.00 B/s 0.00 % 0.00 %
[rcu_gp]
4 be/0 root 0.00 B/s 0.00 B/s 0.00 % 0.00 %
[rcu_par_gp]
du and df
These are used to show disk usage and where disks are mounted:
# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 12G 0 12G 0% /dev
tmpfs 12G 181M 12G 2% /dev/shm
tmpfs 4.7G 2.0M 4.7G 1% /run
/dev/dm-0 238G 93G 144G 40% /
tmpfs 12G 61M 12G 1% /tmp
/dev/dm-0 238G 93G 144G 40% /home
207
Chapter 7 Monitoring
/dev/nvme0n1p1 976M 272M 638M 30% /boot
tmpfs 2.4G 216K 2.4G 1% /run/user/1000
# du -h /etc
0 /etc/.java/.systemPrefs
8.0K /etc/.java/deployment
8.0K /etc/.java
0 /etc/NetworkManager/conf.d
0 /etc/NetworkManager/dispatcher.d/no-wait.d
0 /etc/NetworkManager/dispatcher.d/pre-down.d
0 /etc/NetworkManager/dispatcher.d/pre-up.d
0 /etc/NetworkManager/dispatcher.d
0 /etc/NetworkManager/dnsmasq-shared.d
0 /etc/NetworkManager/dnsmasq.d
28K /etc/NetworkManager/system-connections
32K /etc/NetworkManager
... [reduced for length]
80K /etc/gimp/2.0
80K /etc/gimp
28K /etc/pcp/derived
28K /etc/pcp
37M /etc/
CPU
CPU statistics on your system can be checked using a number of tools both
shipped with your distro and tools that you can install quite easily. The
following are two of the more common tools used.
208
Chapter 7 Monitoring
Top
Most Linux sysadmins will use the top command and press the “1” key.
This will give a similar output to the following:
From the preceding output, you can see that I have four cores in my
laptop. The load average can be seen to be around 1.33, which means that
around 1.33 of my four CPUs are currently being used to run processes.
209
Chapter 7 Monitoring
mpstat
Another useful command for CPU statistics is the mpstat command. The
“mpstat” command displays activities for each available CPU.
To see all the stats per CPU, you can run the following command:
# mpstat -P ALL
Memory
A few ways to check system memory include looking at the /proc/meminfo
file or running commands like “free” or “top.” The following are a few
things you can do on your system to understand more about its memory.
Free
Basic utility that gives all the information required about your
system’s memory:
# free -h
total used free shared buff/cache available
Mem: 23Gi 12Gi 3.2Gi 1.2Gi 7.2Gi 8.8Gi
Swap: 8.0Gi 1.9Gi 6.1Gi
Page Size
If you need to find out your system’s page size, you can use the following
commands:
# getconf PAGESIZE
210
Chapter 7 Monitoring
p map
Another useful tool is the “pmap” utility. “pmap” reports a memory map of a
process. “pmap” can be quite useful to find causes of memory bottlenecks.
V
irtual Memory
It does occasionally happen that you need to investigate issues around
virtual memory or slabinfo. The “vmstat” tool is useful for this kind of
investigation.
v mstat
The vmstat tool can be run to give you different information about
your system.
Running a basic vmstat command as follows:
# vmstat
will give you the following output, which is explained in Table 7-1:
211
Chapter 7 Monitoring
Network
Tools to monitor network configuration or traffic are really useful when
you need to troubleshoot issues or confirm a port is listening for traffic.
The following are a few tools I have used in the past.
212
Chapter 7 Monitoring
N
etstat
One of the first tools I tend to use when I need to check if a port is listening
for traffic is the “netstat” command.
The “netstat” command will show you network connections,
interface statistics, and more. The most common commands I use to check
what ports are listening are as follows:
s s
To get some quick information about socket statistics, you can use the “ss”
command.
To view all TCP and UDP sockets on a Linux system with ss, you can
use the following command:
# ss -t -a
i ptraf-ng
If you prefer to use initiative tools to view network statistics, you can use
the “iptraf” command. “iptraf” is useful to monitor various network
statistics including TCP info, UDP counts, interface load info, IP checksum
errors, and loads of other useful information (Figure 7-1).
# iptraf-ng
213
Chapter 7 Monitoring
Figure 7-1.
Figure 7-1 is what you are presented with when you open iptraf-ng.
This tool has helped me on a few occasions where I needed to monitor
the traffic out of a particular interface. It is not installed by default but
definitely worth using if you are not already.
From the main screen, you can select to monitor IP traffic; from there,
you select the interface you want to monitor, then watch the connectivity
over that interface.
T cpdump
Most if not all network engineers will use wireshark to monitor traffic
on their network. The “tcpdump” command allows the Linux sysadmin
to dump traffic on a particular network interface, all interfaces, or for a
particular service like DHCP or DNS.
214
Chapter 7 Monitoring
The output file from the preceding command can then be opened
using the wireshark tool (Figure 7-2).
Figure 7-2.
215
Chapter 7 Monitoring
NetHogs
If you were experiencing high bursts of network load on a Linux system
and wanted to know who could be responsible, you can try the “nethogs”
tool to see which PID is causing a bandwidth situation (Figure 7-3).
# nethogs
Figure 7-3.
iftop
This is a simple command very similar to “top” but for interface
information (Figure 7-4).
# iftop
216
Chapter 7 Monitoring
Figure 7-4.
From Figure 7-4, you can see the layout of iftop is similar to the regular
top output, except more focused on network data.
Graphical Tools
Gnome System Monitor
Linux desktops like Gnome are not without their own monitoring tools you
can use. Those familiar with Windows will know about “task manager,” a
simple tool that gives you a basic rundown of what processes are running
and the current performance of your system. The Gnome system monitor
is not massively different. The first tab gives you a process list, the second
tab lists your CPU and memory resources being used, and the last tab gives
you a breakdown of your mounted filesystems (Figure 7-5).
217
Chapter 7 Monitoring
Figure 7-5.
From Figure 7-5, you can see all processes currently running on your
Linux system.
K
sysguard
If the Gnome tool does not work for you, you can also use the KDE tool
called ksysguard. The difference between the Gnome system monitor and
the KDE ksysguard tool is that ksysguard has the ability to monitor remote
systems. New tabs can be created, and different resources from remote
systems can be monitored. Useful for a quick and simple monitoring tool
with little to no real effort to configure (Figure 7-6).
218
Chapter 7 Monitoring
Figure 7-6.
Similar to the Gnome system monitor, you can also view all the
processes running on your system, as demonstrated in Figure 7-6.
219
Chapter 7 Monitoring
This is why the previous tools mentioned are for current system activity
and real-time system checking. Trying to use them for root cause analysis
after an issue has occurred will leave you with limited options.
S
ar
A useful tool to query history system metrics is “sar.” The “sar” utility is
installed with the sysstat package. Along with “sar,” the sysstat package has
a few other utilities like iostat, mpstat, and nfsiostat, to name a few.
The “sar” utility stores system statistics and metrics within local
system files that can be queried later for system statistics. The sar files can
be found at the following location:
/var/log/sa/
P
erformance Co-Pilot
A utility that is a bit better to use in my opinion over “sar” is the tools
installed with the pcp package. The pcp package installs a few useful tools
for metric querying and metric collection. Table 7-3 lists the tools installed
with the pcp package.
220
Chapter 7 Monitoring
vnstat
Not to forget network metrics, the vnstat tool is another useful tool to keep
historical network information. vnstat keeps a log of hourly, daily, and
monthly network traffic for the selected interface or interfaces.
Central Monitoring
With a good understanding now of local monitoring and metric collection
tools, we can now move on to central monitoring tools available in the
open source world. These are the tools that can be used to monitor your
entire estate from a single location with historical data being kept for
potential root cause analysis later down the line.
Nagios
The first tool that many people may know and have come to use at some
point is Nagios. Nagios is another one of those open source names that is
recursive. Nagios means “Nagios ain’t going to insist on sainthood.”
221
Chapter 7 Monitoring
Versions
Nagios has both a community and a paid-for product that can be installed
on most Linux distros. CentOS and RHEL, however, are the supported
platforms at this stage for the Enterprise Nagios XI product. Nagios Core,
however, can be installed on quite a few different Linux distros. It’s always
best to discuss these options with the vendor if you ever decide to use the
paid-for product.
Core
The community supported edition of Nagios is the Core release which
gives you the basic monitoring capabilities of Nagios but requires you to
use community forums for help and support.
Nagios XI
The enterprise or paid-for solution of Nagios comes with the standard core
components plus more. This also includes all the support for the product
via phone and email.
Agent Based
Nagios consists of a server and agent-based deployment with a few options
around agents that can be used.
NRPE
Nagios Remote Plugin Executor (NRPE) uses scripts that are hosted on the
client systems. NRPE can monitor resources like disk usage, system load,
or total number of logged in users. Nagios periodically polls the agent on
the remote systems using the check_nrpe plugin.
222
Chapter 7 Monitoring
NRDP
NRDP or Nagios Remote Data Processor is another Nagios agent you can
use. NRDP comes with a flexible data transport mechanism and processor
allowing NRDP to be easily extended and customized. NRDP uses
standard ports and protocols (HTTP and XML) and can be used in place of
NSCA (Nagios Service Check Acceptor).
NSClient++
A Windows agent for Nagios, NSClient++ listens on TCP ports through to
12489. The Nagios plugin that is used to collect information from this add-
on is called check_nt.
NSClient++ is similar to NRPE, as it allows Nagios to monitor memory
usage, CPU load, disk usage, etc.
NCPA
The final agent that can be used is the NCPA agent. The NCPA or Nagios
Cross Platform Agent is an open source project maintained by Nagios
Enterprises.
NCPA can be installed on Windows and Linux. Unlike other agents,
NCPA makes use of the API to gather information and metrics for Nagios.
Active checks are done through the API of the “NCPA Listener” service,
while passive checks are sent via the “NCPA Passive” service.
223
Chapter 7 Monitoring
Nagios Forks
There are a number of forks from Nagios that can also be used. Some of the
forks of Nagios are as follows:
• Icinga
• Naemon
• Shinken
All will share a similarity with Nagios but over time have evolved into
their own solutions. Icinga, for instance, has been developing its own
features for well over a decade now.
Installation
The installation for Nagios can be done in a few ways and is well
documented on the Nagios documentation site:
Prometheus
Prometheus is an open source alerting and event monitoring system that
stores data in a time series database. Prometheus is a central location
for metric data to be stored and is usually paired with other software to
provide an overall monitoring solution.
224
Chapter 7 Monitoring
Exporters
Exporters are what gets data to Prometheus’s time series database.
Multiple exporters can be used on client or server systems. There are
dedicated exporters for different purposes; in the case of getting node
information, there is a dedicated node_exporter that will export local
system metrics like CPU or memory utilization.
Alert Tool
Any monitoring platform worth its weight in salt must have a way to tell
Linux sysadmins when there is a problem. This is typically your alerting
tool. A useful open source tool is Alertmanager, which can be used to
trigger alerts based on Prometheus metrics.
Dashboarding
Even though Prometheus does have a web UI that can be used to query
metrics, it makes more sense to send metrics to a dashboarding tool.
Grafana, for instance, is a good choice for this and is one of the more
popular open source tools available today.
Query Language
PromQL is the query language used to create dashboards and alerts.
Installation
In the same way it is recommended to install Nagios, I would recommend
to install Prometheus. The documentation is very clear and well thought
through. The installation steps are simple enough if you want to do it
manually, but I would still advise the automated method. The Internet is full
of Ansible roles to do it for you, or if you prefer, there are also container images
that can be used to deploy a container if you want the prebuilt option.
225
Chapter 7 Monitoring
Kubernetes or OpenShift
Platforms like Kubernetes or OpenShift can also have Prometheus
deployed on them, but they tend to be used for the platform itself. You
would need to create a new namespace and deploy your own Prometheus
and Grafana to use for external system monitoring.
Configuration
Once installed, Prometheus does not require much configuration to get
started. A simple YAML file normally named prometheus.yaml can be used
for all configurations. A basic configuration from the official Prometheus
site is as follows:
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
# - "first.rules"
# - "second.rules"
scrape_configs:
- job_name: prometheus
static_configs:
- targets: ['localhost:9090']
Global
The global section is for Prometheus global configuration. General
configuration to tell Prometheus how often to scrape, for instance.
226
Chapter 7 Monitoring
Rule_files
The rule_files section is for custom rules we want Prometheus to use. The
example configuration in this case does not have any rule_files to use.
Scrape_configs
The scrape_configs section tells Prometheus what metrics to gather. In the
configuration example, the localhost will be contacted on port 9090 and
will search for metrics on the /metrics endpoint.
Starting Prometheus
Typically, monitoring platforms should be started from a service, and
Prometheus can be configured to do so too. When starting Prometheus,
you should have at least one parameter specified, and that is the name of
the Prometheus configuration file you are using.
To start Prometheus manually, you can run the following command
from the Prometheus installed directory:
# ./prometheus --config.file=prometheus.yml
Thanos
Prometheus monitoring is quite good on its own and can provide
everything you might want from a simple monitoring platform, except
maybe long historical data or high availability.
This is where Thanos can be utilized. Thanos has been designed to
provide a highly available solution that can keep an unlimited metric
retention from multiple Prometheus deployments.
Thanos is based on Prometheus and requires at least one Prometheus
instance within the same network as itself. Thanos manages the metric
collection and querying through a series of components.
227
Chapter 7 Monitoring
Sidecar
A sidecar is the component that allows Thanos to connect to a Prometheus
instance. It can then read data for querying or for uploading to cloud
storage.
Store Gateway
This allows the querying of metric data inside a cloud object
storage bucket.
Compactor
This compresses or compacts data and applies retention on the data stored
in a cloud storage bucket.
Receiver
This is the component responsible for receiving data from Prometheus’s
remote-write function. The receiver can also expose metrics or upload it to
cloud storage.
Ruler/Rule
This is used to evaluate recordings and alerting rules against data in Thanos.
Querier
This makes use of Prometheus’s v1 API to pull and query data from
underlying components.
Query Frontend
By using Prometheus’s v1 API, the query frontend can evaluate PromQL
queries against all instances at once.
228
Chapter 7 Monitoring
Figure 7-7.
E nterprise Monitoring
Monitoring for large organizations with different teams is normally a
contentious subject, mostly because different teams all want to use a tool
that suits them better. There are some excellent proprietary Windows
tools, and then there are quite good open source Linux tools too that can
be used. As this book is focused on open source technologies and the
adoption of Linux, let’s have a brief look at some open source enterprise
monitoring tools that you could use.
229
Chapter 7 Monitoring
Zabbix
A great enterprise-grade monitoring tool that can be used to monitor your
estate is Zabbix. Zabbix pride themselves in the fact that they can monitor
anything from server platforms through to network systems. Zabbix is
a server- and agent-based system but can also monitor some facilities
without the use of an agent.
Enterprise Support
Zabbix has a paid support facility that can be used for enterprise support,
or you can support yourself through community forums.
Installation
The installation is relatively simple and is well documented on the Zabbix
website. They have a really nice way of presenting the installation steps
through a series of selection boxes based on your preferences.
Useful Features
There are a few really nice features that Zabbix can provide. Examples of
these include the ability to monitor Java-based applications directly over
JMX, the ability to monitor virtual machines with VMware tooling, and the
ability to integrate with systems management tools like Puppet or Chef.
CheckMk
Another good enterprise monitoring tool is CheckMk. CheckMk is a
scalable solution like Zabbix that can monitor a wide variety of systems
from standard Linux platforms through to IoT devices.
230
Chapter 7 Monitoring
Enterprise Support
CheckMk offers both a free version with unlimited monitoring where you
support yourself and an enterprise paid-for solution with added features.
Installation
The major enterprise Linux distros are supported, and the CheckMk
documentation has well-documented steps for whichever distro you
are using.
Useful Features
CheckMk has been building their platform with the future in mind. They
have built in the facilities to monitor Docker, Kubernetes, and Azure, to
name a few.
The overall solution is scalable and will work well in large
organizations with a distributed layout (multiple data centers).
Automation has been one of the main development points to ensure that
configuration and setup is as simple as possible.
OpenNMS
The first monitoring tool I ever installed was OpenNMS many years back
when I first got into open source technologies. Researching for this book,
I was quite impressed to see that not only was OpenNMS still a developed
product, but it also looked pretty impressive.
Enterprise Support
Like most enterprise platforms, there are generally two options: a “free”
version with community support and an enterprise paid version.
231
Chapter 7 Monitoring
Installation
The installation of OpenNMS is not as simple as maybe some of the other
tools available but in the same breath is not drastically difficult to install
either. The official documentation is clear enough and does step you
through everything you need to do. There is also a good community forum
for questions if you get stuck.
Useful Features
One feature that really jumps out is that OpenNMS uses Grafana as a
dashboarding tool, which, in my opinion, was an excellent move, largely
due to the fact that more and more of today’s users are developing their
own dashboards.
OpenNMS metrics can also be collected with a wide variety of methods
including JMX, WMI, HTML, XML, and more.
Dashboards
One aspect of monitoring that is almost as important as the metrics being
collected is the ability to view metrics in a format that makes sense. This is
where dashboarding tools are vital.
232
Chapter 7 Monitoring
Over the years, I have come across a few monitoring tools that
were and still are very good but just look awful in a browser. With some
application monitoring tools, I also found the dashboards to be very
difficult to configure. Sizing windows was a nightmare, and connecting
external tools was always not possible.
It seems that I was not the only one to suffer with these tools, and some
smart people have started developing dedicated dashboarding tools that
can integrate with a variety of tools.
D
ashboarding Tools
Table 7-4 lists a few dashboarding tools that can be used today.
G
rafana
As Grafana is the most popular tool today, it is worth exploring what
Grafana has to offer.
233
Chapter 7 Monitoring
What Is Grafana
Grafana is an open source plugin-based dashboarding tool that has a wide
range of data source options that can be used to display metrics without
duplicating any data. Grafana can be deployed on almost all platforms
used today, from Windows through to Debian (Figure 7-8).
Figure 7-8.
U
sing Grafana
There are a few ways to use Grafana:
234
Chapter 7 Monitoring
Cloud Service
If you do not want to run your own Grafana instance on-premise, you
can run your dashboarding in the cloud. The free forever plan includes
Grafana, 10K Prometheus series, 50 GB logs, and more.
On-Premise Installation
Grafana can be deployed in a few ways:
Data Sources
Before you can create a dashboard, you will need to have a source from
where metric data will be pulled. These are your data sources. You will
need to create a data source before attempting to create a dashboard.
Grafana supports a number of data sources that include some of the
following:
• Alertmanager
• AWS CloudWatch
• Azure Monitor
• Elasticsearch
235
Chapter 7 Monitoring
• InfluxDB
• MySQL
• PostgreSQL
• Prometheus
• Jaeger
D
ashboard Creation
Once you have your data source, you are ready to start creating
dashboards. Grafana has the ability to create many different dashboards,
and these can be created from the main Grafana screen when you
first log in.
Dashboards can be imported and exported if you wish to download
prebuilt dashboards or if you wish to share your configuration.
Panels
Once you have your first dashboard, you will want to start creating your
metric visualization. For this, dashboards use panels. Multiple panels can
be used to display the metrics of your choice from your preconfigured data
source. Each panel has a query editor specific to the data source selected
in the panel (Figure 7-9).
Figure 7-9.
236
Chapter 7 Monitoring
Rows
To arrange all your panels, you need to create rows; rows are your logical
dividers for all your panels. Panels can be dragged into different rows for
simple organization.
Save
Always remember to save your dashboards when you have added new panels
or rows. If you happen to open a new dashboard, your changes will be lost.
A
pplication Monitoring
A special kind of monitoring that can be a bit more trickier and often more
expensive on both resource and time is application monitoring. Application
monitoring requires both infrastructure tooling and developers who develop
their applications to expose metrics that can be monitored.
T racing Tools
Tracing tools are used to “trace” the execution path of an application and
its transaction by the use of specialized logging. Typically, these are used
by developers to aid in pinpointing where a particular issue occurs.
Tracing should not be confused with event monitoring. Event
monitoring is primarily used by Linux sysadmins for high-level
troubleshooting and is normally not too “noisy.” Where, with
“tracing” noise is good. The more information, the more accurate the
troubleshooting can be to narrow down the root cause.
237
Chapter 7 Monitoring
There are a few tracing tools available today that can be used.
Proprietary platforms like AppDynamics are excellent tools with rich
features but come with hefty price tags. Fortunately, there are also open
source alternatives, and as we are primarily focused on all that is open
source, we can just move past those that are not.
Jaeger
Originally open sourced by Uber, Jaeger is inspired by the OpenZipkin and
Dapper projects used for monitoring and troubleshooting microservices-
based distributed systems. With that, Jaeger promises to help solve the
following issues:
Zipkin
Before Jaeger, Zipkin was developed as an open source project based on
the Google Dapper project. Zipkin is a Java-based application that provides
an interface for users to view tracing data from a range of data backends.
Zipkin supports transport mechanisms like RabbitMQ and Kafka.
Zipkin can be deployed as a container or run locally by downloading
the latest binaries. All of these steps are well documented on the Zipkin
official site.
238
Chapter 7 Monitoring
Exposing Metrics
Monitoring tools are only as good as the data they can collect. For standard
platform monitoring, the metrics can be pulled using agents which in
turn speak to the system they are running on to return the data they need.
Applications, however, need to expose the data from within the application
so the monitoring agent can pass the data to the monitoring platform.
From there, alerts can be configured along with any dashboards.
Summary
In this chapter, you were introduced to the following:
239
Chapter 7 Monitoring
240
CHAPTER 8
Logging
In this chapter, we focus on a topic where we spend most of our time
troubleshooting as a Linux sysadmin: logs.
We will explore different logging systems that you can use, how to
read logs, how to increase the information we get from logs, and how we
look after our systems so logs do not cause us more issues. Finally, we will
explore how logs should be offloaded to external logging systems in a neat
and secure manner.
R
syslog
Installed by default on all Linux systems and almost always used, Rsyslog
is an incredibly fast logging system with the ability to receive logs from
almost everything running on a Linux platform. Rsyslog has the ability to
not only receive logs from just about everywhere, it can also offload logs to
numerous destinations from files through to MongoDB.
M
odular
Rsyslog has been designed in a modular way, allowing users to choose
what they want to use with rsyslog. There are a number of modules
currently available that range from snmp trap configuration through to
kernel logging. For a full list of all the different modules you could use, look
at the rsyslog official website:
www.rsyslog.com/doc/v8-stable/configuration/modules/index.html
I nstallation
If for some very strange reason rsyslog is not installed by default, you can
install from your standard package management system, like dnf or apt:
You can also run an rsyslog container if you choose that could be used
as a central logging system. More thought will need to be done around
storage and connectivity.
S
ervice
The rsyslog service is enabled and started by default but can be stopped or
disabled in the standard systemd manner:
C
onfiguration Files
The configuration files for rsyslog are handled through two configuration
locations:
242
Chapter 8 Logging
• Global directives
• Templates
Global Directives
General global configuration for rsyslog. Examples include the enabling
and disabling of additional modules and library locations.
Templates
Templates give you the ability to format how you want logs to be recorded
and allow dynamic file name generation. It’s a useful configuration if you
are building a central rsyslog system and want to record the hostname of
the system sending logs.
Rules
Rules consist of selectors and actions. These are the fields that set what will
be logged and where the logs will be sent.
Selector Field
The selector field consists of two parts, the facility and priority. These two
parts are divided by the “.” character.
The following entries are valid facility types: auth, authpriv,
cron, daemon, kern, lpr, mail, news, syslog, user, uucp, and local0
through local7.
The following entries are valid priorities that can be used: debug, info,
notice, warning, err, crit, alert, emerg.
243
Chapter 8 Logging
Action Field
The action field is typically made up of where the location of the log file
will be. However, other actions can also be applied to a particular selector
if you choose. Examples of this could be writing to a database or sending
the log files to a remote logging system.
Actions can be quite flexible too; different protocols, ports, and
interfaces can be configured to send logs to remote systems. It’s useful if
you run a dedicated logging network to not impact a production network.
Fluentd
Fluentd is an open source project that was originally created by a company
called Treasure Data.
Plugin Based
Written in C and Ruby, Fluentd gives the user the ability to be flexible in
how Fluentd can be used. With over 125 plugins for both input and output,
Fluentd can be used with almost any system or cloud provider available.
244
Chapter 8 Logging
Used at Scale
Running a large-scale environment with Fluentd is entirely possible
with user cases reporting that Fluentd can handle over 50,000 systems
sending data.
Installation
Fluentd can be installed in a few ways: standard package installation,
installed from source, or run from a container.
Prerequisites
Before installing Fluentd, there are a few prerequisites that are required:
• Configure NTP.
Manual Installation
Depending on your system, installing Fluentd can be done by either
running a script that matches your distro or installing the required
Ruby gems. The recommendation is to use the gem installation for
nonsupported platforms, and for supported platforms such as RHEL, to
install using the scripts provided by Fluentd.
The official documentation should always be followed for the
detailed steps.
245
Chapter 8 Logging
Container Deployment
Fluentd can also be deployed as a container and is often deployed in this
fashion. The official documentation does highlight all the steps in detail
that need to be followed for a successful deployment.
The basic high-level steps are as follows:
Note There will be more steps than just the preceding steps. Also,
don’t forget your firewalls.
Configuration
The main configuration file for Fluentd is the fluentd.conf file.
Configuration parameters can be found in the official online
documentation or man pages. A basic configuration file looks similar to
the following:
<source>
@type http
port 9880
bind 0.0.0.0
</source>
<match **>
@type stdout
</match>
246
Chapter 8 Logging
Understanding Logs
Having logs available is the first step to finding or preventing a problem.
Understanding what the logs are actually telling you is another very
important step.
Warning Do not open large log files that are many gigabytes in
size using tools like vim on a production system. The file contents will
consume large portions of memory and potentially cause you issues.
Copy the large logs to a different system to avoid any issues.
247
Chapter 8 Logging
Infrastructure Logs
Logs that tell you all about your Linux system’s events, services, and
system are your infrastructure logs. These logs are the standard logs that
are configured in your rsyslog.conf configuration file and tell you all about
what your system is doing in the background. These are the logs that will
be used to troubleshoot any system issues and can be used to look for
issues before they occur.
Important Logs
Logs that should be monitored for system issues are as follows.
/var/log/messages
This log is used to store all the generic events and information about your
system. Other distros like Ubuntu or Debian use a log file named syslog
instead. This log should always be one of the first places you check if you
need to troubleshoot any issues. It may not have all the information but
can get you started when you do not know where to begin.
/var/log/secure
This log file is used for authentication events. This log or the /var/log/
auth.log in Ubuntu and Debian is the best place to start troubleshooting
authentication failures or login attempts.
/var/log/boot.log
This one is fairly straightforward in its purpose. This is used to
troubleshoot boot-related issues. It’s a useful log to use to see how long a
system has been down for.
248
Chapter 8 Logging
/var/log/dmesg
This is used to log information about system hardware changes and
failures. It’s very useful if you are having problems detecting new hardware
being added or removed.
/var/log/yum.log
If using a distro that uses yum as its package management system, you can
see a history of all packages added, updated, or removed.
/var/log/cron
This is a simple log to capture all cron-related tasks that have run
successfully or failed.
Application Logs
Depending on your application or what application server you use, the log
files could be stored anywhere. Application developers need to ensure that
important information is logged to troubleshoot issues or track events. The
ability to increase or decrease verbosity should also be included.
Good Practice
Some good practices for application logging should include the following.
249
Chapter 8 Logging
Security
Logs that contain sensitive information should be secured when keeping a
long history. Permissions to the log directory should be locked to users and
groups authorized to read the logs. The use of ACLs could help to keep the
logs secure.
Warn or Above
Logs in production should never be left in debug mode. Logs should only
be set to warning or error. This will keep the logs small and only report if
there is an issue looming or report errors. Setting log levels too low can
leave you with your /var/log disk filling up.
Increasing Verbosity
When problems occur, there may be a need to get more information than
what has been provided.
• Fatal
• Error
250
Chapter 8 Logging
• Warn
• Info
• Debug
• Trace
The default settings for a production application or platform should
normally be set to “Warn” or “Error.” As previously advised, it is not
recommended to debug in production for two reasons:
However, when the very very rare need does arise, setting the log
verbosity to “Debug” will definitely log more information but will be
limited to what the application or platform deems as a debugging message.
To get the information you need, it is best to start from “Warn” and work
down to “Trace” until you have the information you need. Once done,
always set the logging level back to “Warn” or “Error.”
Log Maintenance
A good Linux sysadmin ensures that all logs are rotated and archived when
not being used. A great Linux sysadmin builds log maintenance into all
system configuration automation and never has to worry about it again.
251
Chapter 8 Logging
If you have never managed logs before, then one of the following is
most likely true:
• /etc/logrotate.conf
• /etc/logrotate.d
Installation
Logrotate is installed by default on all enterprise Linux distros and most
community distros.
Logrotate provides documentation through its man pages that will give
you more than enough information to get started, including examples.
252
Chapter 8 Logging
L og Forwarding
Log forwarding is the preferred option for most people today. Enterprise
tools like Fluentd are a great way to offload local logs to a central location.
It removes the need for local systems to retain logs for extended periods
and reduces the disk footprint.
E lastic Stack
Also known as ELK stack, Elastic Stack is made up of four tools listed in
Table 8-1.
F luentd
Fluentd can be used as a replacement for local logging, as previously
discussed in this chapter, and can also be used as a centralized logging
platform. To use Fluentd as a central logging platform, you need to have
two elements in your network.
253
Chapter 8 Logging
Log Forwarders
A log forwarder monitors logs on a local system, filters the information that
is required, and then sends the information to a central system. In the case
of Fluentd, this would be a log aggregator.
Fluentd has a log forwarder called Fluent Bit which is recommended
by Fluentd to use.
An example of a Fluentd forwarder configuration would look similar to
the following:
<source>
@type forward
port 24224
</source>
<source>
@type http
port 8888
</source>
<match example.**>
@type forward
<server>
host 192.168.100.1
port 24224
</server>
<buffer>
flush_interval 60s
</buffer>
</match>
254
Chapter 8 Logging
Log Aggregators
The destination for log forwarders would be the log aggregators. They are
made up of daemons constantly running and accepting log information to
store. The logs can then be exported or migrated to cloud environments for
off-site storage.
A Fluentd log aggregate configuration example could look similar to
the following:
<source>
@type forward
port 24224
</source>
# Output
<match example.**>
# Do some stuff with the log information
</match>
Rsyslog
If you do not want to use anything outside of what is provided on a
standard enterprise Linux distro, you can stick to using Rsyslog for
centralized logging.
Rsyslog Aggregator
Very similar in how Fluentd is configured to use log forwarders and
aggregators, Rsyslog can be configured to do the same. Rsyslog can be
configured to send and receive logs over either tcp or udp. Rsyslog can also
be configured to send and receive logs securely using certificates.
255
Chapter 8 Logging
3. Configure NTP.
$ModLoad imtcp
$InputTCPServerRun 514
Rsyslog Forwarders
To send logs to a central Rsyslog server, you need to also configure the
rsyslog.conf file on your Linux client systems to send logs to the central
server. A simple configuration to send all logs centrally using tcp is as
follows:
*.* @@192.168.0.1:514
256
Chapter 8 Logging
As with the central Rsyslog server, once the configuration file rsyslog.
conf has been updated the rsyslog service will need to be restarted:
Summary
In this chapter, you were introduced to the following:
257
CHAPTER 9
Security
Security is one of the most important subjects that can be discussed as a
Linux sysadmin. All organizations require at least a minimal amount of
security to avoid being exposed or sabotaged by random hackers looking
for an easy target.
Larger organizations like banks need to focus heavily on security
and need to ensure that they are protected at all costs. This will involve
ensuring systems are hardened to the nth degree.
This chapter will focus on how security can be enforced and how
systems can be checked to ensure they comply not only with good security
practices but also meet compliance regulations. In this chapter, we will
explore different tools that can be used from the open source community
to build secure platforms and how to validate that systems are indeed as
secure as possible.
Finally, we will discuss DevSecOps and how the change in culture can
improve security. We will look at how today's DevSecOps practices have
improved the process in securing Linux systems.
L inux Security
The traditional approach to building and configuring a secure Linux
environment would have been to make use of firewalls, SELinux, and in
some cases antivirus software.
Firewall
The basic description of a Linux firewall is that it is the Netfilter toolset that
allows access to the network stack at the Linux kernel module level.
To configure the ruleset for Netfilter, you require a ruleset creation
tool. By default, all enterprise Linux systems have a firewall ruleset tool
installed, with the exception of certain cloud image versions. These images
tend to be more cutdown and do not always include the firewall tooling.
This is due to the fact that the protection should be handled at the cloud
orchestration layer.
260
Chapter 9 Security
I ptables
Previous Linux distros and a few that have decided to not move forward
with systemd still use the firewall ruleset configuration tool known as
iptables. Iptables can get complex, but if you have a basic understanding
and know how to check if a rule has been enabled, you are well on your
way already (Table 9-1).
261
Chapter 9 Security
F irewalld
If you are using an enterprise version of Linux, chances are you most likely
will be using systemd. With systemd, you will be using firewalld as the
ruleset configuration tool for Netfilter.
Firewalld was designed to be simpler and easier to use than iptables.
Firewalld like iptables has a few commands all Linux system administrators
should know. Table 9-2 lists some basic commands to remember.
Tip When possible, use firewall-cmd and never disable the firewall
service. Instead, understand what ports are required and open the
ports than leaving an entire system open.
S
ELinux
Another measure of security used on most Linux distros is SELinux,
originally conceptualized and worked on by the US National
Security Agency.
262
Chapter 9 Security
There are two kinds of intrusion detection that can be used for any
server estate: host-based intrusion detection and network-based intrusion
detection. For the purposes of this book, we will only discuss what we can
deploy on our Linux platforms.
263
Chapter 9 Security
264
Chapter 9 Security
M
inimal Install
Install your Linux server with the minimal packages selected. It is better to
start from a basic build and add the packages you need after. Less is more
when it comes to secure Linux server building.
D
isk Partitions
Table 9-5 lists all the separate disk partitions that should be configured
with the respective mount options.
/var
/var/log
/var/log/audit
/var/tmp Mount to same disk as /tmp
/tmp nodev, nosuid, noexec
/home nodev
/dev/shm nodev, nosuid, noexec
removable media nodev, nosuid, noexec
265
Chapter 9 Security
Disk Encryption
Only consider the use of disk encryption if the server can easily be taken
out of a data center or server room. This would apply to laptops or any
portable systems. A common disk encryption tool that can be used
is LUKS.
No Desktop
Do not install a Linux desktop or “X Windows System.” If it is installed,
remove both the desktop and the “X Windows System” packages.
266
Chapter 9 Security
267
Chapter 9 Security
ACLs
Configure specific permissions to disks and files using ACLs for users
that need access to the system. Do not open system-wide permissions for
nonadmin users.
Intrusion Detection
Install and configure an intrusion detection tool like Aide or Fail2ban. If
using Aide, be sure to copy the database to a secure location off the server
that is being monitored. This can be used later for comparison purposes.
DevSecOps
All these security steps are only as good as the people who apply them.
If an organization has not embraced the fact that security is not just the
responsibility of the security team, there will be opportunities for the
unsavory types out there when a security slip-up occurs. This is why there
has to be an evolution in the cultural view of security.
One of the biggest changes in organizations culturally over the last
few years has been exactly this. The understanding that everyone is
responsible for security.
268
Chapter 9 Security
What Is It?
In the same vein that DevOps is a set of practices and tools designed
to bring development and operational teams together by embracing
development practices, DevSecOps aims to align everyone within an
organization with security practices and tools (Figure 9-1). Basically, it’s
representing that everyone is responsible for security.
Figure 9-1. Venn diagram where the different teams meet to create
DevSecOps
269
Chapter 9 Security
270
Chapter 9 Security
T ools
Linux sysadmins, developers, and users need to be conscious that anything
new that is added to a Linux estate must meet security requirements.
Running scans and tests manually will not support this cultural shift and
will leave you in a position where security is swiped to the side.
All these security checks need to be done in an automated manner.
When security issues are detected, the process should be stopped and
remediated. If building a new container image, for instance, it is pointless
building an insecure container image and wasting storage. It is best to stop,
fix the issue, and rerun the build.
S
ecurity Gates
One good way to incorporate DevSecOps practices would be to build
security gates into pipeline tools such as Jenkins or Tekton (Figure 9-3).
Figure 9-3. Code is pushed, checked, and baked with a pulled image
271
Chapter 9 Security
Third-Party Tools
Using third-party tools to scan and check code is highly recommended for
your security gates. Using products like SonarQube has the ability to scan
for vulnerabilities and check code for syntax issues.
System Compliance
There are numerous reasons for systems to be compliant. The ability to
be able to store credit card details for one dictates how systems should
be secured within financial organizations. Failure to do so would mean
financial penalties or worse.
For systems to be compliant, there are hardening requirements that
need to be followed. These requirements need to be applied to all systems
and have evidence provided when audit time arises.
System Hardening
Hardening a Linux system is a process of removing any potential attack
surfaces your system may have.
There are many areas that a system can be exposed for a would-be
attacker to use. For example, a recently discovered vulnerability has been
shown to allow a non-root user the ability to exploit a vulnerability in the
sudoedit command. This vulnerability allows the user to run privileged
commands without authorization.
Finding these kinds of vulnerabilities and remediating them before
being exposed is the most important thing we as Linux sysadmins can do.
Reducing the chance of the problem happening in the first place is even
more important when building hundreds if not thousands of systems. This
is why system hardening and system vulnerability scanning are vital to
ensuring your systems are as secure as possible before going live.
272
Chapter 9 Security
Hardening Standards
There are a number of standards that can be used today to harden your
Linux estate. The two main ones used are CIS and STIGs. Both are very
similar, largely due to the fact that there are only so many security tweaks
one can do. Both, however, do serve as a good starting point to secure your
platforms to a good standard.
There are a few other standards that must also be followed for different
organizations such as NIST 800-53 for US federal agencies and PCI DSS
for financial organizations or anyone that wants to store credit/debit card
details. These standards are typically applied over and above the STIGs or
CIS guidelines.
273
Chapter 9 Security
Hardening Linux
There are a few ways to harden your Linux systems.
Manual Configuration
The last way I would ever harden a Linux system would be by doing it
manually. The sheer number of hardening steps that need to be followed
would keep you busy till the cows come home. Most hardening guides are
well over 100 pages long and are far from a riveting read.
If the need arose that a system had to be hardened by hand, then the
best tool you could follow would be the hardening guides available on the
Internet like CIS.
Each of the different hardening guides available all comes with
commands to determine if a system is vulnerable and, if the vulnerability
is in fact present, also provides the remediation commands. Your friend in
this case would be to copy and paste until you have got to the end of the
extensive guide.
My advice would be to push back as hard as possible on doing
anything manually. The time it would take would far exceed what time it
would take to set up the next method of hardening a system.
274
Chapter 9 Security
Automation
Automation is your friend. The Internet is awash with content written by
Linux sysadmins like you that need to harden systems. Chances are you
will find some Ansible or Puppet that will do exactly what you want. You
also will have the added benefit of the process being repeatable, which
could be very handy when your boss tells you to harden another five
systems.
OpenSCAP
Where Internet-downloaded automation might fail you slightly is if you
need to replicate configuration from a different already hardened system.
There may be a particular system that has specific hardening that does not
have all hardening applied for a good reason.
How would you then go about running your standard hardening to
accommodate the same settings?
For this use case, you can make use of OpenSCAP. OpenSCAP has the
ability to scan a system or systems and generate a report of the system’s
configuration. This configuration can be compared to another system, and
a subsequent report can be run to list the differences.
The absolutely amazing thing about OpenSCAP is that it can also
generate Ansible or Puppet code to remediate the differences for you,
saving you from having to write your own automation.
OpenSCAP can be run with CIS profiles out of the box and can also
use other profiles. Most if not all will present you with the remediation of
vulnerabilities through Ansible or Puppet.
275
Chapter 9 Security
Vulnerability Scanning
Keeping an eye on your estate and ensuring that there are no
vulnerabilities is vital to ensuring that you do not have any nasty surprises
waiting.
OpenSCAP
OpenSCAP is another very good vulnerability scanning tool that is more
than just a scanning tool as previously discussed. OpenSCAP has the
ability to use multiple profiles and can be fully customized to scan based
on your organization's requirements.
276
Chapter 9 Security
ClamAV
If you need an open source antivirus, ClamAV can assist with the detection
of viruses, trojans, and many other types of malware. ClamAV can be used
to scan personal emails or files for any malicious content. ClamAV can also
serve as a server-side scanner.
The “paid-for” ClamAV product does an automatic and regular update
of its database, in order to be able to detect recent threats. The community
product requires some further configuration with cron jobs.
Harbor
Technically, a container image repository. Harbor is an open source
project that provides role-based access to its container registry with the
ability to scan images for vulnerabilities. VMware has adopted Harbor as
their container registry for their Tanzu Kubernetes platform.
Role-Based Access
Harbor secures artifacts with policies and role-based access control,
ensuring images are scanned and free from vulnerabilities.
277
Chapter 9 Security
Trivy
Harbor prior to version 2.2 used Clair as its vulnerability scanner but
has since moved on to use Trivy. Harbor can also be connected to more
than one vulnerability scanner. By connecting Harbor to more than one
scanner, you widen the scope of your protection against vulnerabilities.
JFrog Xray
JFrog Xray is a vulnerability scanning tool provided by JFrog. Xray is
natively integrated with Artifactory to scan for vulnerabilities and software
license issues. Xray is able to scan all supported package types from
binaries to container images.
Deep Scanning
Deep scanning allows Xray to scan for any threats recursively through
dependencies of packages or artifacts in Artifactory, before being released
for live deployments.
Clair
Clair (from a French term that means clear) is an open source project
which offers static security and vulnerability scanning for container images
and application containers.
278
Chapter 9 Security
Supported Images
The currently supported images that Clair can scan for vulnerabilities
include all the major enterprise distros discussed in this book. They are as
follows:
• Ubuntu
• Debian
• RHEL
• SUSE
• Oracle
Clair also supports the following images that are being used today in
different environments:
• Alpine
• AWS Linux
• VMware Photon
• Python
Enterprise Version
Clair is currently the vulnerability scanning tool that is used within the
Red Hat Quay (pronounced “kway” not “key”) product. Clair provides an
enterprise-grade vulnerability scanning tool for the Red Hat supported
container registry.
Continuous Scanning
Clair scans every image pushed to Quay and continuously scans images to
provide a real-time view of known vulnerabilities in your containers.
279
Chapter 9 Security
Dashboard
Clair too has a detailed dashboard showing the state of container images
stored within Quay.
Pipeline
Working in DevSecOps methodology, the Clair API can be leveraged in
pipeline tooling like Jenkins or Tekton to scan images being created during
the baking phase.
Vulnerability Scanning
The ability to find and fix vulnerabilities in containers running within
Kubernetes or OpenShift platforms.
Compliance Scanning
Supported by informative dashboarding, Red Hat ACS can scan containers
and images to ensure they meet compliance requirements from standards
like CIS, PCI, or NIST, to name a few examples.
280
Chapter 9 Security
Network Segmentation
Ability to enforce network policies and tighter segmentation of allowed
network traffic in and out of Kubernetes or OpenShift environments.
Risk Profiling
All risks detected from deployments within Kubernetes or OpenShift can
be viewed in a priority list for remediation.
Configuration Management
Used to not only manage the security and vulnerabilities of container
workload within Kubernetes or OpenShift. Red Hat ACS can also harden
the cluster components through the configuration management.
Falco
Created by Sysdig, Falco is another open source threat detection solution
for Kubernetes and OpenShift type environments. Falco can detect any
unexpected behaviors in applications and alerts you about the threats in
runtime.
Falco has the following features.
281
Chapter 9 Security
Immediate Alerting
Reduce risk to your estate with immediate alerts allowing quicker
remediation of vulnerabilities.
Aqua Security
Aqua Security is designed to protect applications that are built
using cloud-native containers and being deployed into hybrid cloud
infrastructure like Kubernetes or OpenShift.
Aqua Security has the following features.
Developer Guidance
Aqua Security guides developers in building container images that are
secure and clean by ensuring they don’t have any known vulnerabilities
in them. Aqua Security even checks that the container images being
developed do not have any known passwords or secrets and any kind of
security threat that could make those images vulnerable.
Informative Dashboarding
Aqua Security has a clear and useful dashboard that provides real-
time information about the platform being managed with all the issues
discovered. If any vulnerability is found, Aqua Security reports the issues
back to the developer with recommendations on what is required to fix the
vulnerable images.
282
Chapter 9 Security
Summary
In this chapter, you were introduced to the following:
283
CHAPTER 10
Maintenance Tasks
and Planning
Any Linux sysadmin will be familiar with the dreaded maintenance
required for Linux estates. In this chapter, we will discuss the various
maintenance jobs that should be done when managing a Linux estate.
We will look at what actual maintenance work should be done, when the
maintenance jobs should be run, and how to plan maintenance to cause
the least amount of downtime.
This chapter will also briefly look at how maintenance tasks and
bureaucratic tasks can be synced to reduce the overall pain that
routine maintenance can sometimes bring. Finally, we will discuss
how automation should be used to improve the overall maintenance
experience for everyone involved.
P
atching
The number one reason for maintenance will be patching and system
updates. It cannot be stressed how important this process is and should
never be neglected. Patching not only contains package updates and fixes
but also provides vulnerability remediation.
S
taging
It is never a good idea to patch your production/live or customer-facing
environments before confirming that nothing will break with the current
round of updates.
This is why a staged approach to patching should always be taken.
Determine the order in which you want to patch your environments and
patch them in stages.
Figure 10-1 is an example of a patching order I have typically used in
the past.
286
Chapter 10 Maintenance Tasks and Planning
Sandbox
Start with a sandbox type environment with at least one system that runs
similar or close to the same applications as your production environment.
This environment should not be user facing or require change control
approval to work on. The entire environment should be disposable
and automated. Sandbox is your environment and is there to prove
configuration will not cause issues in other environments.
Automated Testing
If possible, make use of automated testing to prove updates or patch
configuration has not broken functionality of your test application. There
are numerous options available both open source and proprietary that
can be used to automate application testing. Speak to your organization’s
developers or reach out to who provides your applications. They will more
than likely give you recommendations on what you should use.
Here are some options you can also look into that could help:
• Selenium
• Katalon Studio
• Robotium
Automated Patching
If your patching process is well planned and platform testing can be
automated, there really is little stopping you from automating your actual
patching grunt work.
287
Chapter 10 Maintenance Tasks and Planning
• Prechecks
• Reboots
• Automated testing
Rollback
Pipelines or workflow tooling can also apply rollbacks if problems are
detected, ensuring when the maintenance window closes, nothing is left
in a problematic state. Automation is great, but building in as much risk
management is going to save you having to explain why an environment
was broken by your automation.
Hint You want to make sure you have as much risk reduction in
place to ensure your automation is not blamed for system outages.
This is what makes your life easier and should be safeguarded from
the naysayers.
Filesystem
One area that can grow and cause concern over time if not well maintained
is your filesystems. Filesystems not only store logs, they also store files
users leave behind in their home directories. Paying attention to your
filesystems before they become a problem is crucial to not having
preventable outages.
288
Chapter 10 Maintenance Tasks and Planning
C
leanup
During your system maintenance, it is definitely worth running through
the following different filesystems and checking if there are any files that
are no longer required. Removing these files and any temporary files are
recommended.
C
heck for Errors
Once you have checked for unused files and cleared as much as possible, it
is well worth running a filesystem health check. This will help identify any
possible underlying issues before they become a problem further down
the line.
289
Chapter 10 Maintenance Tasks and Planning
Firewall
Firewall checks are just there to ensure no unexpected new rules have
made their way into your Linux system. Technically, these should be
managed by configuration management tools, but in the case when you do
not have a running SaltStack or Puppet, checking the firewall is a quick and
simple task. Doing it during a maintenance window just means you can
remove any unwanted changes, provided you are covered by any change
control.
Backups
This one really goes without saying. Backups should be done for any
systems that cannot be rebuilt from code and done within acceptable time
frames. Virtual machines can be backed up in their entirety, but physical
systems will need to have specific directories backed up based on the
function of the server.
During your maintenance window, double-check that all backups have
been running and that a recent backup is in place.
290
Chapter 10 Maintenance Tasks and Planning
How often your estate should have maintenance done will depend on a
few factors:
Two of the preceding points would indicate you have bigger problems
in your estate than maintenance; dealing with those first would be the
recommendation before trying to solve symptoms.
291
Chapter 10 Maintenance Tasks and Planning
Structure
By having a regular maintenance window for each environment, you can
plan and structure how updates are applied and tested. Doing this, you
drastically reduce the possibility of issues in your live environments both
from bugs and vulnerabilities.
Automation
I’m sure I mentioned automation enough times now for you to be
sick of the term, and I’m also fairly confident most of you are already
automating by now.
To state the obvious, automating maintenance is about as important
as automating your build process. The following are items you should be
automating:
• Backups
• Patching
• Disk cleanup
• Disk checking
• Software removal
292
Chapter 10 Maintenance Tasks and Planning
4. Disk cleanups.
5. Configuration checks.
Blue/Green
This method would involve switching traffic to either blue or green and
then patching the nonlive environment.
The blue/green approach does give you the ability to update directly
into your live environments if you wish, as you technically are not updating
the “live” side. Provided you do all your due diligence and ensure the
environment you are patching is 100% running before switching back, you
should never experience an outage.
293
Chapter 10 Maintenance Tasks and Planning
Once you have completed maintenance on one side, you can apply the
same maintenance to the second. As you have already proven there to be
no issue, you should be perfectly ok to proceed with your second site.
I would personally still recommend taking a staged approach, but if
you are pressed for time, you do at least have the protection of your second
environment you can switch back to.
Failover
Running multiple data centers is another common approach to reducing
single points of failure. Failing your live traffic to your second data center
will allow maintenance to happen with zero downtime to live traffic.
The same principles of maintenance should apply before failing back
to your primary data center and patching your secondary site.
Maintenance Planning
The execution is only as good as the plan. There are a couple of important
things to consider for any maintenance planning.
294
Chapter 10 Maintenance Tasks and Planning
If you have automated the process, then even better. Your environment
can then constantly stay up to date and run as smoothly as possible,
reducing the need to constantly fix issues and allowing you more time to
focus on the more exciting things.
Bite-Size Chunks
If you have a large amount of maintenance to do and have not automated
the process yet, break your maintenance down into bite-size chunks.
Rather run multiple small maintenance windows than one large window.
Art of Estimating
Be careful how you calculate the time required to complete a task. Rather
overestimate and finish early than underestimate and put yourself under
pressure. Speak to other Linux sysadmins for help on estimating time
when planning.
295
Chapter 10 Maintenance Tasks and Planning
Process Automation
There are a few different products and projects available today to assist you
in automating processes, but one worth mentioning is Red Hat Process
Automation Manager, or PAM.
Warning Red Hat PAM is the tool you will need to use to develop
your process tasks, just as you would with Ansible when automating
technical tasks.
296
Chapter 10 Maintenance Tasks and Planning
Summary
In this chapter, you were introduced to the following:
297
PART IV
See, Analyze,
Then Act
Troubleshooting and asking for help are probably the most important
skills a good Linux sysadmin should have in their armory. Most of us learn
these skills through trial and error with experience or hopefully through
the experience of others with books like mine. The following chapters
show you how to approach and solve difficult problems.
CHAPTER 11
Troubleshooting
Troubleshooting can be a difficult skill to master if you do not understand
the correct approach. Just digging through logs or configuration files may
help resolve simple issues, but understanding how to find the root cause
of an issue is where the real skill comes. In this chapter, we will look at how
a problem should be looked at, how the problem should be analyzed, and
finally how you should act on the information you have seen. Taking your
time to understand before guessing is paramount to solving your problem
quicker and more efficiently.
Once we have been through how a problem should be approached,
we will discuss the proper etiquette that should be used when asking
questions in the community. Learning to not ask others to do your work for
you or at least framing your questions in such a manner that it seems like
you have at least tried is the first step. In this chapter, we will go through
the best way to go about asking for help.
Finally, we will address the not so good ways of troubleshooting that
you should try to avoid.
occurred in the first place. Too often, bandages are applied to symptoms,
and the underlying issues are not fixed. Fix the root cause and you save
yourself all the pain later.
302
Chapter 11 Troubleshooting
Explain to Yourself
We have known for decades that explaining a problem to ourselves can
greatly increase our chances of solving it. By explaining the problem
to yourself, you gain new knowledge about the issue, you ask yourself
questions, and you challenge yourself to what you understand about the
issue. Speak aloud to yourself if it helps and continue to talk to yourself
about the issue. Don’t stew in silence, find a quiet room if you have to, and
thrash out the issue.
303
Chapter 11 Troubleshooting
Rubber Duck
If after explaining the problem to yourself you still don’t have something
tangible, grab an inanimate object (rubber duck) and explain your
problem to it. Just the process of explaining for a second time may help.
Another Person
If the rubber duck option fails, try explaining your problem to another
person; they don’t even have to be technical. In fact, it may be better if they
are not. This will allow you to simplify your explanation so they understand
and possibly in the process help you uncover something you may have
overlooked because it was so simple.
Use Tools
Using a whiteboard or scrap paper when explaining will also allow you to
get the ideas and thoughts out of your head. Rereading the explanation
back to yourself may add further clarity.
304
Chapter 11 Troubleshooting
Example
If we take a scenario where your organization’s internal intranet won’t load
after the evening maintenance, the “whys” shown in Figure 11-1 can be asked.
Figure 11-1. Example of the flow that whys should follow when
trying to determine an underlying issue
305
Chapter 11 Troubleshooting
The intranet is down. Why? Updates were applied and the system
rebooted. Why? The web server was not listening on any ports. Why? The
web server service will not start after boot. Why? There’s a syntax error in
the configuration file. Why? The final why is what brings you to your root
cause. Someone had made a change to the web server configuration and
did not test the syntax. The change was never applied by restarting the web
service, and only when the web server was rebooted after system updates
were applied did the true problem manifest itself.
In this example, the problem was down to an undocumented change
that happened to the intranet web server. The change was never tested in
a syntax check, and the service was never restarted to apply the change.
After the server updates and reboot, the web server tried to start on boot
but failed due to a syntax error.
306
Chapter 11 Troubleshooting
Hypothesis Building
The workflow shown in Figure 11-2 can help with your root cause analysis
by building a series of hypotheses.
307
Chapter 11 Troubleshooting
Causality
When building your hypothesis, avoid falling into a trap of not
understanding the cause and effect of a component, for example, blaming
the kernel version because a new graphics card failed to load. Even though
the kernel is responsible for device drivers, it still requires the driver
from the hardware manufactured compiled in the kernel.
Remediation
Finally, with your theory proved and a solution prepared, the live
environment can be remediated with little to no risk.
308
Chapter 11 Troubleshooting
Training
If you are fortunate enough to have access to training materials, check if
there is anything you may have been taught in exercises that could assist.
309
Chapter 11 Troubleshooting
Proper Grammar
Use proper grammar and spelling where possible. If English is not your
first language, then do the best you can and start your question with
something similar to this:
“I am sorry for my bad English, I hope my question makes sense.”
Try to avoid using slang and use the proper spelling of words where
possible. Remember you are asking for help, make sure your question is as
clear as it can be.
Spelling
Spell-check your questions using whatever tool you have available. Google
Docs has decent spell-checking (I hope, as this book was written in it), and
it’s free.
Either write or copy your question to a new document and check for
both grammar and spelling mistakes.
310
Chapter 11 Troubleshooting
A Better Question
Radeon RX 5700xt driver will not work with Fedora 34
I am currently trying to install the Radeon RX 5700XT graphics card in
my fresh install of Fedora 34. After reading the official documentation on the
AMD site and checking the help of the install command, I am still not able to
find a solution.
I have tried running the commands
./amdgpu-install-pro --opencl=pro,legacy
311
Chapter 11 Troubleshooting
Tip Do not rush your question; take time to ask the question in a
clear and nonvague manner. People will respond to a question that
has been asked by someone who took the time and effort to ask
correctly.
Correct Area
People really do not like being asked questions about subjects that are not
relevant to the area you are asking your question in. Make sure you select
the correct forum or chat room or even support email before you ask your
question.
The general polite response will redirect you to the correct place, but
someone with less patience may pass a slightly more sarcastic answer. So
do avoid embarrassment or at least wasting your time. Ask your questions
where they should be asked.
Forums
If you have a problem with a particular product or project, check if they
have a forum you can ask questions on. Be sure to first check if your
question has not already been asked.
312
Chapter 11 Troubleshooting
Support Cases
If your problem is around an enterprise product that you or your
organization is paying subscriptions for, raise a support case with their
help desks. This is after all what you are paying for.
Be sure however to be very clear on what your problem is. Attach log
files and potentially diagnostic outputs where possible. Just by adding all
the relevant files, you can sometimes get your problem solved quicker.
Live Debugging
Do not debug in live environments; anyone who says that live debugging
is ok is treading on very thin ice. All it takes is one syntax error or one
configuration file to be left in debug mode to cause an outage.
There is a reason why test environments and nonproduction
environments are built. Use them to find the root cause, not your live
environment.
313
Chapter 11 Troubleshooting
314
Chapter 11 Troubleshooting
Ghosts
Not everyone understands the term “red herring,” but it is a term we use in
the UK to refer to something that does not exist. A phantom. Avoid looking
for something that is unlikely to be the root cause of your problem. Keep
applying logical thinking when you are hunting your root cause.
315
Chapter 11 Troubleshooting
Summary
In this chapter, we explored the following about troubleshooting:
316
Chapter 11 Troubleshooting
317
CHAPTER 12
Advanced
Administration
This final chapter of Linux System Administration for the 2020s is going to
explore ways that you, the Linux sysadmin, can dig deeper into the Linux
operating system to find the information you need.
This chapter will start by looking into system analysis and help you
understand how to get more information from your Linux system without
having to spend hours doing so. We will discuss what tools can be used to
both extract and decipher system information for you to get your answers
that bit quicker.
When system analysis tools and techniques do not give you all the
information you need, the use of additional tools is required to get more.
We will spend the remainder of this chapter looking at how you can extract
the last drops of information out of your Linux operating system.
S
ystem Analysis
As a Linux system administrator, you will have spent time looking through
configuration files and general system health to try to pinpoint the source
of a user's problem. This process can normally be painful and can take
time you do not want to spend. Having the correct tools can go a long way
in helping get to the bottom of an issue and allow you to focus on more
interesting things.
© Kenneth Hitchcock 2022 319
K. Hitchcock, Linux System Administration for the 2020s,
https://doi.org/10.1007/978-1-4842-7984-7_12
Chapter 12 Advanced Administration
Here are some quick tools you can use to get information about a
Linux system.
Sosreport
With all enterprise Linux systems, “sosreport” is used to extract
information for support teams. Sosreport is a plugin-based tool that can be
run with different parameters to export different information. Sosreport’s
output is often requested by enterprise support teams when support cases
are raised and is always worth uploading whenever a new support case
is raised.
Sosreports are an archive of the problematic system configuration
and logs. Support teams are able to use the sosreport to better
understand the problems being experienced without requesting different
configuration files.
A sosreport can be created without specifying any parameters as
follows, but can also have additional parameters passed to cut down the
output or increase what is extracted:
# sosreport
As a Linux sysadmin, you may wish to use sosreports for your own
diagnosis queries. Sosreports can be extracted manually if you wish to look
into a user's problem from your own test system.
If manual extraction of sosreports does not interest you, there are tools
that can be used to extract and summarize the configuration within the
reports.
320
Chapter 12 Advanced Administration
xsos
One such tool is xsos, developed and maintained by community members;
xsos can take sosreport inputs and create a nice summary of the system.
For support staff, this saves more time than most realize as there is no need
to extract or sift through configuration files for a quick overview.
To run a basic test of xsos, you can run the following command:
The preceding command will only output details from the system you
are running it from. If you want to view a sosreport output, you will need to
install the xsos tool and pass the path to your sosreport.
The basic xsos report will output the following areas:
• Kdump configuration
• CPU details
• Storage
• LSPCI
321
Chapter 12 Advanced Administration
S
ystem Information
All the device information about your Linux system can be found in the
“/proc” directory. In the “/proc”, there are different files like “meminfo”
or “cpuinfo” which will show you the relevant information about each
component. The “cpuinfo” file, for instance, will show you all the information
about all the CPUs attached to your Linux system including CPU flags.
Shortcut Tools
If digging through “/proc” files is not for you, the tools listed in Table 12-1
can also be used to get basic information about your Linux system. Being
familiar with these tools will allow you to gain quick access to device
information when you need to diagnose any issues quickly.
lshw Will list a full summary of all hardware recognized by your system
lscpu Summary of all CPU information, similar to running
# cat /proc/cpuinfo
lsblk Quick list of all storage devices attached
lsusb List of all USB devices plugged into your Linux system
lspci Lists all the PCI controllers and devices plugged into PCI slots
lsscsi Lists all the scsi and sata devices attached to your system
322
Chapter 12 Advanced Administration
More Details
If the details are not enough in the shortcut tools, you can use the tools
listed in Table 12-2 to give you that bit more.
S
ystem Tracing
Learning what is happening under the covers is what sometimes is needed
when you are stuck with a stubborn issue. There are a few tools that can
help you as the Linux sysadmin get these lower-level details.
S
trace
An extremely useful tool to see what is happening with a process or
running application is “strace.” Strace can be run as a prefix to a command
or application and can also be attached to a running “pid.”
I nstallation
Strace is available in most common repositories in almost all Linux distros.
In the case of Fedora, strace can be installed as follows:
323
Chapter 12 Advanced Administration
The following command will show you everything that happens when
you run the free command:
# strace free -h
O
utput to a File
A very useful thing to do when using strace is to send the output to a file;
from there, you can search for strings or values.
To output a strace command to a file, you can run a similar command
to the following:
The output file can then be viewed in a text editor and in some cases
may even display different calls in different colors to make interpreting
slightly easier.
324
Chapter 12 Advanced Administration
Systemtap
Another nice tool to extract information from your Linux system is
“systemtap.” Systemtap is a scripting language that uses files with the “.stp”
extension. Systemtap can be used to diagnose complex performance or
functional problems with kernel-based Linux platforms.
Installation
Systemtap can be installed manually or can be installed using the
automated installation method.
Manual Install
The basic packages needed for systemtap are systemtap and systemtap-
runtime. On a RHEL system, the following command will install your
packages:
325
Chapter 12 Advanced Administration
Automated Install
Stap-prep is a simple utility that will work out the requirements for
systemtap and install them for you. To use stap-prep, you need to install
the package “systemtap-devel”.
Once you have installed the systemtap-devel package, run the
command stap-prep. The required files for the current running kernel will
be installed.
Systemtap Users
If you are using the normal Linux kernel module backend, you can run
"stap" as root. However, if you want to allow other users to create and
run systemtap scripts, the following users and matching groups must be
created:
• stapusr
• stapdev
Systemtap Scripts
On all systems where systemtap is installed, you will have access to
example scripts. These can be found at the following location:
/usr/share/systemtap/examples/
326
Chapter 12 Advanced Administration
/usr/share/systemtap/examples/io/disktop.stp
What this script does is probe the kernel for information about the
block devices attached:
# stap -v /usr/share/systemtap/examples/io/disktop.stp
Once the script is running, you will see the script probing the kernel for
any disk operations.
To test this, run on a DD command similar to the following in a
new window:
Cross Instrumentation
Often in live environments, it may not be possible to install all the
systemtap packages to run probes or tests. For this reason, it is possible
to create systemtap modules and execute them by only installing the
systemtap-runtime package.
This would allow one system to be used as the compiler that can be
used to compile the systemtap instrumentation modules. The kernel
versions would need to match however, and the systems would need to
be the same architecture. To build different modules for different kernel
versions, just reboot the build system into a different kernel.
327
Chapter 12 Advanced Administration
System Tuning
Another important aspect of Linux system administration is understanding
how to tune a Linux system for the task it needs to perform.
This process can be difficult if you have no guidance from any of the
vendors or if you are new to managing Linux systems.
Tuned
The process of tuning your Linux system can involve an in-depth
understanding of kernel parameters and system configuration. However,
there is a very nice tool called “tuned” which has the ability to tune a
system using different profiles.
Installation
Tuned can be simply installed with yum on a RHEL system as per the
following:
Tuned will also need to have the service enabled and started:
328
Chapter 12 Advanced Administration
Using Tuned
Tuned has a number of profiles that are provided with it during the
installation. To see the current active profile, you can run the following
command:
# tuned-adm active
# tuned-adm list
Summary
In this chapter, you were introduced to the following:
329
Index
A SaltStack, 125
server to client
Ajenti, 50
communication, 126
Almost out-of-the-box distros, 15
API access tools
Ansible, 150
automation platforms, 109
command-line tools, 43
pipeline tooling, 108
configuration, 41
shell scripts, 109
galaxy, 47
Application layer self-healing, 144
generating ansible roles, 45
Application logs, 249, 250
installation
Application monitoring
package management, 40
metrics, 239
pip, 41
tracing tools, 237, 238
inventory, 42
Aqua security, 282
modules, 46 ArgoCD, 192
playbooks, 44 Automated build process flow, 69
role directory structure, 45 Automation
sharing, 46 code libraries, 149
Ansible automation platform decision making
agentless, 121 enterprise vs. community vs.
AWX, 124, 125 cost, 132
Chef, 129–131 market trends, 131
command line, 122 product life cycle, 133
configuration management, 126 documentation, 152
graphical user interface, 123 in estate management tools,
message bus, 126 119, 120
potential security hole, 122 idempotent code, 112
Puppet (see Puppet) management tools
remote execution, 126 enterprise products, 134
332
INDEX
333
INDEX
E
D Elastic Stack tools, 253
Dashboards Enterprise distros, 40
Grafana (see Grafana) Enterprise Linux distributions
tools, 233 canonical, 20, 21
Debian, 15 vs. community, 22, 24
334
INDEX
335
INDEX
336
INDEX
337
INDEX
338
INDEX
339
INDEX
340
INDEX
layout, 229 U
querier, 228
Ubuntu, 14
query frontend, 228
Unix, 4
receiver, 228
Upstream, 8
ruler/rule, 228
User request portal, 67
sidecar, 228
Uyuni, 99
store gateway, 228
Tmux, 38, 39
Tracing tools, 237, 238 V
Trivy, 278 Virtual Machine vs.
Troubleshooting container, 156
ask for help Virtual runtimes, 159
ask questions, 312, 313 vmstat tool, 211
phrase your questions, 310 vnstat tool, 221
proper grammar, 310
spelling, 310
training, 309 W
correlation vs. causation, 314 Web consoles, see Cockpit
ghosts, 315 Webmin, 50
guessing and lying, 314 Workload deployment, 178
live debugging, 313
problem-solve
break down, 304 X
explain to yourself, 303 xsos, 321
five whys, 305
rubber duck, 304
standard questions, 303 Y
theorize based on evidence YAML, 115
build theory, 307
hypotheses, 307
prove theory, 308 Z
remediation, 308 Zabbix, 230
Tuning, 328, 329 Zipkin, 238
341