Programming Persistent Memory
Programming Persistent Memory
Persistent
Memory
A Comprehensive Guide for Developers
—
Steve Scargall
www.dbooks.org
Programming Persistent
Memory
A Comprehensive Guide for
Developers
Steve Scargall
Programming Persistent Memory: A Comprehensive Guide for Developers
Steve Scargall
Santa Clara, CA, USA
www.dbooks.org
Table of Contents
About the Author��������������������������������������������������������������������������������������������������� xiii
About the Technical Reviewer���������������������������������������������������������������������������������xv
About the Contributors������������������������������������������������������������������������������������������xvii
Acknowledgments��������������������������������������������������������������������������������������������������xxi
Preface�����������������������������������������������������������������������������������������������������������������xxiii
iii
Table of Contents
iv
www.dbooks.org
Table of Contents
Persistent Libraries��������������������������������������������������������������������������������������������������������������������� 67
libpmem�������������������������������������������������������������������������������������������������������������������������������� 67
libpmemobj���������������������������������������������������������������������������������������������������������������������������� 68
libpmemobj-cpp�������������������������������������������������������������������������������������������������������������������� 68
libpmemkv����������������������������������������������������������������������������������������������������������������������������� 69
libpmemlog���������������������������������������������������������������������������������������������������������������������������� 69
libpmemblk���������������������������������������������������������������������������������������������������������������������������� 69
Tools and Command Utilities������������������������������������������������������������������������������������������������������� 70
pmempool������������������������������������������������������������������������������������������������������������������������������ 70
pmemcheck��������������������������������������������������������������������������������������������������������������������������� 70
pmreorder������������������������������������������������������������������������������������������������������������������������������ 71
Summary������������������������������������������������������������������������������������������������������������������������������������ 71
v
Table of Contents
vi
www.dbooks.org
Table of Contents
vii
Table of Contents
viii
www.dbooks.org
Table of Contents
ix
Table of Contents
Chapter 16: PMDK Internals: Important Algorithms and Data Structures������������ 313
A Pool of Persistent Memory: High-Level Architecture Overview��������������������������������������������� 313
The Uncertainty of Memory Mapping: Persistent Memory Object Identifier����������������������������� 315
Persistent Thread Local Storage: Using Lanes�������������������������������������������������������������������������� 318
Ensuring Power-Fail Atomicity: Redo and Undo Logging���������������������������������������������������������� 320
Transaction Redo Logging��������������������������������������������������������������������������������������������������� 320
Transaction Undo Logging��������������������������������������������������������������������������������������������������� 321
libpmemobj Unified Logging������������������������������������������������������������������������������������������������ 322
Persistent Allocations: The Interface of a Transactional Persistent Allocator���������������������������� 323
Persistent Memory Heap Management: Allocator Design for Persistent Memory�������������������� 324
ACID Transactions: Efficient Low-Level Persistent Transactions����������������������������������������������� 328
Lazy Reinitialization of Variables: Storing the Volatile State on Persistent Memory����������������� 330
Summary���������������������������������������������������������������������������������������������������������������������������������� 331
www.dbooks.org
Table of Contents
xi
Table of Contents
Appendix B: How to Install the Persistent Memory Development Kit (PMDK)������ 395
PMDK Prerequisites������������������������������������������������������������������������������������������������������������������ 395
Installing PMDK Using the Linux Distribution Package Repository������������������������������������������� 395
Package Naming Convention����������������������������������������������������������������������������������������������� 396
Searching for Packages Within a Package Repository�������������������������������������������������������� 396
Installing PMDK Libraries from the Package Repository����������������������������������������������������� 398
Installing PMDK on Microsoft Windows������������������������������������������������������������������������������������ 402
Glossary���������������������������������������������������������������������������������������������������������������� 425
Index��������������������������������������������������������������������������������������������������������������������� 429
xii
www.dbooks.org
About the Author
Steve Scargall is a persistent memory software and cloud architect at Intel
Corporation. As a technology evangelist, he supports the enabling and development
effort to integrate persistent memory technology into software stacks, applications,
and hardware architectures. This includes working with independent software
vendors (ISVs) on both proprietary and open source development, original equipment
manufacturers (OEMs), and cloud service providers (CSPs).
Steve holds a Bachelor of Science in computer science and cybernetics from the
University of Reading, UK, where he studied neural networks, AI, and robotics. He
has over 19 years’ experience providing performance analysis on x86 architecture and
SPARC for Solaris Kernel, ZFS, and UFS file system. He performed DTrace debugging in
enterprise and cloud environments during his tenures at Sun Microsystems and Oracle.
xiii
About the Technical Reviewer
Andy Rudoff is a principal engineer at Intel Corporation, focusing on non-volatile
memory programming. He is a contributor to the SNIA NVM Programming Technical
Work Group. His more than 30 years’ industry experience includes design and
development work in operating systems, file systems, networking, and fault management
at companies large and small, including Sun Microsystems and VMware. Andy has
taught various operating systems classes over the years and is a coauthor of the popular
UNIX Network Programming textbook.
xv
www.dbooks.org
About the Contributors
Piotr Balcer is a software engineer at Intel Corporation with many years’ experience
working on storage-related technologies. He holds a Bachelor of Science in engineering
from the Gdańsk University of Technology, Poland, where he studied system software
engineering. Piotr has been working on the software ecosystem for next-generation
persistent memory since 2014.
xvii
About the Contributors
xviii
www.dbooks.org
About the Contributors
xix
About the Contributors
xx
www.dbooks.org
Acknowledgments
First and foremost, I would like to thank Ken Gibson for masterminding this book idea
and for gifting me the pleasure of writing and managing it. Your support, guidance, and
contributions have been instrumental in delivering a high-quality product.
If the Vulcan mind-meld or The Matrix Headjack were possible, I could have cloned
Andy Rudoff’s mind and allowed him to work on his daily activities. Instead, Andy’s
infinite knowledge of persistent memory had to be tapped through good old verbal
communication and e-mail. I sincerely thank you for devoting so much time to me and
this project. The results read for themselves.
Debbie Graham was instrumental in helping me manage this colossal project. Her
dedication and support helped drive the project to an on-time completion.
To my friends and colleagues at Intel who contributed content, supported
discussions, helped with decision-making, and reviewed drafts during the book-writing
process. These are the real heroes. Without your heavily invested time and support, this
book would have taken considerably longer to complete. It is a much better product as a
result of the collaborative effort. A huge thanks to all of you.
I'd like to express my sincerest gratitude and appreciation to the people at Apress,
without whom this book could not have been published. From the initial contact and
outline discussions through the entire publishing process to this final polished product,
the Apress team delivered continuous support and assistance. Many thanks to Susan,
Jessica, and Rita. It was a real pleasure working with you.
xxi
Preface
About This Book
Persistent memory is often referred to as non-volatile memory (NVM) or storage
class memory (SCM). In this book, we purposefully use persistent memory as an all-
encompassing term to represent all the current and future memory technologies that
fall under this umbrella. This book introduces the persistent memory technology and
provides answers to key questions. For software developers, those questions include:
What is persistent memory? How do I use it? What APIs and libraries are available?
What benefits can it provide for my application? What new programming methods do I
need to learn? How do I design applications to use persistent memory? Where can I find
information, documentation, and help?
System and cloud architects will be provided with answers to questions such as:
What is persistent memory? How does it work? How is it different than DRAM or SSD/
NVMe storage devices? What are the hardware and operating system requirements?
What applications need or could benefit from persistent memory? Can my existing
applications use persistent memory without being modified?
Persistent memory is not a plug-and-play technology for software applications.
Although it may look and feel like traditional DRAM memory, applications need to be
modified to fully utilize the persistence feature of persistent memory. That is not to say
that applications cannot run unmodified on systems with persistent memory installed,
they can, but they will not see the full potential of what persistent memory offers without
code modification.
Thankfully, server and operating system vendors collaborated very early in the
design phase and already have products available on the market. Linux and Microsoft
Windows already provide native support for persistent memory technologies. Many
popular virtualization technologies also support persistent memory.
For ISVs and the developer community at large, the journey is just beginning. Some
software has already been modified and is available on the market. However, it will
take time for the enterprise and cloud computing industries to adopt and make the
hardware available to the general marketplace. ISVs and software developers need time
to understand what changes to existing applications are required and implement them.
xxiii
www.dbooks.org
Preface
To make the required development work easier, Intel developed and open sourced
the Persistent Memory Development Kit (PMDK) available from https://pmem.io/
pmdk/. We introduce the PMDK in more detail in Chapter 5 and walk through most of
the available libraries in subsequent chapters. Each chapter provides an in-depth guide
so developers can understand what library or libraries to use. PMDK is a set of open
source libraries and tools based on the Storage Networking Industry Association (SNIA)
NVM programming model designed and implemented by over 50 industry partners. The
latest NVM programming model document can be found at https://www.snia.org/
tech_activities/standards/curr_standards/npm. The model describes how software
can utilize persistent memory features and enables designers to develop APIs that take
advantage of NVM features and performance.
Available for both Linux and Windows, PMDK facilitates persistent memory
programming adoption with higher-level language support. C and C++ support is fully
validated. Support for other languages such as Java and Python is work in progress
at the time this book was written. Other languages are expected to also adopt the
programming model and provide native persistent memory APIs for developers. The
PMDK development team welcomes and encourages new contributions to core code,
new language bindings, or new storage engines for the persistent memory key-value
store called pmemkv.
This book assumes no prior knowledge of persistent memory hardware devices
or software development. The book layout allows you to freely navigate the content in
the order you want. It is not required to read all chapters in order, though we do build
upon concepts and knowledge described in previous chapters. In such cases, we make
backward and forward references to relevant chapters and sections so you can learn or
refresh your memory.
B
ook Structure
This book has 19 chapters, each one focusing on a different topic. The book has three
main sections. Chapters 1-4 provide an introduction to persistent memory architecture,
hardware, and operating system support. Chapters 5-16 allow developers to understand
the PMDK libraries and how to use them in applications. Finally, Chapters 17-19 provide
information on advanced topics such as RAS and replication of data using RDMA.
xxiv
Preface
xxv
www.dbooks.org
Preface
xxvi
Preface
The Appendixes have separate procedures for installing the PMDK and utilities
required for managing persistent memory. We also included an update for Java and the
future of the RDMA protocols. All of this content is considered temporal, so we did not
want to include it in the main body of the book.
Intended Audience
This book has been written for experienced application developers in mind. We
intend the content to be useful to a wider readership such as system administrators
and architects, students, lecturers, and academic research fellows to name but a few.
System designers, kernel developers, and anyone with a vested or passing interest in this
emerging technology will find something useful within this book.
Every reader will learn what persistent memory is, how it works, and how operating
systems and applications can utilize it. Provisioning and managing persistent memory
are vendor specific, so we include some resources in the Appendix sections to avoid
overcomplicating the main chapter content.
Application developers will learn, by example, how to integrate persistent memory
in to existing or new applications. We use examples extensively throughout this book
using a variety of libraries available within the Persistent Memory Development Kit
(PMDK). Example code is provided in a variety of programming languages such as C,
C++, JavaScript, and others. We want developers to feel comfortable using these libraries
in their own projects. The book provides extensive links to resources where you can find
help and information.
System administrators and architects of Cloud, high-performance computing,
and enterprise environments can use most of the content of this book to
understand persistent memory features and benefits to support applications and
developers. Imagine being able to deploy more virtual machines per physical server or
provide applications with this new memory/storage tier such that they can keep more
data closer to the CPU or restart in a fraction of the time they could before while keeping
a warm cache of data.
Students, lecturers, and academic research fellows will also benefit from many
chapters within this book. Computer science classes can learn about the hardware,
operating system features, and programming techniques. Lecturers are free use the
content in student classes or to form the basis of research projects such as new persistent
memory file systems, algorithms, or caching implementations.
xxvii
www.dbooks.org
Preface
We introduce tools that profile the server and applications to better understand CPU,
memory, and disk IO access patterns. Using this knowledge, we show how applications
can be modified to take full advantage of persistence using the Persistent Memory
Development Kit (PMDK).
A Future Reference
The book content has been written to provide value for many years. Industry
specification such as ACPI, UEFI, and the SNIA non-volatile programming model will,
unless otherwise stated by the specification, remain backward compatible as new
versions are released. If new form factors are introduced, the approach to programming
remains the same. We do not limit ourselves to one specific persistent memory vendor
or implementation. In places where it is necessary to describe vendor-specific features
or implementations, we specifically call this out as it may change between vendors or
between product generations. We encourage you to read the vendor documentation for
the persistent memory product to learn more.
Developers using the Persistent Memory Development Kit (PMDK) will retain a stable
API interface. PMDK will deliver new features and performance improvements with each
major release. It will evolve with new persistent memory products, CPU instructions,
platform designs, industry specifications, and operating system feature support.
xxviii
Preface
The code examples provided with this book have been tested and validated using
Intel Optane DC persistent memory. Since the PMDK is vendor neutral, they will also
work on NVDIMM-N devices. PMDK will support any future persistent memory product
that enters the market.
The code examples used throughout this book are current at the time of
publication. All code examples have been validated and tested to ensure they compile
and execute without error. For brevity, some of the examples in this book use assert()
statements to indicate unexpected errors. Any production code would likely replace
these with the appropriate error handling actions which would include friendlier
error messages and appropriate error recovery actions. Additionally, some of the code
examples use different mount points to represent persistent memory aware file systems,
for example “/daxfs”, “/pmemfs”, and “/mnt/pmemfs”. This demonstrates persistent
memory file systems can be mounted and named appropriately for the application, just
like regular block-based file systems. Source code is from the repository that accompanies
this book – https://github.com/Apress/programming-persistent-memory.
Since this is a rapidly evolving technology, the software and APIs references
throughout this book may change over time. While every effort is made to be backward
compatible, sometimes software must evolve and invalidate previous versions. For this
reason, it is therefore expected that some of the code samples may not compile on newer
hardware or operating systems and may need to be changed accordingly.
B
ook Conventions
This book uses several conventions to draw your attention to specific pieces of
information. The convention used depends on the type of information displayed.
C
omputer Commands
Commands, programming library, and API function references may be presented in line
with the paragraph text using a monospaced font. For example:
To illustrate how persistent memory is used, let’s start with a sample program
demonstrating the key-value store provided by a library called libpmemkv.
xxix
www.dbooks.org
Preface
S
ource Code
Source code examples taken from the accompanying GitHub repository are shown with
relevant line numbers in a monospaced font. Below each code listing is a reference to
the line number or line number range with a brief description. Code comments use
language native styling. Most languages use the same syntax. Single line comments
will use // and block/multiline comments should use /*..*/. An example is shown in
Listing 1.
xxx
Preface
• Line 45: Here we define a small helper routine, kvprint(), which prints
a key-value pair when called.
Notes
We use a standard format for notes, cautions, and tips when we want to direct your
attention to an important point, for example.
xxxi
www.dbooks.org
CHAPTER 1
Introduction to Persistent
Memory Programming
This book describes programming techniques for writing applications that use persistent
memory. It is written for experienced software developers, but we assume no previous
experience using persistent memory. We provide many code examples in a variety of
programming languages. Most programmers will understand these examples, even if
they have not previously used the specific language.
1
© The Author(s) 2020
S. Scargall, Programming Persistent Memory, https://doi.org/10.1007/978-1-4842-4932-1_1
Chapter 1 Introduction to Persistent Memory Programming
37 #include <iostream>
38 #include <cassert>
39 #include <libpmemkv.hpp>
40
41 using namespace pmem::kv;
42 using std::cerr;
43 using std::cout;
44 using std::endl;
45 using std::string;
46
47 /*
48 * for this example, create a 1 Gig file
49 * called "/daxfs/kvfile"
50 */
51 auto PATH = "/daxfs/kvfile";
52 const uint64_t SIZE = 1024 * 1024 * 1024;
53
54 /*
55 * kvprint -- print a single key-value pair
56 */
57 int kvprint(string_view k, string_view v) {
58 cout << "key: " << k.data() <<
59 " value: " << v.data() << endl;
60 return 0;
61 }
62
2
www.dbooks.org
Chapter 1 Introduction to Persistent Memory Programming
63 int main() {
64 // start by creating the db object
65 db *kv = new db();
66 assert(kv != nullptr);
67
68 // create the config information for
69 // libpmemkv's open method
70 config cfg;
71
72 if (cfg.put_string("path", PATH) != status::OK) {
73 cerr << pmemkv_errormsg() << endl;
74 exit(1);
75 }
76 if (cfg.put_uint64("force_create", 1) != status::OK) {
77 cerr << pmemkv_errormsg() << endl;
78 exit(1);
79 }
80 if (cfg.put_uint64("size", SIZE) != status::OK) {
81 cerr << pmemkv_errormsg() << endl;
82 exit(1);
83 }
84
85
86 // open the key-value store, using the cmap engine
87 if (kv->open("cmap", std::move(cfg)) != status::OK) {
88 cerr << db::errormsg() << endl;
89 exit(1);
90 }
91
92 // add some keys and values
93 if (kv->put("key1", "value1") != status::OK) {
94 cerr << db::errormsg() << endl;
95 exit(1);
96 }
3
Chapter 1 Introduction to Persistent Memory Programming
www.dbooks.org
Chapter 1 Introduction to Persistent Memory Programming
W
hat’s Different?
A wide variety of key-value libraries are available in practically every programming
language. The persistent memory example in Listing 1-1 is different because the key-
value store itself resides in persistent memory. For comparison, Figure 1-1 shows how a
key-value store using traditional storage is laid out.
When the application in Figure 1-1 wants to fetch a value from the key-value store,
a buffer must be allocated in memory to hold the result. This is because the values are
kept on block storage, which cannot be addressed directly by the application. The only
way to access a value is to bring it into memory, and the only way to do that is to read
full blocks from the storage device, which can only be accessed via block I/O. Now
consider Figure 1-2, where the key-value store resides in persistent memory like our
sample code.
5
Chapter 1 Introduction to Persistent Memory Programming
With the persistent memory key-value store, values are accessed by the application
directly, without the need to first allocate buffers in memory. The kvprint() routine in
Listing 1-1 will be called with references to the actual keys and values, directly where
they live in persistence – something that is not possible with traditional storage. In
fact, even the data structures used by the key-value store library to organize its data are
accessed directly. When a storage-based key-value store library needs to make a small
update, for example, 64 bytes, it must read the block of storage containing those 64 bytes
into a memory buffer, update the 64 bytes, and then write out the entire block to make it
persistent. That is because storage accesses can only happen using block I/O, typically
4K bytes at a time, so the task to update 64 bytes requires reading 4K and then writing
4K. But with persistent memory, the same example of changing 64 bytes would only
write the 64 bytes directly to persistence.
www.dbooks.org
Chapter 1 Introduction to Persistent Memory Programming
Program Complexity
Perhaps the most important point of our example is that the programmer still uses
the familiar get/put interfaces normally associated with key-value stores. The fact that
the data structures are in persistent memory is abstracted away by the high-level API
provided by libpmemkv. This principle of using the highest level of abstraction possible,
as long as it meets the application’s needs, will be a recurring theme throughout this
book. We start by introducing very high-level APIs; later chapters delve into the lower-
level details for programmers who need them. At the lowest level, programming directly
to raw persistent memory requires detailed knowledge of things like hardware atomicity,
cache flushing, and transactions. High-level libraries like libpmemkv abstract away all
that complexity and provide much simpler, less error-prone interfaces.
7
Chapter 1 Introduction to Persistent Memory Programming
Starting from the bottom of Figure 1-4 and working upward are these components:
www.dbooks.org
Chapter 1 Introduction to Persistent Memory Programming
• And finally, the application that uses the API provided by libpmemkv.
Although there is quite a stack of components in use here, it does not mean there
is necessarily a large amount of code that runs for each operation. Some components
are only used during the initial setup. For example, the pmem-aware file system is
used to find the persistent memory file and perform permission checks; it is out of the
application’s data path after that. The PMDK libraries are designed to leverage the direct
access allowed by persistent memory as much as possible.
W
hat’s Next?
Chapters 1 through 3 provide the essential background that programmers need to know to
start persistent memory programming. The stage is now set with a simple example; the next
two chapters provide details about persistent memory at the hardware and operating system
levels. The later and more advanced chapters provide much more detail for those interested.
Because the immediate goal is to get you programming quickly, we recommend
reading Chapters 2 and 3 to gain the essential background and then dive into Chapter 4
where we start to show more detailed persistent memory programming examples.
S
ummary
This chapter shows how high-level APIs like libpmemkv can be used for persistent
memory programming, hiding complex details of persistent memory from the
application developer. Using persistent memory can allow finer-grained access and
higher performance than block-based storage. We recommend using the highest-level,
simplest APIs possible and only introducing the complexity of lower-level persistent
memory programming as necessary.
9
Chapter 1 Introduction to Persistent Memory Programming
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
10
www.dbooks.org
CHAPTER 2
Persistent Memory
Architecture
This chapter provides an overview of the persistent memory architecture while focusing
on the hardware to emphasize requirements and decisions that developers need to know.
Applications that are designed to recognize the presence of persistent memory in
a system can run much faster than using other storage devices because data does not
have to transfer back and forth between the CPU and slower storage devices. Because
applications that only use persistent memory may be slower than dynamic random-
access memory (DRAM), they should decide what data resides in DRAM, persistent
memory, and storage.
The capacity of persistent memory is expected to be many times larger than DRAM;
thus, the volume of data that applications can potentially store and process in place is
also much larger. This significantly reduces the number of disk I/Os, which improves
performance and reduces wear on the storage media.
On systems without persistent memory, large datasets that cannot fit into DRAM
must be processed in segments or streamed. This introduces processing delays as the
application stalls waiting for data to be paged from disk or streamed from the network.
If the working dataset size fits within the capacity of persistent memory and DRAM,
applications can perform in-memory processing without needing to checkpoint or page
data to or from storage. This significantly improves performance.
11
© The Author(s) 2020
S. Scargall, Programming Persistent Memory, https://doi.org/10.1007/978-1-4842-4932-1_2
Chapter 2 Persistent Memory Architecture
12
www.dbooks.org
Chapter 2 Persistent Memory Architecture
13
Chapter 2 Persistent Memory Architecture
C
ache Hierarchy
We use load and store operations to read and write to persistent memory rather than
using block-based I/O to read and write to traditional storage. We suggest reading the
CPU architecture documentation for an in-depth description because each successive
CPU generation may introduce new features, methods, and optimizations.
Using the Intel architecture as an example, a CPU cache typically has three
distinct levels: L1, L2, and L3. The hierarchy makes references to the distance
from the CPU core, its speed, and size of the cache. The L1 cache is closest to
the CPU. It is extremely fast but very small. L2 and L3 caches are increasingly
larger in capacity, but they are relatively slower. Figure 2-1 shows a typical CPU
microarchitecture with three levels of CPU cache and a memory controller with
three memory channels. Each memory channel has a single DRAM and persistent
memory attached. On platforms where the CPU caches are not contained within
the power-fail protected domain, any modified data within the CPU caches that has
not been flushed to persistent memory will be lost when the system loses power or
crashes. Platforms that do include CPU caches in the power-fail protected domain
will ensure modified data within the CPU caches are flushed to the persistent
memory should the system crash or loses power. We describe these requirements
and features in the upcoming “Power-Fail Protected Domains” section.
14
www.dbooks.org
Chapter 2 Persistent Memory Architecture
The L1 (Level 1) cache is the fastest memory in a computer system. In terms of access
priority, the L1 cache has the data the CPU is most likely to need while completing a
specific task. The L1 cache is also usually split two ways, into the instruction cache (L1 I)
and the data cache (L1 D). The instruction cache deals with the information about the
operation that the CPU has to perform, while the data cache holds the data on which the
operation is to be performed.
The L2 (Level 2) cache has a larger capacity than the L1 cache, but it is slower. L2
cache holds data that is likely to be accessed by the CPU next. In most modern CPUs,
the L1 and L2 caches are present on the CPU cores themselves, with each core getting
dedicated caches.
The L3 (Level 3) cache is the largest cache memory, but it is also the slowest of the
three. It is also a commonly shared resource among all the cores on the CPU and may be
internally partitioned to allow each core to have dedicated L3 resources.
Data read from DRAM or persistent memory is transferred through the memory
controller into the L3 cache, then propagated into the L2 cache, and finally the L1 cache
where the CPU core consumes it. When the processor is looking for data to carry out an
operation, it first tries to find it into the L1 cache. If the CPU can find it, the condition is
called a cache hit. If the CPU cannot find the data within the L1 cache, it then proceeds to
15
Chapter 2 Persistent Memory Architecture
search for it first within L2, then L3. If it cannot find the data in any of the three, it tries to
access it from memory. Each failure to find data in a cache is called a cache miss. Failure
to locate the data in memory requires the operating system to page the data into memory
from a storage device.
When the CPU writes data, it is initially written to the L1 cache. Due to ongoing
activity within the CPU, at some point in time, the data will be evicted from the L1 cache
into the L2 cache. The data may be further evicted from L2 and placed into L3 and
eventually evicted from L3 into the memory controller’s write buffers where it is then
written to the memory device.
In a system that does not possess persistent memory, software persists data by writing it
to a non-volatile storage device such as an SSD, HDD, SAN, NAS, or a volume in the cloud.
This protects data from application or system crashes. Critical data can be manually flushed
using calls such as msync(), fsync(), or fdatasync(), which flush uncommitted dirty
pages from volatile memory to the non-volatile storage device. File systems provide fdisk
or chkdsk utilities to check and attempt repairs on damaged file systems if required. File
systems do not protect user data from torn blocks. Applications have a responsibility to
detect and recovery from this situation. That’s why databases, for example, use a variety of
techniques such as transactional updates, redo/undo logging, and checksums.
Applications memory map the persistent memory address range directly into its
own memory address space. Therefore, the application must assume responsibility
for checking and guaranteeing data integrity. The rest of this chapter describes
your responsibilities in a persistent memory environment and how to achieve data
consistency and integrity.
16
www.dbooks.org
Chapter 2 Persistent Memory Architecture
from the power-fail protected domain using stored energy guaranteed by the platform for
this purpose. Data that has not yet made it into the protected domain will be lost.
Multiple persistence domains may exist within the same system, for example, on
systems with more than one physical CPU. Systems may also provide a mechanism for
partitioning the platform resources for isolation. This must be done in such a way that
SNIA NVM programming model behavior is assured from each compliant volume or file
system. (Chapter 3 describes the programming model as it applies to operating systems
and file systems. The “Detecting Platform Capabilities” section in that chapter describes
the logic that applications should perform to detect platform capabilities including
power failure protected domains. Later chapters provide in-depth discussions into why,
how, and when applications should flush data, if required, to guarantee the data is safe
within the protected domain and persistent memory.)
Volatile memory loses its contents when the computer system’s power is interrupted.
Just like non-volatile storage devices, persistent memory keeps its contents even in the
absence of system power. Data that has been physically saved to the persistent memory
media is called data at rest. Data in-flight has the following meanings:
• Writes sent to the persistent memory device but have not yet been
physically committed to the media
• Data that has been temporarily buffered or cached in either the CPU
caches or memory controller
When a system is gracefully rebooted or shut down, the system maintains power
and can ensure all contents of the CPU caches and memory controllers are flushed such
that any in-flight or uncommitted data is successfully written to persistent memory
or non-volatile storage. When an unexpected power failure occurs, and assuming no
uninterruptable power supply (UPS) is available, the system must have enough stored
energy within the power supplies and capacitors dotted around it to flush data before the
power is completely exhausted. Any data that is not flushed is lost and not recoverable.
Asynchronous DRAM Refresh (ADR) is a feature supported on Intel products which
flushes the write-protected data buffers and places the DRAM in self-refresh. This
process is critical during a power loss event or system crash to ensure the data is in a safe
and consistent state on persistent memory. By default, ADR does not flush the processor
caches. A platform that supports ADR only includes persistent memory and the memory
controller’s write pending queues within the persistence domain. This is the reason
17
Chapter 2 Persistent Memory Architecture
data in the CPU caches must be flushed by the application using the CLWB, CLFLUSHOPT,
CLFLUSH, non-temporal stores, or WBINVD machine instructions.
Enhanced Asynchronous DRAM Refresh (eADR) requires that a non-maskable
interrupt (NMI) routine be called to flush the CPU caches before the ADR event can begin.
Applications running on an eADR platform do not need to perform flush operations
because the hardware should flush the data automatically, but they are still required
to perform an SFENCE operation to maintain write order correctness. Stores should be
considered persistent only when they are globally visible, which the SFENCE guarantees.
Figure 2-2 shows both the ADR and eADR persistence domains.
18
www.dbooks.org
Chapter 2 Persistent Memory Architecture
which introduces additional maintenance routines that reduce server uptime. There
is also an environmental impact when using batteries as they must be disposed of
or recycled correctly. It is entirely possible for server or appliance OEMs to include a
battery in their product.
Because some appliance and server vendors plan to use batteries, and because
platforms will someday include the CPU caches in the persistence domain, a property is
available within ACPI such that the BIOS can notify software when the CPU flushes can
be skipped. On platforms with eADR, there is no need for manual cache line flushing.
19
Chapter 2 Persistent Memory Architecture
2. Update the node pointer (next pointer) to point to the last node in
the list (Node 2 → Node 1).
3. Update the head pointer to point at the new node (Head → Node 2).
Figure 2-3 (Step 3) shows that the head pointer was updated in the CPU cached version,
but the Node 2 to Node 1 pointer has not yet been updated in persistent memory. This
is because the hardware can choose which cache lines to commit and the order may not
match the source code flow. If the system or application were to crash at this point, the
persistent memory state would be inconsistent, and the data structure would no longer
be usable.
1
NIA NVM programming model spec: https://www.snia.org/tech_activities/standards/
S
curr_standards/npm
20
www.dbooks.org
Chapter 2 Persistent Memory Architecture
Figure 2-3. Adding a new node to an existing linked list without a store barrier
To solve this problem, we introduce a memory store barrier to ensure the order of the
write operations is maintained. Starting from the same initial state, the pseudocode now
looks like this:
2. Update the node pointer (next pointer) to point to the last node in
the list, and perform a store barrier/fence operation.
3. Update the head pointer to point at the new node.
Figure 2-4 shows that the addition of the store barrier allows the code to work as
expected and maintains a consistent data structure in the volatile CPU caches and on
21
Chapter 2 Persistent Memory Architecture
persistent memory. We can see in Step 3 that the store barrier/fence operation waited
for the pointer from Node 2 to Node 1 to update before updating the head pointer. The
updates in the CPU cache matches the persistent memory version, so it now globally
visible. This is a simplistic approach to solving the problem because store barriers do not
provide atomicity or data integrity. A complete solution should also use transactions to
ensure the data is atomically updated.
Figure 2-4. Adding a new node to an existing linked list using a store barrier
The PMDK detects the platform, CPU, and persistent memory features when the
memory pool is opened and then uses the optimal instructions and fencing to preserve
write ordering. (Memory pools are files that are memory mapped into the process
address space; later chapters describe them in more detail.)
22
www.dbooks.org
Chapter 2 Persistent Memory Architecture
To insulate application developers from the complexities of the hardware and to keep
them from having to research and implement code specific to each platform or device,
the libpmem library provides a function that tells the application when optimized flush is
safe to use or fall back to the standard way of flushing stores to memory-mapped files.
To simplify programming, we encourage developers to use libraries, such as libpmem
and others within the PMDK. The libpmem library is also designed to detect the case of
the platform with a battery that automatically converts flush calls into simple SFENCE
instructions. Chapter 5 introduces and describes the core libraries within the PMDK in
more detail, and later chapters take an in-depth look into each of the libraries to help
you understand their APIs and features.
D
ata Visibility
When data is visible to other processes or threads, and when it is safe in the persistence
domain, is critical to understand when using persistent memory in applications. In the
Figure 2-2 and 2-3 examples, updates made to data in the CPU caches could become
visible to other processes or threads. Visibility and persistence are often not the same
thing, and changes made to persistent memory are often visible to other running threads
in the system before they are persistent. Visibility works the same way as it does for
normal DRAM, described by the memory model ordering and visibility rules for a given
platform (for example, see the Intel Software Development Manual for the visibility rules
for Intel platforms). Persistence of changes is achieved in one of three ways: either by
calling the standard storage API for persistence (msync on Linux or FlushFileBuffers
on Windows), by using optimized flush when supported, or by achieving visibility on
a platform where the CPU caches are considered persistent. This is one reason we use
flushing and fencing operations.
A pseudo C code example may look like this:
OPCODE Description
24
www.dbooks.org
Chapter 2 Persistent Memory Architecture
25
Chapter 2 Persistent Memory Architecture
Figure 2-5 shows the flow implemented by libpmem, which initially verifies the
memory-mapped file (called a memory pool), resides on a file system that has the DAX
feature enabled, and is backed by physical persistent memory. Chapter 3 describes DAX
in more detail.
On Linux, direct access is achieved by mounting an XFS or ext4 file system with
the "-o dax" option. On Microsoft Windows, NTFS enables DAX when the volume
is created and formatted using the DAX option. If the file system is not DAX-enabled,
applications should fall back to the legacy approach of using msync(), fsync(), or
FlushFileBuffers(). If the file system is DAX-enabled, the next check is to determine
whether the platform supports ADR or eADR by verifying whether or not the CPU caches
are considered persistent. On an eADR platform where CPU caches are considered
persistent, no further action is required. Any data written will be considered persistent,
and thus there is no requirement to perform any flushes, which is a significant
performance optimization. On an ADR platform, the next sequence of events identifies
the most optimal flush operation based on Intel machine instructions previously
described.
Figure 2-5. Flowchart showing how applications can detect platform features
26
www.dbooks.org
Chapter 2 Persistent Memory Architecture
27
Chapter 2 Persistent Memory Architecture
If an application were not to check these things at startup, due to the persistent
nature of the media, it could get stuck in an infinite loop, for example:
1. Application starts.
3. Encounters poison.
8. …
The ACPI specification defines an Address Range Scrub (ARS) operation that the
operating system implements. This allows the operating system to perform a runtime
background scan operation across the memory address range of the persistent memory.
28
www.dbooks.org
Chapter 2 Persistent Memory Architecture
System administrators may manually initiate an ARS. The intent is to identify bad
or potentially bad memory regions before the application does. If ARS identifies an
issue, the hardware can provide a status notification to the operating system and the
application that can be consumed and handled gracefully. If the bad address range
contains data, some method to reconstruct or restore the data needs to be implemented.
Chapter 17 describes ARS in more detail.
Developers are free to implement these features directly within the application code.
However, the libraries in the PMDK handle these complex conditions, and they will be
maintained for each product generation while maintaining stable APIs. This gives you
a future-proof option without needing to understand the intricacies of each CPU or
persistent memory product.
W
hat’s Next?
Chapter 3 continues to provide foundational information from the perspective of the
kernel and user spaces. We describe how operating systems such as Linux and Windows
have adopted and implemented the SNIA non-volatile programming model that defines
recommended behavior between various user space and operating system kernel
components supporting persistent memory. Later chapters build on the foundations
provided in Chapters 1 through 3.
S
ummary
This chapter defines persistent memory and its characteristics, recaps how CPU caches
work, and describes why it is crucial for applications directly accessing persistent
memory to assume responsibility for flushing CPU caches. We focus primarily on
hardware implementations. User libraries, such as those delivered with the PMDK,
assume the responsibilities for architecture and hardware-specific operations and allow
developers to use simple APIs to implement them. Later chapters describe the PMDK
libraries in more detail and show how to use them in your application.
29
Chapter 2 Persistent Memory Architecture
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
30
www.dbooks.org
CHAPTER 3
31
© The Author(s) 2020
S. Scargall, Programming Persistent Memory, https://doi.org/10.1007/978-1-4842-4932-1_3
Chapter 3 Operating System Support for Persistent Memory
The combination of direct application access to volatile memory combined with the
operating system I/O access to storage devices supports the most common application
programming model taught in introductory programming classes. In this model,
developers allocate data structures and operate on them at byte granularity in memory.
When the application wants to save data, it uses standard file API system calls to write
the data to an open file. Within the operating system, the file system executes this write
by performing one or more I/O operations to the storage device. Because these I/O
operations are usually much slower than CPU speeds, the operating system typically
suspends the application until the I/O completes.
Since persistent memory can be accessed directly by applications and can persist
data in place, it allows operating systems to support a new programming model that
combines the performance of memory while persisting data like a non-volatile storage
device. Fortunately for developers, while the first generation of persistent memory
was under development, Microsoft Windows and Linux designers, architects and
32
www.dbooks.org
Chapter 3 Operating System Support for Persistent Memory
33
Chapter 3 Operating System Support for Persistent Memory
Figure 3-2 also shows the Block Translation Table (BTT) driver, which can be
optionally configured into the I/O subsystem. Storage devices such as HDDs and SSDs
present a native block size with 512k and 4k bytes as two common native block sizes.
Some storage devices, especially NVM Express SSDs, provide a guarantee that when a
power failure or server failure occurs while a block write is in-flight, either all or none
of the block will be written. The BTT driver provides the same guarantee when using
persistent memory as a block storage device. Most applications and file systems depend
on this atomic write guarantee and should be configured to use the BTT driver, although
operating systems also provide the option to bypass the BTT driver for applications that
implement their own protection against partial block updates.
34
www.dbooks.org
Chapter 3 Operating System Support for Persistent Memory
M
emory-Mapped Files
Before describing the next operating system option for using persistent memory,
this section reviews memory-mapped files in Linux and Windows. When memory
mapping a file, the operating system adds a range to the application’s virtual
address space which corresponds to a range of the file, paging file data into physical
memory as required. This allows an application to access and modify file data as
byte-addressable in-memory data structures. This has the potential to improve
performance and simplify application development, especially for applications that
make frequent, small updates to file data.
Applications memory map a file by first opening the file, then passing the resulting
file handle as a parameter to the mmap() system call in Linux or to MapViewOfFile() in
Windows. Both return a pointer to the in-memory copy of a portion of the file. Listing 3-1
shows an example of Linux C code that memory maps a file, writes data into the file
by accessing it like memory, and then uses the msync system call to perform the I/O
35
Chapter 3 Operating System Support for Persistent Memory
operation to write the modified data to the file on the storage device. Listing 3-2 shows
the equivalent operations on Windows. We walk through and highlight the key steps in
both code samples.
50 #include <err.h>
51 #include <fcntl.h>
52 #include <stdio.h>
53 #include <stdlib.h>
54 #include <string.h>
55 #include <sys/mman.h>
56 #include <sys/stat.h>
57 #include <sys/types.h>
58 #include <unistd.h>
59
60 int
61 main(int argc, char *argv[])
62 {
63 int fd;
64 struct stat stbuf;
65 char *pmaddr;
66
67 if (argc != 2) {
68 fprintf(stderr, "Usage: %s filename\n",
69 argv[0]);
70 exit(1);
71 }
72
73 if ((fd = open(argv[1], O_RDWR)) < 0)
74 err(1, "open %s", argv[1]);
75
76 if (fstat(fd, &stbuf) < 0)
77 err(1, "stat %s", argv[1]);
78
79 /*
36
www.dbooks.org
Chapter 3 Operating System Support for Persistent Memory
80 * Map the file into our address space for read
81 * & write. Use MAP_SHARED so stores are visible
82 * to other programs.
83 */
84 if ((pmaddr = mmap(NULL, stbuf.st_size,
85 PROT_READ|PROT_WRITE,
86 MAP_SHARED, fd, 0)) == MAP_FAILED)
87 err(1, "mmap %s", argv[1]);
88
89 /* Don't need the fd anymore because the mapping
90 * stays around */
91 close(fd);
92
93 /* store a string to the Persistent Memory */
94 strcpy(pmaddr, "This is new data written to the
95 file");
96
97 /*
98 * Simplest way to flush is to call msync().
99 * The length needs to be rounded up to a 4k page.
100 */
101 if (msync((void *)pmaddr, 4096, MS_SYNC) < 0)
102 err(1, "msync");
103
104 printf("Done.\n");
105 exit(0);
106 }
• Lines 67-74: We verify the caller passed a file name that can be
opened. The open call will create the file if it does not already exist.
• Line 76: We retrieve the file statistics to use the length when we
memory map the file.
37
Chapter 3 Operating System Support for Persistent Memory
• Line 84: We map the file into the application’s address space to allow
our program to access the contents as if in memory. In the second
parameter, we pass the length of the file, requesting Linux to initialize
memory with the full file. We also map the file with both READ and
WRITE access and also as SHARED allowing other processes to map
the same file.
• Line 91: We retire the file descriptor which is no longer needed once
a file is mapped.
• Line 94: We write data into the file by accessing it like memory
through the pointer returned by mmap.
• Line 101: We explicitly flush the newly written string to the backing
storage device.
Listing 3-2 shows an example of C code that memory maps a file, writes data into
the file, and then uses the FlushViewOfFile() and FlushFileBuffers() system calls to
flush the modified data to the file on the storage device.
45 #include <fcntl.h>
46 #include <stdio.h>
47 #include <stdlib.h>
48 #include <string.h>
49 #include <sys/stat.h>
50 #include <sys/types.h>
51 #include <Windows.h>
52
53 int
54 main(int argc, char *argv[])
55 {
56 if (argc != 2) {
57 fprintf(stderr, "Usage: %s filename\n",
58 argv[0]);
59 exit(1);
60 }
61
38
www.dbooks.org
Chapter 3 Operating System Support for Persistent Memory
39
Chapter 3 Operating System Support for Persistent Memory
www.dbooks.org
Chapter 3 Operating System Support for Persistent Memory
• Lines 45-75: As in the previous Linux example, we take the file name
passed through argv and open the file.
• Line 81: We retrieve the file size to use later when memory mapping.
• Line 89: We take the first step to memory mapping a file by creating
the file mapping. This step does not yet map the file into our
application’s memory space.
• Line 106: This step maps the file into our memory space.
41
Chapter 3 Operating System Support for Persistent Memory
• Line 132: We flush the modified memory page to the backing storage.
• Line 139: We flush the full file to backing storage, including any
additional file metadata maintained by Windows.
• Line 146-157: We unmap the file, close the file, then exit the program.
Figure 3-4 shows what happens inside the operating system when an application
calls mmap() on Linux or CreateFileMapping() on Windows. The operating system
allocates memory from its memory page cache, maps that memory into the application’s
address space, and creates the association with the file through a storage device driver.
As the application reads pages of the file in memory, and if those pages are not
present in memory, a page fault exception is raised to the operating system which will
then read that page into main memory through storage I/O operations. The operating
42
www.dbooks.org
Chapter 3 Operating System Support for Persistent Memory
system also tracks writes to those memory pages and schedules asynchronous I/O
operations to write the modifications back to the primary copy of the file on the storage
device. Alternatively, if the application wants to ensure updates are written back to
storage before continuing as we did in our code example, the msync system call on
Linux or FlushViewOfFile on Windows executes the flush to disk. This may cause the
operating system to suspend the program until the write finishes, similar to the file-write
operation described earlier.
This description of memory-mapped files using storage highlights some of the
disadvantages. First, a portion of the limited kernel memory page cache in main
memory is used to store a copy of the file. Second, for files that cannot fit in memory, the
application may experience unpredictable and variable pauses as the operating system
moves pages between memory and storage through I/O operations. Third, updates to
the in-memory copy are not persistent until written back to storage so can be lost in the
event of a failure.
43
Chapter 3 Operating System Support for Persistent Memory
Listing 3-3. Displaying persistent memory physical devices and regions on Linux
44
www.dbooks.org
Chapter 3 Operating System Support for Persistent Memory
"phys_id":44,
"security":"disabled"
},
{
"dev":"nmem3",
"id":"8089-a2-1837-00000b5e",
"handle":257,
"phys_id":54,
"security":"disabled"
},
[...snip...]
{
"dev":"nmem8",
"id":"8089-a2-1837-00001114",
"handle":4129,
"phys_id":76,
"security":"disabled"
}
],
"regions":[
{
"dev":"region1",
"size":1623497637888,
"available_size":1623497637888,
"max_available_extent":1623497637888,
"type":"pmem",
"iset_id":-2506113243053544244,
"mappings":[
{
"dimm":"nmem11",
"offset":268435456,
"length":270582939648,
"position":5
},
45
Chapter 3 Operating System Support for Persistent Memory
{
"dimm":"nmem10",
"offset":268435456,
"length":270582939648,
"position":1
},
{
"dimm":"nmem9",
"offset":268435456,
"length":270582939648,
"position":3
},
{
"dimm":"nmem8",
"offset":268435456,
"length":270582939648,
"position":2
},
{
"dimm":"nmem7",
"offset":268435456,
"length":270582939648,
"position":4
},
{
"dimm":"nmem6",
"offset":268435456,
"length":270582939648,
"position":0
}
],
"persistence_domain":"memory_controller"
},
{
"dev":"region0",
"size":1623497637888,
46
www.dbooks.org
Chapter 3 Operating System Support for Persistent Memory
"available_size":0,
"max_available_extent":0,
"type":"pmem",
"iset_id":3259620181632232652,
"mappings":[
{
"dimm":"nmem5",
"offset":268435456,
"length":270582939648,
"position":5
},
{
"dimm":"nmem4",
"offset":268435456,
"length":270582939648,
"position":1
},
{
"dimm":"nmem3",
"offset":268435456,
"length":270582939648,
"position":3
},
{
"dimm":"nmem2",
"offset":268435456,
"length":270582939648,
"position":2
},
{
"dimm":"nmem1",
"offset":268435456,
"length":270582939648,
"position":4
},
47
Chapter 3 Operating System Support for Persistent Memory
{
"dimm":"nmem0",
"offset":268435456,
"length":270582939648,
"position":0
}
],
"persistence_domain":"memory_controller",
"namespaces":[
{
"dev":"namespace0.0",
"mode":"fsdax",
"map":"dev",
"size":1598128390144,
"uuid":"06b8536d-4713-487d-891d-795956d94cc9",
"sector_size":512,
"align":2097152,
"blockdev":"pmem0"
}
]
}
]
}
When a file system is created and mounted using /dev/pmem* devices, they can be
identified using the df command as shown in Listing 3-5.
www.dbooks.org
Chapter 3 Operating System Support for Persistent Memory
PS C:\Users\Administrator> Get-PmemDisk
PartitionNumber DriveLetter Offset Size Type
--------------- ----------- ------ ---- ----
1 24576 15.98 MB Reserved
2 D 16777216 248.98 GB Basic
• You can leverage the rich features of leading file systems for
organizing, managing, naming, and limiting access for user’s
persistent memory files and directories.
• You can apply the familiar file system permissions and access rights
management for protecting data stored in persistent memory and for
sharing persistent memory between multiple users.
• System administrators can use existing backup tools that rely on file
system revision-history tracking.
Once a file backed by persistent memory is created and opened, an application still
calls mmap() or MapViewOfFile() to get a pointer to the persistent media. The difference,
shown in Figure 3-5, is that the persistent memory-aware file system recognizes that
the file is on persistent memory and programs the memory management unit (MMU)
in the CPU to map the persistent memory directly into the application’s address space.
Neither a copy in kernel memory nor synchronizing to storage through I/O operations
is required. The application can use the pointer returned by mmap() or MapViewOfFile()
to operate on its data in place directly in the persistent memory. Since no kernel I/O
49
Chapter 3 Operating System Support for Persistent Memory
operations are required, and because the full file is mapped into the application’s
memory, it can manipulate large collections of data objects with higher and more
consistent performance as compared to files on I/O-accessed storage.
Figure 3-5. Direct access (DAX) I/O and standard file API I/O paths through the
kernel
Listing 3-7 shows a C source code example that uses DAX to write a string directly
into persistent memory. This example uses one of the persistent memory API libraries
included in Linux and Windows called libpmem. Although we discuss these libraries in
depth in later chapters, we describe the use of two of the functions available in libpmem
in the following steps. The APIs in libpmem are common across Linux and Windows and
abstract the differences between underlying operating system APIs, so this sample code
is portable across both operating system platforms.
50
www.dbooks.org
Chapter 3 Operating System Support for Persistent Memory
32 #include <sys/types.h>
33 #include <sys/stat.h>
34 #include <fcntl.h>
35 #include <stdio.h>
36 #include <errno.h>
37 #include <stdlib.h>
38 #ifndef _WIN32
39 #include <unistd.h>
40 #else
41 #include <io.h>
42 #endif
43 #include <string.h>
44 #include <libpmem.h>
45
46 /* Using 4K of pmem for this example */
47 #define PMEM_LEN 4096
48
49 int
50 main(int argc, char *argv[])
51 {
52 char *pmemaddr;
53 size_t mapped_len;
54 int is_pmem;
55
56 if (argc != 2) {
57 fprintf(stderr, "Usage: %s filename\n",
58 argv[0]);
59 exit(1);
60 }
61
62 /* Create a pmem file and memory map it. */
63 if ((pmemaddr = pmem_map_file(argv[1], PMEM_LEN,
64 PMEM_FILE_CREATE, 0666, &mapped_len,
65 &is_pmem)) == NULL) {
51
Chapter 3 Operating System Support for Persistent Memory
66 perror("pmem_map_file");
67 exit(1);
68 }
69
70 /* Store a string to the persistent memory. */
71 char s[] = "This is new data written to the file";
72 strcpy(pmemaddr, s);
73
74 /* Flush our string to persistence. */
75 if (is_pmem)
76 pmem_persist(pmemaddr, sizeof(s));
77 else
78 pmem_msync(pmemaddr, sizeof(s));
79
80 /* Delete the mappings. */
81 pmem_unmap(pmemaddr, mapped_len);
82
83 printf("Done.\n");
84 exit(0);
85 }
• Line 44: We include the header file for the libpmem API used in this
example.
52
www.dbooks.org
Chapter 3 Operating System Support for Persistent Memory
S
ummary
Figure 3-6 shows the complete view of the operating system support that this chapter
describes. As we discussed, an application can use persistent memory as a fast SSD,
more directly through a persistent memory-aware file system, or mapped directly into
the application’s memory space with the DAX option. DAX leverages operating system
services for memory-mapped files but takes advantage of the server hardware’s ability
to map persistent memory directly into the application’s address space. This avoids the
need to move data between main memory and storage. The next few chapters describe
considerations for working with data directly in persistent memory and then discuss the
APIs for simplifying development.
53
Chapter 3 Operating System Support for Persistent Memory
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
54
www.dbooks.org
CHAPTER 4
Fundamental Concepts
of Persistent Memory
Programming
In Chapter 3, you saw how operating systems expose persistent memory to applications
as memory-mapped files. This chapter builds on this fundamental model and examines
the programming challenges that arise. Understanding these challenges is an essential
part of persistent memory programming, especially when designing a strategy for
recovery after application interruption due to issues like crashes and power failures.
However, do not let these challenges deter you from persistent memory programming!
Chapter 5 describes how to leverage existing solutions to save you programming time
and reduce complexity.
W
hat’s Different?
Application developers typically think in terms of memory-resident data structures and
storage-resident data structures. For data center applications, developers are careful to
maintain consistent data structures on storage, even in the face of a system crash. This
problem is commonly solved using logging techniques such as write-ahead logging,
where changes are first written to a log and then flushed to persistent storage. If the data
modification process is interrupted, the application has enough information in the log
to finish the operation on restart. Techniques like this have been around for many years;
however, correct implementations are challenging to develop and time-consuming to
maintain. Developers often rely on a combination of databases, libraries, and modern
file systems to provide consistency. Even so, it is ultimately the application developer’s
55
© The Author(s) 2020
S. Scargall, Programming Persistent Memory, https://doi.org/10.1007/978-1-4842-4932-1_4
Chapter 4 Fundamental Concepts of Persistent Memory Programming
A
tomic Updates
Each platform supporting persistent memory will have a set of native memory
operations that are atomic. On Intel hardware, the atomic persistent store is 8 bytes.
Thus, if the program or system crashes while an aligned 8-byte store to persistent
memory is in-flight, on recovery those 8 bytes will either contain the old contents or
the new contents. The Intel processor has instructions that store more than 8 bytes,
but those are not failure atomic, so they can be torn by events like a power failure.
56
www.dbooks.org
Chapter 4 Fundamental Concepts of Persistent Memory Programming
Transactions
Combining multiple operations into a single atomic operation is usually referred to as
a transaction. In the database world, the acronym ACID describes the properties of a
transaction: atomicity, consistency, isolation, and durability.
Atomicity
As described earlier, atomicity is when multiple operations are composed into a single
atomic action that either happens entirely or does not happen at all, even in the face of
system failure. For persistent memory, the most common techniques used are
• Redo logging, where the full change is first written to a log, so during
recovery, it can be rolled forward if interrupted.
The preceding list is not exhaustive, and it ignores the details that can get relatively
complex. One common consideration is that transactions often include memory
allocation/deallocation. For example, a transaction that adds a node to a tree data
structure usually includes the allocation of the new node. If the transaction is rolled back,
the memory must be freed to prevent a memory leak. Now imagine a transaction that
performs multiple persistent memory allocations and free operations, all of which must
be part of the same atomic operation. The implementation of this transaction is clearly
more complex than just writing the new value to a log or updating a single pointer.
57
Chapter 4 Fundamental Concepts of Persistent Memory Programming
C
onsistency
Consistency means that a transaction can only move a data structure from one valid
state to another. For persistent memory, programmers usually find that the locking they
use to make updates thread-safe often indicates consistency points as well. If it is not
valid for a thread to see an intermediate state, locking prevents it from happening, and
when it is safe to drop the lock, that is because it is safe for another thread to observe the
current state of the data structure.
I solation
Multithreaded (concurrent) execution is commonplace in modern applications. When
making transactional updates, the isolation is what allows the concurrent updates
to have the same effect as if they were executed sequentially. At runtime, isolation
for persistent memory updates is typically achieved by locking. Since the memory is
persistent, the isolation must be considered for transactions that were in-flight when
the application was interrupted. Persistent memory programmers typically detect
this situation on restart and roll partially done transactions forward or backward
appropriately before allowing general-purpose threads access to the data structures.
D
urability
A transaction is considered durable if it is on persistent media when it is complete. Even if the
system loses power or crashes at that point, the transaction remains completed. As described
in Chapter 2, this usually means the changes must be flushed from the CPU caches. This can
be done using standard APIs, such as the Linux msync() call, or platform-specific instructions
such as Intel’s CLWB. When implementing transactions on persistent memory, pay careful
attention to ensure that log entries are flushed to persistence before changes are started and
flush changes to persistence before a transaction is considered complete.
Another aspect of the durable property is the ability to find the persistent
information again when an application starts up. This is so fundamental to how storage
works that we take it for granted. Metadata such as file names and directory names are
used to find the durable state of an application on storage. For persistent memory, the
same is true due to the programming model described in Chapter 3, where persistent
memory is accessed by first opening a file on a direct access (DAX) file system and then
memory mapping that file. However, a memory-mapped file is just a range of raw data;
58
www.dbooks.org
Chapter 4 Fundamental Concepts of Persistent Memory Programming
how does the application find the data structures resident in that range? For persistent
memory, there must be at least one well-known location of a data structure to use as a
starting point. This is often referred to as a root object (described in Chapter 7). The root
object is used by many of the higher-level libraries within PMDK to access the data.
S
tart-Time Responsibilities
In Chapter 2 (Figures 2-5 and 2-6), we showed flowcharts outlining the application’s
responsibilities when using persistent memory. These responsibilities included
detecting platform details, available instructions, media failures, and so on. For storage,
these types of things happen in the storage stack in the operating system. Persistent
59
Chapter 4 Fundamental Concepts of Persistent Memory Programming
memory, however, allows direct access, which removes the kernel from the data path
once the file is memory mapped.
As a programmer, you may be tempted to map persistent memory and start using it,
as shown in the Chapter 3 examples. For production-quality programming, you want to
ensure these start-time responsibilities are met. For example, if you skip the checks in
Figure 2-5, you will end up with an application that flushes CPU caches even when it is
not required, and that will perform poorly on hardware that does not need the flushing.
If you skip the checks in Figure 2-6, you will have an application that ignores media
errors and may use corrupted data resulting in unpredictable and undefined behavior.
S
ummary
This chapter provides an overview of the fundamental concepts of persistent memory
programming. When developing an application that uses persistent memory, you must
carefully consider several areas:
• Atomic updates.
60
www.dbooks.org
Chapter 4 Fundamental Concepts of Persistent Memory Programming
• Start-time responsibilities.
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
61
CHAPTER 5
Introducing the Persistent
Memory Development Kit
Previous chapters introduced the unique properties of persistent memory that make it
special, and you are correct in thinking that writing software for such a novel technology
is complicated. Anyone who has researched or developed code for persistent memory
can testify to this. To make your job easier, Intel created the Persistent Memory
Development Kit (PMDK). The team of PMDK developers envisioned it to be the
standard library for all things persistent memory that would provide solutions to the
common challenges of persistent memory programming.
B
ackground
The PMDK has evolved to become a large collection of open source libraries and
tools for application developers and system administrators to simplify managing and
accessing persistent memory devices. It was developed alongside evolving support for
persistent memory in operating systems, which ensures the libraries take advantage of
all the features exposed through the operating system interfaces.
The PMDK libraries build on the SNIA NVM programming model (described in
Chapter 3). They extend it to varying degrees, some by simply wrapping around the
primitives exposed by the operating system with easy-to-use functions and others by
providing complex data structures and algorithms for use with persistent memory.
This means you are responsible for making an informed decision about which level of
abstraction is the best for your use case.
63
© The Author(s) 2020
S. Scargall, Programming Persistent Memory, https://doi.org/10.1007/978-1-4842-4932-1_5
www.dbooks.org
Chapter 5 Introducing the Persistent Memory Development Kit
Although the PMDK was created by Intel to support its hardware products, Intel is
committed to ensuring the libraries and tools are both vendor and platform neutral. This
means that the PMDK is not tied to Intel processors or Intel persistent memory devices.
It can be made to work on any other platform that exposes the necessary interfaces
through the operating system, including Linux and Microsoft Windows. We welcome
and encourage contributions to PMDK from individuals, hardware vendors, and ISVs.
The PMDK has a BSD 3-Clause License, allowing developers to embed it in any software,
whether it’s open source or proprietary. This allows you to pick and choose individual
components of PMDK by integrating only the bits of code required.
The PMDK is available at no cost on GitHub (https://github.com/pmem/pmdk) and
has a dedicated web site at https://pmem.io. Man pages are delivered with PMDK and
are available online under each library’s own page. Appendix B of this book describes
how to install it on your system.
An active persistent memory community is available through Google Forums at
https://groups.google.com/forum/#!forum/pmem. This forum allows developers,
system administrators, and others with an interest in persistent memory to ask questions
and get assistance. This is a great resource.
1. Volatile libraries are for use cases that only wish to exploit the
capacity of persistent memory.
While you are deciding how to best solve a problem, carefully consider which
category it fits into. The challenges that fail-safe persistent programs present are
significantly different from volatile ones. Choosing the right approach upfront will
minimize the risk of having to rewrite any code.
You may decide to use libraries from both categories for different parts of the
application, depending on feature and functional requirements.
64
Chapter 5 Introducing the Persistent Memory Development Kit
V
olatile Libraries
Volatile libraries are typically simpler to use because they can fall back to dynamic
random-access memory (DRAM) when persistent memory is not available. This
provides a more straightforward implementation. Depending on the workload, they may
also have lower overall overhead compared to similar persistent libraries because they
do not need to ensure consistency of data in the presence of failures.
This section explores the available libraries for volatile use cases in applications,
including what the library is and when to use it. The libraries may have overlapping
situation use cases.
l ibmemkind
What is it?
The memkind library, called libmemkind, is a user-extensible heap manager built
on top of jemalloc. It enables control of memory characteristics and partitioning of the
heap between different kinds of memory. The kinds of memory are defined by operating
system memory policies that have been applied to virtual address ranges. Memory
characteristics supported by memkind without user extension include control of
nonuniform memory access (NUMA) and page size features. The jemalloc nonstandard
interface has been extended to enable specialized kinds to make requests for virtual
memory from the operating system through the memkind partition interface. Through
the other memkind interfaces, you can control and extend memory partition features
and allocate memory while selecting enabled features. The memkind interface allows
you to create and control file-backed memory from persistent memory with PMEM kind.
Chapter 10 describes this library in more detail. You can download memkind and
read the architecture specification and API documentation at http://memkind.github.
io/memkind/. memkind is an open source project on GitHub at https://github.com/
memkind/memkind.
65
www.dbooks.org
Chapter 5 Introducing the Persistent Memory Development Kit
l ibvmemcache
What is it?
libvmemcache is an embeddable and lightweight in-memory caching solution that
takes full advantage of large-capacity memory, such as persistent memory with direct
memory access (DAX), through memory mapping in an efficient and scalable way.
libvmemcache has unique characteristics:
The cache is tuned to work optimally with relatively large value sizes. The smallest
possible size is 256 bytes, but libvmemcache works best if the expected value sizes are
above 1 kilobyte.
Chapter 10 describes this library in more detail. libvmemcache is an open source
project on GitHub at https://github.com/pmem/vmemcache.
66
Chapter 5 Introducing the Persistent Memory Development Kit
l ibvmem
What is it?
libvmem is a deprecated predecessor to libmemkind. It is a jemalloc-derived
memory allocator, with both metadata and objects allocations placed in file-based
mapping. The libvmem library is an open source project available from https://pmem.
io/pmdk/libvmem/.
P
ersistent Libraries
Persistent libraries help applications maintain data structure consistency in the presence
of failures. In contrast to the previously described volatile libraries, these provide new
semantics and take full advantage of the unique possibilities enabled by persistent
memory.
l ibpmem
What is it?
libpmem is a low-level C library that provides basic abstraction over the primitives
exposed by the operating system. It automatically detects features available in the
platform and chooses the right durability semantics and memory transfer (memcpy())
methods optimized for persistent memory. Most applications will need at least parts of
this library.
67
www.dbooks.org
Chapter 5 Introducing the Persistent Memory Development Kit
Chapter 4 describes the requirements for applications using persistent memory, and
Chapter 6 describes libpmem in more depth.
l ibpmemobj
What is it?
libpmemobj is a C library that provides a transactional object store, with a manual
dynamic memory allocator, transactions, and general facilities for persistent memory
programming. This library solves many of the commonly encountered algorithmic and
data structure problems when programming for persistent memory. Chapter 7 describes
this library in detail.
l ibpmemobj-cpp
What is it?
libpmemobj-cpp, also known as libpmemobj++, is a C++ header-only library that uses
the metaprogramming features of C++ to provide a simpler, less error-prone interface to
libpmemobj. It enables rapid development of persistent memory applications by reusing
many concepts C++ programmers are already familiar with, such as smart pointers and
closure-based transactions.
This library also ships with custom-made, STL-compatible data structures and
containers, so that application developers do not have to reinvent the basic algorithms
for persistent memory.
68
Chapter 5 Introducing the Persistent Memory Development Kit
l ibpmemkv
What is it?
libpmemkv is a generic embedded local key-value store optimized for persistent
memory. It is easy to use and ships with many different language integrations, including
C, C++, and JavaScript.
This library has a pluggable back end for different storage engines. Thus, it can
be used as a volatile library, although it was originally designed primarily to support
persistent use cases.
Chapter 9 describes this library in detail.
l ibpmemlog
What is it?
libpmemlog is a C library that implements a persistent memory append-only log file
with power fail-safe operations.
l ibpmemblk
What is it?
libpmemblk is a C library for managing fixed-size arrays of blocks. It provides fail-safe
interfaces to update the blocks through buffer-based functions.
69
www.dbooks.org
Chapter 5 Introducing the Persistent Memory Development Kit
p mempool
What is it?
The pmempool utility is a tool for managing and offline analysis of persistent
memory pools. Its variety of functionalities, useful throughout the entire life cycle of an
application, include
p memcheck
What is it?
The pmemcheck utility is a Valgrind-based tool for dynamic runtime analysis
of common persistent memory errors, such as a missing flush or incorrect use of
transactions. Chapter 12 describes this utility in detail.
70
Chapter 5 Introducing the Persistent Memory Development Kit
p mreorder
What is it?
The pmreorder utility helps detect data structure consistency problems of persistent
applications in the presence of failures. It does this by first recording and then replaying
the persistent state of the application while verifying consistency of the application’s
data structures at any possible intermediate state. Chapter 12 describes this utility in
detail.
S
ummary
This chapter provides a brief listing of the libraries and tools available in PMDK
and when to use them. You now have enough information to know what is possible.
Throughout the rest of this book, you will learn how to create software using these
libraries and tools.
The next chapter introduces libpmem and describes how to use it to create simple
persistent applications.
71
www.dbooks.org
Chapter 5 Introducing the Persistent Memory Development Kit
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
72
CHAPTER 6
libpmem: Low-Level
Persistent Memory
Support
This chapter introduces libpmem, one of the smallest libraries in PMDK. This C library
is very low level, dealing with things like CPU instructions related to persistent memory,
optimal ways to copy data to persistence, and file mapping. Programmers who only want
completely raw access to persistent memory, without libraries to provide allocators or
transactions, will likely want to use libpmem as a basis for their development.
The code in libpmem that detects the available CPU instructions, for example, is a
mundane boilerplate code that you do not want to invent repeatedly in applications.
Leveraging this small amount of code from libpmem will save time, and you get the
benefit of fully tested and tuned code in the library.
For most programmers, libpmem is too low level, and you can safely skim this
chapter quickly (or skip it altogether) and move on to the higher-level, friendlier
libraries available in PMDK. All the PMDK libraries that deal with persistence, such as
libpmemobj, are built on top of libpmem to meet their low-level needs.
Like all PMDK libraries, online man pages are available. For libpmem, they are at
http://pmem.io/pmdk/libpmem/. This site includes links to the man pages for both the
Linux and Windows version. Although the goal of the PMDK project was to make the
interfaces similar across operating systems, some small differences appear as necessary.
The C code examples used in this chapter build and run on both Linux and Windows.
73
© The Author(s) 2020
S. Scargall, Programming Persistent Memory, https://doi.org/10.1007/978-1-4842-4932-1_6
www.dbooks.org
Chapter 6 libpmem: Low-Level Persistent Memory Support
U
sing the Library
To use libpmem, start by including the appropriate header, as shown in Listing 6-1.
32
33 /*
34 * simple_copy.c
35 *
36 * usage: simple_copy src-file dst-file
37 *
38 * Reads 4KiB from src-file and writes it to dst-file.
39 */
40
41 #include <sys/types.h>
42 #include <sys/stat.h>
43 #include <fcntl.h>
44 #include <stdio.h>
45 #include <errno.h>
46 #include <stdlib.h>
47 #ifndef _WIN32
48 #include <unistd.h>
49 #else
50 #include <io.h>
51 #endif
52 #include <string.h>
53 #include <libpmem.h>
74
Chapter 6 libpmem: Low-Level Persistent Memory Support
Notice the include on line 53. To use libpmem, use this include line, and link the C
program with libpmem using the -lpmem option when building under Linux.
M
apping a File
The libpmem library contains some convenience functions for memory mapping files.
Of course, your application can call mmap() on Linux or MapViewOfFile() on Windows
directly, but using libpmem has some advantages:
Listing 6-2 shows how to memory map a file on a persistent memory-aware file
system into the application.
As part of the persistent memory detection mentioned earlier, the flag is_pmem is
returned by pmem_map_file. It is the caller’s responsibility to use this flag to determine
how to flush changes to persistence. When making a range of memory persistent, the
caller can use the optimal flush provided by libpmem, pmem_persist, only if the is_pmem
flag is set. This is illustrated in the man page example excerpt in Listing 6-3.
75
www.dbooks.org
Chapter 6 libpmem: Low-Level Persistent Memory Support
Listing 6-3 shows the convenience function pmem_msync(), which is just a small
wrapper around msync() or the Windows equivalent. You do not need to build in
different logic for Linux and Windows because libpmem handles this.
76
Chapter 6 libpmem: Low-Level Persistent Memory Support
Notice how the is_pmem flag on line 96 is used just like it would be for calls to pmem_
persist(), since the pmem_memcpy_persist() function includes the flush to persistence.
The interface pmem_memcpy_persist() includes the flush to persistent because it
may determine that the copy is more optimally performed by using non-temporal stores,
which bypass the CPU cache and do not require subsequent cache flush instructions for
persistence. By providing this API, which both copies and flushes, libpmem is free to use
the most optimal way to perform both steps.
These steps are performed together when pmem_persist() is called, or they can be
called individually by calling pmem_flush() for the first step and pmem_drain() for the
second. Note that either of these steps may be unnecessary on a given platform, and
the library knows how to check for that and do what is correct. For example, on Intel
platforms, pmem_drain is an empty function.
When does it make sense to break flushing into steps? The example in Listing 6-5
illustrates one reason you might want to do this. Since the example copies data using
multiple calls to memcpy(), it uses the version of libpmem copy (pmem_memcpy_nodrain())
that only performs the flush, postponing the final drain step to the end. This works
because, unlike the flush step, the drain step does not take an address range; it is a
system-wide drain operation so can happen at the end of the loop that copies individual
blocks of data.
58 /*
59 * do_copy_to_pmem
60 */
61 static void
62 do_copy_to_pmem(char *pmemaddr, int srcfd, off_t len)
77
www.dbooks.org
Chapter 6 libpmem: Low-Level Persistent Memory Support
63 {
64 char buf[BUF_LEN];
65 int cc;
66
67 /*
68 * Copy the file,
69 * saving the last flush & drain step to the end
70 */
71 while ((cc = read(srcfd, buf, BUF_LEN)) > 0) {
72 pmem_memcpy_nodrain(pmemaddr, buf, cc);
73 pmemaddr += cc;
74 }
75
76 if (cc < 0) {
77 perror("read");
78 exit(1);
79 }
80
81 /* Perform final flush step */
82 pmem_drain();
83 }
78
Chapter 6 libpmem: Low-Level Persistent Memory Support
S
ummary
This chapter demonstrated some of the fairly small set of APIs provided by libpmem.
This library does not track what changed for you, does not provide power fail-safe
transactions, and does not provide an allocator. Libraries like libpmemobj (described in
the next chapter) provide all those tasks and use libpmem internally for simple flushing
and copying.
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
79
www.dbooks.org
CHAPTER 7
libpmemobj: A Native
Transactional Object Store
In the previous chapter, we described libpmem, the low-level persistent memory library
that provides you with an easy way to directly access persistent memory. libpmem is a
small, lightweight, and feature-limited library that is designed for software that tracks
every store to pmem and needs to flush those changes to persistence. It excels at what
it does. However, most developers will find higher-level libraries within the Persistent
Memory Development Kit (PMDK), like libpmemobj, to be much more convenient.
This chapter describes libpmemobj, which builds upon libpmem and turns persistent
memory-mapped files into a flexible object store. It supports transactions, memory
management, locking, lists, and several other features.
What is libpmemobj?
The libpmemobj library provides a transactional object store in persistent memory for
applications that require transactions and persistent memory management using direct
access (DAX) to the memory. Briefly recapping our DAX discussion in Chapter 3, DAX
allows applications to memory map files on a persistent memory-aware file system to
provide direct load/store operations without paging blocks from a block storage device.
It bypasses the kernel, avoids context switches and interrupts, and allows applications to
read and write directly to the byte-addressable persistent storage.
81
© The Author(s) 2020
S. Scargall, Programming Persistent Memory, https://doi.org/10.1007/978-1-4842-4932-1_7
Chapter 7 libpmemobj: A Native Transactional Object Store
82
www.dbooks.org
Chapter 7 libpmemobj: A Native Transactional Object Store
Grouping Operations
With the exception of modifying a single scalar value that fits within the processor’s
word, a series of data modifications must be grouped together and accompanied by a
means of detecting an interruption before completion.
Memory Pools
Memory-mapped files are at the core of the persistent memory programming model.
The libpmemobj library provides a convenient API to easily manage pool creation and
access, avoiding the complexity of directly mapping and synchronizing data. PMDK
also provides a pmempool utility to administer memory pools from the command line.
Memory pools reside on DAX-mounted file systems.
Example 1. Create a libpmemobj (obj) type pool of minimum allowed size and
layout called “my_layout” in the mounted file system /mnt/pmemfs0/
$ pmempool create --layout my_layout obj /mnt/pmemfs0/pool.obj
Example 2. Create a libpmemobj (obj) pool of 20GiB and layout called “my_
layout” in the mounted file system /mnt/pmemfs0/
$ pmempool create --layout my_layout –-size 20G obj \
/mnt/pmemfs0/pool.obj
83
Chapter 7 libpmemobj: A Native Transactional Object Store
Example 3. Create a libpmemobj (obj) pool using all available capacity within
the /mnt/pmemfs0/ file system using the layout name of “my_layout”
$ pmempool create --layout my_layout –-max-size obj \
/mnt/pmemfs0/pool.obj
Applications can programmatically create pools that do not exist at application start
time using pmemobj_create(). pmemobj_create() has the following arguments:
• poolsize specifies the required size for the pool. The memory pool
file is fully allocated to the size poolsize using posix_fallocate(3).
The minimum size for a pool is defined as PMEMOBJ_MIN_POOL in
<libpmemobj.h>. If the pool already exists, pmemobj_create() will
return an EEXISTS error. Specifying poolsize as zero will take the
pool size from the file size and will verify that the file appears to be
empty by searching for any nonzero data in the pool header at the
beginning of the file.
• mode specifies the ACL permissions to use when creating the file, as
described by create(2).
Listing 7-1 shows how to create a pool using the pmemobj_create() function.
84
www.dbooks.org
Chapter 7 libpmemobj: A Native Transactional Object Store
38 #include <stdio.h>
39 #include <string.h>
40 #include <libpmemobj.h>
41
42 #define LAYOUT_NAME "rweg"
43 #define MAX_BUF_LEN 31
44
45 struct my_root {
46 size_t len;
47 char buf[MAX_BUF_LEN];
48 };
49
50 int
51 main(int argc, char *argv[])
52 {
53 if (argc != 2) {
54 printf("usage: %s file-name\n", argv[0]);
55 return 1;
56 }
57
58 PMEMobjpool *pop = pmemobj_create(argv[1],
59 LAYOUT_NAME, PMEMOBJ_MIN_POOL, 0666);
60
61 if (pop == NULL) {
62 perror("pmemobj_create");
63 return 1;
64 }
65
66 PMEMoid root = pmemobj_root(pop,
67 sizeof(struct my_root));
68
69 struct my_root *rootp = pmemobj_direct(root);
70
71 char buf[MAX_BUF_LEN] = "Hello PMEM World";
72
85
Chapter 7 libpmemobj: A Native Transactional Object Store
73 rootp->len = strlen(buf);
74 pmemobj_persist(pop, &rootp->len,
75 sizeof(rootp->len));
76
77 pmemobj_memcpy_persist(pop, rootp->buf, buf,
78 rootp->len);
79
80 pmemobj_close(pop);
81
82 return 0;
83 }
• Line 42: We define the name for our pool layout to be “rweg” (read-
write example). This is just a name and can be any string that
uniquely identifies the pool to the application. A NULL value is valid.
In the case where multiple pools are opened by the application, this
name uniquely identifies it.
• Lines 45-47: This defines the root object data structure which has
members len and buf. buf contains the string we want to write, and
the len is the length of the buffer.
• Lines 53- 56: The pwriter command accepts one argument: the path
and pool name to write to. For example, /mnt/pmemfs0/helloworld_
obj.pool. The file name extension is arbitrary and optional.
86
www.dbooks.org
Chapter 7 libpmemobj: A Native Transactional Object Store
• Line 66: Using the pop acquired from line 58, we use the pmemobj_
root() function to locate the root object.
87
Chapter 7 libpmemobj: A Native Transactional Object Store
Figure 7-1. A high-level overview of a persistent memory pool with a pool object
pointer (POP) pointing to the root object
Using a valid pop pointer, you can use the pmemobj_root() function to get a pointer
of the root object. Internally, this function creates a valid pointer by adding the current
memory address of the mapped pool plus the internal offset to the root.
Listing 7-2. preader.c – An example showing how to open a pool and access the
root object and data
33 /*
34 * preader.c - Read a string from a
35 * persistent memory pool
36 */
37
38 #include <stdio.h>
39 #include <string.h>
40 #include <libpmemobj.h>
41
88
www.dbooks.org
Chapter 7 libpmemobj: A Native Transactional Object Store
89
Chapter 7 libpmemobj: A Native Transactional Object Store
M
emory Poolsets
The capacity of multiple pools can be combined into a poolset. Besides providing a
way to increase the available space, a poolset can be used to span multiple persistent
memory devices and provide both local and remote replication.
You open a poolset the same way as a single pool using pmemobj_open(). (At the
time of publication, pmemobj_create() and the pmempool utility cannot create poolsets.
Enhancement requests exist for these features.) Although creating poolsets requires
manual administration, poolset management can be automated via libpmempool or the
pmempool utility; full details appear in the poolset(5) man page.
C
oncatenated Poolsets
Individual pools can be concatenated using pools on a single or multiple file systems.
Concatenation only works with the same pool type: block, object, or log pools. Listing 7-3
shows an example “myconcatpool.set” poolset file that concatenates three smaller pools
into a larger pool. For illustrative purposes, each pool is a different size and located on
different file systems. An application using this poolset would see a single 700GiB memory
pool.
90
www.dbooks.org
Chapter 7 libpmemobj: A Native Transactional Object Store
R
eplica Poolsets
Besides combining multiple pools to provide more space, a poolset can also maintain
multiple copies of the same data to increase resiliency. Data can be replicated to another
poolset on a different file of the local host and a poolset on a remote host.
Listing 7-4 shows a poolset file called “myreplicatedpool.set” that will replicate
local writes into the /mnt/pmem0/pool1 pool to another local pool, /mnt/pmem1/pool1,
on a different file system, and to a remote-objpool.set poolset on a remote host called
example.com.
REPLICA
256G /mnt/pmem1/pool1
The librpmem library, a remote persistent memory support library, underpins this
feature. Chapter 18 discusses librpmem and replica pools in more detail.
91
Chapter 7 libpmemobj: A Native Transactional Object Store
• pmempool rm removes pool file or all pool files listed in pool set
configuration file.
92
www.dbooks.org
Chapter 7 libpmemobj: A Native Transactional Object Store
POBJ_LAYOUT_BEGIN(cathouse);
POBJ_LAYOUT_TOID(cathouse, struct canaries);
POBJ_LAYOUT_TOID(cathouse, int);
POBJ_LAYOUT_END(cathouse);
The field val declared on the first line can be accessed using any of the subsequent
three operations:
TOID(int) val;
TOID_ASSIGN(val, oid_of_val); // Assigns 'oid_of_val' to typed OID 'val'
D_RW(val) = 42; // Returns a typed write pointer to 'val' and writes 42
return D_RO(val); // Returns a typed read-only (const) pointer to 'val'
A
llocating Memory
Using malloc() to allocate memory is quite normal to C developers and those who use
languages that do not fully handle automatic memory allocation and deallocation. For
persistent memory, you can use pmemobj_alloc(), pmemobj_reserve(), or pmemobj_
xreserve() to reserve memory for a transient object and use it the same way you would
use malloc(). We recommend that you free allocated memory using pmemobj_free() or
POBJ_FREE() when the application no longer requires it to avoid a runtime memory leak.
Because these are volatile memory allocations, they will not cause a persistent leak after
a crash or graceful application exit.
93
Chapter 7 libpmemobj: A Native Transactional Object Store
Persisting Data
The typical intent of using persistent memory is to save data persistently. For this, you
need to use one of three APIs that libpmemobj provides:
• Atomic operations
• Reserve/publish
• Transactional
Atomic Operations
The pmemobj_alloc() and its variants shown below are easy to use, but they are limited
in features, so additional coding is required by the developer:
94
www.dbooks.org
Chapter 7 libpmemobj: A Native Transactional Object Store
These functions reserve the object in a temporary state, call the constructor you
provided, and then in one atomic action, mark the allocation as persistent. They will
insert the pointer to the newly initialized object into a variable of your choice.
If the new object needs to be merely zeroed, pmemobj_zalloc() does so without
requiring a constructor.
Because copying NULL-terminated strings is a common operation, libpmemobj
provides pmemobj_strdup() and its wide-char variant pmemobj_wcsdup() to handle
this. pmemobj_strdup() provides the same semantics as strdup(3) but operates on the
persistent memory heap associated with the memory pool.
Once you are done with the object, pmemobj_free() will deallocate the object while
zeroing the variable that stored the pointer to it. The pmemobj_free() function frees the
memory space represented by oidp, which must have been allocated by a previous call
to pmemobj_alloc(), pmemobj_xalloc(), pmemobj_zalloc(), pmemobj_realloc(),
or pmemobj_zrealloc(). The pmemobj_free() function provides the same semantics as
free(3), but instead of operating on the process heap supplied by the system, it operates
on the persistent memory heap.
Listing 7-5 shows a small example of allocating and freeing memory using the
libpmemobj API.
95
Chapter 7 libpmemobj: A Native Transactional Object Store
96
www.dbooks.org
Chapter 7 libpmemobj: A Native Transactional Object Store
• Lines 79-80: If the pointer in line 72 is a valid object, we read the color
value, print the string, and free the object.
Reserve/Publish API
The atomic allocation API will not help if
For example, if your program needs to subtract money from account A and add it
to account B, both operations must be done together. This can be done via the reserve/
publish API.
To use it, you specify any number of operations to be done. The operations may be
setting a scalar 64-bit value using pmemobj_set_value(), freeing an object with pmemobj_
defer_free(), or allocating it using pmemobj_reserve(). Of these, only the allocation
happens immediately, letting you do any initialization of the newly reserved object.
Modifications will not become persistent until pmemobj_publish() is called.
Functions provided by libpmemobj related to the reserve/publish feature are
97
Chapter 7 libpmemobj: A Native Transactional Object Store
Listing 7-6 is a simple banking example that demonstrates how to change multiple
scalars (account balances) before publishing the updates into the pool.
Listing 7-6. Using the reserve/publish API to modify bank account balances
32
33 /*
34 * reserve_publish.c – An example using the
35 * reserve/publish libpmemobj API
36 */
37
..
44 #define POOL "/mnt/pmem/balance"
45
46 static PMEMobjpool *pool;
47
48 struct account {
49 PMEMoid name;
50 uint64_t balance;
51 };
52 TOID_DECLARE(struct account, 0);
53
..
60 static PMEMoid new_account(const char *name,
61 int deposit)
62 {
63 int len = strlen(name) + 1;
64
65 struct pobj_action act[2];
98
www.dbooks.org
Chapter 7 libpmemobj: A Native Transactional Object Store
99
Chapter 7 libpmemobj: A Native Transactional Object Store
105 pmemobj_set_value(pool, &act[0],
106 &D_RW(account_a)->balance,
107 D_RW(account_a)->balance – price);
108 pmemobj_set_value(pool, &act[1],
109 &D_RW(account_b)->balance,
110 D_RW(account_b)->balance + price);
111 pmemobj_publish(pool, act, 2);
112
113 pmemobj_close(pool);
114 return 0;
115 }
• Lines 98-101: Create a new account for each owner with initial
balances.
Transactional API
The reserve/publish API is fast, but it does not allow reading data you have just written.
In such cases, you can use the transactional API.
The first time a variable is written, it must be explicitly added to the transaction. This
can be done via pmemobj_tx_add_range() or its variants (xadd, _direct). Convenient
macros such as TX_ADD() or TX_SET() can perform the same operation. The transaction-
based functions and macros provided by libpmemobj include
100
www.dbooks.org
Chapter 7 libpmemobj: A Native Transactional Object Store
TX_ADD(TOID o)
TX_ADD_FIELD(TOID o, FIELD)
TX_ADD_DIRECT(TYPE *p)
TX_ADD_FIELD_DIRECT(TYPE *p, FIELD)
The transaction may also allocate entirely new objects, reserve their memory, and
then persistently allocate them only one transaction commit. These functions include
We can rewrite the banking example from Listing 7-6 using the transaction API. Most
of the code remains the same except when we want to add or subtract amounts from the
balance; we encapsulate those updates in a transaction, as shown in Listing 7-7.
Listing 7-7. Using the transaction API to modify bank account balances
33 /*
34 * tx.c - An example using the transaction API
35 */
36
..
101
Chapter 7 libpmemobj: A Native Transactional Object Store
94 int main()
95 {
96 if (!(pool = pmemobj_create(POOL, " ",
97 PMEMOBJ_MIN_POOL, 0600)))
98 die("Can't create pool "%s": %m\n", POOL);
99
100 TOID(struct account) account_a, account_b;
101 TOID_ASSIGN(account_a,
102 new_account("Julius Caesar", 100));
103 TOID_ASSIGN(account_b,
104 new_account("Mark Anthony", 50));
105
106 int price = 42;
107 TX_BEGIN(pool) {
108 TX_ADD_DIRECT(&D_RW(account_a)->balance);
109 TX_ADD_DIRECT(&D_RW(account_b)->balance);
110 D_RW(account_a)->balance -= price;
111 D_RW(account_b)->balance += price;
112 } TX_END
113
114 pmemobj_close(pool);
115 return 0;
116 }
• Line 112: Finish the transaction. All updates will either complete
entirely or they will be rolled back if the application or system crashes
before the transaction completes.
Each transaction has multiple stages in which an application can interact. These
transaction stages include
• TX_STAGE_NONE: No open transaction in this thread.
www.dbooks.org
Chapter 7 libpmemobj: A Native Transactional Object Store
The example in Listing 7-7 uses the two mandatory stages: TX_BEGIN and TX_END.
However, we could easily have added the other stages to perform actions for the other
stages, for example:
TX_BEGIN(Pop) {
/* the actual transaction code goes here... */
} TX_ONCOMMIT {
/*
* optional - executed only if the above block
* successfully completes
*/
} TX_ONABORT {
/*
* optional - executed only if starting the transaction
* fails, or if transaction is aborted by an error or a
* call to pmemobj_tx_abort()
*/
} TX_FINALLY {
/*
* optional - if exists, it is executed after
* TX_ONCOMMIT or TX_ONABORT block
*/
} TX_END /* mandatory */
Optionally, you can provide a list of parameters for the transaction. Each parameter
consists of a type followed by one of these type-specific number of values:
103
Chapter 7 libpmemobj: A Native Transactional Object Store
Optional Flags
Many of the functions discussed for the atomic, reserve/publish, and transactional APIs
have a variant with a "flags" argument that accepts these values:
• Atomic allocations are the simplest and fastest, but their use is
limited to allocating and initializing wholly new blocks.
104
www.dbooks.org
Chapter 7 libpmemobj: A Native Transactional Object Store
105
Chapter 7 libpmemobj: A Native Transactional Object Store
$ export PMEMOBJ_NLANES=512
$ ./my_app
$ unset PMEMOBJ_NLANES
106
www.dbooks.org
Chapter 7 libpmemobj: A Native Transactional Object Store
$ export LD_LIBRARY_PATH=/usr/lib64/pmdk_debug
$ ./my_app
Or
$ LD_PRELOAD=/usr/lib64/pmdk_debug ./my_app
The output provided by the debug library is controlled using the PMEMOBJ_LOG_LEVEL
and PMEMOBJ_LOG_FILE environment variables. These variables have no effect on the
non-debug version of the library.
PMEMOBJ_LOG_LEVEL
The value of PMEMOBJ_LOG_LEVEL enables tracepoints in the debug version of the
library, as follows:
1. This is the default level when PMEMOBJ_LOG_LEVEL is not set. No
log messages are emitted at this level.
107
Chapter 7 libpmemobj: A Native Transactional Object Store
$ export PMEMOBJ_LOG_LEVEL=2
$ ./my_app
PMEMOBJ_LOG_FILE
The value of PMEMOBJ_LOG_FILE includes the full path and file name of a file where all
logging information should be written. If PMEMOBJ_LOG_FILE is not set, logging output is
written to STDERR.
The following example defines the location of the log file to /var/tmp/libpmemobj_
debug.log, ensures we are using the debug version of libpmemobj when executing
my_app in the background, sets the debug log level to 2, and monitors the log in real time
using tail -f:
$ export PMEMOBJ_LOG_FILE=/var/tmp/libpmemobj_debug.log
$ export PMEMOBJ_LOG_LEVEL=2
$ LD_PRELOAD=/usr/lib64/pmdk_debug ./my_app &
$ tail –f /var/tmp/libpmemobj_debug.log
If the last character in the debug log file name is "-", the process identifier (PID) of
the current process will be appended to the file name when the log file is created. This is
useful if you are debugging multiple processes.
Summary
This chapter describes the libpmemobj library, which is designed to simplify persistent
memory programming. By providing APIs that deliver atomic operations, transactions,
and reserve/publish features, it makes creating applications less error prone while
delivering guarantees for data integrity.
108
www.dbooks.org
Chapter 7 libpmemobj: A Native Transactional Object Store
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
109
CHAPTER 8
libpmemobj-cpp:
The Adaptable Language -
C++ and Persistent
Memory
I ntroduction
The Persistent Memory Development Kit (PMDK) includes several separate libraries;
each is designed with a specific use in mind. The most flexible and powerful one is
libpmemobj. It complies with the persistent memory programming model without
modifying the compiler. Intended for developers of low-level system software and
language creators, the libpmemobj library provides allocators, transactions, and a way
to automatically manipulate objects. Because it does not modify the compiler, its API is
verbose and macro heavy.
To make persistent memory programming easier and less error prone, higher-
level language bindings for libpmemobj were created and included in PMDK. The C++
language was chosen to create new and friendly API to libpmemobj called libpmemobj-
cpp, which is also referred to as libpmemobj++. C++ is versatile, feature rich, has a
large developer base, and it is constantly being improved with updates to the C++
programming standard.
The main goal for the libpmemobj-cpp bindings design was to focus modifications to
volatile programs on data structures and not on the code. In other words, libpmemobj-
cpp bindings are for developers, who want to modify volatile applications, provided with
a convenient API for modifying structures and classes with only slight modifications to
functions.
111
© The Author(s) 2020
S. Scargall, Programming Persistent Memory, https://doi.org/10.1007/978-1-4842-4932-1_8
www.dbooks.org
Chapter 8 libpmemobj-cpp: The Adaptable Language - C++ and Persistent Memory
This chapter describes how to leverage the C++ language features that support
metaprogramming to make persistent memory programming easier. It also describes
how to make it more C++ idiomatic by providing persistent containers. Finally, we
discuss C++ standard limitations for persistent memory programming, including an
object’s lifetime and the internal layout of objects stored in persistent memory.
M
etaprogramming to the Rescue
Metaprogramming is a technique in which computer programs have the ability to treat
other programs as their data. It means that a program can be designed to read, generate,
analyze or transform other programs, and even modify itself while running. In some
cases, this allows programmers to minimize the number of lines of code to express a
solution, in turn reducing development time. It also allows programs greater flexibility to
efficiently handle new situations without recompilation.
For the libpmemobj-cpp library, considerable effort was put into encapsulating
the PMEMoids (persistent memory object IDs) with a type-safe container. Instead of a
sophisticated set of macros for providing type safety, templates and metaprogramming
are used. This significantly simplifies the native C libpmemobj API.
P
ersistent Pointers
The persistent memory programming model created by the Storage Networking Industry
Association (SNIA) is based on memory-mapped files. PMDK uses this model for its
architecture and design implementation. We discussed the SNIA programming model in
Chapter 3.
Most operating systems implement address space layout randomization (ASLR).
ASLR is a computer security technique involved in preventing exploitation of memory
corruption vulnerabilities. To prevent an attacker from reliably jumping to, for example,
a particular exploited function in memory, ASLR randomly arranges the address space
positions of key data areas of a process, including the base of the executable and the
positions of the stack, heap, and libraries. Because of ASLR, files can be mapped at
different addresses of the process address space each time the application executes.
As a result, traditional pointers that store absolute addresses cannot be used. Upon
each execution, a traditional pointer might point to uninitialized memory for which
112
Chapter 8 libpmemobj-cpp: The Adaptable Language - C++ and Persistent Memory
T ransactions
Being able to modify more than 8 bytes of storage at a time atomically is imperative for
most nontrivial algorithms one might want to use in persistent memory. Commonly, a
single logical operation requires multiple stores. For example, an insert into a simple list-
based queue requires two separate stores: a tail pointer and the next pointer of the last
element. To enable developers to modify larger amounts of data atomically, with respect
to power-fail interruptions, the PMDK library provides transaction support in some of
its libraries. The C++ language bindings wrap these transactions into two concepts: one,
based on the resource acquisition is initialization (RAII) idiom and the other based on
a callable std::function object. Additionally, because of some C++ standard issues,
the scoped transactions come in two flavors: manual and automatic. In this chapter we
only describe the approach with std::function object. For information about RAII-
based transactions, refer to libpmemobj-cpp documentation (https://pmem.io/pmdk/
cpp_obj/).
The method which uses std::function is declared as
113
www.dbooks.org
Chapter 8 libpmemobj-cpp: The Adaptable Language - C++ and Persistent Memory
Of course, this API is not limited to just lambda functions. Any callable target can
be passed as tx, such as functions, bind expressions, function objects, and pointers
to member functions. Since run is a normal static member function, it has the benefit
of being able to throw exceptions. If an exception is thrown during the execution of
a transaction, it is automatically aborted, and the active exception is rethrown so
information about the interruption is not lost. If the underlying C library fails for any
reason, the transaction is also aborted, and a C++ library exception is thrown. The
developer is no longer burdened with the task of checking the status of the previous
transaction.
libpmemobj-cpp transactions provide an entry point for persistent memory resident
synchronization primitives such as pmem::obj::mutex, pmem::obj::shared_mutex and
pmem::obj::timed_mutex. libpmemobj ensures that all locks are properly reinitialized
when one attempts to acquire a lock for the first time. The use of pmem locks is
completely optional, and transactions can be executed without them. The number of
supplied locks is arbitrary, and the types can be freely mixed. The locks are held until
the end of the given transaction, or the outermost transaction in the case of nesting. This
means when transactions are enclosed by a try-catch statement, the locks are released
before reaching the catch clause. This is extremely important in case some kind of
transaction abort cleanup needs to modify the shared state. In such a case, the necessary
locks need to be reacquired in the correct order.
114
Chapter 8 libpmemobj-cpp: The Adaptable Language - C++ and Persistent Memory
S
napshotting
The C library requires manual snapshots before modifying data in a transaction. The
C++ bindings do all of the snapshotting automatically, to reduce the probability of
programmer error. The pmem::obj::p template wrapper class is the basic building block
for this mechanism. It is designed to work with basic types and not compound types
such as classes or PODs (Plain Old Data, structures with fields only and without any
object-oriented features). This is because it does not define operator->() and there is
no possibility to implement operator.(). The implementation of pmem::obj::p is based
on the operator=(). Each time the assignment operator is called, the value wrapped
by p will be changed, and the library needs to snapshot the old value. In addition to
snapshotting, the p<> template ensures the variable is persisted correctly, flushing data if
necessary. Listing 8-2 provides an example of using the p<> template.
39 struct bad_example {
40 int some_int;
41 float some_float;
42 };
43
44 struct good_example {
45 pmem::obj::p<int> pint;
46 pmem::obj::p<float> pfloat;
47 };
48
49 struct root {
50 bad_example bad;
51 good_example good;
52 };
53
54 int main(int argc, char *argv[]) {
55 auto pop = pmem::obj::pool<root>::open("/daxfs/file", "p");
56
57 auto r = pop.root();
58
115
www.dbooks.org
Chapter 8 libpmemobj-cpp: The Adaptable Language - C++ and Persistent Memory
59 pmem::obj::transaction::run(pop, [&]() {
60 r->bad.some_int = 10;
61 r->good.pint = 10;
62
63 r->good.pint += 1;
64 });
65
66 return 0;
67 }
Allocating
As with std::shared_ptr, the pmem::obj::persistent_ptr comes with a set of allocating
and deallocating functions. This helps allocate memory and create objects, as well as
destroy and deallocate the memory. This is especially important in the case of persistent
116
Chapter 8 libpmemobj-cpp: The Adaptable Language - C++ and Persistent Memory
39 struct my_data {
40 my_data(int a, int b): a(a), b(b) {
41
42 }
43
44 int a;
45 int b;
46 };
47
48 struct root {
49 pmem::obj::persistent_ptr<my_data> mdata;
50 };
51
52 int main(int argc, char *argv[]) {
53 auto pop = pmem::obj::pool<root>::open("/daxfs/file", "tx");
117
www.dbooks.org
Chapter 8 libpmemobj-cpp: The Adaptable Language - C++ and Persistent Memory
54
55 auto r = pop.root();
56
57 pmem::obj::transaction::run(pop, [&]() {
58 r->mdata = pmem::obj::make_persistent<my_data>(1, 2);
59 });
60
61 pmem::obj::transaction::run(pop, [&]() {
62 pmem::obj::delete_persistent<my_data>(r->mdata);
63 });
64 pmem::obj::make_persistent_atomic<my_data>(pop, r->mdata,
2, 3);
65
66 return 0;
67 }
118
Chapter 8 libpmemobj-cpp: The Adaptable Language - C++ and Persistent Memory
What does the preceding mean from a C++ and libpmemobj’s perspective? There are
four major problems:
1. Object lifetime
The standard states that properties ascribed to objects apply for a given object only
during its lifetime. In this context, the persistent memory programming problem is
similar to transmitting data over a network, where the C++ application is given an array
of bytes but might be able to recognize the type of object sent. However, the object was
not constructed in this application, so using it would result in undefined behavior.
119
www.dbooks.org
Chapter 8 libpmemobj-cpp: The Adaptable Language - C++ and Persistent Memory
This problem is well known and is being addressed by the WG21 C++ Standards
Committee Working Group (https://isocpp.org/std/the-committee and http://
www.open-std.org/jtc1/sc22/wg21/).
Currently, there is no possible way to overcome the object-lifetime obstacle and
stop relying on undefined behavior from C++ standard’s point of view. libpmemobj-cpp
is tested and validated with various C++11 compliant compilers and use case scenarios.
The only recommendation for libpmemobj-cpp users is that they must keep this
limitation in mind when developing persistent memory applications.
T rivial Types
Transactions are the heart of libpmemobj. That is why libpmemobj-cpp was implemented
with utmost care while designing the C++ versions so they are as easy to use as possible.
Developers do not have to know the implementation details and do not have to worry about
snapshotting modified data to make undo log–based transaction works. A special semi-
transparent template property class has been implemented to automatically add variable
modifications to the transaction undo log, which is described in the “Snapshotting” section.
But what does snapshotting data mean? The answer is very simple, but the
consequences for C++ are not. libpmemobj implements snapshotting by copying data of
given length from a specified address to another address using memcpy(). If a transaction
aborts or a system power loss occurs, the data will be written from the undo log when the
memory pool is reopened. Consider a definition of the following C++ object, presented
in Listing 8-4, and think about the consequences that a memcpy() has on it.
35 class nonTriviallyCopyable {
36 private:
37 int* i;
38 public:
39 nonTriviallyCopyable (const nonTriviallyCopyable & from)
40 {
41 /* perform non-trivial copying routine */
42 i = new int(*from.i);
43 }
44 };
120
Chapter 8 libpmemobj-cpp: The Adaptable Language - C++ and Persistent Memory
Deep and shallow copying is the simplest example. The gist of the problem is that
by copying the data manually, we may break the inherent behavior of the object which
may rely on the copy constructor. Any shared or unique pointer would be another great
example – by simple copying it with memcpy(), we break the "deal" we made with that
class when we used it, and it may lead to leaks or crashes.
The application must handle many more sophisticated details when it manually
copies the contents of an object. The C++11 standard provides a <type_traits>
type trait and std::is_trivially_copyable, which ensure a given type satisfies the
requirements of TriviallyCopyable. Referring to C++ standard, an object satisfies the
TriviallyCopyable requirements when
A trivially copyable class is a class that:
— has no non-trivial copy constructors (12.8),
— has no non-trivial move constructors (12.8),
— has no non-trivial copy assignment operators (13.5.3, 12.8),
— has no non-trivial move assignment operators (13.5.3, 12.8), and
— has a trivial destructor (12.4).
A trivial class is a class that has a trivial default constructor (12.1) and is
trivially copyable.
[Note: In particular, a trivially copyable or trivial class does not have vir-
tual functions or virtual base classes.]
121
www.dbooks.org
Chapter 8 libpmemobj-cpp: The Adaptable Language - C++ and Persistent Memory
This means that a copy or move constructor is trivial if it is not user provided.
The class has nothing virtual in it, and this property holds recursively for all the members
of the class and for the base class. As you can see, the C++ standard and libpmemobj
transaction implementation limit the possible objects type to store on persistent
memory to satisfy requirements of trivial types, but the layout of our objects must be
taken into account.
Object Layout
Object representation, also referred to as the layout, might differ between compilers,
compiler flags, and application binary interface (ABI). The compiler may do some
layout-related optimizations and is free to shuffle order of members with same specifier
type – for example, public then protected, then public again. Another problem related
to unknown object layout is connected to polymorphic types. Currently there is no
reliable and portable way to implement vtable rebuilding after reopening the memory
pool, so polymorphic objects cannot be supported with persistent memory.
If we want to store objects on persistent memory using memory-mapped files and
to follow the SNIA NVM programming model, we must ensure that the following casting
will be always valid:
someType A = *reinterpret_cast<someType*>(mmap(...));
The bit representation of a stored object type must be always the same, and our
application should be able to retrieve the stored object from the memory-mapped file
without serialization.
It is possible to ensure that specific types satisfy the aforementioned requirements.
C++11 provides another type trait called std::is_standard_layout. The standard
mentions that it is useful for communicating with other languages, such as for creating
language bindings to native C++ libraries as an example, and that's why a standard-
layout class has the same memory layout of the equivalent C struct or union. A general
rule is that standard-layout classes must have all non-static data members with the same
access control. We mentioned this at the beginning of this section – that a C++ compliant
compiler is free to shuffle access ranges of the same class definition.
When using inheritance, only one class in the whole inheritance tree can have non-
static data members, and the first non-static data member cannot be of a base class type
because this could break aliasing rules. Otherwise, it is not a standard-layout class.
122
Chapter 8 libpmemobj-cpp: The Adaptable Language - C++ and Persistent Memory
Having discussed object layouts, we look at another interesting problem with pointer
types and how to store them on persistent memory.
P
ointers
In previous sections, we quoted parts of the C++ standard. We were describing the limits
of types which were safe to snapshot and copy and which we can binary-cast without
thinking of fixed layout. But what about pointers? How do we deal with them in our
objects as we come to grips with the persistent memory programming model? Consider
the code snippet presented in Listing 8-5 which provides an example of a class that uses
a volatile pointer as a class member.
123
www.dbooks.org
Chapter 8 libpmemobj-cpp: The Adaptable Language - C++ and Persistent Memory
39 struct root {
40 int* vptr1;
41 int* vptr2;
42 };
43
44 int main(int argc, char *argv[]) {
45 auto pop = pmem::obj::pool<root>::open("/daxfs/file", "tx");
46
47 auto r = pop.root();
48
49 int a1 = 1;
50
51 pmem::obj::transaction::run(pop, [&](){
52 auto ptr = pmem::obj::make_persistent<int>(0);
53 r->vptr1 = ptr.get();
54 r->vptr2 = &a1;
55 });
56
57 return 0;
58 }
124
Chapter 8 libpmemobj-cpp: The Adaptable Language - C++ and Persistent Memory
L imitations Summary
C++11 provides several very useful type traits for persistent memory programming.
These are
struct std::is_pod;
struct std::is_trivially_copyable;
struct std::is_standard_layout;
They are correlated with each other. The most general and restrictive is the definition
of a POD type shown in Figure 8-1.
125
www.dbooks.org
Chapter 8 libpmemobj-cpp: The Adaptable Language - C++ and Persistent Memory
We mentioned previously that a persistent memory resident class must satisfy the
following requirements:
• std::is_trivially_copyable
• std::is_standard_layout
Persistent memory developers are free to use more restrictive type traits if required.
If we want to use persistent pointers, however, we cannot rely on type traits, and we
must be aware of all problems related to copying objects with memcpy() and the layout
representation of objects. For persistent memory programming, a format description or
standardization of the aforementioned concepts and features needs to take place within
the C++ standards body group such that it can be officially designed and implemented.
Until then, developers must be aware of the restrictions and limitations to manage
undefined object-lifetime behavior.
P
ersistence Simplified
Consider a simple queue implementation, presented in Listing 8-6, which stores
elements in volatile DRAM.
33 #include <cstdio>
34 #include <cstdlib>
35 #include <iostream>
36 #include <string>
37
126
Chapter 8 libpmemobj-cpp: The Adaptable Language - C++ and Persistent Memory
38 struct queue_node {
39 int value;
40 struct queue_node *next;
41 };
42
43 struct queue {
44 void
45 push(int value)
46 {
47 auto node = new queue_node;
48 node->value = value;
49 node->next = nullptr;
50
51 if (head == nullptr) {
52 head = tail = node;
53 } else {
54 tail->next = node;
55 tail = node;
56 }
57 }
58
59 int
60 pop()
61 {
62 if (head == nullptr)
63 throw std::out_of_range("no elements");
64
65 auto head_ptr = head;
66 auto value = head->value;
67
68 head = head->next;
69 delete head_ptr;
70
71 if (head == nullptr)
72 tail = nullptr;
73
127
www.dbooks.org
Chapter 8 libpmemobj-cpp: The Adaptable Language - C++ and Persistent Memory
74 return value;
75 }
76
77 void
78 show()
79 {
80 auto node = head;
81 while (node != nullptr) {
82 std::cout << "show: " << node->value << std::endl;
83 node = node->next;
84 }
85
86 std::cout << std::endl;
87 }
88
89 private:
90 queue_node *head = nullptr;
91 queue_node *tail = nullptr;
92 };
• Lines 77-87: The show() method walks the list and prints the contents
of each node to standard out.
The preceding queue implementation stores values of type int in a linked list and
provides three basic methods: push(), pop(), and show().
In this section, we will demonstrate how to modify your volatile structure to store
elements in persistent memory with libpmemobj-cpp bindings. All the modifier methods
should provide atomicity and consistency properties which will be guaranteed by the
use of transactions.
128
Chapter 8 libpmemobj-cpp: The Adaptable Language - C++ and Persistent Memory
38 #include <libpmemobj++/make_persistent.hpp>
39 #include <libpmemobj++/p.hpp>
40 #include <libpmemobj++/persistent_ptr.hpp>
41 #include <libpmemobj++/pool.hpp>
42 #include <libpmemobj++/transaction.hpp>
43
44 struct queue_node {
45 pmem::obj::p<int> value;
46 pmem::obj::persistent_ptr<queue_node> next;
47 };
48
49 struct queue {
...
100 private:
101 pmem::obj::persistent_ptr<queue_node> head = nullptr;
102 pmem::obj::persistent_ptr<queue_node> tail = nullptr;
103 };
As you can see, all the modifications are limited to replace the volatile pointers with
pmem:obj::persistent_ptr and to start using the p<> property.
Next, we modify a push() method, shown in Listing 8-8.
50 void
51 push(pmem::obj::pool_base &pop, int value)
52 {
53 pmem::obj::transaction::run(pop, [&]{
54 auto node = pmem::obj::make_persistent<queue_node>();
55 node->value = value;
129
www.dbooks.org
Chapter 8 libpmemobj-cpp: The Adaptable Language - C++ and Persistent Memory
56 node->next = nullptr;
57
58 if (head == nullptr) {
59 head = tail = node;
60 } else {
61 tail->next = node;
62 tail = node;
63 }
64 });
65 }
All the modifiers methods must be aware on which persistent memory pool they
should operate on. For a single memory pool, this is trivial, but if the application
memory maps files from different file systems, we need to keep track of which pool has
what data. We introduce an additional argument of type pmem::obj::pool_base to solve
this problem. Inside the method definition, we are wrapping the code with a transaction
by using a C++ lambda expression, [&], to guarantee atomicity and consistency of
modifications. Instead of allocating a new node on the stack, we call pmem::obj::make_
persistent<>() to transactionally allocate it on persistent memory.
Listing 8-9 shows the modification of the pop() method.
67 int
68 pop(pmem::obj::pool_base &pop)
69 {
70 int value;
71 pmem::obj::transaction::run(pop, [&]{
72 if (head == nullptr)
73 throw std::out_of_range("no elements");
74
75 auto head_ptr = head;
76 value = head->value;
77
78 head = head->next;
79 pmem::obj::delete_persistent<queue_node>(head_ptr);
80
130
Chapter 8 libpmemobj-cpp: The Adaptable Language - C++ and Persistent Memory
39 #include "persistent_queue.hpp"
40
41 enum queue_op {
42 PUSH,
43 POP,
44 SHOW,
45 EXIT,
46 MAX_OPS,
47 };
48
49 const char *ops_str[MAX_OPS] = {"push", "pop", "show", "exit"};
50
51 queue_op
52 parse_queue_ops(const std::string &ops)
53 {
54 for (int i = 0; i < MAX_OPS; i++) {
55 if (ops == ops_str[i]) {
56 return (queue_op)i;
131
www.dbooks.org
Chapter 8 libpmemobj-cpp: The Adaptable Language - C++ and Persistent Memory
57 }
58 }
59 return MAX_OPS;
60 }
61
62 int
63 main(int argc, char *argv[])
64 {
65 if (argc < 2) {
66 std::cerr << "Usage: " << argv[0] << " path_to_pool"
<< std::endl;
67 return 1;
68 }
69
70 auto path = argv[1];
71 pmem::obj::pool<queue> pool;
72
73 try {
74 pool = pmem::obj::pool<queue>::open(path, "queue");
75 } catch(pmem::pool_error &e) {
76 std::cerr << e.what() << std::endl;
77 std::cerr << "To create pool run: pmempool create obj
--layout=queue -s 100M path_to_pool" << std::endl;
78 }
79
80 auto q = pool.root();
81
82 while (1) {
83 std::cout << "[push value|pop|show|exit]" << std::endl;
84
85 std::string command;
86 std::cin >> command;
87
88 // parse string
89 auto ops = parse_queue_ops(std::string(command));
90
132
Chapter 8 libpmemobj-cpp: The Adaptable Language - C++ and Persistent Memory
91 switch (ops) {
92 case PUSH: {
93 int value;
94 std::cin >> value;
95
96 q->push(pool, value);
97
98 break;
99 }
100 case POP: {
101 std::cout << q->pop(pool) << std::endl;
102 break;
103 }
104 case SHOW: {
105 q->show();
106 break;
107 }
108 case EXIT: {
109 exit(0);
110 }
111 default: {
112 std::cerr << "unknown ops" << std::endl;
113 exit(0);
114 }
115 }
116 }
117 }
The Ecosystem
The overall goal for the libpmemobj C++ bindings was to create a friendly and less
error-prone API for persistent memory programming. Even with persistent memory
pool allocators, a convenient interface for creating and managing transactions,
auto-snapshotting class templates and smart persistent pointers, and designing
133
www.dbooks.org
Chapter 8 libpmemobj-cpp: The Adaptable Language - C++ and Persistent Memory
an application with persistent memory usage may still prove challenging without
a lot of niceties that the C++ programmers are used to. The natural step forward to
make persistent programming easier was to provide programmers with efficient and
useful containers.
Persistent Containers
The C++ standard library containers collection is something that persistent memory
programmers may want to use. Containers manage the lifetime of held objects
through allocation/creation and deallocation/destruction with the use of allocators.
Implementing custom persistent allocator for C++ STL (Standard Template Library)
containers has two main downsides:
• Implementation details:
• Memory layout:
• The STL does not guarantee that the container layout will remain
unchanged in new library versions.
134
Chapter 8 libpmemobj-cpp: The Adaptable Language - C++ and Persistent Memory
33 #include <libpmemobj++/make_persistent.hpp>
34 #include <libpmemobj++/transaction.hpp>
35 #include <libpmemobj++/persistent_ptr.hpp>
36 #include <libpmemobj++/pool.hpp>
37 #include "libpmemobj++/vector.hpp"
38
39 using vector_type = pmem::obj::experimental::vector<int>;
40
41 struct root {
42 pmem::obj::persistent_ptr<vector_type> vec_p;
43 };
44
...
63
64 /* creating pmem::obj::vector in transaction */
65 pmem::obj::transaction::run(pool, [&] {
66 root->vec_p = pmem::obj::make_persistent<vector_type>
(/* optional constructor arguments */);
67 });
68
69 vector_type &pvector = *(root->vec_p);
135
www.dbooks.org
Chapter 8 libpmemobj-cpp: The Adaptable Language - C++ and Persistent Memory
71 pvector.reserve(10);
72 assert(pvector.size() == 0);
73 assert(pvector.capacity() == 10);
74
75 pvector = {0, 1, 2, 3, 4};
76 assert(pvector.size() == 5);
77 assert(pvector.capacity() == 10);
78
79 pvector.shrink_to_fit();
80 assert(pvector.size() == 5);
81 assert(pvector.capacity() == 5);
82
83 for (unsigned i = 0; i < pvector.size(); ++i)
84 assert(pvector.const_at(i) == static_cast<int>(i));
85
86 pvector.push_back(5);
87 assert(pvector.const_at(5) == 5);
88 assert(pvector.size() == 6);
89
90 pvector.emplace(pvector.cbegin(), pvector.back());
91 assert(pvector.const_at(0) == 5);
92 for (unsigned i = 1; i < pvector.size(); ++i)
93 assert(pvector.const_at(i) == static_cast<int>(i - 1));
Every method that modifies persistent memory containers does so inside an implicit
transaction to guarantee full exception safety. If any of these methods are called inside
the scope of another transaction, the operation is performed in the context of that
transaction; otherwise, it is atomic in its own scope.
Iterating over pmem::obj::vector works exactly the same as std::vector. We can
use the range-based indexing operator for loops or iterators. The pmem::obj::vector
can also be processed using std::algorithms, as shown in Listing 8-13.
136
Chapter 8 libpmemobj-cpp: The Adaptable Language - C++ and Persistent Memory
Listing 8-13. Iterating over persistent container and compatibility with STD
algorithms
95 std::vector<int> stdvector = {5, 4, 3, 2, 1};
96 pvector = stdvector;
97
98 try {
99 pmem::obj::transaction::run(pool, [&] {
100 for (auto &e : pvector)
101 e++;
102 /* 6, 5, 4, 3, 2 */
103
104 for (auto it = pvector.begin();
it != pvector.end(); it++)
105 *it += 2;
106 /* 8, 7, 6, 5, 4 */
107
108 for (unsigned i = 0; i < pvector.size(); i++)
109 pvector[i]--;
110 /* 7, 6, 5, 4, 3 */
111
112 std::sort(pvector.begin(), pvector.end());
113 for (unsigned i = 0; i < pvector.size(); ++i)
114 assert(pvector.const_at(i) == static_cast<int>
(i + 3));
115
116 pmem::obj::transaction::abort(0);
117 });
118 } catch (pmem::manual_tx_abort &) {
119 /* expected transaction abort */
120 } catch (std::exception &e) {
121 std::cerr << e.what() << std::endl;
122 }
123
137
www.dbooks.org
Chapter 8 libpmemobj-cpp: The Adaptable Language - C++ and Persistent Memory
If an active transaction exists, elements accessed using any of the preceding methods
are snapshotted. When iterators are returned by begin() and end(), snapshotting
happens during the iterator dereferencing phase. Note that snapshotting is done only
for mutable elements. In the case of constant iterators or constant versions of indexing
operator, nothing is added to the transaction. That is why it is essential to use const
qualified function overloads such as cbegin() or cend() whenever possible. If an object
snapshot occurs in the current transaction, a second snapshot of the same memory
address will not be performed and thus will not have performance overhead. This will
reduce the number of snapshots and can significantly reduce the performance impact
of transactions. Note also that pmem::obj::vector does define convenient constructors
and compare operators that take std::vector as an argument.
Summary
This chapter describes the libpmemobj-cpp library. It makes creating applications less
error prone, and its similarity to standard C++ API makes it easier to modify existing
volatile programs to use persistent memory. We also list the limitations of this library
and the problems you must consider during development.
138
Chapter 8 libpmemobj-cpp: The Adaptable Language - C++ and Persistent Memory
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter's Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
139
www.dbooks.org
CHAPTER 9
Figure 9-1. Where is data stored? Source: IDC White Paper – #US44413318
141
© The Author(s) 2020
S. Scargall, Programming Persistent Memory, https://doi.org/10.1007/978-1-4842-4932-1_9
Chapter 9 pmemkv: A Persistent In-Memory Key-Value Store
The cloud ecosystem, its modularity, and variety of service modes define
programming and application deployment as we know it. We call it cloud-native
computing, and its popularity results in a growing number of high-level languages,
frameworks, and abstraction layers. Figure 9-2 shows the 15 most popular languages on
GitHub based on pull requests.
Figure 9-2. The 15 most popular languages on GitHub by opened pull request
(2017). Source: https://octoverse.github.com/2017/
142
www.dbooks.org
Chapter 9 pmemkv: A Persistent In-Memory Key-Value Store
p memkv Architecture
There are many key-value data stores available on the market. They have different
features and licenses and their APIs are targeting different use cases. However, their core
API remains the same. All of them provide methods like put, get, remove, exists, open,
and close. At the time we published this book, the most popular key-value data store
is Redis. It is available in open source (https://redis.io/) and enterprise (https://
redislabs.com) versions. DB-Engines (https://db-engines.com) shows that Redis has
a significantly higher rank than any of its competitors in this sector.
Figure 9-3. DB-Engines ranking of key-value stores (July 2019). Scoring method:
https://db-engines.com/en/ranking_definition. Source: https://db-
engines.com/en/ranking/key-value+store
Pmemkv was created as a separate project not only to complement PMDK’s set
of libraries with cloud-native support but also to provide a key-value API built for
persistent memory. One of the main goals for pmemkv developers was to create friendly
environment for open source community to develop new engines with the help of
PMDK and to integrate it with other programming languages. Pmemkv uses the same
BSD 3-Clause permissive license as PMDK. The native API of pmemkv is C and C++.
Other programming language bindings are available such as JavaScript, Java, and Ruby.
Additional languages can easily be added.
143
Chapter 9 pmemkv: A Persistent In-Memory Key-Value Store
The pmemkv API is similar to most key-value databases. Several storage engines
are available for flexibility and functionality. Each engine has different performance
characteristics and aims to solve different problems. Because of that, the functionality
provided by each engine differs. They can be described by the following characteristics:
What makes pmemkv different from other key-value databases is that it provides
direct access to the data. This means reading data from persistent memory does not
require a copy into DRAM. This was already mentioned in Chapter 1 and is presented
again in Figure 9-5.
144
www.dbooks.org
Chapter 9 pmemkv: A Persistent In-Memory Key-Value Store
Having direct access to the data significantly speeds up the application. This benefit
is most noticeable in situations where the program is only interested in a part of the
data stored in the database. In conventional approaches, this would require copying the
whole data in some buffer and returning it to the application. With pmemkv, we provide
the application a direct pointer, and the application reads only as much as it is needed.
To make the API fully functional with various engine types, a flexible pmemkv_config
structure was introduced. It stores engine configuration options and allows you to
tune its behavior. Every engine has documented all supported config parameters. The
pmemkv library was designed in a way that engines are pluggable and extendable
to support the developers own requirements. Developers are free to modify existing
engines or contribute new ones (https://github.com/pmem/pmemkv/blob/master/
CONTRIBUTING.md#engines).
Listing 9-1 shows a basic setup of the pmemkv_config structure using the native C
API. All the setup code is wrapped around the custom function, config_setup(), which
will be used in a phonebook example in the next section. You can see how error handling
is solved in pmemkv – all methods, except for pmemkv_close() and pmemkv_errormsg(),
return a status. We can obtain error message using the pmemkv_errormsg() function. A
complete list of return values can be found in pmemkv man page.
145
Chapter 9 pmemkv: A Persistent In-Memory Key-Value Store
146
www.dbooks.org
Chapter 9 pmemkv: A Persistent In-Memory Key-Value Store
• Line 9-22: All params are put into config (the cfg instance) one after
another (using function dedicated for the type), and each is checked
if was stored successful (PMEMKV_STATUS_OK is returned when no
errors occurred).
A Phonebook Example
Listing 9-2 shows a simple phonebook example implemented using the pmemkv C++
API v0.9. One of the main intentions of pmemkv is to provide a familiar API similar to
the other key-value stores. This makes it very intuitive and easy to use. We will reuse the
config_setup() function from Listing 9-1.
Listing 9-2. A simple phonebook example using the pmemkv C++ API
37 #include <iostream>
38 #include <cassert>
39 #include <libpmemkv.hpp>
40 #include <string>
41 #include "pmemkv_config.h"
42
43 using namespace pmem::kv;
44
45 auto PATH = "/daxfs/kvfile";
46 const uint64_t FORCE_CREATE = 1;
47 const uint64_t SIZE = 1024 ∗ 1024 ∗ 1024; // 1 Gig
48
49 int main() {
50 // Prepare config for pmemkv database
51 pmemkv_config ∗cfg = config_setup(PATH, FORCE_CREATE, SIZE);
52 assert(cfg != nullptr);
53
54 // Create a key-value store using the "cmap" engine.
55 db kv;
56
57 if (kv.open("cmap", config(cfg)) != status::OK) {
58 std::cerr << db::errormsg() << std::endl;
147
Chapter 9 pmemkv: A Persistent In-Memory Key-Value Store
59 return 1;
60 }
61
62 // Add 2 entries with name and phone number
63 if (kv.put("John", "123-456-789") != status::OK) {
64 std::cerr << db::errormsg() << std::endl;
65 return 1;
66 }
67 if (kv.put("Kate", "987-654-321") != status::OK) {
68 std::cerr << db::errormsg() << std::endl;
69 return 1;
70 }
71
72 // Count elements
73 size_t cnt;
74 if (kv.count_all(cnt) != status::OK) {
75 std::cerr << db::errormsg() << std::endl;
76 return 1;
77 }
78 assert(cnt == 2);
79
80 // Read key back
81 std::string number;
82 if (kv.get("John", &number) != status::OK) {
83 std::cerr << db::errormsg() << std::endl;
84 return 1;
85 }
86 assert(number == "123-456-789");
87
88 // Iterate through the phonebook
89 if (kv.get_all([](string_view name, string_view number) {
90 std::cout << "name: " << name.data() <<
91 ", number: " << number.data() << std::endl;
92 return 0;
93 }) != status::OK) {
148
www.dbooks.org
Chapter 9 pmemkv: A Persistent In-Memory Key-Value Store
• Line 57: Here, we open the key-value database backed by the cmap
engine using the config parameters. The cmap engine is a persistent
concurrent hash map engine, implemented in libpmemobj-cpp.
149
Chapter 9 pmemkv: A Persistent In-Memory Key-Value Store
You can read more about cmap engine internal algorithms and data
structures in Chapter 13.
• Line 63 and 67: The put() method inserts a key-value pair into the
database. This function is guaranteed to be implemented by all
engines. In this example, we are inserting two key-value pairs into
database and compare returned statuses with status::OK. It’s a
recommended way to check if function succeeded.
• Line 74: The count_all() has a single argument of type size_t. The
method returns the number of elements (phonebook entries) stored
in the database by the argument variable (cnt).
• Line 82: Here, we use the get() method to return the value of the
“John” key. The value is copied into the user-provided number
variable. The get() function returns status::OK on success or an
error on failure. This function is guaranteed to be implemented by all
engines.
• Line 86: For this example, the expected value of variable number for
“John” is “123-456-789”. If we do not get this value, an assertion error
is thrown.
• Line 89: The get_all() method used in this example gives the
application direct, read-only access to the data. Both key and value
variables are references to data stored in persistent memory. In this
example, we simply print the name and the number of every visited pair.
• Line 99: Here, we are removing “John” and his phone number from
the database by calling the remove() method. It is guaranteed to be
implemented by all engines.
150
www.dbooks.org
Chapter 9 pmemkv: A Persistent In-Memory Key-Value Store
Listing 9-3. A simple phonebook example written using the JavaScript bindings
for pmemkv v0.8
1 const Database = require('./lib/all');
2
3 function assert(condition) {
4 if (!condition) throw new Error('Assert failed');
5 }
6
7 console.log('Create a key-value store using the "cmap" engine');
8 const db = new Database('cmap', '{"path":"/daxfs/
kvfile","size":1073741824, "force_create":1}');
9
10 console.log('Add 2 entries with name and phone number');
11 db.put('John', '123-456-789');
12 db.put('Kate', '987-654-321');
13
151
Chapter 9 pmemkv: A Persistent In-Memory Key-Value Store
14 console.log('Count elements');
15 assert(db.count_all == 2);
16
17 console.log('Read key back');
18 assert(db.get('John') === '123-456-789');
19
20 console.log('Iterate through the phonebook');
21 db.get_all((k, v) => console.log(` name: ${k}, number: ${v}`));
22
23 console.log('Remove one record');
24 db.remove('John');
25
26 console.log('Lookup of removed record');
27 assert(!db.exists('John'));
28
29 console.log('Stopping engine');
30 db.stop();
S
ummary
In this chapter, we have shown how a familiar key-value data store is an easy way for the
broader cloud software developer audience to use persistent memory and directly access
the data in place. The modular design, flexible engine API, and integration with many
of the most popular cloud programming languages make pmemkv an intuitive choice
for cloud-native software developers. As an open source and lightweight library, it can
easily be integrated into existing applications to immediately start taking advantage of
persistent memory.
Some of the pmemkv engines are implemented using libpmemobj-cpp that we
described in Chapter 8. The implementation of such engines provides real-world
examples for developers to understand how to use PMDK (and related libraries) in
applications.
152
www.dbooks.org
Chapter 9 pmemkv: A Persistent In-Memory Key-Value Store
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
153
CHAPTER 10
155
© The Author(s) 2020
S. Scargall, Programming Persistent Memory, https://doi.org/10.1007/978-1-4842-4932-1_10
www.dbooks.org
Chapter 10 Volatile Use of Persistent Memory
B
ackground
Applications manage different kinds of data structures such as user data, key-value
stores, metadata, and working buffers. Architecting a solution that uses tiered memory
and storage may enhance application performance, for example, placing objects that
are accessed frequently and require low-latency access in DRAM while storing objects
that require larger allocations that are not as latency-sensitive on persistent memory.
Traditional storage devices are used to provide persistence.
M
emory Allocation
As described in Chapters 1 through 3, persistent memory is exposed to the application
using memory-mapped files on a persistent memory-aware file system that provides
direct access to the application. Since malloc() and free() do not operate on different
types of memory or memory-mapped files, an interface is needed that provides malloc()
and free() semantics for multiple memory types. This interface is implemented as the
memkind library (http://memkind.github.io/memkind/).
How it Works
The memkind library is a user-extensible heap manager built on top of jemalloc, which
enables partitioning of the heap between multiple kinds of memory. Memkind was
created to support different kinds of memory when high bandwidth memory (HBM) was
introduced. A PMEM kind was introduced to support persistent memory.
Different “kinds” of memory are defined by the operating system memory policies
that are applied to virtual address ranges. Memory characteristics supported by
memkind without user extension include the control of non-uniform memory access
(NUMA) and page sizes. Figure 10-1 shows an overview of libmemkind components and
hardware support.
156
Chapter 10 Volatile Use of Persistent Memory
The memkind library serves as a wrapper that redirects memory allocation requests
from an application to an allocator that manages the heap. At the time of publication,
only the jemalloc allocator is supported. Future versions may introduce and support
multiple allocators. Memkind provides jemalloc with different kinds of memory: A static
kind is created automatically, whereas a dynamic kind is created by an application using
memkind_create_kind().
S
upported “Kinds” of Memory
The dynamic PMEM kind is best used with memory-addressable persistent storage
through a DAX-enabled file system that supports load/store operations that are
not paged via the system page cache. For the PMEM kind, the memkind library supports
the traditional malloc/free-like interfaces on a memory-mapped file. When an
application calls memkind_create_kind() with PMEM, a temporary file (tmpfile(3))
is created on a mounted DAX file system and is memory-mapped into the application’s
virtual address space. This temporary file is deleted automatically when the program
terminates, giving the perception of volatility.
Figure 10-2 shows memory mappings from two memory sources: DRAM
(MEMKIND_DEFAULT) and persistent memory (PMEM_KIND).
For allocations from DRAM, rather than using the common malloc(), the
application can call memkind_malloc() with the kind argument set to MEMKIND_DEFAULT.
MEMKIND_DEFAULT is a static kind that uses the operating system’s default page size for
allocations. Refer to the memkind documentation for large and huge page support.
157
www.dbooks.org
Chapter 10 Volatile Use of Persistent Memory
When using libmemkind with DRAM and persistent memory, the key points to
understand are:
158
Chapter 10 Volatile Use of Persistent Memory
159
www.dbooks.org
Chapter 10 Volatile Use of Persistent Memory
K
ind Creation
Use the memkind_create_pmem() function to create a PMEM kind of memory from a
file-backed source. This file is created as a tmpfile(3) in a specified directory (PMEM_DIR)
and is unlinked, so the file name is not listed under the directory. The temporary file is
automatically removed when the program terminates.
Use memkind_create_pmem() to create a fixed or dynamic heap size depending on
the application requirement. Additionally, configurations can be created and supplied
rather than passing in configuration options to the *_create_* function.
memkind_error_message(err, error_message,
MEMKIND_ERROR_MESSAGE_SIZE);
fprintf(stderr, "%s\n", error_message);
exit(1);
}
You can also create a heap with a specific configuration using the function memkind_
create_pmem_with_config(). This function uses a memkind_config structure with
optional parameters such as size, file path, and memory usage policy. Listing 10-3
shows how to build a test_cfg using memkind_config_new(), then passing that
configuration to memkind_create_pmem_with_config() to create a PMEM kind. We use
the same path and size parameters from the Listing 10-2 example for comparison.
161
www.dbooks.org
Chapter 10 Volatile Use of Persistent Memory
memkind_create_pmem(PMEM_DIR, 0, &pmem_kind)
• PMEM_MAX_SIZE is 0.
Table 10-1. Automatic kind detection functions and their equivalent specified
kind functions and operations
Operation Memkind API with Kind Memkind API Using Automatic Detection
The memkind library internally tracks the kind of a given object from the allocator
metadata. However, to get this information, some of the operations may need to
acquire a lock to prevent accesses from other threads, which may negatively affect the
performance in a multithreaded environment.
163
www.dbooks.org
Chapter 10 Volatile Use of Persistent Memory
Using the same pmem_detect_kind.c code from Listing 10-5, Listing 10-6 shows how
the kind is destroyed before the program exits.
89 err = memkind_destroy_kind(pmem_kind);
90 if (err) {
91 memkind_fatal(err);
92 }
164
Chapter 10 Volatile Use of Persistent Memory
A
llocating Memory
The memkind library provides memkind_malloc(), memkind_calloc(), and memkind_
realloc() functions for allocating memory, defined as follows:
Listing 10-7. An example of allocating memory from both DRAM and persistent
memory
/*
* Allocates 100 bytes using appropriate "kind"
* of volatile memory
*/
165
www.dbooks.org
Chapter 10 Volatile Use of Persistent Memory
166
Chapter 10 Volatile Use of Persistent Memory
1. MEMKIND_MEM_USAGE_POLICY_DEFAULT
2. MEMKIND_MEM_USAGE_POLICY_CONSERVATIVE
The minimum and maximum values for dirty_decay_ms using the MEMKIND_MEM_
USAGE_POLICY_DEFAULT are 0ms to 10,000ms for arenas assigned to a PMEM kind.
Setting MEMKIND_MEM_USAGE_POLICY_CONSERVATIVE sets shorter decay times to purge
unused memory faster, reducing memory usage. To define the memory usage policy, use
memkind_config_set_memory_usage_policy(), shown below:
167
www.dbooks.org
Chapter 10 Volatile Use of Persistent Memory
168
Chapter 10 Volatile Use of Persistent Memory
pmem::allocator methods
pmem::allocator(const char *dir, size_t max_size);
pmem::allocator(const std::string& dir, size_t max_size) ;
template <typename U> pmem::allocator<T>::allocator(const
pmem::allocator<U>&);
template <typename U> pmem::allocator(allocator<U>&& other);
pmem::allocator<T>::~allocator();
T* pmem::allocator<T>::allocate(std::size_t n) const;
void pmem::allocator<T>::deallocate(T* p, std::size_t n) const ;
template <class U, class... Args> void pmem::allocator<T>::construct(U* p,
Args... args) const;
void pmem::allocator<T>::destroy(T* p) const;
For more information about the pmem::allocator class template, refer to the pmem
allocator(3) man page.
Nested Containers
Multilevel containers such as a vector of lists, tuples, maps, strings, and so on pose
challenges in handling the nested objects.
Imagine you need to create a vector of strings and store it in persistent memory. The
challenges – and their solutions – for this task include:
169
www.dbooks.org
Chapter 10 Volatile Use of Persistent Memory
C
++ Examples
This section presents several full-code examples demonstrating the use of libmemkind
using C and C++.
U
sing the pmem::allocator
As mentioned earlier, you can use pmem::allocator with any STL-like data structure.
The code sample in Listing 10-10 includes a pmem_allocator.h header file to use
pmem::allocator.
37 #include <pmem_allocator.h>
38 #include <vector>
39 #include <cassert>
40
41 int main(int argc, char *argv[]) {
42 const size_t pmem_max_size = 64 * 1024 * 1024; //64 MB
43 const std::string pmem_dir("/daxfs");
44
45 // Create allocator object
46 libmemkind::pmem::allocator<int>
47 alc(pmem_dir, pmem_max_size);
48
170
Chapter 10 Volatile Use of Persistent Memory
37 #include <pmem_allocator.h>
38 #include <vector>
39 #include <string>
40 #include <scoped_allocator>
41 #include <cassert>
171
www.dbooks.org
Chapter 10 Volatile Use of Persistent Memory
42 #include <iostream>
43
44 typedef libmemkind::pmem::allocator<char> str_alloc_type;
45
46 typedef std::basic_string<char, std::char_traits<char>,
str_alloc_type> pmem_string;
47
48 typedef libmemkind::pmem::allocator<pmem_string> vec_alloc_type;
49
50 typedef std::vector<pmem_string, std::scoped_allocator_adaptor
<vec_alloc_type> > vector_type;
51
52 int main(int argc, char *argv[]) {
53 const size_t pmem_max_size = 64 * 1024 * 1024; //64 MB
54 const std::string pmem_dir("/daxfs");
55
56 // Create allocator object
57 vec_alloc_type alc(pmem_dir, pmem_max_size);
58 // Create std::vector with our allocator.
59 vector_type v(alc);
60
61 v.emplace_back("Foo");
62 v.emplace_back("Bar");
63
64 for (auto str : v) {
65 std::cout << str << std::endl;
66 }
172
Chapter 10 Volatile Use of Persistent Memory
To create a new dax device using all available capacity on the first available region
(NUMA node), use:
To create a new dax device specifying the region and capacity, use:
$ ndctl list
If you have already created a namespace in another mode, such as the default fsdax,
you can reconfigure the device using the following where namespace0.0 is the existing
namespace you want to reconfigure:
173
www.dbooks.org
Chapter 10 Volatile Use of Persistent Memory
DAX devices must be converted to use the system-ram mode. Converting a dax
device to a NUMA node suitable for use with system memory can be performed using
following command:
This will migrate the device from using the device_dax driver to the dax_pmem
driver. The following shows an example output with dax1.0 configured as the default
devdax type and dax2.0 is system-ram:
$ daxctl list
[
{
"chardev":"dax1.0",
"size":263182090240,
"target_node":3,
"mode":"devdax"
},
{
"chardev":"dax2.0",
"size":263182090240,
"target_node":4,
"mode":"system-ram"
}
]
You can now use numactl -H to show the hardware NUMA configuration.
The following example output is collected from a 2-socket system and shows node 4
is a new system-ram backed NUMA node created from persistent memory:
$ numactl -H
available: 3 nodes (0-1,4)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
23 24 25 26 27 56 57 58 59 60 61 62 63 64 65 66 67 68 69
70 71 72 73 74 75 76 77 78 79 80 81 82 83
node 0 size: 192112 MB
node 0 free: 185575 MB
174
Chapter 10 Volatile Use of Persistent Memory
To online the NUMA node and have the Kernel manage the new memory, use:
At this point, the kernel will use the new capacity for normal operation. The new
memory shows itself in tools such lsmem example shown below where we see an additional
10GiB of system-ram in the 0x0000003380000000-0x00000035ffffffff address range:
$ lsmem
RANGE SIZE STATE REMOVABLE BLOCK
0x0000000000000000-0x000000007fffffff 2G online no 0
0x0000000100000000-0x000000277fffffff 154G online yes 2-78
0x0000002780000000-0x000000297fffffff 8G online no 79-82
0x0000002980000000-0x0000002effffffff 22G online yes 83-93
0x0000002f00000000-0x0000002fffffffff 4G online no 94-95
0x0000003380000000-0x00000035ffffffff 10G online yes 103-107
0x000001aa80000000-0x000001d0ffffffff 154G online yes 853-929
0x000001d100000000-0x000001d37fffffff 10G online no 930-934
0x000001d380000000-0x000001d8ffffffff 22G online yes 935-945
0x000001d900000000-0x000001d9ffffffff 4G online no 946-947
175
www.dbooks.org
Chapter 10 Volatile Use of Persistent Memory
Figure 10-3 shows an application that created two static kind objects: MEMKIND_
DEFAULT and MEMKIND_DAX_KMEM.
Figure 10-3. An application that created two kind objects from different types of
memory
176
Chapter 10 Volatile Use of Persistent Memory
Child processes created using the fork(2) system call inherit the MAP_PRIVATE
mappings from the parent process. When memory pages are modified by the parent
process, a copy-on-write mechanism is triggered by the kernel to create an unmodified
copy for the child process. These pages are allocated on the same NUMA node as the
original page.
Each of the above allocator mechanisms has pros and cons. Garbage collection and
defragmentation algorithms require processing to occur on the heap to free unused
allocations or move data to create contiguous space. Slab allocators usually define a fixed
set of different sized buckets at initialization without knowing how many of each bucket
177
www.dbooks.org
Chapter 10 Volatile Use of Persistent Memory
the application will need. If the slab allocator depletes a certain bucket size, it allocates
from larger sized buckets, which reduces the amount of free space. These mechanisms
can potentially block the application’s processing and reduce its performance.
libvmemcache Overview
libvmemcache is an embeddable and lightweight in-memory caching solution with a
key-value store at its core. It is designed to take full advantage of large-capacity memory,
such as persistent memory, efficiently using memory mapping in a scalable way. It
is optimized for use with memory-addressable persistent storage through a DAX-
enabled file system that supports load/store operations. libvmemcache has these unique
characteristics:
The cache for libvmemcache is tuned to work optimally with relatively large value
sizes. While the smallest possible size is 256 bytes, libvmemcache performs best if the
expected value sizes are above 1 kilobyte.
libvmemcache has more control over the allocation because it implements a custom
memory-allocation scheme using an extents-based approach (like that of file system
extents). libvmemcache can, therefore, concatenate and achieve substantial space
efficiency. Additionally, because it is a cache, it can evict data to allocate new entries in
a worst-case scenario. libvmemcache will always allocate exactly as much memory as it
freed, minus metadata overhead. This is not true for caches based on common memory
allocators such as memkind. libvmemcache is designed to work with terabyte-sized
in-memory workloads, with very high space utilization.
178
Chapter 10 Volatile Use of Persistent Memory
179
www.dbooks.org
Chapter 10 Volatile Use of Persistent Memory
l ibvmemcache Design
libvmemcache has two main design aspects:
E xtent-Based Allocator
libvmemcache can solve fragmentation issues when working with terabyte-sized in-
memory workloads and provide high space utilization. Figure 10-5 shows a workload
example that creates many small objects, and over time, the allocator stops due to
fragmentation.
180
Chapter 10 Volatile Use of Persistent Memory
Figure 10-5. An example of a workload that creates many small objects, and the
allocator stops due to fragmentation
Figure 10-6. Using noncontiguous free blocks to fulfill a larger allocation request
181
www.dbooks.org
Chapter 10 Volatile Use of Persistent Memory
182
Chapter 10 Volatile Use of Persistent Memory
U
sing libvmemcache
Table 10-3 lists the basic functions that libvmemcache provides. For a complete list,
see the libvmemcache man pages (https://pmem.io/vmemcache/manpages/master/
vmemcache.3.html).
vmemcache_set_extent_size Sets the block size of the cache (256 bytes minimum).
vmemcache_set_eviction_policy Sets the eviction policy:
1. VMEMCACHE_REPLACEMENT_NONE
2. VMEMCACHE_REPLACEMENT_LRU
vmemcache_add Associates the cache with a given path on a DAX-enabled file
system or non-DAX-enabled file system.
vmemcache_delete Frees any structures associated with the cache.
vmemcache_get Searches for an entry with the given key, and if found, the entry’s
value is copied to vbuf.
vmemcache_put Inserts the given key-value pair into the cache.
vmemcache_evict Removes the given key from the cache.
vmemcache_callback_on_evict Called when an entry is being removed from the cache.
vmemcache_callback_on_miss Called when a get query fails to provide an opportunity to insert
the missing key.
183
www.dbooks.org
Chapter 10 Volatile Use of Persistent Memory
37 #include <libvmemcache.h>
38 #include <stdio.h>
39 #include <stdlib.h>
40 #include <string.h>
41
42 #define STR_AND_LEN(x) (x), strlen(x)
43
44 VMEMcache *cache;
45
46 void on_miss(VMEMcache *cache, const void *key,
47 size_t key_size, void *arg)
48 {
49 vmemcache_put(cache, STR_AND_LEN("meow"),
50 STR_AND_LEN("Cthulhu fthagn"));
51 }
52
53 void get(const char *key)
54 {
55 char buf[128];
56 ssize_t len = vmemcache_get(cache,
57 STR_AND_LEN(key), buf, sizeof(buf), 0, NULL);
58 if (len >= 0)
59 printf("%.*s\n", (int)len, buf);
60 else
61 printf("(key not found: %s)\n", key);
62 }
63
64 int main()
65 {
184
Chapter 10 Volatile Use of Persistent Memory
66 cache = vmemcache_new();
67 if (vmemcache_add(cache, "/daxfs")) {
68 fprintf(stderr, "error: vmemcache_add: %s\n",
69 vmemcache_errormsg());
70 exit(1);
71 }
72
73 // Query a non-existent key
74 get("meow");
75
76 // Insert then query
77 vmemcache_put(cache, STR_AND_LEN("bark"),
78 STR_AND_LEN("Lorem ipsum"));
79 get("bark");
80
81 // Install an on-miss handler
82 vmemcache_callback_on_miss(cache, on_miss, 0);
83 get("meow");
84
85 vmemcache_delete(cache);
• Line 66: Creates a new instance of vmemcache with default values for
eviction_policy and extent_size.
• Line 74: Calls the get() function to query on an existing key. This
function calls the vmemcache_get() function with error checking for
success/failure of the function.
• Line 82: Adds an on-miss callback handler to insert the key “meow”
into the cache.
• Line 83: Retrieves the key “meow” using the get() function.
185
www.dbooks.org
Chapter 10 Volatile Use of Persistent Memory
S
ummary
This chapter showed how persistent memory’s large capacity can be used to hold volatile
application data. Applications can choose to allocate and access data from DRAM or
persistent memory or both.
memkind is a very flexible and easy-to-use library with semantics that are similar to
the libc malloc/free APIs that developers frequently use.
libvmemcache is an embeddable and lightweight in-memory caching solution that
allows applications to efficiently use persistent memory’s large capacity in a scalable
way. libvmemcache is an open source project available on GitHub at https://github.
com/pmem/vmemcache.
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
186
CHAPTER 11
187
© The Author(s) 2020
S. Scargall, Programming Persistent Memory, https://doi.org/10.1007/978-1-4842-4932-1_11
www.dbooks.org
Chapter 11 Designing Data Structures for Persistent Memory
188
Chapter 11 Designing Data Structures for Persistent Memory
A
tomicity and Consistency
Guaranteeing consistency requires the proper ordering of stores and making sure data
is stored persistently. To make an atomic store bigger than 8 bytes, you must use some
additional mechanisms. This section describes several mechanisms and discusses their
memory and time overheads. For the time overhead, the focus is on analyzing the number
of flushes and memory barriers used because they have the biggest impact on performance.
T ransactions
One way to guarantee atomicity and consistency is to use transactions (described in
detail in Chapter 7). Here we focus on how to design a data structure to use transactions
efficiently. An example data structure that uses transactions is described in the “Sorted
Array with Versioning” section later in this chapter.
Transactions are the simplest solution for guaranteeing consistency. While using
transactions can easily make most operations atomic, two items must be kept in mind.
First, transactions that use logging always introduce memory and time overheads.
Second, in the case of undo logging, the memory overhead is proportional to the size of
data you modify, while the time overhead depends on the number of snapshots. Each
snapshot must be persisted prior to the modification of snapshotted data.
1
Using the libpmemobj allocator, it is also possible to easily lower internal fragmentation by using
allocation classes (see Chapter 7).
189
www.dbooks.org
Chapter 11 Designing Data Structures for Persistent Memory
struct soa {
int a[1000];
int b[1000];
};
Depending on the access pattern to the data, you may prefer one solution over the
other. If the program frequently updates both fields of an element, then the AoS solution
is better. However, if the program only updates the first variable of all elements, then the
SoA solution works best.
For applications that use volatile memory, the main concerns are usually cache
misses and optimizations for single instruction, multiple data (SIMD) processing. SIMD
is a class of parallel computers in Flynn’s taxonomy,2 which describes computers with
multiple processing elements that simultaneously perform the same operation on
multiple data points. Such machines exploit data-level parallelism, but not concurrency:
There are simultaneous (parallel) computations but only a single process (instruction) at
a given moment.
While those are still valid concerns for persistent memory, developers must consider
snapshotting performance when transactions are used. Snapshotting one contiguous
memory region is always better then snapshotting several smaller regions, mainly due to
the smaller overhead incurred by using less metadata. Efficient data structure layout that
takes these considerations into account is imperative for avoiding future problems when
migrating data from DRAM-based implementations to persistent memory.
2
For a full definition of SIMD, see https://en.wikipedia.org/wiki/SIMD.
190
Chapter 11 Designing Data Structures for Persistent Memory
Listing 11-3 presents both approaches; in this example, we want to increase the first
integer by one.
37 struct soa {
38 int a[1000];
39 int b[1000];
40 };
41
42 struct root {
43 soa soa_records;
44 std::pair<int, int aos_records[1000];
45 };
46
47 int main()
48 {
49 try {
50 auto pop = pmem::obj::pool<root>::create("/daxfs/pmpool",
51 "data_oriented", PMEMOBJ_MIN_POOL, 0666);
52
53 auto root = pop.root();
54
55 pmem::obj::transaction::run(pop, [&]{
56 pmem::obj::transaction::snapshot(&root->soa_records);
57 for (int i = 0; i < 1000; i++) {
58 root->soa_records.a[i]++;
59 }
60
61 for (int i = 0; i < 1000; i++) {
62 pmem::obj::transaction::snapshot(
63 &root->aos_records[i].first);
64 root->aos_records[i].first++;
65 }
66 });
67
191
www.dbooks.org
Chapter 11 Designing Data Structures for Persistent Memory
68 pop.close();
69 } catch (std::exception &e) {
70 std::cerr << e.what() << std::endl;
71 }
72 }
• Lines 61-65: When using AoS, we are forced to snapshot data in every
iteration – elements we want to modify are not contiguous in memory.
Examples of data structures that use transactions are shown in the “Hash Table with
Transactions” and “Hash Table with Transactions and Selective Persistence” sections,
later in this chapter.
Copy-on-Write and Versioning
Another way to maintain consistency is the copy-on-write (CoW) technique. In this
approach, every modification creates a new version at a new location whenever you
want to modify some part of a persistent data structure. For example, a node in a linked
list can use the CoW approach as described in the following:
3. Atomically change the original element with the copy and persist
the changes, then free the original node if needed. After this
step successfully completes, the element is updated and is in a
consistent state. If a crash occurs before this step, the original
element is untouched.
192
Chapter 11 Designing Data Structures for Persistent Memory
S
elective Persistence
Persistent memory is faster than disk storage but potentially slower than DRAM. Hybrid
data structures, where some parts are stored in DRAM and some parts are in persistent
memory, can be implemented to accelerate performance. Caching previously computed
values or frequently accessed parts of a data structure in DRAM can improve access
latency and improve overall performance.
Data does not always need to be stored in persistent memory. Instead, it can be
rebuilt during the restart of an application to provide a performance improvement
during runtime given that it accesses data from DRAM and does not require
transactions. An example of this approach appears in “Hash Table with Transactions and
Selective Persistence.”
193
www.dbooks.org
Chapter 11 Designing Data Structures for Persistent Memory
38 #include <functional>
39 #include <libpmemobj++/p.hpp>
40 #include <libpmemobj++/persistent_ptr.hpp>
41 #include <libpmemobj++/pext.hpp>
42 #include <libpmemobj++/pool.hpp>
43 #include <libpmemobj++/transaction.hpp>
44 #include <libpmemobj++/utils.hpp>
45 #include <stdexcept>
46 #include <string>
47
48 #include "libpmemobj++/array.hpp"
49 #include "libpmemobj++/string.hpp"
50 #include "libpmemobj++/vector.hpp"
51
194
Chapter 11 Designing Data Structures for Persistent Memory
52 /**
53 * Value - type of the value stored in hashmap
54 * N - number of buckets in hashmap
55 */
56 template <typename Value, std::size_t N>
57 class simple_kv {
58 private:
59 using key_type = pmem::obj::string;
60 using bucket_type = pmem::obj::vector<
61 std::pair<key_type, std::size_t>>;
62 using bucket_array_type = pmem::obj::array<bucket_type, N>;
63 using value_vector = pmem::obj::vector<Value>;
64
65 bucket_array_type buckets;
66 value_vector values;
67
68 public:
69 simple_kv() = default;
70
71 const Value &
72 get(const std::string &key) const
73 {
74 auto index = std::hash<std::string>{}(key) % N;
75
76 for (const auto &e : buckets[index]) {
77 if (e.first == key)
78 return values[e.second];
79 }
80
81 throw std::out_of_range("no entry in simplekv");
82 }
83
195
www.dbooks.org
Chapter 11 Designing Data Structures for Persistent Memory
84 void
85 put(const std::string &key, const Value &val)
86 {
87 auto index = std::hash<std::string>{}(key) % N;
88
89 /* get pool on which this simple_kv resides */
90 auto pop = pmem::obj::pool_by_vptr(this);
91
92 /* search for element with specified key - if found
93 * update its value in a transaction*/
94 for (const auto &e : buckets[index]) {
95 if (e.first == key) {
96 pmem::obj::transaction::run(
97 pop, [&] { values[e.second] = val; });
98
99 return;
100 }
101 }
102
103 /* if there is no element with specified key, insert
104 * new value to the end of values vector and put
105 * reference in proper bucket */
106 pmem::obj::transaction::run(pop, [&] {
107 values.emplace_back(val);
108 buckets[index].emplace_back(key, values.size() - 1);
109 });
110 }
111 };
• Line 90: Get the instance of the pmemobj pool object, which is used to
manage the persistent memory pool where our data structure resides.
• Lines 96-98: If an element with the specified key is found, update its
value using a transaction.
197
www.dbooks.org
Chapter 11 Designing Data Structures for Persistent Memory
198
Chapter 11 Designing Data Structures for Persistent Memory
72 bucket_array_type buckets;
73 simple_kv_persistent<Value, N> *data;
74
75 public:
76 simple_kv_runtime(simple_kv_persistent<Value, N> *data)
77 {
78 this->data = data;
79
80 for (std::size_t i = 0; i < data->values.size(); i++) {
81 auto volatile_key = std::string(data->keys[i].c_str(),
82 data->keys[i].size());
83
84 auto index = std::hash<std::string>{}(volatile_key)%N;
85 buckets[index].emplace_back(
86 bucket_entry_type{volatile_key, i});
87 }
88 }
89
90 const Value &
91 get(const std::string &key) const
92 {
93 auto index = std::hash<std::string>{}(key) % N;
94
95 for (const auto &e : buckets[index]) {
96 if (e.first == key)
97 return data->values[e.second];
98 }
99
100 throw std::out_of_range("no entry in simplekv");
101 }
102
103 void
104 put(const std::string &key, const Value &val)
105 {
106 auto index = std::hash<std::string>{}(key) % N;
107
199
www.dbooks.org
Chapter 11 Designing Data Structures for Persistent Memory
200
Chapter 11 Designing Data Structures for Persistent Memory
• Line 67: We define the data types residing in volatile memory. These
are very similar to the types used in the persistent version in “Hash
Table with Transactions.” The only difference is that here we use std
containers instead of pmem::obj.
201
www.dbooks.org
Chapter 11 Designing Data Structures for Persistent Memory
• Lines 126-129: When there is no element with the specified key in the
hash table, we insert both a value and a key to their respective vectors
in persistent memory in a transaction.
• Lines 149-150: We define the layout of the persistent data. Key and
values are stored in separate pmem::obj::vector.
202
Chapter 11 Designing Data Structures for Persistent Memory
To understand why we need two versions of the entries_t structure and a current
field, Figure 11-3 shows how the insert operation works, and the corresponding
pseudocode appears in Listing 11-7.
203
www.dbooks.org
Chapter 11 Designing Data Structures for Persistent Memory
204
Chapter 11 Designing Data Structures for Persistent Memory
80
81 working_copy.size = consistent_copy.size + 1;
82 }
83
84 template <typename V, uint64_t s>
85 void array<V,s>::insert(pmem::obj::pool_base &pop,
86 const Value &entry){
87 insert_element(pop, entry);
88 pop.persist(&(v[1 - current]), sizeof(entries_t<Value, slots>));
89
90 current = 1 - current;
91 pop.persist(¤t, sizeof(current));
92 }
• Line 63: We find the position in the current array where an entry
should be inserted.
• Line 71: We copy part of the current array to the working array (range
from beginning of the current array to the place where a new element
should be inserted).
• Line 77: We copy remaining elements from the current array to the
working array after the element we just inserted.
• Line 81: We update the size of the working array to the size of the
current array plus one, for the element inserted.
205
www.dbooks.org
Chapter 11 Designing Data Structures for Persistent Memory
Let’s analyze whether this approach guarantees data consistency. In the first step,
we copy elements from the original array to a currently unused one, insert the new
element, and persist it to make sure data goes to the persistence domain. The persist
call also ensures that the next operation (updating the current value) is not reordered
before any of the previous stores. Because of this, any interruption before or after issuing
the instruction to update the current field would not corrupt data because the current
variable always points to a valid version.
The memory overhead of using versioning for the insert operation is equal to a size
of the entries array and the current field. In terms of time overhead, we issued only two
persist operations.
S
ummary
This chapter shows how to design data structures for persistent memory, considering its
characteristics and capabilities. We discuss fragmentation and why it is problematic in
the case of persistent memory. We also present a few different methods of guaranteeing
data consistency; using transactions is the simplest and least error-prone method.
Other approaches, such as copy-on-write or versioning, can perform better, but they are
significantly more difficult to implement correctly.
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
206
CHAPTER 12
Debugging Persistent
Memory Applications
Persistent memory programming introduces new opportunities that allow developers to
directly persist data structures without serialization and to access them in place without
involving classic block I/O. As a result, you can merge your data models and avoid the
classic split between data in memory – which is volatile, fast, and byte addressable – with
data on traditional storage devices, which is non-volatile but slower.
Persistent memory programming also brings challenges. Recall our discussion
about power-fail protected persistence domains in Chapter 2: When a process or system
crashes on an Asynchronous DRAM Refresh (ADR)-enabled platform, data residing in
the CPU caches that has not yet been flushed, is lost. This is not a problem with volatile
memory because all the memory hierarchy is volatile. With persistent memory, however,
a crash can cause permanent data corruption. How often must you flush data? Flushing
too frequently yields suboptimal performance, and not flushing often enough leaves the
potential for data loss or corruption.
Chapter 11 described several approaches to designing data structures and using
methods such as copy-on-write, versioning, and transactions to maintain data integrity.
Many libraries within the Persistent Memory Development Kit (PMDK) provide
transactional updates of data structures and variables. These libraries provide optimal
CPU cache flushing, when required by the platform, at precisely the right time, so you
can program without concern about the hardware intricacies.
This programming paradigm introduces new dimensions related to errors and
performance issues that programmers need to be aware of. The PMDK libraries reduce
errors in persistent memory programming, but they cannot eliminate them. This chapter
207
© The Author(s) 2020
S. Scargall, Programming Persistent Memory, https://doi.org/10.1007/978-1-4842-4932-1_12
www.dbooks.org
Chapter 12 Debugging Persistent Memory Applications
describes common persistent memory programming issues and pitfalls and how to
correct them using the tools available. The first half of this chapter introduces the tools.
The second half presents several erroneous programming scenarios and describes how
to use the tools to correct the mistakes before releasing your code into production.
p memcheck for Valgrind
pmemcheck is a Valgrind (http://www.valgrind.org/) tool developed by Intel. It is very
similar to memcheck, which is the default tool in Valgrind to discover memory-related
bugs but adapted for persistent memory. Valgrind is an instrumentation framework for
building dynamic analysis tools. Some Valgrind tools can automatically detect many
memory management and threading bugs and profile your programs in detail. You can
also use Valgrind to build new tools.
To run pmemcheck, you need a modified version of Valgrind supporting the new
CLFLUSHOPT and CLWB flushing instructions. The persistent memory version of Valgrind
includes the pmemcheck tool and is available from https://github.com/pmem/valgrind.
Refer to the README.md within the GitHub project for installation instructions.
All the libraries in PMDK are already instrumented with pmemcheck. If you use PMDK
for persistent memory programming, you will be able to easily check your code with
pmemcheck without any code modification.
Before we discuss the pmemcheck details, the following two sections demonstrate how
it identifies errors in an out-of-bounds and a memory leak example.
32 #include <stdlib.h>
33
34 int main() {
35 int *stack = malloc(100 * sizeof(int));
36 stack[100] = 1234;
208
Chapter 12 Debugging Persistent Memory Applications
37 free(stack);
38 return 0;
39 }
In line 36, we are incorrectly assigning the value 1234 to the position 100, which is
outside the array range of 0-99. If we compile and run this code, it may not fail. This is
because, even if we only allocated 400 bytes (100 integers) for our array, the operating
system provides a whole memory page, typically 4KiB. Executing the binary under
Valgrind reports an issue, shown in Listing 12-2.
$ valgrind ./stackoverflow
==4188== Memcheck, a memory error detector
...
==4188== Invalid write of size 4
==4188== at 0x400556: main (stackoverflow.c:36)
==4188== Address 0x51f91d0 is 0 bytes after a block of size 400 alloc'd
==4188== at 0x4C2EB37: malloc (vg_replace_malloc.c:299)
==4188== by 0x400547: main (stackoverflow.c:35)
...
==4188== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
Because Valgrind can produce long reports, we show only the relevant “Invalid write”
error part of the report. When compiling code with symbol information (gcc -g), it is
easy to see the exact place in the code where the error is detected. In this case, Valgrind
highlights line 36 of the stackoverflow.c file. With the issue identified in the code, we
know where to fix it.
209
www.dbooks.org
Chapter 12 Debugging Persistent Memory Applications
The memory allocation is moved to the function func(). A memory leak occurs
because the pointer to the newly allocated memory is a local variable on line 35, which is
lost when the function returns. Executing this program under Valgrind shows the results
in Listing 12-4.
Valgrind shows a loss of 400 bytes of memory allocated at leak.c:35. To learn more,
please visit the official Valgrind documentation (http://www.valgrind.org/docs/
manual/index.html).
210
Chapter 12 Debugging Persistent Memory Applications
Intel Inspector creates a new directory with the data and analysis results, and prints
a summary of findings to the terminal. For the stackoverflow app, it detected one invalid
memory access.
After launching the GUI using inspxe-gui, we open the results collection through
the File ➤ Open ➤ Result menu and navigate to the directory created by inspxe-cli. The
directory will be named r000mi2 if it is the first run. Within the directory is a file named
r000mi2.inspxe. Once opened and processed, the GUI presents the data shown in
Figure 12-1.
211
www.dbooks.org
Chapter 12 Debugging Persistent Memory Applications
Figure 12-1. GUI of Intel Inspector showing results for Listing 12-1
The GUI defaults to the Summary tab to provide an overview of the analysis. Since
we compiled the program with symbols, the Code Locations panel at the bottom shows
the exact place in the code where the problem was detected. Intel Inspector identified
the same error on line 36 that Valgrind found.
If Intel Inspector detects multiple problems within the program, those issues are
listed in the Problems section in the upper left area of the window. You can select each
problem and see the information relating to it in the other sections of the window.
212
Chapter 12 Debugging Persistent Memory Applications
The Intel Inspector output is shown in Figure 12-2 and explains that a memory leak
problem was detected. When we open the r001mi2/r001mi2.inspxe result file in the
GUI, we get something similar to what is shown in the lower left section of Figure 12-2.
Figure 12-2. GUI of Intel Inspector showing results for Listing 12-2
The information related to the leaked object is shown above the code listing:
The right side of the Code panel shows the call stack that led to the bug (call stacks
are read from bottom to top). We see the call to func() in the main() function on line 39
(leak.c:39), then the memory allocation occurs within func() on line 35 (leak.c:35).
The Intel Inspector offers much more than what we presented here. To learn
more, please visit the documentation (https://software.intel.com/en-us/intel-
inspector-support/documentation).
213
www.dbooks.org
Chapter 12 Debugging Persistent Memory Applications
N
onpersistent Stores
Nonpersistent stores refer to data written to persistent memory but not flushed explicitly.
It is understood that if the program writes to persistent memory, it wishes for those
writes to be persistent. If the program ends without explicitly flushing writes, there is an
open possibility for data corruption. When a program exits gracefully, all the pending
writes in the CPU caches are flushed automatically. However, if the program were to
crash unexpectedly, writes still residing in the CPU caches could be lost.
Consider the code in Listing 12-7 that writes data to a persistent memory device
mounted to /mnt/pmem without flushing the data.
32 #include <stdio.h>
33 #include <sys/mman.h>
34 #include <fcntl.h>
35
36 int main(int argc, char *argv[]) {
37 int fd, *data;
38 fd = open("/mnt/pmem/file", O_CREAT|O_RDWR, 0666);
39 posix_fallocate(fd, 0, sizeof(int));
40 data = (int *) mmap(NULL, sizeof(int), PROT_READ |
41 PROT_WRITE, MAP_SHARED_VALIDATE |
42 MAP_SYNC, fd, 0);
214
Chapter 12 Debugging Persistent Memory Applications
43 *data = 1234;
44 munmap(data, sizeof(int));
45 return 0;
46 }
• Line 39: We make sure there is enough space in the file to allocate an
integer by calling posix_fallocate().
• Line 40: We memory map /mnt/pmem/file.
If we run pmemcheck with Listing 12-7, we will not get any useful information
because pmemcheck has no way to know which memory addresses are persistent and
which ones are volatile. This may change in future versions. To run pmemcheck, we pass
--tool=pmemcheck argument to valgrind as shown in Listing 12-8. The result shows no
issues were detected.
We can inform pmemcheck which memory regions are persistent using a VALGRIND_
PMC_REGISTER_PMEM_MAPPING macro shown on line 52 in Listing 12-9. We must include
the valgrind/pmemcheck.h header for pmemcheck, line 36, which defines the VALGRIND_
PMC_REGISTER_PMEM_MAPPING macro and others.
215
www.dbooks.org
Chapter 12 Debugging Persistent Memory Applications
216
Chapter 12 Debugging Persistent Memory Applications
See that pmemcheck detected that data is not being flushed after a write in
listing_12-9.c, line 56. To fix this, we create a new flush() function, accepting an
address and size, to flush all the CPU cache lines storing any part of the data using the
CLFLUSH machine instruction (__mm_clflush()). Listing 12-11 shows the modified
code.
217
www.dbooks.org
Chapter 12 Debugging Persistent Memory Applications
218
Chapter 12 Debugging Persistent Memory Applications
Because Intel Inspector – Persistence Inspector does not consider an unflushed write a
problem unless there is a write dependency with other variables, we need to show a more
complex example than writing a single variable in Listing 12-7. You need to understand
how programs writing to persistent memory are designed to know which parts of the data
written to the persistent media are valid and which parts are not. Remember that recent
writes may still be sitting on the CPU caches if they are not explicitly flushed.
Transactions solve the problem of half-written data by using logs to either roll back
or apply uncommitted changes; thus, programs reading the data back can be assured
that everything written is valid. In the absence of transactions, it is impossible to know
whether or not the data written on persistent memory is valid, especially if the program
crashes.
A writer can inform a reader that data is properly written in one of two ways, either
by setting a “valid” flag or by using a watermark variable with the address (or the index,
in the case of an array) of the last valid written memory position.
Listing 12-13 shows pseudocode for how the “valid” flag approach could be
implemented.
1 writer() {
2 var1 = "This is a persistent Hello World
3 written to persistent memory!";
4 flush (var1);
5 var1_valid = True;
6 flush (var1_valid);
7 }
8
219
www.dbooks.org
Chapter 12 Debugging Persistent Memory Applications
9 reader() {
10 if (var1_valid == True) {
11 print (var1);
12 }
14 }
The reader() will read the data in var1 if the var1_valid flag is set to True (line 10),
and var1_valid can only be True if var1 has been flushed (lines 4 and 5).
We can now modify the code from Listing 12-7 to introduce this “valid” flag. In
Listing 12-14, we separate the code into writer and reader programs and map two
integers instead of one (to accommodate for the flag). Listing 12-15 shows the reading to
persistent memory example.
220
Chapter 12 Debugging Persistent Memory Applications
54 munmap(ptr, 2 * sizeof(int));
55 return 0;
56 }
221
www.dbooks.org
Chapter 12 Debugging Persistent Memory Applications
++ Analysis completes
++ Data is stored in folder "/data/.pmeminspdata/data/listing_12-14"
Listing 12-17. Running Intel Inspector – Persistence Inspector with code Listing
12-15 for after-unfortunate-event phase analysis
$ pmeminsp ca -pmem-file /mnt/pmem/file -- ./listing_12-15
++ Analysis starts
data = 1234
++ Analysis completes
++ Data is stored in folder "/data/.pmeminspdata/data/listing_12-15"
222
Chapter 12 Debugging Persistent Memory Applications
while
depends on
#=============================================================
# Diagnostic # 2: Missing cache flush
#-------------------
Memory store
of size 4 at address 0x7F9C68893000 (offset 0x0 in /mnt/pmem/file)
in /data/listing_12-16!main at listing_12-16.c:52 - 0x687
223
www.dbooks.org
Chapter 12 Debugging Persistent Memory Applications
memory is unmapped
in /data/listing_12-16!main at listing_12-16.c:54 - 0x699
in /lib64/libc.so.6!__libc_start_main at <unknown_file>:<unknown_
line> - 0x223D3
in /data/listing_12-16!_start at <unknown_file>:<unknown_line> - 0x534
The output is very verbose, but it is easy to follow. We get two missing cache flushes
(diagnostics 1 and 2) corresponding to lines 51 and 52 of listing_12-16.c. We do these
writes to the locations in the mapped persistent memory pointed by variables flag
and data. The first diagnostic says that the first memory store is not flushed before the
second store, while, at the same time, there is a load dependency of the first store to the
second. This is exactly what we intended.
The second diagnostic says that the second store (to the flag) itself is never actually
flushed before ending. Even if we flush the first store correctly before we write the flag,
we must still flush the flag to make sure the dependency works.
To open the results in the Intel Inspector GUI, you can use the -insp option when
generating the report, for example:
224
Chapter 12 Debugging Persistent Memory Applications
Figure 12-3. GUI of Intel Inspector showing results for Listing 12-18 (diagnostic 1)
The GUI shows the same information as the command-line analysis but in a more
readable way by highlighting the errors directly on our source code. As Figure 12-3
shows, the modification of the flag is called “primary store.”
In Figure 12-4, the second diagnosis is selected in the Problems pane, showing the
missing flush for the flag itself.
225
www.dbooks.org
Chapter 12 Debugging Persistent Memory Applications
Figure 12-4. GUI of Intel Inspector showing results for Listing 12-20 (diagnostic #2)
To conclude this section, we fix the code and rerun the analysis with Persistence
Inspector. The code in Listing 12-19 adds the necessary flushes to Listing 12-14.
226
Chapter 12 Debugging Persistent Memory Applications
45 _mm_clflush((char *)uptr);
46 }
47
48 int main(int argc, char *argv[]) {
49 int fd, *ptr, *data, *flag;
50
51 fd = open("/mnt/pmem/file", O_CREAT|O_RDWR, 0666);
52 posix_fallocate(fd, 0, sizeof(int) * 2);
53
54 ptr = (int *) mmap(NULL, sizeof(int) * 2,
55 PROT_READ | PROT_WRITE,
56 MAP_SHARED_VALIDATE | MAP_SYNC,
57 fd, 0);
58
59 data = &(ptr[1]);
60 flag = &(ptr[0]);
61 *data = 1234;
62 flush((void *) data, sizeof(int));
63 *flag = 1;
64 flush((void *) flag, sizeof(int));
65
66 munmap(ptr, 2 * sizeof(int));
67 return 0;
68 }
Listing 12-20 executes Persistence Inspector against the modified code from
Listing 12-19, then the reader code from Listing 12-15, and finally running the report,
which says that no problems were detected.
Listing 12-20. Running full analysis with Intel Inspector – Persistence Inspector
with code Listings 12-19 and 12-15
$ pmeminsp cb -pmem-file /mnt/pmem/file -- ./listing_12-19
++ Analysis starts
++ Analysis completes
++ Data is stored in folder "/data/.pmeminspdata/data/listing_12-19"
227
www.dbooks.org
Chapter 12 Debugging Persistent Memory Applications
data = 1234
++ Analysis completes
++ Data is stored in folder "/data/.pmeminspdata/data/listing_12-15"
228
Chapter 12 Debugging Persistent Memory Applications
Note For a refresh on the definitions of a layout, root object, or macros used in
Listing 12-21, see Chapter 7 where we introduce libpmemobj.
In lines 35-38, we create a my_root data structure, which has two integer members:
value and is_odd. These integers are modified inside a transaction (lines 52-61),
setting value=4 and is_odd=0. On line 57, we are only adding the value variable to the
transaction, leaving is_odd out. Given that persistent memory is not natively supported
in C, there is no way for the compiler to warn you about this. The compiler cannot
distinguish between pointers to volatile memory vs. those to persistent memory.
Listing 12-22 shows the response from running the code through pmemcheck.
229
www.dbooks.org
Chapter 12 Debugging Persistent Memory Applications
Although they are both related to the same root cause, pmemcheck identified two
issues. One is the error we expected; that is, we have a store inside a transaction that
was not added to it. The other error says that we are not flushing the store. Since
transactional stores are flushed automatically when the program exits the transaction,
finding two errors per store to a location not included within a transaction should be
common in pmemcheck.
Persistence Inspector has a more user-friendly output, as shown in Listing 12-23.
++ Analysis completes
++ Data is stored in folder "/data/.pmeminspdata/data/listing_12-21"
$
230
Chapter 12 Debugging Persistent Memory Applications
$ pmeminsp rp -- ./listing_12-21
#=============================================================
# Diagnostic # 1: Store without undo log
#-------------------
Memory store
of size 4 at address 0x7FAA84DC0554 (offset 0x3C0554 in /mnt/pmem/pool)
in /data/listing_12-21!main at listing_12-21.c:60 - 0xC2D
in /lib64/libc.so.6!__libc_start_main at <unknown_file>:<unknown_
line> - 0x223D3
in /data/listing_12-21!_start at <unknown_file>:<unknown_line> - 0x954
transaction
in /data/listing_12-21!main at listing_12-21.c:52 - 0xB67
in /lib64/libc.so.6!__libc_start_main at <unknown_file>:<unknown_
line> - 0x223D3
in /data/listing_12-21!_start at <unknown_file>:<unknown_line> - 0x954
32 #include <libpmemobj.h>
33
34 struct my_root {
35 int value;
36 int is_odd;
37 };
38
39 POBJ_LAYOUT_BEGIN(example);
40 POBJ_LAYOUT_ROOT(example, struct my_root);
41 POBJ_LAYOUT_END(example);
42
231
www.dbooks.org
Chapter 12 Debugging Persistent Memory Applications
If we run the code through pmemcheck, as shown in Listing 12-25, no issues are
reported.
232
Chapter 12 Debugging Persistent Memory Applications
Listing 12-26. Generating report with Intel Inspector – Persistence Inspector for
code Listing 12-24
$ pmeminsp cb -pmem-file /mnt/pmem/pool -- ./listing_12-24
++ Analysis starts
++ Analysis completes
++ Data is stored in folder "/data/.pmeminspdata/data/listing_12-24"
$
$ pmeminsp rp -- ./listing_12-24
Analysis complete. No problems detected.
After properly adding all the memory that will be modified to the transaction, both
tools report that no problems were found.
233
www.dbooks.org
Chapter 12 Debugging Persistent Memory Applications
Figure 12-5. The rollback mechanism for the unfinished transaction in Thread 1
is also overriding the changes made by Thread 2, even though the transaction for
Thread 2 finishes correctly
In Figure 12-5, time is shown in the y axis with time progressing downward. These
operations occur in the following order:
• Thread 2 starts, begins its own transaction, acquires the lock, reads
the value of X (which is now 5), adds X=5 to the undo log, and
increments it by 5. The transaction completes successfully, and
Thread 2 flushes the CPU caches. Now, x=10.
234
Chapter 12 Debugging Persistent Memory Applications
This scenario leaves the application with an invalid, but consistent, value of x=10.
Since transactions are atomic, all changes done within them are not valid until they
successfully complete.
When the application starts, it knows it must perform a recovery operation due
to the previous crash and will replay the undo logs to rewind the partial update made
by Thread 1. The undo log restores the value of X=0, which was correct when Thread 1
added its entry. The expected value of X should be X=5 in this situation, but the undo log
puts X=0. You can probably see the huge potential for data corruption that this situation
can produce.
We describe concurrency for multithreaded applications in Chapter 14. Using
libpmemobj-cpp, the C++ language binding library to libpmemobj, concurrency issues
are very easy to resolve because the API allows us to pass a list of locks using lambda
functions when transactions are created. Chapter 8 discusses libpmemobj-cpp and
lambda functions in more detail.
Listing 12-27 shows how you can use a single mutex to lock a whole transaction. This
mutex can either be a standard mutex (std::mutex) if the mutex object resides in volatile
memory or a pmem mutex (pmem::obj::mutex) if the mutex object resides in persistent
memory.
Consider the code in Listing 12-28 that simultaneously adds the same memory
region to two different transactions.
235
www.dbooks.org
Chapter 12 Debugging Persistent Memory Applications
236
Chapter 12 Debugging Persistent Memory Applications
66 pthread_t thread;
67 pthread_mutex_init(&lock, NULL);
68
69 TX_BEGIN(pop) {
70 pthread_mutex_lock(&lock);
71 TOID(struct my_root) root
72 = POBJ_ROOT(pop, struct my_root);
73 TX_ADD(root);
74 pthread_create(&thread, NULL,
75 func, (void *) pop);
76 D_RW(root)->value = D_RO(root)->value + 4;
77 D_RW(root)->is_odd = D_RO(root)->value % 2;
78 pthread_mutex_unlock(&lock);
79 // wait to make sure other thread finishes 1st
80 pthread_join(thread, NULL);
81 } TX_END
82
83 pthread_mutex_destroy(&lock);
84 return 0;
85 }
• Line 69: The main thread starts a transaction and adds the root data
structure to it (line 73).
• Both threads will simultaneously modify all or part of the same data
before finishing their transactions. We force the second thread to
finish first by making the main thread wait on pthread_join().
Listing 12-29 shows code execution with pmemcheck, and the result warns us that we
have overlapping regions registered in different transactions.
237
www.dbooks.org
Chapter 12 Debugging Persistent Memory Applications
238
Chapter 12 Debugging Persistent Memory Applications
Listing 12-30 shows the same code run with Persistence Inspector, which also reports
“Overlapping regions registered in different transactions” in diagnostic 25. The first 24
diagnostic results were related to stores not added to our transactions corresponding
with the locking and unlocking of our volatile mutex; these can be ignored.
protects
memory region
in /data/listing_12-28!main at listing_12-28.c:73 - 0xF1F
in /lib64/libc.so.6!__libc_start_main at <unknown_file>:<unknown_line>
- 0x223D3
in /data/listing_12-28!_start at <unknown_file>:<unknown_line> - 0xB44
overlaps with
memory region
in /data/listing_12-28!func at listing_12-28.c:55 - 0xDA8
in /lib64/libpthread.so.0!start_thread at <unknown_file>:<unknown_line>
- 0x7DCD
in /lib64/libc.so.6!__clone at <unknown_file>:<unknown_line> - 0xFDEAB
239
www.dbooks.org
Chapter 12 Debugging Persistent Memory Applications
M
emory Overwrites
When multiple modifications to the same persistent memory location occur before
the location is made persistent (that is, flushed), a memory overwrite occurs. This is
a potential data corruption source if a program crashes because the final value of the
persistent variable can be any of the values written between the last flush and the crash.
It is important to know that this may not be an issue if it is in the code by design. We
recommend using volatile variables for short-lived data and only write to persistent
variables when you want to persist data.
Consider the code in Listing 12-31, which writes twice to the data variable inside the
main() function (lines 62 and 63) before we call flush() on line 64.
240
Chapter 12 Debugging Persistent Memory Applications
Listing 12-32 shows the report from pmemcheck with the code from Listing 12-31.
To make pmemcheck look for overwrites, we must use the --mult-stores=yes option.
241
www.dbooks.org
Chapter 12 Debugging Persistent Memory Applications
pmemcheck reports that we have overwritten stores. We can fix this problem by either
inserting a flushing instruction between both writes, if we forgot to flush, or by moving
one of the stores to volatile data if that store corresponds to short-lived data.
At the time of publication, Persistence Inspector does not support checking for
overwritten stores. As you have seen, Persistence Inspector does not consider a missing
flush an issue unless there is a write dependency. In addition, it does not consider this a
performance problem because writing to the same variable in a short time span is likely
to hit the CPU caches anyway, rendering the latency differences between DRAM and
persistent memory irrelevant.
U
nnecessary Flushes
Flushing should be done carefully. Detecting unnecessary flushes, such as redundant
ones, can help improve code performance. The code in Listing 12-33 shows a redundant
call to the flush() function on line 64.
33 #include <emmintrin.h>
34 #include <stdint.h>
35 #include <stdio.h>
36 #include <sys/mman.h>
37 #include <fcntl.h>
38 #include <valgrind/pmemcheck.h>
39
40 void flush(const void *addr, size_t len) {
41 uintptr_t flush_align = 64, uptr;
42 for (uptr = (uintptr_t)addr & ~(flush_align - 1);
43 uptr < (uintptr_t)addr + len;
44 uptr += flush_align)
45 _mm_clflush((char *)uptr);
46 }
47
48 int main(int argc, char *argv[]) {
49 int fd, *data;
50
242
Chapter 12 Debugging Persistent Memory Applications
243
www.dbooks.org
Chapter 12 Debugging Persistent Memory Applications
To showcase Persistence Inspector, Listing 12-35 has code with a write dependency,
similar to what we did for Listing 12-11 in Listing 12-19. The extra flush occurs on line 65.
244
Chapter 12 Debugging Persistent Memory Applications
58 data = &(ptr[1]);
59 flag = &(ptr[0]);
60
61 *data = 1234;
62 flush((void *) data, sizeof(int));
63 *flag = 1;
64 flush((void *) flag, sizeof(int));
65 flush((void *) flag, sizeof(int)); // extra flush
66
67 munmap(ptr, 2 * sizeof(int));
68 return 0;
69 }
Listing 12-36 uses the same reader program from Listing 12-15 to show the analysis
from Persistence Inspector. As before, we first collect data from the writer program,
then the reader program, and finally run the report to identify any issues.
Listing 12-36. Running Intel Inspector – Persistence Inspector with Listing 12-35
(writer) and Listing 12-15 (reader)
$ pmeminsp cb -pmem-file /mnt/pmem/file -- ./listing_12-35
++ Analysis starts
++ Analysis completes
++ Data is stored in folder "/data/.pmeminspdata/data/listing_12-35"
data = 1234
++ Analysis completes
++ Data is stored in folder "/data/.pmeminspdata/data/listing_12-15"
245
www.dbooks.org
Chapter 12 Debugging Persistent Memory Applications
cache flush
of size 64 at address 0x7F3220C55000 (offset 0x0 in /mnt/pmem/file)
in /data/listing_12-35!flush at listing_12-35.c:45 - 0x674
in /data/listing_12-35!main at listing_12-35.c:65 - 0x750
in /lib64/libc.so.6!__libc_start_main at <unknown_file>:<unknown_line>
- 0x223D3
in /data/listing_12-35!_start at <unknown_file>:<unknown_line> - 0x574
of
memory store
of size 4 at address 0x7F3220C55000 (offset 0x0 in /mnt/pmem/file)
in /data/listing_12-35!main at listing_12-35.c:63 - 0x72D
in /lib64/libc.so.6!__libc_start_main at <unknown_file>:<unknown_line>
- 0x223D3
in /data/listing_12-35!_start at <unknown_file>:<unknown_line> - 0x574
The Persistence Inspector report warns about the redundant cache flush within
the main() function on line 65 of the listing_12-35.c program file – “main at
listing_12-35.c:65”. Solving these issues is as easy as deleting all the unnecessary
flushes, and the result will improve the application’s performance.
246
Chapter 12 Debugging Persistent Memory Applications
O
ut-of-Order Writes
When developing software for persistent memory, remember that even if a cache line is
not explicitly flushed, that does not mean the data is still in the CPU caches. For example,
the CPU could have evicted it due to cache pressure or other reasons. Furthermore, the
same way that writes that are not flushed properly may produce bugs in the event of an
unexpected application crash, so do automatically evicted dirty cache lines if they violate
some expected order of writes that the applications rely on.
To better understand this problem, explore how flushing works in the x86_64
and AMD64 architectures. From the user space, we can issue any of the following
instructions to ensure our writes reach the persistent media:
• CLFLUSH
The only instruction that ensures each flush is issued in order is CLFUSH because
each CLFLUSH instruction always does an implicit fence instruction (SFENCE). The other
instructions are asynchronous and can be issued in parallel and in any order. The CPU
can only guarantee that all flushes issued since the previous SFENCE have completed
when a new SFENCE instruction is explicitly executed. Think of SFENCE instructions as
synchronization points (see Figure 12-6). For more information about these instructions,
refer to the Intel software developer manuals and the AMD software developer manuals.
247
www.dbooks.org
Chapter 12 Debugging Persistent Memory Applications
As Figure 12-6 shows, we cannot guarantee the order with respect to how A and B
would be finally written to persistent memory. This happens because stores and flushes
to A and B are done between synchronization points. The case of C is different. Using
the SFENCE instruction, we can be assured that C will always go after A and B have been
flushed.
Knowing this, you can now imagine how out-of-order writes could be a problem in
a program crash. If assumptions are made with respect to the order of writes between
synchronization points, or if you forget to add synchronization points between writes
and flushes where strict order is essential (think of a “valid flag” for a variable write,
where the variable needs to be written before the flag is set to valid), you may encounter
data consistency issues. Consider the pseudocode in Listing 12-37.
248
Chapter 12 Debugging Persistent Memory Applications
1 writer () {
2 pcounter = 0;
3 flush (pcounter);
4 for (i=0; i<max; i++) {
5 pcounter++;
6 if (rand () % 2 == 0) {
7 pcells[i].data = data ();
8 flush (pcells[i].data);
9 pcells[i].valid = True;
10 } else {
11 pcells[i].valid = False;
12 }
13 flush (pcells[i].valid);
14 }
15 flush (pcounter);
16 }
17
18 reader () {
19 for (i=0; i<pcounter; i++) {
20 if (pcells[i].valid == True) {
21 print (pcells[i].data);
22 }
23 }
24 }
For simplicity, assume that all flushes in Listing 12-37 are also synchronization
points; that is, flush() uses CLFLUSH. The logic of the program is very simple. There are
two persistent memory variables: pcells and pcounter. The first is an array of tuples
{data, valid} where data holds the data and valid is a flag indicating if data is valid
or not. The second variable is a counter indicating how many elements in the array have
been written correctly to persistent memory. In this case, the valid flag is not the one
indicating whether or not the array position was written correctly to persistent memory.
In this case, the flag’s meaning only indicates if the function data() was called, that is,
whether or not data has meaningful data.
249
www.dbooks.org
Chapter 12 Debugging Persistent Memory Applications
At first glance, the program appears correct. With every new iteration of the loop,
the counter is incremented, and then the array position is written and flushed. However,
pcounter is incremented before we write to the array, thus creating a discrepancy
between pcounter and the actual number of committed entries in the array. Although it
is true that pcounter is not flushed until after the loop, the program is only correct after
a crash if we assume that the changes to pcounter stay in the CPU caches (in that case, a
program crash in the middle of the loop would simply leave the counter to zero).
As mentioned at the beginning of this section, we cannot make that assumption. A
cache line can be evicted at any time. In the pseudocode example in Listing 12-37, we
could run into a bug where pcounter indicates that the array is longer than it really is,
making the reader() read uninitialized memory.
The code in Listings 12-38 and 12-39 provide a C++ implementation of the
pseudocode from Listing 12-37. Both use libpmemobj-cpp from the PMDK. Listing 12-38
is the writer program, and Listing 12-39 is the reader.
250
Chapter 12 Debugging Persistent Memory Applications
51 struct record_t {
52 char name[63];
53 char valid;
54 };
55 struct root {
56 pobj::persistent_ptr<header_t> header;
57 pobj::persistent_ptr<record_t[]> records;
58 };
59
60 pobj::pool<root> pop;
61
62 int main(int argc, char *argv[]) {
63
64 // everything between BEGIN and END can be
65 // assigned a particular engine in pmreorder
66 VALGRIND_PMC_EMIT_LOG("PMREORDER_TAG.BEGIN");
67
68 pop = pobj::pool<root>::open("/mnt/pmem/file",
69 "RECORDS");
70 auto proot = pop.root();
71
72 // allocation of memory and initialization to zero
73 pobj::transaction::run(pop, [&] {
74 proot->header
75 = pobj::make_persistent<header_t>();
76 proot->header->counter = 0;
77 proot->records
78 = pobj::make_persistent<record_t[]>(10);
79 proot->records[0].valid = 0;
80 });
81
82 pobj::persistent_ptr<header_t> header
83 = proot->header;
84 pobj::persistent_ptr<record_t[]> records
85 = proot->records;
86
251
www.dbooks.org
Chapter 12 Debugging Persistent Memory Applications
87 VALGRIND_PMC_EMIT_LOG("PMREORDER_TAG.END");
88
89 header->counter = 0;
90 for (uint8_t i = 0; i < 10; i++) {
91 header->counter++;
92 if (rand() % 2 == 0) {
93 snprintf(records[i].name, 63,
94 "record #%u", i + 1);
95 pop.persist(records[i].name, 63); // flush
96 records[i].valid = 2;
97 } else
98 records[i].valid = 1;
99 pop.persist(&(records[i].valid), 1); // flush
100 }
101 pop.persist(&(header->counter), 4); // flush
102
103 pop.close();
104 return 0;
105 }
Listing 12-39. Reading the data structure written by Listing 12-38 to persistent
memory
33 #include <stdio.h>
34 #include <stdint.h>
35 #include <libpmemobj++/persistent_ptr.hpp>
36
37 using namespace std;
38 namespace pobj = pmem::obj;
39
40 struct header_t {
41 uint32_t counter;
42 uint8_t reserved[60];
43 };
252
Chapter 12 Debugging Persistent Memory Applications
44 struct record_t {
45 char name[63];
46 char valid;
47 };
48 struct root {
49 pobj::persistent_ptr<header_t> header;
50 pobj::persistent_ptr<record_t[]> records;
51 };
52
53 pobj::pool<root> pop;
54
55 int main(int argc, char *argv[]) {
56
57 pop = pobj::pool<root>::open("/mnt/pmem/file",
58 "RECORDS");
59 auto proot = pop.root();
60 pobj::persistent_ptr<header_t> header
61 = proot->header;
62 pobj::persistent_ptr<record_t[]> records
63 = proot->records;
64
65 for (uint8_t i = 0; i < header->counter; i++) {
66 if (records[i].valid == 2) {
67 printf("found valid record\n");
68 printf(" name = %s\n",
69 records[i].name);
70 }
71 }
72
73 pop.close();
74 return 0;
75 }
253
www.dbooks.org
Chapter 12 Debugging Persistent Memory Applications
Listing 12-40. Running Intel Inspector – Persistence Inspector with Listing 12-38
(writer) and Listing 12-39 (reader)
$ pmempool create obj --size=100M --layout=RECORDS /mnt/pmem/file
++ Analysis completes
++ Data is stored in folder "/data/.pmeminspdata/data/listing_12-38"
++ Analysis completes
++ Data is stored in folder "/data/.pmeminspdata/data/listing_12-39"
254
Chapter 12 Debugging Persistent Memory Applications
memory store
of size 1 at address 0x7FD7BEBC068F (offset 0x3C068F in /mnt/pmem/file)
in /data/listing_12-38!main at listing_12-38.cpp:98 - 0x1DAF
in /lib64/libc.so.6!__libc_start_main at <unknown_file>:<unknown_line>
- 0x223D3
in /data/listing_12-38!_start at <unknown_file>:<unknown_line> - 0x1624
The Persistence Inspector report identifies an out-of-order store issue. The tool
says that incrementing the counter in line 91 (main at listing_12-38.cpp:91) is
out of order with respect to writing the valid flag inside a record in line 98 (main at
listing_12-38.cpp:98).
To perform out-of-order analysis with pmemcheck, we must introduce a new tool
called pmreorder. The pmreorder tool is included in PMDK from version 1.5 onward.
This stand-alone Python tool performs a consistency check of persistent programs
using a store reordering mechanism. The pmemcheck tool cannot do this type of analysis,
although it is still used to generate a detailed log of all the stores and flushes issued by an
application that pmreorder can parse. For example, consider Listing 12-41.
Listing 12-41. Running pmemcheck to generate a detailed log of all the stores
and flushes issued by Listing 12-38
$ valgrind --tool=pmemcheck -q --log-stores=yes --log-stores-
stacktraces=yes
--log-stores-stacktraces-depth=2 --print-summary=yes
--log-file=store_log.log ./listing_12-38
255
www.dbooks.org
Chapter 12 Debugging Persistent Memory Applications
The pmreorder tool works with the concept of “engines.” For example, the ReorderFull
engine checks consistency for all the possible combinations of reorders of stores and
flushes. This engine can be extremely slow for some programs, so you can use other
engines such as ReorderPartial or NoReorderDoCheck. For more information, refer to the
pmreorder page, which has links to the man pages (https://pmem.io/pmdk/pmreorder/).
Before we run pmreorder, we need a program that can walk the list of records
contained within the memory pool and return 0 when the data structure is consistent, or
1 otherwise. This program is similar to the reader shown in Listing 12-42.
256
Chapter 12 Debugging Persistent Memory Applications
53 pobj::pool<root> pop;
54
55 int main(int argc, char *argv[]) {
56
57 pop = pobj::pool<root>::open("/mnt/pmem/file",
58 "RECORDS");
59 auto proot = pop.root();
60 pobj::persistent_ptr<header_t> header
61 = proot->header;
62 pobj::persistent_ptr<record_t[]> records
63 = proot->records;
64
65 for (uint8_t i = 0; i < header->counter; i++) {
66 if (records[i].valid < 1 or
67 records[i].valid > 2)
68 return 1; // data struc. corrupted
69 }
70
71 pop.close();
72 return 0; // everything ok
73 }
The program in Listing 12-42 iterates over all the records that we expect should have
been written correctly to persistent memory (lines 65-69). It checks the valid flag for
each record, which should be either 1 or 2 for the record to be correct (line 66). If an
issue is detected, the checker will return 1 indicating data corruption.
Listing 12-43 shows a three-step process for analyzing the program:
2. Use the pmemcheck Valgrind tool to record data and call stacks
while the program is running.
257
www.dbooks.org
Chapter 12 Debugging Persistent Memory Applications
Listing 12-43. First, a pool is created for Listing 12-38. Then, pmemcheck is run
to get a detailed log of all the stores and flushes issued by Listing 12-38. Finally,
pmreorder is run with engine ReorderFull
$ pmempool create obj --size=100M --layout=RECORDS /mnt/pmem/file
Opening the generated file output_file.log, you should see entries similar to those
in Listing 12-44 that highlight detected inconsistencies and problems within the code.
The report states that the problem resides at line 91 of the listing_12-38.cpp writer
program. To fix listing_12-38.cpp, move the counter incrementation after all the data
in the record has been flushed all the way to persistent media. Listing 12-45 shows the
corrected part of the code.
Listing 12-45. Fix Listing 12-38 by moving the incrementation of the counter to
the end of the loop (line 95)
86 for (uint8_t i = 0; i < 10; i++) {
87 if (rand() % 2 == 0) {
88 snprintf(records[i].name, 63,
89 "record #%u", i + 1);
90 pop.persist(records[i].name, 63);
91 records[i].valid = 2;
92 } else
93 records[i].valid = 1;
94 pop.persist(&(records[i].valid), 1);
95 header->counter++;
96 }
S
ummary
This chapter provided an introduction to each tool and described how to use them.
Catching issues early in the development cycle can save countless hours of debugging
complex code later on. This chapter introduced three valuable tools – Persistence
Inspector, pmemcheck, and pmreorder – that persistent memory programmers will want
to integrate into their development and testing cycles to detect issues. We demonstrated
how useful these tools are at detecting many different types of common programming
errors.
The Persistent Memory Development Kit (PMDK) uses the tools described here to
ensure each release is fully validated before it is shipped. The tools are tightly integrated
into the PMDK continuous integration (CI) development cycle, so you can quickly catch
and fix issues.
259
www.dbooks.org
Chapter 12 Debugging Persistent Memory Applications
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
260
CHAPTER 13
Enabling Persistence
Using a Real-World
Application
This chapter turns the theory from Chapter 4 (and other chapters) into practice.
We show how an application can take advantage of persistent memory by building
a persistent memory-aware database storage engine. We use MariaDB (https://
mariadb.org/), a popular open source database, as it provides a pluggable storage
engine model. The completed storage engine is not intended for production use and
does not implement all the features a production quality storage engine should. We
implement only the basic functionality to demonstrate how to begin persistent memory
programming using a well known database. The intent is to provide you with a more
hands-on approach for persistent memory programming so you may enable persistent
memory features and functionality within your own application. Our storage engine is
left as an optional exercise for you to complete. Doing so would create a new persistent
memory storage engine for MariaDB, MySQL, Percona Server, and other derivatives. You
may also choose to modify an existing MySQL database storage engine to add persistent
memory features, or perhaps choose a different database entirely.
We assume that you are familiar with the preceding chapters that covered the
fundamentals of the persistent memory programming model and Persistent Memory
Development Kit (PMDK). In this chapter, we implement our storage engine using C++
and libpmemobj-cpp from Chapter 8. If you are not a C++ developer, you will still find this
information helpful because the fundamentals apply to other languages and applications.
The complete source code for the persistent memory-aware database storage engine
can be found on GitHub at https://github.com/pmem/pmdk-examples/tree/master/
pmem-mariadb.
261
© The Author(s) 2020
S. Scargall, Programming Persistent Memory, https://doi.org/10.1007/978-1-4842-4932-1_13
www.dbooks.org
Chapter 13 Enabling Persistence Using a Real-World Application
Persistent memory can be used in a variety of ways to deliver lower latency for many
applications:
• In-memory databases: In-memory databases can leverage
persistent memory’s larger capacities and significantly reduce restart
times. Once the database memory maps the index, tables, and
other files, the data is immediately accessible. This avoids lengthy
startup times where the data is traditionally read from disk and paged
in to memory before it can be accessed or processed.
• Fraud detection: Financial institutions and insurance companies
can perform real-time data analytics on millions of records to detect
fraudulent transactions.
• Cyber threat analysis: Companies can quickly detect and defend
against increasing cyber threats.
262
Chapter 13 Enabling Persistence Using a Real-World Application
263
www.dbooks.org
Chapter 13 Enabling Persistence Using a Real-World Application
Figure 13-1. MariaDB storage engine architecture diagram for persistent memory
Figure 13-1 shows how the storage engine communicates with libpmemobj to
manage the data stored in persistent memory. The library is used to turn a persistent
memory pool into a flexible object store.
264
Chapter 13 Enabling Persistence Using a Real-World Application
When a handler instance is created, the MariaDB server sends commands to the
handler to perform data storage and retrieve tasks such as opening a table, manipulating
rows, managing indexes, and transactions. When a handler is instantiated, the first
required operation is the opening of a table. Since the storage engine is a single user and
single-threaded implementation, only one handler instance is created.
Various handler methods are also implemented; they apply to the storage engine as
a whole, as opposed to methods like create() and open() that work on a per-table basis.
Some examples of such methods include transaction methods to handle commits and
rollbacks, shown in Listing 13-2.
265
www.dbooks.org
Chapter 13 Enabling Persistence Using a Real-World Application
216 pmdk_hton->flags= HTON_CAN_RECREATE;
217 pmdk_hton->tablefile_extensions= ha_pmdk_exts;
218
219 pmdk_hton->commit= pmdk_commit;
220 pmdk_hton->rollback= pmdk_rollback;
...
223 }
The abstract methods defined in the handler class are implemented to work with
persistent memory. An internal representation of the objects in persistent memory is
created using a single linked list (SLL). This internal representation is very helpful to
iterate through the records to improve performance.
To perform a variety of operations and gain faster and easier access to data, we used
the simple row structure shown in Listing 13-3 to hold the pointer to persistent memory
and the associated field value in the buffer.
266
Chapter 13 Enabling Persistence Using a Real-World Application
1251 char path[MAX_PATH_LEN];
1252 DBUG_ENTER("ha_pmdk::create");
1253 DBUG_PRINT("info", ("create"));
1254
1255 snprintf(path, MAX_PATH_LEN, "%s%s", name, PMEMOBJ_EXT);
1256 PMEMobjpool *pop = pmemobj_create(path, name,PMEMOBJ_MIN_POOL,
S_IRWXU);
1257 if (pop == NULL) {
1258 DBUG_PRINT("info", ("failed : %s error number :
%d",path,errCodeMap[errno]));
1259 DBUG_RETURN(errCodeMap[errno]);
1260 }
1261 DBUG_PRINT("info", ("Success"));
1262 pmemobj_close(pop);
1263
1264 DBUG_RETURN(0);
1265 }
267
www.dbooks.org
Chapter 13 Enabling Persistence Using a Real-World Application
Once the storage engine is up and running, we can begin to insert data into it. But we
first must implement the INSERT, UPDATE, DELETE, and SELECT operations.
376 int ha_pmdk::close(void)
377 {
378 DBUG_ENTER("ha_pmdk::close");
379 DBUG_PRINT("info", ("close"));
380
381 pmemobj_close(objtab);
382 objtab = NULL;
383
384 DBUG_RETURN(0);
385 }
I NSERT Operation
The INSERT operation is implemented in the write_row() method, shown in Listing 13-7.
During an INSERT, the row objects are maintained in a singly linked list. If the table
is indexed, the index table container in volatile memory is updated with the new
268
Chapter 13 Enabling Persistence Using a Real-World Application
269
www.dbooks.org
Chapter 13 Enabling Persistence Using a Real-World Application
In every INSERT operation, the field values are checked for a preexisting duplicate.
The primary key field in the table is checked using the isPrimaryKey()function (line
423). If the key is a duplicate, the error HA_ERR_FOUND_DUPP_KEY is returned. The
isPrimaryKey() is implemented in Listing 13-8.
462 bool ha_pmdk::isPrimaryKey(void)
463 {
464 bool ret = false;
465 database *db = database::getInstance();
466 table_ *tab;
467 key *k;
468 for (unsigned int i= 0; i < table->s->keys; i++) {
469 KEY* key_info = &table->key_info[i];
470 if (memcmp("PRIMARY",key_info->name.str,sizeof("PRIMARY"))==0) {
471 Field *field = key_info->key_part->field;
472 std::string convertedKey = IdentifyTypeAndConvertToString
(field->ptr, field->type(),field->key_length(),1);
473 if (db->getTable(table->s->table_name.str, &tab)) {
474 if (tab->getKeys(field->field_name.str, &k)) {
475 if (k->verifyKey(convertedKey)) {
476 ret = true;
477 break;
478 }
479 }
480 }
481 }
482 }
483 return ret;
484 }
U
PDATE Operation
The server executes UPDATE statements by performing a rnd_init() or index_init()
table scan until it locates a row matching the key value in the WHERE clause of the UPDATE
statement before calling the update_row() method. If the table is an indexed table, the
270
Chapter 13 Enabling Persistence Using a Real-World Application
index container is also updated after this operation is successful. In the update_row()
method defined in Listing 13-9, the old_data field will have the previous row record in it,
while new_data will have the new data.
The index table is also updated using the updateRow() method shown in Listing 13-10.
271
www.dbooks.org
Chapter 13 Enabling Persistence Using a Real-World Application
D
ELETE Operation
The DELETE operation is implemented using the delete_row() method. Three different
scenarios should be considered:
For each scenario, different functions are called. When the operation is successful,
the entry is removed from both the index (if the table is an indexed table) and persistent
memory. Listing 13-11 shows the logic to implement the three scenarios.
272
Chapter 13 Enabling Persistence Using a Real-World Application
620 if (searchNode(prevNode->second)) {
621 if (prevNode->second) {
622 deleteRowFromAllIndexedColumns(prevNode->second);
623 deleteNodeFromSLL();
624 }
625 }
626 }
627 }
628 }
629 stats.records--;
630
631 DBUG_RETURN(0);
632 }
The deleteNodeFromSLL() method deletes the object from the linked list residing on
persistent memory using libpmemobj transactions, as shown in Listing 13-13.
273
www.dbooks.org
Chapter 13 Enabling Persistence Using a Real-World Application
Listing 13-13. ha_pmdk.cc – Deletes an entry from the linked list using
transactions
651 int ha_pmdk::deleteNodeFromSLL()
652 {
653 if (!prev) {
654 if (!current->next) { // When sll contains single node
655 TX_BEGIN(objtab) {
656 delete_persistent<row>(current);
657 proot->rows = nullptr;
658 } TX_END
659 } else { // When deleting the first node of sll
660 TX_BEGIN(objtab) {
661 delete_persistent<row>(current);
662 proot->rows = current->next;
663 current = nullptr;
664 } TX_END
665 }
666 } else {
667 if (!current->next) { // When deleting the last node of sll
668 prev->next = nullptr;
669 } else { // When deleting other nodes of sll
670 prev->next = current->next;
671 }
672 TX_BEGIN(objtab) {
673 delete_persistent<row>(current);
674 current = nullptr;
675 } TX_END
676 }
677 return 0;
678 }
274
Chapter 13 Enabling Persistence Using a Real-World Application
S
ELECT Operation
SELECT is an important operation that is required by several methods. Many methods
that are implemented for the SELECT operation are also called from other methods. The
rnd_init() method is used to prepare for a table scan for non-indexed tables, resetting
counters and pointers to the start of the table. If the table is an indexed table, the
MariaDB server calls the index_init() method. As shown in Listing 13-14, the pointers
are initialized.
Listing 13-14. ha_pmdk.cc – rnd_init() is called when the system wants the
storage engine to do a table scan
869 int ha_pmdk::rnd_init(bool scan)
870 {
...
874 current=prev=NULL;
875 iter = proot->rows;
876 DBUG_RETURN(0);
877 }
When the table is initialized, the MariaDB server calls the rnd_next(), index_first(),
or index_read_map() method, depending on whether the table is indexed or not. These
methods populate the buffer with data from the current object and updates the iterator to
the next value. The methods are called once for every row to be scanned.
Listing 13-15 shows how the buffer passed to the function is populated with the
contents of the table row in the internal MariaDB format. If there are no more objects to
read, the return value must be HA_ERR_END_OF_FILE.
Listing 13-15. ha_pmdk.cc – rnd_init() is called when the system wants the
storage engine to do a table scan
902 int ha_pmdk::rnd_next(uchar *buf)
903 {
...
910 memcpy(buf, iter->buf, table->s->reclength);
911 if (current != NULL) {
912 prev = current;
913 }
275
www.dbooks.org
Chapter 13 Enabling Persistence Using a Real-World Application
914 current = iter;
915 iter = iter->next;
916
917 DBUG_RETURN(0);
918 }
This concludes the basic functionality our persistent memory enabled storage
engine set out to achieve. We encourage you to continue the development of this storage
engine to introduce more features and functionality.
S
ummary
This chapter provided a walk-through using libpmemobj from the PMDK to create
a persistent memory-aware storage engine for the popular open source MariaDB
database. Using persistent memory in an application can provide continuity in the
event of an unplanned system shutdown along with improved performance gained by
storing your data close to the CPU where you can access it at the speed of the memory
bus. While database engines commonly use in-memory caches for performance, which
take time to warm up, persistent memory offers an immediately warm cache upon
application startup.
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
276
CHAPTER 14
Concurrency and
Persistent Memory
This chapter discusses what you need to know when building multithreaded
applications for persistent memory. We assume you already have experience with
multithreaded programming and are familiar with basic concepts such as mutexes,
critical section, deadlocks, atomic operations, and so on.
The first section of this chapter highlights common practical solutions for building
multithreaded applications for persistent memory. We describe the limitation of
the Persistent Memory Development Kit (PMDK) transactional libraries, such as
libpmemobj and libpmemobj-cpp, for concurrent execution. We demonstrate simple
examples that are correct for volatile memory but cause data inconsistency issues on
persistent memory in situations where the transaction aborts or the process crashes.
We also discuss why regular mutexes cannot be placed as is on persistent memory and
introduce the persistent deadlock term. Finally, we describe the challenges of building
lock-free algorithms for persistent memory and continue our discussion of visibility vs.
persistency from previous chapters.
The second section demonstrates our approach to designing concurrent data
structures for persistent memory. At the time of publication, we have two concurrent
associative C++ data structures developed for persistent memory - a concurrent
hash map and a concurrent map. More will be added over time. We discuss both
implementations within this chapter.
All code samples are implemented in C++ using the libpmemobj-cpp library
described in Chapter 8. In this chapter, we usually refer to libpmemobj because it
implements the features and libpmemobj-cpp is only a C++ extension wrapper for it.
The concepts are general and can apply to any programming language.
277
© The Author(s) 2020
S. Scargall, Programming Persistent Memory, https://doi.org/10.1007/978-1-4842-4932-1_14
www.dbooks.org
Chapter 14 Concurrency and Persistent Memory
T ransactions and Multithreading
In computer science, ACID (atomicity, consistency, isolation, and durability) is a set of
properties of transactions intended to guarantee data validity and consistency in case
of errors, power failures, and abnormal termination of a process. Chapter 7 introduced
PMDK transactions and their ACID properties. This chapter focuses on the relevancy
of multithreaded programs for persistent memory. Looking forward, Chapter 16 will
provide some insights into the internals of libpmemobj transactions.
The small program in Listing 14-1 shows that the counter stored within the root
object is incremented concurrently by multiple threads. The program opens the
persistent memory pool and prints the value of counter. It then runs ten threads, each
of which calls the increment() function. Once all the threads complete successfully, the
program prints the final value of counter.
You might expect that the program in Listing 14-1 the prints a final counter value
of 10. However, PMDK transactions do not automatically support isolation from the
ACID properties set. The result of the increment operation on line 53 is visible to
other concurrent transactions before the current transaction has implicitly committed
its update on line 54. That is, a simple data race is occurring in this example. A race
condition occurs when two or more threads can access shared data and they try to
change it at the same time. Because the operating system’s thread scheduling algorithm
can swap between threads at any time, there is no way for the application to know the
order in which the threads will attempt to access the shared data. Therefore, the result
of the change of the data is dependent on the thread scheduling algorithm, that is, both
threads are “racing” to access/change the data.
If we run this example multiple times, the results will vary from run to run. We can
try to fix the race condition by acquiring a mutex lock before the counter increment as
shown in Listing 14-2.
279
www.dbooks.org
Chapter 14 Concurrency and Persistent Memory
46 struct root {
47 pobj::mutex mtx;
48 pobj::p<int> counter;
49 };
50
51 using pop_type = pobj::pool<root>;
52
53 void increment(pop_type &pop) {
54 auto proot = pop.root();
55 pobj::transaction::run(pop, [&] {
56 std::unique_lock<pobj::mutex> lock(proot->mtx);
57 proot->counter.get_rw() += 1;
58 });
59 }
• Line 56: We acquired the mutex lock within the transaction before
incrementing the value of counter to avoid a race condition. Each
thread increments the counter inside the critical section protected by
the mutex.
Now if we run this example multiple times, it will always increment the value of
the counter stored in persistent memory by 1. But we are not done yet. Unfortunately,
the example in Listing 14-2 is also wrong and can cause data inconsistency issues
on persistent memory. The example works well if there are no transaction aborts.
However, if the transaction aborts after the lock is released but before the transaction
has completed and successfully committed its update to persistent memory, other
threads can read a cached value of the counter that can cause data inconsistency issues.
To understand the problem, you need to know how libpmemobj transactions work
internally. For now, we discuss only the necessary details required to understand this
issue and leave the in-depth discussion of transactions and their implementation for
Chapter 16.
A libpmemobj transaction guarantees atomicity by tracking changes in the undo log.
In the case of a failure or transaction abort, the old values for uncommitted changes are
restored from the undo log. It is important to know that the undo log is a thread-specific
280
Chapter 14 Concurrency and Persistent Memory
entity. This means that each thread has its own undo log that is not synchronized with
undo logs of other threads.
Figure 14-1 illustrates the internals of what happens within the transaction when
we call the increment() function in Listing 14-2. For illustrative purposes, we only
describe two threads. Each thread executes concurrent transactions to increment the
value of counter allocated in persistent memory. We assume the initial value of counter
is 0 and the first thread acquires the lock, while the second thread waits on the lock.
Inside the critical section, the first thread adds the initial value of counter to the undo
log and increments it. The mutex is released when execution flow leaves the lambda
scope, but the transaction has not committed the update to persistent memory. The
changes become immediately visible to the second thread. After a user-provided lambda
is executed, the transaction needs to flush all changes to persistent memory to mark
the change(s) as committed. Concurrently, the second thread adds the current value of
counter, which is now 1, to its undo log and performs the increment operation. At that
moment, there are two uncommitted transactions. The undo log of Thread 1 contains
counter = 0, and the undo log of Thread 2 contains counter = 1. If Thread 2 commits
its transaction while Thread 1 aborts its transaction for some reason (crash or abort), the
incorrect value of counter will be restored from the undo log of Thread 1.
The solution is to hold the mutex until the transaction is fully committed, and the data
has been successfully flushed to persistent memory. Otherwise, changes made by one
transaction become visible to concurrent transactions before it is persisted and committed.
Listing 14-3 demonstrates how to implement the increment() function correctly.
281
www.dbooks.org
Chapter 14 Concurrency and Persistent Memory
The libpmemobj API allows us to specify locks that should be acquired and held for
the entire duration of the transaction. In the Listing 14-3 example, we pass the proot-
>mtx mutex object to the run() method as a third parameter.
282
Chapter 14 Concurrency and Persistent Memory
PMEMmutex The data structure represents a persistent memory resident mutex similar
to pthread_mutex_t.
PMEMrwlock The data structure represents a persistent memory resident read-write lock
similar to pthread_rwlock_t.
PMEMcond The data structure represents a persistent memory resident condition
variable similar to pthread_cond_t.
283
www.dbooks.org
Chapter 14 Concurrency and Persistent Memory
284
Chapter 14 Concurrency and Persistent Memory
• Line 41: We are only storing the mtx object inside root object on
persistent memory.
• Lines 47-48: We open the persistent memory pool with the layout
name of “MUTEX”.
• Line 50: We obtain a pointer to the root data structure within the
pool.
• Line 52: We acquire the mutex.
• Lines 54-56: Close the pool and exit the program.
As you can see, we do not explicitly unlock the mutex within the main() function.
If we run this example several times, the main() function can always lock the mutex on
line 52. This works because the pmem::obj::v<T> class template implicitly calls a default
constructor, which is a wrapped std::mutex object type. The constructor is called every
time we open the persistent memory pool so we never run into a situation where the lock
is already acquired.
If we change the mtx object type on line 41 from pobj::experimental::v<std::mu
tex> to std::mutex and try to run the program again, the example will hang during the
second run on line 52 because mtx object was locked during the first run and we never
released it.
www.dbooks.org
Chapter 14 Concurrency and Persistent Memory
286
Chapter 14 Concurrency and Persistent Memory
1
I ntel Threading Building Blocks library (https://github.com/intel/tbb).
2
Michael Voss, Rafael Asenjo, James Reinders. C++ Parallel Programming with Threading Building
Blocks; Apress, 2019; ISBN-13 (electronic): 978-1-4842-4398-5; https://www.apress.com/gp/
book/9781484243978.
287
www.dbooks.org
Chapter 14 Concurrency and Persistent Memory
F ind Operation
Because the find operation is non-modifying, it does not have to deal with data
consistency issues. The lookup operation for the target element always begins from the
topmost layer. The algorithm proceeds horizontally until the next element is greater
or equal to the target. Then it drops down vertically to the next lower list if it cannot
proceed on the current level. Figure 14-3 illustrates how the find operation works for the
element with key=9. The search starts from the highest level and immediately goes from
dummy head node to the node with key=4, skipping nodes with keys 1, 2, 3. On the node
with key=4, the search is dropped two layers down and goes to the node with key=8.
Then it drops one more layer down and proceeds to the desired node with key=9.
3
. Herlihy, Y. Lev, V. Luchangco, N. Shavit. A provably correct scalable concurrent skip list. In
M
OPODIS ‘06: Proceedings of the 10th International Conference On Principles Of Distributed
Systems, 2006; https://www.cs.tau.ac.il/~shanir/nir-pubs-web/Papers/OPODIS2006-BA.
pdf.
288
Chapter 14 Concurrency and Persistent Memory
The find operation is wait-free. That is, every find operation is bound only by
the number of steps the algorithm takes. And a thread is guaranteed to complete
the operation regardless of the activity of other threads. The implementation of
pmem::obj::concurrent_map uses atomic load-with-acquire memory semantics when
reading pointers to the next node.
I nsert Operation
The insert operation, shown in Figure 14-4, employs fine-grained locking schema for
thread-safety and consists of the following basic steps to insert a new node with key=7
into the list:
3. Acquire locks for each predecessor node and check that the
successor nodes have not been changed. If successor nodes have
changed, the algorithm returns to step 2.
4. Insert the new node to all layers starting from the bottom one.
Since the find operation is lock-free, we must update pointers on
each level atomically using store-with-release memory semantics.
289
www.dbooks.org
Chapter 14 Concurrency and Persistent Memory
Figure 14-4. Inserting a new node with key=7 into the concurrent skip list
290
Chapter 14 Concurrency and Persistent Memory
E rase Operation
The implementation of the erase operation for pmem::obj::concurrent_map is not
thread-safe. This method cannot be called concurrently with other methods of the
concurrent ordered map because this is a memory reclamation problem that is hard to
solve in C++ without a garbage collector. There is a way to logically extract a node from
a skip list in a thread-safe manner, but it is not trivial to detect when it is safe to delete
the removed node because other threads may still have access to the node. There are
possible solutions, such as hazard pointers, but these can impact the performance of the
find and insert operations.
4
nton Malakhov. Per-bucket concurrent rehashing algorithms, 2015, arXiv:1509.02235v1;
A
https://arxiv.org/ftp/arxiv/papers/1509/1509.02235.pdf.
291
www.dbooks.org
Chapter 14 Concurrency and Persistent Memory
Find Operation
The find operation is a read-only event that does not change the hash map state.
Therefore, data consistency is maintained while performing a find request. The find
operation works by first calculating the hash value for a target key and acquires read
lock for the corresponding bucket. The read lock guarantees that there is no concurrent
modifications to the bucket while we are reading it. Inside the bucket, the find operation
performs a linear search through the list of nodes.
Insert Operation
The insert method of the concurrent hash map uses the same technique to support
data consistency as the concurrent skip list data structure. The operation consists of the
following steps:
1. Allocate the new node, and assign a pointer to the new node to
persistent thread-local storage.
2. Calculate the hash value of the new node, and find the
corresponding bucket.
292
Chapter 14 Concurrency and Persistent Memory
4. Insert the new node to the bucket by linking it to the list of nodes.
Because only one pointer has to be updated, a transaction is not
needed. Because only one pointer is updated, a transaction is not
required.
E rase Operation
Although the erase operation is similar to an insert (the opposite action), its
implementation is even simpler than the insert. The erase implementation
acquires the write lock for the required bucket and, using a transaction, removes the
corresponding node from the list of nodes within that bucket.
S
ummary
Although building an application for persistent memory is a challenging task, it is more
difficult when you need to create a multithreaded application for persistent memory.
You need to handle data consistency in a multithreaded environment when multiple
threads can update the same data in persistent memory.
If you develop concurrent applications, we encourage you to use existing libraries
that provide concurrent data structures designed to store data in persistent memory.
You should develop custom algorithms only if the generic ones do not fit your needs.
See the implementations of concurrent cmap and csmap engines in pmemkv, described
in Chapter 9, which are implemented using pmem::obj::concurrent_hash_map and
pmem::obj::concurrent_map, respectively.
If you need to develop a custom multithreaded algorithm, be aware of the limitation
PMDK transactions have for concurrent execution. This chapter shows that transactions
do not automatically provide isolation out of the box. Changes made inside one
transaction become visible to other concurrent transactions before they are committed.
You will need to implement additional synchronization if it is required by an algorithm.
We also explain that atomic operations cannot be used inside a transaction while
building lock-free algorithms without transactions. This is a very complicated task if your
platform does not support eADR.
293
www.dbooks.org
Chapter 14 Concurrency and Persistent Memory
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
294
CHAPTER 15
Profiling and Performance
I ntroduction
This chapter first discusses the general concepts for analyzing memory and storage
performance and how to identify opportunities for using persistent memory for both
high-performance persistent storage and high-capacity volatile memory. We then
describe the tools and techniques that can help you optimize your code to achieve the
best performance.
Performance analysis requires tools to collect specific data and metrics about
application, system, and hardware performance. In this chapter, we describe how to
collect this data using Intel VTune Profiler. Many other data collection options are
available; the techniques we describe are relevant regardless of how the data is collected.
295
© The Author(s) 2020
S. Scargall, Programming Persistent Memory, https://doi.org/10.1007/978-1-4842-4932-1_15
www.dbooks.org
Chapter 15 Profiling and Performance
296
Chapter 15 Profiling and Performance
also affect multiprocessor performance, which means that certain memory access
patterns place a ceiling on parallelism. Many well-defined memory access patterns exist,
including but not limited to sequential, strided, linear, and random.
It is much easier to measure, control, and optimize memory accesses on systems that
run only one application. In the cloud and virtualized environments, applications within
the guests can be running any type of application and workload, including web servers,
databases, or an application server. This makes it much harder to ensure memory
accesses are fully optimized for the hardware as the access patterns are essentially
random.
www.dbooks.org
Chapter 15 Profiling and Performance
that requires extensive reading and writing from disk. It is likely that the disk accesses
are the bottleneck for this application and adding a faster storage solution, like persistent
memory, could improve performance.
These are trivial examples, and applications will have widely different behaviors
along this spectrum. Understanding what behaviors to look for and how to measure
them is an important step to using persistent memory. This section presents the
important characteristics to identify and determine if an application is a good fit for
persistent memory. We look at applications that require in-memory persistence,
applications that can use persistent memory in a volatile manner, and applications that
can use both.
298
Chapter 15 Profiling and Performance
The Memory Utilization graph in the Platform Profiler analysis shown in Figure 15-2
measures the memory footprint using operating system statistics and produces a
timeline graph as a percentage of the total available memory.
299
www.dbooks.org
Chapter 15 Profiling and Performance
The results in Figure 15-2 were taken from a different application than Figure 15-1.
This graph shows very high memory consumption, which implies this workload would
be a good candidate for adding more memory to the system. If your persistent memory
hardware has variable modes, like the Memory and App Direct modes on Intel Optane
DC persistent memory, you will need some more information to determine which mode
to use first. The next important information is the hot working set size.
300
Chapter 15 Profiling and Performance
Figure 15-3 shows the results of a Memory Access analysis of an application. It shows
the memory size in parenthesis and the number of loads and stores that accessed it. The
report does not include an indication of what was concurrently allocated.
Figure 15-3. Objects accessed by the application during a Memory Access analysis
data collection
The report identifies the objects with the most accesses (loads and stores). The sum
of the sizes of these objects is the working set size – the values are in parentheses. You
decide where to draw the line for what is and is not part of the hot working set.
Depending on the workload, there may not be an easy way to determine the hot working
set size, other than developer knowledge of the application. Having a rough estimate is
important for deciding whether to start with Memory Mode or App Direct mode.
301
www.dbooks.org
Chapter 15 Profiling and Performance
Figure 15-4. Disk throughput and IOPS graphs from VTune Profiler’s Platform
Profiler
Figure 15-4 shows throughput and IOPS numbers of an NVMe drive collected
using Platform Profiler. This example uses a non-volatile disk for extensive storage, as
indicated by the throughput and IOPS graphs. Applications like this may benefit from
faster storage like persistent memory. Another important metric to identify storage
bottlenecks is I/O Wait time. The Platform Profiler analysis can also provide this metric
and display how it is affecting CPU Utilization over time, as seen in Figure 15-5.
Figure 15-5. I/O Wait time from VTune Profiler’s Platform Profiler
302
Chapter 15 Profiling and Performance
Characterizing the Workload
The performance of a workload on a persistent memory system depends on a
combination of the workload characteristics and the underlying hardware. The key
metrics to understand the workload characteristics are:
303
www.dbooks.org
Chapter 15 Profiling and Performance
The Intel Memory Latency Checker (Intel MLC) is a free tool for Linux and Windows
available from https://software.intel.com/en-us/articles/intelr-memory-
latency-checker. Intel MLC can be used to measure bandwidth and latency of DRAM
and persistent memory using a variety of tests:
VTune Profiler has a built-in kernel to measure peak bandwidth on a system. Once
you know the peak bandwidth of the platform, you can then measure the persistent
memory bandwidth of your workload. This will reveal whether persistent memory
bandwidth is a bottleneck. Figure 15-6 shows an example of persistent memory read and
write bandwidth of an application.
304
Chapter 15 Profiling and Performance
Figure 15-7. Read traffic ratio from VTune Profiler’s Platform Profiler analysis
305
www.dbooks.org
Chapter 15 Profiling and Performance
The Platform Profiler feature in VTune Profiler can collect metrics specific to
persistent memory.
T uning the Hardware
The memory configuration of a system is a significant factor in determining the system’s
performance. The workload performance depends on a combination of workload
characteristics and the memory configuration. There is no single configuration that
provides the best value for all workloads. These factors make it important to tune the
hardware with respect to workload characteristics and get the maximum value out of the
system.
306
Chapter 15 Profiling and Performance
Bandwidth Requirements
The maximum available persistent memory bandwidth depends on the number of
channels populated with a persistent memory module. A fully populated system works
well for a workload with a high bandwidth requirement. Partially populated systems
can be used for workloads that are not as memory latency sensitive. Refer to the server
documentation for population guidelines.
BIOS Options
With the introduction of persistent memory into server platforms, many features and
options have been added to the BIOS that provide additional tuning capabilities. The
options and features available within the BIOS vary for each server vendor and persistent
memory product. Refer to the server BIOS documentation for all the options available;
most share common options, including:
307
www.dbooks.org
Chapter 15 Profiling and Performance
Depending on your familiarity with the code and how it works with production
workloads, knowing which data structures and objects to store in the different memory/
storage tiers may be simple. Should those data structures and objects be volatile or
persisted? To help with searching for potential candidates, tools such as VTune Profiler
can identify objects with the most last-level cache (LLC) misses. The intent is to identify
what data structures and objects the application uses most frequently and ensure they
are placed in the fastest media appropriate to their access patterns. For example, an
object that is written once but read many times is best placed in DRAM. An object that
is updated frequently that needs to be persisted should probably be moved to persistent
memory rather than traditional storage devices.
You must also be mindful of memory-capacity constraints. Tools such as VTune
Profiler can help determine approximately how many hot objects will fit into the
available DRAM. For the remaining objects that have fewer LLC misses or that are too
large to allocate from DRAM, you can put them in persistent memory. These steps will
ensure that your most accessed objects have the fastest path to the CPU (allocated in
DRAM), while the infrequently accessed objects will take advantage of the additional
persistent memory (as opposed to sitting out on a much slower storage devices).
Another consideration for optimizations is the load/store ratio for object accesses. If
your persistent memory hardware characteristics are such that load/read operations are
much faster than stores/writes, this should be taken into account. Objects with a high
load/store ratio should benefit from living in persistent memory.
There is no hard rule for what constitutes a frequent vs. infrequently accessed object.
Although behaviors are application dependent, these guidelines give a starting point for
choosing how to allocate objects in persistent memory. After completing this process,
start profiling and tuning the application to further improve the performance with
persistent memory.
308
Chapter 15 Profiling and Performance
N
UMA Optimizations
NUMA-related performance issues were described in the “Characterizing the Workload”
section; we discuss NUMA in more detail in Chapter 19. If you identify performance
issues related to NUMA memory accesses, two things should be considered: data
allocation vs. first access, and thread migration.
309
www.dbooks.org
Chapter 15 Profiling and Performance
references an address, the system translates the virtual address to a physical address.
The physical address points to memory physically connected to a CPU. Chapter 19
describes exactly how this operation works and shows why high-capacity memory
systems can benefit from using large or huge pages provided by the operating system.
A common practice in software is to have most of the data allocations done when the
application starts. Operating systems try to allocate memory associated with the CPU on
which the thread executes. The operating system scheduler then tries to always schedule
the thread on a CPU that it last ran in the hopes that the data still remains in one of the
CPU caches. On a multi-socket system, this may result in all the objects being allocated
in the memory of a single socket, which can create NUMA performance issues. Accessing
data on a remote CPU incurs a latency performance penalty.
Some applications delay reserving memory until the data is accessed for the first
time. This can alleviate some NUMA issues. It is important to understand how your
workload allocates data to understand the NUMA performance.
Thread Migration
Thread migration, which is the movement of software threads across sockets by the
operating system scheduler, is the most common cause of NUMA issues. Once objects
are allocated in memory, accessing them from another physical CPU from which they
were originally allocated incurs a latency penalty. Even though you may allocate your
data on a socket where the accessing thread is currently running, unless you have
specific affinity bindings or other safeguards, the thread may move to any other core or
socket in the future. You can track thread migration by identifying which cores threads
are running on and which sockets those cores belong to. Figure 15-10 shows an example
of this analysis from VTune Profiler.
Figure 15-10. VTune Profiler identifying thread migration across cores and
sockets (packages)
310
Chapter 15 Profiling and Performance
S
ummary
Profiling and performance optimization techniques for persistent memory systems
are similar to those techniques used on systems without persistent memory. This
chapter outlined some important concepts for understanding performance. It also
provides guidance for characterizing an existing application without persistent memory
and understanding whether it is suitable for persistent memory. Finally, it presents
311
www.dbooks.org
Chapter 15 Profiling and Performance
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
312
CHAPTER 16
PMDK Internals:
Important Algorithms
and Data Structures
Chapters 5 through 10 describe most of the libraries contained within the Persistent
Memory Development Kit (PMDK) and how to use them.
This chapter introduces the fundamental algorithms and data structures on which
libpmemobj is built. After we first describe the overall architecture of the library, we
discuss the individual components and the interaction between them that makes
libpmemobj a cohesive system.
313
© The Author(s) 2020
S. Scargall, Programming Persistent Memory, https://doi.org/10.1007/978-1-4842-4932-1_16
www.dbooks.org
Chapter 16 PMDK Internals: Important Algorithms and Data Structures
Everything is built on top of libpmem and its persistence primitives that the library
uses to transfer data to persistent memory and persist it. Those primitives are also
exposed through libpmemobj-specific APIs to applications that wish to perform low-level
operations on persistent memory, such as manual cache flushing. These APIs are exposed
so the high-level library can instrument, intercept, and augment all stores to persistent
memory. This is useful for the instrumentation of runtime analysis tools such as Valgrind
pmemcheck, described in Chapter 12. More importantly, these functions are interception
points for data replication, both local and remote.
Replication is implemented in a way that ensures all data written prior to calling
drain will be safely stored in the replica as configured. A drain operation is a barrier that
waits for hardware buffers to complete their flush operation to ensure all writes have
reached the media. This works by initiating a write to the replica when a memory copy
or a flush is performed and then waits for those writes to finish in the drain call. This
mechanism guarantees the same behavior and ordering semantics for replicated and
non-replicated pools.
On top of persistence primitives provided by libpmem is an abstraction for fail-safe
modification of transactional data called unified logging. The unified log is a single
data structure and API for two different logging types used throughout libpmemobj to
ensure fail-safety: transactions and atomic operations. This is one of the most crucial,
performance-sensitive modules in the library because it is the hot code path of almost
every API. The unified log is a hybrid DRAM and persistent memory data structure
accessed through a runtime context that organizes all memory operations that need
to be performed within a single fail-safe atomic transaction and allows for critical
performance optimizations.
The persistent memory allocator operates in the unified log context of either a
transaction or a single atomic operation. This is the largest and most complex module in
libpmemobj and is used to manage the potentially large amounts of persistent memory
associated with the memory pool.
Each object stored in a persistent memory pool is represented by an object handle
of type PMEMoid (persistent memory object identifier). In practice, such a handle is
a unique object identifier (OID) of global scope, which means that two objects from
different pools will never have the same OID. An OID cannot be used as a direct pointer
to an object. Each time the program attempts to read or write object data, it must obtain
the current memory address of the object by converting its OID into a pointer. In contrast
to the memory address, the OID value for a given object does not change during the
life of an object, except for a realloc(), and remains valid after closing and reopening
314
Chapter 16 PMDK Internals: Important Algorithms and Data Structures
the pool. For this reason, if an object contains a reference to another persistent object,
for example, to build a linked data structure, the reference must be an OID and not a
memory address.
The atomic and transactional APIs are built using a combination of the persistent
memory allocator and unified logs. The simplest public interface is the atomic API which
runs a single allocator operation in a unified log context. That log context is not exposed
externally and is created, initialized, and destroyed within a single function call.
The most general-purpose interface is the transactional API, which is based on a
combination of undo logging for snapshots and redo logging for memory allocation and
deallocation. This API has ACID (atomicity, consistency, isolation, durability) properties,
and it is a relatively thin layer that combines the utility of unified logs and the persistent
memory allocator.
For specific transactional use cases that need low-level access to the persistent
memory allocator, there is an “action” API. The action API is essentially a pass-through
to the raw memory allocator interface, alongside helpers for usability. This API can
be leveraged to create low-overhead algorithms that issue fewer memory fences, as
compared to general-purpose transactions, at the cost of ease of use.
All public interfaces produce and operate on PMEMoids as a replacement for
pointers. This comes with space overhead because PMEMoids are 16 bytes. There is
also a performance overhead for the translation to a normal pointer. The upside is that
objects can be safely referenced between different instances of the application and even
different persistent memory pools.
The pool management API opens, maps, and manages persistent memory resident
files or devices. This is where the replication is configured, metadata and the heap are
initialized, and all the runtime data is created. This is also where the crucial recovery of
interrupted transactions happens. Once recovery is complete, all prior transactions are
either committed or aborted, the persistent state is consistent, and the logs are clean and
ready to be used again.
www.dbooks.org
Chapter 16 PMDK Internals: Important Algorithms and Data Structures
are relative to the beginning of the application’s virtual address space, but that comes
with many caveats. Using such pointers would be predicated on the pool of persistent
memory always being located at the same place in the virtual address space of an
application that maps it. This is difficult, if not impossible, to accomplish in a portable
way on modern operating systems due to address space layout randomization (ASLR).
Therefore, a general-purpose library for persistent memory programming must provide
a specialized persistent pointer. Figure 16-2 shows a pointer from Object A to Object B. If
the base address changes, the pointer no longer points to Object B.
In addition to the previous requirements, you should also consider some potential
performance problems:
• Additional space overhead over a traditional pointer. This is important
because large fat pointers would take up more space in memory and
because fewer of these fat pointers would fit in a single CPU cache line.
This potentially increases the cost of operations in pointer-chasing
heavy data structures, such as those found in B-tree algorithms.
• 16-byte fat offset pointer with pool identifier. This is the most obvious
solution, which is similar to the one earlier but has 8-byte offset
pointers and 8-byte pool identifiers. Fat pointers provide the best
utility, at the cost of space overhead and some runtime performance.
317
www.dbooks.org
Chapter 16 PMDK Internals: Important Algorithms and Data Structures
libpmemobj uses the most generic approach of the 16-byte offset pointer. This allows
you to make your own choice since all other pointer types can be directly derived from
it. libpmemobj bindings for more expressive languages than C99, such as C++, can also
provide different types of pointers with different trade-offs.
Figure 16-3 shows the translation method used to convert a libpmemobj persistent
pointer, PMEMoid, into a valid C pointer. In principle, this approach is very simple.
We look up the base address of the pool through the pool identifier and then add the
object offset to it. The method itself is static inline and defined in the public header file
for libpmemobj to avoid a function call on every deference. The problem is the lookup
method, which, for an application linked with a dynamic library, means a call to a
different compilation unit, and that might be costly for a very commonly performed
operation. To resolve this problem, the translation method has a per-thread cache
of the last base address, which removes the necessity of calling the lookup with each
dereferencing for the common case where persistent pointers from the same pool are
accessed close together.
The pool lookup method itself is implemented using a radix tree that stores
identifier-address pairs. This tree has a lock-free read operation, which is necessary
because each non-cached pointer translation would otherwise have to acquire a lock to
be thread-safe, and that would have a severe negative performance impact and could
potentially serialize access to persistent memory.
318
Chapter 16 PMDK Internals: Important Algorithms and Data Structures
needs to store data that is unique to one thread of execution. In the persistent case,
we often need to associate data with a transaction rather than a thread.
In libpmemobj, we need a way to create an association between an in-flight
transaction and its persistent logs. It also requires a way to reconnect to those logs after
an unplanned interruption. The solution is to use a data structure called a “lane,” which
is simply a persistent byte buffer that is also transaction local.
Lanes are limited in quantity, have a fixed size, and are located at the beginning
of the pool. Each time a transaction starts, it chooses one of the lanes to operate from.
Because there is a limited number of lanes, there is also a limited number of transactions
that can run in parallel. For this reason, the size of the lane is relatively small, but the
number of lanes is big enough as to be larger than a number of application threads
that could feasibly run in parallel on current platforms and platforms coming in the
foreseeable future.
The challenge of the lane mechanism is the selection algorithm, that is, which lane
to choose for a specific transaction. It is a scheduler that assigns resources (lanes) to
perform work (transactions).
The naive algorithm, which was implemented in the earliest versions of libpmemobj,
simply picked the first available lane from the pool. This approach has a few problems.
First, the implementation of what effectively amounts to a single LIFO (last in, first
out) data structure of lanes requires a lot of synchronization on the front of the stack,
regardless of whether it is implemented as a linked list or an array, and thus reducing
performance. The second problem is false sharing of lane data. False sharing occurs
when two or more threads operate on data that is being modified, causing CPU cache
thrashing. And that is exactly what happens if multiple threads are continually fighting
over the same number of lanes to start new transactions. The third problem is spreading
the traffic across interleaved DIMMs. Interleaving is a technique that allows sequential
traffic to take advantage of throughput of all of the DIMMs in the interleave set by
spreading the physical memory across all available DIMMs. This is similar to striping
(RAID0) across multiple disk drives. Depending on the size of the interleaved block, and
the platform configuration, using naive lane allocation might continuously use the same
physical DIMMs, lowering the overall performance.
To alleviate these problems, the lane scheduling algorithm in libpmemobj is more
complex. Instead of using a LIFO data structure, it uses an array of 8-byte spinlocks, one
for each lane. Each thread is initially assigned a primary lane number, which is assigned
in such a way as to minimize false sharing of both lane data and the spinlock array.
319
www.dbooks.org
Chapter 16 PMDK Internals: Important Algorithms and Data Structures
The algorithm also tries to spread the lanes evenly across interleaved DIMMs. As long as
there are fewer active threads than lanes, no thread will ever share a lane. When a thread
attempts to start a transaction, it will try to acquire its primary lane spinlock, and if it is
unsuccessful, it will try to acquire the next lane in the array.
The final lane scheduling algorithm decision took a considerable amount of research
into various lane scheduling approaches. Compared to the naive implementation,
the current implementation has vastly improved performance, especially in heavily
multithreaded workloads.
The benefit of this logging approach, in the context of persistent memory, is that all the
log entries can be written and flushed to storage at once. An optimal implementation of
redo logging uses only two synchronization barriers: once to mark the log as complete and
once to discard it. The downside to this approach is that the memory modifications are
not immediately visible, which makes for a more complicated programming model. Redo
logging can sometimes be used alongside load/store instrumentation techniques which
can redirect a memory operation to the logged location. However, this approach can be
difficult to implement efficiently and is not well suited for a general-purpose library.
This type of log can have lower performance characteristics compared with the redo
log approach because it requires a barrier for every snapshot that needs to be made,
and the snapshotting itself must be fail-safe atomic, which presents its own challenges.
An undo log benefit is that the changes are visible immediately, allowing for a natural
programming model.
The important observation here is that redo and undo logging are complimentary.
Use redo logging for performance-critical code and where deferred modifications are
not a problem; use undo logging where ease of use is important. This observation led
to the current design of libpmemobj where a single transaction takes advantage of both
algorithms.
321
www.dbooks.org
Chapter 16 PMDK Internals: Important Algorithms and Data Structures
• A checksum for both the header and data, used only for redo logs
The last field is used to create a singly linked list of all logs that participate in a single
transaction. This is because it is impossible to predict the total required size of the log
at the beginning of the transaction, so the library cannot allocate a log structure that is
the exact required length ahead of time. Instead, the logs are allocated on demand and
atomically linked into a list.
The unified log supports two ways of fail-safe inserting of entries:
322
Chapter 16 PMDK Internals: Important Algorithms and Data Structures
Listing 16-1. The core persistent memory allocator interface that splits heap
operations into two distinct steps
int palloc_reserve(struct palloc_heap *heap, size_t size,...,
struct pobj_action *act);
void palloc_publish(struct palloc_heap *heap,
struct pobj_action *actv, size_t actvcnt,
struct operation_context *ctx);
323
www.dbooks.org
Chapter 16 PMDK Internals: Important Algorithms and Data Structures
All memory operations, called “actions” in the API, are broken up into two individual
steps.
The first step reserves the state that is needed to perform the operation. For
allocations, this means retrieving a free memory block, marking it as reserved, and
initializing the object’s content. This reservation is stored in a user-provided runtime
variable. The library guarantees that if an application crashes while holding reservations,
the persistent state is not affected. That is why these action variables must not be
persistent.
The second step is the act of exercising the reservations, which is called
“publication.” Reservations can be published individually, but the true power of this API
lies in its ability to group and publish many different actions together.
The internal allocator API also has a function to create an action that will set a
memory location to a given value when published. This is used to modify the destination
pointer value and is instrumental in making the atomic API of libpmemobj fail-safe.
All internal allocator APIs that need to perform fail-safe atomic actions take
operation context as an argument, which is the runtime instance of a single log. It
contains various state information, such as the total capacity of the log and the current
number of entries. It exposes the functions to create either bulk or singular log entries.
The allocator’s functions will log and then process all metadata modifications inside of
the persistent log that belongs to the provided instance of the operating context.
This works by using a free list for many different sizes, shown in Figure 16-6,
until some predefined threshold, after which it is sensible to allocate directly from
the operating system. Those free lists are typically called bins or buckets and can be
implemented in various ways, such as a simple linked list or contiguous buffer with
boundary tags. Each incoming memory allocation request is rounded up to match
one of the free lists, so there must be enough of them to minimize the amount of
overprovisioned space for each allocation. This algorithm approximates a best-fit
allocation policy that selects the memory block with the least amount of excess space for
the request from the ones available.
Using this technique allows memory allocators to have average-case O(1) complexity
while retaining the memory efficiency of best fit. Another benefit is that rounding up of
memory blocks and subsequent segregation forces some regularity to allocation patterns
that otherwise might not exhibit any.
Some allocators also sort the available memory blocks by address and, if possible,
allocate the one that is spatially collocated with previously selected blocks. This
improves space efficiency by increasing the likelihood of reusing the same physical
memory page. It also preserves temporal locality of allocated memory objects, which can
minimize cache and translation lookaside buffer (TLB) misses.
One important advancement in memory allocators is scalability in multithreaded
applications. Most modern memory allocators implement some form of thread caching,
where the vast majority of allocation requests are satisfied directly from memory that
is exclusively assigned to a given thread. Only when memory assigned to a thread is
entirely exhausted, or if the request is very large, the allocation will contend with other
threads for operating system resources.
This allows for allocator implementations that have no locks of any kind, not even
atomics, on the fast path. This can have a potentially significant impact on performance,
even in the single-threaded case. This technique also prevents allocator-induced false
sharing between threads, since a thread will always allocate from its own region of
325
www.dbooks.org
Chapter 16 PMDK Internals: Important Algorithms and Data Structures
memory. Additionally, the deallocation path often returns the memory block to the
thread cache from which it originated, again preserving locality.
We mentioned earlier that volatile allocators manage operating system–provided
pages but did not explain how they acquire those pages. This will become very important
later as we discuss how things change for persistent memory. Memory is usually
requested on demand from the operating system either through sbrk(), which moves
the break segment of the application, or anonymous mmap(), which creates new virtual
memory mapping backed by the page cache. The actual physical memory is usually not
assigned until the page is written to for the first time. When the allocator decides that it
no longer needs a page, it can either completely remove the mapping using unmap() or
it can tell the operating system to release the backing physical pages but keep the virtual
mapping. This enables the allocator to reuse the same addresses later without having to
memory map them again.
How does all of this translate into persistent memory allocators and libpmemobj
specifically?
The persistent heap must be resumable after application restart. This means that
all state information must be either located on persistent memory or reconstructed on
startup. If there are any active bookkeeping processes, those need to be restarted from
the point at which they were interrupted. There cannot be any volatile state held in
persistent memory, such as thread cache pointers. In fact, the allocator must not operate
on any pointers at all because the virtual address of the heap can change between
restarts.
In libpmemobj, the heap is rebuilt lazily and in stages. The entire available memory
is divided into equally sized zones (except for the last one, which can be smaller than
the others) with metadata at the beginning of each one. Each zone is subsequentially
divided into variably sized memory blocks called chunks. Whenever there is an
allocation request, and the runtime state indicates that there is no memory to satisfy it,
the zone’s metadata is processed, and the corresponding runtime state is initialized.
This minimizes the startup time of the pool and amortizes the cost of rebuilding the
heap state across many individual allocation requests.
There are three main reasons for having any runtime state at all. First, access
latency of persistent memory can be higher than that of DRAM, potentially impacting
performance of data structures placed on it. Second, separating the runtime state from
the persistent state enables a workflow where the memory is first reserved in runtime
state and initialized, and only then the allocation is reflected on the persistent state.
326
Chapter 16 PMDK Internals: Important Algorithms and Data Structures
This mechanism was described in the previous section. Finally, maintaining fail-safety of
complex persistent data structures is expensive, and keeping them in DRAM allows the
allocator to sidestep that cost.
The runtime allocation scheme employed by libpmemobj is segregated fit with chunk
reuse and thread caching as described earlier. Free lists in libpmemobj, called buckets,
are placed in DRAM and are implemented as vectors of pointers to persistent memory
blocks. The persistent representation of this data structure is a bitmap, located at the
beginning of a larger buffer from which the smaller blocks are carved out. These buffers
in libpmemobj, called runs, are variably sized and are allocated from the previously
mentioned chunks. Very large allocations are directly allocated as chunks. Figure 16-7
shows the libpmemobj implementation.
327
www.dbooks.org
Chapter 16 PMDK Internals: Important Algorithms and Data Structures
One of the most impactful aspects of persistent memory allocation is how the
memory is provisioned from the operating system. We previously explained that for
normal volatile allocators, the memory is usually acquired through anonymous memory
mappings that are backed by the page cache. In contrast, persistent heaps must use file-
based memory mappings, backed directly by persistent memory. The difference might
be subtle, but it has a significant impact on the way the allocator must be designed. The
allocator must manage the entire virtual address space, retain information about any
potential noncontiguous regions of the heap, and avoid excessive overprovisioning of
virtual address space. Volatile allocators can rely on the operating system to coalesce
noncontiguous physical pages into contiguous virtual ones, whereas persistent allocators
cannot do the same without explicit and complicated techniques. Additionally, for some
file system implementations, the allocator cannot assume that the physical memory is
allocated at the time of the first page fault, so it must be conservative with internal block
allocations.
Another problem for allocation from file-based mappings is that of perception.
Normal allocators, due to memory overcommitment, seemingly never run out of memory
because they are allocating the virtual address space, which is effectively infinite. There
are negative performance consequences of address space bloat, and memory allocators
actively try to avoid it, but they are not easily measurable in a typical application. In
contrast, memory heaps allocate from a finite resource, the persistent memory device, or
a file. This exacerbates the common phenomenon that is heap fragmentation by making
it trivially measurable, creating the perception that persistent memory allocators are
less efficient than volatile ones. They can be, but the operating system does a lot of work
behind the scene to hide fragmentation of traditional memory allocators.
part of the transaction but is required to allocate the log extensions if they are needed.
Without the internal redo log, it would be impossible to reserve and then publish a new
log object in a transaction that already had user-made allocator actions in the external
redo log.
All three logs have individual operation-context instances that are stored in runtime
state of the lanes. This state is initialized when the pool is opened, and that is also when
all the logs of the prior instance of the application are either processed or discarded.
There is no special persistent variable that indicates whether past transactions in the log
were successful or not. That information is directly derived from checksums stored in
the log.
When a transaction begins, and it is not a nested transaction, it acquires a lane,
which must not contain any valid uncomitted logs. The runtime state of the transaction
is stored in a thread-local variable, and that is where the lane variable is stored once
acquired.
Transactional allocator operations use the external redo log and its associated
operation context to call the appropriate reservation method which in turn creates an
allocator action to be published at the time of transaction commit. The allocator actions
are stored in a volatile array. If the transaction is aborted, all the actions are canceled,
and the associated state is discarded. The complete redo log for memory allocations is
created only at the time of transaction commit. If the library is interrupted while creating
the redo log, the next time the pool is opened, the checksum will not match, and the
transaction will be aborted by rolling back using the undo log.
Transactional snapshots use the undo log and its context. The first time a snapshot
is created, a new memory modification action is created in the external redo log. When
published, that action increments the generation number of the associated undo log,
invalidating its contents. This guarantees that if the external log is fully written and
processed, it automatically discards the undo log, committing the entire transaction. If
the external log is discarded, the undo log is processed, and the transaction is aborted.
To ensure that there are never two snapshots of the same memory location (this
would be an inefficient use of space), there is a runtime range tree that is queried every
time the application wants to create an undo log entry. If the new range overlaps with an
existing snapshot, adjustments to the input arguments are made to avoid duplication.
The same mechanism is also used to prevent snapshots of newly allocated data.
Whenever new memory in a transaction is allocated, the reserved memory range is
inserted into the ranges tree. Snapshotting new objects is redundant because they will be
discarded automatically in the case of an abort.
329
www.dbooks.org
Chapter 16 PMDK Internals: Important Algorithms and Data Structures
To ensure that all memory modifications performed inside the transaction are
durable on persistent memory once committed, the ranges tree is also used to iterate
over all snapshots and call the appropriate flushing function on the modified memory
locations.
330
Chapter 16 PMDK Internals: Important Algorithms and Data Structures
algorithm would loop back to check if the generation number matches. If it is successful,
the running thread initializes the variable and once again increments the generation
number – this time to an even number that should match the number stored in the pool
header.
S
ummary
This chapter described the architecture and inner workings of libpmemobj. We
also discuss the reasons for the choices that were made during the design and
implementation of libpmemobj. With this knowledge, you can accurately reason about
the semantics and performance characteristics of code written using this library.
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
331
www.dbooks.org
CHAPTER 17
Reliability, Availability,
and Serviceability (RAS)
This chapter describes the high-level architecture of reliability, availability, and
serviceability (RAS) features designed for persistent memory. Persistent memory RAS
features were designed to support the unique error-handling strategy required for an
application when persistent memory is used. Error handling is an important part of the
program’s overall reliability, which directly affects the availability of applications. The
error-handling strategy for applications impacts what percentage of the expected time
the application is available to do its job.
Persistent memory vendors and platform vendors will both decide which RAS
features and how they will be implemented at the lowest hardware levels. Some
common RAS features were designed and documented in the ACPI specification, which
is maintained and owned by the UEFI Forum (https://uefi.org/). In this chapter,
we try to attain a general perspective of these ACPI-defined RAS features and call out
vendor-specific details if warranted.
333
© The Author(s) 2020
S. Scargall, Programming Persistent Memory, https://doi.org/10.1007/978-1-4842-4932-1_17
Chapter 17 Reliability, Availability, and Serviceability (RAS)
1. Application starts
8. …
The operating system and applications may need to address uncorrectable errors in
three main ways:
334
www.dbooks.org
Chapter 17 Reliability, Availability, and Serviceability (RAS)
examine the uncorrectable memory error, determine if the software can recover, and
perform recovery actions via an MCE handler. Typically, uncorrectable errors fall into
three categories:
• Uncorrectable errors that may have corrupted the state of the CPU
and require a system reset.
335
Chapter 17 Reliability, Availability, and Serviceability (RAS)
336
www.dbooks.org
Chapter 17 Reliability, Availability, and Serviceability (RAS)
P
atrol Scrub
Patrol scrub (also known as memory scrubbing) is a long-standing RAS feature for
volatile memory that can also be extended to persistent memory. It is an excellent
example of how a platform can discover uncorrectable errors in the background during
normal operation.
Patrol scrubbing is done using a hardware engine, on either the platform or on the
memory device, which generates requests to memory addresses on the memory device.
The engine generates memory requests at a predefined frequency. Given enough time,
it will eventually access every memory address. The frequency in which patrol scrub
generates requests produces no noticeable impact on the memory device’s quality of
service.
337
Chapter 17 Reliability, Availability, and Serviceability (RAS)
By generating read requests to memory addresses, the patrol scrubber allows the
hardware an opportunity to run ECC on a memory address and correct any correctable
errors before they can become uncorrectable errors. Optionally, if an uncorrectable
error is discovered, the patrol scrubber can trigger a hardware interrupt and notify the
software layer of its memory address.
338
www.dbooks.org
Chapter 17 Reliability, Availability, and Serviceability (RAS)
Traditionally, the operating system executes ARS in one of two ways to obtain the
addresses of uncorrectable errors after a boot. Either a full scan is executed on all the
available persistent memory during system boot or after an unconsumed uncorrectable
memory error root-device notification is received. In both instances, the intent is to
discover these addresses before they are consumed by applications.
Operating systems will compare the list of uncorrectable errors returned by ARS to
their persistent list of uncorrectable errors. If new errors are detected, the list is updated.
This list is intended to be consumed by higher-level software, such as the PMDK libraries.
Device Health
System administrators may wish to act and mitigate any device health issues before they
begin to affect the availability of applications using persistent memory. To that end, operating
systems or management applications will want to discover an accurate picture of persistent
memory device health to correctly determine the reliability of the persistent memory.
The ACPI specification defines a few vendor-agnostic health discovery methods, but many
339
Chapter 17 Reliability, Availability, and Serviceability (RAS)
vendors choose to implement additional persistent memory device methods for attributes
that are not covered by the vendor-agnostic methods. Many of these vendor-specific
health discovery methods are implemented as an ACPI device-specific method (_DSM).
Applications should be aware of degradation to the quality of service if they call ACPI
methods directly, since some platform implementations may impact memory traffic when
ACPI methods are invoked. Avoid excessive polling of device health methods when possible.
On Linux, the ndctl utility can be used to query the device health of persistent
memory modules. Listing 17-1 shows an example output of an Intel Optane DC
persistent memory module.
Listing 17-1. Using ndctl to query the health of persistent memory modules
340
www.dbooks.org
Chapter 17 Reliability, Availability, and Serviceability (RAS)
You can obtain similar information using the persistent memory device-specific
utility. For example, you can use the ipmctl utility on Linux and Windows∗ to obtain
hardware-level data similar to that shown by ndctl. Listing 17-2 shows health
information for DIMMID 0x0001 (nmem1 equivalent in ndctl terms).
341
Chapter 17 Reliability, Availability, and Serviceability (RAS)
• The operating system can assume all data will be lost on subsequent
power events
Get NVDIMM Boot Status (_NBS) allows operating systems a vendor-agnostic method
to discover persistent memory health status that does not change during runtime. The
most significant attribute reported by _NBS is Data Loss Count (DLC). Data Loss Count is
expected to be used by applications and operating systems to help identify the rare case
where a persistent memory dirty shutdown has occurred. See “Unsafe/Dirty Shutdown”
later in this chapter for more information on how to properly use this attribute.
342
www.dbooks.org
Chapter 17 Reliability, Availability, and Serviceability (RAS)
U
nsafe/Dirty Shutdown
An unsafe or dirty shutdown on persistent memory means that the persistent memory
device power-down sequence or platform power-down sequence may have failed
to write all in-flight data from the system’s persistence domain to persistent media.
(Chapter 2 describes persistence domains.) A dirty shutdown is expected to be a very
rare event, but they can happen due to a variety of reasons such as physical hardware
issues, power spikes, thermal events, and so on.
A persistent memory device does not know if any application data was lost as a result
of the incomplete power-down sequence. It can only detect if a series of events occurred
in which data may have been lost. In the best-case scenario, there might not have been
any applications that were in the process of writing data when the dirty shutdown
occurred.
The RAS mechanism described here requires the platform BIOS and persistent
memory vendor to maintain a persistent rolling counter that is incremented anytime a
dirty shutdown is detected. The ACPI specification refers to such a mechanism as the Data
Loss Count (DLC) that can be returned as part of the Get NVDIMM Boot Status(_NBS)
persistent memory device method.
Referring to the output from ndctl in Listing 17-1, the "shutdown_count" is reported
in the health information. Similarly, the output from ipmctl in Listing 17-2 reports
"LatchedDirtyShutdownCount" as the dirty shutdown counter. For both outputs, a value
of 1 means no issues were detected.
343
Chapter 17 Reliability, Availability, and Serviceability (RAS)
e. Application creates and sets a “clean” flag in its metadata file and
ensures the update of the clean flag to the persistence domain.
This is used by the application to determine if the application was
actively writing data to persistence during dirty shutdown.
344
www.dbooks.org
Chapter 17 Reliability, Availability, and Serviceability (RAS)
2. Every time the application runs and retrieves its metadata from
persistent memory:
d. If the current LDLC does not match the saved LDLC, then one
or more persistent memory have detected a dirty shutdown and
possible data loss. If they do match, no further action is required
by the application.
h. Application sets the clean flag in its metadata file and ensures that
the update of the clean flag has been flushed to the persistence
domain.
345
Chapter 17 Reliability, Availability, and Serviceability (RAS)
a. Before the application writes data, it clears the “clean” flag in its
metadata file and ensures that the flag has been flushed to the
persistence domain.
c. After the application completes writing data, it sets the “clean” flag
in its metadata file and ensures the flag has been flushed to the
persistence domain.
PMDK libraries make these steps significantly easier and account for interleaving set
configurations.
S
ummary
This chapter describes some of the RAS features that are available to persistent memory
devices and that are relevant to persistent memory applications. It should have given you
a deeper understanding of uncorrectable errors and how applications can respond to
them, how operating systems can detect health status changes to improve the availability
of applications, and how applications can best detect dirty shutdowns and use the data
loss counter.
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
346
www.dbooks.org
CHAPTER 18
Remote Persistent
Memory
This chapter provides an overview of how persistent memory – and the programming
concepts that were introduced in this book – can be used to access persistent memory
located in remote servers connected via a network. A combination of TCP/IP or RDMA
network hardware and software running on the servers containing persistent memory
provide direct remote access to persistent memory.
Having remote direct memory access via a high-performance network connection is
a critical use case for most cloud deployments of persistent memory. Typically, in high-
availability or highly redundant use cases, data written locally to persistent memory is
not considered reliable until it has been replicated to two or more remote persistent
memory devices on separate remote servers. We describe this push model design later in
this chapter.
While it is certainly possible to use existing TCP/IP networking infrastructures to
remotely access the persistent memory, this chapter focuses on the use of remote direct
memory access (RDMA). Direct memory access (DMA) allows data movement on a
platform to be off-loaded to a hardware DMA engine that moves that data on behalf of
the CPU, freeing it to do other important tasks during the data move. RDMA applies the
same concept and enables data movement between remote servers to occur without the
CPU on either server having to be directly involved.
This chapter’s content and the PMDK librpmem remote persistent memory library
that is discussed assume the use of RDMA, but the concepts discussed here can apply to
other networking interconnects and protocols.
347
© The Author(s) 2020
S. Scargall, Programming Persistent Memory, https://doi.org/10.1007/978-1-4842-4932-1_18
Chapter 18 Remote Persistent Memory
Figure 18-1 outlines a simple remote persistent memory configuration with one
initiator system that is replicating writes to persistent memory on a single remote target
system. While this shows the use of persistent memory on both the initiator and target,
it is possible to read data from initiator DRAM and write to persistent memory on the
remote target system, or read from the initiator’s persistent memory and write to the
remote target’s DRAM.
All three protocols support high-performance data movement to and from persistent
memory using RDMA.
348
www.dbooks.org
Chapter 18 Remote Persistent Memory
The RDMA protocols are governed by the RDMA Wire Protocol Standards, which are
driven by the IBTA (InfiniBand Trade Association) and the IEFT (Internet Engineering
Task Force) specifications. The IBTA (https://www.infinibandta.org/) governs the
InfiniBand and RoCE protocols, while the IEFT (https://www.ietf.org/) governs
iWARP.
Low-latency RDMA networking protocols allow the network interface controller
(NIC) to control the movement of data between an initiator node source buffer and the
sink buffer on the target node without needing either node’s CPU to be involved in the
data movement. In fact, RDMA Read and RDMA Write operations are often referred
to as one-sided operations because all of the information required to move the data is
supplied by the initiator and the CPU on the target node is not typically interrupted or
even aware of the data transfer.
To perform remote data transfers, information from the target node’s buffers must
be passed to the initiator before the remote operation( s) will begin. This requires
configuring the local initiator’s RDMA resources and buffers. Similarly, the remote target
node’s RDMA resources that will require CPU resources will need to be initialized and
reported to the initiator. However, once the resources for the RDMA transfers are set up
and applications initiate the RDMA request using the CPU, the NIC does the actual data
movement on behalf of the RDMA-aware application.
RDMA-aware applications are responsible for:
349
Chapter 18 Remote Persistent Memory
Three basic RDMA commands are used by most RDMA-capable applications and
libraries:
RDMA Write: A one-sided operation where only the initiator supplies all of the
information required for the transfer to occur. This transfer is used to write data to the
remote target node. The write request contains all source and sink buffer information.
The remote target system is not typically interrupted and thus completely unaware of
the write operations occurring through the NIC. When the initiator’s NIC sends a write
to the target, it will generate a “software write completion interrupt.” A software write
completion interrupt means that the write message has been sent to the target NIC
and is not an indicator of the write completion. Optionally, RDMA Writes can use an
immediate option that will interrupt the target node CPU and allow software running
there to be immediately notified of the write completion.
RDMA Read: A one-sided operation where only the initiator supplies all of the
information required for the transfer to occur. This transfer is used to read data from
the remote target node. The read request contains all source buffer and target sink
buffer information, and the remote target system is not typically interrupted and thus
completely unaware of the read operations occurring through the NIC. The initiator
software read completion interrupt is an acknowledgment that the read has traversed
all the way through the initiator’s NIC, over the network, into the target system’s NIC,
through the target internal hardware mesh and memory controllers, to the DRAM or
persistent memory to retrieve the data. Then it returns all the way back to the initiator
software that registered for the completion notification.
RDMA Send (and Receive): The two-sided RDMA Send means that both the
initiator and target must supply information for the transfer to complete. This is because
the target NIC will be interrupted when the RDMA Send is received by the target NIC
and requires a hardware receive queue to be set up and pre-populated with completion
entries before the NIC will receive an RDMA Send transfer operation. Data from the
initiator application is bundled in a small, limited sized buffer and sent to the target
NIC. The target CPU will be interrupted to handle the send operation and any data it
contains. If the initiator needs to be notified of receipt of the RDMA Send, or to handle a
message back to the initiator, another RDMA Send operation must be sent in the reverse
direction after the initiator has set up its own receive queue and queued completion
entries to it. The use of the RDMA Send command and the contents of the payload
are application-specific implementation details. An RDMA Send is typically used for
bookkeeping and updates of read and write activity between the initiator and the target,
350
www.dbooks.org
Chapter 18 Remote Persistent Memory
since the target application has no other context of what data movement has taken place.
For example, because there is no good way to know when writes have completed on
the target, an RDMA Send is often used to notify the target node what is happening. For
small amounts of data, the RDMA Send is very efficient, but it always requires target-
side interaction to complete. An RDMA Write with immediate data operation will also
allow the target node to be interrupted when the write has completed as a different
mechanism for bookkeeping.
351
Chapter 18 Remote Persistent Memory
from the initiator node to the persistent memory on the target node, that write or send
data needs to be flushed to the persistence domain on the remote system. Alternatively,
the remote write or send data needs to bypass CPU caches on the remote node to avoid
having to be flushed.
Different vendor-specific platform features add an extra challenge to RDMA and to
remote persistent memory. Intel platforms typically use a feature called allocating writes
or Direct Data IO (DDIO) which allows incoming writes to be placed directly into the
CPU’s L3 cache. The data is immediately visible to any application wanting to read the
data. However, having allocating writes enabled means that RDMA Writes to persistent
memory now have to be flushed to the persistence domain on the target node.
On Intel platforms, allocating writes can be disabled by turning on non-allocating
write I/O flows which forces the write data to bypass cache and be placed directly into
the persistent memory, governed by the location of the RDMA Write sink buffer. This
would slow down applications that will immediately touch the newly written data
because they incur the penalty to pull the data into CPU cache. However, this simplifies
making remote writes to persistent memory simpler and faster because cache flushing
on the remote target node can be avoided. An additional complication to using non-
allocating write mode on an Intel platform is that an entire PCI root complex must be
enabled for this write mode. This means that any inbound writes that come through
that PCI root complex, for any device connected downstream of it, will have write-data
bypass CPU caches, causing possible additional performance latency as a side effect.
Intel specifies two methods for forcing writes to remote persistent memory into the
persistence domain:
352
www.dbooks.org
Chapter 18 Remote Persistent Memory
353
Chapter 18 Remote Persistent Memory
354
www.dbooks.org
Chapter 18 Remote Persistent Memory
system issues optimized flush instructions to flush each cache line in the list to the
persistence domain. This is followed by an SFENCE to guarantee these writes complete
before new writes are handled. At this point, the previous writes that were flushed in the
RDMA Send list are now persistent.
355
Chapter 18 Remote Persistent Memory
also true of the PCIe interconnect to which the target NIC is connected. PCIe Reads will
perform a pipeline flush and force previous PCIe writes to complete first.
Figure 18-3 outlines the basic appliance remote replication method, often referred to
as the appliance persistency method, described earlier.
356
www.dbooks.org
Chapter 18 Remote Persistent Memory
On Linux, the Open Fabric Alliance (OFA) libibverbs library provides ring-3 interfaces
to configure and use the RDMA connection for NICs that support IB, RoCE, and iWARP
RDMA network protocols. The OFA libfabric ring-3 application library can be layered
on top of libibverbs to provide a generic high-level common API that can be used with
typical RDMA NICs. This common API requires a provider plug-in to implement the
common API for the specific network protocol. The OFA web site contains many example
applications and performance tests that can be used on Linux with a variety of RDMA-
capable NICs. Those examples provide the backbone of the PMDK librpmem library.
Windows implements remotely mounted NTFS volumes via the ring-3 SMB Direct
Application library, which provides a number of storage protocols including block
storage over RDMA.
Figure 18-4 provides the basic high-level architecture for a typical RDMA application
on Linux, using all of the publically available libraries and interfaces. Notice that a separate
side-band connection is typically needed to set up the RDMA connections themselves.
358
www.dbooks.org
Chapter 18 Remote Persistent Memory
libpmemobj uses a synchronous write model, meaning that the local initiator write
and all of the remotely replicated writes must complete before the local write will be
completed back to the application. The libpmemobj library also implements a simple
active-passive replication architecture, where all persistent memory transactions are
driven through the active initiator node and the remote targets passively standby,
replicating the write data. While the passive target systems have the latest write data
replicated, the implementation makes no attempt to fail over, fail back, or load balance
using the remote systems. The following sections describe the significant performance
drawbacks to this implementation.
libpmemobj uses the local memory pool configuration information provided in a
configuration file to describe the remote network–connected memory-mapped files.
A remote rpmemd program installed on each remote target system is started and
connected to the librpmem library on the initiator using a secure encrypted socket
connection. Through this connection, librpmem, on behalf of libpmemobj, will set up the
RDMA point-to-point connection with each target system, determine the persistence
method the target supports (general purpose or appliance method), allocate remote
memory-mapped persistent memory files, register the persistent memory on the remote
NIC, and retrieve the resulting memory keys for the registered memory.
Once all the RDMA connections to all the targets are established, all required
queues are instantiated, and memory buffers have all been allocated and registered,
the libpmemobj library is ready to begin remotely replicating all application write
data to its local memory-mapped file. When the application calls pmemobj_persist()
in libpmemobj, the library will generate a corresponding rpmem_persist() call
into librpmem which, in turn, calls the libfabric fi_write() to do the RDMA Write.
librpmem then initiates the RDMA Read or Send persistence method (as governed
by an understanding of the currently enabled target node’s current configuration) by
calling libfabric fi_read() or fi_send(). RDMA Read is used in the appliance remote
replication method, and RDMA Send is used in the general-purpose remote replication
method.
Figure 18-5 outlines the high-level components and interfaces described earlier and
used by both the initiator and remote target system using librpmem and libpmemobj.
359
Chapter 18 Remote Persistent Memory
360
www.dbooks.org
Chapter 18 Remote Persistent Memory
361
Chapter 18 Remote Persistent Memory
Listing 18-2 shows a poolset file that describes the memory-mapped files shared for
the remote access. In many ways, a remote poolset file is the same as the regular poolset
file, but it must fulfill additional requirements:
P
erformance Considerations
Once persistent memory is accessible via a remote network connection, significantly
lower latency can be achieved compared with writing to a remote SSD or legacy block
storage device. This is because the RDMA hardware is writing the remote write data
362
www.dbooks.org
Chapter 18 Remote Persistent Memory
directly into the final persistent memory location, whereas remote replication to an SSD
requires an RDMA Write into the DRAM on the remote server, followed by a second local
DMA operation to move the remote write data from volatile DRAM into the final storage
location on the SSD or other legacy block storage device.
The performance challenge with replicating data to remote persistent memory is that
while large block sizes of 512KiB or larger can achieve good performance, as the size of
the writes being replicated gets smaller, the network overhead becomes a larger portion
of the total latency, and performance can suffer.
If the persistent memory is being used as an SSD replacement, the typical native
block storage size is 4K, avoiding some of the inefficiencies seen with small transfers.
If the persistent memory replaces a traditional SSD and data is written remotely to the
SSD, the latency improvements with persistent memory can be 10x or more.
The synchronous replication model implemented in librpmem means that small
data structures and pointer updates in local persistent memory result in small, very
inefficient, RDMA Writes followed by a small RDMA Read or Send to make that small
amount of write data persistent. This results in significant performance degradation
compared to writing only to local persistent memory. It makes the replication
performance very dependent on the local persistent memory write sequences, which
is heavily dependent on the application workload. In general, the larger the average
request size and the lower the number of rpmem_persist() calls that are required for a
given workload will improve the overall latency required for guaranteeing that data is
persistent.
It is possible to follow multiple RDMA Writes with single RDMA Read or Send
to make all of the preceding writes persistent. This reduces the impact of the size of
RDMA Writes on the overall performance of the proposed solution. But using this
mitigation, remember you are not guaranteed that any of the RDMA Writes is persistent
until RDMA Read completion returns or you receive RDMA Send with a confirmation.
The implementation that allows this approach is implemented in rpmem_flush() and
rpmem_drain() API call pair, where rpmem_flush() performs RDMA Write and returns
immediately and rpmem_drain() posts RDMA Read and waits for its completion (at the
time of publication it is not implemented in the write/send model).
There are many performance considerations, including the high-level networking
model being used. Traditional best-in-class networking architecture typically relies
on a pull model between the initiator and target. In a pull model, the initiator requests
resources from the target, but the target server only pulls the data across via RDMA
363
Chapter 18 Remote Persistent Memory
Read when it has the resources and connection bandwidth. This server-centric view
allows the target node to handle hundreds or thousands of connections since it is in
complete control of all resources for all of the connections and initiates the networking
transactions when it chooses. With the speed and low latency of persistent memory, a
push model can be used where the initiator and target have pre-allocated and registered
memory resources and directly RDMA Write the data without waiting for server-side
resource coordination. Microsoft’s SNIA DevCon RDMA presentation describes the
push/pull model in more detail: (https://www.snia.org/sites/default/files/
SDC/2018/presentations/PM/Talpey_Tom_Remote_Persistent_Memory.pdf).
364
www.dbooks.org
Chapter 18 Remote Persistent Memory
Listing 18-3. The main routine of the Hello World program with replication
37 #include <assert.h>
38 #include <errno.h>
39 #include <unistd.h>
40 #include <stdio.h>
41 #include <stdlib.h>
42 #include <string.h>
43
44 #include <librpmem.h>
45
46 /*
47 * English and Spanish translation of the message
48 */
49 enum lang_t {en, es};
50 static const char *hello_str[] = {
51 [en] = "Hello world!",
52 [es] = "¡Hola Mundo!"
53 };
54
55 /*
56 * structure to store the current message
57 */
58 #define STR_SIZE 100
59 struct hello_t {
60 enum lang_t lang;
61 char str[STR_SIZE];
62 };
63
64 /*
65 * write_hello_str -- write a message to the local memory
66 */
365
Chapter 18 Remote Persistent Memory
104 int
105 main(int argc, char *argv[])
106 {
107 /* for this example, assume 32MiB pool */
108 size_t pool_size = 32 * 1024 * 1024;
109 void *pool = NULL;
110 int created;
111
112 /* allocate a page size aligned local memory pool */
113 long pagesize = sysconf(_SC_PAGESIZE);
114 assert(pagesize >= 0);
115 int ret = posix_memalign(&pool, pagesize, pool_size);
116 assert(ret == 0 && pool != NULL);
117
118 /* skip to the beginning of the message */
119 size_t hello_off = 4096; /* rpmem header size */
120 struct hello_t *hello = (struct hello_t *)(pool + hello_off);
121
122 RPMEMpool *rpp = remote_open("target", "pool.set", pool,
pool_size,
123 &created);
124 if (created) {
125 /* reset local memory pool */
126 memset(pool, 0, pool_size);
127 write_hello_str(hello, en);
128 } else {
129 /* read message from the remote pool */
130 ret = rpmem_read(rpp, hello, hello_off, sizeof(*hello), 0);
131 assert(ret == 0);
366
www.dbooks.org
Chapter 18 Remote Persistent Memory
132
133 /* translate the message */
134 const int lang_num = (sizeof(hello_str) / sizeof(hello_
str[0]));
135 enum lang_t lang = (enum lang_t)((hello->lang + 1) %
lang_num);
136 write_hello_str(hello, lang);
137 }
138
139 /* write message to the remote pool */
140 ret = rpmem_persist(rpp, hello_off, sizeof(*hello), 0, 0);
141 printf("%s\n", hello->str);
142 assert(ret == 0);
143
144 /* close the remote pool */
145 ret = rpmem_close(rpp);
146 assert(ret == 0);
147
148 /* release local memory pool */
149 free(pool);
150 return 0;
151 }
• Line 68: Simple helper routine for writing message to the local memory.
• Line 130: A message from the remote memory pool is read to the
local memory here.
367
Chapter 18 Remote Persistent Memory
• Lines 134-136: If a message from the remote memory pool was read
correctly, it is translated locally.
The last missing piece of the whole process is how the remote replication is set up. It
is all done in the remote_open() routine presented in Listing 18-4.
Listing 18-4. A remote_open routine from the Hello World program with
replication
74 /*
75 * remote_open -- setup the librpmem replication
76 */
77 static inline RPMEMpool*
78 remote_open(const char *target, const char *poolset, void *pool,
79 size_t pool_size, int *created)
80 {
81 /* fill pool_attributes */
82 struct rpmem_pool_attr pool_attr;
83 memset(&pool_attr, 0, sizeof(pool_attr));
84 strncpy(pool_attr.signature, "HELLO", RPMEM_POOL_HDR_SIG_LEN);
85
86 /* create a remote pool */
87 unsigned nlanes = 1;
88 RPMEMpool *rpp = rpmem_create(target, poolset, pool, pool_
size, &nlanes,
89 &pool_attr);
90 if (rpp) {
91 *created = 1;
92 return rpp;
93 }
94
368
www.dbooks.org
Chapter 18 Remote Persistent Memory
E xecution Example
The Hello World application produces the output shown in Listing 18-5.
Listing 18-5. An output from the Hello World application for librpmem
[user@initiator]$ ./hello
Hello world!
[user@initiator]$ ./hello
¡Hola Mundo!
Listing 18-6 shows the contents of the target persistent memory pool where we see
the “Hola Mundo” string.
369
Chapter 18 Remote Persistent Memory
00001020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 |................|
*
00002000
Summary
It is important to know that neither the general-purpose remote replication method
nor the appliance remote replication method is ideal because vendor-specific platform
features are required to use non-allocating writes, adding the complication of effecting
performance on an entire PCI root complex. Conversely, flushing remote writes using
allocating writes requires a painful interrupt of the target system to intercept an RDMA
Send request and flush the list of regions contained within the send buffer. Waking the
remote node is extremely painful in a cloud environment because there are hundreds
or thousands of inbound RDMA requests from many different connections; avoid this if
possible.
There are cloud service providers using these two methods today and getting
phenomenal performance results. If the persistent memory is used as a replacement for
a remotely accessed SSD, huge reductions in latency can be achieved.
As the first iteration of remote persistence support, we focused on application/
library changes to implement these high-level persistence methods, without hardware,
firmware, driver, or protocol changes. At the time of publication, IBTA and IETF drafts
for a new wire protocol extension for persistent memory is nearing completion. This will
provide native hardware support for RDMA to persistent memory and allow hardware
entities to route each I/ O to its destination memory device without the need to change
allocating write mode and without the potential to adversely affect performance on
collateral devices connected to the same root port. See Appendix E for more details on
the new extensions to RDMA, specifically for remote persistence.
RDMA protocol extensions are only one step into further remote persistent memory
development. Several other areas of improvement are already identified and shall be
addressed to the remote persistent memory users community, including atomicity of
remote operations, advanced error handling (including RAS), dynamic configuration of
remote persistent memory and custom setup, and real 0% CPU utilization on remote/
target replication side.
370
www.dbooks.org
Chapter 18 Remote Persistent Memory
As this book has demonstrated, unlocking the true potential of persistent memory
may require new approaches to existing software and application architecture.
Hopefully, this chapter gave you an overview of this complex topic, the challenges of
working with remote persistent memory, and the many aspects of software architecture
to consider when unlocking the true performance potential.
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
371
CHAPTER 19
Advanced Topics
This chapter covers several topics that we briefly described earlier in the book but did
not expand upon as it would have distracted from the focus points. The in-depth details
on these topics are here for your reference.
373
© The Author(s) 2020
S. Scargall, Programming Persistent Memory, https://doi.org/10.1007/978-1-4842-4932-1_19
www.dbooks.org
Chapter 19 Advanced Topics
Figure 19-1. A two-socket CPU NUMA architecture showing local and remote
memory access
On a NUMA system, the greater the distance between a processor and a memory
bank, the slower the processor’s access to that memory bank. Performance-sensitive
applications should therefore be configured so they allocate memory from the closest
possible memory bank.
Performance-sensitive applications should also be configured to execute on a set
number of cores, particularly in the case of multithreaded applications. Because first-
level caches are usually small, if multiple threads execute on one core, each thread
will potentially evict cached data accessed by a previous thread. When the operating
system attempts to multitask between these threads, and the threads continue to evict
each other’s cached data, a large percentage of their execution time is spent on cache
line replacement. This issue is referred to as cache thrashing. We therefore recommend
that you bind a multithreaded application to a NUMA node rather than a single core,
since this allows the threads to share cache lines on multiple levels (first-, second-, and
last-level cache) and minimizes the need for cache fill operations. However, binding
an application to a single core may be performant if all threads are accessing the
same cached data. numactl allows you to bind an application to a particular core or
NUMA node and to allocate the memory associated with a core or set of cores to that
application.
374
Chapter 19 Advanced Topics
# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
node 0 size: 192129 MB
node 0 free: 187094 MB
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
45 46 47 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93
94 95
node 1 size: 192013 MB
node 1 free: 191478 MB
node distances:
node 0 1
0: 10 21
1: 21 10
The node distance is a relative distance and not an actual time-based latency in
nanoseconds or milliseconds.
numactl lets you bind an application to a particular core or NUMA node and allocate
the memory associated with a core or set of cores to that application. Some useful
options provided by numactl are described in Table 19-1.
375
www.dbooks.org
Chapter 19 Advanced Topics
Table 19-1. numactl command options for binding processes to NUMA nodes or
CPUs
Option Description
--membind, -m Only allocate memory from specific NUMA nodes. The allocation will fail
if there is not enough memory available on these nodes.
--cpunodebind, -N Only execute the process on CPUs from the specified NUMA nodes.
--physcpubind, -C Only execute process on the given CPUs.
--localalloc, -l Always allocate on the current NUMA node.
--preferred Preferably allocate memory on the specified NUMA node. If memory
cannot be allocated, fall back to other nodes.
376
Chapter 19 Advanced Topics
"iset_id":-2506113243053544244,
"persistence_domain":"memory_controller",
"namespaces":[
{
"dev":"namespace1.0",
"mode":"fsdax",
"map":"dev",
"size":1598128390144,
"uuid":"b3e203a0-2b3f-4e27-9837-a88803f71860",
"raw_uuid":"bd8abb69-dd9b-44b7-959f-79e8cf964941",
"sector_size":512,
"align":2097152,
"blockdev":"pmem1",
"numa_node":1
}
]
},
{
"dev":"region0",
"size":1623497637888,
"available_size":0,
"max_available_extent":0,
"type":"pmem",
"numa_node":0,
"iset_id":3259620181632232652,
"persistence_domain":"memory_controller",
"namespaces":[
{
"dev":"namespace0.0",
"mode":"fsdax",
"map":"dev",
"size":1598128390144,
"uuid":"06b8536d-4713-487d-891d-795956d94cc9",
"raw_uuid":"39f4abba-5ca7-445b-ad99-fd777f7923c1",
"sector_size":512,
"align":2097152,
377
www.dbooks.org
Chapter 19 Advanced Topics
"blockdev":"pmem0",
"numa_node":0
}
]
}
]
}
Executing mlc or mlc_avx512 with no arguments runs all the modes in sequence
using the default parameters and values for each test and writes the results to the
terminal. The following example shows running just the latency matrix on a two-socket
Intel system.
# ./mlc_avx512 --latency_matrix -e -r
Intel(R) Memory Latency Checker - v3.6
Command line parameters: --latency_matrix -e -r
378
Chapter 19 Advanced Topics
MLC can be used to test persistent memory latency and bandwidth in either DAX or
FSDAX modes. Commonly used arguments include
• -L requests that large pages (2MB) be used (assuming they have been
enabled).
Examples:
Sequential read latency:
379
www.dbooks.org
Chapter 19 Advanced Topics
NUMASTAT Utility
The numastat utility on Linux shows per NUMA node memory statistics for processors
and the operating system. With no command options or arguments, it displays NUMA
hit and miss system statistics from the kernel memory allocator. The default numastat
statistics shows per-node numbers, in units of pages of memory, for example:
$ sudo numastat
node0 node1
numa_hit 8718076 7881244
numa_miss 0 0
numa_foreign 0 0
interleave_hit 40135 40160
local_node 8642532 2806430
other_node 75544 5074814
380
Chapter 19 Advanced Topics
I PMCTL Utility
Persistent memory vendor- and server-specific utilities can also be used to show DDR
and persistent memory device topology to help identify what devices are associated
with which CPU sockets. For example, the ipmctl show –topology command displays
the DDR and persistent memory (non-volatile) devices with their physical memory slot
location (see Figure 19-2), if that data is available.
VXGRLSPFWOVKRZWRSRORJ\
'LPP,'_0HPRU\7\SH_&DSDFLW\_3K\VLFDO,'_'HYLFH/RFDW
[_/RJLFDO1RQ9RODWLOH'HYLFH_*L%_[_&38B',00B$
[_/RJLFDO1RQ9RODWLOH'HYLFH_*L%_[F_&38B',00B%
[_/RJLFDO1RQ9RODWLOH'HYLFH_*L%_[_&38B',00B&
[_/RJLFDO1RQ9RODWLOH'HYLFH_*L%_[_&38B',00B'
[_/RJLFDO1RQ9RODWLOH'HYLFH_*L%_[D_&38B',00B(
[_/RJLFDO1RQ9RODWLOH'HYLFH_*L%_[H_&38B',00B)
[_/RJLFDO1RQ9RODWLOH'HYLFH_*L%_[_&38B',00B$
[_/RJLFDO1RQ9RODWLOH'HYLFH_*L%_[_&38B',00B%
[_/RJLFDO1RQ9RODWLOH'HYLFH_*L%_[F_&38B',00B&
[_/RJLFDO1RQ9RODWLOH'HYLFH_*L%_[_&38B',00B'
[_/RJLFDO1RQ9RODWLOH'HYLFH_*L%_[_&38B',00B(
[_/RJLFDO1RQ9RODWLOH'HYLFH_*L%_[D_&38B',00B)
1$_''5_*L%_[_&38B',00B$
1$_''5_*L%_[D_&38B',00B%
1$_''5 _*L%_[H_&38B',00B&
1$_''5_*L%_[_&38B',00B'
1$_''5_*L%_[_&38B',00B(
1$_''5_*L% _[F_&38B',00B)
1$_''5_*L%_[_&38B',00B$
1$_''5_*L%_[_&38B',00B%
1$_''5_*L%_[D_&38B',00B&
1$_''5_*L%_[_&38B',00B'
1$_''5_*L%_[_&38B',00B(
1$_''5_*L%_[_&38B',00B)
Figure 19-2. Topology report from the ipmctl show –topology command
381
www.dbooks.org
Chapter 19 Advanced Topics
382
Chapter 19 Advanced Topics
Automatic NUMA balancing uses several algorithms and data structures, which are
only active and allocated if automatic NUMA balancing is active on the system, using a
few simple steps:
• A task scanner periodically scans the address space and marks the
memory to force a page fault when the data is next accessed.
• The next access to the data will result in a NUMA Hinting Fault. Based
on this fault, the data can be migrated to a memory node associated
with the thread or process accessing the memory.
Manual NUMA tuning of applications using numactl will override any system-wide
automatic NUMA balancing settings. Automatic NUMA balancing simplifies tuning
workloads for high performance on NUMA machines. Where possible, we recommend
statically tuning the workload to partition it within each node. Certain latency-sensitive
applications, such as databases, usually work best with manual configuration. However,
in most other use cases, automatic NUMA balancing should help performance.
www.dbooks.org
Chapter 19 Advanced Topics
the system rather than one file system per NUMA node, a software volume manager
can be used to create concatenations or stripes (RAID0) using all the system’s capacity.
For example, if you have 1.5TiB of persistent memory per CPU socket on a two-socket
system, you could build a concatenation or stripe (RAID0) to create a 3TiB file system. If
local system redundancy is more important than large file systems, mirroring (RAID1)
persistent memory across NUMA nodes is possible. In general, replicating the data
across physical servers for redundancy is better. Chapter 18 discusses remote persistent
memory in detail, including using remote direct memory access (RDMA) for data
transfer and replication across systems.
There are too many volume manager products to provide step-by-step recipes for all of
them within this book. On Linux, you can use Device Mapper (dmsetup), Multiple Device
Driver (mdadm), and Linux Volume Manager (LVM) to create volumes that use the capacity
from multiple NUMA nodes. Because most modern Linux distributions default to using
LVM for their boot disks, we assume that you have some experience using LVM. There is
extensive information and tutorials within the Linux documentation and on the Web.
Figure 19-3 shows two regions on which we can create either an fsdax or sector
type namespace that creates the corresponding /dev/pmem0 and /dev/pmem1 devices.
Using /dev/pmem[01], we can create an LVM physical volume which we then combine
to create a volume group. Within the volume group, we are free to create as many logical
volumes of the requested size as needed. Each logical volume can support one or more
file systems.
Figure 19-3. Linux Volume Manager architecture using persistent memory regions
and namespaces
384
Chapter 19 Advanced Topics
385
www.dbooks.org
Chapter 19 Advanced Topics
The mmap(2) man page in the Linux Programmer’s manual describes the MAP_SYNC
flag as follows:
MAP_SYNC (since Linux 4.15)
This flag is available only with the MAP_SHARED_VALIDATE mapping
type; mappings of type MAP_SHARED will silently ignore this flag. This flag
is supported only for files supporting DAX (direct mapping of persistent
memory). For other files, creating a mapping with this flag results in an
EOPNOTSUPP error.
Shared file mappings with this flag provide the guarantee that while some
memory is writably mapped in the address space of the process, it will be
visible in the same file at the same offset even after the system crashes or is
rebooted. In conjunction with the use of appropriate CPU instructions, this
provides users of such mappings with a more efficient way of making data
modifications persistent.
Summary
In this chapter, we presented some of the more advanced topics for persistent memory
including page size considerations on large memory systems, NUMA awareness and
how it affects application performance, how to use volume managers to create DAX file
systems that span multiple NUMA nodes, and the MAP_SYNC flag for mmap(). Additional
topics such as BIOS tuning were intentionally left out of this book as it is vendor and
product specific. Performance and benchmarking of persistent memory products are left
to external resources as there are too many tools – vdbench, sysbench, fio, etc. – and too
many options for each one, to cover in this book.
386
Chapter 19 Advanced Topics
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
387
www.dbooks.org
APPENDIX A
P
rerequisites
Installing ndctl and daxctl using packages automatically installs any missing
dependency packages on the system. A full list of dependencies is usually listed when
installing the package. You can query the package repository to list dependencies or use
an online package took such as https://pkgs.org to find the package for your operating
389
© The Author(s) 2020
S. Scargall, Programming Persistent Memory, https://doi.org/10.1007/978-1-4842-4932-1
Appendix A How to Install NDCTL and DAXCTL on Linux
system and list the package details. For example, Figure A-1 shows the packages
required for ndctl v64.1 on Fedora 30 (https://fedora.pkgs.org/30/fedora-x86_64/
ndctl-64.1-1.fc30.x86_64.rpm.html).
390
www.dbooks.org
Appendix A How to Install NDCTL and DAXCTL on Linux
Additionally, you can use an online package search tools such as https://pkgs.org
that allow you to search for packages across multiple distros. Figure A-2 shows the results
for many distros when searching for “libpmem.”
391
Appendix A How to Install NDCTL and DAXCTL on Linux
392
www.dbooks.org
Appendix A How to Install NDCTL and DAXCTL on Linux
Note The version of the ndctl and daxctl available with your operating system
may not match the most current project release. If you require a newer release
than your operating system delivers, consider compiling the projects from the
source code. We do not describe compiling and installing from the source code
in this book. Instructions can be found on https://docs.pmem.io/getting-
started-guide/installing-ndctl#installing-ndctl-from-source-
on-linux and https://github.com/pmem/ndctl.
For example, to install just the ndctl runtime utility and library, use
Runtime:
$ sudo dnf install ndctl daxctl
Development library:
$ sudo dnf install ndctl-devel
For example, to install just the ndctl runtime utility and library, use
393
Appendix A How to Install NDCTL and DAXCTL on Linux
Runtime:
$ yum install ndctl daxctl
Development:
$ yum install ndctl-devel
For example, to install just the ndctl runtime utility and library, use
All Runtime:
$ zypper install ndctl daxctl
All Development:
$ zypper install libndctl-devel
For example, to install just the ndctl runtime utility and library, use
All Runtime:
$ sudo apt-get install ndctl daxctl
All Development:
$ sudo apt-get install libndctl-dev
394
www.dbooks.org
APPENDIX B
P
MDK Prerequisites
In this appendix, we describe installing the PMDK libraries using the packages available
in your operating system package repository. To enable all PMDK features, such as
advanced reliability, accessibility, and serviceability (RAS), PMDK requires libndctl and
libdaxctl. Package dependencies automatically install these requirements. If you are
building and installing using the source code, you should install NDCTL first using the
instructions provided in Appendix C.
395
© The Author(s) 2020
S. Scargall, Programming Persistent Memory, https://doi.org/10.1007/978-1-4842-4932-1
Appendix B How to Install the Persistent Memory Development Kit (PMDK)
396
www.dbooks.org
Appendix B How to Install the Persistent Memory Development Kit (PMDK)
The default package manager utility for your operating system will allow you to
query the package repository using regular expressions to identify packages to install.
Table B-3 shows how to search the package repository using the command-line utility
for several distributions. If you prefer to use a GUI, feel free to use your favorite desktop
utility to perform the same search and install operations described here.
Table B-3. Searching for ∗pmem∗ packages on different Linux operating systems
Operating System Command
Additionally, you can use an online package search tools such as https://pkgs.org
that allow you to search for packages across multiple distros. Figure B-1 shows the results
for many distros when searching for “libpmem.”
397
Appendix B How to Install the Persistent Memory Development Kit (PMDK)
398
www.dbooks.org
Appendix B How to Install the Persistent Memory Development Kit (PMDK)
Note The version of the PMDK libraries available with your operating system may
not match the most current PMDK release. If you require a newer release than your
operating system delivers, consider compiling PMDK from the source code. We
do not describe compiling and installing PMDK from the source code in this book.
Instructions can be found on https://docs.pmem.io/getting-started-
guide/installing-pmdk/compiling-pmdk-from-source and https://
github.com/pmem/pmdk.
All Runtime:
$ sudo dnf install libpmem librpmem libpmemblk libpmemlog/
libpmemobj libpmempool pmempool
All Development:
$ sudo dnf install libpmem-devel librpmem-devel \
libpmemblk-devel libpmemlog-devel libpmemobj-devel \
libpmemobj++-devel libpmempool-devel
All Debug:
$ sudo dnf install libpmem-debug librpmem-debug \
libpmemblk-debug libpmemlog-debug libpmemobj-debug \
libpmempool-debug
399
Appendix B How to Install the Persistent Memory Development Kit (PMDK)
All Runtime:
$ sudo yum install libpmem librpmem libpmemblk libpmemlog \
libpmemobj libpmempool pmempool
All Development:
$ sudo yum install libpmem-devel librpmem-devel \
libpmemblk-devel libpmemlog-devel libpmemobj-devel \
libpmemobj++-devel libpmempool-devel
All Debug:
$ sudo yum install libpmem-debug librpmem-debug \
libpmemblk-debug libpmemlog-debug libpmemobj-debug \
libpmempool-debug
All Runtime:
$ sudo zypper install libpmem librpmem libpmemblk libpmemlog \
libpmemobj libpmempool pmempool
400
www.dbooks.org
Appendix B How to Install the Persistent Memory Development Kit (PMDK)
All Development:
$ sudo zypper install libpmem-devel librpmem-devel \
libpmemblk-devel libpmemlog-devel libpmemobj-devel \
libpmemobj++-devel libpmempool-devel
All Debug:
$ sudo zypper install libpmem-debug librpmem-debug \
libpmemblk-debug libpmemlog-debug libpmemobj-debug \
libpmempool-debug
All Runtime:
$ sudo apt-get install libpmem1 librpmem1 libpmemblk1 \
libpmemlog1 libpmemobj1 libpmempool1
All Development:
$ sudo apt-get install libpmem-dev librpmem-dev \
libpmemblk-dev libpmemlog-dev libpmemobj-dev \
libpmempool-dev libpmempool-dev
All Debug:
$ sudo apt-get install libpmem1-debug \
librpmem1-debug libpmemblk1-debug \
libpmemlog1-debug libpmemobj1-debug libpmempool1-debug
401
Appendix B How to Install the Persistent Memory Development Kit (PMDK)
Note The last command can take a while as PMDK builds and installs.
After successful completion of all of the preceding steps, the libraries are ready to
be used in Visual Studio with no additional configuration is required. Just open Visual
Studio with your existing project or create a new one (remember to use platform x64)
and then include headers to project as you always do.
402
www.dbooks.org
APPENDIX C
403
© The Author(s) 2020
S. Scargall, Programming Persistent Memory, https://doi.org/10.1007/978-1-4842-4932-1
Appendix C How to Install IPMCTL on Linux and Windows
l ibsafec
libsafec is available as a package in the Fedora package repository. For other Linux
distributions, it is available as a separate downloadable package for local installation:
Alternately, when compiling ipmctl from source code, use the -DSAFECLIB_SRC_
DOWNLOAD_AND_STATIC_LINK=ON option to download sources and statically link to
safeclib.
404
www.dbooks.org
Appendix C How to Install IPMCTL on Linux and Windows
Running the executable installs ipmctl and makes it available via the command-line
and PowerShell interfaces.
U
sing ipmctl
The ipmctl utility provides system administrators with the ability to configure
Intel Optane DC persistent memory modules which can then be used by Windows
PowerShellCmdlets or ndctl on Linux to create namespaces on which file systems can
be created. Applications can then create persistent memory pools and memory map
them to get direct access to the persistent memory. Detailed information about the
modules can also be extracted to help with errors or debugging.
ipmctl has a rich set of commands and options that can be displayed by running
ipmctl without any command verb, as shown in Listing C-1.
405
Appendix C How to Install IPMCTL on Linux and Windows
Listing C-1. Listing the command verbs and simple usage information
# ipmctl version
# ipmctl
Commands:
Display the CLI help.
help
406
www.dbooks.org
Appendix C How to Install IPMCTL on Linux and Windows
Store the region configuration goal from one or more DIMMs to a file
dump -destination (file destination) -system -config
407
Appendix C How to Install IPMCTL on Linux and Windows
408
www.dbooks.org
Appendix C How to Install IPMCTL on Linux and Windows
Please see ipmctl <verb> -help <command> i.e 'ipmctl show -help -dimm' for
more information on specific command
Each command has its own man page. A full list of man pages can be found from the
IPMCTL(1) man page by running “man ipmctl”.
An online ipmctl User Guide can be found at https://docs.pmem.io. This guide
provides detailed step-by-step instructions and in-depth information about ipmctl and
how to use it to provision and debug issues. An ipmctl Quick Start Guide can be found
at https://software.intel.com/en-us/articles/quick-start-guide-configure-
intel-optane-dc-persistent-memory-on-linux.
For a short video walk-through of using ipmctl and ndctl, you can watch the
“Provision Intel Optane DC Persistent Memory in Linux” webinar recording (https://
software.intel.com/en-us/videos/provisioning-intel-optane-dc-persistent-
memory-modules-in-linux).
If you have questions relating to ipmctl, Intel Optane DC persistent memory, or a
general persistent memory question, you can ask it in the Persistent Memory Google
Forum (https://groups.google.com/forum/#!forum/pmem). Questions or issues
specific to ipmctl should be posted as an issue or question on the ipmctl GitHub issues
site (https://github.com/intel/ipmctl/issues).
409
APPENDIX D
Java for Persistent
Memory
Java is one of the most popular programming languages available because it is fast,
secure, and reliable. There are lots of applications and web sites implemented in Java.
It is cross-platform and supports multi-CPU architectures from laptops to datacenters,
game consoles to scientific supercomputers, cell phones to the Internet, and CD/DVD
players to automotive. Java is everywhere!
At the time of writing this book, Java did not natively support storing data persistently
on persistent memory, and there were no Java bindings for the Persistent Memory
Development Kit (PMDK), so we decided Java was not worthy of a dedicated chapter.
We didn’t want to leave Java out of this book given its popularity among developers,
so we decided to include information about Java in this appendix.
In this appendix, we describe the features that have already been integrated in to
Oracle’s Java Development Kit (JDK) [https://www.oracle.com/java/] and OpenJDK
[https://openjdk.java.net/]. We also provide information about proposed persistent
memory functionality in Java as well as two external Java libraries in development.
411
© The Author(s) 2020
S. Scargall, Programming Persistent Memory, https://doi.org/10.1007/978-1-4842-4932-1
www.dbooks.org
Appendix D Java for Persistent Memory
Figure D-1. Java heap memory allocated from DRAM and persistent memory
using the “-XX:AllocateHeapAt=<path>” option
The Java heap is allocated only from persistent memory. The mapping to DRAM is
shown to emphasize that non-heap components like code cache, gc bookkeeping, and
so on, are allocated from DRAM.
412
Appendix D Java for Persistent Memory
The existing heap-related flags such as -Xmx, -Xms, and garbage collection–related
flags will continue to work as before. For example:
This allocates an initial 16GiB heap size (-Xms) with a maximum heap size up to
32GiB (-Xmx32g). The JVM heap can use the capacity of a temporary file created within
the path specified by --XX:AllocateHeapAt=/pmemfs/jvmheap. JVM automatically
creates a temporary file of the form jvmheap.XXXXXX, where XXXXXX is a randomly
generated number. The directory path should be a persistent memory backed file system
mounted with the DAX option. See Chapter 3 for more information about mounting file
systems with the DAX feature.
To ensure application security, the implementation must ensure that file(s) created
in the file system are:
The temporary file is created with read-write permissions for the user running the
JVM, and the JVM deletes the file before terminating.
This feature targets alternative memory devices that have the same semantics as DRAM,
including the semantics of atomic operations, and can therefore be used instead of DRAM for
the object heap without any change to existing application code. All other memory structures
such as the code heap, metaspace, thread stacks, etc., will continue to reside in DRAM.
Some use cases of this feature include
413
www.dbooks.org
Appendix D Java for Persistent Memory
414
Appendix D Java for Persistent Memory
• Users can use -XX:MaxRAM to let the VM know how much DRAM is
available for use. If specified, maximum young gen size is set to 80%
of the value in MaxRAM.
• -XX:MaxRAMPercentage.
• Enabling logging with the logging option gc+ergo=info will print the
maximum young generation size at startup.
415
www.dbooks.org
Appendix D Java for Persistent Memory
416
Appendix D Java for Persistent Memory
references to them. Unlike regular Java objects, their lifetime can extend beyond a single
instance of the Java virtual machine and beyond machine restarts.
Because the contents of persistent objects are retained, it’s important to maintain
data consistency of objects even in the face of crashes and power failures. Persistent
collections and other objects in the library offer persistent data consistency at the Java
method level. Methods, including field setters, behave as if the method’s changes to
persistent memory all happen or none happen. This same method-level consistency can
be achieved with developer-defined classes using a transaction API offer by PCJ.
PCJ uses the libpmemobj library from the Persistent Memory Development Kit
(PMDK) which we discussed in Chapter 7. For additional information on PMDK, please
visit https://pmem.io/ and https://github.com/pmem/pmdk.
417
www.dbooks.org
Appendix D Java for Persistent Memory
Mixed data consistency schemes are also implementable. For example, transactional
writes for critical data and either persistent or volatile writes for less critical data (e.g.,
statistics or caches).
LLPL uses the libpmemobj library from the Persistent Memory Development Kit (PMDK)
which we discussed in Chapter 7. For additional information on PMDK, please visit
https://pmem.io/ and https://github.com/pmem/pmdk.
418
Appendix D Java for Persistent Memory
After that, you should see the generated ∗.class file. To run the main() method
inside your class, you need to again pass the LLPL class path. You also need to set the
java.library.path environment variable to the location of the compiled native library
used as a bridge between LLPL and PMDK:
PCJ source code examples can be found in the resources listed in the following:
S
ummary
At the time of writing this book, native support for persistent memory in Java is an
ongoing effort. Current features are mostly volatile, meaning the data is not persisted
once the app exits. We have described several features that have been integrated and
shown two libraries – LLPL and PCJ – that provide additional functionality for Java
applications.
419
www.dbooks.org
Appendix D Java for Persistent Memory
The Low-Level Persistent Library (LLPL) is an open source Java library being
developed by Intel for persistent memory programming. By providing Java access to
persistent memory at a memory block level, LLPL gives developers a foundation for
building custom abstractions or retrofitting existing code.
The higher-level Persistent Collections for Java (PCJ) offers developers a range of
thread-safe persistent collection classes including arrays, lists, and maps. It also offers
persistent support for things like strings and primitive integer and floating-point types.
Developers can define their own persistent classes as well.
420
APPENDIX E
421
© The Author(s) 2020
S. Scargall, Programming Persistent Memory, https://doi.org/10.1007/978-1-4842-4932-1
www.dbooks.org
Appendix E The Future of Remote Persistent Memory Replication
Figure E-1. The proposed RDMA protocol changes to efficiently support persistent
memory by avoiding Send or Read being called after a Write
422
Appendix E The Future of Remote Persistent Memory Replication
InfiniBand Trade Association. Both protocols track each other architecturally and have
essentially added an RDMA Flush and RDMA Atomic Write commands to the existing
volatile memory support.
RDMA Flush – Is a protocol command that flushes a portion of a memory region.
The completion of the flush command indicates that all of the RDMA Writes within the
domain of the flush have made it to the final placement. Flush placement hints allow
the initiator software to request flushing to globally visible memory (could be volatile or
persistent memory regions) and separately whether the memory is volatile or persistent
memory. The scope of the RDMA Write data that is included in the RDMA Flush domain
is driven by the offset and length for the memory region being flushed. All RDMA Writes
covering memory regions contained in the RDMA Flush command shall be included in
the RDMA Flush. That means that the RDMA Flush command will not complete on the
initiator system until all previous remote writes for those regions have reached the final
requested placement location.
RDMA Atomic Write – Is a protocol command that instructs the NIC to write a
pointer update directly into persistent memory in a pipeline efficient manner. This
allows the preceding RDMA Write, RDMA Flush, RDMA Atomic Write, and RDMA
Flush sequence to occur with only one single complete round-trip latency incurred by
software. It simply needs to wait for the final RDMA Flush completion.
Platform hardware changes are required to efficiently make use of the new network
protocol additions for persistent memory support. The placement hints provided in the
RDMA Flush command allows four possible routing combinations:
• Cache Attribute
• No-cache Attribute
• Volatile Destination
The chipset, CPU, and PCIe root complexes need to understand these placement
attributes and steer or route the request to the proper hardware blocks as requested.
On upcoming Intel platforms, the CPU will look at the PCIe TLP Processor Hint
fields to allow the NIC to add the steering information to each PCIe packet generated
for the inbound RDMA Writes and RDMA Flush. The optional use of this PCIe steering
mechanism is defined by the PCIe Firmware Interface in the ACPI specification and
allows NIC kernel drivers and PCI bus drivers to enable the IO steering and essentially
423
www.dbooks.org
Appendix E The Future of Remote Persistent Memory Replication
select cache, no-cache as memory attributes, and persistent memory or DRAM as the
destination.
From a software enabling point of view, there will be changes to the verbs definition
as defined by the IBTA. This will define the specifics of how the NIC will manage and
implement the feature. Middleware, including OFA libibverbs and libfabric, will be
updated based on these core additions to the networking protocol for native persistent
memory support.
Readers seeking more specific information on the development of these persistent
memory extensions to RDMA are encouraged to follow the references in this book
and the information shared here to begin a more detailed search on native persistent
memory support for high-performance remote access. There are many new exciting
developments occurring on this aspect of persistent memory usage.
424
G
lossary
Term Definition
3D XPoint 3D Xpoint is a non-volatile memory (NVM) technology developed jointly by Intel and
Micron Technology.
ACPI The Advanced Configuration and Power Interface is used by BIOS to expose platform
capabilities.
ADR Asynchronous DRAM Refresh is a feature supported on Intel that triggers a flush of
write pending queues in the memory controller on power failure. Note that ADR does
not flush the processor cache.
AMD Advanced Micro Devices https://www.amd.com
BIOS Basic Input/Output System refers to the firmware used to initialize a server.
CPU Central processing unit
DCPM Intel Optane DC persistent memory
DCPMM Intel Optane DC persistent memory module(s)
DDR Double Data Rate is an advanced version of SDRAM, a type of computer memory.
DDIO Direct Data IO. Intel DDIO makes the processor cache the primary destination and
source of I/O data rather than main memory. By avoiding system memory, Intel DDIO
reduces latency, increases system I/O bandwidth, and reduces power consumption
due to memory reads and writes.
DRAM Dynamic random-access memory
eADR Enhanced Asynchronous DRAM Refresh, a superset of ADR that also flushes the CPU
caches on power failure.
(continued)
425
© The Author(s) 2020
S. Scargall, Programming Persistent Memory, https://doi.org/10.1007/978-1-4842-4932-1
www.dbooks.org
GLOSSARY
Term Definition
ECC Memory error correction used to provide protection from both transient errors and
device failures.
HDD A hard disk drive is a traditional spinning hard drive.
InfiniBand InfiniBand (IB) is a computer networking communications standard used in high-
performance computing that features very high throughput and very low latency. It is
used for data interconnect both among and within computers. InfiniBand is also used
as either a direct or switched interconnect between servers and storage systems, as
well as an interconnect between storage systems.
Intel Intel Corporation https://intel.com
iWARP Internet Wide Area RDMA Protocol is a computer networking protocol that
implements remote direct memory access (RDMA) for efficient data transfer over
Internet Protocol networks.
NUMA Nonuniform memory access, a platform where the time to access memory depends
on its location relative to the processor.
NVDIMM A non-volatile dual inline memory module is a type of random-access memory for
computers. Non-volatile memory is memory that retains its contents even when
electrical power is removed, for example, from an unexpected power loss, system
crash, or normal shutdown.
NVMe Non-volatile memory express is a specification for directly connecting SSDs on PCIe
that provides lower latency and higher performance than SAS and SATA.
ODM Original Design Manufacturing refers to a producer/reseller relationship in which the
full specifications of a project are determined by the reseller rather than based on
the specs established by the manufacturer.
OEM An original equipment manufacturer is a company that produces parts and
equipment that may be marketed by another manufacturer.
OS Operating system
PCIe Peripheral Component Interconnect Express is a high-speed serial communication bus.
Persistent Persistent memory (PM or PMEM) provides persistent storage of data, is byte
Memory addressable, and has near-memory speeds.
(continued)
426
Glossary
Term Definition
427
www.dbooks.org
Index
A C
ACPI specification, 28 C++ Standard limitations
Address range scrub (ARS), 338 object layout, 122, 123
Address space layout randomization object lifetime, 119, 120
(ASLR), 87, 112, 316 vs. persistent memory, 125, 126
Appliance remote replication pointers, 123–125
method, 355, 357 trivial types, 120–122
Application binary interface (ABI), 122 type traits, 125
Application startup and recovery Cache flush operation (CLWB), 24, 59, 286
ACPI specification, 28 Cache hierarchy
ARS, 29 CPU
dirty shutdown, 27 cache hit, 15
flow, 27, 28 cache miss, 16
infinite loop, 28 levels, 14, 15
libpmem library, 27 and memory controller, 14, 15
libpmemobj query, 27 non-volatile storage devices, 16
PMDK, 29 Cache thrashing, 374
RAS, 27 Chunks/buckets, 188
Asynchronous DRAM CLFLUSHOPT, 18, 19, 24, 208, 247, 353
Refresh (ADR), 17, 207 close() method, 151
Atomicity, consistency, isolation, and closeTable() method, 268
durability (ACID), 278 CLWB flushing instructions, 208
Atomic operations, 285, 286 cmap engine, 4
Concurrent data structures
definition, 287
B erase operation, 293
Block Translation Table (BTT) find operation, 292
driver, 34 hash map, 291, 292
Buffer-based LRU design, 182 insert operation, 292, 293
429
© The Author(s) 2020
S. Scargall, Programming Persistent Memory, https://doi.org/10.1007/978-1-4842-4932-1
Index
F
D Fence
Data at rest, 17 code, 21, 22
Data in-flight, 17 libpmem library, 23
Data Loss Count (DLC), 342–346 PMDK, 22
Data structure pseudocode, 21
hash table and transactions, 194 SFENCE instructions, 23
persistence, 197, 200–202 flush() function, 217, 242
sorted array, versioning, 202–206 Flushing
Data visibility, 23 msync(), 20
DAX-enabled file system, 179, 184 non-temporal stores, 19
DB-Engines, 143 optimized flush, 19
deleteNodeFromSLL(), 273 temporal locality, 19
deleteRowFromAllIndexedColumns() Fragmentation, 187
function, 273 func() function, 237
delete_row() method, 272
Direct access (DAX), 19, 66
Direct Data IO (DDIO), 352 G
Direct memory access (DMA), 12, 347 General-purpose remote replication
Dirty reads, 233 method (GPRRM)
Dynamic random-access memory performance implications, 355
(DRAM), 11, 155 persistent data, 354, 355
430
www.dbooks.org
Index
431
Index
432
www.dbooks.org
Index
M capacity, 11
characteristics, 12, 13
Machine check exception (MCE), 334
kinds of, 158
main() function, 213, 234
leaked object, 213
MAP_SYNC flag, 385
leaks, 209
MariaDB∗ storage engine
Memory management unit (MMU), 49
architecture, 264
Metaprogramming
creation
allocating, 116–118
database table, 266, 267
definition, 112
database table, closing, 268
persistent pointers, 112, 113
database table,
opening, 267, 268 snapshots, 115, 116
data structure, 266 transactions, 113, 114
DELETE operation, 272–274 mtx object, 282
handle commits and Multiple Device Driver (mdadm), 384
rollbacks, 265, 266 Mutexes
INSERT operation, 268–270 libpmemobj library, 283
new handler instance, 265 main() function, 285
SELECT operation, 275, 276 std::mutex, 282
UPDATE operation, 270, 271 synchronization primitives, 282–284
storage layer, 264
memkind API functions, 159
fixed-size heap creation, 160, 161 N
kind creation, 160 ndctl and daxctl, installation
kind detection, 162, 163 Linux distribution package repository
memory kind detection API, 163 PMDK on Fedora 22, 393
variable size heap creation, 162 PMDK on RHEL and
memkind_config structure, 161 CentOS, 393–394
memkind_create_pmem() PMDK on SLES 12 and
function, 160, 169 OpenSUSE, 394
memkind_create_pmem_with_config() PMDK on Ubuntu 18.04, 394
function, 161, 164 searching, packages, 391–392
memkind_destroy_kind() prerequisites, 389–390
function, 164 ndctl utility, 376
memkind_detect_kind() function, 163 Network interface controller (NIC), 349
memkind_free() function, 166 Non-maskable interrupt (NMI), 18
memkind library, 156, 157 Nonuniform memory access
memkind_realloc() functions, 165 (NUMA), 65, 156
Memory automatic balancing, 382, 383
433
Index
434
www.dbooks.org
Index
435
Index
436
www.dbooks.org
Index
Q Reliability, availability,
serviceability (RAS), 27
Queue implementation, 126, 128
Remote direct memory access
QuickPath Interconnect (QPI)/Ultra Path
(RDMA), 12, 347, 348
Interconnect (UPI), 305
software architecture, 357, 358
remote_open routine, 368, 369
R remove() method, 150
RDMA networking protocols Resource acquisition is initialization
commands, 350 (RAII), 113
NIC, 349 rpmem_drain(), 363
RDMA Read, 350 rpmem_flush(), 363
RDMA Send (and Receive), 350, 351
RDMA Write, 350
RDMA over Converged Ethernet S
(RoCE), 348 show() method, 131
Redis, 143 Single instruction, multiple data (SIMD)
Redo logging, 320 processing, 190
Reliability, Availability, and Single linked list (SLL), 266
Serviceability (RAS) Snapshotting optimization, 196
device health SNIA NVM programming
ACPI NFIT, 343 model, 351
ACPI specification, 342 Solid-state disk (SSD), 1, 156
unsafe/dirty shutdown, 343 Stack/buffer overflow bug, 208
using ndctl to query, 340, 341 Stackoverflow app, 211
vendors, 342 Standard Template Library
ECC (STL), 168, 282, 287
inifinite loop, 334 std::is_standard_layout, 122, 123
MCE, 334 Storage and Networking Industry
using Linux, 335, 336 Association (SNIA), 33, 112
unconsumed uncorrectable error symmetric multiprocessing (SMP)
handling system, 373
ARS, 338
clearing errors, 339
memory root-device notification, 338 T
petrol scrub, 337 Thread migration, 310
runtime, 337 Transactions and multithreading
unsafe/dirty shutdown, counter, 281
DLC counter, 344, 345 illustrative execution, 281
437
Index
438
www.dbooks.org