summaryrefslogtreecommitdiff
path: root/src/backend/storage/file
AgeCommit message (Collapse)Author
2023-04-08Add io_direct setting (developer-only).Thomas Munro
Provide a way to ask the kernel to use O_DIRECT (or local equivalent) where available for data and WAL files, to avoid or minimize kernel caching. This hurts performance currently and is not intended for end users yet. Later proposed work would introduce our own I/O clustering, read-ahead, etc to replace the facilities the kernel disables with this option. The only user-visible change, if the developer-only GUC is not used, is that this commit also removes the obscure logic that would activate O_DIRECT for the WAL when wal_sync_method=open_[data]sync and wal_level=minimal (which also requires max_wal_senders=0). Those are non-default and unlikely settings, and this behavior wasn't (correctly) documented. The same effect can be achieved with io_direct=wal. Author: Thomas Munro <[email protected]> Author: Andres Freund <[email protected]> Author: Bharath Rupireddy <[email protected]> Reviewed-by: Justin Pryzby <[email protected]> Reviewed-by: Bharath Rupireddy <[email protected]> Discussion: https://postgr.es/m/CA%2BhUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg%40mail.gmail.com
2023-04-08Introduce PG_IO_ALIGN_SIZE and align all I/O buffers.Thomas Munro
In order to have the option to use O_DIRECT/FILE_FLAG_NO_BUFFERING in a later commit, we need the addresses of user space buffers to be well aligned. The exact requirements vary by OS and file system (typically sectors and/or memory pages). The address alignment size is set to 4096, which is enough for currently known systems: it matches modern sectors and common memory page size. There is no standard governing O_DIRECT's requirements so we might eventually have to reconsider this with more information from the field or future systems. Aligning I/O buffers on memory pages is also known to improve regular buffered I/O performance. Three classes of I/O buffers for regular data pages are adjusted: (1) Heap buffers are now allocated with the new palloc_aligned() or MemoryContextAllocAligned() functions introduced by commit 439f6175. (2) Stack buffers now use a new struct PGIOAlignedBlock to respect PG_IO_ALIGN_SIZE, if possible with this compiler. (3) The buffer pool is also aligned in shared memory. WAL buffers were already aligned on XLOG_BLCKSZ. It's possible for XLOG_BLCKSZ to be configured smaller than PG_IO_ALIGNED_SIZE and thus for O_DIRECT WAL writes to fail to be well aligned, but that's a pre-existing condition and will be addressed by a later commit. BufFiles are not yet addressed (there's no current plan to use O_DIRECT for those, but they could potentially get some incidental speedup even in plain buffered I/O operations through better alignment). If we can't align stack objects suitably using the compiler extensions we know about, we disable the use of O_DIRECT by setting PG_O_DIRECT to 0. This avoids the need to consider systems that have O_DIRECT but can't align stack objects the way we want; such systems could in theory be supported with more work but we don't currently know of any such machines, so it's easier to pretend there is no O_DIRECT support instead. That's an existing and tested class of system. Add assertions that all buffers passed into smgrread(), smgrwrite() and smgrextend() are correctly aligned, unless PG_O_DIRECT is 0 (= stack alignment tricks may be unavailable) or the block size has been set too small to allow arrays of buffers to be all aligned. Author: Thomas Munro <[email protected]> Author: Andres Freund <[email protected]> Reviewed-by: Justin Pryzby <[email protected]> Discussion: https://postgr.es/m/CA+hUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg@mail.gmail.com
2023-04-05Add smgrzeroextend(), FileZero(), FileFallocate()Andres Freund
smgrzeroextend() uses FileFallocate() to efficiently extend files by multiple blocks. When extending by a small number of blocks, use FileZero() instead, as using posix_fallocate() for small numbers of blocks is inefficient for some file systems / operating systems. FileZero() is also used as the fallback for FileFallocate() on platforms / filesystems that don't support fallocate. A big advantage of using posix_fallocate() is that it typically won't cause dirty buffers in the kernel pagecache. So far the most common pattern in our code is that we smgrextend() a page full of zeroes and put the corresponding page into shared buffers, from where we later write out the actual contents of the page. If the kernel, e.g. due to memory pressure or elapsed time, already wrote back the all-zeroes page, this can lead to doubling the amount of writes reaching storage. There are no users of smgrzeroextend() as of this commit. That will follow in future commits. Reviewed-by: Melanie Plageman <[email protected]> Reviewed-by: Heikki Linnakangas <[email protected]> Reviewed-by: Kyotaro Horiguchi <[email protected]> Reviewed-by: David Rowley <[email protected]> Reviewed-by: John Naylor <[email protected]> Discussion: https://postgr.es/m/[email protected]
2023-03-30pg_stat_wal: Accumulate time as instr_time instead of microsecondsAndres Freund
In instr_time.h it is stated that: * When summing multiple measurements, it's recommended to leave the * running sum in instr_time form (ie, use INSTR_TIME_ADD or * INSTR_TIME_ACCUM_DIFF) and convert to a result format only at the end. The reason for that is that converting to microseconds is not cheap, and can loose precision. Therefore this commit changes 'PendingWalStats' to use 'instr_time' instead of 'PgStat_Counter' while accumulating 'wal_write_time' and 'wal_sync_time'. Author: Nazir Bilal Yavuz <[email protected]> Reviewed-by: Andres Freund <[email protected]> Reviewed-by: Kyotaro Horiguchi <[email protected]> Reviewed-by: Melanie Plageman <[email protected]> Discussion: https://postgr.es/m/[email protected]
2023-03-30Fix format code in fd.c debugging infrastructureAndres Freund
These were not sufficiently adjusted in 2d4f1ba6cfc.
2023-03-29Simplify useless 0L constantsPeter Eisentraut
In ancient times, these belonged to arguments or fields that were actually of type long, but now they are not anymore, so this "L" decoration is just confusing. (Some other 0L and other "L" constants remain, where they are actually associated with a long type.)
2023-03-02Don't leak descriptors into subprograms.Thomas Munro
Open long-lived data and WAL file descriptors with O_CLOEXEC. This flag was introduced by SUSv4 (POSIX.1-2008), and by now all of our target Unix systems have it. Our open() implementation for Windows already had that behavior, so provide a dummy O_CLOEXEC flag on that platform. For now, callers of open() and the "thin" wrappers in fd.c that deal in raw descriptors need to pass in O_CLOEXEC explicitly if desired. This commit does that for WAL files, and automatically for everything accessed via VFDs including SMgrRelation and BufFile. (With more discussion we might decide to turn it on automatically for the thin open()-wrappers too to avoid risk of missing places that need it, but these are typically used for short-lived descriptors where we don't expect to fork/exec, and it's remotely possible that extensions could be using these APIs and passing descriptors to subprograms deliberately, so that hasn't been done here.) Do the same for sockets and the postmaster pipe with FD_CLOEXEC. (Later commits might use modern interfaces to remove these extra fcntl() calls and more where possible, but we'll need them as a fallback for a couple of systems, so do it that way in this initial commit.) With this change, subprograms executed for archiving, copying etc will no longer have access to the server's descriptors, other than the ones that we decide to pass down. Reviewed-by: Andres Freund <[email protected]> (earlier version) Discussion: https://postgr.es/m/CA%2BhUKGKb6FsAdQWcRL35KJsftv%2B9zXqQbzwkfRf1i0J2e57%2BhQ%40mail.gmail.com
2023-02-21Remove obsolete coding for early macOS.Thomas Munro
Commits 04cad8f7 and 0c088568 supported old macOS systems that didn't define O_CLOEXEC or O_DSYNC yet, but those arrived in macOS releases 10.7 and 10.6 (respectively), which themselves reached EOL around a decade ago. We've already made use of other POSIX features that early macOS vintages can't compile (for example commits 623cc673, d2e15083). A later commit will use O_CLOEXEC on POSIX systems so it would be strange to pretend here that it's optional, and we might as well give O_DSYNC the same treatment since the reference is also guarded by a test for a macOS-specific macro, and we know that current Macs have it. Discussion: https://postgr.es/m/CA%2BhUKGKb6FsAdQWcRL35KJsftv%2B9zXqQbzwkfRf1i0J2e57%2BhQ%40mail.gmail.com
2023-01-21Zero initialize uses of instr_time about to trigger compiler warningsAndres Freund
These are all not necessary from a correctness POV. However, in the near future instr_time will be simplified to an int64, at which point gcc would otherwise start to warn about the changed places. Reviewed-by: Tom Lane <[email protected]> Discussion: https://postgr.es/m/[email protected]
2023-01-17Constify the arguments of copydir.h functionsMichael Paquier
This makes sure that the internal logic of these functions does not attempt to change the value of the arguments constified, and it removes one unconstify() in basic_archive.c. Author: Nathan Bossart Reviewed-by: Andrew Dunstan, Peter Eisentraut Discussion: https://postgr.es/m/20230114231126.GA2580330@nathanxps13
2023-01-16Add BufFileRead variants with short read and EOF detectionPeter Eisentraut
Most callers of BufFileRead() want to check whether they read the full specified length. Checking this at every call site is very tedious. This patch provides additional variants BufFileReadExact() and BufFileReadMaybeEOF() that include the length checks. I considered changing BufFileRead() itself, but this function is also used in extensions, and so changing the behavior like this would create a lot of problems there. The new names are analogous to the existing LogicalTapeReadExact(). Reviewed-by: Amit Kapila <[email protected]> Discussion: https://www.postgresql.org/message-id/flat/[email protected]
2023-01-06Fix pg_truncate() on Windows.Thomas Munro
Commit 57faaf376 added pg_truncate(const char *path, off_t length), but "length" was ignored under WIN32 and the file was unconditionally truncated to 0. There was no live bug, since the only caller passes 0. Fix, and back-patch to 14 where the function arrived. Author: Justin Pryzby <[email protected]> Discussion: https://postgr.es/m/20230106031652.GR3109%40telsasoft.com
2023-01-02Update copyright for 2023Bruce Momjian
Backpatch-through: 11
2022-12-30Add const to BufFileWritePeter Eisentraut
Make data buffer argument to BufFileWrite a const pointer and bubble this up to various callers and related APIs. This makes the APIs clearer and more consistent. Discussion: https://www.postgresql.org/message-id/flat/11dda853-bb5b-59ba-a746-e168b1ce4bdb%40enterprisedb.com
2022-12-30Remove unnecessary castsPeter Eisentraut
Some code carefully cast all data buffer arguments for data write and read function calls to void *, even though the respective arguments are already void *. Remove this unnecessary clutter. Discussion: https://www.postgresql.org/message-id/flat/11dda853-bb5b-59ba-a746-e168b1ce4bdb%40enterprisedb.com
2022-12-20Add copyright notices to meson filesAndrew Dunstan
Discussion: https://postgr.es/m/[email protected]
2022-12-08Update types in File APIPeter Eisentraut
Make the argument types of the File API match stdio better: - Change the data buffer to void *, from char *. - Change FileWrite() data buffer to const on top of that. - Change amounts to size_t, from int. In passing, change the FilePrefetch() amount argument from int to off_t, to match the underlying posix_fadvise(). Discussion: https://www.postgresql.org/message-id/flat/11dda853-bb5b-59ba-a746-e168b1ce4bdb%40enterprisedb.com
2022-11-05Remove unneeded includes of <sys/stat.h>Michael Paquier
Since bfb9dfd, none of the files updated in this commit have any stat() calls, so these inclusions are not necessary, for the same reasons as 233cf6e. Per discussion with John Naylor. Discussion: https://postgr.es/m/CAFBsxsGGGX7KD6RxbNoSJzuSc8Gz3hOxcfhTOMLB_hJcm68dKQ@mail.gmail.com
2022-10-27Move pg_pwritev_with_retry() to src/common/file_utils.cMichael Paquier
This commit moves pg_pwritev_with_retry(), a convenience wrapper of pg_writev() able to handle partial writes, to common/file_utils.c so that the frontend code is able to use it. A first use-case targetted for this routine is pg_basebackup and pg_receivewal, for the zero-padding of a newly-initialized WAL segment. This is used currently in the backend when the GUC wal_init_zero is enabled (default). Author: Bharath Rupireddy Reviewed-by: Nathan Bossart, Thomas Munro Discussion: https://postgr.es/m/CALj2ACUq7nAb7=bJNbK3yYmp-SZhJcXFR_pLk8un6XgDzDF3OA@mail.gmail.com
2022-09-29Restore pg_pread and friends.Thomas Munro
Commits cf112c12 and a0dc8271 were a little too hasty in getting rid of the pg_ prefixes where we use pread(), pwrite() and vectored variants. We dropped support for ancient Unixes where we needed to use lseek() to implement replacements for those, but it turns out that Windows also changes the current position even when you pass in an offset to ReadFile() and WriteFile() if the file handle is synchronous, despite its documentation saying otherwise. Switching to asynchronous file handles would fix that, but have other complications. For now let's just put back the pg_ prefix and add some comments to highlight the non-standard side-effect, which we can now describe as Windows-only. Reported-by: Bharath Rupireddy <[email protected]> Reviewed-by: Bharath Rupireddy <[email protected]> Discussion: https://postgr.es/m/20220923202439.GA1156054%40nathanxps13
2022-09-28Revert 56-bit relfilenode change and follow-up commits.Robert Haas
There are still some alignment-related failures in the buildfarm, which might or might not be able to be fixed quickly, but I've also just realized that it increased the size of many WAL records by 4 bytes because a block reference contains a RelFileLocator. The effect of that hasn't been studied or discussed, so revert for now.
2022-09-27Increase width of RelFileNumbers from 32 bits to 56 bits.Robert Haas
RelFileNumbers are now assigned using a separate counter, instead of being assigned from the OID counter. This counter never wraps around: if all 2^56 possible RelFileNumbers are used, an internal error occurs. As the cluster is limited to 2^64 total bytes of WAL, this limitation should not cause a problem in practice. If the counter were 64 bits wide rather than 56 bits wide, we would need to increase the width of the BufferTag, which might adversely impact buffer lookup performance. Also, this lets us use bigint for pg_class.relfilenode and other places where these values are exposed at the SQL level without worrying about overflow. This should remove the need to keep "tombstone" files around until the next checkpoint when relations are removed. We do that to keep RelFileNumbers from being recycled, but now that won't happen anyway. However, this patch doesn't actually change anything in this area; it just makes it possible for a future patch to do so. Dilip Kumar, based on an idea from Andres Freund, who also reviewed some earlier versions of the patch. Further review and some wordsmithing by me. Also reviewed at various points by Ashutosh Sharma, Vignesh C, Amul Sul, Álvaro Herrera, and Tom Lane. Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
2022-09-22meson: Add initial version of meson based build systemAndres Freund
Autoconf is showing its age, fewer and fewer contributors know how to wrangle it. Recursive make has a lot of hard to resolve dependency issues and slow incremental rebuilds. Our home-grown MSVC build system is hard to maintain for developers not using Windows and runs tests serially. While these and other issues could individually be addressed with incremental improvements, together they seem best addressed by moving to a more modern build system. After evaluating different build system choices, we chose to use meson, to a good degree based on the adoption by other open source projects. We decided that it's more realistic to commit a relatively early version of the new build system and mature it in tree. This commit adds an initial version of a meson based build system. It supports building postgres on at least AIX, FreeBSD, Linux, macOS, NetBSD, OpenBSD, Solaris and Windows (however only gcc is supported on aix, solaris). For Windows/MSVC postgres can now be built with ninja (faster, particularly for incremental builds) and msbuild (supporting the visual studio GUI, but building slower). Several aspects (e.g. Windows rc file generation, PGXS compatibility, LLVM bitcode generation, documentation adjustments) are done in subsequent commits requiring further review. Other aspects (e.g. not installing test-only extensions) are not yet addressed. When building on Windows with msbuild, builds are slower when using a visual studio version older than 2019, because those versions do not support MultiToolTask, required by meson for intra-target parallelism. The plan is to remove the MSVC specific build system in src/tools/msvc soon after reaching feature parity. However, we're not planning to remove the autoconf/make build system in the near future. Likely we're going to keep at least the parts required for PGXS to keep working around until all supported versions build with meson. Some initial help for postgres developers is at https://wiki.postgresql.org/wiki/Meson With contributions from Thomas Munro, John Naylor, Stone Tickle and others. Author: Andres Freund <[email protected]> Author: Nazir Bilal Yavuz <[email protected]> Author: Peter Eisentraut <[email protected]> Reviewed-By: Peter Eisentraut <[email protected]> Discussion: https://postgr.es/m/[email protected]
2022-09-21Improve some error messages.Amit Kapila
It is not our usual style to use "we" in the error messages. Author: Kyotaro Horiguchi Reviewed-By: Amit Kapila Discussion: https://postgr.es/m/[email protected]
2022-09-20Harmonize parameter names in storage and AM code.Peter Geoghegan
Make sure that function declarations use names that exactly match the corresponding names from function definitions in storage, catalog, access method, executor, and logical replication code, as well as in miscellaneous utility/library code. Like other recent commits that cleaned up function parameter names, this commit was written with help from clang-tidy. Later commits will do the same for other parts of the codebase. Author: Peter Geoghegan <[email protected]> Reviewed-By: David Rowley <[email protected]> Discussion: https://postgr.es/m/CAH2-WznJt9CMM9KJTMjJh_zbL5hD9oX44qdJ4aqZtjFi-zA3Tg@mail.gmail.com
2022-09-02Expand the use of get_dirent_type(), shaving a few calls to stat()/lstat()Michael Paquier
Several backend-side loops scanning one or more directories with ReadDir() (WAL segment recycle/removal in xlog.c, backend-side directory copy, temporary file removal, configuration file parsing, some logical decoding logic and some pgtz stuff) already know the type of the entry being scanned thanks to the dirent structure associated to the entry, on platforms where we know about DT_REG, DT_DIR and DT_LNK to make the difference between a regular file, a directory and a symbolic link. Relying on the direct structure of an entry saves a few system calls to stat() and lstat() in the loops updated here, shaving some code while on it. The logic of the code remains the same, calling stat() or lstat() depending on if it is necessary to look through symlinks. Authors: Nathan Bossart, Bharath Rupireddy Reviewed-by: Andres Freund, Thomas Munro, Michael Paquier Discussion: https://postgr.es/m/CALj2ACV8n-J-f=yiLUOx2=HrQGPSOZM3nWzyQQvLPcccPXxEdg@mail.gmail.com
2022-08-29Clean up inconsistent use of fflush().Tom Lane
More than twenty years ago (79fcde48b), we hacked the postmaster to avoid a core-dump on systems that didn't support fflush(NULL). We've mostly, though not completely, hewed to that rule ever since. But such systems are surely gone in the wild, so in the spirit of cleaning out no-longer-needed portability hacks let's get rid of multiple per-file fflush() calls in favor of using fflush(NULL). Also, we were fairly inconsistent about whether to fflush() before popen() and system() calls. While we've received no bug reports about that, it seems likely that at least some of these call sites are at risk of odd behavior, such as error messages appearing in an unexpected order. Rather than expend a lot of brain cells figuring out which places are at hazard, let's just establish a uniform coding rule that we should fflush(NULL) before these calls. A no-op fflush() is surely of trivial cost compared to launching a sub-process via a shell; while if it's not a no-op then we likely need it. Discussion: https://postgr.es/m/[email protected]
2022-08-13Remove configure probe for sys/resource.h and refactor.Thomas Munro
<sys/resource.h> is in SUSv2 and is on all targeted Unix systems. We have a replacement for getrusage() on Windows, so let's just move its declarations into src/include/port/win32/sys/resource.h so that we can use a standard-looking #include. Also remove an obsolete reference to CLK_TCK. Also rename src/port/getrusage.c to win32getrusage.c, following the convention for Windows-only fallback code. Reviewed-by: Tom Lane <[email protected]> Discussion: https://postgr.es/m/CA%2BhUKG%2BL_3brvh%3D8e0BW_VfX9h7MtwgN%3DnFHP5o7X2oZucY9dg%40mail.gmail.com
2022-08-06Replace pgwin32_is_junction() with lstat().Thomas Munro
Now that lstat() reports junction points with S_IFLNK/S_ISLINK(), and unlink() can unlink them, there is no need for conditional code for Windows in a few places. That was expressed by testing for WIN32 or S_ISLNK, which we can now constant-fold. The coding around pgwin32_is_junction() was a bit suspect anyway, as we never checked for errors, and we also know that errors can be spuriously reported because of transient sharing violations on this OS. The lstat()-based code has handling for that. This also reverts 4fc6b6ee on master only. That was done because lstat() didn't previously work for symlinks (junction points), but now it does. Tested-by: Andrew Dunstan <[email protected]> Discussion: https://postgr.es/m/CA%2BhUKGLfOOeyZpm5ByVcAt7x5Pn-%3DxGRNCvgiUPVVzjFLtnY0w%40mail.gmail.com
2022-08-05Remove configure probe for fdatasync.Thomas Munro
fdatasync() is in SUSv2, and all targeted Unix systems have it. We have a replacement function for Windows. We retain the probe for the function declaration, which allows us to supply the mysteriously missing declaration for macOS, and also for Windows. No need to keep a HAVE_FDATASYNC macro around. Also rename src/port/fdatasync.c to win32fdatasync.c since it's only for Windows. Reviewed-by: Tom Lane <[email protected]> Reviewed-by: Andres Freund <[email protected]> Discussion: https://postgr.es/m/CA+hUKGJ3LHeP9w5Fgzdr4G8AnEtJ=z=p6hGDEm4qYGEUX5B6fQ@mail.gmail.com Discussion: https://postgr.es/m/CA%2BhUKGJZJVO%3DiX%2Beb-PXi2_XS9ZRqnn_4URh0NUQOwt6-_51xQ%40mail.gmail.com
2022-08-05Simplify replacement code for preadv and pwritev.Thomas Munro
preadv() and pwritev() are not standardized by POSIX, but appeared in NetBSD in 1999 and were adopted by at least OpenBSD, FreeBSD, DragonFlyBSD, Linux, AIX, illumos and macOS. We don't use them much yet, but an active proposal uses them heavily. In 15, we had two replacement implementations for other OSes: one based on lseek() + -v function if available for true vector I/O, and the other based on a loop over p- function. The former would be an obstacle to hypothetical future multi-threaded code sharing file descriptors, while the latter would not, since commit cf112c12. Furthermore, the number of targeted systems that could benefit from the former's potential upside has dwindled to just one niche OS, since macOS added the functions and we de-supported HP-UX. That doesn't seem like a good trade-off. Therefore, drop the lseek()-based variant, and also the pg_ prefix now that the file position portability hazard is gone. At the time of writing, the only systems in our build farm that lack native preadv/pwritev and thus use fallback code are: * Solaris (but not illumos) * macOS before release 11.0 * Windows With this commit, the above systems will now use the *same* fallback code, the version that loops over pread()/pwrite(). Windows already used that (though a later proposal may include true vector I/O for Windows), so this decision really only affects Solaris, until it gets around to adding these system calls. Also remove some useless includes while here. Reviewed-by: Andres Freund <[email protected]> Discussion: https://postgr.es/m/CA+hUKGJ3LHeP9w5Fgzdr4G8AnEtJ=z=p6hGDEm4qYGEUX5B6fQ@mail.gmail.com
2022-08-04Remove dead pread and pwrite replacement code.Thomas Munro
pread() and pwrite() are in SUSv2, and all targeted Unix systems have them. Previously, we defined pg_pread and pg_pwrite to emulate these function with lseek() on old Unixen. The names with a pg_ prefix were a reminder of a portability hazard: they might change the current file position. That hazard is gone, so we can drop the prefixes. Since the remaining replacement code is Windows-only, move it into src/port/win32p{read,write}.c, and move the declarations into src/include/port/win32_port.h. No need for vestigial HAVE_PREAD, HAVE_PWRITE macros as they were only used for declarations in port.h which have now moved into win32_port.h. Reviewed-by: Tom Lane <[email protected]> Reviewed-by: Greg Stark <[email protected]> Reviewed-by: Robert Haas <[email protected]> Reviewed-by: Andres Freund <[email protected]> Discussion: https://postgr.es/m/CA+hUKGJ3LHeP9w5Fgzdr4G8AnEtJ=z=p6hGDEm4qYGEUX5B6fQ@mail.gmail.com
2022-08-04Remove configure probe and related tests for getrlimit.Thomas Munro
getrlimit() is in SUSv2 and all targeted systems have it. Windows doesn't have it. We could just use #ifndef WIN32, but for a little more explanation about why we're making things conditional, let's retain the HAVE_GETRLIMIT macro. It's defined in port.h for Unix systems. On systems that have it, it's not necessary to test for RLIMIT_CORE, RLIMIT_STACK or RLIMIT_NOFILE macros, since SUSv2 requires those and all targeted systems have them. Also remove references to a pre-historic alternative spelling of RLIMIT_NOFILE, and coding that seemed to believe that Cygwin didn't have it. Reviewed-by: Tom Lane <[email protected]> Reviewed-by: Andres Freund <[email protected]> Discussion: https://postgr.es/m/CA+hUKGJ3LHeP9w5Fgzdr4G8AnEtJ=z=p6hGDEm4qYGEUX5B6fQ@mail.gmail.com
2022-07-28Clean up some residual confusion between OIDs and RelFileNumbers.Robert Haas
Commit b0a55e43299c4ea2a9a8c757f9c26352407d0ccc missed a few places where we are referring to the number used as a part of the relation filename as an "OID". We now want to call that a "RelFileNumber". Some of these places actually made it sound like the OID in question is pg_class.oid rather than pg_class.relfilenode, which is especially good to clean up. Dilip Kumar with some editing by me.
2022-07-05Remove durable_rename_excl()Michael Paquier
A previous commit replaced all the calls to this function with durable_rename() as of dac1ff3, making it used nowhere in the tree. Using it in extension code is also risky based on the issues described in this previous commit, so let's remove it. This makes possible the removal of HAVE_WORKING_LINK. Author: Nathan Bossart Reviewed-by: Robert Haas, Kyotaro Horiguchi, Michael Paquier Discussion: https://postgr.es/m/20220407182954.GA1231544@nathanxps13
2022-04-28Revert recent changes with durable_rename_excl()Michael Paquier
This reverts commits 2c902bb and ccfbd92. Per buildfarm members kestrel, rorqual and calliphoridae, the assertions checking that a TLI history file should not exist when created by a WAL receiver have been failing, and switching to durable_rename() over durable_rename_excl() would cause the newest TLI history file to overwrite the existing one. We need to think harder about such cases, so revert the new logic for now. Note that all the failures have been reported in the test 025_stuck_on_old_timeline. Discussion: https://postgr.es/m/[email protected]
2022-04-28Remove durable_rename_excl()Michael Paquier
ccfbd92 has replaced all existing in-core callers of this function in favor of durable_rename(). durable_rename_excl() is by nature unsafe on crashes happening at the wrong time, so just remove it. Author: Nathan Bossart Reviewed-by: Robert Haas, Kyotaro Horiguchi, Michael Paquier Discussion: https://postgr.es/m/20220407182954.GA1231544@nathanxps13
2022-04-13Add missing spaces after single-line commentsDavid Rowley
Only 1 of 3 of these changes appear to be handled by pgindent. That change is new to v15. The remaining two appear to be left alone by pgindent. The exact reason for that is not 100% clear to me. It seems related to the fact that it's a line that contains *only* a single line comment and no actual code. It does not seem worth investigating this in too much detail. In any case, these do not conform to our usual practices, so fix them. Author: Justin Pryzby Discussion: https://postgr.es/m/[email protected]
2022-04-08Track I/O timing for temporary file blocks in EXPLAIN (BUFFERS)Michael Paquier
Previously, the output of EXPLAIN (BUFFERS) option showed only the I/O timing spent reading and writing shared and local buffers. This commit adds on top of that the I/O timing for temporary buffers in the output of EXPLAIN (for spilled external sorts, hashes, materialization. etc). This can be helpful for users in cases where the I/O related to temporary buffers is the bottleneck. Like its cousin, this information is available only when track_io_timing is enabled. Playing the patch, this is showing an extra overhead of up to 1% even when using gettimeofday() as implementation for interval timings, which is slightly within the usual range noise still that's measurable. Author: Masahiko Sawada Reviewed-by: Georgios Kokolatos, Melanie Plageman, Julien Rouhaud, Ranier Vilela Discussion: https://postgr.es/m/CAD21AoAJgotTeP83p6HiAGDhs_9Fw9pZ2J=_tYTsiO5Ob-V5GQ@mail.gmail.com
2022-01-08Update copyright for 2022Bruce Momjian
Backpatch-through: 10
2021-11-29Replace random(), pg_erand48(), etc with a better PRNG API and algorithm.Tom Lane
Standardize on xoroshiro128** as our basic PRNG algorithm, eliminating a bunch of platform dependencies as well as fundamentally-obsolete PRNG code. In addition, this API replacement will ease replacing the algorithm again in future, should that become necessary. xoroshiro128** is a few percent slower than the drand48 family, but it can produce full-width 64-bit random values not only 48-bit, and it should be much more trustworthy. It's likely to be noticeably faster than the platform's random(), depending on which platform you are thinking about; and we can have non-global state vectors easily, unlike with random(). It is not cryptographically strong, but neither are the functions it replaces. Fabien Coelho, reviewed by Dean Rasheed, Aleksander Alekseev, and myself Discussion: https://postgr.es/m/alpine.DEB.2.22.394.2105241211230.165418@pseudo
2021-10-25Report progress of startup operations that take a long time.Robert Haas
Users sometimes get concerned whe they start the server and it emits a few messages and then doesn't emit any more messages for a long time. Generally, what's happening is either that the system is taking a long time to apply WAL, or it's taking a long time to reset unlogged relations, or it's taking a long time to fsync the data directory, but it's not easy to tell which is the case. To fix that, add a new 'log_startup_progress_interval' setting, by default 10s. When an operation that is known to be potentially long-running takes more than this amount of time, we'll log a status update each time this interval elapses. To avoid undesirable log chatter, don't log anything about WAL replay when in standby mode. Nitin Jadhav and Robert Haas, reviewed by Amul Sul, Bharath Rupireddy, Justin Pryzby, Michael Paquier, and Álvaro Herrera. Discussion: https://postgr.es/m/CA+TgmoaHQrgDFOBwgY16XCoMtXxsrVGFB2jNCvb7-ubuEe1MGg@mail.gmail.com Discussion: https://postgr.es/m/CAMm1aWaHF7VE69572_OLQ+MgpT5RUiUDgF1x5RrtkJBLdpRj3Q@mail.gmail.com
2021-09-08Clean up some code using "(expr) ? true : false"Michael Paquier
All the code paths simplified here were already using a boolean or used an expression that led to zero or one, making the extra bits unnecessary. Author: Justin Pryzby Reviewed-by: Tom Lane, Michael Paquier, Peter Smith Discussion: https://postgr.es/m/[email protected]
2021-09-02In count_usable_fds(), duplicate stderr not stdin.Tom Lane
We had a complaint that the postmaster fails to start if the invoking program closes stdin. That happens because count_usable_fds expects to be able to dup(0), and if it can't, we conclude there are no free FDs and go belly-up. So far as I can find, though, there is no other place in the server that touches stdin, and it's not unreasonable to expect that a daemon wouldn't use that file. As a simple improvement, let's dup FD 2 (stderr) instead. Unlike stdin, it *is* reasonable for us to expect that stderr be open; even if we are configured not to touch it, common libraries such as libc might try to write error messages there. Per gripe from Mario Emmenlauer. Given the lack of previous complaints, I'm not excited about pushing this into stable branches, but it seems OK to squeeze it into v14. Discussion: https://postgr.es/m/[email protected]
2021-09-02Optimize fileset usage in apply worker.Amit Kapila
Use one fileset for the entire worker lifetime instead of using separate filesets for each streaming transaction. Now, the changes/subxacts files for every streaming transaction will be created under the same fileset and the files will be deleted after the transaction is completed. This patch extends the BufFileOpenFileSet and BufFileDeleteFileSet APIs to allow users to specify whether to give an error on missing files. Author: Dilip Kumar, based on suggestion by Thomas Munro Reviewed-by: Hou Zhijie, Masahiko Sawada, Amit Kapila Discussion: https://postgr.es/m/[email protected]
2021-08-30Refactor sharedfileset.c to separate out fileset implementation.Amit Kapila
Move fileset related implementation out of sharedfileset.c to allow its usage by backends that don't want to share filesets among different processes. After this split, fileset infrastructure is used by both sharedfileset.c and worker.c for the named temporary files that survive across transactions. Author: Dilip Kumar, based on suggestion by Andres Freund Reviewed-by: Hou Zhijie, Masahiko Sawada, Amit Kapila Discussion: https://postgr.es/m/[email protected]
2021-08-08Move temporary file cleanup to before_shmem_exit().Andres Freund
As reported by a few OSX buildfarm animals there exist at least one path where temporary files exist during AtProcExit_Files() processing. As temporary file cleanup causes pgstat reporting, the assertions added in ee3f8d3d3ae caused failures. This is not an OSX specific issue, we were just lucky that timing on OSX reliably triggered the problem. The known way to cause this is a FATAL error during perform_base_backup() with a MANIFEST used - adding an elog(FATAL) after InitializeBackupManifest() reliably reproduces the problem in isolation. The problem is that the temporary file created in InitializeBackupManifest() is not cleaned up via resource owner cleanup as WalSndResourceCleanup() currently is only used for non-FATAL errors. That then allows to reach AtProcExit_Files() with existing temporary files, causing the assertion failure. To fix this problem, move temporary file cleanup to a before_shmem_exit() hook and add assertions ensuring that no temporary files are created before / after temporary file management has been initialized / shut down. The cleanest way to do so seems to be to split fd.c initialization into two, one for plain file access and one for temporary file access. Right now there's no need to perform further fd.c cleanup during process exit, so I just renamed AtProcExit_Files() to BeforeShmemExit_Files(). Alternatively we could perform another pass through the files to check that no temporary files exist, but the added assertions seem to provide enough protection against that. It might turn out that the assertions added in ee3f8d3d3ae will cause too much noise - in that case we'll have to downgrade them to a WARNING, at least temporarily. This commit is not necessarily the best approach to address this issue, but it should resolve the buildfarm failures. We can revise later. Author: Andres Freund <[email protected]> Discussion: https://postgr.es/m/[email protected]
2021-07-19Don't use #if inside function-like macro arguments.Thomas Munro
No concrete problem reported, but in the past it's been known to cause problems on some compilers so let's avoid doing that. Reported-by: Tom Lane <[email protected]> Discussion: https://postgr.es/m/234364.1626704007%40sss.pgh.pa.us
2021-07-19Adjust commit 2dbe8905 for ancient macOS.Thomas Munro
A couple of open flags used in an assertion didn't exist in macOS 10.4. Per build farm animal prairiedog. Also add O_EXCL while here (there are a few more standard flags but they're not relevant and likely to be missing).
2021-07-18Support direct I/O on macOS.Thomas Munro
Macs don't understand O_DIRECT, but they can disable caching with a separate fcntl() call. Extend the file opening functions in fd.c to handle this for us if the caller passes in PG_O_DIRECT. For now, this affects only WAL data and even then only if you set: max_wal_senders=0 wal_level=minimal This is not expected to be very useful on its own, but later proposed patches will make greater use of direct I/O, and it'll be useful for testing if developers on Macs can see the effects. Reviewed-by: Andres Freund <[email protected]> Discussion: https://postgr.es/m/CA%2BhUKG%2BADiyyHe0cun2wfT%2BSVnFVqNYPxoO6J9zcZkVO7%2BNGig%40mail.gmail.com