0% found this document useful (0 votes)
99 views43 pages

Berkeley Software Architecture Manual 4.4BSD Edition

This document summarizes the facilities provided by the 4.4BSD version of the UNIX operating system. It describes the basic kernel functions available to processes, including process naming and protection, memory management, signals, timing functions, descriptors, and resource controls. It also describes standard system abstractions like files and file systems, communication, terminals, and process control implemented by the operating system or server processes.

Uploaded by

saikishore239328
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views43 pages

Berkeley Software Architecture Manual 4.4BSD Edition

This document summarizes the facilities provided by the 4.4BSD version of the UNIX operating system. It describes the basic kernel functions available to processes, including process naming and protection, memory management, signals, timing functions, descriptors, and resource controls. It also describes standard system abstractions like files and file systems, communication, terminals, and process control implemented by the operating system or server processes.

Uploaded by

saikishore239328
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Berkeley Software Architecture Manual 4.

4BSD Edition
William Joy, Robert Fabry, Samuel Lefer, M. Kirk McKusick, Michael Karels Computer Systems Research Group Computer Science Division Department of Electrical Engineering and Computer Science University of California, Berkeley Berkeley, CA 94720

ABSTRACT This document summarizes the facilities provided by the 4.4BSD version of the UNIX * operating system. It does not attempt to act as a tutorial for use of the system nor does it attempt to explain or justify the design of the system facilities. It gives neither motivation nor implementation details, in favor of brevity. The rst section describes the basic kernel functions provided to a UNIX process: process naming and protection, memory management, software interrupts, object references (descriptors), time and statistics functions, and resource controls. These facilities, as well as facilities for bootstrap, shutdown and process accounting, are provided solely by the kernel. The second section describes the standard system abstractions for les and le systems, communication, terminal handling, and process control and debugging. These facilities are implemented by the operating system or by network server processes.

* UNIX is a trademark of Bell Laboratories.

PSD:5-2

4.4BSD Architecture Manual

TABLE OF CONTENTS Introduction. 0. Notation and types 1. Kernel primitives 1.1. Processes and protection 1.1.1. Host and process identiers 1.1.2. Process creation and termination 1.1.3. User and group ids 1.1.4. Process groups 1.2. Memory management 1.2.1. Text, data and stack 1.2.2. Mapping pages 1.2.3. Page protection control 1.2.4. Giving and getting advice 1.2.5. Protection primitives 1.3. Signals 1.3.1. Overview 1.3.2. Signal types 1.3.3. Signal handlers 1.3.4. Sending signals 1.3.5. Protecting critical sections 1.3.6. Signal stacks 1.4. Timing and statistics 1.4.1. Real time 1.4.2. Interval time 1.5. Descriptors 1.5.1. The reference table 1.5.2. Descriptor properties 1.5.3. Managing descriptor references 1.5.4. Multiplexing requests 1.5.5. Descriptor wrapping 1.6. Resource controls 1.6.1. Process priorities 1.6.2. Resource utilization 1.6.3. Resource limits 1.7. System operation support 1.7.1. Bootstrap operations 1.7.2. Shutdown operations 1.7.3. Accounting

4.4BSD Architecture Manual

PSD:5-3

2. System facilities 2.1. Generic operations 2.1.1. Read and write 2.1.2. Input/output control 2.1.3. Non-blocking and asynchronous operations 2.2. File system 2.2.1 Overview 2.2.2. Naming 2.2.3. Creation and removal 2.2.3.1. Directory creation and removal 2.2.3.2. File creation 2.2.3.3. Creating references to devices 2.2.3.4. Portal creation 2.2.3.6. File, device, and portal removal 2.2.4. Reading and modifying le attributes 2.2.5. Links and renaming 2.2.6. Extension and truncation 2.2.7. Checking accessibility 2.2.8. Locking 2.2.9. Disc quotas 2.3. Interprocess communication 2.3.1. Interprocess communication primitives 2.3.1.1. Communication domains 2.3.1.2. Socket types and protocols 2.3.1.3. Socket creation, naming and service establishment 2.3.1.4. Accepting connections 2.3.1.5. Making connections 2.3.1.6. Sending and receiving data 2.3.1.7. Scatter/gather and exchanging access rights 2.3.1.8. Using read and write with sockets 2.3.1.9. Shutting down halves of full-duplex connections 2.3.1.10. Socket and protocol options 2.3.2. UNIX domain 2.3.2.1. Types of sockets 2.3.2.2. Naming 2.3.2.3. Access rights transmission 2.3.3. INTERNET domain 2.3.3.1. Socket types and protocols 2.3.3.2. Socket naming 2.3.3.3. Access rights transmission 2.3.3.4. Raw access 2.4. Terminals and devices 2.4.1. Terminals 2.4.1.1. Terminal input 2.4.1.1.1 Input modes 2.4.1.1.2 Interrupt characters 2.4.1.1.3 Line editing 2.4.1.2. Terminal output 2.4.1.3. Terminal control operations 2.4.1.4. Terminal hardware support 2.4.2. Structured devices

PSD:5-4

4.4BSD Architecture Manual

2.4.3. Unstructured devices 2.5. Process control and debugging I. Summary of facilities

4.4BSD Architecture Manual

PSD:5-5

1.

Notation and types

The notation used to describe system calls is a variant of a C language call, consisting of a prototype call followed by declaration of parameters and results. An additional keyword result, not part of the normal C language, is used to indicate which of the declared entities receive results. As an example, consider the read call, as described in section 2.1: cc = read(fd, buf, nbytes); result int cc; int fd; result char *buf; int nbytes; The rst line shows how the read routine is called, with three parameters. As shown on the second line cc is an integer and read also returns information in the parameter buf. Description of all error conditions arising from each system call is not provided here; they appear in the programmers manual. In particular, when accessed from the C language, many calls return a characteristic 1 value when an error occurs, returning the error code in the global variable errno. Other languages may present errors in different ways. A number of system standard types are dened in the include le <sys/types.h> and used in the specications here and in many C programs. These include caddr_t giving a memory address (typically as a character pointer), off_t giving a le offset (typically as a long integer), and a set of unsigned types u_char, u_short, u_int and u_long, shorthand names for unsigned char, unsigned short, etc.

PSD:5-6

4.4BSD Architecture Manual

2.

Kernel primitives

The facilities available to a UNIX user process are logically divided into two parts: kernel facilities directly implemented by UNIX code running in the operating system, and system facilities implemented either by the system, or in cooperation with a server process. These kernel facilities are described in this section 1. The facilities implemented in the kernel are those which dene the UNIX virtual machine in which each process runs. Like many real machines, this virtual machine has memory management hardware, an interrupt facility, timers and counters. The UNIX virtual machine also allows access to les and other objects through a set of descriptors. Each descriptor resembles a device controller, and supports a set of operations. Like devices on real machines, some of which are internal to the machine and some of which are external, parts of the descriptor machinery are built-in to the operating system, while other parts are often implemented in server processes on other machines. The facilities provided through the descriptor machinery are described in section 2.

4.4BSD Architecture Manual

PSD:5-7

2.1.

Processes and protection

2.1.1. Host and process identiers Each UNIX host has associated with it a 32-bit host id, and a host name of up to 64 characters (as dened by MAXHOSTNAMELEN in <sys/param.h>). These are set (by a privileged user) and returned by the calls: sethostid(hostid) long hostid; hostid = gethostid(); result long hostid; sethostname(name, len) char *name; int len; len = gethostname(buf, buen) result int len; result char *buf; int buen; On each host runs a set of processes. Each process is largely independent of other processes, having its own protection domain, address space, timers, and an independent set of references to system or user implemented objects. Each process in a host is named by an integer called the process id. This number is in the range 1-30000 and is returned by the getpid routine: pid = getpid(); result int pid; On each UNIX host this identier is guaranteed to be unique; in a multi-host environment, the (hostid, process id) pairs are guaranteed unique. 2.1.2. Process creation and termination A new process is created by making a logical duplicate of an existing process: pid = fork(); result int pid; The fork call returns twice, once in the parent process, where pid is the process identier of the child, and once in the child process where pid is 0. The parent-child relationship induces a hierarchical structure on the set of processes in the system. A process may terminate by executing an exit call: exit(status) int status; returning 8 bits of exit status to its parent. When a child process exits or terminates abnormally, the parent process receives information about any event which caused termination of the child process. A second call provides a non-blocking interface and may also be used to retrieve information about resources consumed by the process during its lifetime.

PSD:5-8

4.4BSD Architecture Manual

#include <sys/wait.h> pid = wait(astatus); result int pid; result union wait *astatus; pid = wait3(astatus, options, arusage); result int pid; result union waitstatus *astatus; int options; result struct rusage *arusage; A process can overlay itself with the memory image of another process, passing the newly created process a set of parameters, using the call: execve(name, argv, envp) char *name, **argv, **envp; The specied name must be a le which is in a format recognized by the system, either a binary executable le or a le which causes the execution of a specied interpreter program to process its contents. 2.1.3. User and group ids Each process in the system has associated with it two user-ids: a real user id and a effective user id, both 16 bit unsigned integers (type uid_t). Each process has an real accounting group id and an effective accounting group id and a set of access group ids. The group ids are 16 bit unsigned integers (type gid_t). Each process may be in several different access groups, with the maximum concurrent number of access groups a system compilation parameter, the constant NGROUPS in the le <sys/param.h>, guaranteed to be at least 8. The real and effective user ids associated with a process are returned by: ruid = getuid(); result uid_t ruid; euid = geteuid(); result uid_t euid; the real and effective accounting group ids by: rgid = getgid(); result gid_t rgid; egid = getegid(); result gid_t egid; The access group id set is returned by a getgroups call*: ngroups = getgroups(gidsetsize, gidset); result int ngroups; int gidsetsize; result int gidset[gidsetsize]; The user and group ids are assigned at login time using the setreuid, setregid, and setgroups calls:

* The type of the gidset array in getgroups and setgroups remains integer for compatibility with 4.2BSD. It may change to gid_t in future releases.

4.4BSD Architecture Manual

PSD:5-9

setreuid(ruid, euid); int ruid, euid; setregid(rgid, egid); int rgid, egid; setgroups(gidsetsize, gidset) int gidsetsize; int gidset[gidsetsize]; The setreuid call sets both the real and effective user-ids, while the setregid call sets both the real and effective accounting group ids. Unless the caller is the super-user, ruid must be equal to either the current real or effective user-id, and rgid equal to either the current real or effective accounting group id. The setgroups call is restricted to the super-user. 2.1.4. Process groups Each process in the system is also normally associated with a process group. The group of processes in a process group is sometimes referred to as a job and manipulated by high-level system software (such as the shell). The current process group of a process is returned by the getpgrp call: pgrp = getpgrp(pid); result int pgrp; int pid; When a process is in a specic process group it may receive software interrupts affecting the group, causing the group to suspend or resume execution or to be interrupted or terminated. In particular, a system terminal has a process group and only processes which are in the process group of the terminal may read from the terminal, allowing arbitration of terminals among several different jobs. The process group associated with a process may be changed by the setpgrp call: setpgrp(pid, pgrp); int pid, pgrp; Newly created processes are assigned process ids distinct from all processes and process groups, and the same process group as their parent. A normal (unprivileged) process may set its process group equal to its process id. A privileged process may set the process group of any process to any value.

PSD:5-10

4.4BSD Architecture Manual

2.2.

Memory management

2.2.1. Text, data and stack Each process begins execution with three logical areas of memory called text, data and stack. The text area is read-only and shared, while the data and stack areas are private to the process. Both the data and stack areas may be extended and contracted on program request. The call addr = sbrk(incr); result caddr_t addr; int incr; changes the size of the data area by incr bytes and returns the new end of the data area, while addr = sstk(incr); result caddr_t addr; int incr; changes the size of the stack area. The stack area is also automatically extended as needed. On the VAX the text and data areas are adjacent in the P0 region, while the stack section is in the P1 region, and grows downward. 2.2.2. Mapping pages The system supports sharing of data between processes by allowing pages to be mapped into memory. These mapped pages may be shared with other processes or private to the process. Protection and sharing options are dened in <sys/mman.h> as: /* protections are chosen from these bits, or-ed together */ #dene PROT_READ 0x04 /* pages can be read */ #dene PROT_WRITE 0x02 /* pages can be written */ #dene PROT_EXEC 0x01 /* pages can be executed */ /* ags contain mapping type, sharing type and options */ /* mapping type; choose one */ #dene MAP_FILE 0x0001 /* mapped from a le or device */ #dene MAP_ANON 0x0002 /* allocated from memory, swap space */ #dene MAP_TYPE 0x000f /* mask for type eld */ /* sharing types; choose one */ #dene MAP_SHARED #dene MAP_PRIVATE /* other ags */ #dene MAP_FIXED #dene MAP_INHERIT #dene MAP_HASSEMAPHORE #dene MAP_NOPREALLOC

0x0010 /* share changes */ 0x0000 /* changes are private */

0x0020 0x0040 0x0080 0x0100

/* map addr must be exactly as requested */ /* region is retained after exec */ /* region may contain semaphores */ /* do not preallocate space */

The cpu-dependent size of a page is returned by the getpagesize system call: pagesize = getpagesize(); result int pagesize; The call: maddr = mmap(addr, len, prot, ags, fd, pos); result caddr_t maddr; caddr_t addr; int *len, prot, ags, fd; off_t pos; causes the pages starting at addr and continuing for at most len bytes to be mapped from the object
This section represents the interface planned for later releases of the system. Of the calls described in this section, only sbrk and getpagesize are included in 4.3BSD.

4.4BSD Architecture Manual

PSD:5-11

represented by descriptor fd, starting at byte offset pos. The starting address of the region is returned; for the convenience of the system, it may differ from that supplied unless the MAP_FIXED ag is given, in which case the exact address will be used or the call will fail. The actual amount mapped is returned in len. The addr, len, and pos parameters must all be multiples of the pagesize. A successful mmap will delete any previous mapping in the allocated address range. The parameter prot species the accessibility of the mapped pages. The parameter ags species the type of object to be mapped, mapping options, and whether modications made to this mapped copy of the page are to be kept private, or are to be shared with other references. Possible types include MAP_FILE, mapping a regular le or character-special device memory, and MAP_ANON, which maps memory not associated with any specic le. The le descriptor used for creating MAP_ANON regions is used only for naming, and may be given as 1 if no name is associated with the region. The MAP_INHERIT ag allows a region to be inherited after an exec. The MAP_HASSEMAPHORE ag allows special handling for regions that may contain semaphores. The MAP_NOPREALLOC ag allows processes to allocate regions whose virtual address space, if fully allocated, would exceed the available memory plus swap resources. Such regions may get a SIGSEGV signal if they page fault and resources are not available to service their request; typically they would free up some resources via unmap so that when they return from the signal the page fault could be successfully completed. A facility is provided to synchronize a mapped region with the le it maps; the call msync(addr, len); caddr_t addr; int len; writes any modied pages back to the lesystem and updates the le modication time. If len is 0, all modied pages within the region containing addr will be ushed; if len is non-zero, only the pages containing addr and len succeeding locations will be examined. Any required synchronization of memory caches will also take place at this time. Filesystem operations on a le that is mapped for shared modications are unpredictable except after an msync. A mapping can be removed by the call munmap(addr, len); caddr_t addr; int len; This call deletes the mappings for the specied address range, and causes further references to addresses within the range to generate invalid memory references. 2.2.3. Page protection control A process can control the protection of pages using the call mprotect(addr, len, prot); caddr_t addr; int len, prot; This call changes the specied pages to have protection prot . Not all implementations will guarantee protection on a page basis; the granularity of protection changes may be as large as an entire region. 2.2.4. Giving and getting advice A process that has knowledge of its memory behavior may use the madvise call: madvise(addr, len, behav); caddr_t addr; int len, behav; Behav describes expected behavior, as given in <sys/mman.h>:

The current design does not allow a process to specify the location of swap space. In the future we may dene an additional mapping type, MAP_SWAP, in which the le descriptor argument species a le or device to which swapping should be done.

PSD:5-12

4.4BSD Architecture Manual

#dene #dene #dene #dene #dene #dene

MADV_NORMAL MADV_RANDOM MADV_SEQUENTIAL MADV_WILLNEED MADV_DONTNEED MADV_SPACEAVAIL

0 1 2 3 4 5

/* no further special treatment */ /* expect random page references */ /* expect sequential references */ /* will need these pages */ /* dont need these pages */ /* insure that resources are reserved */

Finally, a process may obtain information about whether pages are core resident by using the call mincore(addr, len, vec) caddr_t addr; int len; result char *vec; Here the current core residency of the pages is returned in the character array vec, with a value of 1 meaning that the page is in-core. 2.2.5. Synchronization primitives Primitives are provided for synchronization using semaphores in shared memory. Semaphores must lie within a MAP_SHARED region with at least modes PROT_READ and PROT_WRITE. The MAP_HASSEMAPHORE ag must have been specied when the region was created. To acquire a lock a process calls: value = mset(sem, wait) result int value; semaphore *sem; int wait; Mset indivisibly tests and sets the semaphore sem. If the previous value is zero, the process has acquired the lock and mset returns true immediately. Otherwise, if the wait ag is zero, failure is returned. If wait is true and the previous value is non-zero, mset relinquishes the processor until notied that it should retry. To release a lock a process calls: mclear(sem) semaphore *sem; Mclear indivisibly tests and clears the semaphore sem. If the WANT ag is zero in the previous value, mclear returns immediately. If the WANT ag is non-zero in the previous value, mclear arranges for waiting processes to retry before returning. Two routines provide services analogous to the kernel sleep and wakeup functions interpreted in the domain of shared memory. A process may relinquish the processor by calling msleep with a set semaphore: msleep(sem) semaphore *sem; If the semaphore is still set when it is checked by the kernel, the process will be put in a sleeping state until some other process issues an mwakeup for the same semaphore within the region using the call: mwakeup(sem) semaphore *sem; An mwakeup may awaken all sleepers on the semaphore, or may awaken only the next sleeper on a queue.

4.4BSD Architecture Manual

PSD:5-13

2.3.

Signals

2.3.1. Overview The system denes a set of signals that may be delivered to a process. Signal delivery resembles the occurrence of a hardware interrupt: the signal is blocked from further occurrence, the current process context is saved, and a new one is built. A process may specify the handler to which a signal is delivered, or specify that the signal is to be blocked or ignored. A process may also specify that a default action is to be taken when signals occur. Some signals will cause a process to exit when they are not caught. This may be accompanied by creation of a core image le, containing the current memory image of the process for use in post-mortem debugging. A process may choose to have signals delivered on a special stack, so that sophisticated software stack manipulations are possible. All signals have the same priority. If multiple signals are pending simultaneously, the order in which they are delivered to a process is implementation specic. Signal routines execute with the signal that caused their invocation blocked, but other signals may yet occur. Mechanisms are provided whereby critical sections of code may protect themselves against the occurrence of specied signals. 2.3.2. Signal types The signals dened by the system fall into one of ve classes: hardware conditions, software conditions, input/output notication, process control, or resource control. The set of signals is dened in the le <signal.h>. Hardware signals are derived from exceptional conditions which may occur during execution. Such signals include SIGFPE representing oating point and other arithmetic exceptions, SIGILL for illegal instruction execution, SIGSEGV for addresses outside the currently assigned area of memory, and SIGBUS for accesses that violate memory protection constraints. Other, more cpu-specic hardware signals exist, such as those for the various customer-reserved instructions on the VAX (SIGIOT, SIGEMT, and SIGTRAP). Software signals reect interrupts generated by user request: SIGINT for the normal interrupt signal; SIGQUIT for the more powerful quit signal, that normally causes a core image to be generated; SIGHUP and SIGTERM that cause graceful process termination, either because a user has hung up, or by user or program request; and SIGKILL, a more powerful termination signal which a process cannot catch or ignore. Programs may dene their own asynchronous events using SIGUSR1 and SIGUSR2. Other software signals (SIGALRM, SIGVTALRM, SIGPROF) indicate the expiration of interval timers. A process can request notication via a SIGIO signal when input or output is possible on a descriptor, or when a non-blocking operation completes. A process may request to receive a SIGURG signal when an urgent condition arises. A process may be stopped by a signal sent to it or the members of its process group. The SIGSTOP signal is a powerful stop signal, because it cannot be caught. Other stop signals SIGTSTP, SIGTTIN, and SIGTTOU are used when a user request, input request, or output request respectively is the reason for stopping the process. A SIGCONT signal is sent to a process when it is continued from a stopped state. Processes may receive notication with a SIGCHLD signal when a child process changes state, either by stopping or by terminating. Exceeding resource limits may cause signals to be generated. SIGXCPU occurs when a process nears its CPU time limit and SIGXFSZ warns that the limit on le size creation has been reached. 2.3.3. Signal handlers A process has a handler associated with each signal. The handler controls the way the signal is delivered. The call

PSD:5-14

4.4BSD Architecture Manual

#include <signal.h> struct sigvec { int int int };

(*sv_handler)(); sv_mask; sv_ags;

sigvec(signo, sv, osv) int signo; struct sigvec *sv; result struct sigvec *osv; assigns interrupt handler address sv_handler to signal signo. Each handler address species either an interrupt routine for the signal, that the signal is to be ignored, or that a default action (usually process termination) is to occur if the signal occurs. The constants SIG_IGN and SIG_DEF used as values for sv_handler cause ignoring or defaulting of a condition. The sv_mask value species the signal mask to be used when the handler is invoked; it implicitly includes the signal which invoked the handler. Signal masks include one bit for each signal; the mask for a signal signo is provided by the macro sigmask(signo), from <signal.h>. Sv_ags species whether system calls should be restarted if the signal handler returns and whether the handler should operate on the normal run-time stack or a special signal stack (see below). If osv is non-zero, the previous signal vector is returned. When a signal condition arises for a process, the signal is added to a set of signals pending for the process. If the signal is not currently blocked by the process then it will be delivered. The process of signal delivery adds the signal to be delivered and those signals specied in the associated signal handlers sv_mask to a set of those masked for the process, saves the current process context, and places the process in the context of the signal handling routine. The call is arranged so that if the signal handling routine exits normally the signal mask will be restored and the process will resume execution in the original context. If the process wishes to resume in a different context, then it must arrange to restore the signal mask itself. The mask of blocked signals is independent of handlers for signals. It delays signals from being delivered much as a raised hardware interrupt priority level delays hardware interrupts. Preventing an interrupt from occurring by changing the handler is analogous to disabling a device from further interrupts. The signal handling routine sv_handler is called by a C call of the form (*sv_handler)(signo, code, scp); int signo; long code; struct sigcontext *scp; The signo gives the number of the signal that occurred, and the code, a word of information supplied by the hardware. The scp parameter is a pointer to a machine-dependent structure containing the information for restoring the context before the signal. 2.3.4. Sending signals A process can send a signal to another process or group of processes with the calls: kill(pid, signo) int pid, signo; killpgrp(pgrp, signo) int pgrp, signo; Unless the process sending the signal is privileged, it must have the same effective user id as the process receiving the signal. Signals are also sent implicitly from a terminal device to the process group associated with the terminal when certain input characters are typed.

4.4BSD Architecture Manual

PSD:5-15

2.3.5. Protecting critical sections To block a section of code against one or more signals, a sigblock call may be used to add a set of signals to the existing mask, returning the old mask: oldmask = sigblock(mask); result long oldmask; long mask; The old mask can then be restored later with sigsetmask , oldmask = sigsetmask(mask); result long oldmask; long mask; The sigblock call can be used to read the current mask by specifying an empty mask . It is possible to check conditions with some signals blocked, and then to pause waiting for a signal and restoring the mask, by using: sigpause(mask); long mask; 2.3.6. Signal stacks Applications that maintain complex or xed size stacks can use the call struct sigstack { caddr_t int }; ss_sp; ss_onstack;

sigstack(ss, oss) struct sigstack *ss; result struct sigstack *oss; to provide the system with a stack based at ss_sp for delivery of signals. The value ss_onstack indicates whether the process is currently on the signal stack, a notion maintained in software by the system. When a signal is to be delivered, the system checks whether the process is on a signal stack. If not, then the process is switched to the signal stack for delivery, with the return from the signal arranged to restore the previous stack. If the process wishes to take a non-local exit from the signal routine, or run code from the signal stack that uses a different stack, a sigstack call should be used to reset the signal stack.

PSD:5-16

4.4BSD Architecture Manual

2.4.

Timers

2.4.1. Real time The systems notion of the current Greenwich time and the current time zone is set and returned by the call by the calls: #include <sys/time.h> settimeofday(tvp, tzp); struct timeval *tp; struct timezone *tzp; gettimeofday(tp, tzp); result struct timeval *tp; result struct timezone *tzp; where the structures are dened in <sys/time.h> as: struct timeval { long long }; struct timezone { int int }; tv_sec; tv_usec; /* seconds since Jan 1, 1970 */ /* and microseconds */

tz_minuteswest; tz_dsttime;

/* of Greenwich */ /* type of dst correction to apply */

The precision of the system clock is hardware dependent. Earlier versions of UNIX contained only a 1-second resolution version of this call, which remains as a library routine: time(tvsec) result long *tvsec; returning only the tv_sec eld from the gettimeofday call. 2.4.2. Interval time The system provides each process with three interval timers, dened in <sys/time.h>: #dene #dene #dene ITIMER_REAL ITIMER_VIRTUAL ITIMER_PROF 0 1 2 /* real time intervals */ /* virtual time intervals */ /* user and system virtual time */

The ITIMER_REAL timer decrements in real time. It could be used by a library routine to maintain a wakeup service queue. A SIGALRM signal is delivered when this timer expires. The ITIMER_VIRTUAL timer decrements in process virtual time. It runs only when the process is executing. A SIGVTALRM signal is delivered when it expires. The ITIMER_PROF timer decrements both in process virtual time and when the system is running on behalf of the process. It is designed to be used by processes to statistically prole their execution. A SIGPROF signal is delivered when it expires. A timer value is dened by the itimerval structure: struct itimerval { struct struct }; timeval it_interval; timeval it_value; /* timer interval */ /* current value */

4.4BSD Architecture Manual

PSD:5-17

and a timer is set or read by the call: getitimer(which, value); int which; result struct itimerval *value; setitimer(which, value, ovalue); int which; struct itimerval *value; result struct itimerval *ovalue; The third argument to setitimer species an optional structure to receive the previous contents of the interval timer. A timer can be disabled by specifying a timer value of 0. The system rounds argument timer intervals to be not less than the resolution of its clock. This clock resolution can be determined by loading a very small value into a timer and reading the timer back to see what value resulted. The alarm system call of earlier versions of UNIX is provided as a library routine using the ITIMER_REAL timer. The process proling facilities of earlier versions of UNIX remain because it is not always possible to guarantee the automatic restart of system calls after receipt of a signal. The prol call arranges for the kernel to begin gathering execution statistics for a process: prol(buf, bufsize, offset, scale); result char *buf; int bufsize, offset, scale; This begins sampling of the program counter, with statistics maintained in the user-provided buffer.

PSD:5-18

4.4BSD Architecture Manual

2.5.

Descriptors

2.5.1. The reference table Each process has access to resources through descriptors. Each descriptor is a handle allowing the process to reference objects such as les, devices and communications links. Rather than allowing processes direct access to descriptors, the system introduces a level of indirection, so that descriptors may be shared between processes. Each process has a descriptor reference table, containing pointers to the actual descriptors. The descriptors themselves thus have multiple references, and are reference counted by the system. Each process has a xed size descriptor reference table, where the size is returned by the getdtablesize call: nds = getdtablesize(); result int nds; and guaranteed to be at least 20. The entries in the descriptor reference table are referred to by small integers; for example if there are 20 slots they are numbered 0 to 19. 2.5.2. Descriptor properties Each descriptor has a logical set of properties maintained by the system and dened by its type. Each type supports a set of operations; some operations, such as reading and writing, are common to several abstractions, while others are unique. The generic operations applying to many of these types are described in section 2.1. Naming contexts, les and directories are described in section 2.2. Section 2.3 describes communications domains and sockets. Terminals and (structured and unstructured) devices are described in section 2.4. 2.5.3. Managing descriptor references A duplicate of a descriptor reference may be made by doing new = dup(old); result int new; int old; returning a copy of descriptor reference old indistinguishable from the original. The new chosen by the system will be the smallest unused descriptor reference slot. A copy of a descriptor reference may be made in a specic slot by doing dup2(old, new); int old, new; The dup2 call causes the system to deallocate the descriptor reference current occupying slot new, if any, replacing it with a reference to the same descriptor as old. This deallocation is also performed by: close(old); int old; 2.5.4. Multiplexing requests The system provides a standard way to do synchronous and asynchronous multiplexing of operations. Synchronous multiplexing is performed by using the select call to examine the state of multiple descriptors simultaneously, and to wait for state changes on those descriptors. Sets of descriptors of interest are specied as bit masks, as follows:

4.4BSD Architecture Manual

PSD:5-19

#include <sys/types.h> nds = select(nd, in, out, except, tvp); result int nds; int nd; result fd_set *in, *out, *except; struct timeval *tvp; FD_ZERO(&fdset); FD_SET(fd, &fdset); FD_CLR(fd, &fdset); FD_ISSET(fd, &fdset); int fs; fs_set fdset; The select call examines the descriptors specied by the sets in, out and except, replacing the specied bit masks by the subsets that select true for input, output, and exceptional conditions respectively (nd indicates the number of le descriptors specied by the bit masks). If any descriptors meet the following criteria, then the number of such descriptors is returned in nds and the bit masks are updated. 2.fam T A descriptor selects for input if an input oriented operation such as read or receive is possible, or if a connection request may be accepted (see section 2.3.1.4). 2.fam T A descriptor selects for output if an output oriented operation such as write or send is possible, or if an operation that was in progress, such as connection establishment, has completed (see section 2.1.3). 2.fam T A descriptor selects for an exceptional condition if a condition that would cause a SIGURG signal to be generated exists (see section 1.3.2), or other device-specic events have occurred. If none of the specied conditions is true, the operation waits for one of the conditions to arise, blocking at most the amount of time specied by tvp. If tvp is given as 0, the select waits indenitely. Options affecting I/O on a descriptor may be read and set by the call: dopt = fcntl(d, cmd, arg) result int dopt; int d, cmd, arg; /* interesting values for cmd */ #dene F_SETFL #dene F_GETFL #dene F_SETOWN #dene F_GETOWN

3 4 5 6

/* set descriptor options */ /* get descriptor options */ /* set descriptor owner (pid/pgrp) */ /* get descriptor owner (pid/pgrp) */

The F_SETFL cmd may be used to set a descriptor in non-blocking I/O mode and/or enable signaling when I/O is possible. F_SETOWN may be used to specify a process or process group to be signaled when using the latter mode of operation or when urgent indications arise. Operations on non-blocking descriptors will either complete immediately, note an error EWOULDBLOCK, partially complete an input or output operation returning a partial count, or return an error EINPROGRESS noting that the requested operation is in progress. A descriptor which has signalling enabled will cause the specied process and/or process group be signaled, with a SIGIO for input, output, or inprogress operation complete, or a SIGURG for exceptional conditions. For example, when writing to a terminal using non-blocking output, the system will accept only as much data as there is buffer space for and return; when making a connection on a socket, the operation may return indicating that the connection establishment is in progress. The select facility can be used to determine when further output is possible on the terminal, or when the connection establishment attempt is complete.

PSD:5-20

4.4BSD Architecture Manual

2.5.5. Descriptor wrapping. A user process may build descriptors of a specied type by wrapping a communications channel with a system supplied protocol translator: new = wrap(old, proto) result int new; int old; struct dprop *proto; Operations on the descriptor old are then translated by the system provided protocol translator into requests on the underlying object old in a way dened by the protocol. The protocols supported by the kernel may vary from system to system and are described in the programmers manual. Protocols may be based on communications multiplexing or a rights-passing style of handling multiple requests made on the same object. For instance, a protocol for implementing a le abstraction may or may not include locally generated read-ahead requests. A protocol that provides for read-ahead may provide higher performance but have a more difcult implementation. Another example is the terminal driving facilities. Normally a terminal is associated with a communications line, and the terminal type and standard terminal access protocol are wrapped around a synchronous communications line and given to the user. If a virtual terminal is required, the terminal driver can be wrapped around a communications link, the other end of which is held by a virtual terminal protocol interpreter.

The facilities described in this section are not included in 4.3BSD.

4.4BSD Architecture Manual

PSD:5-21

2.6.

Resource controls

2.6.1. Process priorities The system gives CPU scheduling priority to processes that have not used CPU time recently. This tends to favor interactive processes and processes that execute only for short periods. It is possible to determine the priority currently assigned to a process, process group, or the processes of a specied user, or to alter this priority using the calls: #dene #dene #dene PRIO_PROCESS PRIO_PGRP PRIO_USER 0 1 2 /* process */ /* process group */ /* user id */

prio = getpriority(which, who); result int prio; int which, who; setpriority(which, who, prio); int which, who, prio; The value prio is in the range 20 to 20. The default priority is 0; lower priorities cause more favorable execution. The getpriority call returns the highest priority (lowest numerical value) enjoyed by any of the specied processes. The setpriority call sets the priorities of all of the specied processes to the specied value. Only the super-user may lower priorities. 2.6.2. Resource utilization The resources used by a process are returned by a getrusage call, returning information in a structure dened in <sys/resource.h>: #dene #dene RUSAGE_SELF RUSAGE_CHILDREN 0 -1 /* usage by this process */ /* usage by all children */

getrusage(who, rusage) int who; result struct rusage *rusage; struct rusage { struct struct int int int int int int int int int int int int int int };

timeval ru_utime; timeval ru_stime; ru_maxrss; ru_ixrss; ru_idrss; ru_isrss; ru_mint; ru_majt; ru_nswap; ru_inblock; ru_oublock; ru_msgsnd; ru_msgrcv; ru_nsignals; ru_nvcsw; ru_nivcsw;

/* user time used */ /* system time used */ /* maximum core resident set size: kbytes */ /* integral shared memory size (kbytes*sec) */ /* unshared data memory size */ /* unshared stack memory size */ /* page-reclaims */ /* page faults */ /* swaps */ /* block input operations */ /* block output operations */ /* messages sent */ /* messages received */ /* signals received */ /* voluntary context switches */ /* involuntary context switches */

The who parameter species whose resource usage is to be returned. The resources used by the current process, or by all the terminated children of the current process may be requested.

PSD:5-22

4.4BSD Architecture Manual

2.6.3. Resource limits The resources of a process for which limits are controlled by the kernel are dened in <sys/resource.h>, and controlled by the getrlimit and setrlimit calls: #dene #dene #dene #dene #dene #dene #dene #dene RLIMIT_CPU RLIMIT_FSIZE RLIMIT_DATA RLIMIT_STACK RLIMIT_CORE RLIMIT_RSS RLIM_NLIMITS RLIM_INFINITY 0 1 2 3 4 5 6 0x7fffffff /* cpu time in milliseconds */ /* maximum le size */ /* maximum data segment size */ /* maximum stack segment size */ /* maximum core le size */ /* maximum resident set size */

struct rlimit { int int };

rlim_cur; rlim_max;

/* current (soft) limit */ /* hard limit */

getrlimit(resource, rlp) int resource; result struct rlimit *rlp; setrlimit(resource, rlp) int resource; struct rlimit *rlp; Only the super-user can raise the maximum limits. Other users may only alter rlim_cur within the range from 0 to rlim_max or (irreversibly) lower rlim_max.

4.4BSD Architecture Manual

PSD:5-23

2.7.

System operation support


Unless noted otherwise, the calls in this section are permitted only to a privileged user.

2.7.1. Bootstrap operations The call mount(blkdev, dir, ronly); char *blkdev, *dir; int ronly; extends the UNIX name space. The mount call species a block device blkdev containing a UNIX le system to be made available starting at dir. If ronly is set then the le system is read-only; writes to the le system will not be permitted and access times will not be updated when les are referenced. Dir is normally a name in the root directory. The call swapon(blkdev, size); char *blkdev; int size; species a device to be made available for paging and swapping. 2.7.2. Shutdown operations The call unmount(dir); char *dir; unmounts the le system mounted on dir. This call will succeed only if the le system is not currently being used. The call sync(); schedules input/output to clean all system buffer caches. (This call does not require privileged status.) The call reboot(how) int how; causes a machine halt or reboot. The call may request a reboot by specifying how as RB_AUTOBOOT, or that the machine be halted with RB_HALT. These constants are dened in <sys/reboot.h>. 2.7.3. Accounting The system optionally keeps an accounting record in a le for each process that exits on the system. The format of this record is beyond the scope of this document. The accounting may be enabled to a le name by doing acct(path); char *path; If path is null, then accounting is disabled. Otherwise, the named le becomes the accounting le.

PSD:5-24

4.4BSD Architecture Manual

3.

System facilities
This section discusses the system facilities that are not considered part of the kernel. The system abstractions described are:

1.fam T Directory contexts A directory context is a position in the UNIX le system name space. Operations on les and other named objects in a le system are always specied relative to such a context. 1.fam T Files Files are used to store uninterpreted sequence of bytes on which random access reads and writes may occur. Pages from les may also be mapped into process address space. A directory may be read as a le. 1.fam T Communications domains A communications domain represents an interprocess communications environment, such as the communications facilities of the UNIX system, communications in the INTERNET, or the resource sharing protocols and access rights of a resource sharing system on a local network. 1.fam T Sockets A socket is an endpoint of communication and the focal point for IPC in a communications domain. Sockets may be created in pairs, or given names and used to rendezvous with other sockets in a communications domain, accepting connections from these sockets or exchanging messages with them. These operations model a labeled or unlabeled communications graph, and can be used in a wide variety of communications domains. Sockets can have different types to provide different semantics of communication, increasing the exibility of the model. 1.fam T Terminals and other devices Devices include terminals, providing input editing and interrupt generation and output ow control and editing, magnetic tapes, disks and other peripherals. They often support the generic read and write operations as well as a number of ioctl s. 1.fam T Processes Process descriptors provide facilities for control and debugging of other processes.

Support for mapping les is not included in the 4.3 release.

4.4BSD Architecture Manual

PSD:5-25

3.1.

Generic operations

Many system abstractions support the operations read, write and ioctl. We describe the basics of these common primitives here. Similarly, the mechanisms whereby normally synchronous operations may occur in a non-blocking or asynchronous fashion are common to all system-dened abstractions and are described here. 3.1.1. Read and write The read and write system calls can be applied to communications channels, les, terminals and devices. They have the form: cc = read(fd, buf, nbytes); result int cc; int fd; result caddr_t buf; int nbytes; cc = write(fd, buf, nbytes); result int cc; int fd; caddr_t buf; int nbytes; The read call transfers as much data as possible from the object dened by fd to the buffer at address buf of size nbytes. The number of bytes transferred is returned in cc, which is 1 if a return occurred before any data was transferred because of an error or use of non-blocking operations. The write call transfers data from the buffer to the object dened by fd. Depending on the type of fd, it is possible that the write call will accept some portion of the provided bytes; the user should resubmit the other bytes in a later request in this case. Error returns because of interrupted or otherwise incomplete operations are possible. Scattering of data on input or gathering of data for output is also possible using an array of input/output vector descriptors. The type for the descriptors is dened in <sys/uio.h> as: struct iovec { caddr_t int }; iov_msg; iov_len; /* base of a component */ /* length of a component */

The calls using an array of descriptors are: cc = readv(fd, iov, iovlen); result int cc; int fd; struct iovec *iov; int iovlen; cc = writev(fd, iov, iovlen); result int cc; int fd; struct iovec *iov; int iovlen; Here iovlen is the count of elements in the iov array. 3.1.2. Input/output control Control operations on an object are performed by the ioctl operation: ioctl(fd, request, buffer); int fd, request; caddr_t buffer; This operation causes the specied request to be performed on the object fd. The request parameter species whether the argument buffer is to be read, written, read and written, or is not needed, and also the size of the buffer, as well as the request. Different descriptor types and subtypes within descriptor types may use distinct ioctl requests. For example, operations on terminals control ushing of input and output queues and setting of terminal parameters; operations on disks cause formatting operations to occur; operations on tapes control tape positioning. The names for basic control operations are dened in <sys/ioctl.h>.

PSD:5-26

4.4BSD Architecture Manual

3.1.3. Non-blocking and asynchronous operations A process that wishes to do non-blocking operations on one of its descriptors sets the descriptor in non-blocking mode as described in section 1.5.4. Thereafter the read call will return a specic EWOULDBLOCK error indication if there is no data to be read. The process may select the associated descriptor to determine when a read is possible. Output attempted when a descriptor can accept less than is requested will either accept some of the provided data, returning a shorter than normal length, or return an error indicating that the operation would block. More output can be performed as soon as a select call indicates the object is writeable. Operations other than data input or output may be performed on a descriptor in a non-blocking fashion. These operations will return with a characteristic error indicating that they are in progress if they cannot complete immediately. The descriptor may then be selected for write to nd out when the operation has been completed. When select indicates the descriptor is writeable, the operation has completed. Depending on the nature of the descriptor and the operation, additional activity may be started or the new state may be tested.

4.4BSD Architecture Manual

PSD:5-27

3.2.

File system

3.2.1. Overview The le system abstraction provides access to a hierarchical le system structure. The le system contains directories (each of which may contain other sub-directories) as well as les and references to other objects such as devices and inter-process communications sockets. Each le is organized as a linear array of bytes. No record boundaries or system related information is present in a le. Files may be read and written in a random-access fashion. The user may read the data in a directory as though it were an ordinary le to determine the names of the contained les, but only the system may write into the directories. The le system stores only a small amount of ownership, protection and usage information with a le. 3.2.2. Naming The le system calls take path name arguments. These consist of a zero or more component le names separated by / characters, where each le name is up to 255 ASCII characters excluding null and /. Each process always has two naming contexts: one for the root directory of the le system and one for the current working directory. These are used by the system in the lename translation process. If a path name begins with a / , it is called a full path name and interpreted relative to the root directory context. If the path name does not begin with a / it is called a relative path name and interpreted relative to the current directory context. The system limits the total length of a path name to 1024 characters. The le name .. in each directory refers to the parent directory of that directory. The parent directory of the root of the le system is always that directory. The calls chdir(path); char *path; chroot(path) char *path; change the current working directory and root directory context of a process. Only the super-user can change the root directory context of a process. 3.2.3. Creation and removal The le system allows directories, les, special devices, and portals to be created and removed from the le system. 3.2.3.1. Directory creation and removal A directory is created with the mkdir system call: mkdir(path, mode); char *path; int mode; where the mode is dened as for les (see below). Directories are removed with the rmdir system call: rmdir(path); char *path; A directory must be empty if it is to be deleted.

PSD:5-28

4.4BSD Architecture Manual

3.2.3.2. File creation Files are created with the open system call, fd = open(path, oag, mode); result int fd; char *path; int oag, mode; The path parameter species the name of the le to be created. The oag parameter must include O_CREAT from below to cause the le to be created. Bits for oag are dened in <sys/le.h>: #dene #dene #dene #dene #dene #dene #dene #dene O_RDONLY O_WRONLY O_RDWR O_NDELAY O_APPEND O_CREAT O_TRUNC O_EXCL 000 001 002 004 010 01000 02000 04000 /* open for reading */ /* open for writing */ /* open for read & write */ /* non-blocking open */ /* append on each write */ /* open with le create */ /* open with truncation */ /* error on create if le exists */

One of O_RDONLY, O_WRONLY and O_RDWR should be specied, indicating what types of operations are desired to be performed on the open le. The operations will be checked against the users access rights to the le before allowing the open to succeed. Specifying O_APPEND causes writes to automatically append to the le. The ag O_CREAT causes the le to be created if it does not exist, owned by the current user and the group of the containing directory. The protection for the new le is specied in mode. The le mode is used as a three digit octal number. Each digit encodes read access as 4, write access as 2 and execute access as 1, ored together. The 0700 bits describe owner access, the 070 bits describe the access rights for processes in the same group as the le, and the 07 bits describe the access rights for other processes. If the open species to create the le with O_EXCL and the le already exists, then the open will fail without affecting the le in any way. This provides a simple exclusive access facility. If the le exists but is a symbolic link, the open will fail regardless of the existence of the le specied by the link. 3.2.3.3. Creating references to devices The le system allows entries which reference peripheral devices. Peripherals are distinguished as block or character devices according by their ability to support block-oriented operations. Devices are identied by their major and minor device numbers. The major device number determines the kind of peripheral it is, while the minor device number indicates one of possibly many peripherals of that kind. Structured devices have all operations performed internally in block quantities while unstructured devices often have a number of special ioctl operations, and may have input and output performed in varying units. The mknod call creates special entries: mknod(path, mode, dev); char *path; int mode, dev; where mode is formed from the object type and access permissions. The parameter dev is a conguration dependent parameter used to identify specic character or block I/O devices. 3.2.3.4. Portal creation The call fd = portal(name, server, param, dtype, protocol, domain, socktype) result int fd; char *name, *server, *param; int dtype, protocol; int domain, socktype; places a name in the le system name space that causes connection to a server process when the name is used. The portal call returns an active portal in fd as though an access had occurred to activate an inactive
The portal call is not implemented in 4.3BSD.

4.4BSD Architecture Manual

PSD:5-29

portal, as now described. When an inactive portal is accessed, the system sets up a socket of the specied socktype in the specied communications domain (see section 2.3), and creates the server process, giving it the specied param as argument to help it identify the portal, and also giving it the newly created socket as descriptor number 0. The accessor of the portal will create a socket in the same domain and connect to the server. The user will then wrap the socket in the specied protocol to create an object of the required descriptor type dtype and proceed with the operation which was in progress before the portal was encountered. While the server process holds the socket (which it received as fd from the portal call on descriptor 0 at activation) further references will result in connections being made to the same socket. 3.2.3.5. File, device, and portal removal A reference to a le, special device or portal may be removed with the unlink call, unlink(path); char *path; The caller must have write access to the directory in which the le is located for this call to be successful. 3.2.4. Reading and modifying le attributes Detailed information about the attributes of a le may be obtained with the calls: #include <sys/stat.h> stat(path, stb); char *path; result struct stat *stb; fstat(fd, stb); int fd; result struct stat *stb; The stat structure includes the le type, protection, ownership, access times, size, and a count of hard links. If the le is a symbolic link, then the status of the link itself (rather than the le the link references) may be found using the lstat call: lstat(path, stb); char *path; result struct stat *stb; Newly created les are assigned the user id of the process that created it and the group id of the directory in which it was created. The ownership of a le may be changed by either of the calls chown(path, owner, group); char *path; int owner, group; fchown(fd, owner, group); int fd, owner, group; In addition to ownership, each le has three levels of access protection associated with it. These levels are owner relative, group relative, and global (all users and groups). Each level of access has separate indicators for read permission, write permission, and execute permission. The protection bits associated with a le may be set by either of the calls: chmod(path, mode); char *path; int mode; fchmod(fd, mode); int fd, mode; where mode is a value indicating the new protection of the le, as listed in section 2.2.3.2.

PSD:5-30

4.4BSD Architecture Manual

Finally, the access and modify times on a le may be set by the call: utimes(path, tvp) char *path; struct timeval *tvp[2]; This is particularly useful when moving les between media, to preserve relationships between the times the le was modied. 3.2.5. Links and renaming Links allow multiple names for a le to exist. Links exist independently of the le linked to. Two types of links exist, hard links and symbolic links. A hard link is a reference counting mechanism that allows a le to have multiple names within the same le system. Symbolic links cause string substitution during the pathname interpretation process. Hard links and symbolic links have different properties. A hard link insures the target le will always be accessible, even after its original directory entry is removed; no such guarantee exists for a symbolic link. Symbolic links can span le systems boundaries. The following calls create a new link, named path2, to path1: link(path1, path2); char *path1, *path2; symlink(path1, path2); char *path1, *path2; The unlink primitive may be used to remove either type of link. If a le is a symbolic link, the value of the link may be read with the readlink call, len = readlink(path, buf, bufsize); result int len; result char *path, *buf; int bufsize; This call returns, in buf, the null-terminated string substituted into pathnames passing through path . Atomic renaming of le system resident objects is possible with the rename call: rename(oldname, newname); char *oldname, *newname; where both oldname and newname must be in the same le system. If newname exists and is a directory, then it must be empty. 3.2.6. Extension and truncation Files are created with zero length and may be extended simply by writing or appending to them. While a le is open the system maintains a pointer into the le indicating the current location in the le associated with the descriptor. This pointer may be moved about in the le in a random access fashion. To set the current offset into a le, the lseek call may be used, oldoffset = lseek(fd, offset, type); result off_t oldoffset; int fd; off_t offset; int type; where type is given in <sys/le.h> as one of: #dene #dene #dene L_SET L_INCR L_XTND 0 1 2 /* set absolute le offset */ /* set le offset relative to current position */ /* set offset relative to end-of-le */

The call lseek(fd, 0, L_INCR) returns the current offset into the le. Files may have holes in them. Holes are void areas in the linear extent of the le where data has never been written. These may be created by seeking to a location in a le past the current end-of-le and writing. Holes are treated by the system as zero valued bytes.

4.4BSD Architecture Manual

PSD:5-31

A le may be truncated with either of the calls: truncate(path, length); char *path; int length; ftruncate(fd, length); int fd, length; reducing the size of the specied le to length bytes. 3.2.7. Checking accessibility A process running with different real and effective user ids may interrogate the accessibility of a le to the real user by using the access call: accessible = access(path, how); result int accessible; char *path; int how; Here how is constructed by oring the following bits, dened in <sys/le.h>: #dene #dene #dene #dene F_OK X_OK W_OK R_OK 0 1 2 4 /* le exists */ /* le is executable */ /* le is writable */ /* le is readable */

The presence or absence of advisory locks does not affect the result of access . 3.2.8. Locking The le system provides basic facilities that allow cooperating processes to synchronize their access to shared les. A process may place an advisory read or write lock on a le, so that other cooperating processes may avoid interfering with the process access. This simple mechanism provides locking with le granularity. More granular locking can be built using the IPC facilities to provide a lock manager. The system does not force processes to obey the locks; they are of an advisory nature only. Locking is performed after an open call by applying the ock primitive, ock(fd, how); int fd, how; where the how parameter is formed from bits dened in <sys/le.h>: #dene #dene #dene #dene LOCK_SH LOCK_EX LOCK_NB LOCK_UN 1 2 4 8 /* shared lock */ /* exclusive lock */ /* dont block when locking */ /* unlock */

Successive lock calls may be used to increase or decrease the level of locking. If an object is currently locked by another process when a ock call is made, the caller will be blocked until the current lock owner releases the lock; this may be avoided by including LOCK_NB in the how parameter. Specifying LOCK_UN removes all locks associated with the descriptor. Advisory locks held by a process are automatically deleted when the process terminates. 3.2.9. Disk quotas As an optional facility, each le system may be requested to impose limits on a users disk usage. Two quantities are limited: the total amount of disk space which a user may allocate in a le system and the total number of les a user may create in a le system. Quotas are expressed as hard limits and soft limits. A hard limit is always imposed; if a user would exceed a hard limit, the operation which caused the resource request will fail. A soft limit results in the user receiving a warning message, but with allocation succeeding. Facilities are provided to turn soft limits into hard limits if a user has exceeded a soft limit for an unreasonable period of time.

PSD:5-32

4.4BSD Architecture Manual

To enable disk quotas on a le system the setquota call is used: setquota(special, le) char *special, *le; where special refers to a structured device le where a mounted le system exists, and le refers to a disk quota le (residing on the le system associated with special) from which user quotas should be obtained. The format of the disk quota le is implementation dependent. To manipulate disk quotas the quota call is provided: #include <sys/quota.h> quota(cmd, uid, arg, addr) int cmd, uid, arg; caddr_t addr; The indicated cmd is applied to the user ID uid. The parameters arg and addr are command specic. The le <sys/quota.h> contains denitions pertinent to the use of this call.

4.4BSD Architecture Manual

PSD:5-33

3.3.

Interprocess communications

3.3.1. Interprocess communication primitives 3.3.1.1. Communication domains The system provides access to an extensible set of communication domains. A communication domain is identied by a manifest constant dened in the le <sys/socket.h>. Important standard domains supported by the system are the unix domain, AF_UNIX, for communication within the system, the Internet domain for communication in the DARPA Internet, AF_INET, and the NS domain, AF_NS, for communication using the Xerox Network Systems protocols. Other domains can be added to the system. 3.3.1.2. Socket types and protocols Within a domain, communication takes place between communication endpoints known as sockets. Each socket has the potential to exchange information with other sockets of an appropriate type within the domain. Each socket has an associated abstract type, which describes the semantics of communication using that socket. Properties such as reliability, ordering, and prevention of duplication of messages are determined by the type. The basic set of socket types is dened in <sys/socket.h>: /* Standard socket types */ #dene SOCK_DGRAM #dene SOCK_STREAM #dene SOCK_RAW #dene SOCK_RDM #dene SOCK_SEQPACKET 1 2 3 4 5 /* datagram */ /* virtual circuit */ /* raw socket */ /* reliably-delivered message */ /* sequenced packets */

The SOCK_DGRAM type models the semantics of datagrams in network communication: messages may be lost or duplicated and may arrive out-of-order. A datagram socket may send messages to and receive messages from multiple peers. The SOCK_RDM type models the semantics of reliable datagrams: messages arrive unduplicated and in-order, the sender is notied if messages are lost. The send and receive operations (described below) generate reliable/unreliable datagrams. The SOCK_STREAM type models connection-based virtual circuits: two-way byte streams with no record boundaries. Connection setup is required before data communication may begin. The SOCK_SEQPACKET type models a connectionbased, full-duplex, reliable, sequenced packet exchange; the sender is notied if messages are lost, and messages are never duplicated or presented out-of-order. Users of the last two abstractions may use the facilities for out-of-band transmission to send out-of-band data. SOCK_RAW is used for unprocessed access to internal network layers and interfaces; it has no specic semantics. Other socket types can be dened. Each socket may have a specic protocol associated with it. This protocol is used within the domain to provide the semantics required by the socket type. Not all socket types are supported by each domain; support depends on the existence and the implementation of a suitable protocol within the domain. For example, within the Internet domain, the SOCK_DGRAM type may be implemented by the UDP user datagram protocol, and the SOCK_STREAM type may be implemented by the TCP transmission control protocol, while no standard protocols to provide SOCK_RDM or SOCK_SEQPACKET sockets exist. 3.3.1.3. Socket creation, naming and service establishment Sockets may be connected or unconnected. An unconnected socket descriptor is obtained by the socket call:

PSD:5-34

4.4BSD Architecture Manual

s = socket(domain, type, protocol); result int s; int domain, type, protocol; The socket domain and type are as described above, and are specied using the denitions from <sys/socket.h>. The protocol may be given as 0, meaning any suitable protocol. One of several possible protocols may be selected using identiers obtained from a library routine, getprotobyname. An unconnected socket descriptor of a connection-oriented type may yield a connected socket descriptor in one of two ways: either by actively connecting to another socket, or by becoming associated with a name in the communications domain and accepting a connection from another socket. Datagram sockets need not establish connections before use. To accept connections or to receive datagrams, a socket must rst have a binding to a name (or address) within the communications domain. Such a binding may be established by a bind call: bind(s, name, namelen); int s; struct sockaddr *name; int namelen; Datagram sockets may have default bindings established when rst sending data if not explicitly bound earlier. In either case, a sockets bound name may be retrieved with a getsockname call: getsockname(s, name, namelen); int s; result struct sockaddr *name; result int *namelen; while the peers name can be retrieved with getpeername: getpeername(s, name, namelen); int s; result struct sockaddr *name; result int *namelen; Domains may support sockets with several names. 3.3.1.4. Accepting connections Once a binding is made to a connection-oriented socket, it is possible to listen for connections: listen(s, backlog); int s, backlog; The backlog species the maximum count of connections that can be simultaneously queued awaiting acceptance. An accept call: t = accept(s, name, anamelen); result int t; int s; result struct sockaddr *name; result int *anamelen; returns a descriptor for a new, connected, socket from the queue of pending connections on s. If no new connections are queued for acceptance, the call will wait for a connection unless non-blocking I/O has been enabled. 3.3.1.5. Making connections An active connection to a named socket is made by the connect call: connect(s, name, namelen); int s; struct sockaddr *name; int namelen; Although datagram sockets do not establish connections, the connect call may be used with such sockets to create an association with the foreign address. The address is recorded for use in future send calls, which then need not supply destination addresses. Datagrams will be received only from that peer, and asynchronous error reports may be received. It is also possible to create connected pairs of sockets without using the domains name space to rendezvous; this is done with the socketpair call:
4.3BSD supports socketpair creation only in the unix communication domain.

4.4BSD Architecture Manual

PSD:5-35

socketpair(domain, type, protocol, sv); int domain, type, protocol; result int sv[2]; Here the returned sv descriptors correspond to those obtained with accept and connect. The call pipe(pv) result int pv[2]; creates a pair of SOCK_STREAM sockets in the UNIX domain, with pv[0] only writable and pv[1] only readable. 3.3.1.6. Sending and receiving data Messages may be sent from a socket by: cc = sendto(s, buf, len, ags, to, tolen); result int cc; int s; caddr_t buf; int len, ags; caddr_t to; int tolen; if the socket is not connected or: cc = send(s, buf, len, ags); result int cc; int s; caddr_t buf; int len, ags; if the socket is connected. The corresponding receive primitives are: msglen = recvfrom(s, buf, len, ags, from, fromlenaddr); result int msglen; int s; result caddr_t buf; int len, ags; result caddr_t from; result int *fromlenaddr; and msglen = recv(s, buf, len, ags); result int msglen; int s; result caddr_t buf; int len, ags; In the unconnected case, the parameters to and tolen specify the destination or source of the message, while the from parameter stores the source of the message, and *fromlenaddr initially gives the size of the from buffer and is updated to reect the true length of the from address. All calls cause the message to be received in or sent from the message buffer of length len bytes, starting at address buf. The ags specify peeking at a message without reading it or sending or receiving high-priority out-of-band messages, as follows: #dene #dene MSG_PEEK MSG_OOB 0x1 0x2 /* peek at incoming message */ /* process out-of-band data */

3.3.1.7. Scatter/gather and exchanging access rights It is possible scatter and gather data and to exchange access rights with messages. When either of these operations is involved, the number of parameters to the call becomes large. Thus the system denes a message header structure, in <sys/socket.h>, which can be used to conveniently contain the parameters to the calls: struct msghdr { caddr_t int struct int caddr_t int }; msg_name; msg_namelen; iov *msg_iov; msg_iovlen; msg_accrights; msg_accrightslen; /* optional address */ /* size of address */ /* scatter/gather array */ /* # elements in msg_iov */ /* access rights sent/received */ /* size of msg_accrights */

PSD:5-36

4.4BSD Architecture Manual

Here msg_name and msg_namelen specify the source or destination address if the socket is unconnected; msg_name may be given as a null pointer if no names are desired or required. The msg_iov and msg_iovlen describe the scatter/gather locations, as described in section 2.1.3. Access rights to be sent along with the message are specied in msg_accrights, which has length msg_accrightslen. In the unix domain these are an array of integer descriptors, taken from the sending process and duplicated in the receiver. This structure is used in the operations sendmsg and recvmsg: sendmsg(s, msg, ags); int s; struct msghdr *msg; int ags; msglen = recvmsg(s, msg, ags); result int msglen; int s; result struct msghdr *msg; int ags; 3.3.1.8. Using read and write with sockets The normal UNIX read and write calls may be applied to connected sockets and translated into send and receive calls from or to a single area of memory and discarding any rights received. A process may operate on a virtual circuit socket, a terminal or a le with blocking or non-blocking input/output operations without distinguishing the descriptor type. 3.3.1.9. Shutting down halves of full-duplex connections A process that has a full-duplex socket such as a virtual circuit and no longer wishes to read from or write to this socket can give the call: shutdown(s, direction); int s, direction; where direction is 0 to not read further, 1 to not write further, or 2 to completely shut the connection down. If the underlying protocol supports unidirectional or bidirectional shutdown, this indication will be passed to the peer. For example, a shutdown for writing might produce an end-of-le condition at the remote end. 3.3.1.10. Socket and protocol options Sockets, and their underlying communication protocols, may support options. These options may be used to manipulate implementation- or protocol-specic facilities. The getsockopt and setsockopt calls are used to control options: getsockopt(s, level, optname, optval, optlen) int s, level, optname; result caddr_t optval; result int *optlen; setsockopt(s, level, optname, optval, optlen) int s, level, optname; caddr_t optval; int optlen; The option optname is interpreted at the indicated protocol level for socket s. If a value is specied with optval and optlen, it is interpreted by the software operating at the specied level. The level SOL_SOCKET is reserved to indicate options maintained by the socket facilities. Other level values indicate a particular protocol which is to act on the option request; these values are normally interpreted as a protocol number. 3.3.2. UNIX domain This section describes briey the properties of the UNIX communications domain. 3.3.2.1. Types of sockets In the UNIX domain, the SOCK_STREAM abstraction provides pipe-like facilities, while SOCK_DGRAM provides (usually) reliable message-style communications.

4.4BSD Architecture Manual

PSD:5-37

3.3.2.2. Naming Socket names are strings and may appear in the UNIX le system name space through portals. 3.3.2.3. Access rights transmission The ability to pass UNIX descriptors with messages in this domain allows migration of service within the system and allows user processes to be used in building system facilities. 3.3.3. INTERNET domain This section describes briey how the Internet domain is mapped to the model described in this section. More information will be found in the document describing the network implementation in 4.3BSD. 3.3.3.1. Socket types and protocols SOCK_STREAM is supported by the Internet TCP protocol; SOCK_DGRAM by the UDP protocol. Each is layered atop the transport-level Internet Protocol (IP). The Internet Control Message Protocol is implemented atop/beside IP and is accessible via a raw socket. The SOCK_SEQPACKET has no direct Internet family analogue; a protocol based on one from the XEROX NS family and layered on top of IP could be implemented to ll this gap. 3.3.3.2. Socket naming Sockets in the Internet domain have names composed of the 32 bit Internet address, and a 16 bit port number. Options may be used to provide IP source routing or security options. The 32-bit address is composed of network and host parts; the network part is variable in size and is frequency encoded. The host part may optionally be interpreted as a subnet eld plus the host on subnet; this is enabled by setting a network address mask at boot time. 3.3.3.3. Access rights transmission No access rights transmission facilities are provided in the Internet domain. 3.3.3.4. Raw access The Internet domain allows the super-user access to the raw facilities of IP. These interfaces are modeled as SOCK_RAW sockets. Each raw socket is associated with one IP protocol number, and receives all trafc received for that protocol. This allows administrative and debugging functions to occur, and enables user-level implementations of special-purpose protocols such as inter-gateway routing protocols.

The 4.3BSD implementation of the UNIX domain embeds bound sockets in the UNIX le system name space; this may change in future releases.

PSD:5-38

4.4BSD Architecture Manual

3.4.

Terminals and Devices

3.4.1. Terminals Terminals support read and write I/O operations, as well as a collection of terminal specic ioctl operations, to control input character interpretation and editing, and output format and delays. 3.4.1.1. Terminal input Terminals are handled according to the underlying communication characteristics such as baud rate and required delays, and a set of software parameters. 3.4.1.1.1. Input modes A terminal is in one of three possible modes: raw, cbreak, or cooked. In raw mode all input is passed through to the reading process immediately and without interpretation. In cbreak mode, the handler interprets input only by looking for characters that cause interrupts or output ow control; all other characters are made available as in raw mode. In cooked mode, input is processed to provide standard line-oriented local editing functions, and input is presented on a line-by-line basis. 3.4.1.1.2. Interrupt characters Interrupt characters are interpreted by the terminal handler only in cbreak and cooked modes, and cause a software interrupt to be sent to all processes in the process group associated with the terminal. Interrupt characters exist to send SIGINT and SIGQUIT signals, and to stop a process group with the SIGTSTP signal either immediately, or when all input up to the stop character has been read. 3.4.1.1.3. Line editing When the terminal is in cooked mode, editing of an input line is performed. Editing facilities allow deletion of the previous character or word, or deletion of the current input line. In addition, a special character may be used to reprint the current input line after some number of editing operations have been applied. Certain other characters are interpreted specially when a process is in cooked mode. The end of line character determines the end of an input record. The end of le character simulates an end of le occurrence on terminal input. Flow control is provided by stop output and start output control characters. Output may be ushed with the ush output character; and a literal character may be used to force literal input of the immediately following character in the input line. Input characters may be echoed to the terminal as they are received. Non-graphic ASCII input characters may be echoed as a two-character printable representation, character. 3.4.1.2. Terminal output On output, the terminal handler provides some simple formatting services. These include converting the carriage return character to the two character return-linefeed sequence, inserting delays after certain standard control characters, expanding tabs, and providing translations for upper-case only terminals. 3.4.1.3. Terminal control operations When a terminal is rst opened it is initialized to a standard state and congured with a set of standard control, editing, and interrupt characters. A process may alter this conguration with certain control operations, specifying parameters in a standard structure:

The control interface described here is an internal interface only in 4.3BSD. Future releases will probably use a modied interface based on currently-proposed standards.

4.4BSD Architecture Manual

PSD:5-39

struct ttymode { short int short int };

tt_ispeed; tt_iags; tt_ospeed; tt_oags;

/* input speed */ /* input ags */ /* output speed */ /* output ags */

and special characters are specied with the ttychars structure, struct ttychars { char char char char char char char char char char char char char char }; tc_erasec; tc_killc; tc_intrc; tc_quitc; tc_startc; tc_stopc; tc_eofc; tc_brkc; tc_suspc; tc_dsuspc; tc_rprntc; tc_ushc; tc_werasc; tc_lnextc; /* erase char */ /* erase line */ /* interrupt */ /* quit */ /* start output */ /* stop output */ /* end-of-le */ /* input delimiter (like nl) */ /* stop process signal */ /* delayed stop process signal */ /* reprint line */ /* ush output (toggles) */ /* word erase */ /* literal next character */

3.4.1.4. Terminal hardware support The terminal handler allows a user to access basic hardware related functions; e.g. line speed, modem control, parity, and stop bits. A special signal, SIGHUP, is automatically sent to processes in a terminals process group when a carrier transition is detected. This is normally associated with a user hanging up on a modem controlled terminal line. 3.4.2. Structured devices Structures devices are typied by disks and magnetic tapes, but may represent any random-access device. The system performs read-modify-write type buffering actions on block devices to allow them to be read and written in a totally random access fashion like ordinary les. File systems are normally created in block devices. 3.4.3. Unstructured devices Unstructured devices are those devices which do not support block structure. Familiar unstructured devices are raw communications lines (with no terminal handler), raster plotters, magnetic tape and disks unfettered by buffering and permitting large block input/output and positioning and formatting commands.

PSD:5-40

4.4BSD Architecture Manual

3.5.

Process and kernel descriptors

The status of the facilities in this section is still under discussion. The ptrace facility of earlier UNIX systems remains in 4.3BSD. Planned enhancements would allow a descriptor-based process control facility.

4.4BSD Architecture Manual

PSD:5-41

I. Summary of facilities
1. Kernel primitives 1.1. Process naming and protection sethostid gethostid sethostname gethostname getpid fork exit execve getuid geteuid setreuid getgid getegid getgroups setregid setgroups getpgrp setpgrp 1.2 Memory management <sys/mman.h> sbrk sstk getpagesize mmap msync munmap mprotect madvise mincore msleep mwakeup 1.3 Signals <signal.h> sigvec kill killpgrp sigblock sigsetmask sigpause sigstack 1.4 Timing and statistics <sys/time.h> gettimeofday settimeofday
Not supported in 4.3BSD.

set UNIX host id get UNIX host id set UNIX host name get UNIX host name get process id create new process terminate a process execute a different process get user id get effective user id set real and effective user ids get accounting group id get effective accounting group id get access group set set real and effective group ids set access group set get process group set process group

memory management denitions change data section size change stack section size get memory page size map pages of memory ush modied mapped pages to lesystem unmap memory change protection of pages give memory management advice determine core residency of pages sleep on a lock wakeup process sleeping on a lock

signal denitions set handler for signal send signal to process send signal to process group block set of signals restore set of blocked signals wait for signals set software stack for signals

time-related denitions get current time and timezone set current time and timezone

PSD:5-42

4.4BSD Architecture Manual

getitimer setitimer prol 1.5 Descriptors getdtablesize dup dup2 close select fcntl wrap 1.6 Resource controls <sys/resource.h> getpriority setpriority getrusage getrlimit setrlimit 1.7 System operation support mount swapon umount sync reboot acct 2. System facilities 2.1 Generic operations read write <sys/uio.h> readv writev <sys/ioctl.h> ioctl 2.2 File system

read an interval timer get and set an interval timer prole process

descriptor reference table size duplicate descriptor duplicate to specied index close descriptor multiplex input/output control descriptor options wrap descriptor with protocol

resource-related denitions get process priority set process priority get resource usage get resource limitations set resource limitations

mount a device le system add a swap device umount a le system ush system caches reboot a machine specify accounting le

read data write data scatter-gather related denitions scattered data input gathered data output standard control operations device control operation

Operations marked with a * exist in two forms: as shown, operating on a le name, and operating on a le descriptor, when the name is preceded with a f . <sys/le.h> chdir chroot mkdir rmdir open mknod portal
Not supported in 4.3BSD.

le system denitions change directory change root directory make a directory remove a directory open a new or existing le make a special le make a portal entry

4.4BSD Architecture Manual

PSD:5-43

unlink stat* lstat chown* chmod* utimes link symlink readlink rename lseek truncate* access ock 2.3 Communications <sys/socket.h> socket bind getsockname listen accept connect socketpair sendto send recvfrom recv sendmsg recvmsg shutdown getsockopt setsockopt 2.4 Terminals, block and character devices 2.5 Processes and kernel hooks

remove a link return status for a le returned status of link change owner change mode change access/modify times make a hard link make a symbolic link read contents of symbolic link change name of le reposition within le truncate le determine accessibility lock a le

standard denitions create socket bind socket to name get socket name allow queuing of connections accept a connection connect to peer socket create pair of connected sockets send data to named socket send data to connected socket receive data on unconnected socket receive data on connected socket send gathered data and/or rights receive scattered data and/or rights partially close full-duplex connection get socket option set socket option

You might also like