0% found this document useful (0 votes)
720 views

Unix Programming Environment

Unix programming

Uploaded by

Sai
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
720 views

Unix Programming Environment

Unix programming

Uploaded by

Sai
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 369
) (( > = & The UNIX Programming Environment Brian W. Kernighan Rob Pike Bell Laboratories Murray Hill, New Jersey PRENTICE-HALL, INC. Englewood Cliffs, New Jersey 07632 UNIX is « Trademark of Bell Laboratories Libeary of Congress Catalog Card Number 83-6285 1 Prentice-Hall Software Series Brian W. Kernighan, Advisor Edierabpduction supervision: os Heron Cover design: Photo Plas Art, Celine Brandes Manufacturing byee: Gordon Osbourne Copyright ©1984 by Bell Telephone Laboratories, Incorporated. ‘All tights reserved, No part ofthis publication may be reproduced, stored in retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopy- ing, recording, or otherwise, without the prior writen permission of the publisher Printed in the United States of America, Published simultaneously in Canada. ‘This book was typeset in Times Roman and Courier by the authors, using & Mer ‘emthaler Linotron 202 phototypestter driven by @ VAX-I1/250 running the Sth Ealtion fof the UNIX operating system. UNIX is a trademark of Bell Laboratories. DEC, PDP and VAX are trademarks of Digital Equipment Corporation nw wT we ISBN Q-13-937699-2 ISBN Q-13-93?81-X {PBK} PRENTICE-HALL INTERNATIONAL, INC., London PRENTICE-HALL OF AUSTRALIA PTY, LIMITED, Sydney EDITORA PRENTICE-HALL DO BRASIL, LTDA, Rio de eneiro PRENTICE-HALL CANADA INC, Toronto PRENTICE-HALL OF INDIA PRIVATE LIMITED, New Dethi PRENTICE-HALL OF JAPAN, INC., Tokyo PRENTICE-HALL OF SOUTHEAST ASIA PTE. TD., Singapore WHITEHALL BOOKS LIMITED, Wellington, New Zealand 2 4. Preface UNIX for Beginners 1.1 Getting started 1.2. Day-to-day use: files and common commands 1.3 More about files: directories 1.4 The shell 1.5 The rest of the UNIX system ‘The File System 2.1 The basics of files 2.2 What's in a file? 2.3 Directories and filenames 2.4 Permissions 2.5 Inodes 2.6 The directory hierarchy 2.7 Devices Using the Shell 3.1 Command line structure 3.2 Metacharacters 3.3. Creating new commands 3.4 Command arguments and parameters 3.5. Program output as arguments 3.6 Shell variables 3.7 More on I/O redirection 3.8 Looping in shell programs 3.9 bundle: putting it all together 3.10 Why a programmable shell? a 4.1 The grep family 4.2 Other filters CONTENTS ul a 38 41 41 48 2 37 8 65 iv contents 4.3: The stream editor sea 4.4 The awk pattern scanning and processing language 4.5 Good files and good filters 5. Shell Programming 5.1 Customizing the cal command 5.2. Which command is which? 5.3 while and until loops: watching for things 5.4 Traps: catching interrupts 5.5 Replacing a file: overwrite 5.6 zap: killing processes by name 5.7 The pick command: blanks vs. arguments 5.8 The news command: community service messages 5.9 get and put: tracking file changes 5.10 A look back 6. Programming with Standard VO 6.1 Standard input and output: vis 6.2 Program arguments: vis version 2 6.3 File access: vis version 3 6.4 A sereen-at-a-time printer: p 6.5 An example: pick 6.6 On bugs and debugging 6.7 Anexample: zap 6.8 An interactive file comparison program: iaift 6.9 Accessing the environment 7. Unix System Calls 7.1 Low-level VO 7.2 File system: directories 7.3 File system: inodes 7.4 Processes 7.5. Signals and interrupts Program Development 8.1 Stage 1: A four-function calculator 8.2 Stage 2: Variables and error recovery 8.3 Stage 3: Arbitrary variable names; built-in functions 8.4 Stage 4: Compilation into a machine 8.5. Stage : Control flow and relational operators 8.6 Stage 6: Functions and procedures; input/output 8.7 Performance evaluation 8.8 A look back 108 114 130 133 133 1B8 144 150 152 156 159 162 165 169 im 1m 174 176 180 186 187 190 192 199 201 201 208 214 220 25 233 234 242 245 258 266 273 284 286 10. Document Preparation 9.1 The ms macro package 9.2 The trofé level 9.3 The thi and eqn preprocessors 9.4 The manual page 9.5 Other document preparation tools Epilog Appendix 1: Editor Summary Appendix 2: hoc Manual Appendix 3: noc Listing Index 289 290 291 301 308, 313 31s 319 329 335 349 iv. contents 4.3: The stream editor sed 108 4.4 The awk pattern scanning and processing language 1s 4.5 Good files and good filters 130 Shell Programming 133 5.1 Customizing the cal command 133 5.2 Which command is which? 138 5.3 while and until loops: watching for things 144 5.4 Traps: catching interrupts 150 5.5 Replacing a file: overwrite 152 5.6 zap: killing processes by name 156 5.7 The pick command: blanks vs, arguments 159 5.8 The news command: community service messages 162 5.9 get and put: tracking file changes 165 5.10 A look back 169 Programming with Standard /O im 6.1 Standard input and output: vis In 6.2 Program arguments: vis version 2 174 6.3 File access: vis version 3 176 6.4 A sereen-at-a-time printer: p 180 6.5 Anexample: pick 186 6.6 On bugs and debugging 187 6.7 Anexample: zap 190 6.8 An interactive file comparison program: idife 192 6.9 Accessing the environment 199 ‘NIK System Calls 204 7.1 Low-level VO 201 7.2 File system: directories 208 7.3 File system: inodes 214 7.4 Processes 20 7.5 Signals and interrupts 25 8. Program Development 233, 8.1 Stage 1: A four-function calculator 234 8.2 Stage 2: Variables and error recovery 242 8.3 Stage 3: Arbitrary variable names; built-in functions 245 8.4 Stage 4: Compilation into a machine 258 8.5 Stage 5: Control flow and relational operators 266 8.6 Stage 6: Functions and procedures; input/output 23 8.7 Performance evaluation 284 8.8 A look back 286 10, Document Preparation 9.1 The ms macro package 9.2 The trofe level 9.3 The tbi and eqn preprocessors 9.4 The manual page 9.5 Other document preparation tools Epilog Appendix 1: Editor Summary ‘Appendix 2: hoc Manual ‘Appendix 3: hoc Listing Index 289 290 297 301 308, 313 315 319 329 335 349 PREFACE “The number of UNIX installations has grown to 10, with more expected.” (The unix Programmer's Manual, 2nd Edition, June, 1972.) ‘The UNIXt operating system started on a cast-off DEC PDP-7 at Bell Labora- tories in 1969. Ken Thompson, with ideas and support from Rudd Canaday, Doug Mcliroy, Joe Ossanna, and Dennis Ritchie, wrote a small general- purpose time-sharing system comfortable enough to attract enthusiastic users and eventually enough credibility for the purchase of a larger machine — a PDP-11/20. One of the early users was Ritchie, who helped move the system to the PDP-II in 1970. Ritchie also designed and wrote a compiler for the C programming language. In 1973, Ritchie and Thompson rewrote the UNIX ker~ nel in C, breaking from the tradition that system software is written in assem- bly language. With that rewrite, the system became essentially what it is today, Around 1974 it was licensed to universities “for educational purposes" and a few years later became available for commercial use. During this time, UNIX systems prospered at Bell Labs, finding their way into laboratories, software development projects, word processing centers, and operations support systems in telephone companies. Since then, it has spread world-wide, with tens of thousands of systems installed, from microcomputers to the largest main- frames. What makes the UNEX system so successful? We can discern several rea- sons. First, because it is written in C, it is portable — UNIX systems run on a range of computers from microprocessors to the largest mainframes; this is a strong commercial advantage, Second, the source code is available and written in a high-level language, which makes the system easy to adapt to particular requirements. Finally, and most important, it is a good operating system, Funny i a trademark of Bell Laboratories, “UNI i not an acronym, but weak pun on MUL “TICS, the operating system that Thompson and Ritchie worked op before UN vil PREFACE, especially for programmers. ‘The UNIX programming environment is unusually rich and productive. Even though the UNIX system introduces a number of innovative programs and techniques, no single program or idea makes it work well. Instead, what makes it effective is an approach to programming, a philosophy of using the computer. Although that philosophy can’t be written down in a single sen- tence, at its heart is the idea that the power of a system comes more from the relationships among programs than from the programs themselves. Many UNIX programs do quite trivial tasks in isolation, but, combined with other pro- grams, become general and useful tools. Our goal in this book is to communicate the UNIX programming philosophy. Because the philosophy is based on the relationships between programs, we must devote most of the space to discussions about the individual tools, but throughout run the themes of combining programs and of using programs to build programs, To use the UNIX system and its components well, you must understand not only how to use the programs, but also how they fit into the environment. ‘As the UNIX system has spread, the fraction of its users who are skilled in its application has decreased. Time and again, we have seen experienced users, ourselves included, find only clumsy solutions to a problem, or write programs to do jobs that existing tools handle easily. Of course, the elegant solutions are not easy to see without some experience and understanding. We hope that by reading this book you will develop the understanding to make your use of the system — whether you are a new or seasoned user — effective and enjoyable, We want you fo use the UNIX system well We are aiming at individual programmers, in the hope that, by making their work more productive, we can in turn make the work of groups more productive. Although our main target is programmers, the first four or five chapters do not require programming experience to be understood, so they should be helpful to other users as well. Wherever possible we have tried to make our points with real examples rather than artificial ones. Although some programs began as examples for the book, they have since become part of our own set of everyday programs. All examples have been tested directly from the text, which is in machine-readable form, The book is organized as follows. Chapter 1 is an introduction to the most basic use of the system, It covers logging in, mail, the file system, commonly- used commands, and the rudiments of the command interpreter. Experienced users can skip this chapter. Chapter 2 is a discussion of the UNIX file system. The file system is central to the operation and use of the system, so you must understand it to use the system well. This chapter describes files and directories, permissions and file modes, and inodes. It concludes with a tour of the file system hierarchy and ‘an explanation of device files PREFACE ix ‘The command interpreter, or shell, is a fundamental tool, not only for run- ning programs, but also for writing them. Chapter 3 describes how to use the shell for your own purposes: creating new commands, command arguments, shell variables, elementary control flow, and input-output redirection ‘Chapter 4 is about filters: programs that perform some simple transforma- tion on data as it flows through them. The first section deals with the grep pattern-searching command and its relatives; the next discusses a few of the more common filters such as sort; and the rest of the chapter is devoted to ‘two general-purpose data transforming programs called sed and awk. sed is a stream editor, a program for making editing changes on a stream of data as, it flows by. awk is a programming language for simple information retrieval and report generation tasks. It's often possible to avoid conventional program- ming entirely by using these programs, sometimes in cooperation with the shel Chapter 5 discusses how to use the shell for writing programs that will stand up to use by other people. Topics include more advanced control flow and variables, traps. and interrupt handling. The examples in this chapter make considerable use of sed and awk as well asthe shell Eventually one reaches the limits of what can be done with the shell and other programs that already exist. Chapter 6 talks about writing new programs using the standard UO library. The programs are written in C, which the reader is assumed t0 know, or at least be learning concurrently. We try to show sensible strategies for designing and organizing new programs, how to build them in manageable stages, and how to make use of tools that already exist ‘Chapter 7 deals with the system calls, the foundation under all the other layers of software. The topics include input-output, file creation, error pro- cessing, directories, inodes, processes, and signals. Chapter 8 talks about program development tools: yacc, a parser generator; make, which controls the process of compiling a big program; and lex, which generates lexical analyzers. The exposition is based on the development of a large program, a C-like programmable calculator Chapter 9 discusses the document preparation tools, illustrating them with a user-level description and a manual page for the calculator of Chapter 8. It can be read independently of the other chapters. Appendix 1 summarizes the standard editor e4. Although many readers will prefer some other editor for daily use, ed is universally available, efficient and effective. Its regular expressions are the heart of other programs like grep and sed, and for that reason alone itis worth learning, ‘Appendix 2 contains the reference manual for the calculator language of Chapter 8. Appendix 3 is a listing of the final version of the calculator program, presenting the code all in one place for convenient reading. x PREFACE Some practical matters. First, the UNIX system has become very popular, and there are a number of versions in wide use. For example, the 7th Edition ‘comes from the original source of the UNIX system, the Computing Science Research Center at Bell Labs. System II and System V are the official Bell Labs-supported versions. The University of California at Berkeley distributes systems derived from the 7th Edition, usually known as UCB 4.xBSD. In addition, there are numerous variants, particularly on small computers, that are derived from the 7th Edition. We have tried to cope with this diversity by sticking closely to those aspects that are likely to be the same everywhere. Although the lessons that we want to teach are independent of any particular version, for specific details we have chosen to present things as they were in the 7th Edition, since it forms the basis of most of the UNIX systems in widespread use. We have also run the ‘examples on Bell Labs' System V and on Berkeley 4.1BSD; only trivial changes were required, and only in a few examples. Regardless of the version your ‘machine runs, the differences you find should be minor. Second, although there is a lot of material in this book, it is not a reference manual. We feel it is more important to teach an approach and a style of use than just details, The unix Programmer's Manual is the standard source of information. You will need it to resolve points that we did not cover, or (0 determine how your system differs from ours. Third, we believe that the best way to learn something is by doing it. This ook should be read at a terminal, so that you can experiment, verify or con- tradict what we say, explore the limits and the variations. Read a bit, try it ut, then come back and read some more. We believe that the UNIX system, though certainly not perfect, is a mar- velous computing environment. We hope that reading this book will help you to reach that conclusion too. We are grateful to many people for constructive comments and criticisms, and for their help in improving our code. In particular, Jon Bentley, John Linderman, Doug Mellroy, and Peter Weinberger read multiple drafts with great care. We are indebted to Al Aho, Ed Bradford, Bob Flandrena, Dave Hanson, Ron Hardin, Marion Harris, Gerard Holzmann, Steve Johnson, Nico Lomuto, Bob Martin, Larry Rosler, Chris Van Wyk, and Jim Weythman for their comments on the first draft. We also thank Mike Bianchi, Elizabeth Bimmler, Joc Carfagno, Don Carter, Tom De Marco, Tom Duff, David Gay, Steve Mahaney, Ron Pinter, Dennis Ritchie, Ed Sitar, Ken Thompson, Mike Tilson, Paul Tukey, and Larry Wehr for valuable suggestions Brian Kernighan Rob Pike cHarrer 1: UNIX FOR BEGINNERS What is “UNIX”? In the narrowest sense, itis a time-sharing operating sys- tem kernel: a program that controls the resources of a computer and allocates them among its users. It lets users run their programs; it controls the peri- pheral devices (discs, terminals, printers, and the like) connected to the machine; and it provides a file system that manages the long-term storage of information such as programs, data, and documents. In a broader sense, “UNIX” is often taken to include not only the kernel, but also essential programs like compilers, editors, command languages, pro- grams for copying and printing files, and so on, Still more broadly, “UNIX” may even include programs developed by you or other users to be run on your system, such as tools for document preparation, routines for statistical analysis, and graphics packages. Which of these uses of the name “UNIX” is correct depends on which level of the system you are considering. When we use “UNIX” in the rest of this book, context should indicate which meaning is implied. ‘The UNIX system sometimes looks more difficult than it is — it’s hard for a newcomer to know how to make the best use of the facilities available. But fortunately it's not hard to get started — knowledge of only a few programs should get you off the ground. This chapter is meant to help you to start using the system as quickly as possible. It’s an overview, not a manual; we'll cover most of the material again in more detail in later chapters. We'll talk about these major areas ‘© basics — logging in and out, simple commands, correcting typing mistakes, mail, inter-terminal communication. * day-to-day use — files and the file system, printing files, directories, commonly-used commands. ‘© the command interpreter or shell — filename shorthands, redirecting input and output, pipes, setting erase and kill characters, and defining your own search path for commands. If you've used a UNIX system before, most of this chapter should be familiar; you might want to skip straight to Chapter 2. 2. THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 1 You will need a copy of the UNIX Programmer's Manual, even as you read this chapter; it’s often easier for us to tell you to read about something in the manual than to repeat its contents here. This book is not supposed (o replace it, but to show you how to make best use of the commands described in it Furthermore, there may be differences between what we say here and what is true on your system, The manual has a permuted index at the beginning that’s indispensable for finding tne right programs to apply to a problem; learn to use it Finally, a word of advice: don’t be afraid to experiment. If you are a beginner, there are very few accidental things you can do to hurt yourself or other users. So learn how things work by trying them. This is a long chapter, and the best way to read it is a few pages at a time, trying things out as you g0. L.A Getting started Some prerequisites about terminals and typing To avoid explaining everything about using computers, we must assume you have some familiarity with computer terminals and how to use them. If any of the following statements are mystifying, you should ask a local expert for help. The UNIX system is full duplex: the characters you type on the keyboard are sent to the system, which sends them back to the terminal to be printed on the screen. Normally, this echo process copies the characters directly to the screen, so you can sce what you are typing, but sometimes, such as when you are typing a secret password, the echo is turned off so the characters do not appear on the screen. ‘Most of the keyboard characters are ordinary printing characters with no special significance, but a few tell the computer how to interpret your typing. By far the most important of these is the RETURN key. The RETURN key sig- nifies the end of a line of input; the system echoes it by moving the terminal's cursor to the beginning of the next line on the screen. RETURN must be pressed before the system will interpret the characters you have typed. RETURN is an example of a control character — an invisible character that controls some aspect of input and output on the terminal. On any reasonable terminal, RETURN has a key of its own, but most control characters do not Instead, they must be typed by holding down the CONTROL key, sometimes called CTL or CNTL or CTRL, then pressing another key, usually a letter. For example, RETURN may be typed by pressing the RETURN key or, equivalently, holding down the CONTROL key and typing an ‘nm’, RETURN might therefore be called a control-m, which we will write as ctl-m. Other con- trol characters include ctl-a, which tells a program that there is no more input; ctl-g, which rings the bell on the terminal; ctl-n, often called backspace, which can be used to correct typing mistakes; and c1l-i, often called tab, which CHAPTER 1 UNIX FOR BEGINNERS 3 advances the cursor to the next tab stop, much as on a regular typewriter. Tab stops on UNIX systems are eight spaces apart. Both the backspace and tab char- acters have their own keys on most terminals. Two other keys have special meaning: DELETE, sometimes called RUBOUT or some abbreviation, and BREAK, sometimes called INTERRUPT. On most UNIX systems, the DELETE key stops a program immediately, without waiting for it to finish. On some systems, ctl-c provides this service. And on some systems, depending on how the terminals are connected, BREAK is a synonym for DELETE or ctl-c A Session with unix Let’s begin with an annotated dialog between you and your UNIX system. Throughout the examples in this book, what you type is printed in slanted letters, computer responses are in typewriter-style characters, and explanations are in italics Establish a connection: dial a phone or turn on a switch as necessary. Your system should say Jogin: you Type your name, then press RETURN Password: Your password won't be echoed as you type it You have mail There's mail 10 be read after you log in s The system is now ready for your commands s Press RETURN a couple of times $ date What's the date and time? Sun Sep 25 23:02:57 EDT 1983 8 who Who's using the machine? 3b ktyO Sep 25 13:59 you tty2 Sep 25 23:01 mary ttyd_— Sep 25 19:03 doug ttyS Sep 25 19:22 egb tty? Sep 25 17:17 bob tty8_ Sep 25 20:48 $ mail Read your mail From doug Sun Sep 25 20:53 EDT 1983 give me a call sometime monday 2 RETURN moves on to the next message From mary Sun Sep 25 19:07 EDT 1983 Next message Lunch at noon tomorrow? za Delete this message s No more mail $ mail mary Send mail 10 mary lunch at 12 is fine cil End of mail 8 Hang up phone or turnoff terminal ‘and that’s the end Sometimes that’s all there is to a session, though occasionally people do 4 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER ¢ some work too. The rest of this section will discuss the session above, plus other programs that make it possible to do useful things. Logging in ‘You must have a login name and password, which you can get from your system administrator. “The UNIX system is capable of dealing. with a wide varlety of terminals but i is strongly oriented towards devices with lowercase: case distinctions matter! If your terminal produces only upper case (like some video and portable terminals), life will be s0 difficult that you should look for another termina Be sure the switches are set appropriately on your device: upper and lower case, full duplex, and any other settings that local experts advise, such as the speed, or baud rate, Establish a connection using whatever magic is needed for your terminal; this may involve dialing a telephone or merely flipping a switch. In either case, the system should type aogin: Iti types garbage, you may be at the wrong speed; check the speed setting and other switches. If that fails, press the BREAK or INTERRUPT Key a few times, slowly. If nothing produces a login message, you will have to get help ‘When you get the Login: message, type your login name in lower case Follow it by pressing RETURN. If a password is required, you will be asked for it, and printing will be turned off while you type it ‘The culmination of your login efforts is a prompt, usually a single charac- ter, indicating thatthe system is ready to accept commands from you. The prompt is most likely to be a dollar sign $ or a percent sign %, but you can Change it to anything you like; we'll show you how a litle later. ‘The prompt is actually printed by a program called the command interpreter ot shell, which is your main interface to the system: “There may be a message of the day just before the prompt, or a notification that you have mail.” You may also be asked what kind of terminal you are using; your answer helps the system to use any special properties the terminal might have. Typing commands ‘Once you receive the prompt, you can type commands, which are requests that the system do something, We will use program as a synonym for com- mand. When you sce the prompt (let's assume it’s $), type date and press RETURN. The system should reply with the date and time, then print another prompt, so the whole transaction will look like this on your terminal: $ date Mon Sep 26 12:20:57 EDT 1983 s Don't forget RETURN, and don’t type the $. If you think you're being CHAPTER 1 UNIX FOR BEGINNERS 5 ignored, press RETURN; something should happen. RETURN won't be men- tioned again, but you need it at the end of every line. ‘The next command to try is who, which tells you everyone who is currently logged in: who rin teyo sep 26 11:17 Piw ttyd Sep 26 11:30 gerard tty? sep 26 10:27 mark tty9_—Sep 26 07:59 you ttya Sep 26 12:20 8 The first column is the user name. The second is the system's name for the connection being used (“tty stands for “teletype,” an archaic synonym for terminal”). The rest tells when the user logged on. You might also try $ who am i you etya Sep 26 12:20 s If you make 4 mistake typing the name of a command, and refer to a non- existent command, you will be told that no command of that name can be found: $ whom Misspelled command name whom: not found 0 system didn't know how 10 run it s Of course, if you inadvertently type the name of an actual command, it will run, perhaps with mysterious results Strange terminal behavior ‘Sometimes your terminal will act strangely, for example, each letter may be typed twice, or RETURN may not put the cursor at the first column of the next line. You can usually fix this by turning the terminal off and on, or by logging ‘out and logging back in. Or you can read the description of the command stty (“set terminal options”) in Section 1 of the manual. To get intelligent treatment of tab characters if your terminal doesn’t have tabs, type the com- mand $ stty -taps and the system will convert tabs into the right number of spaces. If your ter- al does have computer-settable tab stops, the command tabs will set them correctly for you. (You may actually have to say $ tabs terminal-npe to make it work — see the tabs command description in the manual.) 6 WHE UNIX PROGRAMMING ENVIRONMENT HATER 1 Mistakes in typing If you make a typing mistake, and sec it before you have pressed RETURN, there are two ways fo recover: erase characters one at a time or kill the whole line and re-type it. If you type the line kill character, by default an at-sign @, it causes the whole line to be discarded, just as if you'd never typed it, and starts you over on a new line: $ dava0e Completely botched; start over date ‘on a new line Mon Sep 26 12:23:39 EDT 1983 s The sharp character # erases the last character typed; each # erases one more character, back to the beginning of the line (but not beyond). So if you type badly, you can correct as you go: 3 dagattesse Fix it as you go Mon Sep 26 12:24:02 EDT 1983 8 ‘The particular erase and line kill characters are very system dependent. On ‘many systems (including the one we use), the erase character has been changed. to backspace, which works nicely on video terminals, You can quickly check which is the case on your system: $ datee Try + datee+: not found Ws not = $ dateet Try Mon Sep 26 12:26:08 EDT 1983 Iris # s (We printed the backspace as + so you can see it.) Another common choice is ctl-x for line kill We will use the sharp as the erase character for the rest of this section because it’s visible, but make the mental adjustment if your system is different Later on, in “tailoring the environment,” we will tell you how to set the erase and line kill characters to whatever you like, once and for all ‘What if you must enter an erase or line kill character as part of the text? If you precede either # or @ by a backslash \, it loses its special meaning. So to enter a # oF @, type \# or \@. The system may advance the terminal's cursor to the next line after your @, even if it was preceded by a backslash. Don’t ‘worry — the at-sign has been recorded. ‘The backslash, sometimes called the escape character, is used extensively to indicate that the following character is in some way special. To erase a backslash, you have to type two erase characters: \##. Do you see why’ ‘The characters you type are examined and interpreted by a sequence of pro- ‘grams before they reach their destination, and exactly how they are interpreted CHAPTER 1 UNIX FOR BEGINNERS 7 depends not only on where they end up but how they got there. Every character you type is immediately echoed to the terminal, unless echoing is turned off, which is rare. Until you press RETURN, the characters are held temporarily by the kernel, so typing mistakes can be corrected with the erase and line kill characters. When an erase or line kill character is pre- ceded by a backslash, the kernel discards the backslash and holds the following, character without interpretation When you press RETURN, the characters being held are sent to the pro- gram that is reading from the terminal. ‘That program may in turn interpret the characters in special ways; for example, the shell turns off any special interpretation of a character if it is preceded by a backslash. We'll come back to this in Chapter 3. For now, you should remember that the kernel processes erase and line kill, and backslash only if it precedes erase or line kill; whatever characters are left after that may be interpreted by other programs as well Bxercise 1-1. Explain what happens with § aatere Exercise 1-2. Most shells (though not the 7th Edition shell) interpret # as introducing a comment, and ignore all ext from the # to the end of the line. Given this, explain the following transcript, assuming your erase character is also #: Mon Sep 26 12:39:56 spr 1983 8 save Mon Sep 26 12:40:21 spr 1983 5 \eaate Type-ahead ‘The kernel reads what you type as you type it, even if it’s busy with some- thing else, so you can type as fast as you want, whenever you want, even when some command is printing at you. If you type while the system is printing, your input characters will appear intermixed with the output characters, but they will be stored away and interpreted in the correct order. You can type commands one after another without waiting for them to finish or even to begin. Stopping a program You can stop most commands by typing the character DELETE. The BREAK key found on most terminals may also work, although this is system dependent. In a few programs, like text editors, DELETE stops whatever the program is doing but leaves you in that program. Turning off the terminal or 8 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 1 hanging up the phone will stop most programs. IF you just want output to pause, for example to keep something critical from disappearing off the screen, type cil-s. The output will stop almost immediately; your program is suspended until you start it again. When you want to resume, type ctl-4 Logging out ‘The proper way to log out is to type ctl-d instead of a command; this tells the shell that there is no more input. (How this actually works will be explained in the next chapter.) You can usually just turn off the terminal or hhang up the phone, but whether this really logs you out depends on your sys- tem Mail ‘The system provides a postal system for communicating with other users, so some day when you log in, you will see the message You have mail before the first prompt. To read your mail, type $ mail Your mail will be printed, one message at a time, most recent first. After each item, mail waits for you to say what to do with it. ‘The two basic responses are , which deletes the message, and RETURN, which does not (so it will still be there the next time you read your mail). Other responses include p to reprint a message, s filename to save it in the file you named, and q to quit, from mail. (If you don’t know what a file is, think of it as a place where you can store information under a name of your choice, and retrieve it later. Files are the topic of Section 1.2 and indeed of much of this book.) mail is one of those programs that is likely to differ from what we describe here; there are many variants. Look in your manual for details. Sending mail to someone is straightforward. Suppose it is to go to the per son with the login name nico. The easiest way is this: $ mail nico [Now type inthe text of the leter ‘on as many lines as you like After the last line of the letter ‘ype a controtd. aa $ The cil-d signals the end of the letter by telling the mai1 command that there is no more input. If you change your mind half-way through composing the letter, press DELETE instead of cti-d. The half-formed letter will be stored in a file called dead. Letter instead of being sent. CHAPTER | UNIX FOR BEGINNERS 9 For practice, send mail to yourself, then type mai1 to read it. (This isn’t as aberrant as it might sound — it’s a handy reminder mechanism.) ‘There are other ways to send mail — you can send a previously prepared letter, you can mail to a number of people all at once, and you may be able to send mail to people on other machines. For more details see the description of the mail command in Section | of the UNIX Programmer's Manual. Hen- ceforth we'll use the notation mai1(I) to mean the page describing mail in Section | of the manual. All of the commands discussed in this chapter are found in Section 1. ‘There may also be a calendar service (see calendax(1)); we'll show you in Chapter 4 how to set one up if it hasn’t been done already. Writing to other users If your UNIX system has multiple users, someday, out of the blue, your ter- ‘minal will print something like Message from mary tty7... accompanied by a startling beep. Mary wants to write to you, but unless you take explicit action you won't be able to write back. To respond, type $ write mary ‘This establishes a two-way communication path. Now the lines that Mary types on her terminal will appear on yours and vice versa, although the path is slow, rather like talking to the moon. If you are in the middle of something, you have to get to a state where you can type @ command, Normally, whatever program you are running has to stop or be stopped, but some programs, such as the editor and write itself, have a *!” command to escape temporarily to the shell — see Table 2 in Appendix 1 The weite command imposes no rules, so 4 protocol is needed to keep what you type from getting garbled up with what Mary types. One convention is to take turns, ending each turn with (o), which stands for “over,” and to signal your intent to quit with (00), for “over and out.”” 10. THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 1 Mary's terminal: Your terminal: $ write you $ Meseage from mary tty7... write mary Message from you ttya.. did you forget lunch? (0) ia you forget lunch? (0) Fived ten minutes (0) ten minutes (0) ok (00) ok (00) aka Bor tla $ BOF $ You can also exit from write by pressing DELETE. Notice that your typing errors do not appear on Mary's terminal. If you try to write to someone who isn’t logged in, or who doesn’t want to be disturbed, you'll be told. If the target is logged in but doesn't answer after a decent interval, the person may be busy or away from the terminal; simply type cil-d or DELETE. If you don’t want to be disturbed, use mesg( 1). News Many UNIX systems provide a news service, to keep users abreast of interesting and not so interesting events. Try typing $ news ‘There is also a large network of UNIX systems that keep in touch through tele- phone calls; ask a local expert about netnews and USENET. The manual ‘The UNIX Programmer's Manual describes most of what you need to know about the system. Section 1 deals with commands, including those we discuss in this chapter. Section 2 describes the system calls, the subject of Chapter 7, and Section 6 has information about games. The remaining sections talk about functions for use by C programmers, file formats, and system maintenance. (The numbering of these sections varies from system to system.) Don’t forget the permuted index at the beginning; you can skim it quickly for commands that might be relevant to what you want to do. There is also an introduction to the system that gives an overview of how things work. Often the manual is kept on-line so that you can read it on your terminal. If you get stuck on something, and can’t find an expert to help, you can print any manual page on your terminal with the command man command-name CHAPTER t UNIX FOR BEGINNERS II Thus to read about the who command, type $ man who and, of course, $ man man tells about the man command. Computer-aided instruction Your system may have a command called learn, which provides computer-aided instruction on the file system and basic commands, the editor, document preparation, and even C programming. Try $ learn If Learn exists on your system, it will tell you what to do from there. If that fails, you might also try teach. Games I's not always admitted officially, but one of the best ways to get comfort- able with a computer and a terminal is to play games. ‘The UNIX system comes with a modest supply of games, often supplemented locally. Ask around, or see Section 6 of the manual. 1.2 Day-to-day use: files and common commands Information in a UNIX system is stored in files, which are much like ordi nary office files. Each file has a name, contents, a place to keep it, and some administrative information such as who owns it and how big it is. A file might contain a letter, or a list of names and addresses, or the source statements of a program, or data to be used by a program, or even programs in their execut- able form and other non-textual material ‘The UNIX file system is organized 50 you can maintain your own personal files without interfering with files belonging to other people, and keep people from interfering with you too. ‘There are myriad programs that manipulate files, but for now, we will look at only the more frequently used ones. Chapter 2 contains a systematic discussion of the file system, and introduces many of the other file-related commands. Creating files — the editor If you want to type a paper or a letter or a program, how do you get the information stored in the machine? Most of these tasks are done with a text editor, which is a program for storing and manipulating information in the computer. Almost every UNIX system has a screen editor, an editor that takes advantage of modern terminals to display the effects of your editing changes in context as you make them. ‘Two of the most popular are vi and emacs. We 12 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER | won't describe any specific screen editor here, however, partly because of typo- graphic limitations, and partly because there is no standard one. ‘There is, however, an older editor called ed that is certain to be available fon your system. It takes no advantage of special terminal features, so it will work on any terminal. It also forms the basis of other essential programs (including some screen editors), so it’s worth learning eventually. Appendix 1 contains a concise description. ‘No matter what editor you prefer, you'll have to learn it well enough to be able to create files. We'll use ed here to make the discussion concrete, and to ensure that you can make our examples run on your system, but by all means uuse whatever editor you like best. To use ed to create a file called junk with some text in ing: , do the follow- 8 ea Invokes the text editor a fed command to add text how type in Whatever text you want : Type a *." by itself to stop adding text junk Write your text into a file called “junk 39 (04 prints mumber of characters written @ Quit ea s ‘The command a (“append”) tells ed to start collecting text. The ".” that sig nals the end of the text must be typed at the beginning of a fine by itself. Don't forget it, for until it is typed, no other ed commands will be recognized — everything you type will be treated as text to be added The editor command w (“write”) stores the information that you typed; “w junk” stores it in a file called junk. The filename can be any word you like; we picked junk to suggest that this file isn’t very important. ‘ed responds with the number of characters it put in the file. Until the w command, nothing is stored permanently, so if you hang up and go home the information is not stored in the file. (If you hang up while editing, the data you were working on is saved in a file called ed.trup, which you can continue ‘with at your next session.) If the system crashes (i.e., stops unexpectedly because of software or hardware failure) while you are editing, your file will contain only what the last write command placed there. But after w the infor- mation is recorded permanently; you can access it again later by typing $ ed junk Of course, you can edit the text you typed in, to correct spelling mistakes, change wording, rearrange paragraphs and the like. When you're done, the q command (“quit”) leaves the editor. ‘CHAPTER 1 UNIX FOR BEGINNERS 13 What files are out there? Let's create two files, junk and temp, so we know what we have: Sea To be or not to be we junk 9 q sea That is the question. w temp 22 q s ‘The character counts from ed include the character at the end of each line, called newline, which is how the system represents RETURN. ‘The 1s command lists the names (not contents) of files: $ is junk temp s which are indeed the two files just created. (There might be others as well that you didn’t create yourself.) The names are sorted into alphabetical order automatically. 1s, like most commands, has options that may be used to alter its default behavior. Options follow the command name on the command line, and are usually made up of an initial minus sign ‘~" and a single letter meant to suggest the meaning. For example, 1s -t causes the files to be listed in “time” order: the order in which they were last changed, most recent first. $18 -t temp junk 8 ‘The ~1 option gives a “long” listing that provides more information about each fie: sis -1 total 2 frwer--r-- 1 you 19 sep 26 16:25 junk wewers-r-- 1 you 22 Sep 26 16:26 temp 14 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 1 “total 2” tells how many blocks of disc space the files occupy; a block is usually either 512 or 1024 characters. The string -rw-r--r-~ tells who has permission to read and write the file; in this case, the owner (you) can read and write, but others can only read it. The “1” that follows is the number of links to the file; ignore it until Chapter 2. “you” is the owner of the file, that is, the person who created it. 19 and 22 are the number of characters in the corresponding files, which agree with the numbers you got from ed. The date and time tell when the file was last changed. Options can be grouped: 1s ~1t gives the same data as 1s -1, but sorted ‘with most recent files first. The -u option gives information on when files were used: 1s ut gives a long (-1) listing in the order of most recent use. ‘The option ~x reverses the order of the output, so 1s -rt lists in order of least recent use. You can also name the files you're interested in, and 1s will list the information about them only: $12 -1 junk fewer 1 you 19 Sep 26 16:25 junk ‘The strings that follow the program name on the command line, such as -2 and junk in the example above, are called the program’s arguments. Argu- ments are usually options or names of files to be used by the command. Specifying options by a minus sign and a single letter, such as ~t or the combined -1t, is a common convention. In general, if a command accepts such optional arguments, they precede any filename arguments, but may other- wise appear in any order. But UNIX programs are capricious in their treatment of multiple options. For example, standard 7th Edition 16 won't accept Sie -2-t Doesn't work in 7h Edition as a synonym for 1s -2t, while other programs require multiple options to be separated. As you learn more, you will find that there is little regularity or system to optional arguments. Each command has its own idiosyncrasies, and its own choices of what letter means what (often different from the same function in other commands). This unpredictable behavior is disconcerting and is often cited as a major flaw of the system. Although the situation is improving — ‘new versions often have more uniformity — all we can suggest is that you try to do better when you write your own programs, and in the meantime keep a copy of the manual handy, Printing files — cat. and pr ‘Now that you have some files, how do you look at their contents? There ‘are many programs to do that, probably more than are needed. One possit is to use the editor: CHAPTER | UNIX FOR BEGINNERS 15 8 ed junk 9 24 reports 19 characters in junk 146 Print lines 1 through last 7 be or not to be File has only one line q All done 5 ed begins by reporting the number of characters in Junk; the command 1, $p tells it to print all the lines in the file. After you Jearn how to use the editor, you can be selective about the parts you print There are times when it’s not feasible to use an editor for printing. For example, there is a limit — several thousand lines — on how big a file ed can handle. Furthermore, it will only print one file at a time, and sometimes you want to print several, one after another without pausing, So here are a couple of alternatives. First is cat, the simplest of all the printing commands. cat prints the con- tents of all the files named by its argument $ cat junk To be or not to be $ cat temp ‘That is the question, $ cat junk tmp To be or not to be ‘That is the question. s The named file or files are catenatedt (hence the name “cat”) onto the termi- nal one after another with nothing between ‘There's no problem with short files, but for long ones, if you have a high- speed connection to your computer, you have to be quick with cil-s to stop ‘output from cat before it flows off your screen. There is no “standard” com- ‘mand to print a file on a video terminal one screenful at a time, though almost, every UNIX system has one, Your system might have one called pg or more. Ours is called p; we'll show you its implementation in Chapter 6. Like cat, the command pr prints the contents of all the files named in a list, but in a form suitable for line printers: every page is 66 lines (11 inches) Jong, with the date and time that the file was changed, the page number, and the filename at the top of each page, and extra lines to skip over the fold in the paper. Thus, to print junk neatly, then skip to the top of a new page and print temp neatly: F*Catenate” i a lightly obscure syoaym for * 16 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 1 $ pr junk temp Sep 26 16:25 1983 junk Page 1 To be or not to be (60 more blank lines) Sep 26 16:26 1983 tenp Page 1 ‘hat is the question. (60 more blank lines) s PE can also produce multi-column output: 8 pr -3 filenames prints each file in 3-column format. You can use any reasonable number in place of "3" and pr will do its best. (The word filenames is a place-holder for 8 list of names of files.) px -m will print a set of files in parallel columns See pr(l). It should be noted that px is nor a formatting program in the sense of re- arranging lines and justifying margins. The true formatters ate nroff and troff, which are discussed in Chapter 9. There are also commands that print files on a high-speed printer. Look in your manual under names like 1p and Ipr, or look up “printer” in the per ‘muted index. Which to use depends on what equipment is attached to your machine. pr and Lpr are often used together; after pr formats the informa- tion properly, 1p handles the mechanics of getting it to the line printer. We will return to this a little later. Moving, copying, removing files — mv, cp, xm Let's look at some other commands. ‘The first thing is to change the name of a file, Renaming a file is done by “moving” it from one name {0 another, like this: $ mv junk precious ‘This means that the file that used to be called junk is now called precious; the contents are unchanged. If you run 1s now, you will see a different list unk is not there but precious is. CHAPTER | UNIK POR BEGINNERS 17 sas precious temp 8 cat junk cat: can’t open junk s Beware that if you move a file to another one that already exists, the target file is replaced, To make a copy of a file (that is, to have two versions of something), use the cp command: $ ep precious precious. save makes a duplicate copy of precious in precious. save. Finally, when you get tired of creating and moving files, the rm command removes all the files you name: $ rm temp junk yn: junk nonexistent s ‘You will get a warning if one of the files to be removed wasn’t there, but oth- erwise rm, like most UNIX commands, does its work silently. There is no prompting or chatter, and error messages are curt and sometimes unhelpful. Brevity can be disconcerting to newcomers, but experienced users find talkative commands annoying. What's in a filename? So far we have used filenames without ever saying what a legal name is, so it’s time for a couple of rules. First, filenames are limited to 14 characters. Second, although you can use almost any character in a filename, common sense says you should stick to ones that are visible, and that you should avoid characters that might be used with other meanings. We have already seen, for example, that in the 1s command, 18 ~t means to list in time order. So if you had a file whose name was ~t, you would have a tough time listing it by name. (How would you do it?) Besides the minus sign as a first character, there are other characters with special meaning. To avoid pitfalls, you would do well to use only letters, numbers, the period and the underscore until you're familiar with the situation. (The period and the underscore are conventionally used to divide filenames into chunks, as in precious.save above.) Finally, don’t forget that case distinctions matter — junk, Junk, and JUNK are three different names, A handful of useful commands Now that you have the rudiments of creating files, listing their names, and printing their contents, we can look at a half-dozen file-processing commands. 18 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 1 To make the discussion concrete, we'll use a file called poem that contains a familiar verse by Augustus De Morgan. Let’s create it with ed: Sea Great fleas have 1itele fleas upon their backs to bite ‘em, And Little fleas have lesser fleas, ‘and so ad infinitum. And the great fleas thenselves, in turn, have greater fleas to go on: While these again have greater still, and greater still, and so on. w poen 263 @ 5 ‘The first command counts the lines, words and characters in one or more files; it is named we after its word-counting function: $ we poem 8 46 263 poem s That is, poem has 8 lines, 46 words, and 263 characters. The definition of a “word” is very simple: any string of characters that doesn’t contain a blank, tab or newline. we will count more than one file for you (and print the totals), and it will also suppress any of the counts if requested. See we(1) The second command is called grep; it searches files for lines that match a pattern. (The name comes from the ed command g/regular-expression/p, which is explained in Appendix 1.) Suppose you want to look for the word “fleas” in poem: $ grep fleas poem Great fleas have little fleas And Little fleas have lesser fleas, ‘And the great fleas themselves, in turn, have greater fleas to go on; s gzep will also look for lines that don’t match the pattern, when the option -v is used. (It’s named ‘v’ after the editor command; you can think of it as inverting the sense of the match.) (CHAPTER 1 UNIX FOR BEGINNERS 19 $ grep -v fleas poem upon their backs to bite “em, and so ad infinitun. while these again have greater still, fand greater still, and so on. s grep can be used (0 search several files; in that case it will prefix the filename to each line that matches, so you can tell where the match took place. ‘There are also options for counting, numbering, and so on. grep will also handle much more complicated patterns than just words like “fleas,” but we will defer consideration of that until Chapter 4. The third command is sort, which sorts its input into alphabetical order line by line, This isn’t very interesting for the poem, but let’s do it anyway, just to see what it looks like: $ sort poem and greater still, and so on. and so ad infinitun. have greater fleas to go on; upon their backs to bite ‘en, And little fleas have lesser fleas, ‘4nd the great fleas themselves, in turn, Great fleas have little fleas While these again have greater still, 8 ‘The sorting is line by line, but the default sorting order puts blanks first, then upper case letters, then lower case, so it’s not strictly alphabetical sort has zillions of options to control the order of sorting — reverse order, numerical order, dictionary order, ignoring leading blanks, sorting on fields within the line, etc. — but usually one has to look up those options to be sure of them. Here are a handful of the most common: sort -© ‘Reverse normal order sort -n Sort in numeric order sort -nr Sort in reverse numeric order sort -£ Fold upper and lower case together sort +n Sort starting at n*T-st field Chapter 4 has more information about sort. Another file-examining command is tail, which prints the last 10 lines of 4 file, That's overkill for our cight-line poem, but it’s good for larger files Furthermore, tail has an option to specify the number of lines, so to print the last line of poem: 20 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 1 8 tail -1 poem and greater still, and s0 on. s ‘tail can also be used to print a file starting at a specified line: $ tail +3 filename starts printing with the 3rd line. (Notice the natural inversion of the minus sign convention for arguments.) The final pair of commands is for comparing files. Suppose that we have a variant of poem in the file new_poem: $ cat poem Great fleas have littie fleas upon their backs to bite “em, ‘And little fleas have lesser fleas, ‘and co ad infinitum. and the great fleas thenselves, in turn, have greater fleas to go on; Wile these again have greater still, fand greater still, and a0 on. $ cat new_poen Great fleas have little fleas upon their backs to bite them, And little fleas have lesser fleas, ‘and so on ad infinitum, ‘And the great fleas themselves, in turn, have greater fleas to go oni While these again have greater still, and greater still, and so on. s There's not much difference between the two files; in fact you'll have to look hard to find it. This is where file comparison commands come in handy. cmp finds the first place where (wo files differ: $ cep poem new_poem poen new.poer differ: char 58, line 2 s This says that the files are different in the second line, which is true enough, but it doesn’t say what the difference is, nor does it identify any differences beyond the first ‘The other file comparison command is Ai€£, which reports on all lines that are changed, added or deleted: CHAPTER 1 UNIX FOR BEGINNERS 21 8 aise poem new poem 202 < upon their backs to bite ‘em, > upon their backs to bite then, 4e4 < and so ad infinitum. > and so on ad infinitum s This says that fine 2 in the first file (poem) has to be changed into line 2 of the second file (new_poem), and similarly for line 4. Generally speaking, emp is used when you want to be sure that two files really have the same contents, It's fast and it works on any kind of file, not Jjust text. Gi f£ is used when the files are expected to be somewhat different, and you want to know exactly which lines differ. dif£ works only on files of text. A summary of file system commands Table 1.1 is a brief summary of the commands we've seen so far that deal with files. 1.3 More about files: directories ‘The system distinguishes your file called junk from anyone else's of the same name. ‘The distinction is made by grouping files into directories, rather in the way that books are placed on shelves in a library, so files in different directories can have the same name without any conflict Generally each user has a personal or home directory, sometimes called login directory, that contains only the files that belong to him or her. When you log in, you are “in” your home directory. You may change the directory Yyou are working in — often called your working or current directory — but Yyour home directory is always the same. Unless you take special action, when Yyou create a new file itis made in your current directory. Since this is initially Yyour home directory, the file is unvelated to a file of the seme name that might exist in someone else's directory. AA directory can contain other directories as well as ordinary files (“Great directories have lesser directories .). The natural way to picture this organi- zation is as a tree of directories and files. It is possible to move around within this tree, and to find any file in the system by starting at the root of the tree and moving slong the proper branches. Conversely, you can start where you are and move toward the root Let's try the latter first. Our basic tool is the command pwd (“print work- ing directory"), which prints the name of the directory you are currently in: 22. THE UNIX PROGRAMMING ENVIRONMENT ‘CHAPTER 1 Table 1.1: is 1s filenames as -t as -1 as -a as -r ed filename ep filel file2 my file! file2 a filenames cat filenames pr filenames pr -n filenames pr -m filenames we filenames we =2 filenames grep pattern filenames grep -v pattern files sort filenames tail filename tail -n filename tail +n filename comp file! file2 aise file! file2 ‘Common File System Commands list names of all files in current directory ist only the named files list in time order, most recent first ‘ist long: more information; also 1s ~1t st by time last used; also 1s ~Lu, 1s -Lut ist in reverse order; also ~rt, ~rit, ete. edit named file copy file! to file2, overwrite old file? if it exists move file! to file2, overwrite old file2 if it exists remove named files, irrevocably print contents of named files print contents with header, 66 lines per page print in n columns print named files side by side (multiple columns) count lines, words and characters for each file count lines for each file print lines matching pattern print lines not matching pattern sort files alphabetically by line print last 10 lines of file print last m lines of file start printing file at line m print location of first difference print all differences between files 8 pwa Zast/you s ‘This says that you are currently in the directory you, in the directory usr, which in turn is in the root directory, which is conventionally called just *7”. ‘The / characters separate the components of the name; the limit of 14 charac- ters mentioned above applies to each component of such a name. On many systems, /asr is a directory that contains the directories of all the normal users of the system. (Even if your home directory is not /usr/you, pwd will print something analogous, below.) If you now type 80 you should be able to follow what happens CHAPTER | UNIX FOR FEGINNERS 23 $ 1s /usr/you you should get exactly the same list of file names as you get from a plain 1s. When no arguments ate provided, 16 lists the contents of the current direc tory; given the name of a directory, it lists the contents of that directory, Next, try 8 is /usr This should print a long series of names, among which is your own login direc tory you, "The next step is to try listing the root it similar to this: self, You should get a response sie/ bin boot dev ete Lib tmp ’ (Don’t be confused by the two meanings of /: it’s both the name of the root and a separator in filenames.) Most of these are directories, but unix is actu- ally a file containing the executable form of the UNIX kernel. More on this in Chapter 2, Now ty $ cat /usr/you/ junk (if junk is still in your directory). The name /ase/you/ junk js called the pathname of the file. “Pathname"’ has an intuitive meaning: it represents the full name of the path from the root through the tree of direc tories to a particular file. It is a universal rule in the UNIX system that wher- ever you can use an ordinary filename, you can use a pathname. The file system is structured like 2 genealogical tree; here is a picture that may make it clearer. 24 THE UNIX PROGRAMMING ENVIRONMENT (CHAPTER 1 . Yap Unix Boot you" mixé paul nary Lf L\.\ junk junk temp junk data Your file named Sunk is unrelated to Paul's or to Mary's Pathnames aren't too exciting if all the files of interest are in your own directory, but if you work with someone else or on several projects con- currently, they become handy indeed. For example, your friends can print your junk by saying $ cat /usr/yourjunk ‘Similarly, you can find out what files Mary has by saying $ le /usr/mary data junk 5 ‘or make your own copy of one of her files by $ op /usr/mary/data data or edit her file: $e /usr/mary/data If Mary doesn’t want you poking around in her files, or vice versa, privacy can be arranged. Each file and directory has read-write-execute permissions for the owner, a group, and everyone else, which can be used to control access. (Recall 1s -1.) In our local systems, most users most of the time find open- ness of more benefit than privacy, but policy may be different on your system, so we'll get back to this in Chapter 2. As a final set of experiments with pathnames, try $ 1s ‘bin /usr/bin Do some of the names look familiar? When you run a command by typing its name after the prompt, the system looks for a file of that name. It normally looks first in your current directory (where it probably doesn’t find i), them in bin, and finally in /asr/bin. There is nothing special about commands CHAPTER 1 UNIX FOR BEGINNERS 25 like cat or 1s, except that they have been collected into a couple of direc- tories (0 be easy (0 find and administer. To verify this, try to execute some of these programs by using their full pathnames: 8 in/éate Mon Sep 26 23:29:32 EDT 1983 $ /bin/who sen tty? Sep 26 22:20 ow ttyd Sep 26 22:40 yeu teyS Sep 26 23:04 5 Exercise 13. Try and do whatever comes naturally. Things might be more fun outside of normal working hours. Changing directory — ca If you work regularly with Mary on information in her directory, you can say “I want to work on Mary's files instead of my own.” This is done by changing your current directory with the cd command: $ cd /usr/mary Now when you use a filename (without /’s) as an argument to cat or pr, it refers to the file in Mary's directory. Changing directories doesn’t affect any permissions associated with a file — if you couldn't access a file from your own directory, changing to another directory won't alter that fact. It is usually convenient to arrange your own files so that all the files related to one thing are in a directory separate from other projects. For example, if you want to write a book, you might want to keep all the text in a directory called book. The command mkdix makes a new directory. $ mkdir book Make a directory 8 cd book Goro it $s ped ‘Make sure you're in the right place 7asr/you/book Write the book (several minutes pass) sea.. ‘Move up one level in file system 8 pwd Zasr/you s *..7 refers to the parent of whatever directory you are currently in, the direc tory one level closer to the root. '.” is a synonym for the current directory sod Return to home directory 26 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 1 all by itself will take you back to your home directory, the directory where you log in. ‘Once your book is published, you can clean up the files. To remove the directory book, remove all the files in it (we'll show a fast way shortly), then cd to the parent directory of book and type $ endix book rmdir will only remove an empty directory. ‘The shell When the system prints the prompt $ and you type commands that get exe- cuted, it’s not the kernel that is talking to you, but a go-between called the command interpreter or shell. ‘The shell is just an ordinary program like date cr who, although it can do some remarkable things, The fact that the shell sits between you and the facilities of the kernel has real benefits, some of which we'll talk about here. There are three main ones: * Filename shorthands: you can pick up a whole set of filenames as argu- ‘ments to a program by specifying a pattern for the names — the shell will find the filenames that match your pattern, + Input-output redirection: you can arrange for the output of any program to g0 into a file instead of onto the terminal, and for the input to come from a file instead of the terminal, Input and output can even be connected to other programs. ‘© Personalizing the environment: you can define your own commands and shorthands, Filename shorthand Let's begin with filename patterns, Suppose you're typing a large document like a book. Logically this divides into many small pieces, like chapters and perhaps sections, Physically it should be divided too, because it is cumbersome to edit large files. Thus you should type the document as a number of files. You might have separate files for each chapter, called ch1, ch2, ete. Or, if each chapter were broken into sections, you might create files called ent ent.2 ont.3 on2.1 en2.2 which is the organization we used for this book. With a systematic naming convention, you can (ell at a glance where a particular file fits into the whole. ‘What if you want to print the whole book? You could say CHAPTER 1 UNIX FOR REGINNERS 27 $ pr cht.t cht.2 cht.3 ut you would soon get bored typing filenames and start to make mistakes. This is where filename shorthand comes in. If you say 8 pr che the shell takes the # to mean “any string of characters," so che is a pattern that matches all filenames in the current directory that begin with eb. The shell creates the list, in alphabetical? order, and passes the list to pr. The pr command never sees the #; the pattern match that the shell does in the current directory generates a list of strings that are passed to pr. ‘The crucial point is that filename shorthand is not a property of the pr command, but a service of the shell. Thus you can use it to generate a sequence of filenames for any command. For example, to count the words in the first chapter: $8 we ent. 113-562 3200 cht.0 935 4081 22435 chi.1 974 4191 22756 ch1.2 37815618481 cht.3 4293 5298 28847 ch1.4 331941190 cht. 75 323 2030 ch1.6 3801 16210 68933 total 8 There is a program called echo that is especially valuable for experiment ing with the meaning of the shorthand characters. As you might guess, echo does nothing more than echo its arguments $ echo hello world hello world 8 But the arguments can be generated by pattern-matching: $ echo cht. lists the names of all the files in Chapter 1, # echo + lists all the filenames in the current directory in alphabetical order, Sore prints all your files (in alphabetical order), and F Again, the order is not srctly alphabetical, in that upper case letters come before lower case leters. See asesii(7) for the ordering ofthe characters used in the sort. 28 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER | removes ail files in your current directory. (You had better be very sure that’s what you wanted to say!) The + is not limited to the last position in a filename — #'s can be any- where and can occur several times. Thus § rm +.eave removes all files that end with . save. Notice that the filenames are sorted alphabetically, which is not the same as numerically. If your book has ten chapters, the order might not be what you intended, since ch10 comes before ch2: $ echo » eht.4 eht-2 ... ch10.1 oht0.2 ... eh2.1 on2.2 ’ The * is not the only pattern-matching feature provided by the shell, although it's by far the most frequently used. The pattern [...] matches any of the characters inside the brackets. A range of consecutive letters or digits can be abbreviated: $ pr ch 12346789] Print chapters 1,2,3,4,6.7.8,9 but not 5 & pr ch{1-46-9}+ Same thing $ rm tenpla-z] Remove any of tempa, .., tempz that exist ‘The ? pattern matches any single character: sis? List files with single-character names $s -1 chet List oh1.3 ch2.1 ch3.1, ete, but not eh10.1 8 rm temp? Remove files tempt, ..., tempa, etc. Note that the patterns match only existing filenames. In particular, you cannot make up new filenames by using patterns. For example, if you want to expand ch to chapter in each filename, you cannot do it this way: $ my ch.+ chapter.+ Doesn't work! because chapter. + matches no existing filenames Pattern characters like * can be used in pathnames as well as simple filenames; the match is done for each component of the path that contains a special character. Thus /usr/mary/s performs the match in /usr/mary, and /usr/+/calendar generates a list of pathnames of all user calendar files. If you should ever have to turn off the special meaning of +, ?, etc., enclose the entire argument in single quotes, as in sis *7 You can also precede a special character with a backslash: CHAPTER § UNIX FOR BEGINNERS 29 sis 7 (Remember that because ? is not the erase or line kill character, this backslash is interpreted by the shell, not by the kernel.) Quoting is treated at length in Chapter 3, Exercise 1-4, What are the differences among these commands? $12 junk § echo junk sie7 § cone 7 sie § eono saee § echo » Sie § cone” Input-output redirection Most of the commands we have seen so far produce output on the terminal; some, like the editor, also take their input from the terminal. It is nearly universal that the terminal can be replaced by a file for either or both of input and output. As one example, sas makes a list of filenames on your terminal. But if you say $ Is >¢ilelise that same list of filenames will be placed in the file £11e1ist instead. The symbol > means “put the output in the following file, rather than on the termi nal.” ‘The file will be created if it doesn’t already exist, or the previous con- tents overwritten if it does. Nothing is produced on your terminal. As another example, you can combine several files into one by capturing the out- put of cat in a file: $ cat £1 £2 £3 >temp ‘The symbol >> operates much as > does, except that it means “add to the end of.” That is 8 cat £1 £2 £3 >>temp copies the contents of £1, £2 and £3 onto the end of whatever is already in ‘emp, instead of overwriting the existing contents. As with >, if temp doesn’t exist, it will be created initially empty for you. In a similar way, the symbol < means to take the input for a program from the following file, instead of from the terminal. Thus, you can prepare a letter in file Let, then send it to several people with $ mail mary joe tom bob or <, but ou formatting is traditional, 30 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 1 Given the capability of redirecting output with >, it becomes possible to combine commands to achieve effects not possible otherwise. For example, to print an alphabetical list of users, 8 who »temp $ sore temp $ we -1 temp $ we -1 temp $ pr -3 temp 8 grep mary and < is being done by the shell, not by the individual programs. Centralizing the facility in the shell ‘means that input and output redirection can be used with any program; the program itself isn’t aware that something unusual has happened This brings up an important convention. The command $ sort temp What happens? © Pipes All of the examples at the end of the previous section rely on the same trick: putting the output of one program into the input of another via a tem- porary file But the temporary file has no other purpose; indeed, it’s clumsy to have (0 use such @ file. This observation leads to one of the fundamental con- tributions of the UNIX system, the idea of a pipe. A pipe is a way to connect the output of one program to the input of another program without any tem- porary file; a pipeline is a connection of two or more programs through pipes. Let us revise some of the earlier examples to use pipes instead of tem- poraries, The vertical bar character { tells the shell to set up a pipeline: $ who | sort Print sorted list of users $ who | we Count users $18 two -1 Count fies Sis f pr -3 3-colunn list of filenames 3 who } grep mary Look for particular user Any program that reads from the terminal can read from a pipe instead; any program that writes on the terminal can write to a pipe. This is where the convention of reading the standard input when no files are named pays off: any program that adheres to the convention can be used in pipelines. grep, pr, sort and we are all used that way in the pipelines above. ‘You can have as many programs in a pipeline as you wish’ 32 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 1 Sis! pr -3f Ipr creates a 3-column list of filenames on the line printer, and $ who ! grep mary | we -1 counts how many times Mary is logged The programs in a pipeline actually run at the same time, not one after another. This means that the programs in a pipeline can be interactive; the kernel looks after whatever scheduling and synchronization is needed to make it all work. As you probably suspect by now, the shell arranges things when you ask for 4 pipe; the individual programs are oblivious to the redirection. Of course, programs have to operate sensibly if they are to be combined this way. Most commands follow « common design, so they will fit properly into pipelines at any position. Normally a command invocation looks like ‘command optional-arguments.optiona.filenames IF no filenames are given, the command reads its standard input, which is by default the terminal (handy for experimenting) but which can be redirected to come from a file or a pipe. At the same time, on the output side, most com- ‘mands write their output on the standard output, which is by default sent to the terminal. But it too can be redirected to a file or a pipe, Error messages from commands have to be handled differently, however, or they might disappear into a file or down a pipe. So each command has a standard error output as well, which is normally directed to your terminal. Or, as a picture: standard input or files command, options standard output standard Almost all of the commands we have talked about so far fit this model; the only exceptions are commands like date and who that read no input, and a few like cmp and @if¢ that have a fixed number of file inputs. (But look at the ‘~' option on these.) Exercise 1-7. Explain the difference between and CHAPTER 1 UNIX FOR BEGINNERS 33 Processes The shell does quite a few things besides setting up pipes. Let us turn briefly to the basics of running more than one program at a time, since we have already seen a bit of that with pipes. For example, you can run two pro- ‘grams with one command line by separating the commands with a semicolon; the shell recognizes the semicolon and breaks the line into two commands $ date; who Tue Sep 27 01:03:17 epr 1983 ken eey0 Sep 27 00:43 nr ety! Sep 26 23:45 rob tty2 Sep 26 23:59 bei tty3 Sep 27 00:06 33 ttyd Sep 26 23:31 you ttyS Sep 26 23:04 ber ety] Sep 26 23:34 8 Both commands are executed (in sequence) before the shell returns with a prompt character. You can also have more than one program running simultancously if you wish. For example, suppose you want to do something time-consuming like counting the words in your book, but you don’t want to wait for we to finish before you start something else. Then you can say $ we cht >we.out & 044 Process-id printed by the shell s The ampersand & at the end of a command Tine says to the shell command running, then take further commands from the terminal immedi ately,” that is, don’t wait for it to complete. Thus the command will begin, but you can do something else while it’s running, Directing the output into the file we. out keeps it from interfering with whatever you're doing at the same time. ‘An instance of a running program is called a process. The number printed by the shell for a command initiated with & is called the process-id, you can use it in other commands to refer to a specific running program, It’s important to distinguish between programs and processes. we is @ pro- gram; each time you run the program we, that creates a new process. If several instances of the same program are running at the same time, each is a separate process with a different process-id. If a pipeline is initiated with 8, as in 34 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 1 $ pr che | Ip & 6951 Processid of 1px s the processes in it are all started at once — the & applies to the whole pipeline. Only one process-id is printed, however, for the last process in the sequence ‘The command $ wait waits until all processes initisted with & have finished. If it doesn’t return immediately, you have commands still running, You can interrupt wait with DELE ‘You can use the process-id printed by the shell to stop a process initiated with &: $ Kill 6944 If you forget the process-id, you can use the command ps to tell you about everything you have running. If you are desperate, ki11 0 will kill all your processes except your login shell, And if you're curious about what other users are doing, ps -ag will tell you about all processes that are currently running. Here is some sample output 3 ps -ag PID TTY TIME CMD 36 co Zeve/eron 6423 5 sh 6704 3 sn 67221 vit paper 4430 2 a 66127 sh 6628 7 rogue 6243 2 write énr 6949 4 login bimnier 6952 5 Pr ch1.1 cht.2 ch1.3 ch1.4 6951 5 ape 6959 5 ps -ag 6044 1 write rob s PID is the process-id; TTY is the terminal associated with the process (as in who); TIME is the processor time used in minutes and seconds; and the rest is the command being run. ps is one of those commands that is different on dif- ferent versions of the system, so your output may not be formatted like this. Even the arguments may be different — see the manual page ps(1) Processes have the same sort of hierarchical structure that files do: each process has a parent, and may well have children. Your shell was created by a process associated with whatever terminal line connects you to the system. As CHAPTER 1 UNIX FOR BEGINNERS 35 you run commands, those processes are the direct children of your shell. If you run a program from within one of those, for example with the 1 command to escape from ed, that creates its own child process which is thus a grandchild of the shell ‘Sometimes a process takes so long that you would like to start it running, then turn off the terminal and go home without waiting for it to finish. But if you turn off your terminal or break your connection, the process will normally be killed even if you used & The command nobup (“no hangup”) was created to deal with this situation: if you say $ nobup command 6 the command will continue to run if you log out. Any output from the com- mand is saved in a file called nobup.out. There is no way to nobup a com- mand retroactively. If your process will take a lot of processor resources, it is kind to those who share your system to run your job with lower than normal priority; this is done by another program called nice: $ nice expensive-command & nohup automatically calls nice, because if you're going to log out you can afford to have the command take a little longer. Finally, you can simply tell the system to start your process at some wee hhour of the morning when normal people are asleep, not computing. ‘The com- mand is called at(|): $ at time whatever commands ela s ‘This is the typical usage, but of course the commands could come from a file: Sat 3am temp § ed cn2.7 1534 temp 168. 04 produces text on its standard output, which can then be used anywhere text can be used. This uniformity is unusual; most systems have several file for- ‘mats, even for text, and require negotiation by a program or a user to create file of a particular type. In UNIX systems there is just one kind of file, and all that is required to access a file is its name.+ The lack of file formats is an advantage overall — programmers needn't worry about file types, and all the standard programs will work on any file — but there are a handful of drawbacks. Programs that sort and search and edit really expect text as input: grep can’t examine binary files correctly, nor can sort sort them, nor can any standard editor manipulate them. ‘There are implementation limitations with most programs that expect text as input. We tested 2 number of programs on a 30,000 byte text file containing no newlines, and surprisingly few behaved properly, because most programs ‘make unadvertised assumptions about the maximum length of a line of text (for an exception, see the BUGS section of sort(1)) 1 There's a good test of filesystem uniformity, due orginally to Doug Meliroy, that the UNIX fle sytem passes handily. Can the output of FORTRAN program be vsed as in tothe FORTRAN ‘compiler? A remarkable number of systems have trouble with this test 48 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 2 Non-text files definitely have their place. For example, very large data bases usually need extra address information for rapid access; this has to be binary for efficiency. But every file format that is not text must have its own family of support programs to do things that the standard tools could perform if the format were text, Text files may be a little less efficient in machine cycles, but this must be balanced against the cost of extra software to maintain ‘more specialized formats, If you design a file format, you should think care- fully before choosing a non-textual representation. (You should also think about making your programs robust in the face of long input lines.) 2.3 Directories and filenames All the files you own have unambiguous names, starting with /usz/you, but if the only file you have is junk, and you type 18, it doesn't print /usx/you/ junk; the filename is printed without any prefix: sis junk 5 That is because each running program, that is, each process, has a current directory, and. all filenames are implicitly assumed to start with the name of that directory, unless they begin directly with a slash. Your login shell, and 1s, therefore have a current directory. The command pwd (print working directory) identifies the current directory. 3 pea 7asr/you s The current directory is an attribute of a process, not a person or a program — people have login directories, processes have current directories. If a pro- cess creates a child process, the child inherits the current directory of its parent. But if the child then changes to a new directory, the parent is unaf- fected — its current directory remains the same no matter what the child does. ‘The notion of a current directory is certainly a notational convenience, because it can save a lot of typing, but its real purpose is organizational Related files belong together in the same directory. /usx is often the top directory of the user file system. (user is abbreviated to use in the same spirit as cmp, 1s, etc.) /usr/you is your login directory, your current direc tory when you first log in. /usr/sxe contains source for system programs, /asr/arc/ema contains source for UNIX commands, /usr/sr¢/cmé/sh contains the source files for the shell, and so on. Whenever you embark on new project, or whenever you have a set of related files, say a set of recipes, you could create a new directory with mkdir and put the files there. cuarrer 2 THE FILE SYSTEM 49 $s pwa 7ase/you S mkdir recipes $ cd recipes s pwd /isr/you/recipes $ mkdir pie cookie $ ed pie/apple $ ed cookie/ehoe. chip s Notice that it is simple to refer (o subdirectories. pie/apple has an obvious meaning: the apple pie recipe, in directory /usr/you/recipes/pie. You could instead have put the recipe in, say, recipes/apple.pie, rather than in a subdirectory of recipes, but it seems better organized (o put all the pies together, too. For example, the crust recipe could be kept in recipes/pie/crust rather than duplicating it in each pie recipe. Although the file system is a powerful organizational tool, you can forget where you put a file, or even what files you've got. The obvious solution is a command or two to rummage around in directories. The 1s command is cer~ ‘ainly helpful for finding files, but it doesn’t look in sub-directories, sca $18 junk recipes $ file + junk: ascii text recipes: directory $ 1s recipes cookie pie $ 1s recipes/pie apple s This piece of the file system can be shown pictorially as: 50 THE UNIX PROGRAMMING ENVIRONMENT: (CHAPTER 2 7 L Ys ‘The command du (disc usage) was written t0 tell how much disc space is consumed by the files in a directory, including all its subdirectories. sau 6 -/recipes/pie 4 ‘/recipes/cookie " ‘recipes a s The filenames are obvious; the numbers are the number of dise blocks —~ typi- cally 512 or 1024 bytes each — of storage for each file. The value for a direc- tory indicates how many blocks are consumed by all the files in that directory and its subdirectories, including the directory itself. ‘du has an option ~a, for “all,” that causes it (0 print out all the files in a directory. If one of those is a directory, au processes that as well: $ au -a -/recipes/pie/appie /eecipes/pie/erust s/recipes/pie ‘/recipes/cookie/ehoe.chip /recipes/cookie /recipes */junk ‘The output of du -a can be piped through gxep to look for specific files: $ du -a ! grep choc 3 /recipes/cookie/choe.chip : Recall from Chapter 1 that the name *." is a directory entry that refers to the directory itself; it permits access to a directory without having to know the full CHAPTER 2 ‘THE FILE SYSTEM 51 name, 4u looks in a directory for files; if you don’t tell it which directory, it assumes ‘.’, the directory you are in now. ‘Therefore, junk and ./Junk are names for the same file. Despite their fundamental properties inside the kernel, directories sit in the file system as ordinary files. They can be read as ordinary files. But they can't be created or written as ordinary files — to preserve its sanity and the users’ files, the kernel reserves to itself all control over the contents of direc- tories. The time has come to look at the bytes in a directory od eb 000000 4 =; . \o \o \o \0 \0 \o \o 0 \0 \o \o Wo \0 064 073 056 000 290 000 000 000 000 G90 400 900 00a G90 000 000 0000020 273 (=. - \0 \0 \D \0 \o \o \0 \0 \0 \o 0 0 273 050 056 056 000 00 000 000 000 090 00 000 000 G90 00 900 000040 252; ~r ¢ c i p @ s \0 \0 \0 \0 \o 10 0 252 073 162 145 143 181 160 145 163 000 00 900 a00 Ga0 00 000 o00av6o 230, = =3 uw mn & \0 XO Xo \0 \0 \0 \0 \o 10 0 230 075 152 165 156 153 000 000 000 490 00 000 009 a90 000 000 000100 See the filenames buried in there? ‘The directory format is a combination of binary and textual data. A directory consists of 16-byte chunks, the last 14 bytes of which hold the filename, padded with ASCII NUL's (which have value 0) and the first two of which tell the system where the administrative informa- tion for the file resides — we'll come back to that. Every directory begins with the two entries *." C*dot”) and *. .* (“dot-dot”) $ cd Home $ cd recipes & pwd /asr/you/recipes Sod .+3 ped Up one levet Zosr/¥ou cd --5 pwd Up another level dese S cd. pwd Up another level ped Up another level Can't go any higher Z Seca 2 The directory / is called the root of the file system. Every file in the sys- tem is in the root directory or one of its subdirectories, and the root is its own parent directory. Exercise 2-2. Given the information in this section, you should be able to understand roughly how the 1s command operates. Hint: cat» >f00; 18 -£ foo. 0 52. THE UNIX PROGRAMMING ENVIRONMENT cuarreR > Exercise 2-3. (Harder) How does the pwd command operate? 0 Exercise 2-4. du was written to monitor disc usage, Using it o find files in a directory Ihierarchy is at best a strange idiom, and perhaps inappropriate. As an alternative, look at the manual page for £ind(1), and compare the two commands. In particular, com. pare the command du -a ! grep ... with the corresponding invocation of £ind Which runs faster? Is it better to build # new tool or use a side effect of an old one? 2.4 Permissions Every file has a set of permissions associated with it, which determine who can do what with the file, If you're so organized that you keep your love letters on the system, perhaps hierarchically arranged in a directory, you prob- ably don’t want other people to be able to read them. You could therefore change the permissions on each letter to frustrate gossip (or only on some of the letters, to encourage it), or you might just change the permissions on the directory containing the letters, and thwart snoopers that way. But we must warn you: there is a special user on every UNIX system, called the super-user, who can read or modify any file on the system. ‘The special login name root carries super-user privileges; it is used by system administra tors when they do system maintenance. There is also a command called su that grants super-user status if you know the root password. Thus anyone who knows the super-user password can read your love letters, so don’t keep sensitive material in the file system, If you need more privacy, you can change the data in a file so that even the super-user cannot read (or at least understand) it, using the crypt. command, (crypt(1)). Of course, even crypt isn’t perfectly secure. A super-user can change the crypt command itself, and there are cryptographic attacks on the crypt algorithm. The former requires malfeasance and the latter takes hard work, however, so exypt is in practice fairly secure. In real life, most security breaches are due to passwords that are given away or easily guessed. Occasionally, system administrative lapses make it possible for a malicious user to gain super-user permission, Security issues are discussed further in some of the papers cited in the bibliography at the end of this chapter. ‘When you log in, you type a name and then verify that you are that person by typing a password. The name is your login identification, or login-id. But the system actually recognizes you by a number, called your user-id, or wid. In fact different login-id’s may have the same uid, making them indistinguishable to the system, although that is relatively rare and perhaps undesirable for secu- rity reasons. Besides a uid, you are assigned a group identification, or group- id, which places you in a class of users. On many systems, all ordinary users (as opposed to those with login-id’s like root) are placed in a single group called other, but your system may be different. The file system, and there- fore the UNIX system in general, determines what you can do by the cuarteR 2 TE FILE SYSTEM 53, permissions granted to your uid and group-id. The file /etc/passwd is the password file; it contains all the login infor- mation about each user, You can discover your wid and group-id, as does the system, by looking up your name in /etc/passwa: $ grep you /etc/pasewd you: gksbCTrJO4COM: 604: 1:¥.0.A.People: /usx/you: s ‘The fields in the password file are separated by colons and are laid out like this, (as seen in passwa(5)): login-id:encrypted password: id: groupsld: miscellany :login-directory shell The file is ordinary text, but the field definitions and separator are a conven- tion agreed upon by the programs that use the information in the file. The shell field is often empty, implying that you use the default shell, /ein/sh, The miscellany field may contain anything; often, it has your name and address or phone number Note that your password appears here in the second field, but only in an encrypted form. Anybody can read the password file (you just did), so if your password itself were there, anyone would be able to use it to masquerade as you. When you give your password to Login, it encrypts it and compares the ‘esult against the encrypted password in /etc/passwd. If they agree, it lets you log in. The mechanism works because the encryption algorithm has the property that it’s easy to go from the clear form to the encrypted form, but very hard to go backwards. For example, if your password is ka-boom, it ‘might be encrypted as gkmbCTrJO4com, but given the latter, there's no easy ‘way to get back to the original The kernel decided that you should be allowed to read /ete/passwa by looking at the permissions associated with the file. There are three kinds of permissions for each file: read (i.e., examine its contents), write (i.e., change its contents), and execute (i.e., run it as a program). Furthermore, different permissions can apply to different people. As file owner, you have one set of Fead, write and execute permissions. Your “group” has a separate set. Every- fone else has a third set. ‘The -1 option of 1s prints the permissions information, among other things: $e -1 /ete/pasewa srw-r--r-- 1 root 5115 Aug 30 10:40 /etc/passua $ 1s -19 “etc/passwd sew-r--r-— 1 adm 5115 Aug 30 10:40 /ete/passud ‘These two lines may be collectively interpreted as: /ete/passwd is owned by login-id root, group adm, is 5115 bytes long, was last modified on August 30 at 10:40 AM, and has one link (one name in the file system; we'll discuss links S4 THE UNIX PROGRAMMING EXVIRONMENT CHAPTER 2 in the next section). Some versions of 1s give both owner and group in one invocation ‘The string ~rw-r--r-~ is how 1s represents the permissions on the file. ‘The first ~ indicates that it is an ordinary file. If it were a directory, there would be a d there, The next three characters encode the file owner's (based fon uid) read, write and execute permissions. xw- means that root (the owner) may read or write, but not execute the file, An executable file would have an x instead of a dash, ‘The next three characters (x--) encode group permissions, in this case that people in group adm, presumably the system administrators, can read the file but not write or execute it. ‘The next three (also r--) define the permissions for everyone else — the rest of the users on the system. On this machine, then, only root: can change the login information for a user, but anybody may read the file to discover the information. A plausible alternative would be for group adm to also have write permission on /ete/passwa. The file /etc/group encodes group names and group-id’s, and defines which users are in which groups. /ete/passwd identities only your login group; the newgrp command changes your group permissions to another group. Anybody can say $ ed /ete/pasewd and edit the password file, but only root can write back the changes. You ‘might therefore wonder how you can change your password, since that involves editing the password file. The program to change passwords is called passwd; you will probably find it in bin: $18 -1 /bin/passwd “rwsx-xr-x 1 root. 2454 Jan 4 1983 /bin/pacewd 5 (Note that /etc/passwa is the text file containing the login information, while /bin/passwa, in a different directory, is a file containing an executable program that lets you change the password information.) The permissions here state that anyone may execute the command, but only root can change the passwd command. But the s instead of an x in the execute field for the file owner states that, when the command is run, it is to be given the permissions corresponding to the file owner, in this case root. Because /bin/passwd is “setuid” to root, any user can run the passwd command to edit the pass- word file ‘The set-uid bit is a simple but elegant ideat that solves a number of security problems. For example, the author of a game program can make the program setuid to the owner, so that it can update a score file that is otherwise ‘uid bit is patented by Dennis Ritchie Haren 2 ‘THE FILE SYSTEM 5S protected from other users’ access. But the set-uid concept is potentially Gangerous. /bin/passwd has to be correct; if it were not, it could destroy system information under root’s auspices. If it had the permissions “rwarwxxwx, it could be overwritten by any user, who could therefore replace the file with a program that does anything. This is particularly serious for a setuid program, because root has access permissions to every file on the sys- tem, (Some UNIX systems turn the set-uid bit off whenever a file is modified, to reduce the danger of a security hole.) ‘The set-uid bit is powerful, but used primarily for a few system programs such as passwd. Let’s look at a more ordinary file $ 1s -1 /bin/wno mewxrwxr-x 1 root 6348 Mar 29 1983 /bin/who s who is executable by everybody, and writable by root and the owner's group. ‘What “executable” means is this: when you type who to the shell, it looks in a set of directories, one of which is /bin, for a file named “who.” If it finds such a file, and if the file has execute permission, the shell calls the kernel to run it. The kernel checks the permissions, and, if they are valid, runs the program. Note that a program is just a file with exe- cute permission. In the next chapter we will show you programs that are just text files, but that can be executed as commands because they have execute permission set Directory permissions operate a little differently, but the basic idea is the $ 1s -14 drwaewsr-x 3 you 80 Sep 27 06:11 « s ‘The -€ option of 19 asks it to tell you about the directory itself, rather than its contents, and the leading ¢ in the output signifies that *. is indeed a directory. An r field means that you can read the directory, so you can find out what files are in it with 1s (or 04, for that matter). A w means that you can create and delete files in this directory, because that requires modifying and therefore writing the directory file Actually, you cannot simply write in a directory — even root is forbidden 0 do 0, $ who >. Try 10 overwrite 1 cannot create You can't 5 Instead there are system calls that create and remove files, and only through them is it possible to change the contents of a directory. The permissions idea, 56 THE UNIX PROGRAMMING ENVIRONMENT (CHAPTER 2 however, still applies: the w fields tell who can use the system routines 10 modify the directory, Permission to remove a file is independent of the file itself. If you have write permission in a directory, you may remove files there, even files that are protected against writing. The xm command asks for confirmation before removing a protected file, however, to check that you really want to do so — fone of the rare occasions that # UNIX program double-check your intentions, (The ~£ flag to rm forces it to remove files without question.) The x field in the permissions on a directory does not mean execution; it means “search.” Execute permission on a directory determines whether the directory may be searched for a file. It is therefore possible to create a direc- tory with mode ~~x for other users, implying that users may access any file that they know about in that directory, but may not run 1s on it or read it to see what files are there. Similarly, with directory permissions x-~, users can see (1s) but not use the contents of a directory. Some installations use this device to turn off /usr/games during busy hours. ‘The chmod (change mode) command changes permissions on files. $ chmod permissions filenames ‘The syntax of the permissions is clumsy, however. They can be specified in two ways, either as octal numbers or by symbolic description. ‘The octal numbers are easier 0 use, although the symbolic descriptions are sometimes ‘convenient because they can specify relative changes in the permissions. It ‘would be nice if you could say $ chmod rw-rw-rw- junk Doesn't work his way! rather than $ chmod 666 junk but you cannot. The octal modes are specified by adding together a 4 for read, 2 for write and 1 for execute permission. The three digits specify, as in Ls, permissions for the owner, group and everyone else. The symbolic codes are difficult to explain; you must look in chmod(1) for a proper description, For our purposes, it is sufficient to note that + turns a permission on and that = turns it off. For example $ chmod +x command allows everyone (o execute command, and $ chnod -w file turns off write permission for everyone, including the file's owner. Except for the usual disclaimer about super-users, only the owner of a file may change the permissions on a file, regardless of the permissions themselves. Even if some- body else allows you to write a file, the system will not allow you to change its ‘CHAPTER > WE FILE SysTEM 57 permission bits. $ 1s -14 /usr/mary Grwxewarwe 5 mary 704 Sep 25 10:18 /usr/mary $ chnod 444 /usr/mary chnod: can’t change /usr/mary s If a directory is writable, however, people can remove files in it regardless of the permissions on the files themselves. If you want to make sure that you or your friends never delete files from a directory, remove write permission from it: sca $ date >tenp 8 chmod -w . ‘Make directory unwriable Sls -14 @r-xe-xr-x 3 you 80 Sep 27 11:48, $ rn temp rm: temp not renoved Can't remove file $ chmod 775 . Restore permission sis -1d dewaewxr-2 3 you 80 Sep 27 11:48 S rm temp s Now you can ‘temp is now gone. Notice that changing the permissions on the directory didn’t change its modification date. ‘The modification date reflects changes to the file’s contents, not its modes. The permissions and dates are not stored in the file itself, but in a system structure called an index node, or i-node, the subject of the next section. Exercise 2-5. Experiment with chmod. Try different simple modes, like 0 and 1. Be careful not to damage your login directory! © 2.8 Inodes A file has several components: a name, contents, and administrative infor- mation such as permissions and modification times. The administrative infor~ mation is stored in the inode (over the years, the hyphen fell out of “i-node”), along with essential system data such as how long it is, where on the disc the contents of the file are stored, and so on There are three times in the inode: the time that the contents of the file were last modified (written); the time that the file was last used (read or exe~ cuted); and the time that the inode itself was last changed, for example to set the permissions, 58 THE UNIX PROGRAMMING EXVIRONMENT. CHAPTER 2 $ ate ‘Tue Sep 27 12:07:24 EDT 1983 $ date >junk 8 is -1 junk srw-rw-rw= 1 you 29 Sep 27 12:07 junk $ 1s -1u junk -ew-rw-rw= 1 you 29 Sep 27 06:11 junk $ 1s -1c junk werw-re- 1 you 29 Sep 27 12:07 junk Changing the contents of a file does not affect its usage time, as reported by 1s -1u, and changing the permissions affects only the inode change time, as reported by 18 -1e. $ chmod 444 junk $ le -1u junk we-e--r-= 1 you 29 Sep 27 06:11 junk 8 4s -1e junk sr--r-= 1 you 29 Sep 27 12:11 junk § chmod 666 junk s The ~t option to 18, which sorts the files according to time, by default that of last modification, can be combined with -e or ~u to report the order in which inodes were changed or files were read: 8 Is recipes cookie pie $ is -Iut total 2 @rwxrwxrwx 4 you 64 Sep 27 12:11 recipes mew-rw-rw- 1 you 29 Sep 27 06:11 junk s recipes is most recently used, because we just looked at its contents. IC is important to understand inodes, not only to appreciate the options on 1s, but because in a strong sense the inodes are the files. All the directory hierarchy does is provide convenient names for files. The system’s internal ‘name for a file is its -mumber: the number of the inode holding the file’s infor~ mation, 18 ~i reports the i-number in decimal: $ date »x $ 1s -3 45768 junk 15274 recipes 39852 x s It is the number that is stored in the first two bytes of a directory, before the CHAPTER 2 THE FILE SYSTEM. 59 name. od ~4 will dump the data in decimal by byte pairs rather than octal by bytes and thus make the i-number visible. sod -e po00000 4 =; . \0 \o \9 \0 \0 Xo \o \o \0 Xo \0 Xo \0 19000020 273, + X8\O XO Xo \o 9 XO \o \o X09 \0 Xo 000040 252 roe © $ pe # 0 Wo \0 \0 to 10 10 000060 230 $e nm XD \0 \0 XO \0 0 \0 SD Xo v0 000100 254 = x \O \0 \0 \D \0 \0 \0 0 \0 Xo \0 0 00 9000120 3 od a. (0000000 #5156 00046 00000 90000 00000 0090 oad00 ova00 (9000020 10427 11822 00000 90000 00000 90000 00000 00000 00004 15274 25970 26373 25968 00775 00000 00000 00000 (000060 15768 20058 27502 00000 60000 90000 00000 09000 ‘000100 15852 00120 00000 20000 09000 90000 o9900 90000 000720 The first two bytes in each directory entry are the only connection between the name of a file and its contents. A filename in a directory is therefore called a link, because it links a name in the directory hierarchy to the inode, and hence to the data, The same i-number can appear in more than one directory. The xm command does not actually remove inodes; it removes directory entries or links. Only when the last link to a file disappears docs the system remove the inode, and hence the file itself. If the i:number in a directory entry is zero, it means that the link has been removed, but not necessarily the contents of the file — there may still be a link somewhere else. You can verify that the i-number goes to zero by removing $ mx $ od -d 9000000 15156 00046 00000 00000 00000 c0000 50000 00000 0000020 10427 11822 00000 00000 00000 00000 00000 00000 0000040 15274 25970 26979 25968 00115 00000 00000 00000 0000060 15768 30058 27502 00000 09000 00000 00000 00000 9000100 09000 00420 00000 00000 09000 00000 09000 00000 0000120 s The next file created in this directory will go into the unused slot, although it will probably have a different i-number. ‘The 1n command makes a link to an & $ In old,file new-file sting file, with the syntax The purpose of a link is to give two names to the same file, often so it can appear in two different directories. On many systems there is a link to /oin/ed called /bin/e, so that people can call the editor e. Two links to a 60. THE UNIX PROGRAMMING ENVIRONMENT (CHAPTER? file point to the same inode, and hence have the same i-number: $ In junk Linktojunk $ is “13 total 3 18768 -rw-xw-rw- 2 you 29 Sep 27 12 15768 -rw-rw-rw- 2 you 29 Sep 27 12 15274 drwxcwarwx 4 you 64 Sep 27 09:34 recipes s The integer printed between the permissions and the owner is the number of links to the file, Because each link just points to the inode, each link is equally important — there is no difference between the first link and subsequent ones. (Notice that the total dise space computed by 1s is wrong because of double counting.) When you change a file, access to the file by any of its names will reveal the changes, since all the links point to the same file. $ echo x >junk $ Is -1 total 3 Sew-rw-tw- 2 you cew-rw-rw- 2 you Grwxrwarwx 4 you $ rm Linktojunk junk Linkto junk recipes $is-1 total 2 Spwerw-rw- 1 you 2 sep 27 12:37 junk Qruxrwxews 4 you 64 Sep 27 09:34 recipes s Alter Linktojunk is removed the link count goes back to one. As we said before, rm‘ing a file just breaks a link; the file remains until the last link is removed. In practice, of course, most files only have one link, but again we see a simple idea providing great flexibility. A word to the hasty: once the last link to a file is gone, the data is irretriev- able. Deleted files go into the incinerator, rather than the waste basket, and there is no way to call them back from the ashes. (There is a faint hope of resurrection. Most large UNIX systems have a formal backup procedure that periodically copies changed files to some safe place like magnetic tape, from which they can be retrieved. For your own protection and peace of mind, you should know just how much backup is provided on your system. If there is none, Watch out — some mishap to the discs could be a catastrophe.) Links to files are handy when two people wish to share a file, but some- times you really want a separate copy — a different file with the same infor- mation. You might copy a document before making extensive changes to it, for example, so you can restore the original if you decide you don't like the changes. Making a link wouldn't help, because when the data changed, both CHAPTER 2 ‘THE FILE SYSTEM 61 links would reflect the change. cp makes copies of files: $ op Junk copyofjunk $ Is “24 total 3 49850 -rw-rw-rw- 1 you 2 Sep 27 13:13 copyofjunk 48768 -rw-zw-rw- 1 you 2 Sep 27 12:37 junk 48274 arwxrwxrwx 4 you 64 Sep 27 09:34 recipes ‘The i-numbers of junk and copyof junk are different, because they are dif- ferent files, even though they currently have the same contents. It’s often a ‘good idea to change the permissions on a backup copy so it’s harder to remove it accidentally $ chmod -w copyof junk Turn off write permission $ Is -11 total 3 43850 —r--r--r-- 1 you 2 Sep 27 13:13 copyof junk 18768 -rw-rw-rw- 1 you 2 Sep 27 12:37 junk 48274 dewarwxewx 4 you 64 Sep 27 09:34 recipes 8 rm copyof junk rm: copyofjunk 444 mode n No! It's precious $ date >junk $ Je -1i total 2 $5850 -r--r--r-- 1 you 2 Sep 27 13:13 copyofjunk 1768 -xw-rw-rw- 1 you 29 sep 27 13:16 junk 18274 drwxrwxrwx 4 you 64 Sep 27 09:34 recipes $ xm copyot junk rm: copyof junk 444 mode y Well, maybe not so precious s Is -11 total 2 18768 -ew-rw-rw- 1 you 29 Sep 27 13:16 Junk 18274 drwxrwxrwx 4 you 64 Sep 27 09:34 recipes s Changing the copy of a file doesn’t change the original, and removing the copy has no effect on the original. Notice that because copyof junk had write per mission turned off, rm asked for confirmation before removing the file. There is one more common command for manipulating files: mv moves or renames files, simply by rearranging the links. Its syntax is the same as ep and In: 62. THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 2 $ mv junk sameoldjunk $ is -13 total 2 15274 drwxrwxrwx 4 you 64 Sep 27 09:34 recipes 15768 -ew-rw-rw- 1 you 29 Sep 27 13:16 sameoldjunk s sameoldjunk is the same file as our old junk, right down to the i:aumber; only its name — the directory entry associated with inode 15768 — has been changed. We have been doing all this file shuffling in one directory, but it also works across directories. In is often used to put links with the same name in several directories, such as when several people are working on one program or docu- ment. mv can move a file or directory from one directory to another. In fact, these are common enough idioms that mv and ep have special syntax for them: 8 mv (or ep)filel filed... directory moves (or copies) one or more files to the directory which is the last argument. ‘The links or copies are made with the same filenames, For example, if you ‘wanted to try your hand at beefing up the editor, you might begin by saying $ ep /usr/sre/end/ed.c to get your own copy of the source to play with. If you were going to work on the shell, which is in a number of different source files, you would say $ mkdir oh $ cp /usr/src/end/eh/s ah and cp would duplicate all of the shell’s source files in your subdirectory sh (assuming no subdirectory structure in /usr/sxc/emé/sh — cp is not very clever). On some systems, 1n also accepts multiple file arguments, again with a directory as the last argument. And on some systems, mv, cp and 1m are themselves Tinks to a single file that examines its name to see what service to perform. Exercise 2-6. Why does Ls -1 report 4 links to recipes? Hint: try 8 12-10 /asr/you Why is this useful information? © Exercise 2-7. What is the difference between av junk junky and ep junk junket rm Junk Hint: make a link to unk, then ty it, © (CHAPTER 2 THE FILE SYSTEM 63, Exercise 28. cp doesn’t copy subdirectories, it just copies files at the first evel of a hierarchy. What does it do if one of the argument files is a directory? Is this kind or even sensible? Discuss the relative merits of three possibilities: an option to op to det- cend directories, a separate command rep (recursive copy) to do the job, or just having cop copy a directory recursively when it finds one, See Chapter 7 for help on providing this facility. What other programs would profit from the ability to traverse the directory tree? o 2.6 The directory hierarchy in Chapter 1, we looked at the file system hierarchy rather informally, starting from /usx/you, We're now going to investigate it in a more orderly way, starting from the top of the tree, the root. ‘The top directory is /. sis/ bin Boot aev Lb tmp 8 /anix is the program for the UNIX kernel itself: when the system starts, “unix is read from disc into memory and started. Actually, the process ‘occurs in two steps: first the file /boot is read; it then reads in /unix. More information about this “bootstrap” process may be found in boot(8). The rest of the files in /, at least here, are directories, each a somewhat self-contained section of the total file system. In the following brief tour of the hierarchy, play along with the text: explore a bit in the directories mentioned. ‘The more familiar you are with the layout of the file system, the more effectively you will be able to use it. Table 2.1 suggests good places to look, although some of the names are system dependent. Poin (binaries) we have seen before: it is the directory where the basic programs such as who and ed reside. Zdev (devices) we will discuss in the next section Zete (et cetera) we have also seen before. It contains various administra- tive files such as the password file and some system programs such as Zete/getty, which initializes a terminal connection for /bin/Login. Zete/re is a file of shell commands that is executed after the system is bootstrapped. /ete/group lists the members of each group. /14b (library) contains primarily parts of the C compiler, such as /1b/epp, the C preprocessor, and /1ib/1ibe.a, the C subroutine library. emp (temporaries) is a repository for short-lived files created during the 64 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 2 [es Table 2.1: Interesting Directories (see also hier(7)) / root of the file system Poin essential programs in executable form (“binaries”) /eev device files ete system miscellany Jete/nota Togin message of the day sete/passwd password file ib essential libraries, ete. 7emp temporary files; cleaned when system is restarted sonix executable form of the operating system yosr user file system Zusr/adn system administration: accounting info., ete. Zusx/bin user binaries: troff, etc. dusr/dict dictionary (words) and support for spe21(1) /usr/games game programs | susr/inelude header files for C programs, e.g. math. /usr/include/sys system header files for C programs, e.g, inode.h sase/1ib aries for C, FORTRAN, ete dase /man on-line manual /osr/man/nant manual pages for section 1 of manual /asr/ndec hardware diagnostics, bootstrap programs, ete. /asr/news community service messages /asr/pub public oddments: see aeii(7) and eqnchar(7) dasr/sre source code for utilities and libraries /usr/sre/omd source for commands in /bin and /usr/bin /usr/sre/1ib source code for subroutine libraries /usx/spool working directories for communications programs /usx/spool/1pd ine printer temporary directory /usx/spool/mail mail in-boxes | fasx/spool/aucp working directory for the uucp programs | fusr/sys ‘ource for the operating system kernel /usr/tmp alternate temporary directory (little used) Zusr/you your login directory Zusr/you/b: your personal programs execution of a program. When you start up the editor ed, for example, it creates a file with a name like /tmp/e00512 to hold its copy of the file you are editing, rather than working with the original file. It could, of course, create the file in your current directory, but there are advantages to placing it in /tmp: although it is unlikely, you might already have a file called e00512 in your directory; /tmp is cleaned up automatically when the system starts, so your directory doesn’t get an unwanted file if the system crashes; and often emp is arranged on the dise for fast access, CHAPTER 2 ‘THE FILE SYSTEM 65 There is a problem, of course, when several programs create files in /emp at once: they might interfere with each other's files, That is why ed's tem- porary file has a peculiar name: itis constructed in such a way as to guarantee that no other program will choose the sime name for its temporary file, In Chapters 5 and 6 we will see ways to do ths. Zasr is called the “user file system,” although it may have little to do with the actual users of the system. On our machine, our login directories are Zast/owk and /usz/eob, but on your machine the /asx part might be dif- ferent, as explained in Chapter 1. Whether or not your personal files are in a subdirectory of /usr, there are a number of things you are likely to find there (although local customs vary in this regard, too). Just as in /, there are direc tories called /usr/bin, /usx/1ib and /usz/emp. These directories have functions similar to their namesakes in /, but contain programs less critical to the system. For example, nrofé is usually in /usx/bin rather than /bin, and the FORTRAN compiler libraries live in /use/14, Of course, just what is deemed “critical” varies from system to system. Some systems, such as the istrbuted 7th Edition, have all the programs in /bin and do away with Zusr/bin altogether; others spit /use/bin into two directories according to frequency of use Other directories in /use are /usz/ada, containing accounting informa tion and /usr/aict, which holds a modest dictionary (see spe11(1)).. The on-line manual is kept in /usz/man — see /asr/man/man1/spell. 1, for example. If your system has source code on-line, you will probably find it in dasr/aze. It is worth spending a litte time exploring the file system, especially /usr, to develop a feeling for how the file system is organized and where you might expect to find things. 2.7 Devices We skipped over /dev in our tour, because the files there provide a nice review of files in general. As you might guess from the name, /dev contains device files. One of the prettiest ideas in the UNIX system is the way it deals with peri- pherals — discs, tape drives, line printers, terminals, etc. Rather than having special system routines to, for example, read magnetic tape, there is a file called /dev/mt0 (again, local customs vary). Inside the kernel, references to that file are converted into hardware commands to access the tape, $0 if a pro- gram reads /dev/mt0, the contents of a tape mounted on the drive are returned. For example, $ cp /dev/nto junk copies the contents of the tape to a file called junk. cp has no idea there is anything special about /dev/mt0; itis just a file —a sequence of bytes, 66 THE UNIX PROGRAMMING ENVIRONMENT (CHAPTER? The device files are something of a 200, each creature a little different, but the basic ideas of the file system apply to each. Here is a significantly shor- tened list of our /dev: $ Is -1 /aev 1 root 0, 0 Sep 27 console 1 root 3, 1 Sep 27 keen erwer--r-- 1 root 3, O my 6 nen bry-rw-rw- 1 root 1, 64 aug 24 to cew-rw-rw- 1 root 3, 2 sep 28 null cew-rw-rw- 1 root 4, 64 Sep 9 rmt0 1 root 2, 0 sep 8 p00 1 root 2, 1 sep 27 p01 root 13,0 Apr 12 rxpoo 1 root 13, 1 Jul 28 rrp01 1 root 2, ogul § tty, erw-w--w- 1 you 1, 0 Sep 28 ttyo 1 root 1) 1 Sep 27 tty! 1 root 1, 2 sep 27 ety2 1 root 1, 3 sep 27 tty3 The first things to notice are that instead of @ byte count there is a pair of small integers, and that the first character of the mode is always a ‘b’ or a ‘c" This is how 1s prints the information from an inode that specifies a device rather than a regular file. The inode of a regular file contains a list of dise blocks that store the file’s contents. For a device file, the inode instead con- tains the internal name for the device, which consists of its type — character (©) oF block (>) — and a pair of numbers, called the major and minor device numbers. Discs and tapes are block devices; everything else — terminals, printers, phone lines, ete. — is a character device, The major number encodes the type of device, while the minor number distinguishes different instances of the device. For example, /dev/ttyo and /dev/tty? are two ports on the same terminal controller, $0 they have the same major device number but dif- ferent minor numbers Disc files are usually named after the particular hardware variant they represent. /dev/xp00 and /dev/xp01 are named after the DEC RPO6 dise drive attached to the system, There is just one drive, divided logically into two file systems. If there were a second drive, its associated files would be named /aev/zp10 and /dev/ep11. The first digit specifies the physical drive, and the second which portion of the drive. You might wonder why there are several disc device files, instead of just fone. For historical reasons and for ease of maintenance, the file system is divided into smaller subsystems. The files in a subsystem are accessible through a directory in the main system. The program /ete/mount. reports the correspondence between device files and directories: CHAPTER 2 WE FILE sysTEM 67 $ /etc/mount zp01 on /usr s In our case, the root system occupies /dev/rp00 (although this isn’t reported by /etc/mount) while the user file system — the files in usr and its sub- directories — reside on /dew/xp01 ‘The root file system has to be present for the system to execute, /bin, /éev and /ete are always kept on the root system, because when the system starts only files in the root system are accessible, and some files such as bin/sh are needed to run at all. During the bootstrap operation, all the file systems are checked for self-consistency (see icheck(8) or fsck(8)), and attached to the root system. This attachment operation is called mounting, the software equivalent of mounting a new disc pack in a drive; it can normally be done only by the super-user. After /dev/rp01 has been mounted as /usr, the files in the user file system are accessible exactly as if they were part of the root system For the average user, the details of which file subsystem is mounted where are of little interest, but there are a couple of relevant points. First, because the subsystems may be mounted and dismounted, it is illegal to make a link to 4 file in another subsystem. For example, it is impossible to link programs in bin to convenient names in private bin directories, because /usr is in a dif- ferent file subsystem from /bin: $ In /bin/mail /usr/you/bin/m An: Cross-device link ‘ There would also be a problem because inode numbers are not unique in dif- ferent file systems. Second, each subsystem has fixed upper limits on size (number of blocks available for files) and inodes. If a subsystem fills up, it will be impossible to cenlarge files in that subsystem until some space is reclaimed. The af (disc free space) command reports the available space on the mounted file subsys- tems sar /aew/ep00 1989 Yaev/ep01 21257 ‘ /asr has 21257 free blocks. Whether this is ample space or a crisis depends on how the system is used; some installations need more file space headroom than others. By the way, of all the commands, af probably has the widest variation in output format. Your af output may look quite different Let’s turn now to some more generally useful things. When you log in, you get a terminal line and therefore a file in /dev through which the characters 68 THE UNIX PROGRAMMING ENVIRONMENT (CHAPTER 2 you type and receive are sent. The tty command tells you which terminal you are using: $ who am i you ttyo Sep 28 01:02 8 tty /aev/ttyo $ Is -1 /aev/etyo crwe-w--w- 1 you 1, 12 Sep 28 02:40 /dev/etyo $ date >/dev/etyo Wed Sep 28 02:40:51 EDT 1983 s Notice that you own the device, and that only you are permitted to read it. In other words, no one else can directly read the characters you are typing. Any- fone may write on your terminal, however. To prevent this, you could chmod the device, thereby preventing people from using write to contact you, or you could just use mesg. $ mesg a Turn off messages $ le -1 /aev/etyo erw + you 4, 12 Sep 28 02:41 /aev/ttyo $ meeg y Reswore s It is often useful to be able to refer by name to the terminal you are using, but it’s inconvenient to determine which one it is. The device /Aev/tty is a synonym for your login terminal, whatever terminal you are actually using. $ date »/aev/tty Wed Sep 28 02:42:23 EDT 1983 s /dev/tty is particularly useful when program needs to interact with a user even though its standard input and output are connected to files rather than the terminal. crypt is one program that uses /dev/tty. The “clear text” comes from the standard input, and the encrypted data goes to the standard ‘output, so exypt reads the encryption key from /dev/tty: $ crypt cryptedtext Enter key: ‘Type encryption key s The use of /dew/tty isn’t explicit in this example, but it is there. If crypt read the key from the standard input, it would read the first line of the clear text. So instead crypt opens /dev/tty, turns off automatic character echo- ing so your encryption key doesn't appear on the screen, and reads the key. In Chapters 5 and 6 we will come across several other uses of /dew/tty. Occasionally you want to run a program but don’t care what output is pro- duced. For example, you may have already seen today’s news, and don’t want ‘CHAPTER 2 ‘THE FILE SYSTEM 69 to read it again. Redirecting news to the file /dev/nu11 causes its output to be thrown away: $ news >/dev/null s Data written to /dev/nult is discarded without comment, while programs that read from /dev/null get end-of-file immediately, because reads from dev/null always return zero bytes. ‘One common use of /dev/nult is to throw away regular output so that diagnostic messages are visible. For example, the time command (time(1)) reports the CPU usage of a program. ‘The information is printed on the stan- dard error, so you can time commands that generate copious output by sending the standard output to /4ev/nul1: $ 1s -1 /usr/dict/words Se-r-- 1 Bin 196513 Jan 20 1979 /usr/dict/words § time grep © /usr/dict/words >/dev/null real 1 sys $ tine egrep @ /usr/dict/words >/dev/null real 2.0 user 3.9 sys ae s ‘The numbers in the output of time are elapsed clock time, CPU time spent in the program and CPU time spent in the kernel while the program was running. ‘egrep is a high-powered variant of grep that we will discuss in Chapter 4; it’s about twice as fast as grep when searching through large files. If output from grep and egrep had not been sent to /dev/null or a real file, we would have had to wait for hundreds of thousands of characters to appear on the ter- minal before finding out the timing information we were after, Exercise 2:9. Find out about the other files in /dev by reading Section 4 of the ‘manual. What is the difference between /dev/me0 and /dev/xmtO? Comment on the potential advantages of having subdirectories in dev for dises, tapes, ete. © Exercise 2-10, Tapes written on non-UNIx systems often have different block sizes, such as 800 bytes — ten B0-character card images — but the tape device /dev/meO expects S12-byte blocks, Look up the 4 command (44(1)) to see how to read such a tape. Exercise 2-11. Why isn't /€ev/ety just a link to your login terminal? What would hhappen if it were mode w- like your login terminal? © Exercise 2-12. How does write(!) work? Hint: see utmp(5). © Exercise 2-13, How can you tel if a user has been active atthe terminal recently? © 70) THE UNIK PROGRAMMING ENVIRONMENT. ccuarreR 2 istory and bibliographic notes ‘The file system forms one part of the discussion in “UNIX implementation,” by Ken Thompson (BSTJ, July, 1978). A paper by Dennis Ritchie, entitled “The evolution of the UNIX time-sharing system” (Symposium on Language Design and Programming Methodology, Sydney, Australia, Sept. 1979) is an fascinating description of how the file system was designed and implemented on the original PDP-7 UNIX system, and how it grew into its present form. ‘The UNIX file system adapts some ideas from the MULTICS file system. The MULTICS System: An Examination of its Structure, by E. 1. Organick (MIT Press, 1972) provides a comprehensive treatment of MULTICS. “Password security: a case history,” by Bob Morris and Ken Thompson, is an entertaining comparison of password mechanisms on a variety of systems; it can be found in Volume 2B of the unix Programmer's Manual. In the same volume, the paper “On the security of UNIX,” by Dennis Ritchie, explains how the security of a system depends more on the care taken with its administration than with the details of programs like crypt. cnaprer 3: USING THE SHELL ‘The shell — the program that interprets your requests to run programs — is the most important program for most UNIX users; with the possible exception of your favorite text editor, you will spend more time working with the shell than ‘any other program. In this chapter and in Chapter 5, we will spend a fair amount of time on the shell’s capabilities. The main point we want to make is that you can accomplish a lot without much hard work, and certainly without resorting to programming in a conventional language like C, if you know how to use the shell ‘We have divided our coverage of the shell into two chapters. This chapter goes one step beyond the necessities covered in Chapter I to some fancier but commonly used shell features, such as metacharacters, quoting, creating new commands, passing arguments to them, the use of shell variables, and some elementary control flow. These are topics you should know for yout own use of the shell. The material in Chapter 5 is heavier going — it is intended for writing serious shell programs, ones that are bullet-proofed for use by others. The division between the two chapters is somewhat arbitrary, of course, so both should be read eventually. 3.1. Command line structure To proceed, we need a slightly better understanding of just what a com- ‘mand is, and how it is interpreted by the shell. This section is a more formal coverage, with some new information, of the shell basics introduced in the first chapter, The simplest command is a single word, usually naming a file for execution (later we will see some other types of commands) 8 who Execute the file /oin/ who you ety2 Sep 28 07:55 jp ttyd Sep 28 08:32 ‘ ‘A command usually ends with a newline, but a semicolon ; is also a command terminator: n 72 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 3 8 date: Wed Sep 28 09:07:15 EDT 1983 $ dates who Wed Sep 28 09:07:23 Bor 1983 you tty? Sep 28 07:51 ipl ttyd Sep 28 08:32 5 Although semicolons can be used to terminate commands, as usual nothing happens until you type RETURN. Notice that the shell only prints one prompt, after multiple commands, but except for the prompt, $ dates who is identical to typing the two commands on doesn’t run until date has finished. ‘Try sending the output of “date; who” through a pipe: ifferent lines. In particular, who $ date; who! we Wed Sep 28 09:08:48 EDT 1983 a= 10: 0: ’ This might not be what you expected, because only the output of who goes (0 we. Connecting who and we with a pipe forms a single command, called a pipeline, that runs after date. The precedence of | is higher than that of as the shell parses your command line. Parentheses can be used to group commands: $ (date; who) Wed Sep 28 09:11:09 Epr 1983 you tty? Sep 28 07:51 3pl etyd Sep 28 08:32 $ (date; who) | wo 95 anes 69. The outputs of date and who are concatenated into a single stream that can be sent down a pipe. Data flowing through a pipe can be tapped and placed in a file (but not another pipe) with the tee command, which is not part of the shell, but is nonetheless handy for manipulating pipes. One use is to save intermediate out- put in a file: CHAPTER 3 USING THE SHELL 73 $ (date; who) | tee save | we 3 16 88 Output from we $ cat save Wed Sep 28 09:13:22 EDT 1983 you tty? Sep 28 07:51 3p ttyd Sep 28 08:32 Swe , 1, 5 and &, are nor arguments to the programs the shell runs. They instead control how the shell runs them. For example, $ echo Hello >junk tells the shell to run echo with the single argument Ke11o, and place the out- put in the file junk. The string >junk is not an argument to echo; it is interpreted by the shell and never seen by echo. In fact, it need not be the last string in the command: $ >junk echo Hello is identical, but less obvious, Exercise 3-1. What are the differences among the following three commands? scat file | pe 8 pr stile 8 pr file (Over the years the redirection operator < has lost some ground to pipes: people seem to find “cat £i1e 1" more natural than “e£ie".) © 3.2. Metacharacters ‘The shell recognizes a number of other characters as special; the most com- monly used is the asterisk » which tells the shell t0 search the directory for filenames in which any string of characters occurs in the position of the +. For example, $ echo + is a poor facsimile of 1s. Something we didn’t mention in Chapter 1 is that the filename-matching characters do not look at filenames beginning with a ‘CHAPTER 3 vsINa THE SHELL 75 dot, to avoid problems with the names *.” and *..” that are in every directory. ‘The rule is: the filename-matching characters only match filenames beginning with a period if the period is explicitly supplied in the pattern. As usual, a judicious echo or two will clarify what happens: $16 profile junk temp 8 echo + junk temp $ echo .« toss profile $ Characters like + that have special properties are known as metacharacters ‘There are a lot of them: Table 3.1 is the complete list, although a few of them ‘won't be discussed until Chapter 5. Given the number of shell metacharacters, there has to be some way to say to the shell, “Leave it alone.” The easiest and best way to protect special characters from being interpreted is to enclose them in single quote characters: $ echo ‘+r’ s I's also possible to use the double quotes "...", but the shell actually peeks inside these quotes to look for $, °...’, and \, so don’t use *..." unless you intend some processing of the quoted string ‘A third possibility is to put a backslash \ in front of each character that you want to protect from the shell, as in $ echo \e\\e Although \#\+\¢ isn’t much like English, the shell terminology for itis still a word, which is any single string the shell accepts as a unit, including blanks if they are quoted. Quotes of one kind protect quotes of the other kind: $ echo "Don’t do that!” Don’t do that! s and they don’t have to surround the whole argument: $ echo x’s’y xy 3 echo “#/A/2" on? s 76 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 3 ‘Table 3.1: Shell Metacharacters > prog >file dicect standard output to file > prog >>/lle append standard output to file < prog file take standard input from file ' ips connect standard output of p; to standard input of py e In this last example, because the quotes are discarded after they've done their job, echo sees a single argument containing no quotes. Quoted strings can contain newlines: $ echo ’hello > world’ nello world s The string *> * is a secondary prompt printed by the shell when it expects you to type more input to complete a command, In this example the quote on the first Tine has to be balanced with another. The secondary prompt string is stored in the shell variable PS2, and can be modified to taste. In all of these examples, the quoting of a metacharacter prevents the shell from trying to interpret it, The command CHAPTER 3 SING THE SHELL 77 $ echo xty, echoes all the filenames beginning x and ending y. As always, echo knows nothing about files or shell metacharacters; the interpretation of +, if any, is supplied by the shell ‘What happens if no files match the pattern? The shell, rather than com- plaining (as it did in early versions), passes the string on as though it had been quoted. It’s usually a bad idea to depend on this behavior, but it can be exploited to learn of the existence of files matching a pattern: $ 1s xey xy not found Message from 18: no such fles exist 8 oxyzzy Create xy22y $ 1s xey xyzzy File xyazy matches x+y 8 1s “xey’ xey not found 1s doesn't interpret the + ‘ ‘A backslash at the end of a line causes the Tine to be continued; this is the way to present a very long line to the shell. $ echo abe\ > des\ > ghi abodefghi 5 Notice that the newline is discarded when preceded by backslash, but is retained when it appears in quotes. ‘The metacharacter # is almost universally used for shell comments; if a shell word begins with #, the rest of the line is ignored $ echo hello # there hello $ echo hellorthere hello#there s The # was not part of the original 7th Edition, but it has been adopted very widely, and we will use it in the rest of the book Exercise 3-2. Explain the output produced by A digression on echo Even though it isn’t explicitly asked for, a final newline is provided by echo. A sensible and perhaps cleaner design for echo would be to print only 78 THE UNIX PROGRAMMING ENVIRONMENT cuarrer 3 what is requested. This would make it easy to issue prompts from the shell: $ pure-echo Bnter a command Enter @ command: $ No trailing newline but has the disadvantage that the most common case — providing a newline — is not the default and takes extra typing: cho ‘Hello! Since a command should by default execute its most commonly used function, the real echo appends the final newline automatically. But what if it isn’t desired? ‘The 7th Edition echo has a single option, to suppress the last newline: $ echo -n Enter a command: Enter @ command: $ Prompt on same line $ echo ~ - Only ~m is special s ‘The only tricky case is echoing ~n followed by a newline: $ echo -n * s 1's ugly, but it works, and this is a rare situation anyway. A different approach, taken in System V, is for echo to interpret C-like backslash sequences, such as \b for backspace and \c (which isn’t actually in the C language) to suppress the last newline: $ echo “Enter a conmand:\c’ System V version Enter a command:$ Although this mechanism avoids confusion about echoing a minus sign, it has other problems. echo is often used as a diagnostic aid, and backslashes are interpreted by so many programs that having echo look at them too just adds to the confusion Still, both designs of echo have good and bad points. We shall use the 7th Edition version (=n), so if your local echo obeys a different convention, a couple of our programs will need minor revision. Another question of philosophy is what echo should do if given no argu- ments — specifically, should it print blank line or nothing at all? All the current echo implementations we know print a blank line, but past versions didn’t, and there wete once great debates on the subject. Doug Mellroy imparted the right feelings of mysticism in his discussion of the topic: ‘CHAPTER 3 USING THE SHELL 79 ‘The UNIX and the Echo “There duelt in the land of New Jersey the UNIX a fair maid whom savants taveled far o asimite. Dazaed by her purity, all sought to espouse her, one for hee virginal grace, another Tor her poised civility, yet another for her aiiy in performing exacting tasks seldom accomplished even in much richer lands. So lage of heart and accommodating of nature was she that the UNEX ‘Mlopted all but the mos insufferably rch of Ber suitors. Soon many offspring grew and prospered and spread to the ends of the earth Nature herself smiled and answered to the uNDx mote eagerly than to other mortal beings Humblr folk, who knew litle of more courtly manners, delighted in her echo, so precise and erys- tal clear they scarce believed she could be answered by the same rocks and woods that so garbled thelr own shouts into the wilderness, And the compliant UNDX obliged with perfect echoes of what- ever she was asked ‘When one impatient swain asked the uNtx, “Echo nothing,” the UND obligingly opened her mouth, echoed nothing, and closed it again, ‘Whatever do you mean,” the youth demanded, “opening your mouth like that? Henceforth never open your mouth when you are supposed co echo nothing!” And the Unt obliged. "But I want a perfect performance, even when you echo nothing,” pleaded a sensitive youth ‘and no perfect echoes can come from a closed mouth.” Not wishing to offend either one, the UNIX syreed to say diferent nothings for the impatient youth and forthe sensitive youth. She ealed the sensitive nothing 2 ‘Yet now when she said a," she was really not saying nothing so she had to open her mouth twee, once to say “and once to say nothing. and so she did not please the sensitive youth, who said forthwith, “The \n sounds like a perfect nothing to me, but the second one ruins it. T want you to take back one of them.” So the UND, who could not abide offending, agreed to undo some ‘choes, and called that “e." Now the sensitive youth could ear a perfect echo of nothing by asking for ‘wand "e’ together. But they sty that he died of a surfeit of notation before he ever heard Exercise 3-3, Predict what each of the following grep commands will do, then verify ‘your understanding grep \$ \ grep \\$ wy grep \\\$ Ae grep ’\8 ee grep ‘\'8" - AA file containing these commands themselves makes a good test case if you want to experiment, © Exercise 3-4. How do you tell grep to search for a pattern beginning with doesn’t quoting the argument help? Hint: investigate the -e option. 0 Exercise 35. Consider Why § echo +/+ Does this produce all names in al directories? In what order do the names appear? Exercise 3-6. (Trick question) How do you get a / into a filename (ie., a / that doesn’t separate components of the path)? ‘Exercise 3-7. What happens with 80 TIE UNIX PROGRAMMING ENVIRONMENT HAPTER 3 scat xy oy and with 8 cat x ox Think before rushing off to try them. © Exercise 3-8. If you type eum why can’t rm warn you that you're about to delete all your files? © 3.3 Creating new commands It’s now time to move on to something that we promised in Chapter 1 — how to create new commands out of old ones. Given a sequence of commands that is to be repeated more than a few times, it would be convenient to make it into a “new” command with its own name, so you can use it like a regular command. To be specific, suppose you intend to count users frequently with the pipeline $ who | we -1 that was mentioned in Chapter 1, and you want to make a new program nu to do that The first step is to create an ordinary file that contains ‘who ! we -2" You can use a favorite editor, or you can get creative: $ echo ‘who | we (Without the quotes, what would appear in nu?) As we said in Chapter 1, the shell is a program just like an editor or who or we; its name is sh. And since it’s a program, you can run it and redirect its input. So run the shell with its input coming from the file nu instead of the terminal: $ who you tty2 sep 28, zbh ttya sep 28 nob ttys sep 28 ava ttyé Sep 28 $ cat nu who | we -2 Ssh cx Create ox originally § sh cx ex Make ox itself executable $ echo echo Hi, there! >hello Make a test program § hello Toit hello: cannot execute $ cx hello Make it executable $ hello Try again Hi, there! It works § my cx /usr/you/bin Install ox. $ rm hello, Clean up s Notice that we said $ sh cx cx exactly as the shell would have automatically done if ex were already execut- able and we typed $x cx What if you want to handle more than one argument, for example to make ‘a program like ex handle several files at once? A crude first cut is to put nine arguments into the shell program, as in chnod +x $1 $2 $3 $4 §5 $6 57 $2 $9 (it only works up to $9, because the string $10 is parsed as “first argument, $1, followed by 2 0") If the user of this shell file provides fewer than nine arguments, the missing ones are null strings; the effect is that only the argu- ments that were actually provided are passed to chmod by the sub-shell. So this implementation works, but it’s obviously unclean, and it fails if more than nine arguments are provided. Anticipating this problem, the shell provides a shorthand $* that means “all the arguments.” The proper way to define cx, then, is chnod «x $+ which works regardless of how many arguments are provided. With $+ added to your repertoire, you can make some convenient shell files, such as Le or m: $ cd /usr/you/bin $ cat le # lc: count number of lines in files we -1 s+ Scat m # m: a concise way to type mail mail $+ s Both can sensibly be used without arguments. If there are no arguments, $* 84 THE UNIK PROGRAMMING ENVIRONMENT: CHAPTER 3 will be ull, and no arguments at all will be passed to we or mail. With or without arguments, the command is invoked properly $ le /usr/you/bin/+ 1 /usr/you/bin/ex 2 /usr/you/bin/le 2 /usr/you/bin/a 4 /ust/you/bin/ma 2 /asr/you/bin/what 4 /usr/you/bin/where 9 total 8 Is /usr/you/bin | Ie 6 s These commands and the others in this chapter are examples of personal programs, the sort of things you write for yourself and put in your bin, but are unlikely to make publicly available because they are too dependent on per- sonal taste. In Chapter 5 we will address the issues of writing shell programs suitable for public use. ‘The arguments to a shell file need not be filenames. For example, consider searching a personal telephone directory. If you have a file named /usr/you/1ib/phone-book that contains lines like Gial-a-joke 212-976-3838 @ial-a-prayer 212-246-4200 dial santa 212-976-3636 dow jones report 212-976-4141 then the grep command can be used to search it, (Your own 14b directory is 1a good place to store such personal data bases.) Since grep doesn’t care about the format of information, you can search for names, addresses, zip codes or anything else that you like. Let's make a directory assistance program, which ‘we'll call 411 in honor of the telephone directory assistance number where we live: 8 echo “grep $+ /usr/you/lib/phone-book’ >411 Sex att $ 411 joKe Gial-a-joke 212-976-3038 $477 dial Gial-a-joke 212-976-3838 @ial-a-prayer 212-246-4200 @ial santa 212-976-3636 $411 “dow jones” grep: can’t open jones Something is wrong s The final example is included to show a potential problem: even though dow Jones is presented to 411 as a single argument, it contains a space and is no CHAPTER 3 USING THE SHELL 85 longer in quotes, so the sub-shell interpreting the 411 command converts it into two arguments to grep: it’s as if you had typed $ grep dow jones /usr/you/1ib/phone-book and that’s obviously wrong. ‘One remedy relies on the way the shell treats double quotes. Although anything quoted with ‘...’ is inviolate, the shell looks inside "..." for 8's, V's, and *...°'S. So if you revise 411 to Took like grep "$+" /usr/you/Lib/phone-book the $# will be replaced by the arguments, but it will be passed to grep as a single argument even if it contains spaces. 8 411 dow jones dow jones report 212-976-4141 s By the way, you can make grep (and thus 411) case-i ~y option: 8 grep -¥ pattern dependent with the with -y, lower case letters in pattern will also match upper case letters in the input. (This option is in 7th Edition grep, but is absent from some other sys- tems.) There are fine points about command arguments that we are skipping over until Chapter 5, but one is worth noting here. The argument $0 is the name of the program being executed — in ox, $0 is “ex.” A novel use of $0 is in the implementation of the programs 2, 3, 4, .... which print their output in that many columns: $ who 1 2 azn ttyO sep 28 21 ow ttyS sep 28 21:09 anr ttys Sep 28 22:10 sc} tty? Sep 28 22:11 you tty9 Sep 28 23:00 pub ttyb Sep 28 19:58 5 The implementations of 2, 3, ... are identical; in fact they are links to the same file: $ In 23; In 24; In 25; In26 8 Is “11 (1-97 16722 ~ewxrwxrwx 5 you 51 Sep 28 23: 16722 ~xwxrwxewx 5 you 51 Sep 28 23 16722 -ewarwxrwx 5 you 51 Sep 28 23: 16722 -ewarwxrwx 5 you 51 sep 16722 -ewxewxrwx 5 you 51 Sep 28 23:21 86 TIE UNIX PROGRAMMING ENVIRONMENT CHAPTER 3 $ 1s /usr/you/bin | 5 2 3 4 an 5 6 ox qe 2 ma what where Scat 5 eae: print inn columns pr -s0'-t “11 $+ s The ~t option turns off the heading at the top of the page and the -1n option sets the page length to m lines. The name of the program becomes the number-of-columns argument to pr, so the output is printed a row at a time in the number of columns specified by $0. 3.5 Program output as arguments Let us turn now from command arguments within a shell file to the genera- tion of arguments, Certainly filename expansion from metacharacters like » is, the most common way to generate arguments (other than by providing them explicitly), but another good way is by running a program. The output of any program can be placed in a command line by enclosing the invocation in back- quotes *...* $ echo At the tone the time will be ‘date’ At the tone the time will be Thu Sep 29 00:02:15 EDT 1983. ’ ‘A small change illustrates that *...° is interpreted inside double quotes *.." $ echo "At the tone > the time will be ‘date’.” ‘At the tone the time will be Thu Sep 29 00:03:07 ED? 1983, s As another example, suppose you want to send mail to a list of people whose login names are in the file mailinglist. A clumsy way to handle this, is to edit mailinglist into a suitable mail command and present it to the shell, but it’s far easier to say 8 mail ‘cat mailinglist’ echo $x’ >setx 1 set and print x S cat setx Good Bye" echo $x $ echo $x Hello x is He20 in original shell, $ sh setx Good Bye x is Good Bye in sub-shell. 8 echo $x Hello ‘but sill We120 in this shell ‘There are times when using a shell file to change shell variables would be useful, however. An obvious example is a file to add a new directory to your PATH. The shell therefore provides a command *.” (dot) that executes. the commands in a file in the current shell, rather than in a sub-shell. This was originally invented so people could conveniently re-execute their «profile files without having to log in again, but it has other uses: 90 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 3 $ cat /usr/you/bin/ganes PATH=$PATH: /usr/games Append /usz/games to PATH $ echo $PATH /asr/you/bin: /bin: /use/bin $. ganes $ echo spare jaar /you/bin: /bin: usr/bin: /usr/games The file for the *.” command is searched for with the PATH mechanism, so it can be placed in your bin directory. ‘When a file is executing with *.”, itis only superficially like running a shell file, The file is not “executed” in the usual sense of the word. Instead, the commands in it are interpreted exactly as if you had typed them interactively — the standard input of the shell is temporarily redirected to come from the file. Since the file is read but not executed, it need not have execute permis sions. Another difference is that the file does not receive command line argu- ments; instead, $1, $2 and the rest are empty. It would be nice if arguments were passed, but they are not ‘The other way to set the value of a variable in a sub-shell is to assign to it explicitly on the command line before the command itself: $ echo ‘echo $x’ >echox § ox echox 8 echo sx Hello As before $ echox 2 not set in sub-shell $ x=Hi echox Bi Value of x passed to sub-shell s (Originally, assignments anywhere in the command line were passed to the command, but this interfered with 4€(1),) ‘The ‘.” mechanism should be used to change the value of a variable per- ‘manently, while in-line assignments should be used for temporary changes. As an example, consider again searching /usx/games for commands, with the directory not in your PATH: $ 1s /asr/games | grep fort fortune Fortune cookie command $ fortune fortune: not found $ echo spare :/usr/you/bin: /bin: /usr/bin /ase/ganes nol in PATH $ PaTH=/usr/games fortune Ring the bell; close the book: quench the candle. CHAPTER 3 $ echo spar r/usr/you/bin: /bin: /usr/bin PATH unchanged $ cat /usr/you/bin/games PATHeSPATH: /usr/games games command still shere $. games $ fortune Premature optimization is the root of all evil - Knuth $ echo $PATH :/usr/you/bin: /bin:/usr/bin: /asr/games PATH changed this time s I's possible to exploit both these mechanisms in a single shell file. A slightly different games command can be used to run a single game without changing PATH, or can set PATH permanently to include /usr/games: 8 cat /usr/you/bin/ganes PATHSSPATH: /asr/eanes $+ Note the $« $ ox /usr/you/bin/ganes $ echo $PATH :/usr/you/bin: /bin: /asr/bin Doesn't have /asx/games $ games fortune I'd give my right arm to be ambidextrous. $ echo $PATH :/usr/you/bin: /bin: /usr/bin Slt doesn't $. games $ echo spar /asr/you/bin:/bin:/usr/bin:/usr/games Now it does $ fortune He who heeitates ie sometimes saved. s ‘The first call to games ran the shell file in a sub-shell, where PATH was tem- porarily modified to include /usx/games. The second example instead inter- preted the file in the current shell, with $+ the empty string, so there was no command on the line, and PATH was modified. Using games in these two ‘ways is tricky, but results in a facility that is convenient and natural to use. ‘When you want to make the value of a variable accessible in sub-shell, the shell’s export command should be used. (You might think about why there is no way to export the value of a variable from a sub-shell to its parent.) Here is one of our earlier examples, this time with the variable exported: New shell x known in sub-shell 92. THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 3 $ x= "Good Bye’ Change its value 8 echo $x Good Bye $ cid Leave this shell s Back in original shell 8 echo $x Hello x sill feito s export has subtle semantics, but for day-to-day purposes at least, a rule of thumb suffices: don’t export temporary variables set for short-term conveni- ence, but always export variables you want set in all your shells and sub-shells (including, for example, shells started with the ed's | command). Therefore, variables special to the shell, such as PATH and HOME, should be exported, Exercise 3-13. Why do we always include the current directory in PATH? Where should it be placed? 0 3.7 More on W/O redirection ‘The standard error was invented so that error messages would always appear on the terminal: $ dife file1 f1el2 >aise.out aiff: fiel2: No such file or directory s It’s certainly desirable that error messages work this way — it would be most unfortunate if they disappeared into aiff out, leaving you with the impres- sion that the erroneous 4if£ command had worked properly Every program has three default files established when it starts, numbered by small integers called file descriprors (which we will return to in Chapter 7) ‘The standard input, 0, and the standard output, 1, which we are already fami- liar with, are often redirected from and into files and pipes. ‘The last, num- bered 2, is the standard error output, and normally finds its way to your termi- nal, Sometimes programs produce output on the standard error even when they work properly. One common example is the program time, which runs a command and then reports on the standard error how much time it took. $ time we oh3.1 931 4288 22691 ch3.1 ‘CHAPTER 3 USING THE SHELL 93 $ time we ch3.1 >we.out real 2.0 user ola sys 0.3 $ time wo ch3.1 swo.out 2>time.out $ cat tine.out real im user ° ays ° s ‘The construction 2>filename (no spaces are allowed between the 2 and the >) directs the standard error output into the file; it’s syntactically graceless but it does the job. (The times produced by time are not very accurate for such a short test as this one, but for a sequence of longer tests the numbers are useful and reasonably trustworthy, and you might well want to save them for further analysis; see, for example, Table 8.1.) It is also possible to merge the (wo output streams: $ time we ch3.1 >we-out 2961 § cat we.out 931 4288 22691 ch3.1 real 4.0 user oa ays 0.3 s The notation 2>&41 tells the shell to put the standard error on the same stream as the standard output. There is not much mnemonic value to the ampersand; it’s simply an idiom to be learned. You can also use 1>82 to add the standard output to the standard error: echo... 1962 prints on the standard error, In shell files, it prevents the messages from van ishing accidentally down a pipe or into a file The shell provides a mechanism so you can put the standard input for a ‘command along with the command, rather than in a separate file, so the shell file can be completely self-contained. Our directory information program 411 could be written 94 THE UNIX PROGRAMMING ENVIRONMENT cuarreR 3 8 cat art $0" file direct standard output to file efile append standard output to file file take standard input from file Pitp2 connect standard output of program p, to input of p . obsolete synonym for | nofile direct output from file descriptor 1 to file n>>file append output from file descriptor n to file n>Gm merge output from file descriptor n with file descriptor m n<&m merge input from file descriptor n with file descriptor m < do > echo $i > done ‘The “i” can be any shell variable, although 4 is traditional. Note that the variable’s value is accessed by $4, but that the for loop refers to the variable fas 4, We used « (o pick up all the files in the current directory, but any other list of arguments can be used. Normally you want to do something more interesting than merely printing filenames. One thing we do frequently is to compare a set of files with previous versions. For example, to compare the old version of Chapter 2 (kept in directory 014) with the current one: $s ch2.e 15 oh2.1 con2.2 en2.3 en2.4 cn2.s on2.6 oh2.7 $ for i in ch2. > do > echo $1: > aiff -b ola/si $i > econo ‘Add a blank line for readability > done | pr -h "diff ‘pwd'/old ‘pwd** | Ipr & 372 Process-id ’ We piped the output into pr and Apr just to illustrate that it’s possible: the standard output of the programs within a €or goes to the standard output of the for itself. We put a fancy heading on the output with the ~h option of pr, using two embedded calls of pwd. And we set the whole sequence running asynchronously (&) so we wouldn't have to wait for it; the & applies to the entire loop and pipeline. We prefer to format a for statement as shown, but you can compress it somewhat. The main limitations are that do and done are only recognized as keywords when they appear right after a newline or semicolon. Depending on the size of the for, it’s sometimes better to write it all on one line: for 4 in list; do commands; done You should use the fox loop for multiple commands, or where the built-in 96 THE UNIX PROGRAMMING ENVIRONMENT. CHAPTER 3 argument processing in individual commands is not suitable, But don’t use it when the individual command will already loop over filenames: # Poor idea: for i in $+ a0 chmod «x $1 done is inferior to chmod +x $« because the £or loop executes a separate chmod for each file, which is more ‘expensive in computer resources. (Be sure that you understand the difference between for i ins which loops over all filenames in the current directory, and for i in $+ which loops over all arguments to the shell file.) The argument list for a €or most often comes from pattern matching on filenames, but it can come from anything. It could be $ for i in ‘cat for arguments could just be typed. For example, earlier in this chapter we created a group of programs for multi-column printing, called 2, 3, and so on ‘These are just links to a single file that can be made, once the file 2 has been written, by $ for i in 3.456; do in 2 $1; done 5 As a somewhat more interesting use of the for, we could use pick to select which files to compare with those in the backup directory: ‘CHAPTER 3 USING THE SHELL 97 8 for 4 in ‘pick ch2.+* > do > echo $i: > dife o1a/si $4 > done ! pr! Ipr on2.1? y oha.2? oh2.3? cha.4? oh2.5? cha.6? oh2.7? s ss It’s obvious that this loop should be placed in a shell file to save typing next time: if you've done something twice, you're likely to do it again. Bxercise 3-15, If the 44 ££ loop were placed in a shell file, would you put the pick in the shell file? Why or why not? © Bxercise 3-16. What happens if the last line of the loop above is > done | pr { Ipr & that is, ends with an ampersand? See if you can figure it out, then try it. © 3.9 bundle: putting it all together To give something of the flavor of how shell files develop, let's work through a larger example, Pretend you have received mail from a friend on another machine, say somewhere !bob,+ who would like copies of the shell files in your bin. The simplest way to send them is by return mail, so you might start by typing $ cd /usr/you/bin § for i in ‘pick #* > ao > ‘echo this is file $1 > cat $i > done | mail somewhere!bob s But look at it from somewhere !bob’s viewpoint: he’s going to get a mail mes- sage with all the files clearly demarcated, but he'll need to use an editor to break them into their component files. ‘The flash of insight is that a properly- constructed mail message could automatically unpack itself so the recipient needn't do any work. ‘That implies it should be a shell file containing both the ‘There are several notations for remote machine addresses. ‘The form machine! person is most common. See 2a) 98 THE UNIX PROGRAMMING ENVIRONMENT ‘CHAPTER 3 files and the instructions to unpack it ‘A second insight is that the shell’s here documents are a convenient way to combine a command invocation and the data for the command. The rest of the job is just getting the quotes right. Here’s a working program, called bundle, that groups the files together into a self-explanatory shell file on its standard output: $ cat bundle # bundle: group files into distribution package echo ‘# To unbundle, sh this file’ for 4 ao. echo "echo $1 1782" echo "cat >$1 <<’End of $i°" cat $i echo "End of $i" done Quoting “End of $4” ensures that any shell metacharacters in the files will be ignored Naturally, you should try it out before inflicting it on somewhere !bob: $ bundle ox le >junk ‘Make a trial bundle $ cat junk # To wnbundle, sh this file echo cx 162 cat vex <<"End of cx’ chnod «x $+ End of ox echo le 182 cat >lc <, and filename expansion with #, so that no program need worry about them, and ‘more importantly, so that the application of these facilities is uniform across all programs. Other features, such as shell files and pipes, are really provided by the kernel, but the shell gives a natural syntax for creating them. They go 100 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER» beyond convenience, to actually increasing the capabilities of the system, ‘Much of the power and convenience of the shell derives from the UNIX ker- nel underneath it; for example, although the shell sets up pipes, the kernel actually moves the data through them. ‘The way the system treats executable files makes it possible to write shell files so that they are run exactly like com- piled programs. ‘The user needn't be aware that they are command files — they aren't invoked with a special command like RUN. Also, the shell is a pro- gram itself, not part of the kernel, so it can be tuned, extended and used like any other program. This idea is not unique to the UNIX system, but it has been exploited better there than anywhere els. In Chapter 5, we'll return to the subject of shell programming, but you should keep in mind that whatever you're doing with the shell, you're pro- ‘gramming it — that’s largely why it works so wel History and bibliographic notes, ‘The shell has been programmable from earliest times. Originally there were separate commands for if, goto, and labels, and the goto command operated by scanning the input file from the beginning looking for the right label. (Because it is not possible to re-read a pipe, it was not possible to pipe into a shell file that had any control flow), The 7th Edition shell was written originally by Steve Bourne with some help and ideas from John Mashey. It contains everything needed for program- ming, as we shall see in Chapter 5. In addition, input and output are rational- ized: it is possible to redirect /O into and out of shell programs without limit. The parsing of filename metacharacters is also internal to this shell; it had been fa separate program in earlier versions, which had to live on very small machines. One other major shell that you may run into (you may already be using it by preference) is csh, the so-called “C shell” developed at Berkeley by Bill Joy by building on the 6th Edition shell, The C shell has gone further than the Bourne shell in the direction of helping interaction — most notably, it provides a history mechanism that permits shorthand repetition (perhaps with slight editing) of previously issued commands. The syntax is also somewhat dif- ferent. But because it is based on an earlier shell, it has less of the program- ming convenience; it is more an interactive command interpreter than a pro- gramming language. In particular, it is not possible to pipe into or out of con- trol flow constructs. Pick was invented by Tom Duff, and bundie was invented independently by Alan Hewett and James Gosling. charTeR ¢: FILTERS ‘There is a large family of UNIX programs that read some input, perform a simple transformation on it, and write some output. Examples include grep and tail to select part of the input, sort to sort it, we to count it, and so on. Such programs are called filters. ‘This chapter discusses the most frequently used filters. We begin with grep, concentrating on patterns more complicated than those illustrated in Chapter 1. We will also present two other members of the grep family, egrep and farep, The next section briefly describes a few other useful filters, including tr for character transliteration, 44 for dealing with data from other systems, and unig for detecting repeated text lines. sort is also presented in more detail than in Chapter 1 ‘The remainder of the chapter is devoted to two general purpose “data transformers” or “programmable filters.” They are called programmable because the particular transformation is expressed as a program in a simple programming language. Different programs can produce very different transformations. ‘The programs are sed, which stands for stream editor, and awk, named after its authors, Both are derived from a generalization of grep: $ program pattern-action filenames scans the files in sequence, looking for lines that match a pattern; when one is found a corresponding action is performed. For grep, the pattern is a regular expression as in ed, and the default action is to print each line that matches, the pattern, ‘sed and awk generalize both the patterns and the actions. sed is a deriva- tive of ed that takes a “program” of editor commands and streams data from the files past them, doing the commands of the program on every Tine. awk is not as convenient for text substitution as sed is, but it includes arithmetic, variables, built-in functions, and a programming language that looks quite a bit like C. This chapter doesn’t have the complete story on either program; Volume 2B of the UNIX Programmer's Manual has tutorials on both 101 102 THE UNIX PROGRAMMING ERVIRONMENT (CHAPTER ¢ 4.1 The grep family We mentioned grep briefly in Chapter 1, and have used it in examples since then. & grep pattern filenames searches the named files or the standard input and prints each line that con- tains an instance of the pattern. grep is invaluable for finding occurrences of variables in programs or words in documents, or for selecting parts of the out- put of a program: $ grep -n variable +.[ch] Locate variable in C source $ grep From $MAIL Print message headers in mailbox $ grep From SMAIL ! grep -v mary Headers that didn't come from max 8 grep -y mary $HONE/1ib/phone-book Find mary's phone number $ who ! grep nary See if mary is logged in $ 1s f grep -v temp Filenames that don't contain temp ‘The option -n prints line numbers, -v inverts the sense of the test, and -y makes lower case letters in the pattern match letters of either case in the file (upper case still matches only upper case) In all the examples we've seen so far, grep has looked for ordinary strings of letters and numbers, But grep can actually search for much more compli- cated patterns: gxep interprets expressions in a simple language for describing strings. Technically, the patterns are a slightly restricted form of the string specif- iers called regular expressions. grep interprets the same regular expressions as ed; in fact, grep was originally created (in an evening) by straightforward surgery on ed. Regular expressions are specified by giving special meaning to certain char- acters, just like the +, etc., used by the shell. There are a few more metachar- acters, and, regrettably, differences in meanings. Table 4.1 shows all the regu- lar expression metacharacters, but we will review them briefly here. ‘The metacharacters * and $ “anchor” the pattern to the beginning (*) or end ($) of the fine. For example, $ grep From $MATL locates lines containing From in your mailbox, but $ grep ‘*Prom’ $MArL prints lines that begin with From, which are more likely to be message header lines. Regular expression metacharacters overlap with shell metacharacters, so it’s always a good idea to enclose grep patterns in single quotes. grep supports character classes much like those in the shell, so [az] matches any lower case letter. But there are differences; if a grep character class begins with a circumflex *, the pattern matches any character except CHAPTER + Fierers 103 those in the class, Therefore, {“0-9] matches any non-digit. Also, in the shell a backslash protects J and - in a character class. but grep and ed require that these characters appear where their meaning is unambiguous. For example, [1] (sic) matches either an opening or closing square bracket or a minus sign. A period *." is equivalent to the shell’s ?: it matches any character. (The period is probably the character with the most different meanings to different UNIX programs.) Here are a couple of examples: Sis -1! grep *-a" List subdirectory names Sis -1! grep ‘*.......2W" List files others can read and write The ‘and seven perio match any teven characters atthe beginning of the line, which when applied the output of 18 -1 means any permission string The closure operator « applies t0 the previous character or metacharacter (including a character class) in the expression, and collectively they match any number of successive matches of the character or metacharacter. For example, x4 matches a sequence of x's as long as possible, [a~zA-Z]+ matches an alphabetic string, .* matches anything up to a newline, and . #2 matches any- thing up to and including the fast x on the line There are a couple of important things to note about closures. First, clo- sure applies to only one character, so xy+ matches an x followed by y's, not a sequence like xyxyxy. Second, “any number” includes zero, so if you want at least one character to be matched, you must duplicate it. For example, to match a string of letters the correct expression is [a~zA-Z)[a~za-Z]# (a leer followed by zero or more letters). The shell’s « filename matching char- acter is similar to the regular expression .« No grep regular expression matches a newline; the expressions are applied to each line individually. With regular expressions, grep is a simple programming language. For example, recall that the second field of the password file is the encrypted pass- word, This command searches for users without passwords: 8 grep aes * setc/passua ‘The pattern is: beginning of line, any number of non-colons, double colon. grep is actually the oldest of a family of programs, the other members of which are called fgrep and egrep. Their basic behavior is the same, but fgrep searches for many literal strings simultaneously, while egrep interprets true regular expressions — the same as grep, but with an “or” operator and parentheses to group expressions, explained below. Both fgrep and egrep accept a ~£ option to specify a file from which to read the pattern, In the file, newlines separate patterns to be searched for in parallel. If there are words you habitually misspell, for example, you could check your documents for their occurrence by keeping them in a file, one per line, and using Egrep: 104 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 8 f9rep -£ conmon-errors document ‘The regular expressions interpreted by egrep (also listed in Table 4.1) are the same as in grep, with a couple of additions. Parentheses can be used t0 group, so (xy)* matches any of the empty string, xy, xyxy, xyxyxy and so fon, The vertical bar | is an “or” operator; today!tomorrow matches either today or tomorrow, as does to(day!morrow). Finally, there are two other closure operators in egrep, + and ?. The pattern x+ matches one or more x's, and x? matches zero or one x, but no more. egxep is excellent at word games that involve searching the dictionary for words with special properties. Our dictionary is Webster's Second Interna- tional, and is stored on-line as the list of words, one per line, without defini- tions, Your system may have /usr/dict/words, a smaller dictionary intended for checking spelling; look at it to check the format. Here’s a pattern to find words that contain all five vowels in alphabetical order: $ cat alphvovels *(aeiou]+a[“aeiou}+e[“aeiou]+i[“aeiou}+o[ “aeiou] mul ~aeiou] +$ $ egrep -f alphvowels /usr/dict/web2 | 3 abstemious abetemiouely abstentious acheilous acheirous acieistous affections annelidous arsenious arterious pacterious caesious facetious Eacetiously fracedinous majestious s ‘The pattern is not enclosed in quotes in the file alphvowe2s.,,\Whgp. quotes are used to enclose egrep patterns, the shell protects the UR om interpretation but strips off the quotes; egrep never sees them, Since the file is not examined by the shell, however, quotes are nor used around its contents. We could have used grep for this example, but because of the way egrep works, it is much faster when searching for patterns that include closures, especially when scanning large files As another example, to find all words of six or more letters that have the letters in alphabetical order: $ cat monotonic “ab?o?die?£tg?h?i? j7k71?m7nPorprare?arerurvewex?y?278 $ egrep -£ monotonic /usr/dict/wen2 | grep ‘..----' 1 5 abdest acknow adipsy agnosy almost pefist behint beknow Bijoux biopsy chintz aehors dehort deinos aimpsy egilope ghosty $ (Bgilops is a disease that attacks wheat.) Notice the use of grep to filter the output of egrep. CHAPTER 4 FILTERS 105 Why are there three grep programs? £gxep interprets no metacharacters, but can look efficiently for thousands of words in parallel (once initialized, its running time is independent of the number of words), and thus is used pri- marily for tasks like bibliographic searches. ‘The size of typical fgrep patterns is beyond the capacity of the algorithms used in grep and egrep. The dis- tinction between grep and egrep is harder to justify. grep came much ear- lier, uses the regular expressions familiar from ed, and has tagged regular expressions and a wider set of options. egrep interprets more general expres- sions (except for tagging), and runs significantly faster (with speed indepen- dent of the pattern), but the standard version takes longer to start when the expression is complicated. A newer version exists that starts immediately, so egrep and grep could now be combined into a single pattern matching pro- gram. Table 4.1: grep and egrep Regular Expressions | (decteasing order of precedence) c any non-special character ¢ matches itself \c turn off any special meaning of character ¢ . beginning of line 8 end of line { any single character any one of characters in ..; ranges like a~2 are legal any single character notin ..; ranges are legal what the 1th \C...\) matched (grep only) rx zero oF more occurrences of r r+ one or more occurrences of r (egrep only) 7? —_2er0 oF one occurrences of r (egzep only) ri 11 followed by r2 ri rl or 72 (egrep only) \r\) tagged regular expression r (grep only); can be nested (r) regular expression r (egrep only); can be nested No regular expression matches a newline. Exercise 4-1. Look up tagged regular expressions (\( and \)) in Appendix 1 or ea( 1), and use qzep to search for palindromes — words spelled the same backwards as for- wards. Hint: write a different pattern for each length of word. © Exercise 4-2. ‘The structure of grep is to read a single line, check for a match, then loop. How would grep be affected if regular expressions could match newlines? © 106 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER + 4.2 Other filters The purpose of this section is to alert you to the existence and possibilities of the rich set of small filters provided by the system, and to give a few exam- ples of their use, This list is by no means all-inclusive — there are many more that were part of the 7th Edition, and each installation creates some of its own. [All of the standard ones are described in Section 1 of the manual We begin with sort, which is probably the most useful of all. The basics of sort were covered in Chapter I: it sorts its input by line in ASCH order Although this is the obvious thing to do by default, there are lots of other ways that one might want data sorted, and sort tries to cater to them by providing lots of different options. For example, the ~£ option causes upper and lower case to be “folded,” so case distinctions are eliminated. The ~d option (dic- tionary order) ignores all characters except letters, digits and blanks in com- parisons. Although alphabetic comparisons are most common, sometimes a numeric comparison is needed. The -n option sorts by numeric value, and the -r ‘option reverses the sense of any comparison. So, $16 f sort -£ Sort filenames in alphabetic order $ 1s -s ! sort -n Sort with smallest files first $ 1s -9 ! sort -or Sort with largest files first sort normally sorts on the entire line, but it can be told to direct its atten- tion only to specific fields. The notation +m means that the comparison skips the first m fields; +0 is the beginning of the line. So, for example, $ 1s -1 / sort +3nr Sort by byte count, largest first $ who | sort an Sort by time of login, oldest first Other useful sort options include -0, which specifies a filename for the ‘output (it can be one of the input files), and -u, which suppresses all but one of each group of lines that are identical in the sort fields. Multiple sort keys can be used, as illustrated by this cryptic example from the manual page sort(1) $ sort +06 +0 -u filenames +0£ sorts the line, folding upper and lower case together, but lines that are identical may not be adjacent. So +0 is a secondary key that sorts the equal lines from the first sort into normal ASCH order. Finally, ~w discards any adjacent duplicates. Therefore, given a list of words, one per line, the com- mand prints the unique words. The index for this book was prepared with a similar sort command, using even more of sort’s capabilities. See sort(). The command unig is the inspiration for the -u flag of sort: it discards all but one of each group of adjacent duplicate lines. Having a separate pro- ‘gram for this function allows it to do tasks unrelated to sorting. For example, ‘unig will remove multiple blank lines whether its input is sorted or not. CHAPTER & ruyrers 107 Options invoke special ways to process the duplications: unig ~a prints only those lines that are duplicated; uniq -u prints only those that are unique (i.e., not duplicated); and uniq —c counts the number of occurrences of each line. We'll see an example shortly ‘The comm command is a file comparison program. Given two sorted input files £1 and £2, comm prints three columns of output: lines that occur only in £1, lines that occur only in £2, and lines that occur in both files. Any of these columns can be suppressed by an option: $ comm -12 £1 £2 prints only those lines that are in both files, and $ comm -23 £1 £2 prints the lines that are in the first file but not in the second. This is useful for comparing directories and for comparing a word list with a dictionary. ‘The tx command transliterates the characters in its input. By far the most common use of tr is case conversion: Stragag Map lower case t0 upper Str agae Map upper case 10 lower The dd command is rather different from all of the other commands we have looked at. It is intended primarily for processing tape data from other systems — its very name is a reminder of 0/360 job control language. 44 will do case conversion (with a syntax very different from tx); it will convert from ASCII to EBCDIC and vice versa; and it will read or write data in the fixed size records with blank padding that characterize non-UNIX systems. In practice, a is often used to deal with raw, unformatted data, whatever the source; it encapsulates a set of facilities for dealing with binary data To illustrate what can be accomplished by combining filters, consider the following pipeline, which prints the 10 most frequent words in its input: cat $I er -8¢ A-Za-z ‘\012" 1 Compress runs of non-ltters into newline sort t unig -c | tail t 5 cat collects the files, since tx only reads its standard input. The tr com- mand is from the manual: it compresses adjacent non-letters into newlines, thus converting the input into one word per line. The words are then sorted and unig -c compresses each group of identical words into one line prefixed by a count, which becomes the sort field for sort -n. (This combination of two sorts around 2 unig occurs often enough to be called an idiom.) ‘The result is the unique words in the document, sorted in increasing frequency. tail 108 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 4 selects the 10 most common words (the end of the sorted list) and 5 prints them in five columns, By the way, notice that ending a line with ! is a valid way to continue it Exercise 43, Use the tools in this section to write @ simple spelling checker, using Zasr/dict/words. What are its shortcomings, and how would you address them? 0 Exercise 44. Write a word-counting program in your favorite programming language and compare its size, speed and maintainability with the word-counting pipeline. How easily can you convert it into spelling checker? 0 4.3 The stream editor sed Let us now turn to sed. Since it is derived directly from ed, it should be easy to learn, and it will consolidate your knowledge of ed. ‘The basic idea of sed is simple: 8 sea “ist of ed commands” filenames reads lines one at a time from the input files; it applies the commands from the list, in order, to each line and writes its edited form on the standard output So, for instance, you can change UNIX to UNIX(TH) everywhere it occurs in a set of files with $ sed ’s/UNIK/UNIX( TM) /q” filenames... >output Do not misinterpret what happens here. sed does not alter the contents of its input files, 1 writes on the standard output, so the original files are not changed. By now you have enough shell experience to realize that $ sed ‘.../ file >file is a bad idea: to replace the contents of files, you must use a temporary file, of another program. (We will talk later about a program to encapsulate the idea of overwriting an existing file; look at overwrite in Chapter 5.) sed outputs each line automatically, so no p was needed after the substitu: tion command above; indeed, if there had been one, each modified line would hhave been printed twice. Quotes are almost always necessary, however, since so many sed metacharacters mean something to the shell as well. For exam. ple, consider using du ~a to generate a list of filenames. Normally, du prints the size and the filename: $ au -a cna 18 ena. 2 end.2 4 ena.3 ” cna4 2 ena.9 5 You can use sed to discard the size part, but the editing command needs CHAPTER & FILTERS 109 quotes to protect a + and a tab from being interpreted by the shell: $ du -a ché.+ | sed 'a/.40//* cha. cna. cna. end end s The substitution deletes all characters (.) up to and including the rightmost tab (shown in the pattern as +) Ina similar way, you could select the user names and login times from the ‘output of who: $ who le tty1 Sep 29 07:14 ron ety3 Sep 29 10:31 you ttyd Sep 29 ta ttyS Sep 29 S$ who! sed 's/ #77" Le 07:14 ron 10:31 you 08:36 ta 08:47 s The s command replaces a blank and everything that follows it (as much as possible, including mote blanks) up to another blank by a single blank. Again, quotes are needed. Almost the same sed command can be used to make a program getname that will return your user name: $ cat getname who am i} sed ’8/ .4//* $ getname you s Another sed sequence is used so frequently that we have made it into @ shell file called ind. ‘The ind command indents its input one tab stop; it is handy for moving something over to fit better onto line-printer paper. ‘The implementation of ind is easy — stick a tab at the front of each line: sea '8/*/4/" $+ Version 1 of ina This version even puts a tab on each empty line, which seems unnecessary. A better version uses sed’s ability to select the lines to be modified. If you pre~ fix a pattern to the command, only the lines that match the pattern will be affected: 110 THE UNIX PROGRAMMING ENVIRONMENT cuarren 4 sea "/./8/*/4/" $e Version 2 of ina The pattern /./ matches any line that has at least one character on it other than a newline; the s command is done for those lines but not for empty lines. Remember that sed outputs all lines regardless of whether they were changed, so the empty lines are still produced as they should be. ‘There’s yet another way that ind could be written, It is possible to do the commands only on fines that don’r match the selection pattern, by preceding the command with an exclamation mark ‘!?, In sed '/78/18/*/4/" $e Version 3 of ina the patiern /7$/ matches empty lines (the end of the line immediately follows the beginning), so /*8/1 says, “don’t do the command on empty lines.” As we said above, sed prints each line automatically, regardless of what was done to it (unless it was deleted). Furthermore, most ed commands can be used. So it's easy to write a sed program that will print the first three (say) lines of its input, then quit: sed 3q Although 3q is not a legal ed command, it makes sense in sed: copy lines, then quit after the third one. You might want to do other processing to the data, such as indent it. One way is to run the output from sed through ind, but since sed accepts multi ple commands, it can be done with a single (somewhat unlikely) invocation of sea: sed *8/7/4/ aq" Notice where the quotes and the newline are: the commands have to be on separate lines, but sed ignores leading blanks and tabs, With these ideas, it might seem sensible to write a program, called head, to print the first few lines of each filename argument. But sed 3q (or 10q) is so easy to type that we've never felt the need. We do, however, have an ind, since its equivalent sea command is harder to type. (In the process of writing this book we replaced the existing 30-line C program by version 2 of the one~ line implementations shown earlier). ‘There is no clear criterion for when it’s worth making a separate command from a complicated command line; the best rule we've found is to put it in your bin and see if you actually use it, I's also possible to put sed commands in a file and execute them from there, with $ sed -£ emdfle . ‘You can use line selectors other than numbers like 3: CHAPTER 4 FILTERS 111 $ sed “/pattern/a’ prints its input up to and including the first line matching pattern, and $ 80d “/patten/a’ deletes every line that contains pattern; the deletion happens before the line is automatically printed, so deleted lines are discarded. Although automatic printing is usually convenient, sometimes it gets in the way. It can be turned off by the ~n option; in that case, only lines explicitly printed with a p command appear in the output. For example, § sed -n “/patien/p* does what grep does. Since the matching condition can be inverted by follow- ing it with 1, $ sed patie? 1p" is qrep -v. (So is sed ‘ /pattern/a’.) ‘Why do we have both sed and grep? After all, grep is just a simple spe- cial case of sed. Part of the reason is history — grep came well before sed. But grep survives, and indeed thrives, because for the particular job that they both do, it is significantly easier to use than sed is: it does the common case about as succinctly as possible. (It also does a few things that sed won't; look at the ~b option, for instance.) Programs do die, however. There was once a program called gres that did simple substitution, but it expired almost immediately when sed was born. Newlines can be inserted with sed, using the same syntax as in ed: $ sed “8/8/\ 7 audds a newline to the end of each line, thus double-spacing its input, and § sed ’e/{ +1E +)” >" replaces each string of blanks oF tabs with a newline and thus splits its input into one word per line. (The regular expression ‘[ +]° matches a blank or (ab; ‘C +J+” matches zero or more of these, so the whole pattern matches one or more blanks and/or tabs.) You can also use pairs of regular expressions or line numbers to select @ range of lines over which any one of the commands will operate 112. THE UNIX PROGRAMMING ENVIRONMENT CHAPTER « 8 sed -n '20,30p" Print only lines 20 through 30 $ sea *1,104" Delete lines I through 10 (= tai2 +11) 8 sed “1,/*8/4" Delete up to and including first Blank line 8 sed -n' “/*$/,/"end/p’ Print each group of lines from ‘an empty line 10 line starting with en $ sea ‘sa’ Delete last line Line numbers go from the beginning of the input; they do not reset at the beginning of a new file “There is a fundamental limitation of sed that is not shared by ed, however: relative line numbers are not supported. In particular, + and ~ are not under- stood in line number expressions, so it is impossible to reach backwards in the input § sed *s-10" Mlegal: can’t refer backward Unrecognized command: $-18 s Once a line is read, the previous line is gone forever: there is no way to iden- tify the next-to-last line, which is what this command requires. (In fairness, there is a way to handle this with sed, but itis pretty advanced. Look up the “hold” command in the manual.) There is also no way to do relative address- ing forward: $ sed “/thing/+14’ Megal: can't refer forward sed provides the ability to write on multiple output files. For example, $ sed -n */pat/w filet > Ypai/tw £4102" filenames s writes lines matching pat on £i1e1 and lines not matching par on £i1e2. Or, to revisit our first example, $ sed “s/UNIX/UNIX(TH)/gw u.out” filenames ... >output writes the entire output to file output as before, but also writes just the changed lines to file w.out, Sometimes it’s necessary to cooperate with the shell to get shell file argu- ments into the middle of a sed command. One example is the program newer, which lists all files in a directory that are newer than a specified one, $ cat newer # newer f: list files newer than ls -t f sed ‘/°781'6/q" s ‘The quotes protect the various special characters aimed at sed, while leaving the $1 exposed so the shell will replace it by the filename. An alternate way to write the argument is ‘CHAPTER + rurers 113 Table 4.2: Summary of sea Commands a\ append lines to output until one not ending in \ label branch to command : label c\ change lines to following text as in a a delete line; read next input line |i, insert following text before next output | Ja list line, making all non-printing characters visible | \e print line \a quit jx file read file, copy contents to ovtput s/old/new/f substitute new for old. If f=g, replace all occurrences; J=p, print; f=w file, write to file © label test: branch to label if substitution made to current line w file write line to file y/strl/str2/ replace each character from str! with corresponding character from str2_ (no ranges allowed) . print current input line number temd do sea emd only if line is not selected + label set label for b and commands ( treat commands up to matching } as a group "easva" since the $1 will be replaced by the argument while the \$ becomes just $. In the same way, we can write older, which lists all the files older than the named one: 8 cat older # older list files older than £ Is -tr | sea “/7"$1"s/q" ‘The only difference is the ~x option on 18, to reverse the order Although sed will do much more than we have illustrated, including testing conditions, looping and branching, remembering previous lines, and of course many of the ed commands described in Appendix 1, most of the use of sed is similar to what we have shown here — one or two simple editing commands — rather than long ot complicated sequences. Table 4.2 summarizes some of ‘sed’s capabilities, although it omits the multi-line functions. sed is convenient because it will handle arbitrarily long inputs, because it is fast, and because it is so similar to ed with its regular expressions and line-at- ‘atime processing, On the other side of the coin, however, sed provides a relatively limited form of memory (it’s hard to remember text from one line to 114 THE UNDX PROGRAMMING ENVIRONMENT CHAPTER 4 another), it only makes one pass over the data, it’s not possible to go back- wards, there’s no way to do forward references like /.../+1, and it provides no facilities for manipulating numbers — it is purely a text editor. Bxercise 4.5. Modify older and newer so they don't include the argument file in their output. Change them so the files are listed in the opposite order. © Exercise 4.6. Use sed to make bundle robust. Hint: in here documents, the end marking word is recognized only when it matches the line exactly. © 4.4. The awk pattern scanning and processing language Some of the limitations of sed are remedied by awk. The idea in awk is much the same as in sed, but the details are based more on the C program- ming language than on a text editor. Usage is just like sed: awk “program” filenames but the program is different pattern { action ) pattern {action ) awk reads the input in the filenames one line at a time, Each line is compared with each pattern in order; for each paitern that matches the line, the corresponding action is performed. Like sed, awk does not alter its input files. The patterns can be regular expressions exactly as in egrep, or they can be more complicated conditions reminiscent of C. As a simple example, though, '§ awe “/regular expression’ {print }” filenames does what egrep does: it prints every line that matches the regular expression. Either the pattern or the action is optional. If the action is omitted, the default action is to print matched lines, so $ awk ’/regular expression/” filenames does the same job as the previous example, Conversely, if the pattern is omit- ted, then the action part is done for every input line, So $ awk ‘C print J” filenames does what cat does, albeit more slowly. ‘One final note before we get on to interesting examples. As with sed, it is possible to present the program to awk from a file: $ awk ~£ emdfile filenames CHAPTER & rivers 11S Fields ‘awk splits each input line automatically into fields, that is, strings of non- blank characters separated by blanks or tabs. By this definition, the output of who has five fields: who you tty2 sep 29 11 jim ttyd sep 29 11 : awk calls the fields $1, $2, ..., $NF, where NF is a variable whose value is set, to the number of fields. In this case, NF is 5 for both lines. (Note the differ- ence between NF, the number of fields, and $F, the last field on the fine. In awk, unlike the shell, only fields begin with a $; variables are unadorned.) For example, to discard the file sizes produced by du -a, Sau -a ! awk ‘{ print $2 3” and (o print the names of the people logged in and the time of login, one per line: $ who ! awk ‘{ print $1, $517 you 11:53 jim 11:27 8 To print the name and time of login sorted by time: $ who | awk ‘( print $5, $1} f sort 41:27 3am 44:53 you A These are alternatives to the sea versions shown earlier in this chapter. Although awk is easier to use than sed for operations like these, it is usually slower, both getting started and in execution when there’s a lot of input. awk normally assumes that white space (any number of blanks and tabs) separates fields, but the separator can be changed to any single character. One way is with the -F (upper case) command-line option. For example, the fields in the password file /etc/passwd are separated by colons: $ sed 3q /ete/passwd User:/: Thompson: /usr /ken: To print the user names, which come from the first field, 116 THE UNIX PROGRAMMING ENVIRONMENT (CHAPTER + $ sed 3q /ete/passwd | awk -P: ’{ print $1)” ken one s ‘The handling of blanks and tabs is intentionally special. By default, both blanks and tabs are separators, and leading separators are discarded. If the separator is set (0 anything other than blank, however, then leading separators are counted in determining the fields. In particular, if the separator is a tab, then blanks are not separator characters, leading blanks are part of the field, and each tab defines a field Printing awk keeps track of other interesting quantities besides the number of input fields. The built-in variable NR is the number of the current input “record” or line. So to add line numbers to an input stream, use this: 3 awk “{ print NR, $0 }/ The field $0 is the entire input line, unchanged. In a print statement items separated by commas are printed separated by the output field separator, which is by default a blank. ‘The formatting that print does is often acceptable, but if it isn’t, you can use a statement called printé for complete control of your output. For exam- ple, to print tine numbers in a field four digits wide, you might use the follow- ing: $ awk ‘{ printf "x4d Xe\n", NR, $0 3° ‘%4d specifies a decimal integer (wR) in a field four digits wide, %s a string of characters ($0), and \n a newline character, since printf doesn’t print any spaces or newlines automatically. The printé statement in awk is like the C function; see print £(3) We could have written the first version of ind (from early in this chapter) awk "( printé "\tXe\n", $0.) $# which prints a tab (\t) and the input record. Patterns ‘Suppose you want to look in /etc/passwa for people who have no pass- words. ‘The encrypted password is the second field, so the program is just a pattern: $ awk -P: '82 «0 "** /ete/pasawd The pattern asks if the second field is an empty string (‘== is the equality test CHAPTER 4 ruurers 117 operator). You can write this pattern in a variety of ways: s2en 2nd field is empry 82-7787 2nd field matches empty string 821 7 2nd field doesn’t match any character lengtn($2) == 0 Length of 2nd field is zero ‘The symbol ~ indicates a regular expression match, and 1- means “does not match.” ‘The regular expression itself is enclosed in slashes. Length is an awk built-in function that produces the length of a string of characters. A pattern can be preceded by ! to negate it, a ns2 my The ‘1 operator is like that in C, but opposite to sed, where the | follows the pattern. One common use of patterns in awk is for simple data validation tasks. Many of these amount to little more than looking for lines that fail to meet some criterion; if there is no output, the data is acceptable (""no news is good news"). For example, the following pattern makes sure that every input record has an even number of fields, using the operator % to compute the remainder: NF X 2120 # print if odd number of fields Another prints excessively long lines, using the built-in function length: Length($0) > 72 # print if too long awk uses the same comment convention as the shell does: a # marks the begin- ning of a comment. ‘You can make the output somewhat more informative by printing a warning and part of the too-long line, using another built-in function, subst: Length($0) > 72 { print "Line", NR, "too Jong:", subste(#0,1,60) substr(s,m,m) produces the substring of s that begins at position m and is » characters long. (The string begins at position 1.) If m is omitted, the sub- string from m to the end is used. substr can also be used for extracting fixed-position fields, for instance, selecting the hour and minute from the out put of date: 5 date Thu Sep 29 12:17:01 EDT 1983 $date ! awk “{ print substr(s4, 1, 5) }” 32:17 s Exercise 4-7, How many awk programs can you write that copy input to output as cat does? Which is the shortest? © 118 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER + The BEGIN and END patterns awk provides (wo special patterns, BEGIN and END. BEGIN actions are performed before the first input Tine has been read; you can use the BEGIN pattern to initialize variables, to print headings or to set the field separator by assigning to the variable PS: $ awk ‘BEGIN ( FS = 47") > §2== "" /ete/passwd 8 [No ouput: we all use passwords END actions are done after the last line of input has been processed: $ awk “END { print NR)’ prints the number of lines of input Arithmetic and variables ‘The examples so far have involved only simple text manipulation. awk’s real strength Ties in its ability to do calculations on the input data as well; it is ‘easy to count things, compute sums and averages, and the like, A common use of awk is to sum columns of numbers. For example, to add up all the numbers in the first column: tessesty END ( print s } ince the number of values is available in the vs ie 10 able NR, changing the last END (print 8, s/xR ) prints both sum and average. ‘This example also illustrates the use of variables in awk. is not a built-in variable, but one defined by being used. Variables are initialized to zero by default $0 you usually don’t have to worry about initialization. vk also provides the same shorthand arithmetic operators that C does, so the example would normally be waitten (sonst) mx (print 8} 5 += $1 isthe same ass = 8 + $1, but notationally more compact. You can generalize the example that counts input ines like this: (no += leagen(so) +1 # nunber of chars, 1 for \n nw 4s NP # number of words , mx {print wR, nw, ae } ‘This counts the lines, words and characters in its input, so it does the same job as wo (although it doesn’t break the totals down by file) ‘As another example of arithmetic, this program computes the number of CHAPTER & FLTERS 119 66-line pages that will be produced by running a set of files through pr. This ccan be wrapped up in a command called prpages: 8 cat prpages compute number of pages that pr will print awk ‘1/ totals/ (-n te int((s1+55) / 56) } BND { print n pr puts 56 lines of text on each page (a fact determined empirically). The number of pages is rounded up, then truncated to an integer with the built-in function int, for each line of we output that does not match total at the end of a line, 3 wo ch. ‘7533090 18129 cha. 612 2421 13242 eha.2 637 2462 13455 ch4.3 202 2986 16904 chad 502131117 chao 2854 11172 62847 total $ prpages cha.+ 53 8 ‘To verify this result, run pr into awk directly: 3 pr che. 53 s 1 awk ‘END ( print NR/66 }’ Variables in awk also store strings of characters. Whether a variable is to be treated as a number or as a string of characters depends on the context Roughly speaking, in an arithmetic expression like s+=$1, the numeric value is used; in a string context like x="abc", the string value is used; and in an ambiguous case like xy, the string value is used unless the operands are clearly numeric. (The rules are stated precisely in the awk manual.) String variables are initialized to the empty string. Coming sections will put strings to ood use. awk itself maintains a number of built-in variables of both types, such as NR and FS, Table 4.3 gives the complete list. Table 4.4 lists the operators. Exercise 48. Our test of prpages suggests alternate implementations. Experiment to sve Which is fastest, © Control flow It is remarkably casy (speaking from experience) to create adjacent dupli cate words accidentally when editing a big document, and it is obvious that that almost never happens intentionally, To prevent such problems, one of the 120 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER Table 4.3: awk Built-in Variables, PILENAME name of current input file FS field separator character (default blank & tab) NF number of fields in input record NR number of input record ormr: ‘output format for numbers (default %g; see print €(3)) ors output field separator string (default blank) ors output record separator string (default newline) RS input record separator character (default newline) ‘Table 4.4: awk Operators (increasing order of precedence) on on fa he assignment; v ops expr is v = v op (expr) OR: exprl 4 expr? true if either is; ‘expr? not evaluated if expri is true 58 AND: exprl && expr2 true if both ares expr2 not evaluated if expr! is false ! negate value of expression = 1- relational operators; = and 1 are match and non-match nothing string concatenation = plus, minus “7% multiply, divide, remainder te increment, decrement (prefix or postfix) the components of the Writer’s Workbench family of programs, called double, looks for pairs of identical adjacent words. Here is an implementa- tion of double in awk: CHAPTER « FILTERS 121 $ cat double awk’ FILENAME I= prevfile ( # new file MR = 1 # reset line number prevfile = PILENAME d NE > ot Af ($1 == lastword) printf "double %s, file Xs, Line Xd\n",$1,FILENAME,NR for (1 22; 4 <2 NF; ies) if (Si == 84-10) Printf "double %s, file Xs, Line %d\n",$i,FILENAME,NR if (NP > 0) lastword = $F vse s The operator ++ increments its operand, and the operator ~~ decrements. The built-in variable FILENAME contains the name of the current input file Since NR counts lines from the beginning of the input, we reset it every time the filename changes so an offending line is properly identified. The if statement is just like that in C: LE (condition) statement! else statement If condition is true, then statement! is executed; if itis false, and if there is an else part, then statement2 is executed. ‘The else part is optional. ‘The for statement is a loop like the one in C, but different from the shell’s: for (expression! ; condition; expression? ) The for is identical to the following while statement, which is also valid in awk: expression! while (condition) ¢ expression? ) For example, for (4 = 25 4 NP; ise) runs the loop with 4 set in turn to 2, 3, ..., up to the number of ficlds, NF. ‘The break statement causes an immediate exit from the enclosing while 122 THE UNIX PROGRAMMING ENVIRONMENT ‘CHAPTER « or for; the continue statement causes the next iteration to begin (at condi- tion in the while and expression? in the for). The next statement causes the next input line to be read and pattern matching to resume at the beginning of the awk program. The exit statement causes an immediate transfer to the END pattern. Arrays awk provides arrays, as do most programming languages. As a trivial example, this awk program collects each line of input in a separate array ele- ‘ment, indexed by line number, then prints them out in reverse order: $ cat backwards # backwards: print input in backward line order awk’ ( line(NR) = 80 ) BND { for (i = NRj i > 0; d--) print linefi) ) ’ $+ s Notice that, like variables, arrays don't have to be declared; the size of an array is limited only by the memory available on your machine. Of course if a very large file is being read into an array, it may eventually run out of memory. To print the end of a large file in reverse order requires cooperation with tail! $ tail -5 /usr/dict/web2 | backwards zymurgy zymot ically zymotic zyposthenic zynosis s tail takes advantage of a file system operation called seeking, to advance to the end of a file without reading the intervening data, Look at the discussion of 1seek in Chapter 7. (Our local version of tail has an option ~r that prints the fines in reverse order, which supersedes backwards.) Normal input processing splits each input line into fields. It is possible to perform the same field-splitting operation on any string with the built-in func tion split: n= splits, arr, sep) splits the string s into fields that are stored in elements 1 through n of the array arr. If a separator character sep is provided, it is used; otherwise the current value of FS is used. For example, split($0,a,":") splits the input line on colons, which is suitable for processing /etc/passwa, and split("9/29/83" ,date,"/") splits a date on slashes. ‘CHAPTER 4 FiLrers 123 $ sed tq /etc/passwd | awk “(split($0,a,":"); print af1])’ $ echo 9/29/83 | awk ‘{split($0,date,"/"); print date[3])’ 3 s ‘Table 4.5 lists the awk built-in functions. Table 4.5: awk Built-in Functions cos (expr) cosine of expr exp(expr) exponential of expr: e*?" getline() reads next input line; returns 0 if end of file, 1 if not index(s/ ,32) position of string s2 in s/; returns 0 if not present int (expr) integer part of expr; truncates toward 0 Length(s) length of string 5 Log (expr) natural logarithm of expr sin(expr) sine of expr split(s,a,c) split s into af 1]...a{n) on character ¢; return sprint (fim, ...) format ... according to specification fint substr(s,m,n) __n-character substring of s beginning at position m Associative arrays ‘A standard problem in data processing is to accumulate values for a set of ame-value pairs. That is, from input like susie 400 John 100 Mary 200 Mary 300 John 100 Susie 100 Mary — 100 ‘ve want to compute the total for each name: John 200 Mary 600 Susie 500 awk provides a neat way to do this, the associative array. Although one nor- ‘mally thinks of array subscripts as integers, in awk any value can be used as a subscript. So { sumts1) += $2) BND {for (name in sum) print name, sum{name] } is the complete program for adding up and printing the sums for the name- 124 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER « value pairs like those above, whether or not they are sorted. Each name ($1) is used as a subscript in sum; at the end, a special form of the for statement is used to cycle through all the elements of sum, printing them out. Syntacti cally, this variant of the £or statement is for (var in array) Although it might look superficially like the for loop in the shell, it’s unre- lated. It loops over the subscripts of array, not the elements, setting var to each subscript in turn. ‘The subscripts are produced in an unpredictable order, however, so it may be necessary to sort them. In the example above, the out- put can be piped into sort to list the people with the largest values at the top. $ awk |} sort +ine ‘The implementation of associative memory uses a hashing scheme to ensure that access to any element takes about the same time as to any other, and that (at least for moderate array sizes) the time doesn't depend on how many ele- ‘ments are in the array. ‘The associative memory is effective for tasks like counting all the words in the input: $ cat wordtreq awk’ { for (i Aco MP; its) num[siles } END { for (word in num) print word, num[wora] } se $ wordfreq chd.+ | sort +1 -nr | sed 20q 1 4 the 372 cw 345, of 220 is 105 0 175 a 167 in 109 fané_100 -P1 94 -P2 94 “PP 90 $87 avk 87 sed 83 chat 76 for 75 The 63 are 61 Line 55 print 52 s ‘The first for loop looks at each word in the input line, incrementing the ele- ‘ment of array num subscripted by the word. (Don’t confuse awk’s $4, the i'th field of the input line, with any shell variables.) After the fileshawgbeen read, the second for loop prints, in arbitrary order, the words and their counts. Exercise 49. The output from wordfzeq includes text formatting commands like .C¥, which is used to print words in this font. How would you get rid of such non- words? How would you use tr to make wordfreq work properly regardless of the case of its input? Compare the implementation and performance of wordt reg to the pipeline from Section 4.2 and to this one eed “a/t “IL =} Za! te | ort | unig -c | sort -ne (CHAPTER & FTeRs 125 Strings Although both sed and awk are used for tiny jobs like selecting a single field, omly awk is used to any extent for tasks that really require programming. One example is a program that folds long lines to 80 columns. Any line that exceeds 80 characters is broken after the 80th; a \ is appended as a warning, and the residue is processed. The final section of a folded line is right- justified, not left-justified, since this produces more convenient output for pro- gram listings, which is what we most often use £014 for. As an example, using 20-character lines instead of 80, $ cat test A short line. ‘A somewhat longer line. This line is quite a bit longer than the last one $ fold test A short line. ‘A somewhat longer 14\ ‘This line is quite a\ bit longer than the\ last one. ’ Strangely enough, the 7th Edition provides no program for adding or removing tabs, although pr in System V will do both. Our implementation of fold uses sed to convert tabs into spaces so that awk’s character count is right. This works properly for leading tabs (again typical of program source) but does not preserve columns for tabs in the middle of a line. # fold: fold long lines sed ‘8/+/ 7g’ $ | # convert tabe to 8 spaces awk ¢ BEGIN { N= 80 # folds at column 80 for (i= 1; i <= Nj ies) # make a string of blanks blanks = blanks " * {Af (im = lengtn(s0)) <=) print else { for (i= 1) n> Nj n= N) { printf "%s\\\n", substr(s0,i,) Lest y printf "XeXs\n", substr(blanks,1,N-n), substr($0,4) ) ye In awk there is no explicit string concatenation operator; strings are 126 THE UNIX PROGRAMMING ENVIRONMENT ‘CHAPTER + concatenated when they are adjacent. Initially, blanks is a null string. The loop in the BEGIN part creates a long string of blanks by concatenation: each trip around the loop adds one more blank to the end of blanks. The second loop processes the input line in chunks until the remaining part is short enough. As in C, an assignment statement can be used as an expression, so the construction Af ((m = Length(s0)) <= N) assigns the length of the input line to n before testing the value. Notice the parentheses. Exercise 4-10. Modify £014 so that it will fold lines at blanks or tabs rather than split- ting a word. Make it robust for long words. © Interaction with the shell ‘Suppose you want to write a program £ie1d n that will print the n-th field from each line of input, so that you could say, for example, $ who | field 1 to print only the login names. awk clearly provides the field selection capabil- ity; the main problem is passing the field number m to an awk program. Here is one implementation: awk “(print $817 97 The $1 is exposed (it’s not inside any quotes) and thus becomes the field number seen by awk. Another approach uses double quotes: awk "{ print \ss1 3" In this case, the argument is interpreted by the shell, so the \8 becomes a $ and the $1 is replaced by the value of n. We prefer the single-quote style because So many extra \’s are needed with the double-quote style in a typical awk program. ‘A second example is addup m, which adds up the numbers in the n-th field: ak ‘(2 40 8°81" } END { print =)’ A third example forms separate sums of each of n columns, plus a grand total: CHAPTER 4 rrers 127 awk * BEGIN (n= 81") c for (1 = 1; 4 <= my dee) sun[i] += $4 > END (for (i= 45 4 <= ng dee) ( printf "6g", sun(il total += sum[i] > printé "; total = x6g\n", total he We use a BEGIN to insert the value of n into a variable, rather than cluttering up the rest of the program with quotes. ‘The main problem with all these examples is not keeping track of whether fone is inside or outside of the quotes (though that is a bother), but that as currently written, such programs can read only their standard input; there is no way to pass them both the parameter n and an arbitrarily long list of filenames. This requires some shell programming that we'll address in the next chapter, A calendar service based on awk Our final example uses associative arrays; itis also an illustration of how to interact with the shell, and demonstrates a bit about program evolution. ‘The task is to have the system send you mail every morning that contains a reminder of upcoming events. (There may already be such a calendar service; see calendar(!). This section shows an alternate approach.) The basic ser- vice should tell you of events happening today; the second step is to give a day of warning — events of tomorrow as well as today. The proper handling of weekends and holidays is left as an exercise ‘The first requirement isa place to keep the calendar. For that, a file called calendar in /usr/you seems easiest § cat calendar Sep 30 mother’s birehday ect 1 Tench with joe, noon oct 1 meeting 4pm ‘ Second, you need a way to scan the calendar for a date. There are many choices here; we will use awk because it is best at doing the arithmetic neces- sary to get from “today” to “tomorrow,” but other programs like sed or egrep can also serve. The lines selected from the calendar are shipped off by mai, of course. ‘Third, you need a way to have calendar scanned reliably and automati- cally every day, probably early in the morning. This can be done with at, which we mentioned briefly in Chapter 1. 128 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER « If we restrict the format of calendar so each line begins with a month name and day as produced by date, the first draft of the calendar program is easy: $ date ‘Thu Sep 29 15:23:12 EDP 1983 $ cat bin/calendar # calendar: version 1 -- today only awk 1 && $1 == mon && $2 == day # print calendar lines Vomail sNaME The next step is to arrange for calendar to look for tomorrow as well as today. Most of the time all that is needed is to take today's date and add 1 0 the day. But at the end of the month, we have to get the next month and set the day back to 1. And of course each month has a different number of days. ‘This is where the associative array comes in handy. Two arrays, days and nextmon, whose subscripts are month names, hold the number of days in the ‘month and the name of the next month, Then days{"Jan"} is 31, and nextmon{"Jan"] is Feb. Rather than create a whole sequence of statements, like days "a days( "Fe 31; nextmon{"Jan"] = "Feb" = 28; nextmon{ *Feb"] we will use split to convert a convenient data structure into the one really needed: ‘CHAPTER FILTERS 129 $ cat calendar # calendar: version 3 -~ today and toorrow awk = days(mont}) ( day2 = 1 mon2 = nextmon{ mon’) $1 s= mont && $2 #2 day1 1! $1 *# mon? && $2 == day2 vf mail swame ‘ Notice that Jan appears twice in the data; a “sentinel” data value like this sim- plifies processing for December. ‘The final stage is to arrange for the calendar program to be run every day. What you want is for someone to wake up every morning at around 5 AM and run calendar. You can do this yourself by remembering to say (every day!) $ at Sam calendar aa s bbut that's not exactly automatic or reliable. The trick is to tell at not only to run the calendar, but also to schedule the next run as well. $ cat earty.morning calendar echo early.morning | at Sam s ‘The second line schedules another at command for the next day, so once started, this sequence is self-perpetuating. The at command sets your PATH, current directory and other parameters for the commands it processes, so you needn't do anything special. Exercise 4-11. Modify calendar so it knows about weekends: on Friday, “tomorrow’ includes Saturday, Sunday and Monday. Modify calendar to handle leap years. Should calendar know about holidays? How would you arrange it?) 0 130 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 4 Exercise 4-12, Should calendar know about dates inside a line, not just at the begin- ning? How about dates expressed in other formats, like 10/1/83? 0 Exercise 4-13. Why doesn't calendar use getname instead of $NAME? © Exercise 4-14. Write a personal version of zm that moves files to @ temporary directory rather than deleting them, with an at command to clean out the directory while you are sleeping. © Loose ends awk is an ungainly language, and it’s impossible to show all its capabilities in a chapter of reasonable size. Here are some other things to look at in the manual: ‘+ Redirecting the output of print into files and pipes: any print or print statement can be followed by > and a filename (as a quoted string or in a variable); the output will be sent to that file. As with the shell, >> appends instead of overwriting. Printing into a pipe uses ! instead of >. ‘© Multi-line records: if the record separator RS is set to newline, then input records will be separated by an empty line. In this way, several input lines can be treated as a single record. © “Pattern, pattern” as a selector: as in ed and sed, a range of lines can be specified by a pair of patterns. ‘This matches lines from an occurrence of the first pattern until the next occurrence of the second. A simple example wR 20 10, HR which matches lines 10 through 20 inclusive. 4.8 Good files and good filters Although the last few awk examples are self-contained commands, many uses of awk are simple one- or two-line programs (o do some filtering as part of a larger pipeline. This is true of most filters — sometimes the problem at hand can be solved by the application of a single filter, but more commonly it breaks down into subproblems solvable by filters joined together into a pipe- line. This use of tools is often cited as the heart of the UNIX programming environment. That view is overly restrictive; nevertheless, the use of filters pervades the system, and it is worth observing why it works The output produced by UNIX programs is in a format understood as input by other programs. Filterable files contain lines of text, free of décorative headers, trailers or blank lines. Each line is an object of interest — a filename, a word, a description of a running process — so programs like we and grep can count interesting items or search for them by name. When more information is present for each object, the file is still line-by-line, but columnated into fields separated by blanks or tabs, as in the output of 1s -1. Given data divided into such fields, programs like awk can easily select, pro- cess or rearrange the information, ‘CHAPTER + FILTERS 131 Filters share a common design. Each writes on its standard output the result of processing the argument files, or the standard input if no arguments are given. The arguments specify inpur, never output,t so the output of a command can always be fed to @ pipeline. Optional arguments (or non- filename arguments such as the grep pattern) precede any filenames. Finally, error messages are written on the standard error, so they will not vanish down, a pipe These conventions have little effect on the individual commands, but when uniformly applied to all programs result in a simplicity of interconnection, illustrated by many examples throughout this book, but perhaps most spectacu- larly by the word-counting example at the end of Section 4.2. If any of the programs demanded a named input or output file, required interaction to specify parameters, or generated headers and trailers, the pipeline wouldn't work. And of course, if the UNIX system didn’t provide pipes, someone would have to write a conventional program to do the job. But there are pipes, and the pipeline works, and is even easy to write if you are familiar with the tools, Exercise 4-15. ps prints an explanatory header, and ‘of blocks in the files. Comment, announces the total number History and bibliographic notes ‘A good review of pattern matching algorithms can be found in the paper “Pattern matching in strings” (Proceedings of the Symposium on Formal Language Theory, Santa Barbara, 1979) by Al Aho, author of egrep. sed was designed and implemented by Lee McMahon, using ed as a base awk was designed and implemented by Al Aho, Peter Weinberger and Brian Kernighan, by a much less elegant process. Naming a language alter its authors also shows a certain poverty of imagination. A paper by the imple- mentors, “AWK — a pattern scanning and processing language,” Software— Practice and Experience, July 1978, discusses the design. awk has its origins in several areas, but has certainly stolen good ideas from SNOBOLA, from sed, from a validation language designed by Mare Rochkind, from the language tools yace and Lex, and of course from C. Indeed, the similarity between awk and C is a source of problems — the language looks like C but it’s not. Some constructions are missing; others differ in subtle ways. ‘An article by Doug Comer entitled “The flat file system FFG: a database system consisting of primitives” (Software—Practice and Experience, November, 1982) discusses the use of the shell and awk to create « database system. + An early UNIX fle sytem was destroyed by @ maintenance program that violated this rule, be ‘cause a harmless-looking command scribbled allover the dise cHarTer 5: SHELL PROGRAMMING Although most users think of the shell as an interactive command inter- preter, it is really a programming language in which cach statement runs a command, Because it must satisfy both the interactive and programming, aspects of command execution, it is a strange language, shaped as much by his tory as by design. The range of its application leads to an unsettling quantity of detail in the language, but you don’t need to understand every nuance to use it effectively. This chapter explains the basics of shell programming by show- ing the evolution of some useful shell programs. It is nor a manual for the shell, That is in the manual page sh(1) of the Unix Programmer's Manual, which you should have handy while you are reading, With the shell, as with most commands, the details of behavior can often be most quickly discovered by experimentation. The manual can be cryptic, and there is nothing better than a good example to clear things up. For that rea- son, this chapter is organized around examples rather than shell features; it is a guide to using the shell for programming, rather than an encyclopedia of its capabilities. We will talk not only about what the shell can do, but also about developing and writing shell programs, with an emphasis on testing ideas interactively. ‘When you've written a program, in the shell or any other language, it may bbe helpful enough that other people on your system would like to use it. But the standards other people expect of a program are usually more rigorous than those you apply for yourself, A major theme in shell programming is therefore making programs robust so they can handle improper input and give helpful information when things go wrong $.1 Customizing the ca1 command One common use of a shell program is to enhance or to modify the user interface to a program. As an example of a program that could stand enhance- ‘ment, consider the ca1(1) command: 133 134 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER $ cat usage: cal [month] year Good so far $ cal october 1983 Bad argunent Not so good $ cal 10 1963 October 1983 s mT W™ FP Ss 1 2345676 9101712 13:14:15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 2031 s It’s @ nuisance that the month has to be provided numerically. And, as it turns out, cal 10 prints out the calendar for the entire year 10, rather than for the current October, so you must always specify the year to get a calendar for a single month ‘The important point here is that no matter what interface the cal command provides, you can change it without changing cal itself. You can place a com- mand in your private bin directory that converts a more convenient argument syntax into whatever the real cal requires. You can even call your version cal, which means one less thing for you to remember. The first issue is design: what should cal do? Basically, we want cal to be reasonable. It should recognize a month by name, With two arguments, it should behave just as the old cal does, except for converting month names into numbers. Given one argument, it should print the month or year’s calen- dar as appropriate, and given zero arguments, it should print the current month's calendar, since that is certainly the most common use of a eal com- mand. So the problem is (o decide how many arguments there are, then map them to what the standard cal wants. ‘The shell provides a case statement that is well suited for making such decisions: case word in pattern) commands $3 pattern) commands $3 The case statement compares word to the patterns from top to bottom, and performs the commands associated with the first, and only the first, pattern that matches. The patterns are written using the shell’s pattern matching rules, Slightly generalized from what is available for filename matching. Each action is terminated by the double semicolon ;;. (The 4 may be left off the last cease but we often leave it in for easy editing.) CHAPTER 5 SIELL PROGRAMMING — 135 Our version of cai decides how many arguments are present, processes alphabetic month names, then calls the real cal. The shell variable $# holds the number of arguments that a shell file was called with; other special shell variables are listed in Table 5.1 $ cat’ cal # cal: nicer interface to /usr/bin/eal case $# in °) set ‘date’; m=s2; y=86 ” ma$1; set ‘date’: y=s6 o nes; y=82 # no args: use today #1 arg: use this year #2 arge: month and year case $m sn, jane ivane) Eepe!Febe) marelMare) aprelapre) mays iMay+) june guns) Sulelguis) auge lange) sep+!Sep+) ects loct+) # numeric month is @ plain year /asr/bin/eal $m sy # run the real one s ‘The first case checks the number of arguments, $#, and chooses the appropri- ate action. The final » pattern in the first case is a catch-all: if the number of arguments is neither 0 nor 1, the last case will be executed. (Since patterns are scanned in order, the catch-all must be last.) This sets m and y to the month and year — given two arguments, our cal is going to act the same as the ori- ginal ‘The first case statement has a couple of tricky lines containing set “dates Although not obvious from appearance, it is easy to see what this statement does by trying it: 136 THE UNIX PROGRAMMING ENVIRONMENT Harter s Table $.1: Shell Built-in Variables se the number of arguments se all arguments to shell s@ similar to 8; see Section 5.7 s ‘options supplied to the shell 8? return value of the last command executed 3s. process-id of the shell st process-id of the last command started with & SHOME default argument for ca command STFS __ list of characters that separate words in arguments SMAIL file that, when changed, triggers "you have mail” message SPATH list of directories to search for commands SPS1 prompt string, default "s * $PS2__ prompt string for continued command line, default ‘> * § dace Sat Oct 1 06:05:18 EDT 1989 f set ‘date™ 8 echo #1 # echo $4 06:05:20 + et is a shell builtin command that does too many things. With no argu- ments, it shows the values of variables in the environment, as we saw in Chapter 3, Ordinary arguments reset the values of $1, $2, and so on. So set ‘date’ sets $1 10 the day of the week, $2 t0 the name of the month, and so on. ‘The first case in cal, therefore, sets the month and year from the current date if there are no arguments; if there's one argument, it's used as the month and the year is taken from the current date et also recognizes several options, of which the most often used are ~¥ and —; they turn-on echoing of commands as they are being processed by the shell. These are indispensable for debugging complicated shell programs. The remaining problem is to convert the month, if it i in textual form, into a number. This is done by the second case statement, which should be largely self-explanatory, The only twist is that the 1 character in cage Statc- ment patterns, as in egrep, indicates an alternative: big!smal1 matches cither big of small. Of ‘course, these cases could also be written as [3 lane and so on, ‘The program accepts month names either in all lower case, because most commands accept lower case input, oF with first letter capi- talized, because that isthe format printed by date. ‘The rules for shell patern matching are given in Table 5.2 CHAPTER 5 SHELL PROGRAMMING 137 Table 5. * match any string, including the null string 2 match any single character Cece] match any of the characters in ece [a-d0-3) is equivalent to [abod0123) "..." match ... exactly; quotes protect special characte ‘Shell Pattern Matching Rules 8. Also *...° \c-mateh ¢ literally alb in case expressions only, matches either a or b Z in filenames, matched only by an explicit / in the expression; in case, matched like any other character as the first character of a filename, is matched only by an explicit . in the expression The last two cases in the second case statement deal with a single argu- ment that could be a year; recall that the first case statement assumed it was ‘a month. If it is a number that could be a month, it is left alone. Otherwise, it is assumed to be a year. Finally, the last line calls /usr/bin/cal (the real cal) with the con- verted arguments. Our version of cal works as a newcomer might expect: 5 date Sat Oct 1 06:09:55 EDT 1983 $ cal October 1983 s MT WT FS 1 23245 6 7 8 9101112 13 14:15 46 17 18 19 20 2122 23 24 25 26 27 28 29 30 31 $ cal dec December 191 s Mm Wm FS qa 45678940 19:12:13 14:95 16.17 4a 19 20 21 22 23 24 25 26 27 28 29 30.31 And cal 1984 prints out the calendar for all of 1984. Our enhanced cal program does the same job as the original, but in a simpler, easier-to-remember way. We therefore chose to call it cal, rather 138 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 5 than calendar (which is already a command) or something less mnemonic like ncai. Leaving the name alone also has the advantage that users don’t have to develop a new set of reflexes for printing a calendar. Before we leave the case statement, it’s worth a brief comment on why the shell’s pattern matching rules are different from those in ed and its deriva- tives. After all, two kinds of patterns means two sets of rules to learn and two pieces of code to process them. Some of the differences are simply bad choices that were never fixed — for example, there is no reason except compatibility with a past now lost that ed uses *.* and the shell uses °?° for “match any character."” But sometimes the patterns do different jobs. Regular expressions in the editor search for a string that can occur anywhere in a line; the special characters * and $ are needed to anchor the search to the beginning and end of the line. For filenames, however, we want the search anchored by default, since that is the most common case; having to write something like $ 1s “?+.08 Doesn't work this way instead of Sis te would be a great nuisance. Exercise 5-1. If users prefer your version of a2, how do you make it globally accessi ble? What has to be done to put it in /use/bin? 0 Exercise 5-2. Is it worth fixing cal so cai 83 prints the calendar for 1983? If so, how would you print the calendar for year 837 © Exercise 5-3. Modify cat to accept more than one month, as in cat oct nov of perhaps a range of months $ cal oot ~ deo IF it's now December, and you ask for cal Jan, should you get this year's January or next year's? When should you have stopped adding features to cal? 5.2. Which command is which? ‘There are problems with making private versions of commands such as cal. The most obvious is that if you are working with Mary and type cal while logged in as mary, you will get the standard cal instead of the new one, unless of course Mary has linked the new cal into her bin directory. This ‘can be confusing — recall that the error messages from the original cal are ‘not very helpful — but is just an example of a general problem. Since the shell searches for commands in a set of directories specified by PATH, it is always possible to get a version of a command other than the one you expect. For instance, if you type a command, say echo, the pathname of the file that is actually run could be ./echo or /bin/echo or /usr/bin/echo or ‘CHAPTER 5 SHELL PROGRAMMING 139) something else, depending on the components of your PATH and where the files are, It can be very confusing if there happens to be an executable file ‘with the right name but the wrong behavior earlier in your search path than you expect. Perhaps the most common is the test command, which we will discuss later: its name is such an obvious one for a temporary version of a pro- ‘gram that the wrong test program gets called annoyingly often.+ A command that reports which version of a program will be executed would provide a use ful service. One implementation is to loop over the directories named in PATH, search- ing each for an executable file of the given name. In Chapter 3, we used the for to loop over filenames and arguments. Here, we want a loop that says for i in each component of PATH ao If given name is in directory i rine its full pathname done Because we can run any command inside backquotes *...*, the obvious solu- tion is to run sed over $PATE, converting colons into spaces. We can test it cout with our old friend echo: $ echo sari t/uer/you/bin: /bin: /use/bin 4 components $ echo $patH | sed ’8/:/ 79” yasr/you/bin /bin /usr/bin Only 3 printed $ echo ‘echo $PATH | sed ‘8/:/ /g”* /asr/you/bin /bin /usr/bin Sul only 3 There is clearly a problem. A ull string in PATH is a synonym for *.". Con- verting the colons in PATH to blanks is therefore not good enough — the infor- ‘mation about null components will be lost. To generate the correct list of directories, we must convert a null component of PATH into a dot. The null ‘component could be in the middle or at either end of the string, so it takes a Tite work to catch all the cases: $ echo $parH ! sed (8/72/47 > Bsialg > 2/8/27 > Bi:/ 1” + /usr/you/bin /bin /use/bin $ We could have written this as four separate sea commands, but since sed does the substitutions in order, one invocation can do it al. + Later we will ee how to avoid this problem in shel files, where testis usually used 140 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER Once we have the directory components of PATH, the test(1) command we've mentioned can tell us whether a file exists in each directory. The test. command is actually one of the clumsier UNIX programs. For example, test ax file tests if €iLe exists and can be read, and test -w file tests if fie exists and can be written, but the 7th Edition provides no test -x (although the System V and other versions do) which would otherwise be the one for us. We'll settle for test ~£, which tests that the file exists and is not a directory, in other words, is a regular file, You should look over the manual page for test on your system, however, since there are several versions in cir- culation. Every command returns an exit status — a value returned to the shell to indicate what happened. The exit status is a small integer; by convention, 0 means “true” (the command ran successfully) and non-zero means “false” (the command ran unsuccessfully). Note that this is opposite to the values of true and false in C. Since many different values can all represent “false,” the reason for failure is often encoded in the “false” exit status. For example, grep returns 0 if there was a match, 1 if there was no match, and 2 if there was an error in the pattern of filenames. Every program returns a status, although we usually aren't interested in its value. test is unusual because its sole purpose is to return an exit status. It produces no output and changes no files The shell stores the exit status of the last program in the variable $7: emp /usr/you/.profile /usr/you/ profile ‘No ouput; they're the same echo $7, Zero implies ran O.K.: files identical cmp /usr/you/.profile /usr/mary/.profile usr/you/ -profiie /usr/mary/.profile differ: char 6, line 3 ‘echo $7, Non-zero means files were diferent ‘A few commands, such as emp and grep, have an option ~s that causes them to exit with an appropriate status but suppress all output The shell’s 4£ statement runs commands based on the exit status of a com- mand, as in 4£ command then commands if condition true else commands if condition false i The location of the newlines is important: £i, then and else are recognit only after a newline or a semicolon. The else part is optional ‘The 4€ statement always runs a command — the condition — whereas the “d CHAPTER 5 SHELL PROGRAMMING — 141 case statement does pattern matching directly in the shell. In some UNIX ver- sions, including System V, test is a shell built-in function so an if and a test will run as fast asa case. If test isn't built in, case statements are more efficient than if statements, and should be used for any pattern match- ing: case "$1" in hello) command will be faster than Af test "s1" = hello Slower unless test: is « shell builtin then ‘command 4 ‘That is one reason why we sometimes use case statements in the shell for testing things that would be done with an if statement in most programming languages. A case statement, on the other hand, can’t easily determine whether a file has read permissions; that is better done with a test and an ig. ‘So now the pieces are in place for the first version of the command which, to report which file corresponds to a command: $ cat which # which one which cmd in PATH is executed, version 1 case $# in 0 echo ‘Usag for i in ‘echo $PATH | sed which command’ 1>42; exit 2 a0 ie test -€ sist # use test -x if you can then echo $i/81 exit 0 # found it cn done exit 1 # not found ‘ Let’s test it: 142 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 5 $ cx which Make it executable $ which which s/ehich $ which ea Pein/ed $ mv which /usr/you/bin $ which which /ase/you/bin/which $ The initial case statement is just error-checking. Notice the redirection 1>&2 ‘on the echo so the error message doesn't vanish down a pipe. The shell built-in command exit can be used to return an exit status. We wrote exit 2 to return an error status if the command didn’t work, exit 1 if it couldn't find the file, and exit 0 if it found one. If there is no explicit exit state- ment, the exit status from a shell file is the status of the last command exe- cuted. ‘What happens if you have a program called test in the current directory? (We're assuming that test is not a shell built-in.) $ echo “echo hello’ >test Make a fake test. Sox test Make it executable $ which which Try which now hello Pails! c/uhich s More error-checking is called for. You could run which (if there weren't a test in the current directory!) to find out the full pathname for test, and specify it explicitly. But that is unsatisfactory: test may be in different direc tories on different systems, and which also depends on sed and echo, so we should specify their pathnames too. There is a simpler solution: fix PATH in the shell file, so it only looks in /bin and /usr/bin for commands. Of course, for the which command only, you have to save the old PATH for determining the sequence of directories to be searched. ‘CHAPTER 5 SHELL PROGRAMMING 143 $ cat which # which ond: which end in PATH is executed, final version opath=sParH PATH=/bin: /ase/bin case $# in 0) echo ‘Usage: which conmand’ 1>62; exit 2 for i in ‘echo Sopath | sed * 8/28/27 B/:/ 7g eo if test -£ sist # this is /bin/test then # of /usr/bin/teat only echo $i/81 exit 0 4 found it 4 done exit 1 # not found which now works even if there is a spurious test (or sed or echo) along the search path. $ Is -1 test cewarwxrwx 1 you 11 ct 1 06:55 test Still here § which which Zusr/you/bin/which $ which test /eest S rm test $ which test ein/test ’ ‘The shell provides two other operators for combining commands, 11 and &6, that are often more compact and convenient than the if statement, For example, 1! can replace some if statements: test -£ filename 1! echo file filename does not exist is equivalent to AE teat 1 -£ filename The | negates the condition ‘then ‘echo file filename does not exist 4 The operator 1, despite appearances, has nothing to do with pipes — it is a 144 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 5 conditional operator meaning OR. The command to the left of {| is executed. If its exit status is zero (success), the command to the right of 11 is ignored. If the left side returns non-zero (failure), the right side is executed and the value of the entire expression is the exit status of the right side. In other words, 1! is a conditional OR operator that does not execute its right-hand command if the left one succeeds. The corresponding $8 conditional is AND; it executes its right-hand command only if the left one succeeds. Exercise 5-4, Why doesn't which reset PATH to opath before exiting? © Exercise 5-5. Since the shell uses esac to terminate a case, and £4 to terminate an 4, why does it use done to terminate a do? 0 Exercise 5-6. Add an option ~a to which so it prints all files in PATH, rather than quitting after the first. Hint: match="exit 0°, Exercise 5-7. Modify which so it knows about shell built-ins like exit. 0 Exercise 5-8. Modify which to check for execute permissions on the files. Change it to print an error message when a file cannot be found, © 5.3 while and until loops: watching for things In Chapter 3, the £or loop was used for a number of simple iterative pro- grams. Usually, a £ox loops over a set of filenames, as in ‘for i in *.c’, or all the arguments to a shell program, as in ‘for 4 in $4’, But shell loops are more general than these idioms would suggest; consider the for loop in which, There are three loops: for, white and until. The for is by far the ‘most commonly used. It executes a set of commands — the loop body — once for each element of a set of words, Most often these are just filenames. The while and until use the exit status from a command to control the execution of the commands in the body of the loop. The loop body is executed until the condition command returns a non-zero status (for the while) or zero (for the until). while and until are identical except for the interpretation of the exit status of the command. Here are the basic forms of each loop: for 4 in list of words ao oop body, $4 set 10 successive elements of list done for i (List is implicitly all arguements 10 shell file, i.e., $+) do loop body, $4 set 10 successive arguments done CHAPTER 5 SHELL PROGRAMMING 145 while command ao oop body executed as long as command returns true done until command 0 loop body executed as long as command returns false done ‘The second form of the for, in which an empty list implies $#, is a convenient shorthand for the most common usage. ‘The conditional command that controls a while or until can be any com- mand. As a trivial example, here is a white loop to watch for someone (say Mary) to log in: while sleep 60 do who | grep mary done ‘The sleep, which pauses for 60 seconds, will always execute normally (unless interrupted) and therefore return “success,” so the loop will check once a minute to see if Mary has logged in. This version has the disadvantage that if Mary is already logged in, you must wait 60 seconds to find out. Also, if Mary stays logged in, you will be told about her once a minute. The loop can be turned inside out and written with an until, to provide the information once, without delay, if Mary is on until who 1 grep mary eo sleep 60 gone This is a more interesting condition. If Mary is logged in, ‘who ! grep mary’ prints out her entry in the who listing and returns “true,” because grep elurns a status to indicate whether it found something, and the exit status of a pipeline is the exit status of the last element Finally, we can wrap up this command, give it a name and install i 146 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER $ cat watchtor WY watchfor: wateh for someone to log in PATH=/bin: /uer/bin case $4 in ° echo ‘Usag watchfor person’ 1262; exit 7 until who { egrep "$1" ao. sleep 60 done $ ox watchfor $ watchfor you you ttyO oct 1:08:01 works $ mv watchfor /usr/you/bin Install it s We changed grep to egrep so you can type 8 watchfor ‘joe!mary’ to watch for more than one person. ‘As a more complicated example, we could watch all people logging in and out, and report as people come and go — a sort of incremental who. The basic structure is simple: once a minute, run who, compare its output to that from a minute ago, and report any differences. The who output will be kept in a file, so we will store it in the directory /tmp. To distinguish our files from those belonging to other processes, the shell variable $8 (the process id of the shell command), is incorporated into the filenames; this is a common convention. Encoding the command name in the temporary files is done mostly for the sys- tem administrator. Commands (including this version of watchwho) often leave files lying around in /tmp, and it’s nice to know which command is doing it CHAPTER SHELL PROGRAMMING 147 $ cat watchwho # watchwho: watch who logs in and out PATHs/bin: /usz/bin eap/wehot $$ temp/wwho2 $$ # create an empty file # loop forever who >$new aire fold snew mv gnew sold sleep 60 done | awk "/>/ { $1 = “in: print } Ves ($1 = Fou print }/ is a shell built-in command that does nothing but evaluate its arguments and return “true.” Instead, we could have used the command true, which merely returns a true exit status, (There is also false command.) But is more efficient than true because it does not execute a command from the file system, G4f¢ output uses < and > to distinguish data from the two files; the awk program processes this to report the changes in an easier-to-understand format. Notice that the entire while loop is piped into awk, rather than running a fresh awk once a minute. sed is unsuitable for this processing, because its output is always behind its input by one line: there is always a line of input that has been processed but not printed, and this would introduce an unwanted delay. Because o1 is created empty, the first output from watchwho is a list of all users currently logged in. Changing the command that initially creates ola to who >$01d will cause watchwho to print only the changes; it’s a matter of taste. Another looping program is one that watches your mailbox periodically; whenever the mailbox changes, the program prints “You have mail.” This is 4 useful alternative to the shell’s built-in mechanism using the variable MATE. We have implemented it with shell variables instead of files, to illustrate a dif- ferent way of doing things. 148 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER $ cat checknail # checkmail: watch mailbox for growth PATH=/pin: /usr/bin MAaTLe/usr/spocl/maii/*getname’ # system dependent (1-60) sie <1 swat? while + ao. ye"is -1 SMATL'* echo sx $y, xe"sy" sleep st Gone ! awk '$4 < $12 ( print “You have mail" }" 5 ‘We have used awk again, this time to ensure that the message is printed only ‘when the mailbox grows, not merely when it changes. Otherwise, you'll get a message right after you delete mail. (The shell’s built-in version suffers from this drawback.) The time interval is normally set to 60 seconds, but if there is a parameter fon the command line, as in $ checknail 30 that is used instead. ‘The shell variable t is set to the time if one is supplied, and to 60 if no value was given, by the line 41-60) This introduces another feature of the shell S{var} is equivalent to $var, and can be used to avoid problems with variables inside strings containing letters or numbers: $ varshello $ varxegooabye $ echo svar hello $ echo svarx gooabye 8 echo s{var}x hellox s Certain characters inside the braces specify special processing of the variable. If the variable is undefined, and the name is followed by a question mark, then the string after the ? is printed and the shell exits (unless it’s interactive). If the message is not provided, a standard one is printed: ccuarrer 5 SHELL PROGRAMMING 149) $ echo ${var?) hello O.K.: var is set $ echo ${junk?) junk: parameter not set Default message § echo ${junk?error!) junk: error! Message provided ’ Note that the message generated by the shell always contains the name of the undefined variable. Another form is ${var-thing} which evaluates to $var if it is defined, and thing if it is not. ${var=thing} is similar, but also sets $var to ‘thing: $ echo ${junk-‘Hi there’} Hi there $ echo ${junk?) Junk: parameter not set Sunk unalfected § echo ${junk=’Hi there’) Hi there 8 echo ${junk?) Hi there Junk ser 0 HL there s ‘The rules for evaluating variables are given in Table 5.3. Returning to our original example, 811-60) sets t to $1, or if no argument is provided, to 60. ‘Table 5.3: Evaluation of Shell Variables svar value of var; nothing if var undefined svar} same; useful if alphanumerics follow variable name S{var-thing} value of var if defined; otherwise thing, ‘Svar unchanged, value of var if defined; otherwise thing, If undefined, Svar set to thing S{var?message} if defined, $var. Otherwise, print message and exit shell. If message empty, print: var: parameter not set thing if $var defined, otherwise nothing S{var+thing) Exercise 5-9. Look at the implementation of true and false in /bin or /usr/bin. (How would you find out where they are?) © Exercise 5-10, Change watchfor so that multiple arguments are treated as different 130 THE UNIX PROGRAMMING ENVIRONMENT ‘CHAPTER 5 people, rather than requiring the user to type "Joetmary’. © Exercise 5-11. Write a version of wat chwho that uses comm instead of awk t0 compare the old and new data. Which version do you prefer? Exercise 5-12, Write a version of watehho that stores the who output in shell vari- ables instead of files. Which version do you prefer? Which version runs faster? Should watehwho and checkmail do & automatically? 0 Exercise 5-13. What is the difference between the shell : do-nothing command and the *# comment character? Are both needed? © 5.4 Traps: catching interrupts If you hit DEL or hang up the phone while vatchwho is running, one or two temporary files are left in /tmp. watchwho should remove the temporary files before it exits. We need a way to detect when such events happen, and a way to recover. When you type DEL, an interrupt signal is sent to all the processes that you are running on that terminal. Similarly, when you hang up, a hangup signal is sent, There are other signals as well. Unless a program has taken explicit action to deal with signals, the signal will terminate it. The shell protects pro- ‘grams run with & from interrupts but not from hangups. Chapter 7 discusses signals in detail, but you needn't know much to be able to handle them in the shell. The shell built-in command trap sets up a sequence of commands to be executed when a signal occurs: trap sequence-of-commands list of signal numbers The sequence-of-commands is a single argument, so it must almost always be quoted. The signal numbers are small integers that identify the signal. For example, 2 is the signal generated by pressing the DEL key, and 1 is generated by hanging up the phone. The signal numbers most often useful to shell pro- grammers are listed in Table 5.4. Table 5.4: Shell s | 0 shell exit (for any reason, including end of file) 1 hangup 2 interrupt (DEL key) | 3. quit (cil-\; causes program to produce core dump) 9 kill (cannot be caught or ignored) 5 terminat Signal Numbers 1 jefault signal generated by ¥i11(1) So to clean up the temporary files in watchwho, a trap call should go just before the loop, to catch hangup, interrupt and terminate: CHAPTER S SHELL PROGRAMMING 1ST trap ‘rm -£ tnew fold; exit 1° 1.2 15 waite + ‘The command sequence that forms the first argument to trap is like @ subrou- tine call that occurs immediately when the signal happens. When it finishes, the program that was running will resume where it was unless the signal killed it, Therefore, the trap command sequence must explicitly invoke exit, or the shell program will continue to execute after the interrupt. Also, the com- mand sequence will be read twice: once when the trap is set and once when it is invoked. Therefore, the command sequence is best protected with single quotes, so variables are evaluated only when the trap routines are executed. It makes no difference in this case, but we will see one later in which it matters. By the way, the -£ option tells rm not to ask questions trap is sometimes useful interactively, most often to prevent a program from being killed by the hangup signal generated by a broken phone connec- tion: $ (trap 2134 8 Jong-running-command) & ‘The null command sequence means “ignore interrupts” in this process and its children, ‘The parentheses cause the trap and command to be run together in 1a background sub-shell; without them, the trap would apply to the login shell, as well as to long-running-conmand. ‘The nohup(1) command is a short shell program to provide this service, Here is the 7th Edition version, in its entirety: $ cat ‘which nohup* trap "" 1 15 AE test -t 267 ‘then echo "Sending output to ‘nohup.out’* exec nice -5 $+ >>nohup.owt 2541 else exec nice -5 $4 261 ra test -t tests whether the standard output is a terminal, to see if the output should be saved. The background program is run with nice to give it a lower priority than interactive programs. (Notice that nohup doesn’t set PATH. Should it?) The exec is just for efficiency; the command would run just as well without it. exec is a shell built-in that replaces the process running this shell 152 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER s by the named program, thereby saving one process — the shell that would nor- mally wait for the program to complete. We could have used exec in several other places, such as at the end of the enhanced cai program when it invokes /usr/bin/cal. By way, the signal 9 is one that can't be caught or ignored: it always kills. From the shell, it is sent as $ Ki22 -9 process id }:i11 -9 is not the default because a process killed that way is given no chance to put its affairs in order before dying. Exercise 5-14. ‘The version of nohup above combines the standard error of the com mand with the standard output. Is this @ good design? If not, how would you separate them cleanly? Exercise 5-15. Look up the times shell builtin, and add a line to your .profite so that when you log off the shell prints out how much CPU time you have used, © Exercise 5-16. Write a program that will find the next available user-id in Zeto/passwa. If you are enthusiastic (and have permission), make it into a command that will add a new user fo the system. What permissions does it need?” How should it hhandle interrupts? © 5.5 Replacing a file: overwrite The sort command has an option ~0 to overwrite a file: § sort filet -o file2 is equivalent to $ sort file) >£i102 If £41e1 and £i1e2 are the same file, redirection with > will truncate the input file before it is sorted. The -0 option, however, works correctly, because the input is sorted and saved in a temporary file before the output file is created. ‘Many other commands could also use a -o option. For example, sed could edit a file in place: sed “e/UNIX/UNIX(TN)/9’ ch2 -0 ch2 Doesn't work this way! It would be impractical to modify all such commands to add the option. Furth- ermore, it would be bad design: it is better to centralize functions, as the shell docs with the > operator. We will provide a program overwrite to do the job. The first design is like this: $ sed "s/UNIX/UNIX(TM)/a’ ch2 | overwrite ch2 The basic implementation is straightforward —~ just save away the input until end of file, then copy the data to the argument file: cuarrer s SIELL PROGRAMMING 153 4 overwrite: copy standard input to output after BOF # version 1. BUG here PATHs/bin: /usr/bin case $# in ” Hs » echo ‘Usage: overwrite file’ 1962; exit 2 newe/tmp/overur.§$ trap ‘rm -£ Sew; exit 1° 12 15 cat >snew # collect the input op fnew $1 # overwrite the input file rm -£ Snew cp is used instead of my so the permissions and owner of the output file aren't changed if it already exists Appealingly simple as this version is, it has a fatal flaw: if the user types DEL during the op, the original input file will be ruined. We must prevent an interrupt from stopping the overwriting of the input fie: # overwrite: copy standard input to output after BOF # version 2. BUG here too in: /usr/bin case $# in 1) i » echo ‘Usage: overwrite file’ 182; exit 2 new=/tmp/overwr 1.5% old=/tmp/overwr2. $$ trap ‘rm -f Sew Sold; exit 17 12 15 cat >snew # collect the input ep $1 $olé # save original file trap 01.215 # we are committed; ignore signals ep snew $1 # overwrite the input file xm -£ $new Sold If a DEL happens before the original file is touched, then the temporary files are removed and the file is left alone. After the backup is made, signals are ignored so the last cp won't be interrupted — once the cp starts, overwrite is committed to changing the original file 154 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER There is still a subtle problem. Consider: $ ced “s/UNIX/UNIX(TM)g” precious | overwrite precious command garbled: #/UNIX/UNIX(7)g $ Is -1 precious cewerw-rw= 1 you 0 oct 1 09:02 precious #85" ‘ If the program providing input to overwrite gets an error, its output will be empty and overwrite will dutifully and reliably destroy the argument file. ‘A number of solutions are possible. overwrite could ask for confirma- tion before replacing the file, but making overwrite interactive would negate much of its merit. overwrite could check that its input is non-empty (by test -z), but that is ugly and not right, either: some output might be gen- erated before an error is detected. The best solution is to run the data-generating program under overwrite’s control so its exit status can be checked. This is against tradi- tion and intuition — in a pipeline, overwrite would normally go at the end. But to work properly it must go first. overwrite produces nothing on its standard output, however, so no generality is lost, And its syntax isn’t unheard of: time, nice and nohup are all commands that take another com- ‘mand as arguments. Here is the safe version: # overwrite: copy standard input to output after EOF # final version opath=$ParH case $# in 011) echo ‘Usage: overwrite file cmd (args]’ 1-62; exit 2 filess1; shift news/tmp/overwr1.$$; olds/tmp/overwr2.$% trap ‘rm -f $new gold; exit 171215 # clean up files 4€ PATH=Sopath "$8" >snew # collect input then cp S€ile Sold # save original file trap '' 1215 # we are committed: ignore signals cp $new eile else echo “overwrite: $1 failed, $file unchanged" 1>52 exit 1 ron mm -£ $new Sold ‘CHAPTER 5 SHELL PROGRAMMING — 155 ‘The shell built-in command shift moves the entire argument list one posi- tion to the left: $2 becomes $1, $3 becomes $2, etc. "$@" provides all the arguments (after the shift), like $+, but uninterpreted; we'll come back to it in Section 5.7. Notice that PATH is restored to run the user's command; if it weren't, com- mands that were not in /bin or /usr/bin would be inaccessible to overwrite. overwrite now works (if somewhat clumsily): $ cat notice UNIX is a Trademark of Bell Laboratories $ overwrite notice sed ’s/UNIXKUNIX(TM)/g’ notice connand garbled: 8/UNTXUNIX(TM)/g overwrite: sed failed, notice unchanged 8 cat notice UNIX is a Trademark of Bell Laboratories Unchanged $ overwrite notice sed ’s/UNIX/UNIX(TM)/q’ notice $ cat notice UNIX(TH) is a Trademark of Bell Laboratories s Using sed to replace all occurrences of one word with another is a common thing to do. With overwrite in hand, a shell file to automate the task is easy: $ cat replace # replace: replace stri in files with str2, in place path in: /uer/bin case §¢ in 01112) echo “Usage: replace str? str? files’ 18: exit 1 Lefte"s1; righte"s2"; shirt; shite for i a0 overwrite $i sed "@Sleft@sright®s” $i done $ cat footnote UNIX is not an acronym $ replace UNIX Unix footnote $ cat footnote unix is not an acronym s (Recall that if the list on a for statement is empty, it defaults to $4.) We used @ instead of / to delimit the substitute command, since @ is somewhat less likely to conflict with an input string. 156 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 5 replace sels PATH to /bin:/usr/bin, excluding $HOME/bin. This means that overwrite must be in /usr/bin for replace to work. We made this assumption for simplicity; if you can’t install overwrite in “usr/bin, you will have to put $HOME/bin in PATH inside replace, or give overwrite’s pathname explicitly. From now on, we will assume that the commands we are writing reside in usr/bin; they are meant (0 Exercise 5-17. Why doesn't overwrite use signal code 0 in the trap so the files are removed when it exits? Hint: Try typing DEL while running the following program: trap “echo exiting sleep 10 exit 02 Exercise 5-18. Add an option -v to replace to print all cha Strong hint: 8/SLeft/$right/gSvflag. © lines on /dew/tty Exercise strings. © 19, Fix replace so it works regardless of the characters in the substitution Exercise 5-20, Can replace be used to change the variable { to index everywhere in 44 program? How could you change things to make this work? Exercise 5-21. Is replace convenient and powerful enough to belong in /usr/bin? Is it preferable to simply typing the correct sed commands when needed? Why or why not? © Exercise 5-22. (Hard) doesn’t work, Explain why not, and fix it. Hint: see eval in sh(1). How does your solution affect the interpretation of metacharacters in the command? 5.6 zap: killing processes by name ‘The Ki22 command only terminates processes specified by process-id. When a specific background process needs to be killed, you must usually run ‘ps to find the process-id and then laboriously re-type it as an argument to Ki11. But it’s silly to have one program print a number that you immediately transcribe manually to another. Why not write a program, say zap, to auto. mate the job? One reason is that killing processes is dangerous, and care must be taken (0 kill the right processes. A safeguard is always to run zap interactively, and use pick to select the victims. : A quick reminder about pick: it prints each of its arguments in turn and asks the user for a response; if the response is y, the argument is printed. (pick is the subject of the next section.) zap uses pick to verify that the processes chosen by name are the ones the user wants to kill: (CHAPTER S SHELL PROGRAMMING — 157 $ cat zap # zap pattern: kill all processes matching pattern # BUG in thie version case $# in °) echo ‘Usage: zap pattern’ 1>62; exit 1 will ‘pick \*ps ag ! grep "Se"\* { awk ‘(print $1)/* $ Note the nested backquotes, protected by backslashes. The awk program selects the process-id from the ps output selected by the pick: $ steep 1000 & 22126 3 ps -ag PID PRY TIME CMD 22126 0 0:00 sleep 1000 $ zap sleep 221267 or @ What's going on? 5 The problem is that the output of ps is being broken into words, which are seen by pick as individual arguments rather than being processed 2 line at a time. The shell’s normal behavior is to break strings into arguments at blank/non-blank boundaries, as in for iin 12345 In this program we must control the shell’s division of strings into arguments, so that only newlines separate adjacent “words.”” The shell variable TFS (internal field separator) is a string of characters that separate words in argument lists such as backquotes and for statements Normally, IFS contains a blank, a tab and a newline, but we can change it to anything useful, such as just a newline: 158 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER $ echo ‘echo $#” >narge $ cx narge 8 who you ttyO oct 1:05:59 pow tty2 oct 114 $ nargs ‘who* 10 Ten blank and newline-separated fields $ 1FS=" . Justa nevsline $ nargs ‘who* 2 Two lines, two fields 5 With ZFS set to newline, zap works fine: 5 cat zap # zap pat: kill all proces # final vereion es matching pat # just a newline ) echo “Usage: zap [-2] pattern’ 1262; exit 11 -#) siGes1; shice echo’ PID TTY TIME CMD Kill $816 ‘pick \*ps ag | egrep "$e"\* I awk ‘(print $1)/° $ pe -ag PID TTY TIME CMD 22126 0 0:00 sleep 1000 8 zap sleep PID TTY TIME CMD 22126 0 0:00 sleep 10007 » 23104 0 0:02 egrep sleep? a s We added a couple of wrinkles: an optional argument to specify the signal (note that SIG will be undefined, and therefore treated as a null string if the argument is not supplied) and the use of egrep instead of grep to permit more complicated patterns such as “sleepidate’. An initial echo prints ‘out the column headers for the ps output You might wonder why this command is called zap instead of just ki11. ‘The main reason is that, unlike our cal example, we aren’t really providing a new ki11 command: zap is necessarily interactive, for one thing — and we want {0 retain 4411 for the real one. zap is also annoyingly slow — the CHAPTER s SHELL PROGRAMMING — 159 ‘overhead of all the extra programs is appreciable, although ps (which must be run anyway) is the most expensive. In the next chapter we will provide a more efficient implementation Exercise 5.23. Modify zap to print out the ps header from the pipeline so that itis insensitive to changes in the format of ps output. How much does this complicate the program? 0 5.7 The pick command: blanks vs. arguments We've encountered most of what we need to write a pick command in the shell. The only new thing needed is a mechanism to read the user’s input ‘The shell built-in read reads one line of text from the standard input and assigns the text (without the newline) as the value of the named variable: $ read greeting hello, world Type new value for greeting $ echo sgreeting hello, world s The most common use of read is in .profite to set up the environment when logging in, primarily to set shell variables like TERM. read can only read from the standard input; it can't even be redirected. None of the shell built-in commands (as opposed to the control flow primitives like for) can be redirected with > or < $ read greeting /dev/ety read respons: case $response in ye) echo $4 5 qt) break done 61° 40 we: case $i in not founa’) 55 +/.news_time) break 5 " set x*1s -1 $i ‘echo Si: (83) $5 $6 97 cat $ gone Touch SHOME/.news time The extra newlines in the header separate the news items as they are printed. The first value of IPS is just a newline, so the not found message (if any) from the first 1s is treated as a single argument. The second assignment (0 IFS resets it to a blank, so the output of the second 18 is split into multiple arguments. Exercise 5-27. Add an option -n (notify) to news to report but not print the news items, and not touch .news_time. This might be placed in your .profile. 0 CHAPTER S SHELL PROGRAMMING 165 Exercise 5-28, Compare our design and implementation of mews to the similar com mand on your system. 5 5.9 get and put: tracking file changes In this section, the last of a long chapter, we will show a larger, more com- plicated example that illustrates cooperation of the shell with awk and sed. ‘A program evolves as bugs are fixed and features are added. It is some- times convenient to keep track of these versions, especially if people take the program to other machines — they will come back and ask “What has changed since we got our version?” or “How did you fix the such-and-such bug?” Also, always maintaining backup copies makes it safer to try out ideas: if something doesn’t work out, it’s painless to revert to the original program. One solution is to keep copies of all the versions around, but that is diff- cult to organize and expensive in disc space. Instead, we will capitalize on the likelihood that successive versions have large portions in common, which need to be stored only once. The diff -e command $ dire -e old new generates a list of ed commands that will convert 14 into new. It is there- fore possible to keep all the versions of a file in a single (different) file by maintaining one complete version and the set of editing commands to convert it into any other version There are two obvious organizations: keep the newest version intact and have editing commands go backwards in time, or keep the oldest version and have editing commands go forwards. Although the latter is slightly easier to program, the former is faster if there are many versions, because we are almost always interested in recent versions. We chose the former organization. In a single file, which we'll call the his- tory file, there is the current version followed by sets of editing commands that convert each version into the previous (i.e., next older) one. Each set of edit- ing commands begins with a line that looks like 208 person date summary ‘The summary is a single line, provided by person, that describes the change. ‘There are two commands to maintain versions: get extracts a version from the history file, and put enters a new version into the history file after asking for a one-line summary of the changes. Before showing the implementation, here is an example to show how get and put work and how the history file is maintained: 166 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 5 $ echo a line of text >junk 8 put junk Summary: make a new file Type the description. get: no file junk.# History doesn’t exist put: creating junk.H so put creates it S cat junk.# a line of text @eo you Sat Oct 1 13:31:03 EDT 1983 make a new file $ echo another line >>junk $ put junk Summary: one line added Scat junk. a Line of text another line ee you Sat Oct 1 13:32:26 EDT 1983 one line added 2a 968 you Sat Oct 1 13:31:03 EDT 1983 make a new file ‘ The “editing commands” consist of the single line 24, which deletes line 2 of the file, turning the new version into the original, $m junk $ get junk ‘Most recent version $ cat junk a Line of text another Line $ get -1 junk $ cat junk Newest-but-one version fa Line of text $ get junk Most recent again $ replace another ’a different’ junk — Change it $ put junk Summary: second line changed $ cat junk.# a Line of text a digferent Line @00 you Sat Oct 1 13:34:07 EDT 1983 second line changed 20 another line 8 you Sat Oct 1 13:32:28 EDT 1983 one line added 2a eee you Sat Oct 1 13:31:03 EDT 1983 make a new file s ‘The editing commands run top to bottom throughout the history file to extract, the desired version: the first set converts the newest to the second newest, the next converts that to the third newest, ete. Therefore, we are actually convert- ing the new file into the old one a version at a time when running ed. CHAPTER 5 SHELL PROGRAMMING 167 There will clearly be trouble if the file we are modifying contains lines beginning with a triple at-sign, and the BUGS section of ai ££(1) warns about lines that contain only a period. We chose @8@ to mark the editing commands because it’s an unlikely sequence for normal text. ‘Although it might be instructive to show how the get and put commands evolved, they are relatively long and showing their various forms would require too much discussion. We will therefore show you only their finished forms, put is simpler: # pa install file into history PATH=/bin: /usx/bin case $# in 1) nrst=$1-4 35 » echo ‘Usage: put file’ 162; exit 155 Ae test | -r $1 ‘then echo “put: can’t open $1" 1962 exit 1 £8 trap ‘xm -f /tmp/put.(ab]$$; exit 1° 12 15 echo -n “Summary: * read Summary if get -o /tmp/put.ass $1 # previous version then # merge pieces ep $1 /tmp/put.bss # current version echo "eee ‘getnane’ ‘date’ ssummary” >>/tmp/put bss GLEE -e $1 /tmp/put ass >>/tmp/put.Dss # latest diffs sed -n '/°@80/, Sp’ >/tmp/put.d$$ # old diffs overwrite $HIS? cat /tmp/put.b$$ # put it back else make a new one echo "put: creating SHIST” cp 81 SHIsT echo "@8e ‘getname’ ‘date’ ssunmary" >>SHIST fa zm -£ /tmp/put. [ab] $8 After reading the one-line summary, put calls get to extract the previous ver sion of the file from the history file. The -o option to get specifies an alter- nate output file, If get couldn't find the history file, it returns an error status, and put creates a new history file. If the history file does exist, the then clause creates the new history in a temporary file from, in order, the newest version, the @@@ line, the editor commands to convert from the newest version to the previous, and the old editor commands and @@@ lines. Finally, the tem- Porary file is copied onto the history file using overwrite, 168 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER get is more complicated than put, mostly because it has options. # get: extract file from history Par bin: /usr/bin VERSO! while test "$1" ao. case "$1" in i) INPUT=s2; shite =o) OvrPUT=$2; shift i; =[0-9}) VERSTON=#1 35 +) echo "get: Unknown argument $i" 1962; exit 1 +) cage "SourPUT" in ) oureur=s1 ; es) INPUT=S1. 55 shige done ourpur-s{ourPur?*usage: get (-o outfile] [-i file.) file") INPU«$ (INPUT-SOUTPUT-#) test -r SINPUT |! ( echo "get: no file SINPUT™ 196: trap ‘rm -f /tmp/get.[ab}$$; exit 17 12 15 # split into current version and editing commands ed 0 S& count <= - ‘VERSION’ END ( print "Sd"; print "w", "/soUrPUT’" } + 1 ed ~ /tmp/get.ass en -£ /tmp/get.[ab]$8 exit 15) The options are fairly ordinary. -i and -o specify alternate input and output ={0-9} selects a particular version: 0 is the newest version (the default), ~1 the newest-but-one, etc, The loop over arguments is a while with a test and a shift, rather than a for, because some of the options (~i, ~0) con- sume another argument and must therefore shift it out, and €or loops and shifts do not cooperate properly if the shift is inside the for. The ed option *~" turns off the character count that normally accompanies reading or writing a file The line test -r SINPUT 1! { echo "get: no file SINPUT" 1>62; exit 1; ) is equivalent to ‘CHAPTER 5 SHELL PROGRAMMING — 169) Sf test | -r SINPUT then echo "get: no file SINPUT" 1962 exit 1 ray (which is the form we used in put) but is shorter to write and clearer to pro- grammers who are familiar with the 11 operator. Commands between { and } are executed in the current shell, not a sub-shell; this is necessary here so the exit will exit from get and not just a sub-shell. The characters { and } are like do and done — they have special meaning only if they follow a semi- colon, newline or other command terminator Finally, we come to the code in get that does the work. First, sed breaks the history file into two pieces: the most recent version and the set of edits ‘The awke program then processes the editing commands. @8@ lines are counted (but not printed), and as long as the count is not greater than the desired ver- sion, the editing commands are passed through (recall that the default awk action is to print the input fine). Two ed commands are added after those from the history file: $4 deletes the single @@@ line that sed left on the current version, and a w command writes the file to its final location overwrite is unnecessary here because get changes only the version of the file, not the precious history file Exercise 5.29, Write a command version that does two things 8 version -5 Fite reports the summary, modification date and person making the modification of the selected version in the history fle # version sep 20 tile reports which version number was current on September 20, This would typically be used int § get ‘version sep 20 tite (version can echo the history filename for convenience.) © Exercise 5-30, Modify get and put so they manipulate the history file in a separate directory, rather than clutering up the working directory with # files. 1 Exercise 5-31. Not all versions of a file are worth remembering once things settle down. How can you arrange to dolete versions from the middle of the history fle? $.10. A look back When you're faced with writing a new program, there's a natural tendency to start thinking immediately about how to write it in your favorite program- ‘ming language. In our case, that language is most often the shell. Although it has some unusual syntax, the shell is an excellent programming 170. THE UNIX PROGRAMM ENVIRONMENT CHAPTER S language. It is certainly high-level; its operators are whole programs. Since it is interactive, programs can be developed interactively, and refined in small steps until they “work.” After that, if they are intended for more than per~ sonal use, they can be polished and hardened for a wider user population. In those infrequent cases where a shell program turns out to be too inefficient, some or all of it can be rewritten in C, but with the design already proven and 2 working implementation in hand. (We'll follow this path « couple of times in the next chapter.) This general approach is characteristic of the UNIX programming environ- ‘ment — build on what others have done instead of starting over from nothing: start with something small and let it evolve; use the tools to experiment with new ideas. In this chapter, we've presented many examples that are easy to do with existing programs and the shell. Sometimes it’s enough merely to rearrange arguments; that was the case with cal, Sometimes the shell provides a loop over a set of filenames or through a sequence of command executions, as in watchfor and checkmail. More complicated examples are still less work than they would be in C; for instance, our 20-line shell version of news replaces 350-line [sic] version written in C. But it’s not enough to have a programmable command language. Nor is it ‘enough to have a lot of programs, What matters is that all of the components work together. They share conventions about how information is represented and communicated, Each is designed to focus on one job and do it well. The shell then serves to bind them together, easily and efficiently, whenever you have a new idea. This cooperation is why the UNIX programming environment is so productive, History and bibliographic notes The idea for get and put comes from the Source Code Control System (SCS) originated by Mare Rochkind (“The source code control system," IEEE Trans. on Software Engineering, 1975). SCS is far more powerful and flexible than our simple programs; it is meant for maintenance of large programs in a production environment. ‘The basis of SCS is the same Giff program, how- ever. Harter 6; PROGRAMMING WITH STANDARD 1/0 So far we have used existing tools to build new ones, but we are at the limit of what can be reasonably done with the shell, sed and awk. In this chapter we are going to write some simple programs in the C programming language. ‘The basic philosophy of making things that work together will continue to dominate the discussion and the design of the programs — we want to create tools that others can use and build on. In each case, we will also try t0 show a sensible implementation strategy: start with the bare minimum that does some- thing useful, then add features and options (only) if the need arises. There are good reasons for writing new programs from scratch. It may be that the problem at hand just can't be solved with existing programs. This is often true when the program must deal with non-text files, for example — the majority of the programs we have shown so far really work well only on tex tual information. Or it may be too difficult to achieve adequate robustness ot efficiency with just the shell and other general-purpose tools. In such cases, a shell version may be good for honing the definition and user interface of a pro- gram. (And if it works well enough, there’s no point re-doing it.) The zap program from the last chapter is a good example: it took only a few minutes to write the first version in the shell, and the final version has an adequate user interface, but it’s 100 slow. We will be writing in C because it is the standard language of UNIX systems — the kernel and all user programs are written in C — and, realistically, no other language is nearly as well supported. We will assume that you know C, at least well enough (o read along. If not, read The C Programming Language, by B. W. Kernighan and D. M. Ritchie (Prentice-Hall, 1978). We will also be using the “standard UO library,” a collection of routines that provide efficient and portable /O and system services for C programs. The standard 1/0 library is available on many non-UNIX systems that support C, so programs that confine their system interactions to its facilities can easily be transported. The examples we have chosen for this chapter have a common property: they are small tools that we use regularly, but that were not part of the 7th Edition. If your system has similar programs, you may find it enlightening to m 172. THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 6 compare designs. And if they are new to you, you may find them as useful as we have. In any case, they should help to make the point that no system is perfect, and that often it is quite easy to improve things and to cover up defects with modest effort. 6.1 Standard input and output: vis Many programs read only one input and write one output; for such pro- grams, VO that uses only standard input and standard output may be entirely adequate, and it is almost always enough to get started Let us illustrate with a program called vis that copies its standard input to its standard output, except that it makes all non-printing characters visible by printing them as \nnn, where mun is the octal value of the character. vis is invaluable for detecting strange or unwanted characters that may have crept into files. For instance, vis will print each backspace as \010, which is the ‘octal value of the backspace character: $ cat x abe $ vis «x abe\010\010\0 10. s To scan multiple files with this rudimentary version of vis, you can use cat to collect the files: $ cat filet file2 ... I vis $ cat filet filez ... | vis | grep “\\" and thus avoid learning how to access files from a program. By the way, it might seem that you could do this job with sed, since the "1° command displays non-printable charecters in an understandable form: $ sed -n ix abows s ‘The seq output is probably clearer than that from vis. But sed was never meant for non-text files: $ sed -n 1 /usr/you/bin s Nothing a al (This was on a POP-I; on one VAX system, sed aborted, probably because the input looks like a very long line of text.) So sed is inadequate, and we are forced to write a new program. The simplest input and output routines are called getchar and putchar. Each call to getchar gets the next character from the standard input, which CHAPTER 6 PROGRAMMING WITH STANDARD LO 173 may be a file or a pipe or the terminal (the default) — the program doesn’t know which. Similarly, putchar(c) puts the character ¢ on the standard output, which is also by default the terminal. ‘The function print£(3) does output format conversion. Calls to print and putchar may be interleaved in any order; the output will appear in the order of the calls. There is a corresponding function scan£(3) for input for- mat conversion; it will read the standard input and break it up into strings, numbers, ete., as desired. Calls to seanf and getchar may also be inter~ mixed. Here is the first version of vis: /+ vis: make funny characters visible (version 1) +/ Winelude Hinclude main() c int es while ((c = getchar()) I= EOF) Af (isascii(e) 68 (isprint(c) putchar(e) else print£("\\%030", ©); exit(o)s ei) , getchar returns the next byte from the input, or the value EOP when it encounters the end of file (or an error). By the way, EOF is not a byte from the file; recall the discussion of end of file in Chapter 2. ‘The value of EOF is guaranteed to be different from any value that occurs in a single byte so it can bbe distinguished from real data; e is declared int, not char, so that it is big enough to hold the EOF value. The line #include should appear at the beginning of each source file. It causes the C compiler to read a header file (/asr/include/stdio.n) of standard routines and sym- bols that includes the definition of BOF, We will use as a short- hhand for the full filename in the text. The file is another header file in /asx/ine1ude that defines machine-independent tests for determining the properties of characters. We used isascii and isprint here, to determine whether the input character is ASCII (i.e., value less than 0200) and printable; other tests are listed in Table 6.1, Notice that newline, tab and blank are not “printable” by the definitions in . 174 THE UNIX PROGRAMMING EXVIRONMENT CHAPTER 6 ‘The call to exit at the end of vis is not necessary to make the program work properly, but it ensures that any caller of the program will see a normal exit status (conventionally zero) from the program when it completes. An alternate way to return status is to leave main with return 0; the return value from main is the program's exit status. If there is no explicit return or exit, the exit status is unpredictable To compile a C program, put the source in a file whose name ends in .c, such as vise, compile it with ce, then run the result, which the compiler leaves in a file called a-out ('a’ is for assembler): 8 co vis.c $ aout hello worldetl-g helio worl4\o07 ema s Normally you would rename a.out once to do it directly: 's working, or use the cc option -o $ co -0 vis vis.c Ouput in vis, not a.out Exercise 6-1, We decided that tabs should be left alone, rather than made visible as \017 of » oF \t, since our main use of vie is looking for truly anomalous characters, ‘An alternate design isto identify every character of output unambiguously — tabs, non: ieraphics, blanks at line ends, ete. Modify vis so that characters like tab, backslash, backspace, formfeed, etc., are printed in their conventional C representations \t, \\. Xp, ME, ete., and so that blanks at the ends of lines are marked. Can you do this ‘unambiguously? Compare your design with Exercise 62. Modify vis so that it folds long lines at some reasonable length. How does this interact with the unambiguous output required in the previous exercise? 6.2 Program arguments: vis version 2 When a C program is executed, the command-line arguments are made available to the function main as a count arge and an array argv of pointers to character strings that contain the arguments. By convention, argv[0) is the command name itself, so arge is always greater than 0; the “useful” argu- ments are argv[ 1) ... argv{arge-1]. Recall that redirection with < and > is done by the shell, not by individual programs, so redirection has no effect on the number of arguments seen by the program. To illustrate argument handling, let’s modify vis by adding an optional argument: vis -s strips out any non-printing characters rather than displaying them prominently. This option is handy for cleaning up files from other sys- tems, for example those that use CRLF (carriage return and line feed) instead CHAPTER 6 PROGRAMMING WITH STANDARD 10175 Table 6.1: Character Test Macros ] isalpha(c) alphabetic: a~z A~Z Asupper(c) upper case: A-Z islower(c) lower case: a-z isdigit(c) digit: 0-9 isxdigit(c) hexadecimal digit: 0-9 a-f A-F isainum(c) alphabetic or digit isspace(c) blank, tab, newline, vertical tab, formfeed, return ispunct(c) not alphanumeric or control or space isprint(c) printable: any graphic isentri(c) control character: 0 <= ¢ < 040 t! ¢ Asascii(c) _ASCll character: 0 <= ¢ <= 0177 of newline to terminate lines. /* vis: make funny characters visible (version 2) +/ #inelude #include main(arge, argv) int argo: char targvi}i int c, strip if (arge > 1 64 stromp(argvi1], "-s") ** 0) strip = 1 white ((c = getohar()) 1 Af (isascii(c) 85 Gsprint(c) 1! cxe/\n’ tf ose’\t! Hf ese? 7) putchar(c); else if (Isteip) EIntE(*\\KO30", 6); exit(0); BOF) ) argv is a pointer to an array whose individual elements are pointers to arrays of characters; each array is terminated by the ASCII character NUL (’\07), 80 it can be treated as a string. This version of vis starts by checking to see if there is an argument and if it is -s, (Invalid arguments are ignored.) The function stremp(3) compares two strings, returning zero if they are the same Table 6.2 lists a set of string handling and general utility functions, of which stremp is one. It's usually best to use these functions instead of writ- ing your own, since they are standard, they are debugged, and they are often 176 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 6 faster than what you can write yourself because they have been optimized for particular machines (sometimes by being written in assembly language) Exercise 63. Change the ~@ argument so that vis sn will print only strings of m oF ‘more consecutive printable characters, discarding. non-printing characters and. short sequences of printable ones. This is valuable for isolating the text parts of non-text files such as executable programs. Some versions of the system provide a strings program that does this. Is it better to have a separate program of an argument to via? © Exercise 6-4. The availablity of the C source code is one of the strengths of the UNIX system — the code illustrates elegant solutions to many programming problems. Com ment on the tradeoff between readability of the C source and the occasional optimiza- tions obtained from rewriting in assembly language. Table 6.2: Standard Siring Functions streat(s,t) append string t to string s; return s strncat(s,t,n) append at most n characters of t to s stropy(s,t) copy t to s; return s strnepy(s,t,n) copy exactly n characters; null pad if necessary stronp(e,t) compare # and t, return <0, 0, 0 for <, =e, > Strnenp(s,tsn) compare at most n characters strients) return length of © Strchr(s,c) return pointer to firs ¢ in 8, NULL if none Sterchr(s,c) _reluin pointer to last c in , NULL if none ‘These are index and index on older systems atotts) retuen integer valu of @ |acoete) retuen floating point vale of; | needs declaration double atot () | ma2ioc(n) retuen pointer ton bytes of memory. NULL if can't Jeatioctam) ——felurm pointer to nXn bytes, set Lo 0, NULL i cat malloc and calloc return char + free memory allocated by malloc or calloc free(p) 6.3 File access: vis version 3 The first two versions of vis read the standard input and write the stan- dard output, which are both inherited from the shell. The next step is to modify vis to access files by their names, so that $ vis filet file2 ... will scan the named files instead of the standard input. If there are no filename arguments, though, we still want vis (o read its standard input. ‘The question is how to arrange for the files to be read — that is, how to connect the filenames to the 1/O statements that actually read the data. ‘The rules are simple. Before it can be read or written a file must be opened CHAPTER 6 PROGRAMMING WITH STANDARD 10.177 by the standard library function fopen. fopen takes a filename (like temp or /etc/passwa), does some housekeeping and negotiation with the kernel, and returns an internal name (o be used in subsequent operations on the file This internal name is actually a pointer, called a file pointer, to a structure that contains information about the file, such as the location of a buffer, the current character position in the buffer, whether the file is being read or writ- ten, and the like. One of the definitions obtained by including is for a structure called FILE. The declaration for a file pointer is FILE +£p3 ‘This says that £p is a pointer toa FILE. fopen returns a pointer to a FILE: there is a type declaration for €open in . ‘The actual call to open in a program is char *name, nod fp = fopen(name, mode); ‘The first argument of fopen is the name of the file, as a character string. The second argument, also a character string, indicates how you intend (0 use the file; the legal modes are read ("x"), write ("w"), or append ( Ifa file that you open for writing or appending does not exist, if possible. Opening an existing file for writing causes the old contents to be discarded. Trying to read a file that does not exist is an error, as is trying to read or write a file when you don’t have permission. If there is any ctror, fopen will return the invalid pointer value NULL (which is defined, usually as (char *)0, in ). The next thing needed is a way to read or write the file once it is open. There are several possibilities, of which gete and pute are the simplest gete gets the next character from a file. © = gete(tp) places in © the next character from the file referred to by fp; it returns EOF when it reaches end of file. pute is analogous to getc: pate(c, fp) puts the character ¢ on the file £p and returns ¢. getc and pute return BOF if an error occurs. When a program is started, three files are open already, and file pointers are provided for them. These files are the standard input, the standard output, and the standard error output; the corresponding. file pointers are called stdin, stdout, and stderr. These file pointers are declared in ; they may be used anywhere an object of type FILE + can be, ‘They are constants, however, not variables, so you can’t assign to them. getchar() is the same as geto( stdin) and putchar(c) is the same as 178 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 6 pute(c,stdout). In fact, all four of these “functions” are defined as mac- ros in , since they run faster by avoiding the overhead of a fune- tion call for each character. See Table 6.3 for some other definitions in With some of the preliminaries out of the way, we can now write the third version of vis. If there are command-line arguments, they are processed in order. If there are no arguments, the standard input is processed. Js vis: make funny characters visible (version 3) +/ #inelude include int strip = 05 /* 1 > discard special characters +/ main(argc, argv) int arges char sargvt): int 4} FILE +6; while (arge > 186 argv(1][0} == ‘-7) ( switch (argv(11(11) { case's’: /e -s: atrip funny chare +/ strip = 1; breaks default: fprintf(stderr, "Xs: unknown arg Xs\n", argv[0], argvith); exit(1)s , arge- argyert i if (arge == 1) vis(stdin); else for (i = 4) i < argo; ier) if ((fpefopen(argvlil, "r7)) == NULL) ( fprintf(stderr, "Xe: can’t open %s\n", ‘argv(0}, argyli}); exit(1); } else { ‘vislep); Eclose( tp); , exit (0) ? ‘This code relies on the convention that optional arguments come first. After CHAPTER 6 PROGRAMMING WITH STANDARD LO 179 Table 6.3: Some Definitions stdin standard input stdout standard output stderr standard error EOF end of file; normally ~1 NULL invalid pointer; normally 0 FILE used for declaring file pointers BUFSIZ normal VO buffer size (often 512 or 1024) getc(fp) return one character from stream Ep getchar() —_gete(stdin) pute(c,fp) put character ¢ on stream fp |patchar(c) pute(c,staout) feof (fp) ro when end of file on stream £p | fexzor(tp) ro when any error on stream £p fileno(fp) _ file descriptor for stream £p;_see Chapter 7 cach optional argument is processed, argc and argv are adjusted so the rest of the program is independent of the presence of that argument. Even though ‘vis only recognizes a single option, we wrote the code as a loop to show one way to organize argument processing. In Chapter 1 we remarked on the disorderly way that UNIX programs handle optional arguments. One reason, aside from a taste for anarchy, is that it's obviously easy to write code to han- dle argument parsing for any variation. The function getopt(3) found on some systems is am attempt (0 rationalize the situation; you might investigate it before writing your own. The routine vis pr ts a single file: vis(£p) /+ make chare visible in PILE «fp +/ FILE +f; ‘ int ©; while ((¢ = gete(fp)) I= BOR) if (isascii(c) 66 (sprint(e) {1 ex="\n" Ht os putehar(e) else if (strip) print£("\\030", ©); Neo tt ene 0) i ‘The function fprinté is identical to printé, except for a file pointer argument that specifies the file to be written. The function £elose breaks the connection between the file pointer and the external name that was established by fopen, freeing the file pointer for 180 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 6 another file. Since there is a limit (about 20) on the number of files that a program may have open simultaneously, it’s best to free files when they are no longer needed. Normally, output produced with any of the standard library functions like printf, pute, etc., is buffered so it can be written in large chunks for efficiency. (The exception is output to a terminal, which is usually written as it is produced, or at least when a newline is printed.) Calling fclose on an output file also forces out any buffered output, felose is also called automatically for each open file when a program calls exit or returns from main, stderr is assigned to a program in the same way that stdin and stdout are. Output written on stderr appears on the user's terminal even if the standard output is redirected. vis writes its diagnostics on stderr instead of stdout so that if one of the files can’t be accessed for some reason, the mes- sage finds its way to the user’s terminal instead of disappearing down a pipe- line or into an output file. (The standard error was invented somewhat after pipes, alter error messages did start disappearing into pipelines.) Somewhat arbitrarily, we decided that vis will quit if it can't open an input file; this is reasonable for a program most often used interactively, and with a single input file. You can argue for the other design as well, however. Exercise 6-5. Write a program printable that prints the name of exch argument file that contains only printable characters; if the file contains any non-printable character, the name is not printed. printable is useful in situations like this: 8 pr ‘printable +* f apr Add the option -v to invert the sense of the test, as in grep. What should printable do if there are no filename arguments? What status should printable return? 6.4 A screens ime printer: p So far we have used cat to examine files. But if a file is long, and if you are connected to your system by a high-speed connection, cat produces the ‘output too fast to be read, even if you are quick with cil-s and cil-q. There clearly should be a program to print a file in small, controllable chunks, but there isn’t a standard one, probably because the original UNIX sys- tem was written in the days of hard-copy (paper) terminals and slow communi- cations tines. So our next example is a program called p that will print a file a screenful at a time, waiting for a response from the user after each screen before continuing to the next. (‘“p” is a nice short name for a program that we use a lot.) As with other programs, p reads either from files named as argu- ‘ments or from its standard input: CHAPTER 6 PROGRAMMING WITH STANDARD UO 181 8 p vis. $ grep ‘#efine’ +.[eh] ! p $ This program is best written in C because it’s easy in C, and hard other- wise; the standard tools are not good at mixing the input from a file or pipe with terminal input The basic, no-frills design is to print the input in small chunks. A suitable chunk size is 22 lines: that’s slightly less than the 24-line screen of most video terminals, and one third of a standard 66-line page. A simple way for p to prompt the user is to not print the last newline of each 22-line chunk. The cursor will thus pause at the right end of the line rather than at the left mar gin, When the user presses RETURN, that will supply the missing newline and thus cause the next line to appear in the proper place. If the user types ctl-d or at the end of a screen, p will exit We will take no special action for long lines. We will also not worry about ‘multiple files: we'll merely skip from one to the next without comment. That way the behavior of $ p filenames. will be the same as $ cat filenames... t p If filenames are needed, they can be added with a for loop like $ for 4 in filenames > do > echo $i: = cat $i > done ! Indeed, there are too many features that we can add to this program. It's better to make a stripped-down version, then let it evolve as experience dic- tates, That way, the features are the ones that people really want, not the ones we thought they would want, ‘The basic structure of p is the same as vis: the main routine cycles through the files, calling a routine print that does the work on each. 182 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 6 /* pi print input in chunks (version 1) */ Hinclude f#aefine PAGESIZE 22 char *progname; /+ program name for error message +/ main(argc, argv) int arge: char sargvt li FILE #£p, *efopen(); progname = argv{0); if (arge ** 1) print (stdin, PAGESIZE); elee for (i = 1; 4 < argo; ive) { £p = efopen(argvlil, "x" print(fp, PAGESIZE); felose(£p); ) exit (0); , ‘The routine efopen encapsulates a very common operation: try to open a file; if it’s not possible, print an error message and exit. To encourage error messages that identify the offending (or offended) program, efopen refers to an external string progname containing the name of the program, which is set BILE sefopen(file, mode) /+ fopen file, die if can’t +/ char #file, «mode; « FILE «£p, *fopen(); extern char «progname} if (fp = fopen(#ile, mode)) return fp; fprintf(stderr, "Xs: can’t open file Ks mode %s\n", Progname, file, mode); exit(1); NULL) , We tried a couple of other designs for efopen before settling on this. One was to have it return after printing the message, with a null pointer indicating failure. This gives the caller the option of continuing or exiting. Another design provided efopen with a third argument specifying whether it should return after failing to open the file. In almost all of our examples, however, CHAPTER 6 PROGRAMMING WITH STANDARD 10 183 there's no point to continuing if a file can't be accessed, so the current version of efopen is best for our use. The real work of the p command is done in print: print(fp, pagesize) /+ print fp in pagesize chunks +/ FILE +£p3 int pagesize; static int lines = 0; /+ number of lines so far +/ char buf{BUFSIZ]; while (fgete(buf, sizeof buf, £p) if (selines © pagesize) fputs(buf, stdout); else ( buf{strien(buf)-1] = ’073 fpute(buf, stdout); #flusn( stdout); teyin(); lines = 0; NULL) > We used BUFSIZ, which is defined in , as the size of the input buffer. fgets(buf,size,fp) fetches the next line of input from £p, up to and including a newline, into bu€, and adds a terminating \0; at most size-1 characters are copied. It returns NULL at end of file. (fgets could be better designed: it returns buf instead of a character count; furthermore it provides no warning if the input line was too long. No characters are lost, but you have to look at buf to see what really happened.) The function strlen returns the length of a string; we use that to knock the trailing newline off the last input line. €puts(buf, fp) writes the string buf on file fp. The call to ££1ush at the end of the page forces out any buf- fered output. ‘The task of reading the response from the user after each page has been printed is delegated (o a routine called ttyin. ttyin can’t read the standard input, since p must work even when its input comes from a file or pipe. To handle this, the program opens the file /4ev/tty, which is the user's terminal regardless of any redirection of standard input. We wrote ttyin to return the first character of the response, but don’t use that feature here. 184 THE UNIX PROGRAMMING ENVIRONMENT ‘CHAPTER 6 ttyin() /* process response fron /dev/tty (version 1) +/ ‘ char buf(BUFSIZ]; FILE vefopen(); static FILE #tty = NULL: ie (tty == NULL) tty = efopen(*/dev/ety" if (fgete(buf, BUPSIZ, tty) exit(0); else /+ ordinary line +/ return buf[01; en) NULL 1! bufto} e= ’q’) > The file pointer devtty is declared static so that it retains its value from one call of ttyin to the next; the file /dev/tty is opened on the first call only There are obviously extra features that could be added to p without much work, but it is worth noting that our first version of this program did just what, is described here: print 22 lines and wait. It was a long time before other things were added, and to this day only a few people use the extra features. ‘One easy extra is to make the number of lines per page a variable pagesize that can be set from the command line: Spm prints in n-line chunks. This requires only adding some familiar code at the beginning of main: /+ pt print input in chunks (version 2) +/ int i, pagesize = PAGESIZE; progname = argv[0l; Af (arge > 168 argv(11L01 pagesize = atoi(argv(1}(1)) arge- argyes ‘ ‘The function atoi converts a character string to an integer. (See atoi(3).) ‘Another addition to p is the ability to escape temporarily at the end of each page to do some other command. By analogy to ed and many other programs, if the user types a line that begins with an exclamation mark, the rest of that line is taken to be a command, and is passed to a shell for execution. This feature is also trivial, since there is a function called system(3) to do the ‘work, but read the caveat below. The modified version of ttyin follows: CHAPTER 6 PROGRAMMING WITH STANDARD 1018S ttyin() /+ process response from /dev/tty (version 2) */ ‘ char buf [BUFSIZ); PILE +efopen(); static FILE sty = NULL; Af (tty == NULL) ety = efopen("/dev/ety", ""); for G3) { Af (fgete(buf BUPSIZ,tty) ©* NULL I! bufto] exit(o); else if (buf[0] == *1") { system(buf+1); /* BUG here */ printe("I\a" d else /« ordinary line +/ return buf[0]; > Unfortunately, this version of ttyin has a subtle, pernicious bug. The com- mand run by system inherits the standard input from p, so if p is reading from a pipe or a file, the command may interfere with its input: $ cat /ete/passwd | p -1 ‘root: 3D.fHRSKOB.38:0:1:5.User:/: led Invoke ed from within p ? 4 reads Zete/passwa 1 is confused, and quits ‘The solution requires knowledge about how UNIX processes are controlled, and wwe will present it in Section 7.4. For now, be aware that the standard system in the library can cause trouble, but that ttyin works correctly if compiled with the version of system in Chapter 7. We have now written two programs, vis and p, that might be considered variants of cat, with some embellishments, So should they all be part of cat, accessible by optional arguments like -v and ~p? The question of whether to write a new program or to add features to an old one arises repeatedly as peo- ple have new ideas. We don’t have a definitive answer, but there are some Drinciples that help to decide. The main principle is that a program should only do one basic job — if it does too many things, it gets bigger, slower, harder to maintain, and harder to use. Indeed, the features often lie unused because people can't remember the options anyway, This suggests that cat and vis should nor be combined. cat just copies its input, unchanged, while vis transforms it. Merging them makes a pro- gram that does two different things. It’s almost as clear with cat and p. cat is meant for fast, efficient copying: p is meant for browsing. And p docs transform its output: every 22nd newline is dropped. ‘Three separate programs 186 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 6 seems to be the proper design Exercise 6-6. Does p act sanely if pagesize is not positive? © Exercise 6-7. What else could be done to p? Evaluate and implement (if appropriate) the ability 10 re-print parts of earlier input. (This is one extra feature that we enjoy.) ‘Add a facility to permit printing less than a screenful of input after each pause. Add a facility to scan forward or backward for a line specified by number or content. © Exercise 6-8. Use the file manipulation capabilites of the exee shell built-in (see h(1)) 10 fix teyin's call © system. © Exercise 69. If you forget to specify an input for p, it sits quietly waiting for input from the terminal. Is it worth detecting this probable error? If so, how? Hint: ssatty(3), o 6.5 An example: pick ‘The version of pick in Chapter 5 was clearly stretching the capabilities of the shell. The C version that follows is somewhat different from the one in Chapter 5. If it has arguments, they are processed as before. But if the single argument ‘~' is specified, pick processes its standard input. ‘Why not just read the standard input if there are no arguments? Consider the second version of the zap command in Section 5.6: kill $516 ‘pick \‘ps -ag | egrep *S4"\* | awk ‘(print $1)’ What happens if the egrep pattern doesn’t match anything? In that case, pick has no arguments and starts (o read its standard input; the zap com- ‘mand fails in a mystifying way. Requiring an explicit argument is an easy way {o disambiguate such situations, and the '~’ convention from cat and other programs indicates how to specify it CHAPTER 6 PROGRAMMING WITH STANDARD 10187 /* pick: offer choice on each argument +/ #include char *progname; /+ program name for error message */ maintargc, argv) int arge; char targvl}i char buf(BUFSIZ]; progname = argv(0]; Af (arge == 2 88 stromp(argyi1},"=") © 0) /+ pick = «/ while (fgets(buf, sizeof buf, stdin) I= NULL) ( buf{strien(buf)-1] = ‘\0"; /+ drop newline +/ pick(buf) } elee for (i= 15 4 < arges i++) piek(argv[i1)s exit (0): > pick(s) /+ offer choice of 8 +/ char +s} ‘ fprintf(stderr, "Ms? *, 8); Af (etyin() == "y") peints("%s\n", #8); ) pick centralizes in one program a facility for interactively selecting argu- ments, This not only provides a useful service, but also reduces the need for interactive” options on other commands. Exercise 6-10. Given pick, is there a need for xm -47 © 6.6 On bugs and debugging If you've ever written @ program before, the notion of a bug will be fami liar. There's no good solution to writing bug-free code except to take care to produce a clean, simple design, to implement it carefully, and to keep it clean as you modify it There are a handful of UNIX tools that will help you to find bugs, though none is really first-rate. To illustrate them, however, we need a bug, and all of the programs in this book are perfect. Therefore we'll create a typical bug. Consider the function pick shown above. Here it is again, this time contain- ing an error. (No fair looking back at the original.) 188 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 6 pickis) /+ offer choice of s +/ char +3; Eprinte("%s? ", 5 Af (ttyin() ==" "y") printe(*%s\n", 8); , If we compile and run it, what happens? 8 ce pick.c -o pick § pick +c Try it Menory fault - core dumped Disaster? ‘ “Memory fault” means that your program tried to reference an area of memory that it was not allowed to. It usually means that a pointer points somewhere wild, “Bus error” is another diagnostic with a similar meaning, often caused by scanning a non-terminated string. “Core dumped" means that the kernel saved the state of your executing program in a file called core in the current directory. You can also force a program to dump core by typing cil-\ if it is running in the foreground, or by the command ki11 ~3 if itis in the background There are two programs for poking around in the corpse, adb and sab. Like most debuggers, they are arcane, complicated, and indispensable. ab is in the 7th Edition; sdb is available on more recent versions of the system. One or the other is sure to be there We have space here only for the absolute minimum use of each: printing « stack trace, that is, the function that was executing when the program died, the function that called it, and so on. The first function named in the stack trace is where the program was when it aborted To get a stack trace with adb, the command is $c: CHAPTER 6 PROGRAMMING WITH STANDARD 10189 $ adb pick core Invoke ad $c Stack trace request ~_strout (0175722,011,0,011200) adjust o filleh 060542 _dopent (0177345 ,0176176,011200) TEprint#(011200,0177345) sop: 017200 Ent. 0177345 arge: ° =pick(0177345) 8: 017345 -main(035,0177234) arge: 035 argv 0177234 01 buf: ° cil Quit s This says that main called pick, which called fprintf, which called _doprnt, which called _strout. Since doprnt isn’t mentioned anywhere in pick.c, our troubles must be somewhere in fprinté or above. (The lines after each subroutine in the traceback show the values of local variables. $c suppresses this information, as does $C itself on some versions of adb.) Before revealing all, lets try the same thing with sab: $ sdb pick core + ‘a.out’ not compiled with - address Oxa64 Routine where program died Stack trace request iseek() Eprints (6154,2147479154) pick(2147479154) main( 30,2147478988,2147479112) sa Quit s The information is formatted differently, but there's a common theme: Eprinté, (The traceback is different because this was run on a different machine — a VAX-11/750 — which has a different implementation of the stan- dard UO library). And sure enough, if we look at the £print£ invocation in the defective version of pick, it is wrong Eprinte("Ks? ", 9); There's no stderr, so the format string "xs? " is being used as a FILE pointer, and of course chaos ensues, We picked this error because 's common, a result of oversight rather than 190 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 6 bad design. It’s also possible to find errors like this, in which a function is called with the wrong arguments, by using the C verifier Lint(1), lint examines C programs for potential errors, portability problems, and dubious constructions. If we run Lint on the whole pick.c file, the error is identi fied $ Lunt pick.c fprintf, arg. 1 used inconsistently "11ib-1c"(69) :: "pick-c"(28) In translation, this says that Eprinté’s first argument is different in the stan- dard library definition from its use in line 28 of our program. That is a strong hint about what's wrong, Lint is a mixed success. It says exactly what's wrong with this program, but also produces a lot of irrelevant messages that we've elided above, and it takes some experience to know what to heed and what to ignore. It’s worth the effort, though, because Lint finds some errors that are almost impossible for people to see. It’s always worth running Lint after a long stretch of edit- ing, making sure that you understand each warning that it gives. 6.7 An example: zap zap, which selectively kills processes, is another program that we presented as a shell file in Chapter 5. The main problem with that version is speed: it creates so many processes that it runs slowly, which is especially undesirable for a program that kills errant processes. Rewriting zap in C will make it fas ter. We are not going to do the whole job, however: we will stil use ps to find the process information. This is mach easier than digging the information cout of the kernel, and it is also portable. zap opens a pipe with ps on the input end, and reads from that instead of from a file. ‘The function popen(3) is analogous to fopen, except that the first argument is a command instead of a filename, There is also a pelose that we don’t need here CHAPTER 6 PROGRAMMING WITH STANDARD YO 191 interactive process killer +/ Finclude #include char *progname; /+ program name for error message +/ char ¢ps = "ps -ag"; /+ system dependent «/ main(arge, argv) int arge: char sargvl}i FILE #fin, epopen(); char buf (BUFSIZ]; int pid prognane = argv(0]; Af ((£in = popen(pa, "x*)) 2= NULL) ( Eprints (stderr, "Xs: can’t run %s\n", progname, ps); exit); : fgets(buf, sizeof buf, fin); / get header line +/ fprinté (stderr, "%s", buf); while (fgets(buf, sizeof buf, fin) != NULL) if (arge == 1/1! strindex(buf, argv[1]) >= 0) ( buf[strien(buf)-1] = '\0’; /+ suppress \n +/ fprintf(stderr, "ie? ", buf 4€ (ttyinQ) == "y") ( sscanf (buf, "Xd", &pid); will (pid, STGRILE); } exit (ods ? We wrote the program to use ps -ag (the option is system dependent), but unless you're the super-user you can kill only your own processes. The first call to gets picks up the header line from ps; it’s an interesting exercise to deduce what happens if you try o kill the “process” corresponding, to that header line. ‘The function sscanf is a member of the scanf(3) family for doing input format conversion. It converts from a string instead of a file. The system call Kill sends the specified signal to the process; signal SIGKILL, defined in , can’t be caught or ignored. You may remember from Chapter 5 that its numeric value is 9, but it’s better practice to use the symbolic con- stants from header files than to sprinkle your programs with magic numbers. If there are no arguments, zap presents each line of the ps output for pos- sible selection. If there is an argument, then zap offers only ps output lines that match it. The function strindex(s1,s2) tests whether the argument 192. THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 6 matches any part of a line of ps output, using strnemp (see Table 6.2), atrindex returns the position in #1 where #2 occurs, or ~1 if it does not strindex(e, t) /+ return index of t in 8, -1 if none +/ char +5, +ty ‘ int i, ny n= strlen(t}; for (i = 0; sli] I= ‘\073 444) if (strnemp(s+i, ty n) return i; return -1; °) , Table 6.4 summarizes the commonly-used functions from the standard VO library. Exercise 6-11. Modify zap so that any number of arguments can be supplied. As writ, ten, zap will normally echo the line corresponding to itself as one of the choices. Should it? If not, modify the program accordingly. Hint: getpia(2). 0 Exercise 6-12, Build an fgrep(1) around strindex, Compare running times for complicated searches, say ten words in a document. Why does fgrep run faster? © 6.8 An interactive file comparison program: sdift ‘A common problem is to have two versions of a file, somewhat different, each containing part of a desired file; this often results when changes are made independently by two different people. aif¢ will tell you how the files differ, but it’s of no direct help if you want to select some parts of the first file and some of the second, In this section, we will write a program idiff (“interactive aif£") that presents each chunk of diff output and offers the user the option of choosing the “from” part, choosing the “to” part, or editing the parts. iaiff produces the selected pieces in the proper order, in a file called 14iff.out. That is, given these two files: filet file ‘This is ‘This is a test not a test of of your our skill ability. and comprehension. i£E produces CHAPTER 6 PROGRAMMING WITH STANDARD 10 193. ‘Table 6.4: Useful Standard 130 Functions mem ma Mal fe open(s,mode) open file s; mode "r", for read, write, append (returns NULL for error) cxgete( fp) get character; getchar() is getc(atdin) pute(c,fp) Put character; patchar(c) is pute(c, stdout) ungete(c, fp) put character back on input file £p; at most | char ccan be pushed back at one time scanf(fmt,a1,...) read characters from stdin into a1,... according to fmt. Each ag must be a pointer. Returns EOF or number of fields converted fecanf(£p,...) read from file £p secant (s,...) read from string § printf (fmt,a1,...) format at,... according to fmt, print on stdout. Eprint£(£p,...) print ... on file £p sprintf(s, print... into string s fgets(s,n, fp) read at most n characters into s from £p. Returns NULL at end of file Eputs(s,fp) print string s on file £p f£lush(fp) flush any buffered output on file £p felose( fp) close file £p Ep=popen(s,mode) open pipe to command s. See fopen. pelose( fp) close pipe £p system(s) fun command 8 and wait for completion $ dice filet Fi102 202 ea test > not a test 4,604,5 < your © skill < and comprehension. > ability. 8 A dialog with 1A4 ££ might look like this: 194 THE UN PROGRAMMING ENVIRONMENT CHAPTER 6 $ idife filet fi1e2 202 The frst difference > not a test a> User chooses second (>) version 4,604,5 The second difference < your < geil and comprehension. > abitity p< User chooses first (<) version Adiff output in file iaife.out $ cat idiff.out ‘Output putin this file thie is not a test, of your kill and comprehension. ’ If the response e is given instead of < or >, £44 ££ invokes ed with the two groups of lines already read in. If the second response had been e, the editor buffer would look like this: your kill and comprehension. ability. Whatever is written back into the file by ed is what goes into the final output Finally, any command can be executed from within diff by escaping with temd. Technically, the hardest part of the job is dif, and that has already been done for us. So the real job of idi fF is parsing 4i¢#’s output, and opening, closing, reading and writing the proper files at the right time. The main rou- tine of Adi ££ sets up the files and runs the di££ process: CHAPTER 6 PROGRAMMING WETH STANDARD VO 195 J Adiff: interactive aiff +/ #include finclude char sprognane; fdefine HUGE 10000 /+ large nunber of lines +/ main(arge, argv) int arge; char sargvi FILE *fin, #fout, +£1, *£2, +efopent): char Duf(BUFSIZ], emktemp(}: char sdiffout = "Adi fe -xXXXXX"; progname = argv(01; if (arge t= 3) { fprintf(stderr, "Usage: idiff file? £11¢2\n exit(1); ) £1 © efopen(argvit}, "")3 £2 = efopen(argv(2), "z"): fout = efopen("idiff out", "w"); mktemp(diffout) ; sprintf (buf,"dif£ Xs Xe >Xs" argv{ 11 ,argv(2],diffout); system(buf); fin = efopen(aiffout, *2"); Aaiee(£1, £2, fin, fout); unlink(aiffout) ; printf("%s output in file idiff.out\n exit(o)s progname); , The function mktemp(3) creates a file whose name is guaranteed to be dif- ferent from any existing file. mktemp overwrites its argument: the six x's are replaced by the process-id of the id ££ process and a letter. The system call unlink(2) removes the named file from the file system, The job of looping through the changes reported by 4i€é is handled by a function called iif. The basic idea is simple enough: print a chunk of Gi£E output, skip over the unwanted data in one file, then copy the desired version from the other. There is a lot of tedious detail, so the code is bigger than we'd like, but it’s easy enough to understand in pieces. 196 {THE UNIX PROGRAMMING ENVIRONMENT ‘CHAPTER 6 Adiff(£1, £2, fin, fout) —/+ process aitts +/ ‘ FILE «£1, +£2, efin, #fout; char stempfile = "idief 20000"; char buf (BUFSIZ], buf2(BUPSIZ}, +mktemp(); PILE *£t, efopent); int oma, n, fromt, tot, from2, to2, nf1, nf2s mktemp(tempfile) nE1 = nf2 = 0; while (fgets(buf, sizeof buf, fin) I= NULL) { parse(buf, &from1, tol, Send, Sfrom2, &to2);, n= tot-from! + to2-from2 + 1; /+ #lines from dife +/ if (ond == ‘0") nite 2 else if (ond Eromtee else if (ond feomaees peinte("%e", buf); while (n-- > 0) { Egets(buf, sizeof buf, fin); printe("%s", buf); a0 printe("? "); Felush( stdout); fgets(buf, sizeof buf, stain); switch (baf{0]) { nekip(£1, tot-nft neopy(£2, to2-nf2, fout); break: nskip(£2, to2-né2 ncopy(£1, tot-nf1, fout); break neopy(£1, from1-1-n£1, fout); nskip(£2, from2-1-nf2 ft = efopen(tempfile, neopy(£1, tot+1-fromt, Eprints (ft, "---\n"); neopy(£2, toz+1-from2, £t); fclose(#t); sprintf (buf2, "ed %s' system(buf2); Et = efopen(tempfile, “2"); neopy(ft, HUGE, fout}; + tempfile)s CHAPTER 6 PROGRAMMING WITH STANDARD 10197 felose( ft); break: case 11: systen(buf+1) printe("!\n") break; default printe("< or > or €or !\n"); break; ) } while (buf[O}le/<’ && buf{O}!=’>’ && buflO} 1270"); nE1 = tot; nf2 = toz; ) neopy(f1, HUGE, fout); /* can fail on very long Files +/ unlink(tempeite) ; , The function parse does the mundane but tricky job of parsing the lines produced by diff, extracting the four line numbers and the command (one of a, cor 4). parse is complicated a bit because diff can produce either one line number or two on either side of the command letter. parse(s, pfrom1, ptot, pend, pfrom2, pto2) char #8: int spend, «pfrom1, sptot, epfrom2, «pto: ‘ f#aefine a2i(p) while (Asdigit(+s)) p= 108(p) + *se+ - ‘0 spfrom1 = #pto1 = +pfrom2 = «pto2 = 0 a2i(«peromt); Af (sa ee 47) a2i(+ptot); ) else sptot = «pfromt; spond = eee; a2i(+ptrom2); Af (ee ae 47) a2i(epte2): } ese ‘epto2 = epérona; > The macro a2i handles our specialized conversion from ASCII to integer in the four places it occurs. nskip and ncopy skip over or copy the specified number of lines from @ file: 198 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 6 nskip(fin, n) /* skip n lines of file fin «/ PILE "fini ‘ char buf{BUFSIZ]; while (n-- > 0) Egets(buf, sizeof buf, fin); , noopy(fin, n, fout) /* copy n lines from fin to fout +/ FILE tfin, #fout; ‘ char buf(BUPSIZ1; while (n-- > 0) ( if (fgets(buf, sizeof buf, fin) return; fputs(buf, fout); NULL) ’ As it stands, idift docsn’t quit gracefully if it is interrupted, since it leaves several files lying around in /tmp. In the next chapter, we will show how to catch interrupts to remove temporary files like those used here The crucial observation with both zap and idiff is that most of the hard work has been done by someone else. These programs merely put a con- venient interface on another program that computes the right information. It’s worth watching for opportunities to build on someone else’s labor instead of doing it yourself — it’s a cheap way to be more productive. Exercise 6-13, Add the command q to Sai €£: the response q< will take all the rest of the *<” choices automatically; q> will ake the all the rest of the '>” choices. Exercise 6-14. Modify aiff so that any dif arguments are passed on to diff; -b and ~b are likely candidates. Modify 44i#£ so that a different editor can be specified, 4 Ldite -© anotherciltor £6101 £5162 How do these two modifications interact? © Exercise 6-15. Change 184 ££ to use popen and polose instead of a temporary file for the output of 4if£. What difference does it make in program speed and complex: ity? o Exercise 6-16. dif has the property that if one of its arguments is @ directory, it searches that directory for a file with the same name as the other argument. But if you ley the same thing with 448, it fails in a strange way. Explain what happens, then fixit. 0 CHAPTER 6 PROGRAMMING WITH STANDARD 10 199 6.9 Accessing the environment It is easy to access shell environment variables from a C program, and this can sometimes be used to make programs adapt to their environment without requiring much of their users. For example, suppose that you are using a ter minal in which the screen size is bigger than the normal 24 lines. If you want to use p and take full advantage of your terminal’s capabilities, what choices are open to you? It’s a bother to have to specify the screen size each time you use p: sp -36 You could always put a shell file in your bin: $ cat /usr/you/bin/p exec /asr/bin/p -36 $+ s A third solution is to modify p to use an environment variable that defines the properties of your terminal. Suppose that you define the variable PAGESIZE in your profile: PAGESIZE=36 export PAGESIZE ‘The routine getenv("var") searches the environment for the shell vari- able var and returns its value as a string of characters, or NULL. if the variable is not defined. Given geteny, it’s easy to modify p. All that is needed is to ‘add a couple of declarations and a call to getenv to the beginning of the main routine. /+ pi print input in chunks (version 3) */ char *p, *getenv(); prognane = argvi0l; Af ((pegetenv("PAGESIZE")) I= NULL) pagesize = atoi(p); Af (arge > 186 argv[1](0] == %-") ( pagesize = atoi(Sargv[1}[11)5 arge-- argvess Optional arguments are processed after the environment variable, so any expli- cit page size will still override an implicit one. Exercise 6-17. Modify 1di¢£ to search the environment for the name of the editor to be used. Modify 2, 3, ee., to use PAGESTZE. © 200. THE UN PROGRAMMING ENVIRONMENT CHAPTER 6 History and bibliographic notes ‘The standard VO library was designed by Dennis Ritchie, after Mike Lesk’s portable UO library. ‘The intent of both packages was to provide enough stan- dard facilities that programs could be moved from UNIX to non-UNIX systems without change. ‘Our design of p is based on a program by Henry Spencer. adb was written by Steve Bourne, sdb by Howard Katseff, and Lint by Steve Johnson. Adi EE is loosely based on a program originally written by Joe Maranzano @i€f itself is by Doug Mellroy, and is based on an algorithm invented independently by Harold Stone and by Wayne Hunt and Tom Szymanski, (See “A fast algorithm for computing longest common subsequences,” by J. W. Hunt and T. G. Szymanski, CACM, May, 1977.) The diff algorithm is described in M. D. Mcliroy and J. W. Hunt, “An algorithm for differential file comparison,” Bell Labs Computing Science Technical Report 41, 1976. To quote Mcliroy, “I had tried at least three completely different algorithms before the final one. aiff is a quintessential case of not settling for mere competency in a program bat revising it until it was right.”” cuaprer 7: UNIX SYSTEM CALLS This chapter concentrates on the lowest level of interaction with the UNIX operating system — the system calls. ‘These are the entries to the kernel, They are the facilities that the operating system provides; everything. else is built on top of them. We will cover several major areas. First is the /O system, the foundation beneath library routines like fopen and pute, We'll talk more about the file system as well, particularly directories and inodes. Next comes a discussion of processes — how (0 run programs from within a program. After that we will talk about signals and interrupts: what happens when you push the DELETE key, and how to handle that sensibly in a program, As in Chapter 6, many of our examples are useful programs that were not part of the 7th Edition. Even if they are not directly helpful to you, you should learn something from reading them, and they might suggest similar tools that you could build for your system Full details on the system calls are in Section 2 of the unr Programmer's Manual; this chapter describes the most important parts, but makes no pretense ‘of completeness, 7.1 Low-level UO The lowest level of /O is a direct entry into the operating system. Your program reads or writes files in chunks of any convenient size, The kernel buffers your data into chunks that match the peripheral devices, and schedules operations on the devices to optimize their performance over all users, Fite descriptors All input and output is done by reading or writing files, because all peri- pheral devices, even your terminal, are files in the file system. This means that a single interface handles all communication between a program and peri- pheral devices In the most general case, before reading or writing a file, it is necessary to inform the system of your intent to do so, a process called opening the file. If 201 202 THE UNIX PROGRAMMING ENVIRONMENT ‘CHAPTER 7 you are going to write on a file, it may also be necessary to create it. The sys- tem checks your right to do so (Does the file exist? Do you have permission to access it?), and if all is well, returns a non-negative integer called a file descriptor. Whenever UO is to be done on the file, the file descriptor is used instead of the name to identify the file. All information about an open file is maintained by the system; your program refers to the file only by the file descriptor. A PILE pointer as discussed in Chapter 6 points to a structure that contains, among other things, the file descriptor; the macro £ileno(£p) defined in returns the file descriptor There are special arrangements to make terminal input and output con- venient. When it is started by the shell, a program inherits three open files, with file descriptors 0, 1, and 2, called the standard input, the standard output, and the standard error. All of these are by default connected to the terminal, so if a program only reads file descriptor 0 and writes file descriptors 1 and 2, it can do VO without having to open files. If the program opens any other files, they will have file descriptors 3, 4, etc. If VO is redirected to or from files or pipes, the shell changes the default assignments for file descriptors 0 and 1 from the terminal to the named files. Normally file descriptor 2 remains attached to the terminal, so error messages can go there. Shell incantations such as 2>filename and 2>81 will cause rearrangements of the defaults, but the file assignments are changed by the shell, not by the program. (The program itself can rearrange these further if it wishes, but this is rare.) File 10 — vead and write All input and output is done by two system calls, read and write, which are accessed from C by functions of the same name. For both, the first argu- ‘ment is a file descriptor. The second argument is an array of bytes that serves as the data source or destination, The third argument is the aumber of bytes, to be transferred int £4, a, nread, nwritten; char buf[SIZE); read = read(fd, buf, n) nwritten = write(fd, buf, n); Each call returns a count of the number of bytes transferred. On reading, the number of bytes returned may be less than the number requested, because fewer than n bytes remained to be read. (When the file is a terminal, read normally reads only up to the next newline, which is usually less than what ‘was requested.) A return value of zero implies end of file, and ~1 indicates an error of some sort. For writing, the value returned is the number of bytes actually written; an error has occurred if this isn’t equal to the number sup- posed to be written, While the number of bytes to be read or written is not restricted, the two cuarrer 7 UUNIK SYSTEM CALLS 203 ‘most common values are 1, which means one character at a time (“unbuf- fered”), and the size of a block on a disc, most often 512 or 1024 bytes. (The parameter BUFSIZ in has this value.) To illustrate, here is a program to copy its input to its output. Since the input and output can be redirected to any file or device, it will actually copy anything to anything: it’s a bare-bones implementation of cat. 7+ cat: minimal version */ faefine SIZE $12 /+ arbitrary +/ main() c char buf [SIZB}; int ms while ((n = read(0, buf, sizeof buf)) > 0) weite(t, buf, n); exit(o); , If the file size is not a multiple of STZE, some xead will return @ smaller number of bytes to be written by write; the next call to read after that will return zero, Reading and writing in chunks that match the disc will be most efficient, but even character-at-a-time I/O ig feasible for modest amounts of data, because the kernel buffers your data; the main cost is the system calls, ed, for example, uses one-byte reads to retrieve its standard input. We timed this ver- sion of cat on a file of 54000 bytes, for six values of SIZE: Time (user+ system, sec.) SIZE PDP-11/70— VAX-11/750 L210: 188.8 10 = 299: 19.3 100 3.8 26 s12 13 10 1024 12 06 9120 10 06 ‘The disc block size is 512 bytes on the PDP-I1 system and 1024 on the VAX, It is quite legal for several processes to be accessing the same file at the same time; indeed, one process can be writing while another is reading. If this isn't what you wanted, it can be disconcerting, but it’s sometimes useful. Even though one call to read returns 0 and thus signals end of file, if more data is written on that file, 4 subsequent read will find more bytes available. This observation is the basis of a program called readslow, which continues to read its input, regardless of whether it got an end of file or not. readsiow is 208 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 7 handy for watching the progress of a program: 3 slowprog >temp & 5213 Processid $ readsiow 0) weite(1, buf m); sleep( 10); ) ‘The function sleep causes the program to be suspended for the specified number of seconds; it is described in sleep(3). We don't want readslow to bang away at the file continuously looking for more data; that would be too costly in CPU time. ‘Thus this version of readslow copies its input up to the end of file, sleeps a while, then tries again. If more data arrives while it is asleep, it will be read by the next read, Exercise 7-1. Adda -n argument to readsiow so the default sleep time can be changed to m seconds. Some systems provide an option -€ (“forever”) for tail that combines the functions of tail with those of readslow. Comment on this design. Exercise 7-2. What happens to readslow if the file being read is truncated? How ‘would you fix it? Hint: read about stat in Section 7.3, 0 File creation — open, creat, close, unlink Other than the default standard input, output and error files, you must explicitly open files in order to read or write them. There are two system calls for this, open and creat.t + Ken Thompion was once asked what he woul! do differently if he were redesigning the UNI 53 tem. His reply: Ti spell creat with an e.” cuarrer 7 LNDK SYSTEM CALLS 205 open is rather like fopen in the previous chapter, except that instead of returning a file pointer, it returns a file descriptor, which is an int. char «name; int fa, rwmodes £4 = open(name, rmode); As with fopen, the name argument is a character string containing the filename. The access mode argument is different, however: rwmode is 0 for read, 1 for write, and 2 to open a file for both reading and writing. open returns ~1 if any error occurs; otherwise it returns a valid file descriptor. It is an error to try to open a file that does not exist. ‘The system call creat is provided to create new files, or to rewrite old ones. int perms fa = creat(name, perms); creat returns a file descriptor if it was able to create the file called name, and -1 if not. If the file does not exist, exeat creates it with the permissions specified by the perms argument. If the file already exists, creat will trun- ceate it to zero length; it is not an error to creat a file that already exists (The permissions will not be changed.) Regardless of perms, a created file is open for writing. ‘As described in Chapter 2, there are nine bits of protection information associated with a file, controlling read, write and execute permission, so three-digit octal number is convenient for specifying them. For example, 0755 specifies read, write and execute permission for the owner, and read and exe- cute permission for the group and everyone else. Don't forget the leading 0 which is how octal numbers are specified in C. To illustrate, here is a simplified version of cp. The main simplification is that our version copies only one file, and does not permit the second argument to be a directory. Another blemish is that our version does not preserve the permissions of the source file; we will show how to remedy this later 206 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER T /* cp: minimal version +/ finclude ‘#aefine PERMS 0644 /+ RW for omer, R for group, others +/ char *progname; main(arge, argv) /+ ep: copy £1 to £2 +/ Ant argo: char targv()s int £1, £2, 05 char Duf(BUFSIZ]; prognane = argv[0}; Af (argc I= 3) error("Usage: %s from to", progname); 4€ ((£1 = open(argv[t}, 0)) == -1) error("can’t open Xs", argy(1]) A€ ((£2 = creat (argv(2], PERMS)) == ~1) error("can’t create %s", argvi21); while ((n 2 read(£1, buf, BUPSTZ)) > 0) if (weite(£2, buf, n) f= n) error("write error", (char *) 0); exit(0); , We will discuss error in the next sub-section, ‘There is a limit (typically about 20; look for NOFILE in ) on the number of files that a program may have open simultaneously. Accord- ingly, any program that intends to process many files must be prepared to re- use file descriptors. The system call close breaks the connection between a filename and a file descriptor, freeing the file descriptor for use with some other file, Termination of a program via exit or return from the main pro- gram closes all open files. ‘The system call unlink removes @ file from the file system, Error processing — exrno ‘The system calls discussed in this section, and in fact all system calls, can incur errors. Usually they indicate an error by returning a value of -1. Some- times it is nice to know what specific error occurred; for this purpose all system calls, when appropriate, leave an error number in an external integer called exrno. (The meanings of the various error numbers are listed in the introduc- tion to Section 2 of the UNIX Programmer's Manual.) By using errno, your program can, for example, determine whether an attempt to open a file failed because it did not exist or because you lacked permission to read it There is also an array of character strings sys_errlist indexed by errno that translates the numbers into a meaningful string. Our version of error uses CHAPTER 7 UUNIK SYSTEM CALLS 207 these data structures: error(s1, 52) /+ print error message and die +/ char #81, #82; ‘ extern int errno, eys_nerr} extern char ssys_errlist[], *prognane; i£ (progname) fprintf(stderr, "% fprintf(stderr, 51, 52); if (errno > 0 && errno < sys_nerr) fprintf(stderr, " (%s)", sys_errlist[errno}); fprintf(stderr, "\n"); exit(1)i progname); > exzno is initially zero, and should always be less than sys_nere. It is not reset to zero when things go well, however, so you must reset it after each error if your program intends to continue. Here is how error messages appear with this version of cp: $ op foo bar ep: can’t open foo (No such file or directory) $ date >foo; chmod 0 foo Make an unreadable file 8 ep foo bar ep: can’t open foo (Permission denied) s Random access — 1seek File 1/0 is normally sequential: each read or write takes place in the file right after the previous one. When necessary, however, a file can be read or written in an arbitrary order. The system call seek provides a way to move around in a file without actually reading or writing: int £4, origin: long offset, pos, Lseek(); pos = 1seek(#4, offset, origin); forces the current position in the file whose descriptor is £4 to move to posi- tion offset, which is taken relative to the location specified by origin. Subsequent reading or writing will begin at that position. origin can be 0, 1, or 2 to specify that offset is to be measured from the beginning, from the current position, or from the end of the file, The value returned is the new absolute position, or ~1 for an error. For example, to append to a file, seek to the end before writing: Aseek(£4, OL, 2); 208 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 7 To get back to the beginning (“rewind”), Aseek(#a, OL, 0): To determine the current position, pos = Iseeki#d, OL, 113 Notice the OL argument: the offset is a Long integer. (The ‘I’ in 1seek stands for “long,” to distinguish it from the 6th Edition seek system call that used short integers.) With Iseek, it is possible to treat files more or less like large arrays, at the price of slower access. For example, the following function reads any number of bytes from any place in a file. get(£a, pos, buf, n) /+ read n bytes from position pos +/ int fa, ny long posi char *buf; if (iseek(f4, pos, 0) return -1; else return read(fa, buf, a); =1)/+ get to pos +/ > Exercise 7-3. Modify readsiow to handle a filename argument if one is present. Add the option -e: causes readsiow to seek to the end of the input before beginning to read. What does seek do on a pipe? 9 Exercise 7-4. Rewrite efopen from Chapter 6 to call error. © 7.2 File system: directories ‘The next topic is how to walk through the directory hierarchy. This doesn’t actually use any new system calls, just some old ones in a new context. We will illustrate by writing a function called spname that tries to cope with misspelled filenames. The function = spname(name, newname) ; searches for a file with a name “close enough” to name. If one is found, it is copied into newname. The value n returned by spname is -1 if nothing close enough was found, 0 if there was an exact match, and 1 if a correction was made. spname is a convenient addition to the p command: if you try to print @ file but misspell the name, p can ask if you really meant something else: ‘CHAPTER 7 UNIX SYSTEM CALLS 209 $ p /urs/srx/cond/p/spnan.c Horvibly botched name “/asr/src/cmd/p/spname.c"? y Suggested correction accepted /* spname: return correctly spelled filename +/ ‘As we will write it, spname will try to correct, in cach component of the filename, mismatches in which a single letter has been dropped or added, or a single letter is wrong, or a pair of letters exchanged; all of these are illustrated above. This is a boon for sloppy typist. Before writing the code, a short review of file system structure is in order. A directory is a file containing a list of file names and an indication of where they are located. The “location” is actually an index into another table called the inode table. ‘The inode for a file is where all information about the file except its name is kept. A directory entry thus consists of only two items, an inode number and the file name. ‘The precise specification can be found in the file : $ cat /usr/inelude/sys/air.h f#define DIRSIZ 14 /+ max length of file name +/ struct direct /s structure of directory entry +/ ‘ ino_t € ino; + inode number +/ char duname(DIRSIZ]; /+ file name «/ % s The “type” ino_t is a typedef describing the index into the inode table. It happens to be unsigned short on PDP-II and VAX versions of the sys- tem, but this is definitely not the sort of information to embed in a program: it might be different on a different machine. Hence the typedef. A complete set of “system” types is found in , which must be included before The operation of spnane is straightforward enough, although there are a lot of boundary conditions to get right. Suppose the file name is /d1/d2/f. ‘The basic idea is to peel off the first component (/), then search that directory for a name close to the next component (d7), then search that directory for something near d2, and so on, until a match has been found for each com- ponent. If at any stage there isn’t a plausible candidate in the directory, the search is abandoned. We have divided the job into three functions, spname itself isolates the components of the path and builds them into a “best match so far" filename. It calls mindist, which searches a given directory for the file that is closest to the current guess, using a third function, spdist, to compute the distance between two names. 210, THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 7 Zs spname: return correctly spelled filename +/ a + spname(oldnane, newnase) char soldname, snewname; + returns -1 if no reasonable match to oldnane, + 0 if exact match, . 1 4f corrected. + stores corrected name in newname. finclude ‘include spname(oldnane, newnane) char soldname, *newnane; ‘ char *p, guess(DIRSIZ+1], best (DIRSTZ+1]; char «new = newname, sold = oldname: for (i3) ¢ while (401d == ‘/") /+ skip slashes «/ anewts = soldey enew = NOG Af (sold == "\0") /s exact or corrected +/ return stremp(oldnane,nevname) I= 0; p= guess; /+ copy next component into guess +/ for (; sold I= ‘/* 6 sold f= "\0"; olde+) Af (p < guesasDIRstZ) spre = sold; ap = ‘N04 Af (minist(newname, guess, best) >= 3) return -1; /+ hopeless */ for (p= best; snew = *pr+; ) /+ add to end +/ ewes 7+ of newnane +/ ‘CHAPTER T UNIX SYSTEM CALLS 211 mindist(air, guess, best) /» search dir for guess +/ char #dir, «guess, +best; cl /+ set best, return distance 0..3 «/ Ant 4, né, fa; struct { ino_t ino: char name[DIRSIZ+1]; /+ 1 more than in dir.h +/ ) mbufy mbuf-name(DIRSTZ] = ’\0"; / +1 for terminal ‘\0" +/ Af (4ir(0] == ’\0") 7+ current directory +/ air =". 4 = 3; /+ minimum distance */ Af ((Edsopen(dir, 0)) == -1) return 5 wile (read(f4,(char «) Snbuf,sizeof (struct irect)) > 0) Lf (mbut.ino) [ nd = spdist(nbuf.name, guess); Af (nd <= @ 66 nd T= 3) strepy(best, nbuf.name); a= ng Af (d 220) 7+ exact match +/ break; > 2 close (ta return dj > If the directory name given to mindist is empty, *." is searched. mindist reads one directory entry at a time. Notice that the buffer for read is a struc- ture, not an array of characters. We use sizeof to compute the number of bytes, and coerce the address to a character pointer: Ifa slot in a directory is not currently in use (because a file has been removed), then the inode entry is zero, and this position is skipped. The dis tance test is if (nd xe ds.) instead of if (nd Once we have spname, integrating spelling correction into p is easy: CHAPTER 7 UNIX SYSTEM CALLS 213 print input in chunks (version 4) +/ Hinelude #aefine PAGESIZE 22 char *prognane; /+ program name for error message +/ main(arge, argv) int arges char sargvi); FILE «fp, *efopen(); int i, pagesize = PAGESIZE; char tp, #getenv(), buf(BUPSIZ prognane = argv(0; if ((pegetenv("PAGESIZE")) pagesize = atoi(p! 4€ (arge > 1 && argvi1}(0) Pagesize = atoi(&argy[1}( 11); arge~ argvis} ) Af (arge print (stdin, pagesize); else for (i = 1; 4 < argos i++) switch (spnane(argv[i}, buf)) { case - 7+ no match possible +/ fp = efopen(argy[i}, "2"): break case 1: /+ corrected +/ Eprinté (stderr, "\"s\"? ", buf); if (ttyin() == 'n’) break; argv[i} = buf; 7s fall through... +/ case 0: /+ exact match +/ fp = efopen(argv(i), "r"); print(fp, pagesize): fclose(£p); > exit(0); ) Spelling correction is not something to be blindly applied to every program that uses filenames. It works well with p because p is interactive, but it’s not suitable for programs that might not be interactive. Exercise 7-5. How much can you improve on the heuristic for selecting the best match in spname? For example, itis foolish to treat a regular file as if it were a directory: 214 THE UNIX PROGRAMMING ENVIRONMENT ‘CHAPTER 7 this can happen with the current version. © Exercise 7-6, The name tx matches whichever of te happens to come last in the direc- tory, for any single character c. Can you invent a better distance measure?” Implement it and see how well it works wit real users. Exercise 7-7. mindist reads the directory one entry at a time, Does p run perceptibly faster if directory reading is done in bigger chunks? © Exercise 7-8, Modify apname to retuen a name that is a prefix of the desired name if no closer match can be found. How shovld ties be broken if there are several names that all match the prefix? © Exercise 7-9, What other programs could profit from spname? Design a standalone program that would apply correction to its arguments before passing them along to another program, asin 8 fix prog filenames, Can you write a version of e@ that uses spnane? How would you install it? © 7.3 File system: inodes In this section we will discuss system calls that deal with the file system and in particular with the information about files, such as size, dates, permissions, and so on. ‘These system calls allow you to get at all the information we talked about in Chapter 2 Let’s dig into the inode itself. Part of the inode is described by a structure called stat, defined in struct stat /+ structure returned by atat «/ i dev.t st.dev; + device of inode +/ ino_t — st_ino: 7s Anode nunber +/ short st mod 7s mode bits */ short st-nlink; /« number of links to file */ short et_uid; | /+ owner’s userid «/ short st_gid; /+ owner’s group id +/ dev.t strdev; /+ for special files +/ off_t st_size; /+ file size in characters +/ time t statine; /+ time file last read +/ timet et mtine; /+ time file last written or created */ time t st_ctime; /+ time file or inode last changed +/ M Most of the fields are explained by the comments. ‘Types like dev_t and ino_t are defined in , as discussed above. The st_mode entry contains a set of flags describing the file; for convenience, the flag defini- tions are also part of the file CHAPTER 7 UNIX SYSTEM CALLS 215 #aefine SIFT 0170000 /+ type of file +/ #aefine $_IFDIR 0040000 /+ directory +/ faefine S$ _IFCHR 0020000 /+ character special +/ #aefine SIPBLK 0060000 /+ block special +/ #aefine _§_IPREG 0100000 /+ regular +/ #aefine S_ISUID 0004000 /+ set user id on execution +/ #aefine S_ISGID 0002000 /+ set group id on execution +/ #aefine S_ISVTX 0001000 + save swapped text even after use +/ fdefine STREAD 0000400 /+ read permission, ovner +/ #aefine S_IWRITE 0000200 + write permission, owner +/ #aefine S_IEKEC 0000100 + execute/search permission, owner «/ ‘The inode for a file is accessed by a pair of system calls named stat and fetat. stat takes a filename and returns inode information for that file (or =1 if there is an error). fstat does the same from a file descriptor for an ‘open file (not from a FILE pointer). That is, char sname; int £4; struct stat stbuf; atat(name, Getbuf); fetat(fd, Estbuf); fills the structure stbuf with the inode information for the file name or file descriptor £4. With all these facts in hand, we can start to write some useful code. Let us begin with a C version of checkmai1, a program that watches your mailbox. If the file grows larger, checkmai1 prints “You have mail” and rings the bell. (if the file gets shorter, that is presumably because you have just read and deleted some mail, and no message is wanted.) This is entirely adequate as a first step; you can get fancier once this works. 216 THE UNDE PROGRAMMING ENVIRONMENT CHAPTER 7 7s checknail: watch user’s mailbox */ #include #include Wnclude char +progname; char *maildir = "/usr/spool/mail"; /+ system dependent +/ main(arge, argv) int argc; char saravils struct stat buf; char sname, sgetlogin(): int laateize = 0} progname = argv(0}; Af ((name = getiogin()) == RULL) error("can’t get login name", (char +) 0); Af (chdir(mailéir) »» 1) error("can’t od to Xe", maildir); for (3) ( Af (statiname, &buf) buf.st_size = 05 if (buf stusize > lastsize) Eprinté(etderr, "\n¥ou have mail\007\n"); lastsize = buf.st_siz sleep(60); -1) /+ no mailbox «/ , The function getlogin(3) returns your login name, or NULL if it can’. checkmail changes to the mail directory with the system call chi, so that the subsequent stat calls will not have to search each directory from the root to the mail directory. You might have to change maildir to be correct on your system. We wrote checkmail to keep trying even if there is no mail- box, since most versions of mai remove the mailbox if it’s empty. We wrote this program in Chapter 5 in part (o illustrate shell loops. ‘That version created several processes every time it looked at the mailbox, so it might be more of a system Joad than you want, The C version is a single pro- cess that does a stat on the file every minute, How much does it cost to have ‘checkmail running in the background all the time? We measured it at well tunder one second per hour, which is low enough that it hardly matters. sv: An illustration of error handling We are next going to write a program called sv, similar to ep, that will copy a set of files to a directory, but change cach target file only if it does not exist or is older than the source. “sv” stands for “save”; the idea is that 5v CHAPTER? UNIX SYSTEM CALLS 217 will not overwrite something that appears to be more up to date. sv uses more of the information in the inode than checkmai does. The design we will use for ev is this: $ ov filet file2 ... air copies £ile1 to dix/filet, file2 to dir/£ile2, etc., except that when a target file is newer than its source file, no copy is made and a warning is printed. To avoid making multiple copies of linked files, sv does not allow /’s in any of the source filenames. 7+ sv: save new files */ #include #include Winelude #inelude char +prognam main(arge, argv) int arge; char targvi); int is struct stat stbuf; char #dir = argvlarge-1]; prognane = argv[0]; if (arge <= 2) error("Usage: Xs file if (stat(dir, Sstbuf) error("can’t access directory Xe", ir): Af ((stbuf.st_mode & S_IFHT) I= S_TFDIR) error("¥s is not a directory", dir); for (1 = 15 4 < arge=t; ise) svlargvtil, diz); exit(o): air", progname); =4) ? ‘The times in the inode are in seconds-since-long-ago (0:00 GMT, January 1, 1970), so older files have smaller values in their st_mt:ime field. 218 THE UNIX PROGRAMMING ENVIRONMENT cuarrer 7 avifile, dir) /+ save file in dir «/ char efile, #dix; 4 struct stat sti, sto; int fin, fout, a; char target(BUFSIZ), buf(BUFSIZ}, eindex(); sprintf (target, “Ks/Ks", air, file); Af (index(file, ‘/7) I= NULL) / strohr() in some systems +/ error("won’t handle /’s in %o", file); Af (stat (file, Ssti) == -1) error("can’t stat Xs", file); Af (stat(target, Sto) =='-1) /* target not present +/ ato.stmtine = 0; /+ so make it look old +/ Af (sti-st_mime < sto.st_mtime) /+ target is newer +/ fprintf(stderr, "Xs: Xs not copied\n", progname, file); else if (fin = Open(file, 0)) == -1) error("can’t open file xe", file); else if ((fout = creat(target, sti-st_mode)) error("can’t create %e", target); else while ((n = read(fin, buf, sizeof buf) > 0) if (weite(fout, buf, a) I= n) error("error writing %s", target); close(fin) elose(fout) , We used creat instead of the standard 1/0 functions so that sv can preserve the mode of the input file. (Note that index and strchr are different names, for the same routine; check your manual under string(3) to sce which name your system uses.) ‘Although the ev program is rather specialized, it does indicate some impor- tant ideas. Many programs are not “system programs” but may still use infor~ mation maintained by the operating system and accessed through system calls, For such programs, it is crucial that the representation of the information appear only in standard header files like and , and that programs include those files instead of embedding the actual declarations in themselves. Such code is much more likely to be portable from one system to another. It is also worth noting that at least two thirds of the code in sv is error checking, In the early stages of writing a program, it’s tempting to skimp on error handling, since it is a diversion from the main task. And once the pro- gram “works,” it’s hard to be enthusiastic about going back to put in the checks that convert a private program into one that works regardless of what happens. ‘CHAPTER 7 UNIX SYSTEM CALLS 219 sv isn’t proof against all possible disasters — it doesn’t deal with interrupts at awkward times, for instance — but it’s more careful than most programs To focus om just one point for a moment, consider the final write statement It is rare that a write fails, so many programs ignore the possibility. But discs run out of space; users exceed quotas; communications lines break. All Of these can cause write errors, and you are a lot better off if you hear about them than if the program silently pretends that all is wel. ‘The moral is that error checking is tedious but important. We have been cavalier in most of the programs in this book because of space limitations and to focus on more interesting topics. But for real, production programs, you can’t afford to ignore errors. Exercise 7-10. Modify checkmail to identify the sender of the mail as part of the "you have mail” message. Hint: sscanf, 1seek Exercise 7-11. Modify checkmait so that it does not change to the mail directory before it enters its loop. Does this have a measurable effect on its performance? (Harder) Can you write a version of checkmai that only needs one process to notify all users? © Bxercise 7-12. Write @ program watebfile that monitors a file and prints the file from the beginning each time it changes. When would you use it? © Exercise 7-13. sv is quite rigid in its error handling. Modify it to continue even if it can’t process some file, 0 Exercise 7-14, Make sy recursive: if one of the source files is a directory, that direc- tory and its files are processed in the same manner. Make ep recursive. Discuss whether ep and sv ought to be the same program, so that ep —v doesn't do the copy if the target is newer, © Exercise 7-15. Write the program xandon: random flename produces one Hine chosen at random from the file, Given a file people of names, random can be used in a program called scapegoat, which is valuable for allocating blame: # cat scapegoat echo "It’s all ‘random people's faulti™ scapegoat Te’e all Ken's fault! ‘ Make sure that random is fair regardless of the distribution of line lengths. Exercise 7-16. There’s other information in the inode as well, in particular, disc addresses where the file blocks are located. Examine the file , then write 4 program Scat. that will read files specified by inode number and disc device. (It will work only if the disc in question is readable, of course.) Under what circumstances is deat useful? 220 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 7 7.4 Processes ‘This section describes how to execute one program from within another. ‘The easiest way is with the standard library routine system, mentioned but censured in Chapter 6, system takes one argument, a command line exactly as typed at the terminal (except for the newline at the end) and executes it in a sub-shell. If the command line has to be built from pieces, the in-memory for- ‘matting capabilities of sprintf may be useful. At the end of this section we will show a safer version of system for use by interactive programs, but first We must examine the pieces from which it is built Low-level process creation — exec1p and execvp The most basic operation is to execute another program without returning. by using the system call execlp. For example, to print the date as the action of a running program, use execlp("date", "date", (char +) 0); ‘The first argument to exectp is the filename of the command; execip extracts the search path (i.e., $PATH) from your environment and does the same search as the shell dos. The second and subsequent arguments are the command name and the arguments for the command; these become the argv array for the new program. ‘The end of the list is marked by a 0 argument (Read exec(2) for insight on the design of execip.) ‘The execip call overlays the existing program with the new one, runs that, then exits. ‘The original program gets control back only when there is an error, for example if the file can’t be found or is not executable: execip( "date", "date", (char +) 0): Eprinté(stderr, "Couldn't execute “date’\n"); exits A variant of execip called execvp is useful when you don't know in advance how many arguments there are going to be. The call is execvp( filename, argp); where argp is an array of pointers to the arguments (such as argv); the last pointer in the array must be NULL so execvp can tell where the list ends. As with execlp, filename is the file in which the program is found, and agp is the argv array for the new program; argp(0} is the program name. Neither of these routines provides expansion of metacharacters like <, >, *, quotes, etc., in the argument list. I you want these, use exec1p to invoke the shell /bin/sh, which then docs all the work. Construct a. string commandline that contains the complete command as it would have been typed at the terminal, then say execip("/bin/sh", "sh", commandline, (char «) 0); CHAPTER 7 UNIX SYSTEM CALLS 221 ‘The argument ~c says to treat the next argument as the whole command line, not a single argument. As an illustration of exec, consider the program waitfile. The com: mand $ waitfize filename { command | periodically checks the file named. If it is unchanged since last time, the com- ‘mand is executed. If no command is specified, the file is copied to the standard output. We use waitfile to monitor the progress of trofé, as in $ waitfile troff.out echo troff done & The implementation of waitfile uses Estat to extract the time when the file was last changed. J waitfile: wait until file stops changing +/ #include #inelude Winelude char *progname; main(arge, argv) int arge; char #argvt); int fa; struct stat stbuf: time_t old time = 0; prognane = argv[0}; if large « 2) error("Usage: %s filenane (end]", prognane); Af (£4 = open(argvf1}, 0)) == -1) error("ean’t open Xs", argv 1]); Estat(£4, &stbuf); while (stbuf.st_mtime I= old_time) { old_time « stbuf-st_mtine; sieep(60 fetat(fd, &stbut); } if (arge == 2) { /* copy file +/ execip("cat", "cat", argvi tl, (char +) 0); error("can’t execute cat e", argvl 11) } else ( /+ run process +/ execyp(argv[2], Sargv[2)); error("can’t execute Xs", argv[2]); > exit(0); 222 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 7 This illustrates both exec1p and exeevp. ‘We picked this design because it's useful, but other variations are plausible, For example, waitfile could simply return after the file has stopped chang- ing. Exercise 7-17. Modify vatchEile (Exercise 7-12) so it has the same property as waiteile: if there is no command, it copies the file; otherwise it does the command, Could watontite and waittite share source code? Hint: argvl0]. Control of processes — fork and wait ‘The next step is to regain control after running a program with execip or execyp. Since these routines simply overlay the new program on the old one, to save the old one requires that it first be split into two copies; one of these can be overlaid, while the other waits for the new, overlaying program to fin- ish. ‘The splitting is done by a system call named fork: proc_id = fork(}; splits the program into two copies, both of which continue to run. The only difference between the two is the value returned by fork, the process-id. In one of these processes (the child), proc_id is zero. In the other (the parent), proc_id is non-zero; it is the process-id of the child. Thus the basic way (0 call, and return from, another program is Af (£ork() execip( + se", commandline, (char +) 0 And in fact, except for handling errors, this is sufficient. The fork makes ‘two copies of the program. In the child, the value returned by fork is zero, so it calls execlp, which does the commandline and then dies. In the parent, fork returns non-zero so it skips the execlp. (If there is any error, fork returns ~1.) More often, the parent waits for the child to terminate before continuing itself. This is done with the system call wait: int status; Af (fork() == 0) execip(... 7a chile +/ wait(Gstatue); 7+ parent +/ ‘This still docsn’t handle any abnormal conditions, such as a failure of the execlp or fork, of the possibility that there might be more than one child running simultaneously. (wait returns the process-id of the terminated child, if you want to check it against the value returned by fork.) Finally, this frag- ment doesn’t deal with any funny behavior on the part of the child. Still, these three lines are the heart of the standard aystem function The status returned by wait encodes in its low-order eight bits the system's idea of the child’s exit status; it is 0 for normal termination and non ‘CHAPTER 7 UNIX SYSTEM CALLS 223 zero to indicate various kinds of problems. The next higher eight bits are taken from the argument of the call to exit or return from main that caused termination of the child process. ‘When a program is called by the shell, the three file descriptors 0, 1, and 2 are set up pointing at the right files, and all other file descriptors are available for use. When this program calls another one, correct etiquette suggests mak- ing sure the same conditions hold. Neither fork nor exec calls affect open files in any way; both parent and child have the same open files. If the parent is buffering output that must come out before output from the child, the parent must flush its buffers before the exee1p. Conversely, if the parent buffers an input stream, the child will lose any information that has been read by the parent. Output can be flushed, but input cannot be put back. Both of these considerations arise if the input or output is being done with the standard 1/0 libeary discussed in Chapter 6, since it normally buffers both input and output It is the inheritance of file descriptors across an execip that breaks system: if the calling program does not have its standard input and output connected to the terminal, neither will the command called by system. This may be what is wanted; in an ed script, for example, the input for a command started with an exclamation mark 1 should probably come from the script. Even then ed must read its input one character at a time to avoid input buffer- ing problems. For interactive programs like p, however, system should reconnect stan- dard input and output to the terminal. One way is to connect them to saev/tey The system call dup(4) duplicates the file descriptor #4 on the lowest- numbered unallocated file descriptor, returning a new descriptor that refers to the same open file. This code connects the standard input of a program to a file: int £45 £4 = open( file", 0); close(o); dup( ea; close(fa); The close(0) deallocates file descriptor 0, the standard input, but as usual doesn’t affect the pareat Here is our version of system for interactive programs; it uses progname for error messages. You should ignore the parts of the function that deal with signals; we will return to them in the next section 224 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 7 o + Safer version of system for interactive programs #include #include system(s) /+ run command line = +/ char + ‘ int status, pid, w, tty: int (#istat)(), (eqseat) 00; extern char «prognane; f£lush( stdout); tty = open(*/dev/ety", 2); Af (tty == 0) ( fprintf(stderr, "Xs: can’t open /dev/tty\n", prognane); return ~1; > AE (pid = fork()) elose(0); dup(tty) elose(1); dup(tty: elose(2); dup(tty) close(tty) execip("sh", "sh", "-c", s, (char *) 0); exit(127); oe , close(tty); istat = signal (SrGrwr, s1G_roN); gstat = signal(SicourT, sr¢_rGN); while ((w = wait(kstatus)) I= pid && w I= 1) se status = -15 signal (SIGINT, istat); signal(siGavrt, qetat); return status; - ) Note that /dev/tty is opened with mode 2 — read and write — and then dup'ed to form the standard input and output. This is actually how the system assembles the standard input, output and error when you log in, Therefore, your standard input is writable: $ echo hello 1260 hello ‘ This means we could have dup'ed file descriptor 2 to reconnect the standard input and output, but opening /dew/tty is cleaner and safer. Even this CHAPTER 7 UNIK SYSTEM CALLS 225 system has potential problems: open files in the caller, such as tty in the routine ttyin in p, will be passed to the child process. ‘The lesson here is not that you should use our version of system for all your programs — it would break a non-interactive ed, for example — but that you should understand how processes are managed and use the primitives correctly; the meaning of “correctly” varies with the application, and may not agree with the standard implementation of system. 7.5 Signals and interrupts This section is concerned with how to deal gracefully with signals (like interrupts) from the outside world, and with program faults. Program faults arise mainly from illegal memory references, execution of peculiar instructions, fr floating point errors. ‘The most common outside-world signals are interrupt, which is sent when the DEL character is typed; quit, generated by the FS char- acter (ctl-\); hangup, caused by hanging up the phone; and terminate, gen= erated by the ki11 command. When one of these events occurs, the signal is sent to all processes that were started from the same terminal; unless other arrangements have been made, the signal terminates the process. For most sig- nals, a core image file is written for potential debugging. (See adb(1) and s€b(1).) ‘The system call signal alters the default action. It has two arguments. ‘The first is a number that specifies the signal. The second is either the address of a function, or a code which requests that the signal be ignored or be given the default action. The file contains definitions for the various arguments. Thus Wnelude signal(SIGINP, SI¢_IGN); ‘causes interrupts to be ignored, while signal(SIGINT, SIG_DPL); restores the default action of process termination. In all cases, signal returns the previous value of the signal. If the second argument to signal is the name of a function (which must have been declared already in the same source file), the function will be called when the signal occurs. Most commonly this, facility is used to allow the program to clean up unfinished business before ter- 18, for example to delete a temporary file: 226 THE UNIX PROGRAMMING ENVIRONMENT cuarren 7 #include char *tempfile = "temp-Xxx00t"s mainQ) { extern onintr(); Af (signal (SIGINT, $1G_1GN) I= st¢_rGw) signal (SIGINT, onintr); nktemp(tempfile) ; 7s Process... 4/ exit(0); > onintr() /+ clean up if interrupted */ { unlink (tempfile); exit(4); , Why the test and the double call to signa in main? Recall that signals are sent to all processes started from a particular terminal. Accordingly, when 4 program is to be run non-interactively (started by &), the shell arranges that the program will ignore interrupts, so it won't be stopped by interrupts intended for foreground processes. If this program began by announcing that all interrupts were to be sent to the onintr routine regardless, that would undo the shell’s effort to protect it when-run in the background. The solution, shown above, is to test the state of interrupt handling, and to continue to ignore interrupts if they are already being ignored. The code as written depends on the fact that signal returns the previous state of a partic- ular signal. If signals were already being ignored, the process should continue to ignore them; otherwise, they should be caught. ‘A more sophisticated program may wish to intercept an interrupt and inter- pret it as a request to stop what it is doing and return to its own command- processing loop. Think of a text editor: interrupting a long printout should not cause it t0 exit and lose the work already done. The code for this case can be written like this: ‘CHAPTER 7 UNIK SYSTEM CALLS 227 #include #inclade Smp_buf sibuf; main() int oninte(); Af (signal(SIGINT, SIG_IGN) I= SIG_16N) signal (SIGINT, onintr); setjmp(ajbut); /4 save current stack position +/ for (33) ( 74 pain processing loop +/ } , onintr() /# x t jet if interrupted +/ eignal(SIGINT, onintr); /+ reset for next interrupt #/ printé(*\ntnterrupt\n" ); Longjmp(s buf, 0); /+ return to saved state +/ , The file declares the type jmp buf as an object in which the stack position can be saved; s buf is declared to be such an object. ‘The func- tion set:jmp(3) saves a record of where the program was executing. The values of variables are nor saved. When an interrupt occurs, a call is forced 10 the onintr routine, which can print a message, set flags, or whatever Long jmp takes as argument an object stored into by set imp, and restores control to the location after the call 10 setjmp. So control (and the stack level) will pop back to the place in the main routine where the main loop is entered, Notice that the signal is set again in onintr after an interrupt occurs. This is necessary: signals are automatically reset to their default action when they occur. ‘Some programs that want to detect signals simply can’t be stopped at an arbitrary point, for example in the middle of updating a complicated data struc- ture. The solution is to have the interrupt routine set a flag and return instead of calling exit or Long jmp. Execution will continue at the exact point it was interrupted, and the interrupt flag can be tested later. There is one difficulty associated with this approach. Suppose the program is reading the terminal when the interrupt is sent. The specified routine is duly called; it sets its flag and returns. If it were really true, as we said above, that execution resumes “at the exact point it was interrupted,” the program would continue reading the terminal until the user typed another line. This behavior 228 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 7 might well be confusing, since the user might not know that the program is, reading, and presumably would prefer to have the signal take effect instantly To resolve this difficulty, the system terminates the read, but with an error status that indicates what happened: eeno is set to EINTR, defined in , to indicate an interrupted system call. ‘Thus programs that catch and resume execution after signals should be prepared for “errors” caused by interrupted system calls. (The system calls to ‘watch out for are reads from a terminal, wait, and pause.) Such a program could use code like the following when it reads the standard input #include extern int errno; if (read(0, Sc, 1) <= 0) /+ EOF or interrupted +/ Af (errno “= EINTR) {| /+ EOF caused by interrupt +/ errno = 0; /+ reset for next time */ 7s true end of file #/ ‘There is a final subtlety to keep in mind when signal-catching is combined with execution of other programs. Suppose a program catches interrupts, and also includes a method (like "*!” in ed) whereby other programs can be exe cuted. Then the code would look something like this: Af (fork() == 0) execip(...); signal(SIGINT, SIG_IGN); /+ parent ignores interrupts */ wait(Gstatue); 7+ until child is done */ signal (SIGINT, onintr /+ restore interrupts */ Why is this? Signals are sent to all your processes. Suppose the program you call catches its own interrupts, as an editor does. If you interrupt the subpro- gram, it will get the signal and return to its main loop, and probably read your terminal. But the calling program will also pop out of its wait for the subpro- gram and read your terminal. Having two processes reading your terminal is very confusing, since in effect the system flips a coin to decide who should get each line of input. ‘The solution is to have the parent program ignore inter- rupts until the child is done. This reasoning is reflected in the signal handling in system: ‘CHAPTER 7 UNIX SYSTEM CALLS 229 #inelude system(s) /* xun command line = +/ char «3 0 int status, pid, w, tty: Ant (4istat)(), (aqstat)()s if (pia = forkc) execip(“sh’ exit (127); d istat = signal(siGINr, stG_rGN) gstat = signal(stcqurT, s1¢_rGn) while ((w = wait(Kstatus)) I= pid && w ie (wae 1) signal (SIGINT, istat); signal (Srcaurr, qstat); return statue; ) As an aside on declarations, the function signal obviously has a rather strange second argument. It is in fact a pointer to a function delivering an integer, and this is also the type of the signal routine itself, The two values SIG_IGN and SIG_DFL have the right type, but are chosen so they coincide ‘with no possible actual functions. For the enthusiast, here is how they are defined for the PDP-11 and VAX; the definitions should be sufficiently ugly 10 ‘encourage use of . fdefine SIG_DFL (int (+)())0 f#define SIGLIGN (int (#)())1 Alarms ‘The system call alarm(n) causes a signal STGALRM to be sent to your pro- cess m seconds later. The alarm signal can be used for making sure that some- thing happens within the proper amount of time; if the something happens, the alarm signal can be turned off, but if it does not, the process can regain control by catching the alarm signal To illustrate, here is a program called timeout that runs another com- mand; if that command has not finished by the specified time, it will be aborted when the alarm goes off. For example, recall the watchfor com- mand from Chapter 5. Rather than having it run indefinitely, you might set a 230 THE UNIX PROGRAMMING ENVIRONMENT ‘CHAPTER? limit of an hour: $ timeout -3600 watenfor dag & The code in timeout illustrates almost everything we have talked about in the past two sections. The child is created; the parent sets an alarm and then waits for the child to finish. If the alarm arrives first, the child is killed. An attempt is made to return the child’s exit status. /4 timeout: set time Limit on a process +/ #include #include int pigs 7+ child process id +/ char +prognames main(arge, argv) int arge; char targvt]s « Ant see = 10, status, onalarm(); prognane = argvi0); 3€ (arge > 1 66 argy(1](0} ‘ sec = atoi(argvi1][11)5 arge argy+ ) Af (arge < 2) error("Usage: %s [~10] command", progname); Af ((pidefork()) == 0) ( execvp(argvi1], Sargvi1]) error("couldn’t start Xs", argv 1}; ) signal (SIGALRM, onalarn); alarm(sec) Af (wait(Sstatus) e= -1 11 (status & 0177) f= 0) error("%s killed", argv{1)}i exit((status >> 8) 6 0377); ) onalarm() /+ kill child when alarm arrives +/ ‘ kill (pid, STGKILL); ) Exercise 7-18. Can you infer how steep is implemented? Hint: pause(2). Under ‘what circumstances, if any, could sleep and alarm interfere with each other? 0 CHAPTER 7 UNIX SYSTEM CALLS 231 History and bibliographic notes There is no detailed description of the UNIX system implementation, in part because the code is proprietary. Ken Thompson’s paper “UNIX implementa tion” (BSTJ, July, 1978) describes the basic ideas. Other papers that discuss related topics are “The UNIX system—a retrospective” in the same issue of BSTJ, and “The evolution of the UNIX time-sharing system” (Symposium on Language Design and Programming Methodology, Springer-Verlag Lecture Notes in Computer Science #79, 1979.) Both are by Dennis Ritchie. The program readsiow was invented by Peter Weinberger, as a low- ‘overhead way for spectators to watch the progress of Belle, Ken Thompson and Joe Condon’s chess machine, during chess tournaments. Belle recorded the status of its game in a file; onlookers polled the file with readslow so as not to steal too many precious cycles from Belle, (The newest version of the Belle hardware does little computing on its host machine, so the problem has gone away.) Our inspiration for spname comes from Tom Duff. A paper by Ivor Dur- ham, David Lamb and James Saxe entitled “Spelling correction in user inter faces,” CACM, October, 1983, presents a somewhat different design for spel- ling correction, in the context of a mail program, cuapter s: PROGRAM DEVELOPMENT ‘The UNIX system was originally meant as a program development environ- ment. In this chapter we'll talk about some of the tools that are particularly suited for developing programs. Our vehicle is a substantial program, an inter- preter for a programming language comparable in power to BASIC, We chose to implement a language because it's representative of problems encountered in large programs. Furthermore, many programs can profitably be viewed as languages that convert a systematic input into a sequence of actions and out- puts, so we want to illustrate the language development tools. In this chapter, we will cover specific lessons about » yace, a parser generator, a program that generates a parser from a gram. ‘matical description of a language; © make, a program for specifying and controlling the processes by which a complicated program is compiled; © Lex, a program analogous to yace, for making lexical analyzers. We also want to convey some notions of how to go about such a project — the importance of starting with something small and letting it grow; language evo- lution; and the use of tools. ‘We will describe the implementation of the language in six stages, each of which would be useful even if the development went no further. These stages closely parallel the way that we actually wrote the program. (1)A four-funetion calculator, providing +, -, *, / and parentheses, that ‘operates on floating point numbers. One expression is typed on each line; its value is printed immediately. (2) Variables with names a through 2. This version also has unary minus and some defenses against errors. (3) Arbitrarily-long variable names, builtin functions for sin, exp, etc., use- ful constants like 1 (spelled PI because of typographic limitations), and an exponentiation operator. (4) A change in internals: code is generated for each statement and subse- quently interpreted, rather than being evaluated on the fly. No new features are added, but it leads to (5) (5) Control flow: if-e1se and while, statement grouping with { and }, and 233 234 THE UNIX PROGRAMMING ENVIRONMENT ‘CHAPTER relational operators like >, <2, ete. (© Recursive functions and procedures, with arguments. We also added state- ‘ments for input and for output of strings as well as numbers. The resulting language is described in Chapter 9, where it serves as the main example in our presentation of the UNIX document preparation software. Appendix 2 is the reference manual. ‘This is a very long chapter, because there's a lot of detail involved in get- ting a non-trivial program written correctly, let alone presented. We are assuming that you understand C, and that you have a copy of the UNIX Programmer's Manual, Volume 2, close at hand, since we simply don’t have space to explain every nuance. Hang in, and be prepared to read the chapter a couple of times. We have also included all of the code for the final version in Appendix 3, so you can see more easily how the pieces fit together. By the way, we wasted a lot of time debating names for this language but never came up with anything satisfactory. We settled on hoc, which stands for “high-order calculator.” ‘The versions are thus hoc, hoc2, etc. 8.1 Stage 1: A four-function calculator This section describes the implementation of hoc, a program that provides about the same capabilities as a minimal pocket calculator, and is substantially less portable. It has only four functions: +, ~, *, and /, but it does have parentheses that can be nested arbitrarily deeply, which few pocket calculators provide. If you type an expression followed by RETURN, the answer will be printed on the next line: § hoot 40302 24 (142) © (Gea) 24 wa a5 355/113, 3.1415929 -3-4 hoct: syntax error near line 4 It doesn't have unary minus yet s Grammars Ever since Backus-Naur Form was developed for Algol, languages have been described by formal grammars. The grammar for hoc is small and sim- ple in its abstract representation: corAPTER § PROGRAM DEVELOPMENT 235 List: expr \n List expe \n expr: NUMBER expr + expe expr - expr expe + expr expe / expr (expr) In other words, a List is a sequence of expressions, each followed by a new- line. An expression is a number, or a pair of expressions joined by an opera- tor, or a parenthesized expression. This is not complete. Among other things, it does nat specify the normal precedence and associativity of the operators, nor does it attach a meaning to any construct, And although List is defined in terms of expr, and expr is, defined in terms of NUMBER, NUMBER itself is nowhere defined. ‘These details have to be filled in to go from a sketch of the language to a working program. Overview of yace yace is a parser generator.t that is, a program for converting a grammati- cal specification of a language like the one above into a parser that will parse statements in the language. yacc provides a way fo associate meanings with the components of the grammar in such a way that as the parsing takes place, the meaning can be “evaluated” as well. The stages in using yace are the fol. lowing First, a grammar is written, like the one above, but more precise. This specifies the syntax of the language. yacc can be used at this stage to warn of errors and ambiguities in the grammar. Second, each rule or production of the grammar can be augmented with an ‘action — a statement of what to do when an instance of that grammatical form is found in a program being parsed. The “what to do” part is written in C, with conventions for connecting the grammar to the C code. This defines the semantics of the language Third, a lexical analyzer is needed, which will read the input being parsed and break it up into meaningful chunks for the parser. A NUMBER is an exam- ple of a lexical chunk that is several characters long; single-character operators like + and + are also chunks. A lexical chunk is traditionally called a roken Finally, a controlling routine is needed, to call the parser that yace built. yace processes the grammar and the semantic actions into a parsing func- tion, named yyparse, and writes it out as a file of C code. If yace finds no errors, the parser, the lexical analyzer, and the control routine can be ¥ yace stands for “yet another compiler-compiler." a comment by its ereator, Steve Johnson, on the number of such programs extant atthe Lime it was boing developed (around 1972). yee is ‘one of hand that have flourished. 236 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER & compiled, perhaps linked with other C routines, and executed. ‘The operation of this program is to call repeatedly upon the lexical analyzer for tokens, recognize the grammatical (syntactic) structure in the input, and perform the semantic actions as each grammatical rule is recognized. ‘The entry to the lexi- cal analyzer must be named yylex, since that is the function that yyparse calls each time it wants another token. (All names used by yace start with y.) To be somewhat more precise, the input to yace takes this form: x4 C statements like #inciude, declarations, ete. This section is optional * ‘ace declarations: lexical tokens, grammar variables, precedence and associativity information we ‘grammar rules and actions cy ‘more C statements (optional main() ( ...3 yyparse()s ... } yylex) se. } This is processed by yace and the result written into a file called y.tab.c, whose layout is like this: C statements from berween %( and %), ifany € statements from after second %, if any main() ( ...5 yyparse(); -.. yrtext) Cie D yyparse() { parser, which calls yylex() } It is typical of the UNIX approach that yace produces C instead of a com- piled object (0) file. ‘This is the most flexible arrangement — the generated code is portable and amenable to other processing whenever someone has a good idea yace itself is a powerful tool. It takes some effort to learn, but the effort is repaid many times over. yace-generated parsers are small, efficient, and correct (though the semantic actions are your own responsibility); many nasty parsing problems are taken care of automatically. Language-recognizing pro grams are easy to build, and (probably more important) can be modified repeatedly as the language definition evolves. ‘Stage I program ‘The source code for hoc’ consists of @ grammar with actions, a lexical rou- tine yylex, and a main, all in one file hoc.y. (yace filenames traditionally end in .y, but this convention is not enforced by yace itself, unlike cc and +¢.) The grammar partis the first half of hoc.y: CHAPTER § PROGRAM DEVELOPMENT 237 $ cat hoc. me faefine YYSTYPE double /x data type of yace stack +/ x) Xtoken NUMBER Kleft ‘s' ‘-' /s left associative, same precedence +/ Kleft "*' ’/' 7s left assoc., higher precedence +/ x List: nothing +/ blast “\n" Plist expr ‘\n’ —( printe(*\tx.ag\n", $2); } expr: | NUMBER (sess expr ‘6 expr ( 3114 $3; } expr ‘~ expr { 81-83: } expe 'e? expr { $16 $35) expe ‘/* expr { 817.83; ) 1° expe 1)" 325) 1% Ys ond of grammar +/ There's a lot of new information packed into these few lines, We are not going to explain all of it, and certainly not how the parser works — for that, you will have to read the yace manual Alternate rules are separated by ‘!", Any grammar rule can have an associ ated action, which will be performed when an instance of that rule is recog- nized in the input. An action is a sequence of C statements enclosed in braces {and }. Within an action, $n (that is, $1, $2, etc.) refers to the value returned by the n-th component of the rule, and $$ is the value to be returned a the value of the whole rule. So, for example, in the rule expr: NUMBER ( $$ = $1; ) $1 is the value returned by recognizing NUMBER; that value is to be returned as the value of the expr. ‘The particular assignment $$=8 1 can be omitted — $3 is always set to $1 unless you explicitly set it to something els. ‘At the next level, when the rule is expe: expr ‘+’ expr ( $8 = $1 + $35) the value of the result expr is the sum of the values from the two component expr's. Notice that ’+” is $2; every component is numbered. ‘AL the level above this, an expression followed by a newline (’\n‘) is recognized as a list and its value printed. If the end of the input follows such a construction, the parsing process terminates cleanly. A List can be an empty string; this is how blank input lines are handled. yace input is free form; our format is the recommended standard. 238 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER & In this implementation, the act of recognizing or parsing the input also causes immediate evaluation of the expression. In more complicated situations (including hoc4 and its successors), the parsing process generates code for later execution. ‘You may find it helpful to visualize parsing as drawing a parse tree like the one in Figure 8.1, and to imagine values being computed and propagated up the tree from the leaves towards the root fist Aa tsi” woman nuusée | noMBER \ toby 2 4 \n Figure 8.1: Parse Tree for 2 + 3 + 4 ‘The values of incompletely-recognized rules are actually kept on a stack; this is how the values are passed from one rule to the next. The data type of this stack is normally an int, but since we are processing floating point numbers, ‘we have to override the default. The definition faefine YvsTYPE double sets the stack type to double. Syntactic classes that will be recognized by the lexical analyzer have to be declared unless they are single character literals like ‘+’ and *~". The declaration %token declares one or more such objects. Left oF right associa- tivity can be specified if appropriate by using %left or right instead of Xtoken. (Left associativity means that a~b-e will be parsed as (s-b)-¢ instead of a-(b-c).) Precedence is determined by order of appearance: tokens in the same declaration are at the same level of precedence; tokens declared Ister are of higher precedence. In this way the grammar proper is ambiguous (that is, there are multiple ways to parse some inputs), but the extra information in the declarations resolves the ambiguity The rest of the code is the routines in the second half ofthe file hoc.y: ‘CHAPTER & PROGRAM DEVELOPMENT 239 Continuing noc-y #include #inelude char sprognane; /+ for error messages */ int Lineno = 1; main(arge, argv) Zs moet «/ char sargvi li « progname = argv( 0}; yyparse(): > main calls yyparse to parse the input. Looping from one expression to the next is done entirely within the grammar, by the sequence of productions for List. It would have been equally acceptable to put a loop around the call to yyparse in main and have the action for List print the value and return immediately. yyparse in turn calls yylex repeatedly for input tokens. Our yylex is easy: it skips blanks and tabs, converts strings of digits into a numeric value, counts input lines for error reporting, and returns any other character as itself. Since the grammar expects to see only +, -, #, /, (, ), and \n, any other character will cause yyparse to report an error. Returning a 0 signals “end of file” to yyparse. Continuing noe. y yylext) y+ moet +/ { if (c == Bor) return 0; ‘et Tt dedigit(c)) { /+ number +/ ungetc(c, stdin); seanf("%i£", Syylval return NUMBER; if tc ) if (owe Mn!) Linenosss , The variable yyival is used for communication between the parser and the lexical analyzer; it is defined by yyparse, and has the same type as the yace stack. yylex returns the ype of a token as its function value, and sets yylval to the value of the token (if there is one). For instance, a floating 240 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER & point number has the type NUMBER and a value like 12.34. For some tokens, especially single characters like ’+” and ’\n’, the grammar does not use the value, only the type. In that case, yyivai need not be set The yacc declaration %token NUMBER is converted into a #define state- ment in the yace output file y.tab.c, so NUMBER can be used as a constant anywhere in the C program. yacc chooses values that won't collide with ASCII characters. If there is a syntax error, yyparse calls yyerror with a string containing the cryptic message “syntax error.” The yace user is expected to provide a yyerror; ours just passes the string on to another function, warning, Which prints somewhat more information, Later versions of hoc will make direct use of warning. yyerzor(s) /+ called for yace syntax error #/ char +8 ‘ warning(s, (char #) 0); > varning(s, t) /* print warning message */ char «8, «t} f fprintf(stderr, "Ws: Xs", progname, 8); se (te) fprintf(stderr, * 4s", €)5 fprintf(stderr, "near line %a\n", lineno); ) ‘This marks the end of the routines in hoc. y. Compilation of a yace program is a two-step process: $ yace hoc. Leaves ouput in y.tab.c $cc y.tab.c -0 hoct Leaves executable program in hoot $ hoot 23 0..66666667 3-4 hoct: syntax error near line 1 s Exercise 8-1. Examine the structure of the y.tab.¢ file. (It's about 300 lines long for heel.) o ‘Making changes — unary minus We claimed earlier that using yacc makes it easy to change a language. As an illustration, let’s add unary minus to hoc, so that expressions like o34 CHAPTER 8 PROGRAM DEVELOPMENT 241 are evaluated, not rejected as syntax errors. Exactly two lines have to be added to hoc.y. A new token UNARYMINUS is added to the end of the precedence section, to make unary minus have highest precedence: wiete far = mete "e177 voleft — UNARYMTNUS + new */ ‘The grammar is augmented with one more production for expr: expe: NUMBER ($82 $15) 1 =" expr Xprec UNARYMINUS ( $$ = -$2; ) /» new #/ ‘The %prec says that a unary minus sign (that is, a minus sign before an expression) has the precedence of UNARYMINUS (high); the action is 10 change the sign. A minus sign between two expressions takes the default precedence. Bxercise 8-2. Add the operators % (modulus or remainder) and unary + to hoc. Suggestion: look at frexp(3). © A digression on make It’s a nuisance to have to type two commands to compile a new version of hoct. Although it’s certainly easy to make a shell file that does the job, there’s a better way, one that will generalize nicely later on when there is more than one source file in the program. The program make reads a specification ‘of how the components of a program depend on each other, and how to pro- cess them to create an up-to-date version of the program. It checks the times at which the various components were last modified, figures out the minimum amount of recompilation that has to be done to make a consistent new version, then runs the processes. make also understands the intricacies of multi-step processes like yacc, so these tasks can be put into a make specification ‘without spelling out the individual steps. make is most useful when the program being created is large enough to be spread over several source files, but it’s handy even for something as small as hoct. Here is the make specification for hoc, which make expects in a file called makefile. $ cat makefile hoc: hoe.0 ec hoe.0 -0 hoct This says that hoc1 depends on hoc.o, and that hoc.o is converted into hoc! by running the C compiler ce and putting the output in hoct. make already knows how to convert the yace source file in hoc.y to an object file hoc.o: 282 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER & $ make Make the frst thing in makefite, hoot yace hoc.y co -c y.tab.c zn y.tab.c nv y.tab.o hoc.o co hoc.o -o hoot 8 make Do it again “noet” is up to date. make realizes i's unnecessary s 8.2 Stage 2: Variables and error recovery ‘The next step (a small one) is to add “memory” to hoct, to make hoc?. ‘The memory is 26 variables, named a through z. This isn't very elegant, but it’s an easy and useful intermediate step. We'll also add some error handling. If you try hoc, you'll recognize that its approach to syntax errors is to print & message and die, and its treatment of arithmetic errors like division by zero is reprehensible: $ hoct vo Floating exception ~ core dumped s The changes needed for these new features are modest, about 35 lines of code. The lexical analyzer yylex has to recognize letters as variables; the grammar has to include productions of the form expe VAR 1 VAR “= expe ‘An expression can contain an assignment, which permits multiple assignments like ‘The easiest way to store the values of the variables is in a 26-element array; the single-letter variable name can be used to index the array. But if the gram- ‘mar is to process both variable names and values in the same stack, yace has to be told that its stack contains a union of a double and an int, not just a double. This is done with a %union declaration near the top. A #define or a typedef is fine for setting the stack to a basic type like double, but the %union mechanism is required for union types because yace checks for con- sistency in expressions like $822. Here is the grammar part of hoc.y for hoc2 (CHAPTER PROGRAM DEVELOPMENT 243 $ cat hoc.y %E double mem(261; “4 memory for variables ‘a..'z" #/ » Manion ( /s stack type */ double 7 actual value +/ int index; /# index into men[] +/ i token NUMBER Ktoken VAR ktype expr Sright mest 147 7-7 wef te 77 left UNARYMINUS wm List J+ nothing +/ | ldee “An | list expr ‘\n’ { printe("\ex.ag\n", $2)5 } | lise error “\n’ { yyerrok; NUMBER 1 var (88 = ments); ) Hvar ‘2° expr ( nea($1) expr ‘+? expr { sie s | expr ‘-" expr (88 = $1 - §: Pexpr ‘#/ expr ( 88 = $1 + $3; expr "7" expr { if (83 == 0.0) execerror( $3517 $3; ) "expr ‘)? {88 = 825) 1 '-* expr Xprec UNARYMINUS ( $$ = 82; ) 7s end of grammar +/ ‘The Kunion declaration says that stack elements hold either a double (a number, the usual case), or an int, which is an index into the array mem. The %token declarations have been augmented with a type indicator. The xtype declaration specifies that expr is the member of the union, i.e., a double. The type information makes it possible for yace to generate refer- fences to the correct members of the union. Notice also that = is right- associative, while the other operators are left-associative. Error handling comes in several pieces. The obvious one is a test for a zero divisor; if one occurs, an error routine execerror is called, A second test is to catch the “floating point exception” signal that occurs 244 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER § when a floating point number overflows. The signal is set in main. The final part of error recovery is the addition of a production for exror. “error” is @ reserved word in a yacc grammar; it provides a way to antici- pate and recover from a syntax error. If an error occurs, yace will eventually try to use this production, recognize the error as grammatically “correct,” and, thus recover, The action yyerrok sets a flag in the parser that permits it to get back into a sensible parsing state. Error recovery is difficult in any parser; you should be aware that we have taken only the most elementary steps here, ‘and have skipped rapidly over yace’s capabilities as well. The actions in the hoc2 grammar are not much changed. Here is main, to which we have added set jmp to save a clean state suitable for resuming after fan error. execerror does the matching longjmp. (See Section 7.5 for a description of set-jmp and Long imp.) Hnclude #include jmp_buf begin; main(arge, argv) 7s nocd #/ char sargvlli t int fpecaton()5 prognane = argv[0]; set jmp( begin): signal(SIGFPE, fpecatch); yyparse(); ) execerror(s, t) /+ recover from run-time error +/ char +8, +t; ‘ warning(s, €)5 Longjmp(begin, 0); ) Epecateh() + catch Floating point exceptions +/ { execerror("floating point exception", (char +) 0)5 > For debugging, we found it convenient to have execerror call abort (see abort(3)), which causes a core dump that can be perused with adb or sdb. Once the program is fairly robust, abort is replaced by long jmp. The lexical analyzer is a litle different in hoc2. There is an extra test for & lower-case letter, and since yylval is new a union, the proper member has to be set before yylex returns. Here are the parts that have changed: ‘CHAPTER PROGRAM DEVELOPMENT 245 yylex() 7+ noe +/ Hi dgaigitic)) (| /* number «/ ungete(c, stdin); seanf("Xie", Byylval.val); return NUMBER; ; if (isiower(e)) ¢ yylval.index = ¢ = ’a’; /+ ASCII only +/ ‘Again, notice how the token type (e.g., NUMBER) is di (eg. 3.1416). Le us illustrate variables and error recovery, the new things in hoc? inct from its value $ hoc2 x = 355 ass. yet 493 psx 2 is undefined and thus zero @ivision by zero near line 4 Error recovery 3.418929 4230 # 1630 Overflow ocd: floating point exception near line 5 Actually, the PDP-11 requires special arrangements to detect floating point overflow, but on most other machines hoc2 behaves as shown. Exercise 8.3. Add a facility for remembering the most recent value computed, so that it does not have to be retyped in a sequence of related computations. One solution is to make it one of the variables, for instance “p' for “previous.” 0 Exercise 8-4. Modify hoc so that a semicolon can be used as an expression terminator ‘equivalent to a newline, © 8.3 Stage 3: Arbitrary variable names; buil functions This version, hoc3, adds several major new capabilities, and a correspond- ing amount of extra code. The main new feature is access to built-in functions: sin cos atan exp, log. og 10 sqrt int = abs We have also added an exponentiation operator cedence, and is right-associative, Since the lexical analyzer has to cope with built: it has the highest pre- names longer than a 286 THE UNIX PROGRAMMING ENVIRONMENT HAPTER & single character, it isn’t much extra effort to permit variable names to be arbi trarily long as well. We will need a more sophisticated symbol table to keep track of these variables, but once we have it, we can pre-load it with names and values for some useful constants: Pr 3.14159265358979323846 a E 2.71828182845904523536 Base of natural logarithms GAMMA _0.57721566490153286060 Euler-Mascheroni constant DEG —_57.29577951308232087680 Degrees per radian Pur 1,61803398874989484820 Golden ratio ‘The result is @ useful calculator: $ hoo3 4.5723 2,5410306 exp(2.34109(1.5)) 2,5470206 sin(PI/2) 4 atan(1)*DBG 45 We have also cleaned up the behavior a little. In hoc2, the assignment expr not only causes the assignment but also prints the value, because all expressions are printed: $ hoc2 x= 26 3.14159 6.20318 Value printed for assignment to variable In hoc3, 2 distinction is made between assignments and expressions; values are printed only for expressions: $ hoes xr 2+ 3.14159 Assignment: no value is printed x Expression: 6.28318 value is printed ‘The program that results from all these changes is big enough (about 250 lines) that it is best split into separate files for easier editing and faster compi- lation. There are now five files instead of one: CHAPTER & PROGRAM DEVELOPMENT. 247 hoe-y Grammar, main, yylex (as before) hoo.h Global data structures for inclusion symbole ‘Symbol table routines: Lookup, snetaLl initic Built-ins and constants; init math-c Interfaces to math routines: Sart, Log, ete This requires that we learn more about how to organize a multi-file C pro- gram, and more about make so it can do some of the work for us. We'll get back to make shortly. First, let us look at the symbol table code. A symbol has name, a type (it’s either a VAR or a BLTIN), and a value. If the symbol is a VAR, the value is a double; if the symbol is a built-in, the value is @ pointer to a function that returns a double. This information is needed in hoc-y, symbol.c, and init.c. We could just make three copies, but it’s too easy to make a mistake or forget to update one copy when a change ig made. Instead we put the common information into a header file hoc. that will be included by any file that needs it. (The suffix -h is conventional but not enforced by any program.) We will also add to the makefite the fact that these files depend on hoc-h, so that when it changes, the necessary recompilations are done too. $ cat hoc.h typedef etruct symbol { /+ symbol table entry */ char mame; short types /+ VAR, BLTIN, UNDEF +/ union ( double va 7s $6 VAR +/ gouble (*ptr)(); 7s SE BURIN «/ dus struct Symbol next; /+ to link to another +/ } symbols Symbol tinstall(), #lookup(); s ‘The type UNDEF is a VAR that has not yet been assigned a value. The symbols are linked together in a list using the next field in Symbol. ‘The list itself is local to symbo1.c; the only access to it is through the func tions Lookup and install. This makes it easy to change to symbol table organization if it becomes necessary. (We did that once.) Lookup searches the list for a particular name and returns a pointer to the Symbol with that name if found, and zero otherwise. The symbol table uses linear search, which is entirely adequate for our interactive calculator, since variables are looked up only during parsing, not execution. instai1 puts a variable with its associ ated type and value at the head of the list. emalloc calls malloc, the stan dard storage allocator (ma11oc(3)), and checks the result. These three rou- tines are the contents of symbol.c. The file y.tab-h is generated by run- ning yace ~d; it contains #4efine statements that yace has generated for tokens like NUMBER, VAR, BLTIN, etc. 248 THE UNIX PROGRAMMING ENVIRONMENT ‘CHAPTER $ cat symbol.c #inelude "hoc .h" #include “y.tab.n* static Symbol *symlist = 0; /* symbol table: linked list +/ Symbol *lockup(2) /* find 8 in symbol table +/ char symbol +sp; for (sp = symlist; sp I= (Symbol +) 0; sp = sp-snext) Af (stremp(sp->nane, 5) == 0) return sp; return 0; 7+ 0 ==> not found */ ? symbol tinstali(s, t, 4) /+ install 6 in symbol table +/ char #83 int ti double 4; ‘ symbol +s: char semalloc(); sp = (Symbol +) emalloc( sizeof (symbol) ); sp->name = enalloc(strien(s)+1); /* +1 for ‘\0" «/ stropy(sp->name, 5: sp->type = th sp->u.val = a; sp->next = symlist; /+ put at front of list «/ symlist = spi return 8p ) char temalloc(n) /+ check return from malloc +/ unsigned nj char +p, *malloc()s p= malloc(n); if (ps 0) execerror("out of memory", (char *) 0); return pi ) s The file init.c contains definitions for the constants (PI, etc.) and func tion pointers for built-ins; they are installed in the symbol table by the function init, which is called by main. CHAPTER § 8 cat init.c #incluge "hoc.h” #include "y.tab-n" #include PROGRAM DEVELOPMENT 249, extern double Log(), Logt0(), Expl), sart(), integer(); static struct { 7s constants «/ char ename; double eval; ) constst} = ( spr", 3, 14159265358979323846, spt,” 2177828182843904523536, "Gamma", 0.57721566490 153286060, spec", '57.2957795 1308232087680, "PHI", — 1.61803398874989404820, °, o h static struct ( 7s Bailt-ins «/ char «name; double (+fune)(); } puittins(] = ¢ checks. checks checks checks argument argument argunent, argument, "tog",” bog, /+ “Logi0", Logt0, /+ sexe", “Exp. | /* “sqrt", Sart, /+ “ant",’ integer, "abs", fabs, °, ° a init() /* install constants and built-ins t int 4; syabol + for (i = 0; conatali).name; i++) install (consts[i].name, VAR, consts[i] for (i = 0; builtine[i).names i++) { /+ Baler +/ 7s deg/radian +/ 7s golden ratio */ ” ” ” in table «/ install (builtins[i).name, BLTIN, 0. s-su.ptr = builtins[i].func} seval) af ‘The data is kept in tables rather than being wired into the code because tables ate easier to read and to change. The tables are declared static so that they are visible only within this file rather than throughout the program. We'll ‘come back to the math routines like Log and Sqxt shortly. 250 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER With the foundation in place, we can move on to the changes in the gram- mar that make use of it. cat hoc.y x #include "hoc.h” extern double Pow(); * union { double val: / actual value +/ Symbol ‘sym; /+ symbol table pointer +/ i token NUMBER Meoken VAR BLTIN UNDEF wtype expr asgn wrignt ‘= west 14 =" weft fet 177 ‘left UWARYMINUS Wright ‘*/ — /* exponentiation «/ 1% List: /+ nothing «/ blise ‘\n" List asgn ’\n’ List expr ‘\n’ ( peinte(*\t%.eg\n", $2); } List error ’\n’ 4 yyerrok; > asgn: | VAR ’e expr ( $$2$%->u.vals$3; $1->type = VAR: } expr: NUMBER 1 VAR ( Sf (81->type == UNDEF) execerror( "undefined variable", $1->name); $8 = St-su.val; } expr ‘*’ expr { $8 = $1 + 83) expr ‘/' expr ( Ae ($3 == 0.0) execerror("division by zero", e267 $3; ) Eexpr ‘*/ expr { $8 = Pow(s1, $3); } EC expe 17 1 $80 925} =! expr prec UNARYMINUS { $$ = -82; } > 1 asgn Poautee “(7 expr 7)? (88 = (e($1->u.ptr))(83)5 } F expr ‘¢’ expr ($8 = $1 + $35) E expr ‘-" expr ( $8 = $1 - 83; ) % /+ end of granmar +/ CHAPTER & PROGRAM DEVELOPMENT 251 ‘The grammar now has asgn, for assignment, as well as expr; an input line that contains just VAR = expr is an assignment, and so no value is printed. Notice, by the way, how easy it was to add exponentiation to the grammar, including its right associativity The yace stack has a different Kunion: instead of referring to a variable by its index in @ 26-element table, there is a pointer to an object of type Symbol. The header file hoc. h contains the definition of this type. The lexical analyzer recognizes variable names, looks them up in the sym- bol table, and decides whether they are variables (VAR) or built-ins (BLTIN) The type returned by yylex is one of these; both user-defined variables and pre-defined variables like PI are VAR's. One of the properties of a variable is whether or not it has been assigned a value, so the use of an undefined variable can be reported as an error by yyparse. The test for whether a variable is defined has to be in the gram- mar, not in the lexical analyzer. When a VAR is recognized lexically, its con- text isn’t yet known; we don’t want a complaint that x is undefined when the context is perfectly legal one such as the left side of an assignment like 221 Here is the revised part of yylex: yytext) 7s nocd «/ Af (isalphate)) ( Symbol +8; char sbuf[100], #p = sbaf; ao { spre = cy } while ((c=getchar()) ungetc(c, stdin); ap = ‘0 AE sAnum(c)) jookup( sbuf)) 8 = install(sbuf, UNDEF, 0.0); yylval.sym return e->type UNDEF ? VAR : s->type; main has one extra line, which calls the initialization routine init to install built-ins and pre-defined names like PI in the symbol table. 282 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER § main(arge, argv) 7+ nocd +/ char eargv( i « int fpecaten(); progname = argv(0); init( set mp(begin) ; signal (SIGFPE, fpecaten); yyparse(); > The only remaining file is math.c. Some of the standard mathematical functions need an error-checking interface for messages and recovery — for example the standard function sqrt silently returns zero if its argument is negative. The code in math.c uses the error tests found in Section 2 of the unix Programmer's Manual; see Chapter 7. This is more reliable and portable than writing our own tests, since presumably the specific limitations of the rou- tines are best reflected in the “official” code. The header file con- tains type declarations for the standard mathematical functions. contains names for the errors that can be incurred. $ cat math.c ‘include #inelude extern int errno} double errcheck(); double Log(x) double x; return errcheck(iog(x), “log"): double Log 10(x) double x; return errcheck(1og10(x), "log10"); double Exp(x) double x; return errcheck(exp(x), “exp"); double sqrt (x) double x; return errcheck(sqrt(x), “sqrt"); CHAPTER PROGRAM DEVELOPMENT 253 double Pow(x, y) double x, yi ‘ return errcheck(pow(x,y), "exponentiation" ); ) double integer (x) double x; ‘ return (double) (long) x; } double errcheck(4, 8) /* check result of library call «/ double 4; char #8; 4€ (errno == EDOM) ¢ execerror(s, "argument out of domain"); } eise if (errno == ERANGE) { errno = 0; execerror(s, "result out of range ) return d s An interesting (and ungrammatical) diagnostic appears when we run yace ‘on the new grammar: $ yace hoo.y conflicts: 1 shift/reduce s The “shifvreduce” message means that the hoc3 grammar is ambiguous: the single line of input xa ccan be parsed in two ways: 254 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER & lise I expr lise | | lish asgn list” asgn | | | | (emp) x= Xn (omy) x= Na ‘The parser can decide that the asgn should be reduced to an expr and then to a list, as in the parse tree on the left, or it can decide to use the following \n immediately ("shift") and convert the whole thing to a list without the inter- mediate rule, as in the tree on the right. Given the ambiguity, yacc chooses to shift, since this is almost always the right thing to do with real grammars, You should try to understand such messages, to be sure that yace has made the right decision.+ Running yace with the option -v produces a voluminous file called y.output that hints at the origin of conflicts. Bxercise 8-5. As hoc3 stands, i's legal to say prea good idea? How would you change hoc3 to prohibit assignment to “con- Exercise 8-6. Add the builtin function atan2(y,x), which returns the angle whose tangent is y/x. Add the builtin xana(}, which returns a floating point random var able uniformly distributed on the interval (0,1). How do you have to change the gram. ‘mar (0 allow for built-ins with different numbers of arguments? Exercise 8-7. How would you add a facility to execute commands from within hoe, similar to the 1 feature of other UNIX programs? Bxercise 8-8. Revise the code in math.c to use a table instead of the set of essentially identical functions that we presented, Another digression on make Since the program for hoc3 now lives on five files, not one, the makefile is more complicated: 7 The yace message “reducelreduce confi” indicates x serious problem, more aften the symptom of an outright error inthe grammar than an intentional umbigity. CHAPTER & PROGRAM DEVELOPMENT 255 5 cat makefile YFLAGS = -4 # force creation of y.tab-h OBJS = hoc.o init.o math.o symbol.o # abbreviation hoe3: $(0BIS) ce $(0B9S) -Im -0 hoc3 hoc.o: hoe.h init.o eymbol.o: hoc.h y-tab.h pr: @pr hoc.y hoc.h init.c math.c eymbol.c makefile clean: rm -£ $(0BJS) y.tab. {ch} ’ ‘The YELAGS = -d line adds the option -d to the yace command line gen- erated by make; this tells yace to produce the y.tab-h file of #define statements. The OBJS=... line defines a shorthand for a construct to be used several times subsequently. ‘The syntax is not the same as for shell variables, — the parentheses are mandatory. ‘The flag ~1m causes the math library to be searched for the mathematical functions. hhoc3 now depends on four .0 files; some of the .0 files depend on -b files. Given these dependencies, make can deduce what recompilation is needed after changes are made to any of the files involved. If you want to see what make will do without actually running the processes, try $ make -n On the other hand, if you want to force the file times into a consistent state, the -€ (“touch”) option will update them without doing any compilation steps. Notice that we have added not only a set of dependencies for the source files but miscellaneous utility routines as wel, all neatly encapsulated in one place. By default, make makes the first thing listed in the makefile, but if you name an item that labels a dependency rule, like symbol.o or pr, thet will be made instead, An empty dependency is taken to mean that the item is never “up to date,” so that action will always be done when requested. Thus $ make pr Ipr produces the listing you asked for on a line printer. (The leading @ in “@pr suppresses the echo of the command being executed by make.) And $ make clean removes the yacc output files and the .0 files. This mechanism of empty dependencies in the makefile is often 256 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER § preferable to a shell file as a way to keep all the related computations in a sin- gle file. And make is not restricted to program development — it is valuable for packaging any set of operations that have time dependencie: A digression on Lex ‘The program Lex creates lexical analyzers in a manner analogous to the way that yace creates parsers: you write a specification of the lexical rules of your language, using regular expressions and fragments of C to be executed when a matching string is found. Lex translates that into a recognizer. Lex and yace cooperate by the same mechanism as the lexical analyzers we have already written. We are not going into any great detail on Lex here; the fol- lowing discussion is mainly to interest you in learning more, See the reference manual for Lex in Volume 2B of the UNIX Programmer's Manual. First, here is the Lex program, from the file Lex.2; it replaces the func tion yylex that we have used so far. $ cat lex.2 x #include *hoc.n" include "y.tab-h* extern int lineno; ” we C\t) (3) /# skip blanks and tabs «/ [0-9] +\.7!(0-9]4\.(0-9}+ ‘sscanf(yytext, "%1f", Syylval-val); return NUMBER; [a-za-2}la-2A-20-9} { ‘symbol +8; Af ((s=lookup(yytext)) == 0) ‘5 = install(yytext, UNDEF, 0.0); yylval.eym = return s->type == UNDEF ? VAR : s->types ) Ne { Lineno++; return “\n’; ) /+ everything else +/ { return yytext(o}; } s Each “rule” is a regular expression like those in egrep or awk, except tha Lex recognizes C-style escapes like \t and \n. The action is enclosed it braces. The rules are attempted in order, and constructs like * and + match a long a string as possible. If the rule matches the next part of the input, thi action is performed. The input string that matched is accessible in a Le: string called yytext. ‘The makefile has to be changed to use Lex: CHAPTER PROGRAM DEVELOPMENT 257 $ cat makefile YELAGS = -4 ORS = hov.o Je: .9 init.o math.o aymbol.o hoe3: $085) ce $(0BJS) -Im ~11 -0 hoc3 hoc.o: hoc.h lex.0 init. symbol.o: hoe.h y-tab.h s Again, make knows how to get from a .1 file to the proper .o; all it needs from us is the dependency information. (We also have to add the Lex library =11 to the list searched by ce since the Lex-generated recognizer is not self- contained.) ‘The output is spectacular and completely automatic: $ make yace -d hoc.y conflicts: 1 shift/reduce ec -c y.tab.e zm y-tab.c mv y.tab.o hoe.o lex lex.2 ce -c lex.yy.c rm lex.yy.c ay lex.yy.0 lex.o ec -c init.e ce -o math.c ce -¢ symbol.c ce hoc-0 lex.0 init.o math.o symbol.o -Im -11 -0 hoe3 ‘ If a single file is changed, the single command make is enough to make an up-to-date version: $ touch Jex.1 Change modified-time of Lex. $ make lex lex.1 ec -c lex.yy.c zm lex.yy.e mv lex.yy.0 lex.o ce hoc-0 ex.o init.o math.o symbol.o -11 -Im -o hoc} s We debated for quite a while whether to treat Lex as a digression, to be illustrated briefly and then dropped, or as the primary tool for lexical analysis fonce the language got complicated. There are arguments on both sides. The 258 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER § main problem with lex (aside from requiring that the user learn yet another Janguage) is that it tends to be slow to run and to produce bigger and slower recognizers than the equivalent C versions. It is also somewhat harder to adapt its input mechanism if one is doing anything unusual, such as error recovery or even input from files. None of these issues is serious in the con- text of hoc. The main limitation is space: it takes more pages to describe the Lex version, so (regretfully) we will revert to C for subsequent lexical analysis. It is a good exercise to do the Lex versions, however. Exercise 8-9. Compare the sizes of the two versions of hoc3, Hint: see s4ze(1). 0 8.4 Stage 4: Compilation into a machine We are heading towards hocS, an interpreter for a language with control flow. hoed is an intermediate step, providing the same functions as hoc3, but implemented within the interpreter framework of hoc5. We actually wrote hhocé this way, since it gives us two programs that should behave identically, which is valuable for debugging. As the input is parsed, hocd generates code for a simple computer instead of immediately computing answers. Once the end of a statement is reached, the generated code is executed (“interpreted”) to compute the desired result The simple computer is a stack machine: when an operand is encountered, it is pushed onto a stack (more precisely, code is generated to push it onto a stack); most operators operate on items on the top of the stack, For example, to handle the assignment xeaey the following code is generated: constpush Push a constant onto stack 2 the constant 2 varpush Push symbol table pointer onto stack y Jor the variable y eval Evaluate: replace pointer by value mul Multiply top two items; product replaces them varpush Push symbol table pointer onto stack * {Jor the variable x assign Store value in variable, pop pointer pop Clear top value from stack STOP End of instruction sequence ‘When this code is executed, the expression is evaluated and the result is stored in x, as indicated by the comments. The final pop clears the value off the stack because it is not needed any longer. Stack machines usually result in simple interpreters, and ours is no excep- tion — it’s just an array containing operators and operands, The operators are the machine instructions; each is a function call with its arguments, if any, fol: lowing the instruction, Other operands may already be on the stack, as they ‘CHAPTER § PROGRAM DEVELOPMENT 259) were in the example above. The symbol table code for hoc4 is identical to that for hoc3; the initializa- tion in init.c and the mathematical functions in math. are the same as well. The grammar is the same as for hoe3, but the actions are quite dif ferent. Basically, each action generates machine instructions and any argu- ‘ments that go with them. For example, three items are generated for a VAR in fan expression: a varpush instruction, the symbol table pointer for the vari- able, and an eval instruction that will replace the symbol table pointer by its value when executed. The code for *s’ is just mul, since the operands for that will already be on the stack. $ cat hoo. er #inelude “hoc.h" #4efine code2(ct,e2) _code(ct); code(e2) #4efine code3(ct,02,c3) code(ct); code(e2); code(e3) ” Kunion ( symbol ‘sym; /s symbol table pointer +/ Inst tinst; /+ machine instruction */ Xtoken NUMBER VAR BLTIN UNDEF weight “= wlefe “47 7-7 Kleft "4" 77" Kieft UNARYMINUS %rignt ‘7’ /* exponentiation «/ *% List: /+ nothing +/ List ‘\n’ List asgn ‘\n’ { code2(pop, STOP); return 1; } List expr ‘\n’ { code2(print, STP); return 1; ) List error ’\n’ ( yyerrok; } asgn: VAR “=” expr { code3(varpush, (Inst)$1,assign); 260 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER & expr: NUMBER { code2(constpush, (Inst)$1); var { code3(varpush, (Inst)s1, eval); } asgn BLTIN ‘(expr ‘)’ { codez(bitin, (Inst)$1->u.ptr); } °C expe 1)" expr ‘+’ expr ( code(adai ‘ ) expr ’-/ expr ( code(sub); } expr ‘+’ expr ( code(mul); } expr ‘/' expr ( code(div); } ‘ expr ‘*/ expr { code(power); } ‘2 expr %prec UNARYMINUS ( code(negate); ) wh 7s end of grammar +/ Inst is the data type of a machine instruction (a pointer to a function return- ing an int), which we will return to shortly. Notice that the arguments to code are function names, that is, pointers to functions, or other values that are coerced to function pointers. We have changed main somewhat. The parser now returns after each statement or expression; the code that it generated is executed. yyparse returns zero at end of file main(arge, argv) 7+ hock +/ char sargvl li ‘ int fpecateh(); progname = argvi0}; init); set jmp(begin) ; signal(SIGFPE, fpecatch); for (initcode(); yyparse(); initcode()) execute (prog); return 0; > The lexical analyzer is only a little different. The main change is that ‘numbers have to be preserved, not used immediately. The easiest way t0 do this is to install them in the symbol table along with the variables. Here is the changed part of yylex: ‘CHAPTER PROGRAM DEVELOPMENT 261 yylext) /+ noc +/ Af (cme '.7 tf isaigiticy) (| /* number «/ ‘double aj ungete(c, stdin); scant("%1E", 8d); yylval.sym = install(", NUMBER, 4); return NUMBER; Each element on the interpreter stack is either a floating point value or a pointer to a symbol table entry; the stack data type is a union of these. The machine itself is an array of pointers that point either to routines like mul that perform an operation, or to data in the symbol table. ‘The header file hoc. has to be augmented to include these data structures and function declarations for the interpreter, so they will be known where necessary throughout the pro- gram. (By the way, we chose to put all this information in one file instead of two. In a larger program, it might be better to divide the header information into several files so that each is included only where really needed.) $ cat hoc.h typedef struct Symbol { /+ symbol table entry */ char «name; short type; / VAR, BLTIN, UNDEF +/ union ( double val 7s 36 VAR +/ double («ptr)(); 7s is BLTIN +/ dus struct Symbol ‘next; /+ to link to another +/ ) symbol: Symbol tinstall(), +lookup(); typedef union Datun { /+ interpreter stack type */ double vali Symbol +sym; ) Datum extern Datum pop); typedef int (+Inst)();/* machine instruction */ faefine stoP (Inst) 0 extern Inst progi}; extern eval{), add(), sub(), mul(), div(), negate), power() extern assign(), bltin(), varpush(), constpusn(), print(); ‘ ‘The routines that execute the machine instructions and manipulate the stack are kept in a new file called code.c. Since it is about 150 lines long, we will 262 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER & show it in pieces, $ cat code.c #include "hoch" #include "y.tab.n" faefine NSTACK 256 static Datum stack{NSTACK]; /+ the stack +/ static Datum stackp; 7+ next free spot on stack +/ #aefine NPROG 2000 Inst prog(NPROG]; /* the machine */ Inst sprogpi 7+ next free spot for code generation +/ Inst +pes, /+ program counter during execution +/ initcode() /* initialize for code generation +/ t stackp = stack; progp = prog: > The stack is manipulated by calls to push and pop: push(4) /+ push d onto stack */ Datum 4; ‘ Af (stackp >= katack(NSTACK]) execerror("stack overflow", (char +) 0); sstackpte = dy ) Datum pop() /* pop and return top elem from stack +/ t Af (stackp <= stack) execerror("stack underflow", (char «) 0); return + stackp; , The machine is generated during parsing by calls to the function code, which simply puts an instruction into the next free spot in the array prog. It returns the location of the instruction (which is not used in hoc4). CHAPTER PROGRAM DEVELOPMENT 263 Inst +code(f) _/+ install one instruction or operand +/ Inst £5 a Inst soprogp = progp; if (progp >= EprogiNPROG]} execerror("program too big", (char +) 0); sprogptt = ft return oprogp; ) Execution of the machine is simple; in fact, it’s rather neat how small the routine is that “runs” the machine once it’s set up: execute(p) /+ von the machine «/ Inst *pi ‘ for (pe = pi spe I= STOP; } CeGpere)) O05 , Each cycle executes the function pointed to by the instruction pointed to by the program counter pe, and increments pe so it’s ready for the next instruction. An instruction with opcode STOP terminates the loop. Some instructions, such as constpush and varpush, also increment pe to step over any arguments that follow the instruction, constpush() /# push constant onto stack +/ ‘ Datum 4; d.val = ((Symbol *)epo++)->u.valt push(a); > varpush() /+ push variable onto stack */ { Datum 4; sym = (Symbol +) («pe++); push(a); y ‘The rest of the machine is easy. For instance, the arithmetic operations are all basically the same, and were created by editing a single prototype. Here is ada: 264 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER § aaa) > 7 add top two elems on stack */ Datum 41, 425 42 = popt) at = pop() t.val += d2.vals push(at ‘The remaining routines are equally simple. evan) ‘ ) assign() ( , print() ‘ } bitin’) ¢ » ‘The hardest part /+ evaluate variable on stack */ patum 4; @ = popt): Lf (d.sym->type == UNDEF) execerror("undefined variable", é.syn->name); é.val = 4.sym->u.val; push(a); 7+ assign top value to next value +/ Datum 41, 425 a1 = ppt) 42 = pop); 42 (A1.sym->type I= VAR 68 d1.sym->type = UNDEF) execerror("assignnent to non-variable", 1. sym->name) ; t.sym-ou.val = 62.val} at.sym->type = VAR; push(d2); 7+ pop top value from stack, print it +/ patum 4; = popl prints("\t%.8g\n", diva): /+ evaluate built-in on top of stack +/ Datum 4; a = popt d.val = (4(double (+)()) (spots) )(.val) s push(a); is the cast in bltin, which says that #pc should be cast to “pointer to function returning a double,” and that function executed with val as argument. ‘The diagnostics in eval and assign should never occur if everything is ‘CHAPTER § PROGRAM DEVELOPMENT 265 working properly; we left them in in case some program error causes the stack to be curdied, The overhead in time and space is small compared to the bene- fit of detecting the error if we make a careless change in the program. (We did, several times.) C's ability to manipulate pointers to functions leads to compact and efficient code. An alternative, to make the operators constants and combine the seman- tic functions into a big switch statement in execute, is straightforward and is left as an exercise. A third digression on make As the source code for hoc grows, it becomes more and more valuable to keep track mechanically of what has changed and what depends on that. The beauty of make is that it automates jobs that we would otherwise do by hand (and get wrong sometimes) or by creating a specialized shell file We have made two improvements to the makefile. The first is based on the observation that although several files depend on the yace-defined con- stants in y.tab.b, there's no need to recompile them unless the constants, change — changes to the C code in hoc.y don’t affect anything else. In the new makefile the .o files depend on a new file x.tab-h that is updated only when the contents of y.tab-h change. The second improvement is to make the rule for pr (printing the source files) depend on the source files, so that only changed files are printed. ‘The first of these changes is a great time-saver for larger programs when the grammar is static but the semantics are not (the usual situation). The second change is a great paper-saver Here is the new makefile for hoca: YFLAGS = -€ OBIS = hoc.o code.o init.o math.o aynbol.o nocd: #(0BIS) cc $(OBJS) -Im -0 hood hoc.0 code.o init.o symbol.o: hec.h code.o init.o symbol.o: x.tab-h x.tabsh: y.tab-n -emp ~S x.tab.h y.tab.h I! ep y.tab.h x.tab-h PE: hov.y hoe-h code.c init.c math.c eymbol.c pr 8? @touch pr clean: em -£ $(0BJS) [xy]. tab. [oh] 266 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER & ‘The ‘~' before cmp tells make to carry on even if the cmp fails; this permits the process to work even if x.tab.h doesn’t exist. (The ~s option causes mp to produce no output but set the exit status.) The symbol $? expands into the list of items from the rule that are not up to date. Regrettably, make’s notational conventions are at best loosely related to those of the shell To illustrate how these operate, suppose that everything is up to date. Then’ $ touch hoe.y Change date of noc-y ‘$ make yace ~4 hoc.y conflicts: 1 shift/reduce ce -c y.tab.c xm y.tab.e nv y.tab-o hoc.o comp -s x-tab.h y.tab.h 1! ep y.tab.n x.tab.h cc hoc.0 code.o init.o math.o symbol.o -1m -o hocd 8 make -n pr Print changed files PE hoc.y touch pr s Notice that nothing was recompiled except hoc.y, because the y.tab.h file was the same as the previous one. Exercise 8-10. Make the sizes of stack and prog dynamic, so that hoc never runs ‘out of space if memory can be obtained by calling malloc. Exercise 8-11. Modify hocd to use a switch on the type of operation in execute instead of calling functions. How do the versions compare in lines of source code and ‘execution speed? How are they likely to compare in ease of maintenance and growth? 8.5 Stage 5: Control flow and relational operators This version, hocS, derives the benefit of the effort we put into making an interpreter. It provides if-else and while statements like those in C, state- ment grouping with { and }, and a print statement. A full set of relational operators is included (>, >=, etc.), as are the AND and OR operators && and 1. (These last two do not guarantee the left-to-right evaluation that is such an asset in C; they evaluate both conditions even if itis not necessary.) ‘The grammar has been augmented with tokens, non-terminals, and produc- tions for i£, while, braces, and the relational operators. This makes it quite a bit longer, but (except possibly for the if and white) not much more com- plicated CHAPTER & $ cat hoc.y x #include faefine define » union { ) xtoken xtype wright mere mere mete mest meet meee Height *% List asgn: stmt: cond: waite: PROGRAM DEVELOPMENT 267 snoc-n" codez(ct,e2) code(c1); code(e2) code3(c1,02,c3) code(c1); code(e2); code(e3) symbol ‘sym; /+ symbol table pointer +/ Inet tint; /+ machine instruction */ NUMBER PRINT VAR BLTIN UNDEF WHILE IP ELSE stmt asgn expr stmtlist cond while if end oR Gr GE LP LE ONE fae UNARYMINUS NOT 7+ nothing +/ blise “\n" {list asgn ‘\n’ { codea(pop, STOP); return 1; ) 1 list stmt ‘\n’ { code(stop); return 1; } [list expr ’\n’ { code2(print, STOP); retuen 1; } f lise error ‘\n’ ( yyerrok; VAR ‘=’ expr ( $8583; code3(varpush, (znst)$1,assign); exer { code(pop); } } PRINT expr { code(prexpr); $$ = $2; f while cond stmt end { ($1)01] = (Inst)$3; 7 body of loop */ ($1)[2] = (Inst)$4; } 7+ end, Lf cond fails +/ {Af cond stmt end ( /+ else-lese if +/ (SDC) = (Inst)$3; 7+ thenpart +/ ($1)(3) = (Inst)$4; } 7+ end, if cond faits +/ (S01) = (Inst)$3; 7+ thenpart +/ (s1)(2) © Gases 7+ elsepart +/ ($1131 = Gnst)$7; ) 7+ end, if cond faite +/ FU eemelise 7)” ($s = 825) “U expr “)’ ( code(stop); $$ = $2; } WHILE { $$ = code3(whilecode, STOP, STOP); } 1 if cond stmt ené ELSE stmt end ( /+ if with else +/ > 268 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER & if: IF ( S8=code(itcode); code3(stoP, STOP, STOP); } end: /+ nothing +/ { code( stor); $$ = progpi } atmtlist! /+ nothing «/ {88 = progp; } semtlist ‘\n’ semtlist seme expr: | NUMBER { $$ = codea(constpush, (rnst)$1); } 1 vaR { $8 = code3(varpush, (Inst)$1, eval); } 1 asgn BUTIN “(/ expr ‘)’ ($8 = $3; code2(bitin, (inst) $1->u. ptr! °C expe)’ (88 = $25) expr ‘+’ expr ( code(add); ) expr ’-’ expr { code(sub); } expr ‘+’ expr { code(mul); } expr ‘/* expr { code(aiv); } { code (power); } =" expr prec UNARYMINUS { $$ = $2; code(negate); } expr GT expr ( code(gt); expr GE expr { code(ge); { t ‘ ‘ ‘ expr ‘*/ expr expr LE expr ( code(1e); expr EQ expr expr NE expr expr AND expr expr OR expr NOT expr code(eq) code(ne) code(an); } code(or); } $$ = $2; code(not); } d ) expr LT expr { code(1t); } ) ) ) xx The grammar has five shiftireduce conflicts, all like the one mentioned in hoe3. Notice that STOP instructions are now generated in several places to ter- minate a sequence; as before, progp is the location of the next instruction that will be generated. When executed these STOP instructions will terminate the loop in execute. The production for end is in effect a subroutine, called from several places, that generates a STOP and returns the location of the instruction that follows it ‘The code generated for while and if needs particular study. When the keyword while is encountered, the operation whilecode is generated, and its position in the machine is returned as the value of the production while: WHILE At the same time, however, the two following positions in the machine are also reserved, to be filled in later. ‘The next code generated is the expression that makes up the condition part of the while. The value returned by cond is the CHAPTER & PROGRAM DEVELOPMENT 269 beginning of the code for the condition. After the whole while statement has been recognized, the two extra positions reserved after the whilecode instruction are filled with the locations of the loop body and the statement that follows the loop. (Code for that statement will be generated next.) | while cond stmt end ( (80011 = Ginst)s: /+ body of loop +/ ($1)(2] = (Inst)s4; ) 7+ end, if cond fails +/ $1 is the location in the machine at which whilecode is stored; therefore, ($1)(4} and ($1)(2) are the next two positions. A picture might make this clearer: “srr —] Body ‘SOP The situation for an if is similar, except that three spots are reserved, for the then and else parts and the statement that follows the ig. We will return shortly to how this operates Lexical analysis is somewhat longer this time, mainly to pick up the addi- tional operators: 20 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER & yylex() 7 noes +/ awiten Ce) [ case ">": return follow(’*’, GE, G7); case return follow(’*’, LE, LT case return follow(’2’, BQ, ‘=’) case return follow(’=", NE, NOT); case return follow('!’, oR, 117); case return follow(’6") AND, ‘67; case Linenose; return ‘\n’5 default: return ¢; > foliow looks ahead one character, and puts it back on the input with ungete if it was not what was expected. follow(expect, ifyes, ifno) /+ look ahead for >=, ete. +/ ( int ¢ = getchar(); Lf (c == expect) return ifyea: ungetc(c, stdin! return ifno; , There are more function declarations in hoc. — all of the relationals, for instance — but it’s otherwise the same idea as in hoc4, Here are the last few $ cat hoc.b typedef int (#Inst)(); /* machine instruction +/ fdefine sToP (Inst) 0 extern Inst progl], sprogp, #code(); extern eval(), add(), sub(), mul(), div(), negate(), power); extern assign() bitin(), varpush(}, constpush(), print()s extern prexpr(); extern gt), 1t(), eg(), ge(), 1eC}, me(), and(), OFC), nob()s extern ifcode(), whilecode(); Most of codec is the same too, although there are a lot of obvious new rou- tines to handle the relational operators. The function 1e (“less than or equal to") is a typical example: CHAPTER & PROGRAM DEVELOPMENT 271 re) Datum 41, 42: 82 = popt) at = pop); dt.val » (double) (é1.val <= 42.val); push(at); > The two routines that are not obvious are whilecode and ifcode. The critical point for understanding them is to realize that execute marches along ‘a sequence of instructions until it finds a STOP, whereupon it returns. Code generation during parsing has carefully arranged that a STOP terminates each Sequence of instructions that should be handled by a single call of execute ‘The body of a while, and the condition, then and else parts of an if are all handled by recursive calls to execute that return to the parent level when they have finished their task. The control of these recursive tasks is done by code in whilecode and ifcode that corresponds directly to while and if statements. whilecode() ( patun aj Inst *savepe = poi 7+ loop body «/ execute(saveper2); 7+ condition +/ = port while (d-val) ( execute(s((Inst ##)(savepe))); /* body +/ execute (savepe+2); @ = popt); , pe = #((Inst e*)(savepc+t)); /+ next statement #/ y Recall from our discussion earlier that the wailecode operation is followed by a pointer to the body of the loop, a pointer to the next statement, and then the beginning of the condition part. When whilecode is called, pe has already been incremented, so it points to the loop body pointer. Thus pe+1 points to the following statement, and pe+2 points to the condition. code is very similar; in this case, upon entry pe points to the then part, pert to the else, pe+2 to the next statement, and pe+3 is the condition. 272 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER & Afeode() ‘ Datum a; Inst *savepe = pe; 7+ then part +/ execute(saveper3) /* condition +/ @ = pops s€ (d.val) fexecute(+((Inst #+)(savepe) 1) else if (#((Inat #4)(savepcr1))) /* else part? +/ execute(+((Inst ++) (savepe+1)})5 pe = #((Inst ++)(savepe+2));/# next stmt «/ » ‘The initialization code in init.c is augmented a little as well, with a table of keywords that are stored in the symbol table along with everything else: $ cat init.c static struct { /* Keywords +/ char sname; int val; mr, ELSE, WHILE, "print", PRINT, °, °, We also need one more loop in init, to install keywords. for (4 = 0} keywords[i].name; iss) install (keywords[i].name, keyworas[i}-kval, 0.0); No changes are needed in any of the symbol table management; code.c contains the routine prexpr, which is called when an statement of the form print expr is executed. prexpr() /* print numeric value */ c Datum 4; @ = popl); peintf("%.8g\a", d.val)s > This is not the print function that is called automatically to print the final result of an evaluation; that one pops the stack and adds a tab to the output. hoe5 is by now quite a serviceable calculator, although for serious pro- gramming, more facilities are needed. The following exercises suggest some CHAPTER # PROGRAM DEVELOPMENT 273 possibilities Exercise 8-12. Modify hocS to print the machine it generates in a readable form for debugging. © Exercise 8-13. Add the assignment operators of C, such as ‘ment and decrement operators ++ and --. Modify && and 1! so they guarantee left~ to-right evaluation and early termination, as in C. © Exercise 8-14. Add a for statement like that of C to hoc5. Add break and Exercise 8-15. How would you modify the grammar or the lexical analyzer (or both) of hnoeS 10 make it more forgiving about the placement of newlines? How would you add semicolon as a synonym for newline? How would you add a comment convention? What syntax would you use? 0 Exercise 8-16. Add interrupt handling to hoc, so that a runaway computation can be stopped without losing the state of variables already computed. #2, ete., and the inere- Exercise 8-17. It is a nuisance to have to ereate a program in a file, run it, then edt the file 1 make a trivial change. How would you modify hoc to provide an edit com- ‘mand that would cause you to be placed in an editor with a copy of your hoc program already read in? Hint: consider a text opcode. 8.6 Stage 6: Functions and procedures; input/output The final stage in the evolution of hoc, at least for this book, is a major increase in functionality: the addition of functions and procedures. We have also added the ability to print character strings as well as numbers, and to read values from the standard input. noc6 also accepts filename arguments, includ- ing the name “~” for the standard input. Together, these changes add 235 lines of code, bringing the total to about 810, but in effect convert hoc from a calculator into a programming language. We won't show every line here; Appendix 3 is a listing of the entire program so you can sce how the pieces fit together. In the grammar, function calls are expressions; procedure calls are state- ments, Both are explained in detail in Appendix 2, which also has some more examples. For instance, the definition and use of a procedure for printing all the Fibonacci numbers less than its argument looks like this: 274 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER & s cat fib proc #10) ( aso bat while (b < $1) { print b cob b= arb ) print "\nt ) § hocé Fi - £40( 1000) 112358 19 21.34 55 89 144 233 377 610 987 This also illustrates the use of files: the filename Here is a factorial function: * is the standard input. $ cat fac fune fact) ¢ 4€ ($1 <= 0) return 1 else return $1 * fac($1-1) - § hoc fac - fac(0) 1 fact?) 5040 fac(10) 3628800 Arguments are referenced within a function or procedure as $1, etc., as in the shell, but it is legal to assign to them as well. Functions and procedures are recursive, but only the arguments are local variables; all other variables are global, that is, accessible throughout the program. hoe distinguishes functions from procedures because doing so gives a level of checking that is valuable in a stack implementation. It is too easy to forget a return or add an extra expression and foul up the stack. ‘There are a fair number of changes to the grammar to convert hocS into hhoc6, but they are localized. New tokens and non-terminals are needed, and the Xunion declaration has a new member to hold argument counts: (CHAPTER & PROGRAM DEVELOPMENT 275 $ cat hoc.y wanion ( symbol 7+ symbol table pointer +/ Inet 7+ wachine instruction */ ant /* number of arguments */ ) token NUMBER STRING PRINT VAR BLTIN UNDEF WHILE IF ELSE Xtoken FUNCTION PROCEDURE RETURN FUNC PROC READ Yeoken ARG Mtype expr stmt asgn prlist stmtlist wtype cond while if begin end xtype —procnane xtype arglist, lise! 7+ nothing +/ List ‘\n' List defn ’\n’ List asgn ‘\n’ ( code2(pop, STOP); return 1; } List stmt ‘\n’ { code(stoP}; return 1; } List expr ‘\n’ ( code2(print, STOP); return 1; ) List error ‘\n’ ( yyerrok: } asgn: VAR ‘=/ expr ( code3(varpush,(Inst)$1,assign); $8283; } fARG ‘=! expr { defnoniy(*s" code2(argassign, (Inst)$1); $$9§3;) stmt: expr code(pop)i } RETURN ( defnonly("zeturn"); code(procret: RETURN expr { defnoniy(*return"); $$=$2; code(funcret); ) 1 PROCEDURE begin “(’ arglist ‘)’ { 88 = $2; code3(cail, (Inst), (inst)s4); I PRINT priist {$$ = $2; ) ? expr! NUMBER { $$ = code2(constpush, (Inst}$1); } VAR { $$ = code3(varpush, (Inst)$1, eval); } ARG { defnonly("$"); $$ = code2(arg, (Inst) $1); asgn FUNCTION begin “(’ arglist ’)’ ($$ = $2; code3(call,(Inst)$1,(xnst)$4); ) 1 READ “(VAR ’)’ { $$ = code2(varread, (Inst)$3); } begin: /+ nothing +/ ($8 progps } 276 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER & prlist: expr { code(prexpr); } STRING ($8 = codeaiprstr, (Iast)$1)s } prlist ’,/ expr ( code(prexpr); } prlist ‘,’ STRING — { code2(prstr, (Inst)$3); } defn: FUNC procname ( $2->type=PUNCTION; indef=1; } “(CO 1)" stmt { code(procret); define ($2); indef0; } 1 PROC procname { $2->typesPROCEDURE; indefe1; ) “(UO ')" stmt { code(procret); define($2); inde a proename: VAR | eunerzoN | PROCEDURE arglict:| /+ nothing «/ | expr | arglist ",’ expr ‘ 1% The productions for arglist count the arguments. At first sight it might seem necessary to collect arguments in some way, but it’s not, because each expr in an argument list leaves its value on the stack exactly where it’s wanted. Knowing how many are on the stack is all that’s needed ‘The rules for defn introduce a new yace feature, an embedded action. It is possible to put an action in the middle of a rule so that it will be executed during the recognition of the rule, We use that feature here to record the fact that we are in a function or procedure definition. (The alternative is to create ‘4 new symbol analogous to begin, to be recognized at the proper time.) The function defnonly prints a warning message if a construct occurs outside of the definition of a function or procedure when it shouldn't, There is often a choice of whether to detect errors syntactically or semantically; we faced one earlier in handling undefined variables. The defnonly function is a good example of a place where the semantic check is easier than the syntactic one. defnonly(s) _/+ warn if illegal definition «/ char +8; c Af (Linde) execerror(s, "used outside definition"); ) The variable indeg is declared in hoc.y, and set by the actions for defn. The lexical analyzer is augmented by tests for arguments — a $ followed by number — and for quoted strings. Backslash sequences like \n are inter preted in strings by a function backs1ash. CHAPTER 6 PROGRAM DEVELOPMENT — 277 yylex() 7+ hoes */ if (ec +87) ( 7» azgument? +/ int n= 0; while (isdigit(cegete(fin))) n= Went e- 0% ungetc(e, fin); Af (n == 0) execerror("strange §...", (char +)0)5 yylval-narg = 15 4€ (eo “) ( /» quoted string +/ char sbuf[ 100], *p, semalloc(}; for (p = sbuf; (csgete(fin)) | ‘ 4B (eee "\n? Ht EOF) execerror ("missing quote”, ""); 4€ (p >= sbuf + sizeot(sbaf) - 1) ( sp = ‘\0" execerror("string too long", sbuf) D sp = backslash(e); ) sp = 0 yylval.sym = (Symbol +)emalloc(strlen(sbuf)+1); stropy(yylval.eym, sbuf); return STRING; backslash(e) /+ get next char with \'s interpreted +/ inte: ‘ char sindex(); /» ‘strebr()/ in some systems +/ static char transtabl] = "b\bf\fn\nr\rt\t"; de (o Im) c = gete(fin); Af (Aslower(e) && index(tzanstab, ¢)) return index(transtab, €)(1]; return ¢; , A Texical analyzer is an example of a finite state machine, whether written in C or with a program generator like Lex. Our ad hoc C version has grown fairly complicated; for anything beyond this, 1ex is probably better, both in size of source code and ease of change. Most of the other changes are in codec, with some additions of function names {0 hoc.h. The machine is the same as before, except that it has been 278 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER & augmented with a second stack to keep track of nested function and procedure calls. (A second stack is easier than piling more things into the existing one.) Here is the beginning of code. c $ cat code.c #define NPROG 2000 Inst prog[NPROG]; /+ the machine */ Inst *progpt 71 next free spot for code generation «/ Inst «pe; 7» program counter during execution / Inst *progbase = prog; /+ start of current subprogram */ int returning; 7+ 1 Af return stmt seen +/ typedef struct Frame ( /s proc/fune call stack frame +/ symbol sp; /s symbol table entry */ Inet sretpe; /+ where to reaune after return «/ Datum sargn; | /+ n-th argument on stack #/ int nargs; /+ number of arguments +/ ) Frame; #aefine NFRAME 100 Frame frane{NPRAME); Frame +£p; /* frame pointer +/ initeode() { progp = progbase; stackp = stack; fp = frame; returning = Since the symbol table now holds pointers to procedures and functions, and to strings for printing, an addition is made to the union type in hoc -h: $ cat hoo-h typedef struct symbol { /+ symbol table entry «/ char sname; short types union { @oubie vals 7s VAR 47 double («ptr)() 7s BUTIN «/ int (edeen) 5 ‘7+ FUNCTION, PROCEDURE */ char sete 7+ STRING +7 yu struct Symbol snext; /+ to link to another +/ > symbol ’ During compilation, a function is entered into the symbol table by define, which stores its origin in the table and updates the next free location after the CHAPTER & PROGRAM DEVELOPMENT 279 ‘generated code if the compilation is successful. define(sp) /* pat func/proc in symbol table +/ Symbol *sp; c sp->u.defn = (Inst)progbase; /+ start of code +/ progbase = progp: 74 next code starts here */ > When a function or procedure is called during execution, any arguments have already been computed and pushed onto the stack (the first argument is the deepest). The opcode for call is followed by the symbol table pointer and the number of arguments. A Frame is stacked that contains all the interesting information about the routine — its entry in the symbol table, where to return after the call, where the arguments are on the expression stack, and the number of arguments that it was called with. The frame is created by call, which then executes the code of the routine, cant) /s call a function «/ « (symbol +)pc{0]; /+ symbol table entry +/ 7+ for function */ &frame[NFRAME-1]) execerror(sp->name, “call nested too deeply") fp->ep = api fp-onarge = (int)pel 1]; fp->retpe = pe + 2; fp->argn = stackp ~ execute (sp->u. defn) returning = 0; /* last argument +/ , This structure is illustrated in Figure 8.2. Eventually the called routine will return by executing either a procret. or a £uncret: funeret() 7+ return from a function +/ ‘ Datum a; Af (£p->sp->type == PROCEDURE) execerror(fp->sp->name, "(proc) returns value"); a= popt); /* preserve function return value +/ ret) push(a)s 280 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER & Machine Frame Stack L- stackp eal po-of_+sym args ara? args Fetpe argi =D ‘Symbol Table Entry Figure 8.2: Data structures for procedure call procret() /+ return from a procedure +/ t Af (£p->ep->type == FUNCTION) execerror( fp->sp->name, “(fune) returns no value"); ret): > The function ret pops the arguments off the stack, restores the frame pointer Ep, and sets the program counter. ret) 7+ common return from funo or proc #/ { int iy for (4 = 0; 4 < Ep-snargs: i++) opt}; /* pop arguments +/ pe = (Inst #)fp->retpe; £55 returning = 1; ) Several of the interpreter routines need minor fiddling to handle the situa- tion when a return occurs in a nested statement. This is done inclegantly but adequately by a flag called returning, which is true when a return state- ‘ment has been seen, ifcode, whilecode and execute terminate early if returning is set; call resets it to zero. ‘CHAPTER & PROGRAM DEVELOPMENT 281 Lfcode() t Datum 4; Inst *savepe © pes 7s then part +/ execute(saveper3); 7» condition +/ @ = popt)s af (@.va3) execute(«((Inst ++) (savepe))); else if (#( (Inst #+)(savepc+1))) /* else part? +/ execute(+((Inst ++) (savepe+1))); 4€ (1returning) pe = *((Inst +4) (saveper2)); /+ next stmt +/ > whitecode() c patun 4; Inst +savepe = pes execute (savepe+2); /+ condition +/ @ = popl); while (d.val) { execute(*((Inst +#)(savepe))); /+ body +/ A€ (xeturning) break; execute(eavepe+2); 7+ condition +/ a= popl); , if (Ireturning) pe = #((Inst +) (saveper)); /# next stmt +/ ) execute(p) inst «pi ‘ for (pe = pj *pe I= STOP §5 Ireturning: (lepers) OF ) Arguments are fetched for use or assignment by getarg, which does the correct arithmetic on the stack double +getarg() /* return pointer to argument */ ‘ Ant nares = (int) #pesss Af (narge > £p->nargs) execerror(fp->sp->name, "not enough arguments"); return Gfp->argn{nargs - £p->nargs].val; 282 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER & arg() /* push argument onto stack +/ { patum 4; @.val = *getarg( pash(a); 7» store top of stack in argument +/ patum 4; @ = popt; push(a); 7+ leave value on stack «/ sgetarg() = d.val; ) Printing of strings and numbers is done by pestr and prexpr. pretr() /» print string value +/ c printé("%s", (char #) *per+); > prexpr() /* print numeric value +/ patum aj @ = port); print ("%.6g + diva); > ‘Variables are read by a function called varread. It returns 0 if end of file occurs; otherwise it returns 1 and sets the specified variable. ‘CHAPTER § PROGRAM DEVELOPMENT 283 varread() /+ wead into variable +*/ ‘ Datum a; extern PILE «fin; Symbol tvar = (Symbol *) «pees; again: switch (fecanf(fin, "x1f", Svar->u.val)) { Lf (moreinput ()) goto Again; d.val = var->u.val = 0.0; break; case 0: execerror("non-nunber read into", var->name breaks default: aval = 1.05 breaks , var-stype = VAR: push(4); > If end of file occurs on the current input file, varread calls moreinput, which opens the next argument file if there is one. moreinput reveals more about input processing than is appropriate here; full details are given in Appen- dix 3. This brings us to the end of our development of hoc. For comparison pur- poses, here is the number of non-blank lines in each version: noe! 59 hoc2 94 noc} 248 (Lex version 229) nocd 396 hoes 574 hocé 809 Of course the counts were computed by programs $ sed “/"$/a" ‘pick + [ehyl]* f we -1 ‘The language is by no means finished, at least in the sense that it’s still easy to think of useful extensions, but we will go no further here. The following exer- cises suggest some of the enhancements that are likely to be of value Exercise 8-18. Modify hoc6 to permit named formal parameters in subroutines as an alternative to #1, etc. 0 Bxercise 8-19. As it stands, all variables are global except for parameters. Most of the mechanism for adding local variables maintained on the stack is already present. One approach is to have an auto declaration that makes space on the stack for variables 284 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER & listed; variables not so named are assumed to be global. The symbol table will also have to be extended, so that a search is made first for locals, then for globals. How does this interact with named arguments? Exercise 8-20. How would you add arrays to hoe? How should they be passed to fune. tions and procedures? How are they returned? 0 Exercise 8-21. Generalize string handling, so that variables can hold strings instead of numbers. What operators are needed? The hard part of this is storage management: making sure that strings are stored in such a way that they are freed when they ure not needed, s0 that storage does not leak away. As an interim step, add better facilities for output formatting, for example, access to some form of the C print statement. © 8.7 Performance evaluation We compared hoc to some of the other UNIX calculator programs, to get a rough idea of how well it works. The table below should be taken with a grain of salt, but it does indicate that our implementation is reasonable. All times are in seconds of user time on a PDP-11/70. There were two tasks. The first is computing Ackermann’s function ack(3,3). This is a good test of the function-call mechanism; it requires 2432 calls, some nested quite deeply. fune ack() { Af (41 == 0) return $241 if ($2 22 0) return ack($1-1, 1) return ack($1-1, ack(#1, $2-1)) y ack(3,3) ‘The second test is computing the Fibonacci numbers with values less than 1000 a total of one hundred times; this involves mostly arithmetic with an occasional function call proc £4b() { aso bed while (b < $1) { > asp > ) ied wale (4 < 100) ( #£46( 1000) deied ) ‘The four languages were hoc, be(1), bas (an ancient BASIC dialect that only runs on the PDP-I1), and C (using double's for all variables) ‘The numbers in Table 8.1 are the sum of the user and system CPU time as CHAPTER & PROGRAM DEVELOPMENT 285 Table 8.1: Seconds of user time (PDP-11/70) program ack(3,3) 100% £ib( 1000) whoc. 35 5.0 bas 13 07 be 39.7 149 c <0. <0.1 measured by time. It is also possible to instrument a C program to determine how much of that time each function uses. The program must be recompiled with profiling turned on, by adding the option ~p to each C compilation and load. If we modify the makefile to read nooé: $(0BIS) ce S(CFLAGS) $(0BJS) ~im -o hocé so that the ce command uses the variable CFLAGS, and then say $ make clean; make CFLAGS=-p the resulting program will contain the profiling code. When the program runs, it will leave a file called mon.out of data that is interpreted by the program prof, To illustrate these notions briefly, we made a test on hoc6 with the Fibonacci program above. $ hocé 0) prints "refer |" if (pie > 0) printf "pic f A€ (ideal > 0) prints "ideal I * Ae (tpl > 0) prints “epi f * if (eqn > 0) printf “eqn ! * printf "troft * if (ma > 0) printé "-me" printé "\n" c7 ’ (The -h option to egrep causes it to suppress the filename headers on each line; unfortunately this option is not in all versions of the system.) The input is scanned, collecting information about what kinds of components are used. After all the input has been examined, it’s processed in the right order to print the output, The details are specific to formatting trof£ documents with the standard preprocessors, but the idea is general: let the machine take care of the details, doctype is an example, like bundle, of a program that creates a pro- gram. As it is written, however, it requires the user to retype the line to the shell; one of the exercises is to fix that When it comes to running the actual trof# command, you should bear in mind that the behavior of trofé is system-dependent: at some installations it drives the typesetter directly, while on other systems it produces information on its standard output that must be sent (0 the typesetter by a separate pro- gram By the way, the first version of this program didn't use egxep or sort; awk itself scanned all the input. It turned out to be too slow for large docu- ments, so we added egrep to do a fast search, and then soxt -u to toss out duplicates. For typical documents, the overhead of creating two extra processes to winnow the data is less than that of running awk on a lot of input. To illustrate, here is a comparison between doctype and a version that just runs awk, applied to the contents of this chapter (about 52000 characters): 308 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 9 $ time awk ".. doctype without egrep ...’ ch9.# eat ch9.1 ch9.2 ch9.3 ch9.4 {pic | tbl ! eqn | troff -ms real a user 8 sys 2. S time doctype ch9.+ eat ch9.1 ch9.2 ch9.3 ch9.4 { pic ! thi | eqn | troft real 7.0 user 10 sys 2.3 s ‘The comparison is evidently in favor of the version using three processes. (This was done on a machine with only one user; the ratio of real times would favor the egrep version even more on a heavily loaded system.) Notice that we did get a simple working version first, before we started to optimize. Exercise 9-2. How did we format this chapter? Exercise 9-3. If your eqn delimiter is a dollar sign, how do you get a dollar sign in the ‘output? Hint: investigate quotes and the pre-defined words of eqn. © Exercise 9-4, Why doesn’t $ ‘doctype filenames work? Modify doctype to run the resulting command, instead of printing it. Exercise 9-5. Is the overhead of the extra cat in doctype important? Rewrite doctype to avoid the extra process. Which version is simpler? © Exercise 9.6. Is it better to use doctype or to write a shel file containing the com- mands to format a specific document? © Exercise 9-7. Experiment with various combinations of grep, egrep, farep, sed, awk and sort to create the fastest possible version of doctype. © 9.4 The manual page The main documentation for a command is usually the manual page — a one-page description in the Unix Programmer's Manual. (See Figure 9.2.) The manual page is stored in a standard directory, usually /usr/man, in a sub- directory numbered according (0 the section of the manual. Our hoc manual page, for example, because it describes a user command, is kept in Zusr/man/man1/hoe. 1 Manual pages are printed with the man(1) command, a shell file that runs nroff -man, so man hoc prints the hoc manual. If the same name appears in more than one section, as does man itself (Section 1 describes the command, While Section 7 describes the macros), the section can be specified to man: CHAPTER 9 DOCUMENT PREPARATION 309 $ man 7 man prints only the description of the macros. The default action is to print all ages with the specified name, using nroff, but man -t generates typeset pages using trofé ‘The author of a manual page creates a file in the proper subdirectory of /osr/man. The man command calls nrof# or trof# with a macro package to print the page, as we can see by searching the man command for formatter, invocations. Our result would be $ grep roff ‘which man‘ nroff $opt -man Sall 3 neqn fall ! nroff $opt -man trofé $opt man all 3, troff -t Sopt -man Sail ! te eqn Sall | trofs Sopt -man eqn Sall | trofe -t Sopt -man | te ii ‘The variety is to deal with options: neoff vs. trof£, whether or not to run eqn, etc. The manual macros, invoked by troff -man, define troff com- mands that format in the style of the manual. They are basically the same as the ms macros, but there are differences, particularly in setting up the title and in the font change commands. The macros are documented — briefly — in man(7), but the basics are easy to remember. The layout of a manual page is: ‘TH COMMAND section-number SH NAME Command \~ brief description of function SH SYNOPSIS 13 command options SH DESCRIPTION Detaited explanation of programs and options. Paragraphs are introduced by PP. “PP This is @ new paragraph. SH FILES Files used by the command, ¢.g., passwei{) mentions /etc/passwd SH "SEE ALSO" References o related documents, including other manual pages SH DIAGNOSTICS Description of any unusual output (e.g, see emp)) as(1), Be(1) and de(1), Bucs Error recovery is imperfect within function and procedure definitions, The treatment of newlines is not exactly user-friendly, 8th Edition Figure 9.2: noc() CHAPTER 9 DOCUMENT PREPARATION 313 9.5 Other document preparation tools ‘There are several other programs to help with document preparation. ‘The refex(1) command looks up references by keywords and installs in your docu- ‘ment the in-line citations and a reference section at the end. By defining suit- able macros, you can arrange that refer print references in the particular style you want. There are existing definitions for a variety of computer science journals. refer is part of the 7th Edition, but has not been picked up in some other versions. pic(1) and ideai(1) do for pictures what eqn does for equations. Pic- tures are significantly more intricate than equations (at least to typeset), and there is no oral tradition of how to talk about pictures, so both languages take some work to learn and to use. To give the flavor of pic, here is a simple picture and its expression in pic. Ps. :pe -1 box invis "document"; arrow box dashed "pic"; arrow box dashed “tbi"; arrow box dashed "eqn"; arrow box "troff"; arrow box invis "typesetter* [ box invis "macro" "package spline right then up -> ] with .ne at 2nd last box.s spe st troft |—etypesetter package — The pictures in this book were all done with pic. pic and ideal are not part of the 7th Edition but are now available. refer, pic and ideal are all troff preprocessors, There are also pro- grams to examine and comment on the prose in your documents. The best Known of these is spe11(1), which reports on possible spelling errors in files; wwe used it extensively. styie(1) and diction(1) analyze punctuation, gram- ‘mar and language usage. These in turn developed into the Writer's Work- bench, a set of programs to help improve writing style. The Writer's Work- bench’ programs are good at identifying cliches, unnecessary words and sexist phrases. spell is standard. The others may be on your system; you can easily find cout by using man: 314 THE UNIX PROGRAMMING ENVIRONMENT CHAPTER 9 $ man style diction wb or by listing /bin and /usr/bin, History and bibliographic notes troff, written by the late Joe Ossanna for the Graphics Systems CAT-4 typesetier, has a long lineage, going back to RUNOFF, which was written by J E. Saltzer for CTSS at MIT in the early 1960's. These programs share the basic command syntax and ideas, although trot is certainly the most compli cated and powerful, and the presence of eqn and the other preprocessors adds significantly to its utility. There are several newer typesetting programs with more civilized input format; TEX, by Don Knuth (TEX and Metafont: New Directions in Typesetting, Digital Press, 1979), and Seribe, by Brian Reid ('Scribe: a high-level approach to computer document formatting,” 7th Sympo- sium on the Principles of Programming Languages, 1980), are probably the best known. ‘The paper “Document Formatting Systems: Survey, Concepts and Issues” by Richard Furuta, Jeffrey Scofield, and Alan Shaw (Computing Sur- veys, September, 1982) is a good survey of the field ‘The original paper on eqn is “A system for typesetting mathematics,” (CACM, March 1975), by Brian Kernighan and Lorinda Cherry. The ms ‘macro package, tbl and refer are all by Mike Lesk; they are documented only in the UNA Programmer's Manual, Volume 2A. pic is described in “PIC — a language for typesetting graphics,” by Brian Kernighan, Software—Practice and Experience, Januaty, 1982. ideal is described in “A high-level language for describing pictures,” by Chris Van Wyk, ACM Transactions on Graphics, April, 1982. spell is @ command that turned from/a shell file, written by Steve John- son, into a C program, by Doug Meliroy. The 7th Edition spe uses a hash ing mechanism for quick lookup, and rules for automatically stripping suffixes and prefixes to keep the dictionary small. See “Development of a spelling list,” M, D, Meliroy, IEEE Transactions on Communications, January, 1982. The style and Aiction programs are described in “Computer aids for writers,” by Lorinds Cherry, SIGPLAN Symposium on Text Manipulation, Portland, Oregon (June 1981). cHapter 0: EPILOG The UNIX operating system is well over ten years old, but the number of computers running it is growing faster than ever. For a system designed with ‘no marketing goals or even intentions, it has been singularly successful ‘The main reason for its commercial success is probably its portability — the feature that everything but small parts of the compilers and kernel runs unchanged on any computer. Manufacturers that run UNIX software on their machines therefore have comparatively little work to do to get the system run- ring on new hardware, and can benefit from the expanding commercial market, for UNIX programs. But the UNIX system was popular long before it was of commercial signifi- cance, and even before it ran on anything but the PDP-I1, The 1974 CACM paper by Ritchie and Thompson generated interest in the academic community, and by 1975, 6th Edition systems were becoming common in universities ‘Through the mid-1970's UNIX knowledge spread by word of mouth: although the system came unsupported and without guarantee, the people who used it were enthusiastic enough to convince others to try it too. Once people tried it, they tended to stick with it; another reason for its current success is that the generation of programmers who used academic UNIX systems now expect to find the UNIX environment where they work. ‘Why did it become popular in the first place? ‘The central factor is that it ‘was designed and built by a small number (two) of exceptionally talented peo- ple, whose sole purpose was to create an environment that would be convenient for program development, and who had the freedom to pursue that ideal. Free of market pressure, the early systems were small enough to be understood by a single person. John Lions taught the 6th Edition kernel in an undergraduate operating systems course at the University of New South Wales in Australia In notes prepared for the class, he wrote, “... the whole documentation is not unreasonably transportable in a student's briefcase.” (This has been fixed in recent versions.) In that carly system were packed a number of inventive applications of ‘computer science, including stream processing (pipes), regular expressions, language theory (yacc, lex, etc.) and more specific instances like the ais 316 THE UNIX PROGRAMMING ENVIRONMENT ‘CHAPTER 10 algorithm in ai¢¢. Binding itall together was a kernel with “features seldom found even in larger operating systems.” As an example, consider the UO structure: a hierarchical filesystem, rare at the time; devices installed as names inthe file system, so they require no special utilities; and perhaps a dozen criti cal system calls, such as an open primitive with exactly two arguments. The software was all written in a high-level language and distributed with the sys- tem so it could be studied and modified “The UNIX system has since become one of the computer market's standard ‘operating systems, and with market dominance has come responsibility and the need for “features” provided by competing systems. As a result, the kernel has grown in size by a factor of 10 in the past decade, although it has certainly not improved by the same amount. This growth has been accompanied by a surfeit of ill-conceived programs that don’t build on the existing environment Creeping featurism encrusts commands with options that obscure the original intention of the programs. Because source code is often not distributed with the system, models of good style are harder come by. Fortunately, however, even the large versions are still suffused with the ideas that made the early versions so popular. ‘The principles on which UNIX is ‘based — simplicity of structure, the lack of disproportionate means, building ‘on existing programs rather than recreating, programmability of the command interpreter, a tree-structured file system, and so on — are therefore spreading and displacing the ideas in the monolithic systems that preceded it. The UNIX system can’t last forever, but systems that hope to supersede it will have to incorporate many ofits fundamental ideas. We said in the preface that there is a UNIX approach or philosophy, a style of how to approach a programming task. Looking back over the book, you should be able to see the elements of that style illustrated in our examples. First, let the machine do the work. Use programs like grep and we and awk to mechanize tasks that you might do by hand on other systems. Second, let other people do the work. Use programs that already exist as building blocks in your programs, with the shell and the programmable filter to glue thom together. Write a small program to interface to an existing one that does the real work, as we did with idi££. The UNIX environment is rich in tools that can be combined in myriad ways; your job is often just to think of the right combination ‘Third, do the job in stages. Build the simplest thing that will be useful, and let your experience with that determine what (if anything) is worth doing next Don't add features and options until usage patterns tell you which ones are needed Fourth, build tools. Write programs that mesh with the existing environ- ment, enhancing it rather than merely adding to it. Built well, such programs themselves become a part of everyone's toolkit. We also said in the preface that the system was not perfect. After nine chapters describing programs with strange conventions, pointless differences, CHAPTER 10 emoo 317 and arbitrary limitations, you will surely agree. In spite of such blemishes, however, the positive benefits far outweigh the occasional irritating rough edges. ‘The UNIX system is really good at what it was designed to do: providing a comfortable programming environment So although UNIX has begun to show some signs of middle age, it's still viable and still gaining in popularity. And that popularity can be traced to the clear thinking of a few people in 1969, who sketched on the blackboard a design for a programming environment they would find comfortable Although they didn’t expect their system to spread to tens of thousands of computers, a generation of programmers is glad that it did APPENDIX 1: EDITOR SUMMARY ‘The “standard” UNIX text editor is a program called ed, originally written by Ken Thompson. ed was designed in the early 1970's, for a computing environment on tiny machines (the first UNIX system limited user programs to 8K bytes) with hard-copy terminals running at very low speeds (10-15 charac- ters per second). It was derived from an earlier editor called qed that was popular at the time. ‘As technology has advanced, ed has remained much the same. You are almost certain to find on your system other editors with appealing features; of these, “visual” or “screen” editing, in which the screen of your terminal reflects your editing changes as you make them, is probably the most common. So why are we spending time on such a old-fashioned program? The answer is that ed, in spite of its age, does some things really well. It is avail- able on all UNIX systems; you can be sure that it will be around as you move from one system to another. It works well over slow-speed telephone lines and with any kind of terminal, e@ is also easy to run from a script; most screen editors assume that they are driving a terminal, and can’t conveniently take their input from a file. ed provides regular expressions for pattern matching. Regular expressions based on those in ed permeate the system: grep and sed use almost identical ones; egrep, awk and lex extend them; the shell uses a different syntax but the same ideas for filename matching. Some screen editors have a “line mode" that reverts to a version of ed so that you can use regular expressions. Finally, ed runs fast. It’s quite possible to invoke ed, make a one-line change to a file, write out the new version, and quit, all before a bigger and fancies screen editor has even started Basics ed edits one file at a time. It works on a copy of the file; to record your changes in the original file, you have to give an explicit command. ed pro- vides commands to manipulate consecutive lines or lines that match a pattern, and to make changes within lines. Each ed command is a single character, usually a letter. Most commands 319 320 THE UNIX PROGRAMMING ENVIRONMENT APPENDIX 1 can be preceded by one or two line numbers, which indicate what line or lines are to be affected by the command; a default line number is used otherwise Line numbers can be specified by absolute position in the file (1, 2, «J, bY shorthand like $ for the last line and ‘." for the current line, by pattern searches using regular expressions, and by additive combinations of these. Let us review how to create files with ed, using De Morgan’s poem from Chapter 1 8 ed poem ?poen Warning: the file poem doesn't exist a Start adding lines Great fleas have little fleas upon their backs to bite ‘em, And Little fleas have lesser fleas, ‘and so ad infinitum. : Type a *." 10 stop adding w poe Write lines to file poem 424 (e8 reports [21 characters written q uit ‘The command a adds or appends lines; the appending mode is terminated by a line with a *.” by itself. There is no indication of which mode you are in, 80 two common mistakes to watch for are typing text without an a command, and typing commands before typing the *.”. 4 will never write your text into a file automatically; you have (o tell ito do so with the w command. If you try to quit without writing your changes, however, ed prints a ? as a warning. At that point, another q command will let you exit without writing. @ always quits regardless of changes. $ ed poem 12 Fle exists, and has 121 characters a ‘Add some more lines at the end And the great fleas thenselves, in turn, have greater fleas to go oni Wile these again have greater still, and greater still, and 30 on Type a *." to stop adding ¢ Try 10 quit ? Warning: you didn’t write first * No filename given; poem is assumed 263 @ Now it's OK to quit 3 we poem Check for sure 8 46 263 poem APPENDIX 1 EDITOR SUMMARY 321 Escape to the shell with ! If you are running ed, you can escape temporarily to run another shell ‘command; there’s no need to quit, The ed command to do this is “!": 3 ed poem 263 we poet Run we without leaving 2 @ © 46.263 poem ' You have returned from the command q (Quit without w is OK: no change was made ’ Printing The lines of the file are numbered 1, 2, ...; you can print the m-th line by giving the command mp or just the number n, and lines m through » with mynp. The “line number” § is the last line, so you don’t have to count lines. 1 Print Ist line; same as 1p s Print last line: same as 8p 1p Print lines 1 through last ‘You can print a file one line at a time just by pressing RETURN; you can back up one line at a time with ‘-". Line numbers can be combined with + and $-2,8p Print last 3 lines 1,203 Print lines 1 through 5 But you can’t print past the end or in reverse order; commands like $,$+1p and $, tp are illegal ‘The list command 1 prints in a format that makes all characters visible; it's ‘200d for finding control characters in files, for distinguishing blanks from tabs, and so on. (See vis in Chapter 6.) Patterns Once a file becomes longer than a few lines, it’s a bother to have to print it all (o find a particular line, so e4 provides a way to search for lines that match 1 particular pattern: /pattern/ finds the next occurrence of pattern $ ed poem 263 fleas Search for next line containing £Lea Great fleas have little fleas /tlea/ Search for next ome And little fleas have lesser fleas, “v ‘Search for next using same pattern ‘And the great fleas themselves, in turn, 2 Search backwards for same pattern And Little fleas have lesser fleas, ed remembers the pattern you used last, so you can repeat a search with just 322 THE UNIX PROGRAMMING ENVIRONMENT APPENDIX 1 1/7. To search backwards, use ?pattern? and 22. ‘Searches with /.../ and ?...? “wrap around” at either end of the text sp Print last tine. (°p' is optional) ‘and greater still, and so on fleas Next Eea is near beginning Great fleas have little fleas ?P Wrap around beginning going backwards have greater fleas to go on: A pattern search like /£1ea/ is a line number just as 1 oF $ is, and can be used in the same contexts: 4,/#1ea/p Print from I to next £103 ?£lea?+1,sp Print from previous €1¢a +1 1 end Where are we anyway? ed keeps track of the last line where you did something: printing or adding text or reading a file. The name of this line is *.”; it is pronounced “dot” and. is called the current line. Each command has a defined effect on dot, usually setting it to the last line affected by the command, You can use dot in the same way that you use $ or a number like 1 8 ed poem 263 : Print current line: same as $ after reading and greater still, and so on. “1p Print previous line and this one While these again have greater still, and greater still, and so on. Line number expressions can be abbreviated: Shorthand: Same as Shorthand: Same as : 1 * 4 = or -2 1-2 tor 4242 : so 3 8 Append, change, delete, insert The append command a adds lines after the specified line; the delete com mand 4 deletes lines; the insert command 4 inserts lines before the specified line; the change command ¢ changes lines, a combination of delete and insert. na Add text after line m ne Insert text before line n mand Delete lines m through 1 mine (Change lines m through If no Line numbers are given, dot is used. ‘The new text for a, ¢ and i APPENDIX 1 EDITOR SUMMARY 323, commands is terminated by a *." on a line by itself; dot is left at the last line added. Dot is set to the next line after the last deleted line, except that it doesn't go past line $ oa ‘Adal text at beginning (same as 14) ap. Delete current line, print next (or last, If ar 8) sap Delete from here to end, print new last isa Delete everyting Ppat? 18 Delete from previous ‘pat to just before dot sap Delete last line, print new last line se Change last line. ($a adds after last line) 180 Change all lines Substitution; undo It's a pain to have to re-type a whole line to change a few letters in it. The substitute command s is the way to replace one string of letters by another: s/old/new/ Change first 014 into mew on current line e/old/new/p Change first 014 into new and print line e/old/new/g Change each 014 into new on current line s/old/new/gp Change each 024 into new and print line Only the leftmost occurrence of the pattern in the Tine is replaced, unless a ‘g” follows. The command doesn’t print the changed line unless there is a ‘p’ at the end. In fact, most ed commands do their job silently, but almost any com- mand can be followed by p to print the result If a substitution didn’t do what you wanted, the undo command u will undo the most recent substitution. Dot must be set to the substituted line. a Undo most recent substitution up Undo most recent substitution and print Just as the p and d commands can be preceded by one or two line numbers to indicate which lines are affected, so can the s command: Jold/s/old/new/ Find next 014; change to new 7eld/3//new/ Find next 036; change to new (pattern is remembered) 1,$8/ola/new/p Change first 014 to new on each line: ‘print last line changed 4,$8/ol¢/new/ep Change each 014 10 now on each line: rin last line changed Note that 1,$5 applies the s command to each line, but it still means only the Teftmost match on each line; the trailing °g” is needed to replace all occurrences in each line. Furthermore, the p prints only the last affected line; to print all changed lines requires a global command, which we'll get to shortly, ‘The character & is shorthand; if it appears anywhere on the right side of an s command, it is replaced by whatever was matched on the left side: 324 THE UNIX PROGRAMMING ENVIRONMENT APPENDIX 1 s/big/very &/ Replace big by very big a/big/5 &/ Replace big by big big 8/4/18) Parenuhesize entre line (see .* below) a/and/\5/ Replace ana by & (\ turns off special meaning) Metacharacters and regular expressions In the same way that characters like * and > and | have special meaning to the shell, certain characters have special meaning to e& when they appear in a search pattern or in the left-hand part of an s command. Such characters are called metacharacters, and the patterns that use them are called regular expres- sions. Table 1 lists the characters and their meanings; the examples below should be read in conjunction with the table. ‘The special meaning of any char- acter can be turned off by preceding it with a backslash. Table 1: Editor Regular Expressions © any non-special character c matches itself | \c turn off any special meaning of character ¢ | - ‘matches beginning of line when ~ begins pattern | s ‘matches end of line when $ ends pattern | : ‘matches any single character | [J matches any one of characters in ...; ranges like a~2 are legal [7.1 matches any single character not in ..; ranges are legal m matches zero or more occurrences of r, where ris a character, . oF {1 8 oon right side of only, produces what was matched X(N) tagged regular expression; the matched string is available as \1, etc., om both left and right side No regular expression matches a newline. Pattern Matches: rss empty line, i.e, newline only ad non-empty, i-e., at least one character ey all lines “ehing/ ‘ching anywhere on line 7oening/ thing ar beginning of line /ehings/ thing at end of line 7oenings/ line that contains only thing “ching. 8/ thing plus any character at end of line Zehing\.8/ ‘thing. at end of line asening\// Zening/ anywhere on line 7(eryning/ thing or Thing anywhere on line /ehingl0-91/ thing followed by one digit APPENDIX 1 EDITOR SUMMARY 325 /ehingl*0-91/ ‘thing followed by a non-digit ‘Yebingl0-9){70-9]/ thing followed by digit, non-digit Yehing!.*thing2/ thing! then any string then thing? “thing! ‘thing2$/ thing at beginning and thing2 at end Regular expressions involving + choose the leftmost match and make it as long as possible, Note that x* can match zero characters; xx+ matches one or Global commands The global commands g and v apply one or more other commands to a set of lines selected by a regular expression, The g command is most often used for printing, substituting or deleting a set of lines: ming/re/omd For all lines beoween m and n that match re, do cmd mnv/re/omd For all lines beoveen m and n that don’t match re, do emd The g or v commands can be preceded by line numbers to limit the range; the default range is 1,8 WolB Prin all tines matching regular expression esa Delete all lines matching g/./a//repl/p Replace Ist. on each line by ‘repl’, print changed lines 9/../8//repl/ap Replace each. by‘repl’, print changed lines 9/../8/pat/repl/ On lines matching ... replace Ist ‘pat’ by ‘rept’ 97.1 38/pat/repl/p On lines matching... replace Ist ‘pat’ by ‘rep! and print g/../8/pat/repl/gp On lines matching ... replace all ‘pat’ by ‘rept and print v/../e/pat/repl/sp On lines not matching .., replace all ‘pat’ by ‘repl', print wirs/p Prin all non-blank lines @/..femdI\ To do multiple commands with a single g, ema append 10 each cmd comd3 Dut the fast The commands controlled by a g or v command can also use line numbers. Dot is set in turn (0 each line selected. g/thing/...+1p Print each line with thing and next 9/°\.80/.1,/°\.EN/-8/alpha/beta/gp Change alpha to beta only between £0 and «EX, and print changed lines Moving and copying lines The command m moves a contiguous group of lines; the t command makes a copy of a group of lines somewhere else mam d Move lines m through m to after line d mnt d Copy lines m through n to after line d If no source lines are specified, dot is used. The destination line d cannot be in the range m, n—1. Here are some common idioms using mand t: 326 THE UNIX PROGRAMMING ENVIRONMENT APPENDIX 1 mt Move current line to after next one (interchange) m2 ‘Move current line to before previous one m Same: ~~ is the same as 2 Does nothing ms Move current line 19 end (20 moves to beginning) of Duplicate current line (€ duplicates at end) niet Duplicate previous and current lines rises Duplicate entre set of tines 977700 Reverse order of lines Marks and line numbers ‘The command = prints the line number of line $ (a poor default), .= prints the number of the current line, and so on. Dot is unchanged ‘The command ke marks the addressed line with the lower case letter ¢; the line can subsequently be addressed as “ec. The ke command does not change dot. Marks are convenient for moving large chunks of text, since they remain permanently attached to lines, as in this sequence: aan Find line ... and mark with a a Find line. and mark with sabe Print entre range to be sure ca Find target line “ay /bm. ‘Move selected lines after it Joining, splitting and rearranging lines Lines can be joined with the 3 command (no blanks are added): mang Join tines m through w into one line ‘The default range is .,.+1, s0 op Join current line to next and print ap Join previous line to current and print Lines can be split with the substitute command by quoting a newline: e/part tpart?/paxt 1\ Split line into ovo parts part2/ aN Split at each blank: 4s ‘makes one word per line Dot is left at the last line created, To talk about parts of the matched regular expression, not just the whole thing, use tagged regular expressions: if the construction \(...\) appears in a regular expression, the part of the whole that it matches is available on both the right hand side and the left as \1. There can be up to nine tagged expres- sions, referred to as \1, \2, etc. APPENDIX 1 EDITOR SUMMARY 327 B/E NING ANI/\2\17 Move frst 3 characters to end: AGRI Find lines that contain a repeated adjacent string File handling commands ‘The read and write commands x and w can be preceded by line numbers: ax file Read file: add i after tine n; set dor to last tine read mnie file Write lines mn to file; doris unchanged mn file Append lines mn io file; dot is unchanged ‘The default range for w and W is the whole file, The default n for x is $, an unfortunate choice. Beware, ‘ed remembers the first file name used, either from the command line or from an x or w command. The file command £ prints or changes the name of the remembered file: € Print name of remembered file £ fle Ser remembered name to file’ ‘The edit command e reinitializes ed with the remembered file or with a new e Begin editing remembered fle e file Begin editing file ‘The @ command is protected the same way as q is: if you haven't written your changes, the first e will draw an error message. E reinitializes regardless of changes. On some systems, ed is linked to e so that the same command (@ filename) can be used inside and outside the editor. Encryption Files may be encrypted upon writing and decrypted upon reading by giving the x command; a password will be asked for. The encryption is the same as in crypt(1). The x command has been changed to X (upper case) on some systems, to make it harder to encrypt unintentionally. Summary of commands Table 2 is a summary of e@ commands, and Table 3 lists the valid line numbers. Each command is preceded by zero, one or two line numbers that indicate how many line numbers can be provided, and the default values if they are not. Most commands can be followed by a p to print the last line affected, or 1 for list format. Dot is normally set to the last line affected; it is unchanged by £, i, w, x, =, and ! Exercise. When you think you know ed, try the editor quiz; see quiz(6). 0 328 THE UNIX PROGRAMMING ENVIRONMENT APPENDIX 1 cd e file £ file 1, $9/re/emds eld q sr file +5 +8/re/new/ cystine 1, $w/re/emds 1,80 file s= temdline Tres Pre? Nitn NIWN2 NIGN2 (#1 )newline Table 2: Summary of ea Commands ‘add text until a line containing just . is typed change lines; new text terminated as with a delete lines, reinitialize with file. resets even if changes not written set remembered file to file do ed emds on each line matching regular expression re; multiple emds separated by \newline insert text before line, terminated as with a join lines into one mark line with letter ¢ list lines, making invisible characters visible quit. @ quits even if changes not written read file substitute new for whatever matched re copy lines after line undo last substitution on line (only one) do e6 emds on each line not matching re write lines to file; W appends instead of overwriting enter encryption mode (or ed -x filename) print fine number execute UNIX command cmdline print line Table i; Summary of ed Line Numbers absolute line number m,n = 0, 1, 2, ‘current line last line of text next line matching re; wraps around from $ to 1 previous line matching re; wraps around from 110 $ Tine with mark ¢ line NIn (additive combination) lines N/ through N2 set dot to NZ, then evaluate N2 NJ and N2 may be specified with any of the above ApPENDIX2: HOC MANUAL Hoc - An Interactive Language For Floating Point Arithmetic Brian Kernighan Rob Pike ABSTRACT Hoc is a simple programmable interpreter for floating point expressions It has C-style control flow, function definition and the usual numerical builtin functions such as cosine and logarithm, 1. Expressions Hoc is an expression language, much like C: although there ae several contrlflow statements, most statements such as assignments are expressions whose value is disre- garded. For example, the assignment operator = assigns the value ofits right operand to is eft operand, and yields the value, so multiple assignments work. The expression grammar is: er: runber | variable | expr) | expr binap expr | amop espr | fiction ( arguments ) Numbers are floating point. The input format is that recognized by seanf (3): digits, decimal point, digits, © or E, signed exponent, At least one digit or a decimal point rust be present; the other components are optional Variable names are formed from a letter followed by a string of leters and umbers. binop refers to binary operators such a addition of Togical comparison; wnop refers 10 the two negation operators, “(logical negation, "not) and *=" (arithmetic negetion, sign change). Table 1 lists the operators 329 320 {THE UNIX PROGRAMMING ENVIRONMENT Table 1: Operators, in decreasing order of precedence exponentiation (FORTRAN **) |. Fight associative = (unary) logical and arithmetic negation ! + 7 multiplication, division + = addition, subtraction > >= relational operators: greater, greater or equal, < less, less or equal, equal, not equal (all same precedence) 55 gical AND (both operands always evaluated) H logical OR (both operands always evaluated) : assignment, right associative APPENDIX 2 Functions, as described later, may be defined by the user. Function arguments are expressions separated by commas Which take a single argument, described in Table 2. Table 2: Built-in Functions abs(x) [x], absolute value of x atan(x) arc tangent of cosix) —cos(r), cosine of x exp(x) —_¢*, exponential of x Ant(x) integer part of x, truncated towards zer0 Log(x) __log(x), logarithm base ¢ of x Log10(x) logio(x), logarithm base 10 of x sin(x) sin(x), sine of x There are also a number of built-in functions, all of sart(x) Vr 2 Logical expressions have value 1.0 (true) and 0.0 (false). As in C, any non-zer0 value is taken to be true. As is always the case with floating point numbers, equality comparisons are inherently suspect. Hoc also has a few built-in consta DEG —_$7.29577951308232087680 180, degrees per radian z 2.71828182845901523536 _e, base of natural logarithms GAMMA 0.57721566490153286060 +, Euler-Mascheroni constant PHT 1,61803398874989484820 (\/54+1)2, the golden ratio Pr 3.14159265358979923846 my, circular transcendental number 2, Statements and Control Flow Hoe statements have the following grammar: APPENDIX 2 Hoc MANUAL 331 sit: er variable = expr procedure ( arglist) While ( expr ) stmt (expr } sme if (expr ) stmt else stmt { sonalist } rin expr-tist return optional-expr stl: (nothing) ! stmlist stmt ‘An assignment is parsed by default as a statement rather than an expression, so assign- ‘ments typed interactively do not print their value [Note that semicolons are not special to hoc: statements are terminated by newlines. ‘This causes some peculiar behavior. The following are legal if statements: Af (x © 0) print(y) else print(2) Af (x <0) print(y) } else ¢ print(2) ? Im the second example, the braces are mandatory: the newline after the if would tor- ‘minate the statement and produce a syntax error were the brace omitted. ‘The syntax and semantics of hoc control flow facilities are basically the same as in C. The while and if statements are just as in C, except there are no break or continue 3. Input and Output: read and print ‘The input function read, like the other built-ins, takes a single argument. Unlike the built-ins, though, the argument is not an expression: itis the name of a variable. ‘The next number (as defined above) is read from the standard input and assigned to the named variable. The return value of read is | (true) if a value was read, and 0 (false) if read encountered end of file or an error. ‘Output is generated with the print statement. The arguments to print are a comma- separated list of expressions and strings in double quotes, as in C. Newlines must be supplied; they are never provided automatically by print Note that read is a special builtin function, and therefore takes a single parenthesized argument, while print is a statement that takes a comma-separated, ‘unparenthesizd list: while (readix)) ( print "value is ", x, "\n" ) 332 THE UNIX PROGRAMMING ENVIRONMENT APPENDIX 2 44. Functions and Procedures Functions and procedures are distinct in hoc, although they are defined by the same mechanism. This distinction is simply for run-time error checking: it is an error for a procedure to return a value, and for a function nor to return one, ‘The definition syntax is: Sanction: ane name() stmt procedure roc name() stmt name may be the name of any variable — builtin functions are excluded. The defini. tion, up to the opening brace or statement, must be on one line, as with the if state ments above. Unlike C, the body of a function or procedure may be any statement, not necessarily 1 compound (brace-enclosed) statement. Since semicolons have no meaning in hoc, a null procedure body is formed by an empty pair of braces Functions and procedures may take arguments, separated by commas, when invoked. Arguments are referred to as in the shell: $3 refers to the third (I-indexed) argument. ‘They are passed by value and within functions are semantically equivalent to variables. It is an error to refer to an argument numbered greater than the number of arguments passed to the routine, The error checking is done dynamically, however, so & routine may have variable numbers of arguments if intial arguments affect the number ‘of arguments to be referenced (us in C's pring). Functions and procedures may recurse, but the stack has limited depth (about a hun- ‘dred calls). ‘The following shows a hoc definition of Ackermann’s function: $ hoe fune ack() { Lf ($1 22 0) return $2+1 Hf ($2 28 0) return ack($1-1, 1) return ack(#1-1, ack($1, $2-1)) d ack(3, 2) 29 ack(3, 3) 61 ack(3, 4) hoc: stack too deep near line 8 5. Examples ‘Stirling's formula: $ hoe func stirl() ( , stirl(10) 3620684.7 stiri(20) 2.43288 186418, Factorial function, n! fune fact) if ($1 HOC MANUAL 333 1 nt ~ Vinwiner at return sqrt (208 1*PI) © ($1/B)"$I4(1 + 1/(12681)) 0) return 1 elge return $1 + fac($1-1) Ratio of factorial to Stirling approximation: while ((i = 441) <= 20) ( 10 n 2 3 4 6 16 v7 18 » 20 print i, 0000318 0000265 0000224 0000192, ‘0000166 :0000146 ‘0000128, Soo00114 ‘0000102 ‘0000092 ‘9000083 *, fac(i)/stir1(1), "Nn" ApPENDIX 3: HOC LISTING ‘The following is a listing of noc6 in its entirety. 35 336 THE UNIX PROGRAMMING ENVIRONMENT APPENDIX 3 ee eee Ces proses APPENDIC 3 Hoc usta 337 , 338 THE UNIX PROGRAMMING ENVIRONMENT fovtewngac, pee, saa) /+ tack ana for ve es APPENDIX. Hocustne 339 3M0__THE UNIX PROGRAMMING ENVIRONMENT APPENDIX + embote fmt saateiies £4) 7+ ana pot abe o/ APPENDIX 3 HocustiNG 341 Fhe ‘enon ene eects esianveged 5/4 ey 07 342 THE UNIX PROGRAMMING ENVIRONMENT APPENDIX 3 fmtot ap = bybst Ipe8 /s stat cane ant o/ ea netted 0 deeriy"it APPENDIX 3 Hoc usTING 343 344 THE UNIX PROGRAMMING ENVIRONMENT 5 APPENDIX 3 Wocusrine 345 peletelateg sivas 346 THE UNIX FROGRAMMING ENVIRONMENT APPENDIX 3 Veonmeatpe tt eT aqwarda ate, Meyrorda st ny ane, WB conete2)¥80)¢ APPENDIX 3 Hoc using 347 oe ‘and seas 130 ouput 29 ME command: eroee 29, 208 ‘Stet dor 21 25,37, Se eigen 8 Tepe ii 9.8 . is 92, 18, retgecton 9 eit iat gate Oneonta a5 1 Mabe aed 1} parentheses: Shell 168 ine teh [i pic bell 28 Sparen 0,4 { repalt expteaton 13,324 TieBleStpesion 108 fom egrenion 19,24 cet eigreson. 108 1 Frog expression 102,324 Eg Sec stharey ne ePingt neuen 6,53. (8 Se thel eu sius” 14 FBO ge see preset. Seeks as BU Roses ble of set ts Wasaga eee 2efename reieton 99 Higa ira Be ope a ‘Scions embedded 276 sora Beemer? Se Srfoments command 13,74, 75 349 INDEX Sigumens pea i 114, 220 egw ti ie ‘me eee Sigamont, comand ine 50 as, Sas e Somngn.perptat 129 each 108 Sh ipemw pace ‘Bik break eatement 12, ‘iecommend ih Si stilee hy = cE eatin im Se ici oe MS ig Se es TS RE iar wae 0) ‘Eek functions tele 123 SEE ten, Bi tine to is, SEIS tr ase ‘Sve nome sttemen, 122 SEs 380. THE UNIX PROGRAMMING ENVIRONMENT see mail 6 Soe pers Te hs ve praneé statement 116 SU Epa tinction "12 SSE fi ote 125 sk eubees function 117 SO varus iain 18 i cove vibes, abe of SOE Varah el varies i SBladipnce convention 43 2Beommand,na 8 tigen pases 2 9,7, bucks, + 86,198 SA: ead 87 Dacustaan incon 277 Sica mo Seoee, gut oc ging eh 6.28 SEASpeEcomenion, Xb 43 retapce, ct 2 [eck Naur Form, 234 ‘eckvarde command 122 Breit putern, cue U8 Bo a ater nt Benoa sb bin Stet ot 3, 2 sven: pena Unpedble 2 ee, ee ea eee Bourne, Seve 100, 200, Sich BAS, frst is 2O : bus, eyaten 185,205, ae Ete Tas hte anctions, ed 245 Sandie command Borde, chabiy of 99 Seats am "eenie convention 45 Ssimaeno i Ce pgs 14, Eel io0."* Pe Sebcommind 133,15 SEEN Gnnend 9.27. 129 ait onion aan Rots ita toe eds Seeste ‘See tenet, Sal ae Seem ae ti Sa She iin of 3 Sie eran 150. 198 Soot Se op 2S, 25 Scion eon” 299 hinge, plese 30 seees Papen “ thing dcr 25 hing permite $6 Sarge Soc RUL 18 beter cee 0D Character Gas. neat 102 Starster dei ce act ASCH 4, i 107 harass BREDIC 2 at ‘Secenass command 186 2 Shecenai eaten of 38 hese being ‘Ss ompate, Be ‘mod remand 3 ‘Simo Sod ox $6, 6 ‘oe aap 170,171, 18, sh Shae Tasniat! Sap a THO. 266 Se fant, nocd 269, Scie senert, nee 358 Se Feat: 22h Sede te Mee Shae e He ase ‘oath oni, 3032, Comer Dove 13 Saree Saad seu, 5 ‘commands eating ell 30 Serceens, Doce SEE Re A ‘commands, table of fie, 24 SSS arch Ses Sees Seeley Sea Senate a cere Soe ‘Smvetion -efename 46 San cae ‘vention, St eating Seven iy aaa Saree Sass Speommand ig 61 reat gutem cll 208 Eeea ee» eer aia Saas Steet Sone ah ae Siete Sarkis its noe St Fsuring output SI sopping cor #15 Sls tne kal 6 Seppe test ble 174 ‘Seypect header te 1 ‘se onmand caren etry. «21. 25,37, ie Seoumand, ea 101 Shum urocuier noe 282 este, es, 2 a aoaraetat asd 10) fe command. ero 301 Be Maren Poa © De Morgan, Avge 1, 20 ‘ier ab i ‘tbupger. sab ‘dougeng 8, 24,258,265, 8 eft sip handing 225,227 Sefice fame a ta etronty fonsion 276 Betere at st, as BHLETE. ci 9 Aspendn’y tle empty 255 ekopors ie 95,301,223 Sect onlendss 1 ‘fg of stopen 1 dfs of gee fondly 10s ese of Later TO we ese ol Re S ‘denge of Bee 38 ‘de of eprane’ 209 es of eo Sein of ce 172,178,179, 185 Sere sas decor 6, 66 see eet Pa wa srmand_ 67 some ee ‘0 Seen Hg Bnd BEL wo i ne 25. 51, 75 Ss ae SS eA oe de, 3,80 ake feet ee ois Ses peso ‘ban 81, 9, leery. removing 2 2 Sessa te 81 reser. “emp 63 16 Sees ane Aes gues. fag, 28,50 Sees SEARS V8 Sen. fane/ana a res heade ie 30 Reet an ‘arg apa 6 Soceyes ome” 307, cry soon 88 ‘doabte command 120 Sato ts 98 3 ammo 0 Bare fo, 2 Since ab, Dun oor 230 Ce eand eeee_ a8 PRS Ca or Sle 2 fine mbes 120 Si mer, le of 2, Se Fag Sipetion camper 2 etext of i Fee te 2D De be: HF: 28,28, eo 8. 8 31.8 Sfopen fincion 182 cc INDEX 351 Embedded con 276 eee a atic Sire oes Ender ie 2h 308208 Sere Boe, Seine 38 Slope bei Seer caer le 224, 280 Sierra 32,69, oe 1 i830 tor feanery, hee 76, Sto anu 8921202 Son eat aie" Sa a So Hs. jake! ‘“eeergennyd poor Me 8, eae Srshnvon of hechnai 215 Seeger s Seeccewneny? ‘tel oan mand #1 ina a i, se cian) 12 ale rte ft ats of Sl ie 12 Seca Pane tpes Se fac fonction 34, Bese feuses aS ie as 352 THE UNIX PROGRAMMING ENVIRONMENT ab Ronan 390 tee Exexa commen 8 fala ae awe Tis ‘ht at hn 19, 10, ae te lee i HESS eee ay oo eden ai fie movagg fie, opening» 201 fins 3, 208 fle renaming «16 fea fecal fe Stemi, A ten ay 3.08 {ie tine ae . Mente conenton 26 ie Sone, 7 tieame even, “2 Sy Herame toni, 7 38 tide ia So AeERS Seite os et rai ox 121 fsrepune 9-107 femur 2 fiers e101 inte Sate machine 277 Fein Being pat aceon 202,20 fo1a command TS fsltoslamage 0 fot Ganges 298 $2500 mote a Seka us Br hapa hel os feet yaam cal ‘ee fe decry $19,309 [nang 3 fmt Re cot 9s formset foemaer, neott 298 FORIRAN 77 207 Gane ae trexp aii _ a ee (sin ste aoe 2 Tnctions able of awe 128 fats: atk oe, So fal eS Bas Lo ne functions tbe of wring 176 Prete aed hime, fortune 36 Breguet wy Davida set comimod 16 Seeseg ctn eh Setenar maceo 177 gecene pe Taree i Samar pun 235 Eepetomie 1 Se moving a 38 pla extensions 102 Si ee Hates, Maron.» ashing 13 4 ead e307 261 286 ober, ees. 228 tte te 8 EEE D, BEET Bl an EERE, ‘rere document 94, 98 coe tee oe EISmen ig one ‘hoc. evolation of 233 EELS oe pec a E Laem he foe sles, ble of 283 ee foe! main function 239 Eee fost waensng funtion 240 ilies 38 Emus” hoed execersor function 244 EN meram Boca bata uncon 244 en EREes Sere Siaksew’ ocd 3ex vat 8 toed mates file 282 foed 223 tinction, 268 EERE Eitan s Eas ‘ocd eat function 264 oc mara unctog 240 Food Beint fection. 264 EES food yy function 364 Toes {Feode Inetion 272 Eee Hg & ‘oes Lode. e te 27 hoes execute function 281 past ‘hoes insteode function 278 hoot hoes 27 Shes foc.h, hoes 278 Eesha mag tas 2 ann TELE pean tunann 95 LP ot 12Seae funcion, hoee 3 ste sme, 2, is ae van ie 16 164 "Pong eal, 153,226 LoaScommpend 109, 116 inden "te So comands, permed 2 Inbernce, open He 223 ee, ek saab, 91 foie-c'ne hock 29 imelsfie neck 3B snk nwtite T, ‘ne table 20 (Bhut frmae 331 Inu erage 2900 ie Seni, agg 14 Integer fension 189 Se Fe Se ean pont 226, 28 pate ss Rei? Le Ieson Steve x 200,238,257, sal sification, me 291 sas 3 ae Ae ae ke command. ne 255 aha 2 command 34, 25 aun bon 0, 5 imma ea 236 iSipmind wea" 78 thane crema 357 iam Oued Ione swe af 7,171 gues Jeopment 23, 286 Inne specs, ren Coes HE ge on nr cee ee ie te ez aan Shoes 36 Elana? ee eee nde, ag line «64 322 Beet mt EE Bibhte » Ber ag Eero Sie eta, Eee aoa INDEX 353 ns, ohn 8 LEPiesiary 297 sia ah ny Bs eq tet, 352 tess leave L020, Sor Smads ne 36 iPhoe ta 6,103 Ens Sheet am Beare tr Imsero, efning Wrote 301 He“ ‘Shi peat of 8,36, 7 215 a asi to rate 36, 07 ‘naan function, noe 239 katt ‘main funcion, hocd, 260. ES Aw, pick 7 force yo Bane Wee: sap 19) ‘Bake #7 cominand 2 Sik Sooea ‘ake command 24 ego ee 384 THE UNIX PROGRAMMING ENVIRONMENT fan command 1,30) 2 Irabual,oe329 Mash, Jobe 38,199 Pra eye Je, 238 athve fe) 5053. 252 Riiboy Boog, we, 78 esi ee 121 ‘emery fault Seog command 0,68 ‘Raa revachircte 102,320 mmeacharaer, sel 38 Petecharacer ale of shell 7S ‘Rips funcion it Sader command 25,38 ‘Beda 27 261302 onsout” 28 Roce command 15 a ince pace 290 eprom Bt ‘Mute record, ave 180 IRiN Sms, a9 Sr owe ell 118 Ai eine convention 43 seit aE fama ahi of ror 298 tne rest Seger 7 Btopy farce ep chara css 102 spite EEE. aaa BEERS Ere ou ty ‘eee mand 10,162. 163, e Wap variable. aie 118 cane y mae Us syiieteae us Seis Rims sees fall sine in ATE 139 SESE aw See aaa a28 22 ey pe ee 208 ome ‘i inberienee 228 Bees Se Sane: = ae aoe. SS See ie eae ‘options, parsing '78 i ser rhe Se a acne fh me Te =e, oes ae See Sec SN Pn. a Seca in, ae age lout 289 Pisin bet Bacar: fener te 206 Few dean 3,75 enters, hell) 72,73 frenbese tell} [es fase aiess 238 eee aie, 284 Eents ae futsal wrguoeat ToL oe faseoed, changing & 34 [rsenord enya 33 Ristgg We Yotcrpeene 8, sort Seutty 70 Sn nal seen 139 ab 1, rar te go rene REPrstans = ane pt Sine ach 7 pattern, shell 228 acm i ees el 8 sc ac eigen dee A, 68, 201 Ete, Bema. Fenian great 205 SS aw mthcee ome, Eas BEES wo 3 Psi 7, tine examples 31 Tee a a point ste change 20 cain pos PW foie fone 3 Rep 290,201 edepe inate 278 22 Bee anc 1 eine nei, hot 24 Belt Stenent noe 33) Pane Bein satement, ave 16 eer a as sacar roca dts, ho? Frost acrid 3,73, Bs rottcaey eta," « ann en seers ee Si gras noes, “pale oats pti ne 220 Progam dear 10,135,147, ees rosa, soning 3 7.34 seg icles pie opramabl mater 289 Flopammable st 9, soso ae Prompt Secondary, 76 Foc weie 9 ‘Benes ancton 282 Pee Ppa command, tof 301 BF shel vable Wo, 2 ed tel varie 18 bash function oct 262 Fann may Pecener 172 reyomsng ear aaa ee sale eee Ba 6,6 stig nan Seg cs x eg ore oven e eahe See ecto, 5 102 a segulat expression, * 102,324 tre samp sol api, nae 10, ssp reso grep, 02 ‘falar exces, fee ‘ular expenon: ach tia 8 ‘fuly ete: a of 2d a TERME or Roe 3h,"74 {ly Ss eg 13 ‘Fepiace command 135 Inpex 355 tecement ras, 28 reores fie ‘Sig orate 8 RETURNS, 7, 13.98, ‘Sed he Bieteh, at'g 2,73 Fee. Denali x. 39, 9, i 3 Soft formatter 289 feat 28) Foot ther 82.35 Resist Lary x ike are 17 ‘vcard, sed” 108 sea Sieames 230 Semne TI Sin, Det. a4 Seen Th 39 meant Saray ‘3, BT, 88, 199, 142, Senda romp 76 SS, peswors sa Pica 0 58 command 108 SS Seto 112 dieting newtine 111 Serene HS ht Sea qeommeed 110 Sinn Separator. command 3,78 separators, awe field 116 ee. ee EE ae gp Eeraee g SS periion 34 356 THE UNIX PROGRAMMING ENVIRONMENT 29.68, Seventh Eon 2,1 R7 Tesi igh rk, agnosie cat $09,359 renbeses 72,73 reeset 48 2 Patera 3 Contry 28 Sat $c Pranab 1 Rana | Vaan, eof peg AN, 8 1 Size a, EERE al Fee a See ia ee Shel for sop Ts ‘ell £0 fedioction 199 Smee Hepes See a Shell mets stan 138 el renee, 91 ‘hel sarabe vey 237, 8 ae set Paget elec ie Shae command 135.18 sersnce enti 253, 268 Sonia aoe 3 Dia tut 238 $a Mandingo 228,227 Bs ee Se a a eS EXEao a as tg) odie ses ‘able of hoe 283 SNOBOLT at Sek ou Recipe uae 26 pet? command 315 ck ace 1 SoS ace ie Sand err 32. 92,177,202 anand ip 90 82.17, Sab netions ab of WO tary 170 ated ute 3.177202 eae heute 218 iter, pre 3 Ste am, wane 222 redeem ce seglonn header Ge 173 Sin’ formuts 333 Sie in ene 2 Srctue! fie st 4 EE Shand 5,8 Bare ey se command 217 ee an aoe BE Pee are BS ‘econ fect 24, 229 Berm gla Ebiaope § file oF ek operons 8 seas Ube of dices 63 een eater {iiss table te sommands 21 SSeS nt 36,30 tube of Roe operate 330 te Se tah of C0 tang 302 tthe of mi commands 257 1 Se Soman 36 {abe se matacharacters 75 {ibe of el pater 16, eof Sel eigen 94 SDE of sel sarisblen 3S abe of nl names 30 ‘be of sanderd UO fencons In detnitne 174 ‘eesti exrenion 105 serrerecy ies EL irene, van terminal echo 2, 43° ears Ceoe tecgaion Sours TEX formatter 314 rh sha mos ‘tameout command. 230 get tens be noo 28 cate eer carers nee SEES ona sare soa! kee at Be Skeet AG command 58, 306 EESHE bp commun 30 SESE command 392,306 Eset cae ommend) EOEE AF command 291,299 Eset eon {EEOEE, geting backs vo ‘set pxoéf ngs 20 orf hance ane oe Sica See {EROEE ‘%8 Command 30} Ses Fagen fect 4, 5 Manion i223, 251,278 eta 103 Sree cnn 106 imag 107 Snieecommand 3 fants Bh GRIX sie Echo 78 tacts sia, sbell ener ave cory 22 98, 68 ‘aah Bey 889,65, ons /Sit/onde dssry Jaime ny 20 ‘Jaserinclade decry 173 jets ieee ‘hep command 39 Se arate een 281,276 ers abe of a 118 arabe ae of Se 38 Steak nai hoot 20 ‘he command 173, 175,178 SEE a HE, va fanton 179 ESSE Meson Yeomiandy e112 second Siecle canmand 2 INDEX 357 wetchnto command 7 ear w Se rari Sr on ce Serre eee Shes ae ae 7 Tere Enh, Eoahes Foon Eee ioee > ee Hae Bw FEES com mos a0 Re FES Shem wes ic ts pies ‘Flex function, hoes Bist Bes ab 2 ee a on eee Bee.

You might also like