File C Users Dan AppData Local Temp HhB969
File C Users Dan AppData Local Temp HhB969
File C Users Dan AppData Local Temp HhB969
Pgina 1 de 206
C++ Footprint and Performance Optimization By R. Alexander, G. Bensley Publisher Pub Date ISBN Table of Contents Index Pages : Sams Publishing : September 20, 2000 : 0-672-31904-7 : 400
The market for miniature computer programming is exploding. C++ Footprint and Performance Optimization supplies programmers the knowledge they need to write code for the increasing number of hand-held devices, wearable computers, and intelligent appliances. This book gives readers valuable knowledge and programming techniques that are not currently part of traditional programming training. In the world of C++ programming, all other things being equal, programs that are smaller and faster are better. C++ Footprint and Performance Optimization contains case studies and sample code to give readers concrete examples and proven solutions to problems that don't have cut and paste solutions.
EEn
777 Copyright About the Authors Acknowledgments Tell Us What You Think! Introduction: Why Optimize? Aim of This Book Who This Book Is For The Structure of This Book Part I: Everything But the Code Chapter 1. Optimizing: What Is It All About? Performance Footprint Summary Chapter 2. Creating a New System System Requirements System Design Issues The Development Process Data Processing Methods Summary Chapter 3. Modifying an Existing System Identifying What to Modify Beginning Your Optimization Analyzing Target Areas Performing the Optimizations Summary
Part II: Getting Our Hands Dirty Chapter 4. Tools and Languages Tools You Cannot Do Without Optimizing with Help from the Compiler The Language for the Job Summary Chapter 5. Measuring Time and Complexity The Marriage of Theory and Practice System Influences Summary
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 2 de 206
Chapter 6. The Standard C/C++ Variables Variable Base Types Grouping Base Types Summary Chapter 7. Basic Programming Statements Selectors Loops Summary Chapter 8. Functions Invoking Functions Passing Data to Functions Early Returns Functions as Class Methods Summary Chapter 9. Efficient Memory Management Memory Fragmentation Memory Management Resizable Data Structures Summary Chapter 10. Blocks of Data Comparing Blocks of Data The Theory of Sorting Data Sorting Techniques Summary Chapter 11. Storage Structures Arrays Linked Lists Hash Tables Binary Trees Red/Black Trees Summary Chapter 12. Optimizing IO Efficient Screen Output Efficient Binary File IO Efficient Text File IO Summary Chapter 13. Optimizing Your Code Further Arithmetic Operations Operating SystemBased Optimizations Summary
Part III: Tips and Pitfalls Chapter 14. Tips Tricks Preparing for the Future Chapter 15. Pitfalls Algorithmic Pitfalls Typos that Compile Other Pitfalls
Index
Copyright
Copyright 2000 by Sams Publishing All rights reserved. No part of this book shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, without written permission from the publisher. No patent liability is assumed with respect to the use of the information contained herein. Although every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions. Nor is any liability assumed for damages resulting from the use of the information contained herein. Library of Congress Catalog Card Number: 99-068917
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 3 de 206
Trademarks
All terms mentioned in this book that are known to be trademarks or service marks have been appropriately capitalized. Sams cannot attest to the accuracy of this information. Use of a term in this book should not be regarded as affecting the validity of any trademark or service mark.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 4 de 206
Interior Designers Gary Adair Anne Jones Cover Designer Anne Jones Production Darin Crone
Dedication
To Olivera, Tjitske, and Leanne, who somehow found the restraint to not kill us.
Acknowledgments
We would like to thank the people at Sams for recognizing the value of this project and helping us develop it into the book you are now holding. Special thanks go to those on the home front who had to endure our absence and more than their fair share of duties around the house.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 5 de 206
Nowadays, software is virtually everywhere. Though you might initially think only of PCs and industrial computer systems when talking about software, applications are much more widespread. Consider washing machines, electric razors, thermostats, microwave ovens, cars, TVs, monitors and so on. Obviously these examples span many different kinds of architectures and use a variety of microprocessors. Different optimization techniques for performance and footprint size are needed here. Even an examination of those writing today's software reveals much diversity. There is the generation of software implementers who were schooled specifically in writing softwarethat is, doing requirements analysis, design, and implementation. There are also those who taught themselves to write software, starting perhaps as hobbyists. And more and more we see people from different disciplines switching to software writing. This means that it is no longer a fair assumption that all programmers have, in essence, a technical background. C/C++ programming courses and books give an excellent introduction to the world of C/C++ programming. Basically accessible for all ages and disciplines, they make it possible for anyone to write a working C/C++ program. However, standard techniques have many pitfalls and inefficiencies to avoid when actively developing softwarebe it commercially or as a hobby. Without completely understanding the consequences of programming decisions on a technical level, implementers unwillingly compromise the speed and size of the software they write. The basis for an efficient program is laid long before any actual code is written, when the requirements and design are made and the hardware target is chosen. And even when the software is eventually written, a simple matter of syntax might be all that separates those who produce optimal executable code from those who do not. If you know what to write, you can easily optimize your code to run many times more efficiently. Efficiency can be increased even further with specific programming techniques, differing in level of skill required from the programmer
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 6 de 206
0672319047.
Performance
The first part of this chapter discusses optimization from the performance viewpoint. Here not only software and hardware characteristics are discussed, but also how performance is perceived by users of a system.
What Is Performance?
What does performance actually mean in relation to software? The simple answer is that performance is an expression of the amount of work that is done during a certain period of time. The more work a program does per unit of time, the better its performance. Put differently, the performance of a program is measured by the number of input (data) units it manages to transform into output (data) units in a given time. This translates directly into the number of algorithmic steps that need to be taken to complete this transformation. For example, an algorithm that executes 10 program statements to store a name in a database performs poorly compared to one that stores the same name in five statements. Similarly, a database setup that requires 20 steps to be taken before it knows where new data is to be inserted has a higher impact on program performance than a database setup that does the same in 10 steps. But there are more things to consider than purely technical implications, which is what this section will highlight. Of the software that is written today, a very large part is set up to be used by one or more users interactively. Think of word processors, project management tools, and paint programs. The users of these kinds of programs generally sit behind their computers and work with a single program until they have completed a certain taskfor example, planned the activities of a subordinate, drawn a diagram, or written a ransom note. So let's examine how such a user defines performance; after all, in most cases he will be the one we do the optimizations for. Basically, there are only three situations in which a user actually thinks in terms of performance at all: When a task takes less time than anticipated by the user. When a task takes more time than anticipated by the user. When the size or complexity of the task is apparent to the user. Examining these situations can provide further guidelines for defining performance. Here follow three examples that illustrate the bullet points: A task can take less time than anticipated by the user when, for example, this user has been working with the same program on the same computer for years and her boss finally decides to upgrade to next-generation machines. The user is still running the same program, but because the hardware can execute it faster, performance seems to be better. Also, the user has become accustomed to a certain kind of behavior. In the new situation her expectations are exceeded, she no longer has to twiddle her thumbs when saving a large file or performing a complex calculation. A task can take more time than anticipated by the user when, for example, this user works with a program that handles a large base of sorted names. On startup, the program takes about 15 seconds to load and sort its data, without giving status updates. Even if its
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 7 de 206
algorithms are highly optimized, the user views this unanticipated "delay" as poor software performance. The size and complexity of a task can be apparent to the user when, for example, the user works with a program that searches through megabytes of text files to find the occurrences of a certain string. This action takes only seconds and, because of her technical background, the user knows what is involved with this action. Her perception of the performance of the search program is therefore favorable. These examples demonstrate that performance is more than a simple measure of time and processing. Performance from the view point of a user is more of a feeling she has about the program than the actual workload per second it manages to process. This feeling is influenced by a number of factors that lead to the following statements: Unexpected and unexplained waiting times have a negative effect on the perceived performance. Performance is a combination of hardware and software. Performance depends on what the user is accustomed to. Performance depends on user's knowledge of what the program is doing. Repetition of a technically efficient action will still affect perceived performance no matter how knowledgeable the user is.
Why Optimize?
Although optimization is a logical choice for those who write time-critical or real-time programs, it has more widespread uses. All types of software can in fact benefit from optimization. This section shows four reasons why: As programs leave the development environment and are put to use in the field, the amounts of data they need to handle will grow steadily. This eventually slows the program down, perhaps even to the point of it being unusable. Carefully designed and implemented programs are easier to extend in the future. Consider the benefits of adding functionality to an existing program without concern about degrading its performance due to problems in the existing code. Working with a fast program is more comfortable for users. In fact, speed is typically not an issue until it slows users down. Time is money. A tempting question that you are bound to ask sooner or later is, Why not just buy faster hardware? If your software does not seem able to cut it anymore, why not simply upgrade to a faster processor or use more or faster memory? Processing speed tends to double every 18 months, so algorithms that might have posed a problem six months or a year before might now be doable. But there are a number of reasons why optimizing software will always be needed. A faulty design or implementation decision in an algorithm can easily slow it down 10100 timeswhen sorting or storing data, for example. Waiting for hardware that is only twice as fast is not a solution. Programs, and the data they handle, tend to grow larger and larger during those same 18 months, and users tend to run more and more applications simultaneously. This means the speed requirements for the program might increase as fast as, and sometimes even faster than, the hardware speed increases. When programmers do not acquire the skills to optimize programs, they will find themselves needing to upgrade to new hardware over and over again. With software that is part of a mass-market system (for example, embedded software in TVs, VCRs, set-top boxes, and so on), every cent of cost will weigh heavily. Investments in software occur only once, whereas investments in hardware are incurred with every unit produced. While processors continue to become faster and cheaper, their designs change also. This means that more investments need to be made to upgrade other parts of the systems. The lower the system requirements for a certain program are, the larger the market it can reach. Buying new hardware to solve software problems is just a temporary workaround that hides rather than solves problems. One thing to keep in mind when talking about performance problems is that they are generally not so much the trademarks of entire programs as they are problems with specific parts of a program. The following sections of this chapter focus on those programming areas that are par ticularly prone to causing performance problems.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 8 de 206
which kind of device to use for what purpose and when and where in the program to access the devices. Chapter 12, "Optimizing IO," explains this in greater detail. Examples of (relatively) slow physical devices include hard disks, smartcard readers, printers, scanners, disk stations, CD-ROM players, DVD players, and modems. Here are some considerations when using physical devices: 1. It stands to reason that the more frequently a set of data is used, the closer you will want to place it to the program. Data that is referred to constantly should therefore, if possible, be kept in internal memory. When the data set is not too large and remains constant, it could even be part of the executable file itself. When the data set is subject to change, however, or should be shared between programs, it would be wiser to store it on a local hard disk and load it into memory at runtime. It would be unwise to store it on a network drive; unless you specifically intend for the data to be accessed by different work stations or to be backed up remotely, the use of a network drive would just add unwanted overhead. The choice of device should clearly be closely related to the intended use of the data that it will store. 2. During the design phase of a program, look closely at how and when the data is accessed. By making a temporary copy of (a block of) data, it is possible to increase the speed of accessing it. For example, consider the following scenario in which it is necessary for a program to access data being used by a physical device. This creates no problem when both the program and the device are merely reading the data and not changing it. However, when the data is being changed, some kind of locking mechanism must be used. Of course, this takes time, as the program and device have to wait for each other. This type of problem can be identified at design time, and possibly avoided. For example, with a temporary copy of the data solely for use by the device, the program could continue to make changes to the data, whereas the physical device is used merely for output purposes (taking, if you will, a snapshot of the data). This way the program and the device will not trip over each other. When the device has finished, the memory containing the copy of the data can be reused. If the amount of data involved is too large either to allow efficient copying or to be allocated in memory twice, the suggested technique could still be applied, but to smaller subsets of the data. 3. It is usually a good idea to compress data before sending it when communicating with relatively fast devices over a slower connection (two computers connected via serial cable or modem, for example). When choosing the compression algorithm, be sure that the time that is won by sending less information over the connection is more than the time that is needed by the slowest end of the connection to perform the compression or decompression.
Performance of Subsystems
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 9 de 206
An old proverb says a chain is only as strong as its weakest link. This holds true also for software, particularly when it comes to performance issues. Performance problems are likely to occur from using a badly designed third-party library, or indeed one that was optimized for a different kind of use. So before using subsystems, it is advisable to run some performance testsif only to find out what to expect in practice. It might be possible to design around identified problems. But be prepared to rewrite a subsystem or look for replacements. Generally this would be considered the preferred option. Otherwise future enhancements to the program will continue to suffer from an initial bad choice. Avoid creating workarounds if there is even the remotest possibility of having to replace a subsystem at some point down the line anyway. Time constraints could force a development team to use two similar subsystems simply because the old one is too slow and it would take too much time to incorporate the new one in every part of the program. Clearly this is an unfortunate waste of time and resources. The way in which a subsystem is incorporated into a program affects the performance of the link between the two. Simply calling the interface of the subsystem directly from the program causes the smallest amount of overhead and is therefore the fastest. It does mean that at least one side of the link will need its interface adapted to fit the other. When for some reason both sides cannot be alteredfor example, because the third-party sources are unavailableit is necessary to insert some kind of glue or wrapper layer between the two. This means that communication calls will be redirected. This means extra overhead. However, this same kind of go-between glue layer can also be used to test the functionality and performance of a single part of a system. In this case the glue layer, now called a stub, does nothing or simply returns fixed values. It does not call another object to pass anything on. It simulates the objects being interacted with. The performance of the object being tested is no longer influenced by other parts of the system. Refer to Chapter 2, "Creating a New System," for more details on prototyping.
Performance of Communications
Performance problems are inevitable where communications take place. Think, for example, of communications between separate computers or different processes on the same computer. The following problems are likely to occur: The sender and receiver operate at different speeds (for example, different hardware configurations or scheduling priorities). The link between the sender and the receiver is slow (for example, a serial cable or modem between two fast computers). The sender or receiver is slowed down because it is handling a high number of connections (for example, an ISP). The sender or receiver has to wait for its peer to arrive in the correct program state (for example, a connection has to be set up or data has to be gathered before being sent). The link between sender and receiver is error-prone (for example, a lot of data needs to be retransmitted). Where possible, programs should avoid halting activity by waiting on communications (busy-wait constructions) or using polling strategies that periodically check on connections to see whether they need to be serviced. Instead, communication routines should be called on an interrupt basis or via callback functions. This way, a program can go about its business until it is signaled to activate communication routines. The elegance of using callback functions lies in the fact that callback functions are part of the program that wants to be notified of a certain event taking place. Thus these functions can have complete access to the data and the rest of the functionality of the program. The callback function body contains those instructions that need to be carried out when the event comes about, but the function is in fact called by the object generating the event. By passing a reference to a callback function to an object, you give the event-generating object a means to con tact the program. So switching from polling or busy-wait strategies to interrupt and callback strategies offers the following advantages: Programs will be smaller, as fewer states need to be incorporated. Programs will be faster, as execution does not need to be halted at strategic places to check for interesting events taking place. Responsiveness to events will be faster, as there is no longer a need to wait for the program to arrive in a state in which it is able to recognize events. The use of callback functions is discussed further in Chapter 13, "Optimizing Your Code Further."
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 10 de 206
The user sees only the user interface; to her this is the program. Most of the time the user is unaware of exactly what the program is doing internally and has at best only an abstract concept of its overall work. The programmer focuses much more on the source code, making that as good as it can be. The programmer sometimes views the GUI as a necessary evil. An interface that quickly makes all the functionality of the program accessible will then be stuck on top of the program. Perhaps the most important reason of all is that the user and the programmer have different backgrounds, experiences, and goals with respect to the program. Ideas about what is logical will therefore differ. The following sections provide more detail on how to identify, prevent, and overcome GUI problems and annoyances. Unexplained Waiting Times When programmers forget to add some kind of progress indicators at places in the program where large batches of work are being done, the program will in effect seem to be halting at random to the user. He selects a command from the program's menu and suddenly his computer seems to be stuck for a few seconds. This will be regarded as very frustrating because the user is not aware of what is happening. The programmer in turn probably did not even notice this "look and feel" problem because he knows what the program is doing and therefore expects the slowdown. Simply adding some text in a status bar explaining what is happening, or spawning a little window with a moving slider indicating elapsed time, will greatly enhance the appreciation the end user has for the program. Illogical Set Up of the User Interface Another great way to irritate users is to place user interface controls somewhere where they are not expected. This might seem unlikely but there are, in fact, countless examples to be found in even today's most popular software packages. Finding the menu path File, Tools, Change Password is a practical example of this. But it does not even have to be that obvious. While designing user interfaces, take into account the experiences of the user. For example, when writing a program for a specific OS, it is a good idea to stay as close as possible to its standard interface. So harmonize with the OS, even if it appears less logical than you'd like, such as Print being a submenu of Edit rather than File. Whether or not the intended users are familiar with the standard interface of the OS, it is wise to take advantage of the available experience, even if its setup could be improved. Another type of experience that might be used can be found in situations where some kind of automation is done. Whenever users are forced to switch from some kind of manual systemfor example, on paperto a computerized system, they will already need to adapt pretty heavily. Designing a user interface that looks like, and follows the same logical steps as, their old system will benefit them greatly. This also holds true when upgrading or switching computer systems. Problematic Interface Access The perception a user has of the performance of a program is mostly determined by the speed at which (new) information appears on her screen. Though it is possible that some kind of delay is excepted when calling up stored data, it is unlikely that any kind of delay will be excepted when accessing menu screens. Menus and submenus should therefore appear instantaneous. When a menu contains stored data, at the very least the menu should be drawn immediately (be it a box, a pop-up, or so on) after which the stored data can be added as it becomes available. Not Sufficiently Aiding the Learning Curve Here is where a lot of "look and feel" problems can be solved. A good example of a user-friendly program is one that can follow learning curve of the user. A first-time user will, for example, benefit enormously from having access to an integrated help service. This could be a help menu with the ability to search for key words and phrases or perhaps even the ability to automatically generate pop-up windows with information on what the user is doing and how he is likely to want to proceed. This first-time user is also likely to use the mouse to access the user interface. After using the program a while though, this extra help is no longer needed, and pop-ups and nested menus get in the way of fast access. The user is now more prone to use hotkeys to quickly access functionality, and he will want to turn off any automatically generated help and unnecessary notifications.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 11 de 206
Consider a simplified example of a program that uses a database of names. Although it might work fine for its initial use of approximately 1,000 names, does that provide any certainty for its behavior if another customer decides to use it for a base of 100,000 names? It all depends on how efficient the sorting and storage and retrieval algorithms were initially implemented. The following sections highlight different performance problems that can arise during the lifetime of a program. Extending Program Functionality Performance problems often arise when the functionality of a program needs to be extended. The market demands continuous updates of commercially successful software with newer and improved versions. In fact, many users consider programs without regular updates to be dead and, therefore, a bad investment. The most common upgrades or extensions include the following: New and better user interfaces including "professional" editions Added capabilities or options Support for more simultaneous users Support for new platforms Upgrades to reflect changes in the nature of the data Added filters to increase interaction with other programs Support for devices Network and Internet capabilities However, keep in mind that it is neither wise nor beneficial to keep enhancing software with things it was not originally designed for. What you end up with is a very unstable and unmaintainable product. The initial design (the framework of the initial functionality) should be geared toward supporting future enhancements. This also means that the technical documentation should be clearly written and up to date, perhaps even split up in function groupings that follow the design. This is a must if future enhancements are made by people other than the original developers. To add functionality properly, the programmer making the enhancements should be able to easily identify where his enhancement should go and how it should connect to the existing framework. Code Reuse Problems generated by reuse of existing code are closely related to those mentioned in the previous paragraph. Reuse of tested code can still cause grief even with successful identification of how to integrate new functionality in an existing, well-designed framework. Think, for example, of a program that contains a sorting routine that cleverly sorts a number of names before they are printed in a window. A programmer adding new functionality might decide to use this routine to sort records of address information to save some precious time. Although the sorting routine might have been more than adequate for its initial use, it can have severe shortcomings with respect to performing its new task. It is therefore prudent to investigate con sequences of using existing code, not in the least by defining new test cases. And again, good documentation plays an important role here. Refer to Chapter 3, "Modifying an Existing System," for more details. Test Cases and Target Systems On the whole, programmers are most comfortable when they are designing and writing software, so they generally resist both documenting and testing. So it is not unusual that testing is sometimes reduced to merely checking whether new code will run. The question then is whether the test cases and test data used really represent any and all situations that can be found in the field. Does the programmer even know what kind of data sets will be used in the field and whether it is possible to sufficiently simulate field situations in the development environment? The first step in solving such problems is having a good set of requirements which the programmers can use. If that proves insufficient, it might be necessary to use example data sets, or test cases, from the client for whom the program is being written. Or you might need to move the test setup to the client itself to be able to integrate properly in a "real" field environment. Another common mistake is to develop on machines that are faster or more advanced than those used by the client, meaning that the test and development teams do not get a correct impression of the delays the users will suffer. Side Effects of Long-Term Use It is possible that programs slow down when they are used over a longer period of time. Some common problems that can be hidden quite well when programs only run for short periods of time include Disk fragmentation due to usage of files Spawned processes that never terminate Allocated data that does not get freed (memory leaks)
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 12 de 206
Memory fragmentation Files that are opened but never closed Interrupts that are never cleared Log files that grow too large Semaphores that are claimed but never freed (locking problems) Queues and arrays that exceed their maximum size Buffers that wrap around when full Counters that wrap to negative numbers Tasks that are not handled often enough because their priority is set too low These cases will, of course, take effect in the field and users will complain about performance degradation. These kinds of problems are usually difficult to trace back to their source in the development environment. It is possible to do manual checks by Stepping through the code with the debugger to see what really happens (more about this in Chapter 4) Printing tracing numbers and pointers and checking their values Checking programs with profilers (refer to Chapter 4), taking care to note the relations between parts of the program (should function x really take twice as long as function y?, and so on)
Footprint
This second part of the chapter looks at optimization from the footprint viewpoint, where several techniques to reduce footprint size are discussed together with footprint problems that can arise if preventive actions are omitted.
What Is Footprint?
Strictly speaking, footprint is a spatial measurement term. The footprint of a desktop computer, for example, is about 4040 centimeters (about 1616 inches); that is the desk space it occupies. When talking about software, however, footprint can best be described as the amount of memory an object (a program, data structure, or task, for example) needs to function properly. This memory can be strictly internal ROM or RAM, but it is also possible that footprint requirements necessitate the use of external memory, for example, for temporary data storage. Refer to Chapter 2 for more information. Where executable programs are concerned, different kinds of footprint sizes can be identified, as explained in the following sections. Storage Requirements This is the amount of memory needed when the program is inactive, the footprint of the storage, and the memory required to store the executable file and the data files it needs/has acquired. From the perspective of the user, the storage requirement is simply the amount of space needed to store the program. During development, however, the story can be rather more complicated. When development is not done on the machines that are the eventual targets for running the program, storage calculations become a lot more involved. Refer to "How to Measure Footprint Size" for more information. Runtime Memory Requirements This is the amount of memory needed while the program is being executed. This footprint can differ from the storage footprint for several reasons. For example, the program might not need all the executable code at once, the program will probably use working memory for temporary data storage, and so on. Moreover, the memory used during startup and execution will rarely equal that of the stored files, especially larger programs, which are those made up of more than just a single executable file. Although most often more
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 13 de 206
memory is needed during execution, as one might expect, it is equally possible that a program in fact uses less memory. Practical examples of how and why runtime requirements can differ from storage requirements are given in sections that follow. Compression Often parts of a program and its data can be stored in compressed form. When a program uses large data files, it is unlikely that this data will be stored exactly as it is used in internal memory. It can be compressed using known or proprietary compression algorithms or even stored using a structure that is more optimal for storage purposes. Also the executable code itself might be compressed. When the program consists of several modules, the main program might load and decompress other modules as they are needed. JPEG graphical images and MP3 sound files are examples of well-known data compression techniques. Here, clearly, footprint size reduction is chosen over performancedata compression ratios of up to 90% can be achieved. Note that these techniques do allow data loss. This can be incurred in such a way that it is not, or barely, noticeable to human eyes or ears. However, this is of course not something we would want to do to executable code. The nature of the data being compressed plays an important role in the choice of compression technique. Whichever form of compression is used, however, the fact remains that the runtime memory requirements will most likely differ significantly from the storage requirements. Perhaps even extra memory is needed to perform the compression and decompression. It should not be overlooked that compression and decompression take time, which means that starting a program might take longer. This highlights again the relationship between performance and footprint. Data Structures The structure that is used to hold and access the data might differ between storage time and runtime. Whereas data mostly needs to be small during storage, it might be far more important during runtime that it can be accessed quickly. Data that is small during storage is usually compressed and therefore moves slowly, whereas data that moves quickly during runtime is most likely not compressed, which means the data takes up a lot of space. Each storage method uses an entirely different structure. The structure that is chosen for fast access might include redundant information in extra index files, generated before the data is accessed. For example, think of hashing tables, doubly linked lists, or sets of intermediate search results (refer to Chapters 10, "Blocks of Data," and 11, "Storage Structures"). Again it should be noted that generating this overhead will cost time. Important decisions include how much data to consider describing in these index files, when to generate these index files, and what level of redundancy should be provided by these index files (at some point generating more redundancy will no longer make the data access faster). Overlay Techniques Not all the executable code might be needed at the same time. Dividing a program into functional modules allows the use of overlay techniques. For example, consider a graphics editor. It contains a module of scanning routines that is used when the user wants to scan a picture. After the user closes the window containing the scanner interface, the scanner module is replaced by a module containing special effects routines. The modules do not necessarily have to be exactly the same size; just having one in memory at all times will decrease the runtime memory requirements. The more distinctly you can identify functional modules during the design phase, the better the overlay principle will work. The footprint size actually won depends on the choices the developers make. It is important to identify program states that can never occur at the same time and switch them intelligently. If it is impossible to use special effects during scanning, these two functions are a good choice for being interchanged. You can, of course, try to overlay groups that might in some cases be needed simultaneouslyor closely after each otherbut then you get a much more statistical picture of footprint size. When there is a fixed maximum to the footprint size that can be used, the worst-case combination of overlaid modules should be able to fit into it. Switching overlaid modules costs time and so has an impact on overall program performance. Working Memory A program being executed will need working memory regardless of whatever architectural choices are made about storage, compression, and overlaying. The use of working memory is very diverse. The following list shows the most common uses: Storing variables (pointers, counters, arrays, strings, and so on) Stack space Storing intermediate data Storing placeholders for data (structures, linked lists, binary trees) Storing handles and pointers to operating system resources Storing user input (history buffers and undo and redo functionality) Cache
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 14 de 206
In certain situations, it might enhance performance to set aside some memory to act as a buffer or cache. Think, for example, of the interaction with hardware devices as described in the section "Performance of Physical Devices." Refer also to Chapter 5, "Measuring Time and Complexity," for more detail on cache use and cache misses. Memory Fragmentation Another subject for consideration is the fragmentation of memory. It might not exactly fit our definition of runtime memory requirements, but it certainly does affect the amount of working memory needed. While a program runs, it will continuously allocate and free pieces of memory to house objects (class instances, variables, buffers, and so on). Because the lifetimes of these objects are not all the same and often are unpredictable, fragmentation of large blocks of memory into smaller pieces will occur. It stands to reason that the longer a program runs, the more serious the fragmentation becomes. Programs designed to run longer than a few hours at a time (and that use memory fairly dynamically) could even benefit from some kind of defragmentation routines, which are usually quite expensive performance-wise. Chapter 9 discusses memory fragmentation in more detail. The following sections give advice on how to measure and control memory requirements.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 15 de 206
fairly estimate the requirements for the target machine, some preliminary tests might be necessary to determine any differences of machine code or executable size on the two machines. This way, it will be easier to predict the size of the final program. For example, the processor of the target machine can use twice as many bytes to describe its instructions as the machine used for development. The target machine in turn could have a much better file management system that uses only half the overhead for block allocation. To make a better initial estimate of the target footprint, it is wise to make a good document or spreadsheet describing these kinds of differences at the beginning of the project. This is also useful during development to quickly recalculate target footprint size as the design of the program changes or its user requirements readjust. Measuring Runtime Requirements Added to the problems associated with using different development and target systems, the dynamic aspects of programs further complicates measurement of runtime requirements. This section discusses these aspects and offers advice on measuring them. It is necessary to calculate the actual uncompressed runtime code size. This is the largest size of all possible module combinations; the design should clearly state which combinations of modules could be found in memory when overlaying is used. Moreover, the dynamic memory usage of a program can be monitored. Basically this means running the program through realistic worst-case test scenarios and looking at what the program does with the internal memory. Although tests are performed with tools, it might also be necessary to write some kind of indirection layer that will catch the calls the program makes to the operating system. By having the program call a special glue layer when asking for (or returning) memory, this layer can aid by recording sizes required and released. It could even store information about memory location, making it possible to look at fragmentation patterns. Spy software could also be used to look at the behavior of the program stack, heap size, or usage of other scarce operating resources (IRQs, DMA channels, serial and parallel ports, and buses).
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 16 de 206
The compiler options used during compile time greatly influence the generated footprint. In fact, most compilers even have specific options for optimizing the generated code for either footprint size or performance. There are hidden dangers in the use of these options however. Refer to Chapter 4 for more detailed information.
Summary
The definition of performance is subject to the perception of the person giving the definition. Users will have a different view of a program than developers, and project managers in turn might have yet another view. Even views of individual users vary due to differences of experiences and expectations. One reason that performance issues (and even bugs, unintuitive interfaces, and so on) come into existence at all is because of these different views. The ideal program, as seen by the user, has the following characteristics: It needs little user interaction. It has an intuitive user interface. It has a short learning curve. It is highly flexible. It contains, and is accompanied by, extensive but easily readable user documentation. It has no waiting times during user interaction; any slow actions be performed off line and so on. It has readily available information for users at any given time. The ideal program, as seen by the developer, contains the following attributes: It is geared toward future developmentsadded functionality, handling larger bases of data. It is easily maintainable. It has a good and intuitive design. It is accompanied by well-written technical documentation. It can be passed on to any developer. Unsurprisingly, we have yet to encounter such a program, or indeed one that comes even close to satisfying most of the requirements stated earlier in this chapter. Some of the requirements seem to even openly defy each other. That is why concessions need to be made, based on the actual use of the program. One can wonder why to optimize algorithms and functions when hardware seems to become faster and faster all the time. This chapter has reviewed several considerations about the benefits of optimization as opposed to hardware upgrades. Although hardware predictably improves every 18 months, generally the effect is a mere workaround for doubling speed at best. This does not effectively counter the increased requirements and demands of software. To actually solve these issues requires serious investigation. The following chapter explains how system design plays a further part for creating an effective environment.
CONTENTS CONTENTS
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 17 de 206
System Requirements
Performance and footprint issues obviously play a large role in both the implementation phase and the design phase of a software project. However, this is not where the whole process starts. The requirements phase is the first factor. The following sections discuss why and how the requirements impact performance and footprint.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 18 de 206
and be ready for a following burst. A closely related issue is the responsiveness of a system. What happens when input suddenly starts pouring in after a long period of drought? Does the system respond in due time or does it experience difficulties? It seems more than logical that this should normally not be a problem, but even a household PC in everyday use can experience this. Background processes (such as screensavers, registry updates, virus checkers) ensure that the PC does not respond instantaneously after a period of low interaction. The hardware also plays an important role in responsiveness, such as green PC options, power savers, the suspend mode, and so on. When a system has to guarantee a certain response time, these options most likely need to be kept in check or maybe even turned off. It might also be possible to set specific parts of the system in some kind of standby mode and keep the input channel monitoring active. This highlights the next area of consideration; the communication channels. If a system communicates with its environment via a modem connection, for example, it might be necessary to keep a continuous connection rather than to allow the modem to log off, after which a costly reconnect would be needed before more data can be down- or uploaded. The analysis and design phases of a project should determine how best to proceedthat is, which operating mode to allow for each part of the system. Impact of Stability and Availability on Footprint The data burst example given earlier demonstrates clearly the tradeoff between stability and availability on one hand and the footprint on the other. The longer the system should be able to handle higher data rates, the larger its buffers need to be. Another area to consider is the space needed for stored data. To guarantee a high level of reliability, it may be prudent to duplicate key sets of data. Think, for example, of the DOS/Windows OS where the FAT (containing information on where and how files are stored) is kept in duplicate. Also, the storage medium itself can dictate a necessary cautious approach. You might have to take into account the number of times a medium can be written to and the average number of corrupted memory units that occur during usage. For instance, corrupted sectors on hard disks must be considered beforehand. Extra space may be needed to allow extensive error correction (Hamming codes) and so on. For more details on optimizing input and output of data, see Chapter 12, "Optimizing IO."
Concurrency
When you set up the requirements for a software project or a system, it's important to take into account the environment in which it needs to function. Part of the requirements include assessments of usage of resources not only by the system, but also by tools and tasks likely to be used simultaneously. It's wise to determine answers to these questions: Are there more applications running on the system simultaneously? If so, identify them and find out about their resource usage. For example, a calculator program using large parts of system resources is not going to be very functional as it is most likely to be used as a second or third application to aid in subsets of a larger task. Also, certain resources have to be claimed and shared: serial and parallel ports, interrupts, DMA channels, input devices (such as a keyboard or mouse) and so on. Will a number of users be working with the system simultaneously? A scenario like this is of course possible with a multi-user OS, like UNIX, but has an impact on the resources available for each user. It might even influence implementation decisions as used strategies (client/server strategies, COM, CORBA), locking mechanisms, and private copies of data for processing. Investigating the resulting availability of storage and computing power will result in a requirement of the constrained context in which the program should be able to run.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 19 de 206
and footprint size.) How much overhead does the OS create? (How much memory or performance is lost when spawning a task, opening a file, creating a file on disk, and so on.) What is the minimal guaranteed response time? (And to what extent is this influenced by background tasks like updating registry information, doing page swapping, and so on.) Is it a real-time OS? (And is one really needed? See requirements.) The hardware and software choices affect performance and footprint. This is something which will be reflected again in the design. The following sections address this.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 20 de 206
Do you expect to use (parts) of your program in some future contexts? Any parts to be reused should then be identified and put together in functional groups. This is true also for the relevant documentation. Intelligence Placement Do you want intelligent submodules or do you want to keep all the decision-making power in the main module? The differences are that either the submodules become quite large or the main module does, and portability and reusability are therefore influenced. The same intelligence choices need to be made concerning client/server applications. Who does most of the work, the client or the server?
~ class PaintCanvas; ~ PaintCanvasManager::DoWop() { int a; // intermediate ~ PaintCanvas firstCanvas; // intermediate ~ int intMed = nrOfProcessedCanvasses; // intermediate if(~~ { nrOfProcessedCanvasses = IntMed; } ~ }
Persistent Objects Because persistent objects stay alive after the program terminates, they are still when the program restarts. These objects are used to retain information about a previous run. Examples of this are files in which program data is stored and keys which are set in the registry of the OS (Windows uses this). Such objects will even outlive a computer reboot. Other persistent objects may be resident in memory, like the clipboard to which you can copy pieces of text or pictures from for instance, a word processor. Dedicated hardware may even have special memory reserved for storing persistent objects, disks, EEPROM, FLASH, and so on. The advantage of keeping objects such as these in memory is the speed with which they can be recalled when they are needed again. This can even increase restart time of the program or the dedicated hardware on which the system runs. Decisions to make in memory management (as part of the program or system) can include deleting certain objects or moving them to disk to (temporarily) free some runtime memory. Temporary Objects Temporary objects fall somewhere between intermediate and persistent objects. Their lifetime is generally longer than that of intermediate objects, meaning they outlive functions and program modules; however, they do not often outlive the program itself. An example of a temporary object is the Undo Buffer. Found in most word processors, the Undo function can recall a previous version of text. The buffer content changes constantly during program execution, but the buffer itself is accessible as long as the program is active. Temporary autosave files fall into this category also; however, their properties are closer to persistent storage than the previous example.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 21 de 206
Spawned Objects Spawned objects are donated toand used bya scope beyond their creation. Their lifetime is managed and controlled by the object or program they are donated to. Think, for example, of an object returned by a function or program module:
char * Donator::Creator() { ~ char * spawn = new char [stlen(initstring) + 1]; ~ return spawn; } ~ ~ void Receiver::User() { ~ char * spawned = donateClass->Creator(); ~ delete [] spawned; }
The spawned object in this example is a string allocated by the Donator class and donated to the Receiver class. The donator proceeds as if the object never existed. The Receiver class is now the sole owner of the object and is in charge of its lifetime. Any memory leakage caused by the spawned object is in effect a Receiver problem. After a receiver indicates it has correctly received a donated spawned, it is necessary to destroy all administration of the object. The communication interface is very important here because if for some reason the object is copied in the transferal, this will have an impact on performance as well as footprint. Refer to Chapter 8, "Functions." Having examined the different storage types, you can conclude that the design should specify what kinds of objects are being dealt with, which in turn indicates how they should be stored. Short-lived, frequently used objects should be stored in memory, with perhaps more persistent copies of data for crash recovery purposes, and so on.
Prototyping
Build prototypes to show the client a more concrete idea of what the final product will look like. Keep in mind, however, that a prototype does not have to be a fully functional test version of a product. In fact, the purpose of prototyping is to offer the clients and developers a chance to get their heads pointed in the same direction before investing much development effort. The prototype is, therefore, more of a simulation. It hints at the look and feel of the interface and the functionality it offers. Often, a prototype is nothing more than a user interface that has stubbed functionality. The stubs deliver predefined values instead of doing actual data processing. Or the prototype might be nothing more than a set of slides or drawings. Whatever its setup, it tests the completeness of the requirements and design. As previously discussed, requirements are determined interactively with the client. After this, the design is made from the written requirements specification. However, these steps should not be seen as directly derivable; moreover, they are the result of translations. The person writing the requirements specification translates the interaction with the client into technical and nontechnical requirements. Quite likely, some elements will be lost in the translation because of the different backgrounds and focus of the writer and the client. For example, the writer might miss exact insight in the everyday work situation of the client, and the client, in turn, might miss insight in what is technically possible. This scenario can repeat itself in the design phase, where the original requirements may have been right on the nose. They are, however, translated once more, this time by people who are probably even more technically oriented. It is not a luxury to stand still at this point in the process and ask the client whether he still agrees with the vision finally put forward in the design. The prototype is something that clients can directly comment on. While building prototypes, it is a good idea to keep the following issues in mind.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 22 de 206
Make sure the client is aware that the prototype is a simulation and not the real thing. The downside of a good prototype (simulation) is that the client cannot easily assess how much of the work has already been done. The client may assume that the demo constitutes the final product. During the demo, make clear to the client that all results are artificialthat the user interface is just an initial setup that does not actually do anything. Although it is possible for a demo to be quite fast, it is wise to avoid perceived performance issues by keeping the prototype's speed consistent with that of the final product. This will deter any false expectations on the client's behalf. It is, therefore, a good idea to use initial response predictions to build artificial slowdown into the prototype. Just make the user interface wait a specified number of seconds before it produces the predefined results. Even more prudent would be to add a percentage of overhead to the worst-case predicted responses. This way, it should be possible to make each following version of the program (prototype, alpha version, beta version, and so on) increasingly responsive. Footprint and other target impressions need the same treatment, as was mentioned in the previous bullet. Although most prototypes are unlikely to be resource intensive, it is a good idea to pay attention to the hardware used for the demo. This is because the client may not fully understand the technical implications of prototyping and may not even think to ask if the final product needs more memory, disks, or processing speed than the system on which the prototype runs.
Processing Immediately
The most intuitive form of processing occurs when data is processed as it becomes available. You do not expect your calculator tool to accept your calculation and send you an email with the answer the next morning. Unless the processing of data is expected to take a long time, most folks basically expect processing to occur immediately. Some situations where it's a good idea to use immediate processing include The data rate is low enough for the program to handle it immediately. To put it another way, there is enough idle time for the program to do immediate processing (as with for instance software embedded in a train-ticket machine). The data rate is high, but the process the data has to undergo is simple enough to be executed immediately without causing unacceptable slowdown of the whole program, such as decompression softwareMP3 players, for example. (The process does not necessarily have to be simple in logical terms; the hardware just has to be able to execute it in a reasonable amount of time.) The data rate is high, and the process is complex, but there is simply no other choice than to process the data immediately. Consecutive steps need the processed data to continue. This means placing data in waiting queues and simply processing until the job is done (such as is the case when copying a large number of files with a file manager).
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 23 de 206
Processing on Request
Processing on request is basically the opposite of processing immediately. Instead of processing the data as it comes in (input), processing occurs when the data is needed (output). This means no processing occurs other than that which is absolutely necessary. It also means that processing demands can become high when the demand for data becomes high. Still, it can sometimes be useful to postpone processing. The following advantages can be identified: No time is lost on processing data during input; maximum performance is thus available for the input function. This is important for programs which need to respond fast to capture available data, such as sampler software for audio (audio sampling), video, or medical data (CAT scan). Data that is never retrieved will not be processed. No time and memory is lost on results that are never used. Consider the downloading of compressed files. It is possible to decompress during downloading, but some files may, in fact, not be needed at all. When the processed data occupies more space (because of derived data), storing data before processing will use up less memory or storage. Think of tools that perform some kind of decompression (such as text file versus desktop publishing formatting, MP3 versus WAV, and so on). Depending on the specific situation, the following disadvantages can, however, come into play: Timing problems may occur if the demand for processed data (output) suddenly becomes high. Data may not be instantaneously available because of the high workload. An administration program that needs to churn out paychecks at the end of the month will be very busy if all the salaries still need to be calculated. When the input size of the data is larger than that of the processed data, space (memory or storage) is, in effect, lost (for example, keeping audio in WAV files instead of in MP3 and so on). A mechanism may be needed to locate and remove irrelevant input data or allow new input to overwrite certain existing data (for example, purging a database of Christmas card recipients from the names of people who declined to send you a card for the last three years). To make a viable decision, look closely at the characteristics of the data and its usage. In essence, processing time and speed are set against available resources.
Periodic Processing
Periodic processing is likely to be used when the results of processing are not needed immediately and the program or system has an uneven workload. Perhaps there are certain times when the data rate is low (lunch, evening, early morning), the processing is cheap (power and phone costs), or the processing is less intrusive (network communications and so on). Benefits of periodic processing are No extra processing during input. Processed data is generally available (though not immediately after being input). The processed input data and irrelevant data can easily be discarded. But there can also be a downside: When the system needs to be active longerfor example, around the clock to catch up on processing data during the nighta high demand is placed on the availability of the system. Maintenance (including preventive maintenance) can be difficult to schedule when the system is active longer. The input data cannot be demanded instantly and limits the availability of processed data.
Batch Processing
As with periodic processing, batch processing takes advantage of the uneven workload of a system. This time, however, the processing is not necessarily triggered at a prespecified timemore likely, an event triggers the processing. This characteristic makes batch processing similar to processing on request, with the difference that whole batches or sets of data are processed. Batch processing shares the pros of periodic processing and has the added advantage to be triggered outside the specified period(s).
Summary
Performance and footprint issues can be identified and eliminated or worked around, as early as the requirement and design phases. Special care must be taken to find out the limitations of the chosen target hardware. Also, it is important to establish that the development team and the client are in full agreement regarding the specific requirements of the system and whether the software is developing along the lines of the set performance and footprint targets. Schedule performance and footprint tests as part of the
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 24 de 206
development planning. Evaluate the lifetime of the objects identified during the design and the characteristics expected from the processing of the input data. CONTENTS CONTENTS
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 25 de 206
Finding candidate areas for optimization Analyzing the candidate areas and selecting targets for optimization Performing the design and implementation of the optimizations The next sections discuss the theory behind these optimization steps, as well as the human resource considerations. Optimization of software has an impact on project cost and resource availability, particularly when existing software is still in development. This is why human resource considerations form an integral part of any optimization plan.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 26 de 206
Potential new bugs introduced from new code Regression tests to ensure sound functionality Documentation updates (design, manuals) The architect directly oversees the optimization process when it is limited to bug fixes. However, the architect needs to devote more attention to optimizations related to design errorsa redesign might even be needed in this instance of the faulty software (modules/functions). Note that only two or three individuals have worked on the optimization up to this point. In smaller projects, several roles might even be handled by a single person, thereby using even fewer resources. It is therefore fair to say that impact on the normal development process has probably been minimal. Analyzing candidate areas for optimization is discussed in more detail later in this chapter in the section "Analyzing Target Areas."
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 27 de 206
Some typical examples of algorithm problems found in the field are Inefficient calculations (calculating sine/cosine and so on instead of using precalculation tables) Inefficient functions for searching, replacing, and sorting data Resource managers that perform poorly (memory managers, file interaction classes, and so on) Caching mechanisms that do not work properly or are not optimally tuned Interrupts that are never generated or that are serviced at moments when they are not useful (due to being blocked before) Blocks of data that are marked as deleted but which are never actually deleted, making searches and insertions in data structures slower Inefficient conversions of data formats Inefficient storage methods (writing a text character by character rather than in blocks at a time) For most bugs, just turn directly to the code and begin repairing. This is true also for some algorithm problems; however, most will likely need some kind of redesign. In such cases, those parts of the system were not originally well designed, so more damage than good can come from quickly hacking a patch or workaround. Sometimes, more time is needed for parts of the system that might have seemed trivial at the start. Predictable Processing Time More drastic measures might be necessary to optimize performance and footprint when processing times are approximately what is expected but are still too slow. Although it is possible to make existing algorithms somewhat more efficient, the source of the problem usually is in the design. It's necessary to redesign the entire process, not merely the algorithms. To do so, go back to reacquaint yourself with the original problem domain that your system is supposed to solve. These performance problems are present at a very high abstraction level (the design), and thus introduced not by the implementer(s) but by the architect(s). This means that optimizations will be very time consuming and most likely will result in fairly extensive code changes or additions. These design flaws can generally be traced back to poor, or constantly changing, requirement specifications. In fact, without complete requirements, developers can create software that completely misses its mark. Consider the following scenario. A pen manufacturing company might give developers incomplete requirements for their database. So they create a beautiful and elegant database that is lighting fast at finding specific product entries. However, this manufacturer produces only twenty distinctly different products. He accesses the database through client entries, of which there are thousands. This example shows that without clear requirements that reflect what is expected of the system, the design and implementation might miss the correct focus. The architect designs the system based on impressions from reading the requirements (and from any interaction with the actual clients or end users). The implementers further interpret the requirements as they write the code. Any considerations overlooked at the beginning of this process will likely be lost. Unpredictably Reduced Processing Time You might keep in mind that reduced processing time can be a red herring. Remember that you are optimizing existing code because it suffers from footprint or performance problems. Investigate all unexpected behavior, including parts of the system that are unexpectedly fast or use less memory than anticipated. These characteristics are likely indications of problems. For example, data might be processed incompletely or incorrectly. This might even be the cause of performance problems in other parts of the system. One thing is certain: any existing code problems will crop up sooner or later to make your life miserable (if you want to know exactly when, just ask Murphy). Preventive maintenance is essential.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 28 de 206
The specific program modules that access the data The block size of the data The number of blocks expected to be stored simultaneously at any given time The time of creation or receipt of the data The specific program modules that create or receive the data The typical lifetime of the data in the system When the different data types are identified, consider how to store and access each type. This is where data models come in. Storage Considerations It is important to determine per data type, whether you are dealing with many small blocks, a few large blocks, or perhaps combinations of both. Sometimes it is advisable to group data (blocks) together into larger blocks. This way, it can be stored and retrieved more efficiently, especially when grouping is done in such a way to group data that is likely to be needed simultaneously. However, sometimes it is advisable to split up larger blocks of data for exactly the same reason. Handling data in blocks which are too large can prove equally inefficient. For example, performance can drop suddenly when the OS needs to start swapping memory to and from storage to find a large enough continuous address space. However, consider keeping the data completely in cache when it is accessed often. Refer to Chapter 5, "Measuring Time and Complexity." Processing Considerations You might want to split up or shift the moments at which you process the data. It might be possible to divide the processing step into several substeps. This way, you can store the intermediate processing results to save storage and optimize idle CPU time (effectively evening out the CPU workload). You might even consider shifting the whole processing of data to an earlier or later part of the program. Transaction Considerations After obtaining profiler information, consider these questions so that you have a more complete picture of the situation (refer to Chapter 4): How and when does the data come into the system? How and when do you transmit or display the processed data? When the data arrives in bursts, a different data model is needed from when it trickles in at a low rate. The extent to which data arrival can be predicted also plays a role. Sometimes it is possible to use idle time when data comes in. When input arrives via keyboard, for example, it is unlikely to be taxing for any kind of system. The answers to these questions provide different details for the data models than those the profiler can give. This is because the profiler cannot account for expectations and predictability. After you map out these data interactions, concentrate on leveling out the CPU load. Even a small amount of leveling will be beneficial. This also holds true for memory usage. Spreading out memory usage over time affects not only runtime footprint but also performance. An added advantage here is that the memory which is freed up during this action can be used for instance to increase cache and buffers to boost performance.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 29 de 206
party company. For example, with a slow memory manager, consider replacing it with an existing memory manager that is better equipped to handle the specific dynamic characteristics of your system. Similarly, when you have problems with the data your system needs to handle, consider buying a database management system (DBMS) from a company that specializes in building DBMSs. Of course, there are pros and cons to consider when choosing to use existing software. Pros of using existing software include It does not need to be designed or implemented. It has already been tested and might even have proven itself in the field. The complete development cost of the software does not have to be assigned to the project budget, only the acquisition costs. Problems with the existing software do not have to be solved within the project. (This can also be a liability, as shown in the next list.) The preceding list shows that using existing software means less development time and more certainty regarding the quality of the software. However, we have yet to look at the cons. Cons of using existing software include Unless your contract with the provider of the software specifically states so, you have no control over updates to the existing software. This means that when a bug is found in the purchased module(s), you can only report it and hope it will be fixed before you have to deliver your system to your client. The existing software will have to be integrated into your system. This can mean that changes to interfaces, and even glue layers, are needed. Every extra indirection incurred when using third-party functionality uses up some the performance you are trying to win. The existing software was almost certainly written based on different requirements than those specified for the part you are replacing. This can mean a certain mismatch or difference of focus where parts of the functionality are concerned. There is always the danger that development on that software will be discontinued some time in the future when the provider of the software is a third party. This can be a disaster when you expect to need support on the software or updated versions. When the sources of the software become available to the project, this can soften many of the cons. Also, for some static parts of the system, future support might be unnecessary. Simply taking the offered functionality "as is" might do. Often, however, you might not even have the luxury to make a decision. When your requirements are too specific to find third-party solutions, or when existing solutions are too costly to acquire or incorporate, it's necessary to just do the work in the project itself. Designing and Implementing from Scratch Consider the following pros and cons when building the replacement part from scratch. Pros of writing new software for replacement include Using the results from the previous solution, you have more information to compare and measure against so that in early stages it's apparent if the new solution is on the right track. Because you write everything yourself, you can tweak and tune the new design and implementation to fit your purposes perfectly. Wherever the new design calls for changes to interfaces, no glue will be needed. You can adjust both sides of all interfaces. Future changes to your system (requirements changes, functional updates, and so on) can be handled internally in the project. When new versions need to be made, it is unnecessary to request third parties to make changes to certain parts of your system. No copyright issues need to be settled. This means you do not have to charge your client for using third-party components. The preceding list shows that the main advantage of building in-house replacement parts is the control you have over the different facets of development. Although the cons of rewriting system parts are fairly straightforward, they are included here for completeness. Cons of writing replacement software from scratch include Designing and implementing new software will take time. New software will mean more testing and perhaps even new bugs. Again, this will take up development time. You do not know with a great amount of certainty how much performance or footprint size you will actually win with this new code. You can make estimations, but you are dealing with a part of the program that has previously produced problems with
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 30 de 206
predictability. This list shows that rewriting software parts negatively affects development time without giving any real assurances about results. Comparing the four lists on replacing software parts, it's logical to replace with existing software when A replacement of sufficientand knownquality can be found. When specific performance or footprint data is missing, some time needs to be spent on evaluating the software. Your targets for maintainability of the replacement software can be met. There is not enough time or resources to rewrite the software internally in the project. In all other cases, it makes more sense to do optimizations within the project.
Summary
The process of optimization consists of three equally important parts: Finding candidate areas for optimization Analyzing the candidate areas and coming to conclusions Performing the design and implementation of the (selected) optimizations After candidate areas have been discovered and selected for maintenance, a mix is to be chosen of replacing and fine-tuning existing code.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 31 de 206
With code replacement , an additional choice must be made between redesigning or finding an existing replacement part. Existing replacement parts should be used when A replacement of sufficient and known quality can be found. Your targets for maintainability of the replacement software can be met. There is not enough time or resources to rewrite the software internally in the project. This chapter concludes Part I. The following part, "Getting Your Hands Dirty," discusses implementation-specific problem areas by investigating programming examples of problems found in the field. Insight and ready-to-use solutions are given for subjects such as function calls, memory management, IO, handling data structures, and the correct use of programming tools.
CONTENTS
The Compiler
The compiler is a tool that transforms a program written in a high-level programming language into an executable binary file. Examples of high-level programming languages are Pascal, C, C++, Java, Modula 2, Perl, Smalltalk, and Fortran. To make an executable, the
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 32 de 206
compiler makes use of several other tools: the preprocessor, the assembler, and the linker. Although these tools are often hidden behind the user interface of the compiler itself, it can be quite useful to know more about them, perhaps even to use them separately. Listing 4.1 shows an example of compiler input. This is a C program that prints the text Hello, Mom! onscreen. Listing 4.1 Listing Hello.c
#include <stdio.h> int main(int argc, char *argv[]) { printf("Hello, Mom!\ n"); }
This chapter uses the GNU C Compiler (gcc) as the example compiler. It is available for free under the GNU General Public License as published by the Free Software Foundation . There also is a C++ version of this compiler (often embedded in the same executable) which can be called with the g++ command. In all this chapter's examples, the C compiler can be substituted by the C++ compiler. The command gcc hello.c -o hello will translate the source file hello.c into an executable file called hello. This means the whole processfrom preprocessor to compiler to assembler to linkerhas been carried out, resulting in an executable program. It is possible, however, to suppress some phases of this process or to use them individually. By using different compiler options, you can exert a high degree of control over the compilation process. Some useful examples are
gcc -c hello.c
This command will preprocess, compile, and assemble the input file, but not link it. The result is an object file (or several object files when more than one input file is specified) which can serve as the input for the linker. The linker is described later in this chapter in the section "The Linker." Object files generally have the extension .o. The object file generated by this example will be called hello.o.
gcc -c hello.s
When the compiler receives an Assembly file as input, it will translate this as easily into an object file. There is no functional difference between object files generated from different source languages, so a hello.o generated from a hello.s will be handled the same by the linker as a hello.o generated from a hello.c.
gcc -S hello.c
This option suppresses the compilation process even earlier, resulting not in an object file but in an Assembly file: hello.s (this file could be used as input in the previous example). Listing 4.2 provides the assembly output generated by the compiler for an Intel 80x86 processor, using the hello.c example as input. Depending on the compiler used, output may differ somewhat. Listing 4.2 The Assembly Version of hello.c
.file "hello.c" .version "01.01" gcc2_compiled.: .section .rodata .LC0: .string "Hello, Mom!\ n" .text .align 16 .globl main .type main: pushl %ebp movl %esp,%ebp pushl $.LC0 call printf addl $4,%esp .L1: movl %ebp,%esp popl %ebp ret .Lfe1: main,@function
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 33 de 206
.size .ident
(egcs-1.1.2 release)"
This Assembly code can be turned into an object file by invoking the assembler separately. This way, it is possible to fine-tune the Assembly code generated by the compiler. The assembler is described in section "The Assembler," later in this chapter.
gcc -E hello.c
This command will only preprocess the input file. Because the output generated by the preprocessor is rather lengthy and involved, it is therefore discussed separately in the section, "The Preprocessor," which can be found later in this chapter. Figure 4.1 shows the process by which an executable is obtained. Figure 4.1. The compilation process.
Note that no output file has been specified for the precompiler. When the precompiler is used separately from the compiler, it sends its output to screen. This output can, of course, be redirected to a file (gcc -E hello.c >> outputfilename.txt). A final remark concerning compilers: Most compilers will simply take any intermediate file you throw at them and translate it into an executable. For instance, gcc hello.o -o hello will just link the provided object file and generate an executable, and gcc hello.s -o hello will assemble and link the provided Assembly file into an executable.
The Preprocessor
As you discovered in the previous section, the -E option of the compiler command allows us to use the precompiler separately. Doing its name proud, the preprocessor takes care of several processes before the compiler starts its big task. The following sections describe the main tasks performed by the preprocessor. Include File Substitution The #include statements are substituted by the content of the file they specify. The include line from the example
#include <stdio.h>
will be replaced with the contents of the file stdio.h. Within this file, further includes will be found which in turn will be substituted. This is the reason care should be taken not to define circular includes. If stdio.h were to include myinclude.h and myinclude.h were to again include stdio.h, the preprocessor would be told to substitute indefinitely. Most preprocessors can deal with these kinds of problems and will simply generate an error. It is wise, however, to always use conditional statements when handling includes (see the section "Conditional Substitution," later in this chapter). The <> characters tell the preprocessor to look for the specified file in a predefined directory. Using double quote characters ("") tells the preprocessor to look in the current directory:
#include "myinclude."
In practice, you will find that you use double quotes when including the header files you have written yourself and <> for system header files. Macro Substitution Macro substitution works in a way which is very similar to include file substitution. When you use a macro, you define a symbolic name (called the macro) for what is in effect a string. The preprocessor hunts down all the macros and replaces them with the strings they represent. This can save the programmer the trouble of typing the string hundreds of times. An often-used macro is the following:
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 34 de 206
The Assembler
An assembler is a tool that translates human-understandable mnemonics into microprocessor instructions (executable code). Mnemonics are in fact nothing more than symbolic text representations of those microprocessor instructions. For instance, the mnemonic ADD A,20 (which adds 20 to the value of register A of the microprocessor) will be translated by the assembler into the following numbers: 198 and 20. It should not surprise you to see that microprocessor instructions (or machine instructions) are numbers; the computer memory holds nothing but numbers. This example shows that mnemonics are highly machine dependent, as every mnemonic instruction represents a microprocessor instruction for a very specific microprocessor (the example is taken from Z80 mnemonics). An Assembly program written for a 68000 microprocessor cannot simply be assembled to run on an 80386. The program is not portable and has to be completely rewritten. Another quality of this close proximity to the actual hardware is that Assembly is very useful for optimizing programming routines for speed/footprint. Listing 4.3 shows a small piece of a mnemonic listing for a 68000 microprocessor : Listing 4.3 An 68000 Assembly File
loop:
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 35 de 206
This example will be fairly easy to read for someone familiar with 68000 mnemonics. However, Assembly listings (programs) tend to become lengthy and involved. Often, the goal of a piece of code does not become apparent until a large part has been studied, because a single higher programming language instruction will be represented by many Assembly instructions. Development time is therefore high and looking for bugs in Assembly listings is quite a challenge. Also, working with Assembly requires the programmer to have a high level of system-specific technical knowledge. In the early days of computer programming (even until the late 1980s), the use of Assembly languages was quite common. High-level languages, such as Pascal or C, were not that common, and their predecessors were slow (in both compiling and executing). The generated code was far from optimal. Nowadays, compilers can often do a far better job of optimizing Assembly than most developers could, so Assembly is only used where software and hardware interact closely. Think of parts of operating systems, such as device drivers, process schedulers, exception handlers and interrupt routines. In practice, when you find Assembly, it is most often embedded within sources of a higher programming language. Listing 4.4 shows the use of Assembly in a C++ listing: Listing 4.4 A Listing Mixing 68000 and C/C++
void f() { printf("value of a = %d", a); } #asm label: jsr pushregs add.l #1, d0 rts #endasm
It might surprise you to see that languages can be mixed and matched like this. If so, it will surprise you even more that C and C++ can even be mixed to take advantage of their independent merits; there's more about this in the section Mixing Languages.Note that mixing Assembly with a higher-level language makes the source less portable because the Assembly will still only work on one specific microprocessor. In the section "The Compiler," you saw how to make an executable from an Assembly file. It is, however, also possible to generate an object file from an Assembly file. This is done by calling the assembler with the -o option:
as hello.s -o hello.o
The generated object file can serve as input for the linker, which is the subject of the next section.
The Linker
Using the linker is the final phase in the process of generating an executable program. The linker takes the libraries and object files that are part of the project and links them together. Often, the linker will also add system libraries and object files to make a valid executable. This might not always seem obvious, but an example will make that clear; because stdio.h is included in the hello.c program, it can use the printf command. This command, however, is not part of some ROM library, so the code implementing the function has to come from somewhere else. This is why the linker will have to add a system library to the executable. To use the linker separately, use the command
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 36 de 206
previous sections show, you can use a single command to create the executable of the example program hello.c. But this program is simple; it does not use multiple modules or have complex project interdependencies. As programs grow, they are often split into functional modules, use third-party libraries, and so on. Compiling such a program (project) can be made complicated by the interdependencies between modules (and even between source files!). The Make utility aids in checking up on dependencies and recompiling all objects that are affected by a change in one or more source files, making it possible to rebuild the entire software tree with one single command. This section discusses only the most important Make options because there are enough to fill entire chapters. The Make utility uses an input file (called a makefile ) that describes the project (program) is has to make. This makefile defines the project interdependencies to be checked by the utility. Listing 4.6 shows the makefile that can be used for the example program. Listing 4.6 A Makefile
#This is the makefile for Hello.c OBJS = hello.o LIBS = hello: $(OBJS) gcc -o hello $(OBJS) $(LIBS)
You can now build the example program by simply typing make on the command line when the current directory contains the makefile. It is, of course, possible to put the project definition in a file with a different name, but the Make utility then has to be told explicitly which file to look for:
make -f mymakefile
This is useful when you want to create more than one project definition in the same directory (perhaps to build different versions of the software from the same directory of source files). Let's look at Listing 4.6 line by line and see what is happening: Line 1: This simply is a comment; it could have been any text that might be useful. Line 2: This defines the objects needed for the target; it actually defines a variable OBJS which will be used in the makefile from now on. Line 3: This defines the variable LIBS, which again can be used in the remainder of the makefile. Because the hello program is so simple, it is not actually needed. Line 5: This is the definition of a rule that states that the target hello depends on the content of the variable OBJS. Line 6: This line actually lists the command to build the target. There can be multiple command lines, as long as they always start with a tab. For more makefile commands, you are referred to the appropriate manuals. and the section "The Profiler," later in this chapter, where a slightly more complex makefile is explained.
The Debugger
Bug searching is an activity every programmer will have to do sooner or later. And when saturating the sources with print statements has finally lost its charm, the next logical step is using the debugger. The power of the debugger lies in the fact that it allows the user to follow the path of execution as the program runs. And at any time during the execution, the user can decide to look at (and even change) the value of variables, pointers, stacks, objects, and so on. The only precondition is that the executable contains special debugging information so that the debugger can make connections between the executable and the sources; it is more useful to look at the source code while debugging a program than to be presented with the machine instructions or the Assembly listings. Adding debug information will make the executable larger, so when a program is ready to be released, the whole project should be rebuilt once more but without debug information. Adding debug information to the executable can be done by adding the -g option to the compile command :
gdb hello
Countless debugging tools are available, but the terminology used is generally the same. The following sections provide an overview of
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 37 de 206
important terms. Run Run starts the program execution. It is important to note that run will continue running the program until the end or until a breakpoint (see "Breakpoint," later in this chapter) is encountered. In general you set breakpoints and then run the program, or single step (see Step) into the program to start debugging activities. Step A step executes a single source file statement. As opposed to run, a step executes only one statement and then waits for further instructions. Executing part of a program this way is called single-stepping . Step Into/Step Out Of When single-stepping through a program, you can decide between seeing a call to a function as a single instruction (and stepping, in effect, over the function call) or directing the debugger to step into the function and debug it. Once inside a function, it is possible to skip the rest of the statements of the function and return to the place where it was called. In that case, you decide to step out of the function. Breakpoint A breakpoint is a predefined point in a source file where you want the debugger to stop running and wait for your instructions. You generally set a breakpoint on a statement close to where you think the program will do something wrong. You will then run the program and wait for it to reach the breakpoint. At that point, the debugger will stop and you can check variables and single-step (see Step) through the rest of the suspicious code. Conditional Breakpoint A conditional breakpoint is a breakpoint that is seen by the debugger only when a certain condition has been satisfied (i > 10 or Stop != TRUE). Instead of simply stopping at an instruction, you can tell the debugger to stop at that instruction only when a certain variable reaches a certain value. For example, say a routine goes wrong after 100 iterations of a loop. You would need 100 run commands to get to that program state using a normal breakpoint. With the conditional breakpoint, you can tell the debugger to skip the first 100 iterations. Continue Sometimes this is also called Run . The continue is a run from whatever place in the code you might have reached with the debugger. Say you stop on a breakpoint and take a few single steps, you can decide to continue to the next breakpoint (or the end of the program if the debugger encounters no more breakpoints). Function Stack/Calling Stack The calling stack is the list of functions that was called to arrive in the program state that you are examining. When the debugger arrives at a breakpoint it might be useful to know what the path of execution was. You might be able to arrive in function A() via two different calling stacks: Main() -> CheckPassword() -> A() and Main()->ChangePassword()-> A(). Sometimes it is even possible to switch to the program state of a previous call, making it possible to examine variables before the routine which is under investigation is called. By going back to CheckPassword() or even Main() you can see, for instance, what went wrong with the parameters received by A(). Watch A watch is basically a view on a piece of memory. During debugging, it is possible to check the value of variables, pointers, arrays, and even pieces of memory by putting a watch on a specific memory address. Some debuggers will even allow you to watch the internal state of the microprocessor (registers and so on).
The Profiler
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 38 de 206
The profiler is the most important tool related to performance optimizations. This tool can determine how often each function of a program is called, how much time is spent in each function, and how this time relates (in percentages) to the other functions. In determining which parts of a program are the most-important candidates for optimization, such a tool is invaluable. There is no point in gaining optimizations in functions which are not on the critical path, and it is a waste of time to optimize functions which are seldom called or are fairly optimal to begin with. With the profiler, you look for functions that are called often and that are relatively slow. Any time won in these functions will make the overall program noticeably faster. An important side note is that one should be careful in trying to optimize user input functions. These will mostly just seem slow because they are waiting for some kind of user interaction! As with the debugger, the use of the profiler warrants some preparations made to the executable file. Follow these steps to use the profiler: 1. Compile the executable with the -pg option so the profiler functions are added. The command gcc -pg <source-name>.c -o <exename> will do this. 2. Run the program so the profiler output is generated. It should be clear that only data on functions that are called during this run will be profiled. This also implies that for the best results you need to use test cases that are as close to field use as possible. 3. Run the profile tool to interpret the output of the run. Listing 4.7 will be run through the specified steps. Note that the new example program used is more interesting for the profiler. Listing 4.7 Program to Profile
#include <stdio.h> long mul_int(int a, int b) { long i; long j=0; for (i = 0 ; i < 10000000; i++) j += (a * b); return(j); } long mul_double(double a, double b) { long i; double j=0.0; for (i = 0 ; i < 10000000; i++) j += (a * b); return((long)j); } int main(int argc, char *argv[]) { printf("Testing Int : %ld\ n", mul_int(1, 2)); printf("Testing Double : %ld\ n", mul_double(1, 2)); exit(0); }
This program contains two almost-identical functions. Both functions perform a multiplication; however, they use different variable types to calculate with: mul_int uses integers and mul_double uses doubles. The profiler can now show us which type is fastest to calculate with. Compile this program with the following command line:
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 39 de 206
test
The following output should appear onscreen:
: 20000000 : 20000000
The output file that was generated during the run is called gmon.out. It contains the results of the profile action and can be viewed with the following command line:
Each sample counts as 0.01 seconds. % cumulative self time seconds seconds calls 62.50 0.10 0.10 1 37.50 0.16 0.06 1 0.00 0.16 0.00 1 % time cumulative seconds self seconds
the percentage of the total running time of the program used by this function. a running sum of the number of seconds accounted for by this function and those listed above it. the number of seconds accounted for by this function alone. This is the major sort for this listing.
calls
the number of times this function was invoked, if this function is profiled, else blank. the average number of milliseconds spent in this function per call, if this function is profiled, else blank. the average number of milliseconds spent in this function and its descendents per call, if this function is profiled, else blank. the name of the function. This is the minor sort for this listing. The index shows the location of the function in the gprof listing. If the index is in parenthesis it shows where it would appear in the gprof listing if it were to be
self ms/call
total ms/call
name
printed.
The first column tells us that 62.5% of runtime was spent in function mul_double() and only 37.5% in the function mul_int(); the value for main seems to be negligible, and this is not so surprising as the main function only calls two other functions. The numbers are reflected again in column 3 but this time in absolute seconds. Note that the output is sorted by the third column, so it is fairly easy to find the most time-consuming function. It is up to the developer to decide whether this usage of time is acceptable or not. Listing 4.9 shows what the steps for profiling look like when transformed into a makefile. Listing 4.9 Makefile with Different Targets
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 40 de 206
= = = = = =
test:
profile:$(OBJS) gcc $(COPT) $(PROFOPT) $(PROG).c -o $(PROG) $(LDLIBS) $(PROG) $(PROFCMD) $(PROG) clean: rm $(PROG).o gmon.out $(PROG) execute:
$(PROG)
The makefile presented here has different targets, depending on how the Make is invoked: make and make test make clean make execute make profile Generates a normal executable as shown in the previous sections Removes all generated object files and profiler output Runs the generated executable Creates the profile output (hint: to store this in a file, redirect it again)
As with the debugger, the profiler will influence the program it profiles. This time, however, the implications are not as far reaching because all functions will be influenced almost exactly the same way. The relations between the functions are therefore useable.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 41 de 206
essence, it looks at the source code (all the source code!) and remarks on constructions that it finds suspicious. This is very much like the task the compiler performs, and a lot of analyzer functionality can in fact be invoked by raising the warning levels of most compilers, but the analyzer goes beyond syntactical correctness of the code. The following example will pass the compilation process with flying colors because it is syntactically correct; in fact, you might have even written this listing on purpose:
if (a = 1) { ~}
It is also possible to compile your program with the -Wall option (warnings all) to get analyzer type functionality:
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 42 de 206
Optimization Options GCC and G++ -fcaller-saves -fcse-follow-jumps -fcse-skip-blocks -fdelayed-branch -felide-constructors -fexpensive-optimizations -ffast-math -ffloat-store -fforce-addr -fforce-mem -finline-functions -fkeep-inline-functions -fmemoize-lookups -fno-default-inline -fno-defer-pop -fno-function-cse -fno-inline -fno-peephole -fomit-frame-pointer -frerun-cse-after-loop -fschedule-insns -fschedule-insns2 -fstrength-reduce -fthread-jumps -funroll-all-loops -funroll-loops -O -O2 -O3
This list can be obtained on UNIX via the command line
man gcc
This section presents in-depth examples of what happens when you use several different optimization options. For an explanation of all the options, you are referred to the UNIX man pages on gcc. To illustrate the optimization steps of the compiler, this section presents several Assembly files generated by the g++ compiler. It is not actually necessary to understand the Assembly code line-by-line to appreciate what happens as the general concepts will be explained in the accompanying texts. Listing 4.10 presents the example program which is used in this section. Listing 4.10 Example Program for Compiler Optimization: test.cc
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 43 de 206
int main (int argc, char *argv[]) { int i, j=0; for (i = 1; i <= 10; i++) j+=i; exit(j); }
This program simply adds the numbers 110 during a loop and returns the result (55) on exit. Listing 4.11 shows the Assembly code generated without any optimizations turned on (g++ -S test.cc). Listing 4.11 Generated Assembly for Listing 4.10 Without Compiler Optimization: test.s (Not Optimized)
.file "test.cc" gcc2_compiled.: ___gnu_compiled_cplusplus: .def ___main; .scl .text .align 4 .globl _main .def _main; .scl _main: pushl %ebp movl %esp,%ebp subl $16,%esp call ___main movl $0,-8(%ebp) movl $1,-4(%ebp) .p2align 4,,7 L2: cmpl $10,-4(%ebp) jle L5 jmp L3 .p2align 4,,7 L5: movl -4(%ebp),%eax addl %eax,-8(%ebp) L4: incl -4(%ebp) jmp L2 .p2align 4,,7 L3: movl -8(%ebp),%eax pushl %eax call _exit addl $4,%esp xorl %eax,%eax jmp L1 .p2align 4,,7 L1: movl %ebp,%esp popl %ebp ret .def _exit; .scl
2;
.type
32;
.endef
2;
.type
32;
.endef
3;
.type
32;
.endef
Listing 4.12 shows the Assembly code generated with optimization turned on; level -O2 looks like this (g++ -O2 -S test.cc).
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 44 de 206
Listing 4.12 Generated Assembly for Listing 4.10 with Level 2 Compiler Optimization: test.s (Optimized with -O2) by gcc
.file "test.cc" gcc2_compiled.: ___gnu_compiled_cplusplus: .def ___main; .scl .text .align 4 .globl _main .def _main; .scl _main: pushl %ebp movl %esp,%ebp call ___main xorl %edx,%edx movl $1,%eax .p2align 4,,7 L5: addl %eax,%edx incl %eax cmpl $10,%eax jle L5 pushl %edx call _exit .def _exit;
2;
.type
32;
.endef
2;
.type
32;
.endef
.scl
3;
.type
32;
.endef
Even though this is a small, trivial program, you can already see substantial differences in efficiency and footprint. The optimized version of the example program is clearly much shorter and, when you look at the loop (the code between L2 and L3 in the first Assembly listing, and L5 in the second Assembly listing), you see a noticeable decrease in the number of compare instructions (cmp) and jumps (jmp, jle) . In this example, the optimized code is, in fact, easier to read than the original code, but this is certainly not always the case. Also, note that -O2 takes care of several kinds of optimization in a single compilation. More can be done with this example. There is a loop in the program and, having seen the available optimization options at the beginning of this section, you have undoubtedly spotted the option -funroll-loops. Unrolling loops is a technique to make software faster by doing more steps inside a loop and doing fewer actual loop iterations. The following example demonstrates this principle: Normal Loop execute loop ten times step A end loop Unrolled Loop execute loop 2 times step A step A step A step A step A end loop
Both loops perform step A ten times. The unrolled code is definitely larger, but it is also faster because the overhead incurred by looping around (checking the end condition and jumping back to the beginning) is performed less often, thus taking up a smaller percentage of loop- execution time (Chapter 8 describes loops in more detail). Listing 4.13 shows what unrolling loops can do for the example program (g++ -O2 -funroll-loops -S test.cc). Listing 4.13 Generated Assembly for Listing 4.10 with Further Compiler Optimization by gcc: test.s (Optimized with -O2 funroll-loops)
.file "test.cc" gcc2_compiled.: ___gnu_compiled_cplusplus: .def ___main; .scl .text .align 4 .globl _main .def _main; .scl
2;
.type
32;
.endef
2;
.type
32;
.endef
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 45 de 206
_main: pushl %ebp movl %esp,%ebp call ___main pushl $55 call _exit .def
_exit;
.scl
3;
.type
32;
.endef
Apparently, this was a very good choice. The generated file is now pretty small; notice that the loop has, in fact, completely disappeared! The only evidence remaining is the placement of the number 55 (the result of the ex-loop) on the stack. This means the compiler performed the loop during compilation and placed only the result in the generated code. This is possible only because the result of the loop is known at compile time; there was, in fact, a hidden constant in the program. It seems that using compiler optimizations can already have benefits before you even begin thinking long and hard about what to do. Playing around with the other optimization options found at the beginning of this section will give more insight into what the compiler is willing and able to do.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 46 de 206
there is not yet a lot of support, one that has not proven itself in the field, or one for which the syntax is not even completely standardized. Another advantage of using a more popular language is the increased chance of finding programmers in future to replace or strengthen the current development team. An entirely different, but no less important, future development issue is syntax readability. When you expect to be working on a system for an extended period of time, you want to choose a language that is easily readable and thus extendible. These risks lessen as the need for future development decreases. For one-time projects, it matters only if the current compiler and available documentation and human resources suffice. Desired level of portability. The question here is whether you are writing software for one specific target platform (be it embedded software or not) or whether you expect different hardware configurations to have to execute the software. With most programming languages, you have to recompile the sources to get an executable that runs on a system with a different microprocessor. This does not have to be a problem. Only system-specific parts used might need to be reinvented or bought for new platforms. Other languages are not simply ported. Trying to port the source might even be similar to rewriting the whole program. Clearly you have to decide in advance if the language will have the characteristics desired at porting time. Table 4.1 shows how the characteristics of the programming languages C, C++, Pascal, (Visual) Basic, Java, and Assembly compare to each other: Table 4.1. Programming Language Characteristics C C++ Pasc. ++ ++ + ++ ++ ++ + ++ + ++ ++ + + ++ + ++ +++ ++ ++ +++ + ++ ++ + +++ + + ++ ++ + ++ + ++
Characteristics Portability Speed of development Ease of update/ enhancement Availability of knowledge Standard solutions Code readability Possibility complex design Execution speed Execution footprint Ease of optimization Time to learn language
VB +++ ++ ++ ++ +++ + ++
From this table you can draw the following conclusions on when to use which language: C The C programming language seems to be doing quite well in most categories. Only in the categories "complexity of design" and "ease of update/enhancement" does it look like there are better choices around these days. This is because C was basically developed before the object-oriented (OO) approach became so popular. This means that C is still an accessible, universally applicable language, as long as the projects do not become too large or too complex. When large numbers of developers need to update big chunks of each other's code, C can, in fact, easily become a mess. This is why C is a good choice for most kind of applications, as long as the programs do not become too large or complex. C++ The C++ programming language has characteristics very similar to those of C, unsurprisingly, with the exception of being more readable and maintainable when the programs become larger or the development process is more dynamic. The price that is paid for this is the fact that C++ generally has a larger footprint and takes longer for programmers to learn, especially when they are expected to use it well. This is why C++ is a good choice for all kinds of applications, as long as the footprint is not too tight. Pascal The Pascal programming language is easy to learn, reasonably fast, and quite strict in runtime type checking. This programming language does have a few strange quirks that might have to be programmed around though (strings are limited to 255 characters by default, for example). Also, there are quite a few flavors of Pascal around, which make portability less than obvious. Pascal imitates C and C++ in characteristics, but seems to be less popular at the moment. This is why Pascal is seen as a good all-arounder, but watch out for future support. (Visual) Basic The Visual Basic programming language comes with a plethora of preimplemented functionality, and it is easy to learn, very stable in use, and has low development times. It is not too fast in execution, however, and the footprint can become somewhat large because of the extra software needed to run an executable (libraries, interpreter). This is why (Visual) Basic is a good choice for quickly building programs, user interfaces, and prototypes that aren't time critical (execution slowness is even a plus when writing those first prototypes).
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 47 de 206
Java The Java programming language is easy to learn, has a lot of standard solutions, is robust in usage, has a relatively fast development time and, most importantly, is portable, even without recompilation. The price paid for this is that it is not that fast nor is it optimal in footprint. Conclusion: Nontime-critical, portable applications; Internet software, downloadable executables, prototypes. Assembly The Assembly programming language is prominently fast, small, and easily optimizable, if used expertly. Development times, as learning times, are quite high, and this language does not lend itself to writing large programs with a complex design. Conclusion: Small, time-critical programs or program parts. Drivers, interrupt routines, close hardware interaction programs. Two final remarks on choosing a programming language: Remember that developers work better with languages they like. A developer using his favorite language will enjoy his work more and is prone to putting extra effort into producing quality software. In some cases, it is possible, even advisable, to try to mix different programming languages . The next section discusses this in detail.
Mixing Languages
This section describes ways of mixing programming languages, allowing the choice of beneficial language characteristics to be taken on a lower or more modular levelper library or program module, for instance. Before unnecessary complicating matters, however, determine whether you actually need the different language segments to communicate within a single executable file. If not, you can opt for easier, external communication between components created from different languages. Think of using files, pipes, COM/CORBA, or network communication protocols (TCP/IP can also be used between two programs or processes that run on the same computer). For example, a program generated from Turbo Pascal could write data to a file that, in turn, is read by a program running on a Java engine. Similarly, as in Listing 4.14, a C program could route its output to STDOUT which is captured by a C++ program reading from STDIN. Listing 4.14 Routing Data from a C-Compiled Source to a C++-Compiled Source using an External File
/* C: */ void Send(char * buffer) { printf("%s", buffer); } // // C++: // void Communicator::Read(char & buffer) { cin >> buffer; }
Using this kind of external communication does away with the need for more complex, in- executable communication. There are some cons to external communication to keep in mind, though: It is considerably slower than in-executable communication. It is dependent on the performance and availability of the external medium (disk, hard drive, network drive, and so on). Locking problems for read/write actions can be quite invisible and difficult to solve. For some problem domains, then, it is necessary to look for in-executable solutions. This section looks at two mixing techniques closely related to optimizing C and C++: mixing C and C++ and mixing C (or C++) with Assembly. Mixing C and C++ This example of mixing C with C++ takes a C object containing a single function and links it with a C++ program. The C++ program contains a main function that calls the linked C function. The C object looks like the code in Listing 4.15.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 48 de 206
gcc -c my_c.c
The result of this command is the file my_c.o. Listing 4.16 shows the C++ program that will use the C object. Listing 4.16 Invoking C Functions from C++my_cpp.cpp
int main(int argc, char *argv[]) { cout << "We're in C++ now!" << endl; my_c_fn(); }
Create an executable containing both the C++ and C functionality by compiling the file with the command
#ifdef _
_cplusplus } #endif
Now let's look at how to tackle Assembly. Mixing C/C++ and Assembly Some C compilers allow the use of Assembly statements in the source code, effectively mixing C and Assembly within a single source
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 49 de 206
file (Aztec C, Turbo C, GNU C, and so on). This use of Assembly statements within other programming languages is called inline Assembly . To notify the compiler that it should switch to parsing Assembly, a keyword is used. Depending on the compiler, the keyword will be something like asm, #asm, or asm(code). It is possible that the compiler also needs an extra option to ensure it will recognize the Assembly statements. Turbo C example of mixing C and Assembly:
mul(int a, b) { asm mov ax, word ptr 4[bp] asm word ptr 6[bp] }
gcc example of mixing C and Assembly:
#include <stdio.h> void change_xyz (int x, int y) { int z = 3; printf("x.y.z %d %d %d\ n", x, y, z); asm(" movl $4,8(%ebp) movl $5,12(%ebp) movl $6,12(%esp) "); printf("x.y.z %d %d %d\ n", x, y, z); } int main(int argc, char *argv[]) { change
_xyz(1,2); }
This program can be compiled with the command:
x.y.z 1 2 3 x.y.z 4 5 6
But what does all that %ebp and %esp stuff actually mean? It all has to do with the way of accessing the variables in Assembly. As the variables x and y are passed to the function by value, they reside on the stack in the order in which they are declaredfrom left to right, first x and then y. You get to the stack via the base pointer, which points to the memory address eight bytes below the first variable. Where you say x in C, you have to say 8(%ebp). And y in this example becomes 12(%ebp). Note that the variables are placed four bytes apart as the integer size is four bytes. The variable z is a local variable and is placed under the base pointer, -4(%ebp) . If there had been a second local variable (int z, q ) this second variable would be placed directly after x , so q would be -8(%ebp). Refer to chapter 8, "Functions," for more detail.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 50 de 206
Now let's look at what happens to variables which are passed by reference. The reference variables are, in fact, placed on the stack exactly the same way as the variables that are passed by value. So the following example
__asm mov eax,2 ; move the number 2 __asm add eax,2 ; add the number 2
or even
__asm { __asm mov eax,2 ; move the number 2 __asm add eax,2 ; add the number 2 }
Referring to function parameters is child's play as the symbolic names can simply be used inside the Assembly code as in Listing 4.19. Listing 4.19 Referring to Function Parameters from In-lined Assembly Within a C++ Source for DeveloperStudio
int DoWop(int dow, int op) { __asm { mov eax, dow ; retrieve dow add eax, op ; retrieve op ; leave result in eax to be returned. } }
The C/C++ call int a = DoWop(1,2); will now result in the freshly created variable a receiving the value 3.
Summary
This chapter discussed preliminaries of working on actual code optimizations. It introduced development and optimization tools and
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 51 de 206
discussed the importance of choosing the correct programming language for a project. The tools discussed were The compiler tool which transforms a program written in a high-level programming language into an executable binary file The preprocessor tool that takes care of several processes before the compiler starts its big task The assembler tool that translates human understandable mnemonics into microprocessor instructions (executable code) The linker tool that takes all the libraries and object files you have created and links them together The Make utility tool that aids in checking up on dependencies and recompiling all objects that are effected by a change in one or more source files The Debugger tool that allows the user to follow the path of execution as the program runs The Profiler tool that determines how often each function of a program is called, how much time is spent in each function, and how this time relates (in percentages) to the other functions The Run Time Checker tool that is used to detect irregularities during program execution The Static Source Code Analyzer tool that looks at the source code (all the source code!) and remarks on constructions which it finds suspect The test program, which is probably the most important development tool you can use The programming languages discussed in this chapter are C. Usable for most kind of applications, as long as the programs do not become too large or complex. C++. Usable for all kinds of applications, as long as the footprint is not too tight. Pascal . A good all-arounder, but watch out for future support. Visual Basic. Usable for quickly building programs, user interfaces, and prototypes that are not time critical (execution slowness is even a plus when writing those first prototypes). Java. Usable for portable applications that aren't time critical, Internet software, downloadable executables, and prototypes. Assembly. Usable for small, time-critical programs or program parts. Drivers, interrupt routines, close hardware interaction programs.
CONTENTS CONTENTS
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 52 de 206
This section introduces two techniques for providing efficiency information on algorithms. The first technique is a theoretical description of the complexity of an algorithm, the O(n) notation. It can be used to compare the merits of underlying concepts of algorithms. This can be done without doing any implementation. The second technique introduced in this section is a timing function that can be used to time implemented code. It is used throughout this book to time suggested coding solutions.
O(log2 n) 3 7 10 13
The first column of the previous table indicates the number of elements to sort. The remaining columns represent worst-case sorting times for algorithms with different complexities. Let's say the base time is one minute. This means that sorting 10,000 elements with an algorithm with an O(log2 n) complexity takes 13 minutes. Sorting the same number of elements with an algorithm with an O(n^2) complexity takes 100,000,000 minutes, which is over 190 years! That is something you probably do not want to wait for. This also means that choosing the right algorithm will improve performance far greater than tweaking the wrong algorithm ever could. What you do when tweaking an algorithm is changing its base time, but not the number of algorithmic steps it needs to take. In the example given by the previous table, this could mean optimizing the O(n^2) algorithm to perform sorting in 100,000,000 * 30 seconds instead of 100,000,000 * 60 seconds. Which means the new implementation is twice as fast, but sadly, it still takes over 80 years to complete its task in worst case, and it still does not compare to the 13 minutes of the O(log2 n) algorithm. What you are looking for is thus an algorithm with as low a complexity as possible. You have already seen an example of an algorithm with O(n) complexity. But what about those other complexities? Algorithms with O (n^2) complexity are those which, depending on the number of elements, need to do some processing for all other elements. Think of adding an element to the back of a linked list. When you have no pointer to the last element, you need to traverse the list from beginning to end each time you add an element. The first element is added to an empty list, the fifth element traverses the first four, the tenth element traverses the first nine, and so on. Other complexities you are very likely to come across in software engineering are O(log2 n) and O(n log2 n). The first n in nlog2 n is easy to explain; it simply means that something needs to be done for all elements. The log2 n part is created by algorithmic choices that continuously split in half the number of data elements that the algorithm is interested in. This kind of algorithm is used, for instance, in a game in which you have to guess a number that someone is thinking of by asking as few questions as possible. A first question could be, Is the number even or odd? When the numbers have a limited range, from 1 to 10 for instance, a second question could be, Is the number higher than 5? Each question halves the number of possible answers. Algorithms with O(1) complexity are those for which a processing time is completely independent of the number of elements. A good example of this is accessing an array with an index: int a = array[4]; no matter how many elements are placed into this array, this operation will always take the same amount of time.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 53 de 206
Final remarks on the O notation: Constants prove to be insignificant when comparing different O(n) expressions, and can be ignored; O(2n^2) and O(9n^2) are both considered to be O(n^2) compared to O(n) and so on. Linking algorithms of different complexities creates an algorithm with the highest of the linked complexities; when you incorporate a step of O(n^2) complexity into an algorithm with O(n) complexity the resulting complexity is O(n^2). Nesting algorithms (in effect multiplying their impact) creates an algorithm with multiplied complexity; O(n) nested with O(log2 n) produces O(n log2 n).
#include <iostream.h> #include "booktools.h" long MulOperator() { long i, j = 1031; for (i = 0 ; i < 20000000; i++) { j *= 2; } return 0; } long MulShift() { long i, j = 1031; for (i = 0 ; i < 20000000; i++) { j <<= 1; } return 0; }
void main(void) { cout << "Time in MulOperator: "<< time_fn(MulOperator,5) << endl; cout << "Time in shiftOperator: "<< time_fn(MulShift,5) << endl; }
This Mul/Shift example program uses two different techniques for multiplying a variable (j) by two. The function MulOperator() uses the standard arithmetic operator *, whereas the function MulShift() uses a bitwise shift to the left <<. Performance of both multiplication functions is timed with a function called time_fn(). This function is explained later in this chapter. You will agree that the choice between writing j *= 2 or j <<= 1 has no real impact on code reliability, maintainability, or complexity. However, when you write a program that at some point needs to do this kind of multiplication on a large base of data, it is important to know beforehand which technique is fastest (or smallest). This is especially true when multiplication is used widely throughout the sources you write for a certain target system. So how does this test help you? The result expected of the Mul/Shift example by anyone with a technical background would be that the logical shift is much faster than the multiplication, simply because this action is easier for (most) microprocessors to perform. However, when you run this test using Microsoft's Developer Studio on an x86-compatible processor, you will see that both techniques are equally fast. How is this possible? The answer becomes clear when you look at the assembly generated by the compilerhow to obtain assembly listings is explained in Chapter 4.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 54 de 206
The following assembly snippets show the translations of the two multiplication techniques for Microsoft's Developer Studio on an x86compatible processor.
// Code for the multiplication: ; 16 : j *= 2; shl DWORD PTR _j$[ebp], 1 // Code for the shift: ; 36 : j <<= 1; shl DWORD PTR _j$[ebp], 1
Without even knowing any assembly, it is easy to see why both multiplication functions are equally fast; apparently the compiler was smart enough to translate both multiplication commands into a bitwise shift to the left (shl). So, in this situation, taking time out to optimize your data multiplication routines to use shifts instead of multipliers would have been a complete waste of time. (Just for fun, look at what kind of assembly listing you get when you multiply by 3; j *= 3;) Similarly, if you had run the Mul/Shift example on a MIPS, you would have noticed that for this particular processor the multiplication is in fact faster than the shift operator. This is why it is important to find out specific target and compiler behavior before trying to optimize in these kinds of areas. The Mul/Shift example introduced a second method for you to time your functions (the first method, using profiling as a means of generating timing information, was introduced in Chapter 4.) The Mul/Shift example uses the function time_fn() . You can find the definition of this timing function in the file BookTools.h. You can find the implementation of the timing function in Listing 5.2 and in the file BookTools.cpp. The best way to use the timing function is to simply add both booktools files to the makefile of your software project. Listing 5.2 The Timing Function
unsigned long time_fn(long (*fn)(void), int nrSamples = 1) { unsigned long average = 0; clock_t tBegin, tEnd; for (int i = 0; i < nrSamples; i++) { tBegin = clock(); fn(); tEnd = clock(); average += tEnd - tBegin; } return ((unsigned long)average / nrSamples); }
The time_fn() function receives two parameters: The first is a pointer to the function which it has to time; the second is the number of timing samples it will take. In the Mul/Shift example, you see that the time_fn() function is called first with a pointer to the multiplier function, and then again with a pointer to the shift function. In both cases, five samples are requested. The default number of samples is just one. The actual timing part of the function is done via the clock() function . This function returns the number of processor timer ticks that have elapsed since the start of the process. By not- ing the clock value twiceonce before executing the function which is to be timed, and once after it is finishedyou can approximate quite nicely how much time was spent. The following section explains how to minimize external influences to this timing. Note that the overhead created by the for loops in the MulOperator() and MulShift() functions of the MUL/Shift example is also timed. This is of no consequence to the timing results as you are interested only in the relation between the results of the two functions and both functions contain the exact same overhead. The clock() function is not part of ANSI C++, so its usage can be slightly different per system. This is why several #ifdef compiler directives can be found in the booktools files. The example systems (Linux/UNIX and Windows 9x) used in this book are separated by the fact that the Developer Studio automatically creates a definition _MSC_VER. When using the time_fn() for systems other than the example systems used in this book, you should consult the relevant manuals to check whether there are differences in the use of the clock() function.
System Influences
When doing timing tests as proposed in the previous section, it is important to realize that you are in fact timing system behavior while you are running your code. This means that, because your piece of code is not the only thing the system is dealing with, other factors
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 55 de 206
will influence your test results. The most intrusive factors are other programs running on the system. This influence can of course be minimized by running as few other programs on your system as possible while performing timing tests, and increasing the priority of the process you are timing as much as possible. Still some influences will remain because they are inherent to the operating system. This section discusses the remaining system influences and how to deal with them.
Cache Misses
Cache is memory that is used as a buffer between two or more memory-using objects. The cache that is referred to in this section is the buffer between the CPU and the internal memory of a computer system. Most CPU architectures split this cache into two levels. Level 1 cache is located physically in the CPU itself, and level 2 cache is located outside, but close to, the CPU. Level 1 cache is usually referred to as l1 in technical literature, level 2 cache is usually referred to as l2. Figure 5.1 depicts the two-level cache concept. Figure 5.1. CPU level 1 and level 2 cache.
The reasons for splitting the cache into two levels have to do with the different characteristics of the cache locations. In general, available space for cache inside a CPU is smaller than outside, with the added advantage that accessing cache inside the CPU is often faster. On most architectures l2 has to be accessed with a clock frequency which is a fraction of the CPUs clock frequency. So, as a rule of thumb: l1 is smaller and faster than l2. Usage of caching means that a CPU does not need to retrieve each unit of data from memory separately; rather, a block of data is copied from memory into cache in one action. The advantage of this is that part of the overhead of transferring data from memory to CPU (for example, finding an address and doing bus interaction) is incurred only once for a large number of memory addresses. The CPU uses this cached data until it needs data from a memory address that is not cached. At this point a cache miss is incurred and a new block of data needs to be transferred from memory to cache. Of course data is not only read, it can be changed or overwritten as well. When a cache miss occurs, it is often necessary to transfer the cache back to internal memory first, to save the changes. Each time a cache miss is incurred, processing halts while the copy actions take place. It would, therefore, be unfortunate if a cache miss causes a block of cached data to be switched for a new block, after which another cache miss switches the new block back for the original block again. To minimize these kind of scenarios occurring, further refining caching concepts are often introduced in system architecture, as the following two sections explain. Using Cache Pages With cache paging the available cache resource is divided into blocks of a certain size. Each block is called a page. When a cache miss occurs, only one page (or a few pages) is actually overwritten with new data, and the rest of the pages stay untouched. In this scenario, at least the page in which the cache miss occurred has to stay untouched. This does mean that for each page, a separate administration is needed to keep track of the memory addresses that the page represents and whether changes have been made to the data in the page. Using Separate Data and Instruction Caches With separate data and instruction caches the available cache resource is split into two functional parts: one part for caching CPU instructions, which is a copy of part of the executable image of the active process; and another part for caching data, which is a copy of a subset of the data that is used by the active process. These strategies are, of course, very generic as they cannot take into account any specific characteristics of the software that will run on a certain system. Software designers and implementers can, however, design their software in such a way that cache misses are likely to be minimized. When software is designed to run on a specific system, even more extreme optimizations can be made by taking into account actual system characteristicssuch as l1 and l2 cache sizes. Different techniques for minimizing cache misses and page faults are presented in a later section titled "Techniques for Minimizing System Influences."
Page Faults
The concept of using paging, as discussed in the previous , can also be used on internal memory. In this case, internal memory acts as a cache or buffer between the CPU and a secondary storage devicefor example, a hard disk. When this concept is applied to internal memory it is called virtual memory management. Figure 5.2 depicts the virtual memory concept. Figure 5.2. Virtual memory.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 56 de 206
When virtual memory is used, the internal memory is divided into a number of blocks of a certain size. Each block is called a page. The memory management system can move these pages to and from the secondary storage. On the secondary storage, these pages are stored in something called a page file or a swap file (for maximum performance, even a swap partition can be created). In general this page file can contain more pages than can fit in memory. The size of available internal memory is thus virtually increased by the usage of the larger page file. However, whenever an active process refers to data or an instruction which is located in a page that is not at that time in internal memory, a page fault occurs. The page which is referred to has to be transferred from secondary storage to internal memory. In the worst case another page has to be transferred from internal memory to the page file first, to make room for the new page. This is the case when some of the data of the page that will be replaced was changed.
Other OS Overhead
Apart from overhead which is directly related to your process, the OS itself can also generate more overhead which is related to other activities. Task Switching Unless your target is an embedded system with a very straightforward architecture, chances are that your target is multitasking. In practice this means there are always other processes (daemons, agents, background processes, and so on) running alongside your testing process. Some will be small, whereas others might be fairly large compared to your tests. What the OS does is give all these processes some CPU time to do their job, based on their individual priorities. This is called task switching and it can bring along overhead varying from simply saving CPU registers on stack to swapping entire pages in and out of internal memory. IO When a process on your system is performing IO this will generally be a large influence on system time spent. This is because IO to secondary storage or external devices is generally many times slower than CPU-to-memory interaction. Similarly, when your test program uses a lot of IO, it might not perform exactly the same during two separate test runs. A later run could use data that was buffered during an earlier run. General OS Administration An OS will, from time to time, do some administration. This can be anything from updating some kind of registry to doing active memory management. On some systems this is done only after a certain amount of idle time; however, the OS can still decide to do this just as you are about to run your test. User Interaction When a user interacts with the system, chances are that some kind of background process will suddenly become very active to deal with all the new input. It is generally a good idea to interact as little as possible with the system while running your tests. This includes not moving the mouse, playing with the keyboard, or inserting floppy discs or CDs as the OS might do more than you expect (like displaying ToolTips, looking for environment variables, checking status/validity, switching focus, and so on).
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 57 de 206
with inordinately high values (caused by cache misses, page faults, and so on) or inordinately low values (caused by buffered IO and so on). A clean average should prevail. However, for certain influences it is best to minimize their occurrence by writing better code in the first placefor example, for cache misses and page faults. It is possible to design software in such a way that cache misses and page faults are likely to be minimized, even without necessarily taking specific system architecture information into account. This remainder of this section presents these kinds of techniques. Minimizing Data Cache Misses and Page Faults Group Data According to Frequency of Use Place data that is used most frequently, closely together in memory. This increases the change of having at least some of it at hand almost continuously. Group Data According to Time/Space Relations Place data that is often needed simultaneously or consecutively closely together in memory. Creating this kind of grouping of related data increases the chance that data is available when it is needed. Access and Store Data Sequentially When data is accessed sequentially, chances increase that a miss or a fault will retrieve several of the data elements which are needed in the near future. If the data structure consists of some kind of linked elements (linked list, tree, and so on) this only works when sequential elements are stored closed together and in sequence in memory. This is because otherwise a linked element can basically be anywhere in memory. Perform Intelligent Prefetching When it is possible to determine with reasonable accuracy what data is needed in future, a deliberate miss or fault can be generated during idle time to facilitate a kind of prefetching mechanism. This kind of strategy can be inefficient when it is impossible to determine which pages will be used for the prefetch. You will most likely want to combine this strategy with the following. Lock Frequently Used Data When data is needed on a frequent or continual basis during some stage of a program, it could be worth while to find out whether the page it is in can somehow be locked. For more information, consult system and compiler documentation. Use the Smallest Working Set Possible On most systems, it is possible to influence the working set of a process. This could aid in battling page faults as the working set determines how much memory will be available to a process. The closer this working set matches the maximum memory needs of a program, the closer its data and instruction are packed together. For more information, consult system and compiler documentation. Minimizing Instruction Cache Misses and Page Faults Group Instructions According to Frequency of Use Place functions or sets of instructions that are used most frequently closely together in memory. This increases the change of having at least some of them at hand almost continuously. Group Related Instructions Together Place functions or sets of instructions that are often needed simultaneously or consecutively closely together in memory. Grouping related instructions in this way increases the chance that instructions are available when they are needed. In effect this means that you try to determine chains of functions that are likely to call each other and place them together and in order. Lock Frequently Used Instructions When a function or group of instructions is needed on a frequent or continual basis during some stage of a program, it could be worth it to find out whether the page they are in can somehow be locked. For more information on this consult system and compiler documentation. Note that not all strategies given here are always possible or even desirable. Implementing a certain strategy can sometimes have more cons than pros. For each separate situation the strategies should be weighed on their merits. Note also that some strategies are not particularly suited to be combined. Most standard platforms have tools which help you determine the number of cache misses and page faults that occur during the running of a piece of software, and the time that is lost during a certain kind of cache miss (l1/l2) or page fault. For more information, consult system and compiler documentation.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 58 de 206
Summary
By using the O(n) notation to describe the behavior of an algorithm in relation to the number of elements of data it needs to handle, a lot can be said about algorithm efficiency even before it has been implemented. Algorithms that have already been implemented can easily be pitted against each other using a timing function as is described in this chapter. When performing this second kind of speed tests, however, you should realize that you are effectively timing system behavior while your test is running. There are several things you can do to minimize the influence of system overhead during testing. To minimize the side effects caused by other processes and OS administrative tasks you can Run as few other processes on the system as possible Increase the priority of the testing process Run on a clean (rebooted) system Perform several test runs and average only least contaminated results To minimize cache misses and page faults while your software is running, you can use several optimization strategies: Group data and functions according to frequency of use Group data and functions which are related to each other Access and store data sequentially Perform intelligent prefetching Lock frequently used data and functions Adjust the working set of a process Take specific system information into account How fast data and instruction access can be, is dependant on where it must come from. Figure 5.3 provides an overview. Figure 5.3. Overview of locality.
This diagram shows the different places from which the CPU can get data and instructions. This diagram orders the places by proximity
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 59 de 206
to the CPU, where the registers are closest to the CPU and the external storage is furthest away. In general it can be said that the closer a storage is to the CPU, the smaller and faster it is. CONTENTS CONTENTS
Base Type
W/D (Size in Bytes) char 1 1 short 2 2 short int 2 2 long 4 4 int 4 4 float 4 4 double 8 8 long double 8 8 struct 1 1 bit field 4 4 union 1 1 long long/__int64 8 8
The first column of sizes is valid for Windows with Developer Studio (W/D). Some documentation for this compiler claims the long double to be ten bytes in size instead of eight, in which case the larger range would be valid. A 64-bit integer is called __int64 by Developer Studio. The second column of sizes is valid for UNIX and Linux with the GNU compiler (U/G). When we use the GNU compiler the 64-bit integer is called a long long . Careful readers might notice the struct , bit field , and union types appearing in this list. These are included because the smallest possible size of these types is not always apparent. It is possible to determine the sizes of these types by checking an instance containing the smallest possible elementfor example,
struct Small { char c;} ; struct Bitfield { unsigned bit:1;} ; union Un { char a;}
These types are discussed in more detail at the end of this chapter. When you are using a different target environment from the two dealt with above, it is a good idea to run some size tests. A function which does just that is included in the book tools on the Web site: BaseTypeSizes() (see Chapter 5, "Measuring Time and
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 60 de 206
Complexity"). To prevent cryptic descriptions of types, the remainder of this book refers to the variable types as they are specified in the column U/G. So, when a reference is made to a 64-bit integer, the type name long long will be used. After the sizes and ranges of variable types have been determined, implementers select variable types with the best fit for their variables. In the instance of more visible variables, the design should aid in this selection. Examples include elements that occur as database fields (the amount of characters possible in a name; the maximum value of a ZIP code or postal box number), global definitions (number of possible application states, range of error codes), and return types of interface functions. And always consider whether variables are allowed to contain only positive values or both negative and positive values. As shown earlier in Table 6.1, the positive range of a type is halved when negative values are also allowed. For strictly positive variables use the unsigned keyword:
Performance of Variables
When choosing between base type variables, it also makes sense to think about the performance implications of using the different types. When you run the three functions belowFloat(), Double(), Int()in the timing test of Chapter 5, the reason becomes quite clear, as shown in Listing 6.1. Listing 6.1 Performance of Base Types
long Float() { long i; float j = 10.31, k = 3.0; for (i = 0 ; i < 10000000; i++) { j *= k; } return j; } long Double() { long i; double j = 10.31, k = 3.0; for (i = 0 ; i < 10000000; i++) { j *= k; } return j; } long Int() { long i; int j = 1031, k=3; for (i = 0 ; i < 10000000; i++) { j *= k; } return j; }
Although the difference in speed between multiplying floats and doubles is negligible, using integers is many times faster. So it makes sense to use integers where possible and simply adjust the meaning of the value to accommodate the use of fractions. In the Int() example, by effectively multiplying the variable j by 100, the decimal point was no longer needed. Of course it is possible that the decimal point will be needed againwhen generating application output, for instanceand in that case it will be necessary to convert the value of j. However, the idea is that this conversion will only have to be done a few times (possibly only once), whereas the rest of the application can use much faster calculations. It is still necessary to examine other calculations though. Consider these examples, where the arithmetic is substituted in the three functions for
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 61 de 206
j j j j
*= -= += /=
k; k; k; k;
// // // //
Notice that the integer is a speed demon in all cases but onethe division. This is because the integer is converted several times during a division. So, to judge whether to replace floats and doubles with integers, an implementer must determine how often the different arithmetic functions are likely to be used. An application that uses a certain floating point variable mostly in multiplication, addition, and subtraction is thus an ideal candidate for integer replacement.
void Scope1() { for (int a1 = 0; a1 < 10; a1++) { char arr[] = "pretend this is a long string"; int b2 = (int)arr[a1]; } }
Apart from not actually doing anything, it seems as if this example also contains a few serious scoping flaws. Integer a1 is defined in a for statement header, whereas array arr and integer b2 are defined inside the scope of a for loop. It would appear that at least arr and b2 generate overhead by being created and destroyed for each iteration of the loop. However, the setup of this loop is pretty straightforward and most C/C++ compilers will assign a stack space to each of the three variables. (See Chapters 4, "Tools and Languages," and 8, "Functions," for more information about stack space for variables.) So the generated code should be no different from that shown in Listing 6.3. Listing 6.3 Variable Scope Example 2
void Scope2() { int a1, b2; char arr[] = "pretend this is a long string"; for (a1 = 0; a1 < 10; a1++) { b2 = (int)arr[a1]; } }
However, when you make things a little more complicated, you will see that it makes sense to try to declare variables and objects only once. What if the compiler is unable to predefine a variable outside of its scope? This situation occurs when you allocate variables dynamically or need to call a constructor to create a type, as shown in Listing 6.4. Listing 6.4 Variable Scope Example 3
void Scope3() { for (int a1 = 0; a1 < 10; a1++) { char *arr = new char[20]; Combi *b2 = new Combi; b2->a = (char) a1;
In this case, variable b2 seems to be a structure or a class, and array arr is now also dynamically created. This generates a lot of overhead because reserving and releasing heap memory occurs in every loop iteration. This sort of overhead can be avoided in
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 62 de 206
situations where it is possible to reuse object instances. In those cases objects such as b2 and arr are instantiated once, outside the loop, and they are only used by statements inside the loop body. Obviously, when objects are dynamically created inside a loop to be stored away for later use (filling an array, list, or database) you have no choice but to create different instances. Listing 6.4 is shown because, although in small loops it can be pretty apparent when dynamic creations are wasted, in larger and morecomplex algorithms it is not. So it's good standard programming practice to consider using dynamic objects in limited scopes such as loops and even complete functions. Often implementers use dynamically allocated variables to contain temporary values, perhaps retrieved from a list or a database and so on. Initialization A similar problem to that described in the previous paragraph can occur when variables or objects are initialized with function results. Consider the following example, shown in Listing 6.5. Listing 6.5 Initialization Example 1
void IncreaseAllCounters(DB *baseOne) { for (int i = 0; i < baseOne->GetSize(); i++) { DBElem * pElem = baseOne->NextElem();
This piece of code will iterate over all the elements of object baseOne. To make sure no element is missed, the function GetSize() is called in order to determine the number of elements in baseOne. Because of the implementation choice to call GetSize() in the header of the for statement, a lot of needless overhead is incurred. This piece of code will call the GetSize() function as many times as there are elements in baseOne. A single call would have sufficed as shown in Listing 6.6. Listing 6.6 Initialization Example 2
void IncreaseAllCounters(DB *baseOne) { int bSize = baseOne->GetSize(); for (int i = 0; i < bSize; i++) { DBElem * pElem =baseOne->NextElem(); ~
Again, this is a simple example, but not that many programmers take the time to determine which information is static for an algorithm and calculate or retrieve it beforehand. It seems that the more complex algorithms become, the less time is spent on these kinds of matters. This is ironic because these are exactly the kind of algorithms that implementers end up trying to optimize when the application proves to be too slow. Use A similar slowdown, as described in the previous paragraph, can occur with the use of variables. Listing 6.7 demonstrates how not to use member variables. Listing 6.7 Using Member Variables Example 1
class DB { ~ int dbSize; int i; ~ } ; void DB::IncreaseAllCounters(int addedValue) { for (i = 0; i < dbSize; i++) { ~
The variables i and dbSize are member variables of the DB class. They can, of course, be used in the way described earlier, but
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 63 de 206
accessing member variables can easily take twice as long as accessing local variables . This is because the this has to be used in order to retrieve the base address of a member variable. A better way to use this information might be to create local variables and initialize them (once!) with member variable values as shown in Listing 6.8. Listing 6.8 Using Member Variables Example 2
void DB::IncreaseAllCounters(int addedValue) { int iSize = dbSize; // use iSize in the rest of the function. ~ for (i = 0; i < iSize; { ~ i++)
Structs
The use of structures as data containers seems pretty straightforward, which it is really, when you take one or two things into account before using structures for large sets of data. Listing 6.9 demonstrates several ways of creating a grouped type. Listing 6.9 Structure Sizes
// Structure Size example #include <iostream.h> struct A struct B { { char a; long b; char c; long d; } ; char a; char c; long b; long d; } ;
#pragma pack(push,1) struct C { char a; long b; char c; long d; } ; #pragma pack(pop) void main(void) { cout << "Size of A: " << sizeof(A) << " bytes." << endl; cout << "Size of B: " << sizeof(B) << " bytes." << endl; cout << "Size of C: " << sizeof(C) << " bytes." << endl; }
Listing 6.9 defines three structures that contain identical information: two long wordscalled d and band two characterscalled a and c. When you run this program on a Windows system using Developer Studio, you get the following output: Size of A: 16 bytes. Size of B: 12 bytes. Size of C: 10 bytes. The reason these structures are not the same size has to do with alignment. Normally, a compiler will force long words to be placed at the next long word boundary; this means a new long word can be specified only every four memory addresses. Thus, the memory presentation of the beginning of structure A is: Address 00 01 02 03 04 05 Content char a stuffing stuffing stuffing long b, byte 0 long b, byte 1
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 64 de 206
06 long b, byte 2 07 long b, byte 3* * Actual byte order within a long is system dependent. Structure B is more compact because, although long b still starts at the same address relative to the beginning of the structure, some of the stuffing is replaced by char b: Address Content 00 char a 01 char b 02 stuffing 03 stuffing 04 long b, byte 0 05 long b, byte 1 06 long b, byte 2 07 long b, byte 3* * Actual byte order within a long is system dependent. Looking at these examples, it should not surprise you to find out that the two structures shown in Listing 6.10 are exactly the same size. Listing 6.10 Structure Stuffing
So it makes sense to think about variable order when designing structures. When you use only a few instances of a structure at a time, the footprint impact of the extra stuffing bytes is of course marginal, but it is a different story entirely when you use structures to store elements of an expansive database. It is generally good practice to always use well-thought-out structure design not only becauseas explained in previous chapterssoftware (modules) are reused in ways not always anticipated, but also because it is a good idea to make these kinds of techniques second nature. Now let's look at our wunderkind. How did structure C manage to squeeze itself into a 10-byte hole? You have seen the #pragma pack compiler commands so you have a pretty good idea how, but what exactly happens? The #pragma pack compiler command for Developer Studio tells the compiler to temporarily adjust alignment. In the example program the alignment is set to one byte, creating the following memory mapping: Address Content 00 char a 01 long b, byte 0 02 long b, byte 1 04 long b, byte 2 05 long b, byte 3 06 char c 07 long d, byte 0 08 long d, byte 1 09 long d, byte 2 10 long d, byte 3* * Actual byte order within a long is system dependent. Using push and pop ensures that the alignment for the rest of the program is not affected. Basically you push the current alignment onto some compiler stack, force the new alignment in place, and later pop the old alignment back into action. For more information on alignment consult the manual of the compiler used (look up the /Zp compiler option for Microsoft Developer Studio, or consult the manpages for the GNU compilerman gcc or man g++). Of course alignment is an issue for all types. Table 6.2 is an alignment table for the Windows and UNIX/Linux OSes. Table 6.2. Base Type Alignment Table Windows/Dev Studio 1 2
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 65 de 206
short long int pointer float double long double int64/ long long struct union bit field
2 4 2 4 4 4 4 8
For specific alignment details of other systems or various compilers, consult their respective documentation, or perform an alignment test. This can be as easily done as creating a structure of required types, interspersed with character variables (to make sure the compiler will actually force alignment for every type). Alternatively, use the booktool function: alignment_info() to perform this job for you. Note Alignment exists for a reason: Some processors simply have difficulty using odd memory addresses. When you port code that uses adjusted alignment, the new footprint of the program can thus differ, and variable access can even become somewhat slower. The implementer should therefore be aware that alignment adjusting is a system-specific optimization that needs to be tweaked to the target system (OS, hardware, and compiler combination).
Bit Fields
This section discusses the use of structures as bit fields. Bit fields are discussed in the following contexts: Footprint implications Performance implications Pitfalls Final remarks Footprint Implications Bit fields allow you to use even more-extreme alignment of variables. Essentially, the bit field of C++ makes it possible to declare several variables within a base type, making full use of the range of the base type. Let's say you need to create data base elements that contain six variables, four of which have a range of 02031 and two of which have a range of 01011. Without using bit fields, you will need a structure as defined in Listing 6.11. Listing 6.11 Structure Without Bit Fields
struct NoBitField { short rangeAOne, rangeATwo, rangeAThree, rangeAFour; short rangeBOne, rangeBTwo; } ;
For the range 02031 you need 11 bits and for the range 01011 you need 10 bits; the smallest base type that can contain this information is the "word" (2 bytes), which is called a "short" by most C++ compilers. The size of the NoBitField structure is thus 12 bytes. However, when using bit fields our structure could look like Listing 6.12. Listing 6.12 Structure with Bit Fields
struct BitField { unsigned rangeAOne unsigned rangeATwo unsigned rangeBOne unsigned rangeAThree unsigned rangeAFour unsigned rangeBTwo } ;
: : : : : :
// long 1;
// long 2;
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 66 de 206
The compiler will pack this whole structure into two long words. The size of the BitField structure is thus eight bytes. The compiler can do this because it was told to use only 11 bits for the rangeAxx variables and only 10 bits for the rangeBxx variables. And by ordering the variables intelligently within the structure (11+11+10=32) you ensured optimal storage. The reason this order is important has again to do with alignment. Bits of the base type are assigned to the fields in the order in which they are declared, however, when a field is larger than the number of bits left over in the base type, it is "pushed" to the next base type alignment. The base type for bit fields is a long word for most compilers so when you look at the placement of rangeAOne, rangeATwo, and rangeBOne in the first long word, you will see this:
0000 0000 | 0000 0000 | 0000 0000 | 0000 0000 | rangeBOne | rangeATwo | rangeAOne |
With a less-optimal ordering, as in the following structure, you see
struct WastingBitField { unsigned rangeAOne unsigned rangeATwo unsigned rangeAThree unsigned rangeAFour unsigned rangeBTwo unsigned rangeBOne } ;
: : : : : :
// long 1; // long 2;
rangeAOne and rangeATwo take up 22 bits of the first long word. This means there are 3222=10 bits left for the following field;
however, because you declared another 11-bit field, that field is moved to the next long word alignment, leaving 10 bits of the first long word unused. As a consequence, the size of the WastingBitField structure is 12 bytes. Performance Implications In fact, forcing boundaries on variables makes perfect sense. By putting a little thought into designing your bit fields you can easily take advantage of the space gained, and because variables are set at their closest alignment borders, the compiler can generate relatively fast code. To set a value in a bit field at most two bit operations are needed, see Listing 6.13. Listing 6.13 Operations Necessary to Set Values in a Bit Field
// C code: BitField inst; inst.rangeBOne = 2; // pseudo code to implement the C code: register = long0 AND 0x003fffff register = register OR 0x00800000
The AND masks off everything but the bits of the rangeBOne variable, and simultaneously resets the bits of rangeBOne which need to be reset for the value 2. The OR then sets the bits that need to be set for the value 2. But you do not always use constant values, or values from which the compiler can derive constants at compile time. When a variable is used for setting, or doing arithmetic with, a bit field, considerably more overhead is incurred. Listing 6.14 demonstrates this. Listing 6.14 Speed of Bit Fields
void main(void) { Bitfield a; Structure b; for (int i = 0; i < 100; i++) { a.num = i; b.num = i; } }
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 67 de 206
// Relevant Assembly: ; 11 : a.num = i; eax, DWORD PTR _a$[ebp] eax, -2048 ecx, DWORD PTR _i$[ebp] ecx, 2047 eax, ecx DWORD PTR _a$[ebp], eax b.num = i; eax, DWORD PTR _i$[ebp] WORD PTR _b$[ebp], ax
; fffff800H ; 000007ffH
mov mov
You do not have to know Assembly to see that about three times as many instructions are needed to process the bit field (label 11) compared to the short (label 12). The exact overhead depends on the compiler and target system used. If speed and footprint tradeoffs become important when using structures, it is a good idea to test the implication for the destined target system using test cases such as the one in Listing 6.14. Pitfalls When you look at Listing 6.11, you see that in the NoBitField structure more than one variable was declared on a single source line. This is also possible for bit field structures but take care not to make the following mistake shown in Listing 6.15. Listing 6.15 Pitfall in Using Bit Fields
// correct.
It is also possible to use bit fields and normal base types together in the same structure as shown in Listing 6.17, however, this is not recommended. Listing 6.17 Bad Use of Bit Fields
// Bad idea.
Sadly, whenever a base type follows a bit field or vice versa, the continued memory mapping is aligned on the next long word boundary. The CombinedFields structure thus uses four bytes for the character and another four bytes for the bit field. It would be better to write the character out as a bit field of eight bits, this way the entire structure will fit into a single long word (remember; bit field structures use the long word as a base unit.) Final Remarks It is possible to use signed bit fields as shown in Listing 6.18. Use either int or unsigned as the bit field type. Listing 6.18 Signed Bit Fields
struct SignedBitFields {
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 68 de 206
int unsigned } ;
a : 10; b : 10;
Remember that using signed values (int) effectively decreases the positive range of the variable. Finally, there are two things we simply cannot do with bit fields. It is impossible to use the address of a bit field, or to initialize a reference with a bit field.
Unions
Unions are used to combine the memory space of several variables when only one of the variables will be used at a time. Consequently, the size of a union is that of the largest element it contains. Listing 6.19 shows that when the contained elements are roughly the same size, unions are a wonderful tool for saving space while maintaining strong type checking. Listing 6.19 Union Size Match
union PerfectMatch { char text[4]; int nr; } ; void main(void) { PerfectMatch a; a.text[0] = '0'; a.text[3] = '0'; ~ // Somewhere else in the code: a.nr = 9; }
Listing 6.19 creates two different ways to approach the same memory, allowing the use of different types. When the elements of a union have vastly different sizes, you need to reevaluate how you use unions. Consider Listing 6.20, which represents a data base object. Listing 6.20 Union for Database Fields
struct Address { char *streetName; char *zipCode; short houseNumber; } ; union Location { Address adr; int poBox; } ; struct Client { char *name; short cityCode; char Location } ;
These structures are used to store client information. They register the name of a client, a code for the city she lives in, and her address. However, some clients are known by their post-office box number (PoBox) and others by their full address. The union
// sizeof(Client) = 20 bytes
locationSelector; loc;
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 69 de 206
Location combines this information so the space needed for the PoBox is reused for the address. The locationSelector in the Client structure tells us which kind of address is being used by an instance of the Client structure.
This example works fine, but the Client structure is 20 bytes large and when clients are known by their PoBox number, only 12 of the 20 bytes are actually necessary. This might not be a problem for a small databaseit might not even be a problem when most clients are known by their addressbut when you have a large number of "PoBox" clients, you are simply wasting space. Listing 6.21 gives an example of splitting the Client structure into two (observing structure alignment as discussed in the previous sections). Listing 6.21 Structures Replacing a Union
struct ClientAdr { char *name; char *streetName; char *zipCode; short houseNumber; short cityCode; } ; #pragma pack(push,2) struct ClientPoBox { char *name; short cityCode; int poBox; } ; #pragma
// sizeof(ClientAdr) = 16 bytes
// sizeof(ClientPoBox) = 10 bytes
For every instance of ClientPoBox that is placed in the database, 10 bytes are effectively won; without the pragmas the structure is 12 bytes, winning eight bytes. For every instance of ClientAdr, four bytes are won. Of course you create a minor problem when using two stucts instead of one, namely that of typing the database elements. Workarounds for this are as diverse as Creating separate containers (lists, arrays, and so on) for address-client objects and PoBox-client objects Adding a selector to the structure or object referring to a client object and then casting client pointers to either ClientAdr or
ClientPoBox
Using polymorphismthat is, classes instead of structs. More about this can be found in Chapter 8.
Summary
Although most compilers now use the same memory mapping of base types, this is not always the case, and not all programmers are fully aware of the implications of choosing a certain base type. That is why it is important for implementers to familiarize themselves with the sizes, ranges, and accessing speed of the standard C/C++ variables of their target environment. This chapter discussed ways to evaluate variable characteristics and gave practical examples of optimal use. Speed of variables Not all base types are equally fast in access and arithmetic. It is possible for the integer to be much faster than the float and double in all arithmetic operations except division, depending on the characteristics of the target platform. Also bit fields can be quite fast until you try to do arithmetic with them. The scope and lifetime of a variable also has impact on the speed of use. Design of structures and unions Because of alignment issues, the size of structures depends on the order in which their elements are declared. This alignment can be tweaked in several ways. Unions are always at least as large as their largest member; however, there are techniques for determining optimal use of unions. CONTENTS CONTENTS
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 70 de 206
Selectors Loops This chapter looks closely at how choices between basic programming statementsthat do essentially the same thingcan affect your programs. Often, it is not very clear what the performance and footprint repercussions are of choosing a certain solution. In this chapter, different solutions are presented together with useful data on their performance. This will provide the necessary insight into what actually happens when a program executes and how to determine an optimal implementation solution. Ready-to-use solutions are, of course, offered for the more common implementation problems.
Selectors
This section focuses on selector statements. As there are a variety of different C/C++ statements to enable a program to choose between a set of execution paths, you might wonder what the actual differences between the selector statements are. Certainly almost any selection implementation can be rewritten to use a different statementreplacing a switch with an if..else construction, for instance. However, each statement does have its own specific strengths and weaknesses. When these statement characteristics are understood, it becomes possible to intuitively choose an efficient implementation solution. This section presents the characteristics of the different selector statements, after which a performance comparison between the different statements is made.
char a[]="this is a long string of which we want to know the length"; int b = 0; // inefficient and. if ( (strlen(a) > 100) && (b > 100) ) { d++; } // efficient and. if ( (b > 100) && (strlen(a) > 100) ) { d++; }
Calculating the length of a string takes time because a function needs to be called and an index and a pointer need to be instantiated, after which a loop will determine the end of the string. Comparing the value of an integer variable with a constant takes considerably less time. This is why the second if in the preceeding example is easily 200 times faster than the first when the expression (b > 100) is not true. The following example demonstrates the same theory for a simple if with 'or'-ed expressions:
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 71 de 206
char a[]="this is a long string of which we want to know the length"; int b = 101; // inefficient or. if ( (strlen(a) > 100) || (b > 100) ) { d++; } // efficient or. if ( (b > 100) || (strlen(a) > 100) ) { d++; }
Again, when the fastest expression to evaluate is placed first, the if can be many times faster. Because this example uses an 'or'this time, the second if of the preceeding example is easily 200 times faster then the first when the expression (b > 100) is true. Note that when all expressions of an 'and'are true, evaluation time is no longer influenced by expression order as all expressions are evaluated anyway. Note that when none of the expressions of an 'or'are true, evaluation time is no longer influenced by expression order as all expressions are evaluated anyway. Note also that the rules of expression evaluation hide a sneaky pitfall. Often you can be inclined to test and set a variable in an expression:
if as a selector
The if (else) statement is probably the most straightforward selector of C/C++. Via an if an expression is evaluated, after which either the statements belonging to the if itself, or those belonging to the else, are executed. The following example demonstrates this:
if (a < 10) { DoLittle(); // execute when a < 10. } else if (a < 100) { DoMore(); // execute when 10 <= a < 100. } else if (a < 1000) { DoDa(); // execute when 100 <= a < 1000. }
First variable a is compared to 10. When a is lower than 10, the function DoLittle() is called. Only when a is equal to or larger than 10 is the second comparison made (a < 100). Similarly, only when a is equal to or larger than 100 is the third comparison made. This means that the more comparisons are specified, the more statements need to be executed in order to reach the final comparison. In the preceeding example, the call to the function DoDa() is reached only after three comparisons have been made. It stands to reason that the implementation of a large number of choices made with the if..else technique can thus result in a rather sluggish piece of code. The first comparisons in the if..else construction are still quite fast, but the deeper you get, the slower the response becomes. This means that the statements that need to be executed most often should be placed as high up in the if..else construction as possible. It also means that the default instructionsthe ones executed when none of the other comparisons matchwill take the longest to reach.
if (a == 1) { Do1(); }
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 72 de 206
else if (a == 2) { Do2(); // still pretty fast. } ~ ~ else if (a == 1000) { Do1000(); // very slow to reach. } ~ ~ else { DoDefault(); // slowest to reach. }
The power of the if..else really lies in the fact that complicated expressions can be used as the basis of a choice.
Jump Tables
When the range of conditions for the choices can be translated to a continuous range of numbers (0, 1, 2, 3, 4, and so on), a jump table can be used. The idea behind a jump table is that the pointers of the functions to be called are inserted in a table, and the selector is used as an index in that table. The following code example implements a jump table that does much the same thing as the previous example where if..else is used:
// Table definition: typedef long (*functs)(char c); functs JumpTable[] = { DoOne,DoTwo,DoThree /* etc*/} ; // some code that uses the table: long result = JumpTable[selector](i++);
The first two lines of code in this example define the jump table. The typedef determines the layout of the functions that will be contained by the jump tablein this case, a function which receives a single character as input and which returns a long as a result. The array definition of JumpTable actually places the pointers of the functions in the table. Note that the functions DoOne(), DoTwo(), and DoThree() must be defined elsewhere and should take a character as input parameter and return a long. The last line of code demonstrates the use of the jump table. The variable i is input for the function that is selected via the variable selector. The variable result will receive the long returned by the selected function. In short, when i contains the character a and selector has the value 2, the following function call is actually made:
// 2 functions per selection: int tabEntry = selector * 2; long result = JumpTable[tabEntry](i++); long result += JumpTable[Tabentry+1](i++); switch Statement
The switch statement presents a number of cases which contain programming statements. A case can be chosen and executed
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 73 de 206
when its constant expression matches the value of the selector expression that is being evaluated by the switch statement. Listing 7.1 demonstrates this: Listing 7.1 A Simple Switch
switch(b) { case 1: DoOne(); break; case 4: case 2: DoEven(); break; default: DoError(); }
Depending on the value of the selector expression (here, variable b), a certain function is called. When b equals 1, DoOne() is called; when b equals 2 or 4, DoEven() is called; and for all other values of b, the function DoError() is called. By adding or omitting the break command, the implementer can decide whether a single case is executed exclusively (break) or whether one or more of the following cases are evaluated (and possibly executed) also (omitting the break). The implementation of the switch statement is not fixed though. Depending on how it is used in a piece of source code, the switch statement is translated by the compiler using one of the following techniques:
{ case 19, case 20, case 21, case 22, case 23, default}
the value of the selector expression is simply decreased with the starting number (19) and then used as jump table index. Of course, the implementation of a jump table introduces a certain overhead; this is why switch statements with a small number of cases will still be implemented by a compiler as an if..else, simply because this is faster.
switch as an if..else
You will appreciate that a jump table cannot as easily be used for a switch with cases that have seemingly random constant expressions. Consider the following cases:
{ case 55, case 300, case 6, case 12, case 79, case 43, case 3, default}
For this set of cases it is necessary for the compiler to use a number of if..else statements in order to determine exactly which case should be executed. However, not all is lost yet. Although this technique does introduce some overhead, a good compiler will group the if..else values in such a way as to minimize the number of comparisons that need to be made. When the first comparison determines that the selector expression is larger than that for the first case (55), there will, of course, be no need to compare the expression with cases 6, 12, and 3. The if..else construct will contain expressions grouped in such a way that unnecessary comparisons are not done. When the switch is represented by an if..else construct, the default case will not be the fastest decision made by the switch . Conclusion: It pays to design a case statement in such a way as to allow the case s to be presented as a continuous range. Then,
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 74 de 206
when possible, the default case should be used for the selection that is executed most often.
// switch without function delegation: switch(j) { case 1: k+=1; // Statements to be executed. break; case 2: k+=3; // Statements to be executed. break; ~~ // if..else without function delegation: if (j==1) { k+=1; // Statements to be executed. } else if (j == 2) { k+=3; // Statements to be executed. ~~
As shown in the previous paragraphs, delegating to functions creates extra (function call) overhead; it is therefore treated in a separate test case. This is also why jump tables are not included in this test case: Jump tables always use function calls. The program 07Source01.cpp that can be found on the Web site, uses the timing function time_fn() from the book toolsas discussed in Chapter 5, "Measuring Time and Complexity"in order to time the differences between the if..else and switch implementations. Although the exact timing results will differ per system, the relations between the if..else measurements and switch measurements should not. Table 7.1 shows the relations discovered with 07Source01.cpp: Table 7.1. Test Results of 07Source01.cpp (No Function Call Overhead) Switch Results If..Else Results 1400 700 // 'if' is 2 times faster than 'switch' 1400 1050 1420 1400 1410 1760 1390 2110 1420 2460 1400 2810 // 'switch' is 2 times faster than 'if' 1410 3170 1400 3520
Selected Case 1 2 3 4 5 6 7 8 9
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 75 de 206
Default
890
Table 7.1 shows timing values of a switch implementation (column 2) and a similar if..else implementation (column 3). The timing values denote the relative time it takes for a certain case to be reached. For instance, in order to reach case 7, the if..else construct needs twice as much time as the switch does. What else do can you see in these results? Apparently, the if..else is faster for the first three cases, after which it starts to lose ground. The switch has a relatively constant value for each of the nondefault cases. Remember that the switch is implemented as a branching table in this example, as the number of cases is relatively high. Another prominent number in this table is that for the default case; the switch is clearly much faster than the if..else construct there. Conclusion: Where possible, the default case of a switch should be the most used case. The cases used most often in an if..else construct should be placed as the first if s of that construct. Sometimes it might even be a good idea to precede a switch by one or two if..else statements. When neither the design nor the requirements make clear which cases are most importantthat is, which cases are executed most often when the program is being usedsome static debug counters could be inserted in the various cases in order to determine how often a case is executed during field tests. This information can then be used to optimize case order and implementation. Once again, though, the profiler can give us this information as well (see Chapter 4, "Tools and Languages"). Comparing with function call overhead Now that you will be looking at selectors that delegate to other functions, it is fair to add the jump table technique to our set of selectors to examine. The following excerpt shows how the different techniques will be tested:
// switch with function delegation: switch(j) { case 1: k+=aa(); break; case 2: k+=bb(); break; ~~ // if..else with function delegation: if (j==1) { k+=aa(); // Function call. } else if (j == 2) { k+=bb(); // Function call. } ~~ // jump table with functions: k+=table[j-1](); // Function call. ~~
The program 07Source02.cpp that can be found on the Web site uses the timing function time_fn() from the book toolsas discussed in Chapter 5in order to time the differences between the if..else, switch, and jump table implementations. Although the exact timing results will differ per system, the relations between the results for the different techniques should not. Table 7.2 shows the relations discovered with 07Source02.cpp. Table 7.2. Test Results of 07Source02.cpp (Test with Function Call Overhead) Switch Results If..Else Results Jump Table Results 2630 1940 1920 2640 2290 1980 2640 2620 1930 2640 2980 1930 2620 3350 1930 2620 3700 1920 2650 4010 1930
Selected Case 1 2 3 4 5 6 7
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 76 de 206
8 9 Default
Table 7.2 shows timing values of a switch implementation (column 2), a similar if..else implementation (column 3), and a jump table implementation (column 4). The timing values denote the relative time it takes for a certain case to be reached. As expected, both the switch and the jump table have constant results for the non-default cases. The if..else is still faster than the switch for the first two cases but it does not outdo the jump table . Do not forget, however, that the jump table technique creates some extra overhead that is not represented by this timing data. This overhead is incurred when the jump table is initialized with the function pointers. However, this is done only once and most likely at some point during the start of the program. Conclusion: When the number of statements to be executed in each case becomes large enough to warrant delegation to functions, the jump table technique is clearly the fastest choice, with the switch coming a close second. Conclusion 2: When constant lookup time is important, the jump table is definitely the way to go. Array lookup This final bullet examines array lookup. It is not really a fair comparison to set array lookup times next to selector times, which is why it is placed in a separate bullet; however, array lookup is still a very interesting technique simply because it is incredibly fast. The principle of array lookup is basically creating an array containing results. Then, instead of calling a function in order to obtain a result, the array is simply indexed. Of course, this technique can only be used when all possible function results can be calculated beforehand and an indexing system can be used. A popular and very useful implementation of a lookup array is one containing sine, cosine, or tangent results. Instead of calling an expensive tan() function, an array is created containing the tan results with the needed resolution. The following example illustrates this:
// Look-up Array definition: short tan[] = { 0,175, 349,524,698,872,1047,1221,1396,1571,1746} ; ~~ // Use of array as if function: short res = tan[a];
Note that this example also puts into practice some techniques concerning efficient use of vari ables as explained in Chapter 6, "The Standard C/C++ Variables." Table 7.3 shows some timing results. Table 7.3. Test Results of Array Lookup Array Lookup Results 340 340 340 330 340 340 330 340 340 330
The fastest times yet, obviously. And, of course, each case (array index) is equally fast. The consideration to be made concerning array lookup is whether its use measures up to the amount of space the array takes up. In this example the decision is clear: Any tan function would be considerably larger than this small array of shorts. And even when the array does become impressively large, it might still justify its existence when the data it contains is needed very often, is very expedient, or both.
Loops
This section focuses on loop statements. Loops have a way of making a source code listing look relatively compact as repetitions of statements are coded only once. This is where loops present a danger to performance as well; any small inefficiency in the body of a loop will be incurred as many times as the loop body is iterated over. A loop that has one superfluous statement and that is iterated over 10,000 times will, of course, cause 10,000 superfluous statements to be executed. Consequently, loops are the places where most often a lot of performance can be won. This section presents several techniques that can be used to optimize loops.
Aborting Loops
Often, loops are implemented with termination conditions, as shown in the following examples:
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 77 de 206
// typical loop termination conditions: for (j = 0; j < arraySize; j++) { ~~} while (q < nrOfElems ) { ~~} do { ~~} while (countDown > 0);
These are typical uses of loops for iterating over the elements of arrays, databases, and lists. They are syntactically correct, of course, but can be wildly inefficient. Consider an array of which each element is, on average, accessed an equal amount of times during program execution; a loop as specified like this will thus, on average, execute 50% of its iterations in vain. Only when the element that the loop is "looking for" happens to be the last element in the array does the loop actually have to iterate over all the array elements. This is why often a second stop clause is added to the loop, enabling it to stop when it has found what it is looking for. Listing 7.2 demonstrates this. Listing 7.2 Aborting a Loop with a Flag
void DB::FindId(int searchId) { int Found = 0; for (int i=0; i<Index && !Found; i++) { if (base.GetId(i) == searchId) { //Found. Found = 1; } ~~ other loop statements ~~ } base.ExecuteElement(i); ~~
The flag Found is added to the stop condition of the for loop in order to abort further iteration when the job has been done. As can be seen in this example, two comparisons are needed to determine an early stop. The first comparison is made in the body of the loop in order to determine whether you have found what you are looking forin which case the Found flag is set. The second comparison is made in the header of the loop in order to check up on the status of the Found flag. This already gains us some performance as the average number of iterations is now brought down by 50% (for loops in which each element is used the same number of times on average). However, when the loop does not find what it is looking for, all iterations are, of course, still performed and two statements of overhead were added to every iteration! A better way to abort a loop is presented in Listing 7.3. Listing 7.3 Aborting a Loop with a Break
// abort criteria.
void DB::FindId(int searchId) { for (int i=0; i < Index ; i++) { if (base.GetId(i) == searchId) { //Found. break; } ~~ other loop statements ~~ } base.ExecuteElement(i); ~~
// abort criteria.
By using break (Listing 7.3), you can eliminate one of the two comparisons used in Listing 7.2. When the abort criterion is satisfied (GetId(I) == searchId ), the break takes program execution to the first statement following the loop, which is base.ExecuteElement(i) in Listing 7.3. Not only does the code in Listing 7.3 use fewer comparisons per iteration, it is also faster when the abort criterion is reached. This is because the break causes the rest of the
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 78 de 206
statements in the loop to be skipped (~~ other loop statements ~~), and the loop header is not evaluated a last time. Note that the stop criterion in the header of the loop itself could also be replaced by an if..break combination; however, this is unlikely to affect the loop performance as compilers will generate the same, or very similar, code for both solutions. In practice, complicated loops will often take up most of a function, with the function result being a pointer (or reference) to the found element. In these cases, it is, of course, best to abort the loop by immediately returning the found object. The following example illustrates this:
void DB::FindId(int searchId) { for (int i=0; i < Index ; i++) { if (base.GetId(i) == searchId) { //Found. return base.GetObjectRef(i); } ~~ other loop statements ~~ } } // abort creteria.
void DB::ServiceProblems(int threshold) { int size = baseSize; // local copy of a member variable. for (int i=0; i < size ; i++) { EL *element = GetObject(i);
if (element->GetPressure() < threshold) { // not an interesting element. continue; } ~~ other loop statements ~~ } }
The example in Listing 7.4 iterates over all database elements and performs some actions for each element that has a pressure value greater than that of the specified threshold. An if..continue combination was added to discard any element with a safe pressure value. Consequently, for safe elements the ~~ other loop statements ~~ will be skipped, as the continue forces the next iteration to be started immediately. Notice that, in this example, the if..continue could have easily been replaced by an if..else construct. This is basically always the case; however, more complex loops can become much easier to read when continue is used instead. Notice also that some techniques discussed in Chapter 6 are used here.
Summary
There are different techniques for implementing selectors and loops. Depending on the aim of a piece of code, a certain technique will be more useful than another. This chapter explained the characteristics of different techniques and showed where to use each. Selectors
if..else constructions are very fast for the first few cases and increasingly slower for the rest, with the default case being the slowest. Use if..else constructions for evaluating complex expressions that cannot be reduced to a numbered range for a selector,
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 79 de 206
and either/or situations that are unlikely to be augmented with additional cases in the future. The switch statement is equally fast for all selectable cases, except for the default case, which is faster than the other cases. switch is slower than if..else only when there are a minimal number of cases. Use the switch statement wherever the selector can be reduced to a numbered rangepreferably a continuous oneand the number of (expected) cases is higher than 3 or 4. Jump tables are equally fast for each selectable function they contain. Defaults need to be coded by hand. When cases delegate all the work to a function, a jump table is even faster than a switch , as long as the selector range is continuous. Use jump tables when the selector can be reduced to a continuous numbered range and cases immediately delegate to functions. Array lookup is the fastest technique to obtain a result based on the value of a selector. The memory usage of the array is best when the selector range is continuous. Array lookup can be used only when the selectable results can be calculated beforehand. Use array lookup when the selector can be reduced to a continuous numbered range, the footprint of the array is not a problem, and the selectable results can be calculated beforehand. Loops It is very important to keep an eye on the performance of the body statements of loops. Loops often perform hundreds of iterations, which means that each statement that can be optimized is, in effect, optimized hundreds of times over. This chapter has shown some techniques for optimizing loops and highlighted places where loops can often be optimized. It is a good idea for implementers to always keep the efficiency of their loops in mind when writing software, even when writing seemingly trivial loops. In many cases, the number of loop iterations depends on the number of elements in an array, list, or database. This number will grow over time. In other cases, loops with an initially small number of iterations are enlarged when a program is updated or reused. This means you cannot assume that a small loop will always stay a small loop. Where possible the size and number of iterations of loops should be honed with the use of break and continue statements. CONTENTS CONTENTS
Chapter 8. Functions
IN THIS CHAPTER Invoking Functions Passing Data to Functions Early Returns Functions as Class Methods This chapter focuses on different ways of invoking functions and passing function parameters. It introduces several calling techniques and compares their characteristics. An in-depth view is given of what exactly is involved with calling functions via the different techniques and when best to use which technique.
Invoking Functions
Let's first look at what actually happens when a function call is made. Previous chapters have already hinted at function-call overhead and now seems to be the ideal time to discover exactly what this so-called overhead is. Consider therefore the simple C/C++ function presented in Listing 8.1. Listing 8.1 A Simple Add Function
int MyAddFunct(int a, int b, int c) { return a + b + c; } void main(void) { int a = 1, b = 2 , c = 3; int res = MyAddFunct(a,b,c); }
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 80 de 206
The function MyAddFunct in Listing 8.1 takes three integers as input and returns as a result the sum of the three. By compiling this source with the assembly output option turned on (see Chapter 4, "Tools and Languages" ), you can look at what the processor actually has to do to execute this function. The assembly listings presented in this section were generated with Developer Studio for Windows, but other compilers will give similar results. A column of line numbers has been added to the output to facilitate referencing from the explanatory text. The first line(s) in each listing depicts the original C/C++ statement. Listing 8.2 shows the assembly generated for the call to MyAddFunction. Listing 8.2 Assembly for Calling MyAddFunct()
// MyAddFunct(a,b,c); 00 01 02 03 04 05 06 mov eax, DWORD PTR _c$[ebp] push eax mov ecx, DWORD PTR _b$[ebp] push ecx mov edx, DWORD PTR _a$[ebp] push edx call ?MyAddFunct@@YAHHHH@Z ; ; ; ; ; ; ; place value of 'c'in register eax. push register eax onto the stack. place value of 'b'in register ecx. push register ecx onto the stack. place value of 'a'in register eax. push register eax onto the stack. call
MyAddFunct()
As can be seen in Listing 8.2, before a function is even called a lot of work is already done. In lines 0005 the three variables that are input for the function (a, b, and c) are placed on the top of the stack, so they become part of the function's local parameter set. Then, in line 06, the function is actually called. This causes the value of the program counter to be placed into the stack also. At any time, the program counter points to the memory address from which the processor is retrieving instructions. In calling the function, the program counter is set to point to the address of the first instruction of the function. Because the previous value of the program counter resides safely on the stack, execution can continue with the instruction after the function call when the function has ended. Note that this part of the function call overhead (placing variables on the stack) will grow larger when more (or larger) parameters are passed to a function, and smaller when fewer (or smaller) parameters are passed. Now let's look at what happens inside the function. Listing 8.3 shows what happens on entry of the function. Listing 8.3 Assembly for Entering MyAddFunct()
// int MyAddFunct(int a, int b, int c) // { 07 08 push ebp mov ebp, esp ; place the base pointer on the stack. ; new base pointer
value.
The statements in Listing 8.3 are executed before the actual function body statements. So what exactly is the overhead that is incurred when entering a function? Basically it is more stack work. The base pointer (ebp)which tells the processor where to find local variablesis pushed onto the stack. Its value will be needed when the function ends and access is needed to the local variables of the calling context. After the base pointer is placed onto the stack, the base pointer receives the value of the stack pointer. Now pointing at the top of the stack, the base pointer can be used to retrieve (albeit via a negative index) the function parameters. This is logical, as the base pointer now points at the stack frame containing the function's local variables. This part of the function call overhead is pretty standard. It basically saves the current context onto the stack so it can be used again when the function ends. The more registers that are used inside the function, the more registers will have to be pushed onto the stack for safekeeping. This means two things. First, the function MyAddFunct is obviously not going to use a lot of different registers. Second, the more complicated a function is, the more overhead will be incurred on function entry. Figure 8.1 shows what the stack looks like at this point inside the function call. Figure 8.1. Stack frame when inside a function call.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 81 de 206
Arriving at the instructions that form the function body, Listing 8.4 shows the assembly statements that make up the body of the MyAddFunct function. Listing 8.4 Assembly for the Body of MyAddFunct()
// return a + b + c; 09 10 11 mov eax, DWORD PTR _a$[ebp] ; value of 'a'in register eax. add eax, DWORD PTR _b$[ebp] ; add value of 'b'to eax. add eax, DWORD PTR _c$[ebp] ; add
MyAddFunct terminates.
Listing 8.5 Assembly for Exiting MyAddFunct()
// } 12 13 pop ret ebp 0 ; return ebp to its original value. ; return from
the function.
Basically this is the opposite of Listing 8.2; saved register values are taken from the stack in reverse order, line 12. At this point all that remains on the stack are the three parameters and the program counter. Line 13 returns from the function and effectively pops the program counter from the stack and causes execution to continue at the instructions right after the function call. Listing 8.6 shows how the returned result of function MyAddFunct is captured by the calling function. Listing 8.6 Assembly for Capturing the Result of MyAddFunct()
// int res = ... 14 15 add esp, 12 mov DWORD PTR _res$[ebp], eax ; pop 3 variables * 4 bytes. ; eax contains the function ;
result.
The three variables were still on the stack so the stack pointer is adjusted by 12 bytes (as the stack grows downwards in memory an add is used instead of a sub). So all in all quite a lot of instructions had to be executed to make this calculation, which itself was only three instructions long. When a function is used often, and it does not contain that many statements, it would be great if it were possible to somehow skip the
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 82 de 206
overhead shown in this section, while still keeping the level of abstraction created by the function call. The rest of this section will discuss techniques to do just that, as well as some other useful techniques.
Macros
One way to avoid function-call overhead is by using macros. You can define a macro in both C and C++ with the #define keyword . Just as with any other use of the #define keyword, a macro basically tells the precompiler to substitute a specified sequence for an identifier:
#define TRUE 1
With the definition above, the precompiler is told to replace all occurrences of the word TRUE in the source with the character 1, before handing the source over to the compiler (for more information on the precompiler, see Chapter 4). With function macros we take things a little further as we are able to use function-like arguments. Listing 8.7 shows a few much -used macro definitions. Listing 8.7 Macro Examples
#define MAX(a,b) #define MIN(a,b) #define ABS(a) #define #define #define #define
((a) < (b) ? (b) : (a)) ((a) < (b) ? (a) : (b)) (a) < 0 ? -(a) : (a)
TSTBIT(a,b) ((a & (1<<b)) ? (true): (false)) SETBIT(a,b) ((a | (1 << b))) CLRBIT(a,b) ((a & (~(1 << b)))) FLIPBIT(a,b) (TSTBIT(a,b) ? (CLRBIT(a,b)) :
(SETBIT(a,b)))
When you look at the MAX macro in Listing 8.7, you see that the identifier partMAX(a,b)contains the parameters a and b. These parameters are repeated in the replacement string. The precompiler will thus be able to make a parameterized substitution by taking whichever parameters it finds between the brackets and placing them in the appropriate spots within the substitute string. Listing 8.8 uses this macro and demonstrates immediately why macros are not always a good idea. Listing 8.8 Unexpected Macro Results
#define MAX(a,b) ((a) <(b) ? (b):(a)) void main(void) { int a= 10; char b = 12; short c = MAX(a,b); c = MAX(a++,b++); (not ok) }
The first use of the macro in Listing 8.8MAX(a,b)seems to work fine; it returns the larger of the two given parameters and places it in variable c. The second call, however, is expanded a little differently than the implementer might have intended:
MAX(a++,b++);
=>
The variables a and b are incremented on evaluation; variable b is incremented once more when it is returned. This demonstrates how cryptic the use of macros can be, particularly when you are using functions written by a third party. You might not even be aware that the function you are calling is in fact a macro and therefore has some peculiarities. Moreover, both macro calls in the example use different types of variables; MAX can return either a character or an integer but the result is stored in a short, and the compiler does not complain about this. It seems it is time to list the advantages and disadvantages of using macros: The following is the advantage of using macros: They can make function execution faster by eliminating function call overhead. The following are disadvantages of using macros:
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 83 de 206
Often macro descriptions are very cryptic and hard to read. Use of macros increases footprint as macro bodies are expanded wherever they are used. Finding compiler errors in macros can be a difficult task. Debugging a source with macros can be problematic, because debuggers (as other development tools) do not always know exactly how to handle macros. Macros do not facilitate type-checking. Conclusion: Although the use of macros can certainly speed up code that uses small functions repetitiously, you should be wary of using them. Most often there are other ways to solve the problems. Use macros only if and when you really feel you have to.
Inline Functions
Another way of avoiding function call overhead is the use of inline functions. Inline functions are a C++ feature and therefore cannot be used in C. They operate under a principle similar to that of macrosnamely that the function call is actually substituted by the function body during compile time. This can be a great timesaver for short functions which are called frequently. Think of class member functions for accessing private data; they appear very often in C++ sources but usually do nothing more than return a pointer or variable value. Every time such an access function is called, though, function-call overhead is incurred that is many times larger than the actual function body. An ideal situation for optimization it seems. There are two ways of declaring inline functions: first, by adding the keyword inline to the function definition, and second, by adding the function body in the definition of a member function of a class. Listing 8.9 shows the two methods of declaring inline functions. Listing 8.9 Two Ways to Inline Functions
inline int MyAddFunct(int a, int b, int c) { return a + b + c; } class Client { private: char unsigned char char
// Private data.
public: char *GetName() { return name;} int GetAge() { return age;} char GetGender(); } ; inline char Client::GetGender() { return
gender; }
Let's look at the advantages and disadvantages of inlined functions: Inlining offers the following advantages: Faster function calls Easy optimization as only a keyword has to be added, or a definition moved to a header file Retains the benefit of type-checking for parameters Inlining also offers the following disadvantage: Increased footprint as the function body appears everywhere where the inlined function is called
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 84 de 206
Conclusion: Use inlining only for short functions that are called frequently or that are part of a time-critical piece of code. Sadly, however, for practically all compilers the inlining of functions is merely a suggestion made by the programmer. This means that making a function inline is no hard guarantee that the compiler will in fact substitute function calls for function bodies. It is possible that some or all inlined functions are treated like normal function definitions. For more information on inlining implementation, consult the documentation provided with the compiler you use.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 85 de 206
These relative numbers tell you, even without more precise measurement, or looking at the generated assembly, that the recursive function is a lot slower than the iterative function. This should come as no surprise as you have seen in the beginning of this chapter that function callsof which a lot are used in recursioncause considerable slowdown for relatively small functions. Moreover, the recursive function actually uses more runtime memory as it repeatedly places function call overhead on the stack. The factorial of 10 will in fact result in 11 calls to the FactRec() function. This touches on a major drawback of recursive functions: Their use is limited by the amount of stack that is available. This means predictions have to be made at implementation/design time; when using recursion you simply have to guarantee that the function will never be called with a parameter (or parameters) that will cause a stack overflow, or an unallowable system slowdown. Summarizing, the following rules of thumb can be defined for when to use recursion: When a function can be defined in terms of itself and cannot easily be made iterative. When repetitive actions have to be done on a varying set of data (like walking through subdirectories or traversing a treesee the example in Chapter 11, "Storage Structures." ) When you can guarantee that the recursive call will never cause system interference.
// Passing by value: void FunctValue(int input1, int input2) { input1 += input2;} // Passing by reference: void FunctRef(int &input1, int &input2) { input1 += input2; } // using references.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 86 de 206
void FunctPoint(int *input1, int *input2) { *input1 += *input2; } void main(void) { int a = 6, b = 7;
// using pointers.
FunctValue(a, 5); // values 6 and 5 placed on stack. FunctRef(a, b); // addresses of 'a'and 'b'placed on stack. FunctPoint(&a, &b); //
*input1.
There are two reasons why you want to be able to pass a parameter by reference: When an object is relatively large it is a much better idea to place its address on stack than the entire object itself. Placing a large object on stack not only costs runtime footprint, it also costs time. When a function receives a reference to an object, it can permanently change the value of that object. In Listing 8.12 the functions FunctRef() and FunctPoint() will change the value of variable a in the scope of the calling function, in this case main (). After FunctRef(a,b) the variable a equals 13, then. After FunctPoint(&a, &b) the variable a equals 20 (13 + 7). The first callFunctValue()however, does not change the value of variable a. In fact, FunctValue() does nothing that is, it changes the value of input1 which is a variable on the stack and which disappears after FunctValue() ends. There is another implication to consider, namely the access to parameters. When a parameter is passed by value, it is part of the local function stack. This means its value can be accessed directly. A parameter which is passed by reference has to be dereferenced before its value can be accessed. This means the pointer of the variable is taken and used to find the value of the variable. Accessing value parameters is thus faster than accessing pointer/reference variables. Listing 8.13 shows this through the assembly that is generated for Listing 8.12. Listing 8.13 Assembly Listing Showing Parameter Access for Value Arguments Versus Reference Arguments
// FunctValue Body. input1 += input2; mov eax, DWORD PTR _input1$[ebp] ; get value of input1. add eax, DWORD PTR _input2$[ebp] ; add value of input2. mov DWORD PTR _input1$[ebp], eax ; place addition back in ; input1. ; value of input1 is subsequently lost. // FunctRef & FunctPoint Bodies. input1 += input2; or *input1 += *input2; mov mov mov add mov mov ; value eax, DWORD PTR _input1$[ebp] ; get pointer to 'a'. ecx, DWORD PTR [eax] ; get value of 'a'. edx, DWORD PTR _input2$[ebp] ; get pointer to 'b'. ecx, DWORD PTR [edx] ; add value of 'b'. eax, DWORD PTR _input1$[ebp] ; get pointer to 'a'. DWORD PTR [eax], ecx ; place result back in 'a'. of input1, as well as original variable 'a', has
been changed.
Choosing between passing parameters by reference or by value to time-critical functions is perhaps not as trivial as you might think. Certainly a reference or a pointer must be used when the passed parameter should be permanently changed by the function. When a parameter is passed for read-only use, a choice needs to be made which is based on the following two characteristics:
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 87 de 206
Size of the parameter to be passed When our parameter is a compound object (structure, array, class, and so on) it will most often be large enough to warrant passing it by reference; a quick sizeof(object_to_pass) will tell you exactly how many bytes will be pushed onto stack when you pass object_to_pass by value. However, it is not a fair conclusion that passing a parameter by pointer or reference is always better where stack footprint is concerned. Passing a character by value causes only a single byte to be placed on stack, for instance, where passing a character by reference causes at least four bytes to be placed on stack! Number of times the parameter is accessed. When a read-only parameteror any of its compounded membersis accessed frequently within the function, accessing overhead as demonstrated by Listing 8.13 will begin to influence function performance. In some cases it might be better to make a local copy of the values which are accessed most often:
int in = *input1;
But when the whole parameteror many of its compounded membersis accessed frequently, it becomes a performance/footprint tradeoff whether or not to pass the parameter by value.
Global Data
The previous section demonstrated how the kind and number of parameters passed to a function can influence function performance. The use of global data makes it possible to decrease the amount of data placed on the stack for a function call. Global data is data that can be accessed by all functions of a program; thus, global data does not have to be passed in a function call. This makes the call itself faster as well as the access to the data, compared to data that is passed by reference. When this globally defined data is used by several functions (but does not become the subject of a multiple access dispute between threads), defining the data globally is even more useful. Listing 8.14 takes a structure with authentication dataconsisting of a username, a password and a registration codeand checks whether this data is valid. For this, four different functions are created:
int CheckAuthentication_Global()
Takes a globally defined instance of the authentication structure and evaluates its validity. No stack data is used.
#define OFFSET 10
// Structure to pass as data to the functions. struct Authentication { unsigned char regCode[21]; unsigned char user[7]; unsigned char password[7]; } ; // Globally defined instance of data, used in the global version of the // CheckAuthentication function. Authentication glob = { { 0,3,1,0,0,4,0,4,2,1,0,5,0,2,1,0,5,0,2,} , "Bond01","shaken"} ;
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 88 de 206
// Global function. int CheckAuthentication_Global() { int result = 1; glob.user[result] = result; for (int i = 0; i < 7; i++) { if (glob.user[i] + glob.password[i] != (glob.regCode[i*3]-OFFSET)) { result = 0; break; } } return
result; } // Pass-By-Value function. int CheckAuthentication_ByValue(Authentication value) { int result = 1; value.user[result] = result; for (int i = 0; i < 7; i++) { if (value.user[i] +
value.password[i] != (value.regCode[i*3]-OFFSET)) { result = 0; break; } } return result; } // Pass-By-Reference function. int CheckAuthentication_ByReference(Authentication &ref) { int result = 1; ref.user[result] = result; for (int i = 0; i < 7; i++) { if (ref.user[i] + ref.password[i] != (ref.regCode[i*3]-OFFSET)) { result = 0; break; } }
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 89 de 206
return
result; } // Pass-By-Reference -using a pointer- function. int CheckAuthentication_ByPointer(Authentication *ref) { int result = 1; ref->user[result] = result; for (int i = 0; i < 7; i++) { if (ref->user[i] + ref->password[i] != (ref->regCode[i*3]-OFFSET)) { result = 0; break; } } return
result; }
A program using Listing 8.14 can be found in the file 08Source04.cpp on the Web site. The result of the program is a list of relative numbers indicating the execution speed of the various data passing techniques used. Table 8.1 shows these timing results. Table 8.1. Calling Technique Timing Results Relative Time Spent 220 220 380 170
Technique Calling by reference Calling via a pointer Calling by value Using global data
The results presented in Table 8.1 take into account different calling methods as well as variable access for the different calling methods. From this table, you see that calling by pointer and calling by reference are indeed equally fast. As expected, calling by value is quite a lot slower. The clear winner is the function which uses global data.
Early Returns
As with loops, it is possible to optimize function execution by implementing a function in such a way as to allow the optional skipping of body statements. Using early returns is a method of checking as many reasons as possible for not continuing to execute the function. What you basically want to do is make sure that reserving memory, doing complex calculations and calling other functions is only done when it is clear that the efforts will be useful. Listing 8.15 shows an example function that uses no early returns. Listing 8.15 No Use of Early Returns
struct Object { int color; int material; int city; char *name; } ;
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 90 de 206
pObj->color = CurrentStatus->FashionColor(); pObj->material = CurrentStatus->FashionMaterial(); pObj->city = CurrentStatus->CurrentCity(); if ( pObj->city == UNDEFINED || pObj->color == UNDEFINED || pObj->material == UNDEFINED) delete [] pObj->name; delete pObj; return NULL; } return pObj; }
The function CreateObject() endeavors to create a new object with a given name. This new object will take its other values (color, material, and city) from a global database that should contain current status information. This way the new object will have the color and material which are most fashionable at the moment of creation, and will be produced in the city with the best plants. It is possible, however, that some status information is not defined at object creation time, in which case it is not necessary to create a new object yet. Listing 8.15 shows the way functions are often used in the field. When you look closely at the fail conditions, found at the end of the function, you notice that it is possible to speed up function execution considerably in the case when no object is created. This becomes increasingly useful as likelihood of failure increases. By evaluating fail conditions earlier, it is possible to skip time-consuming object creation and deletion. Listing 8.16 demonstrates this for the CreateObject() function of Listing 8.15. Listing 8.16 Using Early Returns
Object *CreateObject(char *name) { int col, mat, cit; if (UNDEFINED == (col = CurrentStatus->FashionColor())) return NULL; if (UNDEFINED == (mat = CurrentStatus->FashionMaterial())) return NULL; if (UNDEFINED == (cit = CurrentStatus->CurrentCity())) return NULL; Object *pObj = new Object; strcpy(pObj->name, name); pObj->city = cit; pObj->color = col; pObj->material = mat; return pObj; }
Note the complete absence of a delete instruction from Listing 8.16. Also, a call to strcpy is made only when necessary.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 91 de 206
implications of different techniques of using class methods and provides guidelines on when and where to use which technique.
Inheritance
Inheritance can be a very good way of safely reusing existing functionality. Typically, inheritance is used where related objects share some common functionality. Consider two objects that retrieve information from the Internet. The first object retrieves pictures to display on screen, and the second retrieves music to play over audio speakers. The data manipulation done by these objects will be quite different but they have some common functionality: Internet access. By moving this common functionality into a base class, it can be used in both objects but the actual code for Internet access needs to be written, tested, and, most importantly, maintained, only in one place. This is a great advantage where development time is concerned. The design becomes more transparent and the problems tackled by the base class need to be solved only once. Moreover, when a bug is found and fixed, there is no danger of overlooking a similar bug in similar functionality. Listing 8.17 shows a class with which we will demonstrate different ways of using inherited functionality. Listing 8.17 Base Class Definition of an Example Class
0 1 2 3
class ComputerUser { public: // constructor ComputerUser() { favouriteOS = unknownOS; } ComputerUser(char *n, int OS) { strcpy(name, n); favouriteOS = OS; } // destructor ~ComputerUser() { } // interface void SetOS(int OS) { favouriteOS = OS; } void SetName(const char *s) { strcpy(name, s); } private: // data int favouriteOS; char
name[50]; } ;
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 92 de 206
The ComputerUser class will be our base class for storing information about a computer user. Each computer user can have a name and a favorite OS. The class implementation is pretty straightforward, but note that the constructor is overloaded. When a computer user's name and favorite OS are known at construction time, this information will be initialized immediately. Now let's assume that for a specific type of computer user, you want to store more information. For instance, of computer users who have access to the Internet you want to know what their favorite browser is. Not all computer users have Internet access but all people with access to the Internet use a computer; therefore, it is possible to derive an Internetter from a ComputerUser. Listing 8.18 shows how this can be done. Listing 8.18 Deriving from the Example Base Class
0 1 2
class Internetter: public ComputerUser { public: // constructor Internetter() { favouriteBrowser = unknownBROWSER; } Internetter(char *n, int OS, int browser) { SetName(n); SetOS(OS); favouriteBrowser
= browser; } // destructor ~Internetter() { } // interface void SetBrowser(int browser) { favouriteBrowser = browser; } private: // data int
favouriteBrowser; } ;
The Internetter class also has an overloaded constructor for passing initialization data. Because Internetter is a derived class, it needs to initialize not only its own data but also that of its base class. This is can be seen in its second constructor. Now that you have seen an example of a base class and a derived class, let's look at some different techniques of using their member functions. Inefficient Object Initialization One way of creating an instance of the Internetter class could be
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 93 de 206
class Internetter2: public ComputerUser { public: // constructor Internetter2() { favouriteBrowser = unknownBROWSER; } Internetter2(char *n, int OS, int browser) : ComputerUser(n, OS) // base class construction. { favouriteBrowser = browser; } // destructor ~Internetter2() { } // interface void SetBrowser(int browser) { favouriteBrowser = browser; } private: // data int
favouriteBrowser; } ;
Without Inheritance Overhead
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 94 de 206
Before looking at actual timing data of the different techniques, you should examine what happens when you do not use inheritance. Certainly in this example, where so few members are found in the base class, it is worth considering coding the base class functionality directly into the derived class. For objects of which the use is time critical (which are used/accessed often during search actions and so on), it might prove to be a good idea to do just this. Listing 8.20 shows how Internetter functionality can be combined with ComputerUser functionality. Listing 8.20 Not Using Inheritance
class ExtendedComputerUser { public: // constructor ExtendedComputerUser() { favouriteOS = unknownOS; } ExtendedComputerUser(char *n, int OS, int browser) { strcpy(name, n); favouriteOS = OS; favouriteBrowser = browser; } // destructor ~ExtendedComputerUser() // interface void SetOS(int OS) { favouriteOS
{ }
= OS; } void SetBrowser(int browser) { favouriteBrowser = browser; } void SetName(const char *s) { strcpy(name, s); } private: // data int favouriteOS; int favouriteBrowser; char
name[40]; } ;
Timing Data The timing data for Table 8.2 was gathered using a program which can be found in the file 08Source01.cpp on the Web site. The result of the program is a list of relative numbers indicating the execution speed of the various object construction methods described in this section: Table 8.2. Timing Data on Different Uses of Inheritance
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 95 de 206
Inefficient initialization of Internetter Efficient initialization of Internetter Efficient initialization of base class by Internetter2 Without inheritance overhead by ExpandedComputerUser
Note that the first two results are quite close; this is because the overhead incurred in initializing the derived class in test 1 is still incurred in initializing the base class in test 2. Test 3 initializes both the derived class and the base class more efficiently and is noticeably faster. Not using inheritance at all is, unsurprisingly, the fastest method of the four. It should be noted also that slowdown increases as more levels of inheritance are added. Each time a class is derived from another, a layer of constructor and destructor calls is added. However, when constructors and other member functions are actually inlined by the compiler the overhead of derived constructors is, of course, negated. Summarizing, it can be said that for time-critical classes the following questions should be asked concerning the use of inheritance: Is using inheritance actually necessary? What is the actual price to be paid when avoiding inheritance (not only where application footprint is concerned but also in terms of application maintainability)? When inheritance is used, is it used as efficiently as possible?
Virtual Functions
Before looking at the technical implications of using virtual functions, a small set of samples will demonstrate what exactly virtual functions are used for. To the classes of Listing 8.19 is added another class describing Internetters who also happen to be multiplayer gamers. As not all Internetters are multiplayer gamers and certainly not all computer users are multiplayer gamers, the MultiPlayer class will be a derivative of the Internetter2 class. Of a multiplayer gamer we are interested in his or her favorite game and the nickname used when playing. Listing 8.21 shows what the MultiPlayer class looks like. Listing 8.21 MultiPlayer Class
class MultiPlayer: public Internetter2 { public: // constructor MultiPlayer() { favouriteGame = unknownGAME; } ; MultiPlayer(char *n, int OS, int browser, int game, char *nName) : Internetter2(n, OS, browser) { favouriteGame = game; strcpy(nickName, nName); } // destructor ~MultiPlayer() { } // interface void SetGame(int game) { favouriteGame = game; }
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 96 de 206
favouriteGame; } ;
Now let's assume you have a group of computer users of which you want to know the names. To this end you can add a PrintName function to the base class ComputerUser
void ComputerUser::PrintName() {
and create an array of ComputerUsers:
ComputerUser *p[4]; p[0] = new ComputerUser("John", WINDOWS); p[1] = new Internetter("Jane", LINUX, NETSCAPE); p[2] = new Internetter2("Gary", LINUX, EXPLORER); p[3] = new MultiPlayer("Joe", LINUX, NETSCAPE, QUAKE, "MightyJoe");
The following fragment of code will print the names of the ComputerUsers in the array:
void MultiPlayer::PrintName() {
Sadly, the print loop does not recognize this function. It takes pointers to ComputerUsers and calls only their print methods. The compiler needs to be told that we are thinking about overriding certain base class functions with a different implementation. This is done with the keyword virtual. Listing 8.22 shows the new and improved base class, which allows us to override the print method. Listing 8.22 Virtual PrintName Function
class ComputerUser { public: // constructor ComputerUser() { favouriteOS = unknownOS; } ComputerUser(char *n, int OS) { strcpy(name, n); favouriteOS = OS; } // destructor ~ComputerUser() { } // interface void SetOS(int OS) {
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 97 de 206
favouriteOS = OS; } void SetName(const char *s) { strncpy(name, s, 40); } virtual void PrintName() { cout << name << endl; } private: // data int favouriteOS; char
name[50]; } ;
Now when you run the printing loop once more, the multiplayer gamer will be identified by his nickname. By adding the keyword virtual you tell the compiler that you want to use something called late binding for a specific function. This means that it is no longer known at compile time which function will actually be called at runtime. Depending on the kind of class being dealt with, a certain function will be called. Note that By specifying a function as virtual in the base class, it is automatically virtual for all derived classes. When calling a virtual function via a pointer to a class instantiation, the implementation that is used is from the latest derived class that implements the function for that class instantiation. The second bullet describes something called polymorphism . The example print loop uses pointers to ComputerUsers. When it becomes time to print the name of the Joe instantiation of MultiPlayer, it takes the PrintName() implemented by Multiplayer. This means that at runtime the program can see it is handling a MultiPlayer and not just a normal ComputerUser and calls the appropriate function implementation. So how does this actually work? Two things are done at compile time to allow the program to make a runtime decision about which implementation of a virtual function to actually call. First, virtual tables (VT) and virtual table pointers (VPTR) are added to the generated code. Second, code to use the VTs and VPTRs at runtime is added. Virtual Table (VT) For every class that is defined and that either contains a virtual function or inherits a virtual function, a VT is created. A VT contains pointers to function implementations. For each virtual function there will be exactly one corresponding pointer. This pointer points to the implementation that is valid for that class. In our previous example classes, the VTs of ComputerUser , Internetter , Internetter2 , and Multiplayer all contain a single function pointer that points to an implementation of PrintName . As neither Internetter nor Internetter2 define their own PrintName() , their VT entry will point to the base class implementation (ComputerUser::PrintName() ). MultiPlayer does have its own PrintName() function and thus its VT entry points to a different function address (MultiPlayer::PrintName() ). The footprint implications are clear: The more virtual functions are defined within a class, the larger become the VTs of all derived classes. Note also that derived classes can define additional virtual functions; by doing this they again increase VT size for their own derivatives. However, when different classes have the same VT (as Internetter and Internetter2 do in the examples) the compiler can decide to create only a single VT to be shared by the classes. Virtual Table Pointer (VPTR) When an instantiation of a class is made, it needs to be able to access the VT defined for its class. To this end, an extra member is added to the class which points to the VT. This member is called the virtual table pointer. For example, the instantiation Joe contains a VPTR that points to the VT of Multiplayer . Through it, it can access the PrintName() function of Multiplayer . Similarly, Jane, an Internetter , contains a VPTR that points to the VT of
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 98 de 206
To call a virtual function at runtime, the VPTR of the instantiation needs to be dereferenced to find the class VT. Within this VT an index is used to find the pointer to the actual function implementation. This is where performance implications become clear. The following piece of pseudo-code shows how the MultiPlayer MightyJoe has his PrintName() function called:
// get vptr from class instance. vtable *vptr = p[3]->vptr; // get function pointer from vtable. void (*functionPtr)() = vptr[PrintNameIndex]; // call the function. functionPtr();
This may look like quite a bit of overhead but depending on the system used, it could easily translate into two or three machine instructions, which makes the virtual function overhead as small as the overhead incurred when you add an extra parameter of a base type (int for example) to a function call. A program for testing calling overhead is included in the file 08Source02.cpp that can be found on the Web site. The result of the program is a list of relative timing values for Calling a non-virtual member function Calling a virtual member function Calling a non-virtual member function with an extra parameter Calling a member function from a base class via a derived class Calling a non-virtual member function via a pointer Calling a global function
Templates
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Pgina 99 de 206
Templates are used to describe a piece of functionality in a generic way. There are class templates and function templates. With function templates you can generically describe a function without identifying the types which are used by the function, whereas with class templates you can generically describe a class without identifying the types which are used by the class. These generic descriptions become specific when the templates are actually used; the types with which the templates are used determine the implementation. Listing 8.23 contains a function template. Listing 8.23 Function Template "Count"
template <class Type, class Size> Size Count(Type *array, Type object, Size size) { Size counter = 0; for (Size i = 0; i < size; i++) { if (array[i] == object) { counter++; } } return counter; }
This template can be used to count the number of occurrences of a specific object, in an array of objects of the same type. To do this it receives a pointer to the array, an object to look for, and the size of the array. Note that the type of the object (and that of the array) has not been specified. The symbol Type has been used as a generic identifier. Note also that the type for holding the size of the array has not been specified. The symbol Size has been used as a generic identifier. With this template, you can count the number of occurrences of a character in a character string, but, just as easily, you can count the number of occurrences of number in a number sequence:
// Type = char, Size = short. char a[] = "This is a test string"; short cnt = Count(a,'s',(short) strlen(a));
// Generic swap using a template. template <class T> inline void Swap(T &x, T &y) {
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
T w = x; x = y; y = w; } // Generic swap without template. void UniversalSwap(void **x, { void *w = *x; *x = *y; *y void **y)
= w; }
Listing 8.24 contains two functions that swap around the values of their arguments. The template version (Swap()) will cause an implementation to be created for every type you want to be able to swap around, whereas the non-template version (UniversalSwap()) will generate a single implementation. The non-template version is harder to read and maintain though, and you should consider whether a shallow copy via object pointers as used here will always suffice. Another possibility is capturing a generic piece of functionality within a template. This way you have all the benefits of a generic implementation but its use becomes easier:
template <class UST> inline void TSwap(UST &x, UST &y) { UniversalSwap((void **)&x, (void **)&y); }
Now the casting activities are captured within a template. For larger functions and classes, this kind of template wrapper makes perfect sense; the only piece of functionality that is instantiated for every used type is the small part that does the casting. When you think of what the template mechanism actually doesgenerating different instances of functionality at compile timeyou might wonder if you can somehow misuse this mechanism to save some valuable typing time. And indeed, evil things are possible if you are inventive enough. How about some compile-time calculation or loop-unrolling? You have already seen the factorial function (n!) in this chapter. Listing 8.25 shows code which will generate a list of factorial answers at compile time. Listing 8.25 Misusing Templates
template <int N> class FactTemp { public: enum { retval = N * FactTemp<N-1>::retval } ; } ; class FactTemp<1> { public: enum { retval = 1 } ; } ; void main() { long x = FactTemp<12>::retval; }
Of course, the speed of this Fact<> function is unprecedented simply because it is not actually a function at all. This sort of template misuse is only possible when the compiler can determine all possible template instances at compile time. Listing 8.25 will probably no longer work when you ask it to produce the factorial of 69; the compiler will run into trouble and even if it did not, the answer would not even fit in a long. The interesting thing about using compile-time calculation is that you do not have to calculate a list of new values when, for instance, the resolution of your needed values changes. When you are using a list of cosine values and a newer version of
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
your software suddenly needs one more decimal of precision, you can simply adjust the template to generate a different list. In conclusion: Templates are not necessarily slower than "hand-carved" functionality. Templates can save a lot of development time as similar functionality for different types does not need to be written and tested separately. Templates can obscure their actual footprint size. Templates can be avoided by coding all different instances of functionality yourself, or writing generic functionality by using, for instance, pointers and references. Look at the file 08Source03.cpp on the Web site for testing the speed of the different template techniques discussed in this section.
Summary
This chapter dealt with different techniques of calling and using functions. One way to avoid function overhead for a certain type of function is to use macros. Although the use of macros can certainly speed up code that uses small functions repetitiously, you should be wary of using them because of the possibility of unexpected behavior. Most often there are other ways to solve the problems. Use macros only when you feel you really have to. Another way of eliminating function call overhead is to inline functions. Use inlining only for short functions which are called frequently or are part of a time-critical piece of code. Sadly, however, for practically all compilers the inlining of functions is merely a suggestion made by the programmer. One of the choices to make when implementing certain algorithms is whether to use recursion or iteration. Recursion is likely to incur a lot more function-calling overhead. Part of the function-calling overhead is the placement of parameters on the stack. When choosing whether to pass parameters by reference or by value you should consider the size of the parameters to be passed and the number of times the parameter is accessed. For time-critical classes you should consider whether an implementation should use inheritance and virtual functions. Finally, when considering templated functions and classes, you should be aware of the fact that templates, apart from saving development time, also obscure their footprint size. CONTENTS CONTENTS
Memory Fragmentation
This section deals with memory fragmentation. When free memory for use by programs becomes fragmented, programs will experience slowdown and might even stop working altogether. This is obviously a serious problem, especially where programs are concerned which should run for longer periods of time.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Released blocks of memory can become increasingly small in this manner, making them less and less useful in the process. Memory fragmentation is sped up when programs allocate and release memory at a high rate, especially when different block sizes are used. The chances of a released block being used up completely for a new memory claim is smaller when blocks of varying sizes are being used. Think of linked lists and trees used for storing data; most of the time, a large number of small objects is used for containing links and temporary information.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
With what has been discussed about memory fragmentation so far, the following list of characteristics can be compiled: Memory fragmentation is often hidden; although the total amount of free memory might be sufficiently large to accommodate a certain program, fragmentation can cause the blocks that the free memory consists of to be too small to be usable. Memory fragmentation gets worse over time; as released blocks are often split into smaller pieces to satisfy further memory requests, fragmentation can easily become problematic. Memory fragmentation can slow down or even halt programs; depending on the OS used, different scenarios are executed when the OS is unable to allocate blocks of memory of sufficient size. OS responses can range from invoking some kind of garbage collecting, to swapping memory to and from the hard drive, to simply placing programs into holding patterns until memory becomes available. Memory fragmentation can happen when you least expect it; when using memory blocks that are as small as possible and releasing these blocks as soon as they are no longer needed, memory fragmentation can still occur. This is simply because not only is the total amount of memory used important, but also the dynamic behavior of the programs using it. To guarantee a certain typical performance, it is important to consider memory fragmentation issues during the different phases of software engineering (requirements, design, and implementation), especially when a program will use relatively large parts of the memory resources available, or when a program is meant to run over extended periods of time. Early in development, it should therefore become clear whether special efforts will be needed to combat fragmentation. When fragmentation forms an immediate danger and the target OS does not implement a memory management scheme that suits the program or system requirements, it is likely that a dedicated memory manager needs to be incorporated into the software. The following section discusses theories and techniques that can be used by memory management software.
Memory Management
Memory managers (MMs) can be added to programs or systems to improve on the memory management scheme implemented by the target OS. Typical improvements include memory access and allocation speed, combating memory fragmentation, and sometimes even both. The implementation of dedicated MMs can range from a simple indirection scheme such as a suballocator that resides between program and OS, to full-blown applications that completely take over the OS role in memory management. This section discusses theories and techniques that can be used by MM software.
Suballocators
Suballocators are MMs that are placed between a piece of software and the operating system. Suballocators take their memory from, and (eventually) return their memory to, the OS. In between they perform MM tasks for the software. Quick and Dirty MM The simplest form of a suballocator MM is one that actually only provides memory. The following example demonstrates this:
// DIRTY MM unsigned char membuf[512000]; unsigned char *memptr = (unsigned char *)membuf; inline void *DirtyMalloc(long size) { void *p = (void *)memptr; memptr += size; return(p); }
When an application requests memory from this providing function, a slice of a preallocated block is returned. Although not a memory manager to take seriously, this DirtyMalloc does have some interesting characteristics: It is incredibly fast (MMs do not come any faster than this) and it will not give you any fragmentation. The downside to this DirtyMalloc is that memory is not actually managed at all; instead, only the whole initially claimed block can be freed at once (on program termination). But still there is some use for this kind of DirtyMalloc. Imagine you have to build a gigantic list of tiny blocks, sort it, print it, and then exit. In that case, you do not actually need a full memory manager as all the memory is used only once and can be thrown away immediately after usage. This code will be several times faster than a nicely implemented program that uses dynamic allocators such as malloc or new. The Freelist MM Dynamic memory is memory that your program can allocate and release at any time while it is running. It is allocated through the OS from a portion of the system memory resources called the heap. To obtain dynamic memory, your program has to invoke OS calls. Then, when that memory is no longer needed, it should be released. This is done again through OS calls. When your program frequently uses dynamic memory allocation, slowdown will be incurred through the use of those OS calls. This is why it is interesting to consider not releasing all memory back to the OS immediately, but rather keeping some or all of it on hand within your program for later use. A list, which we will refer to as the freelist, will keep track of the free memory blocks. This technique is particularly effective when
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
memory blocks of the same size are allocated and released in quick succession. Think of the use of structures as elements of a list or array. Memory that was used to house a deleted list element can be used again when a new list element needs to be created. Thistime, however, no OS calls are needed. Listing 9.1 presents the functionality you need to create a freelist. The functionality is presented in the form of a base class so it can be explained in a small separate listing and used in further listings in this section. The freelist is not exactly a normal base class though. Later in this section you will see why. Listing 9.1 Freelist Basic Functionality
#include <stdio.h> #include <stdlib.h> //Base class definition. class FreeListBase { static FreeListBase* freelist; FreeListBase* next; public: FreeListBase() { } ; ~FreeListBase() { } ; inline void* operator new(size_t); // overloaded new() inline void operator delete(void*); // overloaded delete() } ; inline void* FreeListBase::operator new(size_t sz) { if (freelist) { // get memory from the freelist if it is not empty. FreeListBase* p = freelist; freelist = freelist->next; return p; } return malloc(sz); // call malloc() otherwise } inline void FreeListBase::operator delete(void* vp) { FreeListBase* p = (FreeListBase*)vp; // link released memory into the freelist. p->next = freelist; freelist = p; } // Set the freelist pointer to NULL. FreeListBase* FreeListBase::freelist = NULL;
The class FreeListBase overloads two important C++ operators, namely new and delete. Because FreeListBase has its own implementation of these operators, the standard new and delete operators are not invoked when you create or destroy FreeListBase objects. So, for example, when you do the following:
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
When a new FreeListBase object is created, the freelist is consulted. If the freelist is not empty, a pointer to the piece of memory at the front of the freelist is returned. The pointer freelist is set to point at the next piece of memory in the list. If, however, the freelist is empty, memory is allocated via the OS through the use of the OS call malloc. The reason you will want to use malloc here is that you need a piece of memory of a certain size (as provided by the parameter given in the call to new). This way any class derived from FreeListBase (which will have a different size) will still receive the correct amount of memory. In fact, the standard new operator itself also uses malloc functionality. For this to work, however, the freelist pointer has to be set to NULL at some time before it is used, to indicate that the list is still empty. As the same freelist pointer should be accessible for allfutureinstances of FreeListBase, it is a static member of the FreeListBase class. It therefore needs to be set globallythat is, outside the member functions of the class. This is what is done in the last line of Listing 9.1.
class TemperatureUsingFreelist : public FreeListBase { public: int ID; int maxTemp; int minTemp; int currentTemp; int average() { return (maxTemp + minTemp)/2;} } ; class TemperatureNotUsingFreelist { public: int ID; int maxTemp; int minTemp; int currentTemp; int average() { return (maxTemp + minTemp)/2;} } ;
When an instance of TemperatureUsingFreeList is deleted, its freelist will remember the memory address of the instance and save it for future use. When an instance of TemperatureNotUsingFreeList is deleted, however, its memory is returned to the OS. A newly created TemperatureNotUsingFreeList instance will therefore again claim memory via the OS call new. You can time both techniques (freelist memory management versus OS memory management) with two loops. One loop repeatedly creates and destroys TemperatureUsingFreeList instances, the other loop repeatedly creates and destroys TemperatureNotUsingFreeList . You can use these loops in the timing setup introduced in Chapter 5, "Measuring Time and Complexity." Here is an example loop:
// or TemperatureNotUsingFreelist
t1 = new TemperatureUsingFreelist; // or TemperatureNotUsingFreelist delete t1; t1 = new TemperatureUsingFreelist; // or TemperatureNotUsingFreelist delete t1; }
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Although the exact timing results will differ per system, the relations between timing the above loop with TemperatureUsingFreeList instances and timing the above loop with TemperatureNotUsingFreeList instances should look something like this:
160 990
This shows that using a freelist is significantly faster when memory is actively reused. Timing results will be equal for both classes when no instances are deleted within the timing loop as the freelist is not used in that case. As you know, the freelist functionality was placed in a base class to explain its functionality clearly in a separate listing. When you use the freelist technique in practice, however, it is best to incorporate its functionality directly into the class that uses it. This is important as the FreeListBase class contains a static pointer to the available memory blocks. This same pointer would be used by all classes derived from FreeListBase . When you inherit from the FreeListBase class, the freelist can at some point contain blocks of different sizes, placed there by different classes. Clearly this is undesirable as the new operator simply returns the first available block, regardless of its size. When you do decide you want to use a general freelist that is shared between different classes, simply augment the FreeListBase with a size parameter (which must be set during the delete operator) and a size check (which must be done in the new operator). Note, however, that the extra set and check will add some overhead to your FreeListBase . This means you have a tradeoff between speed (using a freelist per class is fastest) and footprint (using a shared free-list makes the code smaller). Similarly, you could decide to expand the delete operator to be more intelligent. It could, for example, start to release memory back to the OS when the list reaches a certain size. An example of a freelist can be found in the companion file 09Source01.cpp. Simple Stack Memory Management This section shows how a simple-but-effective MM scheme can be integrated into a class which provides stack functionality. Listing 9.3 shows a simple implementation of a stack class that relies on the OS for memory management. Listing 9.3 Stack Without Dedicated Memory Management
#define MAXSIZE 100000 class Stack { struct elem { elem *previous; char *name; int size; int id; } ;
public: Stack() {
// store { name,id} void push(const char *s, const int nr); // retrieve { name,id} void pop(char *s, int &nr); private: elem int } ;
*last; totalSize;
inline void Stack::push(const char *s, const int nr) { // add new item to the top of the stack int newSize = strlen(s) + 1;
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
if (totalSize + newSize > MAXSIZE) { cout << "Error, Stack Overflow!!" << endl; } else { elem *newElem = new elem; newElem->name = new char [newSize]; strcpy(newElem->name, newElem->id = newElem->previous = newElem->size = last totalSize } } inline void Stack::pop(char *s, int &nr) { // return item from the top of the stack and free it if (last != NULL) { elem *popped = last; strcpy(s, popped->name); nr = popped->id; last = popped->previous; totalSize -= popped->size; delete [] popped->name; delete popped; } else { cout << "Error, Stack Underflow!!" << endl; } }
This stack class can be used for storing strings and corresponding IDs. Storage is done in a last-in first-out (LIFO) fashion. Usage can be as follows:
= newElem; += newSize;
q; name[NAME_SIZE]; id; // Push name + ID on stack // Pop name + ID from the stack.
When this class is used intensely, memory will slowly start to fragment as small blocks of memory are allocated and released repeatedly. Even though it might seem that the memory blocks used by this stack are found in a continuous block of memory because the class behaves as a stack, this is not the case. Typically, different parts of a program will call this kind of storage class at different times. In between, other classes will also use memory resources, which makes it extremely unlikely that allocated memory will be found in a continuous block.
BigChunkStack MM
As the fragmentation caused by the stack class of the previous section is a direct result of the dynamic nature in which it deals with memory requests, the simplest solution for combating fragmentation in this case is to allocate a large chunk of memory immediately and divide this according to incoming requests. Listing 9.4 shows a class that does just that. Listing 9.4 Stack with Dedicated Memory Management
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
class BigChunkStack { struct elem { int id; int previousElemSize; int nameSize; char name; } ;
public: BigChunkStack() { totalSize = 0; emptyElemSize = sizeof(elem); lastElemSize = 0;} // store { name,id} void push(const char *s, const int nr); // retrieve { name,id} void pop(char *s, int &nr); private: char int int int } ;
inline void BigChunkStack::push(const char *s, const int nr) { // add new item to the top of the stack int newStringSize int totalElemSize = strlen(s) + 1; = newStringSize + emptyElemSize;
if (totalSize + totalElemSize > MAXSIZE) { cout << "Error, Stack Overflow!!" << endl; } else { elem *newElem = (elem*) (pool + totalSize);
newElem->id = nr; newElem->nameSize = newStringSize; newElem->previousElemSize = lastElemSize; strcpy(&newElem->name, s); lastElemSize = totalElemSize; totalSize += totalElemSize; } }
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
inline void BigChunkStack::pop(char *s, int &nr) { // return item from the top of the stack and free it if (totalSize != 0) { totalSize -= lastElemSize; elem *popped = (elem*) (pool + totalSize); lastElemSize = popped->previousElemSize; strcpy(s, &popped->name); nr = popped->id; } else { cout << "Error, Stack Underflow!!" } }
The BigChunkStack class can be used in exactly the same manner as the Stack class in Listing 9.3. This means that for the user of the class, the dedicated memory management scheme of BigChunkStack is completely hidden, which is desirable. The dynamic behavior of the BigChunkStack is very different from that of the Stack class. At creation time, an instance of BigChunkStack will contain a block of memory which it calls the pool and which has the size MAXSIZE. This memory pool will be used to contain stack elements that are pushed onto the BigChunkStack. The pool will not fragment and, when an instance of BigChunkStack is destroyed, the pool will be returned to the OS in one piece. At first glance this technique might seem somewhat wasteful because of the large piece of memory that is in constant use. You should keep in mind, however, that the fragmented memory caused by the use of the Stack class will not be very useful after a while in any case. When a program has a runtime requirement that it should always be able to place a certain number of elements onto a stack, the usage of BigChunkStack will help guarantee this. This is especially interesting when a stack has a life span equal to, or close to, that of the program using it. This kind of MM system even makes it possible to calculate beforehand the minimal amount of memory that will be needed for a certain program to run correctly for an indefinite period of time. A downside to the BigChunkStack is that the pool size must be known at compile time, or at least, with some minor changes to the code, at the time the BigChunkStack is created. For certain system requirements, this need not be a problem though, and it might even be advantageous. Later in this chapter, in the section "Resizable Data Structures," you will see how to make these kinds of implementations more dynamic. Chapter 12, "Optimizing IO" contains examples of combining allocation of large memory blocks with maximizing IO throughput. This chapter focuses on more MM theory.
<< endl;
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
MM will respond to memory allocation requests. When memory is requested from your MM, it will have to choose which block to use from those it manages. Here are someblock selection methods you might find interesting. Best Fit The Best Fit method entails searching for the smallest free block that is large enough to accommodate a certain memory request. The following example demonstrates the Best Fit method for a given MM with four memory blocks. The MM in this example is called on to accommodate two consecutive memory requests:
Initial Free Blocks: Request 1) block size: Best Fit: New Free Blocks: Request 2) block size: Best Fit: New Free Blocks:
At the start of the preceding example an MM has the free blocks of memory A through D. The sizes of the various blocks can be found after the block name between parentheses. The first request made to the example MM is for a block of 100KB. Only two blocks of the MM are actually large enough to accommodate this request: A(160KB) and D(120KB). The Best Fit method dictates that block D is used, as its size is closest to that of the requested memory block. After Request 1 has been dealt with, the MM contains a new block called D'. Block D' represents what is left of the original Block D after Request 1 has been serviced. The size of Block D' equals 120KB 100KB = 20KB. The second request made to the example MM is for a block of 130KB. This time Block A is chosen, leaving a Block A' of 160KB130KB = 30KB. The idea behind the Best Fit method is that when a free block of exactly the right size can be found, it is selected and no further fragmentation is incurred. In all other cases, the amount of memory that is left over from a selected block is as small as possible. This means that in using the Best Fit method we ensure that large blocks of free memory are reserved as much as possible for large memory requests. This principle is demonstrated in the example; if Block A had been chosen to accommodate the first request, the MM would not have had a block large enough to accommodate the second request. The danger with the Best Fit method is, of course, that the blocks left over after accommodating a memory request (like Blocks D' and A') are too small to be used again. Worst Fit Unsurprisingly, the Worst Fit method does the opposite of the Best Fit method. Worst Fit entails always using the largest free block available to accommodate a memory request. The following example demonstrates the Worst Fit method for an MM. The example MM has four memory blocks and it is called on to accommodate two consecutive memory requests:
Initial Free Blocks: Request 1) block size: Worst Fit: New Free Blocks: Request 2) block size: Worst Fit: New Free Blocks:
At the start of the preceding example the MM has free memory blocks A through D. Again the sizes of the various blocks can be found after the block name between parentheses. The first request made to the example MM is for a block of 70KB. The largest block available to the MMs is A(160KB). The Worst Fit method dictates that this block should be chosen. After Request 1 has been dealt with, the MM has an Element A' that is 160KB70KB or 90KB in size. The second request made to the MM is for a block of 40KB. This time Block D is the largest block available, leaving a Block D' of
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
120KB40KB or 80KB in size. The idea behind the Worst Fit method is that when a request has been accommodated, the amount of memory that is left over from a selected block is as large as possible. This means that in using the Worst Fit method we try to ensure that leftover blocks are likely to be usable in the future. The danger with the Worst Fit method is that after a while no block in the freelist will be large enough to accommodate a request for a large piece of memory. First Fit The First Fit method entails finding the first free block that is large enough to accommodate a certain memory request. This method allows memory requests to be dealt with expediently. As opposed to Best Fit and Worst Fit, not all the blocks managed by the MM need to be examined during block selection. What exactly happens as far as fragmentation is concerned depends on how the elements are ordered in the freelist. How to Sort/Merge Released Memory When a block of memory is released and thus returned to your MM, a reference to the released memory needs to be inserted somewhere in your MM's freelist. The method used to determine where to insert a released block into the freelist has a large impact on how your MM actually works. Consider an MM that orders the blocks of its freelist by size, with the largest block at the beginning of the list and the smallest at the end. When the First Fit method is used on a freelist ordered this way, the MM actually implements a Worst Fit algorithm because it will always select the largest available block. Similarly, when an MM orders the blocks of its free-list by size but with the smallest block first, the First Fit method actually translates into a Best Fit method. Clearly it pays to spend some time on the sorting mechanism to make sure your MM will perform as optimally as possible. Other sorting methods follow: By memory address This method sorts the freelist entries on the value of the address of the first byte of each free block. Using this method is particularly interesting when you want to design an MM that is able to recombine fragmented memory back into larger blocks. Whenever a reference to a freshly released block is inserted in the freelist, a quick check can be done to determine whether the preceding and/or following block in the freelist connects to the new block. When this is the case, several freelist entries can be replaced by a new entry (with the address of the first byte of the first block and the size of the blocks combined). By release time This method keeps the released blocks in chronological order. The algorithm that implements this method is very straightforward: References to released blocks are always inserted at the beginning of the freelist, or perhaps always at the end of the freelist. By placing new blocks at the beginning of the freelist, the MM can be very fast when blocks of memory of the same size are allocated and deleted in quick succession. In this section, you have seen several well-known techniques for MM implementation. No rule requires you to choose between these techniques. Depending on the requirements of your program, you may choose to mix several techniques together or create an entirely new technique that better suits your requirements. Another important thing to realize is that you can use as many, or as few, MMs in your programs as you see fit. You do not necessarily need to create one main MM to handle all memory requests. In fact, as you have seen in the Freelist and BigChunkStack examples, you could create a dedicated MM for every single class if that would help you. You might even decide to use OS memory management for most memory requests and write your own MMs only for specific (critical) classes. In the summary of this chapter you can find a table with an overview of these methods.
The following sections show how you can use arrays and memory blocks with dynamic sizes. This means you can keep the advantages of normal arrays and memory blocks and still be able to influence their size by enlarging or shrinking them when necessary.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Enlarging Arrays
This section shows how to increase the size of memory blocks. The BigChunkStack class from an earlier section of this chapter is taken as an example. You will remember that the BigChunkStack allocated a large block of memory during initialization. This block of memory was called a pool and objects that should be placed on the stack were copied into this pool. When the pool could not fit any more objects, an error was returned. This section shows you how you can expand the BigChunkStack with the ability to have its pool grow when it is in danger of filling up. The dynamic version of the BigChunkStack can be found in the companion file 09Source02.cpp on the Web site. In the rest of this section you will see what changes need to be made to the original BigChunkStack. The first thing you have to do to make the BigChunkStack resizable is to change the implementation of the pool. In the original BigChunkStack the pool was declared as an array with a fixed size: char pool[MAXSIZE]. For a dynamic pool, you should declare the pool variable as a pointer and the MAXSIZE constant as a member variable. You can set initial values for both in the constructor:
class BigChunkStack { public: BigChunkStack() { totalSize = 0; MAXSIZE = 0; emptyElemSize = sizeof(elem); lastElemSize = 0;} private: char *pool; // pointer instead of constant array. int MAXSIZE; // variable that will keep track of pool size.
Listing 9.5 shows the implementation of a new BigChunkStack::push() function that detects when the pool should grow. Listing 9.5 Resizing push Function for the BigChunkStack
inline void BigChunkStack::push(const char *s, const int nr) { // add new item to the top of the stack int newStringSize int totalElemSize = strlen(s) + 1; = newStringSize + emptyElemSize;
while (totalSize + totalElemSize > MAXSIZE) { if (!grow()) { cout << "Error, Stack Overflow!!" << endl; return; } } elem *newElem = (elem*) (pool + totalSize);
newElem->id = nr; newElem->nameSize = newStringSize; newElem->previousElemSize = lastElemSize; strcpy(&newElem->name, s); lastElemSize = totalElemSize; totalSize += totalElemSize; }
The check that originally determined whether another object could be pushed onto the stack now makes sure the pool is always sufficiently large. It does this by repeatedly calling a grow() function to enlarge the pool while it is too small to contain the new data. The grow() function is where all the interesting work is done. Listing 9.6 shows the implementation of a BigChunkStack::grow () function that detects when the pool should grow. Listing 9.6 grow Function for the BigChunkStack
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
// Create or Enlarge the pool size. { if (MAXSIZE == 0) { // create. pool = (char*) malloc(INITSIZE); if (pool == NULL) return 0; MAXSIZE = INITSIZE; return 1; } else { // enlarge. MAXSIZE *= 2; char* tempPool = (char*) realloc(pool, MAXSIZE); if (tempPool == NULL) return 0; pool = tempPool; return 1; } }
Basically, there are two situations in which the pool can grow. First, the pool can grow when there is no pool at all. This is the initialization of the pool in the if body of the grow function. The constant INITSIZE is used for the initial size. Any value could be chosen here; your choice of value will depend on the requirements of the stack. Second, the pool can grow when the existing pool is too small to fit an object that is being pushed onto the stack. This is done by the reallocation in the else body of the grow function. Each time the pool needs to grow it doubles in size. Because reallocating memory blocks takes time, you do not want to do this too often. The C++ function that is used to resize the pool is called realloc , which takes as arguments a pointer to the first byte of the block to be resized and an integer denoting the new desired size. When realloc is successful, it returns a pointer to the reallocated block. When there is not enough memory to perform the reallocation, a NULL pointer is returned and the block of data is left unchanged. This is why you will want to capture and test the returned pointer before initializing the pool variable with it. When the realloc is unsuccessful, you will not want to overwrite the existing pool pointer with it. Simply return an error and your program will still be able to pop elements from the stack later. For more information on the realloc call, consult your C++ documentation. The changes made in this section to the BigChunkStack result in a stack that will grow when necessary. The following section shows additional changes that allow the stack to decrease in size when it becomes empty.
Shrinking Arrays
This section shows how to decrease the size of memory blocks. The BigChunkStack class is again taken as an example. The dynamic version of the BigChunkStack can be found in the companion file 09Source02.cpp on the Web site. In the rest of this section you will see what changes were made to the original BigChunkStack to allow it to shrink. The changes in this section and in the section Enlarging Arrays constitute all the changes to be made to the BigChunkStack to make it completely dynamic. Listing 9.7 shows the implementation of a new BigChunkStack::pop() function that detects when the pool should shrink. Listing 9.7 Resizing Pop() Function for the BigChunkStack
inline void BigChunkStack::pop(char *s, int &nr) { // return item from the top of the stack and free it if (totalSize * 4 <= MAXSIZE) shrink(); if (totalSize != 0) { totalSize -= lastElemSize; elem *popped = (elem*) (pool + totalSize); lastElemSize = popped->previousElemSize;
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
<< endl;
inline void BigChunkStack::shrink() // Shrink the pool size. { if ((MAXSIZE/2) >= INITSIZE) { // shrink. char* tempPool = (char*) realloc(pool, MAXSIZE/2); if (tempPool == NULL) return ; MAXSIZE /= 2; pool = tempPool; return; } }
The shrink function will decrease the pool by 50% as long as the resulting size is greater than or equal to the chosen INITSIZE. You will note that the realloc function is used again. realloc can be used to increase as well as decrease memory block sizes.
Summary
Memory is called fragmented when the blocks of memory that are not in use are interspersed with blocks of memory that are in use. Memory fragmentation occurs when programs use memory dynamically. When blocks of memory are being allocated and released dynamically, it is unlikely that this will be done in such a manner that released blocks of memory will again form continuous, larger memory blocks. Fragmentation has the following characteristics: Memory fragmentation is often hidden. Memory fragmentation gets worse over time. Memory fragmentation can slow down or even halt programs. Memory fragmentation can happen simply because you are trying to avoid it. Memory managers (MMs) can be added to programs or systems to improve on the memory management scheme implemented by the target OS. Typical improvements are memory access and allocation speed, combating memory fragmentation, and sometimes even both. MMs can use different strategies for selecting a blocks of memory that should accommodate a memory request, as shown here: Best Fit Worst Fit First Fit Large blocks are preserved as much as possible to accommodate large requests. The remainder of a used block is as large as possible. A block is selected as expediently as possible. CONTENTS CONTENTS
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Comparing Blocks of Data The Theory of Sorting Data Sorting Techniques Often, the programs you write will at some point need to perform actions on blocks of data. Think, for instance, of using numerical information for setting up graphics, or handling a textual database of client addresses. Your program might need to go through a large base of such addresses in order to produce a list of clients ordered by name, ZIP code, or age grouping. Perhaps you even need to generate specific subsets from a certain base of information, creating a list of clients with outstanding orders or bills, for instance. Of course, the larger the blocks of data are that your program needs to analyze, the more important become the algorithms it analyzes with. This chapter highlights the performance and footprint implications of various algorithms that are often used for just these kinds of actions, pointing out when and why slowdown can occur. It also introduces new techniques that may help you speed up your programs. Throughout this chapter, pieces of example code are presented. The speed of these pieces of code is tested via a large data file named dbase.txt that can be found on the Web site. This data file contains 103,680 articles of a fictitious company; for each article, six attributes are specified. The total size of the data file is 4.8MB. Note that this chapter focuses solely on using blocks of data. File IO and storage of data in memory are not part of this chapter; they are dealt with in Chapters 11, "Storage Structures," and 12, "Optimizing IO."
#include <string> #include <iostream> using namespace std; void main() { string s1("This is string one."); string s2("This is string two."); if (s1 == s2) { // do something. } }
The reason it is possible to compare two objects of the string class is that the == operator has been overloaded to perform just this action. In C, the piece of code would look like this:
#include <string.h> void main() { char s1[]= "This is string one."; char s2[]= "This is string two."; if (strcmp(s1,s2) == 0) { // do something. } }
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
In order to test the merits of these two implementations, a program is used that loads and searches through the dbase.txt file, counting the number of Small Table articles in the colors yellow, black, silver, and magenta. The program can be found in the file 10Source01.cpp on the Web site. The article name and color are the first attributes of every article in the dbase.txt file, so a quick compare with the beginning of each line in the file will determine whether the article is to be counted or skipped. The timing mechanism of the program does not time the loading of the database, only the time that is spent in the search algorithms. You can examine the exact implementation of the program at your leisure; for now, it is important only to mention that this straightforward comparison between using the C++ string class and using char pointers proves the char pointer method to be much faster than the string method (two to five times faster, depending on the system used). It seems that although the standard C++ string is very easy to use and offers a great many standard functions, it is not what you will want to use when you have to search through a large block of data.
#define macro_strcmp(s1,s2) ((s1)[0]x != (s2)[0] ? (s1)[0] - (s2)[0]:strcmp((s1), (s2))) if (macro_strcmp(s1, s2) == 0) { // do something }
This macro is used to determine whether the first characters of both strings match. When the first characters are the same the strcmp function is called as you would normally do; when the first characters are not the same, however, the string will of course be different and strcmp is not called. This will save a lot of time for all strings of which the first character is different. Note that this even works for empty strings. Only when different strings contain the same first character does this new method introduce extra overhead because of the extra check introduced by the macro. It depends on the type of strings in the database, therefore, whether or not this macro represents an improvement. For our example of finding Small Table articles in the dbase.txt file, this macro implementation is again faster (up to 34%). This is reflected in the results of the 10Source01.cpp program. Another way to speed up string comparison is by treating the character arrays as integer arrays . On most systems an integer is four times larger than a character, and so four characters can be compared at the same time. When the strings have a length that is a multiple of the length of an integer, doing integer comparisons can greatly speed up your string comparison. Listing 10.1 shows what an integer string compare function can look like. Listing 10.1 Integer String Comparison
inline int int_strcmp(char *s1, char *s2, int len) { if ((len & 0x03) != 0) return(macro_strcmp(s1, s2)); for (int i =0; *(int*)&s1[i] == *(int*)&s2[i];) { i+=4; if (i >= len) return(true); // match } return(false); }
The int_strcmp function quickly checks whether the given string length is a multiple of four. If this is not the case, the previously discussed macro_strcmp function is called. For strings with a correct length, characters are compared in groups of four by casting the character pointers to integer pointers and thus reading integers from the string. The longer the compared strings areand the more they look alikethe more benefits this int_strcmp function reaps compared to the previous implementations given in this chapter. For our example of finding Small Table articles in the dbase.txt file, this integer implementation is again faster (up to 50%). This is reflected in the results of the 10Source01.cpp program. The reason why the string lengths have to be a multiple of the integer size for this function to work is that any other size will cause this function to compare beyond the length of a string. A string with a length of six, for instance, will be compared in two loop iterations. The first iteration compares the first four bytes, which is no problem. The second iteration compares the second four bytes, of which only two are part of the actual string. The third byte will be a null indicating the end of the string. The fourth byte, however, is not part of the string. Two strings of six characters that are identical could be found to be different just because of the value of these fourth bytes. The int_compare function can of course be altered to check for different string sizes but it is not very likely that it will still be faster in
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
most string compare cases. This may make you wonder whether the int_compare function is all that useful. It is, in fact, as databases often contain records with fixed field sizes anyway. By simply choosing a field size that is a multiple of the integer size, using the speedy int_strcmp becomes possible for those database records. This can be seen in the results of the 10Source01.cpp program. Table 10.1 shows the complete results of the 10Source01.cpp program . Note that different systems may yield different absolute results, but the relations between the speeds of the various techniques should be fairly constant. Table 10.1. Results of 10Source01.cpp
Compare C++ string Compare strcmp Memory compare Compare macro String 'n' compare Compare ints
Note that the standard memory compare function memcmp basically yields the same results as the standard string compare strcmp. When you look at the source of 10Source01.cpp, you will also notice that memcmp needs to be told how many characters to compare. For straightforward string comparison, strcmp is, therefore, the better choice as it determines the string length without needing extra informationlike a call to the strlen function, for instance. When comparing blocks of data that are not null terminated, you can, of course, use the strncmp variant, which does not necessarily need null terminators. Be advised, though, that string functions stop when they encounter a null byte!
char *s1; // pointer to the text. char s2[] = "substring to look for." char *result; if ((result = strstr(s1, s2)) != NULL) { // do something }
When string s2 can be found in text s1, the function strstr returns a pointer to the first byte of the first occurrence. Null is returned when string s2 cannot be found in text s1.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
When all the characters of both strings match, the algorithm is, of course, finished and it returns a pointer to the first character of the substring in the text file. When, however, a mismatch of characters is found, things become interesting:
if (j != -1) { // Mismatch; align and check for text end. } else { // Match; stop, return pointer. }
A lookup character is taken from the text file. This is the first character following the substring which was just used in the comparison. When this lookup character can also be found in the search string, the search string is aligned with the text file in such a way that these two characters match up. Then comparison starts over. If, however, the lookup character is not part of the search string, the search string is aligned as follows: The first character of the search string is aligned with the character following the lookup character. Here is a textual example of how the search string methin is found in a text file containing the text no thing as vague as something.
char *search(const char *textf, int len) { // Check if searchstr is longer than textf or zero length. if (search_len > len || search_len == 0) return NULL; unsigned char * end = (unsigned char*) textf + len; int len2 = search_len -1; int j; for(;;)
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
{ // main loop for comparing strings. j = len2; while (textf[j] == searchstr[j] && --j >= 0) ; if (j != -1) { // Mismatch; align and check for end of textf. unsigned char* skipindex = (unsigned char*) textf + search_len; if (skipindex >= end) return NULL; // string not in textf. textf += skips[*skipindex]; } else { // Match found. return (char*) textf; } } return NULL; }
The main parts of this function have already been explained. Note, however, that there is an if at the beginning of the function to determine whether any search is needed at all. After this follow two calculations which prepare variables so they do not need to be recalculated during the main loop. The only other new bit consists of the four lines of code to be executed at the mismatch. These lines of code find the lookup character and use it to determine the alignment with a new substring. For this alignment the array skips is used. It contains the exact number of characters to skip for each possible lookup character. The skip value for a given lookup character is as follows: For every character that is not part of the search string, the skip value is equal to the number of characters in the search string plus one. For every character that is part of the search string, the skip value is its position in the search string counted from the end of the string. For our example search string methin, the skips array will therefore have the following values:
m e t h i
= = = = =
6 5 4 3 2
#include <string.h> unsigned char skips[256]; unsigned char search_len; char * searchstr;
void init(const char *str) { search_len = strlen(str); int len2 = search_len + 1; // For chars not in searchstr. for (int i = 0; i < 256;i++) skips[i] = len2; //length + 1. // For chars in searchstr, with double chars only // right most survives. for ( i = 0; i < search_len; i++) skips[str[i]] = search_len - i; // pos back to front.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
char textfile[] = "no thing as vague as something."; void main (void) { init("methin"); char *place = search(textfile, strlen(textfile)); }
Note that a search routine such as this one is not particularly suitable for simply trying to determine whether two strings are equal, as was the subject of previous sections. The reason for this is that a searching algorithm introduces some overhead before getting to the actual string compare work. This Fast Strings search does not, however, stop on encountering a null byte. This means it can be used without problem on any kind of (non-string) data set. The nature of the text file and the search string determine how effective search string algorithms are. Dependencies are, for instance, the length of the search string, the length of the text file, the number of unique characters in the search string, and how often these characters appear in the text file. Generally it can be said, though, that the longer the search string is, and the less often its characters appear in the text file, the faster the fast search algorithm will be. The complete source of the Fast String search can be found in the companion file: 10Source02.cpp , which compares searching a string with strstr to the Fast String search and a case-insensitive Fast String search (which is explained in the next section). As was noted at the beginning of this section, you can improve on this basic Fast String search by analyzing the data set you want to use it on. For instance, for text files of which the length is knownbecause it is fixed or perhaps because memory has just been allocated for it no length calculation is needed in the call to the search function.
Listing 10.4 shows an example of an init function which can be used for a case-insensitive Fast String search. Listing 10.4 Case-Insensitive Pattern Search Initialization Function
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
// // // //
Max pattern length is 256. Remember where alphabet chars are located. Skips array. Length of search pattern.
// Init function. void initInsens(const char *str) { int len1 = 0; // Get length and make 'smalls'copy. unsigned char c = str[len1]; while ( c != '\ 0') { alfa[len1] = 0; if (c >= 'A'&& c <= 'Z') { searchstr[len1] = c + 0x20; alfa[len1] = 1; } else { searchstr[len1] = c; if (c >= 'a'&& c <= 'z') alfa[len1] = 1; } c = str[++len1]; } int len2 = len1 + 1; // For chars not in pattern. for (int i = 0; i < 255; i++) skipsInsens[i] = len2; //length + 1. // For chars in pattern. // with double chars only right most survives. for ( i = 0; i < len1; i++) { skipsInsens[searchstr[i]] = len1 - i; if (alfa[i]) skipsInsens[searchstr[i]-0x20] = len1 - i; }
patlenInsens = len1; }
The changes to the search function are minor, apart from having to use the new names for the search string, the search string length, and the skips array (if you did actually decide to use new names as was done in this section), but the check on the character match needs to be changed from
while ( ((textf[j] == searchstr[j]) || ((alfa[j]) && (textf[j] == (char) (searchstr[j]-0x20)))) && --j >= 0) ;
What this new check actually does is first check whether a character of the search string matches a character in the text file. If it does, the function continues as it normally would. If, however, a character does not match and it is an alphabetic character, another check is done with an uppercase version of the search string character. So for a mismatch of a nonalphabetic characterfor example '%'we
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Algorithm Issues
The Good, the Bad, and the Average In comparing sorting algorithms, it is important not only to look at the worst-case performance but also at the best-case and the average-case. Average case is, of course, the one you will run into most of the time; best case is what you will want to demo to a potential client. Worst case and average case can be very different; they can even differ in orders of magnitude, as you will see later with the quick sort algorithm. Still this does not necessarily mean that you will discard an algorithm based solely on worst-case characteristics. Worst case might happen very seldom and may even be predictable or acceptable. Let's say on average a certain sorting algorithm takes two seconds to sort the office's financial data; it may be acceptable for this to take up to five minutes once or twice a month. A smart employee will let this inconvenience coincide with a caffeine break. You may even try to detect possible worstcase scenarios inside the sorting algorithmas the section on sorting techniques showsbut to get this detection completely watertight will of course make your algorithm impossibly slow. (It would mean checking all the data.) Algorithm Footprints It is important to realize that the runtime footprint of a sorting algorithm is often more than the space that the actual code of the sorting algorithm takes up. One reason for this is that some sorting algorithms are what is called in-place, and others are not. A sorting algorithm is in-place when the data set stays the same size (or a little over that size) while it is being sorted. A sorting algorithm which is not in-place will need more space for the data set when it is sorting. For small databases this is unlikely to be a problem but for large bases this can seriously constrain the maximum database size. This is because a percentage of the base size in memory needs to be reserved for sorting purposes. Another reason why the runtime sorting algorithm footprint can be larger than that of the code is the needed stack frame. When a sorting algorithm uses recursion (see Chapter 8, "Functions") the runtime footprint can suddenly boom during sorting. Algorithm Stability Another characteristic of sorting algorithms is something called stability. A sorting algorithm is called stable when it leaves elements with identical keys in the same order during sorting. This means that when an unsorted database contains two employees with the same name, these employees will emerge in the sorted database in the same order when employee name is the sorting key. Although this stability might seem obvious, not many sorting algorithms guarantee this. Of course, it depends on the requirements of your program whether stability is necessary or not. Algorithm Simplicity The final sorting algorithm characteristic of this section is the complexity of the implementation itself. So far you have read only about the mathematical complexity, but this does not have to correspond directly to the complexity of the software that implements the sorting algorithm. Algorithms which are relatively easy to implement have the advantage that development and maintenance time is low and chances of bugs are fewer. These advantages can translate directly into saving money on development. When it is acceptable for a certain sorting part of a program to be slow, the choice can be made for an algorithm that is easy to implement and maintain instead of a brilliantly fast algorithm that only the implementer understands on the day that he writes the code. In general, of course, the fastest sorting algorithms do tend to be the more complex ones to implement, as you will see in the section "Sorting Techniques."
Sorting Techniques
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
This section explains the theory behind some of the most popular sorting techniques used today. Each subsection highlights a specific sorting technique, telling you about its specific characteristics and giving an example of how you could implement the technique. You can use or modify these examples to suit your needs . At the end of the chapter an overview is given of all sorting techniques discussed so you can quickly see which technique is useful for which situation. In order to keep the theoretical narrative of this section as close to the point as possible, arrays of integers are used as examples of bases to be sorted. This way, the overhead of other implementation issues is minimized. Throughout the text, references are made to the array as thebase of data to be sorted; the value of each element to be sorted is also its sorting key as all elements are integers. Sorting techniques can, of course, be used to sort more complex elements, using one of their fields as a key. Examples of this are given in the sources of this chapter.
Insertion Sort
Insertion sort is a very basic sorting algorithm which is easy to understand and implement. Insertion sort looks at each element of the array and checks whether its value is smaller than that of the previous element. When this is the case, the element has to be moved to some place earlier in the array. In order to find the right place for the element, insertion sort moves back through the array until it encounters an element which is smaller. The element to be moved is placed after this smaller element. All elements between the old and the new place of the moved element are moved one place up. The following example demonstrates this:
20 < 30, so search backwards for a place to insert 20 30 is moved forward and the front of the array is reached 20 placed at the front of the array
10 < 40, so search backwards for a place to insert 10 40, 30 and 20 are moved forward 10 is placed at the front of the array
Note that the first array element can initially be considered sorted. Only when the second element is compared to the first can a decision about the place of either be made. Listing 10.5 shows an example of an implementation of insertion sort. Listing 10.5 Insertion Sort
void InsertionSort(long *data, long n) { long i, j, item; // i divides the array into a sorted region, x<i // and an unsorted region, x >= i. for(i=1;i<n;i++) { // Select the item at the beginning of the as yet unsorted section. item = data[i]; // If this element is greater than item, move it up one. for(j=i; (j > 0) && (data[j-1] > item); j--) data[j] = data[j-1]; // Stopped when data[j-1] <= item, so put item at position j. data[j] = item; } }
You can see in the sorting example that insertion sort is an O(n^2) algorithm; in a worst-case scenario each element will cause all the elements before it to be moved once; n*(n-1) is basically n*n, which is n^2. For insertion sort, the average case is also O(n^2). The pros of insertion sort are Simple implementation
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
High efficiency for small amounts of data (virtually no overhead) Small runtime footprint (in-place) Stable algorithm (original order of identical elements is preserved) The con of insertion sort is Low efficiency for normal to large amounts of data Because of its small runtime footprint and high efficiency for small data sets, insertion sort is ideal for augmenting other sorting algorithms. After a size check, an overhead-heavy sorting algorithm can decide to invoke insertion sort for a data set which is too small to justify overhead. By doing so the last drop of performance can be squeezed out of a sorting algorithm.
Bubble Sort
Bubble sort is another fairly straightforward sorting algorithm. Bubble sort compares the first two elements of the array, switches their places when they are in the wrong order, and then moves up one position in the array to compare the following two elements. It does this repeatedly until it has reached the end of the array, and then it goes back to the beginning of the array and starts over. This whole process is repeated until bubble sort can pass over the entire array without switching any elements. With bubble sort the elements bubble step by step to the place they are supposed to be. The following example demonstrates bubble sort:
10 < 30, so switch these elements. 40 > 30, so leave second tuple. 20 < 40, so switch these elements
Second iteration: 10 30 20 40 30 > 10, so leave these elements 10 20 30 40 20 < 30, so switch these elements 10 20 30 40 20 < 30 and 30 < 40 so leave these elements.
Listing 10.6 shows an implementation of bubble sort. Listing 10.6 Bubble Sort
void BubbleSort(unsigned long data[], unsigned long n) { // Sort array data[] of n-items unsigned long i, j; bool changes = true; // Make max n passes through the array or stop when done. for(i=0;(i<n) && (changes == true);i++) { changes = false; for(j=1;j<(n-i);j++) { // If items are out of order --> exchange them. if( data[j-1]>data[j] ) { long dummy = data[j-1]; // SWAP TWO ITEMS data[j-1] = data[j]; data[j] = dummy; changes = true; } } } }
This implementation consists of two loops. The inner loop runs through the data array and switches elements when needed. The outer loop controls the number of times the inner loop traverses the array. Note the double stop condition; in a worst-case scenario the inner loop is executed as many times as there are elements in the array. The outer loop can be preempted, however, when an iteration over the data array proves that the array is sortedthat is, no more elements were switched.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
The pros of bubble sort are Simple implementation High efficiency of memory usage (in place) Stable algorithm The con of bubble sort is Low efficiencyO(n^2) Without further optimizations, bubble sort is slower than insertion sort. It is presented in this chapter because of its easy implementation and its great popularity.
Shell Sort
Shell sort, invented by Donald L. Shell, is still fairly easy to implement. It is, however, much faster than the previous sorting techniques when larger bases of data are sorted. Shell sort is similar to insertion sort with the difference that it moves data over larger, and variable, spacings. This is good because, in general, elements of an unsorted array are found more than one place removed from where they should be. As you saw in the section "Insertion Sort," insertion sort uses a spacing of 1. This means that two successive elements are compared and, if needed, switched in place. With shell sort you could choose a spacing of 2, for instance. In that case, elements 1 and 3 are compared, instead of elements 1 and 2. After this, elements 4 and 2 are compared, instead of elements 2 and 3. Shell sort is usually used with a different spacing for each time the algorithm iterates the array. The spacing is initially chosen as large and becomes smaller with each iteration until the final iteration through the array is done with a spacing of 1 (this is a normal insertion sort). The idea behind this is that fewer iterations are needed to completely sort the array. The following example demonstrates a shell sort iteration with a spacing of 2, followed by an iteration with a spacing of 1:
Unsorted array: 40 70 20 30 60 First 40 70 20 70 20 30 iteration with spacing = 2. 20 30 60 40 > 20, so switch 40 30 60 70 > 30, so switch 40 70 60 40 < 60, no switch
Second iteration with spacing = 1. 20 30 40 60 70 20, 30, 40 are ok, 70 > 60, so these are switched.
Choosing the spacings for your shell sort algorithm is crucial to its performance. Luckily, a lot of research has already been done on different spacing strategies. The following spacing guidelines were proposed by Donald E. Knuth in his book The Art of Computer Programming: Starting with a spacing of 1, a following spacing is calculated as follows: three times the previous spacing plus one. This implies a fixed spacing range which is as follows: 1, (3*1+1) = 4, (3*4+1) = 13, (3*13+1) = 40, (3*40 +1) = 121, and so on. Knuth further proposes you choose as a starting spacing from this list the spacing which is two orders below the first spacing that is higher than the number of elements you want to sort. That may sound complicated but it's not that bad. The following example demonstrates:
Here, 121 is the first spacing in the spacings list which is higher than the number of elements to sort (n = 100). So, as a starting space you choose the spacing which is found two elements below 121 in the spacings list. This a spacing of 13. A shell sort of 100 elements will use the following order of spacings: 13, 4, 1. Similarly, a shell sort of 122 elements will use the following order of spacings: 40, 13, 4, 1. Listing 10.7 shows an implementation of shell sort, using Knuth's spacing advice. Listing 10.7 Shell Sort
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
// Calculate largest stepsize. n = ub - lb + 1; h = 1; if (n < 14) h = 1; else { while (h < n) h = 3*h + 1; h /= 3; h /= 3; } while (h > 0) { // sort-by-insertion with stepsize of h. for (i = lb + h; i <= ub; i++) { t = data[i]; for (j = i-h; j >= lb && (data[j] > t); j -= h) data[j+h] = data[j]; data[j+h] = t; } // calculate next stepsize. h /= 3; } }
The implementation starts with an if..else construct which determines the initial spacing, based on the concepts you have just seen. After this, a while loop handles the number of array iterations. Note the calculation of the new spacing at the end of the while: h /=3. The for loop inside the while takes care of a single array iteration, sorting the array in a way similar to insertion sort. The average case sorting time for shell sort with Knuth spacing is O(n^1.25). The worst-case sorting time is O(n^1.5). This may seem like an arbitrary difference, but this is not really the case. For 100 elements, the difference is already 316 : 1000. The pros of shell sort are Simple to implement and still quite fast. In-place (efficiency). The con of shell sort is Still there are faster methods. An important characteristic of shell sort is that its worst-case performance is better than most more advanced sorting methods (like quick sort, as you will see in a later section). For this reason, shell sort is popular for life-critical systems where unexpected delays of O (n^2) can be dangerous. There are sorting methods that have a better worst-case performance than shell sort; refer to Table 10.2 for an overview on sorting method characteristics.
Heap Sort
As the name suggests, heap sort makes use of a structure called a heap. Before we define exactly what a heap is, here is a quick recap of tree terminology: A tree is a data structure in which elements are related to each other in parent and child relationships. Parents are called nodes. Each child has at most one parent. Each parent can have several children but at most one parent. There is one parent that has no parents of its own; it is called the root. A child that is not a parent itself is called a leaf. Figure 10.1 shows a simple tree structure. Figure 10.1. A tree structure.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
In Figure 10.1, the leaf F has a parent called C which in turn is a child of the root node A. Each node in a tree is, in fact, the root of a subtree. This is why nodes are sometimes referred to as subtrees. The tree in the previous example has two subtrees; the subtree with root B and the subtree with root C. A tree is called binary and balanced when it adheres to the following constraints: The children are distributed evenly over the tree with each node having exactly two children. The only possible exception to this is the last node of the tree; this node can have a single child when there are not enough children to go around. Only the lowest nodes in the tree can have leaves as children. The tree in Figure 10.1 is thus a balanced binary tree. A balanced binary tree is said to have the heap property when each node has a value that is larger than that of its children; it can, of course, be equal to one of its children. Figure 10.2 shows some examples of different kinds of trees. Figure 10.2. Different types of trees.
In the beginning of this section on sorting we promised you that all examples would sort arrays. In order to store a heap in an array, a mapping of the two data types has to be made. This mapping will of course also dictate how heap elements can be found in the array. The following two sections address these issues. Note that throughout this section the small letter n is used to indicate the number of elements in a given data structure. Mapping a Heap on an Array You can map a heap onto an array by simply reading it like a piece of textfrom left to right, and top to bottom. Figure 10.3 shows this. Figure 10.3. Heap-to-array mapping.
The top part of Figure 10.3 shows a heap structure which contains the letters A through O. The lower part of Figure 10.3 shows the array that will represent the heap. In order to get from the heap to the array, you read the heap and place its elements in the array in the order of reading: The first line in the heap picture consists of a single letter, the letter A. It is placed at the beginning of the array. The second line in the heap picture contains the letters B and C. They are placed in the following two slots of the array and so on. For extra clarity, the eventual array indexes accompany the elements in both pictures, so A will be stored in array[0], and O will be stored in array [14]. Note Technically the tree in the previous example is not a heap because the integer value of A is lower than that of B and C instead of higher. This setup is chosen, however, because mapping from the tree to the array is easier to understand this way. Array Indexing Navigating the heap is made easy by the mapping chosen earlier in the chapter.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Here are some important access points: Total number of nodes in the heap is n / 2, and in the example, 15 / 2 = 7. Note that the number is rounded downwards, 7.5 becoming 7. The last node in the array is (n / 2) 1, and in the example, (15 / 2) 1 = 6; array[6] = G. For any node array[x] , the leaves of that node are found at array[2x+1] and array[2x+2] . In the example array[6] = G, its leaves are array[13] = N and array[14] = O. To use a heap in a sorting algorithm, two different kinds of functions need to be performed. First of all, the initial tree needs to be transformed into a heap. This means moving around elements of the array until each node of the tree has a value higher than that of its (two) leaves. Secondly, the heap needs to be transformed into a sorted array. The following two sections address these issues. Making an Initial Heap Listing 10.8 shows the first half of what will become a HeapSort function. It processes an array of integers, called data. It addresses this array as if it represents a balanced binary tree which may or may not yet have the heap property. The code snippet reorders the array elements of data so that each node has a value that is larger than that of its children, effectively creating a heap. Listing 10.8 Heap Sort, Part I
int i, j, j2, k; int tmp; // Outerloop: considers all nodes, starting at the last one and // working back to the root. for(k = (n>>1) - 1; k >= 0; k--) { // k points to the node under evaluation. // j2+1 and j2+2 point at the children of k. tmp = data[k]; // Innerloop: for each changed node, recursively check the // child node which was responsible for the change. for(j = k; (j<<1) <= n-2; j = i) { // find the largest child of node k. j2 = j <<1; if (j2+2 > n-1) // only one child i = j2+1; else { if (data[j2+1] < data[j2+2]) i = j2+2; else i = j2+1; } // i now points to the child with the highest value. if (tmp < data[i]) data[j] = data[i]; // promote child else break; } data[j] = tmp; }
How does Listing 10.8 create a heap? Basically, it looks at each node of the tree and makes it a heap; this means switching the largest child of a node with the node itself whenever the node is not the largest value of the three. The algorithm starts at the last node of the tree and works its way back to the root. This means, for the example heap, the following nodes are initially evaluated: G, F, E, D. These are the nodes in the last line of the heapthat is, nodes with leaves as children. The next node to be evaluated is node C. The children of C are, of course, nodes themselves. When a change is made to node C, one of the subtrees of C needs to be reevaluated. For instance, initially C has the children F and G. G is the highest of the three so elements G and C are switched. Subtree F is unaffected by this but the subtree that was G (and now is C) needs to be reevaluated. This needs to be done recursively, of course. This is exactly what the inner loop of the previous code snippet does; for each node, it checks for changes and recursively checks each subnode
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
responsible for the change. The outer loop simply backtracks through the array from the last node up to the root. When the algorithm finally arrives at the root node, the array represents a heap. Now all that remains is sorting that heap. Sorting a Heap Listing 10.9 shows the second half of what will become a HeapSort function. It sorts an array of integers, called data. It addresses this array as if it represents a heap. Listing 10.9 Heap Sort, Part II
// remove roots continuously for(k=n-1; k>0; k--) { // k points to the (sorted) back of the array. // switch the root with the element at the back. tmp = data[k]; data[k] = data[0]; // make the array into a heap again. for(j = 0; (j<<1) <= k-2; j = i) { j2 = j<<1; if ( j2+2 > k-1 ) // only one child i = j2+1; else { if ( data[j2+1] < data[j2+2] ) i = j2+2; else i = j2+1; } // i now points to the child with the highest value. if (tmp < data[i]) data[j] = data[i]; // promote child else break; } data[j] }
How does Listing 10.9 sort a heap? It takes the root node of the heap, which is the largest number in the array, and switches it with the last element in the array; this is the final leaf child of the final node. The ex-root element is now exactly where it needs to be. However, the new heap needs to be reevaluated. This is done in the same way as was done in the inner loop of the code snippet that made the initial heap. When the re-valuation of the heap has been done, the root element is again taken and switched with the final leaf of the final node. The sorted part of the array thus grows from the back of the array to the front.
= tmp;
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
void HeapSort(int *data, int n) { // add Part I: Listing 10.8. // add Part II: } Listing 10.9.
Quick Sort
Quick sort, invented by C. A. R. Hoare, is an O(n^2). It is less complex to implement than heap sort, which is why it can be slightly faster; however, worst-case performance is O(n^2). Precautions can be taken to make worst-case performance as unlikely as possible. Because of its high speed, quick sort has been added to the standard C library. Its name in the library is qsort(). When calling qsort you need to provide your own compare function which qsort will use to determine the larger of two array elements as it sorts through your data. This makes it possible to use the standard implementation for a wide variety of types. The compare function must, of course, conform to the rules specified for arguments and the return value, or qsort will not work. Listing 10.10 shows an example of the use of qsort with a self-made compare function that allows qsort to sort arrays of longs. Listing 10.10 Using the Standard QSort
int compare( const void *arg1, const void *arg2 ) { /* Return value: < 0 elem1 less than elem2 0 elem1 == elem2 > 0 elem1 greater than elem2 */ return (*(long*) arg1 - *(long*)arg2); } void main() { qsort((void *)longArray, (size_t) n, sizeof(long), }
compare);
In this example, the first qsort argument is longArray, which is the array of longs that is to be sorted. The second argument is n, which is the number of elements in the array. The third argument is the size of each array element, which is calculated with the statement sizeof(long). The final argument is a pointer to the compare function. Note that the compare function receives two pointers to array elements. Because you, as a user of the qsort function, know which type or array elements you are sorting, you can do a cast to that type in the compare routine. The valid range of return values is included as a comment in the example. When you want the use a quick sort for only a specific type of data structure, it can be wise to implement your own version of the algorithm. At the very least, you will be able to eliminate the call to the compare function by directly inserting the compare code into the quick sort routine itself. You may be able to do more, depending on you data set. Here is the theory behind quick sort. The quick sort algorithm sorts a data array in two phases: the partition phase and the sort phase. In the partition phase, the data array is divided into two around what is called a pivot element. The other array elements are pivoted around this pivot element. This means that all the elements which are smaller than the pivot are put together in one part of the array, and all the elements which are larger than the pivot are placed on the other side of the pivot. Here is an example:
1 7 3 9 6 4 8 2 5 array[4] = 6
4 4 4 6
8 8 8 8
2 2 9 9
5 7 7 7
7>6 and 5<6, switch 9>6 and 2<6, switch 4<6, switch
Starting from the two outermost boundaries of the array (the first and the last elements), elements are compared with the pivot value. Values which are larger than the pivot and are on its left side are switched with elements which are smaller than the pivot on the right side. Note that the number of mismatches on the left side does not have to coincide with the number of mismatches on the right side. Array evaluation simply continues from both sides until they meet in the middle. How far this first partition step actually helps you along the way depends on the value of the pivot element which is chosen. The maximum gains are had when the pivot element just happens to be the median of all values in the array. In the preceding example, the element in the middle of the array was chosen but any other element would also have worked as it is the value of the pivot that counts
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
and not its position in the array. There is a danger in choosing the first or the last element in an array as the pivot element, however. When you use quick sort on an array which is already sorted, taking the first or last element as a pivot creates the largest overhead. After the partition phase has been completed, the sort phase commences. The sort phase consists of two recursive calls to the quick sort function itself. The first call will execute another quick sort on the left half of the partitioned array, and the second call will execute another quick sort on the right half of the array. This way, each half of the array is pivoted once more around a new pivot, after which two more recursive calls can be made. Recursion stops when a quick sort invocation receives an array with size 1. It becomes clear now why choosing the first (or last) element of a sorted array as a pivot creates an unfortunate situation; each recursive call will split the original array into a single element and the rest of the array. This means there will be as many recursive calls as there are elements in the array, resulting in a worst case scenario of O(n^2). Here is a functional description of the quick sort algorithm:
int Partition (int data[], int n) { // Select a pivot element. // re-order data[0] till data[n-1] until the following is true: // // data[0] till data[pivot-1] <= data[pivot] and data[pivot+1] till data[n-1] >= data[pivot]
return pivot; // position of pivot element AFTER re-ordering. } void QuickSort(int data[], int n) { if (n > 1) { int pivot = Partition(data, n); if (n < 3) return; QuickSort(data, pivot); QuickSort(data + pivot + 1, n - pivot - 1); } }
Note that actual sorting can be speeded up by selecting a good pivot value. However, remember that the more complicated you make the method of determining the pivot, the slower the partitioning phase will be, unless, of course, you have specific information on the array you want to sort. Note also the two if statements. The first if statement stops execution for arrays with one element or fewer; there is just no sorting to do with fewer than two elements. The second if statement stops on an array of two elements or fewer. This is possible because after partitioning an array of two elements, no more sorting needs to be done because the two elements will be in the correct order. There is another optimization in this example. Many quick sort implementations used today take three parameters: the data array and an upper and lower bound variable to identify which part of the array is to be sorted. By passing an updated array beginning, and the total number of elements that are left in the part of the array to sort, one parameter has been eliminated. This is interesting because calls to quick sort are made recursively and the less information each recursion places on stack, the smaller the runtime footprint will be. This brings us to another optimization, which is slightly more complicated. The two recursive calls that quick sort makes to itself can be translated into one recursive call and one iteration. Functionally the body of the quick sort function would then look like this:
void QuickSort(int data[], int n) { while(n > 1) { int pivot = Partition(data, n); if (n < 3) return; // call quick sort on the second part. QuickSort(data + pivot + 1, n - pivot - 1); // All that remains to be sorted is first part, // adapt n and restart. n = pivot; // n is upperbound + 1! } }
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Of course the Partition functionality does not have to be placed in a separate routine. By placing it in the quick sort itself, another function call is eliminated. Also, you can tweak which part of the partitioned array you select for the recursive call and which part for the iteration. Listing 10.11 shows a quick sort function which takes all the mentioned optimizations into account. Listing 10.11 Fast Quick Sort
template <class T> inline void FQuickSort(T *data, long n) { T pivot, tempElem; long left, right, tempIndex; while(n > 1) { right = n-1; left = 1; // switch the pivot with the first array element. tempIndex = right / 2; pivot = data[tempIndex]; data[tempIndex] = data[0]; // Partition the array. while (1) { while (pivot > data[left] && left < right) left++; while (pivot <= data[right] && right >= left) right--; if (left >= right) break; // Switch two array elements. tempElem = data[left]; data[left] = data[right]; data[right] = tempElem; right--; left++; } // switch pivot back to where it belongs. data[0] = data[right]; data[right] = pivot; // Sort phase. if (n < 3) return; tempIndex = n - right - 1; if (right > tempIndex) { // first part is largest. FQuickSort(data + right + 1, n - right - 1); n = right; // n is upperbound + 1! } else { // second part is largest. FQuickSort(data, right); data += right + 1; n = tempIndex; } } }
The quick sort implementation of Listing 10.11 is presented as a template so it can be called for arrays of different types. Calling examples of this are given after a brief walkthrough. There are only two new items introduced in the FQuickSort function . First of all, the pivot is chosen as the middle element in the array. To make the partition routine less complex, the pivot is switched with the first array element before partitioning. Second, the largest of the partitioned array parts is sent into iteration, the smallest into recursion.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
int integerArray[] = { 100, 1001, 24, 317 , 314, 2000, 2009, 7009} ; short shortArray[] = { 10, 20, 11, 317 , 314, 510, 12, 709} ; FQuickSort(integerArray, 8); FQuickSort(shortArray, 8);
The advantage of quick sort is High efficiency for average case The cons of quick sort are Very low efficiency for worst case Large runtime footprint Unstable algorithm Quick sort averages on O(nlog2 n) with a worst case of O(n^2). Note that although quick sort does the sorting of its data in-place, it is not, in fact, an in-place algorithm as its stack usage can run up pretty high on large data sets. As promised, the accompanying sources on the Web site contain sorting examples which sort more than just arrays of integers, and comparisons are made between sorting with a quick sort, a radix sort, and the standard qsort : 10source04.cpp 10source05.cpp Sorts arrays pointers to strings Sorts records by using operator overloading
At the end of the following section, more will be said about these sources.
Radix Sort
Radix sort takes a whole different approach to sorting. Radix sort no longer compares sortable items with each other; the essence of radix sort is that it sorts according to the value of subelements. For instance, when the elements to sort are of type integer, radix sort will iterate the array four times, once for each byte of an integer. During the first iteration, it sorts the elements of the array according to the value of the least significant byte, and it ends up sorting elements according to the value of the most significant byte during the fourth iteration. Here is a numeric example:
Iteration Iteration I II 2204 1005 2037 1508 2204 1005 1508 2037
Iteration Iteration III IV 1005 2037 2204 1508 1005 1508 2037 2204
The first column of the preceding example shows a random array. The second column shows what the array looks like after the first iteration. During this first iteration, the numbers are ordered according to the value of the least significant digit. Radix sort thus focuses on the following values during the first iteration: 5, 4, 7, 8. It places them in the order 4, 5, 7, 8. The second iteration looks only at the following digit, recognizing the values 0, 0, 3, 0. The third iteration recognizes the values 0, 0, 2, 5 and the fourth iteration recognizes the values 1, 2, 2, 1. As this example demonstrates, the number of iterations that radix sort need to make in order to sort an array of elements depends not on the number of elements to be sorted, but on the size of a single element. This means that radix sort is an O (m*n) algorithmwhere m is the number of subelements. Sorting a data type of two bytes can be done in two iterations. A radix sort implementation needs enough space for a second array. This is because radix loop iterations copy all the elements from one array to the other, placing the elements at a different position in the second array. However, a following iteration will copy all the elements back to the first array, so only two arrays are ever needed. But how does radix sort know where in the second array to place the copied elements? In order to do this, it needs to divide the new array into sections. There should be a section for every possible value of the subelement which it focuses on. In Figure 10.4, the array elements consist of two subelements: the first and second digit. Two iterations are needed to sort this kind of element. The first iteration will focus only on the least-significant digit. Before this iteration starts, the occurrences of each value of the least- significant digit are counted. The results are two occurrences of the value 0, two occurrences of the value 1, and a single occurrence of the value 6. With this information the destination array can be divided into subsections. An index is created for each subsection, pointing to the first available space in that subsection. Now the source array is iterated over and each source value is copied to the appropriate subsection in the destination array. Note that when a value is copied to a subsection, the index of that subsection must be increased by 1. This way an index will always point to the next available space.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
The second iteration will focus only on the most significant digit. It finds the following occurrences of that digit: a single occurrence of the value 2, two occurrences of value 3, and another two occurrences of value 4. Three subsection indexes are set accordingly and the elements are copied back to the original array (which has become the destination for this iteration). Note that radix sort must always traverse the source array from the least-significant digit or element to the most-significant digit or element, which ensures that the ordering created by a previous iteration is maintained. For example, in the first iteration of the preceding example, the values { 46, 41} are placed in the order { 41, 46} in relation to each other. In the second iteration, these values come together in the same subsection. If this second iteration had started at the back of the destination array, these values would have been switched again, because the value 46 would have been found and copied before the value 41 would be found. This would cause an unsorted output. This is also where endian-ness starts to play a role. On Intel machines, which use little-endian byte order, the number 0x01020304 is placed in memory as follows: 0x04030201. On big-endian machines, such as 68xx architectures and MIPS, this number is placed in memory as: 0x01020304. Simply put, this means that finding the least and most significant bytes may have to be done differently on different machines. With the cpu_info() function found in the booktools, you can determine the endian-ness of a target machine. Note that endian-ness only plays a role for base types. For strings, for instance, the endian-ness of a target machine is of no consequence because strings are placed in memory byte for byte. Listing 10.12 shows an implementation of radix sort. Listing 10.12 Radix Sort, Part I
template <class T> inline void radix(int byte, unsigned long n, T *source, T *dest) { unsigned long count[256], i, c, s=0; unsigned char *q; int size = sizeof(T); // Create array of sub section indexes. memset(&count[0], 0x0, 256*sizeof(unsigned long)); // Count occurrence of every byte value in T q = (unsigned char *)source; for (i = 0; i < n; i++) { count[q[byte]]++; q+=size; } // Create indexes from counters. for (i = 0; i < 256; i++) { c = count[i]; count[i] = s; s += c; } // Copy elements from the source array to the destination array.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
#define L_ENDIAN // or don't define L_ENDIAN. template <class T> void RadixSort (T *data, unsigned long n) { int i, elemsize = sizeof(T); T *temp = (T*)malloc (n * elemsize); if (temp != NULL) { #ifdef L_ENDIAN for (i=0; i < elemsize-1; i+=2) { radix(i, n, data, temp); radix(i+1, n, temp, data); } if (elemsize & 1) // odd elemsize -> one more to go? { radix(elemsize-1, n, data, temp); memcpy(data, temp, n * elemsize); } #else for (i = elemsize-1; i > 0; i-=2) { radix(i, n, data, temp); radix(i-1, n, temp, data); } if (elemsize & 1) // odd elemsize -> one more to go? { radix(0, n, data, temp); memcpy(data, temp, n * elemsize); } #endif free (temp); } }
As promised, here is an example of the usage of this radix sort implementation:
int integerArray[] = { 100, 1001, 24, 317 , 314, 2000, 2009, 7009} ; short shortArray[] = { 10, 20, 11, 317 , 314, 510, 12, 709} ;
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
You should note that for a sufficiently large amount of records and a sufficiently small key length, radix sort is faster than quick sort.
Merge Sort
Merge sort is not really a sorting algorithm; rather, it is an algorithm that merges different sorted bases together. Its obvious use is that of merging two previously separate databases of the same type together. It can, however, also be used to sort a database that does not fit into memory in its entirety. This means that merge sort can help overcome serious footprint issues. In order to sort a database that does not fit into memory, you first have to divide it up into smaller chunks that can be fit into memory. Let's say you keep your (un)sorted data on some kind of external disk. Each chunk needs to be loaded from the disk, sorted (using whatever algorithm you deem best), and stored back on the disk. Now you have a collection of sorted chunks. This is where merge sort comes into the picture. Merge sort takes two or more of these sorted chunks and merges them into a single chunk, which is also sorted. Merge sort keeps merging chunks of sorted data until a single, sorted output is created. A single merge has to reserve memory on the disk for its output file. This output file will be as large as the sum of the sizes of the selected chunks that will be merged. The space that is avail able on the disk thus determines which, and how many, chunks to select for a single merge step. The merging itself consists of taking the first elements of each of the selected chunks and placing the smallest of those in the output file. This is done repeatedly until all elements of the selected chunks have been placed in the output file. At this point, the selected chunks can be deleted, creating space on the disk. This space can be used for the following merge step. Figure 10.5 shows a possible two-step merging of selected chunks. Figure 10.5. Two-step merge sort.
In Figure 10.5, there are four sorted chunks that need to be merged into a single database (or output file): A, B, C, and D. The first
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
merge creates intermediate output files containing all elements of A+B and C+D. The second merge creates a single output file containing all elements of A+B+C+D.
Summary
Table 10.2 summarizes the characteristics of the sorting methods discussed in this chapter. Table 10.2. Sorting Method Characteristics Worst Case Stable In Place Remark O(n^2) Yes Yes Good for few items and teaching purposes O(n^2) Yes Yes Good for teaching purposes and easily maintainable code O(n^1.5) No Yes Good for general purpose and time-critical systems O(nlog2 n) No Yes Best for time-critical systems and large lists O(n^2) No No Best for many strings O(m*n) No No Perfect for large lists with numbers
Sorting Method Insertion Bubble Shell Heap sort Quick sort Radix sort
Note that often interesting results can be gained by combining sorting algorithms. Think, for instance, of a quick sort implementation that calls insertion sort when it receives an array with a small number of elements or a sorting algorithm that uses radix sort for small keys and quick sort for longer keys. The best mix-and-match for you to use depends, of course, on the data set you are using and the way in which you want to use it. The best advice here is to experiment. CONTENTS CONTENTS
Arrays
One of the most straightforward data structures is the array. Using one-dimensional arrays for storing static data sets or data sets with a predictable size causes few problems. And it is footprint-efficient too; no extra memory is needed for storing pointers to other elements of the data set and fragmentation does not occur during the addition and deletion of elements. Using arrays for dynamic data sets is a different matter, however, as you have already seen in Chapter 9, "Efficient Memory Management." This section, therefore, looks at how efficient arrays actually are when it comes to inserting, deleting, searching, traversing, and sorting data sets. The file 11Source01.cpp that can be found on the Web site compares the performance of arrays against that of other data structures. There is a section at the end of this chapter that presents the timing results of these tests.
Inserting
Adding elements to arrays can be tricky at times. Basically, there are two situations in which to add elements to an array: when there is still enough space left in the array and when there is not enough space left in the array. The following sections take a closer look at performance and footprint implications of these two situations. Inserting Elements into an Array That Is Not Full When an array is not full yet, all elements found after the place where the new element is to be inserted must be moved up one position. In the worst case, the new element must be placed at the front of the array (array[0]) and all other elements need to be moved. This results in an O(n) operation. In the best case, the new element must be placed at the back of the array resulting in an O(1) operation. This proves that adding sorted elements to an array is wildly efficient. Furthermore, when elements are read from a sorted storagea database file, for instancethe whole file can be read as a single block and placed into memory. Chapter 12, "Optimizing
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
IO," which deals with efficient IO usage, will come back to this. Inserting Elements into an Array That Is Full When you need to add an element to an array that is full, you are often out of luck. It is often not possible for you to reserve more memory at the back of an existing block or array. This means that to increase the number of elements that will fit in the array, a new array probably needs to be allocated. All the elements from the old array need to be copied to this new array, and the new element itself needs to be added to the new array alsoat least an O(n) operation. This is a lot of work and uses quite a lot of memory, because for a short time both the old and the new array exist simultaneously. When this is not possible because of footprint constraints, it means that parts of the original array need to be transferred to some external storage and back again in order to successfully create the new array. It seems unlikely that you will want to do this for each element that needs to be added to a full array so, in practice, a full array will be extended with a number of empty spaces. An example of this was given in Chapter 10, "Blocks of Data." It is worth thinking about whether or not you can predict the number of elements that will beor are likely to beadded.
Deleting
Deleting elements from an array is very similar to inserting elements, as far as performance and footprint issues are concerned. Keeping a sorted array without any holes in it involves moving all the elements after the deleted element down a space in the array. In the worst case, this is again an O(n) operation; in the best case, it is an O(1) operation. Resizing the array to take up less memory as it shrinks may again involve a costly copy action. An idea here may be to mark items as deleted and postpone actual removal of elements until a good number of elements can be removed in one go.
Searching
A great advantage of arrays is that they are random access. No extra pointers or indexes, or even traversing, is necessary when you know the number of the element you want (a = array[5];). This is an O(1) operation. When you need to find out if a particular element is part of an array, you have a worst-case searching time of O(n) when the array is not sorted. This is because you simply must look at the elements one at a time; when you are out of luck, the element you are looking for is either not in the array or it just happens to be the last element of the array that you check. For sorted arrays, searching is a much happier business. You could opt to use a binary search to quickly locate an element. With a binary search, you recursively check the value of the search element against a value that lies between two boundaries inside the array. In a single check, the search value is checked against the value of the element in the middle of the array. When the search value is greater than the value in the middle, a binary search is started on the second half of the array. When the search value is smaller than the value in the middle, a binary search is started on the first half of the array. Here is an example:
Search element: 453. Array: 001 003 005 007 011 099 207 305 407 453 599 670 700 803 999 n: 15
Step 1: Compare 453 against the value of the middle element (which is array[n/2] = 305). 453 > 305, so do another binary search on the second half of the array. For this new binary search, the array to focus on is a subset of the original: the lower boundary (lb) = 8, the upper boundary (ub) is 15, and n' = 14 8 + 1 = 7. Step 2: Compare 453 against the value of the middle element (array[((ub-lb+1)/2)+lb] = 670). 453 < 670, so do another binary search on the first half of this new array. For this new binary search, the array to focus on is a subset of the previous: the lower boundary = 8, the upper boundary is 10, and n" = 10 8 + 1 = 3. Step 3: Compare 453 against the value of the middle element (array[((ub-lb+1)/2)+lb] = 453). 453 = 453game, set, and match. Listing 11.1 shows an implementation of a binary search algorithm. Listing 11.1 Binary Search Algorithm
long binsearch(int item, int data[], long lb, long ub) { while(1) { long n = ub-lb; // n = number of elements - 1 // When there are 1 or 2 elements left to check. if (n <= 1) if (item == data[lb]) return lb; else if (item == data[ub]) return ub; else return -1; // Where there are more than 2 elements left to check.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
long middle = ((n+1)/2) + lb; int mItem = data[middle]; if (item == mItem) return middle; if(item < mItem) ub = middle-1; else lb = middle+1; } }
This binsearch function is utilized in the file 11Source01.cpp that can be found on the Web site.
binsearch() takes as input parameters the item to search for (item), a pointer to the first element in the array (data), the index of
the first element (lb), and the index of the last element (ub). Usage is as follows:
int array[] = { 1,2,5} ; long n = 3; int item = 2 long a = binsearch(item, array, 0, n-1);
Finding an array element using a binary sort is an O(log2 n) operation. It is very similar to searching a balanced binary tree, as you will see later on.
Traversing
Traversing an array is pretty straightforward. An incrementing counter as array index will do the job nicely (array[i++]). Furthermore, arrays can be traversed in both directions (forward and backward) without special functionality. It does not matter on which array element you start and traversing can easily wrap from the end of an array back to the front: (i++; i %= n;).
Sorting
Arrays can be sorted very efficiently, as you learned in Chapter 10.
Linked Lists
Another fairly straightforward storage structure is the linked list. A linked list is in fact a set of data elements that are chained together by pointers. Each element contains a pointer to the next element in the list, or a NULL pointer in case there are no further elements in the list. Figure 11.1 shows a linked list. Figure 11.1. A linked list.
How you link list elements together is of course entirely up to you. One type of linked list that is often used is the double linked list, in which elements contain pointers to the next list element as well as to the previous element. The advantage of this is that you can navigate through the list in both directions given a pointer to any of the elements. Figure 11.2 shows a double linked list. Figure 11.2. A double linked list.
A skip list is another list of linked elements which contains additional information that allows you to skip some elements while traversing.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Think, for instance, of a list of alphabetically ordered names. Instead of using only a single pointer to the next element, you could create a skip list by adding to each element a pointer to the first element of the next letter in the alphabet:
Exactly when and what you skip will differ per skip list implementation. Figure 11.3 shows a skip list containing alphabetically ordered names. Each list element points to its successor. In addition to this, each first element of a particular letter points to the first element of the following letter. This makes it possible, for instance, to quickly find all names beginning with a certain letter by first skipping to the correct letter and then traversing elements from there on. Skip lists do not have to be limited to one additional pointer per element, of course. You could decide to add another pointer to some elements to point to letters in a different alphabet, for example. The advantage of lists compared to arrays is that you can more easily add elements to and remove them from a list than an array (given that the array may at some time become full). Figure 11.3. A skip list.
The file 11Source01.cpp that can be found on the Web site compares the performance of linked lists against that of other data structures. There is a section at the end of this chapter that presents the timing results of these tests.
Inserting
Inserting an element in a linked list consists of updating the next/previous pointers of the element to insert, the element to insert before, and the element to insert after. Special care has to be taken when inserting into an empty list and when inserting before the first element or after the last element. Nevertheless, the actual inserting action is not dependent on the length of the list and is therefore an O(1) operation. When, however, the place to insert an element still has to be found, some searching needs to be done beforehand. For this, see the section "Searching."
Deleting
Deleting an element from a linked list consists of updating the next/previous pointers of the element before the deleted element, the element after the deleted element, and perhaps (when the element is not destroyed but simply removed from the list), also the deleted element itself. Special care has to be taken when deleting the last element from a list, deleting an element at the back of the list, and deleting the first element of a list. Deleting is also an O(1) operation. When, however, the element to delete still has to be found, some searching needs to be done beforehand. For this, see the following section, "Searching."
Searching
Searching through a linked list is not that exciting. In the case of a single linked list, you have no choice but to take a pointer to an element, and from that point check each element one by one, moving to the next element on a mismatch. This is an O(n) operation because in a worst-case scenario you start at the beginning of the linked list and end up checking each element till you reach the end. When a linked list is sorted and you have more than one element pointer to start from, you can jump into the list at a more optimal point. You determine which pointer points to the element furthest in the listbut not past the key you are searching onand start with that
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
element. Basically what you have done here is create a sort of skip list using element pointers. Worst-case operation is now dependent on the number of search points and how well they are placed. With a double linked list and a pointer to the middle list element, searching is reduced to an O(n/2) operation on a sorted list when you can determine with certainty in which direction to start searching.
Traversing
Traversing a linked list is a matter of following the next pointers of the elements. In a double linked list, two directions of traversing are possible. In a skip list, fast traversing the list is possible because parts of the list that are not interesting at the moment can be skipped while traversing.
Sorting
Although the dynamic use of linked lists is less complex than that of arrays, the opposite is true for sorting. Sorting is especially problematic with single linked lists. This is because access to the elements is no longer random but has to be done through other elements. Sorting directly on the linked elements is therefore far from efficient. The sorting functions of the previous chapter can be used for linked lists when either a) the linked list is a class that overloads the operators [], =, <=, ==, and so on, or b) the sorting functions are updated to use member functions of the list class in order to access and manipulate elements. Listing 11.2 shows the different kinds of element access for sorting via operator overloading or class method adoption. Listing 11.2 Element Access for Sorting
~ // Element access example for arrays // or linked lists with operator overloading. if ( data[i] < data[i+1] { temp = data[i+1]; data[i+1] = data[i]; data[i] = temp; } ~ ~ // Element access example for linked lists using list methods for access. if ( list1.GetElem(i)->key < list1.GetElem(i+1)->key) { list1. SwapElems(i, i+1); } ~
Though the example that uses list methods for access may look shorter and thus more efficient, you should not forget that each call to a list function is in itself an O(n) operation! And it makes no difference whether the function called is an operator or some other method (GetElem() versus []). To refrain from having to search the list each time you need to access an element, you could create an array containing a pointer to each list element. This is the only way to speed up access. During sorting you access the list elements through the array of pointers for element evaluation, and then you sort the pointers based on that element evaluation. Only when the pointers are sorted do you traverse the linked list one last time to set the elements in the right order. However, this means you change the footprint of the data storage during sorting. Having said all this, if sorting directly on list elements is necessary, an insertion sort or merge sort is probably advisable. See Chapter 10.
Hash Tables
The previous paragraphs showed that arrays and linked lists each have their own strengths and weaknesses. The array provides fast access to its elements but is difficult to extend; the linked list is easy to extend but does not provide very fast access to its elements. For bases of data with a large number of elements, you of course want the best of both worlds. The hash table provides a means to bring this about. By combining the implementations of arrays and lists, a hash table that is set up well is both fast to access and easy to extend. Hash tables come in as many shapes and sizes as implementers and designers can imagine, but the basic premise is graphically explained in Figure 11.4. Figure 11.4. Hash table example.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
The hash table depicted in Figure 11.4 is basically an array of linked lists. As with the skip list from the previous section, the hash table helps you by ensuring that you no longer have to traverse a complete list to find an element. This is because the elements are stored in several smaller lists. A hash table is therefore a sort of skip array. Each list maintained by the array of the hash table is called a bucket. A bucket is thus identified by a single (hash) array element. In Figure 11.4 the hash table contains five buckets numbered 04. In order to determine which bucket could contain the element you are looking for, you need to have specific information on how the array orders its lists. As with sorting, the place where elements are stored in a hash table is determined by one of the fields of that element. This field is called the key. Elements containing employee data could, for instance, be stored according to Social Security number or name. The function that maps key values of elements to buckets (or indexes of the hash table array) is called the hash function. Here is an example of a simple hash function:
It is not hard to imagine that the quality of the hash function directly determines the quality of the hash table. When a hash function maps too few or too many key values to a single bucket, the hash table will become very inefficient. The following two figures present extreme examples of what could happen. Figure 11.5 shows what happens when too few elements are mapped to a single bucket. Figure 11.5. All buckets are used but contain only a single element.
When a hash function maps each key value to a separate bucket (perfect hash function), all buckets will eventually be in use but will contain only a single database element. What you have now is in effect an inefficient array. Each array slot contains a pointer to a single element. This means that memory (footprint) is used unwisely and element access is done through an extra indirection. But the good thing is that it's resizable! Figure 11.6 shows what happens when too many elements are mapped to a single bucket. Figure 11.6. One bucket is used, containing all elements.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
When a hash function maps each key value to the same bucket, only one bucket is used and it contains a list of all the database elements. What you have now is in effect a linked list with the added overhead created by an (almost empty) array. It is clear that an ideal hash table would look far more balanced. In order to create a balanced hash table, the hash function needs to be able to divide the data elements evenly over the available buckets. Remember, however, that the more complicated this function is, the slower it will be. Hash functions are discussed in more detail in a following section. Of course, hash table characteristics themselves also play an important part in what a hash table will look like when it is in use. Think, for instance, of the number of buckets used. With too few buckets, the lists that contain the bucket elements will become sufficiently large to slow down element access once more. Analysis of the data you will store should help you determine the desired bucket values. Another thing to consider is how you want to store elements within a bucket. The simplest way is to just append new elements to the end of the list of a bucket. With sufficient buckets (and thus short lists per bucket) this may prove to be very practical. However, as list sizes increase, it may be a good idea to think about storing the elements inside a bucket in an ordered fashion. At the beginning of this section it was noted that hash tables can come in many shapes and sizes. Basically, the data set you want to store will determine the optimal hash table implementation. An array of lists is a much-used mechanism, but there is no reason why a bucket cannot contain a simpler or even more complex data structure. Think of a bucket containing a data tree or yet another hash table. This second level hash table might even use a different hash function. The file 11source01.cpp that can be found on the Web site compares the performance of hash tables against that of other data structures. There is a section at the end of this chapter that presents the timing results of these tests.
Inserting
Inserting elements into a hash table that is set up according to the example in this section can be a very fast operation (O(1)). This operation consists of determining the correct bucket (via the hash function) and adding the element to the linked list of that bucket. When the linked list of the bucket is ordered, the operation can take longer because the correct place for insertion needs to be determined. In the worst case, this is an O(L) operation, in which L is the number of elements already in the bucket.
Deleting
Deleting an element from a hash table that is set up according to the example in this section is also an O(L) operation (worst case), in which L is the number of elements in the bucket from which an element will be deleted. This is because, whether or not the list is ordered, the element to be deleted has to be found.
Searching
Searching for an element in a hash table that is set up according to the example in this section can be very fast. Searching ranges from an O(1) operation (best case) to an O(L) operation (worst case), in which L is the maximum number of elements in a bucket.
Traversing
Traversing a hash table that is set up according to the example in this section can be quite easy when all the possible key values are known. In that case, the hash function can be called for ascending key values, thus finding the buckets. Within the buckets, the elements can be traversed in the same manner as with any linked list. However, when all possible key values are not known, there is little left to do but traverse the available buckets in some orderascending, for instance.
Sorting
Sorting hash tables is not generally done. The hash function is used in order to make a mapping from key value to bucket. Doing any external sorting on this mapping means that elements can no longer be found via the hash function. However, few hash functions also dictate the order of elements within a bucket. As stated before, it can prove to be a good idea to keep the elements in the buckets in some kind of predetermined order. The sorting mechanism that can be used is dependent on the structure that is used for the buckets. For a hash table that is set up according to the example in this section, the sorting implications for buckets are the same as those specified in the section on sorting linked lists.
Hash Functions
In the previous section, you saw that finding elements in a hash table is fastest when you use a perfect hash function. This function guarantees that each possible key value will be mapped to a unique bucket number. This implies that each bucket contains at most a
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
single element, and your hash table is in effect an array in which you can use the key value as an index. However, perfect hash functions are not always easy to find. When more than one key value is mapped to a bucket number by a certain hash function, this is called a collision. When the hash function cannot satisfactorily be rewritten into a perfect hash function, you have two options for dealing with collisions: 1. Allow more than one element to be placed in a bucket. This option was demonstrated in the previous section, where each bucket basically consisted of a linked list. The number of elements that can be placed in such a bucket is limited only by memory. Using linked lists to represent buckets is called chaining , because the elements in a bucket are chained together. Other structures are of course also possible, such as trees, arrays, or even further hashing tables. 2. Do a fix on the collision. It is also possible to try to fix collisions. In doing this you repeatedly look for a new bucket until an empty one is found. The important thing to realize here is that the steps to find an empty bucket must be repeatable. You must be able to find an element in the hash table using the same steps that were taken to store it initially. Collisions can of course be fixed in numerous ways. Table 11.2 contains popular collision fixes. Table 11.2. Popular Hash Table Collision Fixes What You Do Do a linear search starting at the collision bucket until you find an empty bucket. Perform another hashing, with a different hash function, perhaps even on a different key of the element. Reserve a separate area for colliding keys. This could be a second hash table, perhaps with its own hash function.
Requirements for hash functions are clearly pretty strict. If a hash function is not a perfect hash function, it at least has to be perfect in combination with other strategies (option 2) or it should allow a good average distribution of elements over available buckets (option 1). In both cases though, the hash function you use for your hash table is dependent on the kind and amount of data you expect, and the nature of the key that is used. This section shows some sample hash functions for different data types. Listing 11.3 shows a simple hash function for string keys. Listing 11.3 Simple Hash Function for String Keys
// Hashing Strings int Hash(char* str, int tableSize) { int h = 0; while (*str != '\ 0') h += *str++; return h % tableSize; }
The hash function in Listing 11.3 simply adds the character values of a string and uses the modulo operator to ensure the range of the returned bucket index is the same as the range of the available buckets. This hashing function does not take into account the position of characters, therefore anagrams will be sent to the same bucket. If this is undesirable, a distinction can easily be made by, for instance, multiplying each character by its position or adding a value to a character that depends on its position. Of course, you can also decide to write a hash function that uses two or more fields of your own data types. Listing 11.4 shows a hash function for a userdefined date key. Listing 11.4 Hash Function for User-Defined Date Keys
// Hashing Dates struct date { int char char }; int Hash(date *d, int tableSize) { if (tableSize <= 12) return (int) (d->month % tableSize); if (tableSize <= 31) return (int) (d->day % tableSize); year; day; month;
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
range_of_the_key * R.
in which the range_of_the_key is 256 for an 8-bit key and so on and R is ((5^0,5)1)/2). Here are some example calculations for multiplication hash functions using Knuth's suggested constants:
Example 1: Key size : 8 bits Hash table size: 256 buckets hash = (key * 158) >> 7;
Example 2: Key size : 16 bits Hash table size: 128 buckets hash = (key * 40503) >> (32 7);
Binary Trees
Binary trees, sometimes called binary search trees (bst), are popular storage structures. You can search through trees much faster than through linked lists. A well-balanced tree can be searched as efficiently as a sorted array: O(n log2 n). This is because tree searching has the same properties as a binary search algorithm for an arraysee the binsearch function in the section on arrays. However, the overhead per stored element is higher with trees than with a single linked list. A binary tree has references to (at most) two child nodes so, just like with double linked lists, two pointer fields need to be added to each stored element. As with linked lists, the elements of a tree are stored in memory separately, causing the same danger of memory fragmentation. Still, the advantages of binary trees have earned them a place in the standard toolset of every programmer. What binary trees look like and how they can be used was introduced in Chapter 10 in the section on heaps. This section looks at the positive and negative characteristics of binary trees. The file 11Source01.cpp that can be found on the Web site compares the performance of balanced and unbalanced trees against that of other data structures. There is a section at the end of this chapter that presents the timing results of these tests.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Inserting
The function used for inserting elements into a binary tree in effect also takes care of sorting the tree. This is because the placement of each element in the tree has to adhere to the rules of the internal tree structure. The structure of a tree is only valid when each child on the left of its parent has a key value that is equal to or lower than the key value of the parent, and each child on the right of its parent has a key value that is larger than that of its parent. Sadly, it is still possible to create a very inefficient tree even when your insert function observes the structuring rules perfectly. The following example will demonstrate this. Assume we have an empty tree in which we want to insert the following set of values: 50, 25, 80, 32, 81, 12, 63. The first element, 50, is added and becomes the root. The second element, 25, is lower than 50 and placed left of the root. The third element, 80, is higher than the root and placed on its right. This goes on until all elements have been added and a tree is created as shown in Figure 11.7. Figure 11.7. A tree fed with random ordered data.
When we add the same elements to an empty tree but in sorted order, we get an entirely different picture. The first element, 12, is added as the root. The second element, 25, is higher than the root and is placed to its right. The third element, 32, is higher than the root and higher than the child of the root and is placed on the right of the child of the root. The fourth element, 50, is higher again than the last element inserted and is placed to its right. This goes on until all elements have been added and a treei is created as shown in Figure 11.8. Figure 11.8. A tree fed with ordered data.
The storage structures of Figure 11.7 and Figure 11.8 have a valid tree structure. However, the structure in Figure 11.8 is totally unbalanced and is actually nothing more than a single linked list with the added overhead of a NULL pointer for each element. The efficiency of the unbalanced tree is reduced to that of a linked list also. Instead of O(n log2 n), tree access is now an O(n) operation. This means that if searching or adding an element to a balanced binary tree with 50,000 elements takes 50 ms, for instance, that same operation can take up to two minutes for an unbalanced tree. So, the better a tree is balanced, the more efficient it will be. So the order in which elements are fed to a binary tree actually influences its efficiency. It is of course possible to write a function that can be used to periodically attempt to balance the tree, but executing such a function will be very time-consuming and will most likely result in O(n*n) behavior. (See Chapter 5, "Measuring Time and Complexity.") A later section, "Red/Black Trees," shows an example of a tree modified for balanced inserting and deleting.
Deleting
Deleting elements from a binary tree can be more involved than inserting elements. When the element to delete is a leaf, it can be removed without much difficulty. When the element to delete has only one child, that child can simply replace the deleted element. However, the element to delete can also have two children, each of which is a possible root of a subtree. These subtrees must be reintegrated into the remaining tree after the parent element is deleted. The most straightforward way of doing this is by simply feeding the elements of the subtrees back into the tree using the insert function of the tree. This can be time-consuming, however, and needs to be done with care because feeding elements to a tree in a sorted order can result in a very slow tree structure. Another way to
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
reintegrate subtrees that are left over after a delete action is to find the in-order successor of the deleted element and have it take the place of the deleted element. An in-order successor is explained in the section on traversing trees. Figure 11.9 demonstrates this deletion strategy. Figure 11.9. Deleting an element from a binary tree.
In Figure 11.9, the element with key value 50 is deleted. The delete action creates two subtrees: one headed by the key value 25 and another headed by the key value 80. The in-order successor of key value 50 is key value 63. It can by found by traversing the subtree of the right child of the deleted element. As you will recall, in-order traversal first moves down a (sub)tree by passing along the left child of each node it encounters. When no further nodes can be found to the left, the in-order successor is reached. This in-order successor is moved to replace the deleted element, creating a valid tree structure once more. Yet again, as with inserting elements into a tree, the balance of the tree can be upset, causing slower subsequent search results. In order to keep trees nicely balanced, a modification to the standard structure is needed. A later section, "Red/Black Trees," shows an example of a tree modified for balanced inserting and deleting.
Searching
The speed with which your algorithm can search a tree is very much dependent on the shape of the tree. This is because the height of the tree (number of nodes between root and leaf) determines the number of search steps your algorithm may have to take before finding an item or deciding that an item is not present in the tree. A balanced binary tree will have left and right branches for all nodes that do not contain leafs. This means that each step of your searching algorithm will choose a branch based on comparison of the item key with the node key. Listing 11.5 gives an example of a simple search algorithm for finding a node in a binary tree based on the value of its key. Listing 11.5 Searching for a Node in a Binary Tree
TreeNode *Tree::GetNode(TreeNode *current, char *key) { if (current == NULL) return(NULL); int result = strcmp(current->key, key); if (result == 0) return(current); if (result < 0) return(GetNode(current->left, id)); else return(GetNode(current->right, id)); }
As with a binary search, in the best case each test will split the data set to search in half. Searching a balanced binary tree is then an O (log2 n) operation. However, not all trees are balanced. A worst-case setup of an unbalanced tree is a tree in which each node has only one child, making it impossible to choose between branches during a search. This type of tree is nothing more than a linked list, and searching it is therefore an O(n) operation. Generally speaking, building a tree with randomly received data will produce a more or less balanced tree; however, when elements are added to a tree in sorted order, a linked list will be the result. This implies that in order to obtain good searching times (O(log2 n)), more work needs to be done during the inserting (and also sorting and deleting) operations on trees in order to keep the structure balanced. Sadly, the standard tree structure does not lend itself well to easy balancing during these operations. Because the tree structure itself can do so much for your searching times, it is a valid effort to think of updates to ensure better performance for these kinds of operations. Later sections represent different kinds of trees that take this into account.
Sorting
Doing external sorting on a binary tree structure should not be necessary because tree properties exist only due to the relations between the elements within the tree. The tree setup is therefore already sorted, and the sorted order is maintained during addition (and deletion) of elements. An unsorted tree would be a collection of elements with seemingly random connections. Finding an element
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
in that kind of tree would mean traversing all elements, resulting in an O(n) operation. This in fact implies that you are dealing with linked list properties with the additional overhead of useless links.
Traversing Fundamentals
There are three different ways of traversing binary trees: Pre-order, In-order, and Post-order. Pre-order Pre-order traverse means visit the root, traverse left subtree of the root, traverse right subtree of the root. Listing 11.6 shows an implementation of a pre-order function that traverses a tree to print information from each node. Listing 11.6 Pre-order Traverse Function for a Binary Tree
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Traversal Continued
Looking at the order in which the example keys pop from the different traverse methods, you can see that in-order traversal comes across all tree elements in sorted order. However, recursion is needed to visit all elements, which means that the amount of stack used (see Chapter 8) increases with the depth of the tree. Not only can this cause memory problems when traversing large bases of data, it is also performance-inefficient as function call after function call is made. Luckily, it is possible to rewrite the traversal routine into an iterative function; however, changes to the standard tree structure are needed. The changes consist of replacing the null pointers of the leaf nodes with something useful. For instance, the left child pointer of a leaf node is made to point at the node's in-order predecessor, and the right child pointer is made to point at the node's in-order successor. Figure 11.10 shows what this would make the balanced tree from the beginning of this section look like. The added references are depicted in Figure 11.10 by curved arrows. Note that two NULL pointers still remain. These denote the first and last elements of the tree. This kind of binary tree is called a threaded tree. By following the links of a threaded tree, a traversal function can access the tree as if it were a linked list, while other operations (such as a search) still retain the positive characteristics of the tree structure. For each link to a child, however, some extra administration is needed. A traversal routine has to know whether a child link is a normal link or a link to a threaded successor/predecessor. Also, creating the threaded links slows down the insert and delete operations, so there is some overhead to consider when using threaded trees. Listing 11.9 shows what an in-order traversal function for a threaded binary tree can look like. Figure 11.10. A threaded binary tree.
void InorderTraverse(TreeNode *current) { while (current != NULL) { if (current->ltype == 1 && current->left != NULL) { // Normal link to left child. current = current->left; } else { // Nothing on the left so process current node. current->info.PrintInfo(); if (current->rtype == 1) { // Normal link to right child or last element. current = current->right; } else { // Right child threads to successor. // Keep following threaded links till right is normal. do { current = current->right; current->info.PrintInfo(); }
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
right.
Red/Black Trees
Red/Black trees, introduced by R. Bayer, are binary search trees that adhere to a few extra structuring rules, and that have a few extra properties in each node. These new rules and properties make it easier to keep the tree structure balanced when inserting and deleting elements. The extra properties found in the nodes of a Red/Black tree are 1. A pointer to the parent of the node. 2. A code indicating the color of the node. In a Red/Black tree, the color code of each node is either red or black. The extra structuring rules that Red/Black trees adhere to are 1. If a node is red, its children have to be black. 2. Each leaf is a NIL node and is black. 3. The path in the tree from the root to any leaf contains the same number of black nodes. A NIL node will be a special sentinel node used to indicate an empty leaf. When insert and delete operations observe these structural rules, the tree remains well balanced. The number of black nodes between a leaf of the tree and its root is called the black-height of the tree. Looking at the structuring rules closely, you can determine that any path from a leaf to the root can never be more than twice the length of any other leaf-to-root path. This is because the minimum path contains only black nodes (number of nodes equals black-height), and the maximum path contains a red node between each black node (number of nodes equals black-height multiplied by 2). The following sections explain how the new properties and rules are put to use in order to maintain a well-balanced tree.
Inserting Theory
In order to insert an element into a Red/Black tree, the tree is traversed to find an appropriate insertion point. An element is always inserted in the place of a leaf node. The newly inserted node is given the color red, and it will have two black leaf nodes (which are, according to structuring rule 2, in fact NIL nodes). Because of this, structuring rules 2 and 3 still hold after inserting a new element. And if the parent of the inserted element happens to be black, the operation is completed. However, structuring rule 1 will be violated when the parent happens to be a red nodeaccording to rule 1 a red parent is only allowed to have black children. This means that after inserting an element, the tree structure may need to be rebalanced. Two tools are used in rebalancing the tree structure. The first is simply changing the color of a node, the second is rotating nodes around in the structure. The remainder of this section will show that there are actually two situations in which a rebalancing of the tree structure is needed: one when the parent of the new element is red and its uncle is black, and the other when both the parent and the uncle of the element are red. Rebalancing a Red/Black Tree for a Red Parent and a Black Uncle Figure 11.11 shows one of the two situations in which the tree structure needs to be rebalanced after insertion of an element. Figure 11.11. Inserting with a red parent and a black uncle.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
In the left half of Figure 11.11, part of a Red/Black tree is shown, containing the following elements: 50 (Black), 25 (Red), 80 (Black). The element 12 is inserted (Red). The parent of element 12 will be element 25. For reference sake, element 80 will be called the uncle of element 12. The Red/Red parent-child link of elements 12 and 25 is a violation of rule 1 and needs to be rebalanced. The rebalancing is shown in the right half of Figure 11.11. The rebalancing consists of elements 50 and 25 changing place and color. In effect, element 50 becomes the right child of the element that used to be its left child. After this rotation of elements, the Red/Black tree structure is valid again; the top of the changed tree (element 25) is black, so the color of its parent does not matter. Also, the blackheight of the tree is unchanged. The tree structure is valid. Rebalancing a Red/Black Tree for a Red Parent and a Red Uncle However, it is also possible that the uncle of a newly inserted element is red instead of black. Figure 11.12 shows what happens in that case. Figure 11.12. Inserting with a red parent and a red uncle.
In the left half of Figure 11.12, part of a Red/Black tree is shown containing the following elements: 50 (Black), 25 (Red), 80 (Red). The element 12 is inserted (Red). The parent of element 12 will be element 25; the uncle is element 80. In rebalancing the Red/Red parentchild link of elements 12 and 25, retyping can be used. The parent (25) will be colored black, removing the illegal Red/Red parent-child link. But now a new violation is created; the number of black nodes between the leaf and the root is increased. The grandparent of the new element (50) therefore has to be retyped also. In doing this, element 50 has become a red node with a red child (80). This child is retyped also. This brings us to the right half of Figure 11.12. Note that no further changes need to be made to the subtree represented by element 80. Element 80 is black and can thus contain both red and black children, and the black-height is maintained because element 50 is now red.
Inserting Practice
These two techniquesrotation and retypingmay both be needed in this second situation because the rebalancing is not finished yet. Further changes may be needed above element 50 in Figure 11.12, because it may have a red parent. This can cause another Red/Red parent-child violation. So, in effect, retyping can propagate all the way up to the root of the tree. Sometime during this rebalancing, a rotation may also be needed; for example, when a black node is turned to red while having a black uncle. Luckily, as you have seen in this section, after a rotation the tree structure is valid and the rebalancing can stop. Listing 11.10 rebalances a Red/Black tree after an insert. Listing 11.10 Rebalancing a Red/Black Tree After Inserting a New Left Element
template <class T > void RedBlackTree<T>::InsertRebalance(Node <T>*newNode) { Node <T>*tempNode; // Continue checking up to the root, as long as the parent is RED. while (newNode != root && newNode->parent->color == RED) { if (newNode->parent == newNode->parent->parent->left)
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
{ tempNode = newNode->parent->parent->right; if (tempNode->color == BLACK) { if (newNode == newNode->parent->right) { newNode = newNode->parent; RotateLeft(newNode); } newNode->parent->parent->color = RED; newNode->parent->color = BLACK; RotateRight(newNode->parent->parent); } else { tempNode->color = BLACK; newNode->parent->color = BLACK; newNode->parent->parent->color = RED; newNode = newNode->parent->parent; } }
This template function can be found in the file 11Source02.cpp that can be found on the Web site. Note also that this is actually only half of the solution. In Figure 11.12 the new elements were inserted to the left of their parents. However, it is also possible that an element needs to be inserted to the right of a parent. In that case, the implementation code has to be mirrored: left becomes right and right becomes left. This can be seen in the second half of the InsertRebalance function. Listing 11.11 shows this mirroring code. Listing 11.11 Rebalancing a Red/Black Tree After Inserting a New Right Element
else { tempNode = newNode->parent->parent->left; if (tempNode->color == BLACK) { if (newNode == newNode->parent->left) { newNode = newNode->parent; RotateRight(newNode); } newNode->parent->parent->color = RED; newNode->parent->color = BLACK; RotateLeft(newNode->parent->parent); } else { newNode->parent->parent->color = RED; newNode->parent->color = BLACK; tempNode->color = BLACK; newNode = newNode->parent->parent; } } } root->color = BLACK; }
As you can see in the diagrams in this section, insertion is an O(log2 n) + O(h) operation. This is because the search for the place to insert is a binary search (O(log2 n), and the rebalancing is either a rotation that is not dependent on the number of elements in the tree (O(1)) or a retyping that can propagate all the way to the root of the tree (O(h)), where h = tree height.
Deleting
Deleting an element from a Red/Black tree starts the same way as with a normal binary tree. After this, a balancing action may be needed, which is very similar to the one used after inserting an element. As you might expect, deleting a red node from a Red/Black tree does not cause any violations. Looking at how elements are deleted from a normal binary tree, the following cases of violations can be identified when deleting an element from a Red/Black tree:
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
1. A black parent with two leaf nodes is deleted. The element is removed and replaced by a black leaf. The path between this leaf and the root now contains a number of black nodes that is one lower than the black-height of other leaf-to-root paths. This is a violation of structuring rule 3. The resulting leaf is in a state that is referred to as Double Black , meaning that it is off balance because a black node is missing. 2. A black parent with a single child (and one leaf node) is deleted. The only child replaces its deleted parent, keeping its original color. A black node has now disappeared from the path between the child and the root of the tree. This is a violation of rule 3. When the child itself is red, it can be colored black and the violation disappears; if it was already black, it becomes Double Black and the violation remains. 3. A black parent with two children is deleted. The parent is replaced by its in-order successor. If this successor is red, rule 3 is violated due to the disappearance of a black node from the path to the root. If this successor is black, rule 3 is violated due to the fact that the original children of the in-order successor miss a black node in the path to the root. The violations need to be removed once again using the two techniques explained in the section on inserting: rotation and retyping. An implementation of the rebalancing delete operation can be found in the file 11Source02.cpp that can be found on the Web site.
Searching
Searching a Red/Black tree is no different from searching any other balanced binary search tree. The changes made to transform a normal binary tree into a Red/Black tree only affect the ability to balance the structure after insert and delete operations. Searching a balanced binary tree is then an O(log2 n) operation
Sorting
Sorting Red/Black trees, just like sorting any binary tree, is not very useful. The tree structure is sorted by the way elements are inserted into the tree.
Traversing
Traversing a Red/Black tree is no different from traversing any other binary search tree.
Summary
This section summarizes the chapter by discussing the test results of the files 11Source01.cpp and 11Source02.cpp. Note that timing values may differ somewhat between systems but the relationships between values in a table should be fairly constant. The first test done by 11Source01.cpp is placing initial data into the various data structures. Table 11.3 shows the results of 11Source01.cpp for placing initial data into various structures. Table 11.3. Timing Results Storing Initial Data in Storage Structures Filling Data Structures List Hash UnbalTree BalTree 100 60 102710 60
Array 60
Arrays are the fastest structures here. Note that in the 11Source01.cpp file the array is treated as a non-dynamic structure; that is, no reallocation of the structure is done to accommodate new elements. Memory is allocated once and only field values are set during the filling of the array. This stands in contrast to the other structures where a new element must be created each time data is added to the structure. The next test done by the 11Source01.cpp file is searching for an element with a certain key value. Table 11.4 shows the results of 11Source01.cpp for searching for elements in various structures. Table 11.4. Timing Results Searching Elements in Storage Structures Looking Up the First and Last 250 Entries Array2 Array3 List Hash UnbalTree BalTree 0 540 3630 0 3570 0
Array1 0
Note that three different timing values are included for arrays. Array1 reflects the time needed to find a specific element when its key can be used directly to access the element. Array2 reflects the time needed to find a specific element from a sorted array using a binary search algorithm. Array3 reflects the time needed to find a specific element in an unsorted array. This means simply traversing the array from the beginning until the correct key value is found or the end of the array is reached. You may find it interesting to play around a bit with the number of buckets used by the hash table. You can do this by changing the value of the HASHTABLESIZE definition. You should find that when the number of buckets decreases, and thus the list size per bucket increases, performance of the hash table becomes less and less efficient. The first test done by 11Source02.cpp is timing the filling of a normal tree against a Red/Black tree:
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Tree 220
RBtree 270
We see that the filling of the Red/Black tree is slightly slower. This has everything to do with the additional administration we have to do and the fact that filling the normal tree is done in such a way that it always results in a perfect tree. The next test done by 11Source02.cpp is looking up and checking all entries: Tree 110 RBTree 110
As expected, the lookup of an element is as fast in a Red/Black tree as in a normal (balanced tree). But since the Red/Black tree is guaranteed to always be balanced, good lookup times are therefore guaranteed. This cannot be said of a normal tree, where lookup times increase when the shape of the tree becomes more and more unbalanced! The last test done by 11Source02.cpp is removing all entries one by one: Tree 49870 RBtree 110
Deleting nodes shows a tremendous time difference between a normal tree and a Red/Black tree. The reason for this is that the normal tree doesn't have a rebalancing mechanism and entire subtrees of nodes that are removed need to be reinserted, which is very timeconsuming. Because the Red/Black tree does have a delete and rebalance mechanism, no such problems occur. In conclusion, we can say the following about the data storage structures discussed in this chapter:
Arrays
An array is the best structure for static data because it has no overhead and there is no danger of fragmentation within the bounds of the array. Dynamic data can be more complicated as data needs to be copied. This can also mean that a larger footprint is needed.
Lists
A list is a good structure for storing small data sets with a highly dynamic nature. It has more overhead per element than the array, and fragmentation is a danger. However, dynamically adding and deleting elements is fast and does not require extra footprint beyond the memory needed for storing elements. Searching for an element can be quite time-consuming, depending on the type of list (double- or single-linked).
Hash Tables
A hash table is a good structure for larger sets of data for which a unique or distributing key function (hash function) can be designed. The hash table implementation can make use of as many or as few other data structures as necessary. Depending on this, the overhead per element can be as small as that of the array or as large as other structures combined.
Binary Trees
A binary tree is a good structure for larger amounts of highly dynamic data, as long as the tree structure can remain relatively balanced. This means data must be added and deleted in an unsorted order. The binary tree has more overhead per element than the list and becomes as inefficient as the list when it is unbalanced.
Red/Black Trees
A Red/Black tree is a good structure for larger amounts of highly dynamic data. The Red/Black tree has more overhead per element than a normal binary tree but stays balanced independent of the order in which elements are added and deleted. Danger for memory fragmentation is as high as with the other structuresexcluding the arrayand inserting and deleting elements will be slower than with a normal binary tree. For larger amounts of data, this time is won back in faster traversal (because good balance is guaranteed) and searching. CONTENTS CONTENTS
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Efficient Binary File IO Efficient Text File IO As a C++ implementer you have access to the standard C functions for performing IO as well as the new streaming classes introduced by C++. It is not always obvious, however, what the differences between these techniques areapart from their notationand whether there are performance penalties involved in using a certain technique. This chapter looks at the pros and cons of each technique and notes where to use each.
Using printf()
The printf() function can be used to write a string to the standard output stream (stdout). The string that is passed to printf () can contain complex formatting information that makes printf() a handy tool for doing conversions from binary variable values to strings. printf(), in fact, allows you to combine strings with automatically converted variables into a single output string. printf() was already part of C, but can still be used in C++. This is how you write an array input of N strings to screen using printf():
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
{ while ((c = Test[j++]) != '\ 0') putchar(c); j = 0; while ((c = input[i][j++]) != '\ 0') putchar(c); putchar(13); // write '\ n. putchar(10); j = 0; }
Both putc() and putchar() are implemented as functions as well as macros. The macro definitions take precedent. For using putchar() always as a function you have to undefine the macro. To undefine the putchar() macro, place the following line of code in your source file before putchar() is used:
#undef putchar
When you want to use putchar() as a function only for a certain call, force the function invocation by using putchar() as follows:
(putchar)(c)
Using puts() and putc() is certainly more laborious than, for instance, using printf(). puts(), however is a very fast output technique. The Test Results section evaluates its speed compared to the other screen-writing techniques.
Using cout
Apart from the standard C functions for screen output, C++ programmers also have access to the streaming classes provided by that language. Table 12.1 shows the different streams which are available in C++. Table 12.1. C++ Streams Standard input stream (istream class), default connected to keyboard. Standard output stream (ostream class), default connected to screen. Standard output stream for errors (ostream class), default connected to screen. Buffered version of cerr (ostream class), default connected to screen.
The difference between cerr and clog is that cerr messages are processed immediately and clog messages are buffered. Messages that should appear onscreen as soon as possible should, therefore, be sent to cerr. If a message is not supposed to slow down program execution too much, clog should be used. This section demonstrates the use of cout so it can be compared to the other screen output functions. The cout class can only be used in C++. It is as powerful as the printf() function in that it can also be used to convert binary values into strings and combine these values with normal strings into a single output string. The syntax of cout is very different from that of printf() , however. This is how you write an array input of N strings to screen using cout :
for(int i = 0; i < N ; i++) { cout << "Test " << input[i] << '\ n'; }
Using cout like this does not write a string to screen immediately, however; the text appears only when the output buffer is flushed. Flushing the output buffer can be done by adding << endl or << flush to the output stream. This is how you write an array input of N strings to screen using cout and flushing the buffer after every write action:
for(int i = 0; i < N ; i++) { cout << "Test " << input[i] << endl; }
Test Results
This section compares the speed of the different techniques for writing text strings to the screen. Table 12.2 presents the timing results generated by the program in the accompanying file 12Source01.cpp.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Table 12.2. Timing Results Screen Output Techniques Time (ms) 500 380 440 400 540 610 670
cout + \ n printf() puts + copy puts twice putchar macro putchar function
Note that file output can also be achieved with the function described for screen output. This can be done by redirecting program output with the > sign. In order to write the test result of this section file, type the following DOS command:
printf ("A string printed using printf\ n"); cout << "testing cout" << endl;
might output:
Listing 12.2 shows how FILE functions can be used to read data from one binary file and write it to another. Listing 12.2 Using FILE Functions to Read and Write Binary Files
4096 1
FILE *inp, *outp; long numRead, numWritten; int errorCode = 0; if ((inp = fopen(inputFilename, "rb")) != NULL) { if ((outp = fopen(outputFilename, "wb")) != NULL) {
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
while (!feof(inp)) { numRead = fread(buf, ITEMSIZE, BLOCKSIZE, inp); numWritten = fwrite(buf, ITEMSIZE, numRead, outp); } fclose(outp); } else printf("Error Opening File %s\ n", outputFilename); fclose(inp); } else printf("Error Opening File %s\ n", inputFilename);
As was noted in Table 12.3, the fread() and fwrite() functions can be used to transfer blocks of bytes. These functions use two arguments to determine the block size. The first is the item size, denoting the size in bytes of the items in the file. The second is the number of items to be read or written. Item size multiplied by number of items equals the number of bytes transferred. In Listing 12.2, an item size of 1 was chosen because bytes are to be written and read. The number of items, therefore, effectively determines the block size. Choosing a small block size of course means extra overhead because more fread() and fwrite() calls are needed to transfer a file. Choosing a large block size means less overhead but a larger footprint in memory. The Test Results subsection lists an execution speed comparison between different binary file techniques by varying block sizes. Random Access Of course, you will not always access files sequentially as was done in the previous listings. When you know how a file is built up because it contains database records, for instanceyou will want to access specific records directly without having to load the file into memory from its very beginning. What is needed is random access. You want to tell the file system exactly which block of bytes to load. As you have seen in the previous listings, a file is identified and manipulated through a pointer to a FILE object. This FILE object contains information on the file. It keeps, among other attributes, a pointer to the current position in the file. It is because of this attribute that it is possible to call, for instance, fread() several times, with each call giving you the next block from the file. This is because the current position in the file is increased after each call with the number of bytes read from the file. Luckily, you can influence this current position directly through the fseek() function. fseek() places the new current position of a file at an offset from the start, the end, or the current position of a file:
int fseek(FILE *filepointer, long offset, int base); fseek() returns 0 when it executes successfully. Table 12.4 shows the different definitions that can be used for the third argument, base.
Table 12.4. Positioning Keys for fseek Current position in the file End of the file Beginning of the file
Listing 12.3 demonstrates random file access using fseek() to load every tenth record of a file containing 100,000 records. Listing 12.3 Random Access with Standard IO
// defined elsewhere.
if ((db = fopen(dbFilename, "r+b")) != NULL) { for (unsigned long i = 0; i < 500000; i+=10) { fseek(db, i*sizeof(Record), SEEK_SET); // seek record # i fread(&Rec, sizeof(Record), 1, db); strcpy(Rec.name, "CHANGED 1"); Rec.number = 99999; fseek(db, i*sizeof(Record), SEEK_SET); fwrite(&Rec, sizeof(Record), 1, db); } // seek record # i
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Using Streams
The classes ifstream and ofstream can be used to read and write to and from files in C++. And of course C++ would not be C++ if it did not try to make life easier by deriving different classes from these two. Table 12.5 shows the different classes that can be used to perform file input as well as output. Table 12.5. C++ File IO Streaming Classes File stream class for input and output File stream for standard IO files
fstream stdiostream
Table 12.6 shows the functions available on the C++ IO streams. Table 12.6. Functions Available on the C++ IO Streams
open close setmode setbuf get put getline read write gcount pcount seekg seekp tellg tellp
Open a stream Close a stream Set the mode of a stream (see open function) Set size of the buffer for a stream Read from an input stream Write to an output stream Read a line from an input stream (useful in text mode) Read a block of bytes from an input stream (useful for binary mode) Write a block of bytes to an output stream (useful in binary mode) Return the exact number of bytes read during the previous (unformatted) input Return the exact number of bytes written during the previous (unformatted) output Set current file position in an input stream Set current file position in an output stream Return current position in an input stream Return current position in an output stream
Table 12.7 shows the functions associated with IO streams which can be used to assess the status of a stream. Table 12.7. Status Functions Available on the C++ IO Streams Returns current IO status (possible returns: goodbit | failbit | badbit | hardfail) Returns != 0 when there is no error Returns != 0 when the end of the file is reached Returns a value != 0 when there is an error, use rdstate to determine error Returns a value != 0 when there is an error, use rdstate to determine error Resets the error bits
Now look at how a C++ programmer can use a stream class to perform the same read/write behavior as was done in the section "Using FILE Functions." Listing 12.4 shows how stream functions can be used to read data from one binary file and write it to another. Listing 12.4 Using stream Functions to Read and Write Binary Files
unsigned char ch; ifstream inputStream(inputFilename, if (inputStream) { ofstream outputStream(outputFilename, ios::out | ios::binary); ios::in | ios::binary);
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
if (outputStream) { while(inputStream.get(ch)) outputStream.put(ch); outputStream.close(); } else cout << "Error Opening File " << outputFilename << endl; inputStream.close(); } else cout << "Error Opening File " << inputFilename << endl;
Note that some useful flags can be found in Listing 12.4, which can be used when opening a stream. The flags are defined in C++ and can be used to force a certain behavior; ios::binary, for instance, opens a stream in binary mode. For a full list of flags consult language or compiler documentation. Listing 12.4 actually reads and writes single characters from and to files. This is equivalent to using a block size of 1 for the functions discussed in the section "Using FILE Functions." Reading and writing can be done faster by choosing larger blocks of data to transfer, as you will see in Listing 12.5. Compare this with Listing 12.4 to compare speed between different techniques. The results of this test can be found in the later section "Test Results." Listing 12.5 Using Stream Functions with a Larger Block Size
#define BLOCKSIZE
4096
ifstream inputStream(inputFilename, ios::in | ios::binary); if (inputStream) { ofstream outputStream(outputFilename, ios::out | ios::binary); if (outputStream) { while(!inputStream.eof()) { inputStream.read((unsigned char *)&buf, BLOCKSIZE); outputStream.write((unsigned char *)&buf, inputStream.gcount()); } outputStream.close(); } else cout << "Error Opening File " << outputFilename << endl; inputStream.close(); } else cout << "Error Opening File " << inputFilename << endl;
Apart from using a certain block size, it is also possible to define a buffer for a stream. Writing to a buffer can be faster because output is flushed only when the buffer becomes full. In effect, a buffer is used to combine several smaller write or read actions into a single large action. Listing 12.6 shows how buffers can be added to streams to collect read and write actions. Listing 12.6 Reading and Writing a File Using ifstream and ofstream with Buffers
#define BLOCKSIZE 4096 #define STREAMBUFSIZE 8192 unsigned char streambuf1[STREAMBUFSIZE]; unsigned char streambuf2[STREAMBUFSIZE]; ifstream inputStream;
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
inputStream.setbuf((char *)&streambuf1, STREAMBUFSIZE); inputStream.open(inputFilename, ios::in | ios::binary); if (inputStream) { ofstream outputStream; outputStream.setbuf((char *)&streambuf2, STREAMBUFSIZE); outputStream.open(outputFilename, ios::out | ios::binary); if (outputStream) { while(!inputStream.eof()) { inputStream.read((unsigned char *)&buf, BLOCKSIZE); outputStream.write((unsigned char *)&buf, inputStream.gcount()); } outputStream.close(); } else cout << "Error Opening File " << outputFilename << endl; inputStream.close(); } else cout << "Error Opening File " << inputFilename << endl;
You will find a speed comparison between different stream accesses and traditional FILE functions in the Test Results section later on in this chapter. Random Access As was the case with the FILE functions, streams also allow you to access files randomly. In the introduction to this section, you saw the two stream methods that allow you to do this: seekg() and seekp(). seekg() is used to manipulate the current pointer for input from the file (g = get), and seekp() is used to manipulate the current pointer for output to the file (p = put). Both seekg() and seekp() can be called with either one or two arguments. When one argument is given, this is seen as the offset from the beginning of the file; when two arguments are given, the second argument denotes the base of the offset. Values for this base can be found in Table 12.8. Table 12.8. Positioning Keys for seekp/seekg
Current position in the file End of the file Beginning of the file
Listing 12.7 demonstrates random file access using seekp() and seekg() to load every tenth record of a file containing 100,000 records. Listing 12.7 Random Access with Streams
fstream dbStream; Record Rec; dbStream.open(dbFilename, ios::in | ios::out | ios::binary); if (dbStream.good()) { for (unsigned long i = 0; i < 500000; i+=10) { dbStream.seekg(i*sizeof(Record)); dbStream.read((unsigned char *)&Rec, sizeof(Rec));
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
strcpy(Rec.name, "CHANGED 2"); Rec.number = 99999; dbStream.seekp(i*sizeof(Record)); dbStream.write((unsigned char *)&Rec, sizeof(Rec)); } dbStream.close(); } else cout << "Error Opening File " << dbFilename << endl;
In the Test Results section, stream random access is compared to FILE random access.
Test Results
This section compares the speed of the different techniques for transferring data to and from binary files. The program in the file 12Source02.cpp that can be found on the Web site can be used to perform these timing tests. Read/Write Results The program reads its input from a file test.bin, which you should place in directory c:\tmp\. The test works best when you use a file of at least 65KB, as a block size of 65KB is used in one of the tests. The program writes its output into the file c:\tmp\test.out, which it will create itself. Of course it is also possible to use different paths and filenames, just be sure to adjust the inputFilename string in main() accordingly. Table 12.9 presents the timing results generated by the program in the file 12Source02.cpp for reading and writing binary files. Table 12.9. Timing Results for Reading and Writing Binary Files Block Size 1 4096 65536 1 4096 BUFFERED
Technique
The first three rows of Table 12.9 show the results for the FILE functions using different block sizes (1, 4,096, and 65,536 bytes). Rows 4 and 5 show the results for the stream class with different block sizes (1 and 4,096) and row 6 shows the results for the stream class using buffered IO. Random Access Results For the random-access test, the program in the file 12Source02.cpp creates a file called test.db in the directory c:\tmp\. Of course, it is also possible to use a different path and/or filename, just be sure to adjust the dbFilename string in main() accordingly. Table 12.10 presents the timing results generated by this program. Table 12.10. Timing Results for Random Access Binary Files Stdio Streams 970 770 6771 5310 22080 10990
In Table 12.10, you not only see that streams seem to be faster in random access than standard IO, but their advantage increases as more records are read. The reason for this is that the fstream classwhich is used in this testgets a buffer attached to it by default. This means that some IO requests will read from this buffer instead of from the file when the required data happens to be buffered. The standard IO functions cannot easily use a larger block size to retrieve more than a single block from the file because the blocks are no longer found sequentially. Once again this proves that it is crucial to think carefully about what kind of buffer and block size to use.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
those which are opened as binary files lies in whether or not the accessing functions interpret Carriage Return (CR) and Line Feed (LF) codes found in the file data. A binary file is seen as one long string of byte values, whereas a text file is seen as a collection of strings separated by CR and/or LF codes. Generally, programmers open and create text files only when they are to contain actual character data (such as that defined by ASCII or Unicode). Giving CR and LF codes a special meaning allows you to specify further functionality, which can be handy when working with files that will contain text. It is very easy, for instance, to write functionality that searches for a certain line number in a text file or counts the number of lines in a file. This section looks into the different uses of text files in programming and supplies ready-to-use text file functionality, which can speed up text file access in programming.
// Impl 1:Using fixed length records: stool red wood chair blue iron rocking chair aquamarine cast iron
// Impl 2:Using variable length fields with field separators: stool#red#wood#125600 chair#blue#iron#128000 rocking chair#aquamarine#cast iron#130000 // Impl 3:Using field and record separators: stool#red#wood#125600@chair#blue#iron#128000@rocking chair#aquamarine# cast iron#130000
When you are using length records you pay a footprint penalty because each record in the database, regardless of its content, is the same size. This size is determined by the largest possible value of each field. The upside to this implementation is that searching through the database is fast (O(1)) because you can use direct access. The first byte of record 15 is found at position 15 times the record size in bytes. The opposite is true for variable length records. Each record is made to fit exactly its content, and so footprint is less of an issue. However, access to the records needs to be sequential (O(n)), basically by counting the field separators. The implementation that is best for a given situation depends on the footprint requirements and the type of access which will be used most often. When using text files as data storage, simple text functions can be used to manipulate the database records. Listing 12.8 shows a function that can be used to retrieve the value of a certain field from a text database which has been loaded into memory. Listing 12.8 Getting a Field from a String
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
int i=0, j=0, lens = strlen(s); if (n < 1) return (NULL); if (n > 1) // Not the first field { i = GetOccurrence(s, c, n-1); if (i == -1) return(NULL); else i++; } j = GetOccurrence(s, c, n); if (j == -1) j = lens; if ( (j <= i) || (i > (int)lens) || (j > (int)lens) ) return(NULL); // the part between i and j _is_ the field strncpy(fld, &s[i], j-i); fld[j-i] ='\ 0'; return(fld); }
The function in Listing 12.8 takes four arguments: a pointer to the loaded database (s), a character that is used as a field separator (c), the number of the field to be found (n), and a pointer to a buffer into which the field is to be copied (fld). The function assumes that there is no field separator before the first field and after the last, and it can be called as follows:
int GetOccurrence(char *s, char c, int n) { // Return the position of the Nth occurrence of character c in s // Return -1 if not found int i, j=0; for (i=0; i < (int)strlen(s); i++) { if (s[i] == c) { j++; if (j >= n) return(i); } } return(-1); }
Of course Listings 12.8 and 12.9 just show one way of finding fields. You can opt to write a GetRecord() function, for instance, which takes as an input argument the number of fields per record so it can locate a specific record in a database. The TextFile Class As you saw in the section "Efficient Binary File IO," in order to achieve fast file access it is important to use intelligent buffering. You also read in the introduction to this section that binary files and text files are basically the same thing, apart from how CR and LF are interpreted by the access functions. This means that the theory introduced in Efficient Binary File IO can also be used for text files, as
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
long as CR and LF are treated correctly. This is good news because when you decide to use text files as data storage or configuration files, you still want fast access, even when the files become large. This section introduces a new class which does just that. It is called TextFile and can be found in the files 12Source03.cpp and 12Source03.h on the Web site. The TextFile class essentially performs buffered binary file IO. This means that each file access will read a block of binary data from a file to a buffer, or write a block of binary data from a buffer to a file. Meanwhile, the user of the class can access the buffered data without incurring disc access overhead for each access. Basically this is what you already saw in the section Efficient Binary File IO; however, because the TextFile class knows that it is actually treating text files, it gives you some very handy functionality; it allows you to treat the buffer as a collection of lines of textremember, a line in a text file is terminated by a combination of CR and LF (under UNIX, just LF). This means that you keep reading lines from the TextFile class as you need them, whereas it takes care of disc access and buffering. The same is also true for writing to a file. Because the theory behind the TextFile class can be found in the section "Efficient Binary File IO," this section only gives a description of the TextFile interface and examples of how to use it. Table 12.11 contains the interface description of the TextFile class . Table 12.11. TextFile Interface Function: Type: Remarks: Function: Type: Remarks: Function: Type: Argument 1: Argument 2: Return value: Remarks: Function: Type: Remarks: Function: Type: Argument 1: Return value: Remarks: Function: Type: Argument 1: Argument 2: Return value: Remarks:
TextFile()
Constructor The constructor initiates member variables.
~TextFile()
Destructor The destructor calls the close function.
Open()
Public member function. char*, pointer to a user-managed filename string. This string can contain path elements such as drive and directory names. char (optional), either 'r' for an input file to read from, or 'w' for an output file to write to. The default value for argument 2 = 'r'. int, 0 when the file could not be opened. Opens the specified file for the specified action.
close()
Public member function. Closes the current file if it is still open. When the file is opened for writing, the buffer is flushed to file if it is not yet empty.
getline()
Public member function. char*, pointer to a user-managed buffer which receives the next line from the file. int, 0 when the end of the file is reached.-1 when no file is opened or the file is opened for writing.n when successful, n denotes number of characters in the buffer including terminating zero. Reads the next linefrom the file into the user-managed buffer, unless the file is not opened for reading or the end of the file is reached.
putline()
Public member function. char*, pointer to a user-managed buffer which contains the next line to be written to file. int (optional), format to be used for line termination; options are CR, LF, CRLF, LFCR; by default this argument is set to CRLF. int int, 0 when there is an error while writing to the file.-1 when no file is opened or the file is opened for reading.1 when successful.
You can use the TextFile class via normal instantiation, as the following example demonstrates:
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
{
And of course it is also possible to use it via inheritance, as the following example demonstrates:
class MyFileClass : public TextFile { MyFileClass():TextFile(){ } MyOpen(char *s) { // do something; open(s); // call to TextFile::open();
char tmp[255], *c; int n; init(SearchString); // init function for the search. while ((n = file.getline(tmp)) > 0) // read all the lines. { if ((c = search(tmp, n)) != NULL) // search a line. break;
Doing a case-insensitive search for the SearchString() is accomplished by calling the case-insensitive versions of the fast pattern match functions (see Chapter 10). A more complicated and generic way of searching through text files is the topic of the following section.
dir *.txt
The '*'indicates that any number of characters (even 0) and any kind of character can appear in that position. In this case it means that any string ending in '.txt'will match. Another special character that can be used in wildcards is '?', indicating that any (single) character can be placed in that positionsort of like a joker in a card game. The command
dir picture?.bmp
matches all files with the name 'picture'followed by one more character and ending with '.bmp'. Possible matches are picture1.bmp, picture2.bmp, pictureA.bmp, but not picture10.bmp or picture.bmp. It is likely that you might want to perform such a wildcard match on a text file or a text string in some of your programs at some point. Think, for instance, of finding function names in a source file using the search pattern "*(*)". Listings 12.10 and 12.11 show a function that can be used for wildcard matching. Table 12.12 explains its interface. Table 12.12. The CompareStr() Arguments Variable Explanation Argument 1 char*, a pointer to the string to compare with. Argument 2 char*, a pointer to a string specifying the search string-possibly containing wildcard characters. Argument 3 bool, Boolean indicating whether a case-sensitive search (0) or a case-insensitive search (1) is wanted. Argument 4 bool, Boolean indicating whether argument 1 might contain additional characters after the pattern. Return value bool, indicating whether argument 1 and argument 2 match according to wildcard rules. (0 = no match. 1 = match.) Listing 12.10 Pattern Matching Using Wildcards, Part 1
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
int CompareStr(char *s, char *pattern, bool ignoreCase, bool allowTrailing) { long l, m, n = 0L; long lens = strlen(s), lenpattern = strlen(pattern); for (l=0, m=0; (l < lenpattern) && (m < lens); l++, m++) { // walk the pattern and compare it to the string if (pattern[l] != anychars) // normal character or '?' ? { if (pattern[l] != anysinglechar) { if (ignoreCase == true) { if (toupper(pattern[l]) != toupper(s[m])) return(false); } else if (pattern[l] != s[m]) return(false); // Character MISMATCH } // else NO prove that we don't match } else // We've got an expandable WILDCARD (*), Lets have a closer look!! { l++; for (n = m; (n < lens) ; n++) // try to find a working combination if (CompareStr(&s[n], &pattern[l], ignoreCase, allowTrailing)) return(true); // found == full match -> communicate // to callers return(false); // couldn't find a working combination } }
The function CompareStr() walks through the string and the pattern, trying to match up the characters. As long as no wildcard characters are detected in the pattern, it does a normal character compare. However, as soon as a wildcard is detected something special has to happen. The '?'character (anysinglechar) is easy enough to deal with; it matches with any character in the string so no compare is actually needed. The '*'character (anychars) is more complicated. When a '*'is detected, a match must be found between the rest of the pattern after the '*'character and some part of the string. In order to do this, CompareStr() calls itself recursively, but each time starting with a different part of the string. But this is not the end of the function, for at this point there are three possible situations: The end of both the string and the pattern has been reached. There is nothing left to compare and no mismatch occurred, which means the pattern matches the string. The end of the string has been reached but there are still characters left in the pattern. In this case, the string and the pattern match only if the remaining characters in the pattern are all '*'. The end of the pattern has been reached but there are still characters left in the string. In this case the string and the pattern match only if the last character of the pattern is a '*'. Listing 12.11 shows the code which takes care of these three cases. Listing 12.11 Pattern Matching Using Wildcards, Part 2
if (l < lenpattern) // Trailing *'s are OK too! { for (n = l; n < lenpattern; n++) if (pattern[n] != anychars) return(false); } else { if (pattern[l-1] != anychars) {
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
char *resultArray[MAXRESULTS]; int index = 0, n = 0; while (n < lengthOfFile && index < MAXRESULTS) { if (CompareStr(&file[n], pattern, 1, 1)) resultArray[index++] = &file[n]; n++; }
Summary
Different techniques can be used for writing text strings to screen. Table 12.2 showed the results of a speed comparison between the techniques. These results were generated with a program which can be found in the file 12Source01.cpp. There are also different techniques for reading and writing binary files. Table 12.9 showed the results of a speed comparison between C- and C++-style techniques using different buffer sizes. These results were generated with a program which can be found in the file 12Source02.cpp. This chapter also showed different ways of speeding up text file access by using intelligent parsing functions and loading text as binary blocks. Examples of this are given in the functions and classes in the files 12Source02.cpp, 12Source03.cpp, 12Source03.h, and 12Source04.cpp. CONTENTS CONTENTS
Arithmetic Operations
Most programs at some point perform arithmetic operations. This can be as simple as small addition or multiplication, or as complex as three-dimensional volume calculations. Whatever the case may be, often arithmetic operations can be broken down into several smaller, simpler steps involving a series of basic calculations and conversions. These steps are ideal candidates for optimization because they are generally used repeatedly on large amounts of data, and most contain looped or recursive functionality. The many optimization techniques discussed in this book are well suited for arithmetic optimization. Think for instance of rewriting recursive routines into iterative routines, using loop unrolling, creating tables of predetermined values and optimizing mathematical equations. Because arithmetic operations are an important area for optimization, which in practice often gets overlooked or executed in a far from optimal form, the following sections give tips and examples on how to use optimization techniques specifically for arithmetic operations.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
int i, k; long j, count = 0; for (k = 0 ; k < DATALEN; k++) for (i = 1; i <= 256; i <<= 1) { if ((data[k] % i) == 0) count++; }
The example in Listing 13.1 checks each byte in a data block whether it is a multiple of 1, 2, 4, 8,128. Because this example checks for very specific (but certainly often occurring) numbers, it can be optimized with a fast and simple bitwise check, as shown in Listing 13.2. Listing 13.2 Bitwise Counting Multiples
int i, k; long j, count = 0; for (k = 0; k < DATALEN; k++) for (i = 1; i <= 256; i <<= 1) { if ((data[k] & (i-1)) == 0) count++; }
The example in Listing 13.2 avoids the use of the time-consuming % operator by making use of the knowledge that when a number is a multiple of x, all the bits representing x-1 are not set. This means, for instance, that a number is a multiple of eight when its first three bits are not set (the binary representation of eight is 1000, the binary representation of seven is 0111). The result of using a bitwise & instead of the % operator is that the example routine becomes up to five times faster. A program that demonstrates this can be found in the file 13Source01.cpp on the Web site. More examples of bit operator use can be found in the listings of the following sections.
int i, k; long count = 0; for (k = 0 ; k < DATALEN; k++) for (i = 1; i <= 256; i <<= 1) {
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
int k; long count = 0; for (k = 0; k < DATALEN; k++) { count +=((data[k] & 128) >> ((data[k] & 64) >> ((data[k] & 32) >> ((data[k] & 16) >> ((data[k] & 8) >> ((data[k] & 4) >> ((data[k] & 2) >> ((data[k] & 1)); }
7) 6) 5) 4) 3) 2) 1)
+ + + + + + +
As with Listing 13.3, the variable count in Listing 13.4 will contain the number of set bits in data block data upon loop termination. Notice that the inner loop of Listing 13.3 has been transformed into eight fixed logical ands. The results of these and operations are shifted with the >> operator, so each bit that is set increments count only by one. Counting set bits with the code in Listing 13.4 is up to five times faster than counting set bits with the code in Listing 13.3. It seems this loop unrolling is a valuable calculation optimization here; it allows you to process five times as much data per second while collecting parity information. But this code can be made faster still. When you analyze the problem domain carefully you will see that static data information is gathered dynamically. The static data is the number of bits set in a byte with a certain value. For instance, a byte with the value 1 will always have only a single bit set (the binary representation of 1 is 0001). Similarly, each byte with the value 5 has two bits set (the binary representation of 5 is 0101). It is possible, therefore, to create a predetermined list containing the number of bits set for each value a byte can have. This list contains 256 values and can be used as shown in Listing 13.5. Listing 13.5 Using a Lookup Table to Count Set Bits in a Data Block
// 0 1 2 3 4 5 6 7 8 unsigned char lookupTable[] = { 0,1,1,2,1,2,2,3,1.. int i, k; long count = 0; for (i = 0; i < DATALEN; i++) { count+=lookupTable[data[i]]; }
In Listing 13.5, each byte of the data block is used as an index in the lookupTable in order to determine the number of bits set for its value. For instance, a data byte with the value 8 will cause the number 1 to be retrieved from the lookup table (the binary representation of 8 is 1000). The code in Listing 13.5 is up to be four times faster than that of Listing 13.4. The file 13Source01.cpp contains a program that compares the speeds of the bit count solutions presented in Listings 13.3, 13.4, and 13.5. Table 13.1 shows the results of this program. Table 13.1. Results of Counting the Number of Bits Set in a Block of 8192 Bytes Time 6210 1160 320
The final solution put forward in Listing 13.5 is up to twenty times faster than the first solution put forward in Listing 13.3. Clearly it pays to spend some time determining static parts of calculations and how to handle these. A remaining question is whether or not the static allocation of a 256-byte array can be justified by the performance won. This is true for most systems (even most memory-tight
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
embedded systems). Notice, however, that transforming the solution of Listing 13.4 into a macro and using it a few times in a source file will already take up more than 256 bytes. Converting Byte Values to ASCII Often programs need to convert binary data into text strings for output. As you may expect, conversions contain a high level of static information. A conversion often used by programmers is that of binary data into hexadecimal text strings. Listing 13.6 shows the kind of conversion functionality that is often used for this purpose. Listing 13.6 Standard Conversion of Byte Values to Hex String
inline void Nibble2Ascii(int nibble, char &c) { if (nibble < 10) c = (char)(0x30 + nibble); else c = (0x37 + nibble); } void HexStr1(char *s, int val) { Nibble2Ascii(val >> 4, s[0]); Nibble2Ascii(val %16, s[1]); s[2] = 0x0; }
When you look for static information in binary-to-hex string conversions, you will notice that the text representation of each nibble is predetermined and does not have to be calculated at all. Once again a simple array with predetermined answers will do nicely. Listing 13.7 demonstrates a conversion routine that is more than twice as fast as that of Listing 13.6. Listing 13.7 Optimized Conversion of Byte Values to Hex String
char
HexString[] = { "0123456789ABCDEF"} ;
void HexStr2(char *s, int val) { s[0] = HexString[val >> 4]; s[1] = HexString[val & 0xf]; s[2] = '\ 0'; }
The same can be done for conversions of byte values to decimal strings. It is of course possible to convert each byte individually using a standard C/C++ function:
char decLookup[256][4]; // fill this at some point. ~ // Convert byte j to decimal. s = decLookup[j];
This way each byte value is converted only once and after this each following conversion is suddenly over forty times faster. Footprint considerations here are whether to make a static array of values in a source filein which case the runtime footprint as well as the storage footprint increases with the size of the table (refer to Chapter 1, "Optimizing: What Is It All About?" for more details on footprint terminology)or to create the array dynamically during runtime. When you choose the second option do not be fooled into thinking that the footprint of the table can be optimized by using 256 pointers to variable length strings in order to save some bytes on the first 99 entries. As it is, the table is 4*256 bytes large, which is also the size of an array of pointers. The file 13Source01.cpp on the Web site contains a program that compares conversion speeds of several conversion methods. It also shows how standard C/C++ functions like toupper() and tolower() can be optimized in a similar way.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
sum = n * b +
(n * s (n - 1)) --------------2
>> 1);
The variable step in Listing 13.9 is only needed when you want to use a non-fixed step size that is dynamically determined by looking at the range itself. Note that the advantage here is that determining the sum went from an O(n) algorithm (Listing 13.8) to an O(1) algorithm (Listing 13.9)refer to Chapter 5, "Measuring Time and Complexity," for more details. This means that the amount of time needed to determine the sum is not related to the size of the data block for Listing 13.9. It will not surprise you that this solution is therefore thousands of times faster. The file 13Source01.cpp on the Web site contains a program that measures performance differences between the solution of 13.8 and the solution of 13.9. Note that a good math book can help you a long way in finding and optimizing mathematical functions.
Multitasking
When used correctly, multitasking can greatly enhance program performance. However, it should not be used indiscriminately because it does bring with it some overhead and extra things to look out for during design and implementation. This section explains what multitasking is exactly, how to handle the problems it can bring along with it, and what kind of hidden overhead to watch out for. What Is Multitasking? Basically, single processor systems can perform only one task at a time. A processor takes an instruction from memory and executes it. After completion, the processor can take the next instruction from memory and execute it. A collection of instructions that is executed in this sequential way is called a task. You could compare a processor performing a task to a person reading instructions from a manual. Generally, a person will read instructions from a manual one at a time and follow them carefully. Only when an instruction could be followed successfully is the next instruction dealt with. However, if you had five people working for you, you could give them five manuals and have them execute five different tasks for you simultaneously. This is true also for multiprocessor systems; each processor can perform a task independently from the other processors.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
No doubt you know already that it is possible for single processor systems to have multitasking operating systems. In fact, any operating system that allows you to run more than one program at a time is in effect a multitasking OS. Think of starting up a calculator program while you are already using a word processor and perhaps a paint program. So how is this possible? Let's return to the example of the personlet's call him Bobreading a manual. What if you gave Bob five manuals and a stopwatch, and told him to switch manuals every ten minutes? This way he would seem to perform five different tasks. When you look at a single task however, it advances for only ten minutes after which it halts for another forty minutes. In fact, it probably halts for more than forty minutes because Bob needs time to put down one manual and pick up another (maybe he even needs some time to find his place in the manual he just picked up or to determine the order in which he deals with the manuals). This is also what happens when a multitasking OS with a single processor runs more than one task; the processor still performs only one task at a time; however, it switches between tasks at certain moments. This behavior is called task switching. Task switching brings some overhead with it, which means that the total amount of time spent on pure task execution decreases when the number of task switches increases. In the sections that follow you will see what kind of overhead is incurred. Tasks can take different shapes. The next three sections discuss the shapes called Process, Thread, and Fiber. What Is a Process? Although not all literature and all systems define the word process in exactly the same way, there are some general things that can be said about processes. Most often the word process is used to indicate a program, or at least program characteristics. When a processor switches from one process to the next, it needs a lot of information to be able to continue with the next process at exactly the same place/context where it left off last time. This is the only way process execution can continue as if it was never interrupted by a task switch. A process is therefore defined by the information that is needed during a task switch. Think of: The values of the CPU registers These contain the context created by executed processor instructions. A stack This contains the context created by function calls, variable definitions, and so on (refer to Chapter 8, "Functions," for more details on registers in function calls). Address space This memory is sometimes called the apartment or working set . It is that part of system memory (the range of memory addresses) that is available to the process. Meta data Think of security privileges, current working directory, process name, process priority, and so on. This is a lot of information and consequently process switching is the most expensive kind of task switch there is, as far overhead is concerned. The reason for this is robustness of the OS. By using virtual memory management (refer to Chapter 9, "Efficient Memory Management," for more details on memory management) to give each process its own private piece of memory, it is possible to create the illusion that each process has the system all to itself. This means that when a process goes into some kind of faulty state, it can mess up its own memory and resources but not that of another process. This holds true as long as the virtual memory management system does not get messed up and no resources are locked by the misbehaving process. The two sections that follow show some lighter ways to switch between different tasks. Listing 13.10 shows how new processes can be created under Windows. In the section Task Switching you will see what kinds of strategies an OS can use to determine when and how to switch between tasks. Listing 13.10 Creating a New Process Under Windows
#include <windows.h> void main(void) { STARTUPINFO st; PROCESS_INFORMATION pr; memset(&st, 0, sizeof(st)); st.cb = sizeof(st); CreateProcess(NULL, "c:\ \ windows\ \ calc.exe ", NULL, NULL, 1, 0, NULL, NULL, &st, &pr); }
Note that starting a process under Windows is actually nothing more than telling the OS to start up a specific executable. Certain parameters can be set for this executable, such as working directory, command line arguments, and so on. Consult your compiler
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
documentation for more details on process execution. Listing 13.11 shows how a new process can be created under UNIX. Listing 13.11 Creating a New Process Under UNIX
#include <stdio.h> #include <sys/unistd.h> void a( void *dummy ) { for(;;) printf("a"); } void b( void *dummy ) { for(;;) printf("b"); } void main() { if (fork() == 0) a(); b(); }
Note that under UNIX it is possible to start a new process with a function from the original address space. A new address space is created in which this function is executed. The next section discusses a less overhead-intense way of creating a new task. What Is a Thread? Most often the word thread is used to indicate a path of execution within a process. As you have already seen, different processes can be run simultaneously by switching between them. The same trick can be performed within a process; that is, different parts of a process can be run simultaneously by switching between tasks within the process. These kinds of tasks are called threads. As a thread lives within a certain process, it needs less defining information: The values of the CPU registers A stack Thread priority (Often specified in relation to the priority of the process in which the thread lives.) All threads of a process inherit the remaining defining information from the process: Address space Meta data There are three conclusions to be drawn from the preceding two bulleted lists: 1. Task switching between two threads in the same process does not introduce much overhead. 2. All threads in the same process share the same address space and can therefore use the same global data (variables, file handlers, and so on) They can of course also mess up each other's memory. For more information see the section titled "Problems with Multitasking." 3. Task switching between threads of different processes introduces the same overhead as any other task switch between processes. Note that each active process has at least one thread of execution. This thread is called the main thread. Listing 13.12 shows how threads can be used under Windows. Listing 13.12 Creating Multiple Threads in Windows
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
#include <windows.h> #include <process.h> #include <fstream.h> struct StartInput { char name[8]; int number; int length; } ; void StartThread(void* startInput) { StartInput *in = (StartInput*) startInput; int k = in->number; int j = in->length + k; for(; k < j; k++) { cout << k << in->name << endl; } _endthread(); }
void main(void) { StartInput startInputA, startInputB; strcpy(startInputA.name,"ThreadA"); strcpy(startInputB.name,"ThreadB"); startInputA.number = 0; startInputB.number = 5; startInputA.length = 10; startInputB.length = 15; _beginthread(StartThread, 0, (void*) &startInputA); _beginthread(StartThread, 0, (void*) &startInputB); Sleep(6000); }
Listing 13.12 shows one way of creating two new threads from within a Windows process. By passing a pointer to a function in the call beginthread, a new thread is created and scheduled for task switching by the OS. Its execution starts with the first instruction of the function that was passedin this case the function StartThread. As the format used for passing arguments to a new thread is fixed (void*) a casting trick must be performed in order to pass more than 4 bytes of information. In Listing 13.11 a pointer to a structure is passed. Because the receiving function StartThread knows exactly what kind of structure to expect, it can retrieve all the structure fields without a problem. There are three more interesting points to note about Listing 13.12. First, the two created threads receive pointers to two different structures, startInputA and startInputB . In a sequential program you would probably have passed startInputA first to one function, changed some of its fields, and then passed it to another function. However, because two threads will run simultaneously and because the threads share the same address space, changing any field in startInputA would change the values used in both threads! Second, the function StartThread is used as a starting address for tasks (threads). The reason this is possible is that threads have their own private stack. This means each thread will have a private copy of k and j . Third, the main thread is put to sleep after starting the two new threads. The reason for this is that all threads of a process terminate when the process itself (the main thread) terminates. The Sleep function is used as a trick to keep the main thread alive long enough for the two new threads to do their jobs. Strictly speaking the endthread call in the function StartThread is not really necessary, as threads end automatically when they run out of instructions. Sleep is an ideal way to suspend a thread, because a sleeping thread is not scheduled until its nap time is over. Note that using Sleep is therefore an infinitely better construction than using a long running loop. Not only does Sleep guarantee more precisely when the thread will continue executing, but a loop is also very processor intense. This means that a loop slows down all other tasks on the system. Do not forget to set the compiler code generation settings to multithreaded when writing multithreaded programs under Windows. In the file 13Source02.cpp on the Web site, you can find what the same program would look like under UNIX. Compile it with:
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
special must be done. The reason for this is that the programmer of a thread can never anticipate when a task switch will occur, and therefore cannot guarantee that it will never occur during the write action. A thread may want to change the IP address in a shared structure from 127.1.1.0 to 212.33.33.00. If the OS switches from this writing task to a task that reads from this structure during the middle of the write action, the reading thread will read a corrupt IP-Address: 212.33.1.0. This is why programmers need a way to lock a certain resource for a specific thread. There are different kinds of locks available on most OSes, but they all work according to the same basic principles: A lock can be claimed and released. The claim and release functions of locks are atomic. This means a task switch cannot occur during the execution of these functions. When thread A tries to claim a lock that is already claimed by thread B, thread A is put on hold until thread B releases the lock. When thread B releases the lock, thread A becomes the owner of the lock and continues its execution. From that moment on, other threads will be blocked when trying to claim the lock. These other threads in turn stay blocked until thread A releases the lock. Most locks are associated in some way with a list of threads. This means if several threads try to claim the same lock, the first will own it and the remaining threads have to wait in line for their turn. The programmer makes the association between a lock and a resource. The last bulleted item in the preceding list is a very important one. It means that it is perfectly possible to program a thread to use a resource without claiming the associated lock first. This often causes problems when different programmers work on the same multithreaded code; a programmer making changes to an existing thread may not be aware of the fact that certain resources are associated with locks and add code that uses the resources without locking them first. Listing 13.13 shows a Windows program with two worker threads. One thread writes to a structure at given times and the other thread reads from it. The structure is protected by a lock that is imaginatively called lock . The functions EnterCriticalSection and LeaveCriticalSection are used, respectively, to claim and release this lock. The definition of the lock and the claim and release calls is of course OS specific. The set of instructions placed between the claim and release of a lock is called a critical section, which is where these OS calls get their names. Listing 13.13 Protecting a Shared Resource in a Windows Program
CRITICAL_SECTION Lock; void ReadThread(void* dummy) { int prevnr = -1; int nr = 0; // Get initial number and print it. EnterCriticalSection(&Lock); nr = SharedData.number; LeaveCriticalSection(&Lock); prevnr = nr; cout << nr << endl; for(;;) { EnterCriticalSection(&Lock); nr = SharedData.number; LeaveCriticalSection(&Lock);
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
// Print only changed numbers. if (nr != prevnr) cout << nr << endl; Sleep(100); } }
void WriteThread(void* dummy) { int nr = 100; for(;;) { EnterCriticalSection(&Lock); SharedData.number = nr++; LeaveCriticalSection(&Lock); Sleep(5); } }
ReadThread() { EnterCriticalSection(&readlock);
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
// No other ReadThread can access the data // during this critical section. One other // ReadWriteThread can access the data but // only for reading. // Do as much reading here as you want. LeaveCriticalSection(&readlock); // Now another ReadThread or ReadWriteThread can lock the // data, do not perform any reading or writing here. }
ReadWriteThread() { EnterCriticalSection(&writelock); // No other ReadWriteThread will access the data // for reading or writing during this critical section. // Do as much reading here as you want because only // this thread and one ReadThread can access the data. EnterCriticalSection(&readlock); // No other thread (Read nor ReadWrite) will access // the data during this critical section. // Here you can change the data (write to the data) LeaveCriticalSection(&readlock); // Back to previous critical section; readlock has been // released so only perform read actions here. LeaveCriticalSection(&writelock); // Now another ReadThread and ReadWriteThread can lock the data. // do not perform any reading or writing here. }
Several threads can use the ReadThread and ReadWriteThread functions. A thread that wants to read as well as write the data structure has to claim the writelock just before accessing the data for reading. This way it is sure that no other read/write thread is changing the data. However, it allows another reading thread to continue unbothered. When a read/write thread wants to change the data, it has to claim both the writelock and readlock (in that order!) to ensure that no other thread has access to the data. These kinds of optimizations can improve program performance by eliminating as much idle waiting time incurred by locks as possible. However, the use of the locks can become quite complex and needs to be documented well, especially in light of future changes that may be made to the program. Functions that contain sufficient locks, so they can be called by different threads without causing problems with shared resources, are called re-entrant functions. As was stated before, there are many different kinds of locks. Some use counters with which the programmer can specify how many threads can claim a lock. The characteristics of the different kinds of locks that are available to you on a certain OS-plus-compiler combination should be taken into account when devising an optimized locking strategy. Consult your OS and compiler documentation for more information. Deadlocks Using locks enables programmers to create threads that do not corrupt shared resources. However, these locks have the effect that they halt (or block) certain threads at certain times, and halted threads cannot release locks that they have claimed. When a locking strategy is not carefully designed from the start, this can cause a situation to occur in which two or more threads wait for each other to release a lock. This is called a deadlock. The pseudocode in Listing 13.15 demonstrates a potential deadlock. Listing 13.15 Potential Deadlock Situation
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
{ EnterCriticalSection(&resourceA); // Do something with resourceA EnterCriticalSection(&resourceB); // Do something with resourceB (and A) LeaveCriticalSection(&resourceB); LeaveCriticalSection(&resourceA); }
void ThreadB(void* dummy) { EnterCriticalSection(&resourceB); // Do something with resourceB EnterCriticalSection(&resourceA); // Do something with resourceA (and B) LeaveCriticalSection(&resourceA); LeaveCriticalSection(&resourceB); }
A deadlock occurs in Listing 13.15 when the following happens: 1. ThreadA claims resourceA 2. A task switch occurs which makes ThreadB active 3. ThreadB claims resourceB No matter what happens after this, ThreadA and ThreadB will ultimately become deadlocked. This is because ThreadB will try to claim resourceA before it releases resourceB and ThreadA will not even consider releasing resourceA before it is able to claim resourceB. ThreadA will block on the call EnterCriticalSection(&resourceB) and ThreadB will block on the call EnterCriticalSection(&resourceA) Listing 13.15 also demonstrates why deadlocks are often so hard to find. Potential deadlocks may never actually occur or they may occur only sporadically. The fewer instructions placed between the claiming of resources (steps 1 and 3 in the preceding list), the less likely it becomes that a task switch will occur exactly at that point. This is also why debugging deadlocks is so difficult; as you have seen in Chapter 4, "Tools and Languages," the timing of a debug executable is very different from that of a release executable. Another good way to introduce a deadlock is to claim a lock and never release it. This can happen when functions become more complicated and/or claims and releases are harder to match.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
// Wait until device can be claimed. Claim(info->dev); // Lock data for output. EnterCriticalSection(info->dataLock); // Write data to device.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Callback Functions
Callback functions can be used perfectly in combination with multithreaded programming as a way of increasing overall program performance. Callback functions were introduced in Chapter 1 as a way for tasks to signal that they have completed a certain job. Listing 13.16 can easily be adapted to make use of a callback function to signal that the backup was successfully written. There are more applications for callback functions. Listing 13.17 shows how a timer function can be started in a separate thread. This timer receives information on how many times per second it should call a certain callback function, and how many times this callback function should be called in total. Listing 13.17 Using a Timer Callback
#include <windows.h> #include <process.h> #include <fstream.h> void PrintStatus(int in) { cout << in << endl; } struct TimerData { void (*CallBack)(int in); int delay, nrAlarms, data; } ; void Timer(void *input) { TimerData *info = (TimerData*) input; for (int i = 0; i < info->nrAlarms; i++) { Sleep(info->delay); info->CallBack(info->data); } } void main(void) { TimerData timedat; timedat.CallBack timedat.data timedat.delay timedat.nrAlarms = = = = PrintStatus; 5; 1000; 20;
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
thread continues with the work at hand. Note that the timer in Listing 13.17 can be activated with different callback functions in order to do different kinds of things. Timers are sometimes used in watchdog threads. A watchdog periodically checks the status of a certain resource (a task, a device, a piece of memory and so on). When this resource is found to be in a faulty state (printer out of paper or tasks set in a deadlock) it can perform a predefined action such as warn the user or reboot the system.
Summary
Often arithmetic functionality can be broken down into several smaller, simpler steps that are ideal candidates for optimization. These smaller steps are generally used repeatedly on large amounts of data, and most contain looped or recursive functionality, which is why high levels of gain can be had. Several optimization techniques can often be introduced in arithmetic functionality: Rewriting recursive routines into iterative routines Loop unrolling Using data-specific knowledge (using bit arithmetic operations and so on) Using lists of predetermined answers or intermediate results Using optimized mathematical equations Multitasking is either performing different tasks on different processors simultaneously (multiprocessor system) or switching between tasks on a single processor (single processor system). Multitasking can be used to enhance the performance of a program running on a single processor system by, for instance, moving different program states or slow device interaction into a separate task. This way, waiting times do not slow down the main program and complex-polling strategies can be simplified. Tasks come in different shapes, each shape with its own characteristics: Process Thread Fiber While multitasking software can be the answer to many problems, there are things to look out for, such as task switching overhead, possibility of deadlocks, shared memory corruption, and so on. CONTENTS
CONTENTS
Tricks
Little-known facts that can make the job of a programmer easier are the focus of these sections. Some of these help determine system
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
and development environment characteristics, and are reused in the second half of the chapter, in which compatibility and portability of sources is discussed.
inline int cplusplus() { return(sizeof(char) == sizeof('A')); } if (cplusplus) printf("Compiled with a C++ compiler"); else printf("Compiled with a C compiler");
Determining the kind of compiler can, of course, also be done at compile time, which is preferable in most cases as this only costs compile time (no function needs to be executed at runtime to make this assessment). Listing 14.2 shows how you can determine at compile time whether a C or C++ compiler is used. Listing 14.2 Determining C or C++ Compilation at Compile Time
#ifdef __cplusplus printf("Compiled with a C++ compiler"); #else printf("Compiled with a C compiler"); #endif
The reason this works is that C++ compilers define the __cplusplus definition automatically. For more differences between C and C++ refer to the section Compatibility later in this chapter.
Endianness
As explained in Chapter 10, "Blocks of Data," in the section on Radix sort, endianness is the byte order used by a system (storing the hexadecimal long word 0xAABBCCDD in memory in the byte order 0xAABBCCDD or 0xDDCCBBAA). Knowing the byte order of a system becomes important, for instance, when sources will be used on different platforms (development platform is perhaps different from target platform; refer to Chapter 1, "Optimizing: What Is It All About?" ), or when they contain code that needs to communicate over networks and so on. Listing 14.3 shows how you can test the endianness of a platform by simply storing the value 1 (0x000001) in a long word and checking in which byte the bit is set. Listing 14.3 Determining Endianness of a System
bool little_endian() { unsigned int ii = 0x01; if (*(unsigned char *)&ii == 0x01) return(true); else return(false); } void cpu_info() { if (little_endian()) printf("LITTLE Endian, "); else printf("BIG Endian, "); printf("%d bit architecture\ n",
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
sizeof(int)*8); }
Refer to your compiler and language documentation for more details on variable argument list statement definition. Listing 14.4 shows an implementation of a function that uses a variable number of arguments to receive a list of filenames that it should open. Note that this function receives two normal arguments also. Listing 14.4 Implementing Functions with a Variable Number of Arguments
int OpenFileArray(FILE ***array, char * MODE, char *filename, ...) { char *pName = NULL; int int nrFiles arrayIndex = 0; = 0;
// Determine the number of files in the list. if (filename == NULL) return 0; va_list listIndex; va_start(listIndex, filename); do { nrFiles++; pName = va_arg(listIndex, char*); } while (pName != NULL);
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
// Reserve memory for the file handles plus an array terminator. *array = new FILE*[nrFiles+1];
// Open the files. pName = filename; va_start(listIndex, filename); do { if(!((*array)[arrayIndex++] = fopen(pName, MODE))) { // Not all files could be opened. (*array)[arrayIndex-1] = NULL; return 0; } pName = va_arg(listIndex, char*); } while (pName != NULL); (*array)[arrayIndex] = NULL; // All files opened. return
1; }
The first argument of the function OpenFileArray in Listing 14.4 is a pointer to an array of FILE pointers. As such, it returns an array filled with the pointers to the files it has opened. The second argument is a string that defines the mode that opens the files. This is the same argument that fopen expects as a second argument. Listing 14.5 shows how you could use the OpenFileArray function. Listing 14.5 Calling Functions with a Variable Number of Arguments
array; }
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
void CalcWithoutFP(void *data) { // calculation. } void FPCalc(void *data) { // alternative calculation. } void Calc(void *data) { if (FLOATINGPOINTPROCESSOR) FPCalc(data); else
CalcWithoutFP(data); }
When many floating point calculations are made (perhaps you need millions per second), it is a shame that this check is performed as an extra overhead for each calculation. Who knows how time-consuming the check FLOATINGPOINTPROCESSOR actually is? That is why it helps performance when this check only needs to be executed once. Listing 14.7 shows how this o can be achieved. Listing 14.7 Avoiding Unnecessary Checks
// Define a pointer to Calc. void (*Calc)(void*); // Set the value of the pointer. void init(void) { if (FLOATINGPOINTPROCESSOR) Calc = &FPCalc; else Calc =
&CalcWithoutFP; }
An init() function needs to be called once, before the first calculation. After that a pointer is initialized to point to the most optimal calculation function. Anywhere in the program, a floating point calculation can be performed by simply calling:
Calc(data);
This kind of optimization should only be used for decisions with a certain dynamic nature, where the information on which the decision is based becomes available during program start-up. If the presence of a floating point processor could somehow be determined at compile time, a better solution would be to use precompiler directives as shown in Listing 14.2.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
meant. Instead of checking whether the value of variable a equals 1, the first statement will assign the value 1 to a, and the if will always be true. These kinds of bugs can be very hard to find but are easily avoided by simply adopting a different coding style. When you train yourself to turn around the elements of which the expressions are made up, the compiler will 'warn'you when you make this kind of typo.
if (1 == a) if (NULL == fp) . . .
By placing the constant first, the compiler will complain about you trying to assign a value to a constant as soon as you forget the second =.
typedef unsigned char byte; struct large { byte v1; int v2; byte v3; int v4; byte v5; } ;
// 20 byte structure.
The structure large contains three bytes and two integers. Its size, however, is 20 bytes! This is because the integers are allocated at four byte boundaries. By combining the byte variables in groups of four, you can reclaim this wasted alignment space.
struct small { byte v1; byte v3; byte v5; int v4; int v2; } ;
// 12 byte structure.
Structure small can hold exactly the same information as large, but it is only 12 bytes in size. This is because fields v1, v3, and v5 share the same longword. This kind of optimization can save a lot of runtime memory when a structure is used for a large number of data elements.
Note that the ANSI predefined macros use a double underscore! All macros are expanded to character arrays except __LINE__ and __STDC__, which are integer values. Listing 14.8 shows how the macros __DATE__ and __TIME__ can be used to do a kind of versioning of a C++ class. Listing 14.8 Using ANSI Macros __DATE__ and __TIME__
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
// In the class header file. #include <ostream.h> char anyCompilationDate[] = __DATE__; char anyCompilationTime[] = __TIME__; class Any { public: void SourceInfo(void) { cout << "Compiled at: "; cout << anyCompilationDate << " "
// Definition of the Error log function. #include <ostream.h> void LogError(char * fname, int lnr) { cout << "Error in file: " << fname; cout << " on line number: " << lnr << endl; } ~ // Each file will pass its own name and line number. void somefunct(void) { if (error) LogError(__FILE__,
__LINE__); }
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
13, "Optimizing Your Code Further." Listing 14.10 Static Value Generation by the Compiler
0x80) >> 0x40) >> 0x20) >> 0x10) >> 0x08) >> 0x04) >> 0x02) >> 0x01))
7) 6) 5) 4) 3) 2) 1)
+ + + + + + +
} ; } ;
BitCount<255>::bits; return i; }
#include <stdio.h> struct WData { double a; float b; } ; class StackSpace { WData vectors[1000]; } ; class HeapSpace { public: HeapSpace() { vectors = new WData[1000];} ~HeapSpace(){ delete [] vectors;} private: WData *vectors; } ; void f(void)
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
{ int a; StackSpace sd; int b; HeapSpace hd; // // // // // variables on stack: a = -4 sd = -16004 b = -16008 hd =
-16012 }
Upon creation in function f(), both StackSpace and HeapSpace immediately allocate memory for a thousand instances of the WData structure. StackSpace does this by claiming stack memory as the WData array is part of the class and the class is created locally. HeapSpace does this by claiming memory from the heap by calling new. When you examine the placement of variables on the stack by function f(), for instance, by looking at the generated assembly (refer to Chapter 4), you will see that 4 bytes of stack space are reserved for variable a, 1600 bytes for variable sd, 4 bytes for variable b, and 4 bytes for variable hd. This has a number of conse quences; using stack space will be faster than using heap space, but when significant amounts of memory are usedthrough recursive calls for instanceusing stack space can become problematic. This means that the design should specify what kind of usage is expected and what response times should be. From this, the most favorable implementation can be determined.
Compatibility
Compatibility and portability are closely related issues, but there are distinct differences between the two. The compatibility of sources expresses how well they can be used in different development environments on the same system. Think of switching compilers, using C sources within a C++ project, compiling sources from a network drive instead of locally, and so on. The portability of sources expresses how well the sources can be used on different (target) systems. Compatibility is, in a way, a subset of portability because using sources on a different system will also entail using a different compiler. This section discusses compatibility; the next section discusses portability. Using Vendor Specifics In order to write sources with a high level of compatibility, it is always wise to use statements and functions as defined in the standards (ANSI) for the language you are using. However, it is not always apparent when you are using compiler (or vendor) specifics. The following example demonstrates the use of vendor-specific libraries and types.
// Compatible code. char a[] = "Print me if you dare."; short b = 5; // Vendor specific code. CString a = "Print me if you dare."; __int16 b = 5;
Character arrays and shorts are defined in the language. However, the CString class is part of Microsoft's MFC classes and __int16 is a Microsoft-specific type so these can only be used when compiling with a Microsoft compiler. Another kind of vendor specific is the use of compiler settings and options.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
As compiler directives are defined for a specific compiler, chances are that using a different compiler will cause some kind of syntax error on these statements. Another more tricky part of vendor specifics is the way they interpret standards. Sadly, sometimes standards specify what statements should look like and how they should behave, but not how they are implemented. Other times, standards are used before they are fully worked out. This means that different compilers will generate different executables even when using the same source as input and generating code for the same system. The following example demonstrates the expected compiler implementation.
struct Access { int a; char b;} ; Access data; char *p = (char*)(&data) + 4; *p = 'B';
This code works correctly only when the fields of the structure are kept in the same order in the generated executable as they were defined in the source file. Only in that case can the memory for field b be found at an offset of 4 bytes from the start address of the structure. A compiler that moves fields around, perhaps for optimization purposes, does not guarantee that this kind of access is possible. Another such example is how comment lines are interpreted.
#include <stdio.h> #ifdef _MSC_VER #include <process.h> _beginthread(StartThreadA, 0, (void*) &startInputA); #else #include <pthread.h> pthread_t thread1; pthread_create(&thread1, NULL, StartThread, (void*) &startInputA); #endif
Compatibility Between C and C++ As C is a syntactical subset of C++, you would perhaps not expect any difficulties when compiling "old" C sources with a C++ compiler. Sadly, there are some subtle differences that can cause compatibility problems. C++ for instance is stricter concerning default parameters. The statement extern int fn() in C will define an external function which returns an integer and has zero or more arguments. In C++ extern int fn() equals extern int fn(void) . C in turn does not understand the use of // as remark indicators. When parts of your code have to be compiled by a C compiler, because they will be part of an embedded system, for instance, it is better not to use these indicators because not all C compilers are lenient enough to allow this. As you saw in Chapter 4 in the section on mixing C and C++, including C header files in C++ is done by placing extern "C" { }
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
around the includes in order to have the compiler switch to C mode. When a file should include a C header and should still compile with a C++ compiler as well as a C compiler, use the following construction. Place the following construction in the C header file:
Cheader.h: #ifdef __cplusplus extern "C" { #endif // definitions void CFunction(char*); #ifdef __cplusplus } #endif
Then include this file, normally in either a C source or a C++ source:
Portability
When you port sources to a different system, you can run into all the problems identified in the previous section Compatibility. This is because a different system will have a different development environment. However, with porting come extra problems that are system related. This section points out the areas in which systems can prove to be incompatible. File System Characteristics Problems with portability can be caused by something as simple as file system characteristics. For instance, some file systems are case sensitive (Process.h is a different file from process.h) while others are not (Process.h and process.h are two names for the same file). When a software project is moved from a noncase-sensitive file system to a case-sensitive file system, chances are that it will not even compile or preprocess without added development effort. Entries in the make file, resource files, and even include statements are now for the first time checked against case errors. And this is true also for the use of special characters in filenames (spaces, *, _, /, and so on) and the allowed filename lengths. A similar kind of problem can occur when moving a project from a file system that uses symbolic links (UNIX) to one that does not. Symbolic links can be used to refer to a file from several different directories. This means the file is only physically present in one directory. However, via links placed in other directories, it seems to be located there also. When such a directory structure is copied to a file system that does not support these kinds of links, a file that is linked to it is simply copied to each place where a link resides. This means problems arise when this file is updated. Unless the developer making the updates is very aware of what he is doing and changes all the instances of the file, a synchronization problem will occur. Not all instances of the file contain the same information. These kinds of problems can occur when porting a project (or a set of source files) to a different system, or when you decide to move a project from a local hard disk to a network drive. This is because the network drive you move the project to may use a different file system than your local drive. The easiest way to avoid running into file system problems is to use filenames in your projects which are all in uppercase (or all in lowercase), without any characters outside the alphanumeric range, and which are not too long (eightcharacter names and three-character extensions.) Operating System Characteristics Apart from OS-specific libraries and commands, it is also necessary to take into account operating system behavior, such as the way tasks are switched or how locks perform queuing (refer to Chapter 13 for more details). Because of these kinds of differences the dynamic behavior of multitasking programs can change. A well set up multitasking program should not have too many problems with this; however, it pays to think about how much your program depends on how the OS behaves in certain situations. For instance, you may expect your task to be preempted automatically when waiting for interaction with a slow device or even a screen. This may not be true for the system you are porting to. Another thing that often differs between operating systems is the way in which they implement text files. First, there is the matter of line termination. Some operating systems expect only a CR (byte value 0x0d ) to indicate where one line ends and another begins. Other
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
operating systems expect a CR and LF (byte values 0x0d and 0x0a ) or only an LF . Secondly, some operating systems use identification characters at the beginning or end of a file (think of Unicode files starting with a special value). This means that no matter how portable you keep your source files, it is possible that a compiler will not even be able to read them. Also, you have to think very carefully when using text files as data storage or configuration files. Moving a project, along with the configuration files used by the original executable, to a new file system or operating system can mean that a newly compiled executable is still able to read the original configuration files, but only as long as they are not opened and edited by hand. When the original text files are opened and edited on the new operating system, new line terminators will be added. Include Paths Another thing that can differ per system is whether or not the \ character is accepted in include path names. Some OSes accept the following include style:
#include "/libraries/include/hi.h"
In order to keep your sources portable you should always use the second notation because this is accepted by all compilers. Hardware Characteristics When C/C++ sources or projects are ported to systems that use other hardware, issues can arise that are related to hardware characteristics. Think of endianness; an example of this can be found in Chapter 10 in the section on Radix sort. The implementation of Radix sort will compile on any system but it only executes correctly when the algorithm takes the system's endianness into account. Another thing that can differ between types of hardware is the memory size associated with certain variable types. Think of the number of bytes allocated for an int or long. Examples of this can be found in Chapter 6. CONTENTS CONTENTS
Algorithmic Pitfalls
The first group of pitfalls you find here are those hidden in algorithmic choices. These are the kinds of pitfalls you have run into or read about just to find out they exist.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
class JetSet : public AverageJoe { public: long VIP; } ; void SetSocialSecurity(AverageJoe* person, long id, long number) { person[id].socialSecurityNumber = number; } void main () { JetSet vips[50]; for(int i = 0; i < 50; i++) vips[i].VIP = 1; SetSocialSecurity(vips, 1, 0); if (1 == vips[0].VIP) cout << "VIP!!" << endl; else cout << "Ordinairy Joe!! " << endl; }
The SetSocialSecurity() function in Listing 15.1 is set up to be able to change the social security numbers in arrays of JetSeters as well as arrays of AverageJoes; this is why it takes a pointer to the base class, AverageJoe. Listing 15.1 compiles just fine, but when SetSocialSecurity() is called upon to change the social security number of the second VIP in the vips array, something unexpected happens. Instead of the social security number of vip[1] changing, the VIP status of vip[0] changes.
// The statement: SetSocialSecurity(vips, 1, 0); // Expected result: vips[1].socialSecurityNumber == 0 // Actual result: vips[0].VIP == 0
The reason for this becomes apparent when you add the following line to the SetSocialSecurity() function:
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
#include <ostream.h> #define dataLen 254 #define TERM 13 // Initial Values. char dataString[dataLen] = { 0,2,3,4,13,8,8,8,8} ; void main(void)
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
{ // find first terminator. for(int i = 0; i < dataLen; i++) { if (TERM == dataString[i]) break; } cout << "Terminator at: " << i << endl; }
On some C++ compilers Listing 15.2 will compile and run just fine. These are the compilers that see the scope of variable i as the body of the function in which it is declaredin this case main(). Consequently, variable i also has a value outside the for loop and can therefore be used to print the position at which the loop was terminated. The construct shown in Listing 15.3 is obviously illegal with these compilers. Listing 15.3 Illegal in Some C++ Compilers
for(int i = 0; i < dataLen/2; i++) { /*process first half.*/ } for(int i = dataLen/2; i < dataLen; i++) { /*process second
half.*/ }
This is because variable i is defined twice. In the latest version of the C++ standard, the scope of a variable defined in a for heading is limited to the body of that for loop . This means that the following statement in Listing 15.2 will not compile when using a compiler that follows the new standard:
for() { int r, g;
#include <ostream.h>
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
void main () { A Object1, Object2, Object3, Object4; if (Object1.isvalid() && Object2.isvalid() && Object3.isvalid() && Object4.isvalid()) { cout << "Conditions are TRUE" << endl;} else { cout << "Conditions are FALSE" << endl;} }
In the expression Object1.isvalid() && Object2.isvalid() && Object3.isvalid() && Object4.isvalid(), a call is made to each object's isvalid() method from left to right as long as the call to the preceding object's isvalid() method returned a value greater than zero. The result of Listing 15.4 is thus:
#include <ostream.h> class A { public: friend int operator&& (const A& left, const A& right) { cout << "Check" << endl; return(0);} friend int operator&& (const int left, const A& right) { cout << "Check" << endl; return(0);} } ; void main () { A Object1, Object2, Object3, Object4; if (Object1 && Object2 && Object3 && Object4) { cout << "Conditions are TRUE" << endl;} else { cout << "Conditions are FALSE"
<< endl;} }
The result of Listing 15.5 is:
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
This means that each part of the expression if (Object1 && Object2 && Object3 && Object4) is evaluated. This is important to keep in mind because sometimes you do not want part of an expression evaluated if the preceding part was not true.
#include <string.h> #include <memory.h> class Object { public: ~Object(){ delete [] data;} int id; char name[200]; char *data; } ; Object* CloneObject(Object *in) { Object *p = new Object; memcpy(p, in, sizeof(Object)); return p; } void main(void) { char * DataBlock = new char[230]; Object *object1 = new Object; object1->id = 1; object1->data = DataBlock; strcpy(object1->name,"NumberOne"); Object *object2 = CloneObject(object1); delete object1; // Big Problem: delete
object2; }
In Listing 15.6, object2 is a clone of object1; because the cloning was done through a shallow memory copy, both object1 and object2 have a data pointer that points to DataBlock. Objects have their own destructor which releases the memory pointed to by the data pointer. This means that when object1 is deleted, the memory of DataBlock is released. When object2 is deleted, its destructor will try to release the memory of DataBlock a second time. This, of course, causes strange behavior if not a crash. The use of a shallow copy in this context is obviously flawed, but when using complex and/or third party structures it may not be this obvious. Think of structures that contain a pointer to themselves, and so on.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
int x = 32; if (7 < x < 16) { printf("x has a two nibble value."); }
Instead of printing x has a two nibble value when the value of x resides between 7 and 16, the piece of code in Listing 15.7 first evaluates 7 < x, to which the answer is either true (1) or false (0). Then this 0 or 1 answer is compared to 16, inevitably resulting in true (1). Power of X In some programming languages (Pascal, for instance) the ^ character is used to denote the power of a variable. In C/C++ this character is used to denote a bitwise exclusive or (xor), which is why some programmers will fall for the following pitfall:
double x; for (x=0; x < MAXRANGE; x += 1,500) printf("X= %f\ n", x);
Instead of x increasing by fifteen hundred on each iteration, it is increased merely by one. This means that the number of effective iterations is MAXRANGE instead of MAXRANGE/1,500. The compiler sees the numbers after the comma as a new statement. This may seem strange, but the following three (meaningless) statements compile also.
5;4; 500;
#include <ostream.h> class base { public: void Info() { cout << "BaseInfo." << endl;} } ;
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
class derived: public base { public: void Info() { cout << "DerivedInfo." << endl;} void GiveInfo(int a) { if (a) base:Info(); else Info(); } } ; void main(void) { derived der; der.GiveInfo(0);
der.GiveInfo(2); }
It looks like the author of Listing 15.9 is trying to invoke two different versions of the Info() method by calling the method GiveInfo() with two different values (first 0 and then 2). The method GiveInfo() decides to call the Info() method of the base class or the Info() method of the derived class, based on the value of its input parameter a. Sadly, the output of this program is:
DerivedInfo. DerivedInfo.
Why? Because the author made a typo and instead of calling the base class method Info() by typing base::Info(), he made a label called base and called the local Info() method by typing base:Info()(note the missing second colon!). Because the object der is of class derived, the Info() method of derived is called in both cases. A good way to avoid these kinds of typos is to set the compiler warning level so that it warns against the definition of unused labels.
#include <string.h> char data[200] = "How many tabs can fit in this line?"; void GetTabInfo(int *tabsize) { int tabsline = 0; tabsline = strlen(data)/*tabsize; int linediff = strlen(data) /* subtract diff*/ (*tabsize*tabsline); }
In Listing 15.10, the author is trying to determine how many tabs would fit in a given line by dividing the number of characters in that line by the size of a tab (tabsline = strlen(data)/*tabsize;). However, by omitting a space between the division operator / and the scope operator *, the author has inadvertently created the start of a multiline comment /*. This comment ends with the next occurrence of */, and so the statement actually seen by the compiler is not:
tabsline = strlen(data)/*tabsize;
but:
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
tabsline = strlen(data)-(*tabsize*tabsline);
A good way to avoid these kinds of typos is to use source editors that change the colors of comment lines.
// Loop without statements. for (i = 0; i < MAX; i++); cout << i++;
The preceding loop has no statements. Consequently the cout << i++ statement is executed only once, changing i from MAX to MAX+1.
// Loop with a single statement. for (i = 0; i < MAX; ) cout << i; i++;
The preceding loop also has only one statement, this time because the for is not implemented with a compounded body, which is why the statement i++ is not part of the for loop. Consequently, the preceding loop either does not execute or executes without end. These kinds of bugs often arise when a loop with only one statement is extended. This is why it is a good programming practice to give all loops a compounded body, even those with only a single body statement. This way a developer can never make the mistake of adding body statements where there is no body to begin with.
void print() {
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
class B: public A { public: void print() { } ; void main () { // Create a single object B anObject;
// Create two references to the object. A *ptr1 = &anObject; B *ptr2 = &anObject; ptr1->print(); ptr2-> print(); }
The output of Listing 15.11 looks like this:
A B
So why do the two pointers, which point to the same instance of B, call two different print() methods? It seems that the type of pointer itself determines which method is calledA* versus B*. The reason for this is that a pointer to an A object knows only of one kind of print() method, namely that of class A. To correctly redefine the print() method for class B, it should be made virtual. By changing the definition of class A to
class A { public: } ;
B B
Omitting the virtual keyword (when done unintentionally) can be a bug that is hard to find, because as you have seen, it occurs at runtime depending on how a class is accessed. The methods in Listing 15.11 are all statically bound, compared to the virtual solution in which the methods are dynamically bound.
Other Pitfalls
As you have seen so far, pitfalls can occur in the most unexpected placesfor example, algorithms, translations from mathematical equationsand through common typos. What you have not yet seen are pitfalls generated by external influences to source files. This section looks at pitfalls that can occur because of the way in which character arrays are implemented by compilers, and the use of hardware addresses from source files.
// Array of 4 elements.
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
Another keyword that you might like to use in combination with volatile is const , to tell the compiler that the value of an object should not change. It stands to reason that you use const oppositely to volatile ; this way you can tell the compiler that pointer p should always point to address 0xC000050 (it is not allowed to change and point to any other address), but that the value in address 0xC000050 can change at any given time.
char hw1 = 'A'; char hw2 = 'A'; void main(void) { char * p1 =&hw1; volatile char * p2 = &hw1; volatile char * const p3 = &hw1; // Legal actions: *p1 = 'b'; p1 = &hw2; *p2 = 'c'; p2 = &hw2; *p3 = 'd'; // Illegal action: p3 =
&hw2; }
In Listing 15.12, pointer p1 may be subject to compiler optimization, so accessing address hw through p1 may not always work correctly. Accessing hw through pointer p2 is never optimized and is therefore safe, as long as you do not (accidentally) change the value of p2 itself. The safest way to access hw is through pointer p3, as it is never optimized and its address cannot change; the last statement p3 = &hw2; therefore does not even compile. Note that the value of p3 can, of course, be corrupted through non-const pointers to the address of p3 itself and through other memory corruption, such as writing past array bounds and so on. Efficient Access to Hardware Addresses Now that you have a correct way of accessing specific hardware addresses, the next step is to make sure that access is done efficiently. Often you will want to read a stream of bytes from a device. When this device has memory-mapped IO, information is retrieved by continuously reading from the mapped addresses. That is, when no DMA is available. Each read action signals the underlying device to place the next value on the address. Reading a number of bytes from such a device can be done as follows:
unsigned char buffer[1024]; for (int i = 0; i < buflen; i++) buffer[i] = *HARDWAREMAPPEDREGISTER; HARDWAREMAPPEDREGISTER is a pointer to a specific hardware address on which a device places bytes of data. The buffer is
used to store sequentially read bytes. Although this implementation seems pretty straightforward and looks lean because of the small number of statements it contains, it is far from efficient. This is because data is read from the hardware address and placed in normal memory. As you have seen in Chapter 5, "Measuring Time and Complexity," using memory can be slow because of caching and paging schemes. (Cache needs to be consistent with memory, resulting in cache updates if a write-through scheme is used. You can minimize these updates by using processor registers so that such an update is necessary only once per 4 bytes.) Listing 15.13 shows how this kind of memory access can be speeded up by using a register as a four-byte IO buffer. Listing 15.13 Efficient Hardware Address Access (Big-Endian)
unsigned char buffer[1024]; for (int i = 0; i < buflen; i+= 4) { register long tmp;
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013
New Page 27
buffer[i] =
tmp }
Because of the small number of variables used within the loop in Listing 15.13, it is very likely that the compiler will assign tmp to a register. However, the register keyword is added to urge the compiler in the right direction. Now four consecutive read actions can be performed very fast. Each byte that is read is shifted upwards in the long word to make room for the next byte to be read. When the long word is full, it is finally placed into the buffer. With larger registers, IO can be speeded up further; however, make sure that you choose a type for variable tmp that can easily be mapped on a CPU register. If, for instance, you define a 64-bit tmp when the processor does not have any (or a sufficient number of) 64-bit registers, the compiler will choose a normal memory variable for tmp instead of a register. This means you just lose IO speed because of the extra statements added in Listing 15.13. Note that the way in which data is placed in tmp and shifted upwards in the direction of the most significant byte, makes it ideal for bigendian implementations. If, however, you would like to receive the bytes of the device in a little-endian order, some changes need to be made as shown in Listing 15.14. Listing 15.14 Efficient Hardware Address Access (Little-Endian)
unsigned char buffer[1024]; for (int i = 0; i < buflen; i+= 4) { register long tmp; tmp tmp tmp tmp = |= |= |= (*HARDWAREMAPPEDREGISTER); (*HARDWAREMAPPEDREGISTER) << 8; (*HARDWAREMAPPEDREGISTER) << 16; (*HARDWAREMAPPEDREGISTER) << 24;
buffer[i]
= tmp }
CONTENTS
file://C:\Users\Dan\AppData\Local\Temp\~hhB969.htm
20/08/2013