Why Programs Fail: A Guide to Systematic Debugging

Won a Jolt Productivity Award in the Technical Books category

For the past 15 years, the Software Development Jolt Product Excellence and Productivity Awards have been presented annually to showcase products that have "jolted" the industry with their significance and made the task of creating software faster, easier, and more efficient. Jolt Cola, the fabled soft drink quaffed by software developers for sustenance during project development marathons, sponsors the awards presentation. "Our twenty Jolt judges -- leaders in the software development field -- have selected 89 impressive finalists from hundreds of nominations," commented Software Development Technical Editor Rosalyn Lum. "The number of nominations we received this year was unprecedented and the process of narrowing down the finalists was a massive, albeit exciting, task, with many more products contending in each category. These products represent a cross section of the most ground-breaking tools for every phase of the software development lifecycle -- a testament that the spirit of innovation is alive and well in the software development sector. [The winners] are products that are: ahead of the curve; universally useful; simple, yet rich in functionality; redefine their product space; or solve a nagging problem that has consistently eluded other products and books."

1101055499

Why Programs Fail: A Guide to Systematic Debugging

Won a Jolt Productivity Award in the Technical Books category

63.95 In Stock

Why Programs Fail: A Guide to Systematic Debugging

Add to Wishlist

Why Programs Fail: A Guide to Systematic Debugging

eBook

$63.95

eBook
$63.95

Available on Compatible NOOK devices, the free NOOK App and in My Digital Library.

WANT A NOOK? Explore Now

Buy As Gift

Related collections and offers

Overview

Won a Jolt Productivity Award in the Technical Books category

Product Details

ISBN-13:	9780080923000
Publisher:	Elsevier Science
Publication date:	07/22/2009
Sold by:	Barnes & Noble
Format:	eBook
Pages:	544
File size:	12 MB
Note:	This product may take a few minutes to download.

About the Author

Andreas Zeller is a full professor for Software Engineering at Saarland University in Saarbruecken, Germany. His research concerns the analysis of large software systems and their development process; his students are funded by companies like Google, Microsoft, or SAP. In 2010, Zeller was inducted as Fellow of the ACM for his contributions to automated debugging and mining software archives. In 2011, he received an ERC Advanced Grant, Europe's highest and most prestigious individual research grant, for work on specification mining and test case generation. His book "Why programs fail", the "standard reference on debugging", obtained the 2006 Software Development Jolt Productivity Award.

Read an Excerpt

Why Programs Fail

A Guide to Systematic Debugging

By Andreas Zeller

Morgan Kaufmann

Copyright © 2009 Elsevier Inc.
All right reserved.
ISBN: 978-0-08-092300-0

Chapter One

How Failures Come to Be

Your program fails. How can this be? The answer is that the programmer creates a defect in the code. When the code is executed, the defect causes an infection in the program state, which later becomes visible as a failure. To find the defect, one must reason backward, starting with the failure. This chapter defines the essential concepts when talking about debugging, and hints at the techniques discussed subsequently—hopefully whetting your appetite for the remainder of this book.

1.1 MY PROGRAM DOES NOT WORK!

Oops! Your program fails. Now what? This is a common situation that interrupts our routine and requires immediate attention. Because the program mostly worked until now, we assume that something external has crept into our machine—something that is natural and unavoidable;something we are not responsible for—namely,a bug.

If you are a user, you have probably already learned to live with bugs. You may even think that bugs are unavoidable when it comes to software. As a programmer, though, you know that bugs do not creep out of mother nature into our programs. (See Bug Story 1 for an exception.) Rather, bugs are inherent parts of the programs we produce. At the beginning of any bug story stands a human who produces the program in question.

The following is a small program I once produced. The sample program is a very simple sorting tool. Given a list of numbers as command-line arguments, sample prints them as a sorted list on the standard output ($ is the command-line prompt).

$ ./sample 9 7 8 Output: 7 8 9 $ _

Unfortunately, sample does not always work properly, as demonstrated by the following failure:

$ ./sample 11 14 Output: 0 11 $ _

Although the sample output is sorted and contains the right number of items, some original arguments are missing and replaced by bogus numbers. Here, 14 is missing and replaced by 0. (Actual bogus numbers and behavior on your system may vary.) From the sample failure, we can deduce that sample has a bug (or, more precisely, a defect). This brings us to the key question of this chapter:

1.2 FROM DEFECTS TO FAILURES

In general, a failure such as that in the sample program comes about in the four stages discussed in the following.

1. The programmer creates a defect. A defect is a piece of the code that can cause an infection. Because the defect is part of the code, and because every code is initially written by a programmer, the defect is technically created by the programmer. If the programmer creates a defect, does that mean the programmer was at fault? Not necessarily. Consider the following:

* The original requirements did not foresee future changes. Think about the Y2K problem, for instance.

* A program behavior may become classified as a "failure" only when the user sees it for the first time.

* In a modular program, a failure may happen because of incompatible interfaces of two modules.

* In a distributed program, a failure may be the result of some unpredictable interaction of several components.

In such settings, deciding on who is to blame is a political, not a technical, question. Nobody made a mistake, and yet a failure occurred. (See Bug Story 2 for more on such failures.)

2. The defect causes an infection. The program is executed, and with it the defect. The defect now creates an infection—that is, after execution of the defect, the program state differs from what the programmer intended.

A defect in the code does not necessarily cause an infection. The defective code must be executed, and it must be executed under such conditions that the infection actually occurs.

3. The infection propagates. Most functions result in errors when fed with erroneous input. As the remaining program execution accesses the state, it generates further infections that can spread into later program states. An infection need not, however, propagate continuously. It may be overwritten, masked, or corrected by some later program action.

4. The infection causes a failure. A failure is an externally observable error in the program behavior. It is caused by an infection in the program state.

The program execution process is sketched in Figure 1.1. Each program state consists of the values of the program variables, as well as the current execution position (formally, the program counter). Each state determines subsequent states, up to the final state (at the bottom in the figure),in which we can observe the failure (indicated by the x).

Not every defect results in an infection, and not every infection results in a failure. Thus, having no failures does not imply having no defects. This is the curse of testing, as pointed out by Dijkstra. Testing can only show the presence of defects, but never their absence.

In the case of sample, though, we have actually experienced a failure. In hindsight, every failure is thus caused by some infection, and every infection is caused by some earlier infection, originating at the defect. This cause–effect chain from defect to failure is called an infection chain.

The issue of debugging is thus to identify the infection chain, to find its root cause (the defect),and to remove the defect such that the failure no longer occurs. This is what we shall do with the sample program.

1.3 LOST IN TIME AND SPACE

In general, debugging of a program such as sample can be decomposed into seven steps (List 1.1), of which the initial letters form the word TRAFFIC.

1. Track the problem in the database.

2. Reproduce the failure.

3. Automate and simplify the test case.

4. Find possible infection origins.

5. Focus on the most likely origins.

6. Isolate the infection chain.

7. Correct the defect.

Of these steps, tracking the problem in a problem database is mere bookkeeping (see also Chapter 2) and reproducing the problem is not that difficult for deterministic programs such as sample. It can be difficult for nondeterministic programs and long-running programs, though, which is why Chapter 4 discusses the issues involved in reproducing failures.

Automating the test case is also rather straightforward, and results in automatic simplification (see also Chapter 5). The last step, correcting the defect, is usually simple once you have understood how the defect causes the failure (see Chapter 15).

The final three steps—from finding the infection origins to isolating the infection chain—are the steps concerned with understanding how the failure came to be. This task requires by far the most time, as well as other resources. Understanding how the failure came to be is what the rest of this section, and the other chapters of this book, are about.

Why is understanding the failure so difficult? Considering Figure 1.1,all one need do to find the defect is isolate the transition from a sane state (i.e., noninfected, as intended) to an infected state. This is a search in space (as we have to find out which part of the state is infected) as well as in time (as we have to find out when the infection takes place).

However, examination of space and time are enormous tasks for even the simplest programs. Each state consists of dozens, thousands, or even millions of variables. For example, Figure 1.2 shows a visualization of the program state of the GNU compiler (GCC) while compiling a program. The program state consists of about 44,000 individual variables, each with a distinct value, and about 42,000 references between variables. (Chapter 14 discusses how to obtain and use such program states in debugging.)

Not only is a single state quite large, a program execution consists of thousands, millions, or even billions of such states. Space and time thus form a wide area in which only two points are well known (Figure 1.3): initially, the entire state is sane (

), and eventually some part of the state is infected (x). Within the area spanned by space and time, the aim of debugging is to locate the defect—a single transition from sane (

) to infected (

) that eventually causes the failure (Figure 1.4).

Thinking about the dimensions of space and time, this may seem like searching for a needle in an endless row of haystacks—and indeed, the fact is that debugging is largely a search problem. This search is driven by the following two major principles:

1. Separate sane from infected. If a state is infected, it may be part of the infection propagating from defect to failure. If a state is sane, there is no infection to propagate.

2. Separate relevant from irrelevant. A variable value is the result of a limited number of earlier variable values. Thus, only some part of the earlier state may be relevant to the failure.

Figure 1.5 illustrates this latter technique. The failure, to reiterate, can only have been caused by a small number of other variables in earlier states (denoted using the exclamation point, !), the values of which in turn can only have come from other earlier variables. One says that subsequent, variable values depend on earlier values. This results in a series of dependences from the failure back to earlier variable values. To locate the defect, it suffices to examine these values only—as other values could not have possibly caused the failure—and separate these values into sane and infected. If we find an infected value, we must find and fix the defect that causes it. Typically, this is the same defect that causes the original failure.

Why is it that a variable value can be caused only by a small number of earlier variables? Good programming style dictates division of the state into units such that the information flow between these units is minimized. Typically, your programming language provides a means of structuring the state, just as it helps you to structure the program code. However, whether you divide the state into functions, modules, objects, packages, or components, the principle is the same: a divided state is much easier to conquer.

1.4 FROM FAILURES TO FIXES

Let's put our knowledge about states and dependences into practice, following the TRAFFIC steps (List 1.1).

1.4.1 Track the Problem

The first step in debugging is to track the problem—that is, to file a problem report such that the defect will not go by unnoticed. In our case, we have already observed the failure symptom: the output of sample, when invoked with arguments 11 and 14,contains a zero.

$ ./sample 11 14 Output: 0 11 $ _

An actual problem report would include this invocation as an instruction on how to reproduce the problem (see Chapter 2).

1.4.2 Reproduce the Failure

In case of the sample program, reproducing the failure is easy. All you need do is reinvoke sample, as shown previously. In other cases, though, reproducing may require control over all possible input sources (techniques are described in Chapter 4).

1.4.3 Automate and Simplify the Test Case

If sample were a more complex program, we would have to think about how to automate the failure (in that we want to reproduce the failure automatically) and how to simplify its input such that we obtain a minimal test case. In the case of sample, though, this is not necessary (for more complex programs, Chapter 5 covers the details).

1.4.4 Find Possible Infection Origins

Where does the zero in the output come from? This is the fourth step in the TRAFFIC steps: We must find possible infection origins. To find possible origins, we need the actual C source code of sample, shown in Example 1.1. We quickly see that the program consists of two functions:shell_sort() (which implements the shell sort algorithm) and main, which realizes a simple test driver around shell_sort(). The main function:

* Allocates an array a (line 32).

* Copies the command-line arguments into a (lines 33–34).

* Sorts a by invoking shell_sort() (line 36).

* Prints the content of a (lines 38–41).

By matching the output to the appropriate code, we find that the 0 printed by sample is the value of the variable a[0], the first element of the array a. This is the infection we observe: at line 39 in sample.c, variable a[0] is obviously zero.

Where does the zero in a[0] come from? Working our way backward from line 40, we find in line 36 the call shell_sort(a, argc), where the array a is passed by reference. This function might well be the point at which a[0] was assigned the infected value.

(Continues...)

Excerpted from Why Programs Fail by Andreas Zeller Copyright © 2009 by Elsevier Inc. . Excerpted by permission of Morgan Kaufmann. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.

Foreword....................xv
Preface....................xvii
CHAPTER 1 How Failures Come to Be....................1
CHAPTER 2 Tracking Problems....................25
CHAPTER 3 Making Programs Fail....................49
CHAPTER 4 Reproducing Problems....................75
CHAPTER 5 Simplifying Problems....................105
CHAPTER 6 Scientific Debugging....................129
CHAPTER 7 Deducing Errors....................147
CHAPTER 8 Observing Facts....................175
CHAPTER 9 Tracking Origins....................211
CHAPTER 10 Asserting Expectations....................223
CHAPTER 11 Detecting Anomalies....................253
CHAPTER 12 Causes and Effects....................271
CHAPTER 13 Isolating Failure Causes....................283
CHAPTER 14 Isolating Cause-Effect Chains....................305
CHAPTER 15 Fixing the Defect....................329
CHAPTER 16 Learning from Mistakes....................343
APPENDIX Formal Definitions....................363
Glossary....................377
Bibliography....................381
Index....................391

From the B&N Reads Blog

Page 1 of

Editorial Reviews

Praise from the experts for the first edition:
"In this book, Andreas Zeller does an excellent job introducing useful debugging techniques and tools invented in both academia and industry. The book is easy to read and actually very fun as well. It will not only help you discover a new perspective on debugging, but it will also teach you some fundamental static and dynamic program analysis techniques in plain language."
-Miryung Kim, Software Developer, Motorola Korea
"Today every computer program written is also debugged, but debugging is not a widely studied or taught skill. Few books beyond this one present a systematic approach to finding and fixing programming errors."
-James Larus, Microsoft Research
"From the author of ODD, the famous data display debugger, now comes the definitive book on debugging. Zeller's book is chock-full with advice, insight, and tools to track down defects in programs, for all levels of experience and any programming language. The book is lucidly written, explaining the principles of every technique without boring the reader with minutiae. And best of all, at the end of each chapter it tells you where to download all those fancy tools. A great book for the software professional as well as the student interested in the frontiers of automated debugging."
-Walter F. Tichy, Professor, University Karlsruhe, Germany
"Andreas Zeller's Why Programs Fail lays an excellent foundation far practitioners, educators, and researchers alike. Using a disciplined approach based on the scientific method, Zeller provides deep insights, detailed approaches, and illustrative examples."
-David Notkin, Professor Computer Science & Engineering, University of Washington

From the Publisher

Why Programs Fail: A Guide to Systematic Debugging

Why Programs Fail: A Guide to Systematic Debugging

eBook

eBook

Related collections and offers

Overview

Product Details

About the Author

Read an Excerpt

Why Programs Fail

Morgan Kaufmann

Chapter One

Table of Contents

Contents

Customer Reviews

Related collections and offers

Overview

Product Details

About the Author

Read an Excerpt

Why Programs Fail

Morgan Kaufmann

Chapter One

Table of Contents

Contents

Related Subjects

Customer Reviews