Debugging (50+ New Debugging Hacks 2019)

Debugging

Debugging best 50 latest Methods

When you set out to fix a problem, it’s important to select the most appropriate strategy for debugging. This is the option that will allow you to succeed with the least amount of effort. If your choice doesn’t pan out, immediately adopt the next most promising methods.

 

1: Handle All Problems through an Issue-Tracking System

George calls you on the phone complaining loudly that the application you’re developing “isn’t working.” You scribble that on a sticky note and add it to the collection of similar notes adorning your monitor.

 

You make a mental note to check whether you sent him the latest library required for the new version of the application. This is not how you should be working. Here is what you should be doing instead.

 

First, ensure you have an issue-tracking system in place. Many source code repositories, such as GitHub and GitLab, provide a basic version of such a system integrated with the rest of the functionality they provide.

 

A number of organizations use JIRA, a much more sophisticated proprietary system, that can be licensed to run on-premise or as a service. Others opt to use an open-source alternative, such as Bugzilla, Launchpad, OTRS, Redmine, or Trac. The system that is chosen is not as important as using it to file all issues in it.

 

Refuse to handle any problem that’s not recorded in the issue-tracking system. The unswerving use of such a system

  • Provides visibility to the debugging effort
  • Enables the tracking and planning of releases
  • Facilitates the prioritization of work s
  • Helps document common issues and solutions
  • Ensures that no problems fall through the cracks
  • Allows the automatic generation of release notes
  • Serves as a repository for measuring defects, reflecting on them, and learning from them

 

For persons too senior in the organization’s hierarchy to be told to file an issue, simply offer to file it for them. The same goes for issues that you discover yourself. Some organizations don’t allow any changes to the source code unless they reference an associated issue.

 

Also, ensure that each issue contains a precise description of how to reproduce it. Your ideal here is a short, self-contained, correct (compilable and runnable) example (SSCCE), something that you can readily cut and paste into your application to see the problem..

 

To improve your chances of getting well-written bug reports, create instructions of how a good report should look, and brainwash all bug reporters to follow them religiously. (In one organization, I saw these instructions posted on the lavatory doors.)

 

Other things you should be looking for in a bug report are a precise title, the bug’s priority and severity, and the affected stakeholders, as well as details about the environment where it occurs.

 

Here are the key points concerning these fields. A precise short title allows you to identify the bug in a summary report. “Program crashes” is a horrible title; “Crash when clicking on Refresh while saving” is a fine one.

 

The severity field helps you prioritize the bugs. Problems, where a data loss occurs, are obviously critical, but cosmetic issues or those where a documented workaround is possible are less so. A bug’s severity allows a team to triage a list of issues, deciding which to address now, which to tackle later, and which to ignore.

 

The result of triaging and prioritization can be recorded as the issue's priority, which provides you the order on which to work. In many projects, the bug’s priority is set by the developer or project lead because end-users tend to set-top priority to all the bugs they submit.

 

Setting and recording realistic priorities fends off those managers, customer representatives, developers in other teams, and salespeople who claim that everything (or, at least, their pet issue) is a top priority.

 

Identifying an issue’s stakeholders helps the team get additional input regarding an issue and the product owner prioritize the issues. Some organizations even tag stakeholders with the yearly revenue they bring in. (For example, “Submitted by Acme, a $250,000 customer.”)

 

A description of the environment can provide you with a clue on how to reproduce an elusive bug. In the description, avoid the kitchen sink approach in which you demand everything, from the PC’s serial number and BIOS date to the version of each library installed on the system. This will overburden your users, and they may just skip those fields.

 

Instead, ask only the most relevant details; for a web-based app, the browser is certainly important. For a mobile app, you probably want the device maker and model. Even better, automate the submission of these details through your software.

 

When you work with an issue-tracking system, an important good practice is to use it to document your progress. Most tracking systems allow you to append to each entry successive free-form comments.

 

Use these to document the steps you take for investigating and fixing the bug, including dead ends. This brings transparency to your organization’s workings.

 

Write down the precise command incantations you use to log or trace the program’s behavior. These can be invaluable when you want to repeat them the next day, or when you (or some colleagues) hunt a similar bug a year later.

 

The notes can also help refresh your memory when blurry eyed and phased out from a week-long bug-hunting exercise, you try to explain to your team or manager what you’ve been doing all those days.

 

Things to Remember

  • Handle all problems through an issue-tracking system.
  • Ensure each issue contains a precise description of how to reproduce the problem with a short, self-contained, correct example.
  • Triage issues and schedule your work based on the priority and severity of each issue.
  • Document your progress through the issue-tracking system.

 

2: Use Focused Queries to Search the Web for Insights into Your Problem

It’s quite rare these days to work in a location lacking Internet access, but when I find myself in one of these my productivity as a developer takes a dive. When your code fails, the Internet can help you find a solution by searching the web and by collaborating with fellow developers.

 

A remarkably effective search technique involves pasting the error message reported by a failing third-party component into the browser’s search box enclosed in double quotes. The quotes instruct the search engine to look for pages containing the exact phrase, and this increases the quality of the results you’ll get back.

 

Other useful things to put into the search box include the name of the library or middleware that gives you trouble, the corresponding name of the class or method, and the returned error code.

 

The more obscure the name of the function you look for, the better. For example, searching for PlgBlt will give you far better results than searching for BitBlt. Also try synonyms, for instance, “freezes” in addition to “hangs,” or “grayed” in addition to “disabled.”

 

You can often solve tricky problems associated with the invocation of APIs by looking at how other people use them. Look for open-source software that uses a particular function, and examine how the parameters passed to it are initialized and how its return value is interpreted.

 

For this, using a specialized code search engine, such as the Black Duck Open Hub Code Search, can provide you with better results than a generic Google search.

 

For example, searching for mktime on this search engine, and filtering the results for a specific project to avoid browsing through library declarations and definitions, produces the following code snippet.

nowtime = mktime(time->tm_year+1900, time->tm_mon+1,

time->tm_mday, time->tm_hour, time->tm_min, time->tm_sec);

 

This shows that the mktime function, in contrast to local time, expects the year to be passed in full, rather than as an offset from 1900, and that the numbering of months starts from 1. These are things you can easily get wrong, especially if you haven’t read carefully the function’s documentation.

 

When looking through the search results, pay attention to the site hosting them. Through considerable investment in techniques that motivates participants, sites of the StackExchange network, such as Stack Overflow, typically host the most pertinent discussions and answers.

 

When looking at an answer on Stack Overflow, scan beyond the accepted one, looking for answers with more votes. In addition, read the answer’s comments because it is there that people post updates, such as newer techniques to avoid an error.

 

If your carefully constructed web searches don’t come up with any useful results, it may be that you’re barking up the wrong tree. For popular libraries and software, it’s quite unlikely that you’ll be the first to experience a problem.

 

Therefore, if you can’t find a description of your problem online, it may be the case that you’ve misdiagnosed what the problem is. Maybe, for example, the API function that you think is crashing due to a bug in its implementation is simply crashing because there’s an error in the data you’ve supplied to it.

 

If you can’t find the answer online, you can also post on Stack Overflow your own question regarding the problem you’re facing. This, however, requires considerable investment in creating an SSCCE. This is the golden standard regarding question asking in a forum: a short piece of code other members can copy-paste and compile on its own to witness your problem.

 

For some languages, you can even present your example in live form through an online IDE, such as SourceLair or JSFiddle. You can find more details on how to write good examples for specific languages and technologies at Short, Self Contained, Correct Example. Also worth reading is Eric Raymond's guide on this topic titled How To Ask Questions The Smart Way.

 

I’ve often found that simply the effort of putting together a well-written question and an accompanying example led me to my problem’s solution. But even if this doesn’t happen, the good example is likely to attract knowledgeable people who will experiment with it and, hopefully, provide you with a solution.

 

If your problem is partly associated with an open-source library or program, and if you have strong reasons to believe that there’s a bug in that code, you can also get in contact with its developers. Opening an issue on the code’s bug-tracking system is typically the way to go. 

 

Again here, make sure that there isn’t a similar bug report already filed, and that you include in your report precise details for reproducing the problem.

 

If the software doesn’t have a bug-tracking system, you can even try sending an email to its author. Be even more careful, considerate, and polite here; most open-source software developers aren’t paid to support you.

 

Things to Remember

  • Perform a web search regarding error messages by enclosing them in double quotes.
  • Value the answers from StackExchange sites.
  • If all else fails, post your own question or open an issue.

 

Confirm That Preconditions and Postconditions Are Satisfied

When repairing electronic equipment, the first thing to check is the power supplied to it: what comes out of the power supply module and what is fed into the circuit. In far too many cases, this points to the failure’s culprit.

 

Similarly, in computing, you can pinpoint many problems by examining what must hold at the routine’s entry point (pre-conditions—program state and inputs) and at its exit (postconditions— program state and returned values).

 

If the preconditions are wrong, then the fault lies in the part that set them up; if the postconditions are wrong, then there’s a problem with the routine. If both are correct, then you should look somewhere else to locate your bug.

 

Put a breakpoint at the beginning of the routine, or the location where it’s called, or the point where a crucial algorithm starts executing.

 

To verify that the preconditions have been satisfied, examine carefully the algorithm’s arguments, including parameters, the object on which a method is invoked, and the global state used by the suspect code. In particular, pay attention to the following.

 

Look for values that are null when they shouldn’t be.

Verify that arithmetic values are within the domain of the called math function; for example, check that the value passed to log is greater than zero. Look inside the objects, structures, and arrays passed to the routine to verify that their contents match what is required; this also helps you pinpoint invalid pointers.

 

Check that values are within a reasonable range. Often uninitialized variables have a suspect value, such as 6.89851e-308 or 61007410. Spot-check the integrity of any data structure passed to the routine; for example, that a map contains the expected keys and values, or that you can correctly traverse a doubly linked list.

 

Then, put a breakpoint at the end of the routine, or after the location where it’s called, or at the point where a crucial algorithm ends its execution. Now examine the effects of the routine’s execution.

 

Do the computed results look reasonable? Are they within the range of expected results?

If yes, are the results actually correct? You can verify this by executing the corresponding code by hand, by comparing them with known good values, or by calculating them with another tool or method. 

 

Are the routine’s side effects the expected ones? Has any other data touched by the suspect code been corrupted or set to an incorrect value? This is especially important for algorithms that maintain their own housekeeping information within the data structures they traverse.

 

Have the resources obtained by the algorithm, such as file handles or locks, been correctly released?

You can use the same method for higher-level operations and setups. Verify the operation of an SQL statement that constructs a table by looking at the tables and views it scans and the table it builds.

 

Work on a file-based processing task by examining its input and output files. Debug an operation that is built on web services by looking at the input and output of each individual web service.

 

Troubleshoot an entire data center by examining the facilities required and provided by each element: networking, DNS, shared storage, databases, middleware, and so on. In all cases, verify, don’t assume.

 

Things to Remember

Carefully examine a routine’s preconditions and postconditions.

 

Drill Up from the Problem to the Bug or Down from the Program’s Start to the Bug

There are generally two ways to locate a source of a problem. You can either start from the problem’s manifestation and work up toward its source, or you can start from the top level of your application or system and work down until you find the source of the problem.

 

Depending on the type of the problem, one approach is usually more effective than the other. However, if you reach a dead end going one way, it may be helpful to switch directions.

 

Starting from the place where the problem occurs is the way to go when there’s a clear sign of the problem. Here are three common scenarios.

 

First, troubleshooting a program crash is typically easy if you run your program in a debugger, if you attach a debugger to it when it has crashed, or if you obtain a memory dump. What you need to do is examine the values of the variables at the point where the crash occurred, looking for null, corrupted, or uninitialized values that could have triggered the crash.

 

On some systems, you can easily recognize uninitialized values through their recurring byte values, such as 0xBAADF00D—bad food. You can find a complete list of such values in the Magic Number Wikipedia article.

 

Having located a variable with an incorrect value, then try to determine the underlying reason either within the routine where the crash occurred or by moving up the stack of routine calls looking for incorrect arguments or other reasons associated with the crash.

 

If this search doesn’t help you find the problem, begin a series of program runs, again under a debugger. Each time, set a breakpoint near the point where the incorrect value could have been computed. Continue by placing breakpoints earlier and further up the call sequence until you locate the problem.

 

Second, if your problem is a program freeze rather than a crash, then the process of bottom-up troubleshooting starts a bit differently.

 

Run your program within a debugger, or attach one to it, and then break its execution with the corresponding debugger command or force it to generate a memory dump. Sometimes you will realize that the executing code is not your own but that of a library routine.

 

No matter where the break occurred, move up the call stack until you locate the loop that caused your program to freeze. Examine the loop’s termination condition, and, starting from it, try to understand why it never gets satisfied.

 

Third, if the problem is manifesting itself through an error message, start by locating the message’s text within the program’s source code. Your friend here is fgrep -r because it can quickly locate the wording within arbitrarily deep and complex hierarchies.

 

In modern localized software, this search will often not locate the code associated with the message, but rather the corresponding string resource.

 

For example, assume you live in a Spanish-speaking country and you’re debugging a problem that displays the error “Ha ocurrido un error al processor el Archivo XCF” in the Inkscape drawing program. Searching for this string in the Inkscape source code will drive you to the Spanish localization file es.po.

#: ../share/extensions/gimp_xcf.py:

msgid "An error occurred while processing the XCF file."

msgstr "Ha ocurrido un error al processor el Archivo XCF."

 

From the localization file, you can obtain the location of the code (share/ extensions/gimp_xcf.py, line 43) associated with the message. You then set a breakpoint or log statement just before the point where the error message appears in order to examine the problem at the point where it occurs.

 

Again, be prepared to move back a few code statements and up in the call stack to locate the problem’s root cause. If you’re searching for non-ASCII text, ensure that your command-line locale settings match the text encoding (e.g., UTF-8) of the source code you’ll be searching.

 

Working from the top level of your failing system down toward the location of the fault is the way to go when you can’t clearly identify the code associated with the failure. By definition, this is often the case with so-called emergent properties: properties (failures) of your system that you can’t readily associate with a specific part of it.

 

Examples include performance (your software takes up too much memory or takes too long to respond), security (you find your web application’s web page defaced), and reliability (your software fails to provide the web service it was supposed to).

 

The way to work top down is to decompose the whole picture into parts, examining the likely contribution of each one to the failure you’re debugging. With performance problems, the usual approach is profiling: tools and libraries that help you find out which routines clog the CPU or devour memory.

 

In the case of a security problem, you would examine all the code for typical vulnerabilities, such as those that lead to buffer overflows, code injections, and cross-site scripting attacks.

 

Again, there are tools that can help analyze the code for such problems. Finally, in the case of the failing web service, you would dig into each of its internal and external dependencies, verifying that these work as they’re supposed to.

 

Things to Remember

Work bottom up in the case of failures that have a clearly identifiable cause, such as crashes, freezes, and error messages.

Worktop down in the case of failures that are difficult to pin down, such as performance, security, and reliability.

 

Find the Difference between a Known Good System and a Failing One

Often, you may have access both to a failing system and similar one that’s working fine. This might happen after you implement some new functionality, when you upgrade your tools or infrastructure, or when you deploy your system on a new platform.

 

If you can still access the older working system, you can often pinpoint the problem by looking at differences between the two systems (as you will see here) or by trying to minimize them

 

This differential mode of debugging works because, despite what our everyday experience suggests, deep down computers are designed to work deterministically; the same inputs produce identical results. Probe sufficiently deep within a failing system, and sooner or later you’ll discover the bug that causes it to behave differently from the working one.

 

It’s surprising how many times a system’s failure reason stares you right in the eye if only you would take the time to open its log file. This could have a line, such as the following, which indicates an error in the clients.conf configuration file. In other cases, the reason is hidden deeper, so you must increase the system’s log verbosity in order to expose it.

 

If the system doesn’t offer a sufficiently detailed logging mechanism, you have to tease out its runtime behavior with a tracing tool.

 

Besides general-purpose tools such as DTrace and SystemTap, some specialized tools I’ve found useful are those that trace calls to the operating system (strace, truss, Procmon), those that trace calls to the dynamically linked libraries (ltrace, Procmon), those that trace network packets (tcpdump, Wireshark);

 

and those that allow the tracing of SQL database calls Many Unix applications, such as the R Project, start their operation through complex shell scripts, which can misbehave in wonderfully obscure ways.

 

You can trace their operation by passing the -x option to the corresponding shell. In most cases, the trace you obtain will be huge. Thankfully, modern systems have both the storage capacity to save the two logs (the one of the correctly functioning run as well as the one of the failing one) and the CPU oomph to process and compare them.

 

When it comes to the environments in which your systems operate, your goal is to make the two environments as similar as possible. This will make your logs and traces easy to compare, or, if you’re lucky, it will lead you directly to the cause behind the bug.

 

Start with the obvious things, such as the program’s inputs and command-line arguments. Again, verify, don’t assume. Actually, compare the input files of the two systems against each other or, if they’re big and far away, compare their MD5 sums.

 

Then focus on the code. Start by comparing the source code, but be ready to delve deeper, for this is where the bugs often lurk. Examine the dynamic libraries associated with each executable by using a command such as ldd (on Unix) or dump bin with the /dependents option (when using Visual Studio).

 

See the defined and used symbols using nm (on Unix), dump bin with /exports /imports (Visual Studio), or javap (when developing Java code). If you’re sure the problem lies in the code but can’t see any difference, be prepared to dig even deeper, comparing the assembly code that the compiler generates.

 

But before you go to such an extreme, consider other elements that influence the setup of a program’s execution. An underappreciated factor is environment variables, which even an unprivileged user can set in ways that can wreak havoc on a program’s execution. Another is the operating system.

 

Your application might be failing on an operating system that’s a dozen years newer or older than the one on which it’s functioning properly.

 

Also consider the compiler, the development framework, third-party linked libraries, the browser (oh, the joy), the application server, the database system, and other middleware. How to locate the culprit in this maze is the topic we’ll tackle next.

 

Given that in most cases you’ll be searching for a needle in a haystack, it makes sense to trim down the haystack’s size.

 

Therefore, invest the time to find the simplest possible test case in which the bug appears. (Making the needle—the buggy output—larger is rarely productive.) A svelte test case will make your job easier through shorter logs, traces, and processing times.

 

Trim down the test case methodically by gradually removing either element from the case itself or configuration options from the system until you arrive at the leanest possible setting that still exhibits the bug.

 

If the difference between the working and failing system lies in their source code, a useful method is to conduct a binary search through all the changes performed between the two versions so as to pinpoint the culprit.

 

Thus, if the working system is at version 100 and the failing one is at version 132, you’ll first test version 116, and then, depending on the outcome, versions 108 or 124, and so on.

 

The ability to perform such searches is one reason why you should always commit each change separately into your version control system. Thankfully, some version control systems offer a command that performs this search automatically; on Git, it’s the git bisect command.

 

Another highly productive option is to compare the two log files with Unix tools to find the difference related to the bug. The workhorse in this scenario is the diff command, which will display the differences between the two files.

 

However, more often than not, the log files differ in non-essential ways, thus hiding the changes that matter. There are many ways to filter these out. If the leading fields of each line contain varying elements, such as timestamps and process IDs, eliminate them with the cut or awk commands.

 

As an example, the following command will eliminate the time-stamp from the Unix messages log file by displaying the lines starting from the fourth field.

cut -d ' ' -f 4- /var/log/messages

 

Select only the events that interest you—for instance, files that were opened—using a command such as grep 'open('. Or eliminate noise lines (such as those thousands of annoying calls to get the system’s time in Java programs) with a command such as grep -v gettimeofday.

 

You can also eliminate parts of a line that don’t interest you by specifying the appropriate regular expression in a sed command.

 

Finally, a more advanced technique that I’ve found particularly useful if the two files aren’t ordered in a way in which diff can yield useful results is to extract the fields that interest you, sort them, and then find the elements that aren’t common in the two sets with the comm tool.

 

Consider the task of finding which files were only opened in only one of two trace files t1 and t2. In the Unix Bash shell, the incantation for finding differences in the second field (the filename) among lines containing the string open( would be as follows:

comm -23 <(awk '/open\(/{print $2}' t1 | sort) \ <(awk '/open\(/{print $2}' t2 | sort)

 

The two elements in brackets generate an ordered list of filenames that were passed to open. The comm (find common elements) command takes the two lists as input and outputs the lines appearing only in the first one.

 

Things to Remember

Compare the behavior of a known good system with that of a failing one to find the failure’s cause.

Consider all of the elements that can influence a system’s behavior: code, input, invocation arguments, environment variables, services, and dynamically linked libraries.

 

Use the Software’s Debugging Facilities

Programs are complex beasts, and for this reason, they often contain built-in debugging support. This can, among other things, achieve the following:

Make the program easier to debug by disabling features such as background or multi-threaded execution

  • Allow the precise targeting of a failing test case through its selective execution
  • Provide reports and other intelligence regarding performance

 

Introduce additional logging

Therefore, invest some effort to find what debugging facilities are available in the software you’re debugging. Searching for the program’s documentation and source code for the word debug is an excellent starting point.

 

This can point you to command-line options, configuration settings, build options, signals (on Unix systems), registry settings (on Windows), or command-line interface commands that will enable the program’s debug mode.

 

Typically, setting debugging options will make the program’s operation more transparent through verbose output and, sometimes, simpler operation.

 

(Unfortunately, these settings can also cause some bugs to disappear.) Use the expanded log output to explore the reasons behind a failure you’re witnessing. Here are a few examples.

 

A simple case of debugging functionality involves having a command detail its actions. For example, the Unix shells offer the -x option to display the commands they execute. This is useful for debugging tricky text substitution problems.

 

Often, a number of options can be combined to set up an execution that’s suitable for debugging a problem. Consider troubleshooting a failed ssh connection.

 

Instead of modifying the global sshd configuration file or keys, which risks locking everybody out, you can invoke sshd with options that specify a custom configuration file to use (-f) and a port distinct from the default one (-p).

 

(Note that the same port must also be specified by the client.) Adding the -d (debug) will run the process in the foreground, displaying debug messages on the terminal. These are the commands that will be run on the two hosts where the connection problem occurs.

 

# Command run on the server side

sudo /usr/sbin/sshd -f ./sshd_config -d -p 1234

Command run on the client side ssh -p 1234 http://server.example.com

 

Another type of debugging functionality allows you to target a precise case. Consider trying to understand why a specific email message out of the thousands being delivered on a busy host faces a delivery problem.

 

This can be examined by invoking the Postfix sendmail command with the verbose (-v) and message delivery (-M) options followed by the identifier of the failure message.

sudo sendmail -v -M 1ZkIDm-0006BH-0X

 

Things to Remember

Identify what debugging facilities are available in the software you’re troubleshooting, and use them to investigate the problem you’re examining.

 

Diversify Your Build and Execution Environment

Sometimes you can pin down subtle, elusive bugs by changing the playing field. You can do this by building the software with another compiler, or by switching the runtime interpreter, virtual machine, middleware, operating system, or CPU architecture.

 

This works because the other environment may perform stricter checks on the inputs you supply to some routines or because its structure amplifies your mistake.

 

Therefore, if you experience application instability, crashes that you can’t replicate, or portability problems, try testing your software on another setup.

 

Such a change can also allow you to use more advanced debugging tools, such as a nifty graphical debugger or dtrace. Compiling or running your software on another operating system can unearth incorrect assumptions regarding API usage.

 

As an example, some C and C++ header files often declare more entities that are strictly needed, which may lull you into forgetting to include a required header, resulting, again, in portability problems for your customers. Also, some API implementations can vary significantly between operating systems:

 

Solaris, FreeBSD, and GNU/Linux ship with different implementations of the C library, while the Windows API on the desktop and mobile versions is currently relying on a different code base. Note that these differences can also affect interpreted languages that use the underlying C libraries and APIs, such as JavaScript, Lua, Perl, Python, or Ruby.

 

On languages that run close to the hardware, such as C and C++, the underlying processor architecture can influence a program’s behavior.

 

Over the past decades, the dominance of Intel’s x86 processor architecture on the desktop and the ARM architecture on mobile devices has reduced the popularity of architectures with significant differences in byte ordering (SPARC, PowerPC) or even the handling of null pointer indirection (VAX).

 

Nevertheless, differences in the handling of misaligned memory accesses and memory layout still exists between x86 architecture and ARM.

 

For example, accessing a two-byte value on an odd memory address can generate a fault on some ARM CPUs and may result in non-atomic behavior. On other architectures, misaligned memory accesses may severely affect an application’s performance.

 

Furthermore, the size of structures and offsets of their members can differ among the two architectures, especially when using earlier compiler versions.

 

More importantly, the sizes of primitive elements, such as long and pointer values, change as you move code from 32-bit to 64-bit architectures or from one operating system to another. Consider the following program, which displays the sizes of five primitive elements.

#include <stdio.h>
int
main()
{
printf("S=%zu I=%zu L=%zu LL=%zu P=%zu\n", sizeof(short), sizeof(int), sizeof(long),
sizeof(long long), sizeof(char *));
}

Therefore, running your software on another architecture or operating system can help you debug and detect portability problems.

 

On mobile platforms, there’s a huge variation not only on the version of the operating system they run (most phone and tablet manufacturers ship their own modified version of the Android operating system), but also significant hardware differences: screen resolution, interfaces, memory, and processor.

 

This makes it even more important to be able to debug your software on a variety of such platforms. To address this problem, mobile app development groups often maintain a stock of many different devices.

 

There are three main ways to debug your code in another execution environment.

You can use virtual machine software on your workstation to install and run diverse operating systems. This approach has the added advantage of providing you with an easy way to maintain a pristine image of the execution environment: just copy the configured virtual machine image to the “master” file, which you can restore when needed.

 

You can use small inexpensive computers. If the architecture you’re mainly targeting is x86, an easy way to get your hands on an ARM CPU is to keep at hand a Raspberry Pi.

 

This miniature ARM-based device runs many popular operating systems. It’s easy to plug into an Ethernet switch or connect it via Wi-Fi so that you can access it over the network.

 

This will also allow you to cut your teeth into the GNU/Linux development environment, which can be beneficial if you’re mainly debugging your code on Windows or OS X. Also, if Windows is your regular cup of tea, a Mac mini tucked under your desk can offer you easy access to an OS X development environment.

 

You can rent cloud-based hosts running the operating systems you want to use.

 

It’s not always necessary to use another operating system or device to debug your software on diverse compiler and runtime environments. You can easily introduce ecosystem variety on your own development work-station.

 

By doing so, you can regularly benefit from the additional errors and warnings and stricter conformance in some areas that another environment can offer you. As is the case with static analysis tools, different compilers can typically detect more problems than a single one can.

 

This includes both portability problems, which may inadvertently creep in due to lax checking by a particular compiler, and logical flaws that one compiler may not warn about.

 

Compilers are very good at compiling any legal code into a matching executable but are sometimes not as good at flagging misuse of the language, for example, identifying code that works only if a particular included header file also declares some undocumented elements.

 

The second pair of compiler eyes can help you in this regard. All you need to do is to install—and use as part of your debugging lifecycle—an alternative to your mainstream environment. Here are some suggestions.

 

  • For .NET Framework development, use Mono in parallel with Microsoft's tools and environment.
  • For the development of Ada, C, C++, Objective C, and other supported languages, use both LLVM and GCC.
  • For Java development, use both the OpenJDK (or Oracle’s offering from the same code base) and GNU Classpath. Also, try using more than one Java runtime.
  • For Ruby programs, apart from the reference CRuby implementation, try other VMs: JRuby, Rubinius, and mruby.

 

A more radical alternative involves reimplementing part of your code in another language. This can be helpful when you’re debugging a tricky algorithm. The typical case involves an initial (failing) implementation written in a relatively low-level language, such as C. Consider implementing an alternative in a more high-level language: Python, R, Ruby, Haskell, or the Unix shell.

 

The alternative implementation achieved by using the language’s high-level features, such as operations on sets, pipes, and filters, and higher-order functions, may help you arrive at a correctly functioning algorithm.

 

Through this method, you can quickly identify problems in the algorithm’s design and also fix implementation faults. Then, if performance is really critical, you can implement the algorithm in the original language or a language that’s closer to the CPU and use differential debugging techniques to make it work.

 

Things to Remember

  • Diverse compilation and execution platforms can offer you valuable debugging insights.
  • Fix a tricky algorithm by implementing it in a higher-level language.

 

Focus Your Work on the Most Important Problems

Most big software systems ship and operate with countless (known and unknown) bugs. Deciding in an intelligent manner on which bugs to concentrate and which bugs to ignore will increase your debugging effectiveness.

 

Hopefully, you’re not being paid to minimize the number of open issues, but to help deliver reliable, usable, maintainable, and efficient software.

 

Therefore, set priorities through an issue-tracking system and use them to concentrate your work on top-priority issues and to ignore low-priority ones. Here are a few points to help your prioritization.

 

Give a high priority to the following types of problems.

Data loss: This can occur as a result of either data corruption or usability issues. Users entrust their data to your software. If you lose their data you violate that trust, and trust that is lost is difficult to regain.

 

Security: This may affect the confidentiality or integrity of the software's data, the integrity of the system where your software is running, or the availability of the service your software is providing.

 

Such problems are often exploited by malicious individuals and can, therefore, result in large monetary and reputational damage. Security problems can also garner unwelcomed attention from regulatory authorities or extortionists. Consequently, sweeping security issues under the carpet is not an option.

 

Reduced service availability: If your software is providing a service, the cost of downtime may be measured in dollars (sometimes millions of them). Lost goodwill, late-night phone calls from irate managers, and clogged support desks are additional consequences you want to avoid.

 

Safety: These are issues that may result in death or serious injury to people, loss or severe damage to property, or environmental harm. All consequences of the preceding problems apply here. If your software can fail in such a way, you should have more rigorous processes than this list to guide your actions.

 

Crash or freeze: This may result in data loss or downtime, and it may also signify an underlying security problem. Thankfully, you can often easily debug a crashed or non-responding application through postmortem debugging. Consequently, it makes little sense to give such issues a low priority.

 

Code hygiene: Compiler warnings, failed assertions, unhandled exceptions, memory leaks, and, in general, inferior code quality provide a fertile ground for serious bugs to develop and hide. Therefore, don’t let such issues persist and accumulate

 

The following are types of problems you may decide to relegate to a lower priority. These issues are not by themselves unworthy of your attention. However, they are issues you may be able to set aside in order to deal with more urgent ones.

 

Legacy support: Support for outdated hardware, API, and file formats is commendable, but, from a business perspective, it won’t get you very far because, by definition, you’re serving a shrinking market.

 

Backward compatibility: Here the case is less clear-cut because if your software evolves in a way that leaves behind past users, you’re losing customer goodwill.

 

Some companies, such as Nikon, have established a stellar reputation by maintaining backward compatibility through many generations of their product: you can still use a 1970s Nikkor lens on high-end modern Nikon cameras.

 

On the other hand, some successful software firms are known for their “take no prisoners” approach, where they ditch support for older software and services without any qualm. Sometimes it may be worth eliminating support for an old feature in order to focus on the future.

 

Cosmetic issues: These may be devilishly hard to get right and easy to ignore. You are unlikely to lose business over a truncated bubble-help , but dynamically adjusting the size of the ’s panel based on the screen’s dpi setting can be a nightmare.

 

Documented workarounds: You may be able to avoid debugging some tricky issues by documenting a workaround. After switching on my TV, the first time I try to use the TV’s remote to operate the media player I get a “Please try again” prompt. I suspect that properly fixing this minor problem may be a major project.

 

Rarely used features: For problems associated with an exotic, rarely used a feature of your software, it may be more productive to yank the corresponding feature (and deal with the small, if any, fallout), than to actually solve the problem. Collecting usage data regarding your software can make it easier for you to reach such decisions.

 

Note that you should be explicit when you decide to ignore a low-priority issue. File it in the issue-tracking system, and then close it with an action such as “won’t solve.” This documents the decision you’ve made and helps avoid the management overhead of future duplicate issues.

 

Things to Remember

  • Not all problems are worth solving.
  • Fixing a low-priority issue may deprive you of the time required to address a high-priority one.

 

Analyze Debug Data with Unix Command-Line Tools

When you’re debugging you’ll encounter problems no one has ever seen before. Consequently, the shiny IDE that you’re using for writing software may lack the tools to let you explore the problem in detail with sufficient power.

 

This is where the Unix command tools come in. Be-ing general-purpose tools, which can be combined into sophisticated pipelines, they allow you the effortless analysis of text data.

 

Line-oriented textual data streams are the lowest useful common denominator for a lot of data that passes through your hands.

 

Such streams can be used to represent many types of data you encounter when you’re debugging, such as program source code, program logs, version control history, file lists, symbol tables, archive contents, error messages, test results, and profiling figures.

 

For many routines, every-day tasks, you might be tempted to process the data using a powerful Swiss Army knife scripting language, such as Perl, Python, Ruby, or the

 

Windows PowerShell. This is an appropriate method if the scripting language offers a practical interface to obtain the debug data you want to process and if you’re comfortable to develop the scripting command in an interactive fashion.

 

Otherwise, you may need to write a small, self-contained program and save it into a file. By that point, you may find the task too tedious, and end up doing the work manually, if at all. This may deprive you of important debugging insights.

 

Often, a more effective approach is to combine programs of the Unix tool chest into a short and sweet pipeline that you can run from your shell’s command prompt. With the modern shell command-line editing facilities, you can build your command bit by bit, until it molds into exactly the form that suits you.

 

In this, you’ll find an overview of how to process debug data using Unix commands. If you’re unfamiliar with the command-line basics and regular expressions, consult an online tutorial. Also, you can find the specifics on each command’s invocation options by giving its name as an argument to the man command.

 

Depending on the operating system you’re using, getting to the Unix command line is trivial or easy. On Unix systems and OS X, you simply open a terminal window.

 

On Windows, the best course of action is to install Cygwin: a large collection of Unix tools and a powerful package manager ported to run seamlessly under Windows. Under OS X, the Homebrew package manager can simplify the installation of a few tools described here that are not available by default.

 

Many debugging one-liners that you’ll build around the Unix tools follow a pattern that goes roughly like this: fetching, selecting, processing, and summarizing.

 

You’ll also need to apply some plumbing to join these parts into a whole. The most useful plumbing operator is the pipeline (|), which sends the output of one processing step as input to the next one.

 

Most of the time your data will be text that you can directly feed to the standard input of a tool. If this is not the case, you need to adapt your data. If you are dealing with object files, you’ll have to use a command such as nm (Unix), dumpbin (Windows), or javap (Java) to dig into them.

 

For example, if your C or C++ program exits unexpectedly, you can run nm on its object files to see which ones call (import) the exit function.

List symbols in all object files prefixed by file name nm -A *.o |

List lines ending in U exit

grep 'U exit$'

 

If you’re working with files grouped into an archive, then a command such as tar, jar, or ar will list you the archive’s contents. If your data comes from a (potentially large) collection of files, the find command can locate those that interest you.

 

On the other hand, to get your data over the web, use curl or wget. You can also use dd (and the special file /dev/zero), yes or jot to generate artificial data, perhaps for running a quick benchmark.

 

Finally, if you want to process a compiler’s list of error messages, you’ll want to redirect its standard error to its standard output or to a file; the incantations 2>&1 and 2>filename will do this trick.

 

As an example, consider the case in which you’ve changed a function’s interface and want to edit all the files that are affected by the change. One way to obtain a list of those files is the following pipeline.

 

Attempt to build all affected files redirecting standard error

to standard output
make -k 2>&1 |
# Print name of file where the error occurred
awk -F: '/no matching function for call to Myclass::myFunc/
{ print $1}' |
List each file only once sort -u

 

Given the generality of log files and other debugging data sources, in most cases, you’ll have on your hands more data than what you require. You might want to process only some parts of each row or only a subset of the rows.

 

To select a specific column from a line consisting of fixed-width fields or elements separated by space or another field delimiter, use the cut command. If your lines are not neatly separated into fields, you can often write a regular expression for a sed substitute command to isolate the element you want.

 

The workhorse for obtaining a subset of the rows is grepped. Specify a regular expression to get only the rows that match it, and add the -v flag to filter out rows you don’t want to process.

grep -r ' / ' . |

grep -v '/ sizeof'

 

Use fgrep (grep for fixed strings) with the -f flag if the elements you’re looking for are plain character sequences rather than regular expressions and if they are stored into a file (perhaps generated in a previous processing step). If your selection criteria are more complex, you can often express them in an awk pattern expression.

 

Many times you’ll find yourself combining a number of these approaches to obtain the result that you want. For example, you might use grep to get the lines that interest you, grep -v to filter out some noise from your sample, and finally awk to select a specific field from each line.

 

For example, the following sequence processes system trace output lines to display the names of all successfully opened files.

Output lines that call open grep '^open(' trace.out |

Remove failed open calls (those that return -1) grep -v '= -1' |

Print the second field separated by quotes

awk -F\" '{print $2}'

(The sequence could have been written as a single awk command, but it was easier to develop it step-by-step in the form you see.)

 

You’ll find that data processing frequently involves sorting your lines on a specific field. The sort command supports tens of options for specifying the sort keys, their type, and the output order.

 

Once your results are sorted, you then efficiently count how many instances of each element you have. The uniq command with the -c option will do the job here; often you’ll post-process the result with another sort, this time with then flag specifying a numerical order, to find out which elements appear most frequently.

 

In other cases, you might want to compare results between different runs. You can use diff if the two runs generate results that should be the same (perhaps the output of a regression test) or comm if you want to compare two sorted lists. You’ll handle more complex tasks, again, using Awk.

 

As an example, consider the task of investigating a resource leak. A first step might be to find all files that directly call obtain Resource but do not include any direct calls to releaseResource. You can find this through the following sequence.

 

List records occurring only in the first set comm -23 <(

List names of files containing obtain resource grep -rl obtainResource. | sort) <(

List names of files containing releaseResource grep -rl releaseResource. | sort)

(The $(...) sequence is an extension of the bash shell that provides a file-like argument supplying, as input, the output of the process within the brackets.)

 

In many cases, the processed data is too voluminous to be of use. For example, you might not care which log lines indicate a failure, but you might want to know how many there are. Surprisingly, many problems involve simply counting the output of the processing step using the humble wc (word count) command and its -l flag.

 

If you want to know the top or bottom 10 elements of your result list, then you can pass your list through head or tail. Thus, to find the people most familiar with a specific file (perhaps in your search for a reviewer), you can run the following sequence.

 

List each line's last modification git blame --line-porcelain Foo.java |

Obtain the author
grep '^author ' |
Sort to bring the same names together sort |
Count by number of each name's occurrences uniq -c |
Sort by number of occurrences
sort -rn |

 

List the top one's head

The tail command is particularly useful for examining log files. Also, to examine your voluminous results in detail, you can pipe them through more or less; both commands allow you to scroll up and down and search for particular strings.

 

As usual, use awk when these approaches don’t suit you; a typical task involves summing up a specific field with a command such as a sum += $3.

 

For example, the following sequence will process a web server log and display the number of requests and an average number of bytes transferred in each request.

awk '
When the HTTP result code is a success (200)
sum field 10 (number of bytes transferred)
Blog 3 General-Purpose Tools and Techniques
$9 == 200 {sum += $10; count++}
# When input finishes, print count and average
END {print count, sum / count}' /var/log/access.log

 

All the wonderful building blocks of Unix are useless without some way to glue them together. For this, you’ll use the Bourne shell’s facilities.

 

In some cases, you might want to execute the same command with many different arguments. For this, you’ll pass the arguments as input to xargs. A typical pattern involves obtaining a list of files using find and processing them using xargs.

 

So common is this pattern, that in order to handle files with embedded spaces in them (such as the Windows “Program Files” folder), both commands support an argument (-print0 and -0) to have their data terminated with a null character, instead of a space.

 

As an example, consider the task of finding the log file created after you modified foo.cpp that contains the largest number of occurrences of the string “access failure.” This is the pipeline you would write.

 

Find all files in the /var/log/acme folder that were modified after changing foo.cpp

find /var/log/acme -type f -cnewer ~/class='lazy' data-src/acme/foo.cpp -print0 |
Apply fgrep to count the number of 'access failure' occurrences xargs -0 fgrep -c 'access failure' |
Sort the :-separated results in reverse numerical order
according to the value of the second field
sort -t: -rn -k2 |
Print the top result head -1

 

If your processing is more complex, you can always pipe the arguments into a while read loop (amazingly, the Bourne shell allows you to pipe data into and from all its control structures).

 

For instance, if you suspect that a problem is related to an update of a system’s dynamically linked library (DLL), through the following sequence you can obtain a listing with the version of all DLL files in the windows/system32 directory.

# Find all DLL files
find /cygdrive/c/Windows/system32 -type f -name \*.dll |
For each file
while read f ; do
Obtain its Windows path with escaped \
wname=$(cygpath -w $f | sed 's/\\/\\\\/g')
Run WMIC query to get its name and version
wmic datafile where "Name=\"$wname\"" get name, version done |

Remove headers and blank lines grep windows

When everything else fails, don’t shy away from using a couple of intermediate files to juggle your data.

 

Things to Remember

  • Analyze debug data through Unix commands that can obtain, select, process, and summarize textual records.
  • By combining Unix commands with a pipeline, you can quickly accomplish sophisticated analysis tasks.

 

Utilize Command-Line Tool Options and Idioms

The program you’re debugging produces a cryptic “Missing foo” error message. Where is the culprit code? Running fgrep -lr 'Missing foo' in the application’s source code directory will recursively (-r) search through all the files and list (-l) those containing the error message.

 

The beauty of performing a textual search with grep is that this will work irrespective of the programming language of the code that produces the error message.

 

This is particularly useful in applications written in multiple languages, or when you lack the time to set up a project within an IDE. Note that the (-r) fgrep option is a GNU extension, which purists find distasteful. If you’re working on a system lacking this facility, the following pipeline will perform exactly the same task.

find . -type f | xargs fgrep -l 'Missing foo'

 

Often the data you’re examining contain a lot of noise: s you don’t want to see. Although you could tailor a grep regular expression to select the records you want, in many cases, it’s easier to simply discard the records that bother you using the -v argument of the grep command. Particularly powerful is the combination of multiple such commands.

 

For example, to obtain all the log records that include the string “Missing foo” but do not contain “connection failure” or “test,” you can use a pipeline such as the following:

fgrep 'Missing foo' *.log |

fgrep -v 'connection failure' |

fgrep -v test

 

The output of the grep command are lines that match the specified regular expression. However, if those lines are long, it may be difficult to easily see the part of the line where the culprit occurs.

 

For example, you might believe that a display problem associated with a (badly formatted) HTML file has to do with a table tag. How can you quickly inspect all such tags? Passing the --color option to grep, as in grep --color table file.html will show all the table tags in red, simplifying their inspection.

 

By convention, programs that run on the command line do not send errors to their standard output. Doing that might confuse other programs that process their output and also hide the error message from the program's human operator if the output is redirected into a file.

 

Instead, error messages are sent to a different channel, called the standard error. This will typically appear on the terminal through which the command was invoked; even its output was redirected.

 

However, when you’re debugging a program you might want to process that output, rather than see it fly away on the screen. Two redirection operators can help you here. First, you can send the standard error (by convention, file descriptor 2) into a file for later processing by specifying 2>filename when running the program.

 

You can also redirect the standard error to the same file descriptor as the standard output (file descriptor 1), so that you can process both with the same pipeline. For example, the following command passes both outputs through more, allowing you to scroll through the output at your own pace.

 

program 2>&1 | more

When debugging non-interactive programs, such as web servers, all the interesting action is typically recorded into a log file. Rather than repeatedly viewing the file for changes, the best thing to do is to use the tail command with the -f option to examine the file as it grows.

 

The tail command will keep the log file open and register an event handler to get notifications when the file grows. This allows it to display the log file’s changes in an efficient manner.

 

If the process of writing the file is likely at some point to delete or rename the log file and create a new one with the same name (e.g., to rotate its output), then passing the --follow=name to tail will instruct tail to follow the file with that name rather than the file descriptor associated with the original file.

 

Once you have tail running on a log file, it pays to keep that on a separate (perhaps small) window that you can easily monitor as you interact with the application you’re debugging. If the log file contains many irrelevant lines, you can pipe the tail output into grep to isolate the messages that interest you.

sudo tail /var/log/maillog | fgrep 'max connection rate'

 

If the failures you’re looking for are rare, you should set up a monitoring infrastructure to notify you when something goes wrong. For one-off cases, you can arrange to run a program in the background even after you log off by suffixing its invocation with an ampersand and running it with the nohup utility. You will then find the program’s output and errors in a file named nohup.out.

 

Or you can pipe a program’s output to the mail command so that you will get it when it finishes. For runs that will terminate within your workday, you can set a sound alert after the command. long-running-regression-test ; printf '\a'

 

You can even combine the two techniques to get an audible alert or a mail message when a particular log line appears.

sudo tail -f /var/log/secure |
fgrep -q 'Invalid user' ; printf '\a'
sudo tail -f /var/log/secure |
fgrep -m 1 'Invalid user' |
mail -s Intrusion jdh@example.com

 

Modifying the preceding commands with the addition of a while read loop can make the alert process run forever. However, such a scheme enters into the realm of an infrastructure monitoring system for which there are specialized tools.

Things to Remember

  • Diverse grep options can help you narrow down your search.
  • Redirect a program’s standard error in order to analyze it.
  • Use tail -f to monitor log files as they grow.

 

Explore Debug Data with Your Editor

Debuggers may get all the credit, but your code editor (or IDE) can often be an equally nifty tool for locating the source of a bug. Use a real editor, such as Emacs or vim, or a powerful IDE.

 

Whatever you do, trade up from your system’s basic built-in editor, such as Notepad (Windows), TextEdit (OS X), or Nano and Pico (various Unix distributions). These editors offer only rudimentary facilities.

 

Your editor’s search command can help you navigate to the code that may be associated with the problem you’re facing. In contrast to your IDE’s function to find all uses of a given identifier, your editor’s search function casts a wider net because it can be more flexible, and also because it includes in the search space text appearing in comments.

 

One way to make your search more flexible is to search with the stem of the word. Say you’re looking for code associated with an ordering problem. Don’t search for “ordering.” Rather, search for “order,” which will get you all occurrences of order, orders, and order.

 

You can also specify a regular expression to encompass all possible strings that interest you. If there’s a problem involving a coordinate field specified as x1, x2, y1, or y2, then you can locate references to any one of these fields by searching for [xy][12].

 

In other cases, your editor can help you pinpoint code that fails to behave in an expected way. Consider the following JavaScript code, which will not display the failure message it should.

var failureMessage = "Failure!", failure occurrences = 5; // More code here

if (failure occurrences > 0)

alert(failureMessage);

 

After a long, stressful day, you may fail to spot the small but obvious error. However, searching for “failureOccurrences” in the code will locate only one of the two variables (the other is spelled “failure occurrences”).

 

Searching for an identifier is particularly effective for locating typos when the name of the identifier you’re looking for comes from another source: copy-and-pasted from its definitive definition or a displayed error message or carefully typed in.

 

A neat trick is to use the editor’s command to search for occurrences of the same word. With the vim editor, you can search forward for identifiers that are the same as the one under the cursor by pressing * (or # for searching backward). In the Emacs editor, the corresponding incantation is Ctrl-s, Ctrl-w.

 

Your editor is useful when you perform differential debugging. If you have two (in theory) identical complex statements that behave differently, you can quickly spot any differences by copy-and-pasting the one below the other. You can then compare them letter by letter, rather than having your eyes and mind wander from one part of the screen to another.

 

For larger blocks, you may want to compare, consider splitting your editor window into two vertical halves and putting one block beside the other: this makes it easy to spot any important differences.

 

Ideally, you’d want a tool such as a diff to identify differences, but this can be tricky if two files you want to compare differences in nonessential elements, such as IP addresses, timestamps, or arguments passed to routines.

 

Again, your editor can help you here by allowing you to replace the different nonessential text with identical placeholders. As an example, the following vim regular expression substitution command will replace all instances of a Chrome version identifier (e.g., Chrome/45.0.2454.101) appearing in a log file with a string identifying only the major version (e.g., Chrome/45).

:%s/\(Chrome\/[^.]*\)[^ ]*/\1

 

Finally, the editor can be of great help when you’re trying to pinpoint an error using a long log file chock-full of data. First, your editor makes the removal of nonessential lines child’s play.

 

For example, if you want to delete all lines containing the string poll from a log file, in vi you’d enter: g/poll/d, whereas in Emacs you’d invoke (M-x) delete-matching-lines.

 

You can issue such commands multiple times (issuing undo when you overdo it), until the only things left in your log file are the records that really interest you. If the log file’s contents are still too complex to keep in your head, consider commenting the file in the places you understand. For example, you might add “start of the transaction,” “transaction failed,” “retry.”

 

If you’re examining a large file with a logical block structure you can also use your editor’s outlining facilities to quickly fold and unfold diverse parts and navigate between them. At this point, you can also split your editor’s window into multiple parts so that you can concurrently view related parts.

 

Things to Remember

  • Locate misspelled identifiers using your editor’s search commands.
  • Edit text files to make differences stand out.
  • Edit log files to increase their readability.

 

Optimize Your Work Environment

Debugging is a demanding activity. If your development environment is not well tuned to your needs, you can easily die the death of a thousand cuts.

 

Many parts of this book present techniques for the effective use of tools: the debugger the editor and command-line tools. Here you’ll find additional things you should consider in order to keep your productivity high.

 

First come the hardware and software at your disposal. Ensure that you have adequate CPU power, main memory, and secondary storage space at your disposal (locally or on a cloud infrastructure).

 

Some static analysis tools require a powerful CPU and a lot of memory; for other tasks, you may need to store on disk multiple copies of the project or gigabytes of logs or telemetry data. In other cases, you may benefit from being able to easily launch additional host instances on the cloud.

 

You shouldn’t have to fight for these resources: your time is (or should be) a lot more valuable than their cost. The same goes for software. Here the restrictions can be associated both with false economies and with excessive restrictions regarding what software you’re allowed to download, install, and use.

 

Again, if some software will help you debug a problem, it’s inexcusable to have this withheld from you. Debugging is hard enough as it is without additional restrictions on facilities and tools.

 

Having assembled the resources, spend some effort to make the best out of them. A good personal setup that includes key bindings, aliases, helper scripts, shortcuts, and tool configurations can significantly enhance your debugging productivity. Here are some things you can set up and examples of corresponding Bash commands.

 

Ensure that your PATH environment variable is composed of all directories that contain the programs you run. When debugging, you may often use system administration commands, so include those in your path.

export PATH="/sbin:/usr/sbin:$PATH"

 

Configure your shell and editor to automatically complete elements they can deduce. The following example for Git can save you many keystrokes as you juggle between various branches.

# Obtain a copy of the Git completion script
if ! [ -f ~/.bash_completion.d/git-completion.bash] ; then mkdir -p ~/.bash_completion.d
curl https://raw.githubusercontent.com/git/git/master/\ ( contrib/completion/git-completion.bash \
>~/.bash_completion.d/git-completion.bash
fi
# Enable completion of Git commands
source ~/.bash_completion.d/git-completion.bash

 

Set your shell prompt and terminal bar to show your identity, the current directory, and host. When debugging, you often use diverse hosts and identities, so a clear identification of your status can help you keep your sanity. 

 

Configure command-line editing key bindings to match those of your favorite editor. This will boost your productivity when building data analysis pipelines in an incremental fashion.

set -o emacs
# Or
set -o vi
Create aliases or shortcuts for frequently used commands and common typos.
alias h='history 15' alias j=jobs
alias mroe=more

 

Set environment variables so that various utilities, such as the version control system, will use the paging program and editor of your choice. export PAGER=less export VISUAL=vim export EDITOR=ex

Log all your commands into a history file so that you can search for valuable debugging incantations months later. Note that you can avoid logging a command invocation (e.g., one that contains a password) by prefixing it with space

 

Increase the history file size

export HISTFILESIZE=1000000000 export HISTSIZE=1000000 export HISTTIMEFORMAT="%F %T "
Ignore duplicate lines and lines that start with space export HISTCONTROL=ignoreboth
Save multi-line commands as a single line with semicolons shopt -s cmdhist
Append to the history file
shopt -s histappend

 

Allow the shell’s pathname expansion (globbing—e.g., *) to include files located in subdirectories.

shopt -s globstar

This simplifies applying commands on deep directory hierarchies through the use of the ** wildcard, which expands to all specified files in a directory tree. For example, the following command will

 

General-Purpose Tools and Techniques

count the number of files whose author is James Gosling, by looking at the JavaDoc tag of Java source code files.

grep '@author.*James Gosling' **/*.java | wc -l 33

 

Then comes the configuration of individual programs. Invest time to learn and configure the debugger, the editor, the IDE, the version control system, and the humble pager you’re using to match your preferences and working style. IDEs and sophisticated editors support many helpful plugins.

 

Select the ones you find useful, and set up a simple way to install them on each host on which you set up shop. You will recoup the investment in configuring your tools multiple times over the years.

 

When debugging, your work will often straddle multiple hosts. There are three important time savers in this context. First, ensure that you can log in to each remote host you use (or execute a command there) without entering your password.

 

On Unix systems, you can easily do this by setting up a public-private key pair (you typically run ssh-keygen for this) and storing the public key on the remote host in the file named .ssh/authorized_hosts.

 

Second, set up host aliases so that you can access a host by using a short descriptive name, rather than its full name, possibly prefixed by a different username. You store these aliases in a file named .ssh/config.

 

Here is an example that shortens the ssh host login specification from testuser@garfield.dev.asia.example.com into Garfield.

 

Host Garfield

HostName http://garfield.dev.asia.example.com User tester

Third, find out how you can invoke a GUI application on a remote host and have it display on your desktop. Although this operation can be tricky to set up, it is nowadays possible with most operating systems. Being able to run a GUI debugger or an IDE on a remote host can give you a big productivity boost.

 

Debugging tasks often span the command line and the GUI world. Therefore, knowing how to connect the two in your environment can be an important time saver. One common thing you’ll find useful is the ability to launch a GUI program from the command line (e.g., the debugged application with a test file you’ve developed).

 

The command to use is starting under Windows, open under OS X, gnome-open under Gnome, and kde-open under KDE.

 

You will also benefit from being able to copy text (e.g., a long path of a memory dump file) between the command line and the GUI clipboard. Under Windows, you can use the win clip command of the Outwit suite, or, if you have Cygwin installed, you can read from or write to the /dev/clipboard file.

 

Under Gnome and KDE, you can use the xsel command. If you work on multiple GUI environments, you may want to create a command alias that works in the same way across all environments.

 

Also, configure your GUI so that you can launch your favorite editor through a file’s context menu and open a shell window with a given current directory through a directory’s context menu. And, if you don’t know that you can drag and drop file names from the GUI’s file browser into a shell window, try it out; it works beautifully.

 

Having made the investment to create all your nifty configuration files, spend some time to ensure they’re consistently available on all hosts where you’re debugging software.

 

A nice way to do this is to put the files under version control. This allows you to push improvements or compatibility fixes from any host into a central repository and later pull them back to other hosts.

 

Setting up shop on a new host simply involves checking out the repository’s files in your new home directory. If you’re using Git to manage your configuration files, specify the files from your home directory that you want to manage in a .gitignore file, such as the following.

 

Ignore everything

 *
But not these files...
!.bashrc !.editrc !.gdbinit
!.gitconfig
!.gitignore
!.inputrc

 

Note that the advice in this is mostly based on things I’ve found useful over the years. Your needs and development environment may vary considerably from mine. Regularly monitor your development environment to pinpoint and alleviate sources of friction.

 

If you find yourself repeatedly typing a long sequence of commands or performing many mouse clicks for an operation that could be automated, invest the time to package what you’re doing into a script.

 

If you find tools getting in your way rather than helping you, determine how to configure them to match your requirements or look for better tools. Finally, look around and ask for other people’s tricks and tools. Someone else may have already found an elegant solution to a problem that is frustrating you.

 

Things to Remember

  • Boost your productivity through the appropriate configuration of the tools you’re using.
  • Share your environment’s configuration among hosts with a version control system.

 

Hunt the Causes and History of Bugs with the Revision Control System

Many bugs you’ll encounter are associated with software changes. New features and fixes, inevitably, introduce new bugs. A revision control system, such as Git, Mercurial, Subversion, or CVS, allows you to dig into the history in order to retrieve valuable intelligence regarding the problem you’re facing.

 

To benefit from this you must be diligently managing your software’s revisions with a version control system. By “diligently” I mean that you should be recording each change in a separate self-contained commit, documented with a meaningful commit message, and (where applicable) linked to the corresponding issue.

 

Here are the most useful ways in which a version control system can help your debugging work. The examples use Git’s command-line operations because these work in all environments.

 

If you prefer to use a GUI tool to perform these tasks, by all means, do so. If you’re using another revision control system, consult its documentation on how you can perform these operations, or consider switching to Git to benefit from all its power. Note that not all version control systems are created equal.

 

In particular, many have painful and inefficient support for local branching and merging—features that are essential when you debug by experimenting with alternative implementations.

 

When a new bug appears in your software, begin by reviewing what changes were made to it.

git log

 

If you know that the problem is associated with a specific file, specify it so that you will only see changes associated with that file.

git log path/to/myfile.js

 

If you suspect that the problem is associated with particular code lines, you can obtain a listing of the code with each line annotated with details regarding its last change.

git blame path/to/myfile.js

(Specify the -C and -M options to track lines moved within a file and between files.)

 

If the code associated with the problem is no longer there, you can search for it in the past by looking for a deleted string.

git rev-list --all | xargs git grep extinctMethodName

If you know that the problem appeared after a specific version (say V1.2.3), you can review the changes that occurred after that version.

git log V1.2.3..

 

If you don’t know the version number but you know the date on which the problem appeared, you can obtain the SHA hash of the last commit before that date.

git rev-list -n 1 --before=2015-08-01 master

You can then use the SHA hash in place of the version string.

 

If you know that the problem appeared when a specific issue (say, issue 1234) was fixed, you can search for commits associated with that issue.

git log --all --grep='Issue #1234'

(This assumes that a commit addressing issue 1234 will include the string ”Issue #1234” in its message.)

 

In all the preceding cases, once you have the SHA hash of the commit you want to review (say, 1cb6e3f6), you can inspect the changes associated with it.

git show 1cb6e3f6

 

You may also want to see the code changes between the two releases.

git diff V1.2.3..V1.3.2

 

Often, a simple review of the changes can lead you to the problem’s cause. Alternately, having obtained from the commit descriptions the names of the developers associated with a suspect change, you can have a talk with them to see what they were thinking when they wrote that code.

 

You can also use the version control system as a time-travel machine. For example, you may want to check out an old correct version (say V1.1.0) to run that code under the debugger and compare it with the current one.

 

git checkout V1.1.0

Even more impressive, if you know that a bug was introduced between, say, V1.1.0 and V1.2.3 and you have a script, say, http://test.sh that will exit with a non-zero code if a test fails, you can ask Git to perform a binary search among all changes until it locates the one that introduced the bug.

git bisect start V1.1.0 V1.2.3

git bisect run http://test.sh

git reset

 

Git also allows you to experiment with fixes by creating a local branch that you can then integrate or remove.

git checkout -b issue-work-1234

 

If the experiment was successful integrate the branch git checkout master

git merge issue-work-1234

If the experiment failed to delete the branch

git checkout master
git checkout -D issue-work-1234

 

Finally, given that you may be asked to urgently debug an issue while you’re working on something else, you may want to temporarily hide your changes while you work on the customer’s version.

git stash save interrupted-to-work-on-V1234

Work on the debugging issue git stash pop

 

Things to Remember

  • Examining a file’s history with a version control system can show you when and how bugs were introduced.
  • Use a version control system to look at the differences between correct and failing software versions.

 

Use Monitoring Tools on Systems Composed of Independent Processes

Modern software-based systems rarely consist of a single stand-alone program, which you need to debug when it fails. Instead, they comprise diverse services, components, and libraries.

 

The quick and efficient identification of the failed element should be your first win when debugging such a system. You can easily accomplish this on the server side by using or by setting up and running an infrastructure monitoring system.

 

In the following paragraphs, I’ll use as an example the popular Nagios tool. This is available both as free software and through supported products and services. If your organization already uses another system, work on that one; the principles are the same. Whatever you do, avoid the temptation to concoct a system on your own.

 

Over a quick home-brewed solution or a passive recording system such as collected or RRD-tool, Nagios offers many advantages: tested passive and active service checks and notifiers, a dashboard, a round-robin event database, unobtrusive monitoring schedules, scalability, and a large user community that contributes plugins.

 

If your setup is running on the cloud or if it is based on a commonly used application stack, you may also be able to use a cloud-based monitoring system offered as a service. For example, Amazon Web Services (AWS) offers to monitor for the services it provides.

 

To be able to zero inefficiently on problems, you must monitor the whole stack of your application. Start from the lowest-level resources by monitoring the health of individual hosts:

CPU load, memory use, network reachability, number of executing processes and logged-in users, available software updates, free disk space, open file descriptors, consumed network and disk bandwidth, system logs, security, and remote access.

 

Moving up one level, verify the correct and reliable functioning of the services your software requires to run: databases, email servers, application servers, caches, network connections, backups, queues, messaging, software licenses, web servers, and directories. Finally, monitor in detail the health of your application. The details here will vary. It’s best to monitor

 

The end-to-end availability of your application (e.g., if completing a web form will end with a fulfilled transaction)

Individual parts of the application, such as web services, database tables, static web pages, interactive web forms, and reporting Key metrics, such as response latency, queued and fulfilled orders, number of active users, failed transactions, raised errors, reported crashes, and so on

 

When something fails, Nagios will update the corresponding service status on its web interface. In addition, you want to be notified of the failure immediately, for example, with an SMS or an email. 

 

For services that fail sporadically, the immediate notification may allow you to debug the service while it is in a failed state, making it easier to pinpoint the cause. You can also arrange for Nagios to open a ticket so that the issue can be assigned, followed, and documented.

 

Nagios also allows you to see a histogram for the events associated with a service over time. Poring over the time where the failures occur can help you identify other factors that lead to the failure, such as excessive CPU load or memory pressure.

 

If you monitor a service’s complete stack, some low-level failures will cause a cascade of other problems. In such cases, you typically want to start your investigation at the lowest-level failed element.

 

If the available notification options do not suit your needs, you can easily write a custom notification handler.

 

Setting up Nagios is easy. The software is available as a package for most popular operating systems and includes built-in support for monitoring key host resources and popular network services. In addition, more than a thousand plugins allow the monitoring of all possible services, from the cloud, clustering, and CMS to security and web forms.

 

Again, if no plugin matches your requirements, you can easily script your own checker. Simply have the script print the service’s status and exit with 0 if the service you’re checking is OK and with 2 if there’s a critical error.

 

As an example, the following shell script verifies that a given storage volume has been backed up as a timestamped AWS snapshot.

 

Things to Remember

  • Set up a monitoring infrastructure to check all parts composing the service you’re offering.
  • Quick notification of failures may allow you to debug your system in its failed state.
  • Use the failure history to identify patterns that may help you pinpoint a problem’s cause.

 

Review and Manually Execute Suspect Code

On your first pass through the code, carefully examine each line, and look for common mistakes. Nowadays, you can avoid many such mistakes through appropriate conventions (e.g., by placing additional brackets to avoid operator precedence order mistakes), or a static analysis tool can point them out to you Nevertheless, mistakes can slip through, especially in code that does not (ahem) adhere to the necessary coding conventions.

 

Look for errors in operator precedence (the bit operators are particularly risky), missing braces and break statements, extra semicolons (immediately after a control statement), the use of an assignment instead of a comparison, uninitialized or wrongly initialized variables, statements that are missing from a loop, off-by-one errors, erroneous type conversions, missing methods, spelling errors, and language-specific gotchas.

 

To execute code by hand, have an empty sheet of paper at your side, write down the names of the key variables, and start executing the statements in the order the computer would.

 

Every time a variable changes its value, cross out the old value and write the new one. Writing the values with a pencil makes it easier to fix any errors you make.

 

A (real) calculator may help you derive the values of complex expressions more quickly. A programmer’s calculator can be helpful if you’re dealing with bit operations.

 

Avoid using your computer: manipulating the variable values in a spreadsheet, browsing the code with an editor, or quickly checking whether any new email has arrived will make it difficult to deeply concentrate, which is what this method is all about.

 

If the code manipulates complex data structures, draw them with lines, boxes, circles, and arrows. Devise a notation to draw the algorithm’s most important parts.

 

For example, if you’re drawing intervals (it’s always difficult to get the corresponding algorithms exactly right), you can draw the closed end (hopefully the interval’s start) with a line ending in a square bracket and the open end (the interval’s end, if you’re following the correct convention) with a round bracket.

 

You may also often find it useful to draw the parts of a program’s call graph (how routines call each other) that interest you. If you’re fluent in UML, use that for your diagrams, but don’t sweat too much about getting the notation right; seek a balance between the ease of drawing and the comprehension of what you’ve drawn.

 

A larger paper may help provide you with the space you need for your diagram. A whiteboard provides an even larger surface and also makes it easier to erase parts and collaborate.

 

Add colors to the picture to easily distinguish the elements you draw. If the diagram you’ve drawn is important, take a picture when you’re finished and attach it at the corresponding issue.

 

One notch fancier is the manipulation of physical objects, such as white-board magnets, paper clips, toothpicks, sticky notes, checker pieces, or Lego blocks. This increases the level of your engagement with the problem by bringing into play more senses: 3D vision, touch, and proprioception (the sense of where your body parts are).

 

You can use this method to simulate queues, groupings, protocols, ratings, priorities, and a lot more. Just don’t get carried away playing with the objects that are supposed to help you with your work.

 

Things to Remember

  • Look through the code for common mistakes.
  • Execute the code by hand to verify its correctness.
  • Untangle complex data structures by drawing them.
  • Address complexity with large sheets of paper, a whiteboard, and color.
  • Deepen your engagement with a problem by manipulating physical objects.

 

Go Over Your Code and Reasoning with a Colleague

The rubber duck technique is probably the most effective one you’ll find in this book, measured by the number of times you can apply it. It involves explaining how your code works to someone else. Typically, half-way through your explanation, you’ll exclaim, “Oh wait, how silly of me, that’s the problem!” and be done.

 

When this happens, rest assured that this was not a silly mistake that you carelessly overlooked. By explaining the code to your colleague, you engaged different parts of your brain, and these pinpointed the problem. In most cases, your colleague plays a minimal role.

 

This is how the technique gets its name: explaining the problem to a rubber duck could have been equally effective. (In the entry on rubber duck debugging, Wikipedia actually has a picture of a rubber duck sitting at a keyboard.)

 

You can also engage your colleagues in a more meaningful way by asking them to review your code. This is a more formal undertaking in which your colleague carefully goes through the code, pinpointing all problems in it: from code style and commenting, to API use, to design and logical errors.

 

So highly regarded is this technique that some organizations have a code review as a prerequisite for integrating code into a production branch. Tools for sharing comments, such as Gerrit and GitHub’s code commenting functionality can be really helpful because they allow you to respond to comments and leave a record to see how each one is addressed.

 

Etiquette plays an important role in this activity. Don’t take the comments (even the harsh ones) personally, but see them as an opportunity to improve your code. Try to address all review comments; even if a comment is wrong, it is a sign that your code is not clear enough.

 

Also, if you ask others to review your code, you should also offer to review theirs and do that promptly, professionally, and politely. Code stuck waiting for a code review, trivial comments that overlook bigger problems, and nastiness diminish the benefits gained by the practice of code reviews.

 

Finally, you can address tough problems in multi-party algorithms through role-playing. For example, if you’re debugging a communications protocol, you can take the role of one party, a colleague can take the role of the other, and you can then take turns attempting to break the protocol (or trying to make it work).

 

Other areas where this can be effective are security (you get to play Bob and Alice), human-computer interaction, and workflows. Passing around physical objects such as an “edit” token can help you here. Wearing elaborate costumes would be overdoing it though.

 

Things to Remember

  • Explain your code to a rubber duck.
  • Engage in the practice of code review.
  • Debug multi-party problems through role-playing.

 

Add Debugging Functionality

By telling your program that it is being debugged, you can turn the tables and have the program actively help you to debug it.

 

What’s needed for this to work is a mechanism to turn on this debugging mode and the code implementing this mode. You can program your software to enter a debugging mode through one of the following:

  • A compilation option, such as defining a DEBUG constant in C/C++ code
  • A command-line option such as the -d switch used in the Unix sshd SSH daemon and many other programs
  • A signal sent to a process, as was the case in old versions of the BIND domain-name server
  • A (sometimes undocumented) command, such an unusual key combination (on certain Android versions, you enable the USB debugging mode by tapping seven times on the software’s build number)

 

To avoid accidentally shipping or configuring your software with debugging enabled in a production environment, it’s a good practice to include a prominent notice to the effect that the debugging mode is available or enabled. Once you have your program enter a debugging mode there are several things you can program it to do.

 

First, you can make it log its actions so that you can get notified when something happens, and you can later examine the sequence of various events.

 

For interactive and graphics programs, it may also be helpful to have the debugging mode display more information on the screen or enhance the information already there.

 

For example, Minecraft has a debug mode in which it overlays the game screen with performance figures (frames per second, memory used, CPU load), player data (coordinates, direction, light), and operating environment specifications (JVM, display technology, CPU type).

 

It also features a debug mode world, which displays—laid out flat—the thousands of materials in all their conditions that exist in the game.

 

In a rendering application, you could display the edges that make up each object’s facets or the points controlling a Bezier curve. In web applications, it can be helpful to have additional data, such as a product’s database ID, appear when you hover the mouse over the corresponding screen element.

 

The debugging mode can also enable additional commands. These may be accessible through a command-line interface, an additional menu, or a URL.

 

You can implement commands to display and modify complex data structures (debuggers have difficulty processing these), dump data into a file for further processing, change the state into one that will help your troubleshooting, or perform other tasks described in this.

 

A very helpful debugging mode feature is the ability to enter a specific state. As an example, consider the task of debugging the seventh step of a wizard-like interface. Your job will be a lot easier if the debugging mode provides you with a shortcut to skip the preceding six steps, perhaps using some sensible default values for them.

 

Similar features are also useful for debugging games, where they can advance you to a high level (depriving you the pleasure and the excuse of playing to get there) or give you hard-to-earn additional powers.

 

A debugging mode can also increase a program’s transparency or simplify a program’s runtime behavior to make it easier to pin down failures for debugging.

 

For instance, when invoked with debug mode enabled, programs that operate silently in the background (daemons in Unix parlance, services in the Windows world) may operate in the foreground displaying output on the screen. (The Unix sshd daemon is a typical example of such a program.)

 

If a program fires up many threads, you can make it run with just a single thread, simplifying the debugging of problems that are not associated with concurrency.

 

Other changes you can implement can substitute simple or naive algorithms for more sophisticated ones, eliminate peripheral operations to boost performance, use synchronous instead of asynchronous APIs, or use an embedded lightweight application server or database instead of an external one.

 

For software lacking a user interface, such as that running on some embedded devices or servers, a debugging mode can also expose additional interfaces.

 

Adding a command-line interface can allow you to enter debugging commands, and see their results. In embedded devices, you can have that interface run over a serial connection that is set up only in the debug mode.

 

Some digital TVs can use their USB interface in this way. In applications working in a networked environment, you can include a small embedded HTTP server, such as the libmicrohttpd one, which can display key details of the application and also can offer the execution of debugging commands.

 

The debugging mode can also help you simulate external failures. These are typically rare events that may require tricky instrumentation to simulate in order to troubleshoot. A debugging mode can offer commands that, by changing the program’s state, can simulate its behavior under such conditions.

 

Thus, debugging mode commands can simulate the random dropping of network packets, the failure to write data to disk, radio signal degradation, a malfunctioning real-time clock, a misconfigured smart card reader, and so on.

 

Finally, the debugging mode provides you with a mechanism to exercise rare code paths. This works by changing the program’s configuration to favor their execution, instead of a more optimal path.

 

For example, if you have a user input memory buffer that starts with a 1 kB allocation and doubles in size every time it fills, you can have your program’s debug mode initialize the buffer with space for a single byte.

 

This guarantees you that the reallocation will be frequently exercised and that you will be able to observe and fix bugs in its logic. Other cases involve configuring tiny hash table sizes (to stress-test the overflow logic) and very small cache buffers (to stress-test the selection and replacement strategy).

 

Things to Remember

  • Add to your program an option to enter a debug mode.
  • Add commands to manipulate the program’s state, log its operation, reduce its runtime complexity, shortcut through user interface navigation, and display complex data structures.
  • Add command-line, web, and serial interfaces to debug embedded devices and servers.
  • Use debug mode commands to simulate external failures.

 

Add Logging Statements

Logging statements allow you to follow and comprehend the program’s execution. They typically send a message to an output device (for example, the program’s standard error, a console, or a printer) or store it in a place that you can later browse and analyze (a file or a database). You can then examine the log to find the root cause of the problem you’re investigating.

 

Some believe that logging statements are only employed by those who don’t know how to use a debugger. There may be cases where this is true, but it turns out that logging statements offer a number of advantages over a debugger session, and therefore the two approaches are complementary.

 

First of all, you can easily place a logging statement in a strategic location and tailor it to output exactly the data you require. In contrast, a debugger, as a general-purpose tool, requires you to follow the program’s control flow and manually unravel complex data structures.

 

Furthermore, the work you invest in a debugging session only has ephemeral benefits. Even if you save your setup for printing a complex data structure in a debugger script file, it would still not be visible or easily accessible to other people maintaining the code. I have yet to encounter a project that distributes debugger scripts together with its source code.

 

On the other hand, because logging statements are permanent, you can invest more effort than you could justify in a fleeting debugging session to format their output in a way that will increase your understanding of the program’s operation and, therefore, your debugging productivity.

 

Finally, the output of proper logging statements (those using a logging framework rather than random println statements) is inherently filter-able and queryable.

 

There are several logging libraries for most languages and frameworks. Find and use one that matches your requirements, rather than reinventing the wheel. Things you can log include the entry and exit to key routines, contents of key data structures, state changes, and responses to user interactions.

 

To avoid the performance hit of extensive logging, you don’t want to have it enabled in a normal production setting. Most of the logging interfaces allow you to tailor the importance of messages recorded either at the source (the program you’re debugging) or at the destination (the facility that logs the messages).

 

Obviously, controlling the recorded messages at the source can minimize the performance impact your program will incur; in some cases down to zero.

 

Implementing in your application a debug mode allows you to increase the logging verbosity only when needed. You can also configure several levels or areas of logging to fine-tune what you want to see. Many logging frameworks provide their own configuration facility, freeing you from the effort to create one for your application.

 

Logging facilities you may want to use include the Unix syslog library, Apple’s more advanced system log facility ASL, the Windows ReportEvent API, Java’s java.util.logging package, and Python’s logging module. The interfaces to some of these facilities are not trivial, so refer to the listings as a cheat sheet for using each one in your code.

 

There are also third-party logging facilities that can be useful if you’re looking for more features or if you’re working on a platform that lacks a standard one. These include Apache’s Log4j for Java and Boost.Log v2 for C++.

 

Listing Logging with the Unix syslog interface

#include <syslog.h>
int
main()
{
openlog("myapp", 0, LOG_USER);
syslog(LOG_DEBUG, "Called main() in %s", __FILE__); closelog();
}

 

Listing Logging with the Apple’s system log facility

#include <asl.h>
int
main()
{
asl_object_t client_handle = asl_open("com.example.myapp", NULL, ASL_OPT_STDERR);
asl_log(client_handle, NULL, ASL_LEVEL_DEBUG,
"Called main() in %s", __FILE__);
asl_close(client_handle);
}
Listing Logging with the Windows ReportEvent function
#include <windows.h>
int
main()
{
LPTSTR lpszStrings[] = {
"Called main() in file ",
__FILE__
};
HANDLE hEventSource = RegisterEventSource(NULL, "myservice");
if (hEventSource == NULL)
return (1);
ReportEvent(hEventSource, // handle of event source EVENTLOG_INFORMATION_TYPE, // event type
0, // event category
0, // event ID
NULL, // current user's SID
2, // strings in lpszStrings
0, // no bytes of raw data
lpszStrings, // array of error strings
NULL); // no raw data
DeregisterEventSource(hEventSource);
return (0);
}

Listing Logging with the Java’s java.util.logging package

import java.io (http://java.io).IOException;
import java.util.logging.FileHandler; import java.util.logging.Level; import java.util.logging.Logger;
public class EventLog {
public static void main(String[] args) { Logger logger = Logger.getGlobal();
Include detailed messages logger.setLevel(Level.FINEST); FileHandler fileHandler = null; try {
fileHandler = new FileHandler("app.log");
} catch (IOException e) {
System.exit(1);
}
logger.addHandler(fileHandler); // Send output to file logger.fine("Called main");
}
}
Listing Logging with the Python’s logging module
import logging;
logger = logging.getLogger('myapp')
# Send log messages to myapp.log
fh = logging.FileHandler('myapp.log')
logger.addHandler(fh)
logger.setLevel(logging.DEBUG)
logger.debug('In main module')

 

In addition, many other programming frameworks offer their own logging mechanisms. For example, if you associate logging with lumberjacks, you’ll be happy to know that under node.js you can choose between the Bunyan and Winston packages.

 

If you’re using Unix shell commands, then you can log a message by invoking the logger command. In Unix kernels (including device drivers), it is customary to log messages with the printk function call.

 

If your code runs on a networked, embedded device lacking a writable file system with sufficient space where you can store the logs (e.g., a high-end TV or a low-end broadband router), consider using remote logging.

 

This technology allows you to configure the logging system of the embedded device to send the log entries to a server where these are stored. Thus, the following Unix syslogd configuration entry will send all logging associated with local1 to the log master host:

local1.* @@http://logmaster.example.com (http://logmaster.example.com):514

 

Finally, if the environment you’re programming it doesn’t offer a logging facility, you’ll have to roll your own. At its simplest form, this can be a print statement.

printf("Entering function foo\n");

 


When you (think) you’re done with a print-type logging statement, resist the temptation to delete it or put it in a comment. If you delete it, you lose the work you put to create it. If you comment it out, it will no longer be maintained, and as the code changes it will decay and become useless. Instead, place the print command in a conditional statement.

 

if (loggingEnabled)

printf("Entering function foo\n");

Apart from a print statement, here are some other ways you can have applications log their actions.

 

  • In a GUI application, fire up a popup message.
  • In JavaScript code, write to the console and view the results in your browser’s console window.
  • In a web application, stuff logging output in the resulting page’s HTML—as HTML comments or as visible text.

If you can’t modify an application’s source code, you can try making it open a file whose name is the message you want to log and trace the application’s system calls with strace to see the file’s name.

 

Things to Remember

  • Add logging statements to set up a permanent, maintained debugging infrastructure.
  • Use a logging framework instead of reinventing the wheel.
  • Configure the topic and details of what you log through the logging framework.

 

Use Unit Tests

If a flaw in the software you’re debugging doesn’t show up in its unit testing, then appropriate tests are lacking or completely absent. To isolate or pinpoint such a flaw, consider adding unit tests that can expose it.

 

Start with the basics. If the software isn’t using a unit testing framework or isn’t written in a language that directly supports unit testing, download a unit testing package matching your requirements (Wikipedia contains a list of unit testing frameworks), and configure your software to use it.

 

With no existing tests in place, this should involve the adjustment of the build configuration to include the testing library and the addition of a few lines in the application’s startup code to run any tests. While you’re at it, configure your infrastructure to run the tests automatically when the code is compiled and committed.

 

This will ensure that your project will benefit from the improved documentation, collective ownership, ease of refactoring, and simplified integration facilitated by the unit testing infrastructure you’re adding.

 

Then, identify the routines that may be related to the failure you’re seeing, and write unit tests that will verify their functioning. You can find the routines to test through top-down or bottom-up reasoning.

 

Try writing the tests without looking at the routines’ implementation, focusing instead on the documentation of their interface, or, if that’s lacking (let’s be realistic here) on the code that calls them.

 

This will lessen the probability of you replicating a faulty assumption in the unit test. Ensure that the tests you added become a permanent part of the code by committing them to the software’s revision control repository.

 

As an example, consider the class in Listing, which tracks the column position of processed text, taking into account the standard behavior of the tab character. This is notoriously difficult to get right: in the 1980s, screen output libraries contained workarounds for display terminals with buggy behavior in this area.

 

Listing A C++ class that tracks the text’s column position class ColumnTracker {

private:
int column;
static const int tab_length = 8;
public:
ColumnTracker() : column(0) {}
int position() const { return column; }
void process(int c) {
switch (c) {
case '\n':
column = 0;
break;
case '\t':
column = (column / tab_length + 1) * tab_length;
break;
default:
column++;
break;
}
}
};

 

Listing Code running the CppUnit test suite text interface

#include <cppunit/ui/text/TestRunner.h> #include "ColumnTrackerTest.h"
int
main(int argc, char *argv[])
{
CppUnit::TextUi::TestRunner runner;
runner.addTest(ColumnTrackerTest::suite()); runner.run();
return 0;
}

Listing Unit test code

#include <cppunit/extensions/HelperMacros.h> #include "ColumnTracker.h"
class ColumnTrackerTest : public CppUnit::TestFixture { CPPUNIT_TEST_SUITE(ColumnTrackerTest); CPPUNIT_TEST(testCtor);
CPPUNIT_TEST(testTab);
CPPUNIT_TEST(testAfterNewline);
CPPUNIT_TEST_SUITE_END();
public:
void testCtor() {
ColumnTracker ct;
CPPUNIT_ASSERT(ct.position() == 0);
}
void testTab() {
ColumnTracker ct;
Test plain characters ct.process('x'); CPPUNIT_ASSERT(ct.position() == 1); ct.process('x'); CPPUNIT_ASSERT(ct.position() == 2);
Test tab
ct.process('\t');
CPPUNIT_ASSERT(ct.position() == 8);
Test character after tab ct.process('x'); CPPUNIT_ASSERT(ct.position() == 9);
// Edge case
while (ct.position() != 15)
ct.process('x');
ct.process('\t');
CPPUNIT_ASSERT(ct.position() == 16);
Edge case ct.process('\t'); CPPUNIT_ASSERT(ct.position() == 24);
}
void testAfterNewline() {
ColumnTracker ct;
ct.process('x');
ct.process('\n');
CPPUNIT_ASSERT(ct.position() == 0);
}
};

 

Running the unit tests should expose the flawed routine. If the tests succeed, you’ll need to expand their coverage or (less frequently) verify their correctness.

 

If more than one test fails, focus on the failing routines that lie at the bottom of the dependency tree—those that call the fewest other routines (clients). Once you’ve fixed the flawed routine, run the tests again to ensure they all now pass.

 

Bolting unit tests on existing code aren’t trivial because tests and code are typically developed in tandem, so that code can be written in a testable form. Often the tests are even written before the corresponding code.

 

To unit test the suspect routines you may need to refactor the code: splitting large elements into smaller parts and minimizing dependencies between routines to simplify their invocation from the tests.

 

The techniques for doing this are beyond the scope of this book. An excellent treatment of the topic is Michael Feathers’ book Working Effectively with Legacy Code.

 

Things to Remember

Pinpoint flaws by probing suspect routines with unit tests.

Increase your effectiveness by adopting a unit testing framework, refactoring the code to accommodate the tests, and automating the tests’ execution.

 

Use Assertions

Although unit tests an important tool for locating faulty routines, they can’t tell the full story. First, unit tests will point you to the routine that fails a test, but they can’t help you find the exact position of the flaw.

 

This is fine when dealing with small routines, but there are complex algorithms that are difficult to break down into small self-contained routines. Second, errors can crop up in the integration of various parts. Higher-level testing should uncover these errors, but again, it rarely pinpoints their cause.

 

Here’s where assertions come in. These are statements containing a Boolean expression that’s guaranteed to be true if the code is correct. Should the expression evaluate to false, the assertion will fail, typically terminating the program with a runtime error displaying data regarding the failure. Debuggers can often direct you to the location of a failed assertion.

 

By placing assertions in key locations of your code, you can narrow down your search for a fault in two ways. Most obviously, you can focus on the location where an assertion failed. In addition, if the assertions you added didn’t fail, you may be able to rule out as suspect the code where you placed them.

 

Most languages support assertions, either through a built-in statement, as is the case for Java and Python, or through a library, as is the case for C. Many programming environments, in order to minimize the possible performance impact of assertion checking, allow you to specify whether this will be performed or not.

 

This can happen at compile time (e.g., in C and C++ by defining the NDEBUG macro) or at runtime (e.g., in Java with the -enableassertions and -disable assertions options).

 

It is common during development to run code with assertion checking enabled. Running production code with assertion checking enabled has both benefits and costs, which you must weigh on a case-by-case basis.

 

When you’re debugging algorithmic code, it’s often useful to think in terms of preconditions (properties that must hold for the algorithm to function), invariants (properties that the algorithm maintains while it’s processing data), and postconditions (properties that must hold if the algorithm performs according to its specification).

 

Typically, an invariant is true only for the part of the data set that has been processed by the algorithm; at the end of the algorithm’s operation, the postcondition will cover the same elements as the invariant.

 

You can see an example of this style of programming, which finds the maximum value in an integer array by assuming the value is the minimum integer and gradually replacing that with higher values it finds in the array.

 

The preconditions tested at the beginning are that the array is non-empty and that the minimum value chosen is indeed less than or equal to all values of the array.

 

This could fail if the input data type was changed without correspondingly adjusting the constant. The loop forming the algorithm’s body maintains the invariant that the selected maximum value is greater than or equal to all values traversed. At the end of the loop, after all, values have been checked, the same invariant forms the postcondition, which is the algorithm’s specification.

 

Listing  Using assertions to check preconditions, postconditions, and invariants

class Ranking {
/** Return the maximum number in non-empty array v */ public static int findMax(int[] v) {
int max = Integer.MIN_VALUE;
Precondition: v[] is not empty assert v.length > 0 : "v[] is empty";
Precondition: max <=v[i] for every i for (int n : v) assert max <=n : "Found value < MIN_VALUE"; Obtain the actual maximum value for (int i=0; i < v.length; i++) { if (v[i]> max)
max = v[i];
Invariant: max >= v[j] for every j <=i for (int j=0; j <=i; j++) assert max>= v[j] : "Found value > max";
}
Postcondition: max >= v[i] for every i for (int n : v)
assert max >= n : "Found value > max"; return max;
}
}

Apart from this use of assertions to troubleshoot (and document) the operation of algorithms, you can also use assertions in a less formal fashion to pinpoint all sorts of problems. You can thus place assertions

 

  • At the beginning of the program to verify the architectural properties of the CPU, such as the sizes of the integer types used
  • At the entry point of a routine to verify that the passed parameters are of the expected type (when the language won’t check it), and that they hold valid (e.g., nonnull) and reasonable values
  • At the exit point of a routine to verify its result
  • At the beginning and the end of commonly called or complex methods to verify that the class’s state remains consistent
  • After calls to API routines that shouldn’t fail to verify that this is indeed the case
  • After loading resources required by your software to verify that it has been correctly deployed
  • After evaluating a complex expression to verify that the result has the expected properties or holds a reasonable value
  • In the default case of a switch statement (with a false expression as the assertion’s value) to catch unhandled cases
  • After initializing a data structure to verify that it holds the expected values
  • In general, while debugging, add assertions to document your understanding of the code and to test your suspicions.

 

You can leave in the code most of the assertions you added in order to document its operation and guard against future problems.

 

However, if any of the assertions you added to debug the code has unearthed a problem that can actually crop up in production, then, as part of your debugging work, you must replace it with more robust error handling.

 

Such cases include the verification of input coming from the user or other sources you can’t control and the correct execution of APIs that can fail as a matter of course.

 

Also, when a routine can be tested both with an assertion and with unit testing code, the addition of a unit test is preferred because this can be automatically executed and can contribute toward your code’s testing code coverage.

 

Things to Remember

  • Complement unit testing with assertions to pinpoint more precisely a fault’s location.
  • Debug complex algorithms with assertions that verify their preconditions, invariants, and postconditions.
  • Add assertions to document your understanding of the code you debug and to test your suspicions.

 

Verify Your Reasoning by Perturbing the Debugged Program

Arbitrarily changing a program to see what will happen is disparagingly described as hacking. However, experimental changes that you make in a thoughtful manner can allow you to test hypotheses and learn more about the system you’re trying to debug as well as its underlying platform.

 

The changes are especially valuable if the quality of what you’re facing isn’t top notch: they may allow you to cover holes in the code’s documentation or that of the API.

 

Here are some examples of questions that might arise when you’re debugging a system, which you can easily answer by modifying some code.

  • Can I indeed pass null as an argument to this routine?
  • Will this code work correctly if the variable contains more than 999 milliseconds?
  • Will, a warning get logged if a lock is held when entering this routine?

 

Is the order of calling these methods related to my problem?

Could an alternative API work better than the currently used one?

You typically verify the effects of your changes by observing the program's behavior, by logging, or by running the code under a debugger.

 

One experimental approach involves modifying expressions and values that are embedded in the code, often replacing a runtime expression with concrete value. For instance, you can pass a correct value constant to a routine (or have a routine return such a value) to verify that the failure you’re trying to fix goes away.

 

Or, you can pass or return an incorrect value to see whether a problem you’re trying to isolate can be attributed to such a value. Alternately, you can set a parameter to an extreme value in order to make a tiny or rare problem, such as performance degradation, easier to observe.

 

Another experimental avenue involves code changes that allow you to test the correctness of alternative implementations. Here you replace code that might be incorrect with conceivably better code and see whether this fixes your problem.

 

For example, the Microsoft Windows API provides more than five ways to obtain a string’s width on the screen, with little guidance regarding which function is preferable. If your problem is misaligned text, you could exchange one API call (GetTextExtent Point32) with another (GetTextExtentExPoint) and observe the result.

 

Or, if you have doubts regarding the correct order for calling some routines, you can try an alternative one. In other cases, you can try extreme code simplifications.

 

Things to Remember

  • Set values in the code by hand to identify correct and incorrect ones.
  • If you can’t find guidance to correct the code, experiment by trying alternate implementations.

 

Minimize the Differences between a Working Example and the Failing Code

There are cases where you will have at hand the faulty code you’re debugging and an example of related functionality that works just fine.

 

This can often occur when you’re debugging a complex API invocation or an algorithm. You may get the working example from the API’s documentation, a Q&A site, open-source software, or a text-book.

 

The differences between the working example and your code can guide you to the fault. The approach I describe here is based on manipulating the source code; however, you can also look at differences in the runtime behavior of the two.

 

Before using the example code to fix the problem you’re facing, you first must compile and test it to verify that it actually works. If it doesn’t, then probably the problem doesn’t lie with your code.

 

It could be that your setup (compiler, runtime environment, operating system) is responsible for the failure, or that your understanding of what the API or algorithm is supposed to be doing is incorrect, or less likely, that you have discovered a bug in the third-party code.

 

With a handy verified working example, there are two approaches for fixing your code. Both involve gradually minimizing the differences between the example and the faulty code. By definition, when there are no differences between the working example and your code, your code will be working.

 

The first approach involves building on the example to arrive at your code. This works best when your code is simple and self-contained.

 

In small steps, add to the example elements from your code. At each step, verify the example’s functioning. The addition that causes the code you’re building to stop working is the failure’s culprit.

 

The second approach amounts to trimming your code until it matches the example. This works best when your code has many dependencies that hinder its isolated operation.

 

Here you remove or adjust material in your code to make it match the example. Do this in small steps and, after each change, check that your code keeps failing. The change you perform that makes your code work will point you to the fix you need to make.

 

Things to Remember

To find the element that causes a failure, gradually trim down your failing code to match a working example or make a working example match your failing code.

 

Simplify the Suspect Code

Complex code is difficult to debug. Many possible execution paths and intricate data flows can confuse your thinking and add to the work you must do to pinpoint the flaw. Therefore, it’s often helpful to simplify the failing code. You can do this temporarily, in order to make the flaw stand out, or permanently in order to fix it.

 

Before embarking on drastic simplifications, ensure you have a safe way to revert them. All the files you’ll modify should be under version control, and you should have a way to return the code to its initial state, preferably by working on a private branch version.

 

Temporary modifications typically entail drastically pruning the code. Your goal here is to remove as much code as possible while keeping the failure. This will minimize the amount of suspect code and make it easier to identify the fault. In a typical cycle, you remove a large code block or a call to a complex function, you compile, and then you test the result.

 

If the result still fails, you continue your pruning; if not, you reduce the pruning you performed. Note that if a failure disappears during a pruning step, you have strong reasons to believe that the code you pruned is somehow associated with the failure. This leads to an alternate approach, in which you try to make the failure go away by removing as little code as possible.

 

Although you can use your version control system to checkpoint the pruning steps, it’s often quicker to just use your editor. Keep the code in an open editor window, and save the changes after each modification.

 

If after pruning the failure persists, continue the process. If the failure disappears, undo the previous step, reduce the code you pruned, and repeat. You can systematically perform this task through a binary search process.

 

Resist the temptation to comment-out code blocks: the nesting of embedded comments in those blocks will be a source of problems. Instead, in languages that support a preprocessor, you can use preprocessor conditionals.

#ifdef ndef

code you don't want to be executed

#endif

 

In other languages, you can temporarily put the statements you want to disable in the block of an if (false) conditional statement. Sometimes instead of removing code, it’s easier to adjust it to simplify its execution.

 

For example, add a false value at the beginning of an if or loop conditional to ensure that the corresponding code won’t get executed. Thus, you will rewrite (in steps) the following code.

while (a() && b())
someComplexCode();
if (b() && !c() && d() && !e())
someOtherComplexCode();
in this simplified form.
while (false && a() && b())
someComplexCode();
if (false && !c() && d() && !e())
someOtherComplexCode();

 

In other cases, you can benefit by permanently simplifying complex statements in order to ease their debugging. Consider the following statement.

p = s.client(q, r).booking(x).period(y, checkout(z)).duration();

 

Such a statement is justifiably called a train wreck because it resembles train carriages after a crash. It is difficult to debug because you cannot easily see the return of each method.

 

You can fix this by adding delegate methods or by breaking the expression into separate parts and assigning each result to a temporary variable.

Client c = s.client(q, r);

Booking b = c.booking(x);

CheckoutTime ct = checkout(z);

Period p = b.period(y, ct)

TimeDuration d = p.duration();

 

This will make it easier for you to observe the result of each call with the debugger or even to add a corresponding logging statement. Given descriptive type or variable names, the rewrite will also make the code more readable.

 

Note that the change is unlikely to affect the performance of the code: modern compilers are very good at eliminating unneeded temporary variables.

 

Another worthwhile type of simplification involves breaking one large function into many smaller parts. Here the benefit of debugging comes mainly from the ability you gain to pinpoint the fault by testing each part individually.

 

The process may also improve your understanding of the code and untangle unwanted interactions between the parts. These two positive side effects can also lead to a solution of the problem.

 

Finally, another more drastic type of permanent simplification involves ditching complex algorithms, data structures, or program logic. The rationale for such a move is that the complexity that gives rise to the fault you’re trying to locate may, in fact, not be required. Here are some characteristic cases.

 

Advances in processing speed may make a particular optimization irrelevant. On a 3MHz VAX computer, a user would sense the difference between an algorithm that responded to a keystroke in 500ms and one that responded in 5ms.

 

On a 3.2GHz Intel Core i5 CPU, the corresponding response times shrink to 500μs and 5μs, which are both imperceptible to the user. In such cases, using a simpler algorithm makes perfect sense when dealing with small fixed-size data sets.

 

These days, larger memory and disk capacities as well as network throughput rates rarely justify the complex masking operations required for packing data into bits. You can instead use the language's native types, such as integers and Boolean variables. The same may apply for other complex data compression schemes.

 

Changes in hardware technologies may make a particular optimization algorithm irrelevant. For example, operating system kernels used to contain a sophisticated elevator algorithm to optimize the movement of disk heads.

 

On modern magnetic disks, it is not possible to know the location of a particular data block on the disk platter, so such an algorithm will not do any useful work. Moreover, solid-state disks have in effect zero seek times, allowing you to scrap any complex algorithm or data structure that aims to minimize them.

 

The functionality of the buggy algorithm may be available in the library of the programming framework you’re using or it may be a mature third-party component. For example, the code for finding a container’s elements’ median value in O(n) time can contain many subtle bugs.

 

Replacing the C++ code with a call to std::nth_element is an easy way to fix such a flaw. As a larger-scale example, consider replacing a bug-infested proprietary data storage and query engine with a relational database.

 

A complex algorithm, implemented to improve performance, may have been overkill from day one. Performance optimizations are only justified when profiling and other measurements have demonstrated that optimization work on a particular code hotspot is actually required.

 

Programmers sometimes ignore this principle, gratuitously creating byzantine, overengineered code. This gives you the opportunity to do away with the code and the bug at the same time.

 

Modern user experience design favors much simpler interaction patterns than what was the case in the past. This may allow you to replace the buggy spaghetti code associated with a baroque dialog box full of tunable parameters with simpler code that supports a few carefully chosen options and many sensible defaults.

 

Things to Remember

  • Selectively prune large code thickets in order to make the fault stand out.
  • Break complex statements or functions into smaller parts so that you can monitor or test their function.
  • Consider replacing a complex buggy algorithm with a simpler one.

 

Consider Rewriting the Suspect Code in Another Language

When the code you’re trying to fix refuses to comply, drastic measures may be in order. One such measure is to rewrite the offending code in another language. By choosing a better programming environment, you hope to either side-step the bug completely or find and fix it by using

 

The programming language you’ll employ should be more expressive than the one currently used.

 

 For example, assessing the performance of sophisticated trading strategies can be more easily expressed in a language with functional programming support, such as R, F#, Haskell, Scala, or ML. You can also gain more expressiveness through a language's libraries.

 

In some cases, such as in the use of R for statistical computing, the gains can be so big so as make it criminal to use a less featureful alternative.

 

As another example, if your code is doing tricky string processing over dynamically allocated collections of elements in C, you may want to try rewriting the code in C++ or Python. Writing the faulty code in a more expressive language will result in a more compact implementation, which offers fewer chances for errors.

 

Another trait you might find useful is the ability to easily observe the code’s behavior, perhaps constructing the code incrementally. Here, scripting languages offer a particular advantage through the read-eval-print loop (REPL) they support.

 

If you implement an algorithm using Unix tool pipelines, then you can build the processing pipeline step by step, verifying the output of each stage before adding a next one. 

 

Furthermore, if the development system originally used doesn’t offer a decent debugging, logging, or unit testing framework, then adopting an improved implementation environment can provide you with the opportunity to pinpoint the problem by using its shiny support facilities. This can be very useful when you’re debugging code in small embedded systems with lackluster development tools.

 

Once you get the newly written code working, you have two options for fixing the original problem. The first involves adopting the new code and trashing the old one. You can easily do this when there are good bindings between the original language and the new one.

 

For instance, it’s typically trivial to call C++ from C with a few plain-data parameters and a simple return value. You can also keep your new implementation by invoking it as a separate processor microservice. However, this makes sense only when you don’t particularly care about the invocation’s cost.

 

The other option for fixing your bug involves using the new code as an oracle to correct the old one.

 

You can do this by observing the differences in the behavior between the working code and the failing one or by gradually converging the two code bases until the bug surfaces In the first case, you compare the variable and routine return values between the two implementations; in the second one, you proceed by trial and error until you arrive at a correct implementation.

 

Things to Remember

  • Rewrite code you can’t fix in a more expressive language to minimize the number of potentially faulty statements.
  • Port buggy code to a better programming environment to enhance your debugging arsenal.
  • Once you have an alternative working implementation, adopt it, or use it as an oracle to fix the original one.

 

 Improve the Suspect Code’s Readability and Structure

The disorderly, badly written code can be a fertile breeding ground for bugs. Cleaning up the code can uncover the bugs, allowing you to fix them. However, before embarking on a code-cleaning trip, ensure that you have the time and authority to complete it.

 

Nobody will take kindly to a bug fix that modifies 4,000 lines. At the very least, separate cosmetic changes, from refactorings, from the actual bug fix into distinct commits. In some environments, coordinating with others for the first two changes may be the way to go.

 

Start with spacing. At the lowest level, ensure that the code follows consistently the languages and local style rules regarding spaces around operators and reserved words.

 

This can help your eye catch subtle errors in statements and expressions. At a slightly higher level, look at the indentation. Again, this should always use the same number of spaces (typically 2, 4, or 8) applied in a consistent way.

 

With orderly indentation, it’s easier to follow the code’s control flow. Be especially careful with single statements spanning multiple lines: their appropriate indentation will help you verify the correctness of complex expressions and function calls. At the highest level, use your judgment to add spaces where these can aid the user in understanding the code.

 

Aligning similar expressions with some extra spacing can make discrepancies stand out. Separating logic code blocks with an empty line can make it easier for you to understand the code’s structure.

 

In general, ensure that the visual appearance of the code mirrors its functionality so that your eye can catch suspect patterns. If the code’s formatting is really beyond manual salvation, consider using your IDE or a tool, such as clang-format or indent, to fix it automatically.

 

The code’s formatting can improve its visual appearance, but it can only go so far. Therefore, after style fixes, consider whether there’s a need to refactor the code: maintain its functionality while improving its structure.

 

Your objective here is either to fix the problem through the use of a more orderly structure—akin to rewriting or to make the fault stand out in the more orderly code.

 

Here are some common problems (code smells) that can hide faults and the refactorings you can implement to solve them. Most are derived from Martin Fowler’s classic book, Refactoring: Improving the Design of Existing Code (Addison-Wesley, 2000), which you can consult for more details.

 

Duplicated code can introduce bugs when code improvements and fixes fail to update all related instances of the code.

By putting the duplicated code into a commonly used routine, class, or template you can ensure that the correct code is used throughout the program. If partial code updates were the source of the failure, then you will discover them as you compare the code instances you remove.

 

The duplicated code also hides in switch statements, which often change the code’s flow based on a value representing the data’s type. Missing case elements in some switch statements can easily go unnoticed when new cases are added. As a simple measure, you can add a default clause that will log an internal error when it is executed.

 

Even better, restructure the code to eliminate the switch statement. This is typically done by moving the behavior associated with each case into a method of corresponding subclasses, and replacing the switch statement with a polymorphic method call.

 

Alternately, you can express the behavior in subclasses of a state object, which is used in the place of the switch statement.

 

A related problem is the shotgun surgery code smell, where a single change affects many methods and fields. The bug you’re chasing may be a change that someone forgot to implement.

 

By moving all the fields and methods that need change into the same class, you can ensure that these are consistent with each other. The class can be an existing one, a new one, or an internal (nested) one that localizes the required changes.

 

Also exposed to the risk of inconsistent changes are data clumps: data objects that commonly appear together. Group these into a class and use objects of that class both as parameters and as return values. This change will eliminate the multiple data objects and the risk of forgetting one.

 

When a language’s primitive values, such as integers or strings, are used to express more sophisticated values, such as currencies, dates, or zip codes, errors in the manipulation of these values can go unnoticed. For example, if currency values are represented as integers, the code can easily add two different currency values.

 

Introduce classes to represent such objects, and replace the primitive values with these objects. Similarly, the use of containers (linked lists or resizable vectors) instead of primitive arrays can help you fix errors associated with array size management.

 

A further step away from primitive types involves the use of bespoke classes, rather than naked floating point types, to represent physical units (time, mass, force, energy, acceleration).

 

With proper methods to combine these (e.g., F = m a) you can catch errors arising from their improper use—assigning apples to oranges, as it were.

 

Varying interfaces, such as the set of methods supported by a class, method names, and order and type of their parameters, can cloud the code’s structure and thereby hide bugs. Homogenize method names through renaming, and parameters through their reordering.

 

Add, remove, and move methods to homogenize their classes. The classes with their new similar interfaces may reveal more refactoring opportunities, such as the extraction of a superclass.

 

Long routines can be difficult to follow and debug. Break them into smaller pieces, and decompose complex conditionals into routine calls. If a method is difficult to break up due to many temporary variables, consider changing it into a method object where these variables become the object’s fields.

 

Code parts that are inappropriately intimate with each other can hide incorrect interactions that destroy invariants and disrupt the program’s state. Break these by moving methods and fields, and ensure associations between classes are unidirectional rather than bidirectional.

 

Long chains of delegation can provide clients with more access than what is strictly needed, which can, in turn, be a source of errors. Break these by introducing delegate methods. Thus the expression

account.getOwner().getName()

with the introduction of the getOwnerName delegate method becomes

account.getOwnerName()

 

Surprisingly, comments can also point to trouble spots when they’re used to veil incomprehensible or suboptimal code. Often it is enough to replace a commented block of code with a method whose name reflects the original code’s comment.

 

The resulting short sequence of method calls will make it easier for you to spot errors in the code’s logic. In other cases, an assertion will more effectively express the preconditions written in a comment because the assertion will readily fail when the precondition isn’t satisfied.

 

As a final step, remove dead code and speculative generality. Remove unused code and parameters, collapse unused class hierarchies, inline classes with a single client, and rename methods with funky abstract names to reflect what they actually do. Your aim here is to eliminate hiding places for bugs.

 

Things to Remember

  • Format code in a consistent manner to allow your eye to catch error patterns.
  • Refactor code to expose bugs hiding in badly written or needlessly complex code structures.

 

Fix the Bug’s Cause, Rather Than Its Symptom

A surprisingly tempting way to fix a problem is to hide it under the carpet with a local fix. Here are some examples of a conditional statement “fix” being used:

 

To avoid a null pointer dereference

if (p != null) p.aMethod();
To sidestep the division by zero
if (nVehicleWheels == 0) return weight;
else
return weight / nVehicleWheels;
To shoehorn an incorrect number in a logical range
a = surfaceArea() if (a < 0)
a = 0;
To correct a truncated surname
if (surname.equals("Wolfeschlegelsteinha")) surname = "Wolfeschlegelsteinhausenbergerdorff";

 

Some of the preceding statements could have a reasonable explanation. If, however, the conditional was put into place merely to patch a crash, an exception, or an incorrect result, without understanding the underlying cause, then the particular fix is inexcusable. (The most egregious example I’ve seen was C code that cleaned up a string from mysteriously introduced control characters.)

 

Coding around bugs is bad for many reasons.

The “fix,” by short-circuiting some functionality, may introduce a new more subtle bug.

By not fixing the underlying cause, other less obvious symptoms of the bug may remain, or the bug may appear again in the future under a different guise.

 

The program’s code becomes needlessly complex and thus difficult to understand and modify. The underlying cause becomes harder to find because the “fix” hides its manifestation—for example, the crash that could direct you to the underlying cause.

 

Things to Remember

  • Never code around a bug’s symptom: find and fix the underlying fault.
  • When possible, generalize complex cases rather than trying to fix special cases.

 

Compile-Time Techniques

 

Examine Generated Code

Code often gets compiled through a series of transformations from one form to another until it finally reaches the form of the processor’s instructions. For example, a C or C++ file may first get preprocessed, then compiled into assembly language, which is then assembled into an object file;

 

Java programs are compiled into JVM instructions; lexical and parser generation tools, such as lex, flex, yacc, and bison, compile their input into C or C++. Various commands and options allow tapping into these transformations to inspect the intermediate code. This can provide you with valuable debugging intelligence.

 

Consider preprocessed C or C++ source code. You can easily obtain it by running the C/C++ compiler with an option directing it to output that code: -E on Unix systems, /E for Microsoft’s compilers. (On Unix systems, you can also invoke the C preprocessor directly as cpp.)

 

If the resulting code is anything more than a few lines long, you’ll want to redirect the compiler’s output into a file, which you can then easily inspect in your editor. Here is a simple example demonstrating how you can pinpoint an error by looking at the preprocessed output. Consider the following C code.

 

#define PI 3.1415926535897932384626433832795; double toDegrees = 360 / 2 / PI; double toRadians = 2 * PI / 360;

Compiling it with the Visual Studio 2015 compiler produces the following (perhaps) cryptic error.

t.c(3) : error C2059: syntax error : '/'

-masm=intel

 

 Compile-Time Techniques

If you generate and look at the preprocessed code appearing below, you will see the semicolon before the slash, which will hopefully point you to the superfluous semicolon at the end of the macro definition.

#line 1 "t.c"

double toDegrees = 360 / 2 / 3.1415926535897932384626433832795;; double toRadians = 2 * 3.1415926535897932384626433832795; / 360;

 

This technique can be remarkably effective for debugging errors associated with the expansion of complex macros and definitions appearing in third-party header files. When, however, the expanded code is large (and it typically is), locating the line with the culprit code can be difficult.

 

One trick for finding it is to search for a non-macro identifier or a string appearing near the original line that fails to compile (e.g., toRadians in the preceding case). You can even add a dummy declaration as a signpost near the point that interests you.

 

Another way to locate the error involves compiling the preprocessed or otherwise generated code after removing the #line directives. The #line directives appearing in the preprocessed file allow the main part of the compiler to map the code it is reading to the lines in the original file.

 

The compiler can thus accurately report the line of the original (rather than the preprocessed) file where the error occurred. If, however, you’re trying to locate the error’s position in the preprocessed file in order to inspect it, having an error message point you to the original file isn’t what you want.

 

To avoid this problem, preprocess the code with an option directing the compiler not to output #line directives: -P on Unix systems, /EP for Microsoft’s compilers.

 

In other cases, it’s useful to look at the generated machine instructions. This can (again) help you get unstuck when you’re stumped by a silly mistake: you look at the machine instructions and you realize you’ve used the wrong operator or the wrong arithmetic type, or you forgot to add a brace or a break statement.

 

Through the machine code, you can also debug low-level performance problems. To list the generated assembly code, invoke Unix compilers with the -S option and Microsoft’s compilers with the /Fa option. If you use GCC and prefer to see Intel’s assembly syntax rather than the Unix one, you can also specify GCC’s option.

 

In languages that compile into JVM bytecode, run the javap command on the corresponding class, passing it the -c option. Although assembly code appears cryptic, if you try to map its instructions into the corresponding source code, you can easily guess most of what’s going on, and this is often enough.

 

Use Static Program Analysis

Having tools do the debugging for you sounds too good to be true, but it’s actually a realistic possibility.

 

A variety of so-called static analysis tools can scan your code without running it (that’s where the word “static” comes from) and identify apparent bugs. Some of these tools may already be part of your infrastructure because modern compilers and interpreters often perform basic types of static analysis.

 

Stand-alone tools include GrammaTech CodeSonar, Coverity Code Advisor, FindBugs, Poly-space Bug Finder, and various programs whose name ends in “lint.” The analysis tools base their operation on formal methods (algorithms based on lots of math) and on heuristics (an important-sounding word for informed guesses).

 

Although ideally, static analysis tools should be applied continuously during the software’s development to ensure the code’s hygiene, they can also be useful when debugging to locate easy-to-miss bugs, such as concurrency gotchas and sources of memory corruption.

 

Some of the static analysis tools can detect hundreds of different bugs. The following bullets list important bugs that these tools can find and a few toy examples that demonstrate them. In practice, the code associated with an error is typically convoluted through other statements or distributed among many routines.

 

No tool will ever be perfect due to both practical limitations (memory required to track the exponential explosion of the state space) and theoretical constraints (some underlying problems are undecidable—there’s probably no algorithm that can always solve them correctly).

 

Therefore, although static analysis is useful, you’ll sometimes need to use your judgment regarding the correctness of the results it provides you, and be on the lookout for cases it misses.

 

Your first port of call to obtain the benefits of static analysis should be the compiler or interpreter you’re using. Some provide options that will make them check your code more strictly and issue a warning when they encounter questionable code. For example,

 

A starting point of options for GCC, ghc (the Glasgow Haskell Compiler), and clang (the C language family front end of the LLVM compiler) is -Wall, -Wextra, and -Wshadow (many more are available).

 

The options for Microsoft’s C/C++ compiler are /Wall and /W4.

In JavaScript code you write "use strict"; (yes, in double quotes).

 

In Perl you write use strict; and use warnings;.

(The Perl and JavaScript options enable both static and dynamic checks.) In compilers, also specify a high optimization level: this performs the type of analysis needed for generating some of the warnings.

 

If the level of warnings can be adjusted, choose the highest level that will not drown you in warnings about things you’re unlikely to fix. Then methodically remove all other warnings. This may fix the fault you’re looking for, and may also make it easier to see other faults in the future.

 

Having achieved zero number of warnings, take the opportunity to adjust the compilation options so that they will treat warnings as errors (/WX for Microsoft’s compilers, -Werror for GCC). This will prevent warnings being missed in a lengthy compilation’s output, and will also compel all developers to write warning-free code.

 

Having securely anchored the benefits of configuring your compiler, it’s time to set up additional analysis tools. These can detect more bugs at the expense of longer processing times and, often, more false positives.

 

Wikipedia’s static analysis tools list contains more than 100 entries. The list includes both popular commercial offerings, such as Coverity Code Advisor, and widely used open-source software, such as FindBugs.

 

Some focus on particular types of bugs, such as security vulnerabilities or concurrency problems. Choose those that better target your particular needs.

 

Feel free to adopt more than one tool because different tools often complement each other in the bugs they detect. Invest effort to configure each tool in a way that minimizes the spurious warnings it issues by turning off the warnings that don’t apply to your coding style.

 

Finally, make it possible to run the static analysis step as part of the system’s build, and make it a part of the continuous integration setup.

 

The build configuration will make it easy for developers to check their code with the static analysis tools in a uniform way. The check during continuous integration will immediately report any problems that slip past developers.

 

This setup will ensure that the code is always clean from errors reported by static analysis tools. All too often, a team will embark on a heroic effort to clean static analysis errors (perhaps while chasing an insidious bug), and then lose interest and let new errors creep in.

 

Things to Remember

Specialized static program analysis tools can identify more potential bugs in code than compiler warnings. Configure your compiler to analyze your program for bugs. Include in your build cycle and continuous integration cycle at least one static program analysis tool.

 

Configure Deterministic Builds and Executions

The following program prints the memory addresses associated with the program’s stack, heap, code, and data.

#include <stdio.h>
#include <stdlib.h>
int z;
int i = 1;
const int c = 1;
int
main(int argc, char *arg[])
{
printf("stack:\t%p\n", (void *)&argc);
printf("heap:\t%p\n", malloc(1));
printf("code:\t%p\n", (void *)main);
printf("data:\t%p (initialized)\n", (void *)&i);
printf("data:\t%p (constants)\n", (void *)&c);
printf("data:\t%p (zero)\n", (void *)&z);
return 0;
}
On many environments, each run will produce different results. (I’ve seen this happening with GCC under GNU/Linux, clang under OS X, and Visual C under Windows.)
dbuild
stack: 003AFDF4
heap: 004C2200
code: 00CB1000
data: 00CBB000 (initialized)
data: 00CB8140 (constants)
data: 00CBCAC0 (zero)
dbuild
stack: 0028FC68
heap: 00302200
code: 01331000
data: 0133B000 (initialized)
data: 01338140 (constants)
data: 0133CAC0 (zero)

This happens because the operating system kernel randomizes the way the program is loaded into memory in order to hinder malicious attacks against the code. Many so-called code injection attacks work by over-flowing a program’s buffers with malicious code and then tricking the program being attacked into executing that code.

 

This trick is quite easy to pull off if a vulnerable program’s elements are always located at the same memory position. As a countermeasure, some kernels randomize a program’s memory layout, thereby foiling malicious code that attempts to use hard-coded memory addresses.

 

Unfortunately, this address space layout randomization (ASLR) can also interfere with your debugging. Failures that happen on one run may not occur in another one; the values of pointers you painstakingly record change when you restart the program; address-based hash tables get filled in a different way; some memory managers may change their behavior from one run to another.

 

Therefore, ensure that your program stays stable between executions, especially when debugging a memory-related problem. On GNU/Linux, you can disable ASLR by running your program as follows.

setarch $(uname -m) -R myprogram

 

On Visual Studio you disable ASLR by linking your code with the /DYNAMICBASE:NO option or by setting appropriately the project’s Configuration Properties – Linker – Advanced – Randomized Base Address option.

 

Some versions of Windows have a registry setting that globally disables ASLR (HKLM/SYSTEM/CurrentControlSet/Control/SessionManager/ MemoryManagement/MoveImages).

 

Finally, on OS X you need to pass the -no_pie option to the linker, through the compiler’s -Wl flag. This is the incantation you’ll need to use when compiling.

-Wl,-no_pie -o myprogram myprogram.c

 

There are other, thankfully less severe, ways through which two builds of the same program may differ. Here are some representative ones. Unique chose symbol names that GCC places into each compiled file. Use the flag -frandom-seed to pin these down.

 

Varying order of compiler inputs. If the files to be compiled or linked are derived from a Makefile wildcard expansion, their order can differ as directory entries get reshuffled. Specify the inputs explicitly, or sort the wildcard’s expansion. 

 

Timestamps embedded in the code to convey the software’s version, through the __DATE__ and __TIME__ macros, for example. Use the revision control system version identifier (e.g., Git’s SHA sum) instead. This will allow you to derive the timestamp should you even need it.

 

Lists generated from hashes and maps. Some programming language implementations vary how objects are hashed in order to thwart algorithmic complexity attacks.

 

This varies the results of traversing a container. Address this by sorting the listed results. For Perl and Python, you might alternatively set PERL_HASH_SEED or the PYTHONHASHSEED environment variables.

 

Encryption salt. Encryption programs typically perturb the provided key through a randomly derived value—the so-called salt— in order to thwart prebuilt dictionary attacks.

 

Disable the salting when testing and debugging; for example, the OpenSSL program offers the -nosalt option. However, do not use this option for production purposes as it will make your system vulnerable to dictionary attacks.

 

The golden standard for build image consistency is to be able to create bit-identical package distributions by compiling the same source code on different hosts. This requires a lot more work because it also involves sanitizing things such as file paths, locales, archive metadata, environment variables, and time zones.

 

If you need to go that far, consult the reproducible builds website, which offers sound advice on how to tackle these problems.

 

Things to Remember

Configure your build process and the software’s execution to achieve reproducible runs.

Configure the Use of Debugging Libraries and Checks

 

A number of compilation and linking options allow your code and libraries to perform more stringent runtime checks regarding their operation. These options work in parallel with those that configure your own software’s debug mode.

 

which you should also enable at compile time. The options you’ll see here mainly apply to C, C++, and Objective-C, which typically avoid the performance penalty of buffer bounds checking. Consequently, when these checks are enabled, programs may run noticeably slower.

 

Therefore, you must apply these methods with care in real-time systems and performance-critical environments. In the following paragraphs, you’ll see some common ways in which you can configure the compilation or linking of your code to pinpoint bugs associated with the use of memory.

 

You can enable a number of checks on software using the C++ standard template library. With the GNU implementation, you need to define the macro _GLIBCXX_DEBUG when compiling your code, whereas under Visual Studio the checks are enabled if you build your project under debug mode or if you pass the option /MDd to the compiler.

 

Builds with STL checks enabled will catch things such as incrementing an iterator past the end of a range, dereferencing an iterator of a container that has been destructed, or violating an algorithm’s preconditions.

 

DEBUG_PEDANTIC you will also get messages regarding the use of features that are not portable to other STL implementations.

 

The GNU C library allows you to check for memory leaks—allocated memory that isn’t freed over the program’s lifetime. To do that, you need to call the trace function at the beginning of your program, and then run it with the environment variable MALLOC_TRACE set to the name of the file where the tracing output will go.

 

Consider the following program, which at the time it exists still has an allocated memory block.

 

Depending on the compiler you’re using, you may need to provide some extra information in order to get an error report in terms of your source code rather than machine code addresses.

 

You can do that by setting the environment variable ASAN_SYMBOLIZER_PATH to point to the program that does this mapping (e.g., /usr/bin/llvm-symbolizer-3.4) and setting the environment variable ASAN_OPTIONS to symbolize=1.

 

AddressSanitizer is supported on a number of systems, including GNU/ Linux, OS X, and FreeBSD running on i386 and x86_64 CPUs, as well as Android on ARM and the iOS Simulator. AddressSanitizer imposes a significant overhead on your code: it roughly doubles the amount of memory and processing required to run your program.

 

On the other hand, it is not expected to produce false positives, so using it while testing your software is a trouble-free way to locate and remove many memory-related problems.

 

The facilities used for detecting memory allocation and access errors in Visual Studio are not as advanced as AddressSanitizer, but they can work in many situations. 

 

Keep in mind that the provided facilities can only identify writes that happen just outside the allocated heap blocks. In contrast to Address-Sanitizer, they cannot identify invalid read operations, nor invalid accesses to global and stack memory.

 

An alternative approach to use under OS X and when developing iOS applications involves linking with the Guard Malloc (libgmalloc) library. This puts each allocated memory block into a separate (non-consecutive) virtual memory page, allowing the detection of memory accesses outside the allocated pages.

 

The approach places significant stress on the virtual memory system when allocating the memory but requires no additional CPU resources to check the allocated memory accesses. It works with C, C++, and Objective-C. To use the library, set the environment variable

 

DYLD_INSERT_LIBRARIES to /usr/lib/libgmalloc.dylib. Several additional environment variables can be used to fine-tune its operation; consult the libgmalloc manual page for details. As an example, the following program, which reads outside an allocated memory block, terminates with a segmentation fault when linked with the library.

int
main()
{
int *a = new int [5];
int t = a[10];
return 0;
}

 

You can easily catch the fault with a debugger in order to pinpoint the exact location associated with the error.

Finally, if none of these facilities are available in your environment, consider replacing the library your software is using with one that supports debug checks. One notable such library is dmalloc, a drop-in replacement for the C memory allocation functions with debug support.

 

Things to Remember

  • Identify and enable the runtime debugging support offered by your environment’s compiler and libraries.
  • If no support is available, consider configuring your software to use third-party libraries that offer it.

 

 Runtime Techniques

The ultimate source of truth regarding a program is its execution. While a program is running, everything comes to light: its correctness, its CPU and memory utilization, even its interactions with buggy libraries, operating systems, and hardware. Yet, typically, this source of truth is also fleeting, rushing into oblivion at the tune of billions of instructions per second.

 

Worse, capturing that truth can be a tricky, tortuous, or down-right treacherous affair. Tests, application logs, and monitoring tools allow you to peek into the program’s runtime behavior to locate the bug that’s bothering you.

 

 Find the Fault by Constructing a Test Case

You can often pinpoint and even correct a bug simply by working on appropriate tests. Some call this approach DDT for “Defect-Driven Testing”—it is no coincidence that the abbreviation matches that of the well-known insecticide. Here are the three steps you need to follow, together with a running example.

 

The example is based on an actual bug that appeared in qmcalc, a program that calculates diverse quality metrics for C files and displays the values corresponding to each file as a tab-separated list. The problem was that, in some rare cases, the program would output fewer than the 110 expected fields.

 

First, create a test case that reliably reproduces the problem you need to solve. This means specifying the process to follow and the required materials (typically data).

 

For example, a test case can be that loading file for (material) and then pressing x, y, and z causes the application to crash (process). Another could be that putting Acme’s load balancer (material) in front of your application causes the initial user authentication to fail (process).

 

In the example case, the following commands apply the qmcalc program on all Linux C files and generate a summary of the number of fields generated.

# Find all C files
find linux-4.4 -name \*.c |
Apply qmcalc on each file xargs qmcalc |
Display the number of fields awk '{print NF}' |
Order by number of fields sort |
Display number of occurrences uniq -c

 

The second step is the simplification of the test case to the bare minimum. Both methods for doing that, building up the test case from scratch or trimming down the existing large test case, involve an Aha! the moment, where the bug first appears (when building up) or disappears (when trimming down).

 

The test case data will often point you either to the problem or even to the solution. In many cases you can combine both trimming methods: you first remove as much fat as possible, and, once you think you know what the problem is, you construct a new minimal test case from scratch.

 

The third step involves consolidating your victory. Having isolated the problem, grab the opportunity to add a corresponding unit test or regression test in the code.

 

If the failure is associated with a fault in isolated code parts, then you should be able to add a corresponding unit test. If the failure occurs through the combination of multiple factors, then a regression test is more appropriate.

 

The regression test should package your test case in a form that can be executed automatically and routinely when the software is tested. While the fault is still in the code, run the software’s tests to verify that the test fails and that it therefore correctly captures the problem.

 

When the test passes, you have a pretty good indication that you’ve fixed the code. In addition, the test’s existence will now ensure that the fault will not resurface again in the future.

 

To put this in the words of Andrew Hunt and David Thomas: “Coding ain’t done ‘til all the tests run.”

 

Adding to your code a test for a problem that’s already solved is not as pedantic as it sounds. First, you may have missed fixing a particular case; the test will help you catch that problem when that code is exercised.

 

Then, an incorrectly handled revision merge conflict may introduce the same error again. In addition, someone else can commit a similar error in the future. Finally, the test may also catch related errors. There’s rarely a good reason to skimp on tests.

 

When using tests to uncover bugs, it’s worthwhile to know which parts of the code are actually tested and which parts are skipped over because bugs may lurk in the less well-tested parts. You can find this through the use of a tool that performs test coverage analysis.

 

Examples include gcov (C and C++; see 57: “Profile the Operation of Systems and Processes”), JCov, JaCoCo, and Clover (Java), NCover and Open-Cover (.NET), as well as the packages coverage (Python) and blanket.js (JavaScript).

 

Things to Remember

The process of creating a reliable minimal test case can lead you to the fault and its solution.

Embed your test case in the software as a unit or a regression test.

 

Fail Fast

The fast and efficient reproduction of a problem will improve your debugging productivity. Therefore, configure the software to fail at the first sign of trouble. Such a failure will make it easier for you to pinpoint the corresponding fault because the failing code will be executed relatively soon after the code that caused the failure, and may even be located close to it.

 

In contrast, allowing the software to continue running after a minor failure can lead the code’s operation into uncharted territory where a cascade of other problems will make the location of a bug much more difficult.

 

Failing quickly entails the risk of focusing on the wrong problem. However, if you fix that problem and restart your debugging, you have eliminated forever a source of doubt. Through a process of gradual elimination, you’re making progress. Again, allowing minor problems to linger can bring about death from a thousand cuts.

 

Here are some ways to speed up your program’s failures.

Add and enable assertions to verify the validity of routines’ input arguments and the success of API calls. In Java, you enable assertions at runtime with the -ea option. In C and C++ you typically enable assertions at compile time by not defining the NDEBUG macro identifier. (This identifier is typically defined in production builds.)

  • Configure libraries for strict checking of their use.
  • Check the program’s operation with dynamic program analysis methods.
  • Set the Unix shell’s -e option to make shell scripts terminate when a command exits with an error (a non-zero exit status).

 

Note that while failing fast is an effective way to debug a self-contained program, it may not be a suitable way to run a large production system that has graduated from development to maintenance.

 

There, the priority is likely to be resilience: in many cases, allowing the system to operate after a minor failure (for example, a problem loading an icon image or a crash of one among many server processes) may be preferable to bring the whole system down. This permissive mode of operation can be counterbalanced by other measures, such as extensive monitoring logging.

 

Things to Remember

When debugging, set up trip wires so that your program will fail at the first sign of trouble.

 

Examine Application Log Files

Many programs that perform complex processing, execute in the background, or lack access to a console, log their operations to a file or a specialized log collection facility. A program’s log output allows you to follow its execution in real time or analyze a sequence of events at your own convenience.

 

In the case of a failure, you may find in the log either an error or a warning message indicating the reason behind the failure (e.g., “Unable to connect to Example Domain (Example Domain): Connection refused”) or data that point to a software error or misconfiguration. Therefore, make it a habit to start the investigation of a failure by examining the software’s log files.

 

The location and storage method of log files differ among operating systems and software platforms. On Unix systems, logs are typically stored in text files located in the /var/log directory.

 

Applications may create in that directory their own log files, or they may use an existing file associated with the class of events they’re logging. Some examples include

  • Authentication: auth.log
  • Background processes: daemon.log
  • The kernel: kern.log
  • Debug information: debug

 

Other messages: messages

On a not-so-busy system, you may be able to find the file corresponding to the application you’re debugging by running

ls -tl /var/log | head

right after the application creates a log entry; the name of the log should appear at the top among the most recently modified files. If the file is located in a subdirectory of /var/log, you may be able to find it as follows:

List all files under /var/log find /var/log -type f |
List each file's last modification time and name xargs stat -c '%y %n' |
Order by time
sort -r |
List by the ten most recently modified files head

 

If these methods do not work, you can look for the log filename in the application’s documentation, a trace of its execution or its source code.

 

On Windows systems, the application logs are stored in an opaque format. You can run Eventvwr.msc to launch the Event Viewer GUI application, which allows you to browse and filter the logs, use the Windows PowerShell GetEventLog command, or use the corresponding .NET API. Again, logs are separated into various categories; you can explore them through the tree appearing on the left of the Event Viewer. On OS X, the

 

GUI log viewer application is named Console. Both the Windows and the OS X applications allow you to filter the logs, create custom views, or search for specific entries. You can also use the Unix command-line tools to perform similar processing.

 

Many applications can adjust the amount of information they log (the so-called log verbosity) through a command-line option, a configuration option, or even at runtime by sending them a suitable signal. Moreover, logging frameworks provide additional mechanisms for expanding or throttling log messages.

 

When you have debugged your problem, don’t forget to reset logging to its original level; extensive logging can hamper performance and consume excessive storage space or bandwidth.

 

On Unix systems, applications tag every log message with the associated facility (e.g., authorization, kernel, mail, user) and a level, ranging from emergency and alert to informational and debug. You can then configure syslogd (or rsyslogd), the background program that listens for log messages and logs them into a file, how to handle specific messages.

 

The corresponding file is /etc/syslog.conf (or /etc/rsyslog.conf and the /etc/rsyslog.d directory). There you can specify that log messages up to a maximum level (e.g., all messages up to informational, but not the debug ones) and associated with a given facility will be logged to a file, sent to the console, or ignored.

 

For example, the following specifies the files associated with all messages of the security facility, authorization up to the informational level, and all messages of exactly the debug level. It also specifies that messages at the emergency level are sent to all logged-in users.

security.* /var/log/security
http://auth.info (http://auth.info) /var/log/auth.log
*.=debug /var/log/debug.log
*.emerg *

 

For JVM code, the popular Apache log4j logging framework allows the even more detailed specification of what gets logged and where.

 

Its structure is based on loggers (output channels), appenders (mechanisms that can send log messages to a sink, such as a file or a network socket), and layouts (these specify the format of each log message).

 

Log4j is configured through a file, which can be given in XML, JSON, YAML, or Java properties format. Here is a small part of the log4j configuration file used by the Rundeck workflow and configuration management system.

 

By adjusting the level of messages that get logged, you can often get the data needed to debug a problem. Here is an example associated with debugging failing ssh connections. The sshd configuration file is located in /etc/ssh/sshd_config. There, a commented-out line specifies the default log level (INFO).

 

Bumping up the log level to DEBUG

LogLevel DEBUG

results in many more informative messages, one of which clearly indicates the problem’s cause.

Jul 30 12:57:07 prod sshd[5713]: debug1: Could not open authorized keys '/home/jhd/.ssh/authorized_keys': No such file or directory

 

There are several ways to analyze a log record in order to locate a failure’s cause.

  • You can use the system’s GUI event viewer and its searching and filtering facilities.
  • You can open and process a log file in your editor.
  • You can filter, summarize, and select fields using Unix tools.
  • delete-matching-lines
  • You can monitor the log interactively.
  • You can use a log management application or service, such as ELK, Logstash, logging, or Splunk.
  • Under Windows, you can use the Windows Events Command Line Utility (wevtutil) to run queries and export logs.

 

You typically want to start by examining the entries near the time when the failure occurred. Alternately, you can search the event log for a string associated with the failure, for example, the name of a failed command. In both cases, you then scan the log back in time looking for errors, warnings, and unexpected entries.

 

For failures lacking a clear manifestation, it’s often useful to repeatedly remove from a log file innocuous entries until entries containing important information stand out. You can do this within your editor: under

 

Emacs use regular-expression; under vim use

:g/regular-expression/d; under Eclipse and Visual Studio find-replace a regular expression that matches the whole line and ends with \n. On the Unix command line, you can pipe the log file through successive grep -v commands.

 

Things to Remember

  • Begin the investigation of a failing application by examining its log files.
  • Increase an application’s logging verbosity to record the reason for its failure.

 

Configure and filter log files to narrow down the problem.

When debugging performance problems, your first (and often the only) port of call is a profile of the system’s operation. This will analyze the system’s resource utilization and thereby point to a part that is misbehaving or needs to be optimized.

 

Start by obtaining a high-level overview. Two process viewing tools that will also give you a system’s CPU and memory utilization are the top commands on Unix systems and the Task Manager on Window.

 

On a misbehaving system, a high level of CPU utilization (say, 90% on a single core CPU) tells you that you must concentrate your analysis on processing, whereas a low utilization (say, 10%, again on a single core CPU) points to delays that may be occurring due to input/output (I/O) operations.

 

Note that multi-core computers typically report the load over all CPU cores, so if you’re dealing with one single-threaded process, divide the thresholds I gave by the total number of available CPU cores. As an example, so a single process occupying 100% of a CPU core on an otherwise idle system will make the system load appear as 12% (100%=8).

 

Also look at the system’s physical memory utilization. A high (near 100%) utilization may cause errors due to failed memory allocations or a drop in the system’s performance due to virtual memory paging.

 

When looking at the amount of free memory, keep in mind that Linux systems aggressively use almost all available memory as a buffer cache. Therefore, on Linux systems add the memory listed as buffers to the amount of memory you consider to be free.

 

For systems designed to operate normally near their maximum capacity, you need to look beyond utilization (which will be close to 100%), and examine saturation.

 

This is a measure of the demands placed on a resource above the level it can service. For this, you will use the same tools but focus on measures indicating the saturation of each resource.

  • For CPUs, look at a load higher than the number of cores on Unix systems and at the Performance Monitor System – Processor Queue Length on Windows systems.
  • For memory, look at the rate at which virtual memory pages are written out to disk.
  • For the network, I/O, look for dropped packets and retransmissions.
  • For storage, I/O, look at the request queue length and operation latency.
  • For all of the above measures, levels of saturation consistently appearing above 100% (continuously or in bursts) are typically a problem.

 

Having obtained an overview of what’s gumming up your system’s performance, drill down toward the process burning many CPU cycles, causing excessive I/O, experiencing high I/O latency, or using a lot of memory. 

 

If the problem is a high CPU load, look at the running processes. Order the processes by their CPU usage to find the culprit taking up most of the CPU time. In Figure 7.1, this is the process named CPU-hog.

 

If the problem is high memory utilization, look at the running processes ordered by the working set (or resident) memory size. This is the amount of physical (rather than virtual) memory used.

 

Investigate the possibility of a high I/O load or high I/O latency by using tools such as iostat, netstat, nfsstat, or vmstat on Unix or the Performance Monitor on Windows (run perfmon). Look both at the volume of the disk and network data and at the corresponding number of I/O operations because either of these can be a bottleneck.

 

Once you’ve isolated the type of load that’s causing you a problem, use pidstat on Unix or the Windows Task Manager to pinpoint the culprit process. Then trace the individual process's system calls to further understand its behavior.

 

For cases of high CPU or memory utilization, you should continue by profiling the behavior of the culprit process you’ve identified. There’s no shortage of techniques for monitoring a program’s behavior.

 

If you care about CPU utilization, you can run your program under a statistical profiler that will interrupt its operation many times every second and note where the program spends most of its time.

 

Alternately, you can arrange for the compiler or runtime system to plant code setting a time counter at the beginning and end of each (non-inlined) function, and thus create a graph-based profile of the program’s execution.

 

This allows you to attribute the activity of each function to its parents, and thereby untangle performance problems associated with complex call paths. The corresponding GCC option is -pg, and the tool you use to view the resulting data is gprof.

 

In extreme cases, you can even have the compiler instrument each basic block of your code with a counter, so that you can see how many times each line got executed. The corresponding GCC options are -fprofile-arcs and -ftest-coverage, and the tool to annotate the code is gcov.

 

Other profiling options you can use include the Eclipse and NetBeans profiler plugins and the stand-alone VisualVM, JProfiler, and Java Mission Control systems for Java programs, as well as the CLR profiler for .NET code.

 

Memory utilization monitors typically modify the runtime system’s memory allocator to keep track of your allocations. Valgrind under Unix and, again, VisualVM and Java Mission Control are useful tools in this category. Aspect Oriented Programming tools and frameworks, such as AspectJ and Spring AOP, allow you to orchestrate your own custom monitoring.

 

At an even lower level, you can monitor the CPU’s performance counters with tools such as perf, profile, or perfmon2 to look for cache misses, missed branch predictions, or instruction fetches stalls.

 

Things to Remember

  • Analyze performance issues by looking at the levels of CPU, I/O, and memory utilization and saturation.
  • Narrow down on the code associated with a performance problem by profiling a process’s CPU and memory usage.

 

Trace the Code’s Execution

Monitoring and tracing tools and facilities allow you to derive log-like data from the execution of arbitrary programs. This approach offers you a number of advantages over application-level logging.

  • You can obtain data even if the application you’re debugging lacks logging facilities.
  • You don’t need to prepare a debug version of the software which may obfuscate or hide the original problem.
  • Compared to the use of a GUI debugger, it’s lightweight, which allows you to use it in a bare-bones production environment.

 

When you try to locate a bug, an approach you often use involves either inserting logging statements in key locations of your program running the code under a debugger, which allows you to dynamically insert breakpoint instructions.

 

Nowadays, however, performance problems and many bugs involve the use of third-party libraries or interactions with the operating system.

 

One way to resolve these issues is to look at the calls from your code to that other component. By examining the timestamp of each call or looking for an abnormally large number of calls, you can pinpoint performance problems.

 

The arguments to a function can also often reveal a bug. Call tracing tools include ltrace (traces library calls), strace, ktrace, and truss (these trace operating system calls) under Unix, JProfile for Java programs, and Process Monitor under Windows (this traces DLL calls, which involve both operating system and third-party library interfaces).

 

These tools typically work by using special APIs or code-patching techniques to hook themselves between your program and its external interfaces.

 

With a compact and reliable way to reproduce the problem, it was easy to write a shim class that would independently calculate and cache the file’s offset, eliminating the calls to tellg.

 

Processing the output of strace with Unix tools immensely increases your debugging power. Consider the case where a program fails by complaining about an erroneous configuration entry.

 

However, you can’t find the offending string in any of its tens of configuration files. The following Bash command will show you which of the files opened by the program prog contains the offending string, say, xyzzy.

 

It works by sending the output of strace into a pipeline that isolates the names of files passed to the open system call (sed), removes the file-names associated with devices (egrep -v), keeps a unique copy of each filename (sort -u), and looks for the string xyzzy within those files (fgrep).

 

Looking at the system calls of Java and X Window System programs can be irritating because of this issue a large number of calls associated with the runtime framework. These calls can obscure what the program actually does. Thankfully, you can filter out these system calls with the strace -e option.

 

Note that you can also trace an already running program by attaching the tracing tool to it. The command-line tools offer the -p option, whereas the GUI tools allow you to click on the process you want to trace.

 

System and library call tracing is not the only game in town. Most interpreted languages offer an option to trace a program’s execution. Here are the incantations for tracking code written in some popular scripting languages.

Perl: perl -d:Trace
Python: python -m trace --trace
Ruby: ruby -r tracer
Unix shell: sh -x, bash -x, csh -x, etc.

 

Other ways to monitor a program’s operation include the JavaScript tracing backend spy-js, network packet monitoring and the logging of an application's SQL statements, via the database server. For example, the following SQL statements can turn on this logging for MySQL.

set global log_output = 'FILE';

set global general_log_file='/tmp/mysql.log'; set global general_log = 1;

 

Most of the tools referred to so far have been around for ages and can be valuable for solving a problem, once you’ve located its approximate cause.

 

They also have a number of drawbacks: they often require you to take special actions to monitor your code, they can decrease the performance of your system, their interfaces are idiosyncratic and incompatible with each other, each one shows us only a small part of the overall picture, and sometimes important details are simply missing.

 

A tool that addresses this shortcoming is DTrace, a dynamic tracing framework developed originally by Sun that provides a uniform mechanism for monitoring comprehensively and unobtrusively the operating system, application servers, runtime environments, libraries, and application programs. It is currently available on Solaris, OS X, FreeBSD, and NetBSD. On Linux, SystemTap and LTTng offer similar facilities.

 

Unsurprisingly, DTrace, a gold winner in The Wall Street Journal’s Technology Innovation Awards contest, is not a summer holiday hack. The three Sun engineers behind it worked for a number of years to develop mechanisms for safely instrumenting all operating system kernel functions, any dynamically linked library, any application program function or specific CPU instruction, and the Java virtual machine.

 

They also developed a safe interpreted language that you can use to write sophisticated tracing scripts without damaging the operating system’s functioning, and aggregating functions that can summarize traced data in a scalable way without excessive memory overhead.

 

DTrace integrates technologies and wizardry from most existing tracing tools and some notable interpreted languages to provide an all-encompassing platform for program tracing.

 

You typically use the DTrace framework through the dtrace command-line tool. You feed the DTrace tool with scripts you write in a domain-specific language named D (not related to the general-purpose language with the same name).

 

When you run dtrace with your script, it installs the traces you’ve specified, executes your program, and prints its results. D programs can be very simple: they consist of pattern/action pairs like those found in the awk and sed tools and many declarative languages.

 

A pattern (called a predicate in the DTrace terminology) specifies a probe—an event you want to monitor. DTrace comes with thousands of pre-defined probes (49,979 on an early version of Solaris and 177,398 on OS X El Capitan I tried it on).

 

In addition, system programs (such as application servers and runtime environments) can define their own probes, and you can also set a probe anywhere you want in a program or in a dynamically linked library. For example, the command

 

dtrace -n 'syscall:::entry'

will install a probe at the entry point of all operating system calls, and the (default) action will be to print the name of each system call executed and the process-id of the calling process. You can combine predicates and other variables together using Boolean operators to specify more complex tracing conditions.

 

The name syscall in the previous invocation specifies a provider—a module providing some probes. Predictably, the syscall provider provides probes for tracing operating system calls; 500 system calls on my system. The name syscall::open:entry designates one of these probes—the entry point to the open system call.

 

DTrace contains tens of providers, giving access to statistical profiling, all kernel functions, locks, system calls, device drivers, input and output events, process creation and termination, the network stack’s management information bases (MIBs), the scheduler, virtual memory operations, user program functions, and arbitrary code locations, synchronization primitives, kernel statistics, and Java virtual machine operations. Here are the commands you can use to find the available providers and probes.

 

List all available probes dtrace -l

List system call probes dtrace -l -P syscall

List the arguments to the read system call probe dtrace -lv -f syscall::read

 

Together with each predicate, you can define an action. This action specifies what DTrace will do when a predicate’s condition is satisfied. For example, the following command

dtrace -n 'syscall::open:entry {trace(copyinstr(arg0));}'

will list the name of each opened file.

 

Actions can be arbitrarily complex: they can set global or thread-local variables, store data in associative arrays, and aggregate data with functions such as count, min, max, avg, and quantize.

 

For instance, the following program will summarize the number of times each process gets executed over the lifetime of the DTrace invocation.

proc:::exec-success { @proc[execname] = count()}

 

By tallying functions that acquire resources and those that release them, you can easily debug leaks of arbitrary resources. In typical use, DTrace scripts span the space from one-liners, such as the preceding ones, to tens of lines containing multiple predicate action pairs.

 

If your code runs on a JVM, another tool you might find useful for tracking its behavior is Byteman. 

 

This can inject Java code into the methods of your application of the runtime system, without requiring you to re-compile the code. You specify when and how the original Java code is transformed through a clear and simple scripting language.

 

The advantages of using Byteman over adding logging code by hand are threefold. First, you don’t need access to the source code, which allows you to trace third-party code as well as yours.

 

Also, you can inject faults and other similar conditions in order to verify how your code responds to them. Finally, you can write Byteman scripts that will fail a test case if the application’s internal state diverges from the expected norm.

 

On the Windows ecosystem, similar functionality is provided by the Windows Performance Toolkit, which is distributed as part of the Windows Assessment and Deployment Kit.

 

The system has a recording component, the Windows Performance Recorder, which you run on the system facing performance problems to trace events you consider important, and the Windows Performance Analyzer, which, in true Windows fashion, offers you a nifty GUI to graph results and operate on tables.

 

Things to Remember

System and library call tracing allows you to monitor the behavior of programs without access to their source code.

Learn how to use the Windows Performance Toolkit (Windows), System-Tap (Linux), or DTrace (OS X, Solaris, FreeBSD).

 

Use Dynamic Program Analysis Tools

A number of specialized tools can instrument your compiled program with check routines, monitor its execution, and report detected cases of probable errors. This type of checking is termed dynamic analysis because it is carried out at runtime.

 

The corresponding checks complement the techniques discussed in 51: “Use Static Program Analysis,” such as writing "use strict"; in JavaScript and use strict; use warnings; in Perl code, which enable both static and dynamic checks.

 

Compared to static analysis tools, dynamic tools have an easier job in detecting errors that actually occur because, rather than having to deduce what code would be executed (as is the case with static analysis tools), they can trace the code as it is being executed. This means that when a dynamic analysis tool indicates an error, it is highly unlikely that this is a false positive.

 

On the other hand, a dynamic analysis tool will only look at the code that’s actually being executed. Therefore, it can miss faults that are located in code paths that aren’t exercised, resulting in a potentially large number of false negatives.

 

Because dynamic program analysis tools often dramatically slow down a program’s execution and can report a slew of low-priority errors, when debugging it’s best to employ such a tool with a very specific test script that demonstrates the exact problem you’re debugging.

 

Alternately, as a code hygiene maintenance method, you can run the code being analyzed with a realistic and complete test scenario. Through this process, you can whitelist all reported errors so that you can easily catch any new ones that appear when you introduce changes.

 

Many dynamic analysis tools offer facilities to detect the use of uninitialized values, memory leaks, and accesses beyond the boundaries of available memory space.

 

Other tools can catch security vulnerabilities, suboptimal code, incomplete code coverage (this indicates gaps in your testing), implicit type conversions, dynamic typing inconsistencies, and numeric overflows.

 

You can also read how you can use dynamic analysis tools to catch concurrency errors in 62: “Uncover Deadlocks and Race Conditions with Specialized Tools.” Wikipedia’s page on dynamic program analysis lists tens of tools; choose those that match your environment, problem, and budget.

 

A widely used open-source code dynamic analysis system is the Val-grind tool suite, which contains a powerful memory checking component. Consider the following program, which crams into three lines of code a memory leak, illegal memory access, and the return of uninitialized value.

 

Another interesting tool is the Jalangi dynamic analysis framework for a client- and server-side JavaScript. This transforms your JavaScript code into a form that exposes the code’s execution through an API.

 

You can then write verification scripts that get triggered when specific things happen, such as the evaluation of a binary arithmetic operation. You can use such scripts to pinpoint various problems in JavaScript code. 

Recommend