There is an extraordinary variety of tools available for this purpose, and more become available daily. Very capable people are selflessly devoting enormous amounts of time and effort to developing these tools. We all owe a tremendous debt to these individuals. But with the variety of tools available, it is easy to be overwhelmed. Fortunately, while the number of tools is large, data collection need not be overwhelming. A small number of tools can be used to solve most problems. This book centers on a core set of freely available tools, with pointers to additional tools that might be needed in some circumstances.
This first chapter has two goals. Although general troubleshooting is not the focus of the book, it seems worthwhile to quickly review troubleshooting techniques. This review is followed by an examination of troubleshooting from a broader administrative context -- using troubleshooting tools in an effective, productive, and responsible manner. This part of the chapter includes a discussion of documentation practices, personnel management and professionalism, legal and ethical concerns, and economic considerations. General troubleshooting is revisited in Chapter 12, "Troubleshooting Strategies", once we have discussed available tools. If you are already familiar with these topics, you may want to skim or even skip this chapter.
Clearly, the best way to approach troubleshooting is to avoid it. If you never have problems, you will have nothing to correct. Sound engineering practices, redundancy, documentation, and training can help. But regardless of how well engineered your system is, things break. You can avoid troubleshooting, but you can't escape it.
It may seem unnecessary to say, but go for the quick fixes first. As long as you don't fixate on them, they won't take long. Often the first thing to try is resetting the system. Many problems can be resolved in this way. Bit rot, cosmic rays, or the alignment of the planets may result in the system entering some strange state from which it can't exit. If the problem really is a fluke, resetting the system may resolve the problem, and you may never see it again. This may not seem very satisfying, but you can take your satisfaction in going home on time instead.
Keep in mind that there are several different levels in resetting a system. For software, you can simply restart the program, or you may be able to send a signal to the program so that it reloads its initialization file. From your users' perspective, this is the least disruptive approach. Alternately, you might restart the operating system but without cycling the power, i.e., do a warm reboot. Finally, you might try a cold reboot by cycling the power.
You should be aware, however, that there can be some dangers in resetting a system. For example, it is possible to inadvertently make changes to a system so that it can't reboot. If you realize you have done this in time, you can correct the problem. Once you have shut down the system, it may be too late. If you don't have a backup boot disk, you will have to rebuild the system. These are, fortunately, rare circumstances and usually happen only when you have been making major changes to a system.
When making changes to a system, remember that scheduled maintenance may involve restarting a system. You may want to test changes you have made, including their impact on a system reset, prior to such maintenance to ensure that there are no problems. Otherwise, the system may fail when restarted during the scheduled maintenance. If this happens, you will be faced with the difficult task of deciding which of several different changes are causing problems.
Resetting the system is certainly worth trying once. Doing it more than once is a different matter. With some systems, this becomes a way of life. An operating system that doesn't provide adequate memory protection will frequently become wedged so that rebooting is the only option.[1] Sometimes you may want to limp along resetting the system occasionally rather than dealing with the problem. In a university setting, this might get you through exam week to a time when you can be more relaxed in your efforts to correct the underlying problem. Or, if the system is to be replaced in the near future, the effort may not be justified. Usually, however, when rebooting becomes a way of life, it is time for more decisive action.
[1]Do you know what operating system I'm tactfully not naming?Swapping components and reinstalling software is often the next thing to try. If you have the spare components, this can often resolve problems immediately. Even if you don't have spares, switching components to see if the problem follows the equipment can be a simple first test. Reinstalling software can be much more problematic. This can often result in configuration errors that will worsen problems. The old, installed version of the software can make getting a new, clean installation impossible. But if the install is simple or you have a clear understanding of exactly how to configure the software, this can be a relatively quick fix.
While these approaches often work, they aren't what we usually think of as troubleshooting. You certainly don't need the tools described in this book to do them. Once you have exhausted the quick solutions, it is time to get serious. First, you must understand the problem, if possible. Problems that are not understood are usually not fixed, just postponed.
One standard admonition is to ask the question "has anything changed recently?" Overwhelmingly, most problems relate to changes to a working system. If you can temporarily change things back and the problem goes away, you have confirmed your diagnosis.
Admittedly, this may not help with an installation where everything is new. But even a new installation can and should be grown. Pieces can be installed and tested. New pieces of equipment can then be added incrementally. When this approach is taken, the question of what has changed once again makes sense.
Another admonition is to change only one thing at a time and then to test thoroughly after each change. This is certainly good advice when dealing with routine failures. But this approach will not apply if you are dealing with a system failure. (See the upcoming sidebar on system failures.) Also, if you do find something that you know is wrong but fixing it doesn't fix your problem, do you really want to change it back? In this case, it is often better to make a note of the additional changes you have made and then proceed with your troubleshooting.
A key element to successful debugging is to control the focus of your investigation so that you are really dealing with the problem. You can usually focus better if you can break the problem into pieces. Swapping components, as mentioned previously, is an example of this approach. This technique is known by several names -- problem decomposition, divide and conquer, binary search, and so on. This approach is applicable to all kinds of troubleshooting. For example, when your car won't start, first decide whether you have an electrical or fuel supply problem. Then proceed accordingly. Chapter 12, "Troubleshooting Strategies" outlines a series of specific steps you might want to consider.
0.4. Acknowledgments | 1.2. Need for Troubleshooting Tools |
Copyright © 2002 O'Reilly & Associates. All rights reserved.