Debugging Live Systems
Greg Utas - 16/May/2020
Greg Utas - 16/May/2020
[SHOWTOGROUPS=4,20]
No breakpoints or drooling all over the console!
This article presents logging and trace tools that can be used in live systems.
These tools gather information to help identify and reproduce bugs. One of the tools captures function calls and displays them in an indented "Who invoked who?" format. A new tool is implemented by deriving from TraceRecord, which allows a common framework to manage all of a system's trace tools.
During development, setting breakpoints and writing information to the console are probably the two most commonly used debugging techniques. But once software is released and running on a customer's site, these techniques are no longer acceptable. This article describes alternatives that can be used to debug live systems and that are also useful during development. They actually need to be sufficient during development, because they will be the only things available in a live system.
When released software contains a bug, the goal is to reproduce it and develop a fix offline. Sometimes the customer can tell you how to reproduce the problem. But when the customer can't, you need to gather enough data to figure out how to recreate it. There may be times when the customer doesn't even know there is a bug, but your software does. It should then collect the necessary data on its own.
Tracepoint debugging is sometimes available in a live system. Many debuggers support tracepoints as well as breakpoints. When the debugger hits a tracepoint, it collects information and continues without halting. Most often, you tell it to collect the contents of certain data structures. Because this technique is closely related to breakpoint debugging, this article does not discuss it in further detail.
Background
The code in this article is taken from the Robust Services Core (RSC), an open-source framework for building robust C++ applications. RSC is targeted at Windows but has an abstraction layer that should allow it to be ported to other platforms with modest effort. The Windows targets (in *.win.cpp files) currently comprise less than 3K lines of code.
If you develop your application on top of RSC, everything in this article comes for free. If not, you can copy and modify its source code to meet your needs, subject to the terms of its GPL-3.0 license.
Using the Code
Most of the classes in this article are implemented in RSC's Для просмотра ссылки Войдиили Зарегистрируйся directory are defined and implemented in a .h and .cpp of the same name. It should therefore be easy for you to find their full versions.
Walkthroughs
Software Error Log
RSC produces a software error log when Thread::Start catches an exception. This function is inherited by all threads that derive from RSC's Thread class. Exceptions have significant overhead, so applications should throw them very sparingly and catch them even less often. So in the vast majority of cases, an exception represents a bug that needs to be fixed.
When Thread::Start catches an exception, the log that it produces includes a stack trace, as described in Capturing a Thread's Stack When An Exception Occurs. All exceptions in RSC derive from Exception, whose constructor captures the stack when the exception is thrown. This allows Thread::LogTrap to include a stack trace when it generates its log. This log, for example, occurs when using a bad pointer:
The stack must be captured when the exception is thrown, not when it is caught. By the time it's caught, the stack has unwound (in this case all the way to Thread::Start), so there is no longer anything useful to capture. This is why Exception's constructor captures the stack. Contrast this with exceptions thrown in the STL, where all you get is a string like "invalid string position", which says nothing about who caused the problem. The stack trace is vital because it not only identifies where the bug occurred (e.g., in a function that didn't check for a nullptr argument), but also how that code was reached (i.e., from a function that actually provided the bad argument).
RSC handles these logs in the same way as all others. It puts them in a log buffer that is soon written to a file and, optionally, to the console. In released software, users can collect these logs and send them to you. Better still, your software can include code to automatically send them to you over the internet. Each of these logs highlights a bug that needs to be fixed.
Capturing a stack is platform specific; the Windows targets are in SysThreadStack.win.cpp. The above log only captured the chain of function calls that led to the exception. A useful enhancement is to also capture each stack frame and the CPU's registers. Within the log, these simply appear as hex values, but an offline tool can then tie into the compiler's object files and interpret the raw hex so that the values of each function's arguments and local variables can be studied at the time of the exception. But this is more platform-specific code.
Software Warning Log
Unlike the software error log, which is produced after a problem has occurred, a software warning log is used when a problem is detected proactively. Application software generates these logs, most often when it receives invalid input. It invokes Debug::SwLog to generate them, and this function has two versions:
The default behavior is to include a stack trace (SwWarning), but the application can instead choose to generate a simple log without a stack trace (SwInfo) or to actually throw RSC's SoftwareException (SwError). SwInfo is used infrequently because the stack trace is usually wanted for debugging, and SwError is rarely used because the function should prefer to return a value which indicates that it failed.
Let's look at an example of a software warning log. The framework for RSC's command line interpreter (CLI) defines the base class CliParm, from which all CLI parameters ultimately derive. When a parameter registers with the CLI, it must provide a help string so that the CLI's help command can explain each of the parameters that a CLI command expects. We want to limit the length of these strings so that, after they are prefixed with some other information, they will still fit on a console that is 80 characters wide. CliParm's constructor therefore looks like this:
Debug::Assert throws an AssertionException if help is not provided. This would actually cause system initialization to fail, so it won't occur in released software: a developer must provide the value before the code can even be checked in. After that, the code generates a warning log if help is empty or too long to fit on one line of the console. So if we edit NbIncrement.cpp by changing
Hide Copy Code
fixed_string LogsListTextExpl =
"shows info for all logs or the logs in a specific group";
to this:
The following log then appears during system initialization:
The log complains that the help string is too long, and the culprit is easily identified as a parameter used by LogsCommand. Looking at the code, 0x3 is the number of characters that need to be trimmed from the string.
The 225K lines of code in RSC's repository include about 700 calls to Debug::SwLog. It should be used often! But as with the software error log, each occurrence of a software warning log highlights a bug that needs to be fixed.
[/SHOWTOGROUPS]
No breakpoints or drooling all over the console!
This article presents logging and trace tools that can be used in live systems.
These tools gather information to help identify and reproduce bugs. One of the tools captures function calls and displays them in an indented "Who invoked who?" format. A new tool is implemented by deriving from TraceRecord, which allows a common framework to manage all of a system's trace tools.
- .../KB/debug/5255828/master.zip
During development, setting breakpoints and writing information to the console are probably the two most commonly used debugging techniques. But once software is released and running on a customer's site, these techniques are no longer acceptable. This article describes alternatives that can be used to debug live systems and that are also useful during development. They actually need to be sufficient during development, because they will be the only things available in a live system.
When released software contains a bug, the goal is to reproduce it and develop a fix offline. Sometimes the customer can tell you how to reproduce the problem. But when the customer can't, you need to gather enough data to figure out how to recreate it. There may be times when the customer doesn't even know there is a bug, but your software does. It should then collect the necessary data on its own.
Tracepoint debugging is sometimes available in a live system. Many debuggers support tracepoints as well as breakpoints. When the debugger hits a tracepoint, it collects information and continues without halting. Most often, you tell it to collect the contents of certain data structures. Because this technique is closely related to breakpoint debugging, this article does not discuss it in further detail.
Background
The code in this article is taken from the Robust Services Core (RSC), an open-source framework for building robust C++ applications. RSC is targeted at Windows but has an abstraction layer that should allow it to be ported to other platforms with modest effort. The Windows targets (in *.win.cpp files) currently comprise less than 3K lines of code.
If you develop your application on top of RSC, everything in this article comes for free. If not, you can copy and modify its source code to meet your needs, subject to the terms of its GPL-3.0 license.
Using the Code
Most of the classes in this article are implemented in RSC's Для просмотра ссылки Войди
Walkthroughs
Software Error Log
RSC produces a software error log when Thread::Start catches an exception. This function is inherited by all threads that derive from RSC's Thread class. Exceptions have significant overhead, so applications should throw them very sparingly and catch them even less often. So in the vast majority of cases, an exception represents a bug that needs to be fixed.
When Thread::Start catches an exception, the log that it produces includes a stack trace, as described in Capturing a Thread's Stack When An Exception Occurs. All exceptions in RSC derive from Exception, whose constructor captures the stack when the exception is thrown. This allows Thread::LogTrap to include a stack trace when it generates its log. This log, for example, occurs when using a bad pointer:
Код:
THR902 2-Aug-2019 09:45:43.183 on Reigi {5}
in NodeTools.RecoveryTestThread (tid=15, nid=0x0000bc60): trap number 2
type=Signal
signal : 11 (SIGSEGV: Invalid Memory Reference)
errval : 0xc0000005
Function Traceback:
NodeBase.SignalException.SignalException @ signalexception.cpp + 40[12]
NodeBase.Thread.HandleSignal @ thread.cpp + 1801[16]
NodeBase.SE_Handler @ systhread.win.cpp + 126[13]
_unDNameEx @ <unknown file> (err=487)
is_exception_typeof @ <unknown file> (err=487)
is_exception_typeof @ <unknown file> (err=487)
is_exception_typeof @ <unknown file> (err=487)
_CxxFrameHandler3 @ <unknown file> (err=487)
RtlInterlockedCompareExchange64 @ <unknown file> (err=487)
RtlInterlockedCompareExchange64 @ <unknown file> (err=487)
KiUserExceptionDispatcher @ <unknown file> (err=487)
NodeTools.RecoveryTestThread.Enter @ ntincrement.cpp + 3584[0]
NodeBase.Thread.Start @ thread.cpp + 2799[15]
NodeBase.Thread.EnterThread @ thread.cpp + 1564[0]
BaseThreadInitThunk @ <unknown file> (err=487)
RtlGetAppContainerNamedObjectPath @ <unknown file> (err=487)
RtlGetAppContainerNamedObjectPath @ <unknown file> (err=487)
The stack must be captured when the exception is thrown, not when it is caught. By the time it's caught, the stack has unwound (in this case all the way to Thread::Start), so there is no longer anything useful to capture. This is why Exception's constructor captures the stack. Contrast this with exceptions thrown in the STL, where all you get is a string like "invalid string position", which says nothing about who caused the problem. The stack trace is vital because it not only identifies where the bug occurred (e.g., in a function that didn't check for a nullptr argument), but also how that code was reached (i.e., from a function that actually provided the bad argument).
RSC handles these logs in the same way as all others. It puts them in a log buffer that is soon written to a file and, optionally, to the console. In released software, users can collect these logs and send them to you. Better still, your software can include code to automatically send them to you over the internet. Each of these logs highlights a bug that needs to be fixed.
Capturing a stack is platform specific; the Windows targets are in SysThreadStack.win.cpp. The above log only captured the chain of function calls that led to the exception. A useful enhancement is to also capture each stack frame and the CPU's registers. Within the log, these simply appear as hex values, but an offline tool can then tie into the compiler's object files and interpret the raw hex so that the values of each function's arguments and local variables can be studied at the time of the exception. But this is more platform-specific code.
Software Warning Log
Unlike the software error log, which is produced after a problem has occurred, a software warning log is used when a problem is detected proactively. Application software generates these logs, most often when it receives invalid input. It invokes Debug::SwLog to generate them, and this function has two versions:
Код:
static void SwLog(fn_name_arg func, debug64_t errval,
debug64_t offset, SwLogLevel level = SwWarning);
static void SwLog(fn_name_arg func, const std::string& errstr,
debug64_t offset, SwLogLevel level = SwWarning);
Here, debug64_t is an alias for uint64_t that helps to track values used in these logs. The second overload is more user-friendly because errstr usually identifies the reason for the log, whereas errval and offset are intended for internal values that will help with debugging. The level argument specifies the type of log to generate:
Hide Copy Code
enum SwLogLevel
{
SwInfo, // a basic debug log
SwWarning, // a log that includes a stack trace
SwError // throws an exception (which includes a stack trace)
};
The default behavior is to include a stack trace (SwWarning), but the application can instead choose to generate a simple log without a stack trace (SwInfo) or to actually throw RSC's SoftwareException (SwError). SwInfo is used infrequently because the stack trace is usually wanted for debugging, and SwError is rarely used because the function should prefer to return a value which indicates that it failed.
Let's look at an example of a software warning log. The framework for RSC's command line interpreter (CLI) defines the base class CliParm, from which all CLI parameters ultimately derive. When a parameter registers with the CLI, it must provide a help string so that the CLI's help command can explain each of the parameters that a CLI command expects. We want to limit the length of these strings so that, after they are prefixed with some other information, they will still fit on a console that is 80 characters wide. CliParm's constructor therefore looks like this:
Код:
CliParm::CliParm(c_string help, bool opt, c_string tag) :
help_(help),
opt_(opt),
tag_(tag)
{
Debug::Assert(help_ != nullptr);
auto size = strlen(help_);
auto total = ParmWidth + strlen(ParmExplPrefix) + size;
if(size == 0)
Debug::SwLog(CliParm_ctor, "help string empty", size);
else if(total >= COUT_LENGTH_MAX)
Debug::SwLog
(CliParm_ctor, "help string too long", total - COUT_LENGTH_MAX + 1);
}
Debug::Assert throws an AssertionException if help is not provided. This would actually cause system initialization to fail, so it won't occur in released software: a developer must provide the value before the code can even be checked in. After that, the code generates a warning log if help is empty or too long to fit on one line of the console. So if we edit NbIncrement.cpp by changing
Hide Copy Code
fixed_string LogsListTextExpl =
"shows info for all logs or the logs in a specific group";
to this:
Код:
fixed_string LogsListTextExpl = "shows information for all logs or the logs in a specific group";
The following log then appears during system initialization:
Код:
SW900 19-Feb-2020 18:08:29.915 on Reigi {1}
in CliParm.ctor
errval=help string too long offset=0x3
Function Traceback:
NodeBase.Debug.GenerateSwLog @ debug.cpp + 102[22]
NodeBase.Debug.SwLog @ debug.cpp + 214[25]
NodeBase.CliParm.CliParm @ cliparm.cpp + 65[45]
NodeBase.CliText.CliText @ clitext.cpp + 42[19]
NodeBase.LogsListText.LogsListText @ nbincrement.cpp + 1435[25]
NodeBase.LogsAction.LogsAction @ nbincrement.cpp + 1560[31]
NodeBase.LogsCommand.LogsCommand @ nbincrement.cpp + 1575[39]
NodeBase.NbIncrement.NbIncrement @ nbincrement.cpp + 3700[33]
NodeBase.Singleton<NodeBase.NbIncrement>.Instance @ singleton.h + 81[34]
NodeBase.NbModule.Startup @ nbmodule.cpp + 149[5]
NodeBase.ModuleRegistry.Startup @ moduleregistry.cpp + 434[19]
NodeBase.ModuleRegistry.Restart @ moduleregistry.cpp + 263[22]
NodeBase.InitThread.InitializeSystem @ initthread.cpp + 354[0]
NodeBase.InitThread.Enter @ initthread.cpp + 225[0]
NodeBase.Thread.Start @ thread.cpp + 3232[15]
NodeBase.Thread.EnterThread @ thread.cpp + 1755[0]
BaseThreadInitThunk @ <unknown file> (err=487)
RtlGetAppContainerNamedObjectPath @ <unknown file> (err=487)
RtlGetAppContainerNamedObjectPath @ <unknown file> (err=487)
The log complains that the help string is too long, and the culprit is easily identified as a parameter used by LogsCommand. Looking at the code, 0x3 is the number of characters that need to be trimmed from the string.
The 225K lines of code in RSC's repository include about 700 calls to Debug::SwLog. It should be used often! But as with the software error log, each occurrence of a software warning log highlights a bug that needs to be fixed.
Последнее редактирование: