Robust C++: P and V Considered Harmful
Greg Utas - 15/May/2020
Greg Utas - 15/May/2020
[SHOWTOGROUPS=4,20]
Cleaving the Gordian knot of thread safety
Preemptive and priority scheduling create numerous critical regions that applications must protect with mutexes, which have nothing to do with implementing a product specification and all too often lead to errors that are difficult to reproduce and debug. This article presents a Thread class that supports cooperative and proportional scheduling, which can dramatically reduce the number of mutexes required in applications.
Race conditions are an unbecoming source of complexity in many systems. Developers must constantly anticipate them and protect against them—and when they sneak into released software, try to reproduce them and hunt them down. The emergent cesspool of locks, semaphores, and their variants is sheer artificial complexity that has nothing to do with the specifications that the software seeks to implement. And in many cases, most of this complexity is actually avoidable.
The techniques described in this article apply when designing a new system, but even a legacy system may be able to adopt them with relative ease. Nevertheless, you may find what follows to be rather iconoclastic, so be prepared to think outside the box. But the approaches outlined in this article have been successfully used in large servers, including the core network servers that handle the calls in AT&T's wireless network. Those servers run on a proprietary operating system, but this article describes how to implement the same concepts on commercially available operating systems such as Windows and Linux. If the concepts are a fit for your system, you will improve its quality and your productivity by averting bugs that occur too easily and that can be difficult to pinpoint.
Many of the concepts in this article were covered in my book Robust Communications Software, which includes this compelling statement from Dennis DeBruler:
As a system integrator of code being developed by several programmers, it seems that my job consists of finding critical region bugs. Most critical region bugs can be found only by subjecting the system to load testing. Because of this need to do load testing, we have the additional challenge of finding and fixing the most difficult bugs late in the development cycle. Also, testing becomes a matter of peeling the onion as you find one unguarded critical region after another. Each one smaller and harder to find. Each one closer to the deadline—or further behind the deadline. Thus we have the joy that the worse the bug, the more layers of management are asking, "Have you found the bug yet?" The really small critical region bugs make it out to the field and become the reason for the "once every month or so in one of the hundred customer sites" bug. These are very hard to even recreate in the lab, let alone find.
Unless dealing with these kinds of challenges fills you with delight, read on.
Background
It is assumed that the reader is familiar with the issue of thread safety and how to achieve it with critical region protection mechanisms such as locks and semaphores.
This article's title is a little cheeky. Two essential contributions by the late Для просмотра ссылки Войдиили Зарегистрируйся were Для просмотра ссылки Войди или Зарегистрируйся, whose thesis is now accepted virtually without question, and Для просмотра ссылки Войди или Зарегистрируйся, which introduced the Для просмотра ссылки Войди или Зарегистрируйся operations P (wait) and V (signal) to protect critical regions.
Dijkstra argued that goto should be eliminated from high-level programming languages but allowed for it in machine code. So although goto is indispensable, only compilers and people programming in assembler should use it. This article makes a similar argument about semaphores: although indispensable, their use in application code should be restricted by usually confining them to the internal workings of a base Thread class. Just as the need for goto in application software points to a flawed programming language, the need for semaphores throughout application software points to a flawed threading/scheduling framework.
How We Got In…
It is deplorable that preemptive and priority scheduling have become the de facto standards in virtually all of today's operating systems. Preemptive scheduling arose in timesharing systems so that one CPU could serve multiple users. Some users were deemed more important than others, so they received higher priority. Priorities later proved useful in hard real-time systems, where not doing something on time can have severe consequences.
But in many systems, preemptive and priority scheduling are inappropriate. Your system is a unified whole, not the disparate software of users competing for time, so why should your threads be scheduled in and out at random? And presumably your system doesn't include software that enjoys being starved by higher priority work. Such software couldn't be very important, so why would it be there?
…And How to Get Out
Both scheduling disciplines—preemptive and priority scheduling—force applications to protect critical regions at a granular level. Within the same priority, context switches occur haphazardly, and they also occur as soon as a thread of higher priority is ready to run.
To reduce the number of critical regions, you need to control context switching yourself.Для просмотра ссылки Войдиили Зарегистрируйся In many cases, a thread has a limited amount of work to do when it is ready to run, so let it run that work to completion. That is, let it run unpreemptably. When it has finished its work, it sleeps, and the next thread runs. Or if it has many things to do, it does some of them and then yields, returning to the end of the ready queue to give other threads a chance to run. This is known as cooperative scheduling, and it significantly reduces the number of critical sections by completing one logical unit of work before starting the next one. If each application thread returns to its work loop before sleeping, yielding, or handling the next work item, there can be no race conditions between them.
Naturally, if it were that simple, everyone would be doing it. But it requires a suitable base Thread class. And it does give rise to some issues. So it's time to dive into a bit of code.
Using the Code
The code that we will look at comes from the Для просмотра ссылки Войдиили Зарегистрируйся (RSC), an open-source framework for developing robust C++ applications. Specifically, the code comes from its Для просмотра ссылки Войди или Зарегистрируйся class, a different facet of which is discussed in Robust C++: Safety Net. If you don't want to use RSC, you can copy and modify its code to meet your needs, subject to the terms of its GPL-3.0 license.
Walkthroughs
The code has been edited to remove aspects that would distract from the central point of this article. If you look at its full version, you will encounter these things, which are summarized in Deleted Code.
Creating a Thread
The creation of a thread is described here. Each Thread has a SysThread member, which encapsulates a native thread as part of RSC's operating system abstraction layer. A Thread subclass adds its own data, so there is no need for thread_local in RSC. Even the thread created to run main wraps itself as RootThread, which also derives from Thread.
Entering a Thread
Two design principles eliminate race conditions, and hence the need for semaphores, in application software:
Note that Resume records a time by which the thread must yield or otherwise cause itself to be scheduled out. A warp factor is applied to this time if, for example, the thread's function calls are being recorded, which will cause it to run much slower. We will return to this point later.
A Thread Yields
A thread invokes Pause to schedule itself out after it has completed one or more logical units of work:
If time is TIMEOUT_NEVER, the thread sleeps until another thread invokes Thread::Interrupt to wake it up. If time is finite, the thread (unless interrupted) sleeps until its specified timeout occurs, after which it must again become the active thread before it can run.
Pause invokes EnterBlockingOperation and ExitBlockingOperation because a thread must allow another thread to become active whenever it stops running, and become the active thread again before it resumes execution. BlockedOnClock is defined in the following enum:
[/SHOWTOGROUPS]
Cleaving the Gordian knot of thread safety
Preemptive and priority scheduling create numerous critical regions that applications must protect with mutexes, which have nothing to do with implementing a product specification and all too often lead to errors that are difficult to reproduce and debug. This article presents a Thread class that supports cooperative and proportional scheduling, which can dramatically reduce the number of mutexes required in applications.
- .../KB/threads/5246597/master.zip
Race conditions are an unbecoming source of complexity in many systems. Developers must constantly anticipate them and protect against them—and when they sneak into released software, try to reproduce them and hunt them down. The emergent cesspool of locks, semaphores, and their variants is sheer artificial complexity that has nothing to do with the specifications that the software seeks to implement. And in many cases, most of this complexity is actually avoidable.
The techniques described in this article apply when designing a new system, but even a legacy system may be able to adopt them with relative ease. Nevertheless, you may find what follows to be rather iconoclastic, so be prepared to think outside the box. But the approaches outlined in this article have been successfully used in large servers, including the core network servers that handle the calls in AT&T's wireless network. Those servers run on a proprietary operating system, but this article describes how to implement the same concepts on commercially available operating systems such as Windows and Linux. If the concepts are a fit for your system, you will improve its quality and your productivity by averting bugs that occur too easily and that can be difficult to pinpoint.
Many of the concepts in this article were covered in my book Robust Communications Software, which includes this compelling statement from Dennis DeBruler:
As a system integrator of code being developed by several programmers, it seems that my job consists of finding critical region bugs. Most critical region bugs can be found only by subjecting the system to load testing. Because of this need to do load testing, we have the additional challenge of finding and fixing the most difficult bugs late in the development cycle. Also, testing becomes a matter of peeling the onion as you find one unguarded critical region after another. Each one smaller and harder to find. Each one closer to the deadline—or further behind the deadline. Thus we have the joy that the worse the bug, the more layers of management are asking, "Have you found the bug yet?" The really small critical region bugs make it out to the field and become the reason for the "once every month or so in one of the hundred customer sites" bug. These are very hard to even recreate in the lab, let alone find.
Unless dealing with these kinds of challenges fills you with delight, read on.
Background
It is assumed that the reader is familiar with the issue of thread safety and how to achieve it with critical region protection mechanisms such as locks and semaphores.
This article's title is a little cheeky. Two essential contributions by the late Для просмотра ссылки Войди
Dijkstra argued that goto should be eliminated from high-level programming languages but allowed for it in machine code. So although goto is indispensable, only compilers and people programming in assembler should use it. This article makes a similar argument about semaphores: although indispensable, their use in application code should be restricted by usually confining them to the internal workings of a base Thread class. Just as the need for goto in application software points to a flawed programming language, the need for semaphores throughout application software points to a flawed threading/scheduling framework.
How We Got In…
It is deplorable that preemptive and priority scheduling have become the de facto standards in virtually all of today's operating systems. Preemptive scheduling arose in timesharing systems so that one CPU could serve multiple users. Some users were deemed more important than others, so they received higher priority. Priorities later proved useful in hard real-time systems, where not doing something on time can have severe consequences.
But in many systems, preemptive and priority scheduling are inappropriate. Your system is a unified whole, not the disparate software of users competing for time, so why should your threads be scheduled in and out at random? And presumably your system doesn't include software that enjoys being starved by higher priority work. Such software couldn't be very important, so why would it be there?
…And How to Get Out
Both scheduling disciplines—preemptive and priority scheduling—force applications to protect critical regions at a granular level. Within the same priority, context switches occur haphazardly, and they also occur as soon as a thread of higher priority is ready to run.
To reduce the number of critical regions, you need to control context switching yourself.Для просмотра ссылки Войди
Naturally, if it were that simple, everyone would be doing it. But it requires a suitable base Thread class. And it does give rise to some issues. So it's time to dive into a bit of code.
Using the Code
The code that we will look at comes from the Для просмотра ссылки Войди
Walkthroughs
The code has been edited to remove aspects that would distract from the central point of this article. If you look at its full version, you will encounter these things, which are summarized in Deleted Code.
Creating a Thread
The creation of a thread is described here. Each Thread has a SysThread member, which encapsulates a native thread as part of RSC's operating system abstraction layer. A Thread subclass adds its own data, so there is no need for thread_local in RSC. Even the thread created to run main wraps itself as RootThread, which also derives from Thread.
Entering a Thread
Two design principles eliminate race conditions, and hence the need for semaphores, in application software:
- All application threads run unpreemptably (also referred to as running locked). Before a thread starts to run, it must become the active thread. This allows it to run until it finishes one or more logical units of work. At that point, the next thread gets its turn, and so on. The active thread is tracked by the variable ActiveThread_, of type std::atomic<Thread*>, in Thread.cpp.
- All application threads run at the same priority. Strictly speaking, this isn't necessary when each of them must become the active thread in order to run. But because it's a consequence of the design, we'll be honest about what's going on and give each unpreemptable application thread the same priority.
Код:
main_t Thread::EnterThread(void* arg)
{
// Our argument (self) is a pointer to a Thread.
//
auto self = static_cast< Thread* >(arg);
// Indicate that we're ready to run. This blocks until we're signaled
// to proceed. At that point, record that we're resuming execution,
// register to catch signals, and invoke our entry function.
//
self->Ready();
self->Resume();
RegisterForSignals();
return self->Start();
}
Hide Copy Code
void Thread::Ready()
{
// If no thread is currently active, wake InitThread to schedule this
// thread in. Regardless, the thread waits to be signaled before it runs.
//
priv_->readyTime_ = TimePoint::Now();
priv_->waiting_ = true;
if(ActiveThread() == nullptr)
{
Singleton< InitThread >::Instance()->Interrupt(InitThread::ScheduleMask);
}
systhrd_->Wait();
priv_->waiting_ = false;
priv_->currStart_ = TimePoint::Now();
}
Hide Copy Code
void Thread::Resume()
{
// Set the time before which the thread must schedule itself out.
//
auto time = InitialTime() << ThreadAdmin::WarpFactor();
priv_->currEnd_ = priv_->currStart_ + time;
}
Note that Resume records a time by which the thread must yield or otherwise cause itself to be scheduled out. A warp factor is applied to this time if, for example, the thread's function calls are being recorded, which will cause it to run much slower. We will return to this point later.
A Thread Yields
A thread invokes Pause to schedule itself out after it has completed one or more logical units of work:
Код:
DelayRc Thread::Pause(Duration time)
{
auto drc = DelayError;
auto thr = RunningThread();
EnterBlockingOperation(BlockedOnClock);
{
drc = thr->systhrd_->Delay(time);
}
ExitBlockingOperation();
return drc;
}
If time is TIMEOUT_NEVER, the thread sleeps until another thread invokes Thread::Interrupt to wake it up. If time is finite, the thread (unless interrupted) sleeps until its specified timeout occurs, after which it must again become the active thread before it can run.
Pause invokes EnterBlockingOperation and ExitBlockingOperation because a thread must allow another thread to become active whenever it stops running, and become the active thread again before it resumes execution. BlockedOnClock is defined in the following enum:
Код:
enum BlockingReason
{
NotBlocked, // running or ready to run
BlockedOnClock, // SysThread::Delay (non-zero time)
BlockedOnNetwork, // SysUdpSocket::Recvfrom or SysTcpSocket::Poll
BlockedOnConsole, // CinThread::GetLine (console)
BlockedOnDatabase, // in-memory database
BlockingReason_N // number of reasons
};
[/SHOWTOGROUPS]
Последнее редактирование: