Introducing Multithreading
Two significant trends of the past decade have had an enormous effect on the field of software development. First, the continued decrease in the cost of performing computations is no longer driven by increases in clock speed and transistor density, as illustrated by Figure 19.1. Rather, the cost of computation is now falling because it has become economical to make hardware containing multiple CPUs.
Second, computations now routinely involve enormous latency. Latency is, simply put, the amount of time required to obtain a desired result. There are two principal causes of latency. Processor-bound latency occurs when the computational task is complex; if a computation requires performing 12 billion arithmetic operations and the total processing power available is only 6 billion operations per second, at least 2 seconds of processor-bound latency will be incurred between asking for the result and obtaining it. I/O-bound latency, by contrast, is latency incurred by the need to obtain data from an external source such as a disk drive, web server, and so on. Any computation that requires fetching data from a web server physically located far from the client machine will incur latency equivalent to millions of processor cycles.
These two trends together create an enormous challenge for modern software developers. Given that machines have more computing power than ever, how are we to make effective use of that power to deliver results to the user quickly and without compromising on the user experience? How do we avoid creating frustrating user interfaces that freeze up when a high-latency operation is triggered? Moreover, how do we go about splitting CPU-bound work among multiple processors to decrease the time required for the computation?
The standard technique for engineering software that keeps the user interface responsive and CPU utilization high is to write multithreaded programs that perform multiple computations in parallel. Unfortunately, multithreading logic is notoriously difficult to get right; we spend the next four chapters exploring what makes multithreading difficult and learning how to use higher-level abstractions and new language features to ease that burden.
The first higher-level abstraction was the Parallel Extensions library, which was released with .NET 4.0. It includes the Task Parallel Library (TPL), which is discussed in this chapter, and the Parallel LINQ (PLINQ), which is discussed in Chapter 21. The second higher-level abstraction is the Task-based Asynchronous Pattern (TAP) and its accompanying language support.1
Although we strongly encourage you to use these higher-level abstractions, we also cover some of the lower-level threading APIs from previous versions of the .NET runtime at the end of this chapter.2 Thus, if you want to fully understand the resources from multithreaded programming without the later features, you still have access to that material.
We begin this chapter with a few beginner topics in case you are new to multithreading.
There is a lot of confusing jargon associated with multithreading, so let’s define a few terms.
A CPU (central processing unit) or core3 is the unit of hardware that actually executes a given program. Every machine has at least one CPU, though today multiple CPU machines are common. Many modern CPUs support simultaneous multithreading (which Intel trademarks as Hyper-Threading), a mode whereby a single CPU can appear as multiple “virtual” CPUs.
A process is a currently executing instance of a given program; the fundamental purpose of the operating system is to manage processes. Each process contains one or more threads. A process may be accessed programmatically by an instance of the Process class in the System.Diagnostics namespace.
C# programming at the level of statements and expressions is fundamentally about describing flow of control. Thus far in this book, we’ve made the implicit assumption that a given program has only a single point of control. You can imagine the point of control as being a cursor that enters the text of your program at the Main method when you start it up, and then moves around the program as the various conditions, loops, method calls, and so on, are executed. A thread is this point of control. The System.Threading namespace contains the API for manipulating a thread—specifically, the System.Threading.Thread class.
A single-threaded program is one in which there is only one thread in the process. A multithreaded program has two or more threads in the process.
A piece of code is said to be thread safe if it behaves correctly when used in a multithreaded program. The threading model of a piece of code is the set of requirements that the code places upon its caller in exchange for guaranteeing thread safety. For example, the threading model of many classes is “static methods may be called from any thread, but instance methods may be called only from the thread that allocated the instance.”
A task is a unit of potentially high-latency work that produces a resultant value or desired side effect. The distinction between tasks and threads is as follows: A task represents a job that needs to be performed, whereas a thread represents the worker that does the job. A task is useful only for its side effects and is represented by an instance of the Task class. A task used to produce a value of a given type is represented by the Task<T> class, which derives from the nongeneric Task type. These can be found in the System.Threading.Tasks namespace.
A thread pool is a collection of threads, along with logic for determining how to assign work to those threads. When your program has a task to perform, it can delegate a worker thread from the pool, assign the thread to perform the task, and then de-allocate it when the work completes, thereby making it available the next time additional work is requested.
There are two principal scenarios for multithreading: enabling multitasking and dealing with latency.
Users think nothing of running dozens of processes at the same time. They might have presentations and spreadsheets open for editing while at the same time they are browsing documents on the Internet, listening to music, receiving instant messages and email arrival notifications, and watching the little clock in the corner. Each of these processes must continue to do its job even though it is not the only task the machine has to attend to. This kind of multitasking is usually implemented at the process level, but in some situations, you want to do this sort of multitasking within a single process.
For the purposes of this book, we will mostly consider multithreading as a technique for dealing with latency. For example, to import a large file while simultaneously allowing a user to click Cancel, a developer creates an additional thread to perform the import. By performing the import on a different thread, the user can request cancellation instead of freezing the user interface until the import completes.
If enough cores are available so that each thread can be assigned a core, each thread essentially gets its own little machine. However, often there are more threads than cores. Even the relatively common multicore machines of today still have only a handful of cores, while each process could quite possibly run dozens of threads.
To overcome the discrepancy between the numerous threads and the handful of cores, an operating system simulates multiple threads running concurrently by time slicing. The operating system switches execution from one thread to the next so quickly that it appears the threads are executing simultaneously. The time that the processor executes a thread before switching to another is the time slice or quantum. The act of changing which thread is executing on a core is called a context switch.
The effect is like that of a fiber-optic telephone line in which the fiber-optic line represents the processor and each conversation represents a thread. A (single-mode) fiber-optic telephone line can send only one signal at a time, but many people can hold simultaneous conversations over the line. The fiber-optic channel is fast enough to switch between conversations so quickly that each conversation appears uninterrupted. Similarly, each thread of a multithreaded process appears to run continuously with other threads.
If two operations are running in parallel, via either true multicore parallelism or simulated parallelism using time slicing, they are said to be concurrent. To implement such concurrency, you invoke it asynchronously, such that both the execution and the completion of the invoked operation are separate from the control flow that invoked it. Concurrency, therefore, occurs when work dispatched asynchronously executes in parallel with the current control flow. Parallel programming is the act of taking a single problem and splitting it into pieces, whereby you asynchronously initiate the process of each piece such that the pieces can all be processed concurrently.
A thread that is servicing an I/O-bound operation can essentially be ignored by the operating system until the result is available from the I/O subsystem; switching away from an I/O-bound thread to a processor-bound thread results in more efficient processor utilization because the CPU is not idle while waiting for the I/O operation to complete.
However, context switching is not free; the current internal state of the CPU must be saved to memory, and the state associated with the new thread must be loaded. Similarly, if thread A is doing lots of work with one piece of memory, and thread B is doing lots of work with another piece of memory, context switching between them will likely mean that all of the data that was loaded into the cache from thread A will get replaced with the data from thread B (or vice versa). If there are too many threads, the switching overhead can begin to noticeably affect performance. Adding more threads will likely decrease performance further, to the point where the processor spends more time switching from one thread to another than it does accomplishing the work of each thread.
Even if we ignore the cost of context switching, time slicing itself can have a huge impact on performance. Suppose, for example, that you have two processor-bound high-latency tasks, each working out the average of two lists of 1 billion numbers each. Suppose the processor can perform 1 billion operations per second. If the two tasks are each associated with a thread, and the two threads each have their own core, obviously we can get both results in 1 second.
If, however, the two threads share a single processor, time slicing performs a few hundred thousand operations on one thread, then switches to the other thread, then switches back, and so on. Each task consumes a total of 1 second of processor time, and the results of both are available after 2 seconds, leading to an average completion time of 2 seconds. (Again, we are ignoring the cost of context switching.)
If we assigned those two tasks to a single thread that performed the first task and did not even start the second until after the first was completed, the result of the first task would be obtained in 1 second and the result of the subsequent task would be obtained 1 second after that, leading to an average time of 1.5 seconds (a task completes in either 1 or 2 seconds and therefore, on average, completes in 1.5 seconds).
We’ve said several times that writing multithreaded programs is complex and difficult, but we have not said why. In a nutshell, the problem is that many of our reasonable assumptions that are true for single-threaded programs are violated in multithreaded programs. The issues include a lack of atomicity, race conditions, complex memory models, and deadlocks.
An atomic operation is one that always is observed to be either not started or already completed. Its state is never externally visible as “in progress.” Consider, for example, this code fragment:
if (bankAccounts.Checking.Balance >= 1000.00m)
{
bankAccounts.Checking.Balance -= 1000.00m;
bankAccounts.Savings.Balance += 1000.00m;
}
This operation—checking for available funds and then conditionally debiting one account and crediting another—needs to be atomic. In other words, for it to execute correctly, we must ensure that there is never a moment when the operation can be observed to be partially completed. Imagine, for example, that two threads are running in this code concurrently. It is possible that both threads verify that there are sufficient funds in the account, and then both threads do a transfer of funds, even if there are only sufficient funds in the account to do the transfer once. And, in fact, the situation is considerably worse: There are no operations in this code fragment that are atomic! Even operations like compound addition/subtraction or reading and writing a property of decimal type are non-atomic operations in C#. As such, they can all be observed to be “partially complete” in multithreaded scenarios—only partially incremented or decremented. The observation of inconsistent state due to partially completed non-atomic operations is a special case of a more general problem, called a race condition.
As we discussed earlier, concurrency is often simulated by time slicing. In the absence of special thread synchronization (which we discuss in detail in Chapter 22), the operating system can switch contexts between any two threads at any time of its choosing. As a consequence, when two threads are accessing the same object, which thread wins the race and gets to run first is unpredictable. If there are two threads running in the code fragment given previously, for example, it is possible that one thread might win the race and get all the way to the end before the second thread gets a chance to run. It is also possible that the context switch might happen after the first thread does the balance check, and the second thread might then win the race to get all the way to the end first.
The behavior of code that contains race conditions depends on the timing of context switches. This dependency introduces uncertainty concerning program execution. The order in which one instruction will execute relative to an instruction in a different thread is unknown. The worst of it is that code containing race conditions often behaves correctly 99.9 percent of the time, and then one time in a thousand a different thread wins the race due to an accident of timing. This unpredictability is what makes multithreaded programming so difficult.
Because such race conditions are difficult to replicate in the laboratory, much of the quality assurance of multithreaded code depends on long-running stress tests, specially designed code analysis tools, and a significant investment in code analysis and code review by experts. Perhaps more important than any of these is the discipline of keeping things as simple as possible. Often, in the name of hypothetical performance, a developer will try to avoid the simple approach of using a lock and go for lower-level primitives such as interlocked operations and volatiles, which makes it much more likely that their code is wrong. “Keep it simple” is possibly one of the most important guidelines of good multithreaded programming.
Chapter 22 is about techniques for dealing with race conditions.
The existence of race conditions, where two points of control can “race” through a piece of code at unpredictable and inconsistent speeds, is bad enough, but it gets worse. Consider two threads that are running on two different processors but are accessing the same fields of some object. Modern processors do not actually access main memory every time you use a variable. Rather, they make a local copy in special cache memory on the processor; these caches are then periodically synchronized with main memory. This means that two threads that read from and write to the same location on two different processors can, in fact, be failing to observe each other’s updates to that memory or observing inconsistent results. Essentially what we have here is a race condition that depends on when processors choose to synchronize their caches.
Clearly there must exist mechanisms to make non-atomic operations into atomic operations, to instruct the operating system to schedule threads so as to avoid races, and to ensure that processor caches are synchronized when necessary. The primary mechanism used to solve all these problems in C# programs is the lock statement. This statement allows the developer to identify a section of code as “critical” code that only one thread may be in at one time; if multiple threads try to enter the critical section, the operating system will suspend4 all but one. The operating system also ensures that processor caches are synchronized properly upon encountering a lock.
However, locks introduce problems of their own (along with performance overhead). Most notably, if the order of lock acquisition between threads varies, a deadlock could occur such that threads freeze, each waiting for the other to release its lock.
For example, consider Figure 19.2.
At this point, each thread is waiting on the other thread before proceeding, so each thread is blocked, leading to an overall deadlock in the execution of that code.
We discuss various locking techniques in detail in Chapter 22.
________________________________________