Iterating in Parallel
In Chapter 19, we mentioned how the cost of computation is falling, resulting in computers with faster CPUs, an increased number of CPUs, and an increased number of cores in each CPU. Collectively, these trends are making it increasingly more economical to boost the execution parallelization to take advantage of the increased computing power. In this chapter, we look at executing loops in parallel, which is one of the easiest ways to take advantage of the increased computing capability. Much of the chapter consists of Beginner and Advanced Topic blocks.
Consider the for loop statement and associated code shown in Listing 21.1 and the corresponding results in Output 21.1. The listing calls a method for calculating a section of the decimal expansion of pi, where the parameters are the number of digits and the digit to start with. The actual calculation is not germane to the discussion. What is interesting about this calculation is that it is embarrassingly parallelizable; that is, it is remarkably easy to split up a large task—say, computing 1 million decimal digits of pi—into any desired number of smaller tasks that can all be run in parallel. These types of computations are the easiest ones to speed up by adding parallelism.
The for loop executes each iteration synchronously and sequentially. However, because the pi calculation algorithm splits the pi calculation into independent pieces, it is not necessary to compute the pieces sequentially providing the results are appended in the right order. Imagine what would happen if you could have all the iterations of this loop run concurrently: Each processor could take a single iteration and execute it in parallel with other processors executing other iterations. Given the simultaneous execution of iterations, we could decrease the execution time more and more based on the number of processors.
The Task Parallel Library (TPL) provides a convenient method, Parallel.For(), which does precisely that. Listing 21.2 shows how to modify the sequential, single-threaded program in Listing 21.1 to use the helper method.
The output for Listing 21.2 is identical to Output 21.2; however, the execution time is significantly faster if you have multiple CPUs (and possibly slower if you do not). The Parallel.For() API is designed to look similar to a standard for loop. The first parameter is the fromInclusive value, the second is the toExclusive value, and the last is the Action<int> to perform as the loop body. When using an expression lambda for the action, the code looks similar to a for loop statement except that now each iteration may execute in parallel. As with the for loop, the call to Parallel.For() will not complete until all iterations are complete. In other words, by the time execution reaches the string.Join() statement, all sections of pi will have been calculated.
Note that the code for combining the various sections of pi no longer occurs inside the iteration (action) in Listing 21.2. As sections of the pi calculation will very likely not complete sequentially, appending a section whenever an iteration completes will likely append them out of order. Even if sequence was not a problem, there is still a potential race condition because the += operator in Listing 21.1 is not atomic. To address both problems, each section of pi is stored into an array, and no two or more iterations will access a single element within the array simultaneously. Only once all sections of pi are calculated does string.Join() combine them. In other words, we postpone concatenating the sections until after the Parallel.For() loop has completed. This avoids any race condition caused by sections not yet calculated or sections concatenating out of order.
The TPL uses the same sorts of thread pooling techniques that it uses for task scheduling to ensure good performance of the parallel loop: It will try to ensure that CPUs are not overscheduled and so on.
The TPL also provides a similar parallel version of the foreach statement, as shown in Listing 21.3.
In this example, we call a method that encrypts each file within the files collection. It does so in parallel, executing as many threads as the TPL determines is efficient.
The default scheduler within the TPL targets the thread pool, resulting in a variety of heuristics that try to ensure the right number of threads are executing at any one time. Two of the heuristics it uses are hill climbing and work stealing.
The hill-climbing algorithm involves creating threads to run tasks, and then monitoring the performance of those tasks to experimentally determine the point at which adding more threads begins making performance worse. Once that point is reached, the number of threads can then be decreased back to the number that produced the best performance.
The TPL does not associate top-level tasks that are waiting to be executed with any particular thread. If, however, a task running on a thread itself creates another task, the newly created task is associated with that thread automatically. When the new child task is eventually scheduled to run, it usually runs on the same thread as the task that created it. The work-stealing algorithm identifies threads that have an unusually large or unusually small amount of pending work; a thread that has too few tasks associated with it will sometimes “steal” not-yet-executed tasks from threads that have too many tasks waiting to run.
The key feature of these algorithms is that they enable the TPL to dynamically tune its own performance to mitigate processor overscheduling and underscheduling and to balance the work among the available processors.
The TPL generally does a good job of tuning its own performance, but you can help it do a better job by providing hints about the best course of action. Specifying the TPL TaskCreationOptions.LongRunning option described in the section “Long-Running Tasks” in Chapter 19 is an example of such a hint. You can also explicitly tell the task scheduler how many threads you think would be best to service a parallel loop; see “Advanced Topic: Parallel Loop Options” later in this chapter for more details.
The TPL catches and saves exceptions associated with tasks in an AggregateException, because a given task might have several exceptions obtained from its subtasks. This is also the case with parallel execution of loops: Each iteration could have produced an exception, so the exceptions need to be gathered up into one aggregating exception. Consider the example in Listing 21.4 and its results in Output 21.2.
Output 21.2 shows that three exceptions occurred while executing the Parallel.ForEach<T>(...) loop. However, the code shows only one catch, of type System.AggregateException. The UnauthorizedAccessExceptions were retrieved from the InnerExceptions property on the AggregateException. With a Parallel.ForEach<T>() loop, each iteration could potentially throw an exception, so the System.AggregateException thrown by the method call will contain each of those exceptions within its InnerExceptions property.
Unlike a task, which requires an explicit call if it is to block until it completes, a parallel loop executes iterations in parallel but does not itself return until the entire parallel loop completes. Canceling a parallel loop, therefore, generally involves the invocation of the cancellation request from a thread other than the one executing the parallel loop. In Listing 21.5, we invoke Parallel.ForEach<T>() using Task.Run(). In this manner, not only does the query execute in parallel, but it also executes asynchronously, allowing the code to prompt the user to “Press ENTER to exit.”
The parallel loops use the same cancellation token pattern that tasks use. The token obtained from a CancellationTokenSource is associated with the parallel loop by calling an overload of the ForEach() method that has a parameter of type ParallelOptions. This object contains the cancellation token.
Note that if you cancel a parallel loop operation, any iterations that have not started yet are prevented from starting by checking the IsCancellationRequested property. Existing executing iterations run to their respective termination points. Furthermore, calling Cancel() even after all iterations have completed still causes the registered cancel event (via cts.Token.Register()) to execute.
The only means by which the ForEach() method is able to acknowledge that the loop has been canceled is via the OperationCanceledException. Given that cancellation in this example is expected, the exception is caught and ignored, allowing the application to display “Canceling…”, followed by a line of stars before exiting.
Although not generally necessary, it is possible to control the maximum degree of parallelism (i.e., the number of threads that are scheduled to run at the same time) via the ParallelOptions parameter on overloads of both the Parallel.For() and Parallel.ForEach<T>() loops. In some specific cases, the developer may know more about the particular algorithm or circumstance such that changing the maximum degree of parallelism makes sense. These circumstances include the following:
And so on. To control the maximum degree of parallelism, use the MaxDegreeOfParallelism property on the ParallelOptions object.
You can also use the ParallelOptions object’s TaskScheduler property to specify a custom task scheduler to use when scheduling the tasks associated with each iteration. For example, you might have an asynchronous event handler that responds to the user’s click of a Next button. If the user clicks the button several times, you might want to use a custom task scheduler that prioritizes the most recently created task rather than prioritizing the task that has waited the longest. The task scheduler provides a means of specifying how the tasks will execute in relation to one another.
The ParallelOptions object also has a CancellationToken property that provides a mechanism to communicate to the loop that no further iterations should start. Additionally, the body of an iteration can watch the cancellation token to determine if an early exit from the iteration is in order.
Like a standard for loop, the Parallel.For() loop supports the concept of “breaking” to exit the loop and canceling any further iterations. In the context of parallel for execution, however, a break signifies that no new iterations following the breaking iteration should start. All currently executing iterations, however, will run to completion.
To break a parallel loop, you can provide a cancellation token and cancel it on another thread, as described in the preceding “Advanced Topic: Parallel Loop Options.” You can also use an overload of the Parallel.For() method whose body delegate takes two parameters: the index and a ParallelLoopState object. An iteration that wishes to break the loop can call the Break() or Stop() method on the loop state object passed to the delegate. The Break() method indicates that no more iterations with index values higher than the current value need to execute; the Stop() method indicates that no more iterations need to run at all.
For example, suppose you have a Parallel.For() loop that is performing ten iterations in parallel. Some of those iterations might run faster than others, and the task scheduler does not guarantee that they will run in any particular order. Suppose the first iteration has completed; iterations 3, 5, 7, and 9 are “in flight,” scheduled to four different threads; and iterations 5 and 7 both call Break(). In this scenario, iterations 6 and 8 never start, but iterations 2 and 4 are still scheduled to run. Iterations 3 and 9 run to completion because they had already started when the break happened.
The Parallel.For() and Parallel.ForEach<T>() methods return a reference to a ParallelLoopResult object that contains useful information about what happened during the loop. This result object has the following properties:
Returning to the ten-iteration example, the IsCompleted property returns false and the LowestBreakIteration returns a value of 5.