When returning an array or collection, you must indicate that there are zero items by returning either null or a collection instance with no items. The better choice, in general, is to return a collection instance with no items. In so doing, you avoid forcing the caller to check for null before iterating over the items in the collection. For example, given a zero-size IEnumerable<T> collection, the caller can immediately and safely use a foreach loop over the collection without concern that the generated call to GetEnumerator() will throw a NullReferenceException. Consider using the Enumerable.Empty<T>() method to easily generate an empty collection of a given type.
One of the few times to deviate from this guideline is when null is intentionally indicating something different from zero items. For example, a collection of user names for a website might be null to indicate that an up-to-date collection could not be obtained for some reason; that is semantically different from an empty collection.
Chapter 15 went into detail on the internals of the foreach loop. This section discusses how to use iterators to create your own implementation of the IEnumerator<T>, IEnumerable<T>, and corresponding nongeneric interfaces for custom collections. Iterators provide clean syntax for specifying how to iterate over data in collection classes, especially using the foreach loop. The iterator allows end users of a collection to navigate its internal structure without knowledge of that structure.
In 1972, Barbara Liskov and a team of scientists at MIT began researching programming methodologies, focusing on user-defined data abstractions. To prove much of their work, they created a language called CLU that had a concept called “clusters” (CLU being the first three letters of this term). Clusters were predecessors to the primary data abstraction that programmers use today: objects.
During their research, the team realized that although they were able to use the CLU language to abstract some data representation away from end users of their types, they consistently found themselves having to reveal the inner structure of their data to allow others to intelligently consume it. The result of their consternation was the creation of a language construct called an iterator. (The CLU language offered many insights into what would eventually be popularized as object-oriented programming.)
If classes want to support iteration using the foreach loop construct, they must implement the enumerator pattern. As Chapter 15 describes, in C# the foreach loop construct is expanded by the compiler into the while loop construct based on the IEnumerator<T> interface that is retrieved from the IEnumerable<T> interface.
The problem with the enumeration pattern is that it can be cumbersome to implement manually, because it must maintain all the state necessary to describe the current position in the collection. This internal state may be simple for a list collection type class; the index of the current position usually suffices. In contrast, for data structures that require recursive traversal, such as binary trees, the state can be quite complicated.4
Iterators are a means to implement methods of a class, and they are syntactic shortcuts for the more complex enumerator pattern. When the C# compiler encounters an iterator, it expands its contents into CIL code that implements the enumerator pattern. As such, there are no runtime dependencies for implementing iterators. Because the C# compiler handles implementation through CIL code generation, there is no real runtime performance benefit from using iterators. However, a substantial programmer productivity gain may be realized by choosing iterators over the manual implementation of the enumerator pattern. To understand how this improvement arises, we must first consider how an iterator is defined in code.
An iterator provides a shorthand implementation of iterator interfaces, the combination of the IEnumerable<T> and IEnumerator<T> interfaces. Listing 17.13 declares an iterator for the generic BinaryTree<T> type by creating a GetEnumerator() method. Next, you add support for the iterator interfaces.
As Listing 17.13 shows, we need to provide an implementation for the GetEnumerator() method.
Iterators are like functions, but instead of returning a single value, they yield a sequence of values, one at a time. In the case of BinaryTree<T>, the iterator yields a sequence of values of the type argument provided for T. If the nongeneric version of IEnumerator is used, the yielded values are instead of type object.
To correctly implement the iterator pattern, you need to maintain some internal state to keep track of where you are while enumerating the collection. In the BinaryTree<T> case, you track which elements within the tree have already been enumerated and which are still to come. Iterators are transformed by the compiler into a “state machine” that keeps track of the current position and knows how to “move itself” to the next position.
The yield return statement yields a value each time an iterator encounters it; control immediately returns to the caller that requested the item. (An interesting point here is that control really does immediately return, unlike, ironically, the return statement. A return statement runs finally blocks along the way; yield return does not.) When the caller requests the next item, the code begins to execute immediately following the previously executed yield return statement. In Listing 17.14, you return the C# built-in data type keywords sequentially. Results are shown in Output 17.5.
The output from this listing is a listing of the C# built-in types.
When GetEnumerator() is first called in a foreach statement, such as foreach (string keyword in keywords) in Listing 17.14, an iterator object is created and its state is initialized to a special “start” state that represents the fact that no code has executed in the iterator and, therefore, no values have been yielded yet. The iterator maintains its state as long as the foreach statement at the call site continues to execute. Every time the loop requests the next value, control enters the iterator and continues where it left off the previous time around the loop; the state information stored in the iterator object is used to determine where control must resume. When the foreach statement at the call site terminates, the iterator’s state is no longer saved.
It is always safe to call GetEnumerator() again; “fresh” enumerator objects will be created when necessary.
Figure 17.8 shows a high-level sequence diagram of what takes place. Remember that the MoveNext() method appears on the IEnumerator<T> interface.
In Listing 17.13, the foreach statement at the call site initiates a call to GetEnumerator() on the CSharpBuiltInTypes instance called keywords. Given the iterator instance (referenced by iterator), foreach begins each iteration with a call to MoveNext(). Within the iterator, you yield a value back to the foreach statement at the call site. After the yield return statement, the GetEnumerator() method seemingly pauses until the next MoveNext() request. Back at the loop body, the foreach statement displays the yielded value on the screen. It then loops back around and calls MoveNext() on the iterator again. Notice that the second time, control picks up at the second yield return statement. Once again, the foreach displays on the screen what CSharpBuiltInTypes yielded and starts the loop again. This process continues until there are no more yield return statements within the iterator. At that point, the foreach loop at the call site terminates because MoveNext() returns false.
Before you modify BinaryTree<T>, you must modify Pair<T> to support the IEnumerable<T> interface using an iterator. Listing 17.15 is an example that yields each element in Pair<T>.
In Listing 17.15, the iteration over the Pair<T> data type loops twice: first through yield return First and then through yield return Second. Each time the yield return statement within GetEnumerator() is encountered, the state is saved and execution appears to “jump” out of the GetEnumerator() method context and into the loop body. When the second iteration starts, GetEnumerator() begins to execute again with the yield return Second statement.
System.Collections.Generic.IEnumerable<T> inherits from System.Collections.IEnumerable. Therefore, when implementing IEnumerable<T>, it is also necessary to implement IEnumerable. In Listing 17.15, you do so explicitly, and the implementation simply involves a call to IEnumerable<T>’s GetEnumerator() implementation. This call from IEnumerable.GetEnumerator() to IEnumerable<T>.GetEnumerator() will always work because of the type compatibility (via inheritance) between IEnumerable<T> and IEnumerable. Since the signatures for both versions of GetEnumerator() are identical (the return type does not distinguish a signature), one or both implementations must be explicit. Given the additional type safety offered by IEnumerable<T>’s version, you implement IEnumerable’s implementation explicitly.
Listing 17.16 uses the Pair<T>.GetEnumerator() method and displays "Inigo" and "Montoya" on two consecutive lines.
Notice that the call to GetEnumerator() is implicit within the foreach loop.
It is not necessary to hardcode each yield return statement, as you did in both CSharpBuiltInTypes and Pair<T>. Using the yield return statement, you can return values from inside a loop construct. Listing 17.17 uses a foreach loop. Each time the foreach within GetEnumerator() executes, it returns the next value.
In Listing 17.17, the first iteration returns the root element within the binary tree. During the second iteration, you traverse the pair of subelements. If the subelement pair contains a non-null value, you traverse into that child node and yield its elements. Note that foreach (T item in tree) is a recursive call to a child node.
As observed with CSharpBuiltInTypes and Pair<T>, you can now iterate over BinaryTree<T> using a foreach loop. Listing 17.18 demonstrates this process, and Output 17.6 shows the results.
The code in Listing 17.17 creates new “nested” iterators as it traverses the binary tree. As a consequence, when the value is yielded by a node, the value is yielded by the node’s iterator, and then yielded by its parent’s iterator, and then yielded by its parent’s iterator, and so on, until it is finally yielded to the original loop by the root’s iterator. A value that is n levels deep must actually pass its value up a chain of n iterators. If the binary tree is relatively shallow, this is not typically a problem; however, an imbalanced binary tree can be extremely deep and therefore expensive to iterate recursively.
An interesting side effect of defining Pair<T> as a struct rather than as a class is that SubItems.First and SubItems.Second cannot be assigned directly, even if the setter were public. If you modify the setter to be public, the following code produces a compile error indicating that SubItems cannot be modified “because it is not a variable”:
jfkFamilyTree.SubItems.First =
new BinaryTree<string>(
"Joseph Patrick Kennedy");
The issue is that SubItems is a property of type Pair<T>, a struct. Therefore, when the property returns the value, a copy of SubItems is made, and assigning First on a copy that is promptly lost at the end of the statement would be misleading. Fortunately, the C# compiler prevents this error.
To overcome the issue, don’t assign First (see the approach in Listing 17.18), use class rather than struct for Pair<T>, don’t create a SubItems property and instead use a field, or provide properties in BinaryTree<T> that give direct access to SubItems members.
Sometimes you might want to cancel further iteration. You can do so by including an if statement so that no further statements within the code are executed. However, you can also use yield break to cause MoveNext() to return false and control to return immediately to the caller and end the loop. Listing 17.19 shows an example of such a method.
This method cancels the iteration if either of the elements in the Pair<T> class is null.
A yield break statement is similar to placing a return statement at the top of a function when it is determined that there is no work to do. It is a way to exit from further iterations without surrounding all remaining code with an if block. As such, it allows multiple exits. Use it with caution, because a casual reading of the code may overlook the early exit.
When the C# compiler encounters an iterator, it expands the code into the appropriate CIL for the corresponding enumerator design pattern. In the generated code, the C# compiler first creates a nested private class to implement the IEnumerator<T> interface, along with its Current property and a MoveNext() method. The Current property returns a type corresponding to the return type of the iterator. In Listing 17.15 of Pair<T>, an iterator returns a T type. The C# compiler examines the code contained within the iterator and creates the necessary code within the MoveNext method and the Current property to mimic its behavior. For the Pair<T> iterator, the C# compiler generates roughly equivalent code (see Listing 17.20).
Because the compiler takes the yield return statement and generates classes that correspond to what you probably would have written manually, iterators in C# exhibit the same performance characteristics as classes that implement the enumerator design pattern manually. Although there is no performance improvement, the gains in programmer productivity are significant.
Many C# keywords are “reserved” and cannot be used as identifiers unless preceded with an @ sign. The yield keyword is a contextual keyword, not a reserved keyword; thus, it is legal (though confusing) to declare a local variable called yield. In fact, all the keywords added to C# after version 1.0 have been contextual keywords; this helps prevent accidental breakages when upgrading existing programs to use new versions of the language.
Had the C# designers chosen to use yield value; instead of yield return value; as the syntax for an iterator to yield, a possible ambiguity would have been introduced: yield(1+2); now might be yielding a value, or it might be passing the value as an argument to a method called yield.
Since it was previously never legal to have the identifier yield appear immediately before return or break, the C# compiler knows that such a usage of yield must be as a keyword, not an identifier.
Previous iterator examples implemented IEnumerable<T>.GetEnumerator(), the method that foreach seeks implicitly. Sometimes you might want different iteration sequences, such as iterating in reverse, filtering the results, or iterating over an object projection other than the default. You can declare additional iterators in the class by encapsulating them within properties or methods that return IEnumerable<T> or IEnumerable. If you want to iterate over the elements of Pair<T> in reverse, for example, you could provide a GetReverseEnumerator() method, as shown in Listing 17.21.
Note that you return IEnumerable<T>, not IEnumerator<T>. This is different from IEnumerable<T>.GetEnumerator(), which returns IEnumerator<T>. The code in Main() demonstrates how to call GetReverseEnumerator() using a foreach loop.
You can use the yield return statement only in members that return an IEnumerator<T> or IEnumerable<T> type, or their nongeneric equivalents. Members whose bodies include a yield return statement may not have a simple return. If the member uses the yield return statement, the C# compiler generates the necessary code to maintain the state of the iterator. In contrast, if the member uses the return statement instead of yield return, the programmer is responsible for maintaining his own state machine and returning an instance of one of the iterator interfaces. Further, just as all code paths in a method with a return type must contain a return statement accompanied by a value (assuming they don’t throw an exception), so all code paths in an iterator must contain a yield return statement if they are to return any data.
The following additional restrictions on the yield statement result in compiler errors if they are violated:
________________________________________