LINQ with Query Expressions
The end of Chapter 15 showed a query using standard query operators for GroupJoin(), SelectMany(), and Distinct(). The result was a statement that spanned multiple lines and was rather more complex and difficult to comprehend than statements typically written using only features of earlier versions of C#. Modern programs, which manipulate rich data sets, often require such complex queries; it would therefore be advantageous if the language made them easier to read. Domain-specific query languages such as SQL make it much easier to read and understand a query but lack the full power of the C# language. That is why the C# language designers added query expressions syntax1 . With query expressions, many standard query operator expressions are transformed into more readable code, much like SQL.
In this chapter, we introduce query expressions and use them to express many of the queries from Chapter 15.
Two of the operations that developers most frequently perform are filtering the collection to eliminate unwanted items and projecting the collection so that the items take a different form. For example, given a collection of files, we could filter it to create a new collection of only the files with a .cs extension or only the files larger than 1 million bytes. We could also project the file collection to create a new collection of paths to the directories where the files are located and the corresponding directory size. Query expressions provide straightforward syntaxes for both of these common operations. Listing 16.1 shows a query expression that filters a collection of strings; Output 16.1 shows the results.
In this query expression, selection is assigned the collection of C# reserved keywords. The query expression in this example includes a where clause that filters out the noncontextual keywords.
Query expressions always begin with a from clause and end with a select clause or a group clause, identified by the from, select, or group contextual keyword, respectively. The identifier word in the from clause is called a range variable; it represents each item in the collection, much as the loop variable in a foreach loop represents each item in a collection.
Developers familiar with SQL will notice that query expressions have a syntax that is similar to that of SQL. This design was deliberate—it was intended that programmers who already know SQL should find it easy to learn LINQ. However, there are some obvious differences. The first difference that most SQL-experienced developers will notice is that the C# query expression shown here has the clauses in the following order: from, then where, then select. The equivalent SQL query puts the SELECT clause first, followed by the FROM clause, and finally the WHERE clause.
One reason for this change in sequence is to enable use of IntelliSense, the feature of the IDE whereby the editor produces helpful user interface elements such as drop-down lists that describe the members of a given object. Because from appears first and identifies the string array Keywords as the data source, the code editor can deduce that the range variable word is of type string. When you are entering the code into the editor and reach the dot following word, the editor will display only the members of string.
If the from clause appeared after the select, like it does in SQL, as you were typing in the query the editor would not know what the data type of word was, so it would not be able to display a list of word’s members. In Listing 16.1, for example, it wouldn’t be possible to predict that Contains() is a possible member of word.
The C# query expression order also more closely matches the order in which operations are logically performed. When evaluating the query, you begin by identifying the collection (described by the from clause), then filter out the unwanted items (with the where clause), and finally describe the desired result (with the select clause).
Finally, the C# query expression order ensures that the rules for “where” (range) variables are in scope are mostly consistent with the scoping rules for local variables. For example, a (range) variable must be declared by a clause (typically a from clause) before the variable can be used, much as a local variable must always be declared before it can be used.
The result of a query expression is a collection of type IEnumerable<T> or IQueryable<T>.2 The actual type T is inferred from the select or group by clause. In Listing 16.1, for example, the compiler knows that Keywords is of type string, which is convertible to IEnumerable<string>, and it deduces that word is therefore of type string. The query ends with select word, which means the result of the query expression must be a collection of strings, so the type of the query expression is IEnumerable<string>.
In this case, the “input” and the “output” of the query are both a collection of strings. However, the output type can be quite different from the input type if the expression in the select clause is of an entirely different type. Consider the query expression in Listing 16.2 and its corresponding output in Output 16.2.
This query expression results in an IEnumerable<FileInfo> rather than the IEnumerable<string> data type returned by Directory.GetFiles(). The select clause of the query expression can potentially project out a data type that differs from that collected by the from clause expression.
In this example, the type FileInfo was chosen because it has the two relevant fields needed for the desired output: the filename and the last write time. There might not be such a convenient type if you needed other information not captured in the FileInfo object. Tuples3 provide a convenient and concise way to project the exact data you need without having to find or create an explicit type. Listing 16.3 provides output similar to that in Listing 16.2, but uses tuple syntax rather than FileInfo.
In this example, the query projects out only the filename and its last file write time. A projection such as the one in Listing 16.3 makes little difference when working with something small, such as FileInfo. However, “horizontal” projection that filters down the amount of data associated with each item in the collection is extremely powerful when the amount of data is significant and retrieving it (perhaps from a different computer over the Internet) is expensive. Rather than retrieving all the data when a query executes, the use of a tuple enables the capability of storing and retrieving only the required data into the collection.
Imagine, for example, a large database that has tables with 30 or more columns. If there were no tuples, developers would be required either to use objects containing unnecessary information or to define small, specialized classes useful only for storing the specific data required. Instead, tuples enable support for types to be defined by the compiler—types that contain only the data needed for their immediate scenario. Other scenarios can have a different projection of only the properties needed for that scenario.
Queries written using query expression notation exhibit deferred execution, just as the queries written in Chapter 15 did. Consider again the assignment of a query object to variable selection in Listing 16.1. The creation of the query and the assignment to the variable do not execute the query; rather, they simply build an object that represents the query. The method word.Contains("*") is not called when the query object is created. Instead, the query expression saves the selection criteria to be used when iterating over the collection identified by the selection variable.
To demonstrate this point, consider Listing 16.4 and the corresponding output (Output 16.3).
In Listing 16.4, no space is output within the foreach loop. The side effect of printing a space when the predicate IsKeyword() is executed happens when the query is iterated over—not when the query is created. Thus, although selection is a collection (it is of type IEnumerable<T>, after all), at the time of assignment everything following the from clause serves as the selection criteria. Not until we begin to iterate over selection are the criteria applied.
Now consider a second example (see Listing 16.5 and Output 16.4).
Rather than defining a separate method, Listing 16.5 uses a statement lambda that counts the number of times the method is called.
Two things in the output are remarkable. First, after selection is assigned, DelegateInvocations remains at zero. At the time of assignment to selection, no iteration over Keywords is performed. If Keywords were a property, the property call would run—in other words, the from clause executes at the time of assignment. However, neither the projection, nor the filtering, nor anything after the from clause will execute until the code iterates over the values within selection. It is as though at the time of assignment, selection would more appropriately be called “query.”
Once we call ToList(), however, a term such as selection or Items or something that indicates a container or collection is appropriate because we begin to count the items within the collection. In other words, the variable selection serves the dual purpose of saving the query information and acting like a container from which the data is retrieved.
A second important characteristic of Output 16.4 is that calling Count() a second time causes func to again be invoked once on each item selected. Given that selection behaves both as a query and as a collection, requesting the count requires that the query be executed again by iterating over the IEnumerable<string> collection that selection refers to and counting the items. The C# compiler does not know whether anyone has modified the strings in the array such that the count would now be different, so the counting has to happen anew every time to ensure that the answer is correct and up-to-date. Similarly, a foreach loop over selection would trigger func to be called again for each item. The same is true of all the other extension methods provided via System.Linq.Enumerable.
Deferred execution is implemented by using delegates and expression trees. A delegate provides the ability to create and manipulate a reference to a method that contains an expression that can be invoked later. An expression tree similarly provides the ability to create and manipulate information about an expression that can be examined and manipulated later.
In Listing 16.5, the predicate expressions of the where clauses and the projection expressions of the select clauses are transformed by the compiler into expression lambdas, and then the lambdas are transformed into delegate creations. The result of the query expression is an object that holds references to these delegates. Only when the query results are iterated over does the query object actually execute the delegates.
In Listing 16.1, a where clause filters out reserved keywords but not contextual keywords. This clause filters the collection “vertically”: If you think of the collection as a vertical list of items, the where clause makes that vertical list shorter so that the collection holds fewer items. The filter criteria are expressed with a predicate—a lambda expression that returns a bool such as word.Contains() (as in Listing 16.1) or File.GetLastWriteTime(fileName) < DateTime.Now.AddMonths(-1). The latter is shown in Listing 16.6, whose output appears in Output 16.5.
To order the items using a query expression, you can use the orderby clause, as shown in Listing 16.7.
Listing 16.7 uses the orderby clause to sort the files returned by Directory.GetFiles() first by file size in descending order, and then by filename in ascending order. Multiple sort criteria are separated by commas, such that first the items are ordered by size, and then, if the size is the same, they are ordered by filename. ascending and descending are contextual keywords indicating the sort order direction. Specifying the order as ascending or descending is optional; if the direction is omitted (as it is here on filename), the default is ascending.
Listing 16.8 includes a query that is very similar to the query in Listing 16.7, except that the type argument of IEnumerable<T> is FileInfo. This query has a problem, however: We have to redundantly create a FileInfo twice, in both the orderby clause and the select clause.
Unfortunately, although the end result is correct, Listing 16.8 ends up instantiating a FileInfo object twice for each item in the source collection, which is wasteful and unnecessary. To avoid this kind of unnecessary and potentially expensive overhead, you can use a let clause, as demonstrated in Listing 16.9.
The let clause introduces a new range variable that can hold the value of an expression that is used throughout the remainder of the query expression. You can add as many let clauses as you like; simply insert each as an additional clause to the query after the first from clause but before the final select/group by clause.
A common data manipulation scenario is the grouping of related items. In SQL, this generally involves aggregating the items to produce a summary or total or some other aggregate value. LINQ, however, is notably more expressive. LINQ expressions allow for individual items to be grouped into a series of subcollections, and those groups can then be associated with items in the collection being queried. For example, Listing 16.10 and Output 16.6 demonstrate how to group together the contextual keywords and the regular keywords.
There are several things to note in Listing 16.10. First, the query result is a sequence of elements of type IGrouping<bool, string>. The first type argument indicates that the “group key” expression following by was of type bool, and the second type argument indicates that the “group element” expression following group was of type string. That is, the query produces a sequence of groups where the Boolean key is the same for each string in the group.
Because a query with a group by clause produces a sequence of collections, the common pattern for iterating over the results is to create nested foreach loops. In Listing 16.10, the outer loop iterates over the groupings and prints out the type of keyword as a header. The nested foreach loop prints each keyword in the group as an item below the header.
The result of this query expression is itself a sequence, which you can then query like any other sequence. Listing 16.11 and Output 16.7 show how to create an additional query that adds a projection onto a query that produces a sequence of groups. (The next section, on query continuations, shows a more pleasant syntax for adding more query clauses to a complete query.)
The group clause results in a query that produces a collection of IGrouping<TKey, TElement> objects—just as the GroupBy() standard query operator did (see Chapter 15). The select clause in the subsequent query uses a tuple to effectively rename IGrouping<TKey, TElement>.Key to IsContextualKeyword and to name the subcollection property Items. With this change, the nested foreach loop uses wordGroup.Items rather than wordGroup directly, as shown in Listing 16.10. Another potential item to add to the tuple would be a count of the items within the subcollection. This functionality is already available through LINQ’s wordGroup.Items.Count() method, however, so the benefit of adding it to the anonymous type directly is questionable.
As we saw in Listing 16.11, you can use an existing query as the input to a second query. However, it is not necessary to write an entirely new query expression when you want to use the results of one query as the input to another. You can extend any query with a query continuation clause using the contextual keyword into. A query continuation is nothing more than syntactic sugar for creating two queries and using the first as the input to the second. The range variable introduced by the into clause (groups in Listing 16.11) becomes the range variable for the remainder of the query; any previous range variables are logically a part of the earlier query and cannot be used in the query continuation. Listing 16.12 rewrites the code of Listing 16.11 to use a query continuation instead of two queries.
The ability to run additional queries on the results of an existing query using into is not specific to queries ending with group clauses, but rather can be applied to all query expressions. Query continuation is simply a shorthand for writing query expressions that consume the results of other query expressions. You can think of into as a “pipeline operator,” because it “pipes” the results of the first query into the second query. You can arbitrarily chain together many queries in this way.
It is often desirable to “flatten” a sequence of sequences into a single sequence. For example, each member of a sequence of customers might have an associated sequence of orders, or each member of a sequence of directories might have an associated sequence of files. The SelectMany sequence operator (discussed in Chapter 15) concatenates together all the subsequences; to do the same thing with query expression syntax, you can use multiple from clauses, as shown in Listing 16.13.
The preceding query will produce the sequence of characters a, b, s, t, r, a, c, t, a, d, d, *, a, l, i, a, ….
Multiple from clauses can also be used to produce a Cartesian product—the set of all possible combinations of several sequences—as shown in Listing 16.14.
This query would produce a sequence of pairs (abstract, 1), (abstract, 2), (abstract, 3), (as, 1), (as, 2), ….
Often, it is desirable to return only distinct (i.e., unique) items from within a collection, discarding any duplicates. Query expressions do not have an explicit syntax for distinct members, but such functionality is available via the query operator Distinct(), which was introduced in Chapter 15. To apply a query operator to a query expression, the expression must be enclosed in parentheses so that the compiler does not think that the call to Distinct() is a part of the select clause. Listing 16.15 gives an example; Output 16.8 shows the results.
In this example, typeof(Enumerable).GetMembers() returns a list of all the members (methods, properties, and so on) for System.Linq.Enumerable. However, many of these members are overloaded, sometimes more than once. Rather than displaying the same member multiple times, Distinct() is called from the query expression. This eliminates the duplicate names from the list. (We cover the details of typeof() and reflection [where methods like GetMembers() are available] in Chapter 18.)