C# Language Specifications
Iterators, iterator blocks and data pipelines
Introduction
LINQ to Objects is based on iterators. These have always been part of .NET, and
indeed are the basis of the foreach
statement - but with LINQ to Objects
they're more important than ever before. In this article I'll explore what iterators
are about, how to implement them in C# 2 (or 3), and then the fundamental aspects
of LINQ to Objects. This isn't a full tutorial on LINQ to Objects, but it provides
background for the core building blocks (and of course a little bit of LINQ
itself).
What's an iterator?
The iterator pattern is a common one, and in .NET it's encapsulated in the IEnumerable
and IEnumerator
interfaces, and their generic counterparts. (Most of the time
I'll use the nongeneric form for descriptions and the generic form in sample code. Obviously
the generic form is strongly typed where the nongeneric form isn't, but the only other difference
is that the generic form extends IDisposable
.)
The basic idea is that as a data consumer, you can ask an IEnumerable
for an IEnumerator
with the GetEnumerator()
call, and
then iterate over the contents of the IEnumerator
(using the MoveNext()
method and Current
property) until either you no longer need any more
data, or the iterator runs out information to return. For example, a simple
foreach
loop over IEnumerable<string>
looks something
like this:
IEnumerator<string> iterator = dataSource.GetEnumerator();
while (iterator.MoveNext())
{
string data = iterator.Current;
// Use data within the body of the loop
}
The above code doesn't call Dispose
on the iterator, for simplicity:
the code generated for a foreach
loop uses a
try
/finally
construct to make sure that the iterator is
disposed when we've finished with it. This is important when the iterator is implemented
by magic of yield return
statements which we'll see later, but it's
not crucial to the basic pattern itself.
Why are there two interfaces involved?
You may well be asking yourself why IEnumerable
doesn't have MoveNext()
and Current
members itself - why do we need an extra interface? Well,
imagine you had two different loops iterating over the same list. You don't want
the loops to interfere with each other, so you need two independent representations
of the idea of "how much of the list I've looked at so far". This concept is precisely
what IEnumerator
is for. Typically it keeps a separate piece of information
(such as the current index in the list) as well as referring to the collection itself. Although
iterators are typically used for in-memory collections, they certainly don't have to be, as we'll
see later. The key is the "ask for data, process it, ask for more data" cycle.
Iterator blocks (C# 2)
In C# 1, implementing iterators was a pain. You had to make your collection type implement
IEnumerable
, then create a separate type with a reference to an instance of its "parent",
usually adding some more state for how far you'd already gone. It was very, very easy to end
up with subtle off-by-one errors, and generally developers didn't do it unless they really had to.
In C# 2, the compiler does all the hard work for you when you use iterator blocks to implement
either IEnumerable
or IEnumerator
(or the generic forms). It builds a state machine
for you, and the iterator obtained effectively executes the code within the iterator block, yielding values
as it goes. It's easier to demonstrate than to explain - here's a complete program:
using System;
using System.Collections.Generic;
class IteratorBlockDemo
{
static IEnumerable<string> GetDemoEnumerable()
{
yield return "start";
for (int i=0; i < 5; i++)
{
yield return i.ToString();
}
yield return "end";
}
static void Main()
{
foreach (string x in GetDemoEnumerable())
{
Console.WriteLine(x);
}
}
}
The output of the program is simple:
start 0 1 2 3 4 end
I won't go into all the details of how iterator blocks work (see chapter 6 of C# in Depth for more information, or the language specification for a lot of detail) but it's important to understand the flow:
Main
callsGetDemoEnumerable()
GetDemoEnumerable()
creates a new instance of the extra type generated by the compiler. None of the source code we've written executes yet.Main
callsMoveNext()
-
The iterator executes code until it reaches a
yield
statement. In our case, this happens immediately. The iterator remembers that the current item should be "start" and returnstrue
to indicate that there is data available. Main
uses theCurrent
property to retrieve the data, then prints it out.Main
callsMoveNext()
again-
The iterator continues execution from the point it had previously reached - in other words, it goes to the line after the first
yield return
. As before, it executes code (initialising thei
variable) until it reaches the nextyield
statement. -
... The pattern repeats, until there's a call to
MoveNext()
which reaches the end of the method - at which point the call returnsfalse
to indicate that there's no more data available.
There are three very important (related) points here: first, that none of our original
source code is executed until the first call to MoveNext()
. Second,
that subsequent calls to MoveNext()
effectively jump back into the source code at the
point they left off before. Yielding a value is sort of like "pausing" the method. Third,
the compiler makes sure that the state of the method (in terms of local variables) is preserved -
even though the i
variable is local inside the iterator block, its value is kept inside the
iterator so that next time MoveNext()
is called, it can still be used.
These features of iterator blocks make them completely different from other methods - it can be quite tricky to get your head round how they work, until it suddenly clicks. It can be quite useful to debug into an iterator block to really see how the flow works for yourself. Now we know about iterators and how they can be easily implemented in C# 2, let's think about how we can chain them together.
Data pipelines and composability
LINQ to Objects is built on the concept of a data pipeline. We start with a data source which implements
IEnumerable
, and chain together lots of operations, each one acting on the result of the
previous one and returning another IEnumerable
- although it could be one with a different type of
result. This act of chaining small operations together to form one large operation is called composition
.
The emphasis in composition is of making each of the small operations genuinely simple. Flexibility is provided in
LINQ to Objects by the use of delegates which can act on the input elements - for example, to act as a predicate
in a filter, indicating whether or not a particular value should pass the filter; or as a projection to map one
value to another.
As a silly demonstration of this, suppose we have a sequence of numbers, in some random order. We want a result which
is a sequence of strings - we want to filter out any negative numbers, sort the rest, and then call ToString
to get the numbers in hexadecimal format. In C# 3 and .NET 3.5, we can do this with the following query expression:
from number in numbers
where number >= 0
orderby number
select number.ToString("x")
The parser translates this into a non-query expression before the compiler deals with it any further:
numbers.Where(number => number >= 0)
.OrderBy(number => number)
.Select(number => number.ToString("x"))
The =>
syntax is for lambda expressions - a simple way of creating delegates (and also expression trees).
The Where
, OrderBy
and Select
methods are extension methods provided by LINQ to
Objects to do filtering, ordering and projections respectively. As you can tell, there are lots of different features interacting
here, and I don't intend to go into the details of most of them - we're just concentrating on the behaviour of iterators in this article.
Deferred execution vs immediate execution
Just as when we created our own iterator using iterator blocks, the expression above doesn't execute any significant
code when it's executed - it sets up the pipeline, but won't actually ask for any data. It only starts doing any
actual filtering/ordering/projecting when the return value of Select
is first asked for some data.
This is called deferred execution. Many LINQ to Objects query operators use deferred execution, especially when they're
conceptually building more of a pipeline. Other query operators such as Count()
, Max()
and
ToList()
use immediate execution - which means they perform their work immediately, and give you a
result with all the appropriate data.
It's important to be aware of what's going on here so you know when operations
will actually take place. For instance, if you specify a projection which might throw the exception, you need to know that
the exception will only occur when the data is being requested: just because the call to Select
doesn't throw an
exception doesn't mean everything is okay!
Streaming vs buffering
There's another concept which is similar to deferred/immediate execution, but subtly different - and that's whether an operator
streams the data or buffers it. An operator which streams data basically only asks for as much data (from the previous
part of the pipeline) as it it needs to give the next result. Select()
is the simplest example of this - it retrieves
the next item, applies the specified delegate to project it to the result, and then yields that result. Where()
is slightly more complicated - it will keep asking for more data until either there is no more data, or until it finds
an element which passes its filter. When it finds a passing element, however, it will yield that value without asking for any
more data.
Buffering operators act differently. When they're asked for data, they buffer up all the data from the earlier piece
of the pipeline, and then work out all the results. Reverse()
is the simplest example of this - in order to
return all of the data in the reverse order, you've got to get to the last element before you can yield anything.
Sorting and grouping also buffer data in LINQ to Objects.
Generators
LINQ to Objects contains a few generator operators - ones which produce a sequence of data without starting
with one. DefaultIfEmpty
is listed as a generator, but in some ways it isn't - it acts on an existing sequence,
albeit one which may be empty. We won't look at that here.
There are three other standard generator operators: Empty
, Repeat
and Range
.
We'll look at possible implementations of each of them, and then work out how we could implement all of them slightly
differently using composition. Of course, in reality all of these are implemented for us already, but it gives nice
practice for understanding iterator blocks and composition.
Empty
Empty
does exactly what you'd expect it to - it returns an empty sequence, i.e. one where the first call
to MoveNext()
returns false
. Strangely enough, implementing it isn't quite as straightforward
as we might expect. We don't need to return any values, so we can just leave the method body empty, right? Well, no... we actually
need code like this:
public static IEnumerable<TResult> Empty<TResult> ()
{
yield break;
}
The yield break;
statement is an "early out" in iterator blocks - it acts like the end of the method, making
MoveNext()
return false
. Why do we need it at all? Well, with an empty method body there's nothing
to tell the compiler that we want an iterator block in the first place! The compiler only treats a method as being implemented
with an iterator block if it encounters yield return
or yield break
within the method body.
This is the only case I can think of where a yield break
directly before the end of a method actually has
an effect.
Repeat
Repeat
takes two parameters - an element to return, and the number of times to return it. It will
yield the same element however many times you've specified, and then stop. It's trivial to implement (including
checking that the count isn't negative):
public static IEnumerable<TResult> Repeat<TResult>(TResult element, int count)
{
if (count < 0)
{
throw new ArgumentOutOfRangeException("count");
}
for (int i=0; i < count; i++)
{
yield return element;
}
}
A note on error checking: there's a problem here. Ideally, we would like our Repeat
method to throw
an exception immediately we call it, if count
is less than 0. Currently, we'll only throw the exception
when some code tries to iterate through the result. That's generally a bad thing, and it would be nice
if the language
gave us more support here . For the moment though, we'll leave our slightly broken implementation as it is.
Range
Range
is also straightforward. Unlike the previous two operators, it's not generic: it always returns an
IEnumerable<int>
. You specify two parameters: the starting point, and how many values to return. The returned
sequence starts at the given value, and counts up - so Range(10, 3)
will yield 10, 11, 12. Here's an implementation,
omitting error checking:
public static IEnumerable<int> Range(int start, int count)
{
for (int i=start; i < start+count; i++)
{
yield return i;
}
}
One generator to rule them all
The thing about the above generators is that they're all a bit specialized. In particular, they take no delegates, unlike
almost everything else in LINQ to Objects! Let's introduce a new generator. It's so general, I reckon it deserves
the name Generate
. It will produce an infinite sequence by taking a starting point, yielding that, and then
applying a delegate to the current value, yielding the current value, and repeating ad infinitum. Here's the code (with
no error checking - we'd normally check that step
wasn't null):
public static IEnumerable<TSource> Generate<TSource>(TSource start,
Func<TSource,TSource> step)
{
TSource current = start;
while (true)
{
yield return current;
current = step(current);
}
}
The code is as simple as we'd hope - but of course, we'd never want to actually iterate over the result of Generate
directly - it would keep going forever. Indeed, Generate
is almost useless without composition. However, by the
cunning use of just a single other operator, we can easily produce the three generators we've already looked at. Rather than
show any infinite sequences here, I'll wait for a moment before actually using Generate
.
Take
Take
is what I think of as a "pipeline" operator - it takes a sequence and returns a sequence, just like
Select
and Where
do. As well as the "source" sequence, you need to specify how many elements of
the source you want to be returned. Take
simply yields elements from the source until either the source
runs out of data, or the requested number of elements have been returned. So, the implementation (minus error checking) could
look like this:
public static IEnumerable<TSource> Take<TSource>(this IEnumerable<TSource> source,
int count)
{
foreach (TSource element in source)
{
if (count==0)
{
yield break;
}
count--;
yield return element;
}
}
Note the use of yield break
here to indicate that we've finished when the counter reaches 0.
(We could have used a simple break
in this particular case, as there's no more code after the loop,
but I thought it would be worth demonstrating a slightly more normal use of yield break
than the one
in our Empty
implementation.) The use of this
in the declaration of the source
parameter indicates that Take
is an extension method.
With Take
and Generate
in our toolkit, we can now reimplement Empty
, Repeat
and Range
using composition:
public static IEnumerable<TResult> Repeat<TResult>(TResult element, int count)
{
return Generate(element, x => x).Take(count);
}
public static IEnumerable<TResult> Empty<TResult>()
{
return Repeat(default(TResult), 0);
}
public static IEnumerable<int> Range(int start, int count)
{
return Generate(start, x => x+1).Take(count);
}
Repeat
is just implemented by first creating an infinite sequence which begins with the specified element
and then applies an identity function to it at each step. We then cut that infinite sequence to the
required length by composing it with Take
. An Empty
sequence is simply the default value
of the required type, repeated zero times. Finally Range
creates another infinite sequence - starting
with the specified value, it just adds one repeatedly - and again we Take
the required number of elements.
It's not particularly useful to actually implement the three generator functions in this manner - but it
does give an interesting insight into composition. The fact that Generate
always creates an infinite
sequence forces you to think about the fact that the values are only generated as they're requested - if Generate
were to try to create all the values first, and then return them in one go, you'd clearly run out of memory quite quickly.
I have a somewhat more complicated example, taken from a blog post
which contains the full source. It's a query expression which is used to generate a visualisation of the Mandelbrot set:
var query = from row in Enumerable.Range(0, ImageHeight)
from col in Enumerable.Range(0, ImageWidth)
// Work out the initial complex value from the row and column
let c = new Complex((col * SampleWidth) / ImageWidth + OffsetX,
(row * SampleHeight) / ImageHeight + OffsetY)
// Work out the number of iterations
select Generate(c, x => x * x + c).TakeWhile(x => x.SquareLength < 4)
.Take(MaxIterations)
.Count() into count
// Map that to an appropriate byte value
select (byte)(count == MaxIterations ? 0 : (count % 255) + 1);
The use of Generate
here is within the nested query used to find the colour to use for a particular
point. We generate an infinite path of values which jump around the field of complex numbers, stopping
when either the point has a modulus of 2 or more, or we've performed as many iterations as we're interested in.
The full code is downloadable from the blog post, but I hope this gives some indication of how powerful the
composition technique is.
Conclusion
LINQ to Objects is full of sequences, and a bunch of simple operators which combine to form myriad possibilities. In C# in Depth, I recommend that when you're faced with a complicated query expression, you look at the sequences from which it's composed. Understanding how iterators work, streaming/buffering and deferred/immediate execution gives important foundation material to being able to work with LINQ effectively. Composing queries is fun, and the iterator blocks which first appeared in C# 2 make iterators reasonably easy to work with. In this article I've given just a few examples, but I'm sure you can come up with your own. Go ahead, create your own operators - and above all, experiment.
For many of us, LINQ isn't just a new technology, but a new way of approaching problems when writing in C#. It always takes a while to really get to grips with fresh ideas, so I'd heartily recommend playing with LINQ before you need to use it professionally. While the other LINQ providers are perhaps "sexier" than LINQ to Objects, there's a purity, simplicity and immediacy involved when you can just define a collection in memory and manipulate it to your heart's content.
Likewise iterator blocks are a mystery to many developers. Unless you've already been thinking in sequences, you may well have never written an iterator block. Hopefully this article has shown you how simple they can be, along with a couple of warnings where you may have expected slightly different behaviour. Iterator blocks were already useful in C# 2, but when combined with LINQ they can be truly spectacular, all within tiny pieces of code. Again - experiment. It'll all be useful in the long run, and if you're anything like me, I promise you'll have a lot of fun along the way.