Tpl Dataflow walkthrough – Part 5

2012/01/28

Tpl Dataflow walkthrough – Part 5

this post is a complete walkthrough of a web crawler sample that was build purely by using Tpl Dataflow.

it was built on .NET 4.5 / C# 5 (on a virtual machine using VS 11).

I will analyze each part of this sample, both by discussing the Dataflow blocks and the patterns in used.

the sample code is available in here (it is a VS 11 project).

TDF, Tpl,Dataflow, ITargerBlock, ISorceBlock, IDataBlobk, Transform

during the walkthrough you will see the following Tpl Dataflow blocks:

  • TransformBlock
  • TransformManyBlock
  • ActionBlock
  • BroadcastBlock

you will see how the aysnc / await signature of the Dataflow blocks is better for executing an IO bound operation (without freezing a worker ThreadPool thread).

I should also mention that this post is part of the Tpl Dataflow series which you better read before reading this one.

Disclamation: the web crawler sample is for educational purpose only (running web crawler application may be forbidden by the low of your country).

The sample topography:

Tpl Dataflow application is usually a collection of agents which is linked together in order to compose a complete solution. each agent is having its own responsibilities and concerns. the following diagram present the agent topography for this sample:

TDF, Tpl,Dataflow, ITargerBlock, ISorceBlock, IDataBlobk, Transform 

agents block type and responsibilities

Downloader: the responsibility of the downloader is to download the html of a web page. it is using a TransformBlock<Tin, Tout> which belong to the executer block family. the transform block is getting a url as the input message and it produce the page’s html as it output.

the transform block is construct from:

  • input buffer (for url)
  • task (do the transformation)
  • output buffer (for the downloaded html)

the task is taking one message at a time from the input buffer, transform the message by a Func<Tin, Tout> delegate which it get as a constructor parameter and put the result in the output buffer, where it is available for other blocks to consume.

later we will see that our crawler transformation is actually taking Func<Tin, Task<Tout>> which is a better signature for IO bound operations (I will discuss it latter).

TDF, Tpl,Dataflow, ITargerBlock, ISorceBlock, IDataBlobk, Transform

the transform block is a propagator block which mean that it exposed both as a target and a source block. it is implementing IPropagatorBlock<Tin, Tout>.

the following snippet show that IPropagatorBlock is simply an encapsulation of ITargetBlock and ISourceBlock.

Code Snippet
  1. public interface IPropagatorBlock<in TInput, out TOutput>
  2.     : ITargetBlock<TInput>,
  3.       ISourceBlock<TOutput>,
  4.       IDataflowBlock
  5. {
  6. }

Start crawling
Code Snippet
  1. var downloader = new TransformBlock<string, string>(
  2.     async (url) =>
  3.     {
  4.         // using IOCP the thread pool worker thread does return to the pool
  5.         WebClient wc = new WebClient();
  6.         string result = await wc.DownloadStringTaskAsync(url);
  7.         return result;
  8.     }, downloaderOptions);

as I was mentioning earlier the downloader contractor is getting Func<Tin, Task<Tout>>, therefore we can apply an async lambda expression (line 2). the code await for downloading (at line 6).

if you are not familiar with the async / await concept you can read this post or more posts in here.

anyway while awaiting for the download (DownloadStringTaskAsync) the block’s task is actually return its worker thread to the ThreadPool and take advantage of the IOCP (IO Completion Port), this is an IO bound operation which mean that no CPU resources is needed while the network card fetching the data from the network.
it is important to understand that while the network card is handling the request the agent’s task does not fetching another message from the buffer, the task will be interrupt when the data will be available.

analyzing the html

the crawler is using 2 agent for analyze the downloaded html:

  • link parser (which will look for links elements <a href="…"/>)
  • image parser (which will look for image elements <image src="…"/>)

both agent should be link to the downloader agent.
the problem is that linking both agent directly to the downloader agent will result with starvation of one of the agent.
unlike Rx the most blocks forward messages into the first linked target that accept the message, and ignore other linked targets. which mean that the message will be handle by a single agent at a time.

broadcast behavior can be achieved by using a BroadcastBlock<T> which is part of the pure buffer family.
the broadcast block is construct from:

  • input buffer
  • task
  • output buffer of single item.

TDF, Tpl,Dataflow, ITargerBlock, ISorceBlock, IDataBlobk, Transform, IPropagatorBlock

the task is fetching a message from the input buffer and place it in the output buffer, from the output buffer the message submit to the linked block.

the broadcast block is getting a Func<T,T> delegate as a constructor parameter, the idea behind it is cloning (which will enable separation of the messages).
if you are passing a reference type message to multiple agents, without cloning, changes that made by one agent will be visible to all the other agents.

the broadcast block will use the cloning delegate before sending the message to the linked agents.
the cloning pattern will ensure that only single block is processing a message instance at a time, this will maintain the message ownership and avoid the needs of data synchronization for thread safety.

the crawler will use the following block definition for broadcasting:

Code Snippet
  1. var contentBroadcaster = new BroadcastBlock<string>(s => s);

in our case the html content is a string which is immutable, therefore no real cloning is needed.

the crawler will link the agents (blocks) to each other after the construction of all the relevant blocks, right now we are focusing on the agents themselves.

Link parser

the link parser is using the following regular expression in order to fetch all the links (<a href=…"/>) out from the html and extract the link’s url.

Code Snippet
  1. private const string LINK_REGEX_HREF =
  2.     "\\shref=('|\\\")?(?<LINK>http\\://.*?(?=\\1)).*>";
  3. private static readonly Regex _linkRegexHRef =
  4.     new Regex(LINK_REGEX_HREF);

unlike the downloader agent which get a single input (url) and produce a single output (html),
the link parser produce multiple outputs (links) per each input (html).
you can use the transform block and set the output type to array of links but the Tpl Dataflow is having a better block for this scenario.
because the processing of each link is independent of other links, it will be better if the transform output buffer will contain flatten links objects rather then a collection of link’s array.

the crawler is using the TransformManyBlock<Tin,Tout>. this block is similar to the transform block with only one difference, the delegate at the constructor parameter is one of the following delegates:

  • Func<Tin, IEnumerable<Tout>>
  • Func<Tin, Task<IEnumerable<Tout>>>

the block task will extract the outputs results and put each of the extracted result, separately, in the output buffer.

TDF, Tpl,Dataflow, ITargerBlock, ISorceBlock, IDataBlobk, Transform, IPropagatorBlock 

this is the code for the link parser agent:

Code Snippet
  1. var linkParser = new TransformManyBlock<string, string>(
  2.        (html) =>
  3.        {
  4.            var output = new List<string>();
  5.            var links = _linkRegexHRef.Matches(html);
  6.            foreach (Match item in links)
  7.            {
  8.                var value = item.Groups["LINK"].Value;
  9.                output.Add(value);
  10.            }
  11.            return output;
  12.        });

it is very straight forward, parse each html by using regex and return list of result which the block will extract into the output buffer.

Image parser

the image parser is quit similar to the link parser.
the only differences is that it using different regular expression which extract the image’s url.

the regex part is:

Code Snippet
  1. private const string IMG_REGEX =
  2.     "<\\s*img [^\\>]*src=('|\")?(?<IMG>http\\://.*?(?=\\1)).*>\\s*([^<]+|.*?)?\\s*</a>";
  3. private static readonly Regex _imgRegex =
  4.     new Regex(IMG_REGEX);

and the parser agent code is:

Code Snippet
  1. var imgParser = new TransformManyBlock<string, string>(
  2.         (html) =>
  3.         {
  4.             var output = new List<string>();
  5.             var images = _imgRegex.Matches(html);
  6.             foreach (Match item in images)
  7.             {
  8.                 var value = item.Groups["IMG"].Value;
  9.                 output.Add(value);
  10.             }
  11.             return output;
  12.         });

writer agent

the last operational agent is the writer agent which will download the an image from a url and save it to the local disk.

the writer is using a simple action block, which is a simple executer block that have an input buffer and a task.

TDF, Tpl,Dataflow, ITargerBlock, ISorceBlock, IDataBlobk, Transform, IPropagatorBlock

the task is fetching messages from the buffer and execute a delegate which is given as constructor parameter.
the delegate signature can be either Action<T> or Funk<T, Task>. the latter one is great for IO bound operation (from the same reasons discussed earlier when we was looking on the transform block signature).

because the writer is doing 2 IO bound operations:

  • download the image from the web
  • write the image to the file system

the crawler is using the Funk<T, Task> signature.
the writer code is:

Code Snippet
  1. var writer = new ActionBlock<string>(async url =>
  2. {
  3.     WebClient wc = new WebClient();
  4.     // using IOCP the thread pool worker thread does return to the pool
  5.     byte[] buffer = await wc.DownloadDataTaskAsync(url);
  6.     string fileName = Path.GetFileName(url);
  7.  
  8.     string name = @"Images\" + fileName;
  9.  
  10.     using (Stream srm = File.OpenWrite(name))
  11.     {
  12.         await srm.WriteAsync(buffer, 0, buffer.Length);
  13.     }
  14. });

the first await at line 5, is awaiting until the task will be interrupt by the network card,
and the second await at line 12, will await until it will be interrupt by the file system controller.

you may have been notice that the second await is within a using block, you can read more about this topic at this post.

link it together

right now we are having most of our building blocks and it is time to define the data-flow by linking the block to each other.

the downloader should be link to the content broadcaster which in tern should be linked both to the image and link parser, the image parser should be linked to the writer and the link parser should be linked back to the downloader (so it can crawl farther).

but there is one last issue.
it happens that some web page is having links that is targeting an image. this lead us to more complex linking where the link parser should be linked both to the downloader and having conditional link to the writer for those url that is having an image suffix.
as we discuss earlier having a direct link from the link parser to both the downloader and the writer will results with starvation of one of those agents.
we do need a final broadcast block which will handle this distribution task.

Code Snippet
  1. var linkBroadcaster = new BroadcastBlock<string>(s => s);

the link parser will be linked to the broadcaster and the broadcaster will be liked to both the downloader and the writer.

we have spoke of the conditional link from the link parser and the writer, but it will be more effective if the link parser to the downloader will be link only those pages that are most likely having useful data like php, aspx, htm, ext…

Filtering linked messages

the following predicates will be use in order to filter linked messages:

Code Snippet
  1. StringComparison comparison = StringComparison.InvariantCultureIgnoreCase;
  2. Predicate<string> linkFilter = link =>
  3.     link.IndexOf(".aspx", comparison) != -1 ||
  4.     link.IndexOf(".php", comparison) != -1 ||
  5.     link.IndexOf(".htm", comparison) != -1 ||
  6.     link.IndexOf(".html", comparison) != -1;
  7. Predicate<string> imgFilter = url =>
  8.     url.EndsWith(".jpg", comparison) ||
  9.     url.EndsWith(".png", comparison) ||
  10.     url.EndsWith(".gif", comparison);

the first predicate (line 2) will filter the downloader agent target and the second (line 7) will filter the link parser result which is targeting the writer agent.

compose the data-flow

finally we got to the agent composition.

TDF, Tpl,Dataflow, ITargerBlock, ISorceBlock, IDataBlobk, Transform, IPropagatorBlock

Code Snippet
  1. IDisposable disposeAll = new CompositeDisposable(
  2.     // from [downloader] to [contentBroadcaster]
  3.     downloader.LinkTo(contentBroadcaster),
  4.     // from [contentBroadcaster] to [imgParser]
  5.     contentBroadcaster.LinkTo(imgParser),
  6.     // from [contentBroadcaster] to [linkParserHRef]
  7.     contentBroadcaster.LinkTo(linkParser),
  8.     // from [linkParser] to [linkBroadcaster]
  9.     linkParser.LinkTo(linkBroadcaster),
  10.     // conditional link to from [linkBroadcaster] to [downloader]
  11.     linkBroadcaster.LinkTo(downloader, linkFilter, true),
  12.     // from [linkBroadcaster] to [writer]
  13.     linkBroadcaster.LinkTo(writer, imgFilter, true),
  14.     // from [imgParser] to [writer]
  15.     imgParser.LinkTo(writer));

each LinkTo operation return a disposable instance which can be use to dispose the link when it no longer needed. the crawler compose all those disposable together into a single disposable called dispose All by using the CompositeDisposable which is part of the Rx library.

you can see the conditional LinkTo at line 11 and 13.
is is very important to set the last parameter of the LinkTo to true if you don’t want to dispose the link when the filter doesn’t match the criteria.

summary

this post was a walkthrough of a web crawler sample.
the complete sample, which is available in here (VS 11), is also having exception handling, agent termination after x amount of seconds, prevention of processing the same url twice and more. for simplicity the code within this post was a simplified version.

Shout it

Add comment
facebook linkedin twitter email

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>