Tpl Dataflow walkthrough – Part 5


Tpl Dataflow walkthrough – Part 5

this post is a complete walkthrough of a web crawler sample that was build purely by using Tpl Dataflow.

it was built on .NET 4.5 / C# 5 (on a virtual machine using VS 11).

I will analyze each part of this sample, both by discussing the Dataflow blocks and the patterns in used.

the sample code is available in here (it is a VS 11 project).

TDF, Tpl,Dataflow, ITargerBlock, ISorceBlock, IDataBlobk, Transform

during the walkthrough you will see the following Tpl Dataflow blocks:

  • TransformBlock
  • TransformManyBlock
  • ActionBlock
  • BroadcastBlock

you will see how the aysnc / await signature of the Dataflow blocks is better for executing an IO bound operation (without freezing a worker ThreadPool thread).

I should also mention that this post is part of the Tpl Dataflow series which you better read before reading this one.

Disclamation: the web crawler sample is for educational purpose only (running web crawler application may be forbidden by the low of your country).

The sample topography:

Tpl Dataflow application is usually a collection of agents which is linked together in order to compose a complete solution. each agent is having its own responsibilities and concerns. the following diagram present the agent topography for this sample:

TDF, Tpl,Dataflow, ITargerBlock, ISorceBlock, IDataBlobk, Transform 

agents block type and responsibilities

Downloader: the responsibility of the downloader is to download the html of a web page. it is using a TransformBlock<Tin, Tout> which belong to the executer block family. the transform block is getting a url as the input message and it produce the page’s html as it output.

the transform block is construct from:

  • input buffer (for url)
  • task (do the transformation)
  • output buffer (for the downloaded html)

the task is taking one message at a time from the input buffer, transform the message by a Func<Tin, Tout> delegate which it get as a constructor parameter and put the result in the output buffer, where it is available for other blocks to consume.

later we will see that our crawler transformation is actually taking Func<Tin, Task<Tout>> which is a better signature for IO bound operations (I will discuss it latter).

TDF, Tpl,Dataflow, ITargerBlock, ISorceBlock, IDataBlobk, Transform

the transform block is a propagator block which mean that it exposed both as a target and a source block. it is implementing IPropagatorBlock<Tin, Tout>.

the following snippet show that IPropagatorBlock is simply an encapsulation of ITargetBlock and ISourceBlock.

Code Snippet
  1. public interface IPropagatorBlock<in TInput, out TOutput>
  2.     : ITargetBlock<TInput>,
  3.       ISourceBlock<TOutput>,
  4.       IDataflowBlock
  5. {
  6. }

Start crawling
Code Snippet
  1. var downloader = new TransformBlock<string, string>(
  2.     async (url) =>
  3.     {
  4.         // using IOCP the thread pool worker thread does return to the pool
  5.         WebClient wc = new WebClient();
  6.         string result = await wc.DownloadStringTaskAsync(url);
  7.         return result;
  8.     }, downloaderOptions);

as I was mentioning earlier the downloader contractor is getting Func<Tin, Task<Tout>>, therefore we can apply an async lambda expression (line 2). the code await for downloading (at line 6).

if you are not familiar with the async / await concept you can read this post or more posts in here.

anyway while awaiting for the download (DownloadStringTaskAsync) the block’s task is actually return its worker thread to the ThreadPool and take advantage of the IOCP (IO Completion Port), this is an IO bound operation which mean that no CPU resources is needed while the network card fetching the data from the network.
it is important to understand that while the network card is handling the request the agent’s task does not fetching another message from the buffer, the task will be interrupt when the data will be available.

analyzing the html

the crawler is using 2 agent for analyze the downloaded html:

  • link parser (which will look for links elements <a href="…"/>)
  • image parser (which will look for image elements <image src="…"/>)

both agent should be link to the downloader agent.
the problem is that linking both agent directly to the downloader agent will result with starvation of one of the agent.
unlike Rx the most blocks forward messages into the first linked target that accept the message, and ignore other linked targets. which mean that the message will be handle by a single agent at a time.

broadcast behavior can be achieved by using a BroadcastBlock<T> which is part of the pure buffer family.
the broadcast block is construct from:

  • input buffer
  • task
  • output buffer of single item.

TDF, Tpl,Dataflow, ITargerBlock, ISorceBlock, IDataBlobk, Transform, IPropagatorBlock

the task is fetching a message from the input buffer and place it in the output buffer, from the output buffer the message submit to the linked block.

the broadcast block is getting a Func<T,T> delegate as a constructor parameter, the idea behind it is cloning (which will enable separation of the messages).
if you are passing a reference type message to multiple agents, without cloning, changes that made by one agent will be visible to all the other agents.

the broadcast block will use the cloning delegate before sending the message to the linked agents.
the cloning pattern will ensure that only single block is processing a message instance at a time, this will maintain the message ownership and avoid the needs of data synchronization for thread safety.

the crawler will use the following block definition for broadcasting:

Code Snippet
  1. var contentBroadcaster = new BroadcastBlock<string>(s => s);

in our case the html content is a string which is immutable, therefore no real cloning is needed.

the crawler will link the agents (blocks) to each other after the construction of all the relevant blocks, right now we are focusing on the agents themselves.

Link parser

the link parser is using the following regular expression in order to fetch all the links (<a href=…"/>) out from the html and extract the link’s url.

Code Snippet
  1. private const string LINK_REGEX_HREF =
  2.     "\\shref=('|\\\")?(?<LINK>http\\://.*?(?=\\1)).*>";
  3. private static readonly Regex _linkRegexHRef =
  4.     new Regex(LINK_REGEX_HREF);

unlike the downloader agent which get a single input (url) and produce a single output (html),
the link parser produce multiple outputs (links) per each input (html).
you can use the transform block and set the output type to array of links but the Tpl Dataflow is having a better block for this scenario.
because the processing of each link is independent of other links, it will be better if the transform output buffer will contain flatten links objects rather then a collection of link’s array.

the crawler is using the TransformManyBlock<Tin,Tout>. this block is similar to the transform block with only one difference, the delegate at the constructor parameter is one of the following delegates:

  • Func<Tin, IEnumerable<Tout>>
  • Func<Tin, Task<IEnumerable<Tout>>>

the block task will extract the outputs results and put each of the extracted result, separately, in the output buffer.

TDF, Tpl,Dataflow, ITargerBlock, ISorceBlock, IDataBlobk, Transform, IPropagatorBlock 

this is the code for the link parser agent:

Code Snippet
  1. var linkParser = new TransformManyBlock<string, string>(
  2.        (html) =>
  3.        {
  4.            var output = new List<string>();
  5.            var links = _linkRegexHRef.Matches(html);
  6.            foreach (Match item in links)
  7.            {
  8.                var value = item.Groups["LINK"].Value;
  9.                output.Add(value);
  10.            }
  11.            return output;
  12.        });

it is very straight forward, parse each html by using regex and return list of result which the block will extract into the output buffer.

Image parser

the image parser is quit similar to the link parser.
the only differences is that it using different regular expression which extract the image’s url.

the regex part is:

Code Snippet
  1. private const string IMG_REGEX =
  2.     "<\\s*img [^\\>]*src=('|\")?(?<IMG>http\\://.*?(?=\\1)).*>\\s*([^<]+|.*?)?\\s*</a>";
  3. private static readonly Regex _imgRegex =
  4.     new Regex(IMG_REGEX);

and the parser agent code is:

Code Snippet
  1. var imgParser = new TransformManyBlock<string, string>(
  2.         (html) =>
  3.         {
  4.             var output = new List<string>();
  5.             var images = _imgRegex.Matches(html);
  6.             foreach (Match item in images)
  7.             {
  8.                 var value = item.Groups["IMG"].Value;
  9.                 output.Add(value);
  10.             }
  11.             return output;
  12.         });

writer agent

the last operational agent is the writer agent which will download the an image from a url and save it to the local disk.

the writer is using a simple action block, which is a simple executer block that have an input buffer and a task.

TDF, Tpl,Dataflow, ITargerBlock, ISorceBlock, IDataBlobk, Transform, IPropagatorBlock

the task is fetching messages from the buffer and execute a delegate which is given as constructor parameter.
the delegate signature can be either Action<T> or Funk<T, Task>. the latter one is great for IO bound operation (from the same reasons discussed earlier when we was looking on the transform block signature).

because the writer is doing 2 IO bound operations:

  • download the image from the web
  • write the image to the file system

the crawler is using the Funk<T, Task> signature.
the writer code is:

Code Snippet
  1. var writer = new ActionBlock<string>(async url =>
  2. {
  3.     WebClient wc = new WebClient();
  4.     // using IOCP the thread pool worker thread does return to the pool
  5.     byte[] buffer = await wc.DownloadDataTaskAsync(url);
  6.     string fileName = Path.GetFileName(url);
  8.     string name = @"Images\" + fileName;
  10.     using (Stream srm = File.OpenWrite(name))
  11.     {
  12.         await srm.WriteAsync(buffer, 0, buffer.Length);
  13.     }
  14. });

the first await at line 5, is awaiting until the task will be interrupt by the network card,
and the second await at line 12, will await until it will be interrupt by the file system controller.

you may have been notice that the second await is within a using block, you can read more about this topic at this post.

link it together

right now we are having most of our building blocks and it is time to define the data-flow by linking the block to each other.

the downloader should be link to the content broadcaster which in tern should be linked both to the image and link parser, the image parser should be linked to the writer and the link parser should be linked back to the downloader (so it can crawl farther).

but there is one last issue.
it happens that some web page is having links that is targeting an image. this lead us to more complex linking where the link parser should be linked both to the downloader and having conditional link to the writer for those url that is having an image suffix.
as we discuss earlier having a direct link from the link parser to both the downloader and the writer will results with starvation of one of those agents.
we do need a final broadcast block which will handle this distribution task.

Code Snippet
  1. var linkBroadcaster = new BroadcastBlock<string>(s => s);

the link parser will be linked to the broadcaster and the broadcaster will be liked to both the downloader and the writer.

we have spoke of the conditional link from the link parser and the writer, but it will be more effective if the link parser to the downloader will be link only those pages that are most likely having useful data like php, aspx, htm, ext…

Filtering linked messages

the following predicates will be use in order to filter linked messages:

Code Snippet
  1. StringComparison comparison = StringComparison.InvariantCultureIgnoreCase;
  2. Predicate<string> linkFilter = link =>
  3.     link.IndexOf(".aspx", comparison) != -1 ||
  4.     link.IndexOf(".php", comparison) != -1 ||
  5.     link.IndexOf(".htm", comparison) != -1 ||
  6.     link.IndexOf(".html", comparison) != -1;
  7. Predicate<string> imgFilter = url =>
  8.     url.EndsWith(".jpg", comparison) ||
  9.     url.EndsWith(".png", comparison) ||
  10.     url.EndsWith(".gif", comparison);

the first predicate (line 2) will filter the downloader agent target and the second (line 7) will filter the link parser result which is targeting the writer agent.

compose the data-flow

finally we got to the agent composition.

TDF, Tpl,Dataflow, ITargerBlock, ISorceBlock, IDataBlobk, Transform, IPropagatorBlock

Code Snippet
  1. IDisposable disposeAll = new CompositeDisposable(
  2.     // from [downloader] to [contentBroadcaster]
  3.     downloader.LinkTo(contentBroadcaster),
  4.     // from [contentBroadcaster] to [imgParser]
  5.     contentBroadcaster.LinkTo(imgParser),
  6.     // from [contentBroadcaster] to [linkParserHRef]
  7.     contentBroadcaster.LinkTo(linkParser),
  8.     // from [linkParser] to [linkBroadcaster]
  9.     linkParser.LinkTo(linkBroadcaster),
  10.     // conditional link to from [linkBroadcaster] to [downloader]
  11.     linkBroadcaster.LinkTo(downloader, linkFilter, true),
  12.     // from [linkBroadcaster] to [writer]
  13.     linkBroadcaster.LinkTo(writer, imgFilter, true),
  14.     // from [imgParser] to [writer]
  15.     imgParser.LinkTo(writer));

each LinkTo operation return a disposable instance which can be use to dispose the link when it no longer needed. the crawler compose all those disposable together into a single disposable called dispose All by using the CompositeDisposable which is part of the Rx library.

you can see the conditional LinkTo at line 11 and 13.
is is very important to set the last parameter of the LinkTo to true if you don’t want to dispose the link when the filter doesn’t match the criteria.


this post was a walkthrough of a web crawler sample.
the complete sample, which is available in here (VS 11), is also having exception handling, agent termination after x amount of seconds, prevention of processing the same url twice and more. for simplicity the code within this post was a simplified version.

Shout it

Add comment
facebook linkedin twitter email

Leave a Reply

Your email address will not be published.


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>


  1. www.hygienecertificate.net2013/11/28 ב 11:34

    If some one wants to be updated with newest
    technologies afterward he must be pay a visit this web
    page and be up to date every day.

  2. New year SMS2013/12/09 ב 07:40

    Hello, I do think your website may be having web browser compatibility issues.

    When I look at your blog in Safari, it looks fine however, if opening
    in IE, it has some overlapping issues. I simply wanted to
    provide you with a quick heads up! Apart from
    that, excellent site!

  3. home2013/12/10 ב 04:44

    I got this website from my friend who told me on the topic of this site and at the moment this time I am visiting this site and reading very informative articles at this time.

  4. psn code generator mac2013/12/11 ב 18:39

    I feel that iss аmong t&X68;e most important info
    for mе. Αnd i’m hаρpy studyіng your article.
    However wa&X6e;na remark οn some basic things,
    The web si&X74;e tastе is great, thе
    articles is truly &X67;reat : D.
    Just right act&X69;vity, cheers

    Look into my si&X74;e – psn code generator mac

  5. ekycuuAWIY2013/12/12 ב 04:22 where to get valium in bali – valium online paypal

  6. ב 12:18

    Just wish to say your article is as amazing. The clarity in
    your post is simply spectacular and i could
    assume you’re an expert on this subject. Fine with your permission let
    me to grab your feed to keep up to date with forthcoming post.
    Thanks a million and please keep up the gratifying work.

  7. Itunes gift Card generator2013/12/14 ב 10:07

    Thqnks for the marvelous pоstіng! I definitely enjοyed гeadi&X6e;g it, yoou
    arе a great author. I will ensure that I bookmark your blg
    and will event&X75;ally сome back veгу soon.
    I w&X61;nt &X74;o еncouragе you co&X6E;tіnue your grea&X74; writing, havе a nice wеeκend!

    my homepage Itunes gift Card generator

  8. New Year SMS2013/12/16 ב 14:12

    This is very interesting, You’re a very skilled blogger.
    I have joined your rss feed and look forward to seeking more of your
    fantastic post. Also, I’ve shared your site in my social

  9. actualités dromardennes2013/12/19 ב 22:35

    I dugg some of you post as I thought they were extremely helpful invaluable

  10. cityville money cheats2013/12/22 ב 02:24

    Hurrah! In the end I got a weblog from where I can actually
    take helpful facts regarding my study and knowledge.

  11. gta 5 hack2013/12/24 ב 12:44

    What a material of un-ambiguity and preserveness of valuable familiarity concerning unpredicted feelings.

  12. free bf4 keys2013/12/25 ב 11:57

    Thank you for the auspicious writeup. It in fact was a amusement account it.
    Look advanced to far added agreeable from you!
    However, how can we communicate?

  13. anchorman 2 full movie2014/01/01 ב 12:50

    Hello there! This post could not be written any better!
    Going through this article reminds me of my previous roommate!
    He constantly kept preaching about this. I will send
    this post to him. Fairly certain he will have a good read.
    Many thanks for sharing!

  14. Latanya2014/01/18 ב 00:45

    Are you experienced in terms of blogging on this subject?

    I am and I’ve found that once you supply sources for your information men and women have a tendency to supply you with additional credibility!
    Please provide references for data.

  15. visit the up coming post2014/01/21 ב 10:31

    The answers to these queries will help you narrow
    down the choices and make the best decision possible.

    The most popular scheme is to give away points for all the
    money you spend – that can be later be redeemed on a purchase or in a specific shop.
    You only have to wait two years after that discharge to get the best possible
    rates on a home loan.

  16. auto title loans las vegas2014/02/05 ב 01:52

    Do you have other bloցs fօr us to stuԁy? Thɑt you аre a greаt
    writer. I enjoyed reading this.

  17. capital one auto loans2014/02/12 ב 18:49

    Can we do a guest webloɡ collectively? I’m rеally knowledgeablе in terms of
    wгiting on comparable topics includinǥ this. I am formally educatеd aloոg with a extremely
    good author. Let me know when yօu have any other blogs to read!

  18. Jabong cashback2014/02/13 ב 08:54

    This is very attention-grabbing, You are a very
    skilled blogger. I’ve joined your rss feed and sit up for looking for extra of your
    magnificent post. Also, I have shared your site in
    my social networks

  19. Myntra cashback2014/02/21 ב 10:02

    Hi, after reading this remarkable article i am as well delighted to share my experience here with colleagues.

  20. Myntra cashback2014/06/17 ב 09:05

    I’ve been browsing online more than three hours today,
    yet I never found any interesting article like yours.
    It’s pretty worth enough for me. Personally, if all web
    owners and bloggers made good content as you did, the net will be a lot more useful than ever before.