DCSIMG
March 2009 - Posts - Natalie Reznik

Natalie Reznik

"It's not a bug, it's a feature !" Unknown Programmer

March 2009 - Posts

Parsing HTML with RegEx

Many times web developpers have to make use of the data which comes from a not very conventional source - html. For example a client orders a mobile version of a site but is unwilling to give access to the database or provide web service or even RSS. "Take it all from our site" - he says. Not always you're in a position to insist on a normal datasourse and explain the enormous disadvantages and performance problems that will inevitably arrise in this scenario. Your project manager comes to you and says that he understands everything but you should get the client off his back : "But you can do it , right? So let's just do it fast and forget about it; we have other urgent issues to attend so let this one be fast and dirty....".
And so you have to admit that there's no other option but to find your regex cheat-sheet. Regex is something you struggle so much to remember when you need it , but you forget it almost totally after a couple of days that you didn't use it. For many developpers Regex is like going to a dentist - you hate it , you're afraid of it , you do it mostly only if there's no other choice , and you are thankful when it's over. So if you can't find your cheat sheet here's a helpful link:

http://www.mikesdotnetting.com/Article.aspx?ArticleID=46

What next? "I should make it flexible enough not to fail even if there will be slight changes in the source" - you say.
Well , sure that's important , but don't be too optimistic, remember murfey' rule - if anything can go wrong , it will. In case with html it will change, and your regex will certainly fail from time to time.... So it is more important to make your regex easyly fixable. and certainly it is best to save it somewhere in a file or DB rather than hardcoded in your app. Keep your regex as short and simple as possible. Long and complex regexes with numerous conditions and stuff like that are hardly-human-readable, and their mainteinance is a black hole where tons of development-hours disappear, which will certainly make your project manager very unhappy.

On the one hand you don't want to run your string too many time through a regex as it will affect the perfomance; on the other hand it might be really inevitable to breake your regex into few short parts and run them in a loop. The good news that it will make your regex maintainable, and next time your client changes his site's html you'll be able to adjust your regex easily within minutes. So it's up to you to find a golden path in between the two extremes. So now when the inevitable is taken care of , you can focus on how to make it happen as rarely as possible.

So what are the most often changes that occur on html pages? first of all all sorts of whitespaces like \r \n \t between the tags for examle if you have regex like this:

       <td\ style="padding-right:\ 10px;"\ width="60">some title:</td>\r\n\t\t\t\t\t\t\t\t<td><b>(?<SomeTitle>.*?)</b></td>

it will work untill there's a change between the <td>s so it is best to take care of it right away:

       <td\ style="padding-right:\ 10px;"\ width="60">some title:</td>\s*<td><b>(?<SomeTitle>.*?)</b></td>

then the problematic part is the tag's attributes like padding and so on. while the structure of the entire page doesn't change so often some minor changes in the page's design do happen petty often and you don't want your regex fail every time the width of a cell changes for a couple of pixels:

       <td\ style="padding-right:\ \d+px;"\ width="\d+">some title:</td>\s*<td><b>(?<SomeTitle>.*?)</b></td>

or you can make even a more drastic change and resolve all possible issues with attributes:

       <td.{0,70}>some title:</td>\s*<td><b>(?<SomeTitle>.*?)</b></td>

why limiting the number of characters? because if there are many such elements on the page and you want to extract them all there will be a problem if you don't limit it. but not necessarily in your case. Certainly these are mere examples and in real life you might make better regexes alltogether but the thing is that not always you have time to bring them to perfection , somehow such things are considered very easy and not time cosuming by both clients and not very technical project managers. so the best solution is to make it as fast as you can and focus on making them readable and solve typical potential problems, rather then sit and think how to make it perfect.