Thursday, June 4, 2009

SOA and GWT/HTML Scraping

As promised, I continue my previous posts on the Message Enrichment scenario. I met with the customer, and it seems, that the only way to truly enrich their ESB message is to build an image processing engine right inside the ESB. Although it sounds quite cool - and I'd love to do it myself, since it's been years since I wrote some image processing code (last time was on Turbo Pascal for Windows 1.5 - ;-) ) - it seems like a waste to put inside the ESB.
The solution? Use an HTML scraping as an ESB connector, to rip the data from the web browser, and use it as a service. The reason - the application wrote some code inside the engine, and some code (like zoom-in/zoom-out) inside the web browser (using GWT). And so a command will be recieved, it will be translated to an HTTP GET request, and the resulting HTML will be scrapped, to get the actual required image.

I like this idea, and not (only) becuase it's mine. I think it makes good reuse of existing code, which is what service exposure is all about, at least for me. The shocking part was that no one I talked to even considers HTML scraping as a legitemetasdf SOA concept. Ain't that odd? I can't count the times I had to pull my HTML scraping toolkit (XQuery and JTidy are fine by me, thanks) and rip information from existing HTML pages.

However, there is a chanllenge here. Is it even possible to scrap GWT based pages?

1 comment:

  1. I'm in the process of trying to scrape GWT based pages right now, I'll keep you posted...

    ReplyDelete