HTML Parser is a Java library used to parse HTML in either a linear or nested fashion. I have tried various open source parsers like WebHarvest etc and found this one to be the most robust when handling bad and nasty html. My primary purpose in using a parser is to extract content from websites. Other people have other needs from the parser which this might not suffice. It has some pretty cool features like filters, visitors, custom tags and easy to use JavaBeans. It is a fast, robust and well tested package. One of the drawbacks is on the documentation front. There is minimal documentation around and most of the stuff I discovered is by playing around.
So my need here was to be able to extract content from a given tag and the way to identify the tag is by using its ID field. For instance I want to extract the text "some text two" from the below page:
<html><body><div id='one'> some text one </div> <div id='two'> some text two </div></body></html>
Here's the code sample to accomplish this: