Friday, 26 October 2012

Re: libhubbub parse error on google homepage

Btw, thanks for the help on this.  The project is for a nodejs native extension that brings the love of libhubbub to the nodejs world:


There are other html parsers in the nodejs world, but none are as good as libhubbub.  I considered using the parser from webkit or firefox, but libhubub was definitely the easiest to use since it was completely standalone and used very few external libraries.  


On Wed, Oct 24, 2012 at 4:14 AM, Dean Mao <deanmao@gmail.com> wrote:
I see, thanks for the tip.  I'm only using it for the tokeniser as I don't have use for a dom tree.  All I did was perform this when I saw a script tag:

  hubbub_tokeniser_optparams params;
  params.content_model.model = HUBBUB_CONTENT_MODEL_CDATA;
  hubbub_tokeniser_setopt(tok_, HUBBUB_TOKENISER_CONTENT_MODEL, &params);

Then revert it back when I see the end of the script tag.  It seemed like that was what in_head.c was doing with parse_generic_rcdata().


On Wed, Oct 24, 2012 at 3:27 AM, John-Mark Bell <jmb@netsurf-browser.org> wrote:
On Wed, Oct 24, 2012 at 02:54:49AM -0700, Dean Mao wrote:
> Here's a more compact test:
>
> <script>for(var i=0;i<n;i++);</script>
>
> Outputs:
>
> START TAG: 'script'
> CHARACTERS: 'for(var i=0;i'
> START TAG: 'n;i++);<' attributes:
> 'script' = ''
>
> Essentially everything inside a <script> tag should be treated as
> characters until a </script> tag is seen.

Yes. This behaviour you're seeing is expected. The HTML5 tokeniser has a
number of modes, which are selected by the token handler callback
provided by the client. The trivial token handler in test/tokeniser.c
does not manipulate the tokeniser mode, thus it does not handle the
contents of script (and other, similar) elements in the expected fashion.

The treebuilder implementation in Hubbub does manipulate the tokeniser
mode in the correct way. In most cases, you'll want to use the built-in
treebuilder, as it handles all the complexity of coping with junk input
for you. See examples/libxml.c for a demonstration of how to use the
built-in treebuilder.

If you do only wish to use the tokeniser, then you need to ensure that
your token handler changes the tokeniser mode in the same way that an
HTML5 treebuilder would.


J.


No comments:

Post a Comment