Wednesday, 24 October 2012

Re: libhubbub parse error on google homepage

I see, thanks for the tip.  I'm only using it for the tokeniser as I don't have use for a dom tree.  All I did was perform this when I saw a script tag:

  hubbub_tokeniser_optparams params;
  params.content_model.model = HUBBUB_CONTENT_MODEL_CDATA;
  hubbub_tokeniser_setopt(tok_, HUBBUB_TOKENISER_CONTENT_MODEL, &params);

Then revert it back when I see the end of the script tag.  It seemed like that was what in_head.c was doing with parse_generic_rcdata().


On Wed, Oct 24, 2012 at 3:27 AM, John-Mark Bell <jmb@netsurf-browser.org> wrote:
On Wed, Oct 24, 2012 at 02:54:49AM -0700, Dean Mao wrote:
> Here's a more compact test:
>
> <script>for(var i=0;i<n;i++);</script>
>
> Outputs:
>
> START TAG: 'script'
> CHARACTERS: 'for(var i=0;i'
> START TAG: 'n;i++);<' attributes:
> 'script' = ''
>
> Essentially everything inside a <script> tag should be treated as
> characters until a </script> tag is seen.

Yes. This behaviour you're seeing is expected. The HTML5 tokeniser has a
number of modes, which are selected by the token handler callback
provided by the client. The trivial token handler in test/tokeniser.c
does not manipulate the tokeniser mode, thus it does not handle the
contents of script (and other, similar) elements in the expected fashion.

The treebuilder implementation in Hubbub does manipulate the tokeniser
mode in the correct way. In most cases, you'll want to use the built-in
treebuilder, as it handles all the complexity of coping with junk input
for you. See examples/libxml.c for a demonstration of how to use the
built-in treebuilder.

If you do only wish to use the tokeniser, then you need to ensure that
your token handler changes the tokeniser mode in the same way that an
HTML5 treebuilder would.


J.

No comments:

Post a Comment