There are other html parsers in the nodejs world, but none are as good as libhubbub.  I considered using the parser from webkit or firefox, but libhubub was definitely the easiest to use since it was completely standalone and used very few external libraries.  
  On Wed, Oct 24, 2012 at 4:14 AM, Dean Mao <deanmao@gmail.com> wrote:
I see, thanks for the tip. I'm only using it for the tokeniser as I don't have use for a dom tree. All I did was perform this when I saw a script tag:hubbub_tokeniser_optparams params;params.content_model.model = HUBBUB_CONTENT_MODEL_CDATA;hubbub_tokeniser_setopt(tok_, HUBBUB_TOKENISER_CONTENT_MODEL, ¶ms);Then revert it back when I see the end of the script tag. It seemed like that was what in_head.c was doing with parse_generic_rcdata().On Wed, Oct 24, 2012 at 3:27 AM, John-Mark Bell <jmb@netsurf-browser.org> wrote:On Wed, Oct 24, 2012 at 02:54:49AM -0700, Dean Mao wrote:Yes. This behaviour you're seeing is expected. The HTML5 tokeniser has a
> Here's a more compact test:
>
> <script>for(var i=0;i<n;i++);</script>
>
> Outputs:
>
> START TAG: 'script'
> CHARACTERS: 'for(var i=0;i'
> START TAG: 'n;i++);<' attributes:
> 'script' = ''
>
> Essentially everything inside a <script> tag should be treated as
> characters until a </script> tag is seen.
number of modes, which are selected by the token handler callback
provided by the client. The trivial token handler in test/tokeniser.c
does not manipulate the tokeniser mode, thus it does not handle the
contents of script (and other, similar) elements in the expected fashion.
The treebuilder implementation in Hubbub does manipulate the tokeniser
mode in the correct way. In most cases, you'll want to use the built-in
treebuilder, as it handles all the complexity of coping with junk input
for you. See examples/libxml.c for a demonstration of how to use the
built-in treebuilder.
If you do only wish to use the tokeniser, then you need to ensure that
your token handler changes the tokeniser mode in the same way that an
HTML5 treebuilder would.
J.
 
No comments:
Post a Comment