Wednesday, 12 March 2014

[PATCH 00/11] Updating tokeniser to current HTML5 spec

This series of patches updates (not completely) tokeniser to the current HTML5 specification.
Old(assuming) : http://www.w3.org/TR/2008/WD-html5-20080610/tokenisation.html
New : http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html

#Repo : libhubbub
#sURL : https://github.com/Achal-Aggarwal/hibbub.git
#sRef
: tokeniser-update
#SHA1 : c8524048c52f31b9a0e8f3469e11a0db76653adb
#Land : master
#link : https://github.com/Achal-Aggarwal/hibbub/tree/tokeniser-update


#Description
Every commit is having a brief description of what it is doing. With that please see below description of each patch/commit

patch : [PATCH 01/11] Rewriting whole tokeniser. Removed content model flags.
CP1252 table updated. End bang and doctype identifier state added.
link : https://github.com/Achal-Aggarwal/hibbub/commit/f395209aff9d66e70ba236cfc1812e594a0580a5
description : This patch is basically rewriting the whole tokeniser according to new specs. Here no content model
flags which are now treated as states implemented. Except those states and named entities everything is in respect
with new specification.
note : Code is not polished, for now it is just providing a state machine as described in html specs.


patch : [PATCH 02/11] Adding states for RCDATA, RAWTEXT, PLAINTEXT and one
for SCRIPTDATA.
link : https://github.com/Achal-Aggarwal/hibbub/commit/f2606f47504253446201e00440a0610c0870a0a5
description : In this patch I have added states for rcdata, rawtext, plaintext. Though script data now has many
variant states, I have added only one i.e SCRIPT_DATA.
note : Code is not polished, for now it is just providing a state machine as described in html specs.


patch : [PATCH 03/11] Fixing rcdata and rawtext close tag open state for byte
by byte input.
link : https://github.com/Achal-Aggarwal/hibbub/commit/e50255eaa44ce049a78d7251ac8415bb5656ce39
description : This patch fixes two states - rcdata close tag open & rawtext close tag open for the case scenario
where input is buffered byte by byte.
note : Code is not polished, for now it is just providing a state machine as described in html specs.


patch : [PATCH 04/11] Fix tokeniser test executer for content model flag
change and segfault on no doctype name.
link : https://github.com/Achal-Aggarwal/hibbub/commit/ef28df23b75704c830a85422e3a5b59641bea961
description : As we don't have content model flags any more, instead of them we now set initial state of the tokeniser
so to execute tokeniser test, I have modified tokeniser2 and tokeniser3 test executer.
Test are exiting with segfault when doctype has no name. It was because in test we are computing length of a null string
using strlen, I replaced that with another function which returns 0 as length of null string.

patch : [PATCH 05/11] Removing failing testcase of test1.dat for tokeniser2.
link : https://github.com/Achal-Aggarwal/hibbub/commit/0cda9a25adc211a90bf89ceb3414fea2e14e49d6

patch : [PATCH 06/11] Removing failing testcase of test2.dat for tokeniser2.
link : https://github.com/Achal-Aggarwal/hibbub/commit/86c698260c3ffed49663d5a3c54f5f9c0ac9e81b

patch : [PATCH 07/11] Removing failing testcase of test3.dat for tokeniser2.
link : https://github.com/Achal-Aggarwal/hibbub/commit/b64655e5d541422397ebcb2136ad4ada05f84a59

patch : [PATCH 08/11] Removing failing testcase of test4.dat for tokeniser2.
link : https://github.com/Achal-Aggarwal/hibbub/commit/abb37165227143994aa1e5bd82dcf59f80d69d51

patch : [PATCH 09/11] Removing failing testcase of regression.dat for
tokeniser2.
link : https://github.com/Achal-Aggarwal/hibbub/commit/2d1fe2eed8bb91f56360feb323ea2c8a1b4c0263

description : In patch 5,6,7,8,9 I removed tests which were failing on the updated tokeniser.


patch : [PATCH 10/11] Updating tokeniser tests.
link : https://github.com/Achal-Aggarwal/hibbub/commit/544a30788d7353aac70f32540a4f6b5bf0a3fcc5
description : In this I am updating tokeniser test data to https://github.com/html5lib/html5lib-tests/tree/master/tokenizer
note : No new test files are added in this patch.


patch : [PATCH 11/11] Fixing entities.test by correcting cp1252 table entries.
link : https://github.com/Achal-Aggarwal/hibbub/commit/544a30788d7353aac70f32540a4f6b5bf0a3fcc5
description : Some entities.test were failing beacuse cp1252 table was not updated correctly in the patch 0/11.

Achal-Aggarwal (11):
Rewriting whole tokeniser. Removed content model flags. CP1252 table
updated. End bang and doctype identifier state added.
Adding states for RCDATA, RAWTEXT, PLAINTEXT and one for SCRIPTDATA.
Fixing rcdata and rawtext close tag open state for byte by byte input.
Fix tokeniser test executer for content model flag change and segfault
on no doctype name.
Removing failing testcase of test1.dat for tokeniser2.
Removing failing testcase of test2.dat for tokeniser2.
Removing failing testcase of test3.dat for tokeniser2.
Removing failing testcase of test4.dat for tokeniser2.
Removing failing testcase of regression.dat for tokeniser2.
Updating tokeniser tests.
Fixing entities.test by correcting cp1252 table entries.

include/hubbub/parser.h | 6 +-
include/hubbub/types.h | 13 +-
src/parser.c | 4 +-
src/tokeniser/tokeniser.c | 1516 ++++++++++---
src/tokeniser/tokeniser.h | 6 +-
src/treebuilder/in_body.c | 4 +-
src/treebuilder/treebuilder.c | 6 +-
test/data/tokeniser2/INDEX | 2 +-
test/data/tokeniser2/contentModelFlags.test | 50 +-
test/data/tokeniser2/entities.test | 2122 +------------------
test/data/tokeniser2/escapeFlag.test | 24 +-
test/data/tokeniser2/numericEntities.test | 190 +-
test/data/tokeniser2/regression.test | 8 -
test/data/tokeniser2/test1.test | 28 +-
test/data/tokeniser2/test2.test | 24 +-
test/data/tokeniser2/test3.test | 3060 ++++++++++++++-------------
test/data/tokeniser2/test4.test | 79 +-
test/data/tokeniser2/unicodeChars.test | 8 -
test/testutils.h | 19 +
test/tokeniser2.c | 51 +-
test/tokeniser3.c | 48 +-
21 files changed, 3134 insertions(+), 4134 deletions(-)

--
1.8.3.2

No comments:

Post a Comment