I observed that HTML parser (hubbub-0.1.2) is breaking when it finds a SEMICOLON in the text field. I am giving below an example of the text string.
When it finds the ';', it stops working. When I remove this ';' from the string, it works fine. Can you please check, if this is an issue with the parser or if I am missing anything?<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><title>MIT - Massachusetts Institute of Technology</title><meta name="keywords" content="Massachusetts Institute of Technology, MIT" /><meta name="description" content="MIT is devoted to the advancement of knowledge and education of students in areas that contribute to or prosper in an environment of science and technology." /><meta name="robots" content="index,follow,noodp,noydir" /><meta name="allow-search" content="yes" /><meta name="language" content="en" /><meta name="distribution" content="global" /><meta http-equiv="content-type" content="text/html; charset=UTF-8" />
I am pasting below the output of the parser (i.e. ./libxml) mit-edu.htm is the HTML weg page I am giving as inputs.
anilj@ubuntu:~/apache/sandbox/hubbub-0.1.2/examples$ ./libxml mit-edu.htmWARNING: Failed creating namespace xmlHTML DOCUMENTstandalone=trueDTD(html), PUBLIC -//W3C//DTD XHTML 1.0 Transitional//EN, SYSTEM http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtdELEMENT htmldefault namespace href=http://www.w3.org/1999/xhtmlnamespace math href=http://www.w3.org/1998/Math/MathMLnamespace svg href=http://www.w3.org/2000/svgnamespace xlink href=http://www.w3.org/1999/xlinknamespace xmlns href=http://www.w3.org/2000/xmlns/ATTRIBUTE xmlnsTEXTcontent=http://www.w3.org/1999/xhtmlELEMENT headTEXTcontent=ELEMENT titleTEXTcontent=MIT - Massachusetts Institute of Technol...TEXTcontent=ELEMENT metaATTRIBUTE nameTEXTcontent=keywordsATTRIBUTE contentTEXTcontent=Massachusetts Institute of Technology, M...TEXTcontent=ELEMENT metaATTRIBUTE nameTEXTcontent=descriptionATTRIBUTE contentTEXTcontent=MIT is devoted to the advancement of kno...TEXTcontent=ELEMENT metaATTRIBUTE nameTEXTcontent=robotsATTRIBUTE contentTEXTcontent=index,follow,noodp,noydirTEXTcontent=ELEMENT metaATTRIBUTE nameTEXTcontent=allow-searchATTRIBUTE contentTEXTcontent=yesTEXTcontent=ELEMENT metaATTRIBUTE nameTEXTcontent=languageATTRIBUTE contentTEXTcontent=enTEXTcontent=ELEMENT metaATTRIBUTE nameTEXTcontent=distributionATTRIBUTE contentTEXTcontent=globalTEXTcontent=ELEMENT metaATTRIBUTE http-equivTEXTcontent=content-typeATTRIBUTE contentTEXTcontent=text/html; charset=UTF-8
No comments:
Post a Comment