Monday, 25 February 2013

Parser breaking with ";" in the text field.

Team,

I observed that HTML parser (hubbub-0.1.2) is breaking when it finds a SEMICOLON in the text field. I am giving below an example of the text string. 

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<head> 
        <title>MIT - Massachusetts Institute of Technology</title> 
        <meta name="keywords" content="Massachusetts Institute of Technology, MIT" /> 
        <meta name="description" content="MIT is devoted to the advancement of knowledge and education of students in areas that contribute to or prosper in an environment of science and technology." /> 
        <meta name="robots" content="index,follow,noodp,noydir" /> 
        <meta name="allow-search" content="yes" /> 
        <meta name="language" content="en" /> 
        <meta name="distribution" content="global" /> 
        <meta http-equiv="content-type" content="text/html; charset=UTF-8" /> 


When it finds the ';', it stops working. When I remove this ';' from the string, it works fine. Can you please check, if this is an issue with the parser or if I am missing anything?

I am pasting below the output of the parser (i.e. ./libxml) mit-edu.htm is the HTML weg page I am giving as inputs. 

anilj@ubuntu:~/apache/sandbox/hubbub-0.1.2/examples$ ./libxml mit-edu.htm
WARNING: Failed creating namespace xml
HTML DOCUMENT
standalone=true
  DTD(html), PUBLIC -//W3C//DTD XHTML 1.0 Transitional//EN, SYSTEM http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
  ELEMENT html
    default namespace href=http://www.w3.org/1999/xhtml
    namespace math href=http://www.w3.org/1998/Math/MathML
    namespace svg href=http://www.w3.org/2000/svg
    namespace xlink href=http://www.w3.org/1999/xlink
    namespace xmlns href=http://www.w3.org/2000/xmlns/
    ATTRIBUTE xmlns
      TEXT
        content=http://www.w3.org/1999/xhtml
    ELEMENT head
      TEXT
        content=   
      ELEMENT title
        TEXT
          content=MIT - Massachusetts Institute of Technol...
      TEXT
        content=   
      ELEMENT meta
        ATTRIBUTE name
          TEXT
            content=keywords
        ATTRIBUTE content
          TEXT
            content=Massachusetts Institute of Technology, M...
      TEXT
        content=   
      ELEMENT meta
        ATTRIBUTE name
          TEXT
            content=description
        ATTRIBUTE content
          TEXT
            content=MIT is devoted to the advancement of kno...
      TEXT
        content=    
      ELEMENT meta
        ATTRIBUTE name
          TEXT
            content=robots
        ATTRIBUTE content
          TEXT
            content=index,follow,noodp,noydir
      TEXT
        content=   
      ELEMENT meta
        ATTRIBUTE name
          TEXT
            content=allow-search
        ATTRIBUTE content
          TEXT
            content=yes
      TEXT
        content=   
      ELEMENT meta
        ATTRIBUTE name
          TEXT
            content=language
        ATTRIBUTE content
          TEXT
            content=en
      TEXT
        content=   
      ELEMENT meta
        ATTRIBUTE name
          TEXT
            content=distribution
        ATTRIBUTE content
          TEXT
            content=global
      TEXT
        content=   
      ELEMENT meta
        ATTRIBUTE http-equiv
          TEXT
            content=content-type
        ATTRIBUTE content
          TEXT
            content=text/html; charset=UTF-8

No comments:

Post a Comment