edu.harvard.hul.ois.jhove.module
Class HtmlModule
java.lang.Object
   edu.harvard.hul.ois.jhove.ModuleBase
edu.harvard.hul.ois.jhove.ModuleBase
       edu.harvard.hul.ois.jhove.module.HtmlModule
edu.harvard.hul.ois.jhove.module.HtmlModule
- All Implemented Interfaces: 
- Module
- public class HtmlModule 
- extends ModuleBase
Module for identification and validation of HTML files.
 
  HTML is different from most of the other documents in that
  sloppy construction is practically assumed in the specification.
  This module attempt to report as many errors as possible and
  recover reasonably from errors. To do this, there is more
  heuristic behavior built into this module than into the more
  straightforward ones.
 
  XHTML is recognized by this module, but is handed off to the
  XML module for processing.  If the XML module is missing (which
  it shouldn't be if you've installed the JHOVE application without
  modifications), this won't be able to deal with XHTML files.
 
  HTML should be placed ahead of XML in the module order.  If the
  XML module sees an XHTML file first, it will recognize it as XHTML,
  but won't be able to report the complete properties.
 
  The HTML module uses code created with the JavaCC parser generator
  and lexical analyzer generator.  There is apparently a bug in
  JavaCC which causes blank lines not to be counted in certain cases,
  causing lexical errors to be reported with incorrect line numbers.
- Author:
- Gary McGath
 
| Fields inherited from class edu.harvard.hul.ois.jhove.ModuleBase | 
| _app, _bigEndian, _checksumFinished, _countStream, _coverage, _crc32, _date, _defaultParams, _features, _format, _init, _isRandomAccess, _je, _logger, _md5, _mimeType, _name, _nByte, _note, _param, _release, _repInfoNote, _rights, _sha1, _signature, _specification, _validityNote, _vendor, _verbosity, _wellFormedNote | 
 
 
| Constructor Summary | 
| HtmlModule()Instantiate an HtmlModule object.
 | 
 
| Method Summary | 
| protected  int | checkDoctype(java.util.List elements)
 | 
|  void | checkSignatures(java.io.File file,
                java.io.InputStream stream,
                RepInfo info)Check if the digital object conforms to this Module's
  internal signature information.
 | 
| protected static boolean | isXmlAvailable()
 | 
|  int | parse(java.io.InputStream stream,
      RepInfo info,
      int parseIndex)Parse the content of a purported HTML stream digital object and store the
   results in RepInfo.
 | 
| protected  int | seemsToBeXHTML(java.util.List elements)
 | 
| protected  java.lang.String | stripQuotes(java.lang.String str)
 | 
 
| Methods inherited from class edu.harvard.hul.ois.jhove.ModuleBase | 
| addIntegerProperty, addIntegerProperty, applyDefaultParams, calcRAChecksum, checkSignatures, getApp, getBase, getBufferedDataStream, getCoverage, getCRC32, getDate, getDefaultParams, getFeatures, getFormat, getMimeType, getName, getNByte, getNote, getRelease, getRepInfoNote, getRights, getSignature, getSpecification, getValidityNote, getVendor, getWellFormedNote, hasFeature, init, initFeatures, initParse, isBigEndian, isRandomAccess, param, parse, readByteBuf, readDouble, readDouble, readDouble, readFloat, readFloat, readSignedByte, readSignedByte, readSignedByte, readSignedInt, readSignedInt, readSignedInt, readSignedLong, readSignedRational, readSignedRational, readSignedShort, readSignedShort, readSignedShort, readUnsignedByte, readUnsignedByte, readUnsignedByte, readUnsignedInt, readUnsignedInt, readUnsignedInt, readUnsignedRational, readUnsignedRational, readUnsignedRational, readUnsignedShort, readUnsignedShort, readUnsignedShort, resetParams, setApp, setBase, setChecksums, setCRC32, setDefaultParams, setMD5, setNByte, setSHA1, setValidityNote, setVerbosity, show, skipBytes, skipBytes, vectorToPropArray | 
 
| Methods inherited from class java.lang.Object | 
| clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait | 
 
_cstream
protected ChecksumInputStream _cstream
- PRIVATE INSTANCE FIELDS.
 
 
_dstream
protected java.io.DataInputStream _dstream
_doctype
protected java.lang.String _doctype
HTML_3_2
public static final int HTML_3_2
- See Also:
- Constant Field Values
HTML_4_0_STRICT
public static final int HTML_4_0_STRICT
- See Also:
- Constant Field Values
HTML_4_0_FRAMESET
public static final int HTML_4_0_FRAMESET
- See Also:
- Constant Field Values
HTML_4_0_TRANSITIONAL
public static final int HTML_4_0_TRANSITIONAL
- See Also:
- Constant Field Values
HTML_4_01_STRICT
public static final int HTML_4_01_STRICT
- See Also:
- Constant Field Values
HTML_4_01_FRAMESET
public static final int HTML_4_01_FRAMESET
- See Also:
- Constant Field Values
HTML_4_01_TRANSITIONAL
public static final int HTML_4_01_TRANSITIONAL
- See Also:
- Constant Field Values
XHTML_1_0_STRICT
public static final int XHTML_1_0_STRICT
- See Also:
- Constant Field Values
XHTML_1_0_TRANSITIONAL
public static final int XHTML_1_0_TRANSITIONAL
- See Also:
- Constant Field Values
XHTML_1_0_FRAMESET
public static final int XHTML_1_0_FRAMESET
- See Also:
- Constant Field Values
XHTML_1_1
public static final int XHTML_1_1
- See Also:
- Constant Field Values
_withTextMD
protected boolean _withTextMD
_textMD
protected TextMDMetadata _textMD
HtmlModule
public HtmlModule()
- Instantiate an HtmlModule object.
 
parse
public int parse(java.io.InputStream stream,
                 RepInfo info,
                 int parseIndex)
          throws java.io.IOException
- Parse the content of a purported HTML stream digital object and store the
   results in RepInfo.
 
- 
- Specified by:
- parsein interface- Module
- Overrides:
- parsein class- ModuleBase
 
- 
- Parameters:
- stream- An InputStream, positioned at its beginning,
                    which is generated from the object to be parsed.
                    If multiple calls to- parseare made 
                    on the basis of a nonzero value being returned,
                    a new InputStream must be provided each time.
- info- A fresh (on the first call) RepInfo object 
                    which will be modified
                    to reflect the results of the parsing
                    If multiple calls to- parseare made 
                    on the basis of a nonzero value being returned, 
                    the same RepInfo object should be passed with each
                    call.
- parseIndex- Must be 0 in first call to- parse.  If- parsereturns a nonzero value, it must be
                    called again with- parseIndexequal to that return value.
- Throws:
- java.io.IOException
 
checkSignatures
public void checkSignatures(java.io.File file,
                            java.io.InputStream stream,
                            RepInfo info)
                     throws java.io.IOException
- Check if the digital object conforms to this Module's
  internal signature information.
  
  HTML is one of the most ill-defined of any open formats, so
  checking a "signature" really means using some heuristics. The only
  required tag is TITLE, but that could occur well into the file. So we
  look for any of three strings -- taking into account case-independence
  and white space -- within the first sigBytes bytes, and call that
  a signature check.
 
- 
- Specified by:
- checkSignaturesin interface- Module
- Overrides:
- checkSignaturesin class- ModuleBase
 
- 
- Parameters:
- file- A File object for the object being parsed
- stream- An InputStream, positioned at its beginning,
                    which is generated from the object to be parsed
- info- A fresh RepInfo object which will be modified
                    to reflect the results of the test
- Throws:
- java.io.IOException
 
checkDoctype
protected int checkDoctype(java.util.List elements)
- 
 
seemsToBeXHTML
protected int seemsToBeXHTML(java.util.List elements)
- 
 
stripQuotes
protected java.lang.String stripQuotes(java.lang.String str)
- 
 
isXmlAvailable
protected static boolean isXmlAvailable()
-