toxi.data.feeds.util
Class EntityStripper

java.lang.Object
  extended by toxi.data.feeds.util.EntityStripper

public class EntityStripper
extends java.lang.Object

Strips HTML entities such as " from a string, replacing them by their Unicode equivalents.

Since:
2002-07-14

Field Summary
static int LONGEST_ENTITY
          Longest an entity can be 10, at least in our tables, including the lead & and trail ;.
static int SHORTEST_ENTITY
          The shortest an entity can be 4, at least in our tables, including the lead & and trailing ;.
static char UNICODE_NBSP_160_0x0a
          unicode nbsp control char, 160, 0x0a.
 
Constructor Summary
EntityStripper()
           
 
Method Summary
static char bareHTMLEntityToChar(java.lang.String bareEntity, char howToTranslateNbsp)
          convert an entity to a single char.
static java.lang.String flattenHTML(java.lang.String text, char translateNbspTo)
          strips tags and entities from HTML.
static java.lang.String flattenXML(java.lang.String text)
          strips tags and entities from XML..
static char possEntityToChar(java.lang.String possBareEntityWithSemicolon)
          Checks a number of gauntlet conditions to ensure this is a valid entity.
static java.lang.String stripHTMLEntities(java.lang.String text, char translateNbspTo)
          Converts HTML to text converting entities such as " back to " and < back to < Ordinary text passes unchanged.
static java.lang.String stripHTMLTags(java.lang.String html)
          Removes tags from HTML leaving just the raw text.
static java.lang.String stripXMLEntities(java.lang.String text)
          Converts XML to text converting entities such as " back to " and < back to < Ordinary text passes unchanged.
static java.lang.String stripXMLTags(java.lang.String xml)
          Removes tags from XML leaving just the raw text.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

UNICODE_NBSP_160_0x0a

public static final char UNICODE_NBSP_160_0x0a
unicode nbsp control char, 160, 0x0a.

See Also:
Constant Field Values

LONGEST_ENTITY

public static final int LONGEST_ENTITY
Longest an entity can be 10, at least in our tables, including the lead & and trail ;.

See Also:
Constant Field Values

SHORTEST_ENTITY

public static final int SHORTEST_ENTITY
The shortest an entity can be 4, at least in our tables, including the lead & and trailing ;.

See Also:
Constant Field Values
Constructor Detail

EntityStripper

public EntityStripper()
Method Detail

bareHTMLEntityToChar

public static char bareHTMLEntityToChar(java.lang.String bareEntity,
                                        char howToTranslateNbsp)
convert an entity to a single char.

Parameters:
bareEntity - String entity to convert convert. must have lead & and trail ; stripped; may have form: #x12ff or #123 or lt or nbsp style entity. Works faster if entity in lower case.
howToTranslateNbsp - char you would like   translated to, usually ' ' or (char) 160
Returns:
equivalent character. 0 if not recognised.

flattenHTML

public static java.lang.String flattenHTML(java.lang.String text,
                                           char translateNbspTo)
strips tags and entities from HTML. Leaves \n \r unchanged.

Parameters:
text - to flatten
translateNbspTo - char you would like   translated to, usually ' ' or (char) 160 .
Returns:
flattened text

flattenXML

public static java.lang.String flattenXML(java.lang.String text)
strips tags and entities from XML..

Parameters:
text - to flatten
Returns:
flattened text

possEntityToChar

public static char possEntityToChar(java.lang.String possBareEntityWithSemicolon)
Checks a number of gauntlet conditions to ensure this is a valid entity. Converts Entity to corresponding char.

Parameters:
possBareEntityWithSemicolon - string that may hold an entity. Lead & must be stripped, but may optionally contain text past the ;
Returns:
corresponding unicode character, or 0 if the entity is invalid. nbsp -> (char) 160

stripHTMLEntities

public static java.lang.String stripHTMLEntities(java.lang.String text,
                                                 char translateNbspTo)
Converts HTML to text converting entities such as " back to " and < back to < Ordinary text passes unchanged. Also strips decimal and hex entities and stray HTML entities.

Parameters:
text - raw text to be processed. Must not be null.
translateNbspTo - char you would like   translated to, usually ' ' or (char) 160 .
Returns:
translated text. It also handles HTML 4.0 entities such as ♥ { and ￿   -> 160. null input returns null.

stripHTMLTags

public static java.lang.String stripHTMLTags(java.lang.String html)
Removes tags from HTML leaving just the raw text. Leaves entities as is, e.g. does not convert & back to &. similar to code in Quoter. Also removes <!-- --> comments. Presumes perfectly formed HTML, no > in comments, all <...> balanced. Also removes text between applet, style and script tag pairs. Leaves   and other entities as is.

Parameters:
html - input HTML
Returns:
raw text, with whitespaces collapsed to a single space, trimmed.

stripXMLEntities

public static java.lang.String stripXMLEntities(java.lang.String text)
Converts XML to text converting entities such as " back to " and < back to < Ordinary text passes unchanged. Also strips decimal and hex entities and stray HTML entities.

Parameters:
text - raw XML text to be processed. Must not be null.
Returns:
translated text. null input returns null.

stripXMLTags

public static java.lang.String stripXMLTags(java.lang.String xml)
Removes tags from XML leaving just the raw text. Leaves entities as is, e.g. does not convert & back to &. similar to code in Quoter. Also removes <!-- --> comments. Presumes perfectly formed XML, no > in comments, all <...> balanced. Leaves entities as is.

Parameters:
xml - input XML
Returns:
raw text, with whitespaces collapsed to a single space, trimmed.