Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.7k views
in Technique[技术] by (71.8m points)

parsing - Help the Java SAX parser to understand bad xml

I am parsing XML returned from a website but sadly it is slightly malformed. I am getting XML like:

<tag attrib="Buy two for &pound;1" />

Which, I am informed, is invalid because &pound; is an HTML character, not an XML character and definitely cannot appear in an attribute.

What can I do to fix this, assuming I cannot tell the website to obey the rules? I am considering using a FilterInputStream to filter the arriving data before it gets to the SAX parser but this seems over the top.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

In the end I failed to do this with the parser. My solution was to write a FilterInputStream that converted all &xxxx; references into their &#nnnn; form.

/* Cleans up often very bad xml. 
 * 
 * 1. Strips leading white space.
 * 2. Recodes &pound; etc to &#...;.
 * 3. Recodes lone & as &amp.
 * 
 */
public class XMLInputStream extends FilterInputStream {

  private static final int MIN_LENGTH = 2;
  // Everything we've read.
  StringBuilder red = new StringBuilder();
  // Data I have pushed back.
  StringBuilder pushBack = new StringBuilder();
  // How much we've given them.
  int given = 0;
  // How much we've read.
  int pulled = 0;

  public XMLInputStream(InputStream in) {
    super(in);
  }

  public int length() {
    // NB: This is a Troll length (i.e. it goes 1, 2, many) so 2 actually means "at least 2"

    try {
      StringBuilder s = read(MIN_LENGTH);
      pushBack.append(s);
      return s.length();
    } catch (IOException ex) {
      log.warning("Oops ", ex);
    }
    return 0;
  }

  private StringBuilder read(int n) throws IOException {
    // Input stream finished?
    boolean eof = false;
    // Read that many.
    StringBuilder s = new StringBuilder(n);
    while (s.length() < n && !eof) {
      // Always get from the pushBack buffer.
      if (pushBack.length() == 0) {
        // Read something from the stream into pushBack.
        eof = readIntoPushBack();
      }

      // Pushback only contains deliverable codes.
      if (pushBack.length() > 0) {
        // Grab one character
        s.append(pushBack.charAt(0));
        // Remove it from pushBack
        pushBack.deleteCharAt(0);
      }

    }
    return s;
  }

  // Returns true at eof.
  // Might not actually push back anything but usually will.
  private boolean readIntoPushBack() throws IOException {
    // File finished?
    boolean eof = false;
    // Next char.
    int ch = in.read();
    if (ch >= 0) {
      // Discard whitespace at start?
      if (!(pulled == 0 && isWhiteSpace(ch))) {
        // Good code.
        pulled += 1;
        // Parse out the &stuff;
        if (ch == '&') {
          // Process the &
          readAmpersand();
        } else {
          // Not an '&', just append.
          pushBack.append((char) ch);
        }
      }
    } else {
      // Hit end of file.
      eof = true;
    }
    return eof;
  }

  // Deal with an ampersand in the stream.
  private void readAmpersand() throws IOException {
    // Read the whole word, up to and including the ;
    StringBuilder reference = new StringBuilder();
    int ch;
    // Should end in a ';'
    for (ch = in.read(); isAlphaNumeric(ch); ch = in.read()) {
      reference.append((char) ch);
    }
    // Did we tidily finish?
    if (ch == ';') {
      // Yes! Translate it into a &#nnn; code.
      String code = XML.hash(reference);
      if (code != null) {
        // Keep it.
        pushBack.append(code);
      } else {
        throw new IOException("Invalid/Unknown reference '&" + reference + ";'");
      }
    } else {
      // Did not terminate properly! 
      // Perhaps an & on its own or a malformed reference.
      // Either way, escape the &
      pushBack.append("&amp;").append(reference).append((char) ch);
    }
  }

  private void given(CharSequence s, int wanted, int got) {
    // Keep track of what we've given them.
    red.append(s);
    given += got;
    log.finer("Given: [" + wanted + "," + got + "]-" + s);
  }

  @Override
  public int read() throws IOException {
    StringBuilder s = read(1);
    given(s, 1, 1);
    return s.length() > 0 ? s.charAt(0) : -1;
  }

  @Override
  public int read(byte[] data, int offset, int length) throws IOException {
    int n = 0;
    StringBuilder s = read(length);
    for (int i = 0; i < Math.min(length, s.length()); i++) {
      data[offset + i] = (byte) s.charAt(i);
      n += 1;
    }
    given(s, length, n);
    return n > 0 ? n : -1;
  }

  @Override
  public String toString() {
    String s = red.toString();
    String h = "";
    // Hex dump the small ones.
    if (s.length() < 8) {
      // Separator just inserts the string between each.
      Separator sep = new Separator(" ");
      for (int i = 0; i < s.length(); i++) {
        h += sep.sep() + Integer.toHexString(s.charAt(i));
      }
    }
    return "[" + given + "]-"" + s + """ + (h.length() > 0 ? " (" + h + ")" : "");
  }

  private boolean isWhiteSpace(int ch) {
    switch (ch) {
      case ' ':
      case '
':
      case '
':
      case '':
        return true;
    }
    return false;
  }

  private boolean isAlphaNumeric(int ch) {
    return ('a' <= ch && ch <= 'z') 
        || ('A' <= ch && ch <= 'Z') 
        || ('0' <= ch && ch <= '9');
  }
}

XML.java - For completeness. Please confirm the completeness of the list.

public static String hash(CharSequence s) {
    final Integer code = SPECIALS.get(s.toString());
    if (code != null) {
        return "&#" + code + ";";
    }
    return null;
}

private static final Map<String, Integer> SPECIALS;

static {
    // Derived from Wikipedia http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
    final Map<String, Integer> map = new HashMap<>();
    map.put("quot", 34);
    map.put("amp", 38);
    map.put("apos", 39);
    map.put("lt", 60);
    map.put("gt", 62);
    map.put("nbsp", 160);
    map.put("iexcl", 161);
    map.put("cent", 162);
    map.put("pound", 163);
    map.put("curren", 164);
    map.put("yen", 165);
    map.put("brvbar", 166);
    map.put("sect", 167);
    map.put("uml", 168);
    map.put("copy", 169);
    map.put("ordf", 170);
    map.put("laquo", 171);
    map.put("not", 172);
    map.put("shy", 173);
    map.put("reg", 174);
    map.put("macr", 175);
    map.put("deg", 176);
    map.put("plusmn", 177);
    map.put("sup2", 178);
    map.put("sup3", 179);
    map.put("acute", 180);
    map.put("micro", 181);
    map.put("para", 182);
    map.put("middot", 183);
    map.put("cedil", 184);
    map.put("sup1", 185);
    map.put("ordm", 186);
    map.put("raquo", 187);
    map.put("frac14", 188);
    map.put("frac12", 189);
    map.put("frac34", 190);
    map.put("iquest", 191);
    map.put("Agrave", 192);
    map.put("Aacute", 193);
    map.put("Acirc", 194);
    map.put("Atilde", 195);
    map.put("Auml", 196);
    map.put("Aring", 197);
    map.put("AElig", 198);
    map.put("Ccedil", 199);
    map.put("Egrave", 200);
    map.put("Eacute", 201);
    map.put("Ecirc", 202);
    map.put("Euml", 203);
    map.put("Igrave", 204);
    map.put("Iacute", 205);
    map.put("Icirc", 206);
    map.put("Iuml", 207);
    map.put("ETH", 208);
    map.put("Ntilde", 209);
    map.put("Ograve", 210);
    map.put("Oacute", 211);
    map.put("Ocirc", 212);
    map.put("Otilde", 213);
    map.put("Ouml", 214);
    map.put("times", 215);
    map.put("Oslash", 216);
    map.put("Ugrave", 217);
    map.put("Uacute", 218);
    map.put("Ucirc", 219);
    map.put("Uuml", 220);
    map.put("Yacute", 221);
    map.put("THORN", 222);
    map.put("szlig", 223);
    map.put("agrave", 224);
    map.put("aacute", 225);
    map.put("acirc", 226);
    map.put("atilde", 227);
    map.put("auml", 228);
    map.put("aring", 229);
    map.put("aelig", 230);
    map.put("ccedil", 231);
    map.put("egrave", 232);
    map.put("eacute", 233);
    map.put("ecirc", 234);
    map.put("euml", 235);
    map.put("igrave", 236);
    map.put("iacute", 237);
    map.put("icirc", 238);
    map.put("iuml", 239);
    map.put("eth", 240);
    map.put("ntilde", 241);
    map.put("ograve", 242);
    map.put("oacute", 243);
    map.put("ocirc", 244);
    map.put("otilde", 245);
    map.put("ouml", 246);
    map.put("divide", 247);
    map.put("oslash", 248);
    map.put("ugrave", 249);
    map.put("uacute", 250);
    map.put("ucirc", 251);
    map.put("uuml", 252);
    map.put("yacute", 253);
    map.put("thorn", 254);
    map.put("yuml", 255);
    map.put("OElig", 338);
    map.put("oelig", 339);
    map.put("Scaron", 352);
    map.put("scaron", 353);
    map.put("Yuml", 376);
    map.put("fnof", 402);
    map.put("circ", 710);
    map.put("tilde", 732);
    map.put("Alpha", 913);
    map.put("Beta", 914);
    map.put("Gamma", 915);
    map.put("Delta", 916);
    map.put("Epsilon", 917);
    map.put("Zeta", 918);
    map.put("Eta", 919);
    map.put("Theta", 920);
    map.put("Iota", 921);
    map.put("Kappa", 922);
    map.put("Lambda", 923);
    map.put("Mu", 924);
    map.put("Nu", 925);
    map.put("Xi", 926);
    map.put("Omicron", 927);
    map.put("Pi", 928);
    map.put("Rho", 929);
    map.put("Sigma", 931);
    map.put("Tau", 932);
    map.put("Upsilon", 933);
    map.put("Phi", 934);
    map.put("Chi", 935);
    map.put("Psi", 936);
    map.put("Omega", 937);
    map.put("alpha", 945);
    map.put("beta", 946);
    map.put("gamma", 947);
    map.put("delta", 948);
    map.put("epsilon", 949);
    map.put("zeta", 950);
    map.put("eta", 951);
    map.put("theta", 952);
    map.put("iota", 953);
    map.put("kappa", 954);
    map.put("lambda", 955);
    map.put("mu", 956);
    map.put("nu", 957);
    map.put("xi", 958);
    map.put("omicron", 959);
    map.put("pi", 960);
    map.put("rho", 961);
    map.put("sigmaf", 962);
    map.put("sigma", 963);
    map.put("tau", 964);
    map.put("upsilon", 965);
    map.put("phi", 966);
    map.put("chi", 967);
    map.put("psi", 968);
    map.put("omega", 969);
    map.put("thetasym", 977);
    map.put("upsih", 978);
    map.put("piv", 982);
    map.put("ensp", 8194);
    map.put("emsp", 8195);
    map.put("thinsp", 8201);
    map.put("zwnj", 8204);
    map.put("zwj", 8205);
    map.put("lrm", 8206);
    map.put("rlm", 8207);
    map.put("ndash", 8211);
    map.put("mdash", 8212);
    map.put("lsquo", 8216);
    map.put("rsquo", 8217);
    map.put("sbquo", 8218);
    map.put("ldquo", 8220);
    map.put("rdquo", 8221);
    map.put("bdquo", 8222);
    map.put("dagger", 8224);
    map.put("Dagger", 8225);
    map.put("bull", 8226);
    map.put("hellip", 8230);
    map.put("permil", 8240);
    map.put("prime", 8242);
    map.put("Prime", 8243);
    map.put("lsaquo", 8249);
    map.put("rsaquo", 8250);
    map.put("oline", 8254);
    map.put("frasl", 8260);
    map.put("euro", 8364);
    map.put("image", 8465);
    map.put("weierp", 8472);
    map.put("real", 8476);
    map.put("trade", 8482);
    map.put("alefsym", 8501);
    map.put("larr", 8592);
    map.put("uarr", 8593);
    map.put("rarr", 8594);
    map.put("darr", 8595);
    map.put("harr", 8596);
    map.put("crarr", 8629);
    map.put("lArr", 8656);
    map.put("uArr", 8657);
    map.put("rArr", 8658);
    map.put("dArr", 8659);
    map.put("hArr", 8660);
    map.put("forall", 8704);
    map.put("part", 8706);
    map.put("exist", 8707);
    map.put("empty", 8709);
    map.put("nabla", 8711);
    map.put("isin", 8712);
    map.put("notin", 8713);
    map.put("ni", 8715);
    map.put("prod", 8719);
    map.put("sum", 8721);
    map.put("minus", 8722);
    map.put("lowast", 8727);
    map.put("radic", 8730);
    map.put("prop", 8733);
    map.put("infin", 8734);
    map.put("ang", 8736);
    map.put("and", 8743);
    map.put("or", 8744);
    map.put("cap", 8745);
    map.put("cup", 8746);
    map.put("int", 8747);
    map.put("there4", 8756);
    map.put("sim", 8764);
    map.put("cong", 8773);
    map.put("asy

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...