Nux 1.6

nux.xom.binary
Class BinaryXMLCodec

java.lang.Object
  extended by nux.xom.binary.BinaryXMLCodec

public class BinaryXMLCodec
extends Object

Serializes (encodes) and deserializes (decodes) XOM XML documents to and from an efficient and compact custom binary XML data format (termed bnux format), without loss or change of any information. Serialization and deserialization is much faster than with the standard textual XML format, and the resulting binary data is more compressed than textual XML.

Applicability

The overall goal of the bnux algorithm is to maximize serialization and deserialization (parsing) performance without requiring any schema description. Serialization and deserialization speed are roughly balanced against each other; neither side is particularly favoured over the other. Another benefitial effect of the algorithm is that a considerable degree of XML data redundancy is eliminated, but compression is more a welcome side-effect than a primary goal in itself. The algorithm is primarily intended for tightly coupled high-performance systems exchanging large volumes of XML data over networks, as well as for compact main memory caches and for short-term storage as BLOBs in backend databases or files (e.g. "session" data with limited duration). In the case of BLOB storage, selecting matching BLOBs can be sped up by maintaining a simple metaindex side table for the most frequent access patterns. See the performance results below.

While the Java API is considered stable, the bnux data format should be considered a black box: Its internals are under-documented and may change without notice from release to release in backwards-incompatible manners. It is unlikely that support for reading data written with older Nux versions will ever be available. bnux is an exchange format but not an interoperability format. Having said that, the data format is machine architecture/platform independent. For example a bnux file can be moved back and forth between a 32 bit Intel little-endian machine and a 64 bit PowerPC big-endian machine; it remains parseable no matter where.

This approach is expressly not intended as a replacement for standard textual XML in loosely coupled systems where maximum long-term interoperability is the overarching concern. It is also expressly not intended for long-term data storage. If you store data in bnux format there's every chance you won't be able to read it back a year or two from now, or even earlier. Finally, it is probably unwise to use this class if your application's performance requirements are not particularly stringent, or profiling indicates that the bottleneck is not related to XML serialization/deserialization anyway.

The bnux serialization algorithm is a fully streaming block-oriented algorithm, ideal for large numbers of very small to arbitrarily large XML documents.

The bnux deserialization algorithm is a fully streaming algorithm and can optionally be pushed through a NodeFactory. This enables efficient filtering and can avoid the need to build a main memory tree, which is particularly useful for arbitrarily large documents. For example, streaming XQueries over binary XML can be expressed via the NodeFactory generated by a StreamingPathFilter. In streaming mode, the binary codec exactly mimics the NodeFactory based behaviour of the XOM Builder.

Faithfully Preversing XML

Any and all arbitrary XOM XML documents are supported, and no schema is required. A XOM document that is serialized and subsequently deserialized by this class is exactly the same as the original document, preserving "as is" all names and data for elements, namespaces, additional namespace declarations, attributes, texts, document type, comments, processing instructions, whitespace, Unicode characters including surrogates, etc. As a result, the W3C XML Infoset and the W3C Canonical XML representation is guaranteed to be preserved. In particular there always holds:
 java.util.Arrays.equals(XOMUtil.toCanonicalXML(doc), XOMUtil
                .toCanonicalXML(deserialize(serialize(doc))));
 

Optional ZLIB Compression

The bnux algorithm considerably compresses XML data with little CPU consumption, by its very design. However, bnux also has an option to further compress/decompress its output/input with the ZLIB compression algorithm. ZLIB is based on Huffman coding and also used by the popular gzip (e.g. Deflater). ZLIB compression is rather CPU intensive, but it typically yields strong compression factors, in particular for documents containing mostly narrative text (e.g. the bible). For example, strong compression may be desirable over low-bandwith networks or when bnux data is known to be accessed rather infrequently. On the other hand, ZLIB compression probably kills performance in the presence of high-bandwidth networks such as ESnet, Internet2/Abilene or 10 Gigabit Ethernet/InfiniBand LANs, even with high-end CPUs. CPU drain is also a scalability problem in the presence of large amounts of concurrent connections. An option ranging from 0 (no ZLIB compression; best performance) to 1 (little ZLIB compression; reduced performance) to 9 (strongest ZLIB compression; worst performance) allows one to configure the CPU/memory consumption trade-off.

Reliability

This class has been successfully tested against some 50000 extremely weird and unique test documents, including the W3C XML conformance test suite, and no bugs are known.

Serialization employs no error checking at all, since malformed XOM input documents are impossible to produce given XOM's design: XOM strictly enforces wellformedness anyway. Deserialization employs some limited error checking, throwing exceptions for any improper API usage, non-bnux input data, data format version mismatch, or general binary data corruption. Beyond this, deserialization relies on XOM's hard-wired wellformedness checks, just like serialization does. Barring one of the above catastrophic situations, the bnux algorithm will always correctly and faithfully reconstruct the exact same well-formed XOM document.

Example Usage:

 // parse standard textual XML, convert to binary format, round-trip it and compare results
 Document doc = new Builder().build(new File("samples/data/periodic.xml"));
 BinaryXMLCodec codec = new BinaryXMLCodec();
 byte[] bnuxDoc = codec.serialize(doc, 0);
 Document doc2 = codec.deserialize(bnuxDoc);
 boolean isEqual = java.util.Arrays.equals(
     XOMUtil.toCanonicalXML(doc), XOMUtil.toCanonicalXML(doc2));
 System.out.println("isEqual = " + isEqual);
 System.out.println(doc2.toXML());
 
 // write binary XML document to file
 OutputStream out = new FileOutputStream("/tmp/periodic.xml.bnux");
 out.write(bnuxDoc);
 out.close();
 
 // read binary XML document from file
 bnuxDoc = FileUtil.toByteArray(new FileInputStream("/tmp/periodic.xml.bnux"));
 Document doc3 = codec.deserialize(bnuxDoc);
 System.out.println(doc3.toXML());
 

Performance

This class has been carefully profiled and optimized. Preliminary performance results over a wide range of real-world documents are given below. A more detailed presentation can be found at the Global Grid Forum
Web Services Performance Workshop.

Contrasting bnux BinaryXMLCodec with the XOM Builder and Serializer:

For meaningful comparison, MB/s and compression factors are always given normalized in relation to the original standard textual XML file size. Example Interpretation: Note that in contrast to other algorithms, these measurements include XOM tree building and walking, hence measures delivering data to and from actual XML applications, rather than merely to and from a low-level SAX event stream (which is considerably cheaper and deemed less useful).

The deserialization speedup is further multiplied when DTDs or schema validation is used while parsing standard textual XML.

This class relies on advanced Java compiler optimizations, which take considerable time to warm up. Hence, for comparative benchmarks, use a server-class VM and make sure to repeat runs for at least 30 seconds.

Further, you will probably want to eliminate drastic XOM hotspots by compiling XOM with "ant -Dfat=true jar" to maintain an internal String instead of an UTF-8 encoded byte array in Text, which eliminates the expensive character conversions implied for each access to a Text object. This increases performance at the expense of memory footprint. The measurements above report numbers using these patches, both for xom and bnux. If you're curious about the whereabouts of bottlenecks, run java with the non-perturbing '-server -agentlib:hprof=cpu=samples,depth=10' flags, then study the trace log and correlate its hotspot trailer with its call stack headers (see hprof tracing ).

Use class BinaryXMLTest to reproduce results, verify correctness or to evaluate performance for your own datasets.

Author:
whoschek.AT.lbl.DOT.gov, $Author: hoschek $

Constructor Summary
BinaryXMLCodec()
          Constructs an instance; An instance can be reused serially, but is not thread-safe, just like a Builder.
 
Method Summary
 StreamingSerializer createStreamingSerializer(OutputStream out, int zlibCompressionLevel)
          Constructs a new streaming serializer that serializes bnux binary XML to the given underlying output stream, using the given ZLIB compression level.
 Document deserialize(byte[] bnuxDocument)
          Equivalent to deserialize(new ByteArrayInputStream(input), new NodeFactory()).
 Document deserialize(InputStream input, NodeFactory factory)
          Returns the XOM document obtained by deserializing the next binary XML document from the given input stream.
 boolean isBnuxDocument(InputStream input)
          Returns whether or not the given input stream contains a bnux document.
 byte[] serialize(Document document, int zlibCompressionLevel)
          Returns the bnux binary XML document obtained by serializing the given XOM document.
 void serialize(Document document, int zlibCompressionLevel, OutputStream out)
          Serializes the given XOM document as a bnux binary XML document onto the given output stream.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

BinaryXMLCodec

public BinaryXMLCodec()
Constructs an instance; An instance can be reused serially, but is not thread-safe, just like a Builder.

Method Detail

createStreamingSerializer

public StreamingSerializer createStreamingSerializer(OutputStream out,
                                                     int zlibCompressionLevel)
Constructs a new streaming serializer that serializes bnux binary XML to the given underlying output stream, using the given ZLIB compression level.

An optional zlib compression level ranging from 0 (no ZLIB compression; best performance) to 1 (little ZLIB compression; reduced performance) to 9 (strongest ZLIB compression; worst performance) allows one to configure the CPU/memory consumption trade-off.

Unless there is a good reason to the contrary, you should always use level 0: the bnux algorithm typically already precompresses considerably.

Parameters:
out - the underlying output stream to write to
zlibCompressionLevel - a number in the range 0..9
Returns:
a streaming serializer

isBnuxDocument

public boolean isBnuxDocument(InputStream input)
                       throws IOException
Returns whether or not the given input stream contains a bnux document.

A peek into the first 4 bytes is sufficient for unambigous detection, as standard textual XML cannot start with any arbitrary four byte combination.

Finally, the read bytes are put back onto the stream, so they can be reread as part of subsequent parsing attempts. Therefore, the input stream must support input.mark() and input.reset(). For example, a BufferedInputStream is a good choice.

Parameters:
input - the stream to read from
Returns:
true if the stream contains a bnux document
Throws:
IllegalArgumentException - if the underlying stream does not support input.mark() and input.reset().
IOException - if the underlying input stream encounters an I/O error
See Also:
InputStream.mark(int)

deserialize

public Document deserialize(byte[] bnuxDocument)
                     throws BinaryParsingException
Equivalent to deserialize(new ByteArrayInputStream(input), new NodeFactory()).

Parameters:
bnuxDocument - the bnux document to deserialize.
Returns:
the new XOM document obtained from deserialization.
Throws:
BinaryParsingException - if the bnux document is unreadable or corrupt for some reason

deserialize

public Document deserialize(InputStream input,
                            NodeFactory factory)
                     throws BinaryParsingException,
                            IOException
Returns the XOM document obtained by deserializing the next binary XML document from the given input stream.

If the document is in ZLIB compressed bnux format, it will be auto-detected and auto-decompressed as part of deserialization.

This method exactly mimics the NodeFactory based behaviour of the XOM Builder. A NodeFactory enables efficient filtering and can avoid the need to build a main memory tree, which is particularly useful for large documents. For example, streaming XQueries over binary XML can be expressed via the NodeFactory generated by a StreamingPathFilter. Binary XML files can be converted to and from standard textual XML files via a XOMUtil.getRedirectingNodeFactory(StreamingSerializer). For other example factories, see XOMUtil.

Bnux is a self-framing data format: It knows where the end of a document occurs. An input stream can contain any number of independent documents, one after another. Thus, this method reads from the stream as many bytes as required for the current document, but no more than that. Unlike SAX XML parsers and unlike a Builder, it does not read until end-of-stream (EOS), and it does not auto-close the input stream. If this method returns successfully, the input stream has been positioned one byte past the current bnux document, ready to deserialize the following document, if any. It is the responsibility of the caller to ensure the input stream gets properly closed when deemed appropriate.

Parameters:
input - the stream to read and deserialize from
factory - the node factory to stream into. May be null in which case the default XOM NodeFactory is used, building the complete XML document tree.
Returns:
the new XOM document obtained from deserialization.
Throws:
BinaryParsingException - if the bnux document is unreadable or corrupt for some reason
IOException - if the underlying input stream encounters an I/O error

serialize

public byte[] serialize(Document document,
                        int zlibCompressionLevel)
                 throws IllegalArgumentException
Returns the bnux binary XML document obtained by serializing the given XOM document.

An optional zlib compression level ranging from 0 (no ZLIB compression; best performance) to 1 (little ZLIB compression; reduced performance) to 9 (strongest ZLIB compression; worst performance) allows one to configure the CPU/memory consumption trade-off.

Unless there is a good reason to the contrary, you should always use level 0: the bnux algorithm typically already precompresses considerably.

Parameters:
document - the XOM document to serialize
zlibCompressionLevel - a number in the range 0..9
Returns:
the bnux document obtained from serialization.
Throws:
IllegalArgumentException - if the compression level is out of range.

serialize

public void serialize(Document document,
                      int zlibCompressionLevel,
                      OutputStream out)
               throws IllegalArgumentException,
                      IOException
Serializes the given XOM document as a bnux binary XML document onto the given output stream.

An optional zlib compression level ranging from 0 (no ZLIB compression; best performance) to 1 (little ZLIB compression; reduced performance) to 9 (strongest ZLIB compression; worst performance) allows one to configure the CPU/memory consumption trade-off.

Unless there is a good reason to the contrary, you should always use level 0: the bnux algorithm typically already precompresses considerably.

Parameters:
document - the XOM document to serialize
zlibCompressionLevel - a number in the range 0..9
out - the output stream to write to
Throws:
IllegalArgumentException - if the compression level is out of range.
IOException - if the underlying output stream encounters an I/O error

Nux 1.6