|
Nux 1.6 | |||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object nux.xom.binary.BinaryXMLCodec
public class BinaryXMLCodec
Serializes (encodes) and deserializes (decodes) XOM XML documents to and from an efficient and compact custom binary XML data format (termed bnux format), without loss or change of any information. Serialization and deserialization is much faster than with the standard textual XML format, and the resulting binary data is more compressed than textual XML.
While the Java API is considered stable, the bnux data format should be considered a black box: Its internals are under-documented and may change without notice from release to release in backwards-incompatible manners. It is unlikely that support for reading data written with older Nux versions will ever be available. bnux is an exchange format but not an interoperability format. Having said that, the data format is machine architecture/platform independent. For example a bnux file can be moved back and forth between a 32 bit Intel little-endian machine and a 64 bit PowerPC big-endian machine; it remains parseable no matter where.
This approach is expressly not intended as a replacement for standard textual XML in loosely coupled systems where maximum long-term interoperability is the overarching concern. It is also expressly not intended for long-term data storage. If you store data in bnux format there's every chance you won't be able to read it back a year or two from now, or even earlier. Finally, it is probably unwise to use this class if your application's performance requirements are not particularly stringent, or profiling indicates that the bottleneck is not related to XML serialization/deserialization anyway.
The bnux serialization algorithm is a fully streaming block-oriented algorithm, ideal for large numbers of very small to arbitrarily large XML documents.
The bnux deserialization algorithm is a fully streaming algorithm and can
optionally be pushed through a NodeFactory
. This enables
efficient filtering and can avoid the need to build a main memory tree, which
is particularly useful for arbitrarily large documents. For example, streaming
XQueries over binary XML can be expressed via the NodeFactory generated by a
StreamingPathFilter
. In streaming mode, the binary
codec exactly mimics the NodeFactory based behaviour of the XOM
Builder
.
java.util.Arrays.equals(XOMUtil.toCanonicalXML(doc), XOMUtil .toCanonicalXML(deserialize(serialize(doc))));
gzip
(e.g. Deflater
). ZLIB compression
is rather CPU intensive, but it typically yields strong compression factors,
in particular for documents containing mostly narrative text (e.g. the
bible). For example, strong compression may be desirable over low-bandwith
networks or when bnux data is known to be accessed rather infrequently. On
the other hand, ZLIB compression probably kills performance in the presence
of high-bandwidth networks such as ESnet, Internet2/Abilene or 10 Gigabit
Ethernet/InfiniBand LANs, even with high-end CPUs. CPU drain is also a
scalability problem in the presence of large amounts of concurrent
connections. An option ranging from 0 (no ZLIB compression; best performance)
to 1 (little ZLIB compression; reduced performance) to 9 (strongest ZLIB
compression; worst performance) allows one to configure the CPU/memory
consumption trade-off.
Serialization employs no error checking at all, since malformed XOM input documents are impossible to produce given XOM's design: XOM strictly enforces wellformedness anyway. Deserialization employs some limited error checking, throwing exceptions for any improper API usage, non-bnux input data, data format version mismatch, or general binary data corruption. Beyond this, deserialization relies on XOM's hard-wired wellformedness checks, just like serialization does. Barring one of the above catastrophic situations, the bnux algorithm will always correctly and faithfully reconstruct the exact same well-formed XOM document.
// parse standard textual XML, convert to binary format, round-trip it and compare results Document doc = new Builder().build(new File("samples/data/periodic.xml")); BinaryXMLCodec codec = new BinaryXMLCodec(); byte[] bnuxDoc = codec.serialize(doc, 0); Document doc2 = codec.deserialize(bnuxDoc); boolean isEqual = java.util.Arrays.equals( XOMUtil.toCanonicalXML(doc), XOMUtil.toCanonicalXML(doc2)); System.out.println("isEqual = " + isEqual); System.out.println(doc2.toXML()); // write binary XML document to file OutputStream out = new FileOutputStream("/tmp/periodic.xml.bnux"); out.write(bnuxDoc); out.close(); // read binary XML document from file bnuxDoc = FileUtil.toByteArray(new FileInputStream("/tmp/periodic.xml.bnux")); Document doc3 = codec.deserialize(bnuxDoc); System.out.println(doc3.toXML());
Contrasting bnux BinaryXMLCodec with the XOM Builder and Serializer:
The deserialization speedup is further multiplied when DTDs or schema validation is used while parsing standard textual XML.
This class relies on advanced Java compiler optimizations, which take considerable time to warm up. Hence, for comparative benchmarks, use a server-class VM and make sure to repeat runs for at least 30 seconds.
Further, you will probably want to eliminate drastic XOM hotspots by
compiling XOM with "ant -Dfat=true jar" to maintain an internal String
instead of an UTF-8 encoded byte array in Text
, which
eliminates the expensive character conversions implied for each access to a
Text object. This increases performance at the expense of memory footprint.
The measurements above report numbers using these patches, both for xom and
bnux. If you're curious about the whereabouts of bottlenecks, run java with
the non-perturbing '-server -agentlib:hprof=cpu=samples,depth=10' flags, then
study the trace log and correlate its hotspot trailer with its call stack
headers (see
hprof tracing ).
Use class BinaryXMLTest
to reproduce results, verify
correctness or to evaluate performance for your own datasets.
Constructor Summary | |
---|---|
BinaryXMLCodec()
Constructs an instance; An instance can be reused serially, but is not thread-safe, just like a Builder . |
Method Summary | |
---|---|
StreamingSerializer |
createStreamingSerializer(OutputStream out,
int zlibCompressionLevel)
Constructs a new streaming serializer that serializes bnux binary XML to the given underlying output stream, using the given ZLIB compression level. |
Document |
deserialize(byte[] bnuxDocument)
Equivalent to deserialize(new ByteArrayInputStream(input), new NodeFactory()) . |
Document |
deserialize(InputStream input,
NodeFactory factory)
Returns the XOM document obtained by deserializing the next binary XML document from the given input stream. |
boolean |
isBnuxDocument(InputStream input)
Returns whether or not the given input stream contains a bnux document. |
byte[] |
serialize(Document document,
int zlibCompressionLevel)
Returns the bnux binary XML document obtained by serializing the given XOM document. |
void |
serialize(Document document,
int zlibCompressionLevel,
OutputStream out)
Serializes the given XOM document as a bnux binary XML document onto the given output stream. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public BinaryXMLCodec()
Builder
.
Method Detail |
---|
public StreamingSerializer createStreamingSerializer(OutputStream out, int zlibCompressionLevel)
An optional zlib compression level ranging from 0 (no ZLIB compression; best performance) to 1 (little ZLIB compression; reduced performance) to 9 (strongest ZLIB compression; worst performance) allows one to configure the CPU/memory consumption trade-off.
Unless there is a good reason to the contrary, you should always use level 0: the bnux algorithm typically already precompresses considerably.
out
- the underlying output stream to write tozlibCompressionLevel
- a number in the range 0..9
public boolean isBnuxDocument(InputStream input) throws IOException
A peek into the first 4 bytes is sufficient for unambigous detection, as standard textual XML cannot start with any arbitrary four byte combination.
Finally, the read bytes are put back onto the stream, so they can be
reread as part of subsequent parsing attempts. Therefore, the input
stream must support input.mark()
and
input.reset()
. For example, a
BufferedInputStream
is a good choice.
input
- the stream to read from
IllegalArgumentException
- if the underlying stream does not support
input.mark()
and input.reset()
.
IOException
- if the underlying input stream encounters an I/O errorInputStream.mark(int)
public Document deserialize(byte[] bnuxDocument) throws BinaryParsingException
deserialize(new ByteArrayInputStream(input), new NodeFactory())
.
bnuxDocument
- the bnux document to deserialize.
BinaryParsingException
- if the bnux document is unreadable or corrupt for some reasonpublic Document deserialize(InputStream input, NodeFactory factory) throws BinaryParsingException, IOException
If the document is in ZLIB compressed bnux format, it will be auto-detected and auto-decompressed as part of deserialization.
This method exactly mimics the NodeFactory based behaviour of the XOM
Builder
. A NodeFactory enables efficient filtering and
can avoid the need to build a main memory tree, which is particularly
useful for large documents. For example, streaming XQueries over binary
XML can be expressed via the NodeFactory generated by a
StreamingPathFilter
. Binary XML files can be
converted to and from standard textual XML files via a
XOMUtil.getRedirectingNodeFactory(StreamingSerializer)
. For
other example factories, see XOMUtil
.
Bnux is a self-framing data format: It knows where the end of a document
occurs. An input stream can contain any number of independent documents,
one after another. Thus, this method reads from the stream as many bytes
as required for the current document, but no more than that. Unlike SAX
XML parsers and unlike a Builder
, it does not read until
end-of-stream (EOS), and it does not auto-close the input stream. If this
method returns successfully, the input stream has been positioned one
byte past the current bnux document, ready to deserialize the following
document, if any. It is the responsibility of the caller to ensure the
input stream gets properly closed when deemed appropriate.
input
- the stream to read and deserialize fromfactory
- the node factory to stream into. May be null
in
which case the default XOM NodeFactory is used, building the
complete XML document tree.
BinaryParsingException
- if the bnux document is unreadable or corrupt for some reason
IOException
- if the underlying input stream encounters an I/O errorpublic byte[] serialize(Document document, int zlibCompressionLevel) throws IllegalArgumentException
An optional zlib compression level ranging from 0 (no ZLIB compression; best performance) to 1 (little ZLIB compression; reduced performance) to 9 (strongest ZLIB compression; worst performance) allows one to configure the CPU/memory consumption trade-off.
Unless there is a good reason to the contrary, you should always use level 0: the bnux algorithm typically already precompresses considerably.
document
- the XOM document to serializezlibCompressionLevel
- a number in the range 0..9
IllegalArgumentException
- if the compression level is out of range.public void serialize(Document document, int zlibCompressionLevel, OutputStream out) throws IllegalArgumentException, IOException
An optional zlib compression level ranging from 0 (no ZLIB compression; best performance) to 1 (little ZLIB compression; reduced performance) to 9 (strongest ZLIB compression; worst performance) allows one to configure the CPU/memory consumption trade-off.
Unless there is a good reason to the contrary, you should always use level 0: the bnux algorithm typically already precompresses considerably.
document
- the XOM document to serializezlibCompressionLevel
- a number in the range 0..9out
- the output stream to write to
IllegalArgumentException
- if the compression level is out of range.
IOException
- if the underlying output stream encounters an I/O error
|
Nux 1.6 | |||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |