Nux - Changelog

Nux Sea Firefish pyGridWare Bro NetLogger Scishare Colt

Version 1.6 - Last published Jun 18, 2006

Software

Home Download Readme Changelog FAQ / Related

Development

Dependencies Javadoc API Mailing List License

Research

DSD Department

Version 1.6 (June 18, 2006)

Bnux binary XML performance improvements. Serialization and deserialization are now up to 200% and 40% faster, respectively.

Added StAX serialization to BinaryXMLBench and BinaryXMLTest, also for FastInfoset.

Added logic to infer character encoding the client has used on XMLStreamWriter instantiation.

Upgraded to saxonb-8.7.3 (Saxon 8.3 - 8.7.1 still continue to work fine).

Upgraded driver for official W3C XQuery Test Suite (XQTS) to xqts-0.9.4.

Upgraded to lucene-2.0.0

Version 1.5 (March 26, 2006)

Production Quality StAX Integration

StaxBuilder, StaxParser, StaxUtil, StaxReader have improved and graduated to the nux.xom.io package. This includes...

A facility to write to a StAX XMLStreamWriter (StreamingSerializerFactory).

A subclass of nu.xom.Builder that has the same behaviour as the super class, except that it runs over StAX instead of SAX (StaxUtil.createBuilder(XMLInputFactory, NodeFactory)). Can be used for polymorphic pluggability of SAX vs. StAX.

A class that allows to build XOM documents, fragments, or individual nodes via StAX (new StaxParser(XMLStreamReader)).

A class that makes a XOM (sub)tree appear as a StAX XMLStreamReader (StaxUtil.createXMLStreamReader(Node)).

Upgrades

Upgraded to saxonb-8.7 (Saxon 8.3 - 8.6.1 still continue to work fine).

Upgraded driver for official W3C XQuery Test Suite (XQTS) to xqts-0.8.6.

Upgraded to lucene-1.9.1.

Bug Fixes

bnux: If, on serialization, systemID and publidID of a DocType are null, deserialization maps these IDs back to empty strings, rather than null.

XQuery: NodeWrapper.getTypeAnnotation() returns wrong type for attributes, affecting deepEquals().

Rare bug on reuse of non-validating Xerces Builder object for DTDs, when produced via BuilderFactory.

FileUtil.listFiles() doesn't match with wildcards on Windows, making XQTS driver fail on Windows.

Enhancements

Enabled explicit intermediate flush() for streaming bnux serializer. Can be used for discrete message chunking over network transports.

Added BinaryXMLCodec.isBnux(InputStream) enabling clients to differentiate bnux and XML documents.

Various performance enhancements.

Added NodeBuilder and various performance improvements to redirecting factory and streaming bnux serializer.

Added the following ant targets for massive integration tests:

Target Description
download-testdata Downloads some 50000 XML files from various public test data repositories.
test-xqts Runs the Official W3C XQuery Test Suite (XQTS).
test-bnux Parses all 50000 files with XOM/SAX/Xerces, roundtrips them with bnux binary XML and compares results against original documents parsed with XOM/SAX/Xerces, testing for equality wrt. Canonical XML as well as XOM's more strict assertEquals() test routine.
test-staxbuilder Same as test-bnux except that it uses XOM/StAX/Woodstox instead of bnux.
test-staxserializer Same as test-bnux except that it parses with XOM/SAX/Xerces, serializes with XOM/StAX/Woodstox, reparses with XOM/SAX/Xerces, then compares results against original documents parsed with XOM/SAX/Xerces.

Before running any of these targets, tell them to use enough memory and the latest stable Xerces version, like this:
export JAVA_OPTS='-Xmx200m -Djava.endorsed.dirs=/Users/hoschek/unix/java/share/apache/xerces-2.8.0'
For tests involving StAX download Woodstox (e.g. wstx-asl-3.0.jar) and copy it into nux/lib.

Version 1.4.1 (Dec 8, 2005)

This release fixes two rare but deadly bugs introduced in version 1.4. Everything else is identical to 1.4.

Fixed bnux serialization bug that may occur when changing compression level on BinaryXMLCodec reuse.

Fixed bnux read ahead bug that may occur when deserializing multiple documents from the same input stream, with compressionLevel > 0.

A consequence is that the bnux header data format has slightly changed again (version number has also changed).

Another consequence is that the internal buffer and stream management is now more efficient and clean, too.

Version 1.4 (Dec 1, 2005)

Improvements and additions in this release focus on scalability, reliability and ease of use.

XQuery and XOM

Upgraded to xom-1.1-final (with compatible performance patches). xom-1.0.x and xom-1.1.x continue to work fine, albeit less efficient.

Upgraded to saxonb-8.6.1, implementing XQuery W3C Candidate Recommendation, 3 November 2005 (Saxon 8.6, 8.5, 8.4, 8.3 still continue to work fine).

saxon8-xom.jar is nomore needed as its contents are directly compiled into nux.jar, improving simplicity and reliability.

Constructing a new compiled XQuery object is now about 20 times faster.

Added driver for official W3C XQuery Test Suite (XQTS). Contains some 8500 test cases.

XML Streaming and Bnux Binary XML Streaming

Added Streaming Serialization of Very Large Documents in the nux.xom.io package. Using memory consumption close to zero, the new StreamingSerializer enables writing arbitrarily large XML documents onto a destination, such as an OutputStream, both for standard textual XML as well as bnux binary XML (and StAX).

Added streaming bnux deserialization for handling arbitrarily large input documents; uses an InputStream and an application provided NodeFactory just like a XOM Builder does.

Added bnux serialization to an OutputStream.

To enable true streaming, a serialized bnux document now consists internally of one or more independent pages, each at most 64 KB large. Each page is a tokenized byte array containing a portion of the XML document, in document order. Once a page has been read/written related (heavy) state can be discarded, freeing memory. No more than one page needs to be held in memory at any given time. For very large documents this reduces memory consumption, increases throughput and reduces latency. For small to medium sized documents it makes next to no difference.

Slightly more compact bnux data format (version number has changed).

Improved performance on reuse of BinaryXMLCodec instances (recommended).

bnux serialization and deserialization is now roughly 3 times faster when using documents containing namespaces, closely matching performance for documents without namespaces.

Added Streaming conversion of standard textual XML to and from binary format, enabling conversion of arbitrarily large documents. The corresponding fire-bnux command line conversion tool now works in fully streaming mode, too.

Other

Added AnalyzerUtil.getMostFrequentTerms(). Returns (frequency:text) pairs for the top N distinct terms (aka words), sorted descending by frequency (and ascending by term, if tied).

Removed deprecated methods XOMUtil.toByteArray() and XOMUtil.toString(). The methods remain available but have been moved into class FileUtil.

Added more test document collections in samples directory.

Added package nux.xom.sandbox, a playground for kicking around various ideas and prototypes without any API compatibility guarantees. Code quality varies from sketchy to reliable, but is generally not nearly as well designed and tested as the remainder of Nux. In the future some of these classes may (or may not) graduate into stable packages.

Version 1.3 (Aug 5, 2005)

Upgraded to saxonb-8.5 (saxon-8.4 and 8.3 should continue to work as well).

Upgraded to xom-1.1-rc1 (with compatible performance patches). Plain xom-1.0 should continue to work as well, albeit less efficiently.

Numerous bnux Binary XML performance enhancements for serialization and deserialization (UTF-8 character encoding, buffer management, symbol table, pack sorting, cache locality, etc). Overall, bnux is now about twice as fast, and, perhaps more importantly, has a much more uniform performance profile, no matter what kind of document flavour is thrown at it. It routinely delivers 50-100 MB/sec deserialization performance, and 30-70 MB/sec serialization performance (commodity PC 2004). It is roughly 5-10 times faster than xom-1.1 with xerces-2.7.1 (which, in turn, is faster than saxonb-8.5, dom4j-1.6.1 and xerces-2.7.1 DOM). Further, preliminary measurements indicate bnux deserialization and serialization to be consistently 2-3 times faster than Sun's FastInfoSet implementation, using XOM. Saxon's PTree could not be tested as it is only available in the commercial version. The only remaining area with substantial potential for performance improvement seems to be complex namespace handling. This might be addressed by slightly restructuring private XOM internals in a future version.

BinaryXMLTest now also has command line support for testing and benchmarking Saxon, DOM and FastInfoSet (besides bnux and XOM).

Rewrote XQueryCommand. The new nux/bin/fire-xquery is a more powerful, flexible and reliable command line test tool that runs a given XQuery against a set of files and prints the result sequence. In addition, it supports schema validation, XInclude (via XOM), an XQuery update facility, malformed HTML parsing (via TagSoup) and much more. It's available for Unix and Windows, and works like any other decent Unix command line tool.

Removed ValidationCommand (made obsolete by the fire-xquery functionality).

Added experimental XQuery in-place update functionality. Comments on the usefulness of the current behaviour are especially welcome, as are suggestions for potential improvements.

Added nux.xom.xquery.ResultSequenceSerializer, which serializes an XQuery/XPath2 result sequence onto a given output stream, using various configurable serialization options such encoding and indentation. Implements the W3C XQuery/XSLT2 Serialization Draft Spec. Also implements an alternative wrapping algorithm that ensures that any arbitrary result sequence can always be output as a well-formed XML document.

Added XQueryFactory.createXQuery(File file, URI baseURI) and XQueryPool.getXQuery(File file, URI baseURI) to allow for separation of the location of the query file and input XML files.

The default XQuery DocumentURIResolver now recognizes the ".bnux" file extension as binary XML, and parses it accordingly. For example, a query can be 'doc("samples/data/articles.xml.bnux")/articles/*'

Added FileUtil.listFiles(). Returns the URIs of all files who's path matches at least one of the given inclusion wildcard or regular expressions but none of the given exclusion wildcard or regular expressions; starting from the given directory, optionally with recursive directory traversal, insensitive to underlying operating system conventions.

XOMUtil.Normalizer now uses XML whitespace definition rather than Java whitespace definition.

Added XOMUtil.Normalizer.STRIP, which removes Texts that consist of whitespace-only (boundary whitespace), retaining other strings unchanged.

Added AnalyzerUtil.getPorterStemmerAnalyzer() for English language stemming on full text search.

Added XOMUtil.toDocument(String xml) convenience method to parse a string.

Moved XOMUtil.toByteArray() and XOMUtil.toString() into class FileUtil. The old methods remain available but have been deprecated.

Added "jar-bnux" ant target to optionally build a minimal jar file (20 KB) for binary XML only.

Added more test documents to samples/data directory.

Updated license blurbs to 2005.

Version 1.2 (May 26, 2005)

XQuery/XPath: Added optional fulltext search via Apache Lucene engine. Similar to Google search, it is easy to use, powerful, efficient and goes far beyond what can be done with standard XPath regular expressions and string manipulation functions. It is similar in intent but not directly related to preliminary W3C fulltext search drafts. Rather than targetting fulltext search of infrequent queries over huge persistent data archives (historic search), Nux targets fulltext search of huge numbers of queries over comparatively small transient realtime data (prospective search). See FullTextUtil and MemoryIndex.

Example fulltext XQuery that finds all books authored by James that have something to do with 'salmon fishing manuals', sorted by relevance
declare namespace lucene = "java:nux.xom.pool.FullTextUtil"; declare variable $query := "+salmon~ +fish* manual~"; (: any arbitrary Lucene query can go here :) (: declare variable $query as xs:string external; :) for $book in /books/book[author="James" and lucene:match(abstract, $query) > 0.0] let $score := lucene:match($book/abstract, $query) order by $score descending return $book

Example fulltext XQuery that matches on extracted sentences
declare namespace lucene = "java:nux.xom.pool.FullTextUtil"; for $book in /books/book for $s in lucene:sentences($book/abstract, 0) return if (lucene:match($s, "+salmon~ +fish* manual~") > 0.0) then normalize-space($s) else ()

It is designed to enable maximum efficiency for on-the-fly matchmaking combining structured and fuzzy fulltext search in realtime streaming applications such as XQuery based XML message queues, publish-subscribe systems for Blogs/newsfeeds, text chat, data acquisition and distribution systems, application level routers, firewalls, classifiers, etc. For some interesting background information on search technology, see Bob Wyman's Prospective Search, Jim Gray's A Call to Arms - Custom subscriptions, and Tim Bray's On Search, the Series.
Arbitrary Lucene fulltext queries can be run from Java or from XQuery/XPath/XSLT via a simple extension function. The former approach is more flexible whereas the latter is more convenient. Lucene analyzers can split on whitespace, normalize to lower case for case insensitivity, ignore common terms with little discriminatory value such as "he", "in", "and" (stop words), reduce the terms to their natural linguistic root form such as "fishing" being reduced to "fish" (stemming), resolve synonyms/inflexions/thesauri (upon indexing and/or querying), etc. Also see Lucene Query Syntax as well as Query Parser Rules.
Background: The first prototype was put together over the weekend. The functionality worked just fine, except that it took ages to index and search text in a high-frequency environment. Subsequently I wrote a complete reimplementation of the Lucene interfaces and contributed that back to Lucene (the bits in org.apache.lucene.index.memory.*). Next, I placed a smart cache in front of it (the bits in nux.xom.pool.FullTextUtil / FullTextPool). The net effect is that fulltext queries over realtime data now run some three orders of magnitude faster while preserving the same general functionality (e.g. 100000-500000 queries/sec ballpark). In fact, you'll probably notice little or no overhead when adding fulltext search to your streaming apps. See MemoryIndexBenchmark and XQueryBenchmark.
Explore and enjoy, perhaps using the queries and sample data from the samples/fulltext directory as a starting point.

Removed deprecated XQueryUtil.normalizeTexts(). The same functionality remains available through XOMUtil.Normalizer.PRESERVE.normalize().

Version 1.1 (Apr 6, 2005)

nux.xom.pool: Added a configurable XML caching framework. Classes DocumentFactory, DocumentPool, DocumentMap and PoolConfig enable efficient compact thread-safe pooling/caching of XOM document objects. Cached documents typically consume 20-100 times less memory than the equivalent XOM main memory tree. Usage is safe: It survives stress tests looking for memory leaks, race conditions, etc. Plugins for dependency chain invalidation could be added in the future, but for the moment this isn't explicitly supported. Comments on this or any other Nux aspect are always welcome. See API.

nux.xom.pool.*: All pools and ThreadLocals now internally use SoftReferences to allow for automatic garbage collection of cached objects in low-memory situations.

nux.xom.pool:*: All pools now have a constructor that takes a PoolConfig object.

XQuery/XPath: Now requires saxon-8.4 (bundled with the download). See the Saxon documentation for the changelog. In particular note that the namespace axis is nomore supported in XQuery.

XQuery/XPath: Various performance improvements (preceding axis, variables with large node lists, etc.).

XQuery/XPath: On output, now auto-converts any Saxon NodeInfo implementation (e.g. TinyTree, StandardTree, JDOM, DOM) while fully preserving node identities, even in the presence of documentless nodes. This required some minor refactoring internal to NodeWrapper.

XQuery/XPath: Better documentation on how to use extension functions and modules.

Added XOMUtil.Normalizer with standard XML algorithms for text and whitespace normalization of trees.

Deprecated XQueryUtil.normalizeTexts(). The same functionality is now available through XOMUtil.Normalizer.PRESERVE.normalize().

Version 1.0 - Final release (Feb 13, 2005)

Now requires saxon-8.3 (bundled with the download).

The obnoxious dependencies on jars for DOM Level 3, JAXP-1.3 and JaxMe have now disappeared, even under JDK < 1.5! This functionality is nomore needed at all, meaning less baggage, installation and classpath problems for all of us :-)

XQuery/XPath: XPath axis navigation performance improvements in net.sf.saxon.xom.NodeWrapper.

XQuery/XPath: Now uses Saxon's cross-object model namespace handling.

Added sample data directory with XML files and queries, as well as corresponding bnux binary XML files.

Added simple XQuery and XPath benchmark. See XQueryBenchmark.

Added command-line demo that converts a set of standard textual XML files to and from binary xml (bnux) files. See BinaryXMLConverter.

Added bin/ directory with Unix and Windows shell scripts to start the demos.

Version 1.0rc3 - Release Candidate (Jan 23, 2005)

XOM nomore requires Jaxen jar file unless really needed (soft dependency).

XQuery/XPath: Added explain() method returning a description of the compiled and optimized expression tree; useful for advanced performance diagnostics only.

XQuery/XPath: By default the doc() function now uses a DocumentURIResolver that uses a non-validating XOM Builder to parse documents. This can be overriden by passing in your custom DocumentURIResolver.

XQuery/XPath: By default a top-level atomic value in the result sequence is converted to an Element named "atomic-value" with a child Text node holding the atomic value's standard XPath 2.0 string representation. An "atomic-value" element is decorated with a namespace and a W3C XML Schema type attribute. The XPath 2.0 string representation continues to be accessible via Node.getValue(). Because XOM has no concept of a namespace node, the same conversion occurs for XPath namespace nodes in the result sequence. (The standard XPath 2.0 string representation of a namespace node is its URI). "Normal" nodes and anything not at top-level continue to be returned "as is", without conversion.
Examples for xs:integer, xs:string and xs:date
<atomic-value xsi:type="xs:integer" xmlns="http://dsd.lbl.gov/nux" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">1</atomic-value> <atomic-value xsi:type="xs:string" xmlns="http://dsd.lbl.gov/nux" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">hello world</atomic-value> <atomic-value xsi:type="xs:date" xmlns="http://dsd.lbl.gov/nux" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">2004-12-31</atomic-value>

Version 1.0rc2 - Release Candidate (Jan 17, 2005)

DOM Level 3 API (dom3-api.jar) is now also bundled (required by Saxon >= 8.2 on JDK < 1.5)

XQuery/XPath: now accepts any XOM Node as context node and as variables, not just a ParentNode. A variable can now also be bound to a node list (i.e. a XOM Nodes object). This enables to pass the output of a query as input into another query.

net.sf.saxon.xom.DocumentWrapper: id() function now finds first rather than last element in invalid documents that have multiple elements with the same ID. Performance is also improved via a hash index.

XQueryUtil: added normalizeTexts() for fully standards-compliant XPath and XQuery semantics in certain rare use cases.

XQueryUtil: added efficient convenience method "public static Nodes xquery(Node contextNode, String query)".

XOMUtil.toByteArray(): more efficient implementation

StreamingPathFilter: made checked exception into an unchecked exception.

Existing client code continues to run unchanged.

Version 1.0rc1 - Release Candidate (Dec 24, 2004)

This release is synchronized and works in tandem with the recent saxon-8.2 and xom-1.0 final releases.

Nux now works with the recent saxon-8.2 release, hence implementing W3C XQuery Working Draft 29 October 2004. The download includes the saxon8.jar file from saxonb-8.2. As far as we can tell from testing, it is not necessary to include any JAXP-1.3 jars (i.e. dom.jar, jaxp-api.jar) in the classpath, even when running saxon-8.2 with JDK < 1.5. This is good news because it avoids licensing problems as well as classpath, version, redistribution and deployment problems. Let us know if you find Nux use cases where JAXP 1.3 is required. In any case, this Nux release should continue to work fine with the old saxon-8.1.1.

NodeWrapper for XQuery: Performance improvements for iteration over XPath axes. Now performs queries at up to 2000 (100000) executions/sec over 100 (0.5) KB input documents = 200 (50) MB/s.

Low-level glue for native Saxon integration of XOM is now separated out into saxon8-xom.jar.

Added simple command line demo that runs a given XQuery against a set of files and prints the result sequence. See XQueryCommand.

BinaryXMLCodec: General performance and memory improvements. Now parses up to 101 MB/s.

Javadoc improvements

Building nomore requires JUnit - simply type "ant jar"

Version 1.0beta4 (Nov 23, 2004)

BinaryXMLCodec: General performance improvements.

Version 1.0beta3 (Nov 20, 2004)

BinaryXMLCodec: Substantial performance improvements for small documents and pure ASCII documents.

Version 1.0beta2 (Nov 14, 2004)

Added class BinaryXMLCodec, which serializes (encodes) and deserializes (decodes) XOM XML documents to and from an efficient and compact custom binary XML data format (termed bnux format), without loss or change of any information. Serialization and deserialization is much faster than with the standard textual XML format, and the resulting binary data is more compressed than textual XML.

Requires a small backwards compatible external patch to the XOM DocType, making method setInternalDTDSubset public. Copy the file into the XOM source codebase, and rebuild XOM from source with cd xom; ant jar

Version 1.0beta1 (Nov 2, 2004)

Versions are now labelled as "Beta", meaning: No known bugs exist, and no incompatible changes are planned. Please stress this release to shake out any remaining bugs potentially lurking in remote corners.

XQuery: properly convert orphan nodes handed from the wrapper.

XOMUtil.toPrettyXML: remove trailing line break, if present.

Javadoc improvements
Changed BuilderFactory.createMSVBuilder(File) to BuilderFactory.createMSVBuilder(InputStream, URI)
In order to update to the new API, change client code like
File schema = ... 
BuilderFactory factory = ...
Builder builder = factory.createMSVBuilder(schema);
to the new style, as follows:
File schema = ... 
BuilderFactory factory = ...
Builder builder = factory.createMSVBuilder(
    new FileInputStream(schema), schema.toURI());

Version 1.0a5 (Oct 28, 2004)

Added schema validation support for RELAX NG and Schematron. See BuilderFactory and BuilderPool.

Added JAXB convenience conversion support. See XOMUtil.

Renamed XQuery.evaluate to XQuery.execute

Javadoc improvements

Version 1.0a4 (Oct 22, 2004)

For simple and complex continuous queries and/or transformations over very large or infinitely long XML input documents, we have added a convenient streaming path filter API, combining full XQuery support with straightforward filtering.

XOMUtil: Added a node factory that ignores Text nodes that consist of whitespace only.

Fixed small bug in XQueryPool.getXQuery(Resolver)

Faster XOMUtil.toPrettyXML

Javadoc improvements

Version 1.0a3 (Oct 12, 2004)

Separated the patched class nu.xom.xslt.XSLTransform (LGPL licensed, copyright Elliotte Rusty Harold) from the core library.

If you want it, you can get the external patch. The patch adds an additional constructor (needed for thread-safety and flexibility) in a backwards compatible way. Copy the file into the XOM source codebase, and rebuild XOM from source with cd xom; ant jar

The license statement now makes it clear that package net.sf.saxon.xom is under the Mozilla license (co-developed with Michael Kay, the Saxon author).

Version 1.0a2 (Oct 11, 2004)

Added XQuery DocumentURIResolver for custom URI resolution

XOMUtil.toPrettyXML now works for any type of node

Removed old, long obsolete code

Version 1.0a1 (Oct 9, 2004)

Dramatic XQuery performance improvements through native Saxon integration

Fixed a minor bug in native Saxon integration

Renamed packages to nux.xom.*

XQuery now throws XQueryException instead of ParsingException

BuilderFactory: added clean fallback to other SAX Parsers or JAXP, if necessary

Version (date)

Initial strawman prerelease

Contact us