| Version 1.6 (June 18, 2006) |
- Bnux binary XML performance improvements. Serialization and deserialization are now up to 200% and 40% faster, respectively.
- Added StAX serialization to BinaryXMLBench and BinaryXMLTest, also for FastInfoset.
- Added logic to infer character encoding the client has used on XMLStreamWriter instantiation.
- Upgraded to saxonb-8.7.3 (Saxon 8.3 - 8.7.1 still continue to work fine).
- Upgraded driver for official W3C XQuery Test Suite (XQTS) to xqts-0.9.4.
- Upgraded to lucene-2.0.0
|
Version 1.5 (March 26, 2006) |
Production Quality StAX Integration
- StaxBuilder, StaxParser, StaxUtil, StaxReader have improved and graduated to the nux.xom.io package. This includes...
- A facility to write to a StAX XMLStreamWriter (StreamingSerializerFactory).
- A subclass of nu.xom.Builder that has the same behaviour as the super class, except that it runs over StAX instead of SAX (StaxUtil.createBuilder(XMLInputFactory, NodeFactory)).
Can be used for polymorphic pluggability of SAX vs. StAX.
- A class that allows to build XOM documents, fragments, or individual nodes via StAX (new StaxParser(XMLStreamReader)).
- A class that makes a XOM (sub)tree appear as a StAX XMLStreamReader (StaxUtil.createXMLStreamReader(Node)).
Upgrades
- Upgraded to saxonb-8.7 (Saxon 8.3 - 8.6.1 still continue to work fine).
- Upgraded driver for official W3C XQuery Test Suite (XQTS) to xqts-0.8.6.
- Upgraded to lucene-1.9.1.
Bug Fixes
- bnux: If, on serialization, systemID and publidID of a DocType are null,
deserialization maps these IDs back to empty strings, rather than null.
- XQuery: NodeWrapper.getTypeAnnotation() returns wrong type for attributes, affecting deepEquals().
- Rare bug on reuse of non-validating Xerces Builder object for DTDs, when produced via BuilderFactory.
- FileUtil.listFiles() doesn't match with wildcards on Windows, making XQTS driver fail on Windows.
Enhancements
- Enabled explicit intermediate flush() for streaming bnux serializer.
Can be used for discrete message chunking over network transports.
- Added BinaryXMLCodec.isBnux(InputStream) enabling clients to differentiate bnux and XML documents.
- Various performance enhancements.
- Added NodeBuilder and various performance improvements to redirecting factory and streaming bnux serializer.
Added the following ant targets for massive integration tests:
download-testdata |
Downloads some 50000 XML files from various public test data repositories. |
test-xqts |
Runs the Official W3C XQuery Test Suite (XQTS). |
test-bnux |
Parses all 50000 files with XOM/SAX/Xerces, roundtrips them with bnux binary XML and compares results against
original documents parsed with XOM/SAX/Xerces, testing for equality wrt.
Canonical XML as well as XOM's more strict assertEquals() test routine.
|
test-staxbuilder |
Same as test-bnux except that it uses XOM/StAX/Woodstox instead of bnux. |
test-staxserializer |
Same as test-bnux except that it parses with XOM/SAX/Xerces, serializes with XOM/StAX/Woodstox,
reparses with XOM/SAX/Xerces, then compares results against original documents
parsed with XOM/SAX/Xerces. |
Before running any of these targets, tell them to use enough memory and the latest stable Xerces version, like this:
export JAVA_OPTS='-Xmx200m -Djava.endorsed.dirs=/Users/hoschek/unix/java/share/apache/xerces-2.8.0'
For tests involving StAX download Woodstox (e.g. wstx-asl-3.0.jar) and copy it into nux/lib.
|
Version 1.4.1 (Dec 8, 2005) |
This release fixes two rare but deadly bugs introduced in version 1.4.
Everything else is identical to 1.4.
- Fixed bnux serialization bug that may occur when changing compression level on BinaryXMLCodec reuse.
- Fixed bnux read ahead bug that may occur when deserializing multiple documents from the same input stream, with compressionLevel > 0.
- A consequence is that the bnux header data format has slightly changed again (version number has also changed).
- Another consequence is that the internal buffer and stream management is now more efficient and clean, too.
|
Version 1.4 (Dec 1, 2005) |
Improvements and additions in this release focus on
scalability, reliability and ease of use.
XQuery and XOM |
- Upgraded to xom-1.1-final (with compatible
performance patches). xom-1.0.x and xom-1.1.x continue to work fine, albeit less efficient.
- Upgraded to saxonb-8.6.1, implementing XQuery W3C Candidate Recommendation, 3 November 2005 (Saxon 8.6, 8.5, 8.4, 8.3 still continue to work fine).
- saxon8-xom.jar is nomore needed as its contents are directly compiled into nux.jar, improving simplicity and reliability.
- Constructing a new compiled XQuery object is now about 20 times faster.
- Added driver for official W3C XQuery Test Suite (XQTS). Contains some 8500 test cases.
|
XML Streaming and Bnux Binary XML Streaming |
- Added Streaming Serialization of Very Large Documents in the
nux.xom.io package.
Using memory consumption close
to zero, the new StreamingSerializer
enables writing arbitrarily large XML documents onto a destination, such as an OutputStream ,
both for standard textual XML as well as bnux binary XML (and StAX).
- Added streaming bnux deserialization
for handling arbitrarily large input documents;
uses an InputStream and an application provided NodeFactory just like a XOM Builder does.
- Added bnux serialization to an OutputStream.
- To enable true streaming, a serialized bnux document now consists internally of one or more
independent pages, each at most 64 KB large. Each page is a tokenized byte array containing a portion of the XML document, in document order.
Once a page has been read/written related (heavy) state can be discarded, freeing memory.
No more than one page needs to be held in memory at any given time.
For very large documents this reduces memory consumption, increases throughput and reduces latency.
For small to medium sized documents it makes next to no difference.
- Slightly more compact bnux data format (version number has changed).
- Improved performance on reuse of BinaryXMLCodec instances (recommended).
- bnux serialization and deserialization is now roughly 3 times faster when using
documents containing namespaces, closely matching performance for
documents without namespaces.
- Added
Streaming conversion of standard textual XML to and from binary format,
enabling conversion of arbitrarily large documents.
The corresponding
fire-bnux
command line conversion tool now works in fully streaming mode, too.
|
Other |
- Added AnalyzerUtil.getMostFrequentTerms().
Returns (frequency:text) pairs for the top N distinct terms (aka words),
sorted descending by frequency (and ascending by term, if tied).
- Removed deprecated methods XOMUtil.toByteArray() and XOMUtil.toString().
The methods remain available but have been moved into class FileUtil.
- Added more test document collections in
samples directory.
- Added package nux.xom.sandbox,
a playground for kicking around various ideas and prototypes without any
API compatibility guarantees. Code quality varies from sketchy to reliable,
but is generally not nearly as well designed and tested as the remainder of Nux.
In the future some of these classes may (or may not) graduate into stable
packages.
|
|
Version 1.3 (Aug 5, 2005) |
- Upgraded to saxonb-8.5 (saxon-8.4 and 8.3 should continue to work as well).
- Upgraded to xom-1.1-rc1 (with compatible
performance patches).
Plain xom-1.0 should continue to work as well, albeit less efficiently.
- Numerous bnux Binary XML
performance enhancements for serialization and deserialization
(UTF-8 character encoding, buffer management, symbol table, pack sorting, cache locality, etc).
Overall, bnux is now about twice as fast, and, perhaps more importantly, has a much
more uniform performance profile, no matter what kind of document flavour is thrown at it.
It routinely delivers 50-100 MB/sec deserialization performance, and 30-70 MB/sec
serialization performance (commodity PC 2004).
It is roughly 5-10 times faster than xom-1.1 with xerces-2.7.1
(which, in turn, is faster than saxonb-8.5, dom4j-1.6.1 and xerces-2.7.1 DOM).
Further, preliminary measurements indicate bnux deserialization and serialization
to be consistently 2-3 times faster than Sun's FastInfoSet implementation, using XOM.
Saxon's PTree could not be tested as it is only available in the commercial version.
The only remaining area with substantial potential for performance improvement seems to be complex
namespace handling.
This might be addressed by slightly restructuring private XOM internals
in a future version.
- BinaryXMLTest now also has command line support for testing and benchmarking
Saxon, DOM and FastInfoSet (besides bnux and XOM).
- Rewrote XQueryCommand. The new
nux/bin/fire-xquery
is a more powerful, flexible and reliable command line test tool that runs a given
XQuery against a set of files and prints the result sequence.
In addition, it supports schema validation, XInclude (via XOM),
an XQuery update facility, malformed HTML parsing (via TagSoup) and much more.
It's available for Unix and Windows, and works like any other decent Unix command
line tool.
- Removed ValidationCommand (made obsolete by the fire-xquery functionality).
- Added experimental
XQuery in-place update functionality.
Comments on the usefulness of the current behaviour are especially welcome,
as are suggestions for potential improvements.
- Added nux.xom.xquery.ResultSequenceSerializer,
which serializes an XQuery/XPath2 result sequence onto a given output stream, using
various configurable serialization options such encoding and indentation.
Implements the
W3C XQuery/XSLT2 Serialization Draft Spec.
Also implements an alternative wrapping algorithm that ensures that any arbitrary
result sequence can always be output as a well-formed XML document.
- Added XQueryFactory.createXQuery(File file, URI baseURI) and
XQueryPool.getXQuery(File file, URI baseURI) to allow for separation
of the location of the query file and input XML files.
- The default XQuery DocumentURIResolver now recognizes the ".bnux" file
extension as binary XML, and parses it accordingly.
For example, a query can be 'doc("samples/data/articles.xml.bnux")/articles/*'
- Added FileUtil.listFiles().
Returns the URIs of all files who's path matches at least one of the given
inclusion wildcard or regular expressions but none of the given exclusion
wildcard or regular expressions; starting from the given directory, optionally
with recursive directory traversal, insensitive to underlying operating system
conventions.
- XOMUtil.Normalizer now uses
XML whitespace definition rather than Java whitespace definition.
- Added XOMUtil.Normalizer.STRIP, which removes Texts that consist of
whitespace-only (boundary whitespace), retaining other strings unchanged.
- Added AnalyzerUtil.getPorterStemmerAnalyzer()
for English language stemming on full text search.
- Added XOMUtil.toDocument(String xml) convenience method to parse a string.
- Moved XOMUtil.toByteArray() and XOMUtil.toString() into class FileUtil.
The old methods remain available but have been deprecated.
- Added "jar-bnux" ant target to optionally build a minimal jar file (20 KB) for binary XML only.
- Added more test documents to samples/data directory.
- Updated license blurbs to 2005.
|
Version 1.2 (May 26, 2005) |
- XQuery/XPath: Added optional fulltext search via Apache Lucene
engine. Similar to Google search, it is easy to use, powerful, efficient
and goes far beyond what can be done with standard XPath regular
expressions and string manipulation functions. It is similar in intent
but not directly related to preliminary
W3C fulltext search drafts.
Rather than targetting fulltext search of infrequent queries over huge persistent
data archives (historic search), Nux targets fulltext search of huge
numbers of queries over comparatively small transient realtime data (prospective search).
See FullTextUtil and
MemoryIndex.
declare namespace lucene = "java:nux.xom.pool.FullTextUtil";
declare variable $query := "+salmon~ +fish* manual~";
(: any arbitrary Lucene query can go here :)
(: declare variable $query as xs:string external; :)
for $book in /books/book[author="James" and lucene:match(abstract, $query) > 0.0]
let $score := lucene:match($book/abstract, $query)
order by $score descending
return $book
declare namespace lucene = "java:nux.xom.pool.FullTextUtil";
for $book in /books/book
for $s in lucene:sentences($book/abstract, 0)
return
if (lucene:match($s, "+salmon~ +fish* manual~") > 0.0)
then normalize-space($s)
else ()
It is designed to enable maximum efficiency for on-the-fly
matchmaking combining structured and fuzzy fulltext search
in realtime streaming applications such as XQuery based XML message
queues, publish-subscribe systems for Blogs/newsfeeds, text chat, data acquisition
and distribution systems, application level routers, firewalls,
classifiers, etc.
For some interesting background information on search technology, see Bob Wyman's
Prospective Search,
Jim Gray's
A Call to Arms - Custom subscriptions, and Tim Bray's
On Search, the Series.
Arbitrary Lucene fulltext queries can be run from Java or
from XQuery/XPath/XSLT via a simple extension function.
The former approach is more flexible whereas the latter is more convenient.
Lucene analyzers can split on whitespace, normalize to lower case
for case insensitivity, ignore common terms with little
discriminatory value such as "he", "in", "and" (stop words),
reduce the terms to their natural linguistic root form such as
"fishing" being reduced to "fish" (stemming), resolve
synonyms/inflexions/thesauri (upon indexing and/or querying), etc.
Also see Lucene Query Syntax
as well as Query Parser Rules.
Background: The first prototype was put together over the weekend.
The functionality worked just fine, except that it took ages
to index and search text in a high-frequency environment. Subsequently I wrote a complete
reimplementation of the Lucene interfaces and contributed that
back to Lucene (the bits in org.apache.lucene.index.memory.*).
Next, I placed a smart cache in front of it (the bits in
nux.xom.pool.FullTextUtil / FullTextPool).
The net effect is that fulltext queries over realtime data
now run some three orders of magnitude faster while preserving the same
general functionality (e.g. 100000-500000 queries/sec ballpark).
In fact, you'll probably notice little or
no overhead when adding fulltext search to your streaming apps.
See MemoryIndexBenchmark and
XQueryBenchmark.
Explore and enjoy, perhaps using the queries and sample data from the
samples/fulltext directory as a starting point.
- Removed deprecated
XQueryUtil.normalizeTexts() . The same functionality remains available through XOMUtil.Normalizer.PRESERVE.normalize() .
|
Version 1.1 (Apr 6, 2005) |
- nux.xom.pool: Added a configurable XML caching framework.
Classes DocumentFactory, DocumentPool, DocumentMap and PoolConfig
enable efficient compact thread-safe pooling/caching of XOM document objects.
Cached documents typically consume 20-100 times less memory than the equivalent XOM main memory tree.
Usage is safe: It survives stress tests looking for memory leaks, race conditions, etc.
Plugins for dependency chain invalidation could be added in the future,
but for the moment this isn't explicitly supported.
Comments on this or any other Nux aspect are always welcome.
See API.
- nux.xom.pool.*: All pools and ThreadLocals now internally use SoftReferences to allow for automatic garbage collection of cached objects in low-memory situations.
- nux.xom.pool:*: All pools now have a constructor that takes a PoolConfig object.
- XQuery/XPath: Now requires saxon-8.4 (bundled with the download).
See the Saxon documentation for the changelog.
In particular note that the namespace axis is nomore supported in XQuery.
- XQuery/XPath: Various performance improvements (preceding axis, variables with large node lists, etc.).
- XQuery/XPath: On output, now auto-converts any Saxon NodeInfo implementation
(e.g. TinyTree, StandardTree, JDOM, DOM)
while fully preserving node identities, even in the presence of documentless nodes.
This required some minor refactoring internal to NodeWrapper.
- XQuery/XPath: Better documentation on how to use extension functions and modules.
- Added XOMUtil.Normalizer with standard XML algorithms for text and whitespace normalization of trees.
- Deprecated
XQueryUtil.normalizeTexts() . The same functionality is now available through XOMUtil.Normalizer.PRESERVE.normalize() .
|
Version 1.0 - Final release (Feb 13, 2005) |
- Now requires saxon-8.3 (bundled with the download).
- The obnoxious dependencies on jars for DOM Level 3, JAXP-1.3 and JaxMe
have now disappeared, even under JDK < 1.5! This functionality is nomore needed at all,
meaning less baggage, installation and classpath problems for all of us :-)
- XQuery/XPath: XPath axis navigation performance improvements in
net.sf.saxon.xom.NodeWrapper .
- XQuery/XPath: Now uses Saxon's cross-object model namespace handling.
- Added sample data directory with XML files and queries, as well as corresponding bnux binary XML files.
- Added simple XQuery and XPath benchmark.
See XQueryBenchmark.
- Added command-line demo that converts a set of standard textual XML files to and from binary xml (bnux) files.
See BinaryXMLConverter.
- Added
bin/ directory with Unix and Windows shell scripts to start the demos.
|
Version 1.0rc3 - Release Candidate (Jan 23, 2005) |
- XOM nomore requires Jaxen jar file unless really needed (soft dependency).
- XQuery/XPath: Added
explain() method returning a description of the compiled and optimized expression tree;
useful for advanced performance diagnostics only.
- XQuery/XPath: By default the
doc() function now uses a DocumentURIResolver
that uses a non-validating XOM Builder to parse documents.
This can be overriden by passing in your custom DocumentURIResolver.
- XQuery/XPath: By default a top-level atomic value in the result sequence is
converted to an Element named "atomic-value" with a child Text
node holding the atomic value's standard XPath 2.0 string representation.
An "atomic-value" element is decorated with a namespace and a W3C XML Schema type attribute.
The XPath 2.0 string representation continues to be accessible via
Node.getValue() .
Because XOM has no concept of a namespace node, the same conversion occurs for XPath namespace nodes in the result sequence.
(The standard XPath 2.0 string representation of a namespace node is its URI).
"Normal" nodes and anything not at top-level continue to be returned "as is", without conversion.
<atomic-value xsi:type="xs:integer" xmlns="http://dsd.lbl.gov/nux"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">1</atomic-value>
<atomic-value xsi:type="xs:string" xmlns="http://dsd.lbl.gov/nux"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">hello world</atomic-value>
<atomic-value xsi:type="xs:date" xmlns="http://dsd.lbl.gov/nux"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">2004-12-31</atomic-value>
|
Version 1.0rc2 - Release Candidate (Jan 17, 2005) |
- DOM Level 3 API (dom3-api.jar) is now also bundled
(required by Saxon >= 8.2 on JDK < 1.5)
- XQuery/XPath: now accepts any XOM Node as context node and as variables,
not just a ParentNode. A variable can now also be bound to a node list
(i.e. a XOM Nodes object).
This enables to pass the output of a query as input into another query.
- net.sf.saxon.xom.DocumentWrapper: id() function now finds first rather than last element
in invalid documents that have multiple elements with the same ID.
Performance is also improved via a hash index.
- XQueryUtil: added normalizeTexts() for fully standards-compliant
XPath and XQuery semantics in certain rare use cases.
- XQueryUtil: added efficient convenience method
"public static Nodes xquery(Node contextNode, String query)".
- XOMUtil.toByteArray(): more efficient implementation
- StreamingPathFilter: made checked exception into an unchecked exception.
- Existing client code continues to run unchanged.
|
Version 1.0rc1 - Release Candidate (Dec 24, 2004) |
- This release is synchronized and works in tandem with the
recent saxon-8.2 and xom-1.0 final releases.
- Nux now works with the recent saxon-8.2 release,
hence implementing W3C XQuery Working Draft 29 October 2004.
The download includes the saxon8.jar file from saxonb-8.2.
As far as we can tell from testing, it is not necessary to include any
JAXP-1.3 jars (i.e. dom.jar, jaxp-api.jar) in the classpath,
even when running saxon-8.2 with JDK < 1.5.
This is good news because it avoids licensing problems as well as classpath,
version, redistribution and deployment problems.
Let us know if you find Nux use cases where JAXP 1.3 is required.
In any case, this Nux release should continue to work fine with the old saxon-8.1.1.
- NodeWrapper for XQuery: Performance improvements for iteration over XPath axes.
Now performs queries at up to
2000 (100000) executions/sec over 100 (0.5) KB input documents = 200 (50) MB/s.
- Low-level glue for native Saxon integration of XOM is now separated
out into saxon8-xom.jar.
- Added simple command line demo that runs a given XQuery against a set of files
and prints the result sequence.
See XQueryCommand.
- BinaryXMLCodec: General performance and memory improvements.
Now parses up to 101 MB/s.
- Javadoc improvements
- Building nomore requires JUnit - simply type "ant jar"
|
Version 1.0beta2 (Nov 14, 2004) |
- Added class BinaryXMLCodec,
which serializes (encodes) and deserializes (decodes) XOM XML documents to and from
an efficient and compact custom binary XML data format (termed bnux
format), without loss or change of any information. Serialization and
deserialization is much faster than with the standard textual XML format, and
the resulting binary data is more compressed than textual XML.
- Requires a small backwards compatible
external patch
to the XOM DocType, making method
setInternalDTDSubset public.
Copy the file into the XOM source codebase, and rebuild XOM from source
with cd xom; ant jar
|
Version 1.0beta1 (Nov 2, 2004) |
- Versions are now labelled as "Beta", meaning: No known bugs exist, and no incompatible changes are planned.
Please stress this release to shake out any remaining bugs potentially lurking in remote corners.
- XQuery: properly convert orphan nodes handed from the wrapper.
- XOMUtil.toPrettyXML: remove trailing line break, if present.
- Javadoc improvements
- Changed BuilderFactory.createMSVBuilder(File) to BuilderFactory.createMSVBuilder(InputStream, URI)
In order to update to the new API, change client code like
File schema = ...
BuilderFactory factory = ...
Builder builder = factory.createMSVBuilder(schema);
to the new style, as follows:
File schema = ...
BuilderFactory factory = ...
Builder builder = factory.createMSVBuilder(
new FileInputStream(schema), schema.toURI());
|
Version 1.0a4 (Oct 22, 2004) |
- For simple and complex continuous queries and/or transformations over very large or infinitely long XML input documents,
we have added a convenient streaming path filter
API, combining full XQuery support with straightforward filtering.
- XOMUtil: Added a node factory that ignores Text nodes that consist of whitespace only.
- Fixed small bug in XQueryPool.getXQuery(Resolver)
- Faster XOMUtil.toPrettyXML
- Javadoc improvements
|
Version 1.0a3 (Oct 12, 2004) |
- Separated the patched class nu.xom.xslt.XSLTransform (LGPL licensed, copyright Elliotte Rusty Harold)
from the core library.
- If you want it, you can get the
external patch.
The patch adds an additional constructor (needed for thread-safety and flexibility) in a backwards compatible way.
Copy the file into the XOM source codebase, and rebuild XOM from source
with
cd xom; ant jar
- The license statement now makes it clear that package
net.sf.saxon.xom
is under the Mozilla license (co-developed with Michael Kay, the Saxon author).
|
Version 1.0a2 (Oct 11, 2004) |
- Added XQuery DocumentURIResolver for custom URI resolution
- XOMUtil.toPrettyXML now works for any type of node
- Removed old, long obsolete code
|
Version 1.0a1 (Oct 9, 2004) |
- Dramatic XQuery performance improvements through native Saxon integration
- Fixed a minor bug in native Saxon integration
- Renamed packages to nux.xom.*
- XQuery now throws XQueryException instead of ParsingException
- BuilderFactory: added clean fallback to other SAX Parsers or JAXP, if necessary
|
|