Nux - Overview

Nux Sea Firefish pyGridWare Bro NetLogger Scishare Colt

Version 1.6 - Last published Jun 18, 2006

Software

Home Download Readme Changelog FAQ / Related

Development

Dependencies Javadoc API Mailing List License

Research

ACS Department

Efficient and Powerful XML Processing Made Easy

Nux is an open-source Java toolkit making efficient and powerful XML processing easy. It is geared towards embedded use in high-throughput XML messaging middleware such as large-scale Peer-to-Peer infrastructures, message queues, publish-subscribe and matchmaking systems for Blogs/newsfeeds, text chat, data acquisition and distribution systems, application level routers, firewalls, classifiers, etc.

Have you ever tried to take advantage of a robust and natural commodity Java tool set for XML, XQuery, XPath, schema validation, binary XML, fuzzy fulltext similarity search and related technologies, yet were not ready to accept a significant performance penalty? Chances are most tool sets turned out not to be particularly robust and natural, that they incurred dramatic penalties when used in straightforward manners, and that their complex idiosyncracies had a strong tendency to distract from the real job and use cases you wanted to get done in a timely manner.

Nux helps to avoid XML nightmares, enabling you to mix and match powerful main-memory XML tools in natural, straightforward, seamless, effective and standards compliant manners.

Nux reliably processes whatever data fits into main memory (even, say, 250 MB messages), but it is not an XML database system, and does not attempt to be one. Nux integrates best-of-breed components, containing extensions of the XOM, Saxon and Lucene open-source libraries.

Feature Summary API
XQuery/XPath Seamless, complete, standards compliant and efficient W3C XQuery and XPath support for XOM. Also see the nux/bin/fire-xquery command line test tool and XQueryBenchmark. API
In-place update Simple yet powerful and efficient in-place morphing for use as an XQuery/XPath insert, update and delete facility; particularly useful for structurally small tree transformations without requiring (potentially huge) XML tree copies. API
Fulltext search Fulltext search (fuzzy similarity queries) for on-the-fly matchmaking in realtime streaming applications combining structured and unstructured queries. Arbitrary Lucene fulltext queries can be run from Java or from XQuery/XPath/XSLT via a simple extension function. API
Pooling Efficient and flexible pools and factories for documents, XQueries, XSL Transforms, as well as document Builders that validate against various schema languages, including W3C XML Schemas, DTDs, RELAX NG, Schematron, etc. API
Binary XML Optional serialization and deserialization of XOM XML documents to and from an efficient and compact custom binary XML data format (bnux format), without loss or change of any information. Serialization and deserialization is much faster than with the standard textual XML format, and the resulting binary data is more compressed than textual XML. API
Streaming XQuery For simple and complex continuous queries and/or transformations over very large or infinitely long XML input, a convenient streaming path filter API combines full XQuery and XPath support with straightforward filtering. API
Streaming Serialization Using memory consumption close to zero, streaming serialization enables writing arbitrarily large XML documents onto a destination, such as an OutputStream, both for standard textual XML as well as binary XML. API
Pluggable SAX and StAX A XOM Builder implementation that uses a StAX parser (e.g. Woodstox) instead of a SAX parser (e.g. Xerces) can be used interchangeably. API
JAXB and HTML Conversion from XOM to JAXB and vice-versa, serving as an intermediary for XML <--> Object mapping tools. XQueries over ill-formed HTML. API
Open Source BSD style license

Motivation

Have you ever tried to do queries and/or transformations over XML data sources? Chances are that manual SAX/DOM processing was cumbersome at best, that XPath was not powerful or flexible enough, or XSLT perhaps too complicated, and that most related APIs have a steep learning curve, and contain quite a few bugs.

This is where the power and simplicity of XQuery comes in. Nux provides seamless XQuery support for XOM, leveraging the standards compliance, efficiency and maturity of the Saxon engine, in combination with a robust, lean and mean adapter for XOM that Nux contributed to Saxon. Since XQuery is a superset of XPath 2.0 it can also be used with plain XPath expressions as queries. It implements W3C Candidate Recommendation, 3 November 2005 and passes several exhaustive test suites. Like Saxon-B, Nux XQuery is not schema aware, unlike the commercial Saxon SA version.

Have you ever tried to build an XML system that is straightforward, works correctly and processes tens of thousands of small XML messages per second in non-trivial ways? Chances are you've encountered lots of non-obvious obstacles down that path. For that scenario, Nux couples the simplicity and correctness qualities of XOM with efficient and flexible pools and factories for documents, XQueries, XSL Transforms, as well as document Builders that validate against various schema languages, including W3C XML Schemas (leveraging Xerces), RELAX NG, Schematron, etc. (leveraging MSV).

For particularly stringent performance requirements an option for lightning-fast binary XML serialization and deserialization is offered. XML <--> Object mapping glue for integration with JAXB and for queries over ill-formed HTML is also provided.

Architecture Overview / Technology Layers

In a nutshell, the involved technologies are layered as follows:

Command Line Usage

To get started, you can use nux/bin/fire-xquery, a flexible command line test tool that runs a given XQuery against a set of files and prints the result sequence. For example:
List the titles of Tim Bray's blog articles via the Atom feed
fire-xquery --query='{declare namespace atom = "http://www.w3.org/2005/Atom"; doc("http://www.tbray.org/ongoing/ongoing.atom")/atom:feed/atom:entry/atom:title}'

Example Output
<title xmlns="http://www.w3.org/2005/Atom">nbextras.org</title> <title xmlns="http://www.w3.org/2005/Atom">Election Day</title> <title xmlns="http://www.w3.org/2005/Atom">Washington Post Screws Up</title> <title xmlns="http://www.w3.org/2005/Atom">Marketing Truth</title> <title xmlns="http://www.w3.org/2005/Atom">Upcoming Gig: JavaOne</title> <title xmlns="http://www.w3.org/2005/Atom">GPL3 Draft</title>

The command line tool also supports schema validation, XInclude (via XOM), an XQuery update facility, malformed HTML parsing (via TagSoup) and much more. It's available for Unix and Windows, and works like any other decent Unix command line tool. The source code can be found in class nux.xom.tests.XQueryCommand.

Basic API Usage

More interestingly, here are examples demonstrating basic API usage:

Parsing, Constructing and Serializing a Document with XOM
// parse a file document with XOM: Document doc = new Builder().build(new File("samples/data/periodic.xml")); System.out.println(doc.toXML()); System.out.println(XOMUtil.toPrettyXML(doc)); // parse a string document: String xml = "<foo>" + "<bar size='123'>" + "hello world" + "</bar>" + "</foo>"; Document doc = XOMUtil.toDocument(xml); System.out.println(doc.toXML()); System.out.println(XOMUtil.toPrettyXML(doc)); // construct a XOM document (main memory tree): Element bar = new Element("bar"); bar.addAttribute(new Attribute("size", "123")); bar.appendChild(new Text("hello world")); Element foo = new Element("foo"); foo.appendChild(bar); Document doc = new Document(foo); System.out.println(doc.toXML()); System.out.println(XOMUtil.toPrettyXML(doc)); // serialize a document with XOM onto an OutputStream: OutputStream out = new FileOutputStream("samples/data/periodic2.xml"); Serializer ser = new Serializer(out); // ser.setIndent(4); // optional pretty printing ser.write(doc); out.close();

Pluggable SAX and StAX parser
// parse a document with XOM, either using SAX or StAX: InputStream in = new FileInputStream("samples/data/articles.xml"); boolean useSAX = true; // boolean useSAX = false; Builder builder = useSAX ? new Builder() : StaxUtil.createBuilder(null, null); Document doc = builder.build(in); System.out.println(doc.toXML());

XPath / XQuery examples

Basic XPath / XQuery examples
// parse XML document with XOM: Document doc = new Builder().build(new File("samples/data/periodic.xml")); // find the atom named 'Zinc' in the periodic table: Node result = XQueryUtil.xquery(doc, "/PERIODIC_TABLE/ATOM[NAME = 'Zinc']").get(0); System.out.println("result=" + result.toXML()); // equivalent via the more powerful underlying API: XQuery xquery = new XQuery("/PERIODIC_TABLE/ATOM[NAME = 'Zinc']", null); Node result = xquery.execute(doc).next(); // count the numer of elements in a document tree int count = XQueryUtil.xquery(doc, "//*").size(); System.out.println("count=" + count);

A query to find the links of all images (or all JPG images) in a XHTML-like document, with serialization to an output stream
Document doc = new Builder().build(new File("/tmp/test.xml")); Nodes results = XQueryUtil.xquery(doc, "//*:img/@src"); //Nodes results = XQueryUtil.xquery(doc, "//*:img/@src[matches(., '.jpg')]"); // see exactly what items the query does (or does not) return: for (int i=0; i < results.size(); i++) { System.out.println("node "+i+": " + results.get(i).toXML()); //System.out.println("node "+i+": " + XOMUtil.toPrettyXML(results.get(i))); } // or serialize results according to W3C spec onto an output stream: ResultSequenceSerializer ser = new ResultSequenceSerializer(); ser.setEncoding("UTF-8"); ser.setIndent(4); ser.setAlgorithm(ResultSequenceSerializer.W3C_ALGORITHM); // ser.setAlgorithm(ResultSequenceSerializer.WRAP_ALGORITHM); OutputStream out = System.out; ser.write(results, out);

More XQueries: List books published by Addison-Wesley after 1991, including their year and title
java: File query = new File("/tmp/saxonb-8.6.1/use-cases/xmp/q1.xq"); XQuery xquery = XQueryPool.GLOBAL_POOL.getXQuery(query); Nodes results = xquery.execute(null).toNodes(); ResultSequenceSerializer ser = new ResultSequenceSerializer(); ser.setIndent(4); ser.write(results, System.out); xmp/q1.xq: <bib> { for $b in doc("bib.xml")/bib/book where $b/publisher = "Addison-Wesley" and $b/@year > 1991 return <book year="{ $b/@year }"> { $b/title } </book> } </bib> output: <?xml version="1.0" encoding="UTF8"?> <bib> <book year="1994"> <title>TCP/IP Illustrated</title> </book> <book year="1992"> <title>Advanced Programming in the Unix environment</title> </book> </bib>

A typical relational join XQuery that, for all bicycles, lists the item number, description, and highest bid (if any), ordered by item number
for $i in doc("items.xml")//item_tuple let $b := doc("bids.xml")//bid_tuple[itemno = $i/itemno] where contains($i/description, "Bicycle") order by $i/itemno return <item_tuple> { $i/itemno } { $i/description } <high_bid>{ max($b/bid) }</high_bid> </item_tuple>

Google-like realtime fulltext search via Apache Lucene engine

Similar to Google search, the optional fulltext search via the Apache Lucene engine is easy to use, powerful, efficient and goes far beyond what can be done with standard XPath regular expressions and string manipulation functions. It is similar in intent but not directly related to preliminary W3C fulltext search drafts. Rather than targetting fulltext search of infrequent queries over huge persistent data archives (historic search), Nux targets fulltext search of huge numbers of queries over comparatively small transient realtime data (prospective search), e.g. 100000-500000 queries/sec ballpark. See FullTextUtil and MemoryIndex.

Example fulltext XQuery that finds all books authored by James that have something to do with 'salmon fishing manuals', sorted by relevance
declare namespace lucene = "java:nux.xom.pool.FullTextUtil"; declare variable $query := "+salmon~ +fish* manual~"; (: any arbitrary Lucene query can go here :) (: declare variable $query as xs:string external; :) for $book in /books/book[author="James" and lucene:match(abstract, $query) > 0.0] let $score := lucene:match($book/abstract, $query) order by $score descending return $book

Example fulltext XQuery that matches on extracted sentences
declare namespace lucene = "java:nux.xom.pool.FullTextUtil"; for $book in /books/book for $s in lucene:sentences($book/abstract, 0) return if (lucene:match($s, "+salmon~ +fish* manual~") > 0.0) then normalize-space($s) else ()

Arbitrary Lucene fulltext queries can be run from Java or from XQuery/XPath/XSLT via a simple extension function. The former approach is more flexible whereas the latter is more convenient. Lucene analyzers can split on whitespace, normalize to lower case for case insensitivity, ignore common terms with little discriminatory value such as "he", "in", "and" (stop words), reduce the terms to their natural linguistic root form such as "fishing" being reduced to "fish" (stemming), resolve synonyms/inflexions/thesauri (upon indexing and/or querying), etc. Also see Lucene Query Syntax as well as Query Parser Rules.
Explore and enjoy, perhaps using the queries and sample data from the samples/fulltext directory as a starting point.

Querying Nasty HTML

If you'd like to query non-XML documents such as the typical HTML that lives out there, you can combine Nux with TagSoup, which is a "SAX-compliant parser that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: nasty and brutish, though quite often far from short". TagSoup plugs into XOM and makes ill-formed HTML appear as well-formed XML. Just add tagsoup.jar to the classpath and try this:

XPath / XQuery with TagSoup
// find the links of all images in an ill-formed HTML document XMLReader parser = new org.ccil.cowan.tagsoup.Parser(); // tagsoup parser Document doc = new Builder(parser).build("http://www.yahoo.com"); Nodes results = XQueryUtil.xquery(doc, "//*:img/@src"); for (int i=0; i < results.size(); i++) { System.out.println("node "+i+": " + results.get(i).toXML()); //System.out.println("node "+i+": " + XOMUtil.toPrettyXML(results.get(i))); }

Streaming Serialization of Very Large Documents

Using memory consumption close to zero, a StreamingSerializer enables writing arbitrarily large XML documents onto a destination, such as an OutputStream, both for standard textual XML as well as binary XML.

Nodes should be written in document order, starting with writeXMLDeclaration(), followed by writes for the individual nodes, finally finishing with writeEndDocument(). Elements are opened and closed via writeStartTag(Element) and writeEndTag(), respectively.

Creating a document with ten million records
StreamingSerializerFactory factory = new StreamingSerializerFactory(); StreamingSerializer ser = factory.createXMLSerializer(System.out, "UTF-8"); // StreamingSerializer ser = factory.createBinaryXMLSerializer(System.out, 0); // StreamingSerializer ser = factory.createStaxSerializer(XMLStreamWriter writer); ser.writeXMLDeclaration(); ser.writeStartTag(new Element("articles")); for (int i = 0; i < 10000000; i++) { Element article = new Element("article"); article.addAttribute(new Attribute("id", String.valueOf(i))); ser.writeStartTag(article); ser.writeStartTag(new Element("prize")); ser.write(new Text(String.valueOf(i * 1000))); ser.writeEndTag(); // close prize ser.writeStartTag(new Element("quantity")); ser.write(new Text("hello world")); ser.writeEndTag(); // close quantity ser.writeEndTag(); // close article } ser.writeEndTag(); // close articles ser.writeEndDocument();

The following example demonstrates mixing streaming writes with convenient writing of entire prefabricated subtrees. For large documents, this approach combines the scalability advantages of streaming with the ease of use of (comparatively small) main-memory subtree construction:

Creating a document with ten million records, combining streaming with main-memory subtree construction
StreamingSerializerFactory factory = new StreamingSerializerFactory(); StreamingSerializer ser = factory.createXMLSerializer(System.out, "UTF-8"); // StreamingSerializer ser = factory.createBinaryXMLSerializer(System.out, 0); ser.writeXMLDeclaration(); ser.writeStartTag(new Element("articles")); for (int i = 0; i < 10000000; i++) { Element article = new Element("article"); article.addAttribute(new Attribute("id", String.valueOf(i))); Element prize = new Element("prize"); prize.appendChild(String.valueOf(i * 1000)); article.appendChild(prize); Element quantity = new Element("quantity"); quantity.appendChild("hello world"); article.appendChild(quantity); ser.write(article); // writes entire subtree } ser.writeEndTag(); // close articles ser.writeEndDocument();

Streaming XQuery over Very Large Documents

Using the StreamingPathFilter, the following example is complete and efficient code for parsing and iterating through millions of "person" records in a database-like XML document, printing all residents of "San Francisco", while never allocating more memory than needed to hold one person element:

Find and print all persons living in San Francisco
StreamingTransform myTransform = new StreamingTransform() { public Nodes transform(Element person) { Nodes results = XQueryUtil.xquery(person, "name[../address/city = 'San Francisco']"); if (results.size() > 0) { System.out.println("name = " + results.get(0).getValue()); } return new Nodes(); // mark element as subject to garbage collection } }; // parse document with a filtering Builder NodeFactory factory = new StreamingPathFilter("/persons/person", null). createNodeFactory(null, myTransform); new Builder(factory).build(new File("/tmp/persons.xml"));

Working with Efficient Binary XML

The BinaryXMLCodec serializes (encodes) and deserializes (decodes) XOM XML documents to and from an efficient and compact custom binary XML data format (termed bnux format), without loss or change of any information. Serialization and deserialization is much faster than with the standard textual XML format, and the resulting binary data is more compressed than textual XML.

Parse standard textual XML, convert to binary format, round-trip it and compare results
// parse standard textual XML with XOM Document doc = new Builder().build(new File("samples/data/periodic.xml")); // convert to binary XML BinaryXMLCodec codec = new BinaryXMLCodec(); byte[] bnuxDoc = codec.serialize(doc, 0); // check correctness Document doc2 = codec.deserialize(bnuxDoc); boolean isEqual = java.util.Arrays.equals( XOMUtil.toCanonicalXML(doc), XOMUtil.toCanonicalXML(doc2)); System.out.println("isEqual = " + isEqual); System.out.println(doc2.toXML()); // write binary XML document to file OutputStream out = new FileOutputStream("/tmp/periodic.xml.bnux"); out.write(bnuxDoc); out.close(); // read binary XML document from file; convert to XOM document bnuxDoc = FileUtil.toByteArray(new FileInputStream("/tmp/periodic.xml.bnux")); Document doc3 = codec.deserialize(bnuxDoc); System.out.println(doc3.toXML());

Streaming conversion of standard textual XML to and from binary format
// streaming conversion of standard textual XML to bnux binary XML: InputStream in = new FileInputStream("samples/data/weblog.xml"); OutputStream out = new FileOutputStream("/tmp/weblog.xml.bnux"); StreamingSerializerFactory factory = new StreamingSerializerFactory(); StreamingSerializer serializer = factory.createBinaryXMLSerializer(out, 0); NodeFactory redirector = XOMUtil.getRedirectingNodeFactory(serializer); new Builder(redirector).build(in); // performs streaming conversion in.close(); out.close(); // streaming conversion of bnux binary XML to standard textual XML: InputStream in = new FileInputStream("/tmp/weblog.xml.bnux"); OutputStream out = new FileOutputStream("/tmp/weblog.xml"); StreamingSerializerFactory factory = new StreamingSerializerFactory(); StreamingSerializer serializer = factory.createXMLSerializer(out, "UTF-8"); NodeFactory redirector = XOMUtil.getRedirectingNodeFactory(serializer); new BinaryXMLCodec().deserialize(in, redirector); // performs streaming conversion in.close(); out.close();

Contact us