| Efficient and Powerful XML Processing Made Easy |
Nux is an open-source Java toolkit making efficient and powerful XML processing easy.
It is geared towards embedded use in
high-throughput XML messaging middleware such as large-scale Peer-to-Peer
infrastructures, message queues, publish-subscribe and matchmaking systems
for Blogs/newsfeeds, text chat, data acquisition and distribution systems,
application level routers, firewalls, classifiers, etc.
Have you ever tried to take advantage of a robust and natural commodity Java tool set for
XML, XQuery, XPath, schema validation, binary XML, fuzzy fulltext similarity search and related technologies,
yet were not ready to accept a significant performance penalty? Chances are most tool sets
turned out not to be particularly robust and natural, that they incurred dramatic penalties
when used in straightforward manners, and that their complex idiosyncracies had a strong
tendency to distract from the real job and use cases you wanted to get done in a timely
manner.
Nux helps to avoid XML nightmares, enabling you to mix and match powerful main-memory
XML tools in natural, straightforward, seamless,
effective and standards compliant manners.
Nux reliably processes whatever data fits into main memory (even, say, 250 MB messages),
but it is not an XML database system, and does not attempt to be one.
Nux integrates best-of-breed components, containing extensions of the
XOM,
Saxon and
Lucene open-source libraries.
XQuery/XPath |
Seamless, complete, standards compliant and efficient W3C XQuery and XPath support for XOM.
Also see the nux/bin/fire-xquery
command line test tool
and XQueryBenchmark.
|
API
|
In-place update |
Simple yet powerful and efficient in-place morphing for use as an
XQuery/XPath insert, update and delete facility; particularly useful for
structurally small tree transformations without requiring (potentially
huge) XML tree copies.
|
API
|
Fulltext search |
Fulltext search (fuzzy similarity queries) for on-the-fly matchmaking
in realtime streaming applications combining structured and unstructured queries.
Arbitrary Lucene fulltext queries can be run from Java or
from XQuery/XPath/XSLT via a simple extension function.
|
API
|
Pooling |
Efficient and flexible pools
and factories for documents, XQueries, XSL Transforms, as well as
document Builders that validate against various schema languages, including W3C XML Schemas, DTDs, RELAX NG, Schematron, etc.
|
API
|
Binary XML |
Optional serialization and deserialization of XOM XML documents to and from
an efficient and compact custom binary XML data format (bnux
format), without loss or change of any information.
Serialization and deserialization is much faster than with the standard textual XML format,
and the resulting binary data is more compressed than textual XML.
|
API
|
Streaming XQuery |
For simple and complex continuous queries and/or transformations over very
large or infinitely long XML input,
a convenient streaming path filter API combines full XQuery and XPath support with
straightforward filtering.
|
API
|
Streaming Serialization |
Using memory consumption close to zero, streaming serialization enables writing arbitrarily large XML documents
onto a destination, such as an
OutputStream , both for standard textual XML as well as binary XML.
|
API
|
Pluggable SAX and StAX |
A XOM Builder implementation that uses a StAX parser (e.g. Woodstox)
instead of a SAX parser (e.g. Xerces) can be used interchangeably.
|
API
|
JAXB and HTML |
Conversion from XOM to JAXB and vice-versa, serving as an intermediary for
XML <--> Object mapping tools.
XQueries over ill-formed HTML.
|
API
|
Open Source |
BSD style license |
|
|
Motivation |
Have you ever tried to do queries and/or transformations over XML data sources?
Chances are that manual SAX/DOM processing was cumbersome at best, that XPath was not powerful or flexible enough,
or XSLT perhaps too complicated, and that most related APIs have a steep learning curve, and
contain quite a few bugs.
This is where the power and simplicity of XQuery comes in.
Nux provides seamless XQuery support for XOM, leveraging the
standards compliance, efficiency and maturity of the Saxon engine,
in combination with a robust, lean and mean adapter for XOM that Nux contributed to Saxon.
Since XQuery is a superset of XPath 2.0 it
can also be used with plain XPath expressions as queries.
It implements W3C Candidate Recommendation, 3 November 2005
and passes several exhaustive test suites.
Like Saxon-B, Nux XQuery is not schema aware, unlike the commercial Saxon SA version.
Have you ever tried to build an XML system that is straightforward, works correctly and
processes tens of thousands of small XML messages per second in non-trivial ways? Chances are you've encountered lots of
non-obvious obstacles down that path. For that scenario, Nux couples the simplicity and correctness qualities of XOM with
efficient and flexible pools
and factories for documents, XQueries, XSL Transforms, as well as
document Builders that validate against various schema languages, including
W3C XML Schemas (leveraging Xerces),
RELAX NG, Schematron, etc. (leveraging MSV).
For particularly stringent performance requirements
an option for lightning-fast binary XML serialization and deserialization
is offered.
XML <--> Object mapping glue for integration with
JAXB
and for queries over ill-formed HTML is also provided.
|
Command Line Usage |
To get started, you can use
nux/bin/fire-xquery,
a flexible command line test tool that runs a given XQuery against a set of files
and prints the result sequence. For example:
fire-xquery --query='{declare namespace atom = "http://www.w3.org/2005/Atom";
doc("http://www.tbray.org/ongoing/ongoing.atom")/atom:feed/atom:entry/atom:title}'
<title xmlns="http://www.w3.org/2005/Atom">nbextras.org</title>
<title xmlns="http://www.w3.org/2005/Atom">Election Day</title>
<title xmlns="http://www.w3.org/2005/Atom">Washington Post Screws Up</title>
<title xmlns="http://www.w3.org/2005/Atom">Marketing Truth</title>
<title xmlns="http://www.w3.org/2005/Atom">Upcoming Gig: JavaOne</title>
<title xmlns="http://www.w3.org/2005/Atom">GPL3 Draft</title>
The command line tool also supports schema validation, XInclude (via XOM),
an XQuery update facility, malformed HTML parsing (via TagSoup) and much more.
It's available for Unix and Windows, and works like any other decent Unix command line tool.
The source code can be found in class nux.xom.tests.XQueryCommand .
|
Basic API Usage |
More interestingly, here are examples demonstrating basic API usage:
// parse a file document with XOM:
Document doc = new Builder().build(new File("samples/data/periodic.xml"));
System.out.println(doc.toXML());
System.out.println(XOMUtil.toPrettyXML(doc));
// parse a string document:
String xml =
"<foo>" +
"<bar size='123'>" +
"hello world" +
"</bar>" +
"</foo>";
Document doc = XOMUtil.toDocument(xml);
System.out.println(doc.toXML());
System.out.println(XOMUtil.toPrettyXML(doc));
// construct a XOM document (main memory tree):
Element bar = new Element("bar");
bar.addAttribute(new Attribute("size", "123"));
bar.appendChild(new Text("hello world"));
Element foo = new Element("foo");
foo.appendChild(bar);
Document doc = new Document(foo);
System.out.println(doc.toXML());
System.out.println(XOMUtil.toPrettyXML(doc));
// serialize a document with XOM onto an OutputStream:
OutputStream out = new FileOutputStream("samples/data/periodic2.xml");
Serializer ser = new Serializer(out);
// ser.setIndent(4); // optional pretty printing
ser.write(doc);
out.close();
// parse a document with XOM, either using SAX or StAX:
InputStream in = new FileInputStream("samples/data/articles.xml");
boolean useSAX = true;
// boolean useSAX = false;
Builder builder = useSAX ? new Builder() : StaxUtil.createBuilder(null, null);
Document doc = builder.build(in);
System.out.println(doc.toXML());
|
XPath / XQuery examples |
// parse XML document with XOM:
Document doc = new Builder().build(new File("samples/data/periodic.xml"));
// find the atom named 'Zinc' in the periodic table:
Node result = XQueryUtil.xquery(doc, "/PERIODIC_TABLE/ATOM[NAME = 'Zinc']").get(0);
System.out.println("result=" + result.toXML());
// equivalent via the more powerful underlying API:
XQuery xquery = new XQuery("/PERIODIC_TABLE/ATOM[NAME = 'Zinc']", null);
Node result = xquery.execute(doc).next();
// count the numer of elements in a document tree
int count = XQueryUtil.xquery(doc, "//*").size();
System.out.println("count=" + count);
Document doc = new Builder().build(new File("/tmp/test.xml"));
Nodes results = XQueryUtil.xquery(doc, "//*:img/@src");
//Nodes results = XQueryUtil.xquery(doc, "//*:img/@src[matches(., '.jpg')]");
// see exactly what items the query does (or does not) return:
for (int i=0; i < results.size(); i++) {
System.out.println("node "+i+": " + results.get(i).toXML());
//System.out.println("node "+i+": " + XOMUtil.toPrettyXML(results.get(i)));
}
// or serialize results according to W3C spec onto an output stream:
ResultSequenceSerializer ser = new ResultSequenceSerializer();
ser.setEncoding("UTF-8");
ser.setIndent(4);
ser.setAlgorithm(ResultSequenceSerializer.W3C_ALGORITHM);
// ser.setAlgorithm(ResultSequenceSerializer.WRAP_ALGORITHM);
OutputStream out = System.out;
ser.write(results, out);
java:
File query = new File("/tmp/saxonb-8.6.1/use-cases/xmp/q1.xq");
XQuery xquery = XQueryPool.GLOBAL_POOL.getXQuery(query);
Nodes results = xquery.execute(null).toNodes();
ResultSequenceSerializer ser = new ResultSequenceSerializer();
ser.setIndent(4);
ser.write(results, System.out);
xmp/q1.xq:
<bib>
{
for $b in doc("bib.xml")/bib/book
where $b/publisher = "Addison-Wesley" and $b/@year > 1991
return
<book year="{ $b/@year }">
{ $b/title }
</book>
}
</bib>
output:
<?xml version="1.0" encoding="UTF8"?>
<bib>
<book year="1994">
<title>TCP/IP Illustrated</title>
</book>
<book year="1992">
<title>Advanced Programming in the Unix environment</title>
</book>
</bib>
for $i in doc("items.xml")//item_tuple
let $b := doc("bids.xml")//bid_tuple[itemno = $i/itemno]
where contains($i/description, "Bicycle")
order by $i/itemno
return
<item_tuple>
{ $i/itemno }
{ $i/description }
<high_bid>{ max($b/bid) }</high_bid>
</item_tuple>
|
Google-like realtime fulltext search via Apache Lucene engine |
Similar to Google search, the optional fulltext search via the Apache Lucene engine is
easy to use, powerful, efficient and goes far beyond what can be done with
standard XPath regular expressions
and string manipulation functions. It is similar in intent but not directly related to preliminary
W3C fulltext search drafts.
Rather than targetting fulltext search of infrequent queries over huge persistent
data archives (historic search), Nux targets fulltext search of huge
numbers of queries over comparatively small transient realtime data (prospective search),
e.g. 100000-500000 queries/sec ballpark.
See FullTextUtil and
MemoryIndex.
declare namespace lucene = "java:nux.xom.pool.FullTextUtil";
declare variable $query := "+salmon~ +fish* manual~";
(: any arbitrary Lucene query can go here :)
(: declare variable $query as xs:string external; :)
for $book in /books/book[author="James" and lucene:match(abstract, $query) > 0.0]
let $score := lucene:match($book/abstract, $query)
order by $score descending
return $book
declare namespace lucene = "java:nux.xom.pool.FullTextUtil";
for $book in /books/book
for $s in lucene:sentences($book/abstract, 0)
return
if (lucene:match($s, "+salmon~ +fish* manual~") > 0.0)
then normalize-space($s)
else ()
Arbitrary Lucene fulltext queries can be run from Java or
from XQuery/XPath/XSLT via a simple extension function.
The former approach is more flexible whereas the latter is more convenient.
Lucene analyzers can split on whitespace, normalize to lower case
for case insensitivity, ignore common terms with little
discriminatory value such as "he", "in", "and" (stop words),
reduce the terms to their natural linguistic root form such as
"fishing" being reduced to "fish" (stemming), resolve
synonyms/inflexions/thesauri (upon indexing and/or querying), etc.
Also see Lucene Query Syntax
as well as Query Parser Rules.
Explore and enjoy, perhaps using the queries and sample data from the
samples/fulltext directory as a starting point.
|
Querying Nasty HTML |
If you'd like to query non-XML documents such as the typical HTML that lives out there,
you can combine Nux with TagSoup,
which is a "SAX-compliant parser that, instead of parsing well-formed or valid XML,
parses HTML as it is found in the wild: nasty and brutish, though quite often far from short".
TagSoup plugs into XOM and makes ill-formed HTML appear as well-formed XML.
Just add tagsoup.jar to the classpath and try this:
// find the links of all images in an ill-formed HTML document
XMLReader parser = new org.ccil.cowan.tagsoup.Parser(); // tagsoup parser
Document doc = new Builder(parser).build("http://www.yahoo.com");
Nodes results = XQueryUtil.xquery(doc, "//*:img/@src");
for (int i=0; i < results.size(); i++) {
System.out.println("node "+i+": " + results.get(i).toXML());
//System.out.println("node "+i+": " + XOMUtil.toPrettyXML(results.get(i)));
}
|
Streaming Serialization of Very Large Documents |
Using memory consumption close
to zero, a StreamingSerializer
enables writing arbitrarily large XML documents onto a destination, such as an OutputStream ,
both for standard textual XML as well as binary XML.
Nodes should be written in document order, starting with
writeXMLDeclaration() , followed by writes for the individual nodes,
finally finishing with writeEndDocument() . Elements are opened and
closed via writeStartTag(Element) and writeEndTag() ,
respectively.
StreamingSerializerFactory factory = new StreamingSerializerFactory();
StreamingSerializer ser = factory.createXMLSerializer(System.out, "UTF-8");
// StreamingSerializer ser = factory.createBinaryXMLSerializer(System.out, 0);
// StreamingSerializer ser = factory.createStaxSerializer(XMLStreamWriter writer);
ser.writeXMLDeclaration();
ser.writeStartTag(new Element("articles"));
for (int i = 0; i < 10000000; i++) {
Element article = new Element("article");
article.addAttribute(new Attribute("id", String.valueOf(i)));
ser.writeStartTag(article);
ser.writeStartTag(new Element("prize"));
ser.write(new Text(String.valueOf(i * 1000)));
ser.writeEndTag(); // close prize
ser.writeStartTag(new Element("quantity"));
ser.write(new Text("hello world"));
ser.writeEndTag(); // close quantity
ser.writeEndTag(); // close article
}
ser.writeEndTag(); // close articles
ser.writeEndDocument();
The following example demonstrates mixing streaming writes with convenient writing of entire
prefabricated subtrees. For large documents, this approach combines the
scalability advantages of streaming with the ease of use of (comparatively small)
main-memory subtree construction:
StreamingSerializerFactory factory = new StreamingSerializerFactory();
StreamingSerializer ser = factory.createXMLSerializer(System.out, "UTF-8");
// StreamingSerializer ser = factory.createBinaryXMLSerializer(System.out, 0);
ser.writeXMLDeclaration();
ser.writeStartTag(new Element("articles"));
for (int i = 0; i < 10000000; i++) {
Element article = new Element("article");
article.addAttribute(new Attribute("id", String.valueOf(i)));
Element prize = new Element("prize");
prize.appendChild(String.valueOf(i * 1000));
article.appendChild(prize);
Element quantity = new Element("quantity");
quantity.appendChild("hello world");
article.appendChild(quantity);
ser.write(article); // writes entire subtree
}
ser.writeEndTag(); // close articles
ser.writeEndDocument();
|
Streaming XQuery over Very Large Documents |
Using the StreamingPathFilter,
the following example is complete and efficient code for parsing and iterating through millions of "person"
records in a database-like XML document, printing all residents of "San Francisco",
while never allocating more memory than needed to hold one person element:
StreamingTransform myTransform = new StreamingTransform() {
public Nodes transform(Element person) {
Nodes results = XQueryUtil.xquery(person,
"name[../address/city = 'San Francisco']");
if (results.size() > 0) {
System.out.println("name = " + results.get(0).getValue());
}
return new Nodes(); // mark element as subject to garbage collection
}
};
// parse document with a filtering Builder
NodeFactory factory = new StreamingPathFilter("/persons/person", null).
createNodeFactory(null, myTransform);
new Builder(factory).build(new File("/tmp/persons.xml"));
|
Working with Efficient Binary XML |
The BinaryXMLCodec
serializes (encodes) and deserializes (decodes) XOM XML documents to and from
an efficient and compact custom binary XML data format (termed bnux
format), without loss or change of any information. Serialization and
deserialization is much faster than with the standard textual XML format, and
the resulting binary data is more compressed than textual XML.
// parse standard textual XML with XOM
Document doc = new Builder().build(new File("samples/data/periodic.xml"));
// convert to binary XML
BinaryXMLCodec codec = new BinaryXMLCodec();
byte[] bnuxDoc = codec.serialize(doc, 0);
// check correctness
Document doc2 = codec.deserialize(bnuxDoc);
boolean isEqual = java.util.Arrays.equals(
XOMUtil.toCanonicalXML(doc), XOMUtil.toCanonicalXML(doc2));
System.out.println("isEqual = " + isEqual);
System.out.println(doc2.toXML());
// write binary XML document to file
OutputStream out = new FileOutputStream("/tmp/periodic.xml.bnux");
out.write(bnuxDoc);
out.close();
// read binary XML document from file; convert to XOM document
bnuxDoc = FileUtil.toByteArray(new FileInputStream("/tmp/periodic.xml.bnux"));
Document doc3 = codec.deserialize(bnuxDoc);
System.out.println(doc3.toXML());
// streaming conversion of standard textual XML to bnux binary XML:
InputStream in = new FileInputStream("samples/data/weblog.xml");
OutputStream out = new FileOutputStream("/tmp/weblog.xml.bnux");
StreamingSerializerFactory factory = new StreamingSerializerFactory();
StreamingSerializer serializer = factory.createBinaryXMLSerializer(out, 0);
NodeFactory redirector = XOMUtil.getRedirectingNodeFactory(serializer);
new Builder(redirector).build(in); // performs streaming conversion
in.close();
out.close();
// streaming conversion of bnux binary XML to standard textual XML:
InputStream in = new FileInputStream("/tmp/weblog.xml.bnux");
OutputStream out = new FileOutputStream("/tmp/weblog.xml");
StreamingSerializerFactory factory = new StreamingSerializerFactory();
StreamingSerializer serializer = factory.createXMLSerializer(out, "UTF-8");
NodeFactory redirector = XOMUtil.getRedirectingNodeFactory(serializer);
new BinaryXMLCodec().deserialize(in, redirector); // performs streaming conversion
in.close();
out.close();
|
|