Nux 1.6

nux.xom.pool
Class FullTextUtil

java.lang.Object
  extended by nux.xom.pool.FullTextUtil

public class FullTextUtil
extends Object

Thread-safe XQuery/XPath fulltext search utilities; implemented with the Lucene engine and a custom high-performance adapter for on-the-fly main memory indexing with smart caching for indexes, queries and results.

Complementing the standard XPath string and regular expression matching functionality, Lucene has a powerful query syntax with support for word stemming, fuzzy searches, similarity searches, approximate searches, boolean operators, wildcards, grouping, range searches, term boosting, etc. For details see the Lucene Query Syntax and Examples. Also see MemoryIndex and PatternAnalyzer for detailed documentation.

Example Java usage:

 Analyzer analyzer = PatternAnalyzer.DEFAULT_ANALYZER;
 float score = FullTextUtil.match(
    "Readings about Salmons and other select Alaska fishing Manuals", 
    "+salmon~ +fish* manual~", 
    analyzer, analyzer);
 if (score > 0.0f) {
     // query matches text
 } else {
     // query does not match text
 }
 
Example XQuery/XPath usage:
 declare namespace lucene = "java:nux.xom.pool.FullTextUtil"; 
 lucene:match(
    "Readings about Salmons and other select Alaska fishing Manuals", 
    "+salmon~ +fish* manual~")
 
Example XQuery/XPath usage to find all books that have a title about salmon fishing:
 declare namespace lucene = "java:nux.xom.pool.FullTextUtil";
 /books/book[lucene:match(title, "+salmon~ +fish* manual~") > 0.0]
 
An XQuery that finds all books authored by "James" that have something to do with "salmon fishing manuals", sorted by relevance:
 declare namespace lucene = "java:nux.xom.pool.FullTextUtil";
 declare variable $query := "+salmon~ +fish* manual~"; (: any arbitrary Lucene query can go here :)
 (: declare variable $query as xs:string external; :)
 
 for $book in /books/book[author="James" and lucene:match(abstract, $query) > 0.0]
 let $score := lucene:match($book/abstract, $query)
 order by $score descending
 return $book
 
Extracting sentences:
 for $book in /books/book
     for $s in lucene:sentences($book/abstract, 0)
         return
             if (lucene:match($s, "+salmon~ +fish* manual~") > 0.0) 
             then normalize-space($s)
             else ()
 
Using a custom text tokenizer/analyzer, limiting to the first 100 words, with debug logging:
 declare namespace lucene          = "java:nux.xom.pool.FullTextUtil"; 
 declare namespace analyzerUtil    = "java:org.apache.lucene.index.memory.AnalyzerUtil";
 declare namespace patternAnalyzer = "java:org.apache.lucene.index.memory.PatternAnalyzer";
 declare namespace system          = "java:java.lang.System";
 
 lucene:match(
    "Readings about Salmons and other select Alaska fishing Manuals", 
    "+salmon~ +fish* manual~",
    analyzerUtil:getLoggingAnalyzer(
       analyzerUtil:getMaxTokenAnalyzer(
          patternAnalyzer:DEFAULT_ANALYZER(), 
          100),
       system:err(), 
       "log"), 
    patternAnalyzer:DEFAULT_ANALYZER()
 )
 

Author:
whoschek@lbl.gov, $Author: hoschek3 $

Method Summary
static float match(String text, String query)
          Lucene fulltext search convenience method; equivalent to match(text, query, null, null).
static float match(String text, String query, Analyzer textAnalyzer, Analyzer queryAnalyzer)
          Lucene fulltext search convenience method; Returns the relevance score by matching the given text string against the given Lucene query expression.
static String[] paragraphs(String text, int limit)
          Returns at most the first N paragraphs of the given text.
static String[] sentences(String text, int limit)
          Returns at most the first N sentences of the given text.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Method Detail

match

public static float match(String text,
                          String query)
                   throws ParseException
Lucene fulltext search convenience method; equivalent to match(text, query, null, null).

Parameters:
text - the string to match the query against
query - the Lucene fulltext query expression
Returns:
the relevance score; a number in the range [0.0 .. 1.0]
Throws:
ParseException - if the query expression has a syntax error

match

public static float match(String text,
                          String query,
                          Analyzer textAnalyzer,
                          Analyzer queryAnalyzer)
                   throws ParseException
Lucene fulltext search convenience method; Returns the relevance score by matching the given text string against the given Lucene query expression. The score is in the range [0.0 .. 1.0], with 0.0 indicating no match. The higher the number the better the match.

Typically, both analyzers are identical, but this need not be the case.

Parameters:
text - the string to match the query against
query - the Lucene fulltext query expression
textAnalyzer - Stream tokenizer that extracts query terms from query according to some policy. May be null, in which case a default is used.
queryAnalyzer - Stream tokenizer that extracts index terms from text according to some policy. May be null, in which case a default is used.
Returns:
the relevance score; a number in the range [0.0 .. 1.0]
Throws:
ParseException - if the query expression has a syntax error

paragraphs

public static String[] paragraphs(String text,
                                  int limit)
Returns at most the first N paragraphs of the given text. Delimiting characters are excluded from the results. Each returned paragraph is whitespace-trimmed via String.trim(), potentially an empty string.

Parameters:
text - the text to tokenize into paragraphs
limit - the maximum number of paragraphs to return; zero indicates "as many as possible".
Returns:
the first N paragraphs

sentences

public static String[] sentences(String text,
                                 int limit)
Returns at most the first N sentences of the given text. Delimiting characters are excluded from the results. Each returned sentence is whitespace-trimmed via String.trim(), potentially an empty string.

Parameters:
text - the text to tokenize into sentences
limit - the maximum number of sentences to return; zero indicates "as many as possible".
Returns:
the first N sentences

Nux 1.6