Monday, May 9, 2016

Replacing dtSearch with Solr: Using Solr's XMLQueryParser to process Search Queries

The Problem

Sometimes you need a query that supports nesting--and when you do, it can be quite frustrating in Solr.  The company I work for must support deep-nesting proximity queries--some of which have five or six or more levels of nesting. Try doing that with the standard Lucene syntax of "word1 word2"~3!  Yeah, you can't.  Not only that, but it doesn't support "word1 must occur within three words, before, word2," let alone nesting.  Don't get me wrong--Solr supports this kind of query under the hood, but good luck finding a query parser that allows for it.

While dtSearch is an implementation detail, it is an important one for us because dtSearch has a richer search syntax than Solr.  You may not need to support dtSearch, but you may very well want to support something that the default query parser doesn't support.

I've tried various other out-of-the-box query parsers that come with Solr, however each of them were lacking something when looking into them.

The problem I ran into is that Solr's default search syntax does not support complex, deeply-nested proximity queries.  I've been looking and looking for ways of overcoming this handicap and I think I finally found a solution.

XmlQueryParser

Since Solr 5.5, XmlQueryParser has been available for use.  Good luck finding any documentation beyond the only example given in an obscure location in the documentation.

I've done some code spelunking and have found some more complex examples not included in the documentation.

Let's start with a simple, nested example:
dtSearch Syntax: new pre/3 (daybook or (employee w/3 onboarding))

dtSearch syntax:
  • X pre/N Y - X must occur within N words before Y
  • X w/N Y - X and Y must occur within N words of each other (i.e. X within N words, before or after, Y)
So, in the above example, we must match the word new, followed by either daybook or employee and onboarding within three words of each other.

Here's what the query looks like for the XmlQueryParser:

{!xmlparser}
<BooleanQuery>
    <Clause occurs="must">
        <SpanNear fieldName="headline" slop="3" inOrder="true">
            <SpanTerm>new</SpanTerm>
            <SpanOr>
                <SpanTerm>daybook</SpanTerm>
                <SpanNear slop="3" inOrder="false">
                    <SpanTerm>employee</SpanTerm>
                    <SpanTerm>onboarding</SpanTerm>
                </SpanNear>
            </SpanOr>
        </SpanNear>
    </Clause>
</BooleanQuery>

Here's the response:
{ "responseHeader":{ "status":0, "QTime":48, "params":{ "q":"{!xmlparser}\n<BooleanQuery>\n <Clause occurs=\"must\">\n <SpanNear fieldName=\"headline\" slop=\"3\" inOrder=\"true\">\n\t<SpanTerm>new</SpanTerm>\n\t<SpanOr>\n\t <SpanTerm>daybook</SpanTerm>\n\t <SpanNear slop=\"3\" inOrder=\"false\">\n\t\t<SpanTerm>employee</SpanTerm>\n\t\t<SpanTerm>onboarding</SpanTerm>\n\t </SpanNear>\n\t</SpanOr>\n </SpanNear>\n </Clause>\n</BooleanQuery>", "indent":"on", "fl":"headline", "rows":"1000", "wt":"json", "_":"1462805933013"}}, "response":{"numFound":14,"start":0,"maxScore":12.657374,"docs":[ {"headline":"Upstate New York Daybook"}, {"headline":"New Hampshire Daybook"}, {"headline":"New Jersey Daybook"}, {"headline":"New Mexico Daybook"}, {"headline":"New Mexico Daybook"}, {"headline":"Upstate New York Daybook"}, {"headline":"New Hampshire Daybook"}, {"headline":"New Jersey Daybook"}, {"headline":"Upstate New York Daybook"}, {"headline":"New Mexico Daybook"}, {"headline":"New Jersey Daybook"}, {"headline":"New Hampshire Daybook"}, {"headline":"New Employee Onboarding Center combines services for new employees"}, {"headline":"New Employee Onboarding Center combines services for new employees"}] }}

The {!xmlparser} at the beginning of the q= parameter is very important--it tells Solr to use the XmlQueryParser for this search request.

I'm doing more research and will likely follow up with more articles about the various types of subqueries.

No comments:

Post a Comment