Monday, May 23, 2016

Apache Solr - XML Query Parser

Introduction
The XML Query Parser (XmlQueryParser) supports a very wide range of available Apache Solr
search queries--more so than any other query parser that ships with it.
This article will attempt to examine the breadth of that influence released with Solr 6.0.0.
I will be adding separate articles (and linking to them) for the different types of queries so that
more detail may be devoted to it and not overwhelm this main thread.

De-Facto Example
<BooleanQuery fieldName="description">
    <Clause occurs="must">
        <TermQuery>shirt</TermQuery>
    </Clause>
    <Clause occurs="mustnot">
        <TermQuery>plain</TermQuery>
    </Clause>
    <Clause occurs="should">
        <TermQuery>cotton</TermQuery>
    </Clause>
    <Clause occurs="must">
        <BooleanQuery fieldName="size">
            <Clause occurs="should">
                <TermsQuery>S M L</TermsQuery>
            </Clause>
        </BooleanQuery>
    </Clause>
</BooleanQuery>


Difficulties
  • How do I get highlighting to work?

Top-Level
  • BooleanQuery
    • disableCoord (optional, false)
    • minimumNumberShouldMatch (optional, 0)
    • boost (optional, 1.0)
    • Value
      • Clause
        • occurs: should | must | mustNot | filter
        • Value (Note: Many of the following can also have children, explained later)
          • TermQuery
          • TermsQuery
          • MatchAllDocsQuery
          • BooleanQuery
          • LegacyNumericRangeQuery (deprecated)
          • PointRangeQuery
          • DisjunctionMaxQuery
          • UserQuery
          • ConstantScoreQuery
          • SpanNear
          • BoostingTermQuery
          • SpanTerm
          • SpanOr
          • SpanOrTerms
          • SpanFirst
          • SpanNot
        • NOTE: Only the first Clause child is recognized--others will get silently ignored!
      • Ignores any other element types at this level--i.e. only Clause is recognized, no exceptions thrown if it finds something else

  • MatchAllDocsQuery - Matches all documents in an index
  • TermQuery
  • TermsQuery
  • [Legacy]NumericRangeQuery (deprecated in lucene 6.0.0ish)
    • Not supported as of Solr 6 (solr doesn't support point types yet)
  • PointRangeQuery (new in 6.0ish)
    • Not supported as of Solr 6 (solr doesn't support point types yet)
  • RangeQuery
  • DisjunctionMaxQuery
    • tieBreaker (optional, 0.0)
    • boost (optional, 1.0)
    • Value
      • May contain multiple queries of any type of Query defined in this list (i.e. DisjunctionMaxQuery, RangeQuery, …)
  • UserQuery
    • fieldName (optional, defaults to defaultField)
    • Value
      • Text is passed into QueryParser.parse
      • This appears to support the classic query syntax
    • NOTE: Wraps the query into a BoostQuery
  • ConstantScoreQuery
    • boost (optional, 1.0)
    • Value
      • Only gets the first child
      • Child may be any query in this list
  • SpanNear
    • boost (optional, 1.0)
    • slop
    • inOrder (optional, false)
    • Value
      • A collection of various types of SpanQuery
  • BoostingTermQuery
    • fieldName (required either here or in a parent)
    • boost (optional, 1.0)
    • Value: fieldName value
  • SpanTerm
    • fieldName (required either here or in a parent)
    • boost (optional, 1.0)
    • Value: fieldName value
  • SpanOr
    • boost (optional, 1.0)
    • Value: a collection of various types of SpanQuery
  • SpanOrTerms
    • fieldName (required either here or in a parent)
    • boost (optional, 1.0)
    • Value: terms commonly separated by a space
    • Wraps the terms in a SpanOr query
  • SpanFirst
    • This limits span matches to the first N (specified by the end parameter below) positions
      • More specifically, match spans in the subquery whose end position is less than or equal to end.
    • boost (optional, 1.0)
    • end (optional, 1, integer)
    • Value:
      • Gets the first child, which must be a SpanQuery
      • All other children are ignored
  • SpanNot
    • boost (optional, 1.0)
    • Include - First child element called Include must contain a SpanQuery
    • Exclude - First child element called Exclude must contain a SpanQuery



BooleanQuery
TermQuery
{!xmlparser}
<BooleanQuery fieldName="headline">
  <Clause occurs="must">
    <TermQuery>york</TermQuery>
  </Clause> 
</BooleanQuery>

{!xmlparser}
<BooleanQuery>
  <Clause occurs="must">
    <TermQuery fieldName="headline">york</TermQuery>
  </Clause>
</BooleanQuery>

SpanNear
// Headline: new pre/3 york
{!xmlparser}
<BooleanQuery>
  <Clause occurs="must">
    <SpanNear fieldName="headline" slop="3" inOrder="true">
<SpanTerm>new</SpanTerm>
<SpanTerm>york</SpanTerm>
    </SpanNear>
  </Clause>
</BooleanQuery>

// Headline: new pre/3 (york or car)
{!xmlparser}
<BooleanQuery>
  <Clause occurs="must">
    <SpanNear fieldName="headline" slop="3" inOrder="true">
<SpanTerm>new</SpanTerm>
<SpanOr>
    <SpanTerm>york</SpanTerm>
    <SpanTerm>car</SpanTerm>
</SpanOr>
    </SpanNear>
  </Clause>
</BooleanQuery>

// Headline: new pre/3 (york or (car w/3 bart))
// Match: "headline":"New York. Hongkong. Wunsiedel"
{!xmlparser}
<BooleanQuery>
  <Clause occurs="must">
    <SpanNear fieldName="headline" slop="3" inOrder="true">
<SpanTerm>new</SpanTerm>
<SpanOr>
    <SpanTerm>york</SpanTerm>
    <SpanNear slop="3" inOrder="false">
<SpanTerm>car</SpanTerm>
<SpanTerm>arrives</SpanTerm>
    </SpanNear>
</SpanOr>
    </SpanNear>
  </Clause>
</BooleanQuery>

// Headline: new pre/3 (daybook or (employee w/3 onboarding))
{!xmlparser}
<BooleanQuery>
  <Clause occurs="must">
    <SpanNear fieldName="headline" slop="3" inOrder="true">
<SpanTerm>new</SpanTerm>
<SpanOr>
    <SpanTerm>daybook</SpanTerm>
    <SpanNear slop="3" inOrder="false">
<SpanTerm>employee</SpanTerm>
<SpanTerm>onboarding</SpanTerm>
    </SpanNear>
</SpanOr>
    </SpanNear>
  </Clause>
</BooleanQuery>

DisjunctionMaxQuery
{!xmlparser}
<DisjunctionMaxQuery
 tieBreaker="1"
 boost="2">
    <UserQuery fieldName="headline">uber</UserQuery>
    <TermsQuery fieldName="headline">new york times</TermsQuery>
</DisjunctionMaxQuery>

UserQuery
{!xmlparser}
<UserQuery fieldName="headline">
"new computer*"~15
</UserQuery>

ConstantScoreQuery
{!xmlparser}
<ConstantScoreQuery boost="1.0">
    <UserQuery fieldName="headline">tesla</UserQuery>
</ConstantScoreQuery>

SpanNear
{!xmlparser}
<SpanNear fieldName="headline" slop="3" inOrder="true">
<SpanTerm>new</SpanTerm>
<SpanTerm>computer</SpanTerm>
</SpanNear>

BoostingTermQuery
{!xmlparser}
<BoostingTermQuery
  fieldName="headline"
  boost="1.2">
tesla
</BoostingTermQuery>

SpanTerm
{!xmlparser}
<SpanTerm
  fieldName="headline"
  boost="1.2">
tesla
</SpanTerm>

SpanOr
{!xmlparser}
<SpanOr fieldName="headline"
  boost="1.2">
<SpanTerm>pizza</SpanTerm>
<SpanTerm>milk</SpanTerm>
</SpanOr>

SpanOrTerms
{!xmlparser}
<SpanOrTerms
  fieldName="headline"
  boost="1.2">
pizza milk
</SpanOrTerms>

SpanFirst
{!xmlparser}
<SpanFirst
  fieldName="headline"
  end="1"
  boost="1.2">
<SpanTerm>tesla</SpanTerm>
</SpanFirst>

SpanNot -- TODO: Redo this--I'm getting some headlines with york in them
{!xmlparser}
<SpanNot fieldName="headline">
  <Include>
<SpanTerm>new</SpanTerm>
  </Include>
  <Exclude>
<SpanTerm>york</SpanTerm>
  </Exclude>
</SpanNot>

No comments:

Post a Comment