Lucene tokenization / filters not working as expected | Solr analysis confusion

Question

Welcome To Ask or Share your Answers For Others

Lucene tokenization / filters not working as expected | Solr analysis confusion

posted Oct 6, 2021 in Technique[技术] by 深蓝 (71.8m points)

Lucene tokenization / filters not working as expected | Solr analysis confusion

I am trying to figure out the correct configuration for my analyzer configuration in my Solr/Lucidworks setup.

The results that I am seeing in Solr analysis seem to indicate that I should be getting matches, but when I do the Solr query (native or in the Lucidworks UI), no results are returned.

The relevant fragments from schema are:

<field name="content" indexed="true" multiValued="false" required="false" stored="true" type="dlowe_text_en"/>


<dynamicField indexed="true" name="*_txt_en_dlowe_split_tight" stored="true" type="dlowe_text_en"/>
<fieldType autoGeneratePhraseQueries="true" class="solr.TextField" name="dlowe_text_en" positionIncrementGap="100">
    <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType>

I have indexed some content that contains the string:

Administrator's Guide

Now, when I use the Solr analysis, this is the results that I get:

My understanding is if any the results are highlighted, this represents a match, but when I do the search in Solr on "Administrator" no results are found:

If I search on:

Administrator's

I do get the expected result.

I'm I totally miss understanding of how the analysis tool should work?

What I am trying to achieve is a search index that support a lot of technical items, that will only match on exact values. For example:

V-123-1231-1231
WILL_NOT_CHANGE
/mnt/abc/Drivers/
4040:5050

So the WhitespaceTokenizer seems to make the most sense, but I also need stemming on the non-technical strings which would be indicated by periods (.), dashes (-), underlines (_), slashes ( or /), etc.

Any insight / suggestions would be greatly appreciated.

question from:https://stackoverflow.com/questions/66054578/lucene-tokenization-filters-not-working-as-expected-solr-analysis-confusion

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T03:12:56+0000

Based upon further investigation and bumping up the latest version of Solr (8.7) verses the very old corp. version that we are using (6.4.2).

Plus the re-enforcement from Abhijit above, I found out that the "full record" search of Solr doesn't work the way that I would expected.

Instead, I needed to:

copy all the fields that I want indexed into a single multivalue field (eg. content_all)
then I need to add query parameter: df=content_all to execution.

Once I did that, I started getting the results that I expected.

Probably obvious for those that use solr/lucene on a regular basis, but wasn't clear to me. Switching to 8.7 which doesn't have a 'default field', let me down the path to this solution.

Hopefully this will be of help to others in the future.

Categories

Lucene tokenization / filters not working as expected | Solr analysis confusion

Lucene tokenization / filters not working as expected | Solr analysis confusion

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags