Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
150 views
in Technique[技术] by (71.8m points)

Lucene tokenization / filters not working as expected | Solr analysis confusion

I am trying to figure out the correct configuration for my analyzer configuration in my Solr/Lucidworks setup.

The results that I am seeing in Solr analysis seem to indicate that I should be getting matches, but when I do the Solr query (native or in the Lucidworks UI), no results are returned.

The relevant fragments from schema are:

<field name="content" indexed="true" multiValued="false" required="false" stored="true" type="dlowe_text_en"/>


<dynamicField indexed="true" name="*_txt_en_dlowe_split_tight" stored="true" type="dlowe_text_en"/>
<fieldType autoGeneratePhraseQueries="true" class="solr.TextField" name="dlowe_text_en" positionIncrementGap="100">
    <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType>

I have indexed some content that contains the string:

Administrator's Guide

Now, when I use the Solr analysis, this is the results that I get:

enter image description here

My understanding is if any the results are highlighted, this represents a match, but when I do the search in Solr on "Administrator" no results are found:

enter image description here

If I search on:

Administrator's

I do get the expected result.

I'm I totally miss understanding of how the analysis tool should work?

What I am trying to achieve is a search index that support a lot of technical items, that will only match on exact values. For example:

  • V-123-1231-1231
  • WILL_NOT_CHANGE
  • /mnt/abc/Drivers/
  • 4040:5050

So the WhitespaceTokenizer seems to make the most sense, but I also need stemming on the non-technical strings which would be indicated by periods (.), dashes (-), underlines (_), slashes ( or /), etc.

Any insight / suggestions would be greatly appreciated.

question from:https://stackoverflow.com/questions/66054578/lucene-tokenization-filters-not-working-as-expected-solr-analysis-confusion

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Based upon further investigation and bumping up the latest version of Solr (8.7) verses the very old corp. version that we are using (6.4.2).

Plus the re-enforcement from Abhijit above, I found out that the "full record" search of Solr doesn't work the way that I would expected.

Instead, I needed to:

  • copy all the fields that I want indexed into a single multivalue field (eg. content_all)
  • then I need to add query parameter: df=content_all to execution.

Once I did that, I started getting the results that I expected.

Probably obvious for those that use solr/lucene on a regular basis, but wasn't clear to me. Switching to 8.7 which doesn't have a 'default field', let me down the path to this solution.

Hopefully this will be of help to others in the future.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...