Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
454 views
in Technique[技术] by (71.8m points)

indexing - Multilingual Search using lucene

I am doing a multilingual search. And I will use lucene as the tool to do it.

I have the translated contents already, there will be 3 or 4 languages of each document.

For indexing and search, there could be the 4 strategies, For each document/contents:

  1. each language are indexed in different index/directory.
  2. each language are indexed in different document but in the same index.
  3. each language are indexed in different Field but in the same document.
  4. all the languages are indexed in the same Field in a document

But I have not test each of the way yet, could anyone experienced tell me which one is a better way to do the multilingual search?

Thanks!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Although the question has been asked a couple of years ago, it's still a great question.

There are a couple of aspects to consider evaluating the different solution approaches:

  1. are language specific analyzers used at indexing time?
  2. is the query language always known (e.g. user selectable)?
  3. does the query language always match one of the "content" languages?
  4. should only content matching the query language be retuned?
  5. is relevancy important?

If (1.) & (5.) are valid in your project you should not consider any strategy that (re-)uses the same field for multiple languages in the same inverted index, as term frequencies for the various languages are all mixed up (independent of whether you index your multilingual content as one document or as multiple documents). It might be interesting to know, that adding "n" language specific fields does not result in an "n"-times larger index, but for obvious reasons it comes with some overhead.


Single Field (Strategies 2 & 4)


+ only one field to query
+ scales well for additional languages
+ can distinguish/filter languages (if multiple documents, and extra language field)
- cannot distinguish/filter languages (if single document)
- cannot just display the queried language (if single document)
- "wrong" term frequencies (as all languages mixed up)

Multiple Fields (Strategy 3)


+ correct term frequencies
+ can easily restrict/filter queries for particular language(s)
+ facilitates Auto-Complete & Spellcheck / Did-You-Mean
- more fields to index
- more fields to query

Multiple Indices (Strategy 1)


+ correct term frequencies
+ can easily restrict/filter queries for particular language(s)
+ facilitates Auto-Complete & Spellcheck / Did-You-Mean
- additional languages requires all their own index

Independent of a single or multiple fields approach, your solution might need to handle result collapsing for matches in the "wrong" language, if you index your content as multiple documents. One approach might could be by adding a language field and filter for that.

Recommendation: The approach/strategy you choose, depends on a projects requirements. Whenever possible I would opt for a multiple fields or multiple indices approach.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...