lucene-距标题更近的术语赋予更大的权重


问题内容

我了解如何在索引时间或查询时间提升字段。但是,如何增加匹配词条的分数呢?

例:

Query = "lucene"

Doc1 title = "Lucene: Homepage"
Doc2 title = "I have a question about lucene?"

我希望第一个文档得分更高,因为“ lucene”更接近开头(暂时忽略术语freq)。

我看到了如何使用SpanQuery指定字词之间的接近度,但是我不确定如何使用有关字段中位置的信息。

我在Java中使用Lucene 4.1。


问题答案:

我将使用SpanFirstQuery,它匹配字段开头附近的字词。由于所有跨度查询都依赖于位置,在对Lucene进行索引时默认启用。

让我们对其进行独立测试:您只需要提供您SpanTermQuery可以找到该术语的最大位置即可(在我的示例中为一个)。

SpanTermQuery spanTermQuery = new SpanTermQuery(new Term("title", "lucene"));
SpanFirstQuery spanFirstQuery = new SpanFirstQuery(spanTermQuery, 1);

给定您的两个文档,如果您使用进行了分析,则该查询将仅找到标题为“ Lucene:Homepage”的第一个文档StandardAnalyzer

现在,我们可以将上述内容SpanFirstQuery与普通的文本查询结合起来,并使第一个仅影响得分。您可以轻松地使用a
BooleanQuery并将span查询作为应子句放置,如下所示:

Term term = new Term("title", "lucene");
TermQuery termQuery = new TermQuery(term);
SpanFirstQuery spanFirstQuery = new SpanFirstQuery(new SpanTermQuery(term), 1);
BooleanQuery booleanQuery = new BooleanQuery();
booleanQuery.add(new BooleanClause(termQuery, BooleanClause.Occur.MUST));
booleanQuery.add(new BooleanClause(spanFirstQuery, BooleanClause.Occur.SHOULD));

可能有不同的方法可以达到相同的目的,可能使用CustomScoreQuery过分或自定义代码来实现评分,但是在我看来,这是最简单的方法。

我用于测试的代码将打印以下输出(包括分数),该输出TermQuery首先执行,然后执行唯一SpanFirstQuery,最后执行合并BooleanQuery

------ TermQuery --------
Total hits: 2
title: I have a question about lucene - score: 0.26010898
title: Lucene: I have a really hard question about it - score: 0.22295055
------ SpanFirstQuery --------
Total hits: 1
title: Lucene: I have a really hard question about it - score: 0.15764984
------ BooleanQuery: TermQuery (MUST) + SpanFirstQuery (SHOULD) --------
Total hits: 2
title: Lucene: I have a really hard question about it - score: 0.26912516
title: I have a question about lucene - score: 0.09196242

这是完整的代码:

public static void main(String[] args) throws Exception {

        Directory directory = FSDirectory.open(new File("data"));

        index(directory);

        IndexReader indexReader = DirectoryReader.open(directory);
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);

        Term term = new Term("title", "lucene");

        System.out.println("------ TermQuery --------");
        TermQuery termQuery = new TermQuery(term);
        search(indexSearcher, termQuery);

        System.out.println("------ SpanFirstQuery --------");
        SpanFirstQuery spanFirstQuery = new SpanFirstQuery(new SpanTermQuery(term), 1);
        search(indexSearcher, spanFirstQuery);

        System.out.println("------ BooleanQuery: TermQuery (MUST) + SpanFirstQuery (SHOULD) --------");
        BooleanQuery booleanQuery = new BooleanQuery();
        booleanQuery.add(new BooleanClause(termQuery, BooleanClause.Occur.MUST));
        booleanQuery.add(new BooleanClause(spanFirstQuery, BooleanClause.Occur.SHOULD));
        search(indexSearcher, booleanQuery);
    }

    private static void index(Directory directory) throws Exception {
        IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_41, new StandardAnalyzer(Version.LUCENE_41));

        IndexWriter writer = new IndexWriter(directory, config);

        FieldType titleFieldType = new FieldType();
        titleFieldType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
        titleFieldType.setIndexed(true);
        titleFieldType.setStored(true);

        Document document = new Document();
        document.add(new Field("title","I have a question about lucene", titleFieldType));
        writer.addDocument(document);

        document = new Document();
        document.add(new Field("title","Lucene: I have a really hard question about it", titleFieldType));
        writer.addDocument(document);

        writer.close();
    }

    private static void search(IndexSearcher indexSearcher, Query query) throws Exception {
        TopDocs topDocs = indexSearcher.search(query, 10);

        System.out.println("Total hits: " + topDocs.totalHits);

        for (ScoreDoc hit : topDocs.scoreDocs) {
            Document result = indexSearcher.doc(hit.doc);
            for (IndexableField field : result) {
                System.out.println(field.name() + ": " + field.stringValue() +  " - score: " + hit.score);
            }
        }
    }