欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

lucene搜索结果排序之Payload

程序员文章站 2022-07-09 10:22:41
...
提高特定词汇的评分

利用 Payload 功能,可以提高文档中特定词汇的评分,如黑体词汇、斜体词汇等,从而优化搜索结果排序。

下面还以文档 D0 和 D1 为例说明如何设置和检索 Payload。其中GPRS为专业术语,但search “GPRS描述”的时候,返回的D1的得分比D0高。但这不是我们想要的结果,我们可能想要D0得分高一些,这时可在incrementToken中,自定义词的权重(例如术语权重高些),然后在重写Similarity,自定义score。

D0 = "GPRS的问题"
D1 = "问题描述"
Step1:在 Analyzer 处理过程中,为特殊词汇添加评分 Payload
ICTCLASTokenizer.java 
/**
* @see org.apache.lucene.analysis.TokenStream#incrementToken()
*/
@Override
public boolean incrementToken() throws IOException {
clearAttributes();

Word lexeme = segmentation.next();
if (lexeme == null)
return false;

termAttr.setTermBuffer(lexeme.getText());
offsetAttr.setOffset(lexeme.getStartPosition(), lexeme.getEndPosition());

/*
* 有词性,就存进payload
*/
String payloadText = "";
if (needPOSTagged && !StringUtils.isEmpty(lexeme.getPartOfSpeech()))
payloadText = lexeme.getPartOfSpeech();

/*
* 该词为指定关键字或者术语,就存进payload
*/
float keyweight = gmccKeyWordDeal.doDeal(lexeme.getText());
if(keyweight > 0)
payloadText = payloadText + "_" + keyweight;

if(!payloadText.equals(""))
payloadAttr.setPayload(new Payload(payloadText.getBytes()));

finalOffset = lexeme.getEndPosition();

return true;
}

Step2:重写 Similarity (主要负责排名和评分)

BwSimilarity.java

public class BwSimilarity extends DefaultSimilarity {

private static final long serialVersionUID = -8049061435299914513L;

public BwSimilarity() {
super();
}

@Override
public float scorePayload(int docId, String fieldName, int start, int end,
byte[] payload, int offset, int length) {

String payloadStr = "";
try {
payloadStr = new String(payload, "UTF-8");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
return 1;
}

// 获取设定的keyweight,默认为1
String kwStr = "1";
int kwIndex = payloadStr.indexOf("_");
if(kwIndex != -1)
kwStr = payloadStr.substring(kwIndex + 1);

return Float.parseFloat(kwStr);
}

@Override
public float coord(int overlap, int maxOverlap) {
float overlap2 = (float)Math.pow(2, overlap);
float maxOverlap2 = (float)Math.pow(2, maxOverlap);
return (overlap2 / maxOverlap2);
}

}


Step3:使用重写的 boostingSimilarity 进行检索

PayloadTermQuery ptq = new PayloadTermQuery(new Term(field, term),new AveragePayloadFunction());

Searcher searcher = new IndexSearcher(…);
Searcher.setSimilarity(boostingSimilarity);

ScoreDoc[] hits = searcher.search(ptq , hitsPerPage).scoreDocs;




相关链接:

Lucene Payload 的研究与应用:http://www.ibm.com/developerworks/cn/opensource/os-cn-lucene-pl/index.html

上一篇: lucene排序

下一篇: 快速幂取模