[groonga-dev,02829] Re: IN NATURAL LANGUAGE MODEにおける検索スコア

Back to archive index

Kazuhiko kazuh****@fdiar*****
2014年 9月 29日 (月) 21:23:59 JST


こんにちは。

On 27/09/2014 11:10, Kouhei Sutou wrote:
> それぞれ順にやっていくのはどうですか?
> 
> たぶん、
> 
>   1. 今の状態をそのままドキュメントにまとめる
>      (不満はあるかもしれないけど)
>   2. 不満があるところを改善する
> 
> という順番がよいような気がします。
> 
> 詳細は流れを決めてから順に整理していくのがいいんじゃないかと
> 思います。

はい、そうですね。

natural language mode(というか類似文書検索)のスコアについては、すでに
ドキュメントに仕様が書かれているので、boolean modeのスコアの仕様をまとめ
られれば良さそうです。

On 27/09/2014 12:31, Naoya Murakami wrote:
> #これはトピック外ですが、私は全文検索側のスコアリングでもIDF値
> (あと文書長)を考慮させたいなぁと思っているんですけどね。。
> (類似検索でなくても、複数の単語が入力されたときに重要度(IDF)を
> 考慮させないと頻出単語側に偏りすぎる。)

参考までに、MyISAMのboolean modeは、
http://dev.mysql.com/doc/refman/5.5/en/fulltext-boolean.html
にスコアの例が一切ないことからも、「マッチするかどうか」以上の意味という
か期待がなさそうですが、MySQL 5.6のInnoDBでの例だと
http://dev.mysql.com/doc/refman/5.6/en/fulltext-boolean.html
boolean modeでもTF-IDFでスコアを求めているようです。

> 私の認識では、Groongaの全文検索は、単語・フレーズ(トークンではなく
> 前後の並びも一致させる)が条件にヒットしなければ0、ヒットすれば、単語
> ・フレーズが文書中に含まれる数がスコアになると思っています。

それに加えて、'-word'の分のスコアが反映されると思うのですが、以下のスコ
アの違いの理由がよく分かりません。

DROP TABLE IF EXISTS `diaries`;
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `diaries` (
  `id` int(10) unsigned NOT NULL,
  `content` text COLLATE utf8_unicode_ci,
  PRIMARY KEY (`id`),
  FULLTEXT KEY `content` (`content`) COMMENT 'parser "TokenBigram"'
) ENGINE=Mroonga DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
/*!40101 SET character_set_client = @saved_cs_client */;
INSERT INTO `diaries` VALUES (1,'It\'ll be fine tomorrow as well.');
INSERT INTO `diaries` VALUES (2,'It\'ll rain tomorrow.');
INSERT INTO `diaries` VALUES (3,'It\'s fine today. It\'ll be fine
tomorrow as well.');
INSERT INTO `diaries` VALUES (4,'It\'s fine today. But it\'ll rain
tomorrow.');
INSERT INTO `diaries` VALUES (5,'Ring the bell.');
INSERT INTO `diaries` VALUES (6,'I love dumbbells.');

SELECT *, MATCH (content) AGAINST('it AND -today' IN BOOLEAN MODE) AS
score FROM diaries;
+----+--------------------------------------------------+-------+
| id | content                                          | score |
+----+--------------------------------------------------+-------+
|  1 | It'll be fine tomorrow as well.                  |     1 |
|  2 | It'll rain tomorrow.                             |     1 |
|  3 | It's fine today. It'll be fine tomorrow as well. |     0 |
|  4 | It's fine today. But it'll rain tomorrow.        |     0 |
|  5 | Ring the bell.                                   |     0 |
|  6 | I love dumbbells.                                |     0 |
+----+--------------------------------------------------+-------+

SELECT *, MATCH (content) AGAINST('*D+ it -today' IN BOOLEAN MODE) AS
score FROM diaries;
+----+--------------------------------------------------+-------+
| id | content                                          | score |
+----+--------------------------------------------------+-------+
|  1 | It'll be fine tomorrow as well.                  |     1 |
|  2 | It'll rain tomorrow.                             |     1 |
|  3 | It's fine today. It'll be fine tomorrow as well. |     0 |
|  4 | It's fine today. But it'll rain tomorrow.        |     0 |
|  5 | Ring the bell.                                   |     0 |
|  6 | I love dumbbells.                                |     0 |
+----+--------------------------------------------------+-------+

SELECT *, MATCH (content) AGAINST('*D+ -today it' IN BOOLEAN MODE) AS
score FROM diaries;
+----+--------------------------------------------------+-------+
| id | content                                          | score |
+----+--------------------------------------------------+-------+
|  1 | It'll be fine tomorrow as well.                  |     2 |
|  2 | It'll rain tomorrow.                             |     2 |
|  3 | It's fine today. It'll be fine tomorrow as well. |     0 |
|  4 | It's fine today. But it'll rain tomorrow.        |     0 |
|  5 | Ring the bell.                                   |     0 |
|  6 | I love dumbbells.                                |     0 |
+----+--------------------------------------------------+-------+

SELECT *, MATCH (content) AGAINST('-today it' IN BOOLEAN MODE) AS score
FROM diaries;
+----+--------------------------------------------------+-------+
| id | content                                          | score |
+----+--------------------------------------------------+-------+
|  1 | It'll be fine tomorrow as well.                  |     2 |
|  2 | It'll rain tomorrow.                             |     2 |
|  3 | It's fine today. It'll be fine tomorrow as well. |     2 |
|  4 | It's fine today. But it'll rain tomorrow.        |     2 |
|  5 | Ring the bell.                                   |     1 |
|  6 | I love dumbbells.                                |     1 |
+----+--------------------------------------------------+-------+

SELECT *, MATCH (content) AGAINST('it -today' IN BOOLEAN MODE) AS score
FROM diaries;
+----+--------------------------------------------------+-------+
| id | content                                          | score |
+----+--------------------------------------------------+-------+
|  1 | It'll be fine tomorrow as well.                  |     1 |
|  2 | It'll rain tomorrow.                             |     1 |
|  3 | It's fine today. It'll be fine tomorrow as well. |     0 |
|  4 | It's fine today. But it'll rain tomorrow.        |     0 |
|  5 | Ring the bell.                                   |     0 |
|  6 | I love dumbbells.                                |     0 |
+----+--------------------------------------------------+-------+

かずひこ




groonga-dev メーリングリストの案内
Back to archive index