结巴(jieba)分词PHP引入6M字典(40万关键词),明显卡顿
最近在弄一个自动抽取关键词,折腾了一下结巴(jieba)分词,默认的情况下,对于分词不算很满意,于是找了一个40万关键词的字典,引入时,发现卡了很多,即便出来结果还算满意,相对于速度,还是放弃自定义字典!(还是怀念以前DZ的分词接口)
结巴分词关键使用代码(PHP版):
<?php
ini_set('memory_limit', '1024M');
require_once "jiebafc/vendor/multi-array/MultiArray.php" ;
require_once "jiebafc/vendor/multi-array/Factory/MultiArrayFactory.php" ;
require_once "jiebafc/class/Jieba.php" ;
require_once "jiebafc/class/Finalseg.php" ;
require_once "jiebafc/class/JiebaAnalyse.php";
use Fukuball\Jieba\Jieba;
use Fukuball\Jieba\Finalseg;
use Fukuball\Jieba\JiebaAnalyse;
JiebaAnalyse::init();
Jieba::init();
Finalseg::init();
$seg_list = Jieba::tokenize("欧莱雅晶莹水复颜积雪草修护微精华露女补水保湿收缩毛孔爽肤水");
print_r($seg_list);
echo "<hr>";
?>
已整理好放入网盘,3.php是测试文件,keys_dict.txt是自定义字典(40万电商关键词)
下载:百度网盘下载(提取码:sc7p)