NLP 利器 Gensim 中 word2vec 模型添加 model to dict 方法来加速搜索

2020 年 6 月 7 日
AI
自然语言处理

本文为系列文章之一，前面的几篇请点击链接：
NLP 利器 gensim 库基本特性介绍和安装方式
 NLP 利器 Gensim 库的使用之 Word2Vec 模型案例演示
 NLP 利器 Gensim 来训练自己的 word2vec 词向量模型
 NLP 利器 Gensim 来训练 word2vec 词向量模型的参数设置
 NLP 利器 Gensim 中 word2vec 模型的内存需求，和模型评估方式
 NLP 利器 Gensim 中 word2vec 模型的恢复训练：载入存储模型并继续训练
 NLP 利器 Gensim 中 word2vec 模型的训练损失计算，和对比基准的选择

一、添加 model to dict 方法

我们希望提升模型在应用中的表现。

一个比较好的方法就是，把相似的词都缓存到一个词典中。

下一次我们要搜索相似词的时候，首先在这个 dict 中查找。

找到的话就直接显示结果，如果没有找到，那把这个查询词缓存到字典中，下一次就不会错过了。

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


# re-enable logging
logging.root.level = logging.INFO

most_similars_precalc = {word : model.wv.most_similar(word) for word in model.wv.index2word}
# 这里对于所有的词，各形成一个字典，key 为该词，value 是与该词最相似的 5 个词

for i, (key, value) in enumerate(most_similars_precalc.items()):
    if i == 3:
        break
    print(key, value)

运行结果：

the [('which', 0.9999258518218994), ('at', 0.9999212026596069), ('an', 0.9999092817306519), ('up', 0.9999089241027832), ('into', 0.9999079704284668), ('from', 0.9999078512191772), ('in', 0.9999074935913086), ('with', 0.9999056458473206), ('of', 0.9999053478240967), ('on', 0.9999029636383057)]
to [('from', 0.9999440908432007), ('will', 0.999943196773529), ('but', 0.9999423623085022), ('and', 0.9999418258666992), ('would', 0.999941349029541), ('their', 0.9999399781227112), ('for', 0.9999394416809082), ('is', 0.9999381303787231), ('before', 0.9999350905418396), ('by', 0.999934196472168)]
of [('which', 0.9999495148658752), ('after', 0.999946117401123), ('in', 0.9999450445175171), ('at', 0.999943733215332), ('three', 0.9999434351921082), ('for', 0.9999430775642395), ('by', 0.9999409914016724), ('with', 0.9999403357505798), ('an', 0.999937891960144), ('from', 0.9999375343322754)]

二、是否缓存相似词字典的比较

随机选取 4 个词：

import time
words = ['voted', 'few', 'their', 'around']

在没有缓存相似词字典的情况：

start = time.time()
for word in words:
    result = model.wv.most_similar(word)
    print(result)
end = time.time()
print(end - start)

运行结果：

[('powers', 0.9986631870269775), ('unions', 0.9986005425453186), ('start', 0.9985911846160889), ('into', 0.9985776543617249), ('north', 0.9985730648040771), ('then', 0.9985653162002563), ('kandahar', 0.9985642433166504), ('chief', 0.9985554814338684), ('also', 0.9985419511795044), ('child', 0.9985411167144775)]
[('any', 0.9997690916061401), ('are', 0.9997625350952148), ('them', 0.9997580051422119), ('some', 0.999752938747406), ('up', 0.999750018119812), ('five', 0.9997451901435852), ('one', 0.9997447729110718), ('be', 0.9997397661209106), ('his', 0.999735951423645), ('out', 0.9997352361679077)]
[('from', 0.9999539256095886), ('with', 0.9999514818191528), ('on', 0.9999467134475708), ('which', 0.9999465942382812), ('in', 0.9999460577964783), ('at', 0.999943733215332), ('and', 0.999943733215332), ('some', 0.9999434947967529), ('australian', 0.9999405741691589), ('about', 0.9999402761459351)]
[('which', 0.9999223947525024), ('into', 0.9999215006828308), ('from', 0.9999191761016846), ('on', 0.9999156594276428), ('australian', 0.9999149441719055), ('by', 0.999911904335022), ('with', 0.9999110698699951), ('areas', 0.999910831451416), ('in', 0.9999107122421265), ('their', 0.9999104142189026)]
0.0067827701568603516

然后是缓存了字典的情况：

start = time.time()
for word in words:
    if 'voted' in most_similars_precalc:
        result = most_similars_precalc[word]
        print(result)
    else:
        result = model.wv.most_similar(word)
        most_similars_precalc[word] = result
        print(result)

end = time.time()
print(end - start)

运行结果：

[('powers', 0.9986631870269775), ('unions', 0.9986005425453186), ('start', 0.9985911846160889), ('into', 0.9985776543617249), ('north', 0.9985730648040771), ('then', 0.9985653162002563), ('kandahar', 0.9985642433166504), ('chief', 0.9985554814338684), ('also', 0.9985419511795044), ('child', 0.9985411167144775)]
[('any', 0.9997690916061401), ('are', 0.9997625350952148), ('them', 0.9997580051422119), ('some', 0.999752938747406), ('up', 0.999750018119812), ('five', 0.9997451901435852), ('one', 0.9997447729110718), ('be', 0.9997397661209106), ('his', 0.999735951423645), ('out', 0.9997352361679077)]
[('from', 0.9999539256095886), ('with', 0.9999514818191528), ('on', 0.9999467134475708), ('which', 0.9999465942382812), ('in', 0.9999460577964783), ('at', 0.999943733215332), ('and', 0.999943733215332), ('some', 0.9999434947967529), ('australian', 0.9999405741691589), ('about', 0.9999402761459351)]
[('which', 0.9999223947525024), ('into', 0.9999215006828308), ('from', 0.9999191761016846), ('on', 0.9999156594276428), ('australian', 0.9999149441719055), ('by', 0.999911904335022), ('with', 0.9999110698699951), ('areas', 0.999910831451416), ('in', 0.9999107122421265), ('their', 0.9999104142189026)]
0.0007929801940917969

总结：

两者对比，不使用缓存字典的情况，耗时是使用缓存的 10 倍！（0.0067827701568603516 对 0.0007929801940917969）

这个差异性在操作更多词的时候会更加明显！

Tags: 自然语言处理

NLP 利器 Gensim 中 word2vec 模型添加 model to dict 方法来加速搜索

一、添加 model to dict 方法

二、是否缓存相似词字典的比较

VirMach 便宜 VPS

QNews

NLP 利器 Gensim 中 word2vec 模型添加 model to dict 方法来加速搜索

一、添加 model to dict 方法

二、是否缓存相似词字典的比较

分享此文：

Related Posts

开工大吉！一大波岗位虚位以待，达观数据邀你共同犇向未来

Python 修饰符 @ 的用法汇总

要“渡劫”的 何止华为5G？

Windows内核驱动开发：HelloWorld

VirMach 便宜 VPS

QNews

熱門搜尋

要“渡劫”的何止华为5G？