NLP 利器 Gensim 中 word2vec 模型添加 model to dict 方法来加速搜索
本文为系列文章之一,前面的几篇请点击链接:
NLP 利器 gensim 库基本特性介绍和安装方式
NLP 利器 Gensim 库的使用之 Word2Vec 模型案例演示
NLP 利器 Gensim 来训练自己的 word2vec 词向量模型
NLP 利器 Gensim 来训练 word2vec 词向量模型的参数设置
NLP 利器 Gensim 中 word2vec 模型的内存需求,和模型评估方式
NLP 利器 Gensim 中 word2vec 模型的恢复训练:载入存储模型并继续训练
NLP 利器 Gensim 中 word2vec 模型的训练损失计算,和对比基准的选择
一、添加 model to dict 方法
我们希望提升模型在应用中的表现。
一个比较好的方法就是,把相似的词都缓存到一个词典中。
下一次我们要搜索相似词的时候,首先在这个 dict 中查找。
找到的话就直接显示结果,如果没有找到,那把这个查询词缓存到字典中,下一次就不会错过了。
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
# re-enable logging
logging.root.level = logging.INFO
most_similars_precalc = {word : model.wv.most_similar(word) for word in model.wv.index2word}
# 这里对于所有的词,各形成一个字典,key 为该词,value 是与该词最相似的 5 个词
for i, (key, value) in enumerate(most_similars_precalc.items()):
if i == 3:
break
print(key, value)
运行结果:
the [('which', 0.9999258518218994), ('at', 0.9999212026596069), ('an', 0.9999092817306519), ('up', 0.9999089241027832), ('into', 0.9999079704284668), ('from', 0.9999078512191772), ('in', 0.9999074935913086), ('with', 0.9999056458473206), ('of', 0.9999053478240967), ('on', 0.9999029636383057)]
to [('from', 0.9999440908432007), ('will', 0.999943196773529), ('but', 0.9999423623085022), ('and', 0.9999418258666992), ('would', 0.999941349029541), ('their', 0.9999399781227112), ('for', 0.9999394416809082), ('is', 0.9999381303787231), ('before', 0.9999350905418396), ('by', 0.999934196472168)]
of [('which', 0.9999495148658752), ('after', 0.999946117401123), ('in', 0.9999450445175171), ('at', 0.999943733215332), ('three', 0.9999434351921082), ('for', 0.9999430775642395), ('by', 0.9999409914016724), ('with', 0.9999403357505798), ('an', 0.999937891960144), ('from', 0.9999375343322754)]
二、是否缓存相似词字典的比较
随机选取 4 个词:
import time
words = ['voted', 'few', 'their', 'around']
在没有缓存相似词字典的情况:
start = time.time()
for word in words:
result = model.wv.most_similar(word)
print(result)
end = time.time()
print(end - start)
运行结果:
[('powers', 0.9986631870269775), ('unions', 0.9986005425453186), ('start', 0.9985911846160889), ('into', 0.9985776543617249), ('north', 0.9985730648040771), ('then', 0.9985653162002563), ('kandahar', 0.9985642433166504), ('chief', 0.9985554814338684), ('also', 0.9985419511795044), ('child', 0.9985411167144775)]
[('any', 0.9997690916061401), ('are', 0.9997625350952148), ('them', 0.9997580051422119), ('some', 0.999752938747406), ('up', 0.999750018119812), ('five', 0.9997451901435852), ('one', 0.9997447729110718), ('be', 0.9997397661209106), ('his', 0.999735951423645), ('out', 0.9997352361679077)]
[('from', 0.9999539256095886), ('with', 0.9999514818191528), ('on', 0.9999467134475708), ('which', 0.9999465942382812), ('in', 0.9999460577964783), ('at', 0.999943733215332), ('and', 0.999943733215332), ('some', 0.9999434947967529), ('australian', 0.9999405741691589), ('about', 0.9999402761459351)]
[('which', 0.9999223947525024), ('into', 0.9999215006828308), ('from', 0.9999191761016846), ('on', 0.9999156594276428), ('australian', 0.9999149441719055), ('by', 0.999911904335022), ('with', 0.9999110698699951), ('areas', 0.999910831451416), ('in', 0.9999107122421265), ('their', 0.9999104142189026)]
0.0067827701568603516
然后是缓存了字典的情况:
start = time.time()
for word in words:
if 'voted' in most_similars_precalc:
result = most_similars_precalc[word]
print(result)
else:
result = model.wv.most_similar(word)
most_similars_precalc[word] = result
print(result)
end = time.time()
print(end - start)
运行结果:
[('powers', 0.9986631870269775), ('unions', 0.9986005425453186), ('start', 0.9985911846160889), ('into', 0.9985776543617249), ('north', 0.9985730648040771), ('then', 0.9985653162002563), ('kandahar', 0.9985642433166504), ('chief', 0.9985554814338684), ('also', 0.9985419511795044), ('child', 0.9985411167144775)]
[('any', 0.9997690916061401), ('are', 0.9997625350952148), ('them', 0.9997580051422119), ('some', 0.999752938747406), ('up', 0.999750018119812), ('five', 0.9997451901435852), ('one', 0.9997447729110718), ('be', 0.9997397661209106), ('his', 0.999735951423645), ('out', 0.9997352361679077)]
[('from', 0.9999539256095886), ('with', 0.9999514818191528), ('on', 0.9999467134475708), ('which', 0.9999465942382812), ('in', 0.9999460577964783), ('at', 0.999943733215332), ('and', 0.999943733215332), ('some', 0.9999434947967529), ('australian', 0.9999405741691589), ('about', 0.9999402761459351)]
[('which', 0.9999223947525024), ('into', 0.9999215006828308), ('from', 0.9999191761016846), ('on', 0.9999156594276428), ('australian', 0.9999149441719055), ('by', 0.999911904335022), ('with', 0.9999110698699951), ('areas', 0.999910831451416), ('in', 0.9999107122421265), ('their', 0.9999104142189026)]
0.0007929801940917969
总结:
两者对比,不使用缓存字典的情况,耗时是使用缓存的 10 倍!(0.0067827701568603516 对 0.0007929801940917969)
这个差异性在操作更多词的时候会更加明显!