Gensim 中 word2vec 模型的内存需求,和模型评估方式
Gensim 中 word2vec 模型的内存需求,和模型评估方式
本文为系列文章之一,前面的几篇请点击链接:
NLP 利器 gensim 库基本特性介绍和安装方式
NLP 利器 Gensim 库的使用之 Word2Vec 模型案例演示
NLP 利器 Gensim 来训练自己的 word2vec 词向量模型
NLP 利器 Gensim 来训练 word2vec 词向量模型的参数设置
一、内存需求
word2vec 模型的参数是以 Numpy array 的形式存储。
shape 是:(词表长度,词向量维度)
- 词表长度由 min_count 控制。
- 词向量维度由 size 控制。
所以参数个数是 len(vocab) * size
每个参数都是单精度浮点数,即 32 位,在内存中占 4 个字节 bytes。
而这样的矩阵会有 3 个同时存储在内存 RAM 中。
所以假设我们词表长度为 100,000,词向量维度 200,那我们所需的内存大小为:
100,000 * 200 * 4 * 3 = 229MB 左右
当然需要额外的一些内存存储词表内容,但是这个基本可以忽略。
二、模型评估
Word2Vec 模型的训练,是一个非监督学习过程,其实没有客观的标准去衡量精确度。
评估需要依赖于最终的应用。
Google 开放了一个 20,000 个样本的测试集合(句法和语义),来测试 “A 之于于 B 就好比 C 之于 D” 这样的任务。
例如一个比较类型的句法类比:
bad : worse ; good : ?
数据集中有 9 种句法对比,包括名词的复数,相反意义的名词等。
语义问题包括了 5 种语义类比,比如:
首都城市(Paris : France ; Tokyo : ?)
家庭成员(brother : sister ; dad : ?)
Gensim 支持同样的评估集合,同时格式也一样。
model.wv.accuracy('./datasets/questions-words.txt')
测试结果:
[{'section': 'capital-common-countries',
'correct': [],
'incorrect': [('CANBERRA', 'AUSTRALIA', 'KABUL', 'AFGHANISTAN'),
('CANBERRA', 'AUSTRALIA', 'PARIS', 'FRANCE'),
('KABUL', 'AFGHANISTAN', 'PARIS', 'FRANCE'),
('KABUL', 'AFGHANISTAN', 'CANBERRA', 'AUSTRALIA'),
('PARIS', 'FRANCE', 'CANBERRA', 'AUSTRALIA'),
('PARIS', 'FRANCE', 'KABUL', 'AFGHANISTAN')]},
{'section': 'capital-world',
'correct': [],
'incorrect': [('CANBERRA', 'AUSTRALIA', 'KABUL', 'AFGHANISTAN'),
('KABUL', 'AFGHANISTAN', 'PARIS', 'FRANCE')]},
{'section': 'currency', 'correct': [], 'incorrect': []},
{'section': 'city-in-state', 'correct': [], 'incorrect': []},
{'section': 'family',
'correct': [],
'incorrect': [('HE', 'SHE', 'HIS', 'HER'),
('HE', 'SHE', 'MAN', 'WOMAN'),
('HIS', 'HER', 'MAN', 'WOMAN'),
('HIS', 'HER', 'HE', 'SHE'),
('MAN', 'WOMAN', 'HE', 'SHE'),
('MAN', 'WOMAN', 'HIS', 'HER')]},
{'section': 'gram1-adjective-to-adverb', 'correct': [], 'incorrect': []},
{'section': 'gram2-opposite', 'correct': [], 'incorrect': []},
{'section': 'gram3-comparative',
'correct': [],
'incorrect': [('GOOD', 'BETTER', 'GREAT', 'GREATER'),
('GOOD', 'BETTER', 'LONG', 'LONGER'),
('GOOD', 'BETTER', 'LOW', 'LOWER'),
('GOOD', 'BETTER', 'SMALL', 'SMALLER'),
('GREAT', 'GREATER', 'LONG', 'LONGER'),
('GREAT', 'GREATER', 'LOW', 'LOWER'),
('GREAT', 'GREATER', 'SMALL', 'SMALLER'),
('GREAT', 'GREATER', 'GOOD', 'BETTER'),
('LONG', 'LONGER', 'LOW', 'LOWER'),
('LONG', 'LONGER', 'SMALL', 'SMALLER'),
('LONG', 'LONGER', 'GOOD', 'BETTER'),
('LONG', 'LONGER', 'GREAT', 'GREATER'),
('LOW', 'LOWER', 'SMALL', 'SMALLER'),
('LOW', 'LOWER', 'GOOD', 'BETTER'),
('LOW', 'LOWER', 'GREAT', 'GREATER'),
('LOW', 'LOWER', 'LONG', 'LONGER'),
('SMALL', 'SMALLER', 'GOOD', 'BETTER'),
('SMALL', 'SMALLER', 'GREAT', 'GREATER'),
('SMALL', 'SMALLER', 'LONG', 'LONGER'),
('SMALL', 'SMALLER', 'LOW', 'LOWER')]},
{'section': 'gram4-superlative',
'correct': [],
'incorrect': [('BIG', 'BIGGEST', 'GOOD', 'BEST'),
('BIG', 'BIGGEST', 'GREAT', 'GREATEST'),
('BIG', 'BIGGEST', 'LARGE', 'LARGEST'),
('GOOD', 'BEST', 'GREAT', 'GREATEST'),
('GOOD', 'BEST', 'LARGE', 'LARGEST'),
('GOOD', 'BEST', 'BIG', 'BIGGEST'),
('GREAT', 'GREATEST', 'LARGE', 'LARGEST'),
('GREAT', 'GREATEST', 'BIG', 'BIGGEST'),
('GREAT', 'GREATEST', 'GOOD', 'BEST'),
('LARGE', 'LARGEST', 'BIG', 'BIGGEST'),
('LARGE', 'LARGEST', 'GOOD', 'BEST'),
('LARGE', 'LARGEST', 'GREAT', 'GREATEST')]},
{'section': 'gram5-present-participle',
'correct': [],
'incorrect': [('GO', 'GOING', 'LOOK', 'LOOKING'),
('GO', 'GOING', 'PLAY', 'PLAYING'),
('GO', 'GOING', 'RUN', 'RUNNING'),
('GO', 'GOING', 'SAY', 'SAYING'),
('LOOK', 'LOOKING', 'PLAY', 'PLAYING'),
('LOOK', 'LOOKING', 'RUN', 'RUNNING'),
('LOOK', 'LOOKING', 'SAY', 'SAYING'),
('LOOK', 'LOOKING', 'GO', 'GOING'),
('PLAY', 'PLAYING', 'RUN', 'RUNNING'),
('PLAY', 'PLAYING', 'SAY', 'SAYING'),
('PLAY', 'PLAYING', 'GO', 'GOING'),
('PLAY', 'PLAYING', 'LOOK', 'LOOKING'),
('RUN', 'RUNNING', 'SAY', 'SAYING'),
('RUN', 'RUNNING', 'GO', 'GOING'),
('RUN', 'RUNNING', 'LOOK', 'LOOKING'),
('RUN', 'RUNNING', 'PLAY', 'PLAYING'),
('SAY', 'SAYING', 'GO', 'GOING'),
('SAY', 'SAYING', 'LOOK', 'LOOKING'),
('SAY', 'SAYING', 'PLAY', 'PLAYING'),
('SAY', 'SAYING', 'RUN', 'RUNNING')]},
{'section': 'gram6-nationality-adjective',
'correct': [('INDIA', 'INDIAN', 'AUSTRALIA', 'AUSTRALIAN')],
'incorrect': [('AUSTRALIA', 'AUSTRALIAN', 'FRANCE', 'FRENCH'),
('AUSTRALIA', 'AUSTRALIAN', 'INDIA', 'INDIAN'),
('AUSTRALIA', 'AUSTRALIAN', 'ISRAEL', 'ISRAELI'),
('AUSTRALIA', 'AUSTRALIAN', 'JAPAN', 'JAPANESE'),
('AUSTRALIA', 'AUSTRALIAN', 'SWITZERLAND', 'SWISS'),
('FRANCE', 'FRENCH', 'INDIA', 'INDIAN'),
('FRANCE', 'FRENCH', 'ISRAEL', 'ISRAELI'),
('FRANCE', 'FRENCH', 'JAPAN', 'JAPANESE'),
('FRANCE', 'FRENCH', 'SWITZERLAND', 'SWISS'),
('FRANCE', 'FRENCH', 'AUSTRALIA', 'AUSTRALIAN'),
('INDIA', 'INDIAN', 'ISRAEL', 'ISRAELI'),
('INDIA', 'INDIAN', 'JAPAN', 'JAPANESE'),
('INDIA', 'INDIAN', 'SWITZERLAND', 'SWISS'),
('INDIA', 'INDIAN', 'FRANCE', 'FRENCH'),
('ISRAEL', 'ISRAELI', 'JAPAN', 'JAPANESE'),
('ISRAEL', 'ISRAELI', 'SWITZERLAND', 'SWISS'),
('ISRAEL', 'ISRAELI', 'AUSTRALIA', 'AUSTRALIAN'),
('ISRAEL', 'ISRAELI', 'FRANCE', 'FRENCH'),
('ISRAEL', 'ISRAELI', 'INDIA', 'INDIAN'),
('JAPAN', 'JAPANESE', 'SWITZERLAND', 'SWISS'),
('JAPAN', 'JAPANESE', 'AUSTRALIA', 'AUSTRALIAN'),
('JAPAN', 'JAPANESE', 'FRANCE', 'FRENCH'),
('JAPAN', 'JAPANESE', 'INDIA', 'INDIAN'),
('JAPAN', 'JAPANESE', 'ISRAEL', 'ISRAELI'),
('SWITZERLAND', 'SWISS', 'AUSTRALIA', 'AUSTRALIAN'),
('SWITZERLAND', 'SWISS', 'FRANCE', 'FRENCH'),
('SWITZERLAND', 'SWISS', 'INDIA', 'INDIAN'),
('SWITZERLAND', 'SWISS', 'ISRAEL', 'ISRAELI'),
('SWITZERLAND', 'SWISS', 'JAPAN', 'JAPANESE')]},
{'section': 'gram7-past-tense',
'correct': [],
'incorrect': [('GOING', 'WENT', 'PAYING', 'PAID'),
('GOING', 'WENT', 'PLAYING', 'PLAYED'),
('GOING', 'WENT', 'SAYING', 'SAID'),
('GOING', 'WENT', 'TAKING', 'TOOK'),
('PAYING', 'PAID', 'PLAYING', 'PLAYED'),
('PAYING', 'PAID', 'SAYING', 'SAID'),
('PAYING', 'PAID', 'TAKING', 'TOOK'),
('PAYING', 'PAID', 'GOING', 'WENT'),
('PLAYING', 'PLAYED', 'SAYING', 'SAID'),
('PLAYING', 'PLAYED', 'TAKING', 'TOOK'),
('PLAYING', 'PLAYED', 'GOING', 'WENT'),
('PLAYING', 'PLAYED', 'PAYING', 'PAID'),
('SAYING', 'SAID', 'TAKING', 'TOOK'),
('SAYING', 'SAID', 'GOING', 'WENT'),
('SAYING', 'SAID', 'PAYING', 'PAID'),
('SAYING', 'SAID', 'PLAYING', 'PLAYED'),
('TAKING', 'TOOK', 'GOING', 'WENT'),
('TAKING', 'TOOK', 'PAYING', 'PAID'),
('TAKING', 'TOOK', 'PLAYING', 'PLAYED'),
('TAKING', 'TOOK', 'SAYING', 'SAID')]},
{'section': 'gram8-plural',
'correct': [],
'incorrect': [('BUILDING', 'BUILDINGS', 'CAR', 'CARS'),
('BUILDING', 'BUILDINGS', 'CHILD', 'CHILDREN'),
('BUILDING', 'BUILDINGS', 'MAN', 'MEN'),
('BUILDING', 'BUILDINGS', 'ROAD', 'ROADS'),
('BUILDING', 'BUILDINGS', 'WOMAN', 'WOMEN'),
('CAR', 'CARS', 'CHILD', 'CHILDREN'),
('CAR', 'CARS', 'MAN', 'MEN'),
('CAR', 'CARS', 'ROAD', 'ROADS'),
('CAR', 'CARS', 'WOMAN', 'WOMEN'),
('CAR', 'CARS', 'BUILDING', 'BUILDINGS'),
('CHILD', 'CHILDREN', 'MAN', 'MEN'),
('CHILD', 'CHILDREN', 'ROAD', 'ROADS'),
('CHILD', 'CHILDREN', 'WOMAN', 'WOMEN'),
('CHILD', 'CHILDREN', 'BUILDING', 'BUILDINGS'),
('CHILD', 'CHILDREN', 'CAR', 'CARS'),
('MAN', 'MEN', 'ROAD', 'ROADS'),
('MAN', 'MEN', 'WOMAN', 'WOMEN'),
('MAN', 'MEN', 'BUILDING', 'BUILDINGS'),
('MAN', 'MEN', 'CAR', 'CARS'),
('MAN', 'MEN', 'CHILD', 'CHILDREN'),
('ROAD', 'ROADS', 'WOMAN', 'WOMEN'),
('ROAD', 'ROADS', 'BUILDING', 'BUILDINGS'),
('ROAD', 'ROADS', 'CAR', 'CARS'),
('ROAD', 'ROADS', 'CHILD', 'CHILDREN'),
('ROAD', 'ROADS', 'MAN', 'MEN'),
('WOMAN', 'WOMEN', 'BUILDING', 'BUILDINGS'),
('WOMAN', 'WOMEN', 'CAR', 'CARS'),
('WOMAN', 'WOMEN', 'CHILD', 'CHILDREN'),
('WOMAN', 'WOMEN', 'MAN', 'MEN'),
('WOMAN', 'WOMEN', 'ROAD', 'ROADS')]},
{'section': 'gram9-plural-verbs', 'correct': [], 'incorrect': []},
{'section': 'total',
'correct': [('INDIA', 'INDIAN', 'AUSTRALIA', 'AUSTRALIAN')],
'incorrect': [('CANBERRA', 'AUSTRALIA', 'KABUL', 'AFGHANISTAN'),
('CANBERRA', 'AUSTRALIA', 'PARIS', 'FRANCE'),
('KABUL', 'AFGHANISTAN', 'PARIS', 'FRANCE'),
('KABUL', 'AFGHANISTAN', 'CANBERRA', 'AUSTRALIA'),
('PARIS', 'FRANCE', 'CANBERRA', 'AUSTRALIA'),
('PARIS', 'FRANCE', 'KABUL', 'AFGHANISTAN'),
('CANBERRA', 'AUSTRALIA', 'KABUL', 'AFGHANISTAN'),
('KABUL', 'AFGHANISTAN', 'PARIS', 'FRANCE'),
('HE', 'SHE', 'HIS', 'HER'),
('HE', 'SHE', 'MAN', 'WOMAN'),
('HIS', 'HER', 'MAN', 'WOMAN'),
('HIS', 'HER', 'HE', 'SHE'),
('MAN', 'WOMAN', 'HE', 'SHE'),
('MAN', 'WOMAN', 'HIS', 'HER'),
('GOOD', 'BETTER', 'GREAT', 'GREATER'),
('GOOD', 'BETTER', 'LONG', 'LONGER'),
('GOOD', 'BETTER', 'LOW', 'LOWER'),
('GOOD', 'BETTER', 'SMALL', 'SMALLER'),
('GREAT', 'GREATER', 'LONG', 'LONGER'),
('GREAT', 'GREATER', 'LOW', 'LOWER'),
('GREAT', 'GREATER', 'SMALL', 'SMALLER'),
('GREAT', 'GREATER', 'GOOD', 'BETTER'),
('LONG', 'LONGER', 'LOW', 'LOWER'),
('LONG', 'LONGER', 'SMALL', 'SMALLER'),
('LONG', 'LONGER', 'GOOD', 'BETTER'),
('LONG', 'LONGER', 'GREAT', 'GREATER'),
('LOW', 'LOWER', 'SMALL', 'SMALLER'),
('LOW', 'LOWER', 'GOOD', 'BETTER'),
('LOW', 'LOWER', 'GREAT', 'GREATER'),
('LOW', 'LOWER', 'LONG', 'LONGER'),
('SMALL', 'SMALLER', 'GOOD', 'BETTER'),
('SMALL', 'SMALLER', 'GREAT', 'GREATER'),
('SMALL', 'SMALLER', 'LONG', 'LONGER'),
('SMALL', 'SMALLER', 'LOW', 'LOWER'),
('BIG', 'BIGGEST', 'GOOD', 'BEST'),
('BIG', 'BIGGEST', 'GREAT', 'GREATEST'),
('BIG', 'BIGGEST', 'LARGE', 'LARGEST'),
('GOOD', 'BEST', 'GREAT', 'GREATEST'),
('GOOD', 'BEST', 'LARGE', 'LARGEST'),
('GOOD', 'BEST', 'BIG', 'BIGGEST'),
('GREAT', 'GREATEST', 'LARGE', 'LARGEST'),
('GREAT', 'GREATEST', 'BIG', 'BIGGEST'),
('GREAT', 'GREATEST', 'GOOD', 'BEST'),
('LARGE', 'LARGEST', 'BIG', 'BIGGEST'),
('LARGE', 'LARGEST', 'GOOD', 'BEST'),
('LARGE', 'LARGEST', 'GREAT', 'GREATEST'),
('GO', 'GOING', 'LOOK', 'LOOKING'),
('GO', 'GOING', 'PLAY', 'PLAYING'),
('GO', 'GOING', 'RUN', 'RUNNING'),
('GO', 'GOING', 'SAY', 'SAYING'),
('LOOK', 'LOOKING', 'PLAY', 'PLAYING'),
('LOOK', 'LOOKING', 'RUN', 'RUNNING'),
('LOOK', 'LOOKING', 'SAY', 'SAYING'),
('LOOK', 'LOOKING', 'GO', 'GOING'),
('PLAY', 'PLAYING', 'RUN', 'RUNNING'),
('PLAY', 'PLAYING', 'SAY', 'SAYING'),
('PLAY', 'PLAYING', 'GO', 'GOING'),
('PLAY', 'PLAYING', 'LOOK', 'LOOKING'),
('RUN', 'RUNNING', 'SAY', 'SAYING'),
('RUN', 'RUNNING', 'GO', 'GOING'),
('RUN', 'RUNNING', 'LOOK', 'LOOKING'),
('RUN', 'RUNNING', 'PLAY', 'PLAYING'),
('SAY', 'SAYING', 'GO', 'GOING'),
('SAY', 'SAYING', 'LOOK', 'LOOKING'),
('SAY', 'SAYING', 'PLAY', 'PLAYING'),
('SAY', 'SAYING', 'RUN', 'RUNNING'),
('AUSTRALIA', 'AUSTRALIAN', 'FRANCE', 'FRENCH'),
('AUSTRALIA', 'AUSTRALIAN', 'INDIA', 'INDIAN'),
('AUSTRALIA', 'AUSTRALIAN', 'ISRAEL', 'ISRAELI'),
('AUSTRALIA', 'AUSTRALIAN', 'JAPAN', 'JAPANESE'),
('AUSTRALIA', 'AUSTRALIAN', 'SWITZERLAND', 'SWISS'),
('FRANCE', 'FRENCH', 'INDIA', 'INDIAN'),
('FRANCE', 'FRENCH', 'ISRAEL', 'ISRAELI'),
('FRANCE', 'FRENCH', 'JAPAN', 'JAPANESE'),
('FRANCE', 'FRENCH', 'SWITZERLAND', 'SWISS'),
('FRANCE', 'FRENCH', 'AUSTRALIA', 'AUSTRALIAN'),
('INDIA', 'INDIAN', 'ISRAEL', 'ISRAELI'),
('INDIA', 'INDIAN', 'JAPAN', 'JAPANESE'),
('INDIA', 'INDIAN', 'SWITZERLAND', 'SWISS'),
('INDIA', 'INDIAN', 'FRANCE', 'FRENCH'),
('ISRAEL', 'ISRAELI', 'JAPAN', 'JAPANESE'),
('ISRAEL', 'ISRAELI', 'SWITZERLAND', 'SWISS'),
('ISRAEL', 'ISRAELI', 'AUSTRALIA', 'AUSTRALIAN'),
('ISRAEL', 'ISRAELI', 'FRANCE', 'FRENCH'),
('ISRAEL', 'ISRAELI', 'INDIA', 'INDIAN'),
('JAPAN', 'JAPANESE', 'SWITZERLAND', 'SWISS'),
('JAPAN', 'JAPANESE', 'AUSTRALIA', 'AUSTRALIAN'),
('JAPAN', 'JAPANESE', 'FRANCE', 'FRENCH'),
('JAPAN', 'JAPANESE', 'INDIA', 'INDIAN'),
('JAPAN', 'JAPANESE', 'ISRAEL', 'ISRAELI'),
('SWITZERLAND', 'SWISS', 'AUSTRALIA', 'AUSTRALIAN'),
('SWITZERLAND', 'SWISS', 'FRANCE', 'FRENCH'),
('SWITZERLAND', 'SWISS', 'INDIA', 'INDIAN'),
('SWITZERLAND', 'SWISS', 'ISRAEL', 'ISRAELI'),
('SWITZERLAND', 'SWISS', 'JAPAN', 'JAPANESE'),
('GOING', 'WENT', 'PAYING', 'PAID'),
('GOING', 'WENT', 'PLAYING', 'PLAYED'),
('GOING', 'WENT', 'SAYING', 'SAID'),
('GOING', 'WENT', 'TAKING', 'TOOK'),
('PAYING', 'PAID', 'PLAYING', 'PLAYED'),
('PAYING', 'PAID', 'SAYING', 'SAID'),
('PAYING', 'PAID', 'TAKING', 'TOOK'),
('PAYING', 'PAID', 'GOING', 'WENT'),
('PLAYING', 'PLAYED', 'SAYING', 'SAID'),
('PLAYING', 'PLAYED', 'TAKING', 'TOOK'),
('PLAYING', 'PLAYED', 'GOING', 'WENT'),
('PLAYING', 'PLAYED', 'PAYING', 'PAID'),
('SAYING', 'SAID', 'TAKING', 'TOOK'),
('SAYING', 'SAID', 'GOING', 'WENT'),
('SAYING', 'SAID', 'PAYING', 'PAID'),
('SAYING', 'SAID', 'PLAYING', 'PLAYED'),
('TAKING', 'TOOK', 'GOING', 'WENT'),
('TAKING', 'TOOK', 'PAYING', 'PAID'),
('TAKING', 'TOOK', 'PLAYING', 'PLAYED'),
('TAKING', 'TOOK', 'SAYING', 'SAID'),
('BUILDING', 'BUILDINGS', 'CAR', 'CARS'),
('BUILDING', 'BUILDINGS', 'CHILD', 'CHILDREN'),
('BUILDING', 'BUILDINGS', 'MAN', 'MEN'),
('BUILDING', 'BUILDINGS', 'ROAD', 'ROADS'),
('BUILDING', 'BUILDINGS', 'WOMAN', 'WOMEN'),
('CAR', 'CARS', 'CHILD', 'CHILDREN'),
('CAR', 'CARS', 'MAN', 'MEN'),
('CAR', 'CARS', 'ROAD', 'ROADS'),
('CAR', 'CARS', 'WOMAN', 'WOMEN'),
('CAR', 'CARS', 'BUILDING', 'BUILDINGS'),
('CHILD', 'CHILDREN', 'MAN', 'MEN'),
('CHILD', 'CHILDREN', 'ROAD', 'ROADS'),
('CHILD', 'CHILDREN', 'WOMAN', 'WOMEN'),
('CHILD', 'CHILDREN', 'BUILDING', 'BUILDINGS'),
('CHILD', 'CHILDREN', 'CAR', 'CARS'),
('MAN', 'MEN', 'ROAD', 'ROADS'),
('MAN', 'MEN', 'WOMAN', 'WOMEN'),
('MAN', 'MEN', 'BUILDING', 'BUILDINGS'),
('MAN', 'MEN', 'CAR', 'CARS'),
('MAN', 'MEN', 'CHILD', 'CHILDREN'),
('ROAD', 'ROADS', 'WOMAN', 'WOMEN'),
('ROAD', 'ROADS', 'BUILDING', 'BUILDINGS'),
('ROAD', 'ROADS', 'CAR', 'CARS'),
('ROAD', 'ROADS', 'CHILD', 'CHILDREN'),
('ROAD', 'ROADS', 'MAN', 'MEN'),
('WOMAN', 'WOMEN', 'BUILDING', 'BUILDINGS'),
('WOMAN', 'WOMEN', 'CAR', 'CARS'),
('WOMAN', 'WOMEN', 'CHILD', 'CHILDREN'),
('WOMAN', 'WOMEN', 'MAN', 'MEN'),
('WOMAN', 'WOMEN', 'ROAD', 'ROADS')]}]
可以看到测试的结果并不理想,应该是因为我们前面使用的训练语料比较小的原因。
这种精确度的衡量方式有个可选参数 restrict_vocab,用于限制哪些测试样本会被考虑到。
在 2016 年的版本中,Gensim 增加了一个更好的方式来评估语义相似度。
默认使用的是学术数据集:WS-353。但是个人也可以基于这个数据集创造一个专注于特别领域的数据集。
这个数据集包含词语对,及人工标注的相似度评估,用于衡量这两个词的相关性,或同时出现的概率。
例如 coast(海岸) 和 shore(岸)非常相似,这两个词经常出现在同一段文字中。
同时,clothes(衣服) 和 closet(衣橱) 的相似度就要低一些,虽然这两个词是有关系的,但是无法互换。
model.wv.evaluate_word_pairs(datapath('wordsim353.tsv'))
测试结果:
((0.1952515342533469, 0.13490728041580877),
SpearmanrResult(correlation=0.19127414318530173, pvalue=0.14319638687965558),
83.0028328611898)
返回值:
- pearson (tuple of (float, float)) – Pearson correlation coefficient with 2-tailed p-value.
- 皮尔森相关系数(2 个双尾 p 值)
- spearman (tuple of (float, float)) – Spearman rank-order correlation coefficient between the similarities from the dataset and the similarities produced by the model itself, with 2-tailed p-value.
- 斯皮尔曼等级相关系数,针对数据集的相关性和模型产生的相关性,2 个 双尾 p 值。
- oov_ratio (float) – The ratio of pairs with unknown words.
- 配对中有未知单词的比例。
所以上面的结果显示,我们测试的成绩并不好呀,应该是训练语料较小的原因吧!
!!! 注意:
- 在 Google 测试集和 WS-353 上取得好成绩并不意味着在应用中也会表现很好 ~
- 反之亦然 ~
- 最好直接在所需的任务中进行测试!比如我们要做一个分类任务,那直接看分类的效果就好了!