fasttext的有监督与无监督模式

2020 年 7 月 8 日
AI

无监督模式：

//radimrehurek.com/gensim/models/fasttext.htmlradimrehurek.com

gensim中的fasttext的说明可以看下，gensim中的ft实现的是其无监督的模式，可以看到（工程层面的优化暂且不提），算法层面的改进其实就是字或字符粒度的input的引入，；也就是：

陈运文：fastText原理及实践zhuanlan.zhihu.com

这篇文章中提到的，具体的原始论文中也给了比较详细的描述了：

EnrichingWord Vectors with Subword Information

仔细看gensim官方文档中的说明：

当我们去除n_gram的部分就退化为word2vec了

那么具体重点是这种无监督咋特么训练的啊？

看下原论文：

实际上是这样，以where为例子，n_gram为3，则切分为<wh，whe,her,ere,re>，然后本来我们是输出接where的训练一次就可以，现在是分别接<wh，whe,her,ere,re>，训练5次，这样得到的5个向量求和就得到了where的最终的词向量了，然后呢，比如oov里面有there这种词，切分为the，her，ere，我们就可以通过训练郭德the、her、ere求和得到，较好缓解了oov的问题；感觉这种方法更像是一种数据层面的增强方法，通过ngram的方式增加了语料的数量。需要注意的是fasttext使用了hash 散列的处理技巧：

采用hash-trick。由于n-gram原始的空间太大，可以用某种hash函数将其映射到固定大小的buckets中去，从而实现内存可控；

//github.com/salestock/fastText.pygithub.com

然后说一下fasttext的有监督模式，也就是大部分市面上的文章提到的可支持分类的模式，

//github.com/ShawnyXiao/TextClassification-Keras/blob/master/model/FastText/fast_text.pygithub.com

这里实现的是fasttext cbow版的，可以看到就是avg pooling进行平均

//github.com/ShawnyXiao/TextClassification-Keras/blob/master/model/FastText/main.pygithub.com

从公开的资料和代码实现上，fasttext的思路都不难理解，一整个句子所有的词和其ngram都进行embedding然后平均，输出接文本的实际标签从而实现文本分类，不过其官方文档给出了更复杂的实现：

Supervised model

Train & load the classifier

classifier = fasttext.supervised(params)

List of available params and their default value:

input_file     			training file path (required)
output         			output file path (required)
label_prefix   			label prefix ['__label__']
lr             			learning rate [0.1]
lr_update_rate 			change the rate of updates for the learning rate [100]
dim            			size of word vectors [100]
ws             			size of the context window [5]
epoch          			number of epochs [5]
min_count      			minimal number of word occurences [1]
neg            			number of negatives sampled [5]
word_ngrams    			max length of word ngram [1]
loss           			loss function {ns, hs, softmax} [softmax]
bucket         			number of buckets [0]
minn           			min length of char ngram [0]
maxn           			max length of char ngram [0]
thread         			number of threads [12]
t              			sampling threshold [0.0001]
silent         			disable the log output from the C++ extension [1]
encoding       			specify input_file encoding [utf-8]
pretrained_vectors		pretrained word vectors (.vec file) for supervised learning []

这里也用到了ws，即窗口的概念，也就是不一定用一整个句子做文本分类，也可以使用窗口切分成一个个的小句子作为样本进行分类，不过其模型结构就没有所谓的cbow或者skipgram的形式，而是原论文给出的类cbow的结构：