tf_text

2019 年 11 月 24 日
笔记

文字预处理

在文字的建模实践中，一般需要把原始文字拆解成单字、单词或者词组，然后将这些拆分的要素进行索引，标记化供机器学习算法使用。这种预处理叫做标注（Tokenize）。虽然这些功能都可以用python实现，但是Keras提供了现成的方法。

对于keras全部封装在text中

分词器

Tokenizer

keras.preprocessing.text.Tokenizer(num_words=None,                                     filters='!"#$%&()*+,-./:;<=>?@[]^_`{|}~',                                     lower=True, split=' ',                                     char_level=False,                                     oov_token=None).fit_on_texts(texts)

默认情况下，将删除所有标点符号，从而将文本转换为以空格分隔的单词序列（单词可能包含'字符）。然后将这些序列分成令牌列表。然后将它们编入索引或向量化。

在实例化Tokenizer类后，再使用fit_on_texts(texts)，来更新Tokenizer对象的词典和词频信息。

序列化

text_to_word_sequence

keras.preprocessing.text.text_to_word_sequence(text, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~tn', lower=True, split=' ')

将文本转换为单词（或标记）序列。

onehot

在利用机器学习或者深度学习解决分类问题时，我们需要将标签进行编码onehot处理

get_dummies 是利用pandas实现one hot encode的方式。

>>> s = pd.Series(list('abca'))    >>> pd.get_dummies(s)     a  b  c  0  1  0  0  1  0  1  0  2  0  0  1  3  1  0  0

sklearn中的onehot

from sklearn.preprocessing import OneHotEncoder

当然keras也有

keras.preprocessing.text.one_hot(text, n, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~tn', lower=True, split=' ')

tf_text

但是我并不是写这么东西tensorflow-text，主要是想记录下tensorflow-text，这是tf2.0新东西，1.0多没有见过，所以记录下

注意：

需要TensorFlow 2.0

安装

pip install  tensorflow-text

测试

import tensorflow as tf  import tensorflow_text as text

统一编码

Unicode 是计算机科学领域里的一项业界标准，包括字符集、编码方案等。Unicode 是为了解决传统的字符编码方案的局限而产生的，它为每种语言中的每个字符设定了统一并且唯一的二进制编码，以满足跨语言、跨平台进行文本转换、处理的要求。

字符串使用UTF-8。如果使用其他编码，则可以使用核心tensorflow转码操作将代码转码为UTF-8。

import tensorflow as tf  import tensorflow_text as text  # 这里添加的表情，对于tf.2是新东西  # 将UTF-16-BE 编码——>UTF-8  MaoliString = tf.constant([u'毛利学习tf_text.'.encode('UTF-16-BE'), u'好开心?'.encode('UTF-16-BE')])  utf8_docs = tf.strings.unicode_transcode(MaoliString, input_encoding='UTF-16-BE', output_encoding='UTF-8')  print(utf8_docs)

等同于MaoliString = tf.constant([u'毛利学习tf_text.'.encode('UTF_8'), u'好开心?'.encode('=UTF_8')])

输出如下

tf.Tensor(  [b'xe6xafx9bxe5x88xa9xe5xadxa6xe4xb9xa0tf_text.'   b'xe5xa5xbdxe5xbcx80xe5xbfx83xf0x9fx98x8d'], shape=(2,), dtype=string)

tokenizer

WhitespaceTokenizer 根据空格分词

# WhitespaceTokenizer 根据空格分词  tokenizer = text.WhitespaceTokenizer()  tokens = tokenizer.tokenize(['毛利 学习 tf_text.', u'好 开 心 ?'.encode('UTF-8')])  print(tokens.to_list())

UnicodeScriptTokenizer()

根据编码Unicode拆分UTF-8字符串

# 根据编码Unicode拆分UTF-8字符串  tokenizer = text.UnicodeScriptTokenizer()  tokens = tokenizer.tokenize(['毛利 学习 tf_text.', u'好 开 心 ?'.encode('UTF-8')])  print(tokens.to_list())

tf.strings.unicode_split

这是1.多的用法，2.0也ok，2.0就是来了TensorFlow Text

tokens = tf.strings.unicode_split([u"毛利".encode('UTF-8')], 'UTF-8')  print(tokens.to_list())    [[b'xe6xafx9b', b'xe5x88xa9']]

tf_text

文字预处理

分词器

序列化

onehot

tf_text

统一编码

tokenizer

tf.strings.unicode_split

VirMach 便宜 VPS

QNews

tf_text

文字预处理

分词器

序列化

onehot

tf_text

统一编码

tokenizer

tf.strings.unicode_split

分享此文：

Related Posts

Tendermint区块链Weave SDK快速指南

module.exports和exports的区别

R环境安装

数据处理基础（一）

VirMach 便宜 VPS

QNews

热门搜寻