transformers庫學習筆記（一）：安裝與測試

2020 年 6 月 21 日
AI
深度神經網路

印象中覺得transformers是一個龐然大物，但實際接觸後，卻是極其友好，感謝huggingface大神。原文見tmylla.github.io。

安裝

我的版本號：python 3.6.9；pytorch 1.0；CUDA 10.0。

pip install transformers

pip之前確保安裝1.1.0+。

測試

驗證程式碼與結果

python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('I hate you'))"

在命令行輸入如上命令後，transformers會自動下載依賴模型。輸出以下結果，安裝成果。

[{'label': 'NEGATIVE', 'score': 0.9991129040718079}]

transformer pipeline下載模型文件說明

transformers自動下載模型的保存位置：C:\Users\username\.cache\torch\，在模型下載以後，可以保存到其他位置。各文件的說明如下：

![](//gitee.com/misite_J/blog-img/raw/master/img/情感分析pipeline model.png)

json文件包含對應文件的『url』和『etag』標籤。
『a41…』為配置文件：distilbert-base-uncased-config。
『26b…』為詞典文件：bert-base-uncased-vocab。
『437…』為finetuned-sst-2的配置文件：distilbert-base-uncased-finetuned-sst-2-english-config，注意其與『a41…』文件的不同。
『57d…』為Modelcard文件：distilbert-base-uncased-finetuned-sst-2-english-modelcard。
『dd7…』為模型參數文件：distilbert-base-uncased-finetuned-sst-2-english-pytorch_model.bin。

pipeline()簡介

可以看到，通過執行pipeline('sentiment-analysis')('I hate you')，transformers自動下載GLUE中sst2數據集的distilbert-base-uncased-finetuned-sst-2模型，對’I hate you’進行情感分析。

Pipeline是一個簡捷的NLP任務介面，執行Input -> Tokenization -> Model Inference -> Post-Processing (Task dependent) -> Output一系列操作。目前支援Named Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering等任務。

以Question Answering為例：

from transformers import pipeline

nlp = pipeline("question-answering")

context = "Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune a model on a SQuAD task, you may leverage the `run_squad.py`."

print(nlp(question="What is extractive question answering?", context=context))
print(nlp(question="What is a good example of a question answering dataset?", context=context))

對QA任務，transformers使用SQuAD數據集的distilbert-base-cased-distilled-squad模型，模型文件同上文介紹。

移動模型到自定義文件夾

以QA為例：

首先我們建立一個文件夾，命名為distilbert-base-cased-distilled-squad，然後將詞典文件、模型配置文件、模型參數文件三個文件放入這個文件夾，並且將文件重命名為config.json、vocab.txt、pytorch_model.bin即可。

在程式碼中定義模型目錄，DISTILLED = './distilbert-base-cased-distilled-squad'，完整程式碼如下。

from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

DISTILLED = './distilbert-base-cased-distilled-squad'
tokenizer = AutoTokenizer.from_pretrained(DISTILLED)
model = AutoModelForQuestionAnswering.from_pretrained(DISTILLED)

text = """
Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
TensorFlow 2.0 and PyTorch.
"""

questions = [
    "How many pretrained models are available in Transformers?",
    "What does Transformers provide?",
    "Transformers provides interoperability between which frameworks?",
]

for question in questions:
    inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]

    text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
    answer_start_scores, answer_end_scores = model(**inputs)

    answer_start = torch.argmax(answer_start_scores)  # Get the most likely beginning of answer with the argmax of the score
    answer_end = torch.argmax(answer_end_scores) + 1  # Get the most likely end of answer with the argmax of the score

    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))

    print(f"Question: {question}")
    print(f"Answer: {answer}\n")

參考

//huggingface.co/transformers/installation.html

Tags: 深度神經網路