transformers庫學習筆記(一):安裝與測試
印象中覺得transformers是一個龐然大物,但實際接觸後,卻是極其友好,感謝huggingface大神。原文見tmylla.github.io。
安裝
我的版本號:python 3.6.9;pytorch 1.0;CUDA 10.0。
pip install transformers
pip之前確保安裝1.1.0+。
測試
驗證程式碼與結果
python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('I hate you'))"
在命令行輸入如上命令後,transformers會自動下載依賴模型。輸出以下結果,安裝成果。
[{'label': 'NEGATIVE', 'score': 0.9991129040718079}]
transformer pipeline下載模型文件說明
transformers自動下載模型的保存位置:C:\Users\username\.cache\torch\,在模型下載以後,可以保存到其他位置。各文件的說明如下:

-
json文件包含對應文件的『url』和『etag』標籤。
-
『a41…』為配置文件:distilbert-base-uncased-config。
-
『26b…』為詞典文件:bert-base-uncased-vocab。
-
『437…』為finetuned-sst-2的配置文件:distilbert-base-uncased-finetuned-sst-2-english-config,注意其與『a41…』文件的不同。
-
『57d…』為Modelcard文件:distilbert-base-uncased-finetuned-sst-2-english-modelcard。
-
『dd7…』為模型參數文件:distilbert-base-uncased-finetuned-sst-2-english-pytorch_model.bin。
pipeline()簡介
可以看到,通過執行pipeline('sentiment-analysis')('I hate you')
,transformers自動下載GLUE中sst2數據集的distilbert-base-uncased-finetuned-sst-2模型,對’I hate you’進行情感分析。
Pipeline是一個簡捷的NLP任務介面,執行Input -> Tokenization -> Model Inference -> Post-Processing (Task dependent) -> Output一系列操作。目前支援Named Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering等任務。
以Question Answering為例:
from transformers import pipeline
nlp = pipeline("question-answering")
context = "Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune a model on a SQuAD task, you may leverage the `run_squad.py`."
print(nlp(question="What is extractive question answering?", context=context))
print(nlp(question="What is a good example of a question answering dataset?", context=context))
對QA任務,transformers使用SQuAD數據集的distilbert-base-cased-distilled-squad模型,模型文件同上文介紹。
移動模型到自定義文件夾
以QA為例:
-
首先我們建立一個文件夾,命名為distilbert-base-cased-distilled-squad,然後將詞典文件、模型配置文件、模型參數文件三個文件放入這個文件夾,並且將文件重命名為config.json、vocab.txt、pytorch_model.bin即可。
-
在程式碼中定義模型目錄,
DISTILLED = './distilbert-base-cased-distilled-squad'
,完整程式碼如下。from transformers import AutoTokenizer, AutoModelForQuestionAnswering import torch DISTILLED = './distilbert-base-cased-distilled-squad' tokenizer = AutoTokenizer.from_pretrained(DISTILLED) model = AutoModelForQuestionAnswering.from_pretrained(DISTILLED) text = """ Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch. """ questions = [ "How many pretrained models are available in Transformers?", "What does Transformers provide?", "Transformers provides interoperability between which frameworks?", ] for question in questions: inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt") input_ids = inputs["input_ids"].tolist()[0] text_tokens = tokenizer.convert_ids_to_tokens(input_ids) answer_start_scores, answer_end_scores = model(**inputs) answer_start = torch.argmax(answer_start_scores) # Get the most likely beginning of answer with the argmax of the score answer_end = torch.argmax(answer_end_scores) + 1 # Get the most likely end of answer with the argmax of the score answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])) print(f"Question: {question}") print(f"Answer: {answer}\n")
參考
//huggingface.co/transformers/installation.html