huggingface transformers使用指南之二——方便的trainer

  • 2021 年 4 月 12 日
  • AI

打一個比喻,按照封裝程度來看,torch<pytorch lightning<trainer的設計,trainer封裝的比較完整,所以做自定義的話會麻煩一點點。

//huggingface.co/transformers/main_classes/trainer.htmlhuggingface.co

首先還是看下基本參數:

classtransformers.Trainer(model: torch.nn.modules.module.Module = None,args: transformers.training_args.TrainingArguments = None,data_collator: Optional[NewType.<locals>.new_type] = None,train_dataset: Optional[torch.utils.data.dataset.Dataset] = None,eval_dataset: Optional[torch.utils.data.dataset.Dataset] = None,tokenizer: Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase] = None,model_init: Callable[transformers.modeling_utils.PreTrainedModel] = None,compute_metrics: Optional[Callable[transformers.trainer_utils.EvalPrediction,Dict]] = None,callbacks: Optional[List[transformers.trainer_callback.TrainerCallback]] = None,optimizers: Tuple[torch.optim.optimizer.Optimizer,torch.optim.lr_scheduler.LambdaLR] = (None,None))

一個一個來看:

參數:

model: model可以是一個集成了 transformers.PreTrainedMode 或者torch.nn.module的模型,官方提到trainer對 transformers.PreTrainedModel進行了優化,建議使用。transformers.PreTrainedModel,用於可以通過自己繼承這個父類來實現huggingface的model自定義,自定義的過程和torch非常相似,這部分放到huggingface的自定義里講。

而model_init_則是這麼個東西,說白了就是一個返回上述model的函數

def model_init():
model = AutoModelForSequenceClassification.from_pretrained(
model_args.model_name_or_path,
from_tf=bool(“.ckpt” in model_args.model_name_or_path),
config=config,
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
use_auth_token=True if model_args.use_auth_token else None
)
return model


args: 超參數的定義,這部分也是trainer的重要功能,大部分訓練相關的參數都是這裡設置的,非常的方便:

Trainerhuggingface.co圖標

classtransformers.TrainingArguments(output_dir: str,overwrite_output_dir: bool = False,do_train: bool = False,do_eval: bool = None,do_predict: bool = False,evaluation_strategy: transformers.trainer_utils.IntervalStrategy = ‘no’,prediction_loss_only: bool = False,per_device_train_batch_size: int = 8,per_device_eval_batch_size: int = 8,per_gpu_train_batch_size: Optional[int] = None,per_gpu_eval_batch_size: Optional[int] = None,gradient_accumulation_steps: int = 1,eval_accumulation_steps: Optional[int] = None,learning_rate: float = 5e-05,weight_decay: float = 0.0,adam_beta1: float = 0.9,adam_beta2: float = 0.999,adam_epsilon: float = 1e-08,max_grad_norm: float = 1.0,num_train_epochs: float = 3.0,max_steps: int = -1,lr_scheduler_type: transformers.trainer_utils.SchedulerType = ‘linear’,warmup_ratio: float = 0.0,warmup_steps: int = 0,logging_dir: Optional[str] = <factory>,logging_strategy: transformers.trainer_utils.IntervalStrategy = ‘steps’,logging_first_step: bool = False,logging_steps: int = 500,save_strategy: transformers.trainer_utils.IntervalStrategy = ‘steps’,save_steps: int = 500,save_total_limit: Optional[int] = None,no_cuda: bool = False,seed: int = 42,fp16: bool = False,fp16_opt_level: str = ‘O1’,fp16_backend: str = ‘auto’,fp16_full_eval: bool = False,local_rank: int = -1,tpu_num_cores: Optional[int] = None,tpu_metrics_debug: bool = False,debug: bool = False,dataloader_drop_last: bool = False,eval_steps: int = None,dataloader_num_workers: int = 0,past_index: int = -1,run_name: Optional[str] = None,disable_tqdm: Optional[bool] = None,remove_unused_columns: Optional[bool] = True,label_names: Optional[List[str]] = None,load_best_model_at_end: Optional[bool] = False,metric_for_best_model: Optional[str] = None,greater_is_better: Optional[bool] = None,ignore_data_skip: bool = False,sharded_ddp: str = ”,deepspeed: Optional[str] = None,label_smoothing_factor: float = 0.0,adafactor: bool = False,group_by_length: bool = False,length_column_name: Optional[str] = ‘length’,report_to: Optional[List[str]] = None,ddp_find_unused_parameters: Optional[bool] = None,dataloader_pin_memory: bool = True,skip_memory_metrics: bool = False,mp_parameters: str = ”)

  • output_dir (str) – 我們的模型訓練過程中可能產生的文件存放的路徑,包括了模型文件,checkpoint,log文件等;
  • overwrite_output_dir (bool, optional, defaults to False) – 設置為true則自動覆寫output dir下的文件,如果output_dir指向 model的checkpoint(檢查點,即保存某個epochs或者steps下的模型以及相關配置文件),則自動從這個checkpoint文件讀取模型從這個點開始重新訓練;
  • do_train
  • do_eval
  • do_predict

這三個參數和trainer沒什麼關係,可以不用,因為僅僅是作為某個超參數項用於後續自己寫Python XX py腳本的時候方便用的

//github.com/huggingface/transformers/blob/master/examples/language-modeling/run_mlm.pygithub.com

可見這個例子,如果我們是直接jupyter之類的運行,不使用python XX,py ,則沒啥用。

  • evaluation_strategy (str or IntervalStrategy, optional, defaults to "no") –
    The evaluation strategy to adopt during training. Possible values are:
    • "no": No evaluation is done during training.
    • "steps": Evaluation is done (and logged) every eval_steps.
    • "epoch": Evaluation is done at the end of each epoch.

這裡的evaluation_strategy用來設定eval的方式,用steps比較方便,因為可以通過後面的eval steps來控制eval的頻率,每個epoch的話感覺太費時間了。

  • prediction_loss_only (bool, optional, defaults to False) – When performing evaluation and generating predictions, only returns the loss.

設置為True則僅返回損失,注意這個參數比較重要,我們如果要通過trainer的custome metric來自定義模型的eval結果,比如看auc之類的,則這裡要設置為False,否則custom metric會被模型忽略而僅僅輸出training data的loss。

  • per_device_train_batch_size (int, optional, defaults to 8) – The batch size per GPU/TPU core/CPU for training.

trainer默認自動開啟torch的多gpu模式,這裡是設置每個gpu上的樣本數量,一般來說,多gpu模式希望多個gpu的性能盡量接近,否則最終多gpu的速度由最慢的gpu決定,比如快gpu 跑一個batch需要5秒,跑10個batch 50秒,慢的gpu跑一個batch 500秒,則快gpu還要等慢gpu跑完一個batch然後一起更新weights,速度反而更慢了。

  • per_device_eval_batch_size (int, optional, defaults to 8) – The batch size per GPU/TPU core/CPU for evaluation.

和上面類似,只不過對eval的batch做設定。

  • gradient_accumulation_steps (int, optional, defaults to 1) –
    Number of updates steps to accumulate the gradients for, before performing a backward/update pass.

顯示記憶體重計算的技巧,很方便很實用,默認為1,如果設置為n,則我們forward n次,得到n個loss的累加後再更新參數。

顯示記憶體重計算是典型的用時間換空間,比如我們希望跑256的大點的batch,不希望跑32這樣的小batch,因為覺得小batch不穩定,會影響模型效果,但是gpu顯示記憶體又無法放下256的batchsize的數據,此時我們就可以進行顯示記憶體重計算,將這個參數設置為256/32=8即可。用torch實現就是forward,計算loss 8次,然後再optimizer.step()
注意,當我們設置了顯示記憶體重計算的功能,則eval steps之類的參數自動進行相應的調整,比如我們設置這個參數前,256的batch,我們希望10個batch評估一次,即10個steps進行一次eval,當時改為batch size=32並且 gradient_accumulation_steps=8,則默認trainer會 8*10=80個steps 進行一次eval。

  • eval_accumulation_steps (int, optional) – Number of predictions steps to accumulate the output tensors for, before moving the results to the CPU. If left unset, the whole predictions are accumulated on GPU/TPU before being moved to the CPU (faster but requires more memory).

功能類似上面的,不贅述了。

  • learning_rate (float, optional, defaults to 5e-5) – The initial learning rate for AdamW optimizer.

學習率的初始值,默認使用AdamW的優化演算法,當然也可以在自定義設置中就該為其它優化演算法;

  • weight_decay (float, optional, defaults to 0) – The weight decay to apply (if not zero) to all layers except all bias and LayerNorm weights in AdamW optimizer.

這裡trainer是默認不對layernorm和所有layer的biase進行weight decay的,因為模型通過大量語料學習到的知識主要是保存在weights中,這也是實際finetune bert的時候一個會用到的技巧,即分層weight decay(其實就是l2正則化),biase和layernorm的參數無所謂,但是保存了重要知識的weight我們不希望它變化太大,weight decay雖然是限制weight的大小的,但是考慮到一般良好的預訓練模型的權重都比較穩定,所以也可以間接約束權重太快發生太大的變化。

  • adam_beta1 (float, optional, defaults to 0.9) – The beta1 hyperparameter for the AdamW optimizer.
  • adam_beta2 (float, optional, defaults to 0.999) – The beta2 hyperparameter for the AdamW optimizer.
  • adam_epsilon (float, optional, defaults to 1e-8) – The epsilon hyperparameter for the AdamW optimizer.

AdamW優化演算法的超參數,具體可見adamw的原理解析:

機器之心:當前訓練神經網路最快的方式:AdamW優化演算法+超級收斂zhuanlan.zhihu.com圖標

  • max_grad_norm (float, optional, defaults to 1.0) – Maximum gradient norm (for gradient clipping).

梯度裁剪功能,控制梯度的最大值,避免過大的梯度給權重帶來過大的變化從而使得模型變得不穩定。

  • num_train_epochs (float, optional, defaults to 3.0) – Total number of training epochs to perform (if not an integer, will perform the decimal part percents of the last epoch before stopping training).

epochs參數無需多言;

  • max_steps (int, optional, defaults to -1) – If set to a positive number, the total number of training steps to perform. Overrides num_train_epochs.

1個step表示處理完一個batch,功能和epochs類似,二者會衝突只能設置一個。

  • lr_scheduler_type (str or SchedulerType, optional, defaults to "linear") – The scheduler type to use. See the documentation of SchedulerType for all possible values.

Optimizationhuggingface.co圖標

huggingface定義的一些lr scheduler的處理方法,關於不同的lr scheduler的理解,其實看學習率變化圖就行:

這是linear策略的學習率變化曲線。結合下面的兩個參數來理解

  • warmup_ratio (float, optional, defaults to 0.0) – Ratio of total training steps used for a linear warmup from 0 to learning_rate.

linear策略初始會從0到我們設定的初始學習率,假設我們的初始學習率為1,則模型會經過

warmup_ratio*總的steps數 次達到初始學習率

  • warmup_steps (int, optional, defaults to 0) – Number of steps used for a linear warmup from 0 to learning_rate. Overrides any effect of warmup_ratio.

直接指定經過多少個steps到達初始學習率;

這裡二者的默認值為0,所以我們一開始用的就是初始學習率,然後初始學習率會隨著steps的數量線性下降,例如初始學習率為100,steps的數量為100,則每次會下降1的學習率。但是不會真的到0,沒記錯的話一般會使用一個非常小的數作為最終的學習率。

  • logging_dir (str, optional) – TensorBoard log directory. Will default to runs/**CURRENT_DATETIME_HOSTNAME**.
  • logging_strategy (str or IntervalStrategy, optional, defaults to "steps") –
    The logging strategy to adopt during training. Possible values are:
    • "no": No logging is done during training.
    • "epoch": Logging is done at the end of each epoch.
    • "steps": Logging is done every logging_steps.
  • logging_first_step (bool, optional, defaults to False) – Whether to log and evaluate the first global_step or not.
  • logging_steps (int, optional, defaults to 500) – Number of update steps between two logs if logging_strategy="steps".

logging 相關參數,看注釋很好理解,保存訓練過程中的loss,梯度等資訊,便於後期使用tensorboard這類的工具來幫助分析

  • save_strategy (str or IntervalStrategy, optional, defaults to "steps") –
    The checkpoint save strategy to adopt during training. Possible values are:
    • "no": No save is done during training.
    • "epoch": Save is done at the end of each epoch.
    • "steps": Save is done every save_steps.
  • save_steps (int, optional, defaults to 500) – Number of updates steps before two checkpoint saves if save_strategy="steps".
  • save_total_limit (int, optional) – If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints in output_dir.

checkpoint相關,也很好理解了,注意最好設置save_total_limit=一個固定常數,因為一個model的checkpoint是保存整個完整的model的,可能一個checkpoint就是GB級別的,存太多的話費硬碟。

  • no_cuda (bool, optional, defaults to False) – Whether to not use CUDA even when it is available or not.

是否使用cuda。

  • seed (int, optional, defaults to 42) – Random seed that will be set at the beginning of training. To ensure reproducibility across runs, use the model_init() function to instantiate the model if it has some randomly initialized parameters.

隨機固定,保證可復現

  • fp16 (bool, optional, defaults to False) – Whether to use 16-bit (mixed) precision training instead of 32-bit training.
  • fp16_opt_level (str, optional, defaults to 『O1』) – For fp16 training, Apex AMP optimization level selected in [『O0』, 『O1』, 『O2』, and 『O3』]. See details on the Apex documentation.
  • fp16_backend (str, optional, defaults to "auto") – The backend to use for mixed precision training. Must be one of "auto", "amp" or "apex". "auto" will use AMP or APEX depending on the PyTorch version detected, while the other choices will force the requested backend.
  • fp16_full_eval (bool, optional, defaults to False) – Whether to use full 16-bit precision evaluation instead of 32-bit. This will be faster and save memory but can harm metric values.

混合精度訓練相關參數,可以支援amp和apex的後端,關於混合精度的話題還是比較大的,後續研究完混合精度訓練的一些注意事項已經apex的原理再單獨寫一篇總結一下混合精度訓練以及apex的用法。

  • local_rank (int, optional, defaults to -1) – Rank of the process during distributed training.

trainer默認是用torch.distributed的api來做多卡訓練的,因此可以直接支援多機多卡,單機多卡,單機單卡,如果要強制僅使用指定gpu,則通過os cuda visible設置可見gpu即可。

需要注意的是 torch的gpu的id是基於速度的,即速度越快的gpu其gpu id越小,gpu id=0對應最快的gpu,這和nvdia-smi的設定是不同的,

PyTorch 程式碼中 GPU 編號與 nvidia-smi 命令中的 GPU 編號不一致問題解決方法blog.csdn.net圖標

簡單的做法是通過torch.get_device_name的方式獲取當前gpu的名字然後對照,另外上面的鏈接提到的方法也比較方便。

  • tpu_num_cores (int, optional) – When training on TPU, the number of TPU cores (automatically passed by launcher script).
  • debug (bool, optional, defaults to False) – When training on TPU, whether to print debug metrics or not.

tpu設置相關。

dataloader_num_workers (int, optional, defaults to 0) – Number of subprocesses to use for data loading (PyTorch only). 0 means that the data will be loaded in the main process.

  • dataloader_drop_last (bool, optional, defaults to False) – Whether to drop the last incomplete batch (if the length of the dataset is not divisible by the batch size) or not.

是否drop最後一批不滿batch size的數據,最好設置為false,不要浪費數據。注意,trainer內部封裝了dataloader,但是穿的數據參數使用的是dataset(沒試過dataloader能不能直接用,但是考慮到trainer很方便的動態padding的方式,我還是用dataset來整),num_workers不用介紹了很簡單了。

  • eval_steps (int, optional) – Number of update steps between two evaluations if evaluation_strategy="steps". Will default to the same value as logging_steps if not set.

很好理解了,不贅述

  • past_index (int, optional, defaults to -1) – Some models like TransformerXL or :doc`XLNet <../model_doc/xlnet>` can make use of the past hidden states for their predictions. If this argument is set to a positive int, the Trainer will use the corresponding output (usually index 2) as the past state and feed it to the model at the next training step under the keyword argument mems.

看注釋很好理解了。

  • run_name (str, optional) – A descriptor for the run. Typically used for wandb logging.

運行名,和wandb的分析工具有關

  • disable_tqdm (bool, optional) – Whether or not to disable the tqdm progress bars and table of metrics produced by NotebookTrainingTracker in Jupyter Notebooks. Will default to True if the logging level is set to warn or lower (default), False otherwise.

trainer訓練的過程會顯示progressbar通過這裡來關閉,不建議關閉,不好看。。。

  • remove_unused_columns (bool, optional, defaults to True) –
    If using datasets.Dataset datasets, whether or not to automatically remove the columns unused by the model forward method.
    (Note that this behavior is not implemented for TFTrainer yet.)
    自動刪除模型forward的時候不需要的輸出,因為 trainer的dataset的return需要返回包含input ids,等這類和model相關的keys,所以這裡會有這麼個參數,需要注意dataset的設置里要返回字典,這個字典的key要和model的forward的數據的名字對應上。

  • label_names (List[str], optional) –
    The list of keys in your dictionary of inputs that correspond to the labels.
    Will eventually default to ["labels"] except if the model used is one of the XxxForQuestionAnswering in which case it will default to ["start_positions", "end_positions"]. 模型的label參數設定,默認就好,注意dataset返回labels的時候keys用labels就行。
  • load_best_model_at_end (bool, optional, defaults to False) –
    Whether or not to load the best model found during training at the end of training.
    Note
    When set to True, the parameters save_strategy and save_steps will be ignored and the model will be saved after each evaluation. 注釋寫的比較清楚了,不過沒關係,模型的eval的輸出print的結果不受影響
  • metric_for_best_model (str, optional) –
    Use in conjunction with load_best_model_at_end to specify the metric to use to compare two different models. Must be the name of a metric returned by the evaluation with or without the prefix "eval_". Will default to "loss" if unspecified and load_best_model_at_end=True (to use the evaluation loss).
    If you set this value, greater_is_better will default to True. Don』t forget to set it to False if your metric is better when lower.
  • greater_is_better (bool, optional) –
    Use in conjunction with load_best_model_at_end and metric_for_best_model to specify if better models should have a greater metric or not. Will default to:
    • True if metric_for_best_model is set to a value that isn』t "loss" or "eval_loss".
    • False if metric_for_best_model is not set, or set to "loss" or "eval_loss".

eval相關,注釋也寫的比較清楚了,因為我們可能涉及到auc這裡的自定義eval metric,所以自定義metric的函數里最終必須返回字典,比如{‘AUC’,roc_auc_score(y_true,y_pred},然後metric for best model設置為 AUC即可。

  • ignore_data_skip (bool, optional, defaults to False) – When resuming training, whether or not to skip the epochs and batches to get the data loading at the same stage as in the previous training. If set to True, the training will begin faster (as that skipping step can take a long time) but will not yield the same results as the interrupted training would have.

比如batch為256,一個epochs有10個batch,加入我們訓練到第5個batch,即0.5個epochs的時候退出了,那麼可以通過將該參數設置為True,從第5個batch重新開始訓練

  • sharded_ddp (bool, str or list of ShardedDDPOption, optional, defaults to False) –
    Use Sharded DDP training from FairScale (in distributed training only). This is an experimental feature.
    A list of options along the following:
    • "simple": to use first instance of sharded DDP released by fairscale (ShardedDDP) similar to ZeRO-2.
    • "zero_dp_2": to use the second instance of sharded DPP released by fairscale (FullyShardedDDP) in Zero-2 mode (with reshard_after_forward=False).
    • "zero_dp_3": to use the second instance of sharded DPP released by fairscale (FullyShardedDDP) in Zero-3 mode (with reshard_after_forward=True).
    • "offload": to add ZeRO-offload (only compatible with "zero_dp_2" and "zero_dp_3").

If a string is passed, it will be split on space. If a bool is passed, it will be converted to an empty list for False and ["simple"] for True.

Sharded:在相同顯示記憶體的情況下使pytorch模型的大小參數加倍_deephub-CSDN部落格blog.csdn.net圖標

shareddpp,簡單來說通過一些策略來加快多gpu的速度,因為這部分比較多,後續放到torch的性能優化里來講(transformer的trainer設計和pytorch lightning的非常相似,二者的這些優化策略基本一毛一樣)

  • deepspeed (str or dict, optional) – Use Deepspeed. This is an experimental feature and its API may evolve in the future. The value is either the location of DeepSpeed json config file (e.g., ds_config.json) or an already loaded json file as a dict

如何評價微軟開源的分散式訓練框架deepspeed?www.zhihu.com圖標

感覺看trainer的設計基本上能把torch的許多優化方法和訓練技巧都理解一遍了。

  • label_smoothing_factor (float, optional, defaults to 0.0) – The label smoothing factor to use. Zero means no label smoothing, otherwise the underlying onehot-encoded labels are changed from 0s and 1s to label_smoothing_factor/num_labels and 1 - label_smoothing_factor + label_smoothing_factor/num_labels respectively.

label smooth功能,直接對標籤下手,而不再去做 label smooth loss這種麻煩的工作了,但是也正因為直接修改了分類的標籤,我們計算metric的時候,對於auc這種方法,需要自己在自定義的metric里把標籤反推回去,否則會報錯。

  • adafactor (bool, optional, defaults to False) – Whether or not to use the Adafactor optimizer instead of AdamW.

是否使用adafactor優化演算法。具體演算法原理可Googleor百度

  • group_by_length (bool, optional, defaults to False) – Whether or not to group together samples of roughly the same length in the training dataset (to minimize padding applied and be more efficient). Only useful if applying dynamic padding.

trainer的一個非常好用的功能,即動態padding,對序列進行分bucket的處理,即將長度差不多的輸入分為一組,用於後續的padding,注意padding的功能不是這個參數決定的,而是根據data collator來實現的,後面會提到。當然我們也完全可以自己手動做動態padding的處理,這裡需要用到 dataloader中的collate_fn的參數來定義。

  • length_column_name (str, optional, defaults to "length") – Column name for precomputed lengths. If the column exists, grouping by length will use these values rather than computing them on train startup. Ignored unless group_by_length is True and the dataset is an instance of Dataset.
  • report_to (str or List[str], optional, defaults to "all") – The list of integrations to report the results and logs to. Supported platforms are "azure_ml", "comet_ml", "mlflow", "tensorboard" and "wandb". Use "all" to report to all integrations installed, "none" for no integrations.

注釋比較簡單不贅述。

  • ddp_find_unused_parameters (bool, optional) – When using distributed training, the value of the flag find_unused_parameters passed to DistributedDataParallel. Will default to False if gradient checkpointing is used, True otherwise.

DistributedDataParallel中的find_unused_parameters

  • dataloader_pin_memory (bool, optional, defaults to True)) – Whether you want to pin memory in data loaders or not. Will default to True.

這裡簡單科普下pin_memory,通常情況下,數據在記憶體中要麼以鎖頁的方式存在,要麼保存在虛擬記憶體(磁碟)中,設置為True後,數據直接保存在鎖頁記憶體中,後續直接傳入cuda;否則需要先從虛擬記憶體中傳入鎖頁記憶體中,再傳入cuda,這樣就比較耗時了,但是對於記憶體的大小要求比較高。

  • skip_memory_metrics (bool, optional, defaults to False)) – Whether to skip adding of memory profiler reports to metrics. Defaults to False.

是否將記憶體使用情況保存到metrics中去,基本沒用過。。。


ok,回到trainer。

data_collator(DataCollator,optional) – The function to use to form a batch from a list of elements oftrain_datasetoreval_dataset. Will default todefault_data_collator()if notokenizeris provided, an instance ofDataCollatorWithPadding()otherwise.

data_collator是huggingface自定義的數據處理函數。

//github.com/huggingface/transformers/blob/66446909b236c17498276857fa88e23d2c91d004/src/transformers/data/data_collator.pygithub.com

這裡定義了很多很方便的數據預處理函數,比如mlm任務對應的mask方法,會自動幫你進行mask並且返回mask後的結果以及對應的label便於直接訓練。

比較可貴的是,這裡實現了大部分常見的策略,包括了普通mask和全詞mask,並且,這些函數在底層都是放到data loader的collate fn部分的,這也意味著我們可以自己定義一個dataload然後將這些函數直接通過transformers.data.data_collator.XXXX函數傳入collate_fn參數從而方便快速地實現數據的準備,

Pytorch collate_fn用法www.cnblogs.com

collate fn是一個很好用的功能,因為dataloader默認是不支援長度不同的batch的,但是通過collate fn可以很方便地打破這個限制,前面提到的動態padding就可以通過這裡來實現從而節約大量的記憶體。

不過可惜的是,trainer不支援簡單的一個自定義function作為data collator,但也沒事,

//github.com/huggingface/transformers/blob/66446909b236c17498276857fa88e23d2c91d004/src/transformers/data/data_collator.pygithub.com

我們可以自己仿照官方的實現寫一個自定義的個性化的方法,語法非常簡單按照上述的範式來就可以,同時自定義還有一個好處就是可以打破官方實現data collator只能使用pretrain tokenize的限制,我們可以使用自定義的tokenize甚至不適用tokenize,看官方的這些function的功能就知道了,就是一個對batch數據做處理的小程式。

  • train_dataset (torch.utils.data.dataset.Dataset, optional) – The dataset to use for training. If it is an datasets.Dataset, columns not accepted by the model.forward() method are automatically removed.
  • eval_dataset (torch.utils.data.dataset.Dataset, optional) – The dataset to use for evaluation. If it is an datasets.Dataset, columns not accepted by the model.forward() method are automatically removed.
  • tokenizer (PreTrainedTokenizerBase, optional) – The tokenizer used to preprocess the data. If provided, will be used to automatically pad the inputs the maximum length when batching inputs, and it will be saved along the model to make it easier to rerun an interrupted training or reuse the fine-tuned model.
  • model_init (Callable[[], PreTrainedModel], optional) –
    A function that instantiates the model to be used. If provided, each call to train() will start from a new instance of the model as given by this function.
    The function may have zero argument, or a single one containing the optuna/Ray Tune trial object, to be able to choose different architectures according to hyper parameters (such as layer count, sizes of inner layers, dropout probabilities etc).

這些都很簡單了不廢話了,這裡要求傳入的是dataset不是dataloader需要注意。

  • compute_metrics (Callable[[EvalPrediction], Dict], optional) – The function that will be used to compute metrics at evaluation. Must take a EvalPrediction and return a dictionary string to metric values.

這裡是自定義metric的地方,語法也很簡單:

def compute_metrics(p: EvalPrediction) -> Dict:
    preds,labels=p
    preds = np.argmax(preds, axis=-1)
    #print('shape:', preds.shape, '\n')
    precision, recall, f1, _ = precision_recall_fscore_support(lables.flatten(), preds.flatten(), average='weighted', zero_division=0)
    return {
        'accuracy': (preds == p.label_ids).mean(),
        'f1': f1,
        'precision': precision,
        'recall': recall
    }
  • 基本就是一個簡單的python函數

  • callbacks (List of TrainerCallback, optional) –
    A list of callbacks to customize the training loop. Will add those to the list of default callbacks detailed in here.
    If you want to remove one of the default callbacks used, use the Trainer.remove_callback() method.
  • callback和keras中的callback的設計類似,自定義的方法 也類似,不過官方提供了最常用的earlystopping功能,我們只要from transformers import EarlyStoppingCallback然後放到這個參數下即可,早停的metric根據我們的metric_for_best_model 來設置。
  • optimizers (Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR, optional) – A tuple containing the optimizer and the scheduler to use. Will default to an instance of AdamW on your model and a scheduler given by get_linear_schedule_with_warmup() controlled by args.

通過這裡可以比較迅速方便地自定義其它地optimizer和lr scheduler,支援torch和torch式地其它優化器。


trainer部分到此結束,下一章講一下transformers的自定義用法,因為我在做多任務的時候發現transformer沒有提供官方的多任務的介面,但是官方又提供了一些demo程式碼,所以打算把自定義這部分好好看看。


總而言之,目前來看,預訓練模型這塊兒完全可以用trainer來代替,像對抗訓練,多目標優化,帶權訓練等也可以通過集成trainer自定義一個新的trainer來很簡單的實現。

不過對於其它場景來說,比如表格數據,這個介面就不是很友好了,這個時候我們可以通過pytorch lightning來盡量0學習成本的實現~~