mmdetection2.6自定義數據集

2020 年 11 月 19 日
AI
目標檢測

官方推薦的集中自定義數據集的方式：

將自己的數據集組織為標準的數據集格式（常用COCO）
將自己的數據集組織為中間格式
利用datawraper自定義新的數據集

1. 將數據集組織為coco數據集格式

coco數據集的標註格式：

整體為一個字典形式，主要的鍵為：images , annotations, categorties

images的值為一個列表，列表的每個元素為如下所示的元素資訊
annotations的值為一個列表，每個元素為如下所示的資訊
categorties的值為一個列表，每個元素為如下所示的資訊，id從0開始。

'images': [
    {
        'file_name': 'COCO_val2014_000000001268.jpg',
        'height': 427,
        'width': 640,
        'id': 1268
    },
    ...
],

'annotations': [
    {
        'segmentation': [[192.81,
            247.09,
            ...
            219.03,
            249.06]],  # if you have mask labels
        'area': 1035.749,
        'iscrowd': 0,
        'image_id': 1268,
        'bbox': [192.81, 224.8, 74.73, 33.43],
        'category_id': 16,
        'id': 42986
    },
    ...
],

'categories': [
    {'id': 0, 'name': 'car'},
 ]

最簡單的方式，就是將自己的數據集組織為coco數據集的標註格式，這樣訓練的過程中僅僅需要在config文件中修改數據集的路徑與類別即可。

2. 將數據集組織為middle格式

mmdetection定義一種比較簡單的數據集格式，標註文件的資訊是一個字典列表，每個字典對應著一張圖片，如下所示：


[
    {
        'filename': 'a.jpg',
        'width': 1280,
        'height': 720,
        'ann': {
            'bboxes': <np.ndarray, float32> (n, 4),
            'labels': <np.ndarray, int64> (n, ),
            'bboxes_ignore': <np.ndarray, float32> (k, 4),
            'labels_ignore': <np.ndarray, int64> (k, ) (optional field)
        }
    },
    ...
]

轉換為上述的格式之後，有兩種數據集的使用方式，一種在線的使用方式，一種離線的方式：

在線方式

重新寫一個繼承自CustomDataset的類。重寫 load_annotations(self, ann_file) and get_ann_info(self, idx)這兩個方法。

離線的方法

將數據集轉換為標準的COCO或者VOC格式，然後直接使用CustomDataset。

3. 簡單的自定義數據集的例子

假設我們現有的標註數據的格式為txt文件標註

#分別為圖片名稱， 圖片寬高， bbox的數目， bbox坐標與類別id
000001.jpg
1280 720
2
10 20 40 60 1
20 40 50 60 2

#
000002.jpg
1280 720
3
50 20 40 60 2
20 40 30 45 2
30 40 50 60 3

然後創建一個新的文件mmdet/datasets/my_dataset.py來載入數據集。

import mmcv
import numpy as np

from .builder import DATASETS
from .custom import CustomDataset


@DATASETS.register_module()
class MyDataset(CustomDataset):

    CLASSES = ('person', 'bicycle', 'car', 'motorcycle')

    def load_annotations(self, ann_file):
        ann_list = mmcv.list_from_file(ann_file)

        data_infos = []
        for i, ann_line in enumerate(ann_list):
            if ann_line != '#':
                continue

            img_shape = ann_list[i + 2].split(' ')
            width = int(img_shape[0])
            height = int(img_shape[1])
            bbox_number = int(ann_list[i + 3])

            anns = ann_line.split(' ')
            bboxes = []
            labels = []
            for anns in ann_list[i + 4:i + 4 + bbox_number]:
                bboxes.append([float(ann) for ann in anns[:4]])
                labels.append(int(anns[4]))

            data_infos.append(
                dict(
                    filename=ann_list[i + 1],
                    width=width,
                    height=height,
                    ann=dict(
                        bboxes=np.array(bboxes).astype(np.float32),
                        labels=np.array(labels).astype(np.int64))
                ))

        return data_infos

    def get_ann_info(self, idx):
        return self.data_infos[idx]['ann']

然後在config文件中，使用MyDataset。

dataset_A_train = dict(
    type='MyDataset',
    ann_file = 'image_list.txt',
    pipeline=train_pipeline
)

5. 使用dataset wrappers來自定義數據集

mmdetection支援很多中數據集wrapper來混合數據集或者在訓練時修改數據集分布。現在著吃三種數據集包裝器（wrapper）：

RepeatDataset：只需重複整個數據集。
ClassBalancedDataset：以類平衡的方式重複數據集。
ConcatDataset：concat數據集。

5.1 RepeatDataset

使用RepeatDataset作為數據集包裝器來重複數據集。例如重複原始的數據集Dataset_A。config文件如下：

dataset_A_train = dict(
        type='RepeatDataset',
        times=N,
        dataset=dict(  # This is the original config of Dataset_A
            type='Dataset_A',
            ...
            pipeline=train_pipeline
        )
    )

5.2 ClassBalancedDataset

使用ClassBalancedDataset作為wrapper來依據類別的頻率來重複數據集，重複的數據集需要初始化函數 self.get_cat_ids(idx)來支援ClassBalancedDataset。例如使用過取樣率 oversample_thr=1e-3來重複Dataset_A 。

dataset_A_train = dict(
        type='ClassBalancedDataset',
        oversample_thr=1e-3,
        dataset=dict(  # This is the original config of Dataset_A
            type='Dataset_A',
            ...
            pipeline=train_pipeline
        )
    )

5.3 ConcatDataset

有三種方法堆疊數據集

5.3.1 兩個數據集是同樣的類型

採用如下的方式：

dataset_A_train = dict(
    type='Dataset_A',
    ann_file = ['anno_file_1', 'anno_file_2'],
    pipeline=train_pipeline
)

這種方式在測試驗證過程中，兩個數據集會分開進行測試，如果想要整體進行測試，需要separate_eval=False

dataset_A_train = dict(
    type='Dataset_A',
    ann_file = ['anno_file_1', 'anno_file_2'],
    separate_eval=False,
    pipeline=train_pipeline
)

5.3.2 兩個數據集不同

dataset_A_train = dict()
dataset_B_train = dict()

data = dict(
   imgs_per_gpu=2,
   workers_per_gpu=2,
   train = [
       dataset_A_train,
       dataset_B_train
   ],
   val = dataset_A_val,
   test = dataset_A_test
   )

在測試過程中，這種方式支援分離的方式進行測試

5.3.3 明確定義concat的方式

dataset_A_val = dict()
dataset_B_val = dict()

data = dict(
    imgs_per_gpu=2,
    workers_per_gpu=2,
    train=dataset_A_train,
    val=dict(
        type='ConcatDataset',
        datasets=[dataset_A_val, dataset_B_val],
        separate_eval=False))

使用separate_eval=False在測試驗證過程中，將所有的數據集當作以整個數據集進行測試。

注意：

該選項separate_eval=False假定數據集self.data_infos在評估期間使用。因此，COCO數據集不支援此行為，因為COCO數據集不完全依賴於self.data_infos評估。因此，不建議結合使用不同類型的數據集並對其進行整體評估。
不支援評估ClassBalancedDataset，RepeatDataset因此不支援評估這些類型的串聯數據集。

更加複雜的方式，重複兩個數據集分別N， M次，使用如下的方式：

dataset_A_train = dict(
    type='RepeatDataset',
    times=N,
    dataset=dict(
        type='Dataset_A',
        ...
        pipeline=train_pipeline
    )
)
dataset_A_val = dict(
    ...
    pipeline=test_pipeline
)
dataset_A_test = dict(
    ...
    pipeline=test_pipeline
)
dataset_B_train = dict(
    type='RepeatDataset',
    times=M,
    dataset=dict(
        type='Dataset_B',
        ...
        pipeline=train_pipeline
    )
)
data = dict(
    imgs_per_gpu=2,
    workers_per_gpu=2,
    train = [
        dataset_A_train,
        dataset_B_train
    ],
    val = dataset_A_val,
    test = dataset_A_test
)

6. 調整數據集的類別

可以通過調整數據集的類別來訓練數據集的子集，例如現存的數據集為20類，最終可以調整訓練所使用的數據集的類別來僅僅訓練其中的三類，mmdetection可以自動濾除其他的類別。

classes = ('person', 'bicycle', 'car')
data = dict(
    train=dict(classes=classes),
    val=dict(classes=classes),
    test=dict(classes=classes))

mmdetection2.0也支援從文件中讀取數據集類別的名稱，例如txt文件：

person
bicycle
car

使用如下的方法進行操作：

classes = 'path/to/classes.txt'
data = dict(
    train=dict(classes=classes),
    val=dict(classes=classes),
    test=dict(classes=classes))

注意：

在MMDetection v2.5.0之前，如果設置了類別名稱，則數據集將自動過濾出空的GT影像，並且無法通過config禁用它。這是不受歡迎的行為，並且會引起混淆，因為如果未設置類別名稱，則數據集僅在filter_empty_gt=True和時過濾空的GT影像test_mode=False。在MMDetection v2.5.0之後，我們將影像過濾過程與類別修改解耦，即，無論是否設置了類別，數據集都只會在filter_empty_gt=True和test_mode=False時過濾空的GT影像。因此，設置類別僅會影響用於訓練的類別的注釋，並且用戶可以決定是否自己過濾不含GT的影像。
由於中間格式僅具有框標籤且不包含類名稱，因此在使用時CustomDataset，用戶無法通過config過濾出不含GT的影像，而只能離線進行。

Tags: 目標檢測