如何将标注格式从 PASCAL VOC XML 转换为 COCO JSON

程序员文章站 2022-03-30 21:55:45

...

计算机视觉问题需要带标注的数据集。随着object detection的发展，出现了描述对象标注不同文件格式。这造成了令人沮丧的情况，团队将时间花在从一种标注格式转换为另一种标注格式上，而不是专注于更高价值的任务——比如改进深度学习模型架构。数据科学家花费时间在注释格式之间进行转换就像作者花费时间将 Word 文档转换为 PDF。最常用的标注格式来自于有挑战的和日积月累的数据集，随着机器学习研究人员利用这些数据集构建更好的模型，他们的注释格式成为非官方的标准协议。在这篇文章中，我们将为您提供在两种最常见格式之间进行转换所需的代码：VOC XML 和 COCO JSON。

PASCAL VOC XML

PASCAL（Pattern Analysis, Statistical modelling and ComputAtional Learning 模式分析、统计建模和计算学习）是一个由欧盟资助的卓越网络。从 2005 年到 2012 年，PASCAL 举办了视觉对象挑战赛 (VOC，Visual Object Challenge)。 PASCAL 每年都会发布对象检测数据集并报告基准。（此处提供了 PASCAL VOC 数据集。）

PASCAL VOC 注释以 XML 格式发布，其中每个图像都有一个随附的 XML 文件，描述框架中包含的边界框。例如，在用于血细胞检测的 BCCD 数据集中，单个 XML标注示例如下所示：

<annotation>
	<folder>JPEGImages</folder>
	<filename>BloodImage_00000.jpg</filename>
	<path>/home/pi/detection_dataset/JPEGImages/BloodImage_00000.jpg</path>
	<source>
		<database>Unknown</database>
	</source>
	<size>
		<width>640</width>
		<height>480</height>
		<depth>3</depth>
	</size>
	<segmented>0</segmented>
	<object>
		<name>WBC</name>
		<pose>Unspecified</pose>
		<truncated>0</truncated>
		<difficult>0</difficult>
		<bndbox>
			<xmin>260</xmin>
			<ymin>177</ymin>
			<xmax>491</xmax>
			<ymax>376</ymax>
		</bndbox>
	</object>
    ...
	<object>
		...
	</object>
</annotation>

请注意一些关键细节：(1) 被注释的图像文件被称为相对路径 (2) 图像元数据被包含为宽度（width）、高度（hight）和深度（depth） (3) 边界框像素位置由最左上角点xmin、ymin和最右下角点xmax、ymax确定。

COCO JSON

COCO（Common Objects in Context）数据集起源于微软 2014 年发表的一篇论文。该数据集“包含 4 岁儿童可以轻松识别的 91 种物体类型的照片”。在 328,000 张图像中总共有 250 万个标注过的实例。鉴于开源数据的绝对数量和质量，COCO 已成为测试和证明新模型最先进性能的标准数据集。（数据集可在此处获得。）

COCO 标注以 JSON 格式发布。与每个图像都有自己的标注文件的 PASCAL VOC 不同，COCO JSON 只用描述一组图像集合的单个 JSON 文件。此外，COCO 数据集支持多种类型的计算机视觉问题：关键点检测、对象检测、分割和创建字幕。因此，根据不同的任务有不同的标注格式。这篇文章的重点是物体检测（object detection）。用于对象检测的 COCO JSON 示例注释如下所示：

{
    "info": {
        "year": "2020",
        "version": "1",
        "description": "Exported from roboflow.ai",
        "contributor": "",
        "url": "https://app.roboflow.ai/datasets/bccd-single-image-example/1",
        "date_created": "2020-01-30T23:05:21+00:00"
    },
    "licenses": [
        {
            "id": 1,
            "url": "",
            "name": "Unknown"
        }
    ],
    "categories": [
        {
            "id": 0,
            "name": "cells",
            "supercategory": "none"
        },
        {
            "id": 1,
            "name": "RBC",
            "supercategory": "cells"
        },
        {
            "id": 2,
            "name": "WBC",
            "supercategory": "cells"
        }
    ],
    "images": [
        {
            "id": 0,
            "license": 1,
            "file_name": "0bc08a33ac64b0bd958dd5e4fa8dbc43.jpg",
            "height": 480,
            "width": 640,
            "date_captured": "2020-02-02T23:05:21+00:00"
        }
    ],
    "annotations": [
        {
            "id": 0,
            "image_id": 0,
            "category_id": 2,
            "bbox": [
                260,
                177,
                231,
                199
            ],
            "area": 45969,
            "segmentation": [],
            "iscrowd": 0
        },
        {
            "id": 1,
            "image_id": 0,
            "category_id": 1,
            "bbox": [
                78,
                336,
                106,
                99
            ],
            "area": 10494,
            "segmentation": [],
            "iscrowd": 0
        },
        {
            "id": 2,
            "image_id": 0,
            "category_id": 1,
            "bbox": [
                63,
                237,
                106,
                99
            ],
            "area": 10494,
            "segmentation": [],
            "iscrowd": 0
        },
...
    ]
}

请注意这里的一些关键事项：(1) 有关于数据集本身及其许可证的信息 (2) 包含的所有标签都定义为类别 (3) 边界框定义为左上角的 x、y 坐标以及边界框的宽度和高度。

将 VOC XML 转换为 COCO JSON

LabelImg、VoTT 和 CVAT 等流行的标注工具提供 Pascal VOC XML格式的标注。 ImageNet 等一些模型需要 Pascal VOC。其他的，比如 Mask-RCNN，需要 COCO JSON 标注图像。
要将一种格式转换为另一种格式，您可以编写（或借用）自定义脚本或使用 Roboflow 之类的工具（Roboflow 1000张免费）。

使用 Python 脚本

GitHub 用户（和 Kaggle Master）yukkyo 创建了一个存放在github的脚本，Roboflow 团队已经克隆了这个脚本并稍微修改了存储库，以便在此处使用，你可以从这里获得。

整个代码如下：

import os
import argparse
import json
import xml.etree.ElementTree as ET
from typing import Dict, List
from tqdm import tqdm
import re


def get_label2id(labels_path: str) -> Dict[str, int]:
    """id is 1 start"""
    with open(labels_path, 'r') as f:
        labels_str = f.read().split()
    labels_ids = list(range(1, len(labels_str)+1))
    return dict(zip(labels_str, labels_ids))


def get_annpaths(ann_dir_path: str = None,
                 ann_ids_path: str = None,
                 ext: str = '',
                 annpaths_list_path: str = None) -> List[str]:
    # If use annotation paths list
    if annpaths_list_path is not None:
        with open(annpaths_list_path, 'r') as f:
            ann_paths = f.read().split()
        return ann_paths

    # If use annotaion ids list
    ext_with_dot = '.' + ext if ext != '' else ''
    with open(ann_ids_path, 'r') as f:
        ann_ids = f.read().split()
    ann_paths = [os.path.join(ann_dir_path, aid+ext_with_dot) for aid in ann_ids]
    return ann_paths


def get_image_info(annotation_root, extract_num_from_imgid=True):
    path = annotation_root.findtext('path')
    if path is None:
        filename = annotation_root.findtext('filename')
    else:
        filename = os.path.basename(path)
    img_name = os.path.basename(filename)
    img_id = os.path.splitext(img_name)[0]
    if extract_num_from_imgid and isinstance(img_id, str):
        img_id = int(re.findall(r'\d+', img_id)[0])

    size = annotation_root.find('size')
    width = int(size.findtext('width'))
    height = int(size.findtext('height'))

    image_info = {
        'file_name': filename,
        'height': height,
        'width': width,
        'id': img_id
    }
    return image_info


def get_coco_annotation_from_obj(obj, label2id):
    label = obj.findtext('name')
    assert label in label2id, f"Error: {label} is not in label2id !"
    category_id = label2id[label]
    bndbox = obj.find('bndbox')
    xmin = int(bndbox.findtext('xmin')) - 1
    ymin = int(bndbox.findtext('ymin')) - 1
    xmax = int(bndbox.findtext('xmax'))
    ymax = int(bndbox.findtext('ymax'))
    assert xmax > xmin and ymax > ymin, f"Box size error !: (xmin, ymin, xmax, ymax): {xmin, ymin, xmax, ymax}"
    o_width = xmax - xmin
    o_height = ymax - ymin
    ann = {
        'area': o_width * o_height,
        'iscrowd': 0,
        'bbox': [xmin, ymin, o_width, o_height],
        'category_id': category_id,
        'ignore': 0,
        'segmentation': []  # This script is not for segmentation
    }
    return ann


def convert_xmls_to_cocojson(annotation_paths: List[str],
                             label2id: Dict[str, int],
                             output_jsonpath: str,
                             extract_num_from_imgid: bool = True):
    output_json_dict = {
        "images": [],
        "type": "instances",
        "annotations": [],
        "categories": []
    }
    bnd_id = 1  # START_BOUNDING_BOX_ID, TODO input as args ?
    print('Start converting !')
    for a_path in tqdm(annotation_paths):
        # Read annotation xml
        ann_tree = ET.parse(a_path)
        ann_root = ann_tree.getroot()

        img_info = get_image_info(annotation_root=ann_root,
                                  extract_num_from_imgid=extract_num_from_imgid)
        img_id = img_info['id']
        output_json_dict['images'].append(img_info)

        for obj in ann_root.findall('object'):
            ann = get_coco_annotation_from_obj(obj=obj, label2id=label2id)
            ann.update({'image_id': img_id, 'id': bnd_id})
            output_json_dict['annotations'].append(ann)
            bnd_id = bnd_id + 1

    for label, label_id in label2id.items():
        category_info = {'supercategory': 'none', 'id': label_id, 'name': label}
        output_json_dict['categories'].append(category_info)

    with open(output_jsonpath, 'w') as f:
        output_json = json.dumps(output_json_dict)
        f.write(output_json)


def main():
    parser = argparse.ArgumentParser(
        description='This script support converting voc format xmls to coco format json')
    parser.add_argument('--ann_dir', type=str, default=None,
                        help='path to annotation files directory. It is not need when use --ann_paths_list')
    parser.add_argument('--ann_ids', type=str, default=None,
                        help='path to annotation files ids list. It is not need when use --ann_paths_list')
    parser.add_argument('--ann_paths_list', type=str, default=None,
                        help='path of annotation paths list. It is not need when use --ann_dir and --ann_ids')
    parser.add_argument('--labels', type=str, default=None,
                        help='path to label list.')
    parser.add_argument('--output', type=str, default='output.json', help='path to output json file')
    parser.add_argument('--ext', type=str, default='', help='additional extension of annotation file')
    args = parser.parse_args()
    label2id = get_label2id(labels_path=args.labels)
    ann_paths = get_annpaths(
        ann_dir_path=args.ann_dir,
        ann_ids_path=args.ann_ids,
        ext=args.ext,
        annpaths_list_path=args.ann_paths_list
    )
    convert_xmls_to_cocojson(
        annotation_paths=ann_paths,
        label2id=label2id,
        output_jsonpath=args.output,
        extract_num_from_imgid=True
    )


if __name__ == '__main__':
    main()

脚本使用方法

1. 创建labels.txt

如果需要用到labels.txt，那么这个文件的作用是将标签转换为 id 的字典，请使用 labels.txt。

举例labels.txt 内容：

Label1
Label2
...

运行脚本：

用法1（使用ID列表）

$ python voc2coco.py \
    --ann_dir /path/to/annotation/dir \
    --ann_ids /path/to/annotations/ids/list.txt \
    --labels /path/to/labels.txt \
    --output /path/to/output.json \
    <option> --ext xml

用法2(使用标注文件路径列表)：

标注路径列表文件paths.txt举例：

/path/to/annotation/file.xml
/path/to/annotation/file2.xml
...

$ python voc2coco.py \
    --ann_paths_list /path/to/annotation/paths.txt \
    --labels /path/to/labels.txt \
    --output /path/to/output.json \
    <option> --ext xml

使用举例：

在这个例子中你可以把VOC数据集 Shenggan/BCCD_Dataset: BCCD Dataset is a small-scale dataset for blood cells detection.

这个数据集对应的文件结构是：

├── BCCD
│   ├── Annotations
│   │       └── BloodImage_00XYZ.xml (364 items)
│   ├── ImageSets       # Contain four Main/*.txt which split the dataset
│   └── JPEGImages
│       └── BloodImage_00XYZ.jpg (364 items)
├── dataset
│   └── mxnet           # Some preprocess scripts for mxnet
├── scripts
│   ├── split.py        # A script to generate four .txt in ImageSets
│   └── visualize.py    # A script to generate labeled img like example.jpg
├── example.jpg         # A example labeled img generated by visualize.py
├── LICENSE
└── README.md

转换成COCO json格式通过这个命令：

$ python voc2coco.py
    --ann_dir sample/Annotations \
    --ann_ids sample/dataset_ids/test.txt \
    --labels sample/labels.txt \
    --output sample/bccd_test_cocoformat.json \
    --ext xml

# Check output
$ ls sample/ | grep bccd_test_cocoformat.json
bccd_test_cocoformat.json

# Check output
cut -f -4 -d , sample/bccd_test_cocoformat.json
{"images": [{"file_name": "BloodImage_00007.jpg", "height": 480, "width": 640, "id": "BloodImage_00007"}