Document Understanding

Document Understanding#

タスクの概要#

Document Understanding#

文書の内容を解析するタスク
文書分類（document classification）、レイアウト解析（layout analysis）、情報抽出（information extraction）、質問応答（DocQA）を含む（document understanding | Papers With Code）

Visual Document Understanding (VDU)#

PDFや画像などデジタルな文書からの情報抽出を行うタスクのこと

Token Classification#

各トークンのクラスを分類する。代表例はNER
Token classification

Semantic Entity Recognition#

ドキュメントからsemantic entityの抽出とタイプの分類を行う
visually-rich document \(\mathcal{D}\)を与えられた下で、離散的なトークン集合\(t=\{t_0,t_1,\dots,t_n\}\)が得られるとする
- ここでトークン\(t_i\)は単語\(w\)とbounding box\((x_0, y_0, x_1, y_1)\)
- 分類先のsemantic entity labelの集合\(\mathcal{C}=\{c_0, c_1, \dots, c_m\}\)

Relation Extraction#

2つのsemantic entitiesの関係を予測する

Microsoftの関連研究まとめ#

Document AI (Intelligent Document Processing) - Microsoft Research

MS researchの関連研究まとめページ

LayoutLM#

Microsoftが作ったSemantic Entity Recognition

unilm/layoutlmft at master · microsoft/unilm

LayoutLM#

LayoutLMv2#

paper: [2012.14740v4] LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding
hugging face: LayoutLMV2
PyTesseractを使っている
テキストは事前にOCRし、PDFもパースして位置情報を取り出す
画像もテキストもPositionとともにembeddingしてTransformerにいれる
- Visual embeddingはCNNらしい（LayoutLMの特徴と事前学習タスクについて - LayerX エンジニアブログ）

LayoutLMv3#

v2をsimplifiedしたもの。CNNではなくViTを使う？

LayoutLMv3 simplifies LayoutLMv2 by using patch embeddings (as in ViT) instead of leveraging a CNN backbone, and pre-trains the model on 3 objectives: masked language modeling (MLM), masked image modeling (MIM) and word-patch alignment (WPA).

LayoutLMv3 - Hugging Face

LayoutXLM#

LayoutLMv2を53言語で訓練した拡張版
XFUNDというベンチマークデータセットも作成し、XFUND上でLayoutXLMがSOTAであることを確認
Common Crawlのpdfデータを利用することで画像とテキストの入手の手間を省いた
前処理では PyMuPDF でテキストやレイアウトや画像を取得
XLM (Lample and Conneau, 2019) に従って言語ごとにデータをサンプリング
データは22million文書になり、IIT-CDIPデータセットからさらに8million足して合計30mの文書を事前学習に使った

microsoft/layoutxlm-base · Hugging Face

[2106.11539] DocFormer: End-to-End Transformer for Document Understanding

[2111.15664] OCR-free Document Understanding Transformer

Document Understanding

Contents

Document Understanding#

タスクの概要#

Document Understanding#

Visual Document Understanding (VDU)#

Token Classification#

Semantic Entity Recognition#

Relation Extraction#

Microsoftの関連研究まとめ#

LayoutLM#

LayoutLM#

LayoutLMv2#

LayoutLMv3#

LayoutXLM#