UDP | PGW’s notes

This blog is about how to use transformers to solve the Document Understanding problem

https://arxiv.org/abs/2212.02623

Layout, text, image
Cross-modal interactions between text and visual modalities are much strong than pure CV
How to capture this strong correlation between image, text and layout modalities
How to set the model to learn across differnt domains.

encode images with a vison network and feed the encoding to the multimodal encoder along with text
adding a 2D positional embedding to text embeddings
task-specific heads

unify vision, text and layout through vsion-text-layout Transformer.
Pretraining tasks indluding vision task, text task and mixed task.
Input stage, add embeddings of text tokens with the features of the image patch where the tokens are located.

Build a vocabulary for texts and document layout that converts layout
VTL Transformer, one modality-agnostic encoder, text-layout decoder and vision decoder.
seq to seq generation framework
include layout modeling, text and layout reconstruction and vision recognition.
trained on 11M public unlabeled documents with 11 supervised datasets of 1.8M examples

using OCR to get the layout information. The input is v, all the text in v, all the location information for the text
How to encode vision, text and layout,
split the image to P*P parts, for each part convert it to D-dim vector, then group all the patches to a sequence of vectors. text tokens are also converted to numerical D-dim embeddings by vocabulary look-up (we only care about the vacabulary for text)
Have a indicator function for image pathc and token embedding. if the text is inside of the image patch, then the indicator is 1. otherwise it’s 0. Then add the s with v, if the indicator is 1, otherwise the joint-embedding is just the v.
For the layout, normalize it to [0, 1], and times the vocabulary size to make the layout token have the same degree with the text.
Decoder. It has two docoder, text-layout decoder and vision decoder.

The unlabeled document contains OCR text inputs with token-level bounding.
Joint Text-layout reconstruction. Given the senital prompt with layout token and text, mask 15% of the tokens to predict the layout and the text.
Layout Modeling asks the modle to predict poistion of the text tokens. Masking 75%
Visual Text Recognition. Identifies text at given location in the image.
Masked Image reconstruction with text and layout. reconstruct image with text and layout. MAE masks a percentage of the image pathces and feed non-masked patches into a vison encoder. corss-attention with char embeddings.

overall structure is T5-large (encoder-decoder) architecture.
It has 794 M trainable parameters.
Tokenizer is from T5 tokenizer from hf. but extend the vacab with special tokens
IIT-CDIP 1.0 is used, it has 11 million scanned document with contians text and token-level bounding boxes from OCR
Curriculum Learning. from small resolution to large resolution
Adam optimizer with lr =5e-5, 1000 warmup steps, batch size 512, weight decay 0.01.