UDP
This blog is about how to use transformers to solve the Document Understanding problem
Paper
https://arxiv.org/abs/2212.02623
Challenges
- Layout, text, image
- Cross-modal interactions between text and visual modalities are much strong than pure CV
- How to capture this strong correlation between image, text and layout modalities
- How to set the model to learn across differnt domains.
Previous work
- encode images with a vison network and feed the encoding to the multimodal encoder along with text
- adding a 2D positional embedding to text embeddings
- task-specific heads
Univeral Document Processing
- unify vision, text and layout through vsion-text-layout Transformer.
- Pretraining tasks indluding vision task, text task and mixed task.
- Input stage, add embeddings of text tokens with the features of the image patch where the tokens are located.
Steps
- Build a vocabulary for texts and document layout that converts layout
- VTL Transformer, one modality-agnostic encoder, text-layout decoder and vision decoder.
- seq to seq generation framework
- include layout modeling, text and layout reconstruction and vision recognition.
- trained on 11M public unlabeled documents with 11 supervised datasets of 1.8M examples
Interesting parts
- using OCR to get the layout information. The input is v, all the text in v, all the location information for the text
- How to encode vision, text and layout,
- split the image to P*P parts, for each part convert it to D-dim vector, then group all the patches to a sequence of vectors. text tokens are also converted to numerical D-dim embeddings by vocabulary look-up (we only care about the vacabulary for text)
- Have a indicator function for image pathc and token embedding. if the text is inside of the image patch, then the indicator is 1. otherwise it’s 0. Then add the s with v, if the indicator is 1, otherwise the joint-embedding is just the v.
- For the layout, normalize it to [0, 1], and times the vocabulary size to make the layout token have the same degree with the text.
- Decoder. It has two docoder, text-layout decoder and vision decoder.
Self-Supervised Pretraining tasks
- The unlabeled document contains OCR text inputs with token-level bounding.
- Joint Text-layout reconstruction. Given the senital prompt with layout token and text, mask 15% of the tokens to predict the layout and the text.
- Layout Modeling asks the modle to predict poistion of the text tokens. Masking 75%
- Visual Text Recognition. Identifies text at given location in the image.
- Masked Image reconstruction with text and layout. reconstruct image with text and layout. MAE masks a percentage of the image pathces and feed non-masked patches into a vison encoder. corss-attention with char embeddings.
Supervised pretrainging tasks
- Classification with image and label
- Layout analysis, given an entity, predict the bounding boxes
- Information Extraction. Predict the entity type and location of a text query
- Question Answering.
Pre-training
- overall structure is T5-large (encoder-decoder) architecture.
- It has 794 M trainable parameters.
- Tokenizer is from T5 tokenizer from hf. but extend the vacab with special tokens
- IIT-CDIP 1.0 is used, it has 11 million scanned document with contians text and token-level bounding boxes from OCR
- Curriculum Learning. from small resolution to large resolution
- Adam optimizer with lr =5e-5, 1000 warmup steps, batch size 512, weight decay 0.01.