This blog is about how to use transformers to solve the Document Understanding problem

Paper

https://arxiv.org/abs/2212.02623

Challenges

  1. Layout, text, image
  2. Cross-modal interactions between text and visual modalities are much strong than pure CV
  3. How to capture this strong correlation between image, text and layout modalities
  4. How to set the model to learn across differnt domains.

Previous work

  1. encode images with a vison network and feed the encoding to the multimodal encoder along with text
  2. adding a 2D positional embedding to text embeddings
  3. task-specific heads

Univeral Document Processing

  1. unify vision, text and layout through vsion-text-layout Transformer.
  2. Pretraining tasks indluding vision task, text task and mixed task.
  3. Input stage, add embeddings of text tokens with the features of the image patch where the tokens are located.

Steps

  1. Build a vocabulary for texts and document layout that converts layout
  2. VTL Transformer, one modality-agnostic encoder, text-layout decoder and vision decoder.
  3. seq to seq generation framework
  4. include layout modeling, text and layout reconstruction and vision recognition.
  5. trained on 11M public unlabeled documents with 11 supervised datasets of 1.8M examples

Interesting parts

  1. using OCR to get the layout information. The input is v, all the text in v, all the location information for the text
  2. How to encode vision, text and layout,
  3. split the image to P*P parts, for each part convert it to D-dim vector, then group all the patches to a sequence of vectors. text tokens are also converted to numerical D-dim embeddings by vocabulary look-up (we only care about the vacabulary for text)
  4. Have a indicator function for image pathc and token embedding. if the text is inside of the image patch, then the indicator is 1. otherwise it’s 0. Then add the s with v, if the indicator is 1, otherwise the joint-embedding is just the v.
  5. For the layout, normalize it to [0, 1], and times the vocabulary size to make the layout token have the same degree with the text.
  6. Decoder. It has two docoder, text-layout decoder and vision decoder.

Self-Supervised Pretraining tasks

  1. The unlabeled document contains OCR text inputs with token-level bounding.
  2. Joint Text-layout reconstruction. Given the senital prompt with layout token and text, mask 15% of the tokens to predict the layout and the text.
  3. Layout Modeling asks the modle to predict poistion of the text tokens. Masking 75%
  4. Visual Text Recognition. Identifies text at given location in the image.
  5. Masked Image reconstruction with text and layout. reconstruct image with text and layout. MAE masks a percentage of the image pathces and feed non-masked patches into a vison encoder. corss-attention with char embeddings.

Supervised pretrainging tasks

  1. Classification with image and label
  2. Layout analysis, given an entity, predict the bounding boxes
  3. Information Extraction. Predict the entity type and location of a text query
  4. Question Answering.

Pre-training

  1. overall structure is T5-large (encoder-decoder) architecture.
  2. It has 794 M trainable parameters.
  3. Tokenizer is from T5 tokenizer from hf. but extend the vacab with special tokens
  4. IIT-CDIP 1.0 is used, it has 11 million scanned document with contians text and token-level bounding boxes from OCR
  5. Curriculum Learning. from small resolution to large resolution
  6. Adam optimizer with lr =5e-5, 1000 warmup steps, batch size 512, weight decay 0.01.