This blog is about how to use transformers to solve the Visual Question Answering problem

Paper

https://arxiv.org/pdf/2210.03347

Pix2Struct

This is an image-to-text model for visually-situated language

Training data is the masked screenshot with html (there is a ton of these)

Why previous models are suck

The focus is typically on complex task-specific combinations of available inputs and tools

  • Document-understanding models rely on external OCR system
  • UI-understanding models rely on platform-specific metadata
  • Diagram-understanding models rely on diagram parses (a lot of diagram in medical records)

External OCR is not a good choice.

There are people trying to remove the task-specific engineering during inference by learning to decode OCR outputs during pretarining.

Why Pix2Struct rocks

  1. pixel-level imputs -> HTML-based parse
  2. masked inputs can capture the relationship between both text and image
  3. The variable-resolution inputs for ViT (vision transformers)

DocVQA

It seems the highest score is from UDOP. Looking at UDOP