Pix2Struct
This blog is about how to use transformers to solve the Visual Question Answering problem
Paper
https://arxiv.org/pdf/2210.03347
Pix2Struct
This is an image-to-text model for visually-situated language
Training data is the masked screenshot with html (there is a ton of these)
Why previous models are suck
The focus is typically on complex task-specific combinations of available inputs and tools
- Document-understanding models rely on external OCR system
- UI-understanding models rely on platform-specific metadata
- Diagram-understanding models rely on diagram parses (a lot of diagram in medical records)
External OCR is not a good choice.
There are people trying to remove the task-specific engineering during inference by learning to decode OCR outputs during pretarining.
Why Pix2Struct rocks
- pixel-level imputs -> HTML-based parse
- masked inputs can capture the relationship between both text and image
- The variable-resolution inputs for ViT (vision transformers)
DocVQA
It seems the highest score is from UDOP. Looking at UDOP