Pix2Struct

This blog is about how to use transformers to solve the Visual Question Answering problem

https://arxiv.org/pdf/2210.03347

This is an image-to-text model for visually-situated language

Training data is the masked screenshot with html (there is a ton of these)

The focus is typically on complex task-specific combinations of available inputs and tools

Document-understanding models rely on external OCR system
UI-understanding models rely on platform-specific metadata
Diagram-understanding models rely on diagram parses (a lot of diagram in medical records)

External OCR is not a good choice.

There are people trying to remove the task-specific engineering during inference by learning to decode OCR outputs during pretarining.

It seems the highest score is from UDOP. Looking at UDOP