How Does GPT-4o Encode Images?

334 点作者 olooney11 个月前

27 条评论

We desperately need a modern open source replacement for tesseract built on current SoTA ML tech. It is insane that we are resorting to using LLMs — which aside from being the wrong tool and far too overpowered for the job also are prone to hallucinations, have insanely expensive training and inference costs, etc — for this purpose because the “best” non-LLM solution is so bad it can’t even correctly ocr monospaced hi-res scans of ascii text with sufficient accuracy.

评论 #40610544 未加载

评论 #40610610 未加载

评论 #40614871 未加载

评论 #40612686 未加载

评论 #40611325 未加载

评论 #40610484 未加载

评论 #40613485 未加载

评论 #40612148 未加载

valine11 个月前

Llava1.6, IntenVL, CogVLM2 can all do OCR with nothing but tiled image embeddings and an LLM. Feeding in OCR results from tesseract improves the reliability of the transcript, especially for long strings of random characters, but it’s not strictly necessary for the model to read the text out of the image.Clip embeddings can absolutely “read” text if the text is large enough. Tiling enables the model to read small text.

评论 #40613613 未加载

评论 #40609260 未加载

评论 #40612247 未加载

评论 #40612101 未加载

riemannzeta11 个月前

Love this curious and open-minded exploration of how this stuff works.The pyramid strategy loosely tracks with renormalization group theory, which has been formally studied for years as a method of interpreting machine learning models:<a href="https://arxiv.org/abs/1410.3831" rel="nofollow">https://arxiv.org/abs/1410.3831</a>I love the convergence we're seeing in the use of models from different fields to understand machine learning, fundamental physics, and human consciousness. What a time to be alive.

enjoylife11 个月前

> Interestingly enough, it’s actually more efficient to send text as images: A 512x512 image with a small but readable font can easily fit 400-500 tokens worth of text, yet you’re only charged for 170 input tokens plus the 85 for the ‘master thumbnail’ for a grand total of 255 tokens—far less than the number of words on the image.Sounds like an arbitrage opportunity for all those gpt wrappers. Price your cost per token the same, send over the prompt via image, pocket the difference?

评论 #40644305 未加载

simonw11 个月前

Something I don't get is why OpenAI don't provide clear, comprehensive documentation as to how this actually works,I get that there's competition from other providers now so they have an instinct to keep implementation details secret, but as someone building on their APIs this lack of documentation really holds me back.To make good judgements about how to use this stuff I need to know how it works!I had a hilarious bug a few weeks ago where I loaded in a single image representing multiple pages of a PDF and GPT-4 vision effectively hallucinated the contents of the document when asked to OCR it, presumably because the image was too big and was first resized to a point where the text was illegible: <a href="https://simonwillison.net/2024/Apr/17/ai-for-data-journalism/#llm-mistakes" rel="nofollow">https://simonwillison.net/2024/Apr/17/ai-for-data-journalism...</a>If OpenAI had clear documentation about how their image handling works I could avoid those kinds of problems much more effectively.

评论 #40608667 未加载

评论 #40608702 未加载

评论 #40609658 未加载

评论 #40611582 未加载

评论 #40612862 未加载

评论 #40609672 未加载

rafaelero11 个月前

They are very likely using VQVAE to create a dictionary of tokens and then just converting images into them with an encoder.

评论 #40609589 未加载

评论 #40612305 未加载

comboy11 个月前

I love how well this is written. Definitely "look how interesting this is" rather than "look how much do I know". And it dives as deep as needs to, while being accessible for almost everyone. One really needs to master some topic to be able to describe it simply. Great job.

GaggiX11 个月前

An important aspect that is not considered in the article is that GPT-4o can generate images by itself (even though the feature is not enable to the public) meaning that it's very likely trained on sequential image tokens and the images are quantized using a VQGAN, my guess is that the VQGAN takes 512x512 images and outputs 13x13 tokens (169 image tokens + special token), the VQGAN can be a convolutional network like shown in the article, for a transformer-based VQGAN I cannot think of a configuration with overlapping patches where it would output 13x13 tokens on a 512x512 image (unless they just added a padding of 4 on the entire image and the patches are not overlapping).

评论 #40608947 未加载

cs70211 个月前

One possibility is that mapping images to a token embedding consumes ~170x more compute+space than mapping a token id.Another possibility is that OpenAI is mapping each image to ~170 vectors in an embedding space that is shared with token IDs. If that's the case, the architecture of the image-to-fixed-number-of-tokens model has not been disclosed. It could be a standard CNN, a ViT-like model, an autoencoder, a model that routes a variable number of vectors with RGB data to a fixed number of vectors, or something else that has not yet been ublished. The whole thing is likely trained end-to-end.

评论 #40611332 未加载

评论 #40610826 未加载

HarHarVeryFunny11 个月前

I don't think a 13x13 tiling (of N channels/features) can be ruled out just because it can't recognize a grid of 13x13 objects. There is presumably a lot of overlap between the receptive fields of the tiles (due to kernel step sizes).A pyramid of overlapped tiling resolutions is of course possible too.

simonw11 个月前

The way this tests GPT-4o performance by feeding in a 7x7 grid of colored shapes and requesting them back as JSON (about half way down the page) is really clever.

评论 #40608723 未加载

geor9e11 个月前

Nit: the implied premise that this isn't a beautiful and skilled painting <a href="https://www.oranlooney.com/post/gpt-cnn_files/malicious_dogs.png" rel="nofollow">https://www.oranlooney.com/post/gpt-cnn_files/malicious_dogs...</a>

评论 #40611613 未加载

iknownothow11 个月前

I'm probably wrong but the author may have have misunderstood input embeddings. Input embeddings are just dictionary lookup tables. The tokenizer generates tokens and for each token you find its embedding from the lookup.The author is speculating about an embedding model but in reality they're speculating about the image-tokenizer.If I'm not wrong the text tokenizer Tiktoken has a dictionary size of 50k. The image tokenizer could have a very large dictionary size or a very small dictionary size. The 170 tokens this image tokenizer generates might actually have repeating tokens!EDIT: PS. What I meant to say was that input embeddings do not come from another trained model. Tokens come from other trained models. The input embedding matrix undergoes back propagation (learning). This is very important. This allows the model to move the embeddings of the tokens together or apart as it sees fit. If you use embeddings from another model as input embeddings, you're basically adding noise.

评论 #40610339 未加载

评论 #40610661 未加载

blixt11 个月前

I went through a similar journey back when GPT-4V came out. Here's an additional puzzle for you: GPT-4V knows the exact pixel dimensions of the image (post-resize since there is a max size for images in the pipeline, besides 512x512), but I'm 99% sure it's not provided as text tokens. How am I so sure? It's easy to get GPT to divulge everything from system prompt to tool details, etc. but I've tried every trick in the book and then some, multiple times over, and there is no way to get it to quote the dimensions as text. The only way to get it to give you the dimensions is to tell it to output a structure that contains width and height and just pick something reasonable, and they will "randomly" be the correct values:<a href="https://x.com/blixt/status/1722298733470024076" rel="nofollow">https://x.com/blixt/status/1722298733470024076</a>

评论 #40608973 未加载

评论 #40608827 未加载

joelburget11 个月前

Vision transformers should be our default guess as to how GPT-4o works, yet this article never mentions them.

sva_11 个月前

Great article. Perhaps some part of this magic number simply factors in the amount of compute necessary to run the image through the CNN (proportional to compute use per token in the LM).

surfingdino11 个月前

OCR is hard <a href="https://www.vice.com/en/article/gvy4gb/one-mans-david-and-goliath-battle-to-get-xerox-to-fix-a-major-bug" rel="nofollow">https://www.vice.com/en/article/gvy4gb/one-mans-david-and-go...</a>

yorwba11 个月前

It would be interesting to see what happens when you slightly shift the grid of objects until they're split across multiple tiles, and how that affects accuracy.

SubiculumCode11 个月前

I'm not sure how chatgpt4o routes information. If a picture is submitted that contains text, does the text then get resubmitted to chatgpt4o as a textual query, or do the model weights themselves essentially transform the textual images to textual tokens. I do wonder if a response to the textual images is similar to a response to text queries...i.e. processed by the the same weights.

imranhou11 个月前

Not to be nit-picky but double checking myself, isn't a token just 0.75 words, so 170 token would be 127 words and not 227?

tantalor11 个月前

> CLIP embeds the entire image as a single vector, not 170 of them.Single token?> GPT-4o must be using a different, more advanced strategy internallyWhy

评论 #40608769 未加载

jmount11 个月前

Scanning images is quite the problem in the presence of compression (and now interpolation) <a href="https://www.bbc.com/news/technology-23588202" rel="nofollow">https://www.bbc.com/news/technology-23588202</a> .

jamesy0ung11 个月前

I've always wondered how Text to Image LLMs like stable diffusion work, do they just encode RGB values into a matrix and then have a helper tool convert that data into a jpg?

rvnx11 个月前

Author claims that the most likely is that there is Tesseract running behind ChatGPT-4v/o.There is no way that this is Tesseract.-> Tesseract accuracy is very low, it can barely do OCR on printed documents.

评论 #40608873 未加载

评论 #40608754 未加载

评论 #40624178 未加载

评论 #40608988 未加载

评论 #40608935 未加载

评论 #40608904 未加载

alach1111 个月前

I really hope we see improvements to the resolutions large multimodal models can handle. Right now this patchwork approach leads to lots of unwieldly workarounds in applications.

eminence3211 个月前

I'm assuming that the tokens used to encode an image are entirely distinct from the tokens used to encode text. Does anyone know if this is actually the case?

评论 #40608780 未加载

评论 #40608809 未加载

sashank_150911 个月前

I would be disappointed if OpenAI had a separate model for OCR, though I guess that is believable. Much cooler if the LLM just understands language from text

27 条评论

ComputerGuru11 个月前

评论 #40610544 未加载

评论 #40610610 未加载

评论 #40614871 未加载

评论 #40612686 未加载

评论 #40611325 未加载

评论 #40610484 未加载

评论 #40613485 未加载

评论 #40612148 未加载

valine11 个月前

评论 #40613613 未加载

评论 #40609260 未加载

评论 #40612247 未加载

评论 #40612101 未加载

riemannzeta11 个月前

enjoylife11 个月前

评论 #40644305 未加载

simonw11 个月前

评论 #40608667 未加载

评论 #40608702 未加载

评论 #40609658 未加载

评论 #40611582 未加载

评论 #40612862 未加载

评论 #40609672 未加载

rafaelero11 个月前

They are very likely using VQVAE to create a dictionary of tokens and then just converting images into them with an encoder.

评论 #40609589 未加载

评论 #40612305 未加载

comboy11 个月前

GaggiX11 个月前

评论 #40608947 未加载

cs70211 个月前

评论 #40611332 未加载

评论 #40610826 未加载

HarHarVeryFunny11 个月前

simonw11 个月前

The way this tests GPT-4o performance by feeding in a 7x7 grid of colored shapes and requesting them back as JSON (about half way down the page) is really clever.

评论 #40608723 未加载

geor9e11 个月前

评论 #40611613 未加载

iknownothow11 个月前

评论 #40610339 未加载

评论 #40610661 未加载

blixt11 个月前

评论 #40608973 未加载

评论 #40608827 未加载

joelburget11 个月前

Vision transformers should be our default guess as to how GPT-4o works, yet this article never mentions them.

sva_11 个月前

Great article. Perhaps some part of this magic number simply factors in the amount of compute necessary to run the image through the CNN (proportional to compute use per token in the LM).

surfingdino11 个月前

yorwba11 个月前

It would be interesting to see what happens when you slightly shift the grid of objects until they're split across multiple tiles, and how that affects accuracy.

SubiculumCode11 个月前

imranhou11 个月前

Not to be nit-picky but double checking myself, isn't a token just 0.75 words, so 170 token would be 127 words and not 227?

tantalor11 个月前

> CLIP embeds the entire image as a single vector, not 170 of them.Single token?> GPT-4o must be using a different, more advanced strategy internallyWhy

评论 #40608769 未加载

jmount11 个月前

jamesy0ung11 个月前

I've always wondered how Text to Image LLMs like stable diffusion work, do they just encode RGB values into a matrix and then have a helper tool convert that data into a jpg?

rvnx11 个月前

评论 #40608873 未加载

评论 #40608754 未加载

评论 #40624178 未加载

评论 #40608988 未加载

评论 #40608935 未加载

评论 #40608904 未加载

alach1111 个月前

I really hope we see improvements to the resolutions large multimodal models can handle. Right now this patchwork approach leads to lots of unwieldly workarounds in applications.

eminence3211 个月前

I'm assuming that the tokens used to encode an image are entirely distinct from the tokens used to encode text. Does anyone know if this is actually the case?

评论 #40608780 未加载

评论 #40608809 未加载

sashank_150911 个月前

I would be disappointed if OpenAI had a separate model for OCR, though I guess that is believable. Much cooler if the LLM just understands language from text