Multi-modal LLM like PaLM, GPT4, MiniGPTv2 relies on data encoder (image, speech models) to map data to token embedding space.<p>Is there any attempt to directly train on file bytes?
Make the only vocab of LLM as base-2, base-8 or hexadecimal, then do next token prediction on this.<p>I know some attempts have been done like MEGABYTE and Charformer but some may have is not directly learning from bytes with all the header info