Ask HN: Traning LLM directly on file bytes

2 点作者 stealthcat超过 1 年前

Multi-modal LLM like PaLM, GPT4, MiniGPTv2 relies on data encoder (image, speech models) to map data to token embedding space.<p>Is there any attempt to directly train on file bytes? Make the only vocab of LLM as base-2, base-8 or hexadecimal, then do next token prediction on this.<p>I know some attempts have been done like MEGABYTE and Charformer but some may have is not directly learning from bytes with all the header info

Ask HN: Traning LLM directly on file bytes

暂无评论

Ask HN: Traning LLM directly on file bytes

暂无评论