TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: Traning LLM directly on file bytes

2 pointsby stealthcatover 1 year ago
Multi-modal LLM like PaLM, GPT4, MiniGPTv2 relies on data encoder (image, speech models) to map data to token embedding space.<p>Is there any attempt to directly train on file bytes? Make the only vocab of LLM as base-2, base-8 or hexadecimal, then do next token prediction on this.<p>I know some attempts have been done like MEGABYTE and Charformer but some may have is not directly learning from bytes with all the header info

no comments

no comments