TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: How efficient are MLs at storing data?

6 pointsby LoveMortuus10 months ago
With MLs for video and audio getting better and better, I had a thought, what if you were to train a model to perfectly replicate 1000 movies.<p>Would the model be smaller in size than the size of all 1000 together?<p>I’m sorry if this is a stupid question, I don’t fully understand how these work under the hood, I do have a bit of theory but very little practice.<p>My curiosity was along the lines of the CDs like “Greatest Hits of X”, similarly (if the model was more efficient on storage) you could have “Greatest Movies and TV Shows of X”.

3 comments

talldayo10 months ago
&gt; Would the model be smaller in size than the size of all 1000 together?<p>This is a good question, because you&#x27;re butting up against the fundamental limits of information theory that make this field interesting.<p>For starters, I think we have to set some rules. A &quot;lossless&quot; representation of these 1000 movies would mean that simply prompting the system with the name of a film could generate the movie perfectly. If the model fails to reproduce that film on it&#x27;s first try, or cannot <i>exactly</i> recreate the training file, it is a lossy compression.<p>So with that being said, I think we can start painting a picture of how efficient ML can be. You are processing tens of thousands of frames of visual data while attempting to lose as little of the source material as possible. Getting a <i>single movie</i> to render properly on the first time inherently relies on luck; retrying over multiple attempts is infeasible due to how long you have to wait and how much energy you pay for. You&#x27;d be bruteforcing a video-generator against a checksum that it may-or-may not hit.<p>I would argue that the efficiency of ML for storing this data relies on your tolerance for error. If you require an output equally as pristine as your input, ML is not a suitable compression medium for your data.
评论 #41026783 未加载
TheAlchemist10 months ago
This is actually an excellent question ! And quite an active field of research.<p>This is one paper you could read - it&#x27;s about text and not films, but same idea:<p>&quot;LANGUAGE MODELING IS COMPRESSION&quot; - <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2309.10668" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2309.10668</a>
Someone10 months ago
I wouldn’t aim for “perfectly”, not because it is impossible, but because it introduces an unnecessary constraint.<p>Your typical movie is compressed with a lossy compression algorithm, so it will not perfectly replicate the original. You should aim for creating a compressor that creates something that has a smaller error, using some metric that takes into account what viewers can see.<p>You also should define “size” as “size of the compressed movies plus the decompressor” because otherwise, you’ll find that the winning decompressor will contain copies of the movies.<p>For a similar problem for text see <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Hutter_Prize" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Hutter_Prize</a>