TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Mr. Beast Saying Increasingly Large Amounts of Money

1 点作者 felipemesquita5 个月前

1 comment

felipemesquita5 个月前
From the Methodology section:<p>&gt;First, for the main source of data, I chose all Mr. Beast videos with uploaded (ie. non-auto-generated) transcripts—a total of 229 out of 837 published videos on his flagship channel. This gave me a source of processable ground truth about where money was mentioned and also limited the videos to those published the last 6 years, which make up the majority of his meteoric rise. Then, I downloaded the videos in 360p and scraped their transcripts for every occurrence of a dollar amount, logging each mention with its sum, video, and context in a database that I would build on top of as I nailed down the exact timing. I used those contextual timestamps to make rough clips that I fed into the open source AI tool Whisper to (a) get a more precise measurement of where “X dollars” was actually said and (b) standardize and double check that my first scrape had gotten the amount correct. Finally, as many of the clips were still off by a few annoying and noticeable fractions of a second in any direction, I made a script that allowed me to go through each entry individually, trim or extend the clip on either end, and modify the amount one last time if my first 2 methods had failed. After all 2800+ were processed—a task that took weeks—I made a final set of clips out of higher quality versions of the videos and used Premiere to make the film’s final dizzying supercut you see before you.<p>&gt;90% of data science is data cleaning, and I have kept this overview pretty high-level in the interest of making it accessible to a wide audience. A much longer and more technical dive into the steps needed to go from a raw YouTube archive to this video—including everything from token suppression, the comparative benefits of transcription libraries, counterintuitive ways to standardize and parse numbers in natural language, and debugging audio desyncs in clip concatenations - may appear in the future on my website.