The more I listen to NotebookLM “episodes”, the more I am convinced that Google has trained a two-speaker “podcast discussion” model that directly generates the podcast off the back of an existing multimodal backbone. The two speakers interrupt and speak over each other in an uncannily humanlike manner. I wonder whether they basically fine tuned against a huge library of actual podcasts along with the podcast transcripts and perhaps generated synthetic “input material” from the transcripts to feed in as training samples.<p>In other words, take an episode of The Daily and have one language model write a hypothetical article that would summarize what the podcast was about. And then pass that article into the two—speaker model, transcribe the output, and see how well that transcript aligns with the article fed in as input.<p>I am sure I’m missing essential details, but the natural sound of these podcasts cannot possibly be coming from a text transcript.
This is in fact pretty explicitly not open source: <a href="https://github.com/meta-llama/llama-recipes/blob/d83d0ae7f5c9953737d9bbdca490c03282f24c37/pyproject.toml#L18">https://github.com/meta-llama/llama-recipes/blob/d83d0ae7f5c...</a><p>(And given there is no LICENSE file, I’m afraid you can only use this code as reference at best right now)
Great to see this: Fellow tech-geeks, ignore the NotebookLM thing at your peril.<p>NotebookLM, far and away, has been the "AI Killer App" for the VAST MAJORITY of bright-but-not-particularly-techy people I know. My 70ish parents and my 8 year old kid are both just blown away by this thing and can't stop playing with it.<p>Edit: As someone pointed out below, I absolutely mean just the "podcast" thing.
I tried to build something kind of like NotebookLM (personalized news podcasts) over the past months (<a href="https://www.tailoredpod.ai" rel="nofollow">https://www.tailoredpod.ai</a>), but biggest issue is that the existing good TTS Apis are so expensive that a product such as NotebookLM is not really possible for a normal company that doesn't have internal access to Google's models. OpenAI has the cheapest / quality good enough TTS Api, but even then generating hours of audio for free is way too expensive.<p>Open Source TTS models are slowly catching up, but they still need beefy hardware (e.g. <a href="https://github.com/SWivid/F5-TTS">https://github.com/SWivid/F5-TTS</a>)
Pretty weird choice of TTS engines. None of them are anywhere near state of the art as far as open TTS system goes. XTTSv2 or the new F5-TTS would have been much better choices.
The sample output is very poor. Cool demo, but really just emphasizes how much of a hit product the NotebookLM team has managed to come up with, ostensibly with more or less the same foundation models already available.
I'm not so sure this is an open source NotebookLM as it is a few experiments in an iPython notebook. What NotebookLM does at an LLM level is not particularly novel, it's the packaging as a product in a different way than what others are doing that I think is interesting. Also the "podcast" bit is really just an intro/overview of a large corpus, far more useful is being able to discuss that corpus with the bot and get cited references.<p>What this does however demonstrate is that prototyping with LLMs is very fast. I'd encourage anyone who hasn't had a play around with APIs to give it a go.
Here is another (Jupyter based) notebook solution supporting LLaMA models: <a href="https://raku.land/zef:antononcube/Jupyter::Chatbook" rel="nofollow">https://raku.land/zef:antononcube/Jupyter::Chatbook</a> .<p>Here is a demo movie: <a href="https://youtu.be/zVX-SqRfFPA" rel="nofollow">https://youtu.be/zVX-SqRfFPA</a>
If we can have this running locally on mobile phone that would be pretty cool. Imagine receiving a work document (for example, product requirement documents), and then this turning it into a podcast to play for me while I am driving. I think my productivity will be through the roof and I don't need to worry about compliance issues.