Many years ago, I participated in a very odd/unique museum project, where I was asked to just "talk through" my entire life. Someone came to my house, hit record, and I started talking. The objective was to be as detailed as possible, and to try and stay in sequence and not skip over anything. This took several days/sessions, but I think I ended up with 12 CDs of just me recounting my life. Super odd.<p>I had no use for it, so far. It sits in a musem (along with other people's stories), to have a sort of time capsule of what life was like, for a regular person, at the time.<p>I now started to wonder, if that would be enough audio content to just train a TTS model _properly_ (language is German). I know, some will respond saying "you only need 15 seconds of audio" — but I have NEVER managed to get any of these things to work properly, or to produce nice results. It seems like those things were mostly made to hit the news, but not for actual use.<p>So, in 2024, without a 4090 card or A100 sitting in my basement, and without wanting to spend a considerable amount of money on it, what would be the best approach to build a voice model out of this?<p>What I have is: Windows, OS X, Linux, and x64 as well as Apple M2 Pro. AND, I have A LOT OF TIME to let these things run on their own. Time is NOT an issue here, this can take however long it needs.<p>So, how would you build an audio model out of this? Without subscription services, without renting A100s — just, here, at home?<p>Thanks!