His data formatting could be improved here. Title + authors would be better off denoted somehow, like using quotes, and the separate songs should be explicitly delimited using '<|endoftext|>' - looking at the samples in <a href="https://github.com/EugenHotaj/beatles/blob/master/gpt_2_generated.txt" rel="nofollow">https://github.com/EugenHotaj/beatles/blob/master/gpt_2_gene...</a> , GPT-2 does manage to mostly figure out that the songs are separate, but omitting '<|endoftext|>' makes it harder on GPT-2, more prone to runons (already a problem with GPT-2), and also makes prompting less effective (since you can't prompt it like '<|endoftext|>"On The Run" by John Lennon\n' to make it generate lyrics for a specific title & author). Also wouldn't be bad if he had included the specific commands + hyperparameters for the nshepperd repo he's apparently using, even if only the defaults along the lines of the examples in my own writeup ( <a href="https://www.gwern.net/GPT-2" rel="nofollow">https://www.gwern.net/GPT-2</a> ).<p>I'm not surprised that GPT-2-117M has memorized songs by the end of training, it's not a very large corpus of songs. Hard to learn and generalize well from it. If one were working more on this, it'd probably make sense to train on a much larger and varied corpus of song (with inline metadata properly formatted to allow controllable generation); something like RapGenius, maybe?
Or any lyrics: <a href="http://billion.dev.losttech.software:2095/" rel="nofollow">http://billion.dev.losttech.software:2095/</a><p>And the blog article: <a href="https://habr.com/post/453232/" rel="nofollow">https://habr.com/post/453232/</a> (also there's no paywall here)
Tricks in beam search to force rhyme schemes, or techniques like constrained markov chains (c.f. <a href="https://redylan.neocities.org/#/how-it-works/" rel="nofollow">https://redylan.neocities.org/#/how-it-works/</a> and <a href="https://github.com/gabrielebarbieri/markovchain" rel="nofollow">https://github.com/gabrielebarbieri/markovchain</a>) can give really strong results in lyric / structured text generation.<p>Might be worth investigating if you are interested in this application.