Wowsers. This is a step change in quality compared to SOTA. I suspect that without evaluating samples as a correlated group, distinguishing between the generated samples and those recorded from a human will be little better than a coin toss.<p>And even when evaluating these samples as a group, I may be imagining the distinctions I am drawing from a relatively small selection that might be cherry-picked. Nevertheless:<p>The generated samples are more consistent as a group, and more even in quality, with few instances of emphasis that seem (however slightly) out of place.<p>The recorded human samples vary more <i>between</i> samples (by which I mean the sample as a whole may be emphasized with a bit of extra stress or a small raising or lowering of tone compared to the other samples), and within the sample there is a bit more emphasis on a word or two or slight variance in the length of pauses, mostly appropriate in context (as in, it is similar to what I, a non-professional[0], would have emphasized if I were being asked to record these).<p>In general for a non-dramatic voiceover you want to maintain consistency between passages (especially if they may be heard out of order) without completely flattening the in-passage variation, but tastes vary.<p>Conclusion: For many types of voice work, these generated samples are comparable in quality or slightly superior to recordings of an average professional. For semi-dramatic contexts (eg. audiobooks) the generated samples are firmly in the "more than good enough" zone, more or less comparable to a typical narrator who doesn't "act" as part of their reading.<p>[0] Decades ago in Los Angeles I tried my hand at voiceover and voice acting work, but gave up when it quickly became clear that being even slightly prone to stuffy noses, tonsillitis and sore throats was going to pose a major obstacle to being considered reliable unless I was willing to regularly use decongestants, expectorants, and the like.