Any time I see engineering time spent on splitting/sharding test suites, I can't help but wonder if access to a single beefier runner (e.g. 64+ cpus) would have alleviated all that work. Also always find the duplicated setup time a bit wasteful on resources.
Every time I see this type of story I can only think how people get rewarded for heroically/dramatically rescuing bad systems. While those who choose boring/proven technology at an earlier point, don't get recognition (or even hired in many cases).
They call pytest --collect-only and parse the output before distributing it to the Github Action python matrix. I'm a little surprised pytest doesn't offer the ability to cache collection. Though a slow collection time may be an issue that should be addressed since developers must suffer it locally!