Yeah, I'd run into this 2 years ago and ended up also reporting an issue on the Centernet repo [1]<p>The solution I have in that issue adapts from the very helpful discussions in the original Pytorch issue [2]<p>`worker_init_fn=lambda id: np.random.seed(torch.initial_seed() // 2*32 + id)`<p>I will admit that this is *very* easy to mess up as evidenced by the fact that examples in the official tutorials for Pytorch and other well known code-bases suffer from it. In the Pytorch training framework I've helped develop at work, we've implemented a custom `worker_init_fn` as outlined in [1] that is the default for all "trainer" instances who are responsible for instantiating DataLoaders in 99% of our training runs.<p>Also, as an aside, Holy Clickbaity title Batman! Maybe I should have blogged about this 2 years ago. Heck, every 6 months or so, I think that, and then I realize that I'd rather spend time with my kids and on my hobbies when I'm not working on interesting ML stuff and/or coding. An added side benefit is not having to worry about making idiotic clickbaity titles like this to farm karma, or provide high-quality unpaid labor for Medium in order for my efforts to be actually seen by people. But it could also just be that I'm lazy :-)<p>[1] <a href="https://github.com/xingyizhou/CenterNet/issues/233" rel="nofollow">https://github.com/xingyizhou/CenterNet/issues/233</a><p>[2] <a href="https://github.com/pytorch/pytorch/issues/5059" rel="nofollow">https://github.com/pytorch/pytorch/issues/5059</a>
This post is yet another example of why you should never use APIs for random number generation that rely upon and mutate hidden global state, like the functions in numpy.random. Instead, use APIs that explicitly deal with RNG state, e.g., by calling methods on an explicitly created numpy.random.Generator object. JAX takes this one step further: there are no mutable RNG objects at all, and the users has to explicitly manipulate RNG state with pure functions.<p>It’s a little annoying to have to set and pass RNG state explicitly, but on the plus side you never hit these sorts of issues. Your code will also be completely reproducible, without any chance of spooky “action at a distance.” Once you’ve been burned by this a few times, you’ll never go back.<p>You might think that explicitly seeding the global RNG would solve reproducibility issues, but it really doesn’t. If you call into any code you didn’t write, it might also be using the same global RNG.
Great catch!<p>> I downloaded and analysed over a hundred thousand repositories from GitHub that import PyTorch. I kept projects that use NumPy’s random number generator with multi-process data loading. Out of these, over 95% of the repositories are plagued by this problem. It’s inside PyTorch’s official tutorial, OpenAI’s code, NVIDIA’s projects, etc. [1]<p>[1] <a href="https://github.com/pytorch/pytorch/issues/5059" rel="nofollow">https://github.com/pytorch/pytorch/issues/5059</a>
Iirc the bug Karpathy mentioned in his tweet was actually due to the seed being the same across multigpu data parallel workers! You need to account for this too. So the author hasnt solved it.<p>I know this bc I fixed the bug. And probably caused it. Hehe.<p>Also you dont just want to set ur numpy seed but also the native python one and the torch one.
I always randomly log a sample of my inputs to TendorBoard to manually review what my training data <i>actually</i> looks like and (hopefully) pick up on bugs like these. Similarly I find logging high loss inputs very informative.<p>Coincidentally I find this article timely as I was recently reviewing PyTorch DataLoader docs regarding random number generator seeding. It’s the kind of thing unit test don’t pick up since it only occurs when you use separate worker processes.
.NET has a similar pitfall, but not due to forking but rather that the Random() default seed is based on the system clock. So starting several threads constructing new Random objects with the hope that they are unique might in fact give you same RNG sequences.
Forgetting to seed your RNG is a really classic bug. IMHO RNGs should auto seed unless explicitly set not to, but since the opposite behaviour was baked into C so many years ago it's kind of the default. The worst part is how easy a bug this is to miss unless you're explicitly printing out the first set of random numbers for some strange reason.
Note official TensorFlow tutorials make the exact same mistake. I've reported it but it hasn't been fixed. [1]<p>[1]: <a href="https://github.com/tensorflow/tensorflow/issues/47755" rel="nofollow">https://github.com/tensorflow/tensorflow/issues/47755</a>
I notice that the web page of this article is beautifully justified to two sides instead of left alignment, and there is hyphen in breaking lines. Does anyone know how to achieve this in web page? text-align: justify seems to produce inferior results than this page, e.g. rivers in text.
This seems like another reason to never use fork() without exec(). Fork is really a mine field when used this way (and a pretty big maintenance burden on the kernel, by my understanding, to provide the illusion of sharing read-only state with the parent process).
Is there something specific about numpy here, or would it be any RNG?<p>I'm looking at some code that uses random.random() to randomly apply augmentations, I suspect that will have the same issue right?
Python has os.register_at_fork nowadays, so why‘d still have this kind of behavior? Not reseeding after fork has been a footgun for almost as long as fork exists.
Would normally refrain from upvoting this on account of the title, but the actual topic was important enough that I think it can be worth an exception.
A lot of comments are criticising the frameworks or the developers, but suprisingly almost no one is criticising Python, which remains a language of the early 90ies as far as parallelism is concerned.<p>A bit like Stockholm syndrome - "Python doesn't do threading" is so ingrained in its users (and I'm a user) minds that it's not even questioned as a potential source of problems.<p>(Noone said it's easy to do. That's why language developers and implementers are a special breed even today.)
This is probably because I never read these kinds of blogposts but this is one of the most flagrantly clickbait titles I've ever seen. Like the article doesn't even suggest ditching numpy in favor of jax or some kind of other hot take (which would at least warrant such a bombastic title) it literally just presents one instance in which you <i>might</i> be making a mistake when using numpy's rng (not even something more unique to numpy). And the PyTorch team is aware of this and hence exposes `worker_init_fn`. So the title should actually be "Using fork without understanding fork? You might be making a mistake."
"You're making a mistake" sounds like one shouldn't use PyTorch and NumPy together, when the actual message is "there might be a mistake in your code".
Aside from the infuriating clickbait title (which I shall not dignify with an upvote), this is part of why I preprocess augmented images. I don't like too much magic in my custom derived (PyTorch) Dataset objects.