A common mistake when NumPy’s RNG with PyTorch

261 pointsby sunils34about 4 years ago

22 comments

Yeah, I'd run into this 2 years ago and ended up also reporting an issue on the Centernet repo [1]The solution I have in that issue adapts from the very helpful discussions in the original Pytorch issue [2]`worker_init_fn=lambda id: np.random.seed(torch.initial_seed() // 2*32 + id)`I will admit that this is *very* easy to mess up as evidenced by the fact that examples in the official tutorials for Pytorch and other well known code-bases suffer from it. In the Pytorch training framework I've helped develop at work, we've implemented a custom `worker_init_fn` as outlined in [1] that is the default for all "trainer" instances who are responsible for instantiating DataLoaders in 99% of our training runs.Also, as an aside, Holy Clickbaity title Batman! Maybe I should have blogged about this 2 years ago. Heck, every 6 months or so, I think that, and then I realize that I'd rather spend time with my kids and on my hobbies when I'm not working on interesting ML stuff and/or coding. An added side benefit is not having to worry about making idiotic clickbaity titles like this to farm karma, or provide high-quality unpaid labor for Medium in order for my efforts to be actually seen by people. But it could also just be that I'm lazy :-)[1] <a href="https://github.com/xingyizhou/CenterNet/issues/233" rel="nofollow">https://github.com/xingyizhou/CenterNet/issues/233</a>[2] <a href="https://github.com/pytorch/pytorch/issues/5059" rel="nofollow">https://github.com/pytorch/pytorch/issues/5059</a>

评论 #26773344 未加载

评论 #26771956 未加载

shoyerabout 4 years ago

This post is yet another example of why you should never use APIs for random number generation that rely upon and mutate hidden global state, like the functions in numpy.random. Instead, use APIs that explicitly deal with RNG state, e.g., by calling methods on an explicitly created numpy.random.Generator object. JAX takes this one step further: there are no mutable RNG objects at all, and the users has to explicitly manipulate RNG state with pure functions.It’s a little annoying to have to set and pass RNG state explicitly, but on the plus side you never hit these sorts of issues. Your code will also be completely reproducible, without any chance of spooky “action at a distance.” Once you’ve been burned by this a few times, you’ll never go back.You might think that explicitly seeding the global RNG would solve reproducibility issues, but it really doesn’t. If you call into any code you didn’t write, it might also be using the same global RNG.

评论 #26769099 未加载

unityByFreedomabout 4 years ago

Great catch!> I downloaded and analysed over a hundred thousand repositories from GitHub that import PyTorch. I kept projects that use NumPy’s random number generator with multi-process data loading. Out of these, over 95% of the repositories are plagued by this problem. It’s inside PyTorch’s official tutorial, OpenAI’s code, NVIDIA’s projects, etc. [1][1] <a href="https://github.com/pytorch/pytorch/issues/5059" rel="nofollow">https://github.com/pytorch/pytorch/issues/5059</a>

timzamanabout 4 years ago

Iirc the bug Karpathy mentioned in his tweet was actually due to the seed being the same across multigpu data parallel workers! You need to account for this too. So the author hasnt solved it.I know this bc I fixed the bug. And probably caused it. Hehe.Also you dont just want to set ur numpy seed but also the native python one and the torch one.

评论 #26775626 未加载

jeeebabout 4 years ago

I always randomly log a sample of my inputs to TendorBoard to manually review what my training data actually looks like and (hopefully) pick up on bugs like these. Similarly I find logging high loss inputs very informative.Coincidentally I find this article timely as I was recently reviewing PyTorch DataLoader docs regarding random number generator seeding. It’s the kind of thing unit test don’t pick up since it only occurs when you use separate worker processes.

Tooabout 4 years ago

.NET has a similar pitfall, but not due to forking but rather that the Random() default seed is based on the system clock. So starting several threads constructing new Random objects with the hope that they are unique might in fact give you same RNG sequences.

评论 #26775634 未加载

jandreseabout 4 years ago

Forgetting to seed your RNG is a really classic bug. IMHO RNGs should auto seed unless explicitly set not to, but since the opposite behaviour was baked into C so many years ago it's kind of the default. The worst part is how easy a bug this is to miss unless you're explicitly printing out the first set of random numbers for some strange reason.

评论 #26767928 未加载

评论 #26768121 未加载

jamesfisherabout 4 years ago

Note official TensorFlow tutorials make the exact same mistake. I've reported it but it hasn't been fixed. [1][1]: <a href="https://github.com/tensorflow/tensorflow/issues/47755" rel="nofollow">https://github.com/tensorflow/tensorflow/issues/47755</a>

qd6pwu4about 4 years ago

I notice that the web page of this article is beautifully justified to two sides instead of left alignment, and there is hyphen in breaking lines. Does anyone know how to achieve this in web page? text-align: justify seems to produce inferior results than this page, e.g. rivers in text.

评论 #26770312 未加载

tsimionescuabout 4 years ago

This seems like another reason to never use fork() without exec(). Fork is really a mine field when used this way (and a pretty big maintenance burden on the kernel, by my understanding, to provide the illusion of sharing read-only state with the parent process).

andrew_v4about 4 years ago

Is there something specific about numpy here, or would it be any RNG?I'm looking at some code that uses random.random() to randomly apply augmentations, I suspect that will have the same issue right?

评论 #26776809 未加载

formerly_provenabout 4 years ago

Python has os.register_at_fork nowadays, so why‘d still have this kind of behavior? Not reseeding after fork has been a footgun for almost as long as fork exists.

canjobearabout 4 years ago

I usually write my own data handling functions rather than trying to play PyTorch’s game here. I find their abstractions here confusing or not useful.

etiamabout 4 years ago

Would normally refrain from upvoting this on account of the title, but the actual topic was important enough that I think it can be worth an exception.

selimthegrimabout 4 years ago

A less clickbaity title might have been: Bugs are easy to make. Here’s how to make fewer bugs when modifying existing PyTorch and NumPy code.

Nimitz14about 4 years ago

Oh wow, I've definitely done this mistake without realizing it...

ivorasabout 4 years ago

A lot of comments are criticising the frameworks or the developers, but suprisingly almost no one is criticising Python, which remains a language of the early 90ies as far as parallelism is concerned.A bit like Stockholm syndrome - "Python doesn't do threading" is so ingrained in its users (and I'm a user) minds that it's not even questioned as a potential source of problems.(Noone said it's easy to do. That's why language developers and implementers are a special breed even today.)

anon_tor_12345about 4 years ago

This is probably because I never read these kinds of blogposts but this is one of the most flagrantly clickbait titles I've ever seen. Like the article doesn't even suggest ditching numpy in favor of jax or some kind of other hot take (which would at least warrant such a bombastic title) it literally just presents one instance in which you might be making a mistake when using numpy's rng (not even something more unique to numpy). And the PyTorch team is aware of this and hence exposes `worker_init_fn`. So the title should actually be "Using fork without understanding fork? You might be making a mistake."

评论 #26767849 未加载

评论 #26767818 未加载

评论 #26768040 未加载

评论 #26769116 未加载

ummonkabout 4 years ago

"You're making a mistake" sounds like one shouldn't use PyTorch and NumPy together, when the actual message is "there might be a mistake in your code".

nxpnsvabout 4 years ago

So click baity. A proper title would be, be careful when using random numbers and multi processing...

评论 #26768450 未加载

king_magicabout 4 years ago

Aside from the infuriating clickbait title (which I shall not dignify with an upvote), this is part of why I preprocess augmented images. I don't like too much magic in my custom derived (PyTorch) Dataset objects.

BlueTemplarabout 4 years ago

Taking the title at face value : yeah, you are making a mistake, using GAFAM tools like PyTorch, Github or OpenAI's GPTs.