I usually try a technique Andrej didn't mention here which helps me a lot in the debugging and modelling phase : Simulated data that encompass a single key difficulty of the problem.<p>For example in this line of thought in question answering, there is the BABI dataset which create auxiliary problems. So you know where the problems in modelisation are.<p>By pushing this problem to the extreme (for example in nlp you can have tasks that consist of repeating a sequence of character in reverse order to demonstrate that the architecture is indeed capable of memorizing like a parrot), you can often create trivial problems, which takes minutes to run on a single machine, and help discover most bugs.<p>You can also create hierarchies of such problems, so you know in which order you have to tackle them. And you can build sub-modules and reuse them.<p>Quite often the code you obtain then is very explainable and you know what situation will work and what will probably not work. But this network architecture is usually "verbose" and numerically optimize a little less well on large scale problems. The trick is then to simplify your network mathematically into something that is more linear and more general. You can reorder some operations like summing along a different dimension first. Semantically this will be different but will converge better. Because for a network to optimize well it needs to work well in both the forward direction and the backward direction so that the gradient flows well.<p>Once you have a set of simple problems that encompass your general problem, a good solution architecture is usually a more general mixture of the model architecture of the simple problems.
This might be the most "Deep Learning" thing I've ever read:<p>> <i>One time I accidentally left a model training during the winter break and when I got back in January it was SOTA (“state of the art”).</i>
>The first step to training a neural net is to not touch any neural net code at all and instead begin by thoroughly inspecting your data. This step is critical.<p>This can't be overstated. I can't count the number of times I'm the first person to find a problem with the data. It's incredibly frustrating. Just look at your damned data to sanity check it and understand what's going on. Do the same with your model outputs. Don't just look at aggregates. Look at individual instances. Lots of them.
Well, this was unexpectedly excellent.<p>I don't think "stick with supervised learning" is very good advice, though. Unsupervised techniques sometimes work well for NLP and has worked well for other domains, such as medical records[1]. In particular, anytime you have access to much more unlabeled data than labeled data, it should be something you should at least consider.<p>[1]: <a href="https://www.nature.com/articles/srep26094" rel="nofollow">https://www.nature.com/articles/srep26094</a>
As a practitioner, I found myself nodding in agreement again and again and again.<p>This blog post is full of the kind of real-world knowledge and how-to details that are not taught in books and often take endless hours to learn the hard way.<p>If you're interested in deep learning, do yourself a favor and go read this.<p>It is worth its weight in gold.
>>> There is a large number of fancy bayesian hyper-parameter optimization toolboxes around and a few of my friends have also reported success with them, but my personal experience is that the state of the art approach to exploring a nice and wide space of models and hyperparameters is to use an intern :). Just kidding.<p>LOL. Human assisted training at scale is perfectly allowable for mission critical success. Especially if you enjoy an unlimited research budget!<p>You can follow these instructions to the letter. And the same problems around generalization will arise. It's foundational.<p>For 30fps camera images, handling new data in real time works fine for 99% of scenarios. But seeking usable convergence rates on petascale sized data problems such as NVidia's recent work on Deep Learning for fusion reaction container design requires a breakthrough. Not just in software. But computation architectures as well.<p>Deep Reinforcement Learning and the Deadly Triad<p><a href="https://arxiv.org/pdf/1812.02648.pdf" rel="nofollow">https://arxiv.org/pdf/1812.02648.pdf</a><p>Identifying and Understanding Deep Learning Phenomena<p><a href="http://deep-phenomena.org/" rel="nofollow">http://deep-phenomena.org/</a>
"""If you have an imbalanced dataset of a ratio 1:10 of positives:negatives, set the bias on your logits such that your network predicts probability of 0.1 at initialization."""<p>Can someone translate this to PyTorch for me? Or give a simple example of how one would go about doing this?<p>It means, that if I have a 1:10 ratio in the data, an untrained net should predict positive in 10% of the cases, right?
I am bookmarking this article, this is pure gold.<p>Also, it seems to me that most of what he says can be distilled into a boilerplate/template structure for any given deep learning framework, from which new projects can be forked - does this already exist?
For anyone learning to build and train neural nets, this is a fantastic cheat sheet; Andrej is top-notch at explaining these kinds of things. The other posts on this blog are definitely worth a read as well!
"though NLP seems to be doing pretty well with BERT and friends these days, quite likely owing to the more deliberate nature of text, and a higher signal to noise ratio"<p>What is he talking about here? BERT, GPT etc are not unsupervised, they are pretrained on a task that has naturally supervised data (language modelling).
In the blog, he refers to test losses at an early stage, like in "add significant digits to your eval". Does he actually refer to the test data or is he referring to validation data? I was under the idea that we were supposed to touch the test data only once at the end of all training and validation. What is the right way to handle the test data?
This recipe results in large amounts of time spent before any results occur (depending on the task you are trying to solve). Classification is an easy task to use this recipe, but when you venture into object detection or pose estimation, data collection, labeling, and setting up training and evaluation infrastructure is much more complex.