> people unacquainted with biology have a false perception of how low-throughput biology experimentation is. In many ways, it can be. But the underlying physics of microbiology lends itself very well to experiments that could allow one to collect tens-of-thousands, if not millions, of measurements in a singular experiment. It just needs to be cleverly set up.<p>I think this passage gets to the fundamental rift of disagreement in perspective between those focused purely on computational advances versus innovating in wet lab techniques.<p>Why? Because years of peoples' careers have been wasted waiting on promises from molecular biologists claiming they will make these "clever" high-throughput experiments work. In my experience, they'll spend months to years concocting a Rube Goldberg machine of chained molecular biology steps, each of which has (at best) a 90% success rate. You don't have to chain many of these together before your "clever" setup has a ~0% probability of successfully gathering data.
> Get those linguists out of here, more data will replace whatever insights they have! It’s a fun and increasingly popular stance to take. And, to a degree, I agree with it. More data will replace domain experts, the bitter lesson is as true in biology as it is in every other field.<p>I think it’s fundamentally shifting how people approach R&D in all physical fields. The power of “the ML way” is almost a self-fulfilling prophecy. Once you see ML upend the standard approach in one area, the question is not if but when it will upend your area, and the natural next step is to ask, “how can I massively increase data collection rates so I can feed ML”? It just completely flips all branches of science on their head, from carefully investigating and building first-principles theory, to saying “screw it, I really just wanted to map this design space so I can accurately predict outcomes, why don’t I just build a machine to do that?”<p>It then becomes a question of how easy it actually is to build an ML-feeding machine (not easy, very problem-specific), ergo the pendulum now swings to physical lab automation.
In biology, the most important step is finding the right thing to measure. Biological systems are highly contextual, so the second most important step is finding the second thing to measure in relationship to the first thing.<p>In the case of AlphaFold, measuring crystal structures is the most important thing (molecular phenotype). The second most important thing is measuring many genomes. Multiple sequence alignments allows evolution (variation under selection) to tell you about the important bits of the structure. The distance from aligned DNA sequences to protein structure isn't a bridge too far.<p>Unfortunately, biology has been mislead by the popularity of transcriptomics, which the post touches on briefly (limits of single-cell approaches). Transcriptomics generates lots of data (relatively) cheaply, but isn't really the right thing to measure most of the time because it is too far removed causally from the organismal phenotype, the thing we generally care about in biomedicine. Although gene expression has provided some insights, we've exhausted most of its value by now and I doubt ML will rescue it (speaking from personal experience).
The flip side of this is that progress in ML for biology is always going to be _slower_ than progress in ML for natural languages and images [1].<p>Humans are natural machines capable of sensing and verifying the correctness of a piece of text or an image in milliseconds. So if you have a model that generates text or images, it’s trivial to see if they’re any good. Whereas for biology, the time to validate a model’s output is measured more in weeks. If you generate a new backbone with RFDiffusion, and then generate some protein sequences with LigandMPNN, and then want to see if they fold correctly … that takes a week. Every time. Use ML to solve _that_ problem and you’ll be rich.<p>TFA mentions the difficulty of performing biological assays at scale, and there are numerous other challenges. Such as the number of different kinds of assays required to get the multimodal data needed to train the latest models like ESM-3 (which is multimodal, in this context meaning primary sequence, secondary structure, tertiary structure, as well as several other tracks). You can’t just scale a fluorescent product plate reader assay to get the data you need. We need sequencing tech, functional assays, protein-protein interaction assays, X-ray crystallography, and a dozen others, all at scale.<p>What I’d love to see companies like A-Alpha and Gordian and others do is see if they can use the ML to improve the wet lab tech. Make the assays better, faster, cheaper with ML. Like how they use ML to translate the electrical signals of DNA passing through the pore into a sequence in the Nanopore sequencers. So many companies have these sweet assays that are very good. In my opinion, if we want transformative progress in biology, we should spend less time fitting the same data with different models, and spend more time improving and scaling wet lab assays using ML. Can we use ML to make the assay better, make our processes better, to improve the amount and quality of data we generate? The thesis of TFA (and experience) suggests that using the data will be the easy part<p>1. <a href="https://alexcarlin.bearblog.dev/why-is-progress-slow-in-generative-ai-for-biology/" rel="nofollow">https://alexcarlin.bearblog.dev/why-is-progress-slow-in-gene...</a>
I read a great piece from Michael Bronstein about this very topic earlier this year.<p><a href="https://towardsdatascience.com/the-road-to-biology-2-0-will-pass-through-black-box-data-bbd00fabf959" rel="nofollow">https://towardsdatascience.com/the-road-to-biology-2-0-will-...</a><p>I think an important point raised here is the distinction between good data, and the "relative" data present in a lot of biology. As examples from the article, a protein structure, or genome/protein sequence data is good data, but data like RNA-seq or mass spectrometry data is relative (and subject to sensitivity / noise etc). The way I like to think of it is that sequence data and structural data is looking at the actual thing, but the relative data only gets you a sliver of a snapshot of a process. Therefore it's easier to build models to capture relationships between representations of real things, rather than models where you can't really distinguish between signal and noise. I spend a fair amount of time these days trying to figure out how to take advantage of good data to gain insights into things where we have relative data.
The lab is not incredibly low throughput but also most of the experiments look at a single modality.
Take a cell viability or FACS assay - while some additional measurements could be taken or analysed - most of the time the scientist will look at a single parameter. In a separate assay the cell (other passage/day) will be undertaken another assay resulting in nearly incomparable data.<p>The solution: Multimodal data and getting more info on experiments setup (often a bit of voodoo and not written down properly).
Clonal DNA synthesis has increased in price over the past 6 years (even when accounting for inflation). On that metric, we’re actually regressing in our ability to modify the natural world. It’s even worse than stagnation.<p>Or even look at lab robotics - in 2015, you were able to buy a new opentrons for $2500. Now it’s about $10,000 - the only way to rival the old pricing is to scrounge around used sales.<p>Enzyme prices haven’t dropped in basically forever. Addgene increased plasmid prices a little bit ago.<p>I feel like computer hackers can’t even imagine how bad it is over here