Dear HN community,
I have spent several years working in data operations for companies applying ML/AI to medical problems - e.g. pharmaceuticals / drug discovery and cancer diagnostics.<p>It's a truism that the data availability and quality are paramount to getting any useful results out. However, what I have seen and heard consistently is that the knowledge of biomedical datasets is a) siloed by specialisms b) has a very high barrier of entry (i.e. you need time and ideally some practice to figure out what's state of the art) c) the choice of databases in some domains can vary considerably between users be it companies or researchers - often without them being able to explain the reason.<p>To quickly understand the landscape our clients operate in, I started wondering if there's a way to systematise this information to know what datasets I should know well. I looked at aggregators like OpenPHACTS, OpenTargets, small curated lists (like Expasy or Nature recommended data repositories), but those were relatively small. I was aware of some larger aggregators (like NAR Database issue, FairSharing.org or Google database search). Still, it would not be easy to make sense of each of them - and they were not necessarily limited to biology/health data. The question slowly evolved into' what datasets are out there that might be useful for a particular ML/AI use case'.<p>Long story short - with plenty of help, I started building a discovery tool, mostly based on our needs but noticed it was useful for others. It cannot replace domain knowledge but provides a heuristic to be a starting point for new entrants (ahem ahem - hoping there might be a couple in this community). The needs evolved beyond a Google Sheet (that can solve like 80% of problems of this scale), so it's wrapped with a small search and recommendation engine.<p>It contains over 5000 datasets, periodically refreshed. They range from cell biology, chemistry, protein and molecule structure and classification through pathways, drug information, side and adverse effects, omics, to anonymised medical records and clinical standards. It merges the results from specialist sources with our private curated collection, reconciles the records, and augments them with information extracted from the scientific literature.<p>As far as we are aware - this is the most extensive such open collection of biomedical data - hoping this may be of some use to others, especially early in the journey!