Hi everyone,<p>I’m Nathan, working on subsets.io. We’re building a search engine for data, where your search with your internal dataset and can add matches with one click.<p>— Context —<p>The datasets we work with during our analytics/modelling jobs are often lacking. They were generated by internal systems, but in many cases there are important missing attributes, or the entities in the internal dataset are significantly influenced by external factors that are not captured.<p>Unfortunately, integrating with external data is a major hassle. External data is scattered across many different places (data dumps, api providers, api marketplaces, open data platforms, the web), in all kinds of formats. Integrating it is very time-consuming and requires significant technical skills, which is usually not an analysts/scientist’s core competency. Also, the fragmentation of external data makes exploration very difficult. Furthermore, it’s often unclear if a given external dataset will add value so it can be hard to justify the integration investment.<p>We want to make data exploration and integration easier by treating it as a search problem. Because a dataset contains many values for a given column, we can make much stronger assertions about types than api’s that work with single values. We currently check basic types, and use language models to infer context for more complex string types. For example, if you upload a dataset with countries and their respective alcohol consumption for a year, our system will recommend to add smoking rates, tariff rates, and crime rates.<p>It’s obviously a very complex problem, but I think that if we can communicate clearly in case of uncertainty we can make the process a lot easier.<p>Would love to hear your thoughts.<p>PS: Feel free to drop me a message at nathan@subsets.io if you have any questions or would like to chat<p>Thanks :>