TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: Create Computer Vision Datasets in Half the Time with Emerald AI (Beta)

2 pointsby Datenstromalmost 4 years ago

1 comment

Datenstromalmost 4 years ago
Hey HN,<p>We’re Mike, Mark, Patrick, and Derek of Emerald AI (emerald-ai.info), and we’re building a platform to clean, balance, and annotate data for rapidly deploying Computer Vision (CV) models. Unlike other data platforms that support vision data types, we do not assume that your data is ready for labeling. We provide tools to analyze, sort, arrange, and visualize your raw data, downselect to the most beneficial, and search for more underrepresented attributes and classes. We use a combination of active, bayesian, and meta-learning to make this possible with very little labeled data.<p>We’ve worked on large computer vision projects in the industry, including autonomous in-space robotic construction at NASA, geospatial object detection, and aerial search and rescue. We have experienced the difficulty of getting these systems working well in production. Despite the progress in pretrained models and transfer learning, one of the biggest challenges in getting models quickly trained, deployed, and performing well in a production environment is acquiring enough high-quality data. The idea for the platform originated from a project we worked together on in the past where we didn’t have enough labeled data, and obtaining more directly from the source was impossible, and all other means were cost-prohibitive. To deliver a solution regardless, we used transfer learning to create classifiers and hooked them up to a web scraper that could search for data that improved our model. This process was still cumbersome and not a general solution but saved us so much time we decided to build something that could automate that process and in a more efficient way for everyone.<p>Data labeling costs can be prohibitive when building CV datasets in some settings. For example, Medical and DoD data often have privacy concerns, and when data requires specialized training to label such as again medical imagery and data with fine-grained sub-categories. Other teams we have spoken to have petabyte-scale raw image and video datasets that would be extremely valuable when labeled. Still, the costs of cleaning, organizing, and labeling petabytes of data are often too high with traditional methods. We also hope to empower academics, researchers, and hobbyists to build datasets that otherwise would have been impossible.<p>We are solving this problem with several features on our platform.<p>1) We train models as you label data that reach high accuracy with only a few examples rather than thousands.<p>2) The iteration between creating the model for annotation and seeing results is quick, which allows identifying errors quickly<p>3) The level of effort in creating annotation models is low enough to analyze and balance the dataset beyond training labels (weather conditions, lighting, etc.)<p>4) Streaming data sources, including our web scraper, can be added to collect more underrepresented data, which can often improve performance<p>Our approach uses Bayesian Meta-Learning, which has recently seen rapid improvement [1-2], and we are ready to move from benchmarks [3-4] to actual use cases and will be starting a closed beta by the end of this year. Our method is also very computationally efficient, which allows for rapid feedback. One advantage of the accurate uncertainty estimates we have is that we can provide that information in our data visualizations about the produced labels. When applied to dataset attributes for balancing, the attribute labels do not need to achieve extremely high accuracy or certainty before the results are acceptable, unlike the labels used for model training. This solution works because deep-learning models are generally robust to slightly imbalanced data. Finally, by connecting to external data sources such as a web scraper or internal data lake, we can search for the underrepresented data that will most improve your dataset and, in turn, model performance.<p>Thanks for reading! We are posting here before anywhere else and hoping to find a few solid use cases for the beta. Also, I would love any of the valuable feedback I have seen in other launch posts over the years.<p>[1] C. Nguyen, T.-T. Do, and G. Carneiro, “Uncertainty in Model-Agnostic Meta-Learning using Variational Inference,” arXiv:1907.11864 [cs, stat], Oct. 2019, Accessed: Jul. 15, 2021. [Online]. Available: <a href="http:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1907.11864" rel="nofollow">http:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1907.11864</a><p>[2] H. B. Lee et al., “Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distribution Tasks,” presented at the International Conference on Learning Representations, Sep. 2019. Accessed: Jun. 29, 2021. [Online]. Available: <a href="https:&#x2F;&#x2F;openreview.net&#x2F;forum?id=rkeZIJBYvr" rel="nofollow">https:&#x2F;&#x2F;openreview.net&#x2F;forum?id=rkeZIJBYvr</a><p>[3] X. Zhai et al., “A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark,” arXiv:1910.04867 [cs, stat], Feb. 2020, Accessed: Jul. 01, 2021. [Online]. Available: <a href="http:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1910.04867" rel="nofollow">http:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1910.04867</a><p>[4] E. Triantafillou et al., “Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples,” arXiv:1903.03096 [cs, stat], Apr. 2020, Accessed: Jun. 29, 2021. [Online]. Available: <a href="http:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1903.03096" rel="nofollow">http:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1903.03096</a>