TechEcho

Hey HN community -I’m Ulrik from Cord (<a href="https://cord.tech" rel="nofollow">https://cord.tech</a>) in the current YC W21 batch [1] - we are building software that allows people to label their data intelligently using a toolbox of various ‘labeling algorithms’. Labeling algorithms are any units of intelligence (e.g. a pre-trained model, or an interpolation algorithm) that help automate the annotation process. This enables data science and machine learning teams to rapidly iterate on their ML models without having to farm out labeling tasks to an external workforce.Today we’re launching the first part of our product, our Web App, which serves our initial set of automation features through a GUI. It also allows you to classify images and draw vector labels, visualize data, and perform collaborative QA.Computer vision ML algorithms are widely used for tasks like detecting everyday objects such as cars and pedestrians. However, they are yet to see widespread adoption for things like detecting cancerous polyps during an endoscopic procedure or blood clots in MRI scans. The lack of massive-scale labeled training datasets that fuel contemporary approaches is often the blocking element in building ML applications that solve these more specialised tasks. We also believe that the core part of the IP of an ML application stems from the labeled data used to train it.Creating these datasets is challenging for several reasons. Labeling the data requires expensive domain-expert annotators, and privacy might prevent the data from being sent to an external workforce. Ultimately most labeling work tends to be done using open-source tools that were not created for speed and purpose-built to handle massive-scale datasets[2]. These tools also tend to provide a poor experience for the end consumer of the training data (e.g., data scientists, ML engineers) because they lack intelligence and require high manual input.The initial seed of the idea came while I was working on a CS master’s project of visualizing massive-scale medical image datasets. I saw saw how much time and effort was being spent by doctors on labeling data. I met my co-founder Eric, who had worked as a quant researcher in finance, and after meeting him we realized we could take an algorithmic approach to tackling the labeling problem. Instead of writing trading algorithms, we turned our focus to writing labeling algorithms.For example, for a food calorie estimation project we translated image level classifications of food items to individualized bounding box labels using a labeling algorithm we wrote with our SDK, requiring only one manual label per food item. Although it was an image dataset, our algorithm approximated noisy bounding box labels by using a CSRT object tracker across images. It then trained a shallow Faster RCNN ‘micro-model’ on the noisy labels, ran inference on the data, and suppressed earlier noisy labels. We then quickly visually reviewed and adjusted the results on our Web App[3]. We have applied a similar approach in areas such as gastroenterology[4] and pathology.The days of relying on an army of human annotators and waiting to start the model building process are hopefully (soon) over. We are incredibly excited to be driving for that change - and are delighted to be sharing Cord with the HN community! We would love to hear your feedback. How are you going about creating and managing training data today? What are your key constraints? If you have used a creative method to label your data before, please share. Thank you so much in advance![1] What I Learned From My First Month at Y Combinator - <a href="https://medium.com/swlh/what-i-learned-from-my-first-month-at-y-combinator-5b35fb9ebb7b" rel="nofollow">https://medium.com/swlh/what-i-learned-from-my-first-month-a...</a>[2] Why You Should Ditch Your In-House Training Data Tools (And Avoid Building Your Own) - <a href="https://medium.com/p/ef78915ee84f" rel="nofollow">https://medium.com/p/ef78915ee84f</a>[3] Label a Dataset with a Few Lines of Code - <a href="https://eric-landau.medium.com/label-a-dataset-with-a-few-lines-of-code-45c140ff119d" rel="nofollow">https://eric-landau.medium.com/label-a-dataset-with-a-few-li...</a>[4] Pain Relief for Doctors Labelling Data - <a href="https://eric-landau.medium.com/pain-relief-for-doctors-labelling-data-72f3e5e31c92" rel="nofollow">https://eric-landau.medium.com/pain-relief-for-doctors-label...</a>

10 comments

eviloliveover 4 years ago

Really cool stuff ! Looking forward to trying it outmy favorite commercial option so far is : <a href="https://www.v7labs.com/" rel="nofollow">https://www.v7labs.com/</a>prodi.gy for running small augmentation UIs is worth checking outThe workflow of the Toronto univ annotation suite [1] is sweet for making polygons semi-automatically, but not widely available yet[1] <a href="https://youtu.be/3kFQJQicHxA" rel="nofollow">https://youtu.be/3kFQJQicHxA</a>

评论 #26105872 未加载

bc11over 4 years ago

Interesting work! I enjoyed reading some of your blog posts— I have some gastroenterology projects that required tons of manual rotoscoping/segmentation and could have benefited from a faster collaborative pipeline. We used hasty.ai—I’d love to hear your thoughts on their work vs yours. I’m also on some projects doing audio spectrogram segmentation if you have some software that can handle audio ‘images’, that’s another space with a gap in the industry. I haven’t found the equivalent outside of VIA/Audacity/Praat for labeling.

评论 #26115459 未加载

评论 #26117415 未加载

yesterday200over 4 years ago

At Lenus eHealth we have been following Cord closely for some time and can only vouch for the quality of their product! Been trying out the public API and is impressed with the progress so far.

评论 #26105643 未加载

jrrbover 4 years ago

Is this something we will be able to license and run on our own servers? We are quite wary of sharing our data and labels externally -- we've had bad experiences with that...

评论 #26105243 未加载

festinalenteover 4 years ago

Congrats! This seems like a sorely needed product in the industry. Are there plans to expand this to other areas like document tagging?

评论 #26104820 未加载

rustastraover 4 years ago

As a user, am I expected to write the labeling algorithms myself or do you offer some in-built ones?

评论 #26106747 未加载

iceburg8over 4 years ago

how is this different from roboflow?

评论 #26106639 未加载

ipsum2over 4 years ago

How is this different fromAquarium?

评论 #26107573 未加载

123molchunover 4 years ago

Looks pretty cool, but how is this different from Scale AI?

评论 #26104726 未加载

MattHellerover 4 years ago

How is it different from superb AI?

评论 #26115942 未加载

10 comments

eviloliveover 4 years ago

评论 #26105872 未加载

bc11over 4 years ago

评论 #26115459 未加载

评论 #26117415 未加载

yesterday200over 4 years ago

At Lenus eHealth we have been following Cord closely for some time and can only vouch for the quality of their product! Been trying out the public API and is impressed with the progress so far.

评论 #26105643 未加载

jrrbover 4 years ago

Is this something we will be able to license and run on our own servers? We are quite wary of sharing our data and labels externally -- we've had bad experiences with that...

评论 #26105243 未加载

festinalenteover 4 years ago

Congrats! This seems like a sorely needed product in the industry. Are there plans to expand this to other areas like document tagging?

评论 #26104820 未加载

rustastraover 4 years ago

As a user, am I expected to write the labeling algorithms myself or do you offer some in-built ones?

评论 #26106747 未加载

iceburg8over 4 years ago

how is this different from roboflow?

评论 #26106639 未加载

ipsum2over 4 years ago

How is this different fromAquarium?

评论 #26107573 未加载

123molchunover 4 years ago

Looks pretty cool, but how is this different from Scale AI?

评论 #26104726 未加载

MattHellerover 4 years ago

How is it different from superb AI?

评论 #26115942 未加载

Launch HN: Cord (YC W21) – training data toolbox for computer vision

10 comments

Launch HN: Cord (YC W21) – training data toolbox for computer vision

10 comments