Launch HN: Datasaur (YC W20) – data labeling interface for NLP

174 pointsby flyxabout 5 years ago

Hey HN community -I’m Ivan from Datasaur (<a href="https://datasaur.ai/" rel="nofollow">https://datasaur.ai/</a>) - we build software to allow humans to more efficiently label data for training natural language processing (NLP).NLP algorithms are being trained in a wide variety of industries - from customer service to legal contracts, forum moderation to restaurant reviews. All these algorithms benefit from recent breakthroughs in academia and a generous open-source community. However, in order to be deployed to the real world, they require a custom set of training data to learn and understand the language unique to each industry. Therefore, people around the world are meticulously labeling data samples.Example sentence: London is the capital and largest city of England and of the United Kingdom.Labels: “London” —> “capital”, “United Kingdom”Labels: “London” —> “largest city”, “England”In the last few years I’ve worked at companies such as Apple and Yahoo and noticed that many organizations tend to reinvent the wheel when creating labeling interfaces for their labelers. Some companies still do this work in Excel. We saw an opportunity to create a "single interface to rule them all" - to handle all sorts of text labeling tasks.We leverage existing NLP capabilities to intelligently validate the quality of labels in a document and complement human judgment. Furthermore, we already understand terms like “Starbucks” and “New York” - why spend time labeling these terms from scratch every time? We created an API so you can plug in existing models to apply a first pass on labeling the document. We also built many other extensions to help labelers optimize their time - a “find and label” extension for labeling repetitive terms, a dictionary extension for quickly looking up unfamiliar terms. We spent the past year building out the labeling solution I wish I could have used.We now handle named entity recognition, parts of speech, document labeling, coreference resolution (multiple words referring to the same object/person) and dependency parsing (drawing relationships between words). A case study with one of our clients shows 70% improved labeling efficiency upon adopting the Datasaur platform, and we have much more room to improve.We also spoken with 100+ AI teams globally and identified the best practices in labeling. In addition to providing an enhanced interface, we can help track labeler performance, peer disagreement scores, and detect/remove labeler bias. By incorporating and encoding these features into our software, we can not only help improve the labeling efficiency but also improve the quality of the data and therefore the resulting AI model.We believe that as AI becomes ever more prevalent and ubiquitous, labeling will become an increasingly important task. AI is a garbage-in, garbage-out technology, and the quantity and quality of data can often make a critical difference in the resulting AI model. We’re really excited to open Datasaur up to the world today and hear your feedback. Have you run into similar labeling issues? What tips and tricks have you employed to keep up with AI’s voracious appetite for data? We’d love to hear how you’ve tackled data labeling at your own companies. Thanks so much in advance!Ivan

29 comments

zachguoabout 5 years ago

It resembles an open source annotation tool that has existed for years. <a href="https://brat.nlplab.org/" rel="nofollow">https://brat.nlplab.org/</a>It doesn't include a ML assistant though.We have built a semi-automated annotation tool for our internal use too. ML models help classify documents and extract named entities by making suggestions. Sometimes I'm thinking of spinning it off as a standalone product but not sure how big the market would be.

评论 #22510656 未加载

评论 #22510571 未加载

aliml85about 5 years ago

Looks great, Ivan. Congrats! If I understand correctly you would use Spacy and some pre-trained models to validate human labels. Now my question is that what is the point of collecting labels for ML training if we already have a valid model for the same task that can complement human labels?

aliakhtarabout 5 years ago

Cool project, what would be cooler is if you had an API to retrieve the labels for a given word. May be that's in the works?

评论 #22508012 未加载

staticautomaticabout 5 years ago

Could you please elaborate on what you mean by "intelligently validate the quality of labels in a document and complement human judgment", and discuss your methodology?This seems to operate under the assumption that human labels are not actually the ground truth. I understand that they can be dirty, but most unsupervised approaches aren't producing a ground truth, either. So, are you saying it's better to have multiple pretty good sources of truth instead? Because depending on the application, that might make sense or it might be like trying to start a farm with a dead horse and a dead cow.

评论 #22507336 未加载

IanCalabout 5 years ago

Looks interesting, signed up to try this out and see if it might deal with some of our labelling.A few notes -Some help docs would be good, or better links to them. You specify a few types of projects but don't really explain what they are - I tried searching for "constituency" type projects but I have no idea what they are still.You're sending error messages to the frontend. "Cannot read property 'startsWith' of undefined" is not something that should be reaching an end user, and this is happening unreliably when I upload files.If I upload a CSV file I cannot seem to do NER from "new project". NER specifically chosen supports TSV but not CSV. My TSV files just say "server error", though they load as just txt files.What's a question set? What's the format you need from me (it just says "csv").Autolabelling seems to do nothing. Do you have example text where that should work?I can navigate the text but the hotkeys for labelling don't do anything until I've already clicked once.Search & label all is interesting but doesn't seem to give me any labelling options. Also the regex search for "someword \w" just returns all two words next to each other which seems wrong to me.Congrats on the launch!

评论 #22530321 未加载

Shenglongabout 5 years ago

This is awesome--really excited to see this need being solved.

crimsalisabout 5 years ago

Congrats on the launch! I spend more than 50% of my time labeling data and this will make life much easier.

sailfastabout 5 years ago

This looks awesome! Waiting for my email confirmation.I was looking for information about where my data has to be hosted to use this service and could not find it. Will there be some more information about how this data is handled once I get past the login? Thanks!

评论 #22507001 未加载

andrewncabout 5 years ago

This is very cool! I especially love the logo. Congrats on the launch and best of luck.

评论 #22509221 未加载

gault8121about 5 years ago

In the spreadsheet view, do users need to upload labels as a text file to then assign them to items? I work with Quill.org, a nonprofit edtech tool that helps students improve their writing skills, and we do a lot of labeling work now where we may need to assign say one of 20 labels to 1,000 responses at a time. I uploaded some sample data, but didn't understand how I could quickly assign labels to my content. Please let me know if I'm missing something here.

评论 #22509737 未加载

milaniabout 5 years ago

Congratulation for the launch!To understand the scope of your work a little bit, if I have Prodigy with custom labeling needs set up for me, do I still benefit from switching to datasaur?

评论 #22507220 未加载

评论 #22507186 未加载

narrationboxabout 5 years ago

This looks wonderful, will definitely try it out. We ran into the labeling issue when doing NER a couple years ago on Reddit books dataset. If only this existed then.

评论 #22506860 未加载

dunky11about 5 years ago

Wish you good luck, the website looks clean, the product idea is good:) You request an image however which width is 3000+ pixels: <a href="https://s.datasaur.ai/static/media/homepage-hero.4917b8af.png" rel="nofollow">https://s.datasaur.ai/static/media/homepage-hero.4917b8af.pn...</a> . 1200px in width should be enough, I would resize the image, it slows down the page.

评论 #22507271 未加载

comet_trailabout 5 years ago

Interesting product. Could have used this at previous companies. How is this different from FigureEight or Scale?

评论 #22511244 未加载

评论 #22507216 未加载

hbcondo714about 5 years ago

Any chance you could support HTML files? We've been using <a href="https://www.tagtog.net/" rel="nofollow">https://www.tagtog.net/</a> for some of our data labeling / annotations needs but their tool for these file types is still "experimental".

评论 #22507439 未加载

inerteabout 5 years ago

LinkedIn suggested a post from you a couple weeks ago and I remember thinking “what’s Ivan up to?” and I saw Datasaur. Congrats on YC! I know that our time at Yahoo was a brief overlap but I remember the swirl of ML, Knowledge Graph and labelling our org was at 5 years ago.Good luck with Datasaur!- Julio Nobrega

评论 #22507521 未加载

leraxabout 5 years ago

This name is the best part of the project (and the project itself it's already an awesome tool).

_prometheusabout 5 years ago

Datasaur looks awesome! Can't wait to try it out. Congrats on the launch!Curious about data security and privacy? How do you guarantee privacy? Is there some cryptography or secure enclaves used? Some sets of documents (and email) are super high trust.Guessing the on-prem version is probably safest route

评论 #22507720 未加载

boreasabout 5 years ago

I've got a question, a lot of startup websites have a similar look to this one. It's a look I actually really like. What technologies are they all using? How would I build a site like this?wappalyzer doesn't give me anything and I don't have a ton of webdev experience.

评论 #22509215 未加载

WFHRenaissanceabout 5 years ago

Very cool logo. Just signed up.

评论 #22506632 未加载

hbcondo714about 5 years ago

On the pricing page, the Growth box shows a checkmark for "Unlimited labels" but right below in the "Choose the right plan for you", the Growth plan says the number of labels is 10,000,000.

评论 #22507225 未加载

inthewoodsabout 5 years ago

Great idea - and this is an odd comment: I think you're pricing it too low relative to the number of people in the market. Just my gut - could well be wrong, wrong, and wrong again.

评论 #22509217 未加载

mrollabout 5 years ago

Hey Ivan, this looks great! What are the privacy implications for my data that I want to label with your tool? I’m assuming I upload it to your servers?

评论 #22506976 未加载

braindead_inabout 5 years ago

Congrats. Do you guys use AllenNLP, by any chance?

评论 #22507899 未加载

seaturtlesabout 5 years ago

Awesome! Congrats, excited for this!

foobawabout 5 years ago

Any plans to support image annotation (something similar to what CVAT does)?

评论 #22509228 未加载

ymtabout 5 years ago

Looks awesome! I need to convince my team to use datasaurCongratulations for the launch!

chownationabout 5 years ago

Roarrsome, congrats!

评论 #22510894 未加载

felixkurniawanabout 5 years ago

Congratulations Ivan for the launch! Best of luck!

29 comments

zachguoabout 5 years ago

评论 #22510656 未加载

评论 #22510571 未加载

aliml85about 5 years ago

aliakhtarabout 5 years ago

Cool project, what would be cooler is if you had an API to retrieve the labels for a given word. May be that's in the works?

评论 #22508012 未加载

staticautomaticabout 5 years ago

评论 #22507336 未加载

IanCalabout 5 years ago

评论 #22530321 未加载

Shenglongabout 5 years ago

This is awesome--really excited to see this need being solved.

crimsalisabout 5 years ago

Congrats on the launch! I spend more than 50% of my time labeling data and this will make life much easier.

sailfastabout 5 years ago

评论 #22507001 未加载

andrewncabout 5 years ago

This is very cool! I especially love the logo. Congrats on the launch and best of luck.

评论 #22509221 未加载

gault8121about 5 years ago

评论 #22509737 未加载

milaniabout 5 years ago

Congratulation for the launch!To understand the scope of your work a little bit, if I have Prodigy with custom labeling needs set up for me, do I still benefit from switching to datasaur?

评论 #22507220 未加载

评论 #22507186 未加载

narrationboxabout 5 years ago

This looks wonderful, will definitely try it out. We ran into the labeling issue when doing NER a couple years ago on Reddit books dataset. If only this existed then.

评论 #22506860 未加载

dunky11about 5 years ago

评论 #22507271 未加载

comet_trailabout 5 years ago

Interesting product. Could have used this at previous companies. How is this different from FigureEight or Scale?

评论 #22511244 未加载

评论 #22507216 未加载

hbcondo714about 5 years ago

评论 #22507439 未加载

inerteabout 5 years ago

评论 #22507521 未加载

leraxabout 5 years ago

This name is the best part of the project (and the project itself it's already an awesome tool).

_prometheusabout 5 years ago

评论 #22507720 未加载

boreasabout 5 years ago

评论 #22509215 未加载

WFHRenaissanceabout 5 years ago

Very cool logo. Just signed up.

评论 #22506632 未加载

hbcondo714about 5 years ago

On the pricing page, the Growth box shows a checkmark for "Unlimited labels" but right below in the "Choose the right plan for you", the Growth plan says the number of labels is 10,000,000.

评论 #22507225 未加载

inthewoodsabout 5 years ago

Great idea - and this is an odd comment: I think you're pricing it too low relative to the number of people in the market. Just my gut - could well be wrong, wrong, and wrong again.

评论 #22509217 未加载

mrollabout 5 years ago

Hey Ivan, this looks great! What are the privacy implications for my data that I want to label with your tool? I’m assuming I upload it to your servers?

评论 #22506976 未加载

braindead_inabout 5 years ago

Congrats. Do you guys use AllenNLP, by any chance?

评论 #22507899 未加载

seaturtlesabout 5 years ago

Awesome! Congrats, excited for this!

foobawabout 5 years ago

Any plans to support image annotation (something similar to what CVAT does)?

评论 #22509228 未加载

ymtabout 5 years ago

Looks awesome! I need to convince my team to use datasaurCongratulations for the launch!

chownationabout 5 years ago

Roarrsome, congrats!

评论 #22510894 未加载

felixkurniawanabout 5 years ago

Congratulations Ivan for the launch! Best of luck!