Hey HN!<p>Markup is an open-source annotation tool for transforming unstructured documents into a structured format that can be used for ML, NLP, etc.<p>Markup learns as you annotate in order to speed up the process by suggesting complex annotations to you.<p>There are also a few different in-built tools, including:<p>- A data generator that helps you to produce synthetic data for training the suggestion model<p>- An annotator diff tool that helps you to compare annotations produced by multiple annotators<p>It's still very much a work in progress (and the documentation is severely lacking), but the ultimate goal is to make a tool that's as useful as <a href="https://prodi.gy/" rel="nofollow">https://prodi.gy/</a>, without the $400 price tag.
Beautiful. So many annotation tools focus on "text classification" which assumes you've already got segmented samples. In the real world of documents that's a whole challenge in itself.<p>Another challenge is that sometimes you're working with PDFs and that means not only ingesting but also displaying. The difficulty is in keeping track of annotations and predictions across the PDF<->text string boundary, both ways.<p>There are understandably even fewer solutions to that problem because it's a harder UI to build.
This looks incredible! I’ve been following doccano for awhile but they were still working on active learning. Will you be adding an open source license like MIT?
Looks like an interesting project. Would you have some kind of a summary of the methodology you're using for the annotation suggestions? What kind of learning, and which kinds of features?
Really nice tool - thanks for making this! What is your plan for this? Is this a side-project that you'll potentially turn into a business, or is this just a hobby on the side of your full-time job?<p>Just asking because I think many folks would be happy to pay to support a small ISV to ensure it's long-term sustainability. Not via donations, but actual pricing.
> Document to annotate - The document you intend to annotate (must be .txt file)<p>Any thoughts on supporting additional file formats? I'm actually interested in annotating HTML files / web pages. It would be great if I could browse for a local HTML file or enter in a URL and the HTML content would be rendered for it to be annotated using the entities.
That's fantastic. I was about to start a project in October building something that's almost completely there already, for a specific use case (annotation of therapy sessions).