Anyone know how this compares to GROBID [1]? I'm looking at alternatives to GROBID as I'm not super pleased with its outputs. GROBID has a lot of great features for journal papers (reference extraction / parsing), but I'm only interested in cleanly extracting the body. Also considering nougat [2] but I haven't tried it yet.<p>[1] <a href="https://github.com/kermitt2/grobid">https://github.com/kermitt2/grobid</a><p>[2] <a href="https://github.com/facebookresearch/nougat">https://github.com/facebookresearch/nougat</a>
Nice tool, I've been using html2md[1] and such. It's written in python and in beta so it's probably not the best for processing static sites and such. But still useful<p>[1]: <a href="https://github.com/suntong/html2md">https://github.com/suntong/html2md</a>