TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Text extraction

1 pointsby theslayover 11 years ago
Hi, I'm working on plagiarism detection and I need some help on text extraction from pdfs. I've tried PDFTextStream which really works well for extracting text from pdfs. I need to be able to extract the text into a strutured format where i could query thing like title, chapters,etc. Would appreciate it if I could get pointers to achieving this task. Thanks

2 comments

pedalpeteover 11 years ago
Have you tried posting this to <a href="http://stackoverflow.com" rel="nofollow">http:&#x2F;&#x2F;stackoverflow.com</a> ? That&#x27;s a better forum for these kinds of questions.<p>If you were to write a blog post about how to structure the extracted text, that&#x27;s more the HN thing.
mindcrimeover 11 years ago
I won&#x27;t swear to it, but I suspect you&#x27;re going to have to largely roll your own, and that it will be at least partly heuristic driven. I use Apache Tika[1] to extract text from PDFs and then index it with Lucene, but we don&#x27;t need to discriminate between various chapters or anything. But I can picture how you could use OpenNLP[2] and some custom code, to break down the chapters.<p>[1]: <a href="http://tika.apache.org" rel="nofollow">http:&#x2F;&#x2F;tika.apache.org</a><p>[2]: <a href="http://opennlp.apache.org" rel="nofollow">http:&#x2F;&#x2F;opennlp.apache.org</a>