Why not just system(pdf2html) - I don't see the point since this level of functionality is trivially achieved. If it did something over and above that it might be useful, like OCR, but even that's not hard to add.
If you're doing this local/cli<p>`pdftext`, from <a href="http://www.foolabs.com/xpdf/" rel="nofollow">http://www.foolabs.com/xpdf/</a><p>For OCR, `pdfimages` (also from xpdf), combined with ImageMagick's `convert`, and `tesseract` (<a href="http://code.google.com/p/tesseract-ocr/" rel="nofollow">http://code.google.com/p/tesseract-ocr/</a>) works passably well.
I have some questions:<p>1. Why return an array of texts? Where do the texts get split up? At page boundaries? Column boundaries? At the end of each line? If a line is interrupted by a corner of an image and continues a couple of inches afterward, does it get treated as a separate text? (I once used a PDF->text extractor program that spit out every word sepearately, often in an incorrect order. That probably had to do with how the PDF was organized internally.)<p>2. "The PDF file should be smaller than 1 Mbit" -> You mean 1 megabyte, right? Because 1 megabit is only 125-128 kilobytes.
Going from PDF to nicely formatted word doc would be huge for lawyers and people who do a lot of contract negotiations. It's hard to do well though.
I've recently been working on extracting text from PDFs myself. I've found that `pdftohtml -xml` from the Poppler utils does a decent job of it, and includes a bounding box for each piece of text. I've submitted a few patches to their Bugzilla to also include the transformation matrix as well as some extra styling information.
I googled "converting PDF to text" and "converting PDF to html". A tons of services already exist out there. Apparently, it's not something new. How do you plan to compete? Are you planning to focus on data extraction rather than conversion?
Hey I've documented this in Mashape - <a href="https://www.mashape.com/ismaelc/extract-text-from-pdfs#!documentation" rel="nofollow">https://www.mashape.com/ismaelc/extract-text-from-pdfs#!docu...</a>
This is similar to what we do at <a href="http://searchtower.com" rel="nofollow">http://searchtower.com</a> , where you can store, view, index and search the data.