If you're doing this local/cli<p>`pdftext`, from <a href="http://www.foolabs.com/xpdf/" rel="nofollow">http://www.foolabs.com/xpdf/</a><p>For OCR, `pdfimages` (also from xpdf), combined with ImageMagick's `convert`, and `tesseract` (<a href="http://code.google.com/p/tesseract-ocr/" rel="nofollow">http://code.google.com/p/tesseract-ocr/</a>) works passably well.