Prompted by downloading a .doc file from Qwest only to find out that inside was a monospaced text file, I set up a small, nearly UI-free site for doing document conversions. <a href="http://doc.mar.cx/<url>" rel="nofollow">http://doc.mar.cx/<url></a>; gives an HTML or other sensible rendering of an url (e.g. <a href="http://doc.mar.cx/http://www.itu.int/dms_pub/itu-t/oth/02/02/T02020000010001MSWE.doc" rel="nofollow">http://doc.mar.cx/http://www.itu.int/dms_pub/itu-t/oth/02/02...</a> ) and <a href="http://doc.mar.cx/<extension>/<url>" rel="nofollow">http://doc.mar.cx/<extension>/<url></a>; attempts to convert the url into the format with the given extension (e.g. <a href="http://doc.mar.cx/txt/http://www.itu.int/dms_pub/itu-t/oth/02/02/T02020000010001MSWE.doc" rel="nofollow">http://doc.mar.cx/txt/http://www.itu.int/dms_pub/itu-t/oth/0...</a> ).<p>I use wvHtml for doc->html, wvPDF for doc->pdf, but antiword for doc->txt. To convert .docx, .xls, .xlsx, and WordPerfect files to HTML, I use OpenOffice, by way of jodconverter. For ODF files, I use OdfConverter. Conversion of Excel files to .csv files uses xls2csv. For PowerPoint files, I use ppthtml to convert to html, and catppt to convert to text. For Lotus 1-2-3 files (I added this after downloading some historical telecom data from the FCC!), I use ssconvert.<p>Any conversion that results in an HTML file (e.g. doc or pdf to html) I bundle all the images into a single file using the data: url scheme. To do this, I wrote a utility called pagecan: <a href="http://afiler.com/pagecan/" rel="nofollow">http://afiler.com/pagecan/</a>
You should also consider 'pandoc', written in Haskell, for converting between markup formats: <a href="http://johnmacfarlane.net/pandoc/" rel="nofollow">http://johnmacfarlane.net/pandoc/</a><p>I am curious for more details about why Tika wasn't good enough. Please explain.
How about trying out calibre <a href="http://calibre-ebook.com" rel="nofollow">http://calibre-ebook.com</a>
It can do all kinds of conversions from a number of formats, it is quite reliable, and it can be run headless.
How would you compare abiword for doc/docx conversion versus antiword (<a href="http://www.winfield.demon.nl/" rel="nofollow">http://www.winfield.demon.nl/</a>)?<p>Also, what are the limitations of abiword for doc/docx files?
Million Dollar Question:<p>How could you additionally parse the information to extract structured data? For example; names of candidates, addresses, previous employers, job titles held.
Please add a candidate delete function. I sent an email with candidate with multiple attachments and Recruiterbox created multiple candidates by mistake.