TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

HTML preview for doc, docx, pdf & rtf

44 pointsby _raghuabout 14 years ago

9 comments

afilerabout 14 years ago
Prompted by downloading a .doc file from Qwest only to find out that inside was a monospaced text file, I set up a small, nearly UI-free site for doing document conversions. <a href="http://doc.mar.cx/&#60;url&#62" rel="nofollow">http://doc.mar.cx/&#60;url&#62</a>; gives an HTML or other sensible rendering of an url (e.g. <a href="http://doc.mar.cx/http://www.itu.int/dms_pub/itu-t/oth/02/02/T02020000010001MSWE.doc" rel="nofollow">http://doc.mar.cx/http://www.itu.int/dms_pub/itu-t/oth/02/02...</a> ) and <a href="http://doc.mar.cx/&#60;extension&#62;/&#60;url&#62" rel="nofollow">http://doc.mar.cx/&#60;extension&#62;/&#60;url&#62</a>; attempts to convert the url into the format with the given extension (e.g. <a href="http://doc.mar.cx/txt/http://www.itu.int/dms_pub/itu-t/oth/02/02/T02020000010001MSWE.doc" rel="nofollow">http://doc.mar.cx/txt/http://www.itu.int/dms_pub/itu-t/oth/0...</a> ).<p>I use wvHtml for doc-&#62;html, wvPDF for doc-&#62;pdf, but antiword for doc-&#62;txt. To convert .docx, .xls, .xlsx, and WordPerfect files to HTML, I use OpenOffice, by way of jodconverter. For ODF files, I use OdfConverter. Conversion of Excel files to .csv files uses xls2csv. For PowerPoint files, I use ppthtml to convert to html, and catppt to convert to text. For Lotus 1-2-3 files (I added this after downloading some historical telecom data from the FCC!), I use ssconvert.<p>Any conversion that results in an HTML file (e.g. doc or pdf to html) I bundle all the images into a single file using the data: url scheme. To do this, I wrote a utility called pagecan: <a href="http://afiler.com/pagecan/" rel="nofollow">http://afiler.com/pagecan/</a>
sushiabout 14 years ago
UX Suggestion: Please hyperlink the Blog text besides the Recruiterbox logo. It's underlined so users expect it to be a link.
评论 #2563516 未加载
评论 #2563664 未加载
bravuraabout 14 years ago
You should also consider 'pandoc', written in Haskell, for converting between markup formats: <a href="http://johnmacfarlane.net/pandoc/" rel="nofollow">http://johnmacfarlane.net/pandoc/</a><p>I am curious for more details about why Tika wasn't good enough. Please explain.
评论 #2564198 未加载
评论 #2564195 未加载
kalmi10about 14 years ago
Based on the title I expected some html5 magic for converting binary files into html in the browser.
tucosanabout 14 years ago
How about trying out calibre <a href="http://calibre-ebook.com" rel="nofollow">http://calibre-ebook.com</a> It can do all kinds of conversions from a number of formats, it is quite reliable, and it can be run headless.
dpapathanasiouabout 14 years ago
How would you compare abiword for doc/docx conversion versus antiword (<a href="http://www.winfield.demon.nl/" rel="nofollow">http://www.winfield.demon.nl/</a>)?<p>Also, what are the limitations of abiword for doc/docx files?
评论 #2564209 未加载
jamesshamenskiabout 14 years ago
Million Dollar Question:<p>How could you additionally parse the information to extract structured data? For example; names of candidates, addresses, previous employers, job titles held.
评论 #2564172 未加载
Jakobabout 14 years ago
Please add a candidate delete function. I sent an email with candidate with multiple attachments and Recruiterbox created multiple candidates by mistake.
nopalabout 14 years ago
There's really not much here.<p>Could we see some code or a demo?