TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: How do you digitize documents?

5 pointsby caseyf7over 5 years ago
Any recommendations on scanning documents, bills, articles, etc. Advice on scanners, software and workflows would be greatly appreciated.

5 comments

simonblackover 5 years ago
I store most digitised documents as .PDFs.<p>Very often you can obtain original .PDFs from companies by downloading from websites, as well as (or instead of) the paper documentation they send you.<p>For local scanning, I use a HP MFP. If I need to scan individual pages, I can then merge those, if necessary, with a &#x27;merge.pdf&#x27; type of software utility.<p>Store the scanned&#x2F;downloaded documents in some type of tree-structured directory format. This greatly reduces the time taken to find a specific document.<p>I keep financial documents separate from other documents. Financial documents are also segregated into separate tax-year &#x27;trees&#x27;.<p>Documents are backed-up month by month, and also daily. The monthly back-ups are stored indefinitely, and separately from the daily back-ups which are deleted in reverse chronological &#x27;exponential&#x27; order.<p>Daily-backups remaining at the moment. Day 0000 was back on 23rd June 2012. Last word is server name. Note how there are more recent backups than earlier backups:<p><pre><code> 0000-120623nullius 1024-150401nullius 2048-180131centrepoint 2304-181014centrepoint 2560-190627centrepoint 2688-191102centrepoint 2720-191204centrepoint 2736-191220centrepoint 2752-200105centrepoint 2756-200109centrepoint 2758-200111centrepoint 2759-200112centrepoint 2760-200113centrepoint</code></pre>
throwaway78678over 5 years ago
I&#x27;ve got a decent brother scanner like so <a href="https:&#x2F;&#x2F;www.ebay.com&#x2F;p&#x2F;13030519316" rel="nofollow">https:&#x2F;&#x2F;www.ebay.com&#x2F;p&#x2F;13030519316</a>, when I scan a document it ends up on a folder from my NAS.<p>I&#x27;ve built a small webapp that reads the content of this folder as untagged documents. Tagging them will move them to a proper folder and the docs will finally be visible in a treeview.<p>It is relatively robust and low maintenance. I might at some point work on download + OCR scripts to get and auto-tag bills and such that are already in PDF. Not sure if it is really useful to be honest at this point
rfmw19over 5 years ago
My method was more specific to bills and finance documents. I used a generic photo scanner. It&#x27;s not as automatic as the purpose-built document scanners that have automatic feeders and support multiple pages, but I wanted something that I could use for photography as well.<p>I coupled this with some very hacked together Perl scripts with Tesseract OCR[1] that fed in data to ledger-cli[2] for handling bills. I put other generic documents into folders by date.<p>It worked pretty well, and I was able to generate some pretty graphs from data that was fully reconciled with financial institutions like my bank, credit card, investments, etc., but still took too much time. So what do I do now? Nothing!<p>This was years ago. I assume there is now better support from financial institutions for extracting data and this coupled with improved OCR&#x2F;machine learning might make things more robust and make it worthwhile to try again.<p>[1] <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Tesseract_(software)" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Tesseract_(software)</a><p>[2] <a href="https:&#x2F;&#x2F;www.ledger-cli.org&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.ledger-cli.org&#x2F;</a>
clintonbover 5 years ago
What’s your goal? I haven’t received a paper bill in years. They are already digitized. Same for most news&#x2F;magazine articles. Aside from older&#x2F;historical documents, nearly every piece of paper I encounter has a digital counterpart that I can access in some form.
2rsfover 5 years ago
with bills the quality is secondary, and indexing is more important. I scan using Microsoft Office Lens and email to myself adding a few keywords in the title &quot;Electricity bill for November 2020&quot;