TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Pdf text extractor – in pages and regions you define

1 点作者 seinecle将近 2 年前

2 条评论

albert_e将近 2 年前
Interesting.<p>I have a use case that is slightly different. Maybe someone can suggest a good framework &#x2F; tool --<p>Our school publishes a PDF daily -- that someone makes by filling a Microsoft Excel template and printing it to PDF &#x2F; Save As PDF.<p>The excel template is fairly simple -- a block of key-value pairs as a two column table for each subject (fixed number of fields), and N number of such blocks one below the other based on number of subjects covered that day.<p>Now the length of the PDF (whether content fits one page or spills in 2 or 3) as well the scaling of the PDF print (how big or small the text appears) varies a lot due to the inconsistent manual steps they follow.<p>What would be a good way to automate the extraction of text from such a daily PDF feed?<p>I want to load this extracted data into a simple flat table (in say a SQLite database or DynamoDB) and use it to display the same content as a browsable &#x2F; filterable webpage (showing content from all PDFs till date)<p>I was hoping to take help from ChatGPT code interpreter and write a Python script that I can schedule on AWS Lambda. But if there is a known approach for this kind of document processing, please point me to it. Thanks!
seinecle将近 2 年前
Part of a set of free, no registration asked, click and point, web based functions. Your feedback is welcome.