Digitised full text search is a start.<p>The next level up is full linked cross referencing; bio files on all people that appear in the minutes, the ability to trace their interactions through time, the disclosure of what public ownership records they have for land and companies in the immediate and adjacent regions, what boards they have positions on.<p>If they are there to (say) lobby against local public transportation then who are they affiliated with that is also lobbying against the same in other county areas | states, etc.<p>Disclaimer: we did this some years ago for global mineral exploration and project development, linking public records from companies, stock exchanges, land bureaus, etc. and eventually onsold the data and the process [1].<p>It can be a grind to set up and get swinging, but ultimately worthwhile and can be funded via a subscription model.<p>[1] <a href="https://www.spglobal.com/marketintelligence/en/campaigns/metals-mining" rel="nofollow">https://www.spglobal.com/marketintelligence/en/campaigns/met...</a>
Hrmm, Berkeley resident here. It's clear that the city possesses the original digital copies of the last several decades of these documents, and the fact that their system of record produces them as images is just a weird quirk. Instead of OCRing them, wouldn't it be better to just get the city to fix their system? I'm not the only one who thinks so. Berkeleyside recently wrote about it:<p><a href="https://www.berkeleyside.org/2022/08/12/new-city-website-limits-access-to-vast-archive-of-berkeley-records" rel="nofollow">https://www.berkeleyside.org/2022/08/12/new-city-website-lim...</a>
There are several USA aggregator platforms that help with these kind of public meetings, but they are a bit regional.<p>Original city marked as *<p><a href="https://www.documenters.org" rel="nofollow">https://www.documenters.org</a> / <a href="https://www.citybureau.org" rel="nofollow">https://www.citybureau.org</a> covers Chicago*, Atlanta, Cleveland, Detroit, Fresno, Minneapolis, Omaha (backed by MuckRock products) and helps activate volunteers.<p>NYC* has an aggregator around their many Community Boards somewhere but it is not turning up in my searches.<p><a href="https://councildataproject.org" rel="nofollow">https://councildataproject.org</a> covers Seattle*, Portland, Alameda, Denver, etc.<p>I haven't seen one on the ANCs in DC yet.
This uses AWS's Textract service, but if you're doing a LOT of extraction, that gets pretty expensive pretty quickly. We do thousands of pages daily on CourtListener.com and created an open source microservice for this purpose. It can take PDFs, DOCX, DOC, TXT, HTML, or a handful of other files and extract the text, doing OCR if necessary:<p><a href="https://free.law/projects/doctor" rel="nofollow">https://free.law/projects/doctor</a><p>We're always looking for more people to use and improve it.
This is awesome! Cities don’t do a great job of organizing data. It’s unclear if there’s enough funding/interest to have better access to data like this. But I certainly wish there was!<p>I sent a message to author directly but wanted to add as well. Has anyone used tsvector index on Postgres for this type of full text search?<p>I’ve had great experience with it (for this type of full text/document data) but there isn’t much information out there about this kind of index and how to best utilize it.