Hi HN,<p>In 2021 I'm working to open source a behemoth project I've poured over 1,500 hours into. It relates to US Congress bill discovery and analysis (similar, but different, to govtrack).<p>My next major step is to write a data dictionary to bring organization to the undefined/unstructured chaos. The goal is anyone can quickly start hacking on their own applications with the data, and conduct their own analyses, but without requiring a polysci degree to do that. I'd be thrilled if a highschool student could pick the data up and start hacking.<p>Here is an example schema:
https://i.imgur.com/Qsoa1aj.png<p>Currently I use a relational database and although JSON querying does work fine, it isn't exactly easy to build statistical analyses with on the fly. Here are some questions I can answer, but not quickly:<p>1. What's the entire list of unique bill attributes that have ever existed in the dataset? What about only for 2019?<p>2. How many times was X attribute used in 2019? What was every possible value for it?<p>3. For all bills and all actions ever recorded, what is the total number of unique <i>types</i> of actions have been recorded? (eg tabling a bill, holding a vote, passed to committee, etc)<p>4. Which bill was most "popular" (most referenced by other bills) in 2020?<p>I have experience with Elasticsearch, MongoDB, et al and am intrigued by Typesense. But as I don't work with statistical analysis often, I humbly ask the community if there are tools I should be considering to answer the above questions (quickly!).<p>Cheers!