Show HN: Jq-Like Tool for Markdown

325 pointsby yshavit3 months ago

There have been a few times I wanted the ability to select some text out of a Markdown doc. For example, a GitHub CI check to ensure that PRs / issues / etc are properly formatted.This can be done to some extent with regex, but those expressions are brittle and hard to read or edit later. mdq uses a familiar pipe syntax to navigate the Markdown in a structured way.It's in 0.x because I don't want to fully commit to the syntax being stable, in case real-world testing shows that the syntax needs tweaking. But I think the project is in a pretty good spot overall, and would be interested in feedback!

19 comments

verdverm3 months ago

> GitHub PRs are Markdown documents, and some organizations have specific templates with checklists for all reviewers to complete. Enforcing these often requires ugly regexes that are a pain to write and worse to debugThis is because GitHub is not building the features we need, instead they are putting their energy towards the AI land grab. Bitbucket, by contrast, has a feature where you can block PRs using a checkbox list outside of the description box. There are better ways to solve this first example from OP readme. Cool project, I write mainly MDX these days, would be cool to see support for that dialect

评论 #43154358 未加载

评论 #43153710 未加载

评论 #43154450 未加载

评论 #43154151 未加载

评论 #43154025 未加载

评论 #43156630 未加载

评论 #43163396 未加载

lanstin3 months ago

Ironically one of the reasons markdown (and other text based file formats) were popular because you could use regular find/grep to analyze it, and version control to manage it.

评论 #43153447 未加载

评论 #43153358 未加载

评论 #43153313 未加载

unglaublich3 months ago

My flow is to go through the Pandoc JSON AST and then use Jq. This works for other input formats, too.

评论 #43153442 未加载

评论 #43153777 未加载

dleeftink3 months ago

Kind of aligned with this is MarkdownDB, providing an SQLite backend to your Markdown files [0]. Cool to see this, I feel the structure of .md files is not always equally respected or regarded as a data serialisation target.[0]: <a href="https://markdowndb.com/" rel="nofollow">https://markdowndb.com/</a>

broodbucket3 months ago

I think you'd benefit of having some more real-world-ish examples in the README, as someone who doesn't intuit what I'd want to use this for.

评论 #43155309 未加载

评论 #43154465 未加载

pokstad3 months ago

Please don’t reimplement JQ. That problem is already solved. Instead, just provide a tool that can convert your target syntax into JSON, then it can be piped to JQ for querying.

kbd3 months ago

Cool thanks for sharing! I'll have to check this out. I've wanted something similar.After trying a bunch of the usual ones, the only "notes system" I've stuck with is just a directory of markdown files that's automatically committed to git on any change using watchexec.I've wanted to add a little smarts to it so I could use it to track tasks (eg. sort, prune completed, forward uncomplete tasks over to the next day's journal, collect tasks from "projects", etc.) so I started writing some Rust code using markdown-rs. Then, to round-trip markdown with changes, only the javascript version of the library currently supports serializing github flavored markdown. So then I actually dumped the markdown ast to json from rust and picked it up in js to serialize it for a proof of concept. That's about as far as I got so far. But while markdown-rs saves position information, it doesn't save source token information (like, * and - are both list items) so you can't reliably round-trip.FWIW, the other thing I was hoping to do was treat markdown documents as trees (based on headings) use an xpath kind of language to pull out sections. Anyway, will check out your code, thanks for posting.

threecheese3 months ago

Interesting; one thing you may have learned researching existing tools and libraries: many of them serialize markdown to html before running structured extraction/manipulation - even stuff like converting to pdf.The core assumption here is that Markdown was/is designed to be serializeable to html - this is why a markdown document/AST is mostly not a tree structure, for tree-ish elements such as sub-sections. Instead, it is flat, an array of elements in order of appearance in the document. Apparently this most closely matches the structure of html, at both the block and inline levels. Only Lists and Blockquotes (afair) support nesting.Ex: h1 -> paragraph -> h2 -> paragraph is not nested, it is an array of four ordered elements.Anyway, you might throw a task at Cursor or Copilot to see how an equivalent implementation using html fares against your test suite, you may be able to develop more quickly.

aqueueaqueue3 months ago

Why not MD -> json, then use jq! That would be half a static site generator there!

spiffyk3 months ago

Thanks for sharing! No immediate use-case for me right now, but good to know something like this exists.I wanted to point out little nitpicks for the documented shell invocations:<pre><code> cat example.md | mdq '# usage' </code></pre> This can be changed into a stdin file redirect to avoid invoking an extra `cat` process (see Useless use of cat [1]):<pre><code> mdq '# usage' < example.md </code></pre> In a similar fashion, you can avoid an extra `echo` process here:<pre><code> echo "$ISSUE_TEXT" | mdq -q '- [x] I have searched for existing issues' </code></pre> by changing to this:<pre><code> mdq -q '- [x] I have searched for existing issues' <<< "$ISSUE_TEXT" </code></pre> [1]: <a href="https://en.wikipedia.org/wiki/Cat_(Unix)#Useless_use_of_cat" rel="nofollow">https://en.wikipedia.org/wiki/Cat_(Unix)#Useless_use_of_cat</a>

评论 #43158018 未加载

twinkjock3 months ago

Thanks for sharing this Yuval! Thanks as well for using permissive licenses so I can use this at work.

评论 #43153427 未加载

frankfrank133 months ago

I worked on a project converting word docs to markdown so they could more easily be ingested into an LLM, one issue was that context windows used to be very short, so we would basically split on `\n#` to get sections, but this turns into a whole thing where you have to make guesses about which header level is appropriate to split at, and then you turn each section into a separate chunk in FAISS. Anyways we ended up using HTML instead of MD but theres so much tooling for traversing HTML and not MD. This would have been helpful for that

foo423 months ago

This is one of those moments where you come across a tool _just_ at the right moment. I have a task for which this will be perfect

infogulch3 months ago

I've always wanted a "literate programming" / jupyter-style notebook based on markdown. Maybe this could help make something like that possible.

评论 #43162930 未加载

zerkten3 months ago

Thanks! I have to grapple with some markdown across multiple repos and this'll be a helpful tool in the toolchest.

linklater123 months ago

congrats on your tool, will check it out. I have a side question on markdown: cursor messes up markdown generation quite often for me. I think its responses are always in markdown with sections for code and asking it to generate markdown breaks it. So the question: any ideas on how to have cursor generate markdown?

nodesocket3 months ago

How is it parsing? Just normal string and regex matching or transforming markdown to an intermediate structured language?

评论 #43153615 未加载

dcreater3 months ago

What purpose does this serve that grep doesn't?

moonshotideas3 months ago

Love this! One persons opinion - I’d change it to mq - less chars are always better for command