If you have uv installed you can run this against a file without first installing anything like this:<p><pre><code> uvx markitdown path-to-file.pdf
</code></pre>
(This will cache the necessary packages the first time you run it, then reuse those cached packages on future invocations.)<p>I've tried it against HTML and PDFs so far and it seems pretty decent.
I worked on an in-house version of this feature for my employer (turning files into LLM-friendly text). After reading the source code, I can say this is a pretty reasonable implementation of this type of thing. But I would avoid using it for images, since the LLM providers let you just pass images directly, and I would also avoid using it for spreadsheets, since LLMs are very bad at interpreting Markdown tables.<p>There are a lot of random startups and open source projects who try to make this space sound fancy, but I really hope the end state is a simple project like this, easy to understand and easy to deploy.<p>I do wish it had a knob to turn for "how much processing do you want me to do." For PDF specifically, you either have to get a crappy version of the plain text using heuristics in a way that is very sensitive to how the PDF is exported, or you have to go full OCR, and it's annoying when a project locks you into one or the other. I'm also not sure I'd want to use the speech-to-text features here since they might have very different performance characteristics than the text-to-text stuff.
For PDFs it's entirely a wrapper around <a href="https://pdfminersix.readthedocs.io/en/latest/tutorial/highlevel.html" rel="nofollow">https://pdfminersix.readthedocs.io/en/latest/tutorial/highle...</a> - <a href="https://github.com/microsoft/markitdown/blob/main/src/markitdown/_markitdown.py#L478">https://github.com/microsoft/markitdown/blob/main/src/markit...</a><p>So if that's your use case, PDFMiner might be better to integrate with directly!
Pandoc (<a href="https://pandoc.org" rel="nofollow">https://pandoc.org</a>) can be used to convert a .docx file to markdown and other file formats like djot and typst. I don't think pandoc can convert powerpoint and excel files.
I index a lot of tabletop RPG books in PDF format, which often have complex visual layouts and many tables that parsers typically have difficulty with. If this is just a wrapper around PDFMiner, as noted in another comment, I don't see any value added by this tool.<p>This handles them... fine. It either doesn't recognize or never attempts to handle tables, which makes it fundamentally a non-starter for my typical usage, but to its credit it seems to have at least some sense of table cells; it organizes columns in a manner that isn't fully readable but isn't as broken as some other solutions, either.<p>It otherwise handles text that's in variable-width columns or wrapped in complex ways around art work rather well. It inserts extraneous spaces on fully justified text, which is frustrating but not unusual, and sometimes adds extraneous line breaks on mid-sentence column breaks.<p>The biggest miss, though, is how it completely misses headings! This seems fundamental for any use case, including grooming sources for LLM training. It doesn't identify a single heading in any PDF I've thrown at it so far.
Nary a mention of LLMs in the readme. That was an unexpected but pleasant surprise, when the idea of converting something to markdown for LLMs is floated as if it's new and the greatest thing since sliced bread. <a href="https://hn.algolia.com/?dateRange=all&page=0&prefix=true&query=LLM%20markdown&sort=byDate&type=story" rel="nofollow">https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...</a><p>It's interesting to read the code. It's mostly glue code, and most of it is in single 1101 line file. But it does indeed say what the README says it does. Here is the <i>special handling for Wikipedia</i>: <a href="https://github.com/microsoft/markitdown/blob/main/src/markitdown/_markitdown.py#L216">https://github.com/microsoft/markitdown/blob/main/src/markit...</a><p>Edit: good to see the one from yesterday flagged. I tried to assume good intent, but also wondered if it was a place to draw a line in the sand. <a href="https://news.ycombinator.com/item?id=42405758">https://news.ycombinator.com/item?id=42405758</a><p>Edit 2: ah, it came down to simple violation of the Show HN rules. I didn't notice, but yeah, that's definitely the case.
Quite curious how this compares to docling - <a href="https://github.com/DS4SD/docling">https://github.com/DS4SD/docling</a><p>docling uses an LLM IIRC, so that's already a difference in approach
This is amazing and really useful, love the idea; but let me tell you a story, it's a bit of a tangent but relevant enough:<p>In an online language class we were sending the assignments to our teacher via slack, the teacher would then mark our mistakes and send it back.<p>I, as a true hater of all the heavy weight text formats for everyday communications, autonomously fired up the terminal, wrote my assignment in my_name.md and happily sent it without giving it any thought. This is what I hear the next session:<p>"... and everybody did a great job! Although someone just sent me their assignment in a stupid format. I don't know what it was! I could neither highlight it or make the text bold or anything. Don't do that to me again please".<p>Before that I never dreamed of meeting someone who preferred a word document _after_ opening a .md file, and I also learned if I had chosen product design as a career, everyone would've suffered immensely (or maybe not, I would've just ended up jobless).
This is... interesting. From my understanding - and people can correct me if I'm wrong, but didn't Microsoft spend an extremely large amount of effort essentially trying to screw people who made things like this in the 2000s? Interoperability and the Open Office movement were prety hard fought. It's kind of crazy to see MSFT do this today. Did I just misunderstand and the underlying formats (docx etc) were actually pretty friendly, or have the formats evolved a lot since then? Or is it more a case of "It doesn't matter if it looks terrible because we're feeding it to the AI beast anyway"<p>A cynic might say it became suddenly easy when MSFT had a reason to allow you to genereate markdown to feed into it's AI?
Though it promises to convert everything to Markdown, it seems to be
a worse version of what the already existing tools such as PDFtotext, docx2txt, pptx2md, ... collected [here] do without even pretending to export to Markdown.
Looking at its [source], it indeed seems to be a wrapper to python variants of those. Making the pool smaller can hardly improve the output.<p>[here] <a href="https://github.com/Konfekt/vim-office">https://github.com/Konfekt/vim-office</a>
[source] htps://github.com/microsoft/markitdown/blob/main/src/markitdown/_markitdown.py
Never thought I'd see the day. Yet... not surprising because plain text is the ideal format for analysis, LLM training, etc.<p>The question businesses will start to ask is why are we putting our data into .docx files in the first place?
Are there any good libraries for the opposite, going from markdown to pdf or docx? Pandoc gets most of the way there but struggles with certain things like tables.
I will try it with some complex layout PDFs or documents with tables. These documents have real business use cases for automation — insurance, banking, etc.<p>Anyone here who wants to convert PDF documents or scanned images as it is preserving the layout, do try LLMWhisperer - <a href="https://unstract.com/llmwhisperer/" rel="nofollow">https://unstract.com/llmwhisperer/</a>
If the source document is anything half decent, this would serve to lose information, as markdown is far from flexible and powerful enough to represent all kinds of formatting and layout present in source documents. If all you need is the text information, then that might be just what you want, lossily compressing documents.
Very timely, thanks!<p>Was just yesterday working on chaining together `xlsx` and `tablemark` to accomplish this. I found `uvx markitdown my-excel.XLSX | sed 's/ NaN/ /g' my-markdown.md` to be just what I needed to get my spreadsheet into a reasonably-legible markdown table when rendered by GitLab.
I made a version that can run entirely within the browser<p><a href="https://www.html.zone/markitdown/" rel="nofollow">https://www.html.zone/markitdown/</a>
Since it’s Microsoft maybe it will do a half decent job on Outlook HTML and .docx. I have evaluated most of them out there, paid included and haven’t found one that I thought was good enough to run in production. Definitely will be giving this a try.
I don't think it works if you try installing it using pip. Can anyone confirm? I ended up downloading it manually, making a venv, and then running it.
This is BS, it doesn't support Office documents, it supports only Microsoft's broken office documents which don't obey their own custom specs. Why doesn't this work on ODF files?