MarkItDown: Python tool for converting files and office documents to Markdown

329 pointsby Handy-Man5 months ago

29 comments

simonw5 months ago

If you have uv installed you can run this against a file without first installing anything like this:<pre><code> uvx markitdown path-to-file.pdf </code></pre> (This will cache the necessary packages the first time you run it, then reuse those cached packages on future invocations.)I've tried it against HTML and PDFs so far and it seems pretty decent.

评论 #42412612 未加载

评论 #42412665 未加载

irskep5 months ago

I worked on an in-house version of this feature for my employer (turning files into LLM-friendly text). After reading the source code, I can say this is a pretty reasonable implementation of this type of thing. But I would avoid using it for images, since the LLM providers let you just pass images directly, and I would also avoid using it for spreadsheets, since LLMs are very bad at interpreting Markdown tables.There are a lot of random startups and open source projects who try to make this space sound fancy, but I really hope the end state is a simple project like this, easy to understand and easy to deploy.I do wish it had a knob to turn for "how much processing do you want me to do." For PDF specifically, you either have to get a crappy version of the plain text using heuristics in a way that is very sensitive to how the PDF is exported, or you have to go full OCR, and it's annoying when a project locks you into one or the other. I'm also not sure I'd want to use the speech-to-text features here since they might have very different performance characteristics than the text-to-text stuff.

评论 #42411695 未加载

评论 #42411243 未加载

评论 #42413188 未加载

评论 #42415954 未加载

btown5 months ago

For PDFs it's entirely a wrapper around <a href="https://pdfminersix.readthedocs.io/en/latest/tutorial/highlevel.html" rel="nofollow">https://pdfminersix.readthedocs.io/en/latest/tutorial/highle...</a> - <a href="https://github.com/microsoft/markitdown/blob/main/src/markitdown/_markitdown.py#L478">https://github.com/microsoft/markitdown/blob/main/src/markit...</a>So if that's your use case, PDFMiner might be better to integrate with directly!

评论 #42412398 未加载

figomore5 months ago

Pandoc (<a href="https://pandoc.org" rel="nofollow">https://pandoc.org</a>) can be used to convert a .docx file to markdown and other file formats like djot and typst. I don't think pandoc can convert powerpoint and excel files.

评论 #42411679 未加载

评论 #42411423 未加载

starkparker5 months ago

I index a lot of tabletop RPG books in PDF format, which often have complex visual layouts and many tables that parsers typically have difficulty with. If this is just a wrapper around PDFMiner, as noted in another comment, I don't see any value added by this tool.This handles them... fine. It either doesn't recognize or never attempts to handle tables, which makes it fundamentally a non-starter for my typical usage, but to its credit it seems to have at least some sense of table cells; it organizes columns in a manner that isn't fully readable but isn't as broken as some other solutions, either.It otherwise handles text that's in variable-width columns or wrapped in complex ways around art work rather well. It inserts extraneous spaces on fully justified text, which is frustrating but not unusual, and sometimes adds extraneous line breaks on mid-sentence column breaks.The biggest miss, though, is how it completely misses headings! This seems fundamental for any use case, including grooming sources for LLM training. It doesn't identify a single heading in any PDF I've thrown at it so far.

benatkin5 months ago

Nary a mention of LLMs in the readme. That was an unexpected but pleasant surprise, when the idea of converting something to markdown for LLMs is floated as if it's new and the greatest thing since sliced bread. <a href="https://hn.algolia.com/?dateRange=all&page=0&prefix=true&query=LLM%20markdown&sort=byDate&type=story" rel="nofollow">https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...</a>It's interesting to read the code. It's mostly glue code, and most of it is in single 1101 line file. But it does indeed say what the README says it does. Here is the special handling for Wikipedia: <a href="https://github.com/microsoft/markitdown/blob/main/src/markitdown/_markitdown.py#L216">https://github.com/microsoft/markitdown/blob/main/src/markit...</a>Edit: good to see the one from yesterday flagged. I tried to assume good intent, but also wondered if it was a place to draw a line in the sand. <a href="https://news.ycombinator.com/item?id=42405758">https://news.ycombinator.com/item?id=42405758</a>Edit 2: ah, it came down to simple violation of the Show HN rules. I didn't notice, but yeah, that's definitely the case.

评论 #42411689 未加载

markhneedham5 months ago

Quite curious how this compares to docling - <a href="https://github.com/DS4SD/docling">https://github.com/DS4SD/docling</a>docling uses an LLM IIRC, so that's already a difference in approach

评论 #42411388 未加载

评论 #42415870 未加载

hks05 months ago

This is amazing and really useful, love the idea; but let me tell you a story, it's a bit of a tangent but relevant enough:In an online language class we were sending the assignments to our teacher via slack, the teacher would then mark our mistakes and send it back.I, as a true hater of all the heavy weight text formats for everyday communications, autonomously fired up the terminal, wrote my assignment in my_name.md and happily sent it without giving it any thought. This is what I hear the next session:"... and everybody did a great job! Although someone just sent me their assignment in a stupid format. I don't know what it was! I could neither highlight it or make the text bold or anything. Don't do that to me again please".Before that I never dreamed of meeting someone who preferred a word document _after_ opening a .md file, and I also learned if I had chosen product design as a career, everyone would've suffered immensely (or maybe not, I would've just ended up jobless).

评论 #42413413 未加载

评论 #42501445 未加载

评论 #42412781 未加载

LittleTimothy5 months ago

This is... interesting. From my understanding - and people can correct me if I'm wrong, but didn't Microsoft spend an extremely large amount of effort essentially trying to screw people who made things like this in the 2000s? Interoperability and the Open Office movement were prety hard fought. It's kind of crazy to see MSFT do this today. Did I just misunderstand and the underlying formats (docx etc) were actually pretty friendly, or have the formats evolved a lot since then? Or is it more a case of "It doesn't matter if it looks terrible because we're feeding it to the AI beast anyway"A cynic might say it became suddenly easy when MSFT had a reason to allow you to genereate markdown to feed into it's AI?

评论 #42417182 未加载

评论 #42411488 未加载

konfekt5 months ago

Though it promises to convert everything to Markdown, it seems to be a worse version of what the already existing tools such as PDFtotext, docx2txt, pptx2md, ... collected [here] do without even pretending to export to Markdown. Looking at its [source], it indeed seems to be a wrapper to python variants of those. Making the pool smaller can hardly improve the output.[here] <a href="https://github.com/Konfekt/vim-office">https://github.com/Konfekt/vim-office</a> [source] htps://github.com/microsoft/markitdown/blob/main/src/markitdown/_markitdown.py

theanonymousone5 months ago

Why is the repository 95% "HTML" code?

评论 #42411271 未加载

评论 #42411724 未加载

kepano5 months ago

Never thought I'd see the day. Yet... not surprising because plain text is the ideal format for analysis, LLM training, etc.The question businesses will start to ask is why are we putting our data into .docx files in the first place?

评论 #42412491 未加载

lbrunson5 months ago

Are there any good libraries for the opposite, going from markdown to pdf or docx? Pandoc gets most of the way there but struggles with certain things like tables.

ezxs5 months ago

it would be cool if Word just had that implemented inside the product like Google Docs does.

constantinum5 months ago

I will try it with some complex layout PDFs or documents with tables. These documents have real business use cases for automation — insurance, banking, etc.Anyone here who wants to convert PDF documents or scanned images as it is preserving the layout, do try LLMWhisperer - <a href="https://unstract.com/llmwhisperer/" rel="nofollow">https://unstract.com/llmwhisperer/</a>

toastal5 months ago

So we convert from rich formats with metadata & advanced features to a format without the former & severely lacking at the latter.

zelphirkalt5 months ago

If the source document is anything half decent, this would serve to lose information, as markdown is far from flexible and powerful enough to represent all kinds of formatting and layout present in source documents. If all you need is the text information, then that might be just what you want, lossily compressing documents.

poidos5 months ago

Very timely, thanks!Was just yesterday working on chaining together `xlsx` and `tablemark` to accomplish this. I found `uvx markitdown my-excel.XLSX | sed 's/ NaN/ /g' my-markdown.md` to be just what I needed to get my spreadsheet into a reasonably-legible markdown table when rendered by GitLab.

ccbikai5 months ago

I made a version that can run entirely within the browser<a href="https://www.html.zone/markitdown/" rel="nofollow">https://www.html.zone/markitdown/</a>

roamerz5 months ago

Since it’s Microsoft maybe it will do a half decent job on Outlook HTML and .docx. I have evaluated most of them out there, paid included and haven’t found one that I thought was good enough to run in production. Definitely will be giving this a try.

sneak5 months ago

I wish we had a markdown equivalent for spreadsheets. Markdown tables ain’t it.

评论 #42419716 未加载

评论 #42413160 未加载

SuperHeavy2565 months ago

I don't think it works if you try installing it using pip. Can anyone confirm? I ended up downloading it manually, making a venv, and then running it.

ulrischa5 months ago

I wonder how a powerpoint can be converted to markdown

be_erik5 months ago

Oh thank god. I can finally retire my docx to pandoc to markdown tool chain. I can’t believe M$ was the big one to go first. Good on ya.

fritzo5 months ago

Converters like this are much more useful if they are bi-directional, even if the two directions aren't exactly inverses.

throwaway815235 months ago

Why not Pandoc?

评论 #42414372 未加载

einpoklum5 months ago

This is BS, it doesn't support Office documents, it supports only Microsoft's broken office documents which don't obey their own custom specs. Why doesn't this work on ODF files?

yawnxyz5 months ago

anyone get the Bing search DocumentConverter working? It keeps getting me null results

ekianjo5 months ago

any idea how it compares to Docling?