This is one of the biggest pain points of my work - reviewing diffs of Jupyter Notebooks.<p>Does anyone have any good tools for this that preserve the visuals of the Notebooks.<p>My approach has always been rendering the files as .py without the cell outputs and comparing which is a big PITA.<p>Anyone have any advice?
You can use jupytext to maintain dual .py/.ipynb representation of notebooks and keep both versions in sync:<p><a href="https://github.com/mwouts/jupytext/blob/main/docs/paired-notebooks.md" rel="nofollow">https://github.com/mwouts/jupytext/blob/main/docs/paired-not...</a><p>It works both ways, it can update the .py file each time you save the notebook, or you can edit the .py file and have the jupytext command line tool update the .ipynb.
Visual studio code has a diffing view for notebooks that looks very promising. <a href="https://code.visualstudio.com/docs/datascience/jupyter-notebooks#_custom-notebook-diffing" rel="nofollow">https://code.visualstudio.com/docs/datascience/jupyter-noteb...</a>
Can you talk more about why you’re working in Jupyter Notebooks at a level that needs diff reviews? Are you reviewing your own work, or the work of others?<p>One option would be to start a policy to always “restart and clear output” before saving. This cleans the output cells and makes the .ipynb files diffable. Just happens to also make them nice for storing in version control.<p>Another option would be to work in pure python files in the first place, and only use Jupyter after the fact. The close brother to Jupyter is the Spyder IDE, which gives you most of the benefits of quick visual outputs, but also has a nice python debugger built in.
I used something as a precommit hook in the past that removed plots and other rendered content and only kept text and code in git index. I'm almost sure it was <a href="https://github.com/kynan/nbstripout" rel="nofollow">https://github.com/kynan/nbstripout</a> but it's been a while and I could be wrong.<p>Once the hook was in place git diff worked well enough to not need any other diffing tool.
There is <a href="https://nbdime.readthedocs.io/en/latest/" rel="nofollow">https://nbdime.readthedocs.io/en/latest/</a>, although I haven't used it personally to know how good it is.
Here are tools people commonly use for notebook version control with git -<p>[1] nbdime to view local diffs & merge changes<p>[2] jupytext for 2-way sync between notebook & markdown/scripts<p>[3] JupyterLab git extension for git clone / pull / push & see visual diffs<p>[4] Jupyerlab gitplus to create GitHub PRs from JupyterLab<p>[5] ReviewNB for reviewing & diff'ing notebook PRs / Commits on GitHub<p>Disclaimer: While I’m the author of last two (GitPlus & ReviewNB), I’ve represented the overall landscape in an unbiased way. I've been working on this specific problem for 3+ years & regularly talk to teams who use GitHub with notebooks.<p>[1] <a href="https://nbdime.readthedocs.io" rel="nofollow">https://nbdime.readthedocs.io</a><p>[2] <a href="https://jupytext.readthedocs.io" rel="nofollow">https://jupytext.readthedocs.io</a><p>[3] <a href="https://github.com/jupyterlab/jupyterlab-git" rel="nofollow">https://github.com/jupyterlab/jupyterlab-git</a><p>[4] <a href="https://github.com/ReviewNB/jupyterlab-gitplus" rel="nofollow">https://github.com/ReviewNB/jupyterlab-gitplus</a><p>[5] <a href="https://www.reviewnb.com/" rel="nofollow">https://www.reviewnb.com/</a>
The solution is don't use ipynb. Instead, use an IDE that can run code segments in files, and version those files.<p>You end up with files which are syntactically correct code, versionable, and can be run in segments just like ipynb. Win, win, win.
You can use clean and smudge filters in git. Since notebook files are JSON it's pretty straightforward to stripe outputs from them using `jq`:<p><a href="http://timstaley.co.uk/posts/making-git-and-jupyter-notebooks-play-nice/" rel="nofollow">http://timstaley.co.uk/posts/making-git-and-jupyter-notebook...</a>
Hex just launched a diff view feature, along with git sync and a clean file format: <a href="https://hex.tech/blog/github-sync" rel="nofollow">https://hex.tech/blog/github-sync</a>
In addition to this, you can keep a dual markdown version that uses a much more human-readable syntax and preserves both code and markdown sections of the Jupyter notebook. This is also via jupytext. In both jupyterlab and jupyter you can pair the two versions (something like what is discussed here: <a href="https://www.wrighters.io/jupytext-notebooks-as-markdown-or-python/" rel="nofollow">https://www.wrighters.io/jupytext-notebooks-as-markdown-or-p...</a>) and they will stay in sync automatically.
For the Elixir equivalent of Jupyter (called Livebook) I've been keeping the markdown files in a `livebooks` directory so diffing them is as easy as `git diff` or any other existing text-based diff tools. It's been pretty successful.
In Google Colab, when you "Download ipynb" you get a file that looks like json.<p>You can prettify it via "python3 -m json.tool" for example. Then you have a structure that you can diff via your favorite diff tool.<p>What is a pita about it?