I'm a graduate student doing a lot of data analysis. For a given project, I find myself using a mishmash of jupyter notebooks, shell and python scripts, and interactive ipython/sql sessions. If I switch projects for a couple days, it becomes really hard for me to piece together what exactly I've done on the old project.<p>Is there a simple to use system that would at least give me breadcrumbs to retrace my steps? I'd like to strike a balance between the time it takes to do the organization and its usefulness. It would be helpful if I could use the same system to plan future goals as well as document past work.
A recent episode of Not So Standard Deviations (a great podcast for anyone in the sciences or applied statistics) covered reproducibility:<p><a href="https://soundcloud.com/nssd-podcast/episode-5-irl-roger-is-totally-with-it" rel="nofollow">https://soundcloud.com/nssd-podcast/episode-5-irl-roger-is-t...</a><p>I'd suggest treating your code like a software project: convert repeated logic into methods, collect methods into modules/libraries depending on their use, write <i>lots</i> of documentation, and use version control (Github with a nice README.md in each project is a great start).<p>If you transition projects, take 5-10 minutes to update your docs (I keep a "captain's log") with the latest details and a list of todo items. I like to note my victories ("on 1/19 I produced this plot that sent me in a different direction; on 1/21 I demoed my project and received such-and-such feedback") and next steps in the log, as a way to retrace my progress over weeks or months.<p>That podcast also mentions knitr (<a href="http://yihui.name/knitr/" rel="nofollow">http://yihui.name/knitr/</a>), which looks great for docs.
At the abstract level, get a standardized process for your analysis that you can use from project to project. Doesn't have to be complicated, but it needs to be consistent and applied to each project.<p>Here's what I'd recommend -<p>Take 5-10 minutes after getting a project and think through it. Break it into chunks and note what you need to discover (i.e. go find a data set) and what you already can/know how to do. Create an ordered task list for yourself (1. get data set, 2. setup project db, 3. create schema for db etc. etc. etc). This gives you measurable milestones you can track and helps you keep track of what's next. You can calendar these out if you need to as well.<p>Keep a running notebook / text file for each project that's dated with a brief explanation of what you did that day, what troubles you ran into, what needs fixing, what you need to do next, and any other random thoughts you had related to the project. It unloads your RAM and keeps it somewhere you can get at easily. Write in this at the end of each day or when you switch projects. It should take 3-5 minutes to update this. It's also the first thing you open when you start working on a project again.<p>Get source control setup and give each project it's own folder. Store notes and source code in there and keep it up to date. Remember process matters right now more than specific implementations. Can be SVN or GIT. Can further sub-organize if you need to. I.E. SQL folder for scripts/stored procedures, BASH for shell scripts, PYTHON for python scripts etc.<p>Write a README for your project, that gets stored in the project folder, that explains what pieces of software get run and in what order. That way you won't forget order of execution.<p>Be systematic in your approach. The best process is the one you stick to and actually use, not the one that's perfect on paper.
This is exactly why businesses have some sort of bug tracking/task management software. You don't need Jira, but something lightweight like Trello would work. The most important part is to make sure to comment on issues with your latest work and findings. This can be as simple as dropping new scripts into a comment or pointing to a source control commit/pull request/whatever. At the end of the day, you want something centralized where you can track your latest changes and findings
You can take a simple approach and use a to do list. Split the list into to-do, doing and done. Each project you are doing should have a header in each section as long as it is not complete. When complete remove it from the to-do and doing section.
E.g.<p>To-do
Project 1
- task 2<p>Doing
Project 1
-task 1<p>Done
Project 1
Project 2
- task 1
- task 2<p>The work you want to do for a given day should be moved to the doing section. At the end of the day if the work is completed move it to the done list. If it was not completed move it to the to do section. You could add comments to tell you were you stopped or any issues you came across
I second the "treat it like software development" approach.<p>If you use version control through say Git then you can comment all your commits and back it up with an easy way to track what you've been doing and roll things back as and when necessary.