This is a great question! Here is what I think.<p>Both the ways "data scientists" typically work and the agile methods used in many software development organizations are unsuitable to commercial use of machine learning and other data-rich methods.<p>In your question I am hearing two themes: (i) how to organize the actual work ("no data", "no features", "no users") and (ii) how to slot the work into the sprint system.<p>The typical sprint system often introduces risk and uncertainty to data rich projects. Here is an example. I was working on a project where the sprints were typically two weeks, but one part of building the knowledge base was running a batch job that took two days. Of course if you set the batch job up wrong you might have to do it more than once.<p>When I was doing the batch job I would account for the risk and spend maybe two days getting ready for the batch job and run the batch job at the very beginning of the sprint, then even if things went horribly wrong with the batch job and I had to do it two or three times I was certain the KB would be ready on time. Practically I had a PERT chart in my head that I was using to plan my own work.<p>Even though I told them what I just told you, the first time some other team members did the batch job, they started it on the last day of the sprint which meant it wasn't ready and the Sprint shipped with an old and inappropriate KB.<p>As a retrospective it would be good to turn the 2 day batch job into a 2 hour batch job (It started out as a 2 century batch job!) Also the reliability of the batch job is every bit as important as the speed in a situation like that. More fundamentally, I think some thinking about the ordering of work (PERT charts) should have been built into the process.<p>There are lots of cases there, but note the risk amplifying property of the sprint. If some input to the sprint is a day late, everything that input depends on slips two weeks.<p>For that project we also did two hour "planning poker" meetings and that was another problem because with two hours we didn't have enough time to make certain decisions. If we'd had two or three people think about things for a day we could have made consistently better decisions about certain things which would mean doing the right work in the next sprint, similarly saving two weeks of calendar time.<p>It is very easy for little failures of the type described above to cascade and produce a recurring pattern of failure that is awful for productivity, morale, etc.<p>It is very important to push back on management and address these kinds of problems.<p>Now this sounds very negative for agile in data-rich projects and that's not the only thing you should take away. In the long run, data rich projects benefit hugely from continuous improvement that is done on a regular cadence.<p>You meet "data scientists" or "junior programmers" who have started a number of projects and sent deliverables over to other people who get them ready for production. They think they have a great batting average, but when you look it from a wider perspective you see that 4x the man hours they put in the project got spent getting ready to get stuff for production. Had the team "begun with the end in mind", the total cost of the project could be cut in 1/2 or more and the risk greatly reduced.<p>Big and very capable co's like IBM and Nuance, as well as many smaller ones you have not heard of, have built data-rich systems that turned out to be like building a nuclear reactor. We are not talking something that cost $22,000 when it should have cost $21,000, but rather something that cost $20 billion when it should have cost $5. The people involved will tell you they don't know what they're going to do next but they do know they are never going to do that again.<p>So your process, technology, everything, has to be designed to control (1) risk and (2) cost to address those things and you've got to communicate that to the people you work with.<p>What most people don't know/accept/believe is that most teams would control cost best if they tried to control risk first, see:<p><a href="https://www.amazon.com/Rapid-Development-Taming-Software-Schedules/dp/1556159005" rel="nofollow">https://www.amazon.com/Rapid-Development-Taming-Software-Sch...</a><p>As for your other issues, this is what I am going to say.<p>Short term there are two things that really matter: (1) getting data, and (2) developing the basic interfaces between the ML component and the rest of the system. If you have (2) you can really contribute to the sprints, if you don't, you are cannot. Without (1) any data pipeline stuff, featuring engineering, etc. is going to largely be a waste of time.<p>For data start out with the Enron emails or your own emails and label enough of it that you can start thinking about the other issues. Your early data set will be nowhere near large enough to get useful results, and that's another issue you'll need to bring up with management once you've reached it.