ASK: HN how to deliver machine learning results under startup pressure?

11 点作者 kuro-kuris超过 8 年前

Hi HN, I just started working for a very early stage startup who want to do mainly intent extraction on email datasets. I thought I would work on the natural language processing. I have worked here for 2 weeks and I am struggling with extreme pressure between sprints. We don't have any data, feature engineering, users nothing. What can I do? I am trying to build up a data processing pipeline but it is difficult with the pressure. How can I keep delivering and build a better machine learning environment in the company?Thanks for your advice HN!

6 条评论

apohn超过 8 年前

This may seem like really basic and obvious advice, but I’ve always found it helpful to think about solving a much smaller problem, solving that, and then going to management with that as a guidepost for building out the larger project. Putting “Build Data+ML Pipeline from scratch” + “In 2 weeks” is a great way to lose all hope and feel like you are drowning.For example, PaulHoule mentioned Enron emails. Could you build a pipeline to ingest those emails and provide sentiment or some basic text analysis? That would give you a deliverable (so management/the team sees you are working towards something) you can quantify into sprints. Make sure you are building something that could be development into a production ready pipeline, and not just a toy script you can put out in a day or two.After that is built, I’d use that as a guidepost for what you need to build. Show them the project and then use that to define the different stages you need to build for the production application. Include risks, blocking issues, and what you need from other team members. For example, you can point out that after a certain point in developing the pipeline you simply cannot move forward without data or emails.Management/The team is going to come back to you with one of the following. 1) That is a good plan and we like/dislike the timeline. 2). That is a bad plan, change it. 3). Leave us alone, we hired you so we don’t have to deal with this stuff.If they do 3) that’s bad. If they do 2), you should evaluate who is being unrealistic – you or them. At my last job we lost great data scientists because lots of other people thought Stats/ML was magic that could be done with a simple R/Python script. The end result was severely rushed work done to unrealistic standards, resulting in massive technical debt – some projects were just a house of cards.

adamwi超过 8 年前

Many great answers already, but will add my two cents as well. All the tips regarding poker planning, breaking down work in manageable chunks, identifying the critical path to finish the sprint and so on are highly relevant but I also think there is a broader question about how work is planned and the team is managed.The situation you describe with an ambitious product roadmap and very limited time available (and other resources lacking) is very common in early stage start-ups with very visionary founders that have limited managerial experience (I have been there myself).In my book the team leader/manger role is to make sure the team performing at its best and can be productive, the personal productivity of the team leader is irrelevant. A common mistake is the underestimate the effects of investing time in creating an productive environment for the team.If would be in your situation I would make a prioritised plan with time estimates (using the techniques mentioned in other posts) and then sit down with the team leader/manger/founder and refine it and agree on deadlines. This would remove pressure as the timeline is jointly agreed, then it would just be important to be very clear if any deviations come up.If the team leader/manger/founder is someone can have an open discussion with just share your thoughts as in the post. In my team we aim for brutally transparent culture when it comes to morale and stress levels, it allows us a team to be much more productive in the long run (and we have more fun). Remember that start-ups should be a marathon and not a sprint despite working in "sprints". I hope it helps!

tedmiston超过 8 年前

You can't do much machine learning without users and data, or at least real world datasets. I would focus on picking off the small problems you can solve today (or this week) for now.

ncouture超过 8 年前

I hope this helps...You might find out like me that there is very little pressure when your goals are well defined and you have a list of all the tasks needed to bring them to completion.This lets you focus on a set of specific tasks that are ideally ordered by priority, or <effor (to some extent).Add clocking your work and you get a very clear picture of how your time is spent, and in some cases (:) where it would be wise to lower the amount of time put on certain things.Sorry, this is as vague as I could explain it.TL;DR Organization can really make miracles.

PaulHoule超过 8 年前

This is a great question! Here is what I think.Both the ways "data scientists" typically work and the agile methods used in many software development organizations are unsuitable to commercial use of machine learning and other data-rich methods.In your question I am hearing two themes: (i) how to organize the actual work ("no data", "no features", "no users") and (ii) how to slot the work into the sprint system.The typical sprint system often introduces risk and uncertainty to data rich projects. Here is an example. I was working on a project where the sprints were typically two weeks, but one part of building the knowledge base was running a batch job that took two days. Of course if you set the batch job up wrong you might have to do it more than once.When I was doing the batch job I would account for the risk and spend maybe two days getting ready for the batch job and run the batch job at the very beginning of the sprint, then even if things went horribly wrong with the batch job and I had to do it two or three times I was certain the KB would be ready on time. Practically I had a PERT chart in my head that I was using to plan my own work.Even though I told them what I just told you, the first time some other team members did the batch job, they started it on the last day of the sprint which meant it wasn't ready and the Sprint shipped with an old and inappropriate KB.As a retrospective it would be good to turn the 2 day batch job into a 2 hour batch job (It started out as a 2 century batch job!) Also the reliability of the batch job is every bit as important as the speed in a situation like that. More fundamentally, I think some thinking about the ordering of work (PERT charts) should have been built into the process.There are lots of cases there, but note the risk amplifying property of the sprint. If some input to the sprint is a day late, everything that input depends on slips two weeks.For that project we also did two hour "planning poker" meetings and that was another problem because with two hours we didn't have enough time to make certain decisions. If we'd had two or three people think about things for a day we could have made consistently better decisions about certain things which would mean doing the right work in the next sprint, similarly saving two weeks of calendar time.It is very easy for little failures of the type described above to cascade and produce a recurring pattern of failure that is awful for productivity, morale, etc.It is very important to push back on management and address these kinds of problems.Now this sounds very negative for agile in data-rich projects and that's not the only thing you should take away. In the long run, data rich projects benefit hugely from continuous improvement that is done on a regular cadence.You meet "data scientists" or "junior programmers" who have started a number of projects and sent deliverables over to other people who get them ready for production. They think they have a great batting average, but when you look it from a wider perspective you see that 4x the man hours they put in the project got spent getting ready to get stuff for production. Had the team "begun with the end in mind", the total cost of the project could be cut in 1/2 or more and the risk greatly reduced.Big and very capable co's like IBM and Nuance, as well as many smaller ones you have not heard of, have built data-rich systems that turned out to be like building a nuclear reactor. We are not talking something that cost $22,000 when it should have cost $21,000, but rather something that cost $20 billion when it should have cost $5. The people involved will tell you they don't know what they're going to do next but they do know they are never going to do that again.So your process, technology, everything, has to be designed to control (1) risk and (2) cost to address those things and you've got to communicate that to the people you work with.What most people don't know/accept/believe is that most teams would control cost best if they tried to control risk first, see:<a href="https://www.amazon.com/Rapid-Development-Taming-Software-Schedules/dp/1556159005" rel="nofollow">https://www.amazon.com/Rapid-Development-Taming-Software-Sch...</a>As for your other issues, this is what I am going to say.Short term there are two things that really matter: (1) getting data, and (2) developing the basic interfaces between the ML component and the rest of the system. If you have (2) you can really contribute to the sprints, if you don't, you are cannot. Without (1) any data pipeline stuff, featuring engineering, etc. is going to largely be a waste of time.For data start out with the Enron emails or your own emails and label enough of it that you can start thinking about the other issues. Your early data set will be nowhere near large enough to get useful results, and that's another issue you'll need to bring up with management once you've reached it.

评论 #12413351 未加载

nyddle超过 8 年前

Yeah, I noticed a trend: startups looking for devs with extensive ml and highload experience to build their first prototype. They plan to grow rapidly, thats why.

评论 #12416614 未加载