Things I want from Devs as SRE/DevOps

198 点作者 oschvr超过 2 年前

33 条评论

chrsig超过 2 年前

> If you’re a Software Engineer/Developer, then consider that a service (at least, for me), is a piece of code running in a live production system, that YOU wrote, only YOU know how it works, thus YOU own.I've grown unfond of this attitude. I most certainly don't own it. I have no IP rights to it at all. We're both being paid to solve different facets of the same problem. Coming at me with "this is your problem" isn't going to foster a collaborative environment with me. Which is much more pleasant than an adversarial environment.Also: I'm not the only one that knows how it works, it's been peer reviewed in no small part to reduce my bus factor. All documentation requested is perfectly reasonable, and should be part of the organizations standard operating procedure.If it's not part of the SOP, then no, you wont have those things. You need to work at a cultural level to change that, and for that you're much better off making allies than anything else. Make it clear how those things help you, and what you'll do to make the developers life easier when you don't need to worry about the basics. If altruism fails you, you can usually count on people to act in their own best interests.

评论 #34009547 未加载

评论 #34007888 未加载

评论 #34007812 未加载

评论 #34008600 未加载

评论 #34010675 未加载

评论 #34009690 未加载

评论 #34012982 未加载

评论 #34009307 未加载

评论 #34014033 未加载

评论 #34013278 未加载

hayst4ck超过 2 年前

This questionnaire is kind of foreign to me since I see an SRE's job, more or less, as defining interfaces and then forcing everything to adhere to them (politically or manually).These are the questions I find useful:<pre><code> "How is capacity for the service allocated right now?" "How is software updated right now?" "How was the last outage handled in as much detail as possible?" </code></pre> From there, just about everything answers itself with a couple days of reading code and poking at machines, particularly from the output of `lsof` (log files, config files, what the service talks to).Half of these questions could be answered with grep and once you get proficient at grep, you can answer questions faster, and more importantly, more accurately than the people who work on the services themselves.> that YOU wrote, only YOU know how it works, thus YOU own.I find this attitude pretty toxic. If you are in an SRE vs Product Dev mindset, then you have bigger battles to fight than service manipulation.

评论 #34010971 未加载

tayo42超过 2 年前

I don't get why SRE is a job(and it was my title for years) The stuff listed is just good software engineering. If a swe cant figure out that they need to monitor their application (or really anything on this list) you have no business being anything other then a junior programmer.These kinds of responsibilities create this weird scenario now where the team sre is the teams babysitter. Which just leads to the ops vs dev bullshit weve seen before. Toxic right off the bat.

评论 #34007763 未加载

评论 #34007919 未加载

评论 #34007984 未加载

评论 #34010247 未加载

评论 #34012964 未加载

mberning超过 2 年前

People are complaining about the idea that the developer is ultimately the owner of any service they wrote.I don’t see how this is even controversial. Consider the case where a SRE is responsible for 5 or 10 such systems. They could never be expected to know as much about those systems as the people that wrote them.Now if there is a one to one relationship between SREs and systems then it might make sense to expect that level of understanding from the SRE.In my experience it would be a great privilege to have a dedicated SRE to your application.

评论 #34008526 未加载

评论 #34009420 未加载

评论 #34014413 未加载

lamontcg超过 2 年前

Can someone explain to me how this is any different of a mentality from system engineers that SRE replaced?I haven't read the SRE book, but my understanding was that at Google the answer to all this would be that the SRE would act as a software developer and submit pull requests to the codebase in order to implement/fix all of this?> If you’re a Software Engineer/Developer, then consider that a service (at least, for me), is a piece of code running in a live production system, that YOU wrote, only YOU know how it works, thus YOU own.And my own take on this statement which is getting so much traction in the comments is that this seems largely indistinguishable from the wall between Dev and Ops that we had back in the late 90s.

评论 #34012321 未加载

评论 #34011741 未加载

eyelidlessness超过 2 年前

Maybe I’m being overly pedantic, but… these are questions DevOps engineers should be able to answer themselves because they’ve contributed to the answers. I understand that DevOps has basically become a euphemism for ops + automation complexity that requires product-equivalent engineering talent + arcane knowledge of a zillion cloud vendors’ … everything. But can we go back to calling that ops?I actually liked the DevOps-as-in-devs-also-ops as a forcing function to keep deployment relatively simple because it’s very low on the core competency/value proposition spectrums. It also has the benefit of rewarding companies for making that feasible at the expense of a tiny fraction of the cost of dedicated ops roles.

hnarn超过 2 年前

I work as an SRE and while I agree with the "list of questions" as a general template for collaboration, I strongly disagree with the point that developers "own" the applications.If you work in the same company, you all own the application. The customers don't care that you're "only" the SRE, or "only" the sales guy. This type of attitude is toxic and should be challenged categorically.If you, the SRE, do not have the information needed (i.e. the "list of questions") then it's as much your responsibility to ask for it as it is the developers jobs to help you answer it.If you feel that the company culture makes it impossible for you to create these necessary processes so that everyone have the information they need, you need to either work towards changing that culture or get a new job.

mianos超过 2 年前

This list is exactly what we try to deliver to operations in our firm. All very reasonable.You know why you "rarely get an answer for straight away "? I assume because they are working on the next ticket/delivery. A lot of this stuff is not estimated properly. A way to get it estimated properly is to work with the devs, cooperatively.This said, for some reason, this blog post seems adversarial and gives me a bad vibe. Instead of "List of questions I’d like to get an answer from devs", it should be "we should work together to get these things done".

dsr_超过 2 年前

This is exactly the sort of requirements list that a dev group would receive from an ops group back when ops were systems administrators and network engineers.And I am not objecting to it in the least; these are all good and vital questions.I am objecting to anyone claiming that DevOps is anything other than "using the kinds of tools that help software development projects to help operations", and I present this as absolute evidence.

评论 #34008557 未加载

mediascreen超过 2 年前

My answer to about half of those questions would probably be: "How would YOU like it to work. You are the expert on our systems and I would like to know what you consider best practices. Give me some guidance on how we run things here and I will do my best to set it up that way. If my application is very special and need special considerations I will contact you to figure out a way that works for both of us"

kubectl_h超过 2 年前

I moved from full stack eng to SRE/DevOps a couple of years ago but have the least enjoyable role of straddling the two. And while I think this post surfaces some good points I can tell you that deep in the heart of every SRE/DevOps engineer that didn't come from a software dev background -- all they truly want is to get paid 250K a year to administer a system that literally does nothing and thus never breaks and this desire is the subtext to and informs every interaction they have with the engineering team.

评论 #34008135 未加载

评论 #34008074 未加载

t-writescode超过 2 年前

I am very opinionated about what SRE and DevOps own vs what devs own; and, I didn't really have anything negative to say about my (admitted) skim of the article.As an SWE, I want to and need to know how to provide metrics on my system to be able to understand its health, and I should have good safeguards in place, or at least have communicated with the SREs what I need to provide to them to help them have good safeguards in place, to make sure the application keeps running. If the application goes down, it's my responsibility to make sure it's not my fault (bug in application code) that caused the system to fail.What I, an SWE, want out of an SRE, though, is infrastructure management. I want to be able to ask them for some queues, and for a redis instance with high availability. I want them to set up the Kafka cluster, the database. I want us to have a conversation about where the secrets are to be stored. I want to be able to ask them what I need to do in code to get a secret and use it. I want them to be able to give me a good template for k8s deployments - or maybe to pair with them, given the docker containers and sidecars I need for a deployment and the projected scaling I'll need and come out with a best-practices set of k8s deployments.I would be grateful if they monitor the database for some horrible queries; and, use their knowledge of which deployments made that bad query, to file a ticket to the right team so they fix their code or add an index or whatever is necessary.Infrastructure, be it k8s or nomad, configuring redis, making rabbitmq highly available, configuring and organizing (especially organizing) k8s deployments into something sane and logical, and so many other things related to infrastructure are as specialized of skills as writing high-performance or unusually architected, large systems. I've seen the systems that come up when SWE-on-assignment create infrastructure; and, I've seen the literal years of work SREs have in their backlog to fix it with best practices.It's similar to front-end developers: it's an entirely different skill set; and, while each person in each tear can stumble around in the other tiers, it's way better if we are all there, working together toward a common goal, and especially focusing in the areas we have each specialized our craft.addendum: of course there are exceptions; but I think those exceptions are 1 in 100 or 1 in 1000.

评论 #34010559 未加载

deathanatos超过 2 年前

> If you’re a Software Engineer/Developer, then consider that a service (at least, for me), is a piece of code running in a live production system, that YOU wrote, only YOU know how it works, thus YOU own.Like this is the single biggest truth in the article, and I'm glad to see it stated so clearly. Shout it from the rooftops, please. It's a direct logical consequence, too — and yet, so many people seem to make decisions that violate this truth.I field so many questions about "why is service X doing Y?" Have you asked the service owners?Unfortunately, I've found one more or less has to become proficient in rapidly understanding services you don't own, because getting other people to act logically is a fool's errand.> Are you logging to stdout ?Nooooo to stderr, that's literally what it is there for. (As C says, "for writing diagnostic output". Logs are that.) Also, it is sometimes buffered and you don't (IMO) really want that.Any output producing program requires stdout for the output, and you can't co-mingle logs with that and have piping still work. While it is unlikely that your production service is producing output, there's no reason to do anything different with the logs. (I'd say a part of being a good production service is "don't be needlessly special".)(But our tooling will just capture and mux the two streams together, too, so it doesn't matter, unless buffering means the error logs don't make it right before your service is killed.)Also, your infra team provides the metrics service, but you need to capture your own metrics. My metrics provider does not have a crystal ball, it cannot peer into your service's memory and pull out critical stats. You must push them yourself. Talk to your infra team, they can show you the API they use… (We collect common, machine level stats, like "CPU in use" or external things about your service that are easily visible, like per-container memory usage. But not your reqs/sec.)

评论 #34012720 未加载

rad_gruchalski超过 2 年前

Cool, turn it into a set of requirements and put up as part of the definition of done.Questions in this form always seem condescending. Like “I‘m smarter than you, I thought about it, you didn’t”.

评论 #34011592 未加载

mattpallissard超过 2 年前

This comment section was exactly what I expected. A mirror of how most folks in the trenches discuss these murky boundaries.<pre><code> * SRE/DevOps folks stating the person that wrote the application has the knowledge to debug it. * Devs saying that it's SRE/DevOps job to debug it * Lots of comments on culture and you should do X </code></pre> I know most people like the whole grassroots thing, but the only shops I've seen that are actually killing it are the ones who dictate these boundaries and responsibilities from the top down. And I've seen a lot of shops.

jamesrom超过 2 年前

This is completely backwards. As someone that has been an SRE and DevOps engineer.Almost all of the questions can be simply answered with: "This is a NFR that was created by SRE".The important thing is to collaborate with each team and be there when architectural and design decisions are being made in the first place!All of these questions are post-hoc, coming after the thing has been built. You would never need to ask these questions, if you help drive initial design.Embed yourself with your teams. Ask to be part of design discussions. Remember: 50% eng 50% ops. You have no excuse!

评论 #34012019 未加载

RcouF1uZ4gsC超过 2 年前

It seems that these questions should basically be answered once per company.All services should have common health endpoints and shutdown operations.Logging should be standardized across all the services of a company.Having bespoke answers to these questions for each service will rapidly devolve into chaos, when you have multiple services deployed.

blacklion超过 2 年前

Are you really DevOps if you need to write such rants? Are you really DevOps if you company has Devs?I've thought, that DevOps by definition is developer and operations in one. You wrote service, you support service, and there is no boundary, and there is no such problem as described in this text, by definition.DevOps complains about problem, proposed solution for which is to be DevOps...

Joel_Mckay超过 2 年前

"Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure." ( Melvin E. Conway, <a href="https://en.wikipedia.org/wiki/Conway%27s_law" rel="nofollow">https://en.wikipedia.org/wiki/Conway%27s_law</a> )This is unfortunately the death knell for DevOps organizational teams on large projects. Primarily, the design specification usually ends up being hammered into the inherent dysfunction the project was intended to solve in the first place.Best of luck =)

mkl95超过 2 年前

I agree with some of the points, but on the other hand most organizations do not really empower SDEs to reason about architecture. Things like production budgets and production grade monitoring and observability are usually owned fully by SRE/devops, and if some enterprise architect type is involved devs won't even own the spec. At those places, devs can at best make a wild guess of what the expectations are. Responsibility should be proportional to power.

travisgriggs超过 2 年前

Yeck. I read—then skimmed—through this article. Do others have the same “another mediocre engineer turned manager who I detest?” Don’t want to work where this guy works.The first sin they embark in is framing their argument, in part, as one of titles/labels. This is usually an institutional smell. And it’s not a pretty odor.The second is that the person believes there role is to question others. It’s a move that insecure people play. The idea is that you keep your opponents defending themselves against questions you define, and that means there’s no time to address some of the hard questions that might circle your own “roll.”It sounds like the guy feels he knows the answers. If so, why doesn’t he jump in and do them? If he knows better how to do this SRE thing as defined by him, clearly his company has pulled a Peter principle, promoting him from something he did well, to a position where he now harps on others using their nostalgia. Value may have been lost. If he’s really that good, we can use him in the trenches. If not, he’ll learn how to try to explain why some of these PHB questions are actually hard to answer and execute.

donutshop超过 2 年前

I've always hesitated when there are large pushes coming from DevOps. What if all I wanted to do was code and not work on yaml files? I also didn't sign up to be paged in the middle of the night. Some of the points are valid in the article but like some others the tone comes off as hostile.Truthfuy often times I don't understand how things behave in a production environment.

评论 #34008341 未加载

评论 #34008131 未加载

scarface74超过 2 年前

I’m hearing that what they really need is a Developer who understands operations so the “DevOps” guy just has to take care of operations?That was suppose to be the definition of “DeVOps” in the first place. Any company that has a DevOps role is going to really be an operation role by another name.

tflinton超过 2 年前

Honestly if you keep to <a href="https://12factor.net/" rel="nofollow">https://12factor.net/</a> the only time an SRE will page you is when there’s a cryptic custom error with no runbook.If only I had a dollar for every time some program dereferences a null.

opportune超过 2 年前

I think these are good questions to ask but IME SREs are expected to learn and even contribute to these either as they onboard to their team or as they take ownership of reliability for a particular service.The way this is phrased, it sounds like the author is managing reliability for things where they don’t already know the answers to these questions nor do they have the context or bandwidth (or even access?) to answer it themselves. Seems like a recipe for disaster, or at the very least, a lot of frantic learn-as-you-go.That said, as a dev, I do think we could do a lot better adding playbooks. Though on the other side of the fence, they’re often ignored with a “I don’t know what’s going on and you wrote this, can you help?”

评论 #34009818 未加载

poulsbohemian超过 2 年前

I find it interesting that once upon a time, DevOps was a way of doing / organizing things, not a role per se. It slowly morphed back toward systems administration (the things it was intending to replace) as a role, and SRE was a kind of sub-set role of both. Recently, I'm starting to see this SRE / DevOps abbreviation, described as a one and the same common role. So I guess all that is old is new again, just renamed?

simonjgreen超过 2 年前

This doesn't read like an SRE perspective, it reads like a classic SysAdmin perspective. Which, while useful, is a very different role.

jdbernard超过 2 年前

It seems like there is a lot of disagreement and discussion about the role of SRE vs Devs. My team is responsible for our own operations (we are Dev and DevOps, no SRE team), this is a great list of questions to ask ourselves during the planning and estimation phase before we build our new stuff.

fasteo超过 2 年前

wildcow超过 2 年前

> Are you using gRPC or REST Is most likely "Are you using gRPC or JSON over HTTP".

hkon超过 2 年前

Devops is dead?

评论 #34012582 未加载

wilde超过 2 年前

Lol if you don’t know these things, your company has Ops, not DevOps

评论 #34010040 未加载

doublerabbit超过 2 年前

Next in the trilogy: Things I want from SRE/DevOps as a SysAdmin.- What specs of a VM do you require?I'll assume that 16mb of RAM and 512mb of drive space running Slackware is suitable operating from 1.44mb floopy.- What do I do if it doesn't compile?It works in DevLand I assume I'll work anywhere. No, you cant growl at me, you asked for Linux and I gave you Linux. Documentation please.

评论 #34008029 未加载

评论 #34007549 未加载