I think unfortunately the conclusion here is a bit backwards; de-risking deployments by improving testing and organisational properties is important, but is not the only approach that works.<p>The author notes that there appears to be a fixed number of changes per deployment and that it is hard to increase - I think the 'Reversie Thinkie' here (as the author puts it) is actually to decrease the number of changes per deployment.<p>The reason those meetings exist is because of risk! The more changes in a deployment, the higher the risk that one of them is going to introduce a bug or operational issue. By deploying small changes often, you get deliver value much sooner and fail smaller.<p>Combine this with techniques such as canarying and gradual rollout, and you enter a world where deployments are no longer flipping a switch and either breaking or not breaking - you get to turn outages into degradations.<p>This approach is corroborated by the DORA research[0], and covered well in Accelerate[1]. It also features centrally in The Phoenix Project[2] and its spiritual ancestor, The Goal[3].<p>[0] <a href="https://dora.dev/" rel="nofollow">https://dora.dev/</a><p>[1] <a href="https://www.amazon.co.uk/Accelerate-Software-Performing-Technology-Organizations/dp/1942788339" rel="nofollow">https://www.amazon.co.uk/Accelerate-Software-Performing-Tech...</a><p>[2] <a href="https://www.amazon.co.uk/Phoenix-Project-Helping-Business-Anniversary/dp/B00VBEBRK6" rel="nofollow">https://www.amazon.co.uk/Phoenix-Project-Helping-Business-An...</a><p>[3] <a href="https://www.amazon.co.uk/Goal-Process-Ongoing-Improvement/dp/0566086654" rel="nofollow">https://www.amazon.co.uk/Goal-Process-Ongoing-Improvement/dp...</a>
I am trying to expound a concept I call “software literacy” - where a business can be run via code just as much as today a company can be run by English words (policy documents, emails etc).<p>This leads to a few corollaries - things like “If GPUs do the work then coders are the new managers” or we need whole-org-test-rigs to be clear about the impacts of chnages.<p>This seems directly related to this excellent article - to my mind if all the decision makers are not looking at the code as the first class object in a chnage process (is opposed to Jiras or project plans) then not all decision makers are (software) literate - and this comes up a lot in the threads here (“how do I discuss with non-technical management”) - the answer is you cannot - that management must be changed. This is an enormous generational road block that I thought was a problem thirty years ago but naively assumed would disappear as coders grew up. Of course the problem is that to “run” a company one does not need to code - so until not coding is something embarrassing like not writing is for a newspaper editor we won’t get past it.<p>The main point is that we need companies that can be run with the new set of self-reinforcing concepts - sops, testing, not meetings but systems as communication.<p>I will try and rewrite this comment later - it needs work
The organisation will actively prevent you from trying to improve deployments though, they will say things like “Jenkins shouldn’t be near production” or “we can’t possibly put things live without QA being involved” or “we need this time to make sure the quality of the software is high enough”. All with a straight face while having millions of production bugs and a product that barely meets any user requirements (if there are any).<p>In the end fighting the bureaucracy is actually impossible in most organisations, especially if you’re not part of the 200 layers of management that create these meetings. I would sack everyone but programmers and maybe two designers and let everyone fight it out without any agile coaches and product owners and scrum master and product experts.<p>Slow deployment is a problem but it’s not <i>the</i> problem.
A marginally related point but I do not know if others faced the following situation: I worked in a place with a CI pipeline room ~25 minutes with the unit/integration tests (3000+) taking 18 minutes.<p>When something happens in production we ended up placing more tests; and of course when things goes south at least 50 minutes were necessary to recover.<p>After a lot of consideration we decided to focus on the recovery and relax and simply some tests and focus on recovery (i.e. have the full thing in less than 5 minutes) combined with a canary as deployment strategy (instead rolling updates).<p>At least for us was a so refreshing experience but sounded wrong in some ways.
I have personal experience with this in my professional career. Before Christmas break I had a big change, and there was fear. My org responded by increasing testing (regression testing, which increased overhead). This increased the risk that changes on dev would break changes on my branch (not a code merging way, but in a <i>complex adaptive system</i> way).<p>I responded to this risk by making a meeting. I presented our project schedule, and told my colleagues about their <i>expectations</i>, I.e. if they drop code style comments on the PRs they will be deferred to a future PR (and then ignored and never done).<p>What we needed <i>is</i> fine grained testing with better isolation between components. The problem is is that our management is at a high level, they don’t see meetings as a means to an end, they see meetings as a worthy goal in and of itself self to achieve. More meetings means more collaboration, means good. I’d love to see advice on how to lead technical changes with non-technical management.
While this is mostly correct it’s also just as irrelevant.<p>TLDR; software performance, thus human performance, is all that matters.<p>Risk management/acceptance can be measured with numbers. In software this is actually far more straightforward than in many other careers, because software engineers can only accept risk within the restrictions of their known operating constraints and everything else is deferred.<p>If you want to go faster you need to maximize the frequency of human iteration above absolutely everything else. If a person cannot iterate, such as waiting on permissions, they are blocked. If they are waiting on a build or screen refresh they are slowed. This can also be measured with numbers.<p>If person A can iterate 100x faster than person B correctness becomes irrelevant. Person B must maximize upon correctness because they are slow. To be faster and more correct person A has extreme flexibility to learn, fail, and improve beyond what person B can deliver.<p>Part of iterating faster AND reducing risk is fast test automation. If person A can execute 90+% test coverage in time of 4 human iterations then that test automation is still 25x faster than one person B iteration with a 90+% lower risk of regression.
I had a boss who actually acknowledged that he was deliberately holding up my development process - this was a man who refused to allow me a four day working week.
Sounds like a process problem. 2024 development cycles should be able to handle multiple lanes of development and deployments. Also why things moved to microservices so you can deploy with minimal impact as long as you don't tightly couple your dependencies.