We built network isolation for 1500 services

135 pointsby p10jkleover 5 years ago

16 comments

d4ntover 5 years ago

Locking down this network of services is a massive security improvement and they've used some very neat ways of achieving it. Overall, I really appreciate them writing this up.However, 1500 services? That really feels like they're separating things at too granular a level. Does every one of those things really need to sit behind a network call? Couldn't some of that re-use be via code libraries? I wonder what the service to developer ratio is?

评论 #21454752 未加载

评论 #21460606 未加载

all_usernamesover 5 years ago

Great post. I really appreciate engineering blogs written in this storytime format. I don't have time to dive into the implementation of Calico or <insert one of the 1,261 kubernetes projects here>, but I learn a lot from reading the process a team goes through in figuring out and iterating on a solution. <a href="https://landscape.cncf.io/" rel="nofollow">https://landscape.cncf.io/</a>

评论 #21460417 未加载

sansnommeover 5 years ago

Another potential solution is to use a constraint solver like MSFT Z3, or if you want a nicer syntax and more flexibility, Prolog.E.g. <a href="https://medium.com/@ahelwer/checking-firewall-equivalence-with-z3-c2efe5051c8f" rel="nofollow">https://medium.com/@ahelwer/checking-firewall-equivalence-wi...</a>This is much more scalable in the long run.

gravypodover 5 years ago

If the authors are reading this I was wondering two things:1. Why was static analysis of the code chosen over observing the system during runtime and integration testing?2. What was the reason rhe CNI layer was chosen for the implementation of this over the service mesh layer?Something that really interests me about bazel/buck/pants/please is it automates #1 entirely with dep queries.

评论 #21460987 未加载

评论 #21461069 未加载

z3t4over 5 years ago

Applying network filtering, while being a nice extra layer, it should not be the only layer. Services should need authorization like if it was an open api.

评论 #21461059 未加载

rawoke083600over 5 years ago

"But we already have over 1,500" wow... I would start there...

purple_ducksover 5 years ago

> attempt to find code that looked like it was making a request to another service.> We generally fixed those cases by adding a special comment in the code that told rpcmap about the linkWhy not enforce all endpoints/urls be defined in a config file and sidestep this? - scanning code for URLs/constructed URL is overkill and brittle.

grandinjover 5 years ago

Strikes me that some services ideally need to expose multiple interfaces, and that isolation should be on a per-service-interface basis.E.g. the monitoring service should only be able to access the metrics part of each service.

评论 #21460967 未加载

aSplash0fDerpover 5 years ago

Nice write-up! Thats the beauty of scale, explain a part in detail, then go with the 30,000 foot view.IMM, the security orchestration may actually become the "app" as speeds continue to increase, compute costs go even lower and losses incurred from compromised data/networks increase.A true zero trust platform that keeps all of the doors closed or "instances/vm" offline until (the milliseconds) they're needed is the security symphony we might see on the horizon.Data silos and walled gardens may never go out of style, they'll just take on new acronyms.

angry_octetover 5 years ago

Impressive achievement. It still sounds like callee's have more knowledge of callers than is justified. Is it a security property or a component functionality property? How do those interact?A centralised graph representation of the security/functionality properties would be a better way to represent this information, so it can catch adding interfaces which should be forbidden. Also able to be configuration managed as sets of microservices.If you have a connectivity graph it would be good to do taint analysis to see how far bad information can propagate.

评论 #21455806 未加载

matdehaastover 5 years ago

Curious if you looked at using oAuth with client credentials grant for each service?Also didn't see any mention of prior art like <a href="https://cloud.google.com/beyondcorp/" rel="nofollow">https://cloud.google.com/beyondcorp/</a>.Thanks for the great writeup!

评论 #21461846 未加载

brentisover 5 years ago

Nice work. If you define your policies based on a tagging taxonomy you could centrally manage these inbound/outbound service relationships. Every new instance or container would assume same network policies based on tag.

hu3over 5 years ago

> This would read all the Go code in our platform, and attempt to find code that looked like it was making a request to another service.Is there a link about how much Go does Monzo they use?

评论 #21459700 未加载

mschuster91over 5 years ago

1.500 services? What the... the run times for calls must be atrocious with all the network communication and latency that is happening.

评论 #21460919 未加载

评论 #21462079 未加载

voltarolinover 5 years ago

Can a service mesh such as Istio provide the capability that Monzo have implemented themselves here?

kasey_junkover 5 years ago

Using YAML for critical infrastructure specification is one of the stupidest things we’ve ever done as an industry.

评论 #21459853 未加载

评论 #21459746 未加载

评论 #21459758 未加载

评论 #21459857 未加载

评论 #21460975 未加载

评论 #21459685 未加载