It's really not that difficult to network containers. We're using flannel [1] on CoreOS. We're using flannel's VXLAN backend to encapsulate container traffic. We're Kubernetes users so every kube pod [2] gets it's own subnet and flannel handles the routing between those subnets, across all CoreOS servers in the cluster.<p>I was skeptical when we first deployed it but we've found it to be dependable and fast. We're running it in production on six CoreOS servers and 400-500 containers.<p>We did evaluate Project Calico initially but discovered some performance tests that tipped the scales in favor of flannel. [3] I don't know if Calico has improved since then, however. This was about a year ago.<p>[1] <a href="https://github.com/coreos/flannel" rel="nofollow">https://github.com/coreos/flannel</a><p>[2] A Kubernetes pod is one or more related containers running on a single server<p>[3] <a href="http://www.slideshare.net/ArjanSchaaf/docker-network-performance-in-the-public-cloud" rel="nofollow">http://www.slideshare.net/ArjanSchaaf/docker-network-perform...</a>
Another solution to this problem is Romana [1] (I am part of this effort). It avoids overlays as well as BGP because it aggregate routes. It uses its own IP address management (IPAM) to maintain the route hierarchy.<p>The nice thing about this is that nothing has to happen for a new pod to be reachable. No /32 route distribution or BGP (or etcd) convergence, no VXLAN ID (VNID) distribution for the overlay. At some scale, route and/or VNID distribution is going to limit the speed at which new pods can be launched.<p>One other thing not mentioned in the blog post or in any of these comments is network policy and isolation. Kubernetes v1.3 includes the new network APIs that let you isolate namespaces. This can only be achieved with a back end network solution like Romana or Calico (some others as well).<p>[1] romana.io
On the topic of "why do we need a distributed KV store for an overlay network?" from the blog: there's a good blog post about why Kubernetes doesn't use Docker's libnetwork.<p><a href="http://blog.kubernetes.io/2016/01/why-Kubernetes-doesnt-use-libnetwork.html" rel="nofollow">http://blog.kubernetes.io/2016/01/why-Kubernetes-doesnt-use-...</a>
We're just about to switch to BGP internally using Calico (mentioned in another comment, I believe performance is good now), we run around 300-600 containers currently using our implementation using Consul+Serf. We'll drop a talk on it once we've made the switch if anyone is interested. We're deliberately avoiding flannel because of the tunnelled networking and added complexity that we don't feel we want to introduce.
I've for a long time wondered if anyone has successfully just gone full ipv6 only with a substantial container/vm roll-out. On paper it should have:<p>1) enough addresses. Just enough. For everything. For everyone. Google-scale enough.<p>2) Good out-of-the box dynamic assignment of addresses.<p>And finally, optional integration with ipsec, which I get might in the end be over-engineered, and under-used -- but wouldn't it be nice if you could just trust the network (you'd still have to bootstrap trust somehow, probably running your own x509 CA -- but how nice to be able to flip open any book on networking from the 80s and just replace the ipv4 addressing with ipv6 and just go ahead and use plain rsh and /etc/allow.hosts as your entire infrastructure for actually secure intra-cluster networking -- even across data-centres and what not. [ed: and secure nfsv3! wo-hoo]).<p>But anyway, have anyone actually done this? Does it work (for a meaningfully large value of work)?
BGP looks really complex. Isn't OSPF (BGP's "little brother") a much attractive choice here? It's still complex, but should be much simpler.<p>Another attractive alternative to Flannel is Weave [1], run in the simpler non-overlay mode. In this mode, it won't start a SDN, but will simply act as a bridge/route maintainer, similar to Flannel.<p>[1] <a href="https://www.weave.works/products/weave-net/" rel="nofollow">https://www.weave.works/products/weave-net/</a>
Have I misunderstood something here? We don't BGP on a local networks. Via ARP, a node says "who has $IP?" Something answers with a MAC address. The packet for $IP is wrapped in an Ethernet frame for that MAC address. If the IP isn't local to your network, your router answers with its own MAC, and the packet is framed up for the router.<p>BGP is the process by which ranges of IPs are claimed by routers. Is Calico really used by docker containers in this way?
More on this here: <a href="https://medium.com/@sargun/a-critique-of-network-design-ff8543140667#.2fwstossu" rel="nofollow">https://medium.com/@sargun/a-critique-of-network-design-ff85...</a> -- BGP isn't just about containers. It's about signaling. It's a mechanism for machines to influence the flow of traffic in the network.<p>This isn't container weirdness. This is because networks got stuck in 2008. We still don't have have IPv6 SLAAC. Many of us made the jump to layer 3 clos fabrics, but stopped after that. My belief is because AWS EC2, Google GCE, Azure Compute, and others consider this the gold standard.<p>IPv6 natively supports autoconfiguring multiple IPs per NIC / machine automagically*. This is usually on by default as part of the privacy extensions, so in conjunction with SLAAC, you can cycle through IPs quickly. It also makes multi-endpoint protocols relevant.<p>Containers and bad networking because of the lack of IP / container is a well-known problem, it's even touched on in the Borg paper, briefly:
One IP address per machine complicates things. In
Borg, all tasks on a machine use the single IP address of
their host, and thus share the host’s port space. This causes
a number of difficulties: Borg must schedule ports as a resource; tasks must pre-declare how many ports they need,
and be willing to be told which ones to use when they start;
the Borglet must enforce port isolation; and the naming and
RPC systems must handle ports as well as IP addresses.<p>Thanks to the advent of Linux namespaces, VMs, IPv6,
and software-defined networking, Kubernetes can take a
more user-friendly approach that eliminates these complications: every pod and service gets its own IP address, allowing developers to choose ports rather than requiring their software to adapt to the ones chosen by the infrastructure, and removes the infrastructure complexity of managing ports.<p>But, I ask, what's wrong with the Docker approach of rewriting ports? Reachability is our primary concern, and I'm unfortunately BGP hasn't become the lingua franca for most networks ("The Cloud"). I actually think ILA (<a href="https://tools.ietf.org/html/draft-herbert-nvo3-ila-00#section-4.5" rel="nofollow">https://tools.ietf.org/html/draft-herbert-nvo3-ila-00#sectio...</a>) / ILNP (RFC6741) are the most interesting approaches here.
Or you could NAT on the host and deploy simpler overlay networking: <a href="https://github.com/pjperez/docker-wormhole" rel="nofollow">https://github.com/pjperez/docker-wormhole</a><p>You can deploy this on any machine (container or not) and have it always reachable from other members of the same network, which could be e.g. servers on different providers (AWS, Azure, Digital Ocean, etc)
Especially since there isn't really a policy-routing component to this, isn't BGP pretty _extremely_ complicated for the problem Calico is trying to solve?<p>Stipulating that you need a routing protocol here (you don't, right? You can do proxy ARP, or some more modern equivalent of proxy ARP.), there's a whole family of routing protocols optimized for this scenario, of which OSPF is the best-known.
There's a lot of misinformation in this.<p>>A Linux container is a process, usually with its own filesystem attached to it so that its dependencies are isolated from your normal operating system. In the Docker universe we sometimes talk like it's a virtual machine, but fundamentally, it's just a process. Like any process, it can listen on a port (like 30000) and do networking.<p>A container isn't a process. It's an amalgamation of cgroups and namespaces. A container can have many processes. Hell, use systemd-nspawn on a volume that contains a linux distro and your container is basically the entire userspace of a full system.<p>>But what do I do if I have another computer on the same network? How does that container know that 10.0.1.104 belongs to a container on my computer?<p>Well, BGP certainly isn't a hard requirement. Depending on how you've setup your network, if these are in the same subnet and can communicate via layer 2, you don't need any sort of routing.<p>>To me, this seems pretty nice. It means that you can easily interpret the packets coming in and out of your machine (and, because we love tcpdump, we want to be able to understand our network traffic). I think there are other advantages but I'm not sure what they are.<p>I'm not sure where the idea that calico/BGP are required to look at network traffic for containers on your machine came from. If there's network traffic on your machine, you can basically always capture it with tcpdump.<p>> I find reading this networking stuff pretty difficult; more difficult than usual. For example, Docker also has a networking product they released recently. The webpage says they're doing "overlay networking". I don't know what that is, but it seems like you need etcd or consul or zookeeper. So the networking thing involves a distributed key-value store? Why do I need to have a distributed key-value store to do networking? There is probably a talk about this that I can watch but I don't understand it yet.<p>I think not at all understanding one of the major players in container networking is a good indication it might not yet be time to personally write a blog about container networking. Also absent is simple bridging.<p>Julia generally writes fantastic blogs, and I know she doesn't claim to be an expert on this subject and includes a disclaimer about how this is likely to be more wrong than usual, but I feel like there was a lot of room for additional research to be done to produce a more accurate article. I understand the blog is mostly about what she has recently learned, and often has lots of questions unanswered... But this one has a lot of things that are answered, incorrectly :(
The internal OpenDNS docker system, Quadra, relies on BGP for a mix on of on-prem and off-prem hosting:<p><a href="http://www.slideshare.net/bacongobbler/docker-with-bgp" rel="nofollow">http://www.slideshare.net/bacongobbler/docker-with-bgp</a>
The real problem is that cloud providers don't provide out-of-the-box functionality to assign more than one IP to a network interface. If they did this, there wouldn't even be an issue.<p>I've been requesting this feature from the EC2 team at AWS for some time about this, to no avail. You can bind multiple interfaces (ENIs) to an instance (up to 6, I think, depending on the instance size), each with a separate IP address, but not multiple IPs to a single interface.<p>BGP, flannel, vxlan, etc. are IMO a waste of cycles and add needless complexity to what could otherwise be a very simple architecture.