Don't manually parse those things. If your database provides a way to parse those (like PostgreSQL does), make use of it. Use what your programming language provides, and if it really doesn't provide the functionality to parse IPv6 addresses, use a library to do so.<p>For example, in Rust you can write this:<p><pre><code> use std::error::Error;
use std::net::{AddrParseError, SocketAddr};
fn parse_address(addr: &str) -> Result<SocketAddr, AddrParseError> {
addr.parse()
.or_else(|_| Ok(SocketAddr::new(addr.parse()?, 10443)))
}
fn main() -> Result<(), Box<dyn Error>> {
let addresses = [
"[2001:0db8:f00f::0553:1211:0088]:10444",
"2001:0db8:f00f::0553:1211:0088",
];
for address in &addresses {
println!("{:?}", parse_address(address)?);
}
Ok(())
}
</code></pre>
This isn't specific to IPv6 by the way. It also applies to other standards like CSV (although I suppose with CSV it does vary, I saw so many broken CSV files that sometimes a custom implementation is the best way to go to parse those).
Well, I mean, yes, those are all mistakes people are going to make, no doubt about that. But somehow the solution is missing?<p>The core mistake here is using string manipulation at all. The only correct way to handle string data is to parse it, and then either operate on the parsed data structure, or re-serialize into a canonical format and use that for comparisons and stuff.<p>And in the best case, you don't try to build your own parser, but use a well-tested one that's already there. For the particular use case of parsing IP addresses, it's probably best to use getaddrinfo() with AI_NUMERICHOST, and for re-serialization getnameinfo() with the same flag. If those don't understand the address, you most likely won't be able to connect anyway. And they will handle stuff like link-local addresses correctly, at least as long as you are on the host that actually has the respective interface.<p>For databases, you use column types intended for storing IP addresses, so the database will do the parsing and canonicalization for you.<p>And when you actually have to build a parser yourself, read the damn spec for the format instead of going by what you think the format is, because most likely it's not that.<p>And mind you that most of those problems are not really IPv6-specific. There are also many ways to write an IPv4 address that your average parser will understand. Most of those are generally frowned upon, so they don't occur often, but if you want to reliably compare IPv4 addresses, you actually need to do the same as for IPv6.
It won't solve most of the problems, but if you're using PostgreSQL its native inet datatype (which supports 4 or 6) instead of a text or binary string can save a world of pain and there are alternatives in some other RDBMSs:<p><a href="https://www.postgresql.org/docs/current/datatype-net-types.html#DATATYPE-INET" rel="nofollow">https://www.postgresql.org/docs/current/datatype-net-types.h...</a>
<a href="https://www.postgresql.org/docs/current/functions-net.html" rel="nofollow">https://www.postgresql.org/docs/current/functions-net.html</a><p>First class network and ip types that properly support contains and exclusion operators make a lot of things less error-prone and potentially much faster.
Don't try to parse things without a standard. Even CSV and e-mail addresses are more complicated than they seem.<p>Also, it's pretty silly that we still use these unintuitive conventions from 40 years ago for modern systems. Is 192.168.2.8:10443 an address? A phone number and extension? Is it TCP, UDP? IPv4, IPv6? An HTTPS service, or just something resembling its decimal notation assigned service number? Are there multiple services proxied behind this one address? Can I route between them? When I request a URI, does the application know what I really want/expect? What about a timeout for my request? What about authentication/authorization? Consistency requirements? Idempotence? Security guarantees?<p>Operating systems don't even take <host>:<port> arguments for network syscalls, that's just a convention we sort of came up with and later stuck to. But as a URL it's pretty crap. I suggest we replace them with modern URLs that can embed tiered information such as session IDs, service types, routes, security requirements, operational parameters, etc. Most people may only need <a href="https://google.com/" rel="nofollow">https://google.com/</a>, but sometimes we may also want to request webv2+uquic+v6:/SC,TLSv1.3/userid[s:84742049]@google.com(r:NA)/ . I know that's ugly as sin, but hopefully people wouldn't need to specify all of that all of the time (service name/version, transport, address, strong consistency, TLS 1.3, userid, session id, host/namespace, North American region).
It’s certainly been said but IPv6 to my eyes is awash in second system syndrome, which has largely slowed its adoption.<p>The complexity of handling addresses is plainly a failure of the design. Using colons for separators when they’re already being used for ports served no purpose but to confuse. Having more than a single valid form of an address again only serves to confuse.<p>If there’s anything to learn from the UNIX principals it’s there is great power in making things easily manipulated as strings. The design of IPv6 makes this impossible.
Not just ipv6.<p><pre><code> $ ping 127.1
PING 127.1 (127.0.0.1): 56 data bytes
...
$ ping $(((127 << 24) + 1))
PING 2130706433 (127.0.0.1): 56 data bytes
...
</code></pre>
I've primarily seen the second form in the early days of the web when spammers were presumably trying to bypass mail or web filters that were scanning for blacklisted IPs.<p>Edit: fix formatting
I'm really stumped as to why they didn't eliminate ports when designing IPv6. Having the last bits in the address take the role of ports makes sense given the hierarchical nature of IPv6 addresses. The first bits specify the network (my home), next bits the device, and last bits the service/endpoint on that device.<p>It would also make parsing much easier given that they chose : as separator in the addresses.
Another gotcha is that ipv4 and ipv6 has different firewall tables. So if you for example have blocked <i>all</i> traffic besides port 80, you <i>also</i> need to do it on ipv6 !
All of this flailing around would have been avoided by either thinking carefully about canonical representations, or reading <a href="https://tools.ietf.org/html/rfc5952" rel="nofollow">https://tools.ietf.org/html/rfc5952</a>
fwiw, rfc-5952, specifically, sec:6 outlines recommended string representations. also, it is generally much nicer to use 'sock_storage' as an underlying representation of AF_{UNIX/INET,INET6}, endpoints rather than building abstractions to tide over differences of 'socaddr_*'
Maybe this is a stupid question at this point but if using colons introduces this degree of ambiguity why not stick with simple periods? Or any other separator for that matter.
Even putting aside the ambiguity of using ':' as a address:port delimiter, there are differences between platforms in parsing <i>just IPv6 addresses</i><p>e.g. on Linux/glibc 2.28<p><pre><code> [cling]$ #include <arpa/inet.h>
[cling]$ ::in6_addr dst;
[cling]$ dst
(::in6_addr &) @0x7ff318aff010
[cling]$ inet_pton (AF_INET6, "1234:1234:1234:1234:1234::1234:8.8.8.8", &dst)
(int) 0
[cling]$ inet_pton (AF_INET6, "1234:1234:1234:1234:1234:1234:8.8.8.8", &dst)
(int) 1
</code></pre>
Here '::' in this 'full' address is (correctly) rejected.<p>However, although I haven't verified this on FreeBSD (perhaps someone can?), there's a comment in the libc source suggesting that this will be accepted there<p><a href="https://github.com/freebsd/freebsd/blob/1d6e4247415d264485ee94b59fdbc12e0c566fd0/lib/libc/inet/inet_pton.c#L127" rel="nofollow">https://github.com/freebsd/freebsd/blob/1d6e4247415d264485ee...</a><p>Parsing sucks.
Obviously the point of the article is to not manually parse ip addresses.<p>Well actually, python can do that just as well as all the other languages.<p><pre><code> >>> import ipaddress
>>> ipaddress.ip_address('2001:db8::')
IPv6Address('2001:db8::')</code></pre>
The fundamental problem in the example:<p><pre><code> leader_host = bigdata.example.org:10443
</code></pre>
is ":10443" is not part of the <i>host</i>name. The field is called "leader_host"; if a port is needed, it should use it's own field instead of trying to overload the host field.<p><pre><code> leader_host = bigdata.example.org
leader_port = 10443
</code></pre>
(and as as others have already mentioned, don't write your own parser when they already exist in your stdlib/etc)