I just finished updating the code with an assembly version of a hot function (finding the leading "1" bit position) and updated the blog to reflect the improvements. In general I saw a 3-40% performance increase using the assembly version, which I borrowed from math/big/arith_*.s. The reason for the wide range is because it depended on the size of the delta integers being processed.
2013-11-17 Update #2: Tried another new trick. This time I updated the leading bit position function, when using the gccgo compiler, to use libgcc’s __clzdi2 routine. This had the same effect as the update #1 except it’s for when gccgo is used. Performance increase ranged from 0% to 20% for encoding only. Thanks dgryski on reddit and minux on the golang-nuts mailing list.
Did you use any parallelism/concurrency? There ought to be a way to speed up the Go version with parallelism. This is fair as that is one of Go's strengths.