Friday 17 January 2020

Happy Eyeballs. Unhappy user.

As part of the migration efforts to IPv6, many programs implement a system known as "happy eyeballs". The basic premise is that sometimes, IPv6 "is flaky", and after a while, you should give up and take the IPv4 option - resulting in some "happy eyeballs". In a dual stack system, IPv6 is preferred.

Well the thing about this is it is S L O W... and users (not unreasonably) get grumpy about it. Here's a case where something went wrong, and *everything* was subjected to this delay.



Some background first. The network here has run, in one form or another, IPv6 for over a decade. We don't generally find IPv6 to be a problem. These days, most wired LAN segments are dual stacked, and It Just Works(TM).

Today, I received a call from a perplexed user that said "The internet is slow. Does it have anything to do with the building work?" (there are extensive re-painting and re-roofing efforts underway). Builders tend to kill networks dead rather than slow them down (cue wailing and gnashing of teeth - their scaffolding has been erected smack in front of one of our few point-to-point wireless bridges, and metal beats 60GHz wireless signals, every time). I was of course aware of the massive international cable outages, but fortunately, our ISP has routed around that, because they had enough spare capacity to sort it out - and nobody else was complaining about "slowness", and several of the sites and services I use regularly were troublesome for them, but not for me in my office. I checked monitoring, found no outages around that building (and had received no SMS alerts, but I double-check such things) and logged into the switch, found which port they sat behind, and it all looked good (settings all correct, duplex and speed as expected) - having systems that map user to mac are pretty useful. Ping to their PC was steady, with no packet loss. All the obvious sources of problems checked, I paid a visit to the PC in question (this is unusual - PC problems are usually left to general IT support techs here, not networks team or sysadmins - silos!) - certainly the problem "felt" odd, so I thought I should check it out.

First, I simplified the network, removing the passthrough VOIP handset from the equation.

We then tried to reproduce the error, which was easy. No joy, problem still there.

Carefully observing for patterns, I noticed some common elements - firstly, the resolution of the DNS name seemed to take an unusually long time in the browser. Then, the first page would take a long time to load. After that, the "loaded" site was perfectly usable (on-site links then loaded fast), but generically browsing across the internet was painfully slow. Once a speedtest site finally loaded, the results were as expected. It was not specific to a particular browser (the user had already checked that before phoning us. Good job!).

Next, given the seemingly slow DNS resolution, I checked to see nobody had changed DNS to something weird. Nope, normal. nslookup was pretty normal in speed and results were as expected.
It was "just" that websites and internet-based services were being painful.

Perhaps the TCP/IP stack was a bit off. Ran the usual reset commands and rebooted. No improvement.

I then thought "hmm, there's a pause, then it works again, and keeps working for a site. I wonder if it's a happy eyeballs thing and IPv6 has gone weird on this machine"? Turned off IPv6.
No improvement.

Clutching at straws, I did another ipconfig /all and was amazed to see an IPv6 address under a "Microsoft 6to4 Adapter" interface I had never seen before.

When I had disabled IPv6.

It turns out that somehow (I have no idea how), the hidden Microsoft 6to4 adapter had been installed on this machine. Uninstalled that, and things got back to normal. Turned IPv6 back on again, and it carried on working just fine. For whatever reason, it looks like disabling IPv6 on the interface does not disable the 6to4 virtual network adapter.

The problem came down to that rogue 6to4 adapter, and it took me rather longer to figure out than I was happy about.

Sadly, I didn't take a gander at the routing table (this was a user PC problem to be fixed, not an interesting specimen to study!) - I rather wonder why the PC was preferring that to the native IPv6 addressing.

All of the "transitional" IPv4/IPv6 stopgap migration tunnel protocols are blocked at our border firewall - we do native IPv6, so why should we allow tunnels etc? - ISATAP, Teredo and 6to4 are all explicitly blocked (we block very little outgoing stuff - Universities are MUCH more open than most businesses and K-12 schools).

So, of course, when the user's device tried to get to a site over IPv6, the requests travelling along the 6to4 tunnel were dropped at the border firewall. Once the happy eyeballs timer expired, it fell back to IPv4 and things "just worked" - for that site. As more sites and services are available over IPv6, such a problem seemed to be a widespread issue - but it all came down to that mystery rogue logical adapter. Thanks, Windows.

There are known bugs (but not AFAIK in Win10, which this was) that cause spawning 6to4 adapters. This was just one, however, on a wired interface.

Lesson I learned for next time that might have told me the problem sooner? Check firewalls for blocked packets when things go slowly, not only when they just don't work at all.

Of course, now I've *seen* this class of error in the wild, I'll recognise and fix it that much faster next time.

2 comments:

  1. I've since heard that this recurred for this user, and that support are seeing it on other machines.

    Sadly, not working there brings an end to my ability to diagnose and find a longer term fix than "remove the 6to4 adapter(s) and reboot".

    ReplyDelete