Connection reset...

After switching energy supplier, I discovered I couldn't connect to their website.

This is an explanation with notes of how I've attempted to diagnose the issue so far.1

Is it really broken?

The error is similar in all browsers, the connection has been reset before a secure connection could be estabilshed. It happens on bare & www. and luckily there's only a single A record, no IPv6 which makes troubleshooting simpler.

The first obvious thing was to check other ways of accessing the site. First noticed in Firefox, confirmed in Chrome, then on Android, all from the same wifi. After disconnecting the Android from wifi and checking over the phone network, I'm not surprised when the site suddenly loads. The problem is confined to my local network & not likely connected to my client devices or software. (Checked software using both mbedTLS & openssl)

For my router/access point, I run OpenWRT and so cutting out my lan/wifi from the problem is easy:

root@OpenWrt:~# curl https://britishgas.co.uk
curl: (35) ssl_handshake returned - mbedTLS: (-0x0050) NET - Connection was reset by peer

A confounding factor in my mind is that around the same time I first tried to connect to the site, I had just gotten around to replacing my aging access point. Was this the trigger? At this point, I'm tempted to start swapping out hardware to see if the issue will go away. That's a bit of effort though and I've a funny feeling I'd seen connection failures before. Best to be lazy.

A near identical issue a decade ago turned out to be caused by MTU problems. The symptoms then were several (but not all) ssl connections failing with similar connection reset errors, including gmail. This time, I wasn't able to find another site to fail in the same way. Still, it's a hunch and I decided to pursue it...

MTU?

I might as well double check what it is by sending packets to something close within my ISP. If the problem is past there, I won't have much luck getting it fixed easily. I've used the -4 flag to ensure ping doesn't resolve an IPv6 address - the maths work out differently to IPv4.

$ ping -c 1 -s 1464 -M do -4 bottomless.aa.net.uk
PING bottomless.aa.net.uk (81.187.81.187) 1464(1492) bytes of data.
1472 bytes from bottomless.aa.net.uk (81.187.81.187): icmp_seq=1 ttl=63 time=22.8 ms

--- bottomless.aa.net.uk ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 22.834/22.834/22.834/0.000 ms
$ ping -c 1 -s 1465 -M do -4 bottomless.aa.net.uk
PING bottomless.aa.net.uk (81.187.81.187) 1465(1493) bytes of data.
ping: local error: message too long, mtu=1492

--- bottomless.aa.net.uk ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

1492's pretty standard for PPPOE, so that doesn't raise any flags. (Maximum data size is 1464 which adds to the 20 for an ip header & 8 for icmp's)

It's possible someone's blocking ICMP between us, breaking Path MTU Discovery but feels unlikely given these nice big & common numbers.

Still, might as well reduce my MTU and try connections again just in case...

$ sudo ip link set dev wlp2s0 mtu 1400 # 1350, 1300, 1200...
$

... didn't help, which mostly rules out MTU being the problem, maybe. I also haven't reproduced the issue with any other site which leads me away from it being MTU related.

By this point, I've repeated everything on the access point too, just incase.

I'm also able to get logs from my ISP to confirm their idea of my MTU, everything is matching up and as exepcted. They have several options to fix common problems but none seem to help. I restore to defaults.

Time to break out...

Wireshark!

It's really easy to get lost in wireshark. I find that you have to be careful about the conclusions you draw from it. I find it is quite easy to get lost in a rabbit hole by digging based on what you didn't see but expected to, or see but are unfamiliar with.

First run on the laptop...

... might as well confirm with a tcpdump from the router. This is easiest without credential prompts, so use an agent:

# ssh-agent bash
ssh-add ~/.ssh/id_rsa
ssh root@192.168.1.1 tcpdump -i pppoe-wan -U -s0 -w - 'not port 22' | wireshark -k -i -

It's clear from these two captures that the TLS "Server Hello" is missing in response to the "Client Hello".2 It's tempting at this point to think maybe it is being sent but dropped before reaching my router... (MTU...?3)

My ISP is awesome and allows you to request traffic dumps if you promise to follow the rules. Getting a dump from them confirms my router and openwrt definitely aren't hiding anything from me, but without me being an expert on the protocols involved I can't rule out my equipment sending the bad hello packets.

Anyway, I'm the only user of this internet connection for the next 10 seconds so I'm safe legally speaking...

All of these dumps are showing the same thing.

  1. TCP's 3-way handshake
  2. A TLS "Client Hello" from me to the server.
  3. A TCP ACK of that packet quickly in response.
  4. A TCP RST - ending the connection ~4 seconds after the initial connection.

Does any of this help confirm or rule out MTU issues? Maybe. Seeing a RST packet within 4 seconds doesn't feel right for MTU - I'd expect a longer timeout upstream. That's just a gut feeling though and it's been a while since I dealt with a real MTU issue and I'm too lazy to set up an experiment.

Support from #A&A bolsters the idea that it isn't MTU, as they point out the MSS is negotiated in the SYN ACK at a reasonable number, 1452.4

What next?

At this point, we've got a few options for diagnosing:

  1. Go back to my ISP supplied modem & default configuration to confirm the issue is reproducible.
    • If the issue isn't resolved, doesn't completely rule out equipment problems with the modem.
    • I'm lazy and don't want to lose my current connectivity & networking setup.
  2. Go back to the old AP to rule out the new AP hardware & software.
    • Only teaches us something if it resolves the problem - similar builds of openwrt on both AP's.
    • I'm also lazy and would have to swap some cables.
  3. Plug my laptop into the modem & run PPPoE there to achieve the same result.
    • Rules on the question of my equipment vs upstream - hardware and software stacks completely different, unlikely to have the bug in both.
    • A right faff.
  4. Use the same hardware & software configuration on an identical, but different internet service.
    • Lazy-friendly, not requiring swapping hardware.
    • I've got a big new hunch at this point...

I went with 4. My ISP is pretty amazing in their efforts to help diagnose faults. Practically instantly I had a second IPv4 address (I thought they'd run out... again?) assigned to my line which I could request to use via PPPoE... it turns out to be assigned by default5 anyway so no configuration changes were required.

huh...

Huh

britishgas.co.uk is treating my original ip strangely. There are a few possible reasons I can think of...

I don't want to keep the new ip because

  1. I want a static ip. That's why I use an ISP who provide them. Having to change your static ip negates the reasons I find one useful in the first place.
  2. Why should someone else have to deal with my old 'bad' IP?
  3. How many other people are affected by this - surely I'm not the only victim?

I guess I'll have to wait 48 hours...?


Footnotes:

1

With lots of help & advice from #A&A - thank you!)

2

Ignore the above warning about making decisions on what you don't see, we get a RST packet, we really did expect a Server Hello instead.

3

Don't ignore above warning, this time. Because I said so? Oh and watch out for that rabbit hole!

4

MSS <= MTU - 40 for ipv4

5

Also not sure how to choose the PPPoE IPv4 address on openwrt, but I'll save that problem for another day.