Some time ago we had a problem when trying to communicate AWS with one of our clients’ on-premise infrastructure.
We deployed a Windows EC2 instance in a VPC, to act as a VPN endpoint for road warriors. Once a VPN connection was established, we couldn’t get our QA tests to pass. Traffic wouldn’t flow from either side of the VPN to the other.
Machines on each side of the network were able to ping the VPN machines’ two IP addresses (both the local and the remote side), so the tunnel seemed to be correctly established. The VPN machine was able to contact hosts on either side of the network, but it looked like the machine was not routing.
Tracing traffic on the VPN server, we’d see traffic being forwarded to AWS hosts, but the IP packet returning to the origin host was dropped, so ping would fail, and TCP would do retransmissions. First theory was that maybe Windows firewall was dropping packets.
Nope! All firewalling disabled would leave us with the same frustrating result.
Going through all EC2 networking options that might be involved: Static routes, NAT settings, Security Groups, Source/Destination IP checks, VPC Routing Tables and VPC NACLs were checked to be OK.
All OK, and the VPN was unusable.
After some more tests and some retries, we had to reconsider the approach. Time was pressing for the project and the VPN was a prerequisite for other elements. We decided to deploy an OpenVPN solution that turned out to work, but left open an investigation to see what had gone wrong with our Windows VPN deployments.
Finally, with the project on track again we decided to investigate, reproduce the bug and hunt down possible causes.
We had deployed Windows VPNs before on EC2 and on-premise without any type of failure. A new deployment in a test account that had nothing to do with the last one: the VPN wouldn’t work. So back to tracing and debugging network traffic.
Windows has an excellent traffic capturing tool: Microsoft Network Monitor 3.4. In that tool we found TCP checksum errors. After a while, we came across with this Official AWS Documentation, although not referring to our specific setup:
Excerpt from this doc:
If you launched the Windows instance from a current Amazon AMI, you might not be able to route traffic from other instances without updating your adapter settings. From your Windows server or Windows instance, do the following: disable the IPv4 Checksum Offload, TCP Checksum Offload (IPv4), and UDP Checksum Offload (IPv4).
Bug caught and squatted!
It seems that the last version of the AWS Windows AMI had changed behaviour from some change in settings or network drivers.
Now the VPNs work without a glitch!
This makes me think that sometimes cloud technologies aren’t trivial: you have to deal with Operating Systems, virtualization, network, cloud provider specifics (bugs, limitations and features). CAPSiDE’s team is constantly testing and benchmarking different clouds and technologies to be able to deliver customers projects under a wide variety of circumstances. We care about your project and about its success, and we have a big range of tools and experience to solve the possible problems in the way.