RAGS RES

an home router

Update December 2021:I've been using the setup described here, with various adjustments, for over a year. It's been working extremely well for me, and has been largely extremely robust. Some more details can be found at the end of the article.

This article is brought to you with the sounds of Apache drums and song, as I sit on a hillside in Sedona, Arizona, USA. I'm on a much needed vacation, which is affording me a healthy mix of road trip, family time, adventure, and a little productivity. Sometimes one should disconnect, but in this case, what I need is a sense of forward progress that's been sorely lacking lately. 2020 has been such a year - I hope you're doing ok dear reader. There's a funny irony that I'm writing this article about connectivity from a place with explicitly so little of it, but as the resources required to perform this task were hard to come by, I'm writing up some details for you, friend.

Vendor drivers are the problem

I know, I know, that poor horse is now flat as a pancake. The story here for me begins with Ubiquiti. I've been a fan of Ubiquiti for years, as we started to use it at ECL for wireless carrier backhauls. They're not the most polished or complete of devices, but they almost all come with an SSH front-door, and through this door I can fix many external problems, no matter what the odd need is. In Bermuda that was all about being able to execute odd reconfigurations (with an appropriate carrier license) during network topology changes, be it due to practicalities of being in a humid & hilly middle of nowhere, or improving our capacity on-demand during hurricane relief operations.

So I bought into Ubiquiti at home, they have a line of middle tier routers that are a little on the expensive side for "just a router", but advertise some really compelling numbers under offload configurations. They run VyOS with a reasonable HTTP frontend, and once again, have a usable SSH escape hatch through which you can add extra goodies, such as kernel modules for Wireguard or the like. All good right?

You remember I said compelling performance numbers? You probably see where this is going. In my dayjob I work on an operating system, and in our team we use near head versions of Clang (among many other things). A large amount of our build toolchain are delivered as prebuilt binaries, which generally is a faster way to get them than to compile tons of C++ locally. These prebuilts can be pretty weighty, the Clang prebuilts we move around are over 1GB a piece, and as I say, they're one of many. Silicon Valley, despite being what it is, has fairly poor internet access in most of the region. The Comcast monopoly is in full force throughout most of the region, in fact AT&T don't even know that the address of my house even exists. As a side effect of this, there's no chance of Fiber and we're all stuck with poor asymmetric networks. I'm relatively lucky in fact, as despite being somehat into the countryside, I have relatively new lines to Comcast, and I generally get something approximating the advertised service.

One day I'm doing a `jiri update` and things start getting slow. I mean really slow. I'm talking <100kbps slow. Ain't nobody got time for that. This prompted me to start digging in, filing bugs, asking for help, and so on. I found opportunities for optimizations throughout our prebuilt delivery stack, but none of these had any relationship to the problem I was having. After exchaging some packet traces with some particularly helpful colleagues, I came to understand that I was seeing a relatively frequent problem with packet reordering. Now, if you're the kind of person who hangs out on the Ubiquiti forums telling everyone how it is, you might be screaming right about now "packet ordering is irrelevant to TCP, that's not how the internet works", but dear reader, we all experience otherwise. In my use case, I'm talking to Google frontend servers, really it's just Google Cloud Storage that's in use as part of Chrome Infrastructure Package Distribution, which my team uses for moving these prebuilts around. Now Google has been working for some time, along with other industry players, on improving TCP congestion control. The current deployments use BBR, and for the most part this is a good improvement to user experience for many use cases. The problem I have is that the Ubiquiti router I'm using is based on a Cavium OCTEON SoC, and the hardware offload driver for this thing uses a very very very crude strategy for core selection in packet routing. Essentially it's a combination of a bucking spilling algorithm, a classical failure of understanding what "volatile" means in C, along with a general apparent lack of either knowledge or time to write the driver the way that most upstream ethernet drivers are written. The ultimate side-effect is that if you have more than a moderate amount of traffic flowing through the router (by traffic I mean pps, not bps, btw), you'll start to see queue bucket spilling. Bucket spilling is really good if you want to, say, max out a "this router can do XXMpps" on an advert, but it's really appalling if you want, say, TCP congestion control algorithms to make sensible choices. It turns out this behavior is so poor under load that it really floors out connection speeds. Marvell actually open sourced this driver package, and even better, upstream started absorbing some of it and fixing it for reals, but then no one was using it, so in around 5.2 or so they were all deleted from exp/. I could fixup the driver that the original author admitted to upstream when asked about it "I wouldn't use this in a router", but really after paying Ubiquiti for this gear I shouldn't have to. Support has been DOA, and the forums have had some dude telling me that TCP doesn't work that way and I don't know what I'm talking about, so well, good luck to them - they lost my and my telco's business.

I have two routers that are perfectly fine for your casual office internet connection or your average Netflix stream, but you probably don't want to try and make more than a few RTC calls concurrently through them. Oh wait, it's COVID, and California is on fire, so now I have evacuees in my house and we're all on video calls all day long, while I also need to download oodles of code and objects from the internet. Game over. Ubiquiti/Cavium routers are dead to me. It's time for a replacement, and I've been watching reports from the field that our operating systems super simple super functional L2 broadcast packets are knocking over various other major vendors routers in the field. Lets survey the landscape. I can probably build my own in 12h or less, and that's what you'll find in this article.

We've come a long way baby...

I drove more than 1000 miles in the last week, to get here - but we're off topic again. Actually we've come a long way as an industry in the last 20 years. 20 years ago, I wouldn't have recommended that you use Linux itself and it's core ecosystem of software to power a home router. The setup would be too intricate, and way too filled with software that frankly wasn't, and remarkably to this day still isn't trustworthy. I ran an OpenBSD border gateway for many years, roughly until I got over the 100mbps range, and then it couldn't really keep up anymore. Apple was able to tide things over for a few years after that, but the ecosystem lock-in is.. a thing, having to dig out old pieces of hardware to reconfigure the damn thing. That brings me to our first and most important goal in our requirements gathering:

Low cost of ownership is these days one of the most important details of all systems that I (re)introduce into my life. I'm capable of assembling and/or writing an indordinate volume of systems and software, and every one comes at the price of capacity for other aspects of life or other systems. Networks are quite stable systems now, we have IPv4 still kicking, and IPv6 (if you're lucky) in a functional state. We understand not everything, but a lot more about systems security, and we have emerging, nay established, re-presentations of security models in zero-trust networking "standards". Of particular import in the case of a home gateway is that we get the feature set that we want without a significant ongoing maintenance cost due to software upgrades - the high frequency of which would be for security upgrades. What I'm going to do is to pick a set of software (numeracy and selection) that aims to keep this cost low. More to come on this later, as this is perpetually imperfect, but it's workable now.

Now that it's established that cost of ownership is a primary factory in our decision making, lets establish some more featureful requirements:

There are also a bunch of things that I don't really care about:

Shell scripts are dead. Long live Linux

This is going to be the contentious part of the article, actually, I suspect that technology selection, among various industry influencers and leaders, is going to be generally contentious here, but I can explain my choices rationally.

Base Distro

I evaluated a lot of pre-canned router options, including:

There are things to like about many of these disros, such as having web interfaces, or having unified configuration paradigms. They also in theory offer a much lower learning curve - you don't need to pick so much technology, just grab one and go. The problem is that hitting my set of requirements above is actually a little outside of that feature set they offer. In particular, many of these products use older technology packages, some of which are really really out of upstream maintenance (such as this).

I decided to take a look at the cost function with a relatively vanilla Linux distribution. Looking at the distros, what I'm looking for is:

There's a Clear winner here. Clear Linux has been performing really well for me in a number of situations since I tried it out lately. You might hear about it on the blogosphere as a "fast" distro, they focus on a small set of CPUs (it's an Intel product, surprise?), but that's really not a key selling point for me.

Clear is a relatively pure SystemD based distribution. It doesn't have any lagging SysV heritage in the paths of usage I take, there's no old iptools network settup hanging around, few if any shell scripts in the boot path, and it's low on early annoyance. The installer is really good because it only asks relatively important questions, then it does its job. On UEFI system the boot setup works just great, and it's also systemd based.

The update system, and general storage layout is the real appeal, the real winner for me. On Clear linux if you want to "factory reset" the system, you `rm -rf /etc /var` and reboot. That's it. This doesn't break a multitude of software, cause an apt or an rpm to freak out, etc. You might be wondering why I care about resetting a router back to a just installed distro state - well, I don't, but the aforementioned property actually has other advantages. I own /etc. Let me state that again, and please take a minute to let it sink in clearly: I own /etc. All of it.

Lets talk about systemd for a minute. I have a lot of friends with whom I share an average bias of reductionism. In this group of friends, I'm somewhat of an outlier in my acceptance of systemd, and a heretic as a proponent of it. Let me hit you with something here as a very first opening point: my router boots from UEFI -> routing packets in two seconds. There's no shell script booted system I've ever seen which can come anywhere close to that, and yes, I know that it's "simpler", by some notions, but having read into systemd, and understanding the roles it is fulfilling, the mechanisms that it is using to provide this performance, I don't particular care that it's in C instead of shell - and I don't find it to be that complicated. I've debugged my fair share of complicated shell, and that's just as bad if not worse than a dbus issue. Some other key points here though:

The last point there might raise an eyebrow. Clear Linux has one property that I really deem a mistake, that I find to be beyond a mild annoyance, and that is it's insistence on including and starting a pac file discovery and parsing agent. I understad that Intel is an old school enterprise with old school enterprise IT problems, and I really hope they can find their way to zero-trust solutions in the future, as if they fail it will affect all of us. Regardless, I can't abide a stray packet on my network enabling mitm of all kinds of my traffic, even if there are other mitigations in the application protocols for most applications. This pac crap has got to go. The good news is, this is as simple as `systemctl stop {service} && systemctl mask {service}` and it will never again get started. Sure, I shouldn't have to do this, but in a traditional init system, overriding this kind of stuff tends to be much much more of a mess, particularly come upgrade time. Yes, I know it's possible to write better shell scripts, but people don't.

Suck it up, were gunna break the rules...

I'm not happy about this part: dnsmasq. The thing is, you should never use dnsmasq. That seems flamboyant, well, take a look at this mailinglist thread. The site runs no TLS, and hosts for example, an almost abaondoned webmail client and a version that's relatively trivially attackable from email sent to users.

I'm quite familiar with dnsmasq, not just the program, but I've also done passes through the code at various points. dnsmasq isn't fundamentally broken like BIND is fundamentally broken, but it is fundamentally broken in that no one can actually write safe C. This isn't a "partisan debate", it's not something you should approach like an American conspiratorial conservative, the data and the science and the history is there, this class of software is unsafe and is overdue for replacement.

I plan to replace dnsmasq as soon as I have another 40-50 hours to sit down and do so. Roughly speaking my plan is to implement a functional equivalent in Rust writing a new DHCP server and IPv6 Router, along with the maturing trust-dns package that looks like it'll do well to provide everything I need/want on the DNS side. In the meantime, dnsmasq sadly is the software package that provides the facilities I want. The utility of DHCP lease to DNS name binding is quite significant, and the configuration file is exceptionally rational. It's possible that I might be able to move to a pure systemd-networkd solution before I get around to writing something, as it's rapidly improving in this area - but so far it's lacking some of these key features, including static lease configuration.

The good news of dnsmasq is that it's one package that covers a bunch of my needs, it covers DHCP, IPv6 RA, LAN and Recursive DNS. The threat / maintenance model reality here is that I'm going to need to watch carefully for dnsmasq security announcements, along with systemd.

Lets get down to business!

Hardware and Distro Install

As I am choosing Clear Linux, I'm going to need to use x64 hardware. This can entirely be replicated on another distribution on another architecture if you need.

For hardware, I received a recommendation for the Seeed Odyssey. Time will tell how solid these boards are, but mines working great. The reason I'm using it is that it's low on moving parts - there's a small fan on the bottom that I'm not sure it even really needs for my use case, and otherwise the one I ordered has onboard eMMC and RAM. I also bought the LTE companion card, which as mentioned above, I'll talk about bonding another time.

Distro installation for Clear Linux is easy. I wrote the Server ISO to a USB key and booted from it in UEFI mode. The install of Clear is remarkably simple, fill in the options, there aren't many, and make yourself a user account. Once it's all done, you probably want to flip the BIOS settings around to only boot from disk, and to turn on when power is restored. If you have a good track record remembering passwords for hardware, drop a password in there for good measure.

Configuration Management

The really nice thing about using Clear Linux is that this step is a real cheat. After the system boots, I'm just going to make a git repo straight in /etc, because I'm a savage, and a clean /etc is a real treat.

$ sudo swupd bundle-add git
$ cd /etc
$ sudo git init
$ sudo chown -R $USER .git

There is still some "junk" in /etc, and key material that I don't want to commit, so, I'll add some ignore entries:

$ sudo tee /etc/.gitignore
.pwd.lock
group
gshadow
machine-id
mtab
openldap/
os-release
pam.d/
passwd
passwd-
resolv.conf
shadow
shadow-
ssh/ssh_host_ecdsa_key
ssh/ssh_host_ecdsa_key.pub
ssh/ssh_host_ed25519_key
ssh/ssh_host_ed25519_key.pub
ssl/certs
wg.key
CTRL+D
$ git add .gitgnore

Some basic system setup, I want a hostname, domainname, some crufty services disabled:

$ for unit in NetworkManager.service pacdiscovery.path pacdiscovery.service \
    pacrunner.service sshd-keygen.service tallow.service; do
  sudo systemctl disable --now $unit
  sudo systemctl mask $unit
done
sudo systemctl enable --now systemd-networkd

These probably bear a little explanation in a few cases:

Ok, I said SSH was going to get some attention:

$ sudo mkdir /etc/ssh
$ sudo tee /etc/ssh/sshd_config
Port 22
Port 2222
HostKey /etc/ssh/ssh_host_ed25519_key
HostKey /etc/ssh/ssh_host_ecdsa_key
KexAlgorithms curve25519-sha256@libssh.org,ecdh-sha2-nistp521,ecdh-sha2-nistp384,ecdh-sha2-nistp256
Ciphers chacha20-poly1305@openssh.com,aes256-gcm@openssh.com,aes128-gcm@openssh.com,aes256-ctr,aes192-ctr,aes128-ctr
MACs hmac-sha2-512-etm@openssh.com,hmac-sha2-256-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-512,hmac-sha2-256,umac-128@openssh.com
AuthenticationMethods publickey
PermitRootLogin prohibit-password
AuthorizedKeysFile      .ssh/authorized_keys
Subsystem       sftp    /usr/libexec/sftp-server
CTRL+D
$ suduo rm /etc/ssh/ssh_host_dsa* /etc/ssh/ssh_host_rsa*

That's the core system configuration landed. I'll start on router services next, but it's commit time.

$ git commit

Network Configuration

First up, as I'm building a router, I want to have reasonable congestion control for a router. Ideally I'd use cake, but it's not available in Clear's kernel right now, so for now I'm falling back on fq_codel, which performs pretty well.

$ sudo mkdir /etc/sysctl.d
$ sudo tee /etc/sysctl.d/router.conf
net.core.default_qdisc = fq_codel
CTRL+D

Next up, I want to configure two NICs (actually I have three, as I have LTE as well, but I'm skipping that, remember):

This first step is really cosmetic, and you could skip it, but it can really help if you're ever doing diagnosis with blurry eyes. Feel free to add 0 suffices if you makes you feel like you have more of a gray beard.

$ sudo mkdir /etc/systemd/network
$ sudo tee /etc/systemd/10-internet.link
[Match]
Path=pci-0000:02:00.0
[Link]
Name=internet
Description=Internet Interface
CTRL+D
$ sudo tee /etc/systemd/10-lan.link
[Match]
Path=pci-0000:03:00.0
[Link]
Name=lan
Description=LAN Interface
CTRL+D

Great, now I have decent names for my interfaces, this is going to make all kinds of other configs more readable. Time to actually give the interfaces some configuration.

Now here you might make some different choices than I did, for example you might use different time servers, or you might want a different prefix delegation hint if your ISP provides a different length. What's configured below is the internet interface, disabling lan name discovery, enabling ip forwarding, getting IPs from DHCP and router adverts. I also don't want privacy extensions for the public interface. Note: I don't want my ISPs services, hostname, etc. You probably don't either.

$ sudo tee /etc/systemd/30-internet.network
[Match]
Name=internet

[Network]
LLMNR=false
NTP=time1.google.com
NTP=time2.google.com
NTP=time3.google.com
NTP=time4.google.com
IPForward=true
DHCP=true
IPv6AcceptRA=true
IPv6PrivacyExtensions=false

[DHCPv4]
UseDNS=false
UseNTP=false
UseSIP=false
UseHostname=false
UseDomains=false

[DHCPv6]
PrefixDelegationHint=::/60
UseDNS=false
UseNTP=false

And the LAN side now, this is going to do all of the IPv6 router advertisements and prefix delegation completely with systemd-networkd, which is quite nice, one less thing dnsmasq will be doing for us (though that actually sacrifices quad-A records for now). You might want to pick a different subnet and a different domain.

$ sudo tee /etc/systemd/30-lan.network
[Match]
Name=lan

[Network]
ConfigureWithoutCarrier=true
Address=192.168.200.1/24
DNS=192.168.200.1
DNSDefaultRoute=true
Domains=your-domain.home
IPForward=true
IPv6PrefixDelegation=dhcpv6
IPv6PrivacyExtensions=false

[IPv6PrefixDelegation]
DNS=_link_local
Domains=your-domain.home
RouterLifetimeSec=60

NAT time!

IPv4 being what it is, I'm going to need to make that one measly IP address spread around. That means NAT. I'm going to use nftables, normally folks would use iptables, but what the hell, lets live on the edge.

In this ruleset, there are two tables for "port forwards", the http one is just an example for TCP, and forwards to 192.168.200.2, you probably want something else. Note that the {tcp,udp}_forwards and {tcp,udp}_allowed should be kept in sync. There might be a better way to handle this, but I couldn't see it. It's still better than typing it out in iptables though.

$ sudo swupd bundle-add firewalld
$ sudo tee /etc/nftables.conf
flush ruleset;

define rtr = 192.168.200.1
define http = 192.168.200.2

define lan_network = 192.168.200.0/24

define tcp_forwards = {
        https : $http . https,
}

define tcp_allowed = {
        $http . https,
}

define udp_forwards = {
        51820 : $rtr . 51820,
}

define udp_allowed = {
        $rtr . 51820,
}

table inet nat {
    map tcp_destinations {
        type inet_service : ipv4_addr . inet_service
        elements = $tcp_forwards
    }

    map udp_destinations {
        type inet_service : ipv4_addr . inet_service
        elements = $udp_forwards
    }

    set tcp_masq {
        type ipv4_addr . inet_service
        elements = $tcp_allowed
    }

    set udp_masq {
        type ipv4_addr . inet_service
        elements = $udp_allowed
    }

    chain prerouting {
                type nat hook prerouting priority 0;
                policy accept;
                ip daddr != $lan_network fib daddr type local dnat ip addr . port to tcp dport map @tcp_destinations
                ip daddr != $lan_network fib daddr type local dnat ip addr . port to udp dport map @udp_destinations
    }

    chain postrouting {
                type nat hook postrouting priority 100;
                policy accept;
                ip saddr $lan_network ip daddr . tcp dport @tcp_masq masquerade
                ip saddr $lan_network ip daddr . udp dport @udp_masq masquerade
                oif internet masquerade
    }
}

table inet filter {
    set tcp_allowed {
        type ipv4_addr . inet_service
        elements = $tcp_allowed
    }
    set udp_allowed {
        type ipv4_addr . inet_service
        elements = $udp_allowed
    }

    chain forward {
                ip daddr . tcp dport @tcp_allowed accept
                ip daddr . udp dport @udp_allowed accept
                iif lan oif internet accept
                iif internet oif lan ct state related,established accept
    }
}
CTRL+D

Now as I'm using nftables directly, and not actually the firewalld that I installed, I'm going to need to drop a unit file for this:

$ sudo tee /etc/systemd/system/nftables.service
[Unit]
Description=nftables
Documentation=man:nftables(8)

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/nft -f /etc/nftables.conf

[Install]
WantedBy=multi-user.target
CTRL+D
$ sudo systemctl daemon-reload && sudo systemctl enable nftables

Aaand dnsmasq...

One of the things that dnsmasq has going for it is that it's pretty easy to configure. Lets drop a hosts file, and drop a dnsmasq.conf, after installing it.

$ sudo swupd bundle-add dnsmasq
$ sudo mkdir /var/lib/dnsmasq
$ sudo tee /etc/dnsmasq.conf
bind-interfaces
interface=lan
cache-size=2000
local-service
domain-needed
no-poll # resolv.conf
no-resolv
dhcp-leasefile=/var/lib/dnsmasq/leases
# Dangerous names from dangerous technologies.
dhcp-ignore-names=tag:blockedhosts
dhcp-host=isatap,set:blockedhosts
dhcp-host=unifi,set:blockedhosts
dhcp-host=wpad,set:blockedhosts
server=8.8.8.8
server=8.8.4.4
server=2001:4860:4860::8888
server=2001:4860:4860::8844
dhcp-authoritative
dhcp-range=192.168.200.21,192.168.200.240,255.255.255.0,86400
domain=your-domain.home.,192.168.200.0/24,local
host-record=rtr.your-domain.home,192.168.200.1

# This is an example static entry, for that "http" we configured in the nftables rules:
dhcp-host=aa:bb:cc:dd:ee:11,192.168.200.2,http.your-domain.home
CTRL+D
$ sudo tee /etc/hosts
127.0.0.1  localhost
127.0.1.1  localhost

::1        localhost ip6-localhost ip6-loopback
fe00::0    ip6-localnet
ff00::0    ip6-mcastprefix
ff02::1    ip6-allnodes
ff02::2    ip6-allrouters
CTRL+D
$ sudo systemctl enable dnsmasq

Note that I need to disable systemd's normally helpful stub listener, as it'll block dnsmasq, and I don't need it anyway:

$ sudo tee /etc/systemd/resolv.conf
DNSStubListener=no
CTRL+D

All done, time to commit, and reboot!

$ cd /etc
$ git add .
$ git commit
$ sudo systemctl reboot

What comes next?

December 2021 Update

Over the last year I've made a few key changes:

The only real problem I experienced with the router was a flaw in a recent dnsmasq update in Clearlinux that appears to end up going into a spin validating DNSSEC. I had to fight clearlinux debug symbol fusefs, which is a good idea in principle, but it relies on DNS, and so it's a bit of a disaster when the DNS server is the thing you're debugging. I did workaround that problem, and was able to poke around at an instance of the problem, but frankly the Unix data structures for the incoming data are such a pain to work with in a debugger, I decided screw it, I'll drop DNSSEC for now. Currently my plan is to rework my configurations to use the DHCP server in systemd-networkd and implement a DNS server based on trust-dns. For now that'll likely have to poll `networkctl` periodically in order to sync the DHCP leases, and I'm debating writing a fuller dnsmasq replacement again if that's problematic. We'll see. It's clear that dnsmasq is on it's last legs, and it's also the case that while I really appreciate being able to just commit /etc, I'm not overly happy with Clear linux either. The teams support model and inability to accept actual patches are quite horrible to work with. They have a good product, but it's not there to be a product, it's just a bigcorp wall-throw to avoid GPL issues. I kinda wanna make a modern distro, but I'd want someone to fund such a venture.

Discuss