The IP addresses and networks named here have been altered for obscurity.
Because security through obscurity always works, no questions asked :)
The conordotid server hamster diligently keeping our infra online
Background
Our home network is chopped up into multiple VLANs for better organization, security, and relationship job-security:
- 192.168.10.0/24 (Core Services)
- 192.168.20.0/24 (Public Services) <- You are here!
- 192.168.30.0/24 (Surveillance)
- 192.168.40.0/24 (Trusted WiFi Clients)
- 192.168.50.0/24 (Untrusted WiFi Clients)
- 192.168.60.0/24 (IoT WiFi)
- 192.168.254.0/24 (Management)
Infrastructure Details
-
Firewall/DHCP Server: OPNsense running on an HP T750, Broadcom 4 port gigabit copper NIC
-
Switch: HP 1820-24G (Super cheap! Slightly horrifying)
- Uplink between firewall and switch over a dual link gigabit trunk
-
VM Host (Public/Core Services): Proxmox VE
- Ryzen 9 3900x
- 32GB RAM
- Couple SSDs
-
Storage Host: Unraid
- Ryzen 7 2700x (2 cores disabled, TDP limited to 65 watts)
- 32GB RAM
- 6 spinnys and 1 chippy
- 10gb fiber link to VM host and @conor-fredora
-
DHCP Setup: All servers were assigned static IPs using OPNsense's ISC DHCP service because it is easy and auto-documenting—a choice that, as you'll see, played a starring role in the outage
Timeline
Configuration Change
- Initiated by @conor (That's me!): An isolated VLAN meant for retro systems (OpenSTEP, Red Hat Linux 9, Win2K, etc) was reconfigured:
- Network changed from
10.0.0.0/24
to192.168.123.0/24
. - Firewall's IP on the VLAN was updated accordingly.
- Network changed from
Don't forget to adjust the DHCP Server range if needed after applying. Don't forget to adjust the DHCP Server range if needed after applying. Don't forget to adjust the DHCP Server range if needed after applying. Don't forget to adjust the DHCP Server range if needed after applying.
Oversight
- The DHCP Server range was not adjusted after applying to reflect the new
192.168.123.0/24
subnet. The service still had the pool defined in the10.0.0.0/24
network. - As a result, OPNsense disabled the ISC DHCP service because it ISC cannot start while misconfigured.
Impact
- Over time, as DHCP leases across all VLANs began to expire:
- Machines relinquished their IPs but were unable to obtain new ones due to the disabled DHCP service.
- Communication on all networks became impossible.
- "Conor the WiFi isn't working"
Resolution
- The ISC DHCP service was reconfigured:
- The "Retro Network" pool was correctly assigned to the
192.168.123.0/24
subnet. - The DHCP service was restarted successfully, restoring network functionality.
- The "Retro Network" pool was correctly assigned to the
Lessons Learned
-
Read the warnings
- They aren't just there for fun :)
-
Rethink IP Management for Servers
- Current process is okay for things like IP cameras, but not important Core Services
- No more DHCP static leases for server IP assignments. Do them manually on each server
-
Network Monitoring - I already do this to some extent, but the machine tasked with this was collateral damage.
- Use something "out of band" to check connectivity to various networks. For example, a VPS with a wireguard tunnel back home.
Action Items
- Ideally ditch DHCP servers in core services, public, and management networks. These are relatively static and don't see change very much
- Set up some sort of remote monitor / alerting system