Our two servers each have their own A records in DNS with a TTL of 3600 seconds (1 hour). This long timeout is fine since the IP address of the actual server never really changes.
Access to the service is instead provided by a CNAME record which points to one of those two hostnames. The TTL of the CNAME record is 60 seconds, allowing us to quickly fail over between the two sites as needed.
So the time came, and I had to perform a fail over. I updated the CNAME, and in order to prevent users from being unable to connect, I waited 60 seconds before shutting down the old server and starting up the new one.
From there things went bad. I tried to access the admin console, and failed. I tried to log into the Jabber server, and failed. Finally I hit the admin console through the A record instead of the CNAME, and found that other users had seamlessly failed over.
After a bit of testing I determined that my Linux box and my Windows box both worked fine. The only problem was the Mac that I was making the change from. For some reason, the Mac was holding on to the old IP address.
After some testing, and confirmation from other individuals on their Macs, I think I know what was going on. Using dscacheutil -cachedump -entries, I inspected the local resolver cache.
Here's what I found:
Category Best Before Last Access Hits Refs TTL Neg DS Node
---------- ------------------ ------------------ -------- ------ -------- ----- ---------
Host 01/28/09 21:07:02 01/28/09 20:18:35 10 4 3600
Key: h_aliases:openfire.domain.fake. ipv4:1
Key: h_aliases:openfire.domain.fake ipv4:1
Key: h_name:server1.domain.fake ipv4:1
This appears to be reporting that the local resolver cached the server1.domain.fake DNS record, and set an expiration date of the record for "
01/28/09 21:07:02".
openfire.domain.fake was then set as an alias for that record without retaining its own TTL. This would certainly explain the behavior that I saw.So it seems to Mac OS X may be incompatible with a fairly common DNS failover technique. I filed a bug, so it'll be interesting to see how long it takes before Apple gets around to fixing it.