Boards.ie blog Rotating Header Image

Cleaning up a few years of incremental infrastructure growth

Boards.ie has, more or less, expanded one server at a time. Beginning life on a single shared server, spending a stint under the desk of one of our founders, then the big move to two separate servers, and more as demand increased. Eventually, by the start of this year, we had sprawled up to 18 racked machines, some of which were non-operational, some which we couldn’t use properly due to the organic nature of how all that machinery had come together.

One major step for us has been consolidating that hardware. We’d been spread across two cabinets in Digiweb’s data-centre for several months, the hack in January set off a series of delays that kept us away from the hardware longer than we would have liked. Over the last month we’ve taken several trips out to the data centre and completed some important stages.

Previously, our hardware configuration looked something like this (this is what a decade’s worth of slow growth looks like):

Cabinet 2009, before changes. Internet notably on right.

Problems

In this configuration, we had servers which were using other servers as shelves, literally stacked on top of each other. Obviously this made any physical maintenance quite tedious. The back of the cabinet (not shown here) was a jungle of criss-crossing cables, some of which weren’t even connected at one end, but were so tangled up with others that removing them presented more risk than the reward was worth. We were also sharing the cabinet with several other customers.

The method of load-balancing we were using was also fairly basic, blindly spinning requests off to back ends using one of our oldest surviving machines (a dual core Pentium III). If one web server dropped offline for any reason, 20% of page requests would simply fail. If the load balancer dropped offline, all requests would fail. Additionally, recovery from a failure of the load balancer was not trivial. We were also having increasing trouble from poorly written web spiders and malicious spambots who would occasionally drive the site to a crawl.

While it had rarely been an issue, our static content (images, mostly) was also being served from a single machine.

Also, physical space was starting to become an issue; we were quickly going to have problems fitting more hardware in the same place as the rest, and our internal switch was down to a couple of free ports. Our growth curve is relatively predictable and we knew we’d need more capacity within about 3-4 months.

Solving

We started with the problem we knew was coming, capacity. Digiweb gave us our own cabinet and hooked it up to the old cabinet so that we could start adding more hardware. We decided to make sure we Did It Right with the new cabinet, and set about making sure everything would be accounted for in the layout of the machines, right down to trying to be consistent about cable colours.

We bought a pair of machines to serve as replacements for our ailing P3 load balancer, a couple of beefy (8 core, 32GB RAM) ones to serve as new database machines and a pair of new routers.

New servers, boxed

Over a couple of trips, we moved all of our machines from the old cabinet to the new one, retiring the older machines where possible, and getting rails for anything that was freestanding. We even managed to keep downtime relatively low, though moving the primary database server meant we had to drop offline – the vBulletin software that Boards.ie runs on doesn’t give us the option for a “read only” mode without a significant overhaul.

Old cabinet, midway through move

Our new load balancers have been set up to mirror each other and perform automatic failover in the event of one failing (using CARP and pfsync). This has also allowed us to have a much more robust firewall and a redundant internal DNS cache (djbdns). We’re using varnish to do the actual load balancing, as well as taking care of dead machines at the back end.

We’ve also implemented multiple database support in vBulletin using memcached counters for weighting, which means we can easily add more machines to our database pool where required and balance the traffic.

Here’s a rough diagram of our infrastructure as it is today:

Infrastructure diagram 2010 - moving the Internet to the left was the hardest part.

And some photos!

Web servers

DB and storage

Back of DB / Storage

Back of web servers

7 Comments

  1. Dave Masterson says:

    Well done guys – I think every sys admin comes across this issue at some stage, and it takes a brave one to decide to take the plunge and “do it right”. May they serve you well, and if anyone uses the wrong color cable when adding a new server, beat them with a big stick. Works for me.

  2. Conor says:

    Thanks Dave! Don’t worry, there’s a big stick kept with the spares. :)

  3. Ronanob says:

    Very Very sweet set up, the new Dell 32gb systems are fantastic and it looks like you’d be able to add in new severs pretty easily. Lots of redundancy there too.

  4. Conor says:

    Yep, the new database machines are great for soaking up a lot of queries. The ones we have are Dell 2970s with 2 x AMD Opteron 2382s and 32GB RAM.

    We’ve been working hard to remove things which are single points of failure. Some bits were easier than others, but we’ve come a long way in the last 6 months or so. There’s still a good bit to go though. Ideally, I’d like automatic failover on pretty much everything. There’s only two of us, so unplanned downtime has the ability to take up a lot of our capacity and affect further development. I’d much rather have the machines take care of the emergency panic work and free Ross and I up for making the actual repairs.

  5. [...] We also had a post last week with a (high level) overview of our current hardware. [...]

  6. Back in the day, when I was working at Sun (*cry*), I knew some folks who were totally anal about cabling – down to the colour of the cable ties they used! Mind you, they were also the hottest of the sh!t hot sysadmins.

    (Unfortunately they were at the back of the line when social skills were handed out – but that wasn’t such a big deal at Sun.)

  7. [...] in work-related news, Ross and I recently finished up a long, slow process of cleaning up the boards.ie infrastructure. Our next big challenge is to squeeze some more performance out of the software boards.ie runs on [...]

Leave a Reply