Building a reliable back end for Balloons!

2009 October 22
by Dave Verwer

So I have spent the last few months building our new iPhone app, Balloons!

Obviously it needs a back end web service to move all of the balloons around the skies and host all of the attached messages. The back end is (predictably) written in Ruby on Rails and we just finished moving the hosting for it to new Brightbox[2] VPS’s ready for the launch. When we do go live, server reliability and the ability to scale is pretty important, after all we are hoping it will be successful and for every copy (including the free app) downloaded, our servers get hit a little harder.

Now I don’t have a huge amount of experience building high availability sites but a few years ago in a different life (working with .net & Windows[1]) I was working on a project which needed to be extremely reliable and also was expecting a large user load on day one. This meant building a reliable database cluster and load balanced web servers. With help we did it, and it worked, but it was incredibly expensive and complicated. It took months of planning and weeks of work to build and deploy and left me with a healthy fear of doing anything like it again.

So it was a very pleasant surprise this week that when I started looking at load balancing for the Balloons! servers and found out that not only was it really cheap (£19/mo) but that it only took about 10 minutes to set up. Combine that with the use of the Brightbox MySQL cluster rather than a locally hosted DB server and with a few simple tweaks to our Capistrano deployment scripts I was able to have a good level of reliability for our back end for a trivial amount of extra cost (I halved the size of our VPS, but doubled the number of machines so the cost and capacity was roughly the same) and effort.

So yesterday Brightbox had some pretty major downtime with a SAN failing in what seems to be a fairly epic way. Luckily it didn’t affect any of my servers but if it had, Balloons! would not have been unavailable for more than a few seconds because we chose to load balance and Brightbox (very sensibly) offered to put our virtual servers on different host machines and also different SAN’s.

A quick look at twitter showed that people are naturally pretty upset about such a large amount of downtime and some people were talking about moving hosting providers. That is of course their prerogative but moving to another hosting provider is no protection against disasters like this because stuff like this happens to every hosting provider from time to time and as I understand it, it was not due to anything other than bad luck with several disks failing at the same time.

The point I am trying to make is that if your site is important enough for you to lose money/reputation from it being unavailable for more than a few minutes then building an amount of reliability in is your responsibility as no single server is ever going to be 100% reliable on its own, no matter how good the hosting provider. The cost and complexity of doing it these days should make it an easy decision to make.

Now let’s just hope we sell enough copies Balloons! to really put it to the test :)

  1. This is not a rant about Windows and I am sure it is much easier to load balance stuff and build a SQL Server cluster than it was then. I would imagine the cost is still a few bands up from how we have done it for Balloons! though.
  2. In the interest of disclosure, most of the team at Brightbox are friends of mine and so take this with the appropriate amount of bias although it was not written with any intentionally.
4 Responses leave one →
  1. 2009 October 22

    Hey,

    I use exactly the same set up for the McSweeney’s app (http://iphone.mcsweeneys.net) and one of my boxes *was* affected by the mega-outage yesterday. The load balancer kept things running though :)

    Russell.

  2. 2009 October 22
    Matt permalink

    I’m a brightbox customer and am relatively happy, but this post isn’t entirely correct. You’re trying to say that those affected by yesterday’s downtime hadn’t built in any reliability to their setup. The device that failed was a SAN, which is a measure of reliability and redundancy on its own. It’s not a single server or a single hard drive. Brightbox provide this by default, so everybody complaining had taken steps to build in reliability by choosing Brightbox in the first place.

    By load balancing you’ve doubled up the existing reliability level, which is obviously a good thing but I am assuming you didn’t need to share storage between the web servers, which introduces an additional cost that isn’t in any way an easy decision.

    The same concept could be taken to the MySQL Cluster, its reliable and has redundancy by default, but if that fails you’ll be down like everyone else, no matter how many load balanced web servers you have, and the cost to remedy that isn’t an easy decision either.

    These things happen to every hosting provider (trust me I know :(), and as a new business you are exposed to them until you can fork up the cash to make sure it doesn’t happen, in the mean time you just have to hope you’re lucky. No hosting provider can provide you with 100% reliability until you can provide them with the ££££.

    So ultimately I’m still not saying its Brightbox’s fault (these things happen) but your point is misguided and looks from a pretty poor perspective about the people complaining, in fact you’re still in the same boat as them. If you want to really say you’ve got 100% reliability, you need a completely duplicated setup in another data centre ;)

  3. 2009 October 22

    @Russell Very nice app, love the design.

  4. 2009 October 22

    @Matt

    > You’re trying to say that those affected by yesterday’s
    > downtime hadn’t built in any reliability to their setup.
    > The device that failed was a SAN, which is a measure
    > of reliability and redundancy on its own. It’s not a
    > single server or a single hard drive. Brightbox provide
    > this by default, so everybody complaining had taken
    > steps to build in reliability by choosing Brightbox in
    > the first place.

    I absolutely am not saying that you didn’t build reliability in, what I am saying is that you made a choice as to what level of reliability was appropriate for the service/content/whatever that you are hosting on that box.

    Brightbox use redundant disks inside a SAN and a redundant MySQL cluster, this is a level of redundancy and you chose it over a cheaper option which did not do this. Good for you… and you already went further than many do. But, consciously or subconsciously, you chose that it was not worth the money to increase that reliability with load balancing just as I have chosen not to replicate my entire hosting platform somewhere else.

    I have invested what I believe to be an appropriate amount of money to get the best reliability I could for the service I am trying to make available. So did you and we just made a different decision. What I am saying is that was your decision, just like it was mine to not invest in a backup hosting provider if the Brightbox data centre slides into the sea one day.

    > So ultimately I’m still not saying its Brightbox’s fault
    > (these things happen) but your point is misguided
    > and looks from a pretty poor perspective about the
    > people complaining.

    I did not mean to judge the people complaining, I even said that they were naturally upset. I would have been upset too if I were in that position. I just said they knew the world we live in is not perfect and that “shit happens” and they made a decision about what was an appropriate level of redundancy for the data/service they were protecting.

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS