Cherian Thomas

Friend | Co-founder of Cucumbertown| Cook | I make beautiful things that keep me up all day & night
28 November 2012

Post mortem of a failed HackerNews launch

Right after our TC press post we were onto showcasing Cucumbertown to the HN community and taking feedback. Things were looking fine for sometime until Murphy decided to play spoilsport and we hit ocean bottom when we showed up on the front page.
1 hour on the HN homepage and we were out.
Pretty saddening experience. And it got me all the more pissed off for:

It’s somewhat interesting to understand what happened underneath and how we went on to fix it.
Here’s how the drama unfolded.
At about 8:30 AM PST I posted the news to HN and waited for someone to up vote. A bump and since it was a “Show HN”, it started moving fast and pretty soon was on its way to the front page.

The connections started increasing and things looked pretty good.

CucumberTown | Datadog 1

The traffic coming in brought with it people who loved food & had so much positive emotions.
Love was in the air.

Select ct user to change | Cucumbertown com Backyard

Juggling between the intense stream of answering mails from new members & questions on HN I found myself singing Backstreet Boys for an hour…

Until one of our users Trey and an HN user shot me an IM.

Chat with Trey Hunner  cherian in gmail com  Gmail

We started probing the graphs and it looked like the world was coming to an end.

User United857 struck the nail in the coffin with this message and we were officially down by 11:30 PM IST. It was 404 hell.

Show HN Fork recipes on the  Github for food | Hacker News
The worst part, ssh did not work anymore. All we could wait was to get the terminal buffers to allow us to type a character at a time.
Eventually we got the monit alerts.
Continuous 404’s ensured we went out of the front page and pretty soon down the drain.

CucumberTown | Datadog 4
Even though HN traffic is not our direct market, posting on HN, taking feedback and learning has always been my dream. And it was…
Without considering that we were pretty down & drained already and that it’s nearly 1 AM in Bangalore we decided to get to the root of this.

What became of this investigation is a lesson in carelessness. One I hope I’ll remember for sometime.
So back to the train of events…
Once the domino effect started all we could do was to wait and watch httpd & java processes fail via monit mails.
I’ll save the detailed investigation (its too verbose) and go directly to inference.
Digging through the trails later we saw that
Free memory had decreased, IO wait increased drastically & immediately after that Solr hung. Apache piled the requests and stopped responding. This led Nginx to return 404’s. This continued for 20-30 minutes as a black box.
Eventually Solr crashed releasing some memory. This helped us restart the services. But the damage was done by then.
Digging through logs one by one and eventually in dmesg the culprit showed up
Cucumbertown  cherian prod ~  tmux  179×50 1
So the root cause analysis narrowed down to two points. Why did free memory come down so low and why did IOWait go up so high.
Luckily for us everything was graphed and soon enough we saw that people came in directly and started searching heavily. Solr usage went very high and it started evicting. We did not anticipate this. Pretty bad!
To make matters worse the high usage with memory consumption sent the OS to swap. Even then this should have been handled. The dmesg OOM led us to the details and that’s when we realized Linode offers only a max of 512 MB swap by default. We had gone with it and since memory + swap overflowed pretty quickly the system was OOMing.


But between failures we noticed a pattern that shouldn’t have happened.
Apache requests were in sync with Nginx.

CucumberTown | Datadog 5
Cucumbertown is a site that’s heavily cached. Anonymous requests are all http cached at the load balancer level & only the logged in ones hit the backend web.

Digging further and we started seeing this (below). For every Nginx hit there is equivalent apache hit. But this was only for the homepage.

Cucumbertown  cherian prod ~  tmux  179×50
Since the homepage is query heavy this had a heavy impact on the underlying infrastructure. In fact you could very well conclude that the homepage hits were the reason we went down (The explore items that you see on homepage – they are all in fact a manifestation of Solr searches)

Cucumbertown 1
This was surprising. We had our pages cached as follows.

location = / {
 proxy_pass http://localhost:82;
 proxy_cache cache;
 proxy_cache_key $cache_key; 
 proxy_cache_valid 200 302 2m;
 proxy_cache_use_stale updating;

Things should have been fine. The show recipe pages were.
A curl request exposed the blunder.

Cherian-Mac: cherianthomas$ curl -I
HTTP/1.1 200 OK
Server: ngx_openresty
Content-Type: text/html; charset=utf-8
Connection: keep-alive
Keep-Alive: timeout=60
Vary: Accept-Encoding
Vary: Cookie
Set-Cookie: csrftoken=Dl4mvy4Rky7sfZwqek27hFrCXzWCi9As; expires=Fri, 18-Oct-2013 02:15:32
 GMT; Max-Age=31449600; Path=/
X-Cache-Status: EXPIRED

I’ll leave it to the reader why X-Cache-Status always returned expired.

And now if you are like us running a machine with very little swap, create a secondary one with sufficient space and a lower priority. This will decrease your chances for OOM drastically.
Here’s how to do it.

cd /
sudo dd if=/dev/zero of=swapfile bs=1024 count=2097152 
sudo mkswap swapfile

Change count accordingly
Add this to the line lower than the swap file line in /etc/fstab

/swapfile swap swap defaults 0 0
sudo swapon -a 
swapon -s

Additionally add the following two lines to your /etc/sysctl.conf

vm.panic_on_oom=1 kernel.panic=15 
vm.panic_on_oom=1 line enables panic on OOM;

The kernel.panic=15 line tells the kernel to reboot fifteen seconds after panicking.
In the worst-case scenario you can reboot than be hung for a longer duration

Graphs courtesy DataDog.

  • pseut

    Bro, writing “+ve” is totes unprofessional.

    • somebodythatiusedtoknow

      And writing “totes” isn’t?

      • goyte


      • john dough

        pseut didn’t write the article.

      • pseut

        Context, cuz, context.

        • jmgunn87

          in this case a socket context HAHAHAHA(hilarious douche’)

      • jmgunn87


  • what-where-you-thinking

    Eeeeh, cache with Varnish?

    • Nginx caching is actually quite a robust caching proxy, and the performance is surprisingly good. If your cached data is going to fit in the filesystem memory cache on the os level, then it is about as fast as malloc varnish (and will outperform varnish using disk caching). If otoh you have a huge number of cached files, varnish is going to pull away from nginx, as nginx will end up disk seek bound, even if you have an ssd.

  • zNetOps

    Zynga clearly didn’t teach you 🙂 We do quite a lot of this separation of service.

  • Pat

    “I’ll leave it to the reader why X-Cache-Status always returned expired.”

    How about you just tell us?

    By not telling the reader you have enormously decreased the value of this article.

    • Hi Pat,

      If you look at the headers you’ll see “Set-Cookie:
      csrftoken=Dl4mvy4Rky7sfZwqek27hFrCXzWCi9As; expires=Fri, 18-Oct-2013
      02:15:32 GMT; Max-Age=31449600; Path=/”

      Nginx does not cache any response that has a Set-Cookie,
      unless you specifically want to hide cookie headers (“proxy_hide_header
      “Set-Cookie” – You shouldn’t do
      this normally)

      Now the Set-Cookie was happening to everyone (including
      anonymous users) since the login form required a csrf token.

      This broke caching for all users.

      • Matthew

        Your text flows outside of the container and everything past “ex” in the Set-Cookie header is hidden from view, so I can’t see it for myself.

  • using Linode was part of your issue – though not understanding both JVMs and swap (and even more how JVM relies on swap) in a Linux context has bitten many professional sysadmins I’ve run into in my career

    glad you got it resolved, though 🙂

    • What’s wrong with using Linode? I hope you’re not trying to say they should be using EC2 or whatever other “cloud” service the cool kids are using these days.

      • Even if he is, you shouldn’t take it against him. Both scenarios has their valid cases.

      • AWS is just a more expensive VPS service.
        One should use a dedicated server instead.

        • Dedicated hardware has its place – but VPS offerings have theirs, too.

          The VPS-iness isn’t the problem: it’s not understanding the app

      • Linode, per se, has nothing wrong with it. The problem was starting on something far too undersized to be running dynamic content – especially when utilizing Java.

        Cloud/VPS providers are no different, in this type of instance, from physical hardware. However, starting at something so small and not expecting to need to boost its resources quickly was a major miscalculation.

  • jmgunn87

    Forgive me, why use apache at all? why not nginx for all your web servers? I think you would have been able to handle more of the load without using Apache…

    • jmgunn87

      IMO nginx and php5-fpm would have prevented this.

      • jim

        except that apache solr is what he’s referring to

        • John

          Solr runs on Tomcat not apache… Solr is just an Apache project it doesn’t actually RUN on apache LOL… FYI

    • Apache has an incredible eco-system we didn’t want to
      miss. For e.g. Google’s mod_pagespeed is an apache module and we use it

      • john

        so you keep mod_pagespeed at the expense of your site…?

        • jmgunn87

          yeah that’s a weird one. Still you can only learn from experience and my experience is that apache (with or without pagespeed) cannot handle concurrency.

  • zaza

    use sphinx not solr -problem solved =D

  • exZyngite

    I’m glad you learned a lot from the downtime, but I have to say there is apparently one important lesson from Zynga you didn’t learn! Never use swap for anything except a database. Running out of memory is ok for web servers, it means they fail fast and completely. You can always spin up more or use instances with larger memory. Swap means your memory hogging process grinds to a halt, and often the entire machine is unresponsive, so you can’t do anything to it. You might not even be able to shut it down.

    • laurikari

      Yes! If a server goes to swap-land, it’s pretty much dead in the water already.

      Failing some requests (and perhaps serving a failwhale page) is better than timing out all requests.

    • you don’t know much about swap wrt to JVMs, then