Blog

C500k in Action at Urban Airship

Editor’s Note: This post was compiled and created by Scott Andreas, who can be reached on twitter: @cscotta.

Building AirMail Push for Android’s Infrastructure

We’ve been working to develop AirMail Push for Android, our push platform, along with a supporting server-side infrastructure that can handle millions of concurrently connected devices. Like Push Notifications on Apple’s iOS platform, AirMail for Android maintains a persistent socket connection to our cloud, which lets us push out messages to devices in real-time. This brings with it all the complexity of handling hundreds of thousands of persistent socket connections server side, managing the connection on the client side as users pass in and out of coverage or switch between networks, and detecting and closing half-open sockets as quickly as possible.

This article describes the challenges we faced building out the server-side push infrastructure, how they led us to re-design and implement it using a hybrid threaded/evented/queue-based architecture based on Java’s NIO libraries, and a retrospective on this decision now that it’s live.

Lots and Lots of Sockets

In order to push a notification to a device, you need to have a persistent socket connection open to it. Our platform needed to be capable of serving millions of simulatenously-connected devices, which means that our server’s system stack and our software need to elegantly accept, read from, and push notifications out to a very large number of open connections per node. But how many? 10k? 50k? 100k? 500k? 1MM? More importantly, how many nodes do we need to handle this? If one node can only serve 10,000 clients, the venture is cost-prohibitive from the start – just one million devices would require a fleet of 100 servers.

We began our venture into uncharted territory with an implementation in Python based on Eventlet. Eventlet’s a fantastic library that provides support for concurrent task execution through lightweight coroutines. Its socket implementation allows developers to accept and manage an arbitrary number of connections with less difficulty and expensive context switches associated with threaded implementations. With it, the first version of our edge server “Helium” was born.

Eventlet helped us build a Helium whose code was incredibly straightforward, clean, and easy to understand. We love Python and are very grateful for the community of developers who are working to build libraries like this. Our initial implementation of Helium enabled us to accept and send application-level keepalives to roughly 37,000 connected clients before hitting the 1.7GB memory limit of a base EC2 instance. While pleased with this number, the amount of instances needed to meet our capacity requirements would have been prohibitively large and expensive. While we are constantly investing in our infrastructure to build a stable, secure, performant platform, we’re better programmers than we are sysadmins and prefer to write software that scales both efficiently and cost-effectively.

Things That Look Like Sea Monsters

Thus began our search to build a prototype of a more efficient socket server. We considered a variety of options, including a C/C++-based implementation wrapping [e]poll, a threaded or evented implementation in Java or Scala, a spike using Node.js, along with a few others. As expected, the purely-thread-based Java option fell over after a few thousand connections, and the C-based implementation proved too low-level to meet our evolving project requirements.

A quick experiment using Java’s NIO library (“new/non-blocking” IO) immediately turned heads, so we spent several hours fanning out that spike to include three versions:

  • A raw Java NIO implementation
  • An implementation of the same, atop the Netty library.
  • And one more using Netty, but written in Scala rather than Java.

Each of these three spikes vastly surpassed our previous results both in terms of connections achieved, and memory efficiency:

Table 1: Connections Handled Before Failure


ImplementationConnectionsMemory Used
Java Pure NIO512,000 +2.5 GB
Java w/Netty330,0002.2 GB
Scala w/Netty173,0001.5 GB

 

Table 2: Memory Efficiency per Connection


ImplementationConnectionsMemory UsedDelta
Java Pure NIO80,000 +581 MB1x
Java w/Netty80,000711 MB1.3x
Scala w/Netty80,000937 MB2.26x

Work Smarter, Not Harder

Jumping from 35,000 connections per node on an EC2 Small instance to over half a million on a single EC2 Large marked a huge win in terms of efficiency, infrastructure cost, and administration overhead — the number of connections per node jumped by nearly 1500%. (As a premature epilogue, this number held true after implementing all of our application logic, keepalives, and message passing, which is almost unheard-of in an early test).

As an interesting sidenote, the failure modes we saw in both the Java + Netty and Scala + Netty were unusual; CPU usage spiked to 100% in the Java process, with nothing output to the console. It appeared that an exception was being thrown and silently caught in a tight infinite loop.

Based on these numbers, we pressed on with a Java + Pure NIO implementation. Stay tuned – we’ll have a few posts on our NIO implementation, more info on our methodology and metrics, some things we’ve learned on the way, and all sorts of sharp corners we’ve hit our heads on that you can avoid.

More soon!

25 Responses

  1. ian at 11:31 am on August 24, 2010

    glad to see more examples of NIO in action — will be interesting to see what you all come up with in the future and how you intend to scale out after you start hitting your memory ceilings or is the current plan to just pop on a high memory quadruple and see where it takes you?

  2. Jared Kuolt at 11:43 am on August 24, 2010

    @ian: We’re playing it safe and not pushing our servers to high for a while. Eventually we’ll see about pushing them to higher and higher limits, but for now the numbers we have are so good we’re not worried if we can’t go higher.

  3. Don Park at 12:05 pm on August 24, 2010

    Great post! What happened with the node.js spike?

  4. Vadim at 12:14 pm on August 24, 2010

    For handling keep-alives you should take a look at setsockopt(s, SO_KEEPALIVE ….)

    http://tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO

    • schmichael at 1:12 pm on August 24, 2010

      Hi Vadim,

      I outlined why we didn’t use TCP keep-alives in a comment on another HN post:

      http://news.ycombinator.com/item?id=1594918

    • Vadim at 2:58 pm on August 24, 2010

      The fact that carriers does no deliver them is news to me,
      Thanks a lot for the info

    • schmichael at 3:49 pm on August 24, 2010

      I can’t profess to know how all carriers treat keepalives. We just ran into a case where a carrier seemed to be silently dropping them in our coverage area. No clue if that’s the norm, but I still would never rely upon them working.

    • Stephan at 5:49 pm on August 24, 2010

      Could you maybe write something about the keep-alive interval you’re using? If you choose a too small one, you drain the user’s battery and generate unnecessary network traffic (both the user’s and yours). If you choose a too large one, you run the risk of the connection being silently dropped (and some important message not being delivered as fast as it should). I’m curious how you balance out these conflicting interests and if you could maybe share any experience on the TCP timeouts you’re observing in practice.

  5. francois at 1:15 pm on August 24, 2010

    Your comment section is polluted with useless “twitter trackback”. I can’t even find the comments. Please remove this.

    • schmichael at 1:30 pm on August 24, 2010

      All the devs at UA agree. :-) We’re working on removing the Twitter integration.

  6. Cem Ezberci at 2:55 pm on August 24, 2010

    have you considered using zeromq?

  7. Carson McDonald at 2:57 pm on August 24, 2010

    This seems to be pushing the bounds of what can be done in a low memory environment pretty well. It takes me back to this C1024K test done a couple years ago that consumed 32G to run: http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb-part-3/

  8. Marc at 3:38 pm on August 24, 2010

    Quite impressive numbers but haven’t you made a slight mistake in table 2? The “delta” of scala with netty would me more around 1.6 instead of 2.26. On top of that the performance of scala+netty vs. java+netty while using less memory looks very fishy to me. Care for releasing the test code on github or writing a few comments on that one?

  9. Trustin Lee at 7:45 pm on August 24, 2010

    Would you mind if you let me know what Netty version was used? (Just in case you used 3.1.x, you should use 3.2.x.) I personally haven’t tested that far, so I am curious what’s causing the VM thrashed since Netty never swallows an exception silently.

  10. startupgrrl at 7:53 pm on August 24, 2010

    Is this over TCP? How are you getting around the 16-bit port limit?

    • Andrei at 12:28 pm on August 25, 2010

      My thoughts exactly. What happen with ~64K open sockets / system limit?

    • secmask at 8:52 am on September 1, 2010

      You can use a number of network interfaces to handle this.

  11. Benjamin van der Veen at 9:02 pm on August 24, 2010

    Thanks for writing this article. Very interesting and enlightening. I’m eager to read the follow-ups!

  12. eungju's me2DAY at 9:11 pm on August 24, 2010

    은주의 생각…

    C500K in Action at Urban Airship…

  13. Cuper Hector at 10:41 pm on August 24, 2010

    Is it possible to release your benchmark code? I would like to do the same test as we’re using Netty in our product. Thanks.

  14. Stephan at 3:10 am on August 25, 2010

    What happened to the comments on this page that are not from Twitter or HN?

  15. tsuna at 10:26 am on August 25, 2010

    Which version of Netty did you use in your tests?

    Regarding the 100% CPU use, your conclusion that “an exception was being thrown and silently caught in a tight infinite loop” seems suspicious, how did you come to it? Either you did something wrong, or there’s a bug in Netty. Either way it’d be interesting to know what caused this (and if it’s a bug in Netty, give them a chance to fix it).

    Is there any chance to get the code you tested and the code that performed the benchmark? We can’t really trust any benchmark numbers that aren’t repeatable or independently verifiable.

  16. bill robertson at 1:09 pm on August 25, 2010

    What underlying operating system are you using? Were there configuration changes require to make it work?

  17. Trustin Lee at 3:11 am on August 30, 2010

    You might be interested in this:
    Groovy + Netty = 512k concurrent Web Socket connections
    http://groovy.dzone.com/articles/512000-concurrent-websockets