By Scott Kveton • February 22nd, 2012 • Posted in Best Practices, Company, Events, Operations, Uncategorized
Startups are some of the most amazing organizations on earth. You come together with a group of people to create something from nothing. It’s stressful, insanely difficult and it consumes you completely. Maintaining momentum and focus, as well as keeping it all together during the good and the bad, is the role of culture at your company.
I’ve worked at a lot of different companies over the years and have seen a lot that has worked and quite a bit that hasn’t. I wanted to take a chance to talk a little bit about the culture we’ve created at Urban Airship.
In the Beginning
When Adam, Michael, Steven and I started Urban Airship we knew we wanted to build a company that promoted transparency. Where people worked hard because they love what they do and they have a fantastic time doing it. Our first office was a shared space with other startups and we occupied two desks. Yeah, that’s four people and two desks. To this day we don’t have any private offices and it’s one of the reasons we moved into the wide-open space of our current office last year.

Urban Airship founders Steven Osborn, Scott Kveton, Adam Lowry and Michael Richardson
Meetings: Stand and Deliver
I hate meetings just for the sake of meeting, but they are necessary. We have a couple of standing meetings but my favorite ones are the weekly Monday morning stand-up and the Friday Happy Hour.
On Monday morning at 10am the entire company gathers together for a five minute stand-up meeting. We talk about the week ahead, introduce new people, mention guests visiting the office, etc. This is a great way to set the tone for the week.

An engineering standup meeting at Urban Airship
On Fridays at 4pm we have Happy Hour (you have a keg at your office, right?!). Friday is a chance to look back at the week. We do this with ad hoc presentations from different parts of the company with video conferencing linking our PDX and San Francisco offices. Engineering might share how they are scaling push on iOS. Sales will talk about a specific deal and what it took to get it closed. Marketing might preview the latest campaign to drive new leads. It’s also a great time to talk about any general issues that might be surfacing. There isn’t a better time to share with each other than over a good Portland beer…or a PBR, as is often the case.

Open house party at our new San Francisco office
Transparency…for Real
I’m going to say it even though I realize it’s beyond cliche. We’re pretty damn transparent. Tell me this, do the people at your company know about the different parts of the business and how they work together? Do they understand how leads fill the pipeline? The relationship between marketing and sales? The challenges with scaling infrastructure? Current revenue? Burn? Cash on hand? Outstanding shares? From day one we worked hard to do just that and I share all of this and more during our weekly Happy Hour meetings.
There is No Fence
I’m sure you’ve worked at companies before that have a fence. Maybe it’s a fence between product and Q/A. Sales and marketing. Finance and the rest of the company. Yeah, we don’t really do that here. A really good company knows that all of those parts of the business have to work together in order to be successful. There’s no “well, that’s not my job” attitude at UA. You roll up your sleeves and get done what needs to get done. If something is broken like a process or we have an outage, we address it in a way so as to learn from the mistake.
How do we accomplish this? I think we make it clear across the board that we’re working as a team here. While I may be the CEO, I still take direction from the board, our advisors and the rest of the team. Same goes for product, marketing, sales, business development, ops, admin and finance. A well-oiled team that collaborates effectively and move mountains and we do everything we can to continue that trend.
Free Friday
Once a quarter we do something called “Free Friday.” Steven saw Atlassian’s success and we decided to mimic it.
Free Friday starts on Thursday at 4pm and employees are encouraged to work on anything they want. Seriously. Anything. The only requirement is that you have to be prepared to share what you worked on starting at 4pm the next day (and yes, this maps conveniently to Happy Hour). You can work individually or in groups. The entire company is invited to participate and we witness everything from building games to building out new features for the platform. Some of our best new ideas for features came because of a coding spike done to prototype something. My personal favorite was what I worked on during the last Free Friday–the Urban Airship BBQ. That totally counts.

Know the KPIs
What are the real Key Performance Indicators (KPIs) for your business? For a long time we thought it was the number of notifications we were sending. While interesting, and certainly an indication of growth, we view the total number of pushes being sent as just one of the KPIs.
We have a big board in both of our offices (inspired by the one at Panic just down the street) that displays our real-time KPIs across the entire company. This is more than just up-time and number of notifications, but metrics like revenue, pipeline, incoming leads and average time to close that show we are on target with our plan. These are real KPIs and they live on our big board which everyone passes by several times a day.
As a joke back in 2010 I got a custom engraved bell that hung near the sales team. It says “Urban Airship – ring in case of sale.” We put it up and rang it from time-to-time but then it actually became something that we did for big deals. It was about closure for a sometimes hard-fought deal or one that was really going to move the needle. Now when people hear the bell ring the ENTIRE company knows that we just won a significant deal that will continue to move the ball forward.

Don’t Skimp on your Office Manager
We have Barb. Barb kicks ass. Barb joined the company early on (employee #7, I believe) and she came in to handle all of the little things. That’s pretty early considering it was me, one inside sales guy and a handful of engineers. I know people that wait much longer to make this hire but I’d never recommend that.

Akasha and Barb keep the airship flying high
A good chunk of the executive team travels and having someone that is there every single day with their finger on the pulse of the team is critical. There are countless times when Barb has said “the boys look a little stressed out, we should do something nice,” which then leads to an off-site, a scavenger hunt, impromptu parties and a general continuity that is so important to stress-heavy startups.
When we acquired SimpleGeo, Barb migrated to SF for two months to get the office in order and to help integrate the new team members into our culture.
DO NOT SKIMP ON YOUR OFFICE MANAGER. Sure, you could get somebody cheap, fresh out of college to help with AR/AP, keep the fridge stocked with drinks and greet people at the door, but don’t bother. Go the extra mile and pay a little extra to get someone that knows how to run an office. Your office manager should be the right-hand person to whoever is running Ops or to the CEO.
Perks Don’t Cost You All that Much
We started catering lunches over a year ago. We did two days a week to start and now do three days a week in the office, which still leaves time for our team to explore their love for the PDX food cart scene. Meals are healthy, diverse and always include a vegetarian/vegan option. Sales mingles with marketing, who mingles with engineering. This is a good thing, and while it costs us $10/head/meal it’s well worth it. Let’s do a little math.
UA Lunch Timelapse @15fps, sound from Steven Osborn on Vimeo.
In our Portland office we have 53 people. Lunch is served promptly at Noon and people are usually back at their desks by 12:40pm at the latest, and they aren’t rushed at all. Twenty minutes x 53 people is 17 hours of saved time a day. The cost for that meal? $530. For the week we get:
$530/day x 3 days = $1590/week
20 minutes x 53 people x 3 days = 53 hours saved/week (or 1 hour saved per employee per week)
That adds up and while this is a “perk” for the team it happens to turn a fantastic ROI for the company, both in real dollars and especially culturally.
We don’t have a vacation policy. It’s something we borrowed from Netflix. You take the time that you need when you need to take it and you put it on the shared calendar for the entire company to see. People know when you’re going to be gone and we can plan accordingly around gaps and releases. If people abuse this policy (and no one has) then there are bigger issues to be dealt with.
Play Just as Hard as You Work
Luau party. The People of Walmart party. The all-day scavenger hunt scouring Portland for all things geeky or otherwise. Hiring a choreographer to come in and teach the team the Thriller dance to do during our Halloween party. Getting a REAL Santa to come in for the kids of the company. They are fun, they create culture and they bind the team together.
Urban Airship Zombie Ball Dance from Jason Grigsby on Vimeo.

The blueprint for our day off in PDX

Airshippers on the road
You’re going to spend a full third of your life working. Why not do it at a place that challenges you with hard problems, let’s you have an insane amount of fun and has a real business model in a burgeoning space? While these are the tactical elements that make up the Urban Airship culture I think the ethos of “work hard, play hard” is at the root of who we are. If you get a chance to stop by the UA offices in SF or PDX I’d be happy to show you what it looks like in practice and you’ll see for yourself that Urban Airship really is a special place to work.
Oh, and we’re hiring. 
By Mike Herrick • February 3rd, 2012 • Posted in Best Practices, Company, Developer, Industry, Operations
This is the first in a series of posts that explores some of the things we’re doing behind the scenes in Urban Airship Engineering. Over the next few months, members of the engineering team will offer insight into how we operate, lessons we’ve learned, open source projects we have created and some of the challenges we face serving hundreds of millions of instances of apps connecting to our services every day.
Culturally and from a process standpoint, Urban Airship Engineering is focused on learning and adapting through continuous improvement. We wish we were smart enough to have invented these techniques, but luckily we didn’t have to. The way we conduct product development day to day is an adaptation of ideas in various books including the Poppendieck’s books on Lean Software and David Anderson’s Kanban.
“Value Stream” is a term that originated with Lean Manufacturing meaning to analyze and design the flow of materials and information required to bring a product or service to a consumer. The whole idea is to drive out waste, deliver fast, build quality in, and engage all of the people involved in the value stream so that great products emerge. Urban Airship has had a lot of success with Kanban and Kaizen. Don’t let those words scare you off if you don’t speak Japanese, they just mean “billboard” and “change for the better,” respectively. This post explores our high-level history in applying these concepts and how we’ve begun to scale our process across our SF and PDX offices as necessitated by bringing Urban Airship and SimpleGeo together as one company last October.
Model What’s Happening Now and Let Improvement Emerge
We started with Kanban and Kaizen when I joined Urban Airship in October 2010. The engineering team at the time was following an adaptation of Agile/Scrum and while it was producing results, it wasn’t working as well as we wanted it to in serving all the needs of a venture-backed startup. We adapted our operating model to a starter version of what is described in Anderson’s Kanban book. A great thing about Kanban is starting is easy; you just model the value stream as it is and define work-item types that are meaningful to the team (e.g., Bug Pack, Minimum Marketable Feature, Business Enablement, Refactoring/Technical Debt). Each work-item that is in process, planned, or recently completed gets hung up on a physical board with lanes that indicate the current value stream.

- Airshippers discussing the original set of work-items and where they fit in the value stream in October 2010

- The original wee Kanban board at Urban Airship’s previous office above PIE
Every morning we have a Product Development Boardwalk. This is a standup style meeting where we walk through every work-item on the board in no longer than 15 minutes. We rotate the facilitator of this meeting every day to keep it fresh and make sure that everyone knows the process well enough that they can lead the team through it. This establishes shared context for everyone on what is happening in the value stream. It’s amazing what smart people can achieve together when they have the same context! Physical Kanban boards are still unrivaled for enabling this.
We tweaked our value stream a bit here and there as we learned from successes and failures. We introduced new work-item types, changed WIP limits, added exit criteria, added lanes, split lanes etc. We broke out a separate board right next to Product Development for Customer Development & GTM (Go-to-Market). We made these changes based on insights from our monthly Operational Review & Retrospective and from 5 Whys from production incidents and other defects.
A Year Later: SimpleGeo Acquisition, Investment and Development Partners
In October of 2011, Urban Airship acquired San Francisco-based SimpleGeo, took on a new round of financing and executed some major business agreements that would cause the company to grow even more quickly than it was already. In order to keep work-items flowing, it was time to begin to scale our value stream.
Urban Airship is a very in-person company. In order to make our value stream work across offices, we decided to replicate it in both locations. To achieve the same feel we had when it was just PDX, we now alternate facilitating offices every week. Individuals in the facilitating office also rotate every day. We’re still working out the kinks to be sure, but it’s working and is a lot of fun.
Our Kanban board was showing it’s age even before these events occurred. We had grown engineering four-fold and with that it was getting harder to see everything and achieve shared context. In order to address this and begin to scale into small sets of focused teams, we introduced horizontal lanes for each major part of our product line. We have kept the meeting wide open to anyone who wants to participate or observe, but now just the engineering leadership team (leaders of functional areas, program managers, product managers, tech leads, team leads, etc.) attends the Product Development Boardwalk every morning. Many of these leaders maintain what we refer to as a “Zoom In” Kanban board, which is focused on a specific functional area or work-item that rolls up to the main Product Development board.

The Product Development Kanban board in PDX (left wall) and Polycom unit with the replica board in SF

The SF Polycom unit displaying the PDX Kanban board

A real live Product Development Boardwalk meeting with Wade Simmons from SF facilitating via Polycom

The SF Product Development board (notice the glare from the warm California sun)

An example “Zoom In” Kanban board (far wall) in our Messaging Feature Room in PDX
Looking Forward
Change is constant at Urban Airship. As we continue to grow our team, scale to a billion devices using our services and build new features and products, we’ll keep making adaptions to how we operate in Urban Airship Engineering.
Tools / Technology
We use the following tools and technology to power our Kanban:
- The biggest baddest magnetic white boards we can find
- The most powerful magnets we can find
- Magnet pictures of our people
- White board markers
- Colored note cards (indicate different classes of service)
- Colored markers (to indicate different work-item types (e.g., Minimum Marketable Feature, Bug Pack)
- Polycom to power the video conferencing between offices
- Google Hangouts and Skype for Zoom In video conferencing
- LeanKitKanban – web-based Kanban board we use to help sync the SF and PDX boards. It also produces some great metrics and graphs that we use to understand what is happening across product development (e.g., Continuous Flow Diagram, Cycle Time per lane, per work-item type, etc., Card Distribution Diagrams, Efficiency Diagrams, and even a Process Control Diagram)
- JIRA – we use it to track details of work-items and as the electronic record we reference in source control, etc.
Sound like a value stream you’d thrive in? We’re hiring in SF and PDX!
Urban Airship Engineering hires people with opinions who care deeply about their work, technology, the products they build and making a huge impact on the market. Every person on our team is asked to be part of the solution and to contribute a ton.
Come join us: http://urbanairship.com/company/jobs
By Mike Herrick • July 7th, 2011 • Posted in Android, Operations • 6 Comments
Editor’s Note: This post was compiled and created by Scott Andreas, who can be reached on twitter: @cscotta.
Exploring the Durability of IP Connections from Android Devices
We see a lot of phones each day. Our Helium messaging platform serves hundreds of different models of Android devices from over 5100 global network providers in 208 countries via links running the gamut from GPRS to 3G to satellite. In order to maximize reliability and deliverability across our network, we’re continuously analyzing the behavior of our systems and the data available to us about devices in the field. Recently, we’ve taken steps to automate a more thorough analysis of these logs to understand how network interruptions impact individual devices and our system as a whole. This gives us some insight into what’s happening on these devices and the networks to which they connect.
Connection Durability as a Metric
Carriers and industry analysts have conducted many studies on the speed of mobile networks, dropped call stats, and coverage. In this post, however, we explore a different dimension called connection durability. Connection durability refers to the average duration of a mobile IP connection, or more precisely, the average number of times a device’s data connection reconnects throughout the day. While such irregular blips are unlikely to interrupt web browsing or Twitter-checking unless they occur during it, these blips do affect the reliability of background services such as sync and messaging. As these services are the lifeblood of a mobile device, it’s worth looking into what happens over the course of a tumultuous day on a mobile network.
What factors affect connection durability?
Several factors combine to result in low-quality or short-lived network connections. You’re probably familiar with many of them: walking into an elevator, taking the subway, or moving away from the windows inside a large building or descending to the basement. CDMA devices suspend their data connection each time a phone call is made. All mobile networks have dead spots. For devices using WiFi or WiMax connections, many devices will aggressively manage the link, shutting it down as often as possible. Switching between WiFi and 3G/EDGE connections also triggers a temporary drop. Task Killer apps trigger a similar effect, interrupting a connection until the service restarts. While the quality of a network may appear very good while you’re using it, most mobile phones take a silent beating over the course of the day as connectivity fluctuates.
To better understand this fluctuation, we’ve analyzed the server-side logs generated by the connection activity of a slice of one million devices on our messaging cluster. After plotting a global baseline, we’ve analyzed this data to see what we can learn about connection quality by country, carrier, and device type. Due to a variety of factors, this data does not permit us to offer firm statistical conclusions about the quality of a given network, device, or connection from a country. It’s important to bear in mind that this data speaks primarily to the ability of a device to maintain a persistent connection in the presence of all factors that diminish connection duration, including the OS itself. By analyzing this data in different dimensions, we seek to understand if any interesting correlations are present.
Diving In: Connection Events Across All Devices
Let’s start by looking at the global statistics of connection events per day across this slice of devices:
Click to Enlarge
This chart shows that most devices in this sample lose and regain their data connection fewer than 10 times per day (55%), with the vast majority losing their connection fewer than 100 times per day (96.2%). Two reconnects/day is the most frequently-occurring value, followed closely by three and then one. Following this, we find a long tail of a handful of devices with much higher reconnect rates, most likely indicating either a malfunctioning phone or one with a consistently poor connection. As a high rate of disconnect and reconnect events will typically occur when the device is in an area with marginal coverage (passing through a subway, dead spot, elevator, or building), these events are likely concentrated to small portions of the day during which adverse conditions are present.
With this data we can establish that over the course of a day, most mobile devices will experience a relatively low rate of reconnections. 55% of devices in this sample reconnected 10 or fewer times per day, averaging less than one connection event per 2.4 hours.
Geography: Breaking it Down by Country
We’ve also examined the breakdown by country. Might connection durability vary unevenly across national boundaries? Via MaxMind’s Geo-IP Country database, we’re able to map mobile device IPs to the country from which they’re connecting. While it’s not possible to reliably pinpoint city or regional data by mobile IP, we can determine the country with a high level of confidence.

Click to Enlarge
The y-axis in this chart repesents the number of times a device located in a given country reconnected throughout the day. Here, we see that devices in this sample connecting from China experience the fewest reconnections per day (12), slowly climbing upward toward Canada and the US with 21. However, after these, we see a spike indicating that connections from Indonesia, France, and Japan are significantly more volatile. While many devices in the sample from these three countries demonstrated low reconnect rates, others varied widely with samples in the hundreds of reconnects in each. Note that this plot excludes countries from which fewer than 1000 devices in this slice of data have connected.
Variations Across Mobile Networks
Surprised by the numbers in France and Japan, we broke the results down by network to see if uneven connection rates appeared at the carrier level as well. Via MaxMind’s ISP/Organization database, we can map device IPs to network providers. Parsing this data takes work, as many mobile networks function as independent systems under one brand following mergers with other carriers (e.g., AT&T and Cingular, or Verizon – Bell Atlantic – GTE). Rather than attempting to group these, we’ve provided the raw data of device-to-network mappings below. Note that this chart represents networks from which we see greater than 1000 devices in this sample connecting.
Click to Enlarge
In this chart, AT&T Global Internet Services and “Service Provider Corporation” (formerly Cingular) represent AT&T. Cellco Partnership is the corporate name of Verizon Wireless. Orange Communication SA and Orange PCS Ltd. are networks operated by Orange in multiple countries. We also see landline and fiber providers due to devices connecting via WiFi. Once again, the y-axis in this chart repesents the number of times a device located in a given country reconnected throughout the day.
Consistent with our breakdown by country, we see that connections from NTT Docomo (Japan) and Bouygues (France) experience the highest level of interruptions. Devices on these networks experience significantly more dropped and re-established data connections to our messaging cluster than on other networks in other countries. This data also shows that connections originating from landline and fiber providers are interrupted more often. With the exception of NTT and Bouygues, the upper bounds of this dataset are weighted toward landline providers such as Cox, BellSouth, Charter, and Comcast.
Device Models and Manufacturers
What variations present when we slice this data by device type? This chart represents the mean reconnect rates from devices (frequency > 1000) in this sample.
Click to Enlarge
The Motorola Xoom leads here, which may be attributed in part to the fact that tablets are less likely to be carried through volatile network conditions throughout the day. On the opposite end, LG’s Optimus V, T, and M phones showed significantly greater reconnect rates, topped out by Samsung’s Nexus S and the T-Mobile G2. The middle of the pack is dominated by an alternating flurry of Samsung, HTC, and Motorola phones. This chart does not demonstrate a direct correlation between device manufacturers and reconnect rates, suggesting that the variations between individual models (radios, chipsets, software, etc.), the networks on which they are deployed, and user behavior (such as leaving a tablet on a coffee table) may be more significant than the device’s manufacturer.
Wrapping Up
This sample is not pure enough to support statistically sound statements regarding the reliability of a particular device, carrier, or connection within a country. The number of confounding factors prevents us from making such statements with confidence. This would require a cleaner dataset, and a more thorough analysis that cuts across each of these categories to account for the variations introduced.
However, it provides a fascinating picture into the life of a mobile device on data networks deployed throughout the world. We see that these devices must be capable of gracefully and transparently handling network failures throughout the day, retrying connections and backing off as appropriate. Network and geographic factors may correlate with the ability of a device to maintain a reliable IP connection to a remote server. We can also see that devices registered on mobile data networks tend to maintain more stable connections than those connecting over WiFi via traditional network providers. Finally, the data demonstrates that connection durability rates can vary widely across different Android device models as well.
More importantly, this slice of data provides insight into the behavior of devices connected to our messaging cluster. These results enable us to tune our software and systems on both the client and server to maximize connection durability and the reliability of our messaging services, while minimizing the impact on the device. Regardless of the factors contributing to poor connections, this type of analysis provides us with a better understanding of the best, average, and worst cases that devices are likely to experience, and feeds directly back into our development process. This rigorous analysis of our data is important, and constantly helps us to improve the reliability and performance of our systems.
Post Script
We initially performed this analysis back in January on a much smaller dataset. After re-running the same jobs across a dataset about 8x the size, we found surprisingly little variation. Previously, 63% of devices connected 10 or fewer times per day (compared to the current 55%). Consequently, reconnect rates increased about 10% across a few of the dimensions we analyzed (country, network, and by device type). While a few elements changed in the new analysis, it was refreshing to see that a revisit of this data six months later validated our first analysis, paving the way for more confident, data-driven changes to our messaging systems.
By Mike Herrick • June 1st, 2011 • Posted in News, Operations • 2 Comments
What happened last night?
Our engineering team at Urban Airship has officially and successfully transitioned our services to our new hybrid-cloud architecture, hosted by our partners Amazon Web Services and Carpathia Hosting. Late last night, we completed the biggest phase of the move, which has been in the works for a long time. We have more work to do, but today is a very big milestone for us. This infrastructure is a significant investment in Urban Airship and we are all excited about what it will enable us to do for our customers.

Countdown to the new hybrid cloud.
Why did Urban Airship make this move?
It’s no secret that Urban Airship is rapidly growing. As we grow, our customers are increasingly demanding more sophisticated push messaging. Our hybrid-cloud architecture positions Urban Airship to continue to exceed customer expectations for our stable, scalable, low-latency, and high-throughput mobile messaging platform.
Looking past this transition, we have some exciting product news coming soon (stay tuned) about products made possible by our new hybrid-cloud architecture. Beyond our upcoming announcements, we believe that we are now on the hosting architecture that we need to provide our customers with the next iterations of our existing products, with new products under development and with unimagined products down the road.
We scoured the industry for a solution that would meet our needs during this high-growth phase and keep up with our vision of what the next few years will bring. This hybrid-cloud architecture positions us well for the future.
What were the requirements?
As we searched for the right solution for our needs, we established ten critical requirements.
- Connectivity – We service customers and devices around the world and need excellent connectivity between our services, partners, customers, and our customers’ customers.
- Geographic Footprint – The geographic footprint of our hosting partners is important so that we can provide adequate disaster recovery, high availability and low latency to our services as we grow.
- Cloud Elasticity – The elasticity of the cloud is needed to keep up with storage and compute growth, as well as provide burst capacity for certain operations.
- Customized Kit – We push network, compute, and storage equipment hard due to our scale and unique usage patterns. Our hosting providers need to be capable of provisioning and operating the various types of equipment we need.
- Economics – At the rate we are adding infrastructure, the economic model of the architecture needs to scale so that we can continue to offer competitively priced products. The architecture needs to empower us to make price-performance tradeoffs for cloud vs. customized kit.
- Flexibility – Currently, we prefer a managed service provider model (MSP) for any hosting partner. As Urban Airship grows, a colocation model may become advantageous for some of our services. The provider must have the business model flexibility to work with us through a MSP to colocation transition.
- Relationship – Rapid growth stretches any relationship, a good foundation is key. We’ve grown a very good relationship with AWS over the years and demand the same from any additional partners.
- Reputation and Clientele – Reputation and existing clientele is key to a decision this big. Potential partners need to be best in class, have existing happy customers with similar requirements, and work well with other Urban Airship partners.
- SAS 70 – We service a growing list of enterprise customers where SAS 70 assurance is a must-have requirement.
- Speed – Mobile messaging and the demands of our customers are moving at break-neck speed. Any provider must have the ability and the passion to keep up with the needs of our market.
How does this affect our customers?
We believe that our investment will have many positive benefits for our customers. The connectivity, geographic footprint, cloud elasticity, and customized kit all combine to provide for robust mobile messaging worldwide. The economics of our hosting will continue to allow us to provide customers with the best value for our products. The reputations of AWS and Carpathia Hosting, our relationships with both, and their SAS 70 compliance should give customers confidence that Urban Airship is running on the best hosting platform in the world. Finally, the speed and flexibility of this approach will allow us to continue to execute well for years to come.
Next steps
As always, we’ll communicate our current status on http://status.urbanairship.com and the Urban Airship Engineering Twitter account @ua_eng. We’ll have more to say about some of the technical details of our architecture and the drivers that drove this transition this summer at Open Source Bridge and OSCON:

Behind the scenes at the Airship late last night: It was all hands on deck as the engineering team verified the deployment.
Open Source Bridge
OSCON
Thank you
“You should be so lucky to have problems with scale some day” is a popular saying among startups. It reinforces the risk of premature investment and scaling before product/market fit is achieved. Urban Airship’s new platform infrastructure strategy has clearly been necessitated by our early success. We’ve “become so lucky” and need to make these investments in order to sustain our growth and exceed customer expectations. First and foremost, this means continuing to provide a stable, scalable, low-latency, high-throughput mobile messaging platform. For our newfound “luck,” we have our customers to thank. Thank you for sharing your mobile messaging problems with us and believing in us and our vision. We look forward to continuing the conversation and can’t wait to show you what we have coming next!
By Adam Lowry • October 18th, 2010 • Posted in Operations
Last Tuesday morning, October 12th, our API service and many of our backend systems were down. This was particularly bad timing as it coincided with a scheduled maintenance window where we upgraded one of our database systems, leading us to believe for some time that this was the cause of the outage. After much investigation we determined the root cause was severe EBS volume performance degradation.
What follows is an account of how we handled the incident and what steps we are taking to more appropriately mitigate the problem should it arise again. Hopefully this information can help anyone else using EC2 and EBS.
Post-mortem Summary
On October 11 at about 11:30pm PDT Amazon began experiencing severe issues with EBS disk volumes in an availability zone in the US-East-1 region, which we use extensively. Each of the three database instances we relied upon contained at least one bad volume in its RAID array. Once the root cause was determined, we submitted a support request to AWS, and prepared our two slaves to take over. After working through the night, our services were restored at around 5:30am the following morning.
Our EBS Volume Setup
To ensure proper disk performance, our EBS volumes are structured as striped RAID arrays on our database servers. Performance across each volume in the array should be roughly symmetric if everything is going well. However, when one drive performs poorly, the performance of the whole array is significantly degraded.
Symptoms
Our problem was characterized by high I/O utilization on a subset (usually one) of the EBS volumes in an array. While observing iostat, we could see that some of the EBS volumes consistently had higher %util readings than others. Device saturation occurs when this value is close to 100%. This behavior usually manifested as a large spike in await (the average time for I/O requests issued to the device to be served), followed by %util pegging to 100% at full load and remaining fully utilized until there were no more operations directed toward the drive, despite returning only a few kilobytes of data per second. Unfortunately, we don’t have access to the actual root cause information for this failure as it’s all internal to AWS, but it appears to have something to do with high latency between our instances and these volumes.
Solution
This was a pervasive problem throughout our availability zone. Not only was our new master database server severely impaired, but so were two other hot-spares. We decided to spin up a new instance and hope that we got enough EBS volumes which were not affected to start our service again. We frequently ran into “capacity unavailable” errors while trying to provision volumes. When we finally had enough volumes to start another database server and had copied over a data dump, our testing indicated that its EBS volumes were similarly impacted.
AWS said they were ‘close’ to resolving the issue many times, but would not offer detailed information regarding the problem or a timeline for resolution, only that their team was working on it. As a safeguard, we decided to launch another server in a different availability zone and rsync the data over, knowing this would take an hour and a half just to copy the data. About halfway through this process, EBS latency seemed to improve some in our original availability zone, and we turned one of the slaves on as master.
We consistently provided information to AWS engineers hoping it would assist them in diagnosing and fixing the issues. Unfortunately, the communication was quite often delayed by the AWS team and the information they gave was mostly vague. Even after the problem was identified and publicly announced, we would wait between 40 minutes and multiple hours between responses. In the end we noticed improved performance and recovered our sites, but it wasn’t until 16 hours later that we received another update from Amazon. Several times during that period they posted small public updates on their status page, but would not reply to updates and requests for details.
Throughout this process, we were simultaneously working on bringing up new nodes, rebuilding indexes on others, or restoring from dumps. Unfortunately, because our dataset is fairly large, the process of copying it around took hours so we were pretty limited in what we were able to accomplish until EBS latency improved.

Action Plan
Things we’re doing to make sure this doesn’t happen again:
Use Multiple Availability Zones For Core Services
We found ourselves using one availability zone primarily to keep latency down, and throughput high. As we change to different database solutions, and re-engineer our architecture, we’re ready to start deploying in several availability zones for our core services. Naturally, this is a core requirement for surviving degraded services from hosting providers. Redundancy, redundancy, redundancy.
Don’t Rebuild Indexes, Store Them
Once our hot-spares had also been deemed unfit for service, we had only the option of starting a new instance from a database dump. This isn’t so bad, usually, but rebuilding the indexes afterward is extremely time intensive (and we rely heavily on primary and secondary indexes).
Work With AWS to Improve Communication
We’ll have some serious discussions with Amazon to improve communication during incidents like these. While we appreciate that engineers need to do their work to restore functionality, it’s unacceptable to go sixteen hours without updating a support ticket for such a serious issue, even if one believes it to be resolved.
Conclusion
We take service outages very seriously. We’re investing a lot of time in making sure that our infrastructure is protected from edge cases like slow EBS volumes and general outages within an availability zone. Naturally, we’re not just stopping there. We’ll continue to look at new solutions and to strive to provide great service to our customers.
Update:
Before we posted this we had another EBS volume failure (this morning). The volume exhibited the same failure — accepting reads and writes, but with extremely high latency backing up all access to 100% utilization. Luckily this time it was only the single volume on the single server, and we were able to promote the slave.
By Gavin McQuillan • October 5th, 2010 • Posted in Operations • 3 Comments
As we’ve been expanding, we’ve found that tracking performance or errors across dozens of different services isn’t trivial. There are lots of solutions to this, including scribe and flume. For now, we’ve decided to go a more traditional route for aggregating logs. This keeps things simple, and doesn’t require that we have to alter how our services talk with logging infrastructure (even if they just log to a file). So far, we’ve had great success with this method. Here’s how we set it up:
Shipping Logs
While logging and log management are definitely not among the most interesting engineering challenges, they are among the more important. When trying to track down a problem within a cluster of servers, having a central log that can be tailed, mined, analyzed, stored, and backed-up is a tremendous asset. To do this, we use rsyslog, primarily because of its straightforward configuration, its ability to filter based on syslog header items. Excellent documentation and lots of examples don’t hurt either.
Rsyslog’s architecture is simple but elegant. The daemon runs on a central server and accepts log messages via TCP (port 10514) and/or UDP (port 514) from rsyslog daemons on other nodes in the cluster. Each remote syslog packet sent begins with a header containing a timestamp for the entry, followed by the message itself. These log messages usually contain a service name, severity level, the message itself, and occasionally some metadata, like this:
107 tacobot.local [Helium] INFO QueueConsumer - Connected to Beanstalk; waiting for jobs.
These messages are logged from dozens of different services and components within our infrastructure such as the machines’ OS itself, network services like Apache, Postgres, and Cassandra, and our messaging applications written in Python and Java. It’s difficult to overstress the value of bringing the status of all of these different services into one place.
While sharing a common protocol is definitely valuable, one must also ensure that each service is able to speak it. With deep roots in the UNIX tradition, interfacing with syslog from Python is dead simple:
>>> import syslog
>>> syslog.syslog('Processing started')
We had a bit more difficulty teaching our Java-based services to play nicely with syslog, but eventually straightened things out. These services use Log4J, a swiss army knife for logging messages in Java. While Log4J is tremendously flexible and allows for non-XML based configuration, its documentation (especially for remote syslog’ing) is a bit light. It took us some time to figure out that Log4J’s “SyslogAppender” supported UDP-based logging to remote servers, it does not support TCP-based logging (our default). With a default failure mode of printing nothing at all (i.e., no errors indicating that there’s a problem), and without logs showing up at their destination, this was a bit difficult to track down.
After some research, we discovered that the solution was rather simple: enabling UDP logging in rsyslog.conf on the server receiving logs, and updating our Log4J configuration as follows:
log4j.rootLogger=INFO, SYSLOG
log4j.appender.SYSLOG=org.apache.log4j.net.SyslogAppender
log4j.appender.SYSLOG.syslogHost=localhost
log4j.appender.SYSLOG.layout=org.apache.log4j.PatternLayout
log4j.appender.SYSLOG.layout.ConversionPattern=%-4r [%t] %-5p %c %x - %m%n
log4j.appender.SYSLOG.Header=true
log4j.appender.SYSLOG.Facility=LOCAL1
This configuration specifies a default log level of “Info,” with these messages to be logged to Syslog. These messages go to a local syslog daemon, which forwards them on to our central logging server. The conversion pattern is a fairly standard formatting string specifying how we’d like the actual log messages to appear. Setting the “header” instructs Log4J to include a syslog header specifying the hostname and timestamp, which is used by the central rsyslog daemon to split up the incoming logs it receives into different files based on the service and/or host logging to it.
The actual use of this logger is fairly simple. In the application, it’s just a matter of adding:
PropertyConfigurator.configure(“log4j.properties”);
private static Logger logger = Logger.getLogger("Helium");
logger.info("Launching Queue Consumer...");
Filtering on Loghost
One of the biggest challenges with centralized logging is figuring out how to sort incoming messages. With only a few exceptions, we do this based on the service type (queue, mongodb, postgres, web, etc.). We’re able to glean this from the hostname of the server that sends the message to our loghost.
One of the nice things about rsyslog’s filtering capabilities is that it supports regular expressions. In our case, a full regular expression match wasn’t necessary because each of our hosts’ hostname begins with its service type:
Here’s an example set of lines from our rsyslog.conf on the loghost (the machine receiving all of our syslog messages):
if $hostname startswith 'queue' then /var/log/ua/queue.log
& ~
Rsyslog automatically breaks up header properties into $<property>
You can get a full list of the properties here:(http://www.rsyslog.com/doc/property_replacer.html)
In this case, we’re using the ‘startswith’ function within rsyslog. It’s a little faster than doing a full regular expression match because it doesn’t have to try every character offset in the target property. So, in our case, we’re looking at all of the incoming messages, and if its hostname starts with ‘queue’, then it came from the queue server and we’ll store that in the queue service log (aggregated data across all of the queue servers).
We could just as easily do additional sorting if we wanted to. Say we only wanted to get Error level log messages or above from queue services, and we wanted this to go to a file called /var/log/ua/queue_errors.log:
if $hostname startswith ‘queue’ and $syslogseverity <= 5 then /var/log/ua/queue_errors.log
& ~
The “& ~” is really important. It’s shorthand for “skip to the next message.” Because rsyslog by default allows you to filter the same message any number of ways, you need to explicitly tell it when you don’t want to filter that message, or it will end up in every log file after it successfully makes a match. e.g. if you have 10 if statements to filter incoming messages, and your log message is triggered on number 6 — without the ‘& ~‘ your message would appear in the last four of the log files.
Finally, we need a place to put all messages which don’t match our filters:
*.* /var/log/ua/other.log
This is a catchall, so we see which messages our filters missed. In our case, since we use hostname filtering, we see either messages coming from misconfigured hosts or from syslog clients which don’t include the hostname for some reason. This makes it easy to figure out which syslog client needs reconfiguration.
Performance
Rsyslog is quite efficient. However, we thought it wise to split up the incoming message queue from the action queue, to help prevent floods of incoming messages from getting things done. We will, afterall, have many dozens of machines all shipping their logs to this one loghost server.
From /etc/rsyslog.conf:
# Decouple incomming queue from action queue.
$MainMsgQueueFileName /var/log/rsyslog.main.q
$ActionQueueFileName /var/log/rsyslog.action.q
Logrotate and Backups
One of the main purposes in aggregating our logs is to make it easy to back them up. A combination of logrotate and custom backup software facilities our needs to store logs in S3 so we can do log analysis later using Hive.
The log rotate script, placed somewhere in /etc/logrotate.d/:
/var/log/ua/*.log {
daily
missingok
rotate 10
compress
create 640 syslog adm
sharedscripts
postrotate
service rsyslog reload > /dev/null
endscript
lastaction
ualogbackup
endscript
}
The ualogbackup binary then takes all of the *.1.gz files and uploads them into a specific S3 bucket. The last action is a really handy logrotate script. It ensures that all of the logs are rotated before it runs. We found that a combination of sharedscripts and postrotate commands weren’t enough to ensure a consistent state.
Now we’ve got a system where all of our logs are backed up and ready to be analyzed.
By Michael Richardson • January 5th, 2010 • Posted in Operations
Hello devs,
This morning we had an infrastructure failure that resulted in a 2 hour outage. We have recovered and all systems are functioning normally with no signs that we’ll have this problem again.
Yesterday we upgraded our primary database server to a larger instance, in response to the enormous growth we’ve seen since the holidays. While the transition went swimmingly, our standby systems were not correctly transitioned. This morning our EBS volume (network attached storage for Amazon’s EC2 infrastructure) stopped responding to all requests, and the instance failed to respond to reboot requests. We launched a replacement, and worked to recover from the failure. This recovery took unacceptably long, and we apologize for that. It was a tiny comfort that our monitoring system worked (our emergency pagers sound remarkably like fire alarms).
We are reworking our failure handling systems now to ensure reliable processing. In addition, we have set up a status monitoring page at http://status.urbanairship.com . We will post updates here whenever there are issues with our systems, so that there’s one place to watch.
The amount of growth that we’ve seen over the past couple of months has been truly astonishing. We’ve delivered over 55 million messages and there are over six million devices out there that have at least one application (often several – over 9 million device tokens!) powered by Urban Airship. We’ve also seen fantastic growth around in app purchase. Of course, the best part of our growth has been seeing so many new customers and interacting with all of you.
On our end, we’re continuing to work hard to make sure that everything continues to be fast and easy to use. There have recently been a few new features that we want to share with you.
Device token listing API
We’ve known from the start that we never wanted to lock people into using Urban Airship. The device token listing API is a great way to be able to easily access all of the device tokens that we have registered for your service. That means that if, for any reason, you want to get ahold of them and stop sending messages through Urban Airship, that’s no problem. You’re not locked in at all.
Push from device
After getting a lot of requests from people who wanted to be able to schedule notifications from their applications, we created an option to allow sending push notifications without requiring the use of the master key. We don’t allow sending to tags, aliases, batch or broadcasts with this, but it’s perfect for scheduling a notification or peer-to-peer messaging.
Feedback web hook
We already provide an easy API to get device tokens that have been marked inactive, but we now also have a web hook feature that we’re testing. If this is enabled, when we get an inactive device token we’ll ping your server about it – no need to poll us for this information. If you want to play around with this, please send us an email.
For the future
We have a couple of fantastic new products on the horizon that we’re extremely excited about. We think they will fundamentally change how you communicate with your users and, since it’s coming from us, you know that you can continue to expect ease of use and great support. We’re looking for private beta customers right now – if you’re interested, and are willing to put up with early beta builds, please let us know at http://urbanairship.com/contact/ .
As always, we would love to hear from you. Pop by our IRC channel and say hi, or contact us, or even write us a letter. It’s talking to customers, like you, that help make this the best job on earth.
And, once again, thank you for using Urban Airship!
Sincerely,
The Urban Airship Crew
By Adam Lowry • September 10th, 2009 • Posted in Operations
We will be doing scheduled maintenance on the Urban Airship API servers on Sunday night at midnight PDT (-0700 UTC). We expect less than an hour in total.
If this interferes with a anticipated high-traffic period, please contact us.
Update 2009-09-14 1 AM PDT: Maintenance complete.