Working Here
Engineering

The Whacking Game

Kevin

Kevin

 

August 5, 2020

The Whacking Game

At TextNow we serve more than 15M active users per month, with millions of calls and messages flooding in and out each day. Users rely on our application to connect and communicate using a technology that is free or as close to free as possible.

Over the last two years, we have grown rapidly, and in turn, our infrastructure needed to accommodate that growth. To do that, we started transitioning our monolithic application — chopping it into small, manageable, independent pieces. And with every transition story, there were critical design decisions that paved our journey; experiments that opened the gates to creating cool shit together, along with valuable lessons that we had to learn the hard way.

In this post, we’ll go through how we migrated one of the pieces of the monolithic decisions we made, capturing the lessons learned along the way.

The Block Service

One of the simplest yet most finicky services is the block service. As the name implies, it allows our users to block other contacts. With every call being made, every SMS being sent, we check if either the caller/receiver is blocking the other. If so, we terminate the request.

The API for this service is a very simple CRUD application. Every use case is just a single call to the database:

  • Block a contact — A user blocks a contact. Thus, the user will no longer send/receive calls or texts from that contact.
  • Unblock the contact.
  • Is blocked? — Checks if a user is blocking a contact.
  • List all the blocked contacts of a user.

The challenges (Why)

The existing block service, aside from being part of the monolithic and on the roadmap to being split apart in its own service, comes with many drawbacks:

  • The table had two columns: One is the userId of who is blocking, and the second is the blocked contactwhich includes — but is not limited to — phone numbers. That introduced complexity when parsing it, trying to recognize the contact type.
  • The table had invalid values; There was no strict validation before insertion. So, the table was full of junk contact values.
  • The code had a lot of if-else statements which led to a fragile code; changing one line of code is enough to break the whole system.
  • It wasn’t fast enough to support our ever-growing number of users, phone calls, and SMS messages. And as we grow, the response time decreases proportionally.
  • The table, therefore, grows unboundedly.

The solution (What)

After a round of discussions, we ended up deciding on the following implementations:

  • An easy CRUD API to be utilized by clients that support either HTTP or gRPC.
  • Contacts are validated and normalized to a certain format before insertion. For example, a phone number will have thetel scheme format tel:+19999999999. This way, it is easy to identify the contact type.
  • Provide fast access to data with single-digit millisecond performance — write requests are very low ~= 10/sec. We decided to use DynamoDB.
  • Support other services and data science for spam analysis and reporting. In general, services should be decoupled, and block service shouldn’t worry about who is consuming the data nor how will be processed.

The plan (How)

The plan was pretty straightforward:

  1. Migrate all the old data.
  2. Cutover to the new service.

The Migration

While there are many approaches for migration depending on the data size, desired consistency level, and flexibility of the schema, we took the simplest approach that was acceptable for our needs:

[caption id="attachment_412" align="aligncenter" width="1024"]

The migration plan[/caption]

First, we took a copy data of the current relational table as a sqldump CSV file. We had about 40M rows.

We then split the CSV file into N multiple files so that we feed each to the migration script in a separate terminal (screen), where each script has a G number of goroutines — Yes, we use Golang!

And because errors are expected, we wrote them in an error file in the same format as the original CSV file so that we can feed this file back to the migration script. We also made sure to write the debugging information in a separate debug file so as to not pollute the error file.

Next, we spun up an EC2 instance and ran the migration script.

Then, after all was said and done, we cutover to the new service. That’s to redirect the requests to the new block service.

Finally, in the background, we took a copy of all the requests (insert/update/delete) that happened during the time of the copy of the old table (and before the cutover) and ran them through the migration script. But something unexpected started happening: there were some spikes in the response time taking up to 7s!

[caption id="attachment_413" align="aligncenter" width="1024"]

A screenshot from DynamoDB table metrics shows the read latency[/caption][caption id="attachment_414" align="aligncenter" width="1024"]

The Datadog latency metric shows the max latency spikes[/caption]

Questions were spinning in my head: How does that happen when we are using DynamoDB? A NoSQL database that promises a single-digit millisecond latency.

Even though the average latency graph looked good, we were concerned about these spikes, as they could affect the overall user experience.

Investigating DynamoDB latency

If you recall, the block service is invoked on — and adds overhead to — every call or SMS, in and out. We make a database GET request given userId as the partition key and the contact as the sort key to check the block existence.

We initially thought this was a hot partition problem. But that didn’t seem to be the case, as our partition key is the userId , which is guaranteed to be unique and uniformly distributed. Was it then due to a certain userId being called so frequently and rapidly? After further investigation, that didn’t seem to be the case, either.

So then we landed on the question: What is the number of initial partitions allocated to us? From what we understood, DynamoDB creates a number of partitions based on the provisioned capacity and the data stored with the following equation:

Number of partitions = max((RCU / 3000 + WCU / 1000), GB / 10)

There is a hard read (3000 RCU), write (1000 WCU) and storage limit (10 GB) on each partition regardless of the provisioned capacity. If we have, say, more RCU than what a partition can handle, then dynamo will split the data among two partitions instead.

Well, we didn’t have a low provisioned capacity nor did we get a number of requests that were more than what a partition can handle (and so the table was not throttling). And yet, the spikes continued on. It wasn’t an expected behaviour and we weren’t doing anything wrong. So, what was the problem?

Whacking the p99 & max

DynamoDB is a massive scale distributed service, with thousands of nodes in the back end fleet. Due to the distributed nature of DynamoDB, in each and every minute, there is something failing in the background. Nodes might go down, requests take longer than expected, data loss is not uncommon, even network failure is a possibility.

In short, expect to see some requests which will cause very high latency. And that’s absolutely fine. We should, however, handle these cases. Having said that, the trick was to fail fast and retry soon:

When we make a request, it might take > 1s, and the SDK (Golang) wouldn’t timeout in an expected manner. In fact, it didn’t seem there is a timeout at all with the default values (note: http.DeafultClienthas a default timeout of no timeout).

The solution was to not trust the default values and set a timeout reasonable for our case. In addition, on failure, we retried the request with exponential backoff + jitter (random sleep time between the subsequent requests).

So, now, requests timeout sooner, and then they are retried again. Chances are, next time the request will successfully be processed before it times out.

The good news is, the SDK has support for retrying with exponential backoff and jitter. But, it just hasn’t been utilized as there was no timeout! So we decided to go with the following configurations:

// Default is to retry 3 times
- Max number of retries = 3// Timeout is slightly greater than p99 (p99 when no spikes)
- Timeout>= p99 = 500ms (slightly greater than p99)// Min wait time is slightly greater than the average response time
- Min delay>= average response time = 20ms// Max wait time is slightly less than half the timeout
- Max delay ~= Timeout/2 ~= p99/2 = 200ms

… And the result?

[caption id="attachment_416" align="aligncenter" width="1024"]

The sudden drop in spikes after the new config changes[/caption]

We did whack the p99 & max down from > 10s to ~500ms!

On some rare occasions, even with the retry mechanism, the request might fail due to some DynamoDB internal errors or just because it times out after several retrials. In this case, instead of raising an error, we fail open to return a default value. A clean solution for a problem that was causing a whole lot of mess.

The Future

Along with our blocking service, we are excited to be working on new endeavours to improve the communication experience for our users: give warnings about potential spam calls, proactively terminate spam calls, mute/unmute contacts (temporarily disable notification for certain contacts), and so on and so forth. If these types of challenges interest you, don’t be afraid to check out our open positions! We are always looking for passionate minds to join our team.

Big thanks to the team of developers who consistently work hard on improving the user experience at TextNow. I was happy to write this blog post but the real credit goes to Asad Awadia, David Magnus, and Emmanuel dos Santos.

Related posts