Harvest, Yield, and Scalable Tolerant Systems

by Armando Fox and Eric Brewer

Maurício Linhares / @mauriciojr

Why?

We all work in with distributed systems, independent of scale.

They break. A lot. All the time.

The CAP theorem is the main source of vocabulary for defining distributed systems.

But the "harvest and yield" solution presented is actually the main topic of the paper. Not the CAP theorem.

1. Motivation, Hypothesis, Relevance

Failures will happen, no matter what you do, so learn how to deal with them while still being available most of the time.

2. Related Work and the CAP Principle

  • Strong Consistency
  • High Availability
  • Partition tolerance

Strong consistency (C)

ACID (Atomicity, Consistency, Isolation, Durability) level guarantees for the inclusions and updates to the data.

High availability (A)

Some data can always be accessed.

Partition tolerance (P)

The system continues to function even if parts of it can't communicate between each other.

You can't provide all 3 at the same time

You have to pick 2 and work around the other.

CA but not P

Most distributed databases. MongoDB, ElasticSearch, Redis and others, all fall into this category. If you get a network partition, the smaller side of the partition stops working as it won't be able to elect a master. Or you get a split-brain. And then you're DOOMED.

CP but not A

The system goes into a read-only (or unavailable) mode to prevent inconsistencies until the partition heals.

AP but not C

The way most caches work (like Memcached or Redis) or distributed message queues (like Amazon SQS). Consistency is not guaranteed but it will usually provide an answer (maybe old or incomplete) even during partition events.

Weak or eventual consistency to the rescue

AP systems are the most common solution for actual distributed systems as downtime is not acceptable and failures will happen (and cause partitions) all the time. Consistency is the most common victim of distributed systems.

3. Harvest, Yield, and the CAP Principle

The meat

Yield

The probability of completing a request. How many 9's of availability can your system provide?

Harvest

How complete is the answer that you are providing to your user?

Redefine what "correct behavior" means for your system

Define your systems in terms of how you can degrade harvest or yield to fit it's usage and your customer's needs.

Check http://tinyurl.com/yjxzqlk for a full Farmville writeup.

4. Strategy 1: Trading Harvest for Yield — Probabilistic Availability

Is it better to show some search results or none?

All distributed systems are probabilistic

There is always a > 0% chance that something is broken since the network is based on best effort delivery.

I said best effort

Embrace the probabilities

What are your data hot spots? Can you replicate them to provide better harvest? Can you distribute your replicas across availability zones so that one zone failing does not reduce harvest drastically?

5. Strategy 2: Application Decomposition and Orthogonal Mechanisms

Yes, microservices! Buzzword bingo!

I mean, seriously, microservices!

This is a 1999 paper and they already knew this and you're still asking yourself if you should use them or not.

Why application decomposition?

Push complexity to where it actually needs to live. One piece of the app can live as an AP system with an expiration based cache? Then build it like that!

Orthogonal mechanisms

Failure detection and restarts? Make it orthogonal. Service discovery? Make it orthogonal. Load balancing? Make it orthogonal.

5.1. Programming With Orthogonal Mechanisms

Running on AWS with Elastic Load Balancer, Auto-Scaling groups, health checks and auto-termination? You might be already doing this!

5.2. Related Uses of Orthogonal Mechanisms

Even systems that were not originally made to be reliable can be made scalable by placing a "reliable" interface on top of them that provides these characteristics, like Netflix's Hystrix.

6. Discussion and Research Agenda

Since it's publication in 1999, many of the services touted here have been implemented in what we now call the cloud, using these concepts of orthogonality and building blocks to get applications to be distributed.

Are we there yet?

Unfortunately, some of the calls for research here are still unanswered for, specially the programming model that provides first-class abstractions for manupulating degraded results. So far, most of the solutions have been ad-hoc and application specific.

Questions?

Thanks!