We all work in with distributed systems, independent of scale.
Failures will happen, no matter what you do, so learn how to deal with them while still being available most of the time.
ACID (Atomicity, Consistency, Isolation, Durability) level guarantees for the inclusions and updates to the data.
Some data can always be accessed.
The system continues to function even if parts of it can't communicate between each other.
You have to pick 2 and work around the other.
Most distributed databases. MongoDB, ElasticSearch, Redis and others, all fall into this category. If you get a network partition, the smaller side of the partition stops working as it won't be able to elect a master. Or you get a split-brain. And then you're DOOMED.
The system goes into a read-only (or unavailable) mode to prevent inconsistencies until the partition heals.
The way most caches work (like Memcached or Redis) or distributed message queues (like Amazon SQS). Consistency is not guaranteed but it will usually provide an answer (maybe old or incomplete) even during partition events.
AP systems are the most common solution for actual distributed systems as downtime is not acceptable and failures will happen (and cause partitions) all the time. Consistency is the most common victim of distributed systems.
The meat
The probability of completing a request. How many 9's of availability can your system provide?
How complete is the answer that you are providing to your user?
Define your systems in terms of how you can degrade harvest or yield to fit it's usage and your customer's needs.
Check http://tinyurl.com/yjxzqlk for a full Farmville writeup.
Is it better to show some search results or none?
There is always a > 0% chance that something is broken since the network is based on best effort delivery.
What are your data hot spots? Can you replicate them to provide better harvest? Can you distribute your replicas across availability zones so that one zone failing does not reduce harvest drastically?
Yes, microservices! Buzzword bingo!
This is a 1999 paper and they already knew this and you're still asking yourself if you should use them or not.
Push complexity to where it actually needs to live. One piece of the app can live as an AP system with an expiration based cache? Then build it like that!
Failure detection and restarts? Make it orthogonal. Service discovery? Make it orthogonal. Load balancing? Make it orthogonal.
Running on AWS with Elastic Load Balancer, Auto-Scaling groups, health checks and auto-termination? You might be already doing this!
Even systems that were not originally made to be reliable can be made scalable by placing a "reliable" interface on top of them that provides these characteristics, like Netflix's Hystrix.
Since it's publication in 1999, many of the services touted here have been implemented in what we now call the cloud, using these concepts of orthogonality and building blocks to get applications to be distributed.
Unfortunately, some of the calls for research here are still unanswered for, specially the programming model that provides first-class abstractions for manupulating degraded results. So far, most of the solutions have been ad-hoc and application specific.