Auto-scaling, Queues and CloudFormations to Slash Costs at Neat

Maurício Linhares / @mauriciojr / neat.com

Who?

  • Technical Lead at Neat.com
  • Brazilian from João Pessoa, not from Rio de Janeiro or São Paulo
  • Moved here 8 weeks ago
  • No puedo hablar español, lo siento

Where were we 2.5 years ago?

Why were we burning money?

No auto-scaling means...

Unexpected traffic spike?

Just go there and provision a bunch of new machines. And please remember to take them down once the spike is over!

We were running at an Elastic Computing Cloud

But our systems were not making use of this elasticity...

What did we need?

  • Quickly provisionable machines
  • Auto-scaling groups
  • Queues and queue metrics
  • All organized together with CloudFormations

First step?

Revamp the provisioning process.

Original process

  • Use Knife and Chef to build an instance for a service out of a bare bones AMI
  • All steps, from installing software to setting up config happen at this point
  • Slow, many minutes from nothing to instance running
  • Not reproductible - machines provisioned at different points in time will have different versions of their libraries

Doesn't work for auto-scaling

When you're auto-scaling to meet real time customer demand, you can't waste any time.

Pre-made AMIs/images enter the scene


k mint spi -b SPI-Bundle-RELEASE-1.5.370-NSDK-4.0.0.242

The Golden AMI

  • A specific version of the software required gets installed
  • No environment-specific configuration exists yet
  • Once booted in an actual environment, machine uses user-data to figure out where to pull config and starts it's work
  • Fast and reproductible, all instances are the same

User data as JSON


{
  "environment" : "production",
  "role" : "thumbnailer"
}

Chef kicks off and figures out what to do

A separate script reads the user data, calls Chef using the given role and environment. Instance gets configured and services are ready to action.

Auto-scaling groups arrive

  • Collection of machines instanciated out of a specific configuration
  • Register machines at an ELB if you need it
  • Simple (and mostly useless) health check process
  • Really, that's all

Alarms, metrics and scaling policies

This is where it gets interesting.

Pick a metric

It has to be in CloudWatch but you can push anything there. Using a metric that is already provided by AWS is always simpler.

Setup alarms and scaling policies

Alarms trigger actions when their threshold is met.


"ScaleUpWorkerAlarm": {
  "Type": "AWS::CloudWatch::Alarm",
  "Properties": {
    "AlarmDescription": "Scale-Up if queue depth exceeds our limit",
    "Namespace": "AWS/SQS",
    "MetricName": "ApproximateNumberOfMessagesVisible",
    "Dimensions": [
      {
        "Name": "QueueName",
        "Value": "MyQueue"
      }
    ],
    "Statistic": "Average",
    "Period": "60",
    "EvaluationPeriods": "3",
    "Threshold": 100,
    "ComparisonOperator": "GreaterThanThreshold",
    "AlarmActions": [
      {
        "Ref": "WorkerScaleUpPolicy"
      }
    ]
  }
}

"WorkerScaleUpPolicy": {
	"Type": "AWS::AutoScaling::ScalingPolicy",
	"Properties": {
		"AdjustmentType": "ChangeInCapacity",
		"AutoScalingGroupName": {
			"Ref": "WorkerAutoScalingGroup"
		},
		"Cooldown": 300,
		"ScalingAdjustment": 1
	}
}

Working together

Metrics, alarms and policies work together to make your auto-scaling group grow or shrink as needed. You can have as many alarms, metrics and policies as you want, just make sure they actually represent how you want your app to grow.

Ok, lots of different parts, how do we tie them together?

What are CloudFormations?

  • Templated (JSON) AWS resources
  • Supports declaring most of the existing services and config options
  • Removes the need to perform manual steps to setup services your app needs
  • Creates whole, isolated, environments
  • Configurable with external parameters and mappings inside the template

Unified resource creation

Resource creation was all over the place, now only CloudFormations do it.

Answers the What does this app needs? question

Now you just open the CloudFormations associated with it and it should be there.

No more all access keys for apps

Templates must include their own security policies and allow access only to resources they themselves create, using IAM (Identity and Access Management) profiles.

And what happened to our story?

The service went from being manually provisioned and scaled to a full fledged auto-scaling solution. It now runs at 1/2 of the original cost and served as an example for all new services being created.

Is this the end?

We're still learning how CloudFormations work

and being bitten every once in a while.

What did we learn so far at Neat?

Do not name stuff

If AWS can generate a name for it, do not name it. Use CloudFormation outputs to get their names.

Avoid nesting or cross-CF dependencies

If you really need to do it make sure the dependency tree is shallow or you will have trouble.

Separate stuff that changes frequently from the ones that do not

Don't place your RDS database at the same template as your webapp auto-scaling group.

Do not upload templates directly, build tools to do it

And make sure these tools understand how to name stacks and validate parameters.


k cfn id2 server update -e qa -c neat
					
					

Create two auto-scaling groups to simplify zero-downtime deployments

Whenever you want to deploy something, scale up the group that is not currently scaled and then scale down the one that was.

Use IAM profiles for everything

Yes, I'm repeating this.

Create and hook up MANY alarms to your monitoring service

We're all humans, send notifications for more than one threshold to make sure they won't be snoozed into oblivion.

Make sure your logs are going somewhere

Because all machines die.

What about problems?

There's no diff

Want to figure out what will change between the current template and the one deployed? Run it. If Justin Campbell was here he would say Terraform has diffs.

If a resource is deleted out of the CF...

You'll be in for a lot of trouble.

JSON is verbose and doesn't take comments or documentation

But are tools to use other languages like Python or Ruby to declare templates.

Not all features are there yet

S3 notifications still don't have all the options available at the console/API.

Vendor lock-in

You're investing and you're stuck.

It's a black box

Problemns? Open a ticket and wait.

Custom resources are painful to write and test

Check what we did at https://github.com/TheNeatCompany/cfn-bridge

Whats next for us?

Preemptive scaling

We already have the numbers and the usage patterns are quite consistent, scaling up based on time means customers wait even less when they actually start to use the app.

Move the monolith

While all new apps have moved to CF-based setups, our monolith is still work in progress, but we will get there.

Better custom health-check for instances

Right now the health checks are rudimentar and not very effective and spotting instances that are misbehaving.

This was a team effort

  • Bruce Willke Jr.
  • Kevin Lee
  • Richard Henning
  • Sarah Gray
  • Shairon Toledo
  • Todd Davenport
  • Travis Truman

Questions?

Thanks!