Production Ready Ruby

Decreasing MTBF & Increasing MTTR

29 January 2017

Adam Hawkins

SRE Team Lead, Saltside

Tell Me About You

Define "production ready"

Code structured such that:

1. Mean time between failure (MTBF) is reduced
2. Mean time to resolve (MTTR) is increased

How can we make this better?

OK....What Practices?

Smoke Test your Processes

Yay, Make, Make is great!

First, start your process using the same command you would use in production.

Example Makefile

# Sentinel artifact representing some commands have been run.
# Every non-phone make target must create a file
ENVIRONMENT:=tmp/environment

# Boot everything for testing
$(ENVIRONMENT):
    bundle exec rackup -p 9292
    mkdir -p $(@D)
    touch $@

.PHONY: test-smoke
# Run a smoke test; depend on the $(ENVIRONMENT)
test-smoke: $(ENVIRONMENT)
    env SERVER_URL=http://localhost:9292 bats smoke_test.bats

Smoke Test your Processes

Next, run some commands to test the process that make sense in production.

I like bats.

Bats is a bash test framework. Simple assertions via test and TAP output.

Example smoke_test.bats

#!/usr/bin/env bats

@test "liveness probe" {
    run curl -f "${SERVER_URL}/probe/liveness"
    [ $status -eq 0 ] # $status populated by bats run command
}

@test "readiness probe" {
    run curl -f "${SERVER_URL}/probe/readiness"
    [ $status -eq 0 ]
}

Smoke Test Your CLIs

Move your utilities out of rake into thor (or anything else really).

Define a test task in the Makefile

.PHONY: test-util
test-util: $(ENVIRONMENT)
    bundle exec util reset -f

Eliminated Regressions

Add Telemetry

What is Telemetry?

Telemetry is data required to understand the current state.

Save data (e.g. metrics and/or logs) to relate the current to the past state.

Data = Business + Technical

Server Telemetry

"How is my server doing?"

Include metadata to aggregate across paths/request names etc

Examples

Dependency Telemetry

These are your upstream APIs, other internal services, or data stores

Include metadata to aggregate across dependencies.

Examples:

Queue Telemetry

"Are messages moving through the queue?"

"queue" refers to message queue or a job queue like sidekiq

Include metadata to aggregate across each queue

Examples

Thread Pool Telemetry

"What's the load on the pool?"

Interacting Telemetery

Logging

Logging with Progname

require 'logger'

logger = Logger.new($stdout).tap do |log|
  log.level = ENV.fetch('LOG_LEVEL', :debug)
end

logger.info('server') { 'handling reqeust' }
logger.debug('order-processor') { 'incoming order' }

# Outputs
#
# I, [2017-01-28T23:02:48.662657 #29158]  INFO -- server: handling reqeust
# D, [2017-01-28T23:02:48.662730 #29158] DEBUG -- order-processor: incoming order
#
# Easy Grepping for subsystems when dealing with logs

Progname via DelegateClass

require 'delegate' # Underused and powerful library
require 'logger'

class NamedLogger < DelegateClass(Logger)
  def initialize(logger, progname)
    super logger
    @progname = progname
  end

  def info(msg)
    super(@progname) { msg }
  end
end

NamedLogger Example

logger = Logger.new $stdout

server = NamedLogger.new logger, :server
queue = NamedLogger.new logger, :queue

server.info 'incoming request'
queue.info 'processed message'

# Output
#
#  I, [2017-01-28T23:10:40.365454 #29392]  INFO -- server: incoming request
#  I, [2017-01-28T23:10:40.365516 #29392]  INFO -- queue: processed message

So Remember

Please Read this Book

"Release It!" by Michael Nygaard

Thank you