whoami
$ sudo rpm -i st2common-0.11.0-6.noarch.rpm error: Failed dependencies: python(abi) = 2.7 is needed by st2common-0.11.0-6.noarch
$ docker run hello-world Unable to find image 'hello-world:latest' locally Pulling repository docker.io/library/hello-world Network timed out while trying to connect to https://index.docker.io/v1/repositories/library/hello-world/images. You may want to check your internet connection or if you are behind a proxy.
error: unknown filesystem. grub rescue>
These things are:
In physics, a spherical cow in a vacuum is easy to simulate.
In operations, deployment, development, and other high-level workflows are easy if nothing ever breaks
Suffering through broken things constantly… is not fun
It's not only about being less miserable!
Friction, waste, downtime, you ulcer - whatever you call it, how do we mitigate it?
What can we learn from the past decade of operational trends?
Distrusting your installation
Distrusting your hardware
Distrusting the operating system
Distrusting your container runtime
The natural conclusion
Bin Packing
…using a new operating system each time isn't efficient
$ docker images | sort -k7 -h | awk '{ print $NF; }' \ | uniq | tr '\n' ' ' | fold -s -w 40 SIZE 1.84kB 109MB 127MB 160MB 191MB 195MB 202MB 206MB 288MB 298MB 390MB 594MB 644MB 645MB 685MB 697MB 706MB 711MB 750MB 752MB 812MB 842MB 879MB 917MB 918MB 922MB 929MB 935MB 1.06GB 1.07GB 9.56GB
Common factors
Practices that were repeated, scaled up, or focused on
"Can I recreate this if:"
I've forgotten what the steps are?
Kubernetes pods (or, modern-day "have you tried rebooting it?")
How do you record, version, and share operational knowledge?
APIs over hands on hardware
Next-generation: Kubernetes operators
Executables → Entire Machines → Load Balancers → All Operations
Everybody wants the Heroku/Travis CI experience
Wisdom of the ancients to apply:
Abstraction
Version incompatibilities everywhere
Terraform doesn't allow running any operations against a state that was written by a future Terraform version. The state is reporting it is written by Terraform '0.11.1'.
Maybe?
$ cat .tool-versions python 3.6.0 helm 2.8.2 terraform 0.11.0
Only prerequisite is Docker, everyone runs the same bits.
$ docker run --rm --interactive \ hashicorp/terraform:0.11.1 \ plan
plan
your pull requestsapply
on merge to master
Not if your service will die, but when
Congratulations, now your problem is CAP
Today's platforms assume a many-replica architecture
kubernetes, EC2 ASGs, GCP instance groups, nomad, consul, Elasticsearch, etcd, etc. etc.
$x
I can use?Secrets (vault)
I tend to build systems that (wrongly) assume:
Done until next time!
Hypothesis: the command line is currently the absolute best user interface to a computer for technical operators
…because…
…because this is tangible instead of telling the new team member "click on this page in the AWS console"
aws --region us-east-1 ec2 describe-instances \ --filter "Name=tag:Name,Values=*myhost*" \ | jq -r ' .Reservations \ | map(.Instances) \ | flatten \ | map(select(.State.Name | contains("running"))) '
Blurring the line between work and programming (control structures, variables, etc.)
Was there any point to this presentation?
Turn frustrations into technical achievements!
Assume everything is going to break, all the time, and merrily build around your pessimism
@retry(wait=wait_exponential(multiplier=1, min=1, max=60)) def perform_something_over_the_network():