Schibsted YAMS

Or how to build and maintain a thousands req/s service with minimal dedication

Who are you?

Daniel Caballero

Devops/SRE Engineer @ Schibsted

Part time (Devops) lecturer @ La Salle University

I'm sorry. I did NOT write BSD

So... I work

... I (some kinda) teach

... I (try to) program...

... I (would like to) rock...

... and I live

So... I value my time (a lot)

I really don't like to waste it

Resolving incidents
Reactive work
Repetitive work

Schibsgrñvahed..WHAT??

What is Schibsted?

And SPT?

It's about convergence through global solutions

What's behind global components / services?

You build it, you run it

Nothing new in the horizon probably for you

first mention in 2006, by Werner Vogels / Amazon
Nice elaboration behind the rationale here

That means there's no ops/support/systems/devops team

{
    "format": "webp",
    "watermark": {
        "location": "north",
        "margin": "20px",
        "dimension": "20%"
    },
    "actions": [
        {
            "resize": {
                "width": 300,
                "fit": {
                    "type": "clip"
                }
            }
        }
    ],
    "quality": 90
}

Why not offline transformations?

Lots of (user) contents. Reprocessing hurts
Sites are dynamic by nature. Some of them do adapt the content to the device

This may sound familiar...

CDNs able to transform contents on the fly:

As a native functionality...
Or through lambdas / edge computing

SaaS solutions:

Opensource solutions:

So...

Why did you invest time on that?

Why are you here?

Availability

Low latency

We are the owners of the backlog

Despite sometimes is not so useful...

Low costs

High usage

Effort-less maintenance

(Almost) No incidents
New sites do not require high onboarding efforts

Let's say we dedicate half an engineer

We don't (usually) cut people in half, so let's say one engineer

But be careful: if you stop developing a service, you kill the service

So we try to convince the company it requires, at least, the focus of two engineers

But oncall rotations

Ok. Let's say 3-4. And we accept an extra project.

How did you achieved that?

"For every complex problem there is an answer that is clear, simple, and wrong", H. L. Mencken

Combination of...

Don't you see some similarity?

Team

Agile

Continuous improvement
- Experiment. Its about to play. Prompt feedback. Sometimes you win. Sometimes you learn
Autonomous
- Transversal team. We have our own providers accounts. Directly in touch with sites/clients

Benefiting from other Sch services

Reusability of other colleagues code/components

Collaboration and transparency

Internal RFCs
Consumers as contributors
Internal opensource model

Proactive mindset

Warning: "we will care when needed" culture
- "Proactive vs Risk" balance is hard
Company pays for oncall availability

Product

There is an actual need

Project was initiated by and for several sites that had a common problem

Limited scope

API as the point of interaction
No business logic. "Dumb" service
Almost no-functionality that is used by a single site, or not used at all

As-a-service experience

Self-service
Multitenant API
Metrics reporting per tenant

Tech

Good design/tech choices

Immutable pattern
Microservices
AWS + Netflix stack
libvips
Non-blocking services

But not perfect neither the best, for sure

Everything as code

No space for "one time" actions.

Alerting configuration by code
Infrastructure
(Most of the) application configuration

Continuous Delivery

And capacity to incorporate everything to the pipeline.

Small deltas. Iterative deliveries. Low risk deployments. And be smart assuming risks

Look forward, rather than investing lots of time in your rollback strategy

0-error target

Yeah, Google also states something different by introducing error budgets...

... but helped us:

to understand & tune the platform,
get trust from Sch sites, avoiding major disruptions when big sites onboarded,
and minimizing unplanned / reactive activities

Observability toolkit

Shit happens
- let's minimize pain
Unlocks experimentation culture
- As understanding what happens becomes easier

If we connect this to immutability...

Incident troubleshooting can become a forensics exercise

Nice solution... but

Why not docker/k8s?

Local tests
YAMS Portal/Frontend already there
Migration exercise

Why not a Service Mesh?

How to prevent unplanned still more?

Canary analysis
- See Spinnaker implementation
Stress tests in the acceptance tests
- Specific tcp stress tool released, tcpgoon
Simulate dependencies degradation
- similar in concept to the Simian Army from Netflix, but specialized in API, Hoverfly