Schibsted YAMS

Or how to build and maintain a thousands req/s service with minimal dedication

Who are you?



Daniel Caballero

Devops/SRE Engineer @ Schibsted

Part time (Devops) lecturer @ La Salle University

So... I work

... I (some kinda) teach

... I (try to) program...

... I (would like to) rock...

... and I live

So... I value my time (a lot)

I really don't like to waste it



  • Resolving incidents
  • Reactive work
  • Repetitive work

Schibsgrñvahed..WHAT??

What is Schibsted?

And SPT?

It's about convergence through global solutions

What's behind global components / services?

You build it, you run it

Nothing new in the horizon probably for you

That means there's no ops/support/systems/devops team

{
    "format": "webp",
    "watermark": {
        "location": "north",
        "margin": "20px",
        "dimension": "20%"
    },
    "actions": [
        {
            "resize": {
                "width": 300,
                "fit": {
                    "type": "clip"
                }
            }
        }
    ],
    "quality": 90
}

Why not offline transformations?

  • Lots of (user) contents. Reprocessing hurts
  • Sites are dynamic by nature. Some of them do adapt the content to the device

This may sound familiar...

CDNs able to transform contents on the fly:

  • As a native functionality...
  • Or through lambdas / edge computing

SaaS solutions:

Opensource solutions:

So...

Why did you invest time on that?

Why are you here?

Availability

Low latency

We are the owners of the backlog

Despite sometimes is not so useful...

Low costs

High usage

Effort-less maintenance

  • (Almost) No incidents
  • New sites do not require high onboarding efforts

Let's say we dedicate half an engineer

We don't (usually) cut people in half, so let's say one engineer

But be careful: if you stop developing a service, you kill the service

So we try to convince the company it requires, at least, the focus of two engineers

But oncall rotations

Ok. Let's say 3-4. And we accept an extra project.

How did you achieved that?

"For every complex problem there is an answer that is clear, simple, and wrong", H. L. Mencken

Combination of...

Don't you see some similarity?

Team

Agile

Benefiting from other Sch services

Reusability of other colleagues code/components

Collaboration and transparency



  • Internal RFCs
  • Consumers as contributors
  • Internal opensource model

Proactive mindset

Product

There is an actual need

Project was initiated by and for several sites that had a common problem

Limited scope

  • API as the point of interaction
  • No business logic. "Dumb" service
  • Almost no-functionality that is used by a single site, or not used at all

As-a-service experience

  • Self-service
  • Multitenant API
  • Metrics reporting per tenant

Tech

Good design/tech choices

But not perfect neither the best, for sure

Everything as code

No space for "one time" actions.

  • Alerting configuration by code
  • Infrastructure
  • (Most of the) application configuration

Continuous Delivery

And capacity to incorporate everything to the pipeline.

Small deltas. Iterative deliveries. Low risk deployments. And be smart assuming risks

Look forward, rather than investing lots of time in your rollback strategy

0-error target

Yeah, Google also states something different by introducing error budgets...

... but helped us:

  • to understand & tune the platform,
  • get trust from Sch sites, avoiding major disruptions when big sites onboarded,
  • and minimizing unplanned / reactive activities

Observability toolkit

  • Shit happens
    • let's minimize pain
  • Unlocks experimentation culture
    • As understanding what happens becomes easier

If we connect this to immutability...

Incident troubleshooting can become a forensics exercise

Nice solution... but

Why not docker/k8s?

  • Local tests
  • YAMS Portal/Frontend already there
  • Migration exercise

Why not a Service Mesh?

How to prevent unplanned still more?

  • Canary analysis
  • Stress tests in the acceptance tests
    • Specific tcp stress tool released, tcpgoon
  • Simulate dependencies degradation
    • similar in concept to the Simian Army from Netflix, but specialized in API, Hoverfly

Are you going to opensource it?

  • Schibsted contributes to opensource projects
  • And also releases
  • Problem: Not following a "contribute-first" approach
  • But already contributed to bimg, zuul, krakenD...

Are you going to offer this SaaS to other companies?

Final reminder

Be Rx in the code...

But not in real life

Given...

a CPU is quicker than you attending interrupts

Your company will eventually not pay for just hero-style engineers

Great thanks

And especially...

Edge colleagues

Keep rocking, Poland!

Other Qs?

dan . caba at gmail (dot)com

Your opinion is very important to me

  • Find my lecture on the schedule in the eventory app
  • Rate and comment my performance

Thanks for your feedback, I will know what to improve