YAITSASUG

Yet Another Image Transformation Service At Scale Using Golang

dani(dot)caba at gmail(dot)com

Who are you?

dcProfile := map[string]string{
  "name": "Daniel Caballero",
  "title": "Staff Devops Engineer",
  "mail": "dani(dot)caba at gmail(dot)com",
  "company": &SchibstedPT,
  "previously_at": []company{&NTTEurope, &Semantix, &Oracle},
  "linkedin": http.Get("https://www.linkedin.com/in/danicaba"),
  "extra": "Gestión DevOps de Arquitecturas IT@LaSalle",
}

So... I work

... I (some kinda) teach

... I (try to) program...

... I (would like to) rock...

... and I live

So... I value my time (a lot)

And I really don't like to waste it resolving incidents

Schibsgrñvahed..WHAT??

What is Schibsted?

Marketplaces global expansion

Large group of companies

And SPT?

{
    "format": "jpg",
    "watermark": {
        "location": "north",
        "margin": "20px",
        "dimension": "20%"
    },
    "actions": [
        {
            "resize": {
                "width": 300,
                "fit": {
                    "type": "clip"
                }
            }
        }
    ],
    "quality": 90
}

The journey

2+1/2 YEARS AGO

Firsts onboardings

Onboarding pipelines

Firsts nightmares

New Architecture

New Core

Self service capabilities

Updated onboarding pipelines

Current usage

(Your?) thoughts so far...

Why maintaining your own service?

But there's already opensource http servers for that, right?

Why not offline transformations?

Why microservices?

+

  • Quicker releases
  • APIGW helps to delegate common functionality
    • But business agnostic ones
  • Reusability of individual microservices
  • Each microservice can choose different techs
    • We will focus in delivery-images, in Golang
  • Easier to scale with the organization/development team
    • Not taking advantage
  • More granular scalability
  • and... fun

-

  • S2S communication overhead
  • It can imply extra costs
  • More tooling required (logging, tracing...)
  • Reproduce the complete environment becomes tricky
  • Always caring about coupled services...

Why not CDN/edge transformations?

  • Some functionality may be covered...
    • Typically resizing and format conversion
    • But not all our functionality (watermarking?)
  • It may mean duplicated processing
  • Not easy to pack something like libvips as lambdas
  • No unique & global CDN in Schibsted

Not a new story... why not presenting it before?

Why transformations in golang?

Transformation library

  • imageflow was not production ready two years ago, with clear gaps on functionalities and bindings

Choosing the programming language

Platform (& development) properties

IaC

  • Most of the services in AWS...
  • Generating Cloudformations from python troposphere
  • Managing Cloudformation deployments with Sceptre
  • New projects with infrastructure definition in the same repo than the service code
    • Trying to extend CD to Infrastructure
  • We have assessed AWS GoFormation
    • But still lacks some functionality, like GetAtt or Ref

Devel bots

Code reviews

Continuous integration and delivery

Travis

language: go
go:
- 1.9.3

script:
- diff -u <(echo -n) <(gofmt -s -d $(find . -type f -name '*.go' -not -path "./vendor/*"))
- docker login -u="$ARTIFACTORY_USER" -p="$ARTIFACTORY_PASSWORD" containers.schibsted.io
- "./requirements/start-requirements.sh -d"
- "_script/tests-docker"
- "_script/compile-docker"
- "_script/cibuild"

deploy:
  skip_cleanup: true
  on:
    all_branches: true
  provider: script
  script: _script/deploy

FPM

fpm -s dir \
    -t rpm \
    -n ${PACKAGE_NAME}${DEV} \
    -v ${VERSION} \
        --iteration ${ITERATION} \
    --description "Yams delivery images. Commit: ${GIT_COMMIT_ID}" \
    --before-install ${TRAVIS_BUILD_DIR}/_pkg/stopservice \
    --after-install ${TRAVIS_BUILD_DIR}/_pkg/postinst \
    --before-remove ${TRAVIS_BUILD_DIR}/_pkg/stopservice \
    --depends datadog-config \
    --depends sumologic-config \
    ${DIST_PATH}/${PACKAGE_NAME}/=/

Hardened images

Spinnaker

Acceptance & Stress testing

Locust

Vegeta

Configuration management

  • Using Netflix Archaius in all rxJava services
    • Configured with dynamo tables...
    • so dynamic reconfiguration is possible...
    • and quite useful when dealing with outages :)
  • ... but Viper in delivery-images

Observability

Logs

  • Using Sumologic to aggregate them into a central location
    • Now quite happy... but costs (logging 100G per day)
    • Daemon/sidecar with specific config (files to forward)

Using logrus in delivery-images

  • Disabling locking
  • Time/Date formatting equivalent to the logs in Java
  • Zap could be used...
    • but not concerned about overhead, as...
    • fetching and transforming images are massive ops in comparison

Monitoring and alerting

  • We use Datadog
    • System + Custom application metrics
      • Via statsd, sidecar model (again)
    • Importing also Cloudwatch metrics
  • Extensive usage of:
    • Dashboards (troubleshooting and also KPIs)
    • Alerting

  pre:
    not_allowed_notify_to:
    - "@webhook-alert-gateway-sev2"
    - "@pagerduty"
    healthy_host_count_critical: 0.0
  pro:
    healthy_host_count_critical: 1.0

monitors:
  - name: "[ALB] - {{name.name}} in region {{region.name}} - 5xx backend error rate"
    tags:
      - "app:yams"
    type: "metric alert"
    options:
      require_full_window: false
      thresholds:
        warning: 0.05
        critical: 0.1
      notify_no_data: false

Distributed tracing

Integration

Where?

  • httserver, as a middleware
    • so incoming requests' headers are considered
  • all functions where you want to add instrumentation
  • client factory
    • to propagate spans to outgoing requests

n := negroni.New(recovery)

n.Use(NewRequestIdHandler())
n.Use(tracing.NewTracingHandler())
n.Use(negroni.NewStatic(http.Dir(serverConfig.schemaDir)))

loggingMiddleware := negronilogrus.NewMiddlewareFromLogger(serverConfig.log, "bumblebee")
loggingMiddleware.SetLogStarting(false)
n.Use(loggingMiddleware)

n.Use(statsMiddleware)
n.UseHandler(httpRouter)
return n

if transformedImage.FromCache {
    b.Logger.WithField("id", requestId).Debug("Found Transformation in cache")
    tracing.AddLogToSpanInContext(request.Context(), "Got image from cache")
    b.Monitor.Incr("transformation.cache.hit", tags, 1)
} else {
    b.Monitor.Incr("transformation.cache.miss", tags, 1)
    transformationElapsed := time.Since(transformationStart)
    transformationElapsedInMillis := float64(transformationElapsed.Nanoseconds()) / 1000.0
    b.Monitor.TimeInMilliseconds("request.transformation.duration", transformationElapsedInMillis, tags, 1)
    b.Monitor.Gauge("request.transformation.duration", transformationElapsedInMillis, tags, 1)
    b.Logger.WithField("id", requestId).Debug("Transformation took: ", transformationElapsed.Seconds(), " secs")
    tracing.AddLogKvToSpanInContext(request.Context(), "transformation.duration", transformationElapsed.String())
}

S2S resiliency

Actors in delivery-images

Implementation detail

viper.SetDefault("eureka.datacenter", amazonDatacenterInfo)
datacenter := viper.GetString("eureka.datacenter")
if datacenter == k8DatacenterInfo {
    var err error
    appInstance, err = sdeureka.NewK8SFargoInstance(eurekaConfig)
    if err != nil {
        return nil, nil, err
    }
} else if datacenter == amazonDatacenterInfo {
    awsSession, err := session.NewSession()
    if err != nil {
        log.Info("registerEureka", "error creating an AWS session", "message", err)
        return nil, nil, err
    }

    instanceMeta := ec2metadata.New(awsSession)
    appInstance, err = sdeureka.NewAwsFargoInstance(log, eurekaConfig, instanceMeta)
    if err != nil {
        return nil, nil, err
    }
} else if datacenter == localDatacenterInfo {
    var err error
    appInstance, err = sdeureka.NewLocalFargoInstance(eurekaConfig)
    if err != nil {
        return nil, nil, err
    }

hystrix.ConfigureCommand("DeliveryImages#GetWatermarkFromLogicManager", hystrix.CommandConfig{
    Timeout:               viper.GetInt("hystrix.command.DeliveryImagesGetWatermarkFromLogicManager.timeout"),
    MaxConcurrentRequests: viper.GetInt("hystrix.command.DeliveryImagesGetWatermarkFromLogicManager.maxConcurrentRequests"),
    ErrorPercentThreshold: viper.GetInt("hystrix.command.DeliveryImagesGetWatermarkFromLogicManager.errorPercentThreshold"),
})

hystrix.ConfigureCommand("DeliveryImages#GetResourceByAliasFromTranslator", hystrix.CommandConfig{
    Timeout:               viper.GetInt("hystrix.command.GetResourceByAliasFromTranslator.timeout"),
    MaxConcurrentRequests: viper.GetInt("hystrix.command.GetResourceByAliasFromTranslator.maxConcurrentRequests"),
    ErrorPercentThreshold: viper.GetInt("hystrix.command.GetResourceByAliasFromTranslator.errorPercentThreshold"),
})

Real time monitoring

Almost for free

func httpServer(serverConfig deliveryImagesServer) *negroni.Negroni {

    httpRouter := httprouter.New()
    httpRouter.GET("/tenants/:tenant_id/domains/:domain_id/buckets/:bucket_id/images/*image_id", serverConfig.bumblebeeController.Handler)
    httpRouter.HEAD("/tenants/:tenant_id/domains/:domain_id/buckets/:bucket_id/images/*image_id", serverConfig.bumblebeeController.Handler)
    httpRouter.POST("/schema/validate", serverConfig.schemaController.Handler)
    httpRouter.GET("/healthcheck", serverConfig.healtcheckHandler)

    hystrixStreamHandler := hystrix.NewStreamHandler()
    hystrixStreamHandler.Start()
    httpRouter.GET("/hystrix.stream", hystrixStream(hystrixStreamHandler))

Graceful shutdowns

We strongly rely on autoscaling:

Deregistering instances makes scale-downs cleaner

And there's still more...

Some delivery-images-specific details

bi-image + libvips

imageType := bimg.DetermineImageType(imageToTransform.Buffer())

tracing.AddLogToSpanInContext(transformationRequest.Context, "Imagetype: "+bimg.ImageTypeName(imageType))

autoRotate := getAutoRotateFromRule(rule)
if autoRotate {
    err = imageToTransform.AutoRotate()
    if err != nil {
        t.Log.WithField("id", transformationRequest.RequestId).Errorf("Failed to apply autorotation. Error: %v", err)
        return TransformedImage{}, err
    }
}

imageToTransform, err = t.applyBlurring(imageToTransform, transformationRequest)
if err != nil {
    t.Log.WithField("id", transformationRequest.RequestId).Errorf("Failed to apply blurring. Error: %v", err)
    return TransformedImage{}, err
}

if err = t.applyPixelation(imageToTransform, transformationRequest); err != nil {
    t.Log.WithField("id", transformationRequest.RequestId).Errorf("Failed to apply pixelation. Error: %v", err)
    return TransformedImage{}, err
}

func (t *VipsTransformer) applyPixelation(imageToTransform *bimg.SchImage, transformationRequest TransformationRequest) error {

    if len(transformationRequest.Rule.GetPixelateAreas()) == 0 {
        return nil
    }

    start := time.Now()
    pixelAreas := make([]bimg.PixelateArea, len(transformationRequest.Rule.GetPixelateAreas()))

    for i, pixelArea := range transformationRequest.Rule.GetPixelateAreas() {
        pixelAreas[i] = bimg.PixelateArea{
            Left:          pixelArea.Left,
            Top:           pixelArea.Top,
            Width:         pixelArea.Width,
            Height:        pixelArea.Height,
            MinNrOfPixels: pixelArea.MinNrOfPixels,
            MinPixelSize:  pixelArea.MinPixelSize,
        }
    }
    defer t.metrics("pixelate", transformationRequest.Context, transformationRequest.RequestId, start)
    return imageToTransform.Pixelate(pixelAreas)
} 

Caching

Be careful with...

What we are caching

  • External HTTP responses, thanks to the CDN
  • Local in-memory caches to avoid s2s calls
  • Watermarks cache (also in memory)
  • Transformations (in S3)

What we are NOT caching

  • Some controversy in the team in front of DynamoDB Accelerator/DAX
  • We decided to keep investing in s2s cache rather caching as a (transparent) storage proxy
    • You probably don't want to share a caching cluster between all your microservices

Implementation

  • Using freecache
    • But we implemented our own for some other entities
  • Can manage retention for us, but not very smart, specially compared to com.google.common.cache:
    • No background refreshing
    • No locking; one key expiring may mean thousands of backend requests
  • Randomized TTLs, so we avoid all instances expiring contents at the same time

Is that enough?

GIFs and heavy-load transformations

  • Media (newspapers) use cases are quite different from marketplaces
    • they use completely dynamic transformations
      • we accept them in the JWT payload
    • we've seen attempts of:
      • transforming gifs with hundreds of frames (actual short video clips)
      • And also a 21.603x14.400px image (that's 300Mpx)

Caching does help

  • We don't cancel ongoing transformations
    • so everything eventually gets transformed
  • We implemented a "best effort" rate limiting for GIFs..
    • We protect individual nodes (max concurrency)
      • You want to transform frames in parallel...
      • But keeping resources for other transformations
    • And, between nodes...
      • Registering ongoing work in tables
        • to prevent duplicated work
  • We introduced some limits on the size of the output

Transformations cache

Datastore access

Mind S3 VPC endpoints!!

Nice solution... but

Why not docker/k8s?

  • Local tests
  • YAMS Portal/Frontend already there
  • Migration exercise

gRPC?

Why not a Service Mesh?

And Prometheus?

We may.

And it may be a good moment to consider opencensus.

Actual (& not so far) future

Multiregion

  • Buckets already replicated in two regions per continent
  • v0 was running in Oregon and Ireland
  • v1 in Ireland, and now also being deployed in Virginia
    • We may also deploy in Paris or Frankfurt
  • Still open discussions in regards to DynamoDB usage

Smoke tests

  • We had some satellites in v0 calling the API constantly
  • Interesting as:
    • Closer to user experience
    • Avoid delays between Cloudwatch and Datadog
  • Javascript/client alerting not an option (yet)

More elasticity to reduce costs

  • Changes in transformation rules means massive eviction
    • So we are a bit overscaled...
  • Better degradation and more efficient ASG triggers
    • Reusing cache if no capacity
    • Automatic ASG parameters adjustments
    • Minimize parallelization in the transformation pipe
    • Incoming queue

Extra compression

  • Currently libjpg-turbo
  • Good for performance, pretty decent results, but...
  • MozJPEG, api-compatible with libjpg
  • guetzli, from Google

Bringing the service closer to the business

  • Image uploader
  • Online image editor
  • Integration with data services
    • Automatic classification
    • Nudity detector
    • Car plate pixelation

More transformation engines

  • PDF conv makes sense for attachments (CVs)
% go test -bench . -benchtime 30s -timeout 30m
goos: linux
goarch: amd64
pkg: github.schibsted.io/daniel-caballero/documentsConversionTests
BenchmarkConvertAllFiles/localUnoconv/CV-Templates-Curriculum-Vitae.doc-4                     30        1335211165 ns/op
BenchmarkConvertAllFiles/directLibreOffice/CV-Templates-Curriculum-Vitae.doc-4                50         812129343 ns/op
BenchmarkConvertAllFiles/justCopy/CV-Templates-Curriculum-Vitae.doc-4                      20000           2100841 ns/op
BenchmarkConvertAllFiles/localUnoconv/CV-Templates-Curriculum-Vitae_with_Large_Pic.doc-4                       5        7936095889 ns/op
BenchmarkConvertAllFiles/directLibreOffice/CV-Templates-Curriculum-Vitae_with_Large_Pic.doc-4                  5        7033935000 ns/op
BenchmarkConvertAllFiles/justCopy/CV-Templates-Curriculum-Vitae_with_Large_Pic.doc-4                        2000          29097488 ns/op
BenchmarkConvertAllFiles/localUnoconv/CV-Templates-Curriculum-Vitae_with_Small_Pic.doc-4                      30        1273404605 ns/op
BenchmarkConvertAllFiles/directLibreOffice/CV-Templates-Curriculum-Vitae_with_Small_Pic.doc-4                100         673872470 ns/op
BenchmarkConvertAllFiles/justCopy/CV-Templates-Curriculum-Vitae_with_Small_Pic.doc-4                       30000           1455526 ns/op
BenchmarkConvertAllFiles/localUnoconv/CVTemplate.odt-4                                                        20        1698359980 ns/op
BenchmarkConvertAllFiles/directLibreOffice/CVTemplate.odt-4                                                   50        1057170276 ns/op
BenchmarkConvertAllFiles/justCopy/CVTemplate.odt-4                                                         30000           1560396 ns/op
BenchmarkConvertAllFiles/localUnoconv/CVTemplate_with_Large_Pic.odt-4                                          5        7347933754 ns/op
BenchmarkConvertAllFiles/directLibreOffice/CVTemplate_with_Large_Pic.odt-4                                    10        6864524329 ns/op
BenchmarkConvertAllFiles/justCopy/CVTemplate_with_Large_Pic.odt-4                                           2000          30033113 ns/op
BenchmarkConvertAllFiles/localUnoconv/CVTemplate_with_Small_Pic.odt-4                                         20        1669373873 ns/op

  • Video...

Actual transformation pipelines

More adoption?

Some major Marketplaces are not using the service, yet

ApiGW replacement?

Zuul could be replaced by Krakend

Simulating dependencies failures

Hoverfly: similar in concept to the Simian Army from Netflix, but specialized in API degradations

Price calculator

func main() {
        defaultMonthsArg := 1
        monthsFlag := NewIntFlag(&defaultMonthsArg)
        weeksFlag := NewIntFlag(nil)
        flag.Var(&monthsFlag, "months", "How many months to go back. Default is applied if weeks are not specified.")
        flag.Var(&weeksFlag, "weeks", "How many weeks to go back. Incompatible with the months argument.")
        debugPtr := flag.Bool("debug", false, "Print debugging information to standard error.")
        outputPtr := flag.String("output", "text", "Define type of output {text|csv}.")
        apiVersionPtr := flag.String("apiversion", "all", "API version to take into account: {all|v1|v0}.")

        // default matches YAMS-team conventions
        profilePtr := flag.String("profile", "spt-ms-pro", "Profile from your aws config files you want to use.")
        engineersPtr := flag.Int("nengineers", 2, "Number of engineers maintaining the service.")
        ddMonthCostPtr := flag.Int("ddogmonthlycost", 4000, "Datadog monthly cost (US $) you want to consider.")
        sumoMonthCostPtr := flag.Int("sumomonthlycost", 16000, "Sumologic monthly cost (US $) you want to consider.")

        flag.Parse()

        if monthsFlag.wasSet && weeksFlag.wasSet {
                output.PrintError("You cannot use months AND weeks arguments at the same time")
                os.Exit(2)
        }

Classifier end to end tests

Again, go benchmark:

% ./_script/run_benchmark
goos: linux
goarch: amd64
pkg: github.schibsted.io/platform-services/yams-classifier-endtoendtest
BenchmarkUploadAndClassifyAllFiles/2MB.jpg-4                  50        1715680610 ns/op
BenchmarkUploadAndClassifyAllFiles/6MB.jpg-4                  30        2427555276 ns/op
BenchmarkUploadAndClassifyAllFiles/lbc_large_1_8MB.jpg-4                      50        2132103767 ns/op
BenchmarkUploadAndClassifyAllFiles/lbc_large_2_8MB.jpg-4                      30        3191113451 ns/op
BenchmarkUploadAndClassifyAllFiles/testCardSmall.jpg-4                       100         801532429 ns/op
PASS
ok      github.schibsted.io/platform-services/yams-classifier-endtoendtest      506.423s

It almost includes a YAMS SDK

yamsConfig, err := yamscfgfile.NewYamsConfig()
if err != nil { 
        b.Fatal(err)
}
yamsClient := yamssdk.NewYamsClient(yamsConfig)
...
imageUrls, reqId, err := yamsClient.Upload(fileReader, "image/jpeg", daysWeWillPreserveUploadedFiles)
if err != nil {
        b.Fatal("Error uploading image", imageFilename, "against YAMS. Reqid:", reqId,
                ". Err:", err)
}
classificationResult, err := dataimagesdk.Classify(imageUrls.StaticDeliveryUrl)
...

Before closing...

Are you going to opensource it?

  • Schibsted do support contribution to opensource projects
  • As well as releasing internal code
  • Problem: Not following a "contribute-first" approach
  • But already contributed to bimg, zuul, krakenD...

Are you going to offer this SaaS to other companies?

Corollary

Be Rx in the code...

But not in real life

Great thanks...

Sch*

And especially...

Edge colleagues

Other Qs?