High volume web services, maxconnections and tcpgoon

dani(dot)caba at gmail(dot)com

Dec 17th, 2017

Who are you?

dcProfile := map[string]string{
  "name": "Daniel Caballero",
  "title": "Devops Engineer",
  "mail": "dani(dot)caba at gmail(dot)com",
  "company": &SchibstedPT,
  "previously_at": []company{&NTTEurope, &Semantix, &Oracle},
  "linkedin": http.Get("https://www.linkedin.com/in/danicaba"),
  "extra": "Gestión DevOps de Arquitecturas IT@LaSalle",
}

What are you bringing here?

A little bit of context...

What is Schibsted?

And SPT?

So what?

2016-06-11 - Incident

Actual root issue

(Easy) fix...

$ cat /etc/security/limits.d/tomcat.conf
tomcat hard nofiles 10240
$ cat /etc/tomcat7/server.xml
...
<Connector port="8080" protocol="org.apache.coyote.http11.Http11NioProtocol"
connectionTimeout="20000"
redirectPort="8443" />
...

And some testing...

func connection_handler(id int, host string, port int, wg *sync.WaitGroup) {
    fmt.Println("\t runner "+strconv.Itoa(id)+" is initiating a connection")
    conn, err := net.Dial("tcp", host+":"+strconv.Itoa(port))
    if err != nil {
        fmt.Println(err)
        os.Exit(1)
    }
    fmt.Println("\t runner "+strconv.Itoa(id)+" established the connection")
    connBuf := bufio.NewReader(conn)
    for{
        str, err := connBuf.ReadString('\n')
        if len(str)>0 {
            fmt.Println(str)
        }
        if err!= nil {
            break
        }
    }
    fmt.Println("\t runner "+strconv.Itoa(id)+" got its connection closed")
    wg.Done()
}

func run_threads(numberConnections int, delay int, host string, port int) {
    runtime.GOMAXPROCS(numberConnections)

    var wg sync.WaitGroup
    wg.Add(numberConnections)

    for runner:= 1; runner <= numberConnections ; runner++ {
        fmt.Println("Initiating runner # "+strconv.Itoa(runner))
        go connection_handler(runner, host, port, &wg)
        time.Sleep(time.Duration(delay) * time.Millisecond)
        fmt.Println("Runner "+strconv.Itoa(runner)+" initated. Remaining: "+strconv.Itoa(numberConnections-runner))
    }

    fmt.Println("Waiting runners to finish")
    wg.Wait()
}

func main() {
    hostPtr := flag.String("host", "localhost", "Host you want to open tcp connections against")
    portPtr := flag.Int("port", 8888, "Port you want to open tcp connections against")
    numberConnectionsPtr := flag.Int("connections", 100, "Number of connections you want to open")
    delayPtr := flag.Int("delay", 10, "Number of ms you want to sleep between each connection creation")

    flag.Parse()

    run_threads(*numberConnectionsPtr, *delayPtr, *hostPtr, *portPtr )

    fmt.Println("\nTerminating Program")
}

Executing the tests...

% ./tcpMaxConn -host ec2-54-229-56-140.eu-west-1.compute.amazonaws.com -port 8080 -connections 5 
Initiating runner # 1
         runner 1 is initiating a connection
Runner 1 initated. Remaining: 4
Initiating runner # 2
         runner 2 is initiating a connection
Runner 2 initated. Remaining: 3
Initiating runner # 3
         runner 3 is initiating a connection
Runner 3 initated. Remaining: 2
Initiating runner # 4
         runner 4 is initiating a connection
Runner 4 initated. Remaining: 1
Initiating runner # 5
         runner 5 is initiating a connection
Runner 5 initated. Remaining: 0
Waiting runners to finish
         runner 2 established the connection
         runner 1 established the connection
         runner 4 established the connection
         runner 3 established the connection
         runner 5 established the connection
         runner 2 got its connection closed
         runner 1 got its connection closed
         runner 4 got its connection closed
         runner 5 got its connection closed
         runner 3 got its connection closed

Terminating Program

Is this everything?

No...

Supporting a relatively high number of connections in parallel...

...is an easy job...

... but fragile.

Back to the basics

Points to bear in mind

  • OS (net.core.somaxconn, ethernet cards queues, devices backlogs...)
  • max file descriptor limits for the user running the service
  • in a multiprocess model (legacy?), max processes limits for the user running the service
  • Connector/listener in your application / application server
  • Associated thread pools / incoming requests queue (if applies)
  • Probably you also want pooling-multiplexing against backends
  • Don't forget about other processes using resources
  • And is there a load balancer in front of you? More considerations may apply

If you break a single item, you hit the ground

Plus it may not manifest soon; you realize when:

  • Lots of ELBs in front of you (normally under high load) pre-opening hundreds of connections
  • Or issues with backend components (slow responses?) so in-flight connections increase drastically

Ook, but you are careful, you review PRs, and you do stress tests...

... you are safe. Really?

New incident

Incident 2017-10-31

AND Incident 2017-11-24

WTF

The solution

Continuous testing coverage

Obvious, but...

Should application stress tests already cover this?

Again, no...

And building something more...

Approach

  • Mission: We want something we can easily plug to our test suite that checks a single instance of our service do support an specific number of parallel TCP connections, without entering into standard l7 (http) stress testing
  • Given it requires a deployed version of your application (ideally the same you will use for production), the acceptance test phase is the target place to plug this check.

Acceptance tests

How does it look like?

% ./tcpgoon --help
tcpgoon tests concurrent connections towards a server listening on a TCP port

Usage:
  tcpgoon [flags] <host> <port>

Flags:
  -y, --assume-yes         Force execution without asking for confirmation
  -c, --connections int    Number of connections you want to open (default 100)
  -d, --dial-timeout int   Connection dialing timeout, in ms (default 5000)
  -h, --help               help for tcpgoon
  -i, --interval int       Interval, in seconds, between stats updates (default 1)
  -s, --sleep int          Time you want to sleep between connections, in ms (default 10)
  -v, --verbose            Print debugging information to the standard error

% ./tcpgoon myhttpsamplehost.com 80 --connections 10 --sleep 999 -y 
Total: 10, Dialing: 0, Established: 0, Closed: 0, Error: 0, NotInitiated: 10
Total: 10, Dialing: 1, Established: 1, Closed: 0, Error: 0, NotInitiated: 8
Total: 10, Dialing: 1, Established: 2, Closed: 0, Error: 0, NotInitiated: 7
Total: 10, Dialing: 1, Established: 3, Closed: 0, Error: 0, NotInitiated: 6
Total: 10, Dialing: 1, Established: 4, Closed: 0, Error: 0, NotInitiated: 5
Total: 10, Dialing: 1, Established: 5, Closed: 0, Error: 0, NotInitiated: 4
Total: 10, Dialing: 1, Established: 6, Closed: 0, Error: 0, NotInitiated: 3
Total: 10, Dialing: 1, Established: 7, Closed: 0, Error: 0, NotInitiated: 2
Total: 10, Dialing: 1, Established: 8, Closed: 0, Error: 0, NotInitiated: 1
Total: 10, Dialing: 1, Established: 9, Closed: 0, Error: 0, NotInitiated: 0
Total: 10, Dialing: 0, Established: 10, Closed: 0, Error: 0, NotInitiated: 0
--- myhttpsamplehost.com:80 tcp test statistics ---
Total: 10, Dialing: 0, Established: 10, Closed: 0, Error: 0, NotInitiated: 0
Response time stats for 10 established connections min/avg/max/dev = 17.929ms/19.814ms/29.811ms/3.353ms
% echo $?
0

 % tcpgoon -c 5000 -s 0 -y ec2-52-213-210-34.eu-west-1.compute.amazonaws.com 443
Total: 5000, Dialing: 0, Established: 0, Closed: 0, Error: 0, NotInitiated: 5000
Total: 5000, Dialing: 0, Established: 1020, Closed: 0, Error: 3980, NotInitiated: 0
--- ec2-52-213-210-34.eu-west-1.compute.amazonaws.com:443 tcp test statistics ---
Total: 5000, Dialing: 0, Established: 1020, Closed: 0, Error: 3980, NotInitiated: 0
Response time stats for 1020 established connections min/avg/max/dev = 116.443ms/313.739ms/549.88ms/111.426ms
Time to error stats for 3980 failed connections min/avg/max/dev = 105.145ms/145.092ms/316.247ms/39.371ms

And internally?

Baking

Nothing especially interesting (a docker wrapper does exist so we can run travis logic locally):

./_script/test
./_script/formatting_checks

TRAVIS_PULL_REQUEST=${TRAVIS_PULL_REQUEST:-""}
TRAVIS_BRANCH=${TRAVIS_BRANCH:-""}
if [ "$TRAVIS_PULL_REQUEST" == "false" ] && [ "$TRAVIS_BRANCH" = "master" ]
then
    echo "INFO: Merging to master... time to build and deploy redistributables"
    docker_name="dachad/tcpgoon"
    ./_script/build "$docker_name"
    ./_script/deploy "$docker_name"
fi

But...

No, you cannot just move binaries around

Testing...

Are we testing this test? :)

  • A basic tcpserver is included and in use by the project tests.
  • Eureka integration is using dockertest to initialize and shutdown a dockered Eureka instance.

Q&As

I'd have done the same just with scripting or a fancy tool

Maybe. But goroutines do work very well in this scenario.

  • You probably don't want to fork a process per connection you want to test

  • hping does not work as...
    • ...it does not complete the three-way handshake

Where does the project name come from?

Goon: /ɡuːn/ noun informal; noun: goon; plural noun: goons ; 
...
2.
NORTH AMERICAN
a bully or thug, especially a member of an armed or security force.
...

Please, do not read it as "TCP-Go-On". Its awful. Very.

This is a very dangerous tool

Probably. Knifes are also dangerous. And you can buy knifes. We cannot prevent bad usage.

How many connections can you open from a single client?

Depends on how many connections do you support in your client machine :) . No official benchmark/stress test yet, but able to open between 5k-10k without problems from my laptop.

Can I use the tool now?

Yes. And a public docker image is available to facilitate the job:

% WHALEBREW_INSTALL_PATH=$HOME/bin whalebrew install dachad/tcpgoon
🐳  Installed dachad/tcpgoon to /home/caba/bin/tcpgoon
% tcpgoon myhttpsamplehost.com 80 -c 2 -y 
Total: 2, Dialing: 0, Established: 0, Closed: 0, Error: 0, NotInitiated: 2
Total: 2, Dialing: 0, Established: 2, Closed: 0, Error: 0, NotInitiated: 0
--- myhttpsamplehost.com:80 tcp test statistics ---
Total: 2, Dialing: 0, Established: 2, Closed: 0, Error: 0, NotInitiated: 0
Response time stats for 2 established connections min/avg/max/dev = 57.606ms/63.499ms/69.391ms/5.892ms
% echo $?
0

Is this now part of YAMS' acceptance tests?

Not yet. Stress test ELBs is not the objective, so Service Discovery integration is required (& ongoing).

Closure...

The gifts

  • When assessing post mortems, do not stop until the very last root cause is clear
  • One time solutions suck
  • Golang works well for building low level utilities
  • Code requires continuous testing. Deliverables too

Special thanks to...

  • chadell, also owning the project
  • my wife, who created our gopher
  • other teams in Schibsted,
    • creating excellent tools to maintain/run our applications
    • providing also great and valuable feedback

Further questions?

Enjoy Christmas!