Skip to main content

Chaos Testing With Toxiproxy

·5 mins

Chaos Testing #

When testing parts of our code that rely on other components to perform their job (e.g. another backend service, a database, and so on…), it can be useful to test (and thus, document) how they would behave in presence of particular network conditions: for example, how would our service behave if the database latency suddenly went sky-high? and how would it behave if the database became unavailable? Depending on our needs, we might want to simply return an error or try to somehow compensate so that our service can keep operating, albeit with some level of degradation.

This is a small-scale, test-oriented variation of Chaos testing, made popular by Netflix about ten years ago: as their backoffice reached hundreds of microservices, they needed to verify how performance degradation or unavailability of some of them would affect their customers’ experience. So they started creating tools to create scenarios where they could intentionally simulare particular network conditions and verify how other ones would behave in that case.

ToxiProxy #

ToxiProxy is a TCP proxy designed to simulate network conditions (dubbed toxics, in the library) for testing purposes: Netflix runs Chaos Testing drills on production systems, but we can start smaller 😅

The idea behind ToxiProxy is simple: we create a proxy that, under normal circumstances, simply forwards all traffic to our dependency (e.g. an API, database, …), then we “point” our service to the proxy; when we run a specific test, we may then simulate some network issues (e.g. increased latency) and verify the impact on our service.

I’ll use the ToxiProxy go client in the following example, but it’s not a language-specific tool: there are clients in several other languages and they are actually simple wrappers around its HTTP API, available on port 8474, so one can even use its API directly, if a client is not available for a given language.

Setup #

Let’s imagine that we have a Service calling another microservice, the Orders API, via its HTTP API (for simplicity’s sake, I’ll use Echo Server as substitute for the Orders API). In order to start testing, we need to spin up a container running the ToxiProxy server and the Echo Server:

# docker-compose.yaml

version: "3"
services:
  orders:
    container_name: orders
    image: ealen/echo-server
    ports:
      - 3000:80

  toxiproxy:
    image: shopify/toxiproxy
    ports:
      - 8474:8474
      - 12345:12345

Echo Server listens on port 80 and we’re remapping it to port 3000 (an arbitrary number, it can be any port that is available on your machine). ToxiProxy serves its HTTP API on port 8474, so I’m exposing it; in addition to that, I’m exposing port 12345 (again, arbitrary number) so that I can later use it to proxy the Orders API.

Test structure #

What we are now going to do is:

  1. create a proxy that forwards incoming traffic to the orders container
  2. point our Service to the proxy
  3. introduce some “turbolence” in the network
  4. make Service call the Orders API in a test, to verify how it is affected
  5. finally, reset the proxy state (= clean up after ourselves)

For my own convenience, I’m going to use testcontainers-go to spin up the Docker containers and then take them down. This is what our project’s test directory will look like, when we’re done:

/test
  |_ docker-compose.yaml
  |_ service_test.go
  |_ setup_test.go

We’ve already seen the docker-compose.yaml, let’s now jump to setup_test.go:

// setup_test.go

package test

import (
	"fmt"
	"log"
	"net/http"
	"os"
	"testing"
	"time"

	toxiproxy "github.com/Shopify/toxiproxy/v2/client"
	"github.com/testcontainers/testcontainers-go"
)

const (
	proxyName     = "orders_api"
	proxyPort     = 12345
	ordersApiPort = 80
)

var (
	client  *toxiproxy.Client
	proxies map[string]*toxiproxy.Proxy
	service Service
	err     error
)

func beforeAll() {
	client = toxiproxy.NewClient("localhost:8474")

	_, err = client.Populate([]toxiproxy.Proxy{{
		Name:     proxyName,
		Listen:   fmt.Sprintf("[::]:%d", proxyPort),
		Upstream: fmt.Sprintf("orders:%d", ordersApiPort),
		Enabled:  true,
	}})
	if err != nil {
		panic(err)
	}
	proxies, err = client.Proxies()
	if err != nil {
		panic(err)
	}

	service = Service{fmt.Sprintf("http://localhost:%d", proxyPort)}
}

func TestMain(m *testing.M) {
	compose := testcontainers.NewLocalDockerCompose([]string{"docker-compose.yaml"}, "my_test").WithCommand([]string{"up", "-d"})

    // start containers
	if execError := compose.Invoke(); execError.Error != nil {
		log.Panic("error invoking docker-compose", execError.Error)
	}

    // after all tests: take all containers down
	defer compose.Down()

    // before all tests: configure the proxy and instantiate the service
	beforeAll()

    // run all tests
	code := m.Run()

	os.Exit(code)
}

// ---
// Just pretend this is the real Service
type Service struct {
	Url string
}

func (s *Service) Serve() (*http.Response, error) {
	start := time.Now()
	log.Printf("[service] Performing GET on %s ...\n", s.Url)
	res, err := http.Get(s.Url)
	log.Printf("[service] GET request took: %v\n", time.Now().Sub(start))
	if err != nil {
		log.Printf("[service] got error: %v", err)
	}

	return res, err
}
// ---

The func TestMain(m *testing.M) function prepares the “stage” for the test execution, spinning up all required containers and then taking them down after tests are over: it allows us to perform the equivalent of the beforeAll / afterAll methods provided by most popular testing libraries.

Now let’s look at service_test.go, which contains the actual tests:

// service_test.go

package test

import (
	"testing"

	toxiproxy "github.com/Shopify/toxiproxy/v2/client"
)

func Test_OrderAPI_OK(t *testing.T) {
	// the proxy is enabled and just lets traffic pass through to the Orders API
    // just verify that everything works under normal conditions

	_, err := service.Serve()
	if err != nil {
		t.Fatalf("unexpected error: %v", err)
	}
}

func Test_OrderAPI_Unreachable(t *testing.T) {
	proxy := proxies[proxyName]

	// make the Orders API unreachable
	if err := proxy.Disable(); err != nil {
		t.Fatalf("couldn't disable proxy: %v", err)
	}
	// make it reachable again at the end
	defer proxy.Enable()

	// how badly does our service fail when it cannot reach the Orders API? what does it return to its users?
	res, err := service.Serve()

	// TODO run assertions here
}

func Test_OrderAPI_HighLatency(t *testing.T) {
	proxy := proxies[proxyName]

	// introduce 5000ms latency with ± 250ms jittery in calls to the Orders API
	_, err := proxy.AddToxic("high_latency", "latency", "downstream", 1.0, toxiproxy.Attributes{
		"latency": 5000,
		"jitter":  250,
	})
	if err != nil {
		t.Fatalf("couldn't add toxic: %v", err)
	}
	// remove latency at the end
	defer proxy.RemoveToxic("high_latency")

	// how does our service behave when the Orders API responds with increased latency?
	res, err := service.Serve()

	// TODO run assertions here
}

Note: The TODO comments mark the points where we should add some actual assertions, depending on our expectations.

Finally, run the tests with go test ./... and check the “verdict”! 🙂