OpenStreetMap

This is a bit less OpenStreetMap related then normal, but has to do with the Standard Tile Layer and an outage we had this month.

On July 18th, the Standard Tile Layer experienced degraded service, with 4% of traffic resulting in errors for 2.5 hours. A significant factor in the time to resolve the incident was a lack of visibility of the health status of the rendering servers. The architecture consists of a content delivery network (CDN) hosted by Fastly, backed by 7 rendering servers. Fastly, like most CDNs, offers automatic failover of backends by fetching a URL on the backend server and checking its response. If the response fails, it will shift traffic to a different backend.

A bug in Apache resulted in the servers being able to handle only a reduced number of connections, causing a server to fail the health check, diverting all load to another server. This repeated with multiple servers, sending the load between them until the first server responded to the health check again because it had zero load. Because the servers were responding to most of the manually issued health checks and we had no visibility into how each Fastly node was directing its traffic, it took longer to find the cause than it should have.

Our normal monitoring is provided by Statuscake, but this wasn’t enough here. Instead of increasing the monitoring, we wanted to make use of the existing Fastly healthchecks, which probe the servers from 90 different CDN points. Besides being a vastly higher volume of checks, this more directly monitors the health checks that matter for the service

During the incident, Fastly support provided some details on how to monitor health check status. Based on this guide, the OWG has set up an API on the tile CDN to indicate backend health, and monitoring to track this across all POPs.

Fastly uses a modified version of Varnish, which supports VCL for configuration. This is a powerful language, which lets us do sophisticated load-balancing, and in this case, even create an API directly on the CDN.

We start with a custom VCL snippet within the recv subroutine that directs requests to the API endpoint to a custom error

if (req.url.path ~"^/fastly/api/hc-status") {
  error 660;
}

Next, we make another VCL snippet within the error subroutine that manually assembles a JSON response indicating the servers’ statuses, as well as headers with the same information

 if (obj.status == 660) {
  # 0 = unhealthy, 1 = healthy
  synthetic "{" LF
      {"  "timestamp": ""} now {"","} LF
      {"  "pop": ""} server.datacenter {"","} LF
      {"  "healthy" : {"} LF
      {"    "ysera": "} backend.F_ysera.healthy {","} LF
      {"    "odin": "} backend.F_odin.healthy {","} LF
      {"    "culebre": "} backend.F_culebre.healthy {","} LF
      {"    "nidhogg": "} backend.F_nidhogg.healthy {","} LF
      {"    "pyrene": "} backend.F_pyrene.healthy {","} LF
      {"    "bowser": "} backend.F_bowser.healthy {","} LF
      {"    "baleron": "} backend.F_balerion.healthy LF
      {"  }"} LF
      {"}"};
  set obj.status = 200;
  set obj.response = "OK";
  set obj.http.content-type = "application/json";
  set obj.http.x-hcstatus-ysera = backend.F_ysera.healthy;
  set obj.http.x-hcstatus-odin = backend.F_odin.healthy;
  set obj.http.x-hcstatus-culebre = backend.F_culebre.healthy;
  set obj.http.x-hcstatus-nidhogg = backend.F_nidhogg.healthy;
  set obj.http.x-hcstatus-pyrene = backend.F_pyrene.healthy;
  set obj.http.x-hcstatus-bowser = backend.F_bowser.healthy;
  set obj.http.x-hcstatus-balerion = backend.F_balerion.healthy;
  return (deliver);
}

This API can be manually viewed to show the status, but it only works from the CDN node you’re connecting through. To monitor all of the nodes at once, we use the Fastly edge_check endpoint. When called with an authorized token, the response looks something like

[
  {
    "pop": "frankfurt-de",
    "server": "cache-fra19139"
    },
    "response": {
      "headers": {
        "x-hcstatus-ysera": "1",
        "x-hcstatus-odin": "1",
        "x-hcstatus-culebre": "1",
        "x-hcstatus-nidhogg": "1",
        "x-hcstatus-pyrene": "1",
        "x-hcstatus-bowser": "1",
        "x-hcstatus-balerion": "1"
      },
      "status": 200
    }
  },

  {
    "pop": "yvr-vancouver-ca",
    "server": "cache-yvr1528"
    },
    "response": {
      "headers": {
        "x-hcstatus-ysera": "1",
        "x-hcstatus-odin": "1",
        "x-hcstatus-culebre": "1",
        "x-hcstatus-nidhogg": "1",
        "x-hcstatus-pyrene": "1",
        "x-hcstatus-bowser": "1",
        "x-hcstatus-balerion": "1"
      },
      "status": 200
    }
  }
]

The real response has a lot more headers and other information in it, as well as another 90 POPs, but what I’ve shown is the important information. This is all the information required, but it’s not in a very useful form. To make it useful, we need to gather the data with our monitoring tool Prometheus. This is done with a simple prometheus exporter that queries the URL, parses the response, and writes out metrics. Once the metrics are in Prometheus, we can do alerting on them and graph them.

Because the metrics are 1 or 0, taking the average with avg(fastly_healthcheck_status{host="tile.openstreetmap.org"}) by (backend) gives a graph indicating the backend status, as measured by Fastly POP healthchecks. This graph is now on the Tile Rendering Dashboard.

Discussion

Comment from TomH on 25 July 2022 at 08:36

I’m not sure I agree that StatusCake wasn’t enough - it had in fact alerted when pyrene stopped responding which was the root cause of the problems and we just hadn’t noticed that.

That’s why I pulled the StatusCake results into prometheus so that we would get repeating alerts if they aren’t resolved rather than just a one off alert.

The healthcheck results are also on the Tile CDN Dashboard now as well.

Comment from pnorman on 26 July 2022 at 03:48

Yes - the improved monitoring of StatusCake might have caught it, but if Pyrene had been working (e.g. if it had been the most recent to be upgraded) then we’d have still had the problem among the European servers which would have been just as difficult to diagnose, so I think both monitoring changes are important.

I like the presentation on the Tile CDN dashboard.

Log in to leave a comment