The evolution of GraphQL at scale

Most organisations adventures with GraphQL start with one team looking for a way to solve one of the biggest problems with frontend development in a microservice architecture: how to get all the data your app needs without making a million and one service calls.

The promise of being able to get all of their data in a shape that matches how they envisage using it with only a single request is often too hard for a frontend engineer to resist.

A proof of concept is created, it proves the hypothesis that it simplifies the codebase and increases team velocity and, before you know it, it’s in production. The organisation now has its first GraphQL API, but where does it go from there?

The answer to that question very much depends on the organisation: how they have structured their product engineering teams, the desired level of self-sufficiency of teams, their adherence to the ideals of a distributed architecture, etc.

A fork in the road

With a working GraphQL API in place, it won’t be long before other teams want to make use of it and get all the benefits of using GraphQL.

At this point, things tend to go in one of two directions, either other teams start to adopt and add to the existing implementation (monolith) or they duplicate what the initial team has done (BFF).

The glorious monolith

Rather than reinvent the wheel, other teams want to build on the efforts of that first team. Initially, things start great with new teams adding their unique data concerns to the graph and reusing the pre-existing schema for the rest.

Product development times decrease, there’s less duplication of effort, frontend engineers have a consistent API tailored to them to work with, etc. Nirvana right? Not so quick.

As with any other type of monolithic application, as they grow the cracks start to appear. The more teams there are making changes to it, the harder it is to release change and, before you know it, the dream becomes a nightmare and our monolithic GraphQL API becomes a hated beast.

In addition to this, odds are that the initial implementation only considered the data needs of the implementing team and their product. This means that all of the initial types and schemas are structured around their access patterns which are likely to be very different from other teams.

Take an eCommerce application as an example. The team building product listings want to access product information by requesting products with a specific SKU, but the team building the basket will want product details for only those product SKUs contained within the basket.

This is a very simplistic example and easily fixable but, in a monolithic implementation, it’s likely that both teams would write their own resolvers, probably duplicating the fetching logic for the back service that provides the product data.

Suddenly we start getting duplication of effort again and things start to get messier by the day.

Backend for frontend

Rather than building on the initial implementation, another common approach is to follow the backend for frontend (BFF) pattern and introduce a GraphQL server per experience (either an app, channel, team, page, etc.) that only handles the data needs of that experience.

Each BFF’s schema is tailored around that experience’s access patterns and the schema structure exactly mirrors the needs of the frontend application. It feels like we are getting closer to that nirvana state right?

In some cases, yes, but consider what we are doing here: we are creating a GraphQL BFF per experience with the potential for a lot of duplication amongst BFFs.

In an organisation with teams that have a very narrow focus (say where an experience is equivalent to a page on a site), there will be a lot of common functionality that is duplicated, infrastructure costs will go up, the number of attack vectors increases, etc.

That being said, this can be a highly effective way of introducing and using GraphQL within organisations. Depending on the app you are building and your organisational structure, this may indeed be your end game.

Michelle Garrett gives a great talk on how Condé Nast makes effective use of BFFs that I’d highly recommend you check out as an example of running the BFF pattern at scale.

But it is not for everyone.

One gateway to rule them all

So what are the alternatives to a single monolith or numerous BFFs? For a long time, the only option available to teams was schema stitching.

Schema stitching describes the process of creating a single GraphQL schema from multiple underlying GraphQL schemas. It was a technique created to try and solve some of the challenges faced by those who went down the monolithic GraphQL route.

It essentially allows you to split up and distribute a monolithic GraphQL schema into several underlying services. Each of these deals with one or more types of the original schema. In front of these, you place a gateway server that will use tooling to introspect and build up a singular schema from the individual parts.

When a request comes into the gateway, it will act as a proxy and distribute the query amongst the underlying services so requests for type A will be sent to service A, type B to service B, etc.

But schema stitching isn’t easy.

Take my simple example from above of there being Product and Basket types. What you want is the following schema where the Basket type contains an items field that returns an array of Products.

	# Product service
type Product {
id: ID!
name: String
description: String
}

# Basket service
type Basket {
id: ID!
items: [BasketItem]
}

type BasketItem {
id: ID!
quantity: Int
product: Product
}

# Query to fetch the basket
query {
getBasket(id: 12345) {
items {
id
quantity
product {
name
description
}
}
}
}

But that won’t work. The Basket service|schema has no knowledge of Products and probably only knows about the ID of the products in the basket.

To be able to expose that schema you’d either have to use some glue code in the gateway that understood the relationship between the two services|schemas and did a behind the scenes fetch of all Products that match the IDs held in BasketItems. Or instead make the Basket service call the Product service to get the Product information.

That would, of course, be invisible to the consumer of the stitched schema but… don’t you feel just a little bit dirty having read the past few paragraphs? As a one off it is fine, but multiply that by however many interconnections you have between your types and… eek!

Schema stitching is full of these little problems that are all solvable, but it is death by a thousand cuts. Throw into the mix questions on how to update the stitched schema when one of the underlying schemas change, what happens when one of the underlying services is down at compose time, etc. and you’ll quickly be banging your head against your desk.

Federation to the rescue

So what to do? A monolith is bad for all the reasons that monoliths are bad, BFFs are great but the amount of duplication can get ridiculous at scale and schema stitching is… a headache and often very fragile.

In an attempt to solve this dilemma, Apollo introduced Apollo Federation in May 2019 to give the ability to build a single, cohesive schema from multiple federated services.

Rather than splitting the schema up by type (which is what you end up doing with schema stitching), federation allows you to separate it by concern.

I won’t go into too much detail here (I’ll save that for a later blog post), but it does this by allowing multiple federated services to extend a single type, with the gateway composing the type fragments together again.

The gateway’s inbuilt query planner then splits incoming queries into smaller queries (based on its understanding of which service resolves which part of the query) and sends them on to the federated services.

Taking our example from earlier:

	# Basket service 
type Basket @key(fields: "id") {
id: ID!
items: [BasketItem]
}

type BasketItem {
id: ID
quantity: Int
}

# Product service
type Product @key(fields: "id") {
id: ID!
name: String
description: String
}

extend type BasketItem @key(fields: "id") {
id: ID! @external
product: Product
}

# Query to fetch the basket
query {
getBasket(id: 12345) {
items {
id
quantity
product {
name
description
}
}
}
}

The Basket service defines a Basket type with an items field of BasketItem that defines the quantity of a given item. But where are the products?

Well the Product service now not only defines the Product type, but it also extends the BasketItem defined in the Basket service and adds a product field which resolves the related product for that given basket item.

How does it achieve this? Well I’ll save the detail on that one for a later blog post, but at a high level the @key directives after the various types define the externally referenceable fields that are used to link the various types together.

One of the easiest ways I’ve found to reason about this is to think of the federated data graph as a relational database, with the keys being analogous with foreign keys in the database. Each federated service then needs to provide a standard resolver for handling queries by their foreign key(s).

In the above simple example, each federated service now only has to know about one high level data type. The Basket service handles all requests for Basket data and the Product service handles all requests for Product data, even when some of those requests are coming from the Basket service to resolve the Product data for the items in the Basket.

This fits in nicely with the modern trend for microservice architectures with colocated data sources. There are no blurred lines between services or, say, the Basket service storing a subset of product data for items in the Basket.

Is it perfect? No and running a federated data graph at scale comes with a whole host of challenges, but it is a giant step forward for GraphQL at scale.

In future posts, I’ll be digging into the detail of how to implement a federated data graph and propose solutions for some of the more common challenges you’ll face.

Follow me on Twitter or connect on LinkedIn for notifications of future posts or if you have any questions about GraphQL at scale.