Caching in a cursor-based pagination with external requests

November 6th, 2023

Sometimes, you need to retrieve a lot of data from the external API, but the API returns data in portions and is paginated, using something like cursor-pagination — service provides you with the cursor in the API response that you can then use to paginate through collection of items by supplying the given cursor in the next request.

Ideally, you would offload every API request to the background queue and the job, but sometimes, due to any number of reasons, this may not be possible, and you have to fetch every record in a single request lifecycle. In that case, you could write a simple do-while loop that keeps making API requests until there's no more data in a response.

Here's an example of a method that retrieves all users from the GET /users endpoint, but paginates by using cursor supplied from the previous API response.

public function users(): Collection
{
    $cursor = null;
    $users = collect([]);

    do {
        $response = $this->get('/users', [
            'cursor' => $cursor,
        ]);

        $users = $users->merge($response['users']);

        $cursor = $response['next_cursor'] ?? null;
    } while ($cursor !== null);

    return $users;
}

As you can see, this keeps running until cursor is empty. As soon as cursor is null, this stops and returns the collection of all users. This is fine until one of the requests fails for any reason, such as rate limit or if the API is down. In that case, the method would have to run again and potentially repeat the same API requests that were previously successful.

What I like to do in these situations is cache every single request for a short period of time. That way, if any of the subsequent requests fail and the method has to run again, we won't have to make unnecessary requests that were previously successful, but rather just take the response out of the cache. Here's an improvement of this method by adding caching.

public function users(): Collection
{
    $cursor = null;
    $users = collect([]);

    do {
+       $key = sprintf('users:%s', $cursor ?: 'initial');

-       $response = $this->get('/users', [
+       $response = Cache::remember($key, now()->addMinutes(5), fn () => $this->get('/users', [
            'cursor' => $cursor,
        ]));

        $users = $users->merge($response['users']);

        $cursor = $response['next_cursor'] ?? null;
    } while ($cursor !== null);

    return $users;
}

This way, we have reduced the number of requests that have to run if any of the API requests in the entire "chain" fail. However, we can do one more improvement, which is clearing the cache after all requests finally finish. We can either store all the cache keys in an array or just use cache tagging to flush everything at once.

public function users(): Collection
{
    $cursor = null;
    $users = collect([]);

    do {
-       $key = sprintf('users:%s', $cursor ?: 'initial');

-       $response = Cache::remember($key, now()->addMinutes(5), fn () => $this->get('/users', [
+       $response = Cache::tags('users')->remember($cursor ?: 'initial', now()->addMinutes(5), fn () => $this->get('/users', [
            'cursor' => $cursor,
        ]));

        $users = $users->merge($response['users']);

        $cursor = $response['next_cursor'] ?? null;
    } while ($cursor !== null);

+   Cache::tags('users')->flush();

    return $users;
}

Of course, this is far from ideal. As I've said, this shouldn't run in the same request lifecycle. We shouldn't make (potentially) dozens of API requests in the same lifecycle. If there are a lot of records, this can use a lot of memory and take a very long time to finish... but sometimes you gotta work with what you have.