What is a cache?

How to increase the speed of your website? Which strategy to adopt? Why is cache invalidation so difficult?

Find more tutorials explaining the basics of a subject in ten minutes with the tag what is it.

The cache: a source of many issues, omnipresent in conversations as soon as you are working on a web product and the number one excuse of developers to avoid a real diagnosis.

What is it?

A cache is a system used to store the result of a process to keep it close for future use. No need to re-execute the whole chain. A cache can contain loads of different things: the result of a greedy treatment, images, videos, data that is rarely changing and so on...

The goal is to improve the performance:

  • Decrease the response time: fewer processes, geographic proximity with the client...
    Less delay = happy users and improved SEO.
  • Decrease the mobilized resources: computing power, load on the machines protected behind the cache...
    Fewer resources = fewer costs.

You use cache systems every day!

Each display of a web page triggers dozens of different cache layers distributed across multiple entities: your browser, your internet service provider, the cloud provider of the website, the internal software of the website and so on... Even before reaching the content, your browser deals by itself with loads of underground cache systems such as the domain name resolution.

Each layer of cache has its own purpose, involving different technologies and a dedicated management strategy:

  • You can control how the browser or an intermediate proxy will cache your data by using HTTP headers in the response: cache-control, ETag, expires, last-modified, status and so on...
  • A content delivery network (CDN) is a network of servers distributed all over the world. It is used to bring your content geographically closer to the clients to reduce network latency. Basically, if the client is in Europe, his request will more likely be routed to a European server. A CDN is mostly used to deliver static files. Some features, like the edge side includes markups, allow you to use a CDN by mixing static and dynamic content.
  • A reverse proxy is a frontal server processing the requests before they reach the real applicative servers. You can, here too, decide to have a cache strategy for static files. Famous technologies: Varnish, Nginx.
  • The applicative server, the one processing the request and executing the code, also has cache related directives for static files. Famous technologies: Apache, Nginx, Tomcat or Express.
  • The applicative cache: we will detail this one in the following sections
  • Every modern browser implements the Cache API. With a few lines of javascript, you can store HTTP requests and their responses directly in the browser, making your content available offline.

The applicative cache

By caching the results of SQL queries, the responses of REST requests and so on: you will reduce the load on your data sources and decrease the number of communications between your internal systems or with an external partner.

Let's take an example

You want to display a page with your subscription offers. When a user logs into your website, you check their context and decide which offer they can subscribe to. If a user is not yet paying anything, you want to propose offers A, B, and C. If a user already subscribed to A, they can upgrade to B or C. For users in France, the additional offer D is available. The catalog of offers is managed by an external partner. To display the details of an offer, you need to call this partner's HTTP API and combine its result with data from your own system: it can take around five seconds to get the final result. By clicking on an offer, the user is redirected to a checkout page with, here too, the details of that offer.

Here, there are two main criteria to deduct which offers to display: the currently subscribed offer and the country. In other words, every user sharing these criteria will see the same offers. It would be silly to compute everything and query the external catalog for every user. You need to cache the results!

For the checkout page and your underlying system, you will need for sure to cache the details of a single offer.

Storing and retrieving an entity from the cache

$offerA = $externalPartner->getOffer('A'); $cache->set(key: 'offer_A', value: $offerA); // Later in the code function getOffer(string $name): Offer { // If the offer is already in the cache, good. Else we need to query the partner. if ($cache->has("offer_{name}")) { return $cache->get("offer_{name}"); } $offer = $externalPartner->getOffer($name); $cache->set(key: "offer_{$name}", value: $offer); return $offer; } $offerA = getOffer('A'); // {name: A, canBeSubscribed: true, price: $10}

Knowing that, you can take different approaches for the page displaying several offers. Here are two examples:

Option 1: you fully store the offers with all their details.

Storing full entities in the cache

$offerA = getOffer('A'); $offerB = getOffer('B'); ... $cache->set(key: 'offers_not_subscribed_in_france', value: [$offerA, $offerB, $offerC, $offerD]); $cache->set(key: 'offers_not_subscribed_in_other_countries', value: [$offerA, $offerB, $offerC]); $cache->set(key: 'offers_subscribed_to_A_in_france', value: [$offerB, $offerC]); $cache->set(key: 'offers_subscribed_to_A_in_other_countries', value: [$offerB, $offerC]); // Later in the code function getOffersToDisplay(string $subscriptionContext, string $country): Offer[] { // I'm not repeating here the checking and setting logic of getOffer but of course you should return $cache->get("offers_{$subscriptionContext}_in_{$country}"); }

Option 2: you store only the identifiers.

Storing identifiers in the cache

$cache->set(key: 'offers_not_subscribed_in_france', value: ['A', 'B', 'C', 'D']); $cache->set(key: 'offers_not_subscribed_in_other_countries', value: ['A', 'B', 'C']); $cache->set(key: 'offers_subscribed_to_A_in_france', value: ['B', 'C']); $cache->set(key: 'offers_subscribed_to_A_in_other_countries', value: ['B', 'C']); // Later in the code function getOffersToDisplay(string $subscriptionContext, string $country): Offer[] { // I'm not repeating here the checking and setting logic of getOffer but of course you should $offerNames = $cache->get("offers_{$subscriptionContext}_in_{$country}"); $offersToDisplay = []; foreach ($offerNames as $offerName) { $offersToDisplay[] = getOffer($offerName); } return $offersToDisplay; }
Wait! Why would I ever want to do that? You are calling the cache in a loop!

Each option has its pros and its cons. You will choose a strategy based on the cache systems you use and your goals. Are you limited in terms of memory? Does the cache need to be shared between several processes? Between several machines? Do you want to list what is already stored in the cache? Which data are you storing, is it common to all your clients? How long before this data becomes obsolete?

For instance, if your cache needs to be distributed and available for several machines, it will involve an independent server with network calls. Even if it's faster than a database or an HTTP API call, there is still a cost and a risk of data loss. Examples of technologies: Redis, Memcache. The machines hosting these cache technologies are usually provided with gigabytes of RAM.

On the opposite side of the ring, if you are caching data "in memory", just available for the current context of execution, there is almost no cost of calling the cache. I am talking here about storing values in data structures such as a "map" or a "set" in the memory of the machine running the applicative server. These machines are optimized to execute the code faster, with big CPUs but not so much memory.

Most websites supporting a high load will use several layers combining local and remote caches. For instance, by using an ORM, such as Doctrine, to record an entity in the database, it can directly save the data in several systems: databases, memory caches, remote caches, filesystems... Some layers will be used for retrieving the data in the same process, other layers for other machines and so on...

So, which option is the best?

First, let me remind you that these were just examples and I would be glad to know how you would do it in the comments.

Option 1 saves fetching time but requires loads of memory, while option 2 consumes little memory but needs several calls to the cache in the fetching process. There is no right answer, it depends on your project!

Cache invalidation

If the data changes in your database or in an external system, you have to invalidate the linked cache entries to refresh this data. Meaning you should know when and what to invalidate!

For instance, if you want to change the price of an offer. In option 1 you will not only need to invalidate the cache entry of that offer but also the cache entries using it. You need to keep somewhere the info linking all these entries to your offer and the order of deletion. Forget to delete only one entry and you can be sure that your software is not consistent somewhere. Some cache systems, such as Redis, provide tags to help you manage the links between the entries.

There are only two hard things in Computer Science: cache invalidation and naming things. <Phil KARLTON>

Let's continue our story

When the user has logged into the website, you fetched loads of data about them: communication opt-in, subscriptions... Because there is a good chance that the user will continue their journey on the website during the next minutes and you don't want to fetch everything again: you cached their profile.

Except if the user does something impacting their data, you will never invalidate the cache entry. This entry will forever continue taking space for nothing.

That's why most cache systems allow you to provide an expiration delay when setting an entry. A too-short delay: your cache is useless. A too-long delay: you risk accumulating too many entries and overloading your memory.

Now, you developed a new feature requiring the invoices of the users in their profile. You currently have in your cache 10 000 profiles without the invoices, they will expire one at a time during the next twenty-four hours.

Before you release the feature to production, you need to decide how you are going to operate. Some ideas:

  • You can directly invalidate all cache entries related to profiles. Meaning that as soon as you trigger the purge, the cache will not serve anymore as a protection layer to your system. Thousands of users might directly hit your database at the same time, overloading everything. Adopt this strategy only if you know your system's limits and the number of users.
  • If the profiles without invoices will continue to work without bugs. You do nothing and wait for the entries to expire. You are in a kind of hybrid mode for twenty-four hours with some users unable to see the new feature.
  • If the profiles without invoices will provoke bugs. You can try to invalidate everything like in the first solution but my advice would be to rework the feature and maybe release it in two steps. The first step to fetch the invoices and store them in the cache. The second step, at least twenty-four hours later, to release the feature actually using the invoices.
Can't this overload issue occured during the release of a new feature ?

Yes, it can! Imagine you are releasing a new offer and invalidating the cache entries while there are many users currently loading your pages. The first user calling the function getOffer will trigger the call to the external catalog to populate the cache. As we previously mentioned, this process can take up to five seconds. Every user calling getOffer during these five seconds will trigger the same heavy process, risking overloading your catalog and every other involved part of your system.

To avoid that you can warm up the cache. In other words, you can preload the entries before releasing the feature to the users. Basically, it means iterating over all possible combinations to fill the cache with all variants. In our example, you would loop over all the countries and the subscription statuses.

There is so much to say about cache systems! For now, I hope I made things clearer for you.
Have a nice day :)