Table of contents

  1. Introduction
    1. Locality principle
    2. Terminology
  2. Timing attacks
    1. Exercise 1: Mount a timing attack
    2. Basic cache attack: Flush+Reload
      1. Exercise 2: Mount a Flush+Reload attack
  3. Cache placement policies
    1. Direct mapping
    2. Set-associativity
      1. Exercise 3
      2. Exercise 4
      3. Exercise 5
      4. Exercise 6
    3. More advanced cache attacks: Prime+Probe
      1. OPTIONAL: Exercise 7
      2. OPTIONAL: Exercise 8
      3. OPTIONAL: Exercise 9

Introduction

Over the years, a performance gap has formed between the processor and the memory unit. As shown in the figure below, processor performance has been increasing much faster than memory performance. Consequently, the processor has become much faster than the memory, forming a bottleneck for the performance of the computer in general.

Illustration of the performance gap between memory and CPU

To solve this issue, caches were introduced. Cache memories are not as fast as registers, but can include more data, and are much faster than the main memory. However, they are also more expensive than the main memory and therefore a lot smaller. Most commercial CPUs typically offer a hierarchy of caches: Level 1 cache (L1) is the fastest but also the smallest cache, Level 2 cache (L2) is slower but larger, and the last level cache (L3 or LLC), which is the slowest but the biggest.

Illustration of memory hierarchy in a computer

Caches can be local to a single processor core (this is typically for case for L1 and L2 caches), or shared across multiple cores. Finally, the L1 cache is usually split into an instruction cache (L1I), which contains program instructions, and a data cache (L1D), which contains program data.

Illustration of cache hierarchy in a computer

Locality principle

Programs usually access a relatively small portion of the address space at a time.

In particular, when a program accesses a given memory location (a variable), it is likely to access that same location again in the near future. This is called temporal locality. Hence, a memory location that is accessed by a program can be cached to speed up future accesses!

Illustration of temporal locality

Additionally, when a program accesses a memory location, it is likely to access nearby memory locations in a near future (think for instance about nearby variables on the stack or nearby array members). This is called spatial locality. Hence, when a memory location is accessed, we transfer entire blocks (multiple contiguous words in memory) into the cache at once, pulling nearby data into the cache.

Illustration of temporal locality

By exploiting these two locality principles, caches can cause a huge performance gain, even though they are small.

Terminology

Consider a program that accesses a memory address: lw t0, 0x10010000.

We say that we have a cache miss if the address 0x10010000 is not in the cache (for instance if it is the first time it is accessed by the program). In that case, the value is requested from the DRAM and the memory access is slow. The value is then placed in the cache for later use (following the temporal locality principle).

We say that we have a cache hit if the address 0x10010000 is in the cache. In this case, the corresponding value is directly served from the cache and the memory access is fast.

We can express the performance of the caching algorithm with the hit rate, which is the number of cache hits divided by the total number of memory accesses. We can also talk about the miss rate, which is equal to (1 - hit rate), or (cache misses / total accesses).

Timing attacks

Caches introduce timing variations based on the memory accesses of a program. This means that an attacker can use timing to infer which memory addresses are accessed by a victim program! In particular, on shared architectures (for instance a remote server shared between multiple users), an attacker can monitor the state of the cache (by observing cache hits and misses) to infer which cache lines are accessed by a victim. If the memory addresses accessed by the victim depend on secret data, the attacker can ultimately infer information about this secret data, leading to critical security vulnerabilities.

Attacks that exploit the state of the cache as a way to leak secret data are called cache attacks. They are part of a more general class of attacks, called timing attacks, which exploit timing variations of a system to leak secret data.

:crystal_ball: We will see two different examples of cache attacks later in the session. But first, let’s illustrate basic timing attacks with some exercises.

Exercise 1: Mount a timing attack

First, download the archive code.zip, which contains the program for this exercise and the next one. The file passwd.c contains the code of a password checker. In this exercise, you have to infer the value of the password using a timing attack. (The password is given in secret.h, so for the sake of the exercise, please do not open this file!)

  1. Examine the source code in passwd.c, then compile and run the program, and try to guess the correct password!
  2. Explain why the execution timings being printed are not always exactly the same when repeatedly providing the exact same input. Why is it a good idea to print the median instead of the average?
  3. Can you identify a timing side-channel in the code? Keep in mind that you only control the user and user_len arguments (by providing inputs to the program), while secret and secret_len remain fixed unknown values. Hint: Try to come up with a way to iteratively provide inputs and learn something useful from the associated timings. First infer secret_len, before finally inferring all the secret bytes. You can assume the secret PIN code uses only numeric digits (0-9).
$ gcc passwd.c -o passwd
$ ./passwd
Enter super secret password ('q' to exit): 32349
--> You entered: '32349'
 _______________
< ACCESS DENIED >
 ---------------
        \
         \
    \|/ ____ \|/
    "@'/ .. \`@"
    /_| \__/ |_\
       \__U_/

time (med clock cycles): 68

Basic cache attack: Flush+Reload

Cache attacks, just as the timing attack on the password checker illustrated above, exploit variations of execution time to infer secret data. By measuring the execution time of a memory access, an attacker can determine whether a memory access results in a cache hit or a cache miss. Using this information, the attacker can determine if a victim has accessed the same memory location!

Let us look at a basic cache attack, called Flush+Reload. Flush+Reload relies on an instruction, offered by some CPUs, to flush (remove data from) the cache. We illustrate Flush+Reload with a step-by-step example (notice that each item number corresponds to a slide in the slideshow below):

  1. Consider that an attacker and a victim share some memory so that a variable a is accessible to both the attacker and the victim;
  2. Flush: the attacker flushes the address &a from the cache;
  3. Victim executes: The attacker lets the victim execute. If secret == 1, the victim requests the address &a, which produces a cache miss;
  4. The address &a is then requested from DRAM and placed in the cache;
  5. Reload: The attacker tries to access again the address &a and times the memory access. If the access is fast (cache hit) then the attacker can infer that the value has been accessed by the victim, and therefore that secret == 1;
  6. Alternatively, the attacker can try to access the variable b. Because the access is slow (cache miss), the attacker can infer that the value has not been accessed by the victim, and again conclude that secret == 1.

Flush+Reload is a very reliable and easy attack as it does not require knowledge of internal cache organization. However, it requires shared memory between an attacker and its victim (such that both have access to a and b) and hence is only applicable in a limited number of scenarios.

Exercise 2: Mount a Flush+Reload attack

The flush-and-reload.c of the code.zip archive file contains the following function:

void secret_vote(char candidate)
{
    if (candidate == 'a')
        votes_a++;
    else
        votes_b++;
}

We want to detect which of the candidates the user voted for. To increase the vote count for candidate a or for candidate b, the program first has to load the corresponding current number of votes from memory (i.e. votes_a or votes_b). This is the access we will try to detect. Based on whether votes_a or votes_b has been accessed, we know which candidate got the vote.

:fire: Write down the three steps required to mount a successful Flush+Reload attack on this program.

Solution

Using the Flush+Reload technique, we (as the attacker) need to take the following steps:

  1. Flush votes_a from the cache;
  2. Let the secret_vote function execute with a secret input candidate. This will load the value of either votes_a or votes_b from memory, and place it in the cache;
  3. After the execution is done, reload votes_a. If the access time is low, this signals a cache hit. A cache hit in turn indicates that the victim process has accessed votes_a, which means a vote for candidate A. On the other hand, if the access time is high, this means the value has not been cached, this is not the candidate the user voted for.

Of course, the above process could be executed with votes_b in place of votes_a. Since there are only two candidates, it suffices to check whether one of the two variables has been accessed.

Edit the main function in flush-and-reload.c to implement the missing flush and reload attacker phases. You can use the provided void flush(void *adrs) and int reload(void *adrs) functions. The latter returns the number of CPU cycles needed for the reload. If necessary, compensate for timing noise from modern processor optimizations by repeating the experiment (steps 1-3 above) a sufficient number of times and taking the median or average.

You can compile and run the program with:

$ gcc flush-and-reload.c -o fnr
$ ./fnr

Cache placement policies

The cache placement policy determines where a memory address should be placed in the cache.

Direct mapping

Let us start with the structure of the cache. In our examples, we consider byte-addressable memory, meaning that data is stored and accessed byte-per-byte (contrary to a word-addressable memory where data is stored word-per-word). The cache is a table that contains multiple rows, these are called the cache sets. A cache set can include one or more cache blocks (a block is sometimes also called a cache line, which can be slightly confusing in our representation, since a cache set is represented as a row [or line] in the cache table, which can contain multiple cache lines).

The simplest cache placement policy, called direct mapping, maps every memory address to a unique block in the cache.

Take for instance the cache model given below, where each cache set only contains a single block. Given a memory address, the index of the corresponding cache set is determined using the two least significant bits of the address (index = adress % 4). Because multiple addresses map to a single cache block, the cache also needs to keep track of a tag, corresponding to the most significant bits of the address. Therefore, given an address, the index determines where to look for the data in the cache and the tag indicates whether we have a cache hit or a cache miss.

Illustration of a direct mapped cache

A memory address, composed of a tag t and an index i, is in the cache (cache hit) if the tag at index i in the cache matches t. For instance, accessing the address 0001 (i.e. tag=00, index=01) results in a cache hit because the tag in the cache at index 01 is 00. However, accessing the address 0010 (i.e. tag=00, index=10) results in a cache miss because the tag in the cache at index 10 is 01.

The data in one cache block typically contains more than one byte. This is to enable spatial locality: when the data from a certain address is loaded, the contents of the neighboring memory locations are also placed in the cache, in case they are also accessed in the future.

For instance, in the cache model given below, the size of a block is 4 bytes. The lower bits of the address correspond to the offset of the data in a cache block. For instance, the address 001000 corresponds to the value A0, while the address 001001 corresponds to the value A1.

Illustration of a direct mapped cache where a cache set contains 2 cache blocks

:bulb: Summary.
A memory address (of size 32 bits) is composed of:

  1. an offset (the least significant bits of b) which determines the offset into one cache block;
  2. an index (the next k bits), which determines the cache set;
  3. a tag (the remaining 32-(k+b) most significant bits), which determines whether we have a cache miss or a cache hit. Additionally, the cache contains a Valid bit, which indicates whether a cache line is valid or not (e.g., in order to synchronize data across different caches).

Summary of a direct mapped cache

Set-associativity

A limitation of direct-mapped caches is that there is only one block available in a set. Every time a new memory address is referenced that is mapped to the same set, the entire block is replaced, which causes a cache miss. Imagine for instance a program that frequently accesses addresses 000100 and 010100 in the above illustration. Because both addresses map to the same cache set (at index 01), accessing 010100 evicts 000100 from the cache (and vice versa). Hence, accessing both addresses in an alternating fashion results in a sequence of cache misses, which causes a performance loss.

To mitigate this problem, we can duplicate our cache structure into multiple ways, where a given address can be placed into any of the ways. We illustrate below a 2-way cache. Now, even though 000100 and 010100 map to the same cache set (at index 01), they can be placed in two different ways and can be in the cache at the same time!

Illustration of a 2-way set-associative cache

Finally, a fully associative cache is made up of a single cache set containing multiple ways. Hence, a memory address can occupy any of the ways and is solely identified with its tag (no need for an index because there is only one set!). Fully-associative caches ensure full utilization of the cache: a block is never evicted if the cache is not full. When the cache is full, the evicted line is determined by a replacement policy (e.g., the least recently used block is replaced). However, searching for an address in a fully associative cache is expensive: it takes time (and power) to iterate through all cache blocks and find a matching tag.

:bulb: Notice that a 1-way associative cache corresponds to a direct-mapped cache. Hence, an n-way set-associative cache provides an interesting tradeoff between a direct-mapped cache and a fully associative cache.

Illustration of a n-way set-associative cache

Exercise 3

Suppose a computer’s address size is k bits (using byte addressing), the cache data size (the total of the stored data, excluding the metadata such as tags) is S bytes, the block size is B bytes, and the cache is A-way set-associative. Assume that B is a power of two, so B=2^b. Figure out what the following quantities are in terms of S, B, A, b and k:

  • the number of sets in the cache;
  • the number of index bits in the address;
  • the number of bits needed to implement the cache.

Exercise 4

Consider a processor with a 32-bit addressing mode and a direct mapped cache, with 8-word block size and a total size of 1024 blocks. Calculate how many bits of the address are used to tag a block, to select the block, a word in the block, a byte in the word. The sum of these numbers must be 32! Calculate the total size needed to implement this cache.

Exercise 5

Here is a series of address references given as word addresses: 2, 3, 11, 16, 21, 13, 64, 48, 19, 11, 3, 22, 4, 27, 11. Label each reference in the list as a hit or a miss and show the final content of the cache, assuming:

  • direct-mapped cache with 16 1-word blocks that is initially empty;
  • direct-mapped cache with 4-word block size and total size of 16 words;
  • 2-way set-associative cache, 2-word block size. Total size of 16 words.

The following table shows an example format of a possible solution for a direct-mapped cache with 8 2-word blocks.


+---+----+----+
| S | W0 | W1 |
+---+----+----+
| 0 | 48 | 49 |
| 1 |  2 |  3 |
| 2 |  4 |  5 |
| 3 | 22 | 23 |
| 4 | -- | -- |
| 5 | 26 | 27 |
| 6 | 12 | 13 |
| 7 | -- | -- |
+---+----+----+
2 (miss), 3 (hit) 11, 16, 21, 13, 64, 48, 19 (all misses), 11 (hit), 3, 22, 4, 27, 11 (all misses)

Exercise 6

Associativity usually improves the hit ratio, but not always. Consider a direct mapped cache with 16 1-word blocks and a 2-way set-associative cache with 1-word block size and a total size of 16 words. Find a sequence of memory references for which the associative cache experiences more misses than the direct mapped cache.

More advanced cache attacks: Prime+Probe

Utilizing knowledge about the cache organization (i.e., placement policies and cache collisions), an attacker can perform cache attacks across protection domains (without shared memory)!

Prime+Probe is a cache attack that allows an attacker to infer information a victim memory accesses without requiring shared memory (unlike Flush+Reload):

  1. Consider an attacker and a victim executing on the same machine (but without shared memory). The victim has access to a variable a and the attacker has access to a variable c such that a and c map to the same cache line;
  2. The attacker evicts the address &a from the cache by accessing the address &c;
  3. The attacker lets the victim execute. Assuming secret == 1, the victim requests the address &a, which produces a cache miss;
  4. The address &a is then requested from DRAM and placed in the cache, evicting the attacker’s data c;
  5. The attacker tries to access again the address &c and times the memory access. If the access is slow (cache miss) then the attacker can infer that a has been accessed by the victim, and therefore that secret == 1;
0/0

OPTIONAL: Exercise 7

Is the above Prime+Probe technique applicable for each of the cache mapping schemes you know: (1) direct mapping, (2) N-way set associative, (3) fully associative with LRU replacement policy? Explain why (not). If applicable, describe a high-level strategy to replace cached victim memory locations by constructing an attacker-controlled eviction set during the “prime” phase. What is the granularity at which an attacker can see victim accesses during the “probe” phase?

OPTIONAL: Exercise 8

Consider a processor with a 32-bit addressing mode and a direct mapped cache, with a 1-word block size and a total size of 16 blocks. First calculate index and tag address bits, as in Exercise 4 above. Say that all memory addresses in the range [0, 100[ belong to the victim, whereas the attacker owns addresses in the range[100, 232[. The victim performs a series of address references given as word addresses: either (2, 3, 11, 16, 21, 13, 64, 48, 19, 11, 3, 22, 4, 27, 11) if secret=1, or (2, 3, 11, 16, 21, 13, 64, 48, 19, 11, 70, 36, 7, 91, 75) if secret=0. Construct a memory access sequence to be performed by an attacker during the “prime” phase, in order to learn the victim’s secret in the “probe” phase later on.

OPTIONAL: Exercise 9

Repeat the previous exercise for a processor with a 32-bit addressing mode and a 2-way set-associative cache, with 2-word block size and a total size of 16 words. You can assume a least-recently used cache replacement policy.