Ph.D. in Computer Science at Rutgers University

Memory Hierachy - Cache

Cache plays an important role in memory hierachy. Accessing cache is much more faster than accessing main memory. In this article, you'll see what is a cache, why we need cache, category of cache in usage and three kinds of cache in design. In addition, I'll stress out the whole process how the CPU get data from cache or main memory.

Basic Concept on Cache

The word "Cache" comes from French which means a hiding place. In computer architecture, cache is a rather small memory made from SRAM storing part code or data in main memory for the trade-off between size and speed. In Von Neumann Architecture, instructions and data are stored in memory without classification, while in Harvard Architecture, instructions and data are stored seperately in memory. In most modern CPU design such as x86 and ARM, data and instructions are all stored in main memory but there're both data cache and instruction cache. Therefore, nowadays it's hard to tell whether a CPU is Von Neumann or Harvard; As it's meaningless to discuss RISC and CISC nowadays because in the only existing CISC architecture - x86, instructions will be decoded into micro-ops, the length of micro-ops are the same. It's quite interesting in the history of Computer Architecure that many obsolete designs will be brought on again when time elapses.

Hierachy in Cache

There's also hierachy between different levels of cache. In Intel Core i7, There're three level of Cache: Eache core has its private L1 Data Cache L1 Instruction Cache and L2 Cache. All the cores share one Last-level Cache.


SRAM structure

SRAM is a type of semiconductor memory that uses bistable latching circuitry (flip-flop) to store each bit. SRAM exhibits data remanence, but it is still volatile in the conventional sense that data is eventually lost when the memory is not powered. The following pictures show the circuit diagram of an SRAM "cell" storing bit 0 and bit 1 respectively.


When a read operation occurs, the word line is activated. So bit line signal and its complement is activated on switch M5 and M6. As the following two pictures illustrates.


When a write operation occurs, the new written bit signal and its complement is first activated, and then word line is activated. Finally M5 and M6 is turned on, and this cell has changed to the opposite stable state. The following pictures shows the changing process of writing bit 0 from the original state of bit 1.


Do We Really Need Cache?

As we know, accessing registers only takes 1 clock cycle while accessing main memory takes typically hundreds of clock cycles. That means when fetching data from memory, the CPU resouces will waste a lot. If we add Cache between registers and main memory, waiting time will decreases because accessing cache only takes 10 - 40 clock cycles. Cache is much smaller than main memory, thus we don't want cache miss happens very often. Actually, it's locality that makes the Cache useful. In addition, proper cache replacement algorithm and prefetching will also helps improve the performance of cache hit rate. I'm not gonna talk more on locality, replacement algorithms and prefetching because those're quite complicated topics.

Structure of Cache

Cache looks like a table, each line of the table has roughly the following information:

 1) Tag: marks the address of main memory of the current Cache Line
 2) Cache Line (Block): several consecutive bytes of memory correspond to Tag
 3) Flags: some flags of current cache line such as valid bit

Remember, cache line is the minimum unit of a cache, when updating a cache, we update the whole cache line. We cannot update only one byte in a cache line and leave others unchanged.When CPU need to write data back, there're two modes:

 1) Write-through: Write data to Cache and main memory at the same time.
 2) Write-back: Only write data to Cache, and write to main memory later.

To improve the robustness of Write-back mode. We could write data back to memory periodically or just use NvRAM.

Three kinds of Cache in Design

1. Full-Associative Cache

In full-associative cache, a given memory block could sit in arbitary cache frame. Like a car entering a parking lots that could park in any parking lot.

For example, the width of address bus is 32 bit, memory capacity is 4GB, cache capacity is 64KB, cache line (block) size is 16B


2. Direct-Mapped Cache

In direct-mapped cache, a given memory block could only sit in a unique cache frame. Like a car whose last number of plate is 5 entering a parking lots that can only park in parking lot that number is 5.

For example, the width of address bus is 32 bit, memory capacity is 4GB, cache capacity is 64KB, cache line (block) size is 16B


3. Set-Associative Cache

Set-Associative Cache is a combination of full-associative cache and direct-mapped cache. A parking lots is divided into several sets, and each set has two slots. Then a car whose last number of plate is 5 can only park in either parking lot in the set number of 5.

For example, the width of address bus is 32 bit, memory capacity is 4GB, cache capacity is 64KB, cache line (block) size is 16B


Four kinds of cache miss

- Compulsory Miss (Cold Miss): On the first access to a block; the block must be brought into the cache; also called cold start misses, or first reference misses.

- Collision Miss: In the case of set associative or direct mapped block placement strategies, conflict misses occur when several blocks are mapped to the same set or block frame; also called collision misses or interference misses.

- Capacity Miss: Occur because blocks are being discarded from cache because cache cannot contain all blocks needed for program execution (program working set is much larger than cache capacity).

- Consistency Miss: Consistency miss happens in the senario that caches of different cores has the copy of some data in memory but somehow different.

Physically-Tagged or Virtually-Tagged?

Before reading this section, you need to know what are physical address and virtual address. What is page table and what is TLB. I've already cover those concepts in another ariticle.

- Physically-Tagged

As we know, it's physical address that a memory could read, while address in instructions are virtual addresses. Thus, if we use physically-tagged cache, we need to transfer virtual address to physical address by accessing TLB or page table before we could access the cache. When there's a TLB miss, there will be a big loss in pipelining performance. Fortunately, the size of page in modern architecure is typically more than 4KB, therefore TLB hit rate will be rather reasonable.

- Virtually-Tagged

It seems much more simple using virtual tagging because we don't need to access TLB before accessing cache. However there will be more issues if we using virtual tagging:

1) Cache will be flushed on each process context switch which will cause more cache miss and memory reference.

2) If PC of two different processes are the same, there will be a conflict cache miss in instruction cache because two PC's will sit in the same slot in direct-mapped cache.

3) If PC of two different processes are the same and cache are virtually-tagged, one process could easily get into the data of another process, which is not safe.

4) In modern OS, different virtual addresses are allowd to be mapped to the same physical address. Using virtual tagging may waste a lot of cache blocks.

Therefore, it's preferable to use physically-tagged cache. But how could we make some improvement on physically-tagged cache? Intel put forward a method called 'Virtually-Indexed Physically-Tagged (VIPT)'. As we know, the transfer between virtual address and physical address only makes a change between virtual page and physical frame, and the offset are the same. Consider in set-associative cache, if we make the 'bits of set' sitting in area of page/frame offset, when accessing the TLB, CPU could using the 'bits of set' to access cache at the same time.


Let me illusrate this process in detail (set-associative cache and without way prediction):

Traditional Physically-tagged

Cycle 1: Accessing TLB and get the physical address from MMU. (Assuming TLB hit)

Cycle 2: Using the physical address achieved from last cycle to get set number and access cache. (Assuming Cache hit)

Cycle 3: Compare TAGs in the set we selected in last cycle.

Cycle 4: Get proper data from the cache line achieved from last cycle based on data offset inside an cache line.

Using 'Virtually-Indexed Physically-Tagged'

Cycle 1: Accessing TLB and get the physical address from MMU. At the same time, get set number and access cache based on page/frame offset bits of virtual address. (Assuming TLB and cache both hit)

Cycle 2: Compare TAGs in the set we selected in last cycle.

Cycle 3: Get proper data from the cache line achieved from last cycle based on data offset inside an cache line.

Problem of VIPT: Since the cache line size is fixed, page size is also fixed, the set number is limited, so the ways in an associativity will increase.

Exclusive Cache and Inclusive Cache

Exclusive cache is if a data is in a level of cache, that cannot be in any other levels of caches. Which means each time a cache line being evicted will be throw to the next level cache. The advantage of exclusive cache is storing more data in cache, but the con is no redundency in cache.

Inclusive cache is what ever in this level's cache is in the next level's cache. The pro of inclusive cache is redundency, but the con is cache miss rate will increase. Inclusive cache could also be divided into strict inclusive cache and not-strict inclusive cache. Strict inclusive cache is prefered when size GAP between different levels of cache is big. Not-strict inclusive cache is prefered when size GAP is not very big. Suppose the GAP is not so big and if you use strict inclusive approach, you're very likely to enconter cache conflict.