AdministriviaProject 2 is due March 14, Sunday, by midnight. You can download the details from the class web site. The mid-term is March 12th. It will be held in SGM 124 at the regular class start time of 2:00pm. The test will last until 3:15pm. Readings for this weekSections 3.4 - 3.6 This lecture covers Tom Anderson's lectures 14 and 15. |
Now that we know all about caching and how it can help use, let's put a cache between the CPU and main memory that is able to store a certain amount of data (of course it's not as big as main memory). A cache in this location is typically called a translation lookaside buffer. For our TLB to work well, it needs to hold pointers to the most frequently used main memory pages. This saves having to use the page table to find out which page frame is referenced by the virtual page the CPU is looking for.
The TLB is implemented as a hardware table of frequently used address translations. Every memory reference is first compared to the items in the TLB. If a match is found, the TLB uses its pointers to the page frame in main memory, allowing immediate retrieval of that page from memory. On a miss, the page table(s) are referenced and the appropriate page frame retrieved. In addition, an entry in the TLB is made to this page frame. If the TLB is full, an older entry must be removed. We will discuss later how we can decide which entry to remove.
The access time for a TLB hit is about 5 to 10 nanoseconds. When page table accesses are required, it can take more than 100 nanoseconds to retrieve a page frame and update the TLB.
With this mechanism, which is cheap to implement in hardware, we search each TLB entry in sequential order. The time required to find an item is not constant, but will average out to be the time to search half of the items in the TLB. The bigger the TLB, the longer it will take (on average) to find an item.
With direct mapping, we restrict each virtual page to use a specific entry slot in the TLB. For example, if we have 16 slots in the TLB and 160 page frames in physical memory, we can map the page frames to the TLB either by allocating 10 consecutive page frames to a TLB slot (page frames 0 - 9 go to TLB slot 1; 10 - 19 go to TLB slot 2; etc), or have every 10th page allocated to the same TLB slot (page frames 0, 10, 20, etc, go to TLB slot 1; page frames 1, 11, 21, etc, go to TLB slot 2; and so on).
With either mechanism, we have multiple page frames that are required to compete for a single slot, even when other slots are unused. If it turns out that 2 page frames that must share the same slot have been both allocated to a single TLB slot, you can have thrashing. Thrashing is where you have a lot of lot of TLB misses because of contention. In addition, the page that gets replaced may be the most heavily used page. This is not a good thing!
To minimize the conflicts that can occur with Direct Mapping, we can implement a hashing function. With a hashing function, we calculate which TLB entry a particular page frame will be allocated to, based upon some algorithm. Using a hashing function may reduce the number of conflicts, but it still suffers from the problem of possibly eliminating a TLB entry that is heavily used. It would be much better if we could pick a lightly used page to evict.
The ideal for a TLB would allow us to put any page frame in any TLB slot AND we can look up each TLB slot simultaneously. This is called a fully associative TLB. But, we don't always have to go that far. We can arrange our TLB in a set of N separate banks. We can do a simultaneous lookup in each bank. This is called N-way associative.
If we have a TLB with 32 entries and 4 banks. Each bank will have 8 entries. We can search each of the 4 banks simultaneously. The time required to find an entry will take at most the time to search 8 TLB entries.
The advantage to this scheme is that it minimizes conflicts. A new page frame reference could possibly go into any of the N banks that we have. If we use Direct Mapping or use a Hashing Function, we can pick a bank with an unused slot. In the fully associative TLB, a page frame can go into any TLB slot.
With the Direct Mapping and hashing functions methods, we have no choice as to which TLB entry is replaced. However, in the associated TLB, we can use 3 algorithms:
What happens when the CPU does a context switch when we have a TLB? To maintain protection between processes, we have to invalidate all the TLB entries. The reason for this is that the entries point to page frames in physical memory. What were good pages for one process are likely to be not good for the next process. This means every context switch causes a process to begin by page faulting.
In addition, we must ensure that any page that gets modified has the page table bits updated to reflect this fact. At this point, since the TLB only have pointers to page frames, we don't have to do more than that. When we talk about caching later, there is more involved if we put a hardware cache in the TLB.
Note that the TLB doesn't actually store any data from main memory. It only stores a pointer to the page frame in physical memory. If we want to speed things up even more, we can implement some hardware caches. But first, let's find out how memory is organized. It will help us to understand where our caches should be located.
Memory is organized around 2 principles:
To speed up the throughput of a computer, we OS writers want to put the frequently accessed pages in small, fast, expensive memory, and put the big piles of information in cheap slow memory.
So, the CPU is much faster than any other component in a computer (including main memory). In addition, main memory is much faster than a hard drive. Caching is the process of maintaining a copy of data that can be accessed more quickly than the storage medium upon which the data normally resides. We use caching at each level to avoid having to go to the next level of memory, which is slower.
The reasoning behind having something like a cache is that most programs make a large number of references to a small number of pages (we call this locality), i.e., computers do not behave in a random fashion (wouldn't that be tough on programmers). Thus, only a small fraction of all the pages in a process are actually used at any one point in time. This active set is called the working set.
There are 2 kinds of locality that programs can display:
A hit for a cache means that we found the item we wanted in the cache. A
miss means that the requested item was not in the cache. The hit ratio
is a percentage of the number of hits divided by the total number of requests. The
Effective Access Time can be calculated based on the probability of a hit or miss.
The equation is:
We know the following:
The effective access time is .001 * 1000 + .999 * 1.0 = 1.999. In essence, with these numbers, with a hit ratio of 99.9%, our effective access time is doubled, cutting the performance of our computer in roughly half.
There are 4 reasons we can have a cache miss.
The primary cause of poor caching performance is programs that exhibit bad locality. A good example is matrix manipulation. Depending on the implementation in the computer, matrix data is either stored sequentially by row, or by column. For good programming, a programmer must know which is used and ensure that they reference the data in their matrices in the same order. Otherwise, you can get a page fault with every reference in your matrix.
A prime example is the first computer MIPS built. Someone got the bright idea of allowing the video to be mapped in the TLB. THe TLB on this computer has 64 slots. The memory page size was 4K. Typically, the size of video memory was 1MB. This makes 256 pages of video required for the video. Since a video update typically is for the entire screen, or at least a fair piece of it, the entire TLB could be sucked up for the video, or worse, constant page faulting as the 256 video pages competed for 64 slots.
As we discussed above, we can put a hardware cache between every level of memory. The
figure below provides a schematic of where the most common caches are located.

The cache between the CPU and the MMU is called a virtually-addressed cache. It is based on the virtual addresses for user processes. The cache in the MMU, referenced by the TLB, is called a physically addressed cache, because the TLB converts the virtual addresses to physical addresses before checking the cache.
Caches have real pages stored in them. When the contents of an address is modified, we
have an option as to how far we propagate the change. In an ideal world, we would only
update the nearest cache, and only propagate the changes back to slower caches/memory
when the CPU is idle. However, busy computers are seldom idle, at least not often enough
to allow the ideal case. In addition, if power is lost to the computer, or it locks up
somehow, any changes not actually stored on the hard drive will be lost. So we really
have two options:
Demand Paging is where you only load pages from disk into memory when an address on that page is needed by the CPU. You don't have the OS try to figure out ahead of time what pages are needed are load them in advance (called preloading). As it turns out, some CPUs have hardware that looks ahead to do this. When the hardware does it, it is called pre-fetching. The goal is to have the pages at least loaded into memory, if not even in cache by the time the CPU is ready for the address. All this is done without the use of the CPU, however.
With this scheme, physical memory becomes a cache for the disks. The page table can have pointers to either pages that have been loaded into physical memory, or it can have pointers to pages that are still on the disk. This makes the actual location of the page transparent to the application program (this is a good thing), providing the appearance of infinite virtual address space.
Instead of having the hardware load the TLB, we could have it done by the OS. This is what is done in Nachos/MIPS. The assumption is that if the hit rate is high enough, we won't pay too high a penalty for having the OS do the work. This hides the page table from the application completely.
The process works like this:
When a page fault occurs, as OS builders, we don't want to disrupt the user program (OK, unless you want to be mean). We need some mechanism of restarting the instruction in the user program that caused the page fault. Hardware can help us by keeping track of the faulting instruction and storing the state of the CPU at the time the page fault occurred. However, we still have to be concerned about side effects.
Instructions that execute partially and change the state of the CPU are instructions that cause side effects. The trick is keeping track of what has occurred so that we don't do something twice.
For example, the assembly code instruction:
Increments the value stored in the CPU register sp (the parenthesis mean take the contents of sp and the plus sign means to increment the value) and stores the value at virtual address 10. The side effect in this instruction is whether the increment of sp has occurred when the page fault for address 10 occurs (we are assuming that the page holding this address is not in the TLB, or a hardware cache). If we don't keep track of what we have already done, we may do a double increment after servicing the page fault.
We have two solutions for this problem: unwind the side-effects and finish-off the side-effects. Unwinding the side effects means that we undo any partial work the instruction has performed. Finishing-off the side effects means that we have kept track of what portions of an instruction have already executed so that we can pick up where we left off.
We finally come back to this. The issue is how to decide which page to evict from our physical memory (remember that physical memory is really just a cache for our disk; what we discuss here works for any cache).
We could pick a page at random and kick it out, but just like for our TLB, this is really a bad algorithm. The page that is picked has no relation to the usage patterns.
This algorithm always removes the oldest pages in memory. This is fair, in that each page stays in memory about the same amount of time. However, being fair is not always being efficient. Just like the random algorithm, the pages picked for removal in this way are not related to usage, either.
We will pick a page for replacement that won't be used for the longest amount of time in the future. Though this algorithm is optimal, it is not practical, since we really can't predict exactly when a page is going to be used. But maybe we can approximate this, somehow.
LRU uses history to determine which page to kick out of memory. We remove the page not accessed for the longest time. We are assuming that if a page hasn't been referenced for awhile, that means it is not really needed. The problem with this algorithm is that we can't afford to have enough hardware in our cache to keep track of exactly when a page was brought into memory. We have to approximate age in some way.
LRU does badly when the next page referenced is the LRU page. This is because the page you are referencing was just kicked out of the cache. FIFO has the same problem. So, worst case is that every page reference causes a page fault.
MIN, on the other hand, does not suffer from this worst case performance. With MIN, after the initial loading of page frames into the cache, you get 1/N page faults, where N is the number of page frames. This is because with MIN, you are not throwing away the oldest page, you are throwing away the page that won't be used for the longest time. Worse case is that you have N+1 pages in your process, but only N page frames available. If you reference the one page not in the cache, you have to go N more references before you will get another miss, since MIN will select the page to kick out that is N references away.
Adding more page frames does not help FIFO. That is because the pages in memory have no relation to each other for differing numbers of page frames. However, LRU and MIN have a linear improvement in page faults when the number of page frames available is increased. This is due to the fact that the pages in the cache for N available slots is a subset of the pages in memory when there are N+1 available slots.
The perfect way to implement LRU is to keep an exact time stamp of when the page was loaded. However, depending on how you store time, you need 32 or 64 bits for each page table entry. This makes your page table even bigger, without necessarily improving performance that much. We can do pretty good by approximating the true time.
We approximate LRU by replacing one of the old pages, not necessarily the oldest page. We
do this as follows:
Because we set the USE bit to 0 as the hand advances, the algorithm always finds a page. However, we only have 2 classes of pages (based on the USE bit): young and old. Perhaps a few more classes would improve performance.
With Nth Chance, we don't throw a page away until the clock hand goes by N times. The OS will maintain a counter for each entry and increment it each time the hand goes by. However, if the USE bit is 1, we will clear it (like we did earlier) and set the counter to 0, as well. If the USE bit is 0, we just increment the counter. If the counter is greater than N, we replace the page.
The big question, is how to pick the value for N. If we have a big N, we have a good simulation of LRU. The problem is that the clock hand can go around a lot of times before it finds a page (not a good thing, necessarily). If N is 2, it is just like the Clock algorithm.
The page table maps virtual pages to physical page frames. A Core Map goes the other way. Do we really need one? For the clock algorithm we certainly do. The Clock algorithm uses page frames, not virutal pages. Why? Well, we certainly have a lot more virtual pages possible than page frames (remember the apparently infinite virtual memory). Using page frames keeps the size of the clock down. Thus, we need the core map to map these physical pages to the virtual pages they represent.
There is also the issue of mapping multiple virtual pages onto a single physical page (we call this sharing). If all of these virtual pages are in our clock and we kick one out, what happens to the others? They should all be evicted for consistency, but we have no way of knowing which ones they are without searching all the pages in the clock.