The original meaning of cache memory (Cache) refers to a RAM that has faster access speed than general random access memory (RAM). Generally speaking, it does not use DRAM technology like system main memory, but it is expensive. The faster SRAM technology also has the name of cache memory.
The cache memory is a first-level memory that exists between the main memory and the CPU, and is composed of a static memory chip (SRAM). The capacity is relatively small but the speed is much higher than the main memory, which is close to the speed of the CPU. In the hierarchical structure of the computer storage system, it is a high-speed small-capacity memory between the central processor and the main memory. Together with the main memory, it constitutes a primary memory. The scheduling and transmission of information between the cache memory and the main memory is performed automatically by the hardware.
The most important technical index of cache memory is its hit rate.
The cache memory is a first-level memory that exists between the main memory and the CPU, and is composed of a static memory chip (SRAM). The capacity is relatively small but the speed is much higher than the main memory, which is close to the speed of the CPU.
It is mainly composed of three parts:
Cache storage body: store instructions and data blocks transferred from main storage.
Address conversion component: establish a directory table to realize the conversion from the main memory address to the cache address.
Replacement component: when the cache is full, the data block is replaced according to a certain strategy, and the address conversion component is modified.
Cache memory is usually composed of high-speed memory, associative memory, replacement logic circuit and corresponding control circuit. In a computer system with a cache memory, the address where the central processor accesses the main memory is divided into three fields: row number, column number, and group address. Thus, the main memory is logically divided into several lines; each line is divided into several groups of storage units; each group contains several or dozens of words. The high-speed memory is also correspondingly divided into rows and columns of storage unit groups. Both have the same number of columns and the same group size, but the number of high-speed memory rows is much less than the number of main memory rows.
The associative memory is used for address association and has the same number of rows and columns as the high-speed memory. When a row of storage unit groups in a column of the main memory is transferred to an empty storage unit group in the same column of the high-speed memory, the storage unit corresponding to the associative memory records the row number of the transferred storage unit group in the main memory.
When the central processor accesses the main memory, the hardware first automatically decodes the column number field of the access address in order to compare all the row numbers of the column of the associative memory with the row number field of the access main memory address: if there is The same, indicating that the main memory unit to be accessed is already in the high-speed memory, called a hit, the hardware will map the address of the main memory access to the address of the high-speed memory and perform the access operation; if they are not the same, indicating that the unit Not in the high-speed memory, which is called off-target. The hardware will perform the operation of accessing the main memory and automatically transfer the main memory cell group where the unit is located to the empty storage unit group in the same column of the high-speed memory, and at the same time put the group in the main memory. The line number in the memory is stored in the unit at the corresponding position of the Lenovo memory.
When there is an off-target and there is no empty position in the corresponding column of the high-speed memory, a group in the column is eliminated to make room for the newly transferred group. This is called replacement. The replacement rule is called replacement algorithm. Common replacement algorithms are: least recently used algorithm (LRU), first-in first-out method (FIFO) and random method (RAND). The replacement logic circuit performs this function. In addition, when performing the write main memory operation, in order to maintain the consistency of the contents of the main memory and the high-speed memory, hits and misses must be dealt with separately.
Main-auxiliary storage storage level Because the main storage capacity of the computer is always too small compared to the capacity required by the programmer, the transfer of programs and data from the auxiliary storage to the main storage is arranged by the programmer himself, and the programmer must spend a lot of money Energy and time divide the large program into blocks in advance, determine the location of these program blocks in the auxiliary memory and the address of the main memory, and also pre-arrange how and when each block is called in and out when the program runs, so there exists The allocation of storage space. The formation and development of the operating system make programmers get rid of the address positioning between the main and auxiliary storage as much as possible, and at the same time form the "auxiliary hardware" that supports these functions. Through the combination of software and hardware, the main storage and auxiliary storage are unified into A whole, as shown. At this time, a storage hierarchy is formed by the main storage and the auxiliary storage, that is, the storage system. As a whole, its speed is close to the speed of main storage, its capacity is close to the capacity of auxiliary storage, and the average price per person is also close to the average price of cheap and slow auxiliary storage. The continuous development and improvement of this system has gradually formed a virtual storage system that is now widely used. In the system, the application programmer can use the machine instruction address code to uniformly address the entire program, just as the programmer has all the virtual memory space corresponding to the width of this address code. This space can be much larger than the actual space of the main memory, so that the entire program can be saved. This instruction address code is called a virtual address (virtual memory address, virtual address) or logical address, and its corresponding storage capacity is called a virtual memory capacity or virtual memory space; and the actual main memory address is called a physical address, real ( Memory) address, the corresponding storage capacity is called main storage capacity, real storage capacity or real (primary) storage space.
CACHE-Main memory storage hierarchy
When the virtual address is used to access the main memory, the machine automatically converts it to the real address of the main memory via auxiliary software and hardware. Check whether the content of the unit corresponding to this address has been loaded into the main memory, and access it if it is in the main memory. If it is not in the main memory, then transfer the program and data from the auxiliary memory to the main memory via auxiliary software and hardware. , And then visit. These operations do not have to be arranged by the programmer, that is, it is transparent to the application staff. The main-auxiliary storage level solves the contradiction between the large capacity requirement of memory and low cost. In terms of speed, the main memory of the computer and the CPU have maintained a gap of about an order of magnitude. Obviously this gap limits the potential of CPU speed. In order to bridge this gap, a single memory using only one process is not feasible and must be further studied from the computer system structure and organization. Setting the cache memory (Cache) is an important method to solve the access speed. Cache memory is set between the CPU and main memory to form a cache-main memory hierarchy, which requires that the cache can keep up with the requirements of the CPU in speed. The address mapping and scheduling between the Cache-main storage draws on the technology of the main-auxiliary storage hierarchy that appeared earlier than it. The difference is that because of its high speed requirements, it is not completely implemented by hardware, not by a combination of software and hardware. as the picture shows.
Address mapping and conversion
Address mapping refers to the correspondence between the address of a certain data in memory and the address in the buffer. Here are three ways of address mapping.
1. Fully connected
Address mapping rules: any block of main memory can be mapped to any block in the cache
(1) Main memory and cache are divided into data blocks of the same size.
(2) A certain data block in the main memory can be loaded into any space of the cache. If the number of blocks in the cache is Cb and the number of blocks in the main memory is Mb, then there are Cb×Mb mapping relationships.
The directory table is stored in the associated (associated) memory, which includes three parts: the block address of the data block in the main memory, the block address after being stored in the cache, and the valid bit (also called the loading bit). Since it is a fully associative method, the capacity of the directory table should be the same as the number of cached blocks.
Advantages: The hit rate is relatively high, and the Cache storage space utilization rate is high.
Disadvantages: When accessing the relevant memory, it has to be compared with the entire content every time. The speed is low and the cost is high, so there are few applications.
2. Direct connection
Address mapping rules: A block in the main memory can only be mapped to a specific block in the cache.
(1) Main memory and cache are divided into data blocks of the same size.
(2) The main memory capacity should be an integer multiple of the cache capacity. The main memory space is divided into areas according to the cache capacity. The number of blocks in each area in the main memory is equal to the total number of cache blocks.
(3) When a block in a certain area in the main memory is stored in the cache, it can only be stored in the same position as the block number in the cache.
Data blocks with the same block number in each area in the main memory can be transferred to the same address in the buffer, but only one block in the area can be stored in the cache at the same time. Since the main and cache block numbers are the same, only the area number of the transferred block can be recorded when the directory is registered. The two fields of main, cache block number and address in the block are identical. The directory table is stored in the high-speed small-capacity memory, which includes two parts: the area number and valid bits of the data block in the main memory. The capacity of the directory table is the same as the number of cached blocks.
Advantages: The address mapping method is simple. During data access, you only need to check whether the area codes are equal, so you can get a faster access speed, and the hardware is simple.
Disadvantages: The replacement operation is frequent, and the hit rate is relatively low.
3. Group linked image mode
Group-linked mapping rules:
(1) Main memory and cache are divided into blocks of the same size.
(2) Main memory and cache are divided into groups according to the same size.
(3) The main memory capacity is an integer multiple of the cache capacity. The main memory space is divided into areas according to the size of the buffer. The number of groups in each area in the main memory is the same as the number of cache groups.
(4) When the data of the main memory is transferred to the cache, the group numbers of the main memory and the cache should be equal, that is, a block in each area can only be stored in the space of the same group number in the cache, but between the addresses of the blocks in the group It can be stored arbitrarily, that is, a direct mapping mode is used from the main storage group to the Cache group; a fully associative mapping mode is used within the two corresponding groups.
The conversion of the main memory address and the cache address has two parts. The group address is accessed by the direct mapping method and the address, while the block address is accessed by the content in the fully associative manner. Group-associated address translation components are also implemented using related memories.
Advantages: The collision probability of the block is relatively low, the utilization rate of the block is greatly improved, and the block failure rate is significantly reduced.
Disadvantages: The difficulty and cost of implementation are higher than the direct mapping method.
1. According to the locality of the program, we can know that the program is always using the instructions and data that have been used recently. This provides a theoretical basis for the replacement strategy. Based on various factors such as hit rate, difficulty of implementation, and speed of replacement, replacement strategies may include random methods, first-in first-out methods, and least recently used methods.
(1). Random method (RAND method)
The random method is to randomly determine the replacement memory block. Set a random number generator, and determine the replacement block according to the generated random number. This method is simple and easy to implement, but the hit rate is relatively low.
(2). First-in first-out method (FIFO method)
The first-in first-out method is to select the block that is transferred first to replace. When the block that is called first and hit multiple times is likely to be replaced first, it does not conform to the locality rule. The hit rate of this method is better than the random method, but it does not meet the requirements. The first-in first-out method is easy to implement,
(3). The least recently used method (LRU method)
The LRU method is based on the use of each block, always selecting the least recently used block to be replaced. This method better reflects the locality of the program. There are many ways to implement the LRU strategy.
2 In a multi-body parallel storage system, because the level of I/O device requests to main memory is higher than the CPU memory access, this phenomenon occurs that the CPU waits for the memory access of the I/O device, causing the CPU to wait for a period of time, even possible Waiting for a few main memory cycles reduces the CPU's efficiency. In order to avoid CPU and I/O devices competing for memory access, a level of cache can be added between the CPU and the main memory, so that the main memory can send the information that the CPU wants to the cache in advance, once the main exists and the I/O device During the exchange, the CPU can directly read the required information from the cache without having to wait for it to affect efficiency.
3 The currently proposed algorithms can be divided into the following three categories (the first category is the key to master):
(1) Traditional replacement algorithm and its direct evolution, its representative algorithms are: ① LRU (Least Recently Used) algorithm: replace the least recently used content out of Cache; ② LFU (Lease Frequently Used) algorithm: replace the least visited content Cache; ③ If all contents in Cache are cached on the same day, the largest document will be replaced by Cache, otherwise it will be replaced by LRU algorithm. ④ FIFO (First In First Out): follow the first-in first-out principle, if the current Cache is filled, then replace the earliest entered Cache.
(2) The replacement algorithm based on the key characteristics of cache content, its representative algorithms are: ① Size replacement algorithm: replace the largest content with Cache ② LRU-MIN replacement algorithm: This algorithm tries to minimize the number of documents to be replaced. Let the size of the document to be cached be S, and replace the document cached in the cache with a size of at least S according to the LRU algorithm; if there is no object with a size of at least S, follow the LRU algorithm from the document with a size of at least S/2 Replace; ③ LRU-Threshold replacement algorithm: consistent with the LRU algorithm, except that documents whose size exceeds a certain threshold cannot be cached; ④ Lowest Lacency First replacement algorithm: replace documents with the least access delay out of the Cache.
(3) Cost-based replacement algorithm. This type of algorithm uses a cost function to evaluate the objects in the Cache, and finally determines the replacement object according to the size of the replacement value. The representative algorithms are: ①Hybrid algorithm: the algorithm assigns a utility function to each object in the Cache, replacing the object with the smallest utility into the Cache; ②Lowest Relative Value algorithm: replacing the object with the lowest utility value into the Cache; ③Least Normalized Cost Replacement ( LCNR) algorithm: This algorithm uses an inference function about document access frequency, transmission time and size to determine the replacement document; ④ Bolot et al. proposed a weight inference function based on document transmission time cost, size, and last access time Determine the document replacement; ⑤ Size-Adjust LRU (SLRU) algorithm: sort the cached objects according to the ratio of cost to size, and select the object with the smallest ratio to replace.
In the development of computer technology, the access speed of main memory has been much slower than the operating speed of the central processor, so that the high-speed processing capacity of the central processor cannot be fully exerted, and the working efficiency of the entire computer system is affected. There are many methods that can be used to alleviate the speed mismatch between the central processor and the main memory, such as the use of multiple general-purpose registers, multi-bank interleaving, etc., and the use of cache memory at the storage level is also one of the commonly used methods. Many large and medium-sized computers and some recent minicomputers and microcomputers also use cache memory.
The capacity of the cache memory is generally only a few hundredths of the main memory, but its access speed can be matched with the central processor. According to the principle of program locality, there is a high possibility that those cells adjacent to a cell of the main memory being used will be used. Therefore, when the central processor accesses a unit of the main memory, the computer hardware automatically transfers the contents of the group of units including the unit into the cache memory, and the main memory unit to be accessed by the central processor is likely Just in the group of cells that have just been transferred to the cache. Therefore, the central processor can directly access the cache memory. Throughout the process, if the vast majority of CPU accesses to main memory can be replaced by access to cache memory, the processing speed of the computer system can be significantly increased.
When the CPU finds useful data in the cache, it is called a hit. When the data required by the CPU is not in the cache (this is called a miss), the CPU accesses the memory. In theory, in a CPU with a level 2 cache, the hit rate of reading L1Cache is 80%. In other words, the useful data found by the CPU from L1Cache accounts for 80% of the total data, and the remaining 20% is read from L2Cache. Since the data to be executed cannot be accurately predicted, the hit rate of reading L2 is also around 80% (reading useful data from L2 accounts for 16% of the total data). Then there is data that has to be called from memory, but this is already a fairly small ratio. In some high-end CPUs, we often hear L3Cache, which is designed for data that misses after reading L2Cache-a kind of Cache. In a CPU with L3Cache, only about 5% of the data needs to be called from memory, This further improves the efficiency of the CPU.
In order to ensure a high hit rate when the CPU accesses, the content in the cache should be replaced according to a certain algorithm. A more commonly used algorithm is the "least recently used algorithm" (LRU algorithm), which is to eliminate the least visited rows in the most recent period. Therefore, it is necessary to set a counter for each line. The LRU algorithm is to clear the counter of the hit line and increment the counter of the other lines. When replacement is needed, the data row with the largest row counter count value is eliminated. This is an efficient and scientific algorithm, and its counter clearing process can eliminate some data that is not needed after frequent calls out of the cache, and improve the utilization rate of the cache.
The impact of Cache's replacement algorithm on hit rate. When the new main memory block needs to be transferred into the Cache and its available space is filled again, the data of the Cache needs to be replaced, which creates a replacement strategy (algorithm) problem. According to the locality of the program, we can know that the program is always using the instructions and data that have been used recently. This provides a theoretical basis for the replacement strategy. The goal of the replacement algorithm is to make Cache get the highest hit rate. Cache replacement algorithm is an important factor that affects the performance of proxy cache system. A good Cache replacement algorithm can produce a high hit rate. Commonly used algorithms are as follows:
(1) Random method (RAND method) The random replacement algorithm is to use a random number generator to generate a block number to be replaced and replace the block. This algorithm is simple and easy to implement, and it does not consider the past, present and future of the Cache block However, it does not use the "historical information" used by the upper layer memory or the locality principle of memory access, so it cannot improve the hit rate of the cache, and the hit rate is low.
(2) First-in-first-out method (FIFO method) First-in-first-out (First-In-First-Out, FIFO) algorithm. It is to replace the information block that first enters the cache. The FIFO algorithm selects the earliest transferred blocks of Cache to replace in the order of elimination into the Cache, it does not need to record the usage of each block, it is easier to implement, the system overhead is small, and its disadvantage is that it may put some Program blocks that need to be frequently used (such as loop programs) are also replaced as the earliest blocks to enter the cache, and because they do not follow the locality principle of memory access, they cannot improve the hit rate of the cache. Because the earliest information may be used later, or often used, such as the cycle program. This method is simple and convenient, using the "historical information" of the main memory, but it cannot be said that the first entry is not used often. Its disadvantage is that it does not correctly reflect the principle of program locality, the hit rate is not high, and an abnormality may occur. phenomenon.
(3) Least Recently Used (LRU) The least recently used (LRU) algorithm. This method replaces the information blocks in the least recently used cache. This algorithm is better than the first-in first-out algorithm. But this method does not guarantee that it will not be used in the past or in the future. The LRU method is based on the usage of each block, and always selects the least recently used block to be replaced. Although this method better reflects the locality of the program, this replacement method needs to record the usage of each block in the Cache at any time to determine which block is the least recently used block. The LRU algorithm is relatively reasonable, but it is relatively complicated to implement and has a large system overhead. It is usually necessary to set up a hardware or software module called a counter for each block to record its usage.
FPGA Spartan-3A DSP Family 1.8M Gates 37440 Cells 667MHz 90nm Technology 1.2V 676-Pin FBGA
FPGA Spartan-3A DSP Family 3.4M Gates 53712 Cells 667MHz 90nm Technology 1.2V 484-Pin LCSBGA
FPGA Virtex-5 FXT Family 65nm Technology 1V 1738-Pin FCBGA
CPLD CoolRunner -II Family 12K Gates 512 Macro Cells 128MHz 0.18um Technology 1.8V 208-Pin PQFP