Unit VI: Pipelining
⭐Numerical Formula:
- Pipeline Speedup:
- Throughput:
- Efficiency:
⭐Advanced Caches-I
1. Cache Pipelining
- Definition: A method of breaking down cache operations into smaller, sequential stages, allowing multiple operations to be processed simultaneously.
- Stages:
- Tag Check: Checks if the requested data is in the cache by comparing tags.
- Data Access: Reads or writes data in cache memory if the tag check is successful.
- Write-back (if applicable): Writes data back to lower memory levels if there’s a modified block in the cache.
- Advantages:
- Higher Throughput: Multiple memory operations can be processed at different stages, reducing delays.
- Reduced Access Time: By overlapping stages, it decreases the effective latency of cache accesses.
2. Write Buffers
- Definition: A small, fast memory area that holds data waiting to be written to main memory or a lower-level cache.
- Purpose: Allows the CPU to continue processing without stalling while waiting for write operations to complete.
- Operation:
- When the CPU writes to memory, the data is temporarily stored in the write buffer.
- The write buffer manages these pending writes and eventually writes the data to the main memory when it’s not busy.
- Benefits:
- Reduces Write Latency: CPU can perform other operations while write requests are handled in the background.
- Prevents Stalls: Reduces stalls in pipelines by allowing reads and writes to proceed concurrently.
3. Multilevel Caches
- Definition: A hierarchical structure of caches, typically including multiple levels (L1, L2, and sometimes L3) with varying speeds and sizes.
- Levels:
- L1 Cache: Fastest, smallest, and closest to the CPU; designed for the highest speed and lowest latency.
- L2 Cache: Larger than L1, with slightly longer latency; it stores data not found in L1.
- L3 Cache: Largest and slowest (if present), shared across cores in multi-core systems.
- Benefits:
- Reduced Access Time: Most data requests are fulfilled by the faster caches, reducing the need to access slower main memory.
- Improved Hit Rate: By using multiple levels, the system can keep more frequently accessed data close to the CPU, improving performance.
4. Victim Caches
- Definition: A small, fully associative cache that holds recently evicted cache lines from a higher-level cache (typically L1).
- Purpose: Reduces conflict misses by storing data recently evicted from the main cache.
- How It Works:
- When data is evicted from L1, it is stored in the victim cache.
- If a future request matches data in the victim cache, it’s retrieved from there instead of accessing the slower main memory or L2.
- Benefits:
- Reduces Miss Penalty: Helps recover data that may be accessed again soon after eviction, decreasing miss rates.
- Efficient for Direct-Mapped Caches: Particularly beneficial in direct-mapped caches where conflict misses are common due to limited associativity.
5. Prefetching
- Definition: A technique to fetch data into cache before it’s actually requested by the CPU, anticipating future accesses.
- Types:
- Hardware Prefetching: Managed by the CPU hardware, which predicts and fetches future data based on access patterns.
- Software Prefetching: Compiler inserts prefetch instructions, typically based on predictable data access patterns in the code.
- Methods:
- Sequential Prefetching: Fetches the next block in sequence if it predicts the CPU will need it soon.
- Stride Prefetching: Recognizes and prefetches data with regular patterns (e.g., array accesses with a fixed interval).
- Advantages:
- Reduces Cache Misses: By pre-loading data, the cache is more likely to contain required data when it’s accessed.
- Improves Performance: Minimizes idle time by ensuring the CPU has necessary data available in advance.
⭐Advanced Caches-II
1. Software Memory Optimization
- Definition: Techniques used to improve the efficiency of memory usage and data access patterns in programs, minimizing cache misses and enhancing overall performance.
- Purpose: To maximize the use of cache by organizing code and data to better fit cache memory, reducing the need for slow main memory access.
- Techniques:
- Loop Tiling (Blocking):
- Divides large data structures into smaller blocks to increase cache reuse within each block.
- Commonly used for matrix operations to fit submatrices into the cache.
- Data Layout Optimization:
- Structuring Arrays: Arranges data in a way that matches how it will be accessed, ensuring cache lines are fully utilized.
- Array of Structures (AoS) vs. Structure of Arrays (SoA): Choose data layouts that better match access patterns, especially in SIMD operations.
- Loop Unrolling:
- Expands the loop to reduce loop control overhead and enable better pipelining and cache usage.
- Example: If accessing an array in a loop, unrolling allows multiple array elements to be accessed at once, enhancing data locality.
- Memory Access Reordering:
- Reorders computations to access data in a sequential pattern, reducing cache misses.
- Example: Accessing arrays row-wise instead of column-wise to exploit spatial locality in row-major memory layouts.
- Prefetching (Software-Controlled):
- Inserting prefetch instructions manually to bring data into the cache before it is needed.
- Helps avoid cache misses when data access patterns are predictable.
- Minimizing Cache Interference:
- Organizing data to avoid multiple pieces of frequently used data mapping to the same cache line, reducing conflict misses.
- Example: By padding arrays or adjusting data placement, you can reduce conflicts in direct-mapped or set-associative caches.
- Loop Tiling (Blocking):
- Benefits:
- Improved Cache Hit Rate: Maximizes data usage in the cache, minimizing costly memory accesses.
- Faster Execution: Reduces the number of cache misses, allowing the CPU to access data faster.
2. Nonblocking Caches
- Definition: A type of cache that allows multiple cache requests to be processed simultaneously without blocking or stalling other requests.
- Purpose: To allow the CPU to continue executing instructions while cache misses are being handled, rather than waiting for each request to complete before moving on.
- Operation:
- When a cache miss occurs, the CPU can proceed with other instructions that do not depend on the missed data.
- Multiple cache misses can be handled concurrently, reducing delays and improving throughput.
- Key Features:
- Miss Status Holding Registers (MSHRs):
- Holds information about outstanding cache misses, allowing the cache to track and manage multiple misses at once.
- Multiple Miss Handling:
- Supports concurrent requests for data from memory, allowing non-blocking behavior during cache misses.
- Helps in handling long-latency memory operations without stalling the pipeline.
- Out-of-Order Execution Compatibility:
- Complements out-of-order processors by allowing cache misses to be handled in parallel with other independent instructions.
- Miss Status Holding Registers (MSHRs):
- Types:
- Fully Nonblocking Cache: Allows any number of concurrent misses, though it’s complex and requires more resources.
- Partially Nonblocking Cache: Limits the number of concurrent misses it can handle, balancing complexity and performance.
- Benefits:
- Reduces Cache Miss Penalty: Decreases waiting time for the CPU during cache misses, allowing it to perform other tasks.
- Increases Throughput: By keeping the pipeline filled with useful instructions, overall execution speed is improved, especially in memory-intensive applications.
⭐Vector Processors and GPUs
1. Introduction
Vector Processors:
- Definition: Processors that can operate on entire arrays (vectors) of data with a single instruction.
- Purpose: Designed to handle data in parallel, making them ideal for tasks like scientific computations, graphics, and signal processing.
- How They Work:
- Instead of processing one data element at a time, they apply the same operation to multiple elements simultaneously.
- Example: Adding two arrays element-by-element in a single operation.
GPUs (Graphics Processing Units):
- Definition: Highly parallel processors originally designed to handle graphics and image processing tasks, now used in a wide range of applications.
- Purpose: Optimized for data-parallel operations, GPUs can process thousands of threads at once, making them suitable for AI, gaming, scientific computing, and more.
- Architecture:
- Consists of many small, efficient cores that can perform computations in parallel.
- Each core is simpler than a CPU core but optimized for parallel workloads, executing the same instruction across multiple data points.
2. Hardware Optimization
Vector Processor Hardware Optimization:
- Multiple Functional Units:
- Vector processors contain multiple ALUs (Arithmetic Logic Units) that allow simultaneous execution of operations on multiple data elements.
- Vector Registers:
- Large registers that can hold entire vectors, reducing the need for frequent memory access.
- Memory Bandwidth Optimization:
- High memory bandwidth is essential to keep the data flow to and from the processor fast enough to support vector operations.
GPU Hardware Optimization:
- SIMD (Single Instruction, Multiple Data):
- Executes the same instruction across multiple data points, making GPUs ideal for parallel tasks.
- Stream Multiprocessors (SMs):
- Each SM can execute multiple threads concurrently, organized into groups called warps.
- High Memory Bandwidth:
- GPUs have high memory bandwidth to handle the large data requirements of parallel processing.
- Texture and Shared Memory:
- Specialized memory types to optimize data access patterns, often used in image processing.
Benefits of Hardware Optimization:
- Increased Parallelism: Executes more operations in parallel, improving speed for large-scale data tasks.
- Reduced Latency: By reducing memory accesses and optimizing data flow, latency is minimized.
- Energy Efficiency: Processing multiple data points simultaneously can reduce energy consumption per task.
3. Vector Software and Compiler Optimization
Vector Software Optimization:
- Loop Vectorization:
- Converts scalar loops (operating on one data element at a time) into vector operations to leverage hardware parallelism.
- Example: A loop that adds two arrays can be rewritten to add entire segments at once.
- Data Structure Optimization:
- Arranges data to ensure efficient access by vector processors or GPUs.
- Using contiguous memory layouts for arrays allows faster, sequential access.
- Memory Alignment:
- Ensures data is aligned in memory for optimal access by vector registers, minimizing alignment-related penalties.
- Prefetching Data:
- Preloading data into cache or registers before it’s needed helps maintain continuous data flow and prevents stalls.
Compiler Optimization for Vector Processors and GPUs:
- Automatic Vectorization:
- Some compilers can automatically transform code to use vector instructions where possible, identifying parallelizable parts of the code.
- Example: Compilers like GCC and Intel compilers can detect and apply vectorization to compatible loops.
- SIMD Instructions:
- Compilers optimize by using SIMD-specific instructions (e.g., AVX, SSE) to handle multiple data elements with a single instruction.
- Loop Unrolling and Loop Fusion:
- Loop Unrolling: Expands the loop body to perform multiple iterations per cycle, reducing loop overhead.
- Loop Fusion: Merges two adjacent loops that operate on the same data, enhancing data locality and cache efficiency.
- Memory Coalescing (for GPUs):
- Organizes memory accesses to ensure adjacent threads access contiguous memory locations, reducing the number of memory transactions.
- Thread Scheduling (for GPUs):
- Compilers optimize the order in which threads are scheduled to maximize parallel efficiency and minimize idle time.
Benefits of Software and Compiler Optimization:
- Improved Performance: Maximizes the use of vector and GPU hardware capabilities, leading to faster execution.
- Efficient Memory Usage: Reduces cache misses and improves data locality.
- Automatic Optimizations: Reduces the need for manual tuning, making it easier for developers to take advantage of vector and GPU processing.
⭐Multithreading
1. SIMD (Single Instruction, Multiple Data)
- Definition: A parallel processing technique where a single instruction is executed on multiple data points simultaneously.
- Purpose: Used to perform the same operation across multiple data elements, which is ideal for tasks with high data parallelism, such as image processing, scientific calculations, and matrix operations.
- How It Works:
- Single Instruction: Only one instruction is issued by the processor.
- Multiple Data Streams: The same instruction operates on multiple data elements (e.g., adding two arrays of numbers element-by-element).
- Example:
- If we have two arrays, SIMD can add each corresponding element across both arrays in a single operation rather than looping through each pair of elements.
- Applications:
- Common in multimedia processing, such as video playback, gaming, and image filtering.
- Used in AI and machine learning for vectorized computations.
- Benefits:
- Increased Throughput: Processes large datasets faster by handling multiple data points per instruction.
- Energy Efficiency: Reduces the number of required instructions, saving power.
- Limitations:
- Best suited for tasks with identical operations on each data point; less effective for tasks requiring different operations.
2. GPUs (Graphics Processing Units)
- Definition: Highly parallel processors initially designed for graphics rendering but now widely used for general-purpose computing (GPGPU).
- Purpose: Optimized for tasks that can be split into many smaller, parallel subtasks, such as rendering images, simulations, and deep learning computations.
- Architecture:
- Many Cores: Contains thousands of small, efficient cores for high data-parallel processing.
- Streaming Multiprocessors (SMs): Groups of cores in a GPU that can execute multiple threads in parallel.
- Memory Types:
- Global Memory: Large but slower; accessible by all threads.
- Shared Memory: Fast memory shared by threads in the same block, useful for reducing data access times.
- Registers: Fast, local storage within each core.
- Working Principle:
- A GPU executes thousands of threads in parallel, with each core performing the same operation across different data points.
- GPUs are particularly efficient at SIMD operations, executing a single instruction across large sets of data.
- Applications:
- Graphics rendering, image processing, scientific simulations, cryptocurrency mining, AI and machine learning.
- Benefits:
- Massive Parallelism: Executes a high number of threads simultaneously, providing much faster computation for parallel tasks.
- Efficient for High Data Parallelism: Handles tasks where the same operation must be applied to large datasets.
- Limitations:
- Less suited for tasks with high sequential dependencies.
- Not as flexible for complex branching logic, which CPUs handle more efficiently.
3. Coarse-Grained Multithreading
- Definition: A multithreading technique where a processor switches between threads only on costly events, such as cache misses or long memory access delays.
- Purpose: To reduce idle time by allowing the processor to work on a different thread when the current thread is stalled.
- How It Works:
- Unlike fine-grained multithreading (where switches occur every cycle), coarse-grained multithreading only switches when a thread encounters a long latency event.
- The processor begins executing a new thread while the first thread is waiting for its stalled operation (e.g., memory fetch) to complete.
- Benefits:
- Reduced Latency: Minimizes idle time by performing useful work from another thread while waiting for a long-latency event.
- Increased Throughput: Keeps the processor busy, potentially improving overall system performance.
- Example:
- In a web server, if one thread is waiting for data from the database, the processor can switch to another thread handling a different request, thus improving response times.
- Limitations:
- Context Switching Overhead: Switching between threads involves some overhead and can reduce efficiency if switching occurs too frequently.
- Less Responsive Than Fine-Grained Multithreading: Because thread switching only occurs on long-latency events, coarse-grained multithreading is less responsive to quickly changing workloads.
⭐Parallel Programming-I
1. Introduction to Parallel Programming
- Definition: Parallel programming is a programming technique where multiple processes or threads execute simultaneously, working together to solve a problem faster than sequential processing.
- Purpose: Aims to improve performance by dividing tasks among multiple processors or cores, allowing computations to be done in parallel.
- Benefits:
- Increased Performance: Reduces the time required to complete a task by running multiple operations at once.
- Efficient Resource Utilization: Takes full advantage of multi-core processors.
- Scalability: Can handle larger problems as additional cores become available.
- Challenges:
- Data Synchronization: Ensuring that multiple threads have consistent access to shared data.
- Concurrency Issues: Handling scenarios where multiple threads need to access the same resources without conflicts.
- Programming Complexity: Writing and debugging parallel programs is generally more complex than writing sequential programs.
2. Sequential Consistency
- Definition: A consistency model in parallel programming where the result of executing a series of operations is as if the operations were executed in some sequential order, even if they are actually executed in parallel.
- Explanation:
- Sequential consistency ensures that all threads in a program see memory operations in the same order.
- Even if threads run in parallel, they appear to follow a single global order for shared memory operations.
- Importance:
- Predictability: Makes parallel programs easier to reason about, as it guarantees an order of operations visible to all threads.
- Debugging: Simplifies debugging by ensuring a consistent view of operations.
- Example:
- Suppose Thread A writes a value to a variable, and Thread B reads from that variable. In a sequentially consistent system, Thread B will either see the old or new value of the variable, depending on the global order of operations, but all threads agree on this order.
- Limitations:
- Performance Impact: Enforcing sequential consistency can limit performance as it restricts some optimizations that would allow greater flexibility in reordering operations.
- Less Common in Modern Systems: Many modern architectures use weaker consistency models for better performance, requiring programmers to enforce consistency where needed.
3. Locks
- Definition: A synchronization mechanism used in parallel programming to ensure that only one thread or process can access a critical section (a portion of code that accesses shared resources) at a time.
- Purpose: Prevents data races and ensures that shared data remains consistent by allowing exclusive access to critical sections.
- Types of Locks:
- Mutex (Mutual Exclusion):
- The most common lock type, ensuring only one thread can access a critical section at a time.
- When a thread locks a mutex, other threads trying to lock it are blocked until it’s unlocked.
- Spinlock:
- A lightweight lock where a thread continuously checks if the lock is available.
- Used in situations where waiting times are expected to be very short, as it avoids the overhead of putting threads to sleep.
- Read-Write Lock:
- Allows multiple threads to read data simultaneously, but only one thread to write at a time.
- Useful for scenarios where reads are more frequent than writes, optimizing access times.
- Mutex (Mutual Exclusion):
- How Locks Work:
- A lock is acquired before entering a critical section and released afterward.
- If another thread tries to acquire the lock while it’s held, it will either wait or, in the case of a spinlock, keep checking until the lock is available.
- Applications:
- Ensuring consistent data when multiple threads access shared resources (e.g., database, shared memory).
- Preventing race conditions, where two or more threads attempt to modify data at the same time.
- Challenges:
- Deadlock: Occurs when two or more threads wait indefinitely for each other to release locks, causing the program to halt.
- Priority Inversion: A situation where a lower-priority thread holds a lock, blocking a higher-priority thread from executing.
- Performance Overhead: Locks can slow down program execution, as threads must wait to access shared resources.
⭐Parallel Programming-II
1. Atomic Operations
- Definition: Operations that are completed in a single step without interruption, meaning they are indivisible and cannot be broken down.
- Purpose: Ensure that a single operation on a shared resource is completed entirely without interference from other threads, preventing race conditions.
- Examples of Atomic Operations:
- Incrementing a Counter: An atomic increment operation ensures that only one thread can update the counter at a time.
- Compare-and-Swap (CAS): Compares the current value of a variable with a specified value, and if they match, swaps it with a new value.
- Importance:
- Thread Safety: Prevents data inconsistencies by ensuring that no other thread can access the data mid-operation.
- Efficiency: Atomic operations are faster than locks since they don’t require context switching or waiting, ideal for simple tasks.
- Limitations:
- Limited Scope: Atomic operations are only applicable for simple tasks (like incrementing a variable) and may not be sufficient for complex operations that require multiple steps.
2. Memory Fences (Barriers)
- Definition: Instructions that enforce ordering constraints on memory operations, ensuring that certain operations are completed before others begin.
- Purpose: Helps maintain memory consistency across threads by controlling the order in which operations appear to execute, especially on systems with relaxed memory models.
- Types of Memory Fences:
- Load Fence: Ensures all load (read) operations before the fence complete before any subsequent loads.
- Store Fence: Ensures all store (write) operations before the fence complete before any subsequent stores.
- Full Fence: Ensures that all loads and stores before the fence complete before any loads or stores after it.
- Usage in Multithreading:
- Memory fences are essential in systems with multiple processors or cores, where instructions may be reordered for optimization.
- They prevent threads from seeing inconsistent states due to instruction reordering by enforcing a strict order on specific operations.
- Example:
- In a producer-consumer setup, a memory fence can ensure that data written by the producer is visible to the consumer before it accesses it.
- Limitations:
- Performance Cost: Memory fences can slow down program execution by enforcing stricter ordering, reducing optimization flexibility.
- Complexity: Adding memory fences correctly requires a deep understanding of the system’s memory model, which can be complex for developers.
3. Locks
- Definition: A synchronization mechanism that allows only one thread to access a resource or critical section at a time.
- Purpose: Prevents multiple threads from accessing shared resources simultaneously, ensuring data consistency.
- Common Types of Locks:
- Mutex (Mutual Exclusion): A standard lock that only allows one thread to execute in a critical section at a time.
- Spinlock: A lock where a thread continually checks if the lock is available, ideal for short wait times.
- Read-Write Lock:
- Allows multiple threads to read concurrently but restricts write access to only one thread at a time.
- Useful in scenarios with many reads and few writes.
- How Locks Work:
- A thread must acquire a lock before entering a critical section and release it after exiting.
- If another thread tries to acquire the lock while it’s held, it must wait until the lock is released.
- Issues with Locks:
- Deadlock: Occurs when two or more threads wait on each other indefinitely to release locks, causing a standstill.
- Priority Inversion: When a low-priority thread holds a lock needed by a high-priority thread, leading to delays.
- Performance Overhead: Frequent use of locks can reduce performance, as threads spend time waiting rather than executing.
4. Semaphores
- Definition: A synchronization tool that uses a counter to control access to a resource, allowing a specified number of threads to access the resource simultaneously.
- Types of Semaphores:
- Binary Semaphore: A semaphore with a value of either 0 or 1, similar to a mutex, allowing only one thread access at a time.
- Counting Semaphore: Allows multiple threads to access a resource up to a specified limit, as defined by the semaphore’s count.
- How Semaphores Work:
- Wait (P Operation): Decreases the semaphore count by 1. If the count is zero, the thread is blocked until another thread increments the count.
- Signal (V Operation): Increases the semaphore count by 1, allowing a waiting thread to access the resource if it was previously blocked.
- Applications:
- Resource Management: Used to control access to a pool of resources, such as limiting the number of threads that can use a database connection.
- Thread Synchronization: Coordinates actions between threads, ensuring certain tasks are completed before others begin.
- Example:
- In a system with a limited number of database connections, a counting semaphore can limit the number of threads that access the database simultaneously.
- Advantages:
- Flexibility: Can handle multiple threads, making it ideal for managing access to limited resources.
- Efficiency: Allows more than one thread access when appropriate, improving resource utilization.
- Challenges:
- Risk of Misuse: Incorrectly using semaphores can lead to issues like deadlocks.
- Complexity: Requires careful management to ensure proper access and avoid conflicts or resource exhaustion.
⭐Small Multiprocessors
1. Bus Implementation
- Definition: A communication system that transfers data between components in a multiprocessor system, allowing processors to communicate with each other and with memory.
- Purpose: Enables processors to share memory and other resources, facilitating communication and data transfer within the system.
- Key Components:
- Data Bus: Carries data between processors, memory, and other devices.
- Address Bus: Transmits the addresses of data locations in memory.
- Control Bus: Sends control signals to coordinate the timing and direction of data flow.
- Types of Bus Systems:
- Single Bus: All processors and memory modules share a single bus; simple but can become a bottleneck with many processors.
- Multiple Bus: Uses separate buses to improve bandwidth, allowing simultaneous data transfers.
- Challenges:
- Scalability: As more processors are added, contention for the bus increases, leading to slower communication.
- Performance Bottlenecks: If multiple processors try to access the bus simultaneously, it can create delays.
- Solution - Bus Arbitration:
- Purpose: Decides which processor or device can use the bus when multiple devices request it.
- Arbitration Methods:
- Centralized Arbitration: A single controller grants access to the bus based on priority.
- Distributed Arbitration: All devices participate in the arbitration process, deciding among themselves.
2. Cache Coherence Protocols
- Definition: Protocols that maintain consistency of data in caches of different processors in a multiprocessor system.
- Purpose: Ensures that all processors have a consistent view of memory, even when multiple caches hold copies of the same data.
- Why Cache Coherence is Needed:
- In multiprocessor systems, each processor has its own cache. If one processor updates a data value in its cache, other caches may have an outdated copy, leading to inconsistencies.
- Types of Cache Coherence Problems:
- Write-Through Problem: Occurs when one cache updates a shared variable, but other caches do not see the update.
- False Sharing: Happens when processors repeatedly invalidate each other's cache entries, even though they are accessing different parts of the same cache line.
- Main Cache Coherence Protocols:
- Snooping Protocols:
- Definition: Each cache monitors (or "snoops on") the shared bus to detect if other caches modify data that it holds.
- Common Snooping Protocols:
- Write-Invalidate: When a processor writes to a cache line, it invalidates the line in all other caches, ensuring only one valid copy exists at a time.
- Write-Update (Write-Broadcast): When a processor writes to a cache line, it broadcasts the update to other caches, so all copies are updated.
- Advantages:
- Effective for small multiprocessors.
- Simple implementation on a shared bus.
- Disadvantages:
- Becomes inefficient as the number of processors increases due to high traffic on the shared bus.
- Directory-Based Protocols:
- Definition: Uses a centralized directory to track which caches hold copies of each memory block. The directory manages the coherence, eliminating the need for caches to constantly monitor the bus.
- How It Works:
- The directory keeps track of the state of each memory block and which processors have a copy.
- When a processor wants to read or write, it contacts the directory, which then handles the coherence action.
- Advantages:
- Scales better with more processors than snooping protocols.
- Reduces bus traffic, as only necessary updates are communicated.
- Disadvantages:
- More complex to implement due to the need for a centralized directory and additional memory overhead.
- Snooping Protocols:
- States in Cache Coherence Protocols:
- MESI Protocol (common in snooping-based systems):
- Modified (M): The cache line is modified (dirty) and only exists in this cache. It must be written back to main memory before others can read it.
- Exclusive (E): The cache line is unmodified, and this is the only cache holding it. No need to write back if modified.
- Shared (S): The cache line may be present in multiple caches but is not modified.
- Invalid (I): The cache line is invalid or outdated.
- MESI Protocol (common in snooping-based systems):
π¨Thanks for visiting finenotes4u✨
Welcome to a hub for πNerds and knowledge seekers! Here, you'll find everything you need to stay updated on education, notes, books, and daily trends.
π Bookmark our site to stay connected and never miss an update!
π Have suggestions or need more content? Drop a comment below, and let us know what topics you'd like to see next! Your support means the world to us. π