diff --git a/memory_order/A_Primer_on_Memory_Consistency_and_Cache_Coherence/1.Introduction_to_Consistency_and_Coherence.md b/memory_order/A_Primer_on_Memory_Consistency_and_Cache_Coherence/1.Introduction_to_Consistency_and_Coherence.md new file mode 100644 index 0000000..d8c777c --- /dev/null +++ b/memory_order/A_Primer_on_Memory_Consistency_and_Cache_Coherence/1.Introduction_to_Consistency_and_Coherence.md @@ -0,0 +1,393 @@ +# CHAPTER 1 Introduction to Consistency and Coherence + +> ``` +> Coherence 一致性 ; 连贯 ; 连贯性 ; 相干性 +> Consistency 一致性 ; 连贯性 ; 黏稠度 ; 密实度 ; 坚实度 ; 平滑度 +> +> Conherence是支持 consistency 模型的一种手段 +> ``` + +Many modern computer systems and most multicore chips (chip multiprocessors) +support shared memory in hardware. In a shared memory system, each of the +processor cores may read and write to a single shared address space. These +designs seek various goodness properties, such as high performance, low power, +and low cost. Of course, it is not valuable to provide these goodness +properties without first providing correctness. Correct shared memory seems +intuitive at a hand-wave level, but, as this lecture will help show, there are +subtle issues in even defining what it means for a shared memory system to be +correct, as well as many subtle corner cases in designing a correct shared +memory implementation. Moreover, these subtleties must be mastered in hardware +implementations where bug fixes are expensive. Even academics should master +these subtleties to make it more likely that their proposed designs will work. + +> ``` +> valuable /ˈvæljuəbl/ : 有价值的 +> intuitive /ɪnˈtjuːɪtɪv/ : 直觉的;易懂的;有直觉力的;使用简便的;凭直觉得到的 +> corner ['kɔːnə(r)]: 角; 角落, 冷僻处;区域,地区;街角;墙角,壁角;嘴角,眼角; +> subtle [ˈsʌtl]: 微妙的 ; 巧妙的 ; 狡猾的 ; 敏锐的 +> master: 掌握 +> ``` +> +> 许多现代计算机系统和大多数多核芯片(芯片多处理器)都支持硬件中的共享内存。 +> 在共享存储器系统中,每个处理器核可以读取和写入单个共享地址空间。这些设计寻求 +> 各种优良特性,例如高性能、低功耗和低成本。当然,在不首先提供正确性的情况下提 +> 供这些优良特性是没有价值的。正确的共享内存在hand-ware level 上似乎很直观, +> 但是,正如本次讲座将帮助展示的那样,甚至在定义什么样的共享内存系统的意味着正 +> 确的方面也存在微妙的问题,以及在设计一个正确的共享内存系统时也存在许多微妙的 +> 极端情况。此外,这些微妙之处必须在硬件实现中掌握,因为错误修复的成本很高。 +> 即使是学者也应该掌握这些微妙之处,以使他们提出的设计更有可能发挥作用。 + +Designing and evaluating a correct shared memory system requires an architect +to under- stand memory consistency and cache coherence, the two topics of this +primer. Memory consistency (consistency, memory consistency model, or memory +model) is a precise, architecturally visible definition of shared memory +correctness. Consistency definitions provide rules about loads and stores (or +memory reads and writes) and how they act upon memory. Ideally, consistency +def- initions would be simple and easy to understand. However, defining what it +means for shared memory to behave correctly is more subtle than defining the +correct behavior of, for example, a single-threaded processor core. The +correctness criterion for a single processor core partitions behavior between +one correct result and many incorrect alternatives. This is because the pro- +cessor’s architecture mandates that the execution of a thread transforms a +given input state into a single well-defined output state, even on an +out-of-order core. Shared memory consistency models, however, concern the loads +and stores of multiple threads and usually allow many correct executions while +disallowing many (more) incorrect ones. The possibility of multiple correct +executions is due to the ISA allowing multiple threads to execute concurrently, +often with many possible legal interleavings of instructions from different +threads. The multitude of correct executions complicates the erstwhile simple +challenge of determining whether an execution is correct. Nevertheless, +consistency must be mastered to implement shared memory and, in some cases, to +write correct programs that use it. + +> ``` +> precise [prɪˈsaɪs] : 精确的 +> criterion [kraɪˈtɪriən] : 标准 +> mandate: 授权;强制执行;委托办理 +> Nevertheless : 然而 +> interleaving : v. 插入插进 n: 交错,交叉 +> multitude : 群众;众多;大量; +> complicate : 使复杂化 +> erstwhile [ˈɜːrstwaɪl] : 过去的 +> ``` +> +> 设计和评估正确的共享内存系统需要架构师了解内存一致性和缓存一致性,这是本入门书 +> 的两个主题。 内存一致性(一致性、内存一致性模型或内存模型)是共享内存正确性 +> 的精确的、架构上可见的定义。一致性定义提供了有关加载和存储(或内存读取和写入) +> 以及它们如何作用于内存的规则。 理想情况下,一致性定义应该简单且易于理解。 +> 然而,定义共享内存的正确行为比定义单线程处理器核心的正确行为更为微妙。 +> 单个处理器核心的正确性标准将行为划分为一个正确结果和许多不正确的替代结果. +> 这是因为处理器的架构要求线程的执行将给定的输入状态转换为单个明确定义的输出 +> 状态,即使在out-of-order的核心上也是如此。然而,共享内存一致性模型涉及多个线程的 +> 加载和存储,并且通常允许许多正确的执行,同时不允许许多(更多)错误的执行。 +> 该正确执行的可能性是由于 ISA 允许多个线程同时执行,通常具有来自不同线程的指令的 +> 许多可能的合法交错。大量的正确执行使以前简单的确定执行是否正确的挑战变得复杂。 +> 然而,必须掌握一致性才能实现共享内存,并在某些情况下编写使用它的正确程序。 + +The microarchitecture—the hardware design of the processor cores and the shared +memory system—must enforce the desired consistency model. As part of this +consistency model support, the hardware provides cache coherence (or +coherence). In a shared-memory system with caches, the cached values can +potentially become out-of-date (or incoherent) when one of the processors +updates its cached value. Coherence seeks to make the caches of a shared-memory +system as functionally invisible as the caches in a single-core system; it does +so by propagating a processor’s write to other processors’ caches. It is worth +stressing that unlike consistency which is an architectural specification that +defines shared memory correctness, coherence is a means to supporting a +consistency model. + +> ``` +> propagating [ˈprɑːpəɡeɪtɪŋ]: 传播 +> It is worth stressing that : 值得强调的是 +> ``` +> +> 微体系结构(处理器内核和共享内存系统的硬件设计)必须强制执行所需的一致性模型。 +> 作为一致性模型支持的一部分,硬件提供缓存一致性(或一致性)。在具有缓存的共享内 +> 存系统中,当其中一个处理器更新其缓存值时,缓存值可能会过时(或不一致). +> Coherence 旨在使共享内存系统的缓存在功能上与单核系统中的缓存一样不可见; 它通过将 +> 处理器的写入传播到其他处理器的缓存来实现这一点。值得强调的是,与consistency(定义 +> 共享内存正确性的架构规范)不同,coherence是支持consistency模型的一种手段。 + +Even though consistency is the first major topic of this primer, we begin in +Chapter 2 with a brief introduction to coherence because coherence protocols +play an important role in providing consistency. The goal of Chapter 2 is to +explain enough about coherence to understand how consistency models interact +with coherent caches, but not to explore specific coherence protocols or +implementations, which are topics we defer until the second portion of this +primer in Chapters 6–9. + +> 尽管 consistency 是本入门书的第一个主要主题,但我们在第 2 章开始时会简要介绍 +> coherence,因为 coherence 协议在提供 consistency 方面发挥着重要作用。第 2 章的目 +> 标是充分解释coherence,以了解 consistency 模型如何与coherent 缓存交互,但不是 +> 探索特定的coherence协议或实现,这些主题我们将推迟到第 6 章至第 9 章中本初级 +> 读物的第二部分。 + +## 1.1 CONSISTENCY (A.K.A., MEMORY CONSISTENCY, MEMORY CONSISTENCY MODEL, OR MEMORY MODEL) + +Consistency models define correct shared memory behavior in terms of loads and +stores (memory reads and writes), without reference to caches or coherence. +To gain some real-world intuition on why we need consistency models, consider +a university that posts its course schedule online. Assume that the Computer +Architecture course is originally scheduled to be in Room 152. The day before +classes begin, the university registrar decides to move the class to Room 252. +The registrar sends an e-mail message asking the website administrator to +update the online schedule, and a few minutes later, the registrar sends a text +message to all registered students to check the newly updated schedule. It is +not hard to imagine a scenario—if, say, the website administrator is too busy +to post the update immediately—in which a diligent student receives the text +message, immediately checks the online schedule, and still observes the (old) +class location Room 152. Even though the online schedule is eventually +updated to Room 252 and the registrar performed the “writes” in the correct +order, this diligent student observed them in a different order and thus went to +the wrong room. A consistency model defines whether this be- havior is correct +(and thus whether a user must take other action to achieve the desired outcome) +or incorrect (in which case the system must preclude these reorderings). + +> ``` +> course : 课程 +> registrar [ˈredʒɪstrɑːr] : 登记员;(大学的)教务长,教务主任,注册主任; +> diligent [ˈdɪlɪdʒənt] : 勤勉的;刻苦的;孜孜不倦的 +> preclude : 排除;妨碍;阻止;使行不通 +> eventually : 最后,终于 +> ``` +> +> consistency 模型根据加载和存储(内存读取和写入)定义正确的共享内存行为,而不参考 +> 缓存或 conherence。为了获得关于为什么我们需要一致性模型的real-world intuition, +> 请考虑一所在线发布课程表的大学。开课前一天,大学教务主任决定将班级转移到252室。 +> 教导主任发送电子邮件,要求网站管理员更新在线时间表,几分钟后,教导主任向所有学生 +> 发送短信 已注册的学生可以查看最新更新的时间表。不难想象这样一个场景——如果网站管理 +> 员太忙而无法立即发布更新——一个勤奋的学生收到短信,立即查看在线时间表,并且仍然关注 +> 到的是(旧的)上课地点 152 室。尽管在线时间表最终更新为252房间,并且教导主任以正确 +> 的顺序执行了“写入”,但这位勤奋的学生以不同的顺序观察了它们,因此去了错误的房间。 +> 一致性模型定义了这种行为是正确的(以及用户是否必须采取其他操作来实现期望的结果) +> 还是不正确的(在这种情况下系统必须阻止这些重新排序)。 + +Although this contrived example used multiple media, similar behavior can +happen in shared memory hardware with out-of-order processor cores, write +buffers, prefetching, and multiple cache banks. Thus, we need to define shared +memory correctness—that is, which shared memory behaviors are allowed—so that +programmers know what to expect and implementors know the limits to what they +can provide. + +> ``` +> contrived /kənˈtraɪvd/ : 人为的;做作的;矫揉造作的;不自然的;预谋的 +> ``` +> +> 尽管这个人为的示例使用了multiple media(???),但在具out-of-order processors +> core、write buffer、prefetching 和 multiple cache banks 的内存硬件中也可能 +> 发生类似的行为。 因此,我们需要定义共享内存的正确性,即允许哪些共享内存行为, +> 以便程序员知道会发生什么,实现者知道他们可以提供的限制。 + +Shared memory correctness is specified by a memory consistency model or, more +simply, a memory model. The memory model specifies the allowed behavior of +multithreaded programs executing with shared memory. For a multithreaded +program executing with specific input data, the memory model specifies what +values dynamic loads may return and, optionally, what possible final states of +the memory are. Unlike single-threaded execution, multiple correct behaviors +are usually allowed, making understanding memory consistency models subtle. + +> 共享内存的正确性由内存一致性模型或更简单地由内存模型指定。 内存模型指定 +> 了使用共享内存执行的多线程程序所允许的行为。对于使用特定输入数据执行的多线程 +> 程序,内存模型指定动态加载可能返回的值,以及内存可能的最终状态(可选)。 +> 与单线程执行不同,通常允许多个正确的行为,这使得理解内存一致性模型变得微妙。 +>> 1. what's the 'dynamic load' ? + +Chapter 3 introduces the concept of memory consistency models and presents +sequential consistency (SC), the strongest and most intuitive consistency +model. The chapter begins by motivating the need to specify shared memory +behavior and precisely defines what a memory consistency model is. It next +delves into the intuitive SC model, which states that a multi-threaded +execution should look like an interleaving of the sequential executions of each +constituent thread, as if the threads were time-multiplexed on a single-core +processor. Beyond this intuition, the chapter formalizes SC and explores +implementing SC with coherence in both simple and aggressive ways, culminating +with a MIPS R10000 case study. + +> ``` +> delves [delvz] : (在手提包. 容器等中) 翻找 +> delves into : 深入研究 +> sequential /sɪˈkwenʃl/ : 顺序的,按次序的;连续的; +> constituent [kənˈstɪtjuənt] 成分; (选区的)选民,选举人;构成要素; +> multiplexed [ˈmʌltiˌplɛkst] : 多路复用的 +> culminating [ˈkʌlmɪneɪtɪŋ] (在某一点)结束;(以某种结果)告终 +> case study : 案例分析 +> ``` +> +> 第3章介绍了内存一致性模型的概念,并提出了最强、最直观的一致性模型 -- 顺序 +> 一致性(SC)。本章首先提出了指定共享内存行为的需要,并精确定义了内存一致性 +> 模型是什么。接下来深入研究直观的 SC 模型,该模型指出多线程执行应该看起来像 +> +> 每个组成线程的顺序执行的交错,就好像线程在单核处理器上进行时间复用一样。 +> 除了这种 intuition (觉得应该翻译成直观的展示)之外,本章还将 SC 形式化, +> 并探索以简单和积极的方式实现 conherence SC,最后以 MIPS R10000 案例研 +> 究结束。 + +In Chapter 4, we move beyond SC and focus on the memory consistency model +imple- mented by x86 and historical SPARC systems. This consistency model, +called total store order (TSO), is motivated by the desire to use +first-in–first-out write buffers to hold the results of committed stores before +writing the results to the caches. This optimization violates SC, yet promises +enough performance benefit to inspire architectures to define TSO, which permits +this optimization. In this chapter, we show how to formalize TSO from our SC +formalization, how TSO affects implementations, and how SC and TSO compare. + + +Finally, Chapter 5 introduces “relaxed” or “weak” memory consistency models. It +moti- vates these models by showing that most memory orderings in strong models +are unnecessary. If a thread updates ten data items and then a synchronization +flag, programmers usually do not care if the data items are updated in order +with respect to each other but only that all data items are updated before the +flag is updated. Relaxed models seek to capture this increased or- dering +flexibility to get higher performance or a simpler implementation. After +providing this motivation, the chapter develops an example relaxed consistency +model, called XC, wherein programmers get order only when they ask for it with +a FENCE instruction (e.g., a FENCE after the last data update but before the +flag write). The chapter then extends the formalism of the previous two chapters +to handle XC and discusses how to implement XC (with considerable reordering +between the cores and the coherence protocol). The chapter then discusses a way +in which many programmers can avoid thinking about relaxed models directly: if +they add enough FENCEs to ensure their program is data-race free (DRF), then +most relaxed models will appear SC. With “SC for DRF,” programmers can get both +the (relatively) simple correctness model of SC with the (relatively) higher +performance of XC. For those who want to reason more deeply, the chapter +concludes by distinguishing acquires from releases, discussing write atomicity +and causality, pointing to commercial examples (including an IBM Power case +study), and touching upon high-level language models ( Java and CCC). + +Returning to the real-world consistency example of the class schedule, we can +observe that the combination of an email system, a human web administrator, and +a text-messaging system represents an extremely weak consistency model. To +prevent the problem of a diligent student going to the wrong room, the +university registrar needed to perform a FENCE operation after her email to +ensure that the online schedule was updated before sending the text message. + +## 1.2 COHERENCE (A.K.A., CACHE COHERENCE) + +Unless care is taken, a coherence problem can arise if multiple actors (e.g., +multiple cores) have access to multiple copies of a datum (e.g., in multiple +caches) and at least one access is a write. Consider an example that is similar +to the memory consistency example. A student checks the online schedule of +courses, observes that the Computer Architecture course is being held in Room +152 (reads the datum), and copies this information into her calendar app in her +mobile phone (caches the datum). Subsequently, the university registrar decides +to move the class to Room 252, updates the online schedule (writes to the +datum) and informs the students via a text message. The student’s copy of the +datum is now stale, and we have an incoherent situation. If she goes to Room +152, she will fail to find her class. Examples of incoherence from the world of +computing, but not including computer architecture, include stale web caches +and programmers using un-updated code repositories. + +Access to stale data (incoherence) is prevented using a coherence protocol, +which is a set of rules implemented by the distributed set of actors within a +system. Coherence protocols come in many variants but follow a few themes, as +developed in Chapters 6–9. Essentially, all of the variants make one +processor’s write visible to the other processors by propagating the write to +all caches, i.e., keeping the calendar in sync with the online schedule. But +protocols differ in when and how the syncing happens. There are two major +classes of coherence protocols. In the first approach, the coherence protocol +ensures that writes are propagated to the caches synchronously. When the online +schedule is updated, the coherence protocol ensures that the student’s calendar +is updated as well. In the second approach, the coherence protocol propagates +writes to the caches asynchronously, while still honoring the consistency +model. The coherence protocol does not guarantee that when the online schedule +is updated, the new value will have propagated to the student’s calendar as +well; however, the protocol does ensure that the new value is propagated before +the text message reaches her mobile phone. This primer focuses on the first +class of coherence protocols (Chapters 6–9) while Chapter 10 discusses the +emerging second class. + +Chapter 6 presents the big picture of cache coherence protocols and sets the stage for the +subsequent chapters on specific coherence protocols. This chapter covers issues shared by most +coherence protocols, including the distributed operations of cache controllers and memory con- +trollers and the common MOESI coherence states: modified (M), owned (O), exclusive (E), +shared (S), and invalid (I). Importantly, this chapter also presents our table-driven methodology +for presenting protocols with both stable (e.g., MOESI) and transient coherence states. Tran- +sient states are required in real implementations because modern systems rarely permit atomic +transitions from one stable state to another (e.g., a read miss in state Invalid will spend some +time waiting for a data response before it can enter state Shared). Much of the real complex- +ity in coherence protocols hides in the transient states, similar to how much of processor core +complexity hides in micro-architectural states. + +Chapter 7 covers snooping cache coherence protocols, which initially dominated the com- +mercial market. At the hand-wave level, snooping protocols are simple. When a cache miss oc- +curs, a core’s cache controller arbitrates for a shared bus and broadcasts its request. The shared +bus ensures that all controllers observe all requests in the same order and thus all controllers can +coordinate their individual, distributed actions to ensure that they maintain a globally consis- +tent state. Snooping gets complicated, however, because systems may use multiple buses and +modern buses do not atomically handle requests. Modern buses have queues for arbitration and +can send responses that are unicast, delayed by pipelining, or out-of-order. All of these fea- +tures lead to more transient coherence states. Chapter 7 concludes with case studies of the Sun +UltraEnterprise E10000 and the IBM Power5. + +Chapter 8 delves into directory cache coherence protocols that offer the promise of scal- +ing to more processor cores and other actors than snooping protocols that rely on broadcast. +There is a joke that all problems in computer science can be solved with a level of indirection. +Directory protocols support this joke: A cache miss requests a memory location from the next +level cache (or memory) controller, which maintains a directory that tracks which caches hold +which locations. Based on the directory entry for the requested memory location, the controller +sends a response message to the requestor or forwards the request message to one or more ac- +tors currently caching the memory location. Each message typically has one destination (i.e., +no broadcast or multicast), but transient coherence states abound as transitions from one stable +coherence state to another stable one can generate a number of messages proportional to the +number of actors in the system. This chapter starts with a basic MSI directory protocol and then +refines it to handle the MOESI states E and O, distributed directories, less stalling of requests, +approximate directory entry representations, and more. The chapter also explores the design of +the directory itself, including directory caching techniques. The chapter concludes with case +studies of the old SGI Origin 2000 and the newer AMD HyperTransport, HyperTransport +Assist, and Intel QuickPath Interconnect (QPI). + +Chapter 9 deals with some, but not all, of the advanced topics in coherence. For ease of +explanation, the prior chapters on coherence intentionally restrict themselves to the simplest +system models needed to explain the fundamental issues. Chapter 9 delves into more compli- +cated system models and optimizations, with a focus on issues that are common to both snoop- +ing and directory protocols. Initial topics include dealing with instruction caches, multilevel +caches, write-through caches, translation lookaside buffers (TLBs), coherent direct memory ac- +cess (DMA), virtual caches, and hierarchical coherence protocols. Finally, the chapter delves +into performance optimizations (e.g., targeting migratory sharing and false sharing) and a new +protocol family called Token Coherence that subsumes directory and snooping coherence. + +## 1.3 CONSISTENCY AND COHERENCE FOR HETEROGENEOUS SYSTEMS + +Modern computer systems are predominantly heterogeneous. A mobile phone +processor today not only contains a multicore CPU, it also has a GPU and other +accelerators (e.g., neural net- work hardware). In the quest for +programmability, such heterogeneous systems are starting to support shared +memory. Chapter 10 deals with consistency and coherence for such heteroge- +neous processors. + +The chapter starts by focusing on GPUs, arguably the most popular accelerators +today. The chapter observes that GPUs originally chose not to support hardware +cache coherence, since GPUs are designed for embarrassingly parallel graphics +workloads that do not synchronize or share data all that much. However, the +absence of hardware cache coherence leads to pro- grammability and/or +performance challenges when GPUs are used for general-purpose work- loads with +fine-grained synchronization and data sharing. The chapter discusses in detail +some of the promising coherence alternatives that overcome these limitations—in +particular, explaining why the candidate protocols enforce the consistency +model directly rather than implementing coherence in a consistency-agnostic +manner. The chapter concludes with a brief discussion on consistency and +coherence across CPUs and the accelerators. + +## 1.4 SPECIFYING AND VALIDATING MEMORY CONSISTENCY MODELS AND CACHE COHERENCE + +Consistency models and coherence protocols are complex and subtle. Yet, this +complexity must be managed to ensure that multicores are programmable and that +their designs can be validated. To achieve these goals, it is critical that +consistency models are specified formally. A formal specification would enable +programmers to clearly and exhaustively (with tool support) under- stand what +behaviors are permitted by the memory model and what behaviors are not. Second, +a precise formal specification is mandatory for validating implementations. + +Chapter 11 starts by discussing two methods for specifying systems—axiomatic +and operational—focusing on how these methods can be applied for consistency +models and co- herence protocols. Then the chapter goes over techniques for +validating implementations— including processor pipeline and coherence protocol +implementations—against their specifi- cation. The chapter discusses both formal +methods and informal testing. + +## 1.5 A CONSISTENCY AND COHERENCE QUIZ +... + +## 1.6 WHAT THIS PRIMER DOES NOT DO + +## 1.7 REFERENCES diff --git a/memory_order/A_Primer_on_Memory_Consistency_and_Cache_Coherence/memory_consistency_motivation_and_sequential_consistency.md b/memory_order/A_Primer_on_Memory_Consistency_and_Cache_Coherence/memory_consistency_motivation_and_sequential_consistency.md new file mode 100644 index 0000000..a2cbe34 --- /dev/null +++ b/memory_order/A_Primer_on_Memory_Consistency_and_Cache_Coherence/memory_consistency_motivation_and_sequential_consistency.md @@ -0,0 +1,67 @@ +## 3.8 OPTIMIZED SC IMPLEMENTATIONS WITH CACHE COHERENCE + +Most real core implementations are more complicated than our basic SC +implementation with cache coherence. Cores employ features like prefetching, +speculative execution, and multithreading in order to improve performance and +tolerate memory access latencies. These features interact with the memory +interface, and we now discuss how these features impact the implementation of +SC. It is worth bearing in mind that any feature or optimization is legal as +long as it does not produce an end result (values returned by loads) that +violates SC. + +> it's worth bearing in mind: 值得牢记的是 +> +> 大多数真正的核心实现比我们具有缓存一致性的基本 SC 实现更复杂。 内核采用预 +> 取、推测执行和多线程等功能来提高性能并容忍内存访问延迟。这些功能与内存接口交互, +> 我们现在讨论这些功能如何影响 SC 的实现。 值得记住的是,任何功能或优化都是合 +> 法的,只要它不产生违反 SC 的最终结果(load返回的值)。 + +**Non-Binding Prefetching** + +A non-binding prefetch for block B is a request to the coherent memory system +to change B’s coherence state in one or more caches. Most commonly, prefetches +are requested by software, core hardware, or the cache hardware to change B’s +state in the level-one cache to permit loads (e.g., B’s state is M or S) or +loads and stores (B’s state is M) by issuing coherence requests such as GetS +and GetM. Importantly, in no case does a non-binding prefetch change the state +of a register or data in block B. The effect of the non-binding prefetch is +limited to within the “cache- coherent memory system” block of Figure 3.5a, +making the effect of non-binding prefetches on the memory consistency model to +be the functional equivalent of a no-op. So long as the loads and stores are +performed in program order, it does not matter in what order coherence +permissions are obtained. + +Implementations may do non-binding prefetches without affecting the memory +consis- tency model. This is useful for both internal cache prefetching (e.g., +stream buffers) and more aggressive cores. + +**Speculative Cores** + +Consider a core that executes instructions in program order, but also does +branch prediction wherein subsequent instructions, including loads and +stores, begin execution, but may be squashed (i.e., have their effects nullified) +on a branch misprediction. These squashed loads and stores can be made to look +like non-binding prefetches, enabling this speculation to be correct because it +has no effect on SC. A load after a branch prediction can be presented to the L1 +cache, wherein it either misses (causing a non-binding GetS prefetch) or hits +and then returns a value to a register. If the load is squashed, the core +discards the register update, erasing any functional effect from the load—as if +it never happened. The cache does not undo non-binding prefetches, as doing so +is not necessary and prefetching the block can help performance if the load +gets re-executed. For stores, the core may issue a non-binding GetM prefetch +early, but it does not present the store to the cache until the store is +guaranteed to commit. + +> ``` +> squashed [skwɑːʃt] :挤进;粉碎;制止;打断;塞入;去除;压软(或挤软、压坏、压扁等);把…压(或挤)变形 +> ``` +> +> 考虑一个按程序顺序执行指令的核心,但也进行分支预测,其中后续指令(包括加载和存 +> 储e开始执行,但可能会因分支错误预测而被squashed(即,使其影响无效)。这些squa +> shed的加载和存储可以看起来像non-binding 预取,从而使这种推测是正确的,因为它对 +> SC 没有影响。分支预测之后的load可以呈现给L1高速缓存,其中它或者未命中 +> (导致非绑定GetS预取)或者命中,然后将值返回到寄存器。如果负载被squashed,内核就 +> 会丢弃寄存器更新,从而消除负载的任何功能影响,就好像它从未发生过一样。 +> 缓存不会撤消非绑定预取,因为这样做是不必要的,并且如果重新执行加载,预取块可以 +> 提高性能。 对于存储,核心可能会提前发出非绑定 GetM 预取,但在保证存储提交之前, +> 它不会将存储呈现给缓存。 diff --git a/memory_order/A_Primer_on_Memory_Consistency_and_Cache_Coherence/pic/TABLE-5-2.png b/memory_order/A_Primer_on_Memory_Consistency_and_Cache_Coherence/pic/TABLE-5-2.png new file mode 100644 index 0000000..7a463e9 Binary files /dev/null and b/memory_order/A_Primer_on_Memory_Consistency_and_Cache_Coherence/pic/TABLE-5-2.png differ diff --git a/memory_order/A_Primer_on_Memory_Consistency_and_Cache_Coherence/pic/Table-5-1.png b/memory_order/A_Primer_on_Memory_Consistency_and_Cache_Coherence/pic/Table-5-1.png new file mode 100644 index 0000000..6c26475 Binary files /dev/null and b/memory_order/A_Primer_on_Memory_Consistency_and_Cache_Coherence/pic/Table-5-1.png differ diff --git a/memory_order/A_Primer_on_Memory_Consistency_and_Cache_Coherence/relaxed_memory_consistency.md b/memory_order/A_Primer_on_Memory_Consistency_and_Cache_Coherence/relaxed_memory_consistency.md index c909c98..226f3a0 100644 --- a/memory_order/A_Primer_on_Memory_Consistency_and_Cache_Coherence/relaxed_memory_consistency.md +++ b/memory_order/A_Primer_on_Memory_Consistency_and_Cache_Coherence/relaxed_memory_consistency.md @@ -191,24 +191,39 @@ which is a drawback, but it also enables many optimizations that can improve performance. We discuss a few common and important optimizations, but a deep treatment of this topic is beyond the scope of this primer. -> +> reason about : 推出;推理;推断 +> +> 现在假设一个宽松的内存一致性模型,允许我们重新排序任何内存操作,除非它们之间存在 +> FENCE。 这种宽松的模型迫使程序员推断哪些操作需要排序,这是一个缺点,但它也实现了 +> 许多可以提高性能的优化。 我们讨论了一些常见且重要的优化,但对该主题的深入处理超出u +> 了本入门的范围。 **Non-FIFO, Coalescing Write Buffer** +> coalesce [ˌkoʊəˈles] : 合并;结合;联合 + Recall that TSO enables the use of a FIFO write buffer, which improves -performance by hid- ing some or all of the latency of committed stores. -Although a FIFO write buffer improves performance, an even more optimized design -would use a non-FIFO write buffer that permits coalescing of writes (i.e., two -stores that are not consecutive in program order can write to the same entry in -the write buffer). However, a non-FIFO coalescing write buffer normally violates -TSO because TSO requires stores to appear in program order. Our example relaxed +performance by hiding some or all of the latency of committed stores. Although +a FIFO write buffer improves performance, an even more optimized design would +use a non-FIFO write buffer that permits coalescing of writes (i.e., two stores +that are not consecutive in program order can write to the same entry in the +write buffer). However, a non-FIFO coalescing write buffer normally violates TSO +because TSO requires stores to appear in program order. Our example relaxed model al- lows stores to coalesce in a non-FIFO write buffer, so long as the stores have not been separated by a FENCE. +> violates [ˈvaɪəleɪts] : 违背 +> +> 回想一下,TSO 支持使用 FIFO 写入缓冲区,这通过隐藏已提交存储的部分或全部延迟来提高性能。 +> 尽管 FIFO 写入缓冲区可提高性能,但更优化的设计将使用允许合并写入的非 FIFO 写入缓冲区 +> (即,程序顺序不连续的两个存储可以写入写入缓冲区中的同一条目)。 然而,非 FIFO 合并写入 +> 缓冲区通常会违反 TSO,因为 TSO 要求存储按程序顺序出现。 我们的示例宽松模型允许存储合并 +> 在非 FIFO 写入缓冲区中,只要存储未被 FENCE 分隔即可。 + **Simpler Support for Core Speculation** In systems with strong consistency models, a core may speculatively execute -loads out of pro- gram order before they are ready to be committed. Recall how +loads out of program order before they are ready to be committed. Recall how the MIPS R10000 core, which supports SC, used such speculation to gain better performance than a naive implementation that did not speculate. The catch, however, is that speculative cores that support SC typically have to include @@ -218,7 +233,7 @@ the addresses of evicted cache blocks against a list of the addresses that the core has speculatively loaded but not yet committed (i.e., the contents of the core’s load queue). This mechanism adds to the cost and complexity of the hardware, it consumes additional power, and it represents another finite -resource that may con- strain instruction level parallelism. In a system with a +resource that may constrain instruction level parallelism. In a system with a relaxed memory consistency model, a core can execute loads out of program order without having to compare the addresses of these loads to the addresses of incoming coherence requests. These loads are not speculative with respect to @@ -226,6 +241,21 @@ the relaxed consistency model (although they may be speculative with respect to, say, a branch prediction or earlier stores by the same thread to the same address). +> ``` +> the catch is that : 问题是 +> ``` +> +> 在具有强一致性模型的系统中,核心可能会在准备好提交之前推测性地执行不符合程序 +> 顺序的load。回想一下支持 SC 的 MIPS R10000 内核如何使用这种推测来获得比不进行 +> 推测的简单实现更好的性能。然而,问题是支持 SC 的推测核心通常必须包含检查推测 +> 是否正确的机制,即使错误推测很少见 [15, 21]。R10000 通过将已逐出的缓存块的地址 +> 与核心已推测加载但尚未提交的地址列表(即核心加载队列的内容)进行比较来检查推测。 +> 这种机制增加了硬件的成本和复杂性,它消耗额外的功率,并且它代表了另一种可能限制 +> 指令级并行性的有限资源。在具有宽松存储器一致性模型的系统中,核心可以不按程序顺 +> 序执行加载,而无需将这些加载的地址与传入一致性请求的地址进行比较。 这些负载对 +> 于宽松的一致性模型而言不是推测性的(尽管它们可能对于分支预测或同一线程到同一 +> 地址的早期存储而言是推测性的)。 + **Coupling Consistency and Coherence** We previously advocated decoupling consistency and coherence to manage