Method, apparatus and system for dynamically controlling an addressing mode for a cache memory | Patent Publication Number 20150120998
US 20150120998 A1Zhaojuan Bian
Wei Zhou
Kebing Wang
In an embodiment, a first portion of a cache memory is associated with a first core. This first cache memory portion is of a distributed cache memory, and may be dynamically controlled to be one of a private cache memory for the first core and a shared cache memory shared by a plurality of cores (including the first core) according to an addressing mode, which itself is dynamically controllable. Other embodiments are described and claimed.
- 1. An apparatus comprising:na first portion of a cache memory associated with a first core, the first cache memory portion of a distributed cache memory, wherein the first cache memory portion is dynamically controlled to be one of a private cache memory for the first core and a shared cache memory shared by a plurality of cores including the first core, according to an addressing mode.
- 12. A system on a chip (SoC) comprising:na plurality of nodes each having a node identifier and including at least one core and at least one cache portion; anda control logic to receive performance metric information, determine a grouping mode for a cache memory formed of the cache portions of each of the plurality of nodes based at least in part on the performance metric information, and to dynamically change the node identifier of at least some of the plurality of nodes responsive to a change in the grouping mode.
- 21. A method comprising:nreceiving an incoming request transaction in a system including a plurality of nodes;generating an access address using a request address of the incoming request transaction and a node identifier of a first node including a first core and a first portion of a last level cache (LLC), wherein the node identifier is dynamically changed when a group addressing mode for the LLC changes; andaccessing the first portion of the LLC using the access address.
- 28. A system comprising:na processor including a plurality of nodes, each having at least one core and a portion of a distributed cache memory, the processor further including a controller to dynamically control a node identifier for each of the plurality of nodes based at least in part on a grouping mode of the distributed cache memory; anda dynamic random access memory (DRAM) coupled to the processor.
This disclosure pertains to computing systems, and in particular (but not exclusively) to cache memory management.
Referring to
In one embodiment, a processing element refers to hardware or logic to support a software thread. Examples of hardware processing elements include: a thread unit, a thread slot, a thread, a process unit, a context, a context unit, a logical processor, a hardware thread, a core, and/or any other element, which is capable of holding a state for a processor, such as an execution state or architectural state. In other words, a processing element, in one embodiment, refers to any hardware capable of being independently associated with code, such as a software thread, operating system, application, or other code. A physical processor (or processor socket) typically refers to an integrated circuit, which potentially includes any number of other processing elements, such as cores or hardware threads.
A core often refers to logic located on an integrated circuit capable of maintaining an independent architectural state, wherein each independently maintained architectural state is associated with at least some dedicated execution resources. In contrast to cores, a hardware thread typically refers to any logic located on an integrated circuit capable of maintaining an independent architectural state, wherein the independently maintained architectural states share access to execution resources. As can be seen, when certain resources are shared and others are dedicated to an architectural state, the line between the nomenclature of a hardware thread and core overlaps. Yet often, a core and a hardware thread are viewed by an operating system as individual logical processors, where the operating system is able to individually schedule operations on each logical processor.
Physical processor 100, as illustrated in
As depicted, core 101 includes two hardware threads 101a and 101b, which may also be referred to as hardware thread slots 101a and 101b. Therefore, software entities, such as an operating system, in one embodiment potentially view processor 100 as four separate processors, i.e., four logical processors or processing elements capable of executing four software threads concurrently. As alluded to above, a first thread is associated with architecture state registers 101a, a second thread is associated with architecture state registers 101b, a third thread may be associated with architecture state registers 102a, and a fourth thread may be associated with architecture state registers 102b. Here, each of the architecture state registers (101a, 101b, 102a, and 102b) may be referred to as processing elements, thread slots, or thread units, as described above. As illustrated, architecture state registers 101a are replicated in architecture state registers 101b, so individual architecture states/contexts are capable of being stored for logical processor 101a and logical processor 101b. In core 101, other smaller resources, such as instruction pointers and renaming logic in allocator and renamer block 130 may also be replicated for threads 101a and 101b. Some resources, such as re-order buffers in reorder/retirement unit 135, ILTB 120, load/store buffers, and queues may be shared through partitioning. Other resources, such as general purpose internal registers, page-table base register(s), low-level data-cache and data-TLB 115, execution unit(s) 140, and portions of out-of-order unit 135 are potentially fully shared.
Processor 100 often includes other resources, which may be fully shared, shared through partitioning, or dedicated by/to processing elements. In
Core 101 further includes decode module 125 coupled to fetch unit 120 to decode fetched elements. Fetch logic, in one embodiment, includes individual sequencers associated with thread slots 101a, 101b, respectively. Usually core 101 is associated with a first ISA, which defines/specifies instructions executable on processor 100. Often machine code instructions that are part of the first ISA include a portion of the instruction (referred to as an opcode), which references/specifies an instruction or operation to be performed. Decode logic 125 includes circuitry that recognizes these instructions from their opcodes and passes the decoded instructions on in the pipeline for processing as defined by the first ISA. For example, as discussed in more detail below decoders 125, in one embodiment, include logic designed or adapted to recognize specific instructions, such as transactional instruction. As a result of the recognition by decoders 125, the architecture or core 101 takes specific, predefined actions to perform tasks associated with the appropriate instruction. It is important to note that any of the tasks, blocks, operations, and methods described herein may be performed in response to a single or multiple instructions; some of which may be new or old instructions. Note decoders 126, in one embodiment, recognize the same ISA (or a subset thereof). Alternatively, in a heterogeneous core environment, decoders 126 recognize a second ISA (either a subset of the first ISA or a distinct ISA).
In one example, allocator and renamer block 130 includes an allocator to reserve resources, such as register files to store instruction processing results. However, threads 101a and 101b are potentially capable of out-of-order execution, where allocator and renamer block 130 also reserves other resources, such as reorder buffers to track instruction results. Unit 130 may also include a register renamer to rename program/instruction reference registers to other registers internal to processor 100. Reorder/retirement unit 135 includes components, such as the reorder buffers mentioned above, load buffers, and store buffers, to support out-of-order execution and later in-order retirement of instructions executed out-of-order.
Scheduler and execution unit(s) block 140, in one embodiment, includes a scheduler unit to schedule instructions/operation on execution units. For example, a floating point instruction is scheduled on a port of an execution unit that has an available floating point execution unit. Register files associated with the execution units are also included to store information instruction processing results. Exemplary execution units include a floating point execution unit, an integer execution unit, a jump execution unit, a load execution unit, a store execution unit, and other known execution units.
Lower level data cache and data translation buffer (D-TLB) 150 are coupled to execution unit(s) 140. The data cache is to store recently used/operated on elements, such as data operands, which are potentially held in memory coherency states. The D-TLB is to store recent virtual/linear to physical address translations. As a specific example, a processor may include a page table structure to break physical memory into a plurality of virtual pages.
Here, cores 101 and 102 share access to higher-level or further-out cache, such as a second level cache associated with on-chip interface 110. Note that higher-level or further-out refers to cache levels increasing or getting further way from the execution unit(s). In one embodiment, higher-level cache is a last-level data cache—last cache in the memory hierarchy on processor 100—such as a second or third level data cache. However, higher level cache is not so limited, as it may be associated with or include an instruction cache. A trace cache—a type of instruction cache—instead may be coupled after decoder 125 to store recently decoded traces. Here, an instruction potentially refers to a macro-instruction (i.e. a general instruction recognized by the decoders), which may decode into a number of micro-instructions (micro-operations).
In the depicted configuration, processor 100 also includes on-chip interface module 110. Historically, a memory controller, which is described in more detail below, has been included in a computing system external to processor 100. In this scenario, on-chip interface 11 is to communicate with devices external to processor 100, such as system memory 175, a chipset (often including a memory controller hub to connect to memory 175 and an I/O controller hub to connect peripheral devices), a memory controller hub, a northbridge, or other integrated circuit. And in this scenario, bus 105 may include any known interconnect, such as multi-drop bus, a point-to-point interconnect, a serial interconnect, a parallel bus, a coherent (e.g. cache coherent) bus, a layered protocol architecture, a differential bus, and a GTL bus.
Memory 175 may be dedicated to processor 100 or shared with other devices in a system. Common examples of types of memory 175 include DRAM, SRAM, non-volatile memory (NV memory), and other known storage devices. Note that device 180 may include a graphic accelerator, processor or card coupled to a memory controller hub, data storage coupled to an I/O controller hub, a wireless transceiver, a flash device, an audio controller, a network controller, or other known device.
Recently however, as more logic and devices are being integrated on a single die, such as SOC, each of these devices may be incorporated on processor 100. For example in one embodiment, a memory controller hub is on the same package and/or die with processor 100. Here, a portion of the core (an on-core portion) 110 includes one or more controller(s) for interfacing with other devices such as memory 175 or a graphics device 180. The configuration including an interconnect and controllers for interfacing with such devices is often referred to as an on-core (or uncore configuration). As an example, on-chip interface 110 includes a ring interconnect for on-chip communication and a high-speed serial point-to-point link 105 for off-chip communication. Yet, in the SOC environment, even more devices, such as the network interface, co-processors, memory 175, graphics processor 180, and any other known computer devices/interface may be integrated on a single die or integrated circuit to provide small form factor with high functionality and low power consumption.
In one embodiment, processor 100 is capable of executing a compiler, optimization, and/or translator code 177 to compile, translate, and/or optimize application code 176 to support the apparatus and methods described herein or to interface therewith. A compiler often includes a program or set of programs to translate source text/code into target text/code. Usually, compilation of program/application code with a compiler is done in multiple phases and passes to transform hi-level programming language code into low-level machine or assembly language code. Yet, single pass compilers may still be utilized for simple compilation. A compiler may utilize any known compilation techniques and perform any known compiler operations, such as lexical analysis, preprocessing, parsing, semantic analysis, code generation, code transformation, and code optimization.
Larger compilers often include multiple phases, but most often these phases are included within two general phases: (1) a front-end, i.e. generally where syntactic processing, semantic processing, and some transformation/optimization may take place, and (2) a back-end, i.e. generally where analysis, transformations, optimizations, and code generation takes place. Some compilers refer to a middle, which illustrates the blurring of delineation between a front-end and back end of a compiler. As a result, reference to insertion, association, generation, or other operation of a compiler may take place in any of the aforementioned phases or passes, as well as any other known phases or passes of a compiler. As an illustrative example, a compiler potentially inserts operations, calls, functions, etc. in one or more phases of compilation, such as insertion of calls/operations in a front-end phase of compilation and then transformation of the calls/operations into lower-level code during a transformation phase. Note that during dynamic compilation, compiler code or dynamic optimization code may insert such operations/calls, as well as optimize the code for execution during runtime. As a specific illustrative example, binary code (already compiled code) may be dynamically optimized during runtime. Here, the program code may include the dynamic optimization code, the binary code, or a combination thereof.
Similar to a compiler, a translator, such as a binary translator, translates code either statically or dynamically to optimize and/or translate code. Therefore, reference to execution of code, application code, program code, or other software environment may refer to: (1) execution of a compiler program(s), optimization code optimizer, or translator either dynamically or statically, to compile program code, to maintain software structures, to perform other operations, to optimize code, or to translate code; (2) execution of main program code including operations/calls, such as application code that has been optimized/compiled; (3) execution of other program code, such as libraries, associated with the main program code to maintain software structures, to perform other software related operations, or to optimize code; or (4) a combination thereof.
As greater numbers of cores are integrated into a single processor or system-on-chip (SoC), core interconnection and cache memory organization become one determinative of performance. On-chip communication fabrics have various interconnection topologies, such as full ring and mesh topologies. In most communication fabrics, each interconnect stop is shared by a core and a local last level cache (LLC) portion or slice. These LLC slices may be dynamically controllably organized as a private cache memory for a single core or a shared cache memory. In private mode, the data used by a core is placed in its local LLC slice and the access latency is low. But in private mode, a core can only access a small portion of the LLC, and coherent traffic overhead is introduced, which may occupy a significant part of uncore interconnection bandwidth. On the other hand, a shared mode makes the entire LLC accessible to all cores. However, since requested data of a core could be in the LLC slice of a remote node, access latency can be very high.
Group sharing is a tradeoff mode between private and shared modes. In the group sharing mode, LLC slices are organized into a plurality of groups. LLC slices of a group are shared between the cores associated with these slices, but LLC slices of different groups are private with respect to cores outside of a given group. Each core can access LLC slices of its neighboring node(s) of the same group, and cache coherence is maintained among different groups. Currently, most group sharing modes are implemented by a physical communication fabric, which is not flexible and scalable. Instead, the given grouping is fixed on manufacture. Moreover, current group sharing modes do not solve the coherent traffic issues.
Embodiments enable dynamic sharing of a shared cache memory according to a group sharing mode in which an identifier of a node (node ID) is included as part of the addressing scheme to address the LLC. As an example, this identifier can be included within an LLC index to implement a group sharing LLC. Note further that this node ID may be independent of other indicators used to identify a given node, as the node IDs used herein are transparent to system software and thus are distinct from core identifiers, advanced programmable interrupt controller (APIC) IDs, or other identifiers used by hardware or software of a system. Further, understand that this node ID is associated with all of a node and its included components, including core and LLC slice.
Using a node identifier as part of a LLC hash index algorithm, LLC slices can be divided into several groups based at least in part on their node IDs. Then, LLC slices of a single group are shared internally as a group, and privately among different groups. Also in various embodiments, the grouping mode can be flexible to dynamically adjust a size of one or more of the groups based on various performance characteristics, such as workload runtime profiling (regarding workload characteristics), interconnect traffic, and so forth. Other examples include LLC access/miss rate, state statistics of cache lines, and so forth. Embodiments also enable savings on cache coherence traffic.
To better understand the context in which the dynamic cache grouping occurs, reference is made to
In the high level shown in
In addition, each node 210 includes at least a portion of a cache hierarchy. For example, in an embodiment each node includes a private cache memory that is only accessed by the local core. Furthermore, each node also includes a portion of a shared cache memory such as a portion of a LLC. Each node 210 further includes a router 205 that is configured to enable communications with other elements of the processor via an interconnect fabric 220. In the embodiment shown in
In the embodiment shown in
As a further illustration of a processor arrangement, reference is now made to
In the embodiment of
To enable communications with other portions of processor 300, tile 310 further includes a router 330 configured to communicate transactions and other information between components within tile 310 and other tiles and other circuitry of processor 300.
Still referring to
Different manners of including a node identifier within a cache memory addressing scheme may be realized in different embodiments. However, for purposes of illustration in one embodiment a global set index of a cache slice may be as follows: Global LLC Set Index=Local LLC Set Index+node ID.
Referring now to
As shown in
In an embodiment, a request address is received with a tag portion, an index portion that has a width of log2 (totalSetNum) and a block offset portion that has a width of log2(blockSize). Instead using group addressing mode operation described herein an access request address includes a tag portion, an index portion that has a width of log2 (totalSetNum/groupNum) and which includes at least a portion of a node ID, and an offset portion that has a width of log2(blockSize).
Referring now to
Of course while this particular addressing scheme is shown for purposes of illustration, understand that a dynamic addressing scheme may take other forms in different embodiments as different node ID schemes are possible.
Compared to a conventional LLC index, a request address contributes fewer bits to the LLC index, and the node ID contributes the group ID. As seen in the embodiment of
Combining the index algorithm with global LLC slice addressing mode, nodes having the same group ID form a group to share their LLC slices, without any change to an existing communication fabric. Using the index algorithm described here, the number of LLC cache slices to be shared (referred to herein as a group number) can be changed by adjusting the group ID field width. Such changes may be dynamically performed, e.g., by a cache controller or other control logic based on dynamic operating conditions.
Further, when the group ID sub-field width is 0, LLC slices are switched to a shared mode; and when the sub-node ID sub-field width is 0, all LLC slices are switched to private dynamically.
Embodiments enable adjustments to cache groupings based on real-time profiles to match the workload characteristics. For example, when the group number is decreased, each core has a larger LLC capacity and more data can be shared. Instead when LLC miss traffic is low and coherent traffic is low, the group number can be increased which causes data to be stored in LLC slices nearby to the requestor to shorten access latency.
Thus LLC grouping mode can be dynamically adjusted. Further the groups may be arranged such that a group is formed of spatially close LLC slices (nearby the requestor) to shorten access latency. Embodiments enable workload-based improvements in uncore communication fabric efficiency. In addition, embodiments dynamically optimize the LLC organization for different LLC miss rates.
Embodiments apply equally to other group sharing modes. If a given node ID is allocated in one group, the same node ID is mapped to the same sub-node (0) in other groups, because LLC set addressing mode of each group is identical. Thus embodiments may further decrease the coherent traffic among the groups. For example, when a line of an LLC slice is written, probe-invalidate messages are only sent to the nodes with the same sub-node ID of other groups, not all the remaining nodes, which saves uncore interconnection bandwidth.
Referring back to
As described above, the number of groups may be dynamically adjusted to realize better performance for different workloads. For example, if workloads on 16 cores have a small footprint and no data shared, the 16 cores could be divided into 8 groups or even 16 groups. Instead if workloads on the 16 cores have a large footprint and more data shared, the 16 cores could be divided into 2 groups or even 1 group. As such, an LLC indexing algorithm in accordance with an embodiment enables dynamic grouping adjustment with high flexibility and low overhead.
Referring now to
As seen in
Still referring to
Still referring to
Otherwise, if an access miss occurs in this first cache memory portion, control instead passes to block 560, where a cache coherency message is generated and sent. More specifically, this cache coherency message is sent to other cache memory portions that have a common sub-node identifier with the first cache memory portion. Note that this cache coherency message can be generated by different components such as a local control logic associated with the first cache memory portion. Alternately, a miss in the first cache memory portion can be indicated to a global cache controller which in turn generates the cache coherency traffic.
Still referring to
Referring now to
Method 600 begins by receiving performance metric information regarding workload characteristics and interconnect traffic (block 610). Such workload characteristics include large or small footprints, and/or shared or private data sets, in an embodiment. The performance information includes LLC access rates, LLC miss rates, state statistics of cache lines, in an embodiment. In an embodiment, the performance metric information may be received by one or more cores, e.g., from performance monitoring units of such cores. In turn, the interconnect traffic information may be received from an interconnect control logic.
A grouping mode may be determined based at least in part on this information (block 620). For example, a cache controller may be configured to select a group mode with a relatively small number of nodes in each group (e.g., 2) if the interconnect traffic is low and the workload is a private small data footprint. Instead if the interconnect traffic is high and the workload is a shared large data footprint performance workload, a group mode with a larger number of nodes in each group (e.g., 4 or more) may be established. Alternately, based on profiling information that indicates that most cores only need a small LLC capacity, a private mode may be configured. In private mode, requested data of a core may be cached in the local LLC slice to reduce access latency and save power for data transfer on interconnect structures.
Still referring to
Then at block 650 the node IDs may be dynamically updated based on the grouping mode. In an embodiment, this update includes updating the node ID mapping table in the cache controller and communicating the new node IDs to the various nodes themselves (block 660) where the values may be stored in a volatile storage of the node (e.g., within a cache logic or core agent), so that they can be used in generating cache access addresses. Although shown at this high level in the embodiment of
Turning next to
Here, SOC 2000 includes 2 cores—2006 and 2007. Similar to the discussion above, cores 2006 and 2007 may conform to an Instruction Set Architecture, such as an Intel® Architecture Core™-based processor, an Advanced Micro Devices, Inc. (AMD) processor, a MIPS-based processor, an ARM-based processor design, or a customer thereof, as well as their licensees or adopters. Cores 2006 and 2007 are coupled to cache control 2008 that is associated with bus interface unit 2009 and L2 cache 2010 to communicate with other parts of system 2000. Interconnect 2010 includes an on-chip interconnect, such as an IOSF, AMBA, or other interconnect discussed above, which potentially implements one or more aspects of the described herein.
Interconnect 2010 provides communication channels to the other components, such as a Subscriber Identity Module (SIM) 2030 to interface with a SIM card, a boot ROM 2035 to hold boot code for execution by cores 2006 and 2007 to initialize and boot SOC 2000, a SDRAM controller 2040 to interface with external memory (e.g. DRAM 2060), a flash controller 2045 to interface with non-volatile memory (e.g. Flash 2065), a peripheral controller 2050 (e.g. Serial Peripheral Interface) to interface with peripherals, video codecs 2020 and Video interface 2025 to display and receive input (e.g. touch enabled input), GPU 2015 to perform graphics related computations, etc. Any of these interfaces may incorporate aspects described herein.
In addition, the system illustrates peripherals for communication, such as a Bluetooth module 2070, 3G modem 2075, GPS 2080, and WiFi 2085. Also included in the system is a power controller 2055. Note as stated above, a UE includes a radio for communication. As a result, these peripheral communication modules are not all required. However, in a UE some form a radio for external communication is to be included.
The following examples pertain to further embodiments. In one example, an apparatus includes a first portion of a cache memory to be associated with a first core, where the first cache memory portion is to be part of a distributed cache memory and to be dynamically controlled to be one of a private cache memory for the first core and a shared cache memory shared by a plurality of cores including the first core, according to an addressing mode.
In an embodiment, a cache controller is to dynamically control the first cache memory portion. The cache controller may dynamically control the first cache memory portion based at least in part on a workload executed on the first core, and further the cache controller may dynamically modify the addressing mode. The cache controller may receive a request address with a request, and generate an address to access the first cache memory portion using the request address and a node identifier associated with the first core. In an embodiment, the address comprises a tag portion, an index portion including the node identifier and a portion of the request address, and an offset portion. The node identifier includes, in an embodiment, a group identifier to identify a group of cores including the first core and a sub-node identifier to identify the first core, wherein the node identifier is a unique number and the group identifier and the sub-node identifier are non-unique numbers.
When the first cache memory portion is configured in a shared cache memory mode, the cache controller may send a cache coherency message to only cores of the plurality of cores having a common sub-node identifier with the first core. And, the cache controller may dynamically adjust the group identifier to dynamically control the first cache memory portion to be one of the private cache memory and the shared cache memory. In an embodiment, a width of the group identifier and a width of the sub-node identifier are variable based on the dynamic control. Also, the cache controller may dynamically change a node identifier of a first node including the first core and the first cache memory portion when the first cache memory portion is changed from the private cache memory to the shared cache memory.
In another example, a SoC comprises: a plurality of nodes, each node to be associated with a node identifier and to include at least one core and at least one cache slice; and a control logic to receive performance metric information, determine a grouping mode for a cache memory to be formed of the cache portions of each of the plurality of nodes based at least in part on the performance metric information, and to dynamically change the node identifier of at least some of the plurality of nodes responsive to a change in the grouping mode. The control logic may flush the cache memory responsive to a change in the grouping mode, and the node identifier may be transparent to system software. In an embodiment, the at least one cache portion is to be a private cache for the at least one core in a first grouping mode and to be a shared cache in a second grouping mode.
In one example, the control logic is to determine the grouping mode to be the second grouping mode in which N of the plurality of nodes are to operate with the shared cache when the performance metric information indicates a miss rate greater than a first threshold, and to determine the grouping mode to be the second grouping mode in which M of the plurality of nodes are to operate with the shared cache when the performance metric information indicates a miss rate less than the first threshold, where N is less than M.
In the second grouping mode, a first group of nodes are to share the at least one cache portion of each of the first group of nodes, in an embodiment. A distributed cache memory may include the at least one cache portion of each of the plurality of nodes, where the at least one cache portion of each of the plurality of nodes is dynamically controlled to be one of a private cache memory of the at least one core and a shared cache memory shared by some of the plurality of nodes.
In an embodiment, the control logic is to receive a request address with a request, and to generate an address to access a first cache portion using the request address and a node identifier associated with a first node including the first cache portion. The address comprises a tag portion, an index portion including the node identifier and a portion of the request address, and an offset portion, and the node identifier includes a group identifier to identify a group of nodes and a sub-node identifier to identify the first node, where the node identifier is a unique number and the group identifier and the sub-node identifier are non-unique numbers.
In another example, a method comprises: receiving an incoming request transaction in a system including a plurality of nodes; generating an access address using a request address of the incoming request transaction and a node identifier of a first node including a first core and a first portion of a LLC, where the node identifier is dynamically changed when a group addressing mode for the LLC changes; and accessing the first portion of the LLC using the access address.
In one example, the method further comprises determining the group addressing mode for the LLC based at least in part on one or more of performance metric information and system configuration information. Responsive to a group addressing mode change, the method includes associating a node identifier with each portion of the LLC and communicating the node identifier to each corresponding node, where the node identifier includes a group identifier and a sub-node identifier. And the method may further include generating the access address including a group identifier to identify a group of nodes including the first node and a sub-node identifier to identify the first node, where the node identifier is a unique number and the group identifier and the sub-node identifier are non-unique numbers. The method may also include generating and sending a cache coherency message to other LLC portions having the same sub-node identifier as the first node responsive to a miss in the first LLC portion. In an embodiment, the cache memory may be flushed responsive to a change to the group addressing mode.
In another example, a computer readable medium including instructions is to perform the method of any of the above examples.
In another example, an apparatus comprises means for performing the method of any one of the above examples.
According to another example, a system includes: a processor including a plurality of nodes, each having at least one core and a portion of a distributed cache memory, and a controller to dynamically update a node identifier for each of the plurality of nodes based at least in part on a grouping mode of the distributed cache memory. The system may further include a DRAM coupled to the processor. In an example, the controller is to dynamically control each portion of the distributed cache memory to be part of a shared cache memory in a first grouping mode and to be a private cache memory in a second grouping mode. The controller may receive a request address with a request to access a first distributed cache memory portion, and to generate an address to access the first distributed cache memory portion using the request address and a node identifier associated with a first node including the first distributed cache memory portion, the address including a tag portion, an index portion including the node identifier and a portion of the request address, and an offset portion.
Note that the above processor can be implemented using various means.
In an example, the processor comprises a SoC incorporated in a user equipment touch-enabled device.
In another example, a system comprises a display and a memory, and includes the processor of one or more of the above examples.
In yet another example, an apparatus comprises: a first node including a first cache slice and a first core; a second node including a second cache slice and a second core; and control logic to include the first cache slice and the second cache slice in a first group based on a common node identifier for the first and second nodes. The control logic is further to include the first cache slice in the first group and a second cache slice in a second group, which is private from the first group, based on distinct node identifiers for the first node and the second node. In an embodiment, the control logic is to dynamically change between the common node identifier and the distinct node identifier based on performance.
Understand that various combinations of the above examples are possible.
Embodiments may be used in many different types of systems. For example, in one embodiment a communication device can be arranged to perform the various methods and techniques described herein. Of course, the scope of the present invention is not limited to a communication device, and instead other embodiments can be directed to other types of apparatus for processing instructions, or one or more machine readable media including instructions that in response to being executed on a computing device, cause the device to carry out one or more of the methods and techniques described herein.
Embodiments may be implemented in code and may be stored on a non-transitory storage medium having stored thereon instructions which can be used to program a system to perform the instructions. The storage medium may include, but is not limited to, any type of disk including floppy disks, optical disks, solid state drives (SSDs), compact disk read-only memories (CD-ROMs), compact disk rewritables (CD-RWs), and magneto-optical disks, semiconductor devices such as read-only memories (ROMs), random access memories (RAMs) such as dynamic random access memories (DRAMs), static random access memories (SRAMs), erasable programmable read-only memories (EPROMs), flash memories, electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, or any other type of media suitable for storing electronic instructions.
While the present invention has been described with respect to a limited number of embodiments, those skilled in the art will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover all such modifications and variations as fall within the true spirit and scope of this present invention.