multicore-5-stage-cpu

Design, Verification and Synthesis of a 5-Stage Pipelined Multicore CPU on Altera DE2-115 (>50MHz)

This project implements a dual-core pipelined processor with a snoopy-bus and split caches, provides functionally verified modules and thoroughly tests for edge cases in protocols. All source files and tcl scripts for this project can be found here. This project was completed as a part of ECE 437, and requires that code be private – Please reach out if you’d like to view it, and are not taking the course.

Below, I’ve formatted our Final Report (that got us an A) for the site, but want to quickly highlight that the core is tested on these custom testbenches:


Overview

This report presents a comparison of four variants of our RISC-V processor design: the single-cycle processor, the 5- stage pipeline without caches, the pipeline with a two-level cache hierarchy, and the multicore processor. All designs execute instructions based on the same subset of the RISC Instruc- tion Set Architecture (ISA), but incrementally build on each other to replicate previous- generation CPUs, leading to differences in clock speed and hardware complexity.

All designs execute the same mergesort benchmark (and, where necessary, the dual.mergesort variant) to ensure a fair comparison. The single-cycle processor executes a single instruction per clock cycle, leading to a straightforward design. However, since it is required to complete the multiple aspects of an instruction within a cycle, it leads to low clock speeds. On the other hand, the pipelined processor splits up part of the instruction execution into different stages, ensuring that each stage completes in a single cycle. The pipelined version without caches exposes the full memory-access latency while the cached pipeline hides much of it at the cost of added hard- ware. On the other hand, the multicore design allows for parallel speedup, but at suffers some coherence overhead. We expect the cached pipeline to outperform the uncached ver- sion as memory latency increases, and the multicore design to show greater speedup on the dual-threaded mergesort.

To confirm our above expectations, we compare each design in the FPGA flow to collect synthesis frequency, average CPI, total execution time (for different memory access latencies), instruction latency, and resource utilization. Section 2 details the architectural diagrams and state machines for the pipeline, caches, and coherence logic; Section 3 presents the full performance and resource-utilization results; Section 4 discusses conclusions drawn from these data; and Section 5 documents individual contributions.

Below, we provide the Instruction-type breakdowns for each benchmark used. Note that these breakdowns are generated post-simulation, and will show variance between Core 0 and Core 1 counts due to real-life multi-core coherence overhead. We will use these numbers while calculating the CPI for our designs.

Instruction Type Count Percentage
R‑Type 823 15.22%
I‑Type 834 15.42%
LW 806 14.90%
SW 691 12.77%
JAL/JALR 695 12.85%
B‑Type 1560 28.83%
TOTAL 5409 100%
Instruction‑type breakdown for mergesort
Instruction Type Core 0 Count Core 0 Percentage Core 1 Count Core 1 Percentage
SLL 2 0.06% 2 0.09%
SRL 2 0.06% 1 0.04%
SLT 378 11.91% 300 13.30%
SLTU 32 1.01% 32 1.42%
SUB 3 0.09% 3 0.13%
ADD 34 1.07% 32 1.42%
OR 8 0.25% 0 0.00%
ORI 11 0.35% 9 0.40%
LW 476 15.00% 332 14.72%
SW 389 12.26% 303 13.43%
BEQ 863 27.19% 634 28.10%
BNE 64 2.02% 0 0.00%
ADDI 518 16.32% 302 13.39%
JAL 389 12.26% 303 13.43%
JALR 2 0.06% 1 0.04%
LUI 2 0.06% 1 0.04%
TOTAL 3173 100% 2256 100%
Instruction‑type breakdown for dual.mergesort

Architecture

Figure 1 highlights the top-level block diagram of the interfaces we use throughout our design variants. The Caches block is a pass-through for the standalone singlecycle/pipeline, and contains some basic combinational logic to highlight completion of memory requests. The Datapath block contains the main processor implementation through which the instructions flow.

Figure 1: Processor Interfaces

Figure 2 visualizes the single-cycle processor design.

Figure 2: SingleCycle Processor Block Diagram

Figure 3 outlines our pipelined processor design. Here, every memory access in the Memory stage must wait for main-memory latency, leading to pipeline stalls whenever an instruction issues a load or store. Forwarding and Hazard modules (within Datapath) are responsible for ensuring minimal bubble additions while resolving data and control hazards.

Figure 3: Pipelined Processor Block Diagram

In the cached pipeline (Figure 4), we added a split L1 hierarchy: a direct-mapped instruction cache, and a 2-way set-associative data cache. The number of frames and index-splits of the addresses are parameterizable. Each cache uses a write-back, write-allocate policy. On an L1 miss, the data cache controller initiates either LOAD or WRITE-BACK FSM over the simple bus controller and stalls only the Memory stage, allowing earlier stages to continue if they do not depend on the miss. The instruction cache does not allow for writing, but contains an FSM to load the instructions on a miss.

Figure 4: Caches Block Diagram

Our multicore design (Figure 5) instantiates two identical pipelined datapaths, highlighted as CPU0 and CPU1, each with its own split L1 caches, connected to a shared Bus Control module. We implement the MSI coherence protocol, with no additional SRAM arrays to coordinate the states of each frame. The Bus Controller’s state diagram in Figure 6 arbitrates between competing requests and broadcasts snoop responses. Each core’s data cache includes additional logic to support LR/SC atomic operations, modeled after the RISC-V specification, as shown in Figure 7.

Figure 5: Multicore Block Diagram
Figure 6: Bus Control Finite-State-Machine
Figure 7: D-Cache Finite-State-Machine

Results

In this section, we compare the aforementioned designs based on performance metrics such as frequency, cycles per instruction (CPI), instruction latency, total execution time, and FPGA resource utilization. Below, we outline the formulas used to calculate each metric, a short paragraph targeting specific requirements, followed by tables presenting our overall results for all designs.

Maximum Clock Frequency

The maximum clock frequency is extracted from the system.log report.

\[f_{\text{max}} = \min\left(\frac{\text{CLK}}{2}, \text{CPUCLK} \right)\]

For the pipelined, non-cache processor, \(f_{\text{max}} = 65.23 MHz\)

For the pipelined, processor with cache, \(f_{\text{max}} = 55.85 MHz\)

For the final multicore processor, \(f_{\text{max}} = 50.04 MHz\)

Average Cycles per Instruction (CPI)

CPI is calculated as:

\[\text{CPI} = \frac{\text{Total Clock Cycles}}{\text{Total Instructions Executed}}\]

where:

  • Total Clock Cycles is obtained from the make system.sim command.
  • Total Instructions Executed is reported at the end of the simulation (sim -t or sim -t -m).

NOTE: The Total Instruction count for mergesort is 5409 and dual.mergesort and 5429.

Total Execution Time

The total execution time for the program is given by:

\[\text{Total Execution Time} = \frac{\text{Total Clock Cycles}}{f_{\text{max}}} s\]

Instruction Latency

Instruction latency represents the time taken for a single instruction to complete.

\[\text{Latency}_{\text{single-cycle}} = \frac{\text{Execution Time}}{\text{Total Instructions Executed}} s\]

FPGA Resource Utilization

The FPGA resource usage includes:

  • Total Logic Elements – Represents the number of Look-Up Tables (LUTs) used.
  • Total Registers - Represents the number of Flip-Flops (FFs) used.
  • Total Pins – Represents the number of I/O pins used for communication.

Summary

Now, we present tables for a easier comparison of the above metrics.

NOTE: By comparing the differences between the Pipelined processor, with and without a cache, one can view the RAM latency at which caches start helping performance. Even at RAM Latency = 2, the caches reduce the execution time of the mergesort algorithm. The execution time drops from 269 us to 194 us. At higher RAM latencies, the reduction in execution time is even greater.

LAT CPI Total Clock Cycles Clock Frequency (MHz) Execution Time (us)
0 1.27 6906 34.22 202
2 2.55 13814 34.64 397
6 5.10 27628 34.64 798
10 7.66 41442 34.64 1196
Performance Metrics for the Single‑Cycle Processor on mergesort
LAT CPI Total Clock Cycles Clock Frequency (MHz) Execution Time (us)
0 1.62 8805 56.93 155
2 3.24 17549 65.23 269
6 6.47 35035 65.23 537
10 9.71 52521 65.23 805
Performance Metrics for the Pipelined Processor on mergesort
LAT CPI Total Clock Cycles Clock Frequency (MHz) Execution Time (us)
0 1.81 9821 58.39 168
2 2.00 10844 55.85 194
6 2.38 12874 55.85 231
10 2.75 14902 55.85 267
Performance Metrics for the Pipelined Processor w/ Caches on mergesort
LAT CPI Total Clock Cycles Clock Frequency (MHz) Execution Time (us)
0 2.44 13212 50.04 320
2 2.63 14229 45.12 315
6 3.00 16264 45.12 360
10 3.38 18300 45.12 406
Performance Metrics for the Multi‑Core Processor on mergesort
LAT CPI Total Clock Cycles Clock Frequency (MHz) Execution Time (us)
0 1.73 9438 50.04 189
2 1.91 10374 45.12 230
6 2.26 12299 45.12 273
10 2.62 14248 45.12 316
Performance Metrics for the Multi‑Core Processor on dual.mergesort
Design Latency for LAT=6 (ns)
Single‑Cycle 148
Pipelined 99
Pipelined w/ Caches 43
Multi‑Core w/ caches 67
Instruction Latency Comparison
Design Total Logic Elements Total Registers Total Pins
Single‑Cycle 3505 1835 102
Pipelined 3167 1294 106
Pipelined w/ Caches 7208 4106 106
Multi‑Core w/ caches 14272 7985 106
FPGA Resource Utilization Comparison

Conclusion

Here, we discuss the improvements of each design over the initial Single-Cycle Processor. We evaluate these by calculating the \textbf{Speedup} between the aforementioned values, using this formula:

\[\text{Speedup %} = \frac{\text{New Value - Old Value}}{\text{Old Value}} * 100%\]

NOTE: When reading the tables, please note the conventions highlighted in the headers.

LAT CPUCLK Increase (%) CPI Increase (%) Execution Time Decrease (%)
0 66.36 27.56 23.27
2 88.30 27.06 32.24
6 88.30 26.86 32.71
10 88.30 26.76 32.69
Pipelined (over Single‑Cycle) Processor Speedup on mergesort
LAT CPUCLK Increase (%) CPI Increase (%) Execution Time Decrease (%)
0 46.23 27.56 –58.41
2 46.23 3.13 15.62
6 46.23 41.17 52.01
10 46.23 –55.87 63.96
Multi‑Core (over Single‑Cycle) Processor Speedup on mergesort
LAT CPUCLK Increase (%) CPI Increase (%) Execution Time Decrease (%)
0 46.23 36.22 6.435
2 44.45 −25.09 47.85
6 44.45 −55.68 69.17
10 44.45 −65.79 76.17
Multi‑Core (over Single‑Cycle) Processor Speedup on dual.mergesort
Design LE Increase (%) Reg Increase (%) Pin Increase (%)
Pipelined −9.64 −29.49 3.92
Pipelined w/ Caches 105.69 123.82 3.92
Multi‑Core w/ caches 307.27 335.37 3.92
Percentage Increase in FPGA Resource Usage (over Single‑Cycle)

Across all latencies, our 5‑stage pipeline raises the maximum clock frequency by (maximum) 88%, at the cost of a 27% CPI increase from pipeline hazards and stalls. This trade‑off yields a (maximum) 33% decrease in total execution time. By splitting instruction execution into shorter stages, the pipeline shortens the critical path, hence the higher clock rate.

Instantiating two cores and running the original mergesort as a single-threaded benchmark improves clock rate by around 44%. However, the CPI speedups are uneven: at LAT=0 the CPI increases by 27%, with 58% slower execution time, whereas at higher latencies we gain a 50+% time speedup. The initial slowdown at LAT=0 is because of MSI-coherence overhead and lower maximum frequency. As RAM latency increases, memory stalls become longer, and the Multi-Core design’s ability to overlap requests begins to help.

Running mergesort in parallel reduces CPI by up to 66% at high latencies. Execution time falls by 6% even at LAT=0, and by up to 76% at LAT=10. This speedup reflects effective parallelism of the workload – with both cores kept busy and local caches absorbing much of the memory latency.

Finally, our FPGA resource utilization figure highlights that we mildly save on some hardware when moving from Single-Cycle to the Pipelined design. This can attributed to the use of latching, simplifying long combinational chains. However, with the addition of caches, we see almost a 105% increase over the single-cycle design, due to the large SRAM-array control logic that needs to be embedded. In our Multi-Core design, we more than double our hardware resources from the Pipelined design, owing to the larger Bus Controller, and a duplication of the complete datapath.

Overall, these results confirm that

  • pipelining is a fundamental throughput optimization; it trades a small CPI penalty for a relatively greater clock-rate speedup.
  • parallel hardware with coherence can show greater speedup, particularly amortizing the effective memory access stalls; it trades greater FPGA resource utilization for overall program efficiency.