This project implements a dual-core pipelined processor with a snoopy-bus and split caches, provides functionally verified modules and thoroughly tests for edge cases in protocols. All source files and tcl scripts for this project can be found here. This project was completed as a part of ECE 437, and requires that code be private – Please reach out if you’d like to view it, and are not taking the course.

Below, I’ve formatted our Final Report (that got us an A) for the site, but want to quickly highlight that the core is tested on these custom testbenches:

Overview

This report presents a comparison of four variants of our RISC-V processor design: the single-cycle processor, the 5- stage pipeline without caches, the pipeline with a two-level cache hierarchy, and the multicore processor. All designs execute instructions based on the same subset of the RISC Instruc- tion Set Architecture (ISA), but incrementally build on each other to replicate previous- generation CPUs, leading to differences in clock speed and hardware complexity.

All designs execute the same mergesort benchmark (and, where necessary, the dual.mergesort variant) to ensure a fair comparison. The single-cycle processor executes a single instruction per clock cycle, leading to a straightforward design. However, since it is required to complete the multiple aspects of an instruction within a cycle, it leads to low clock speeds. On the other hand, the pipelined processor splits up part of the instruction execution into different stages, ensuring that each stage completes in a single cycle. The pipelined version without caches exposes the full memory-access latency while the cached pipeline hides much of it at the cost of added hard- ware. On the other hand, the multicore design allows for parallel speedup, but at suffers some coherence overhead. We expect the cached pipeline to outperform the uncached ver- sion as memory latency increases, and the multicore design to show greater speedup on the dual-threaded mergesort.

To confirm our above expectations, we compare each design in the FPGA flow to collect synthesis frequency, average CPI, total execution time (for different memory access latencies), instruction latency, and resource utilization. Section 2 details the architectural diagrams and state machines for the pipeline, caches, and coherence logic; Section 3 presents the full performance and resource-utilization results; Section 4 discusses conclusions drawn from these data; and Section 5 documents individual contributions.

Below, we provide the Instruction-type breakdowns for each benchmark used. Note that these breakdowns are generated post-simulation, and will show variance between Core 0 and Core 1 counts due to real-life multi-core coherence overhead. We will use these numbers while calculating the CPI for our designs.

Instruction Type	Count	Percentage
R‑Type	823	15.22%
I‑Type	834	15.42%
LW	806	14.90%
SW	691	12.77%
JAL/JALR	695	12.85%
B‑Type	1560	28.83%
TOTAL	5409	100%

Instruction‑type breakdown for mergesort

Instruction Type	Core 0 Count	Core 0 Percentage	Core 1 Count	Core 1 Percentage
SLL	2	0.06%	2	0.09%
SRL	2	0.06%	1	0.04%
SLT	378	11.91%	300	13.30%
SLTU	32	1.01%	32	1.42%
SUB	3	0.09%	3	0.13%
ADD	34	1.07%	32	1.42%
OR	8	0.25%	0	0.00%
ORI	11	0.35%	9	0.40%
LW	476	15.00%	332	14.72%
SW	389	12.26%	303	13.43%
BEQ	863	27.19%	634	28.10%
BNE	64	2.02%	0	0.00%
ADDI	518	16.32%	302	13.39%
JAL	389	12.26%	303	13.43%
JALR	2	0.06%	1	0.04%
LUI	2	0.06%	1	0.04%
TOTAL	3173	100%	2256	100%

Instruction‑type breakdown for dual.mergesort

Architecture

Figure 1 highlights the top-level block diagram of the interfaces we use throughout our design variants. The Caches block is a pass-through for the standalone singlecycle/pipeline, and contains some basic combinational logic to highlight completion of memory requests. The Datapath block contains the main processor implementation through which the instructions flow.

Figure 1: Processor Interfaces

Figure 2 visualizes the single-cycle processor design.

Figure 2: SingleCycle Processor Block Diagram

Figure 3 outlines our pipelined processor design. Here, every memory access in the Memory stage must wait for main-memory latency, leading to pipeline stalls whenever an instruction issues a load or store. Forwarding and Hazard modules (within Datapath) are responsible for ensuring minimal bubble additions while resolving data and control hazards.

Figure 3: Pipelined Processor Block Diagram

In the cached pipeline (Figure 4), we added a split L1 hierarchy: a direct-mapped instruction cache, and a 2-way set-associative data cache. The number of frames and index-splits of the addresses are parameterizable. Each cache uses a write-back, write-allocate policy. On an L1 miss, the data cache controller initiates either LOAD or WRITE-BACK FSM over the simple bus controller and stalls only the Memory stage, allowing earlier stages to continue if they do not depend on the miss. The instruction cache does not allow for writing, but contains an FSM to load the instructions on a miss.

Figure 4: Caches Block Diagram

Our multicore design (Figure 5) instantiates two identical pipelined datapaths, highlighted as CPU0 and CPU1, each with its own split L1 caches, connected to a shared Bus Control module. We implement the MSI coherence protocol, with no additional SRAM arrays to coordinate the states of each frame. The Bus Controller’s state diagram in Figure 6 arbitrates between competing requests and broadcasts snoop responses. Each core’s data cache includes additional logic to support LR/SC atomic operations, modeled after the RISC-V specification, as shown in Figure 7.

Figure 5: Multicore Block Diagram

Figure 6: Bus Control Finite-State-Machine

Figure 7: D-Cache Finite-State-Machine

Results

In this section, we compare the aforementioned designs based on performance metrics such as frequency, cycles per instruction (CPI), instruction latency, total execution time, and FPGA resource utilization. Below, we outline the formulas used to calculate each metric, a short paragraph targeting specific requirements, followed by tables presenting our overall results for all designs.

Maximum Clock Frequency

The maximum clock frequency is extracted from the system.log report.

\[f_{\text{max}} = \min\left(\frac{\text{CLK}}{2}, \text{CPUCLK} \right)\]

For the pipelined, non-cache processor, \(f_{\text{max}} = 65.23 MHz\)

For the pipelined, processor with cache, \(f_{\text{max}} = 55.85 MHz\)

For the final multicore processor, \(f_{\text{max}} = 50.04 MHz\)

Average Cycles per Instruction (CPI)

CPI is calculated as:

\[\text{CPI} = \frac{\text{Total Clock Cycles}}{\text{Total Instructions Executed}}\]

where:

Total Clock Cycles is obtained from the make system.sim command.
Total Instructions Executed is reported at the end of the simulation (sim -t or sim -t -m).

NOTE: The Total Instruction count for mergesort is 5409 and dual.mergesort and 5429.

Total Execution Time

The total execution time for the program is given by:

\[\text{Total Execution Time} = \frac{\text{Total Clock Cycles}}{f_{\text{max}}} s\]

Instruction Latency

Instruction latency represents the time taken for a single instruction to complete.

\[\text{Latency}_{\text{single-cycle}} = \frac{\text{Execution Time}}{\text{Total Instructions Executed}} s\]

FPGA Resource Utilization

The FPGA resource usage includes:

Total Logic Elements – Represents the number of Look-Up Tables (LUTs) used.
Total Registers - Represents the number of Flip-Flops (FFs) used.
Total Pins – Represents the number of I/O pins used for communication.

Summary

Now, we present tables for a easier comparison of the above metrics.

NOTE: By comparing the differences between the Pipelined processor, with and without a cache, one can view the RAM latency at which caches start helping performance. Even at RAM Latency = 2, the caches reduce the execution time of the mergesort algorithm. The execution time drops from 269 us to 194 us. At higher RAM latencies, the reduction in execution time is even greater.

LAT	CPI	Total Clock Cycles	Clock Frequency (MHz)	Execution Time (us)
0	1.27	6906	34.22	202
2	2.55	13814	34.64	397
6	5.10	27628	34.64	798
10	7.66	41442	34.64	1196

Performance Metrics for the Single‑Cycle Processor on mergesort

LAT	CPI	Total Clock Cycles	Clock Frequency (MHz)	Execution Time (us)
0	1.62	8805	56.93	155
2	3.24	17549	65.23	269
6	6.47	35035	65.23	537
10	9.71	52521	65.23	805

Performance Metrics for the Pipelined Processor on mergesort

LAT	CPI	Total Clock Cycles	Clock Frequency (MHz)	Execution Time (us)
0	1.81	9821	58.39	168
2	2.00	10844	55.85	194
6	2.38	12874	55.85	231
10	2.75	14902	55.85	267

Performance Metrics for the Pipelined Processor w/ Caches on mergesort

LAT	CPI	Total Clock Cycles	Clock Frequency (MHz)	Execution Time (us)
0	2.44	13212	50.04	320
2	2.63	14229	45.12	315
6	3.00	16264	45.12	360
10	3.38	18300	45.12	406

Performance Metrics for the Multi‑Core Processor on mergesort

LAT	CPI	Total Clock Cycles	Clock Frequency (MHz)	Execution Time (us)
0	1.73	9438	50.04	189
2	1.91	10374	45.12	230
6	2.26	12299	45.12	273
10	2.62	14248	45.12	316

Performance Metrics for the Multi‑Core Processor on dual.mergesort

Design	Latency for LAT=6 (ns)
Single‑Cycle	148
Pipelined	99
Pipelined w/ Caches	43
Multi‑Core w/ caches	67

Instruction Latency Comparison

Design	Total Logic Elements	Total Registers	Total Pins
Single‑Cycle	3505	1835	102
Pipelined	3167	1294	106
Pipelined w/ Caches	7208	4106	106
Multi‑Core w/ caches	14272	7985	106

FPGA Resource Utilization Comparison

Conclusion

Here, we discuss the improvements of each design over the initial Single-Cycle Processor. We evaluate these by calculating the \textbf{Speedup} between the aforementioned values, using this formula:

\[\text{Speedup %} = \frac{\text{New Value - Old Value}}{\text{Old Value}} * 100%\]

NOTE: When reading the tables, please note the conventions highlighted in the headers.

LAT	CPUCLK Increase (%)	CPI Increase (%)	Execution Time Decrease (%)
0	66.36	27.56	23.27
2	88.30	27.06	32.24
6	88.30	26.86	32.71
10	88.30	26.76	32.69

Pipelined (over Single‑Cycle) Processor Speedup on mergesort

LAT	CPUCLK Increase (%)	CPI Increase (%)	Execution Time Decrease (%)
0	46.23	27.56	–58.41
2	46.23	3.13	15.62
6	46.23	41.17	52.01
10	46.23	–55.87	63.96

Multi‑Core (over Single‑Cycle) Processor Speedup on mergesort

LAT	CPUCLK Increase (%)	CPI Increase (%)	Execution Time Decrease (%)
0	46.23	36.22	6.435
2	44.45	−25.09	47.85
6	44.45	−55.68	69.17
10	44.45	−65.79	76.17

Multi‑Core (over Single‑Cycle) Processor Speedup on dual.mergesort

Design	LE Increase (%)	Reg Increase (%)	Pin Increase (%)
Pipelined	−9.64	−29.49	3.92
Pipelined w/ Caches	105.69	123.82	3.92
Multi‑Core w/ caches	307.27	335.37	3.92

Percentage Increase in FPGA Resource Usage (over Single‑Cycle)

Across all latencies, our 5‑stage pipeline raises the maximum clock frequency by (maximum) 88%, at the cost of a 27% CPI increase from pipeline hazards and stalls. This trade‑off yields a (maximum) 33% decrease in total execution time. By splitting instruction execution into shorter stages, the pipeline shortens the critical path, hence the higher clock rate.

Instantiating two cores and running the original mergesort as a single-threaded benchmark improves clock rate by around 44%. However, the CPI speedups are uneven: at LAT=0 the CPI increases by 27%, with 58% slower execution time, whereas at higher latencies we gain a 50+% time speedup. The initial slowdown at LAT=0 is because of MSI-coherence overhead and lower maximum frequency. As RAM latency increases, memory stalls become longer, and the Multi-Core design’s ability to overlap requests begins to help.

Running mergesort in parallel reduces CPI by up to 66% at high latencies. Execution time falls by 6% even at LAT=0, and by up to 76% at LAT=10. This speedup reflects effective parallelism of the workload – with both cores kept busy and local caches absorbing much of the memory latency.

Finally, our FPGA resource utilization figure highlights that we mildly save on some hardware when moving from Single-Cycle to the Pipelined design. This can attributed to the use of latching, simplifying long combinational chains. However, with the addition of caches, we see almost a 105% increase over the single-cycle design, due to the large SRAM-array control logic that needs to be embedded. In our Multi-Core design, we more than double our hardware resources from the Pipelined design, owing to the larger Bus Controller, and a duplication of the complete datapath.

Overall, these results confirm that

pipelining is a fundamental throughput optimization; it trades a small CPI penalty for a relatively greater clock-rate speedup.
parallel hardware with coherence can show greater speedup, particularly amortizing the effective memory access stalls; it trades greater FPGA resource utilization for overall program efficiency.