accelerated-matrix-processor
HW-SW Co-Designed Stack for a custom Tensor-Core (targeting TSMC 70nm)
I’m responsible for parts of the software and hardware aspects of the Memory Subsystem integrated into AMP0/AMP1. This involves ensuring optimal scratchpad usage and QoS through multi-layered-control starting at the PyTorch Dispatcher through a Software-Controlled Scratchpad and into the Systolic Array FUs.

Top-level view of AMP1
dcache

Top-Level of the 4-Way Non-Blocking Parameterizble D-Cache using PLRU replacement.

Addressing scheme inside each Bank

Inside view of each Bank

Inside view of the MSHR Buffer

Showcasing Parameters
crossbar

Top-Level of the Full-Mesh and Butterfly Crossbars
scratchpad + tca

Top-Level of AMP1's Scratchpad and Compute Accelerator