Leading on-chip Memory Subsystem for Atalla Tensor Core; focusing on architecture diagramming & ISA design.
Built a cycle-accurate simulator of the datapath for performance modelling using implicit-convolution and GEMM kernels.
Architected a parameterizable 2MB Scratchpad with on-the-fly swizzling and a pipelined N × N crossbar – optimized for PPA.
Designed INT8/FP16 datapaths between Systolic Array & Vector Core; integrating DDR4 controller for asynchronous DRAM transfers.