Draft: Gemv kernel optimizations

Allows using less than MAXM bytes. Smem is just used as a manual cache (no sharing across threads). Passing BLOCKSIZE via template argument BS.

Cast flat buffer into shape for easier arithmetic.

Merge request reports

Loading