Draft: Gemv kernel optimizations
Allows using less than MAXM bytes. Smem is just used as a manual cache (no sharing across threads). Passing BLOCKSIZE via template argument BS.
Cast flat buffer into shape for easier arithmetic.
Allows using less than MAXM bytes. Smem is just used as a manual cache (no sharing across threads). Passing BLOCKSIZE via template argument BS.
Cast flat buffer into shape for easier arithmetic.