Show HN: RunMat – runtime with auto CPU/GPU routing for dense math

20 points by nallana | 4 comments
Hi, I’m Nabeel. In August I released RunMat as an open-source runtime for MATLAB code that was already much faster than GNU Octave on the workloads I tried. https://news.ycombinator.com/item?id=44972919

Since then, I’ve taken it further with RunMat Accelerate: the runtime now automatically fuses operations and routes work between CPU and GPU. You write MATLAB-style code, and RunMat runs your computation across CPUs and GPUs for speed. No CUDA, no kernel code.

Under the hood, it builds a graph of your array math, fuses long chains into a few kernels, keeps data on the GPU when that helps, and falls back to CPU JIT / BLAS for small cases.

On an Apple M2 Max (32 GB), here are some current benchmarks (median of several runs):

* 5M-path Monte Carlo * RunMat β‰ˆ 0.61 s * PyTorch β‰ˆ 1.70 s * NumPy β‰ˆ 79.9 s β†’ ~2.8Γ— faster than PyTorch and ~130Γ— faster than NumPy on this test.

* 64 Γ— 4K image preprocessing pipeline (mean/std, normalize, gain/bias, gamma, MSE) * RunMat β‰ˆ 0.68 s * PyTorch β‰ˆ 1.20 s * NumPy β‰ˆ 7.0 s β†’ ~1.8Γ— faster than PyTorch and ~10Γ— faster than NumPy.

* 1B-point elementwise chain (sin / exp / cos / tanh mix) * RunMat β‰ˆ 0.14 s * PyTorch β‰ˆ 20.8 s * NumPy β‰ˆ 11.9 s β†’ ~140Γ— faster than PyTorch and ~80Γ— faster than NumPy.

If you want more detail on how the fusion and CPU/GPU routing work, I wrote up a longer post here: https://runmat.org/blog/runmat-accel-intro-blog

You can run the same benchmarks yourself from the GitHub repo in the main HN link. Feedback, bug reports, and β€œhere’s where it breaks or is slow” examples are very welcome.

Loading...
Loading...