I have run some performance tests to compare ChainerX implementation against NumPy and CuPy.
Some new chainerx.linalg
routines run more slowly compared to numpy.linalg
and cupy.linalg
.
Performance results & code to obtain them is available in this gist.
The most probable reason for slowness of qr
, solve
, inv
is allocating temporary arrays in the kernel and copying data from them to routine level arrays. Also repeated transpose-copy operations (because of ChainerX’s row-major vs column-major ordering of LAPACK/cuSOLVER) have an effect. Routines that don’t require such manipulations run almost one to one with NumPy/CuPy.