GSoC ChainerX: Performance of new routines

I have run some performance tests to compare ChainerX implementation against NumPy and CuPy. Some new chainerx.linalg routines run more slowly compared to numpy.linalg and cupy.linalg. Performance results & code to obtain them is available in this gist.

The most probable reason for slowness of qr, solve, inv is allocating temporary arrays in the kernel and copying data from them to routine level arrays. Also repeated transpose-copy operations (because of ChainerX’s row-major vs column-major ordering of LAPACK/cuSOLVER) have an effect. Routines that don’t require such manipulations run almost one to one with NumPy/CuPy.