I believe there is one design decision in MGL-MAT that has far reaching consequences: to make a single matrix object capable of storing multiple representations of the same data and let operations decide which representation to use based on what's the most convenient or efficient, without having to even know about all the possible representations.

This allows existing code to keep functioning if support for diagonal matrices (represented as a 1d array) lands and one can pick and choose the operations performance critical enough to implement with diagonals.

Adding support for matrices that, for instance, live on a remote machine is thus possible with a new facet type (MAT lingo for representation) and existing code would continue to work (albeit possibly slowly). Then one could optimize the bottleneck operations by sending commands over the network instead of copying data.

Contrast this with what I understand to be the status quo over on the Python side. The specialized Python array libs (cudamat, gpuarray, cudandarray) try to be drop-in replacements for - or at least similar to - numpy.ndarray with various degrees of success. There is lots of explicit conversion going on between ndarray and these CUDA blobs and adding new representations would make this exponentionally worse.

Torch (Lua) also has CUDA and non-CUDA tensors are separate types, and copying between main and GPU memory is explicit which leads to pretty much the same problems.

All of this is kind of understandable. When one thinks in terms of single-dispatch (i.e. object.method()), this kind of design will often emerge. With muliple-dispatch, data representation and operations are more loosely coupled. The facet/operation duality of MGL-MAT is reminiscent of how CLOS classes and generic functions relate to each other. The anology is best if objects are allowed to shapeshift to fit the method signatures.

Speaking of multiple-dispatch, by making the operations generic functions following some kind of protocol to decide which facets and implementation to use would decouple facets further. Ultimately, this could make the entire CUDA related part of MGL-MAT an add-on.