Optimizing Matrix Operations
This investigation was inspired by my efforts to optimize computation time for a plant-canopy light-and-energy-balance model called SCOPE.  The canopy is broken down into 60 layers, each containing a distribution of leaves at different angles, represented as a matrix of 60 layers by 13  angles of inclination by 36 angles of azimuth. The model is written in MATLAB, and since MATLAB doesn't support 3D matrix operations, the original code used a FOR loop to iterate over each layer. If you're familiar with vector-oriented "scripting" languages, you will not be surprised that this method slowed things down considerably. My solution was to reshape the array into a into a 2-D matrix of 60 side-by-side 13x36 submatrices (so the new matrix is 13 x 2160), which increased the speed 10-fold.

Having done this in MATLAB I set up a simple benchmark to see how other languages dealt with this data size vs. speed issue.The answer turned out to be more complicated than I had expected.  While I had anticipated the first issue in this list, the remaining issues affecting speed were a bit of a surprise:
• interpreted vs. "built-in" operations (i.e. for-loop vs. matrix multiply)
• access to Intel's Math Kernel Library (MKL), or equivalent (part of the BLAS - Basic Linear Algebra Subprograms - library)
• size of the matrix (assuming MKL access)
• number of independent CPU cores (assuming MKL access)
• language-specific issues
• cost of allocating data
The first item is well-known: to the extent that you can take advantage of "consolidated" operations, your interpreted language (MATLAB, R, Python, etc.) will benefit.

The next three are interrelated: even though the benchmark is single-threaded, MKL will create its own threads to distribute matrix operations across all available cores if the matrix is larger than a certain size. In my benchmark multithreading kicked in somewhere between 4,500 and 45,000 elements (I didn't explicitly probe this question). So which languages use MKL?  Once again, the answer isn't simple. MATLAB uses it transparently. Python's numpy will use it if you're doing matrix operations. R, by default does not use MKL but you can either replace the BLAS in standard R or you can get Revolution R (now Microsoft R Open) in which case it will always be used. The other languages, by default, won't use MKL unless you can get particular libraries.

The last two bullet points, above, depend on the language.  As a simple example, MATLAB for-loops are notoriously slow if you append to a matrix on each iteration: in the tables below, the "best" coding method has the result matrix pre-allocated before the for-loop; the "worst" method does not. In Python it turns out that the row- and colMeans functions do not use the BLAS; using a hand-coded replacement function that computes means as a matrix operation  makes it 4-7x faster.  In R, on the other hand, the native "colMeans" function did better than the matrix algebra method even when using MKL-enhanced BLAS. In fact, R benefitted less from MKL than did Python.  Java 8 and C++ do not have built-in matrix libraries. Instead I used a native-Java matrix class that I had written for a different project (and translated to C++ for this one). The "best" speed in Java is probably not the best possible and is therefore even more impressive in its "native" speed.

The benchmark ran a simple set of functions that compute the column (or row) means of a matrix with a fixed number of columns (or rows). The functions were called using different-sized matrices, but as matrix size increased, the number of for-loop iterations decreased to keep the actual number of math operations constant -- see the third table, below.

The results show dramatic differences between the "best" and "worst" coding methods that I used, but also substantial differences between languages and even within languages, depending on matrix size vs. iterations used.

Speed (seconds) for best coding method
 MATLAB Java R Python C++ C++ debug Julia small 4.0 1.5 25.5 21.4 2.8 28.6 5.4 10x cols 1.3 1.3 4.5 2.8 2.4 15.8 3.8 10x rows 0.8 0.9 4.8 2.9 1.8 10.1 2.4 10x row rowMean 0.9 2.2 9.3 3.0 2.1 9.8 2.7 100x cols 1.1 1.7 2.6 0.8 2.5 10.1 3.9 500x cols 1.1 2.0 2.3 0.5 2.7 10.1 3.7

Speed (seconds) for "worst" coding method
 MATLAB Java R Python C++ C++ debug small 40.3 12.5 27.5 88.4 2.8 28.6 10x cols 5.6 6.4 14.0 12.1 2.4 15.8 10x rows 5.5 11.6 12.7 12.2 1.8 10.1 10x row rowMean 5.5 22.9 4.7 11.6 2.1 9.8 100x cols 1.8 6.0 13.5 4.1 2.5 10.1 500x cols 0.8 13.4 3.4 2.7 10.1

Size vs. Iterations in the previous tables - the two are balanced so the number of math ops are constant
 matrix size iterations nelements math ops matrix size KB small 13x36 6,261,780 468 2.93E+09 3.6 10x cols 13x360 626,178 4,680 2.93E+09 36.2 10x rows 130x36 626,178 4,680 2.93E+09 36.2 10x row rowMean 130x36 626,178 4,680 2.93E+09 36.2 100x cols 13x3600 62,618 46,800 2.93E+09 362.1 500x cols 13x18000 12,524 234,000 2.93E+09 1,810.4 