Performance and Optimization

This writeup is based on the work done in our Profiling Lattice Boltzmann IPython Notebook.

Our Journey: From Python to Cython to OpenCL

The performance of Lattice Boltzmann simulations is typically measured in terms of MLUPS, or "Million Lattice Updates per Second." This means if your lattice has 1 million lattice sites and your simulation runs at a speed of 1 MLUPS, your lattice is updated once per second.

During the course of our project, we moved from Python, to Cython, and finally to OpenCL in order to make our simulation faster. Regardless of implementation, our simulations returned the same macroscopic fields after the same number of iterations with equivalent input parameters; we were able to check the accuracy of our code this way. We found that a combination of Python and Numpy achieved about 0.5 MLUPS, Cython and Numpy about 5 MLUPS, and OpenCL running on a GTX Titan Black video card achieved about 325 MLUPS (single point precision). Relative to pure Python, OpenCL gave us an increase in speed of a factor of roughly 650. A plot of MLUPS and Speedup vs. Implementation can be seen below.

Relative to other well-established Lattice Boltzmann projects like Sailfish, our simulation fared relatively well. Depending on the hardware used by Sailfish, they obtained somewhere between 300 and 1400 MLUPS with single-point precision (see their paper here). Relative to the GPU's tested by the Sailfish team, the GTX Titan black would probably perform most similar to the NVIDIA Tesla K20x which obtained roughly 1200 MLUPS. However, the Sailfish team profiled their code with a slightly different lattice (D3Q19), while we used D2Q9; it is hard to make direct comparisons because of this. We therefore guess that our code is roughly 5 times slower than other more-developed Lattice-Boltzmann software like Sailfish. It is possible that porting our code to CUDA could give us substantial performance increases at the cost of less flexibility, as CUDA can only run on NVIDIA graphics cards.

A More Detailed View of Performance

We would like to discuss some of the decisions we made when moving from Python to Cython to OpenCL and the bottlenecks in our code at each step.

Python: 0.5 MLUPS

Our simulation began in Python. Unfortunately, the code was so slow that it was difficult to test if it was working correctly, as instabilities in our Lattice Boltzmann simulator took time to develop. Our python code can be seen here.

We profiled our code; the output is seen below. It became clear that the move function was the bottleneck in our code. The move function was responsible for making the "particles" hop in a given direction and consisted of several nested for loops. As expected, the nested for loops were extremely slow in Python. We consequently decided to cythonize the move function in hopes of obtaining better performance; we also cythonized the move_bcs method because it was easy to do so. Note that no parallelization was used in this step.

15058 function calls in 28.306 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 3 25.691 8.564 25.879 8.626 3 1.441 0.480 1.441 0.480 3 0.363 0.121 0.384 0.128 3 0.346 0.115 0.346 0.115 15022 0.189 0.000 0.189 0.000 {range} 3 0.155 0.052 0.227 0.076 3 0.072 0.024 0.072 0.024 {method 'reduce' of 'numpy.ufunc' objects} 3 0.025 0.008 0.252 0.084 3 0.019 0.006 0.019 0.006 1 0.004 0.004 28.306 28.306

Cython: 6.0 MLUPS

After cythonizing the move and move_bcs functions , our simulation speed increased by about a factor of 10. Our cython code can be seen here. When we profiled our code, as shown below, the new bottleneck of our ode was the update_feq method, responsible for updating the equilibrium values of f. The update_feq method involved many Numpy operations, i.e. multiplying, adding and subtracting, and slicing arrays. It seemed unlikely that we would be able to outperform Numpy operations in Cython (as Numpy is heavily optimized), so we decided to move to OpenCL. Note that we could have replaced the Numpy operations with parallel for loops in Cython with prange, but we were skeptical that we would significantly outperform numpy. Consequently, no parallelization was used in this step either.

623 function calls in 15.790 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 20 9.620 0.481 9.620 0.481 cython_dim.pyx:132(update_feq) 20 2.304 0.115 2.304 0.115 cython_dim.pyx:292(collide_particles) 20 2.179 0.109 2.179 0.109 cython_dim.pyx:255(move) 20 1.034 0.052 1.513 0.076 cython_dim.pyx:160(update_hydro) 20 0.479 0.024 0.479 0.024 {method 'reduce' of 'numpy.ufunc' objects} 20 0.160 0.008 1.673 0.084 cython_dim.pyx:377(update_hydro) 20 0.006 0.000 0.013 0.001 cython_dim.pyx:382(move_bcs) 20 0.006 0.000 0.007 0.000 cython_dim.pyx:191(move_bcs) 1 0.000 0.000 15.790 15.790 cython_dim.pyx:299(run)

pyOpenCL: 325 MLUPS

Relative to our Python and Cython code, our pyOpenCL code performed quite well; it was approximately 50 times faster than cython and 650 times faster than python. The driving python file can be seen here while the OpenCL kernels can be seen in the directory above, or here. Parallelization was heavily used in this part of the project.

In general, we believe that our current code is well optimized. Our OpenCL code went through several iterations, the first of which obtained roughly 50 MLUPS; it was significantly slower than our current code! Profiling our code reveals that the update_feq, move, collide_particles, update_hydro, and move_bcs all take approximately the same order of magnitude of time to complete. Our code is no longer limited by one slow function; future improvements must consequently optimize each function.

119004 function calls in 14.528 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1000 4.873 0.005 4.927 0.005 1000 3.846 0.004 3.936 0.004 1000 2.807 0.003 2.854 0.003 1000 1.207 0.001 1.255 0.001 1000 1.069 0.001 1.116 0.001 1000 0.178 0.000 1.339 0.001 8000 0.175 0.000 0.260 0.000 1000 0.167 0.000 1.467 0.001 8000 0.059 0.000 0.071 0.000 8000 0.047 0.000 0.070 0.000 8000 0.033 0.000 0.116 0.000 16000 0.016 0.000 0.016 0.000 24000 0.014 0.000 0.014 0.000 {method 'pop' of 'dict' objects} 16000 0.013 0.000 0.013 0.000 {getattr} 8000 0.008 0.000 0.008 0.000 {isinstance} 8000 0.006 0.000 0.006 0.000 1 0.006 0.006 14.528 14.528

Two kernels still need to be optimized, however: the move_bcs and the bounceback_in_obstacle kernels. Currently, we use a 2-dimensional workgroup and use if-statements to determine if a thread is on a boundary or in an obstacle. Almost every thread will not be on a boundary or in the obstacle, however; the area of the lattice grows as N2 while the number of boundary pixels will grow as N, and an obstacle typically takes up a very small fraction of the total area in a simulation. Instead of using 2-d workgroups where very few workers actually work, we should probably send 1-dimensional workgroups with buffers of every pixel that is a boundary and every pixel that is an obstacle. This way, every worker will contribute toward the computation of boundary conditions and bounceback on obstacles.

We did not make these changes due to time constraints. Even if we did make the changes and the kernels executed instantaneously, however, our code's speed would only increase by about 10%. We would like to make these changes in the future.