আগস্ট ০৯, ২০২৩ — Posted by Alan Kelly, Software EngineerOne of our previous articles, Optimizing TensorFlow Lite Runtime Memory, discusses how TFLite’s memory arena minimizes memory usage by sharing buffers between tensors. This means we can run models on even smaller edge devices. In today’s article, I will describe the performance optimization of the memory arena initialization so that our users get the benefit…
One of our previous articles, Optimizing TensorFlow Lite Runtime Memory, discusses how TFLite’s memory arena minimizes memory usage by sharing buffers between tensors. This means we can run models on even smaller edge devices. In today’s article, I will describe the performance optimization of the memory arena initialization so that our users get the benefit of low memory usage with little additional overhead.
ML is normally deployed on-device as part of a larger pipeline. TFLite is used because it’s fast and lightweight, but the rest of the pipeline must also be fast. Profiling on the target device with representative data lets us identify the slowest parts of the pipeline so that we can optimize the most important part of the code.
In this article, I will describe the profiling and optimization of TFLite’s memory arena with instructions on how to use Simpleperf and visualize the results. Sample commands are given. It is assumed that the Android NDK is installed and that you have a development device that you can connect to using adb.
Simpleperf comes with some scripts to make it easier to use. run_simpleperf_on_device.py pushes simpleperf to the device and runs your binary with the given arguments.
/usr/lib/android-ndk/simpleperf/run_simpleperf_on_device.py record –call-graph fp /data/local/tmp/my_binary arg0 arg1 …
This will generate the output file perf.data which you must then copy back to your computer.
adb pull /data/local/tmp/perf.data
You then generate the binary cache which contains all the information needed later to generate a useful profile.
/usr/lib/android-ndk/simpleperf/binary_cache_builder.py -lib /your/binarys/folder -i perf.data
And generate the proto buffer used for visualization:
/usr/lib/android-ndk/simpleperf/pprof_proto_generator.py --ndk_path=/path/to/android-ndk -i perf.data -o profile.proto
You can then display the results this using pprof:
pprof -http :8888 profile.proto
And open localhost:8888 in your browser to view the profile. I find flame graphs to be the most useful:
ArenaPlanner::ExecuteAllocations accounts for 54.3% of the runtime of this model. I was expecting to find that ML operators such as fully connected layers or convolutions to be the bottleneck of this model, and not runtime overhead. This is a particularly bad case, the memory arena overhead isn’t this bad for every model, but improvements here will impact all models. This model has variable input sizes and many dynamic tensors, whose output size isn’t known until operator evaluation, which trigger frequent tensor re-allocations. This really is as bad as it gets. Let’s zoom in on the profile.
InterpreterInfo::num_tensors() accounts for 10.4% of the runtime. The reason this function is so expensive is because it is a virtual function which calls another function and it is called within a loop. I would never have suspected this.
for (int i = 0; i < static_cast<int>(graph_info_->num_tensors()); ++i) {
…
} |
Arena planner does not create or destroy tensors so the number of tensors is constant. Let’s cache it.
const int num_tensors = static_cast<int>(graph_info_->num_tensors());
for (int i = 0; i < num_tensors); ++i) {
…
} |
Our next piece of low hanging fruit is InterpreterInfo::tensor(unsigned long) which is another virtual function which does bounds checking and then returns a pointer to a tensor. Tensors are stored in an array so let’s add a function to get a pointer to this array. The commits are here and here.
After these simple changes the runtime of this model has reduced by 25% and then overhead of the memory allocator by half. Simpleperf made identifying these inefficiencies easy! Time to profile again to measure the impact of these changes.
ArenaPlanner::CalculateAllocations is now the most expensive function at 12.7%. This calls two functions: SimpleMemoryArena::Allocate and ArenaPlanner::CreateTensorAllocationVector.
Although ArenaPlanner::CreateTensorAllocationVector is the cheaper of the two, the code is far simpler so it might be easier to optimize. This function identifies which tensors are allocated between two nodes in the graph and then sorts them by size as a Greedy by Size allocation algorithm is used where the largest tensors are allocated first. The structure of the graph is constant so we can store a map of tensors allocated at each node. Instead of checking each tensor in the model to see if it is allocated between the two nodes, we can identify the tensors to be allocated by iterating through the map. The cost of ArenaPlanner::CreateTensorAllocationVector has gone from 4.8% to 0.8% of the runtime. Code can be seen here. Sort does not appear in the profile so we ignore it.
The next function to look at is ArenaPlanner::ResolveTensorAllocation which is 10.9% of the runtime after the previous optimizations. This function resets each tensor’s data pointer after allocation. However, these pointers don’t always change. How about keeping track of and only updating the ones which change? After this change, ArenaPlanner::ResolveTensorAllocation doesn’t appear in the profile anymore.
Let’s now take a look at allocation and deallocation. SimpleMemoryArena::Allocate accounts for 7% and SimpleMemoryArena::Deallocate accounts for 6.8% of the runtime. A record of all allocations in the arena are stored in a vector ordered by their offsets within the memory arena. Entries in this sorted data structure are inserted, removed and searched. These operations are all O(N) in a vector. Could an std::multimap with the offset as the key be better? A multimap is needed because the records are ordered by their offsets and there may be multiple tensors with the same offset. Removal and insertion are O(logN) but search would still be O(N) as we are searching for the tensor id and not the offset. The best way to find out is to test and profile.
Replacing the vector with a multimap actually slows down the arena code: it is almost three times slower than using a vector! While this goes against intuition, this is commonly found when optimizing code. Operations on a set or a map have linear or logarithmic complexities, however, there is also a constant value in the complexity. This value is higher than the constant value for the complexity of a vector. We also iterate through the records, which is much cheaper for a vector than for a list or multimap. A list was also tested, coming in at twice as slow as a vector.
Deallocation can still be improved though. SimpleMemoryArena::Deallocate iterates through the records and when it finds the record to deallocate, it removes it from the vector. This has O(N2) complexity. The memcpy seen in the profile above comes from the frequent calls to std::vector::erase. It is much more efficient to mark records to be erased and then to erase them in one pass using std::remove_if. The second optimization here is to look at how tensors are typically deallocated: ArenaPlanner::ResetAllocationsAfter deallocates all tensors from a node until the end of the graph. To address this, SimpleMemoryArena::DeallocateAfter(int32_t node) was added which iterates once through all the records, marking those which are allocated after the node. SimpleMemoryArena::ResolveDeallocations erases these in one pass making deallocation O(N). After these changes, ResetAllocationsAfter no longer appears in the profile! Commits are here and here.
The profile now looks very different with the overhead of tensor allocation gone from 49.9% of the runtime to 11%. This profile already looks much more reasonable. SimpleMemoryArena::Allocate is the last function left to optimize. For each tensor, this function iterates through the vector of records trying to find space for the current allocation. The complexity of this is O(N2). This is a fundamental limitation of the Greedy By Size algorithm. Efficient use of memory comes at the cost of increased overhead. Although the complexity can’t be reduced, N can. We process nodes in the order in which they are executed. Allocation information for tensors which have been deallocated on already executed nodes is not needed anymore, it is only slowing things down. Records are purged periodically so that only records which are active are considered. On a large model, this significantly reduces N. ArenaPlanner::ExecuteAllocations is no longer the most expensive function! It has gone from 11% to 6% and a fully connected operator is now the most expensive function in the profile, which is what we expect when profiling neural network inference.
This is what a neural network profile should look like. Time should be spent running your model’s operators, not in the inference runtime.
The optimized memory arena is now publicly available as part of TensorFlow 2.13.
Today’s post walked you through an example of Simpleperf helping to find easy to fix inefficiencies in TFLite’s memory arena that would never have been found by just looking at the code. Pprof can display annotated source code, disassembly and graphs making it easy to find the bottlenecks in your on-device pipelines.
আগস্ট ০৯, ২০২৩ — Posted by Alan Kelly, Software EngineerOne of our previous articles, Optimizing TensorFlow Lite Runtime Memory, discusses how TFLite’s memory arena minimizes memory usage by sharing buffers between tensors. This means we can run models on even smaller edge devices. In today’s article, I will describe the performance optimization of the memory arena initialization so that our users get the benefit…