Why Super Mario 64 Wastes So Much Memory

The Nintendo 64 was an amazing video game console, and alongside consoles like the Sony PlayStation, helped herald in the era of 3D games. That said, it was new hardware, with new development tools, and thus creating those early N64 games was a daunting task. In an in-depth review of Super Mario 64’s code, [Kaze Emanuar] goes over the curious and wasteful memory usage, mostly due to unused memory map sections, unoptimized math look-up tables, and greedy asset loading.

The game as delivered in the Japanese and North-American markets also seems to have been a debug build, with unneeded code everywhere. That said, within the context of the three-year development cycle, it’s not bad at all — with twenty months spent by seven programmers on actual development for a system whose hardware and tooling were still being finalized, with few examples available of how to do aspects like level management, a virtual camera, etc. Over the years [Kaze] has probably spent more time combing over SM64‘s code than the original developers, as evidenced by his other videos.

As noted in the video, later N64 games like Legend of Zelda: Ocarina of Time are massively more optimized and streamlined, as lessons were learned and tooling improved. For the SM64 developers, however, they had a gargantuan 4 MB of fast RDRAM to work with, so optimization and memory management likely got kicked down to the bottom on the priority list. Considering the absolute smash hit that SM64 became, it seems that these priorities were indeed correct.

Continue reading “Why Super Mario 64 Wastes So Much Memory”

Speed Up Arduino With Clever Coding

We love Arduino here at Hackaday; they’ve probably done more to make embedded programming accessible to more people than anything else in the history of the field. One thing the Arduino ecosystem is rarely praised for is its speed. That’s where [Playduino]  comes in, with his video (embedded below) that promises to make everyone’s favourite microcontroller run 50x faster.

You might be expecting an unstable overclocking setup, with swapped crystals, tweaked voltages and a hefty heat sink, but no! This is stock hardware. The 50x speedup comes from one simple hack: don’t use digitalWrite();

If you aren’t familiar, the digitalWrite() function is one of the key functions Arduino gives you to operate its boards– specify the pin and the value (high or low) to drive it. It’s very easy, but it’s also very slow. [Playduino] takes a moment to show just how much is going on under the hood when you call digitalWrite(), and shows you what you can do instead if you have a need for speed. (Hint: there’s no Arduino-provided code involved; hardware registers and the __asm keyword show up.)

If you learned embedded programming in an earlier era, this will probably seem glaringly obvious. If you, like so many of us, got started inside of the Arduino ecosystem, these closer-to-the-metal programming techniques could prove useful tools in your quiver. Big thanks to [Stephan Walters] for the tip.

Of course if you prefer to speed things up by hardware rather than software, you can overclock an Arduino– with liquid nitrogen, even.

Continue reading “Speed Up Arduino With Clever Coding”

C++ Design Patterns For Low-Latency Applications

With performance optimizations seemingly having lost their relevance in an era of ever-increasing hardware performance, there are still many good reasons to spend some time optimizing code. In a recent preprint article by [Paul Bilokon] and [Burak Gunduz] of the Imperial College London the focus is specifically on low-latency patterns that are relevant for applications such as high-frequency trading (HFT). In HFT the small margins are compensated for by churning through absolutely massive volumes of trades, all of which relies on extremely low latency to gain every advantage. Although FPGA-based solutions are very common in HFT due their low-latency, high-parallelism, C++ is the main language being used beyond FPGAs.

Although many of the optimizations listed in the paper are quite obvious, such as prewarming the CPU caches, using constexpr, loop unrolling and use of inlining, other patterns are less obvious, such as hotpath versus coldpath. This overlaps with the branch reduction pattern, with both patterns involving the separation of commonly and rarely executed code (like error handling and logging), improving use of the CPU’s caches and preventing branch mispredictions, as the benchmarks (using Google Benchmark) clearly demonstrates. All design patterns can also be found in the GitHub repository.

Other interesting tidbits are the impact of signed and unsigned comparisons, mixing floating point datatypes and of course lock-free programming using a ring buffer design. Only missing from this list appears to be aligned vs unaligned memory accesses and zero-copy optimizations, but those should be easy additions to implement and test next to the other optimizations in this paper.

Using Valgrind To Analyze Code For Bottlenecks Makes Faster, Less Power-Hungry Programs

What is the right time to optimize code? This is a very good question, which usually comes down to two answers. The first answer is to have a good design for the code to begin with, because ‘optimization’ does not mean ‘fixing bad design decisions’. The second answer is that it should happen after the application has been sufficiently debugged and its developers are at risk of getting bored.

There should also be a goal for the optimization, based on what makes sense for the application. Does it need to process data faster? Should it send less data over the network or to disk? Shouldn’t one really have a look at that memory usage? And just what is going on inside those CPU caches that makes performance sometimes drop off a cliff on a single core?

All of this and more can be analyzed using tools from the Valgrind suite, including Cachegrind, Callgrind, DHAT and Massif.

Keeping Those Cores Cool

Modern day processors are designed with low power usage in mind, regardless of whether they are aimed at servers, desktop systems or embedded applications. This essentially means that they are in a low power state when not doing any work (idle loop), with some CPUs and microcontrollers turning off power to parts of the chip which are not being used. Consequently, the more the processor has to do, the more power it will use and the hotter it will get.

Continue reading “Using Valgrind To Analyze Code For Bottlenecks Makes Faster, Less Power-Hungry Programs”