An4777 How To Optimize Power Consumption On stm32 Mcus Stmicroelectronics
An4777 How To Optimize Power Consumption On stm32 Mcus Stmicroelectronics
Application note
Introduction
This application note applies to the X-CUBE-REF-PM expansion package for STM32Cube, which includes power mode
examples for STM32G0 series, STM32L0 series, STM32L1 series and STM32L4 series microcontrollers.
The power consumption is the biggest advantage of low-power STM32 microcontrollers. The firmware example related to this
application note provides helpful hints on achieving the datasheet levels of power consumption and a simple framework to ease
further experimentation with different configurations.
The low-power STM32 microcontrollers have a rich variety of configuration options regarding the flash memory interface.
While the G0 is not labeled as low lower series, the feature set is similar, and it is a small device with low power consumption.
This application note showcases the different settings under various test conditions, providing guidelines for the optimization of
the power efficiency and is particularly focused on the influence of memory subsystem settings on the execution efficiency. This
subject is covered at the same detail level than in the product datasheets.
Reference documents
The reference documents are available on STMicroelectronics on www.st.com web site
• Ultra-low-power STM32L0x3 advanced Arm®-based 32-bit MCUs reference manual (RM0367)
• STM32L100xx, STM32L151xx, STM32L152xx and STM32L162xx advanced Arm®-based 32-bit MCUs reference manual
(RM0038)
• STM32L4x6 advanced Arm®-based 32-bit MCUs reference manual (RM0411)
• STM32G0x1 advanced Arm®-based 32-bit MCUs (RM0444)
1 General information
2 Definitions
Term Description
3 System architecture
The memory interface manages the read and write accesses from the core/bus matrix towards the nonvolatile
memory. This holds for both the instruction and data access.
For configuring the nonvolatile memory read access during the program execution, the configuration flags are
accessible in the access control register.
The latency serves the purpose of reducing the rate at which the NVM is read. An extra wait cycle must be
enabled for a system clock higher than 16 MHz for the highest voltage regulator range. For lower core voltages,
this threshold frequency goes lower.
To compensate this bandwidth deficiency, a prefetch can be configured. The memory controller then attempts to
have the next instruction ready before the core requests it.
The STM32L1 flash memory interface can use a 64-bit read access internally to be able to serve the core with
data and instruction close to its own space. The extra 32 bits are used by the prefetch to load the next instruction
and provide it to the core immediately when needed.
The STM32L0 flash memory interface does not have the 64-bit wide bus, but the memory controller is capable of
data preread. This simple buffer is similar to the prefetch, but works also for data.
The STM32L4 flash memory interface has a full 64-bit wide (plus 8-bit ECC) connection to the bus matrix, shared
between data and instruction. The flash memory interface incorporates an ART Accelerator, a prefetch
mechanism and a cache designed to minimize the effect of memory latency. The flash memory interface is then
capable of transferring data and instruction simultaneously, under the condition that they are ready in the cache.
The STM32G0 flash memory interface features prefetch and instruction cache, though smaller than on the L4. No
cache is available for data read. It handles one or two banks of flash memory very similar to the situation found in
the STM32L4. Native word width is 64-bit plus 8-bit of ECC.
All the performance improvements resulting from the memory interface settings, come at a cost of an increased
power consumption. Access with no latency, no preread, no cache, and no prefetch is used in the low-power
mode. The following section sheds light on the kind of tradeoffs that the performance improvements represent.
4 Low-power modes
The bulk of this application note and its main focus are the run modes and efficiency of the code execution. This
is the main added value of this application note over the information covered in the datasheets.
For the sake of completeness, the low-power modes must however be mentioned. It means the states in which
the CPU core cannot execute any code and only the selected subset of peripherals are active.
The following table compares the low-power modes between the MCU series:
STM32L0 Series/
MCU series STM32L4 series STM32G0 series
STM32L1 series
Either main or low-power Low-power regulator on, main Either main or low-power regulator,
Sleep modes regulator, flash memory clock off regulator configurable, flash memory flash memory state in low power
with low-power sleep clock configurable mode configurable
Stop modes Single stop mode Stop0, Stop1, and Stop2 steps Stop0 and Stop1
Available and also special shutdown Available and shutdown mode as
Standby Available
mode implemented well
All necessary details about listed low-power modes are in the reference manual and datasheets.
5 Operation modes
The following operation modes are used to assess the impact of the memory interface settings on the
performance and power consumption. All the measurements have been done using VCC = 3.3 V and the voltage
regulator range 1. The speed and consumption would be lower using lower regulator levels, but linearly lower
relative to the range 1 measurements. For example with the voltage regulator range 3 and the system clock
speed at 2 MHz (from MSI) the power consumption would be roughly 10 times lower for all the measurements
and the performance roughly 10 times lower for all the measured configurations. There is no point in repeating the
measurement for all the configuration combinations.
Frequency ≤ 24 ≤ 48 ≤ 64
Latency 0 0 1 1 1 1 2 2 2 2
Instruction cache 0 1 0 0 1 1 0 0 1 1
Prefetch 0 0 0 1 0 1 0 1 0 1
While it is possible to enable prefetch regardless of latency setting, it makes no sense when number of wait states
is zero. In range 2 the system clock is capped at 16 MHz, which is achieved with 1 wait state. For more details
see chapter 3.3.4 in RM0444.
Latency 0 0 1 1 1 1
64-bit 0 1 1 1 1 1
Prefetch 0 0 0 1 0 1
The table of valid configurations is clearly demonstrating the following simple rules:
• Wait states are inevitable when exceeding 16 MHz.
• When the latency is set to 1, the 64-bit access is mandatory.
• The prefetch is impossible without the 64-bit access.
Latency 0 1 0 1 1 1 0 1 1 1 1 1 1
Preread 0 0 1 1 1 0 X X 0 1 1 0 X
Prefetch 0 0 0 0 1 1 X X 0 0 1 1 X
Buffer disable 0 0 0 0 0 0 1 1 0 0 0 0 1
The table of valid configurations is clearly demonstrating the following simple rules:
• The latency cannot be zero with clock speeds exceeding 16 MHz.
• When the buffer is disabled, it cannot be configured.
• Prefetch and preread configure the usage of the six words in the internal buffer, not their total amount.
Latency 0 >1
Data cache 0 0 1 1 0 0 0 1 0 1 1 1
Instruction cache 0 1 0 1 0 0 1 0 1 1 0 1
Prefetch 0 0 0 0 0 1 0 0 1 0 1 1
The prefetch, data cache and instruction cache settings are independent of each other. Each of these three
features can be enabled or disabled independently of the frequency or any other setting. However, some settings
make less sense than others, especially with zero wait states, prefetch is definitely not recommended.
The settings are only simple when the voltage regulator settings are disregarded. But the read access latency
strongly depends on the voltage regulator settings. For example at a 16 MHz speed, while with range 1 the
latency on a Flash read is 1 CPU cycle, with range 2 the latency on the same core frequency increases to 3 CPU
cycles.
For more details see the Read access latency section in the RM0411 reference manual.
The STM32Cube Expansion Package (X-CUBE-REF-PM) related to this application note is intended for use with
cheap and widespread STM32 Nucleo application boards. With some effort, the examples can be adapted for
other hardwares. This description refers to Nucleo boards.
All the controls are implemented as number key press inputs, with choices listed on the bottom of the screen. The
choices are not available at all times.
The control firmware deliberately tries to hide settings that are not applicable. For example, when a low-power
mode is selected, the executed code selection is hidden as not relevant.
Enter the number corresponding to the available choices (selections 1-5).
In case of another selection, the terminal asks for a new value. Once the choice is made, updated settings are
listed. For example, when the low-power run mode is selected, the oscillator settings are adjusted to produce a
compatible system clock.
To execute a test, first set the power mode: it determines the available system clock settings and the test
availability. For low-power mode the active peripherals may be selected, for run modes the executed code may be
selected.
The firmware tries to limit the access to some of the setting combinations, that would obviously lead to failure.
However, especially when using the HSE clock source it is still possible to leave the operating conditions
envelope defined in the datasheets. The correct operation is then not guaranteed.
To start the test execution, enter ‘6’ in the root menu. In case of failure, the firmware activates the on-board LED.
Blue button on the Nucleo board abort most of the Reset button on the Nucleo board is used to return
EXTI_BUTTON tests, returning into the root menu, retaining settings. into the root menu. Settings are however reset to
May cause additional current consumption. default values.
Relevant computational tests are limited to
Tests run until aborted by reset, power off,
FINITE_LOOP LOOP_COUNT cycles. Measuring the time to complete
debugger or EXTI (list depending on other options).
the task is used to compute the execution efficiency.
Debug interface is active during the test. Useful to Debug interface is in high-Z during the test. Only
DEBUG_ON
review the settings and check the functionality. this code must be ever used for measurement.
The default setting ‘with all three define switches not active’, is the configuration, which allows the user to obtain
the datasheet values.
The datasheet includes the power consumption measurements for several different codes executed. These are
Dhrystone, CoreMark, Fibonacci, and while(1) loop. The CoreMark is not included in the published example code
for licensing reasons. But the example includes two additional test codes instead. The “Reduced code” and
“Memory read stress test” are focusing directly on the memory interface settings and their influence on the
execution efficiency.
The flash memory interface efficiency focused tests are not present in the datasheet. The results of their
execution are analyzed in the following pages.
To assess the performance of the MCU with different memory controller settings, several benchmark tests have
been used. All the tests have been executed on a NUCLEO-L152RE board using all the available memory
interface settings, listed in Section 5.2 STM32L1 series device options. All the tests have been executed both in
standalone and in parallel with a DMA transfer, constantly reading from the program NV memory. The DMA
channel was directed to the SPI output configured to the highest available speed (fPCLK/2) and low priority.
Three clock configurations have been used in the measurements. One with the plain 16 MHz HSI clock as the
system clock and no latency set, another with the same clock but the Flash latency configured (Flash memory
running effectively on lower clock) and the third with the PLL set to produce the 32 MHz system clock.
All the measurements are taken on a single sample of NUCLEO-L152RE board at ambient temperature. The
values provided are an arithmetic mean from several measurements.
Latency 0 0 0 1 1 1 1
64-bit 0 1 1 1 1 1 1
Prefetch 0 0 1 0 1 0 1
Timing for 50000 cycles [s] 2.57 2.57 2.57 3.05 2.86 1.52 1.46
Average current [mA] 5.75 5.78 6.11 5.13 5.62 10.42 11.08
Energy [mJ] 48.77 49.02 51.82 51.63 53.04 52.27 53.38
12
32MHz; 64b +
prefetch
32MHz; 64b access without prefetch
10
6 16MHz; latency,
I[mA]
0
0 0.5 1 1.5 time [s] 2 2.5 3 3.5
Table 9. Dhrystone results with DMA simultaneously reading data from the Flash memory
Latency 0 0 0 1 1 1 1
64-bit 0 1 1 1 1 1 1
Prefetch 0 0 1 0 1 0 1
Timing for 50000 cycles [s] 2.72 2.68 2.68 3.28 3.09 1.64 1.55
Average current [mA] 6.17 6.25 6.58 5.50 5.99 11.24 11.68
Energy [mJ] 55.38 55.28 58.19 59.53 61.08 60.83 59.74
Figure 3. Dhrystone results with DMA simultaneously reading data from the Flash memory
14
10
8
16MHz; latency on, 64b
I[mA]
0
0 0.5 1 1.5 time [s] 2 2.5 3 3.5
Configuring a 64-bit access or a prefetch makes a very small difference on a low clock speed where the latency
can be avoided. On the contrary, setting the latency may lead to a lower power consumption in situations where
the speed is not critical. At higher speeds the efficiency of the prefetch is situational, leading to ultimate
performance but the gain in speed may be lower than the consumption increase.
Latency 0 0 0 1 1 1 1
64-bit 0 1 1 1 1 1 1
Prefetch 0 0 1 0 1 0 1
Timing for 500000 cycles [s] 0.9 0.9 0.9 1.06 0.964 0.59 0.497
Average current [mA] 5.25 5.41 5.63 4.82 5.11 9.09 9.78
Energy [mJ] 15.59 16.07 16.72 16.86 16.26 17.70 16.04
12
10
32MHz; 64b and prefetch on 32MHz; prefetch off
8
I[mA]
0
0 0.2 0.4 0.6 0.8 1 1.2
time [s]
Table 11. 32-bit code result with DMA simultaneously reading data from the Flash memory
Latency 0 0 0 1 1 1 1
64bit 0 1 1 1 1 1 1
Prefetch 0 0 1 0 1 0 1
Timing for 500000 cycles [s] 0.956 0.921 0.916 1.22 1.02 0.64 0.54
Average current [mA] 5.85 5.96 6.18 5.20 5.67 9.83 10.66
Energy [mJ] 18.46 18.11 18.68 20.94 19.09 20.76 19.00
Figure 5. 32-bit code result with DMA simultaneously reading data from the Flash memory
12
6
16MHz; no 64b access, no prefetch
16MHz; latency active along with
64b access and prefetch
4
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4
time [s]
The findings are in line with the expectations: a code with high share of 32-bit instructions benefits a lot from the
prefetch once the memory latency is in place. But with zero latency the extra bandwidth is likely to be useless.
Table 12. Literal pool with no additional data read from the Flash memory
Latency 0 0 0 1 1 1 1
64-bit 0 1 1 1 1 1 1
Prefetch 0 0 1 0 1 0 1
Timing for 500000 cycles [s] 3.66 2.73 2.72 3.38 3.32 1.69 1.66
Average current [mA] 5.44 5.58 6.12 4.85 5.33 9.78 10.73
Energy [mJ] 65.70 50.27 54.93 54.10 58.40 54.54 58.78
Figure 6. Literal pool reading with no additional data read from the Flash memory
12
access
6
16MHz without prefetch
0
0 0.5 1 1.5 2 2.5 3 3.5 4
time [s]
Table 13. Literal pool reading with DMA simultaneously reading the Flash memory
Latency 0 0 0 1 1 1 1
64-bit 0 1 1 1 1 1 1
Prefetch 0 0 1 0 1 0 1
Timing for 500000 cycles [s] 3.98 2.94 2.94 3.92 3.88 1.97 1.96
Average current [mA] 6.04 6.26 6.73 5.40 5.72 10.62 11.59
Energy [mJ] 79.33 60.73 65.29 69.85 73.24 69.04 74.96
Figure 7. Literal pool reading with DMA simultaneously reading data from the Flash memory
14
12
32MHz, 64b and prefetch
8
16MHz, 64b and prefetch
I[mA]
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
time [s]
As expected, mostly in case of a data read transfer the effect of the prefetch is lower, but a 64-bit memory access
makes a significant difference even with zero memory latency.
The Cortex-M0+ core is much simpler compared to the Cortex-M3 used in the STM32L1 Series. The 32-bit
instruction benchmark is dropped as the Thumb-2 instruction set support in the M0+ core is very limited and an
extensive usage of 32-bit code is not realistic with a code compiled for the STM32L0 Series.
The remaining tests have been executed on a NUCLEO-L073RZ board using all the available memory interface
settings, listed in Section 5.3 STM32L0 series device options. All the tests have been executed both standalone
and in parallel with a DMA transfer constantly reading from the program NV memory. The DMA channel was
directed to the SPI output configured to the highest available speed (fPCLK/2), but low priority.
Two clock configurations have been used in the measurements. One with the plain 16 MHz HSI clock as the
system clock and no latency set, the other with the PLL set to produce the 32 MHz system clock and of course
the Flash memory latency set to 1.
All the measurements are taken on a single sample of NUCLEO-L073RZ board at ambient temperature. The
values provided are an arithmetic mean from several measurements.
Table 14. Dhrystone with no additional data read from the flash memory
Latency 0 0 0 0 0 1 1 1 1 1
Prefetch 1 0 0 1 0 1 0 0 1 0
Preread 1 1 0 0 0 1 1 0 0 0
Disabled buffer 0 0 1 0 0 0 0 1 0 0
Time [ms] 3769 3766 3771 3769 3769 2139 2667 2720 2130 2667
Average current [mA] 4.32 4.42 4.54 4.40 4.39 8.14 7.52 7.52 8.04 7.43
Energy [mJ] 53.73 54.93 56.49 54.72 54.60 57.46 66.20 67.49 56.51 65.40
Figure 8. Dhrystone with no additional data read from the flash memory
9.00
6.00
5.00
16MHz; buffer disabled
I[mA]
2.00
1.00
0.00
0 500 1000 1500 2000 2500 3000 3500 4000
time [ms]
Table 15. Dhrystone with DMA simultaneously reading data from the flash memory
Latency 0 0 0 0 0 1 1 1 1 1
Prefetch 1 0 0 1 0 1 0 0 1 0
Preread 1 1 0 0 0 1 1 0 0 0
Disabled buffer 0 0 1 0 0 0 0 1 0 0
Time [ms] 3903 3901 3906 3906 3904 2377 2853 2956 2334 2843
Average current [mA] 4.69 4.77 4.87 4.68 4.59 8.58 8.21 8.15 8.66 7.80
Energy [mJ] 69.40 61.41 62.77 60.32 59.13 67.29 77.31 79.31 66.70 73.17
Figure 9. Dhrystone with DMA simultaneously reading data from the flash memory
10
32MHz; prefetch only
9
32MHz; pre-read and prefetch
32MHz; pre-ready only
8 32MHz; buffer disabled
32MHz; no pre-read or prefetch
7
0
0 500 1000 1500 2000 2500 3000 3500 4000 4500
time [ms]
This example clearly shows that the internal six word buffer improves the energy efficiency even if it is not well
utilized, like in case of zero latency. The best option is to keep it on, but to disable the prefetch and preread.
In case of the configuration with the latency is enabled, the prefetch is probably worth using. The preread is
obviously not used by the DMA channel and does not represent an improvement in this particular scenario.
Table 16. Literal pool with no additional data read from the Flash memory
Latency 0 0 0 0 0 1 1 1 1 1
Prefetch 1 0 0 1 0 1 0 0 1 0
Pre-read 1 1 0 0 0 1 1 0 0 0
Disabled buffer 0 0 1 0 0 0 0 1 0 0
Time [ms] 2402.5 2401.5 2403 2403 2399.5 2009 2058.5 2091 1817 1819
Average current [mA] 3.4 3.42 3.36 3.14 3.19 6.03 6.05 5.94 5.83 5.73
Energy [mJ] 26.95 27.10 26.64 24.89 25.25 39.97 41.09 40.98 34.95 34.39
Figure 10. Literal pool with no additional data read from the Flash memory
7
32MHz; both pre-read
and prefetch on
32MHz; prefetch only 32MHz; pre-read only
6
4
16MHz; pre-read only
I[mA]
0
0 500 1000 1500 2000 2500 3000
time[ms]
Table 17. Literal pool with DMA simultaneously reading data from the Flash memory
Latency 0 0 0 0 0 1 1 1 1 1
Prefetch 1 0 0 1 0 1 0 0 1 0
Pre-read 1 1 0 0 0 1 1 0 0 0
Disabled buffer 0 0 1 0 0 0 0 1 0 0
Time [ms] 2533.5 2533.5 4854.5 4587 4591 2292.5 2301 2420 2299 2302.5
Average current [mA] 3.86 3.86 3.38 3.32 3.29 7.42 7.39 7.34 7.25 7.18
Energy [mJ] 32.27 32.27 54.15 50.26 49.84 56.13 56.11 58.62 55.00 54.56
Figure 11. Literal pool with DMA simultaneously reading data from the Flash memory
8
32MHz; pre-read only
32MHz; prefetch and pre-read
6
32MHz; no pre-read or prefetch
4
I[mA]
0
0 1000 2000 3000 4000 5000 6000
time[ms]
This example finally demonstrates the advantage of the pre-read setting. It can greatly improve the efficiency
when more than one stream of data is read from the Flash memory and there is no latency. The prefetch is not
useful when dealing mostly with data, that is no surprise. Again it is a good idea to keep the buffer enabled. The
only reason to disable the buffer is if the timing needs to be more deterministic, whatever the efficiency cost may
be.
The STM32L4 Series devices are based on the Cortex-M4 core connected to the 32-bit multilayer AHB bus matrix
that connects up to six master and eight slave devices supporting concurrent operations as long as the bus
masters are accessing different bus slaves.
The tests have been executed on a NUCLEO-L476RG board using all the available memory interface settings,
listed in Table 6. Device option summary. The results of execution with a concurrent DMA transfer are not
included for the STM32L4 Series. The impact of the DMA on timing is minimal and the added current
consumption is approximately the same regardless of the Flash interface configuration, making the results not
interesting.
One set of tests has been executed only with VCORE range1 to provide a comparison with other series featured in
this overview and to assess the impact of the prefetch and caches.
Other set of measurements has been executed using different latency, frequency and voltage regulator settings to
assess the energy needed for different operations in case of a battery powered application.
All the measurements are taken on a single sample of NUCLEO-L476RG board at ambient temperature. The
values provided are an arithmetic mean from several measurements.
Table 18. Dhrystone test using core voltage range 1 and HSI clock
Latency 0 1
D-cache 1 0 0 0 1 1 1 1 0
I-cache 1 0 0 1 1 1 0 0 1
Prefetch 0 0 1 1 1 0 0 1 0
Time [ms] 2561 1552 1473 1313 1281 1283 1498 1430 1310
Average current [mA] 3.12 6.55 6.61 5.87 5.9 5.65 6.56 6.6 5.71
Energy [mJ] 24.80 31.51 30.19 23.89 23.42 22.48 30.45 29.25 23.19
7.00
29.25 30.19
31.51
30.45
6.00
23.42 23.89
22.48 23.19
5.00
I [mA]
30.57
4.00
29.77
26.28
24.72
24.88
3.00
2.00
1000 1500 2000 2500 3000
time [ms]
This example clearly demonstrates that while the prefetch can lead to an improved performance, especially if the
instruction cache is enabled, it does not bring a significant additional advantage in case of the dhrystone test
code. The prefetch complements the caches and helps in the code sections with minimum loops, where the
caches cannot help.
The optimal configuration of the Flash interface being identified, how the cache behaves using different core clock
speeds. A higher clock speed leads to a higher latency, forcing the core to wait for a read access to the Flash
memory if the instruction and data are not available in the ART cache. The core waiting for the memory still needs
energy, reducing the overall efficiency.
40
35
30
25
Range2, ART enabled
Range1, ART disabled
Range1, ART enabled
20
15
10
0 10 20 30 40 50 60 70 80 90
f [MHz]
In Figure 13. Energy cost of the dhrystone test loop the same test loop of 50000 dhrystone tests is executed with
different clock settings using either the MSI, or in case of a 64 MHz and a 80 MHz PLL, a module with the MSI as
the source clock. The additional power consumption of the PLL causes a slight drop in the efficiency visible on the
chart.
Otherwise the chart shows us that at least in case of a dhrystone test, which includes lot of loops, the ART
accelerator cache is able of improving the MCU execution efficiency by increasing the core clock. This is a
remarkable feature.
Latency 0 1
D-cache 1 0 0 0 1 1 1 1 0
I-cache 1 0 0 1 1 1 0 0 1
Prefetch 0 0 1 1 1 0 0 1 0
Time [ms] 570 344.5 344 340.2 284.9 284.3 288.1 288.7 340.2
Average current [mA] 3.10 6.75 6.77 6.49 6.19 6.09 6.9 6.88 6.45
Energy [mJ] 5.49 7.21 7.22 6.84 5.47 5.37 6.16 6.16 6.80
6.16
7 7.22
6.16 7.21
5.47 6.84
6 5.37 6.80
6.90
I[mA]
4
6.39
3
5.49
0
0 100 200 300 400 500 600
time [ms]
In case of data literal pool loop the data cache tends to improve significantly the execution speed, while the
instruction cache tends to rather contribute to the power consumption. What is not visible from the plot is that the
efficiency improvement tends to grow slowly with several hundred iterations before reaching a maximum.
The STM32G0 shares some power saving features with the low power series. The STM32G0B1RE, device used
in the measurement, has 512 kB of dual bank flash memory.
ES0548 and ES0549 describe a bug that compromises the prefetch advantage of this device.
When the boundary between the two banks is crossed, the prefetch may fail to present the intended instruction,
resulting in a possible hard fault.
There is no workaround, so it is recommended to disable prefetch.
Architecturally, the STM32G0 has the same Cortex-M0+ CPU core as the STM32L0 series, but with a nonvolatile
memory arrangement more similar to the STM32L4, with a smaller cache.
The measurements presented in this document are performed on the Nucleo-G0B1RE board without
modifications.
frequency [MHz] 16 32 64
Latency 0 1 1 1 1 2 2 2 2
Cache 0 1 1 0 0 1 1 0 0
Prefetch 0 0 1 1 0 0 1 1 0
Time [s] 2.06 1.17 1.09 1.19 1.3 0.66 0.595 0.693 0.789
Average current [mA] 2.56 4.21 4.57 4.84 4.39 7.72 8.4 8.56 7.7
Energy [mJ] 17.67 16.89 17.05 19.73 19.57 17.38 17.08 20.23 20.56
The benchmark shows the advantage of both cache and prefetch. As latency increases, they keep the CPU busy
and efficient. But while the cache hits save energy, the prefetch costs energy even if the instruction is not used.
In some cases, such as the Dhrystone running with 1 wait state, prefetching improves performance but decreases
the overall power efficiency.
Other methods of assessing performance have been used, with results that differ in absolute terms or even in the
order of configurations in terms of efficiency. However, the overall trend is broadly the same, suggesting that both
prefetch and cache benefits increase as clock speed (and latency) increases, with cache improving more on the
efficiency side, and prefetch providing the greatest benefit at peak performance.
The general rule to minimize the power consumption is to perform the task for the shortest possible time, at the
lowest possible operating frequency and with the clock enabled to a minimal part of the silicon.
In other words, the goal is to optimize for execution speed and then find an optimal balance between the time and
the clock frequency. The speed optimization is mostly a matter of compiler choice. If the user has the opportunity,
he must build the reference projects with different development tools and observe the difference in power
consumption and execution speed.
Even the best compiler can benefit from some tricks applicable in most C source codes:
1. Where possible, use variables of size that correspond to the CPU register size (32 bits).
2. Use macros instead of simple functions to save on function call overhead.
3. Learn to use keywords like static, restrict, register, inline.
4. Most compilers can be guided using various “#pragma” statements for more optimized results. Check what
pragmas are available in your development toolchain.
The memory placement influences also the power consumption. Some microcontrollers embed more than one
type of volatile memory. Some may need little more energy than others.
12 Conclusion
Each low-power STM32 microcontroller series requires a slightly different approach to optimize the energy
efficiency.
Putting the product in low-power mode during the idle period is best practice, but the wake up time must always
be considered. The peripherals left active in low-power mode to trigger the wake up have an impact on the power
consumption. This is detailed in the datasheet and can be checked using the firmware examples.
Another set of optimization challenges is presented in relation to the Run mode and the code execution.
The measured results provide the guidance for decision whether or not to enable the different memory interface
settings. The features like the prefetch, improving the benchmark result, also lead to a higher power consumption
and the overall efficiency is dependent on the task processed by the microcontroller.
There is no significant benefit in tweaking the settings when the flash memory latency is not in place. This makes
sense only if the flash memory contains frequently used literal pools (predefined data constants) or if the cache
access leads to lower energy consumption.
With the flash memory latency in place, the flash interface must be set up carefully, as the performance difference
between the optimal and default configuration may be significant. It is definitely possible to activate some flash
interface settings only temporarily for particular operations and disable them afterwards.
It is demonstrated that the errata present on the dual bank STM32G0 devices does impact the top performance,
but less so the efficiency.
Revision history
Contents
1 General information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Definitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3
3 System architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4 Low-power modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
5 Operation modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6
5.1 STM32G0 device options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.2 STM32L1 Series device options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.3 STM32L0 Series device options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.4 STM32L4 Series device options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5.5 Execution from a volatile memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
List of tables
Table 1. List of acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Table 2. Low-power mode brief comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Table 3. The options in voltage regulator range 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Table 4. Configurations available on STM32L1 series devices with regulator range 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Table 5. Configurations available on STM32L0 series devices with regulator range 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Table 6. Device option summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Table 7. The example build options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Table 8. Dhrystone results with no background transfer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Table 9. Dhrystone results with DMA simultaneously reading data from the Flash memory . . . . . . . . . . . . . . . . . . . . . . 12
Table 10. 32-bit code result with no background transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Table 11. 32-bit code result with DMA simultaneously reading data from the Flash memory. . . . . . . . . . . . . . . . . . . . . . . 14
Table 12. Literal pool with no additional data read from the Flash memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Table 13. Literal pool reading with DMA simultaneously reading the Flash memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Table 14. Dhrystone with no additional data read from the flash memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Table 15. Dhrystone with DMA simultaneously reading data from the flash memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Table 16. Literal pool with no additional data read from the Flash memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Table 17. Literal pool with DMA simultaneously reading data from the Flash memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Table 18. Dhrystone test using core voltage range 1 and HSI clock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Table 19. Literal measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Table 20. Flash memory interface settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Table 21. Document revision history . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
List of figures
Figure 1. Terminal screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Figure 2. Dhrystone results with no background transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Figure 3. Dhrystone results with DMA simultaneously reading data from the Flash memory . . . . . . . . . . . . . . . . . . . . . 13
Figure 4. 32-bit code result with no background transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Figure 5. 32-bit code result with DMA simultaneously reading data from the Flash memory . . . . . . . . . . . . . . . . . . . . . 15
Figure 6. Literal pool reading with no additional data read from the Flash memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Figure 7. Literal pool reading with DMA simultaneously reading data from the Flash memory . . . . . . . . . . . . . . . . . . . . 17
Figure 8. Dhrystone with no additional data read from the flash memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Figure 9. Dhrystone with DMA simultaneously reading data from the flash memory. . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Figure 10. Literal pool with no additional data read from the Flash memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Figure 11. Literal pool with DMA simultaneously reading data from the Flash memory. . . . . . . . . . . . . . . . . . . . . . . . . . 22
Figure 12. Dhrystone test plot of energy needed for execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Figure 13. Energy cost of the dhrystone test loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Figure 14. Literal pool chart plot of energy efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26