Skip to content

stm32: UART overrun race condition locks up system #3375

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hoihu opened this issue Oct 18, 2017 · 18 comments
Closed

stm32: UART overrun race condition locks up system #3375

hoihu opened this issue Oct 18, 2017 · 18 comments

Comments

@hoihu
Copy link
Contributor

hoihu commented Oct 18, 2017

There is a race condition present in the stm32 port that relates to the way the UART overrun (ORE bit in USART status register) is handled in the IRQ service routine.

If this condition is met, the pyboard completly locks up and cannot be recovered (you have to hard-reset or power cycle).

The reason is because the UART RX ISR function does not clear the ORE flag if the receive buffer is empty and hence the irq is starting again as soon as the handler exits (ORE is also triggering the IRQ). It results in a 100% CPU utilisation.

This situation is explained in the STM32 reference manual page chapter "Overrun error" in the USART description.

Steps to reproduce:

  1. Wire up a pyboard UART Rx pin (e.g. X10) to an USB->serial adapter's TX pin
  2. on the pyboard enter:
from pyb import UART
u=UART(1,230400)
while True: u.read(10)
  1. on the PC enter:
import time, serial, os
u=serial.serial_for_url('COMxx',230400)
while True: u.write(os.urandom(100)); time.sleep(0.01)

you should now see a lot of random data being transferred to pyboard. Since the window time of the critical section (https://github.com/micropython/micropython/blob/master/ports/stm32/uart.c#L486-L494) is somewhere smaller than 1usec it's very hard to hit the bug and normally you won't see any troubles.

However, the raise condition can be forced if another IRQ can be triggered on the pyboard, hence delaying the calling of the UART service routine.

One way that worked for me was to load a file, e.g. boot.py, in a text editor and hit several times "save" (normally 10-20 times should be ok) -> BOOM. The pyboard freezes and continously fires the UART RX IRQ

The way to resolve is to put the following code

    if (__HAL_UART_GET_FLAG(&self->uart, UART_FLAG_ORE) != RESET) {
        if (__HAL_UART_GET_FLAG(&self->uart, UART_FLAG_RXNE) == RESET) {
            // overrun and empty data, just do a dummy read to clear ORE
            // and prevent a raise condition where a continous interrupt stream (due to ORE set) occurs
            // (see chapter "Overrun error" ) in STM32 reference manual
            self->uart.Instance->DR;
        } 
    } 

just before https://github.com/micropython/micropython/blob/master/ports/stm32/uart.c#L486

On the L4, the overrun can be disabled in the UART's control register, so it doesn't need that check.

@hoihu hoihu changed the title stm32: UART overrun raise condition locks up system stm32: UART overrun race condition locks up system Oct 18, 2017
@dpgeorge
Copy link
Member

Thanks for the report, but I can't reproduce it.

I tried the way described (but using another pyboard to generate the random data on the UART, not the PC) but everything was OK, I could save boot.py over 50 times in a row without issue.

I also tried explicitly disabling interrupts in a function and then delaying for a while:

def pause(us):
    u.read() # drain buffer
    i = machine.disable_irq()
    time.sleep_us(us)
    machine.enable_irq(i)

Running this in a loop didn't give any issues:

while 1:
    print(ua.read())
    pause(100)

@hoihu are you able to reproduce the bug using the above pause() function, or something similar?

@peterhinch
Copy link
Contributor

@dpgeorge Disabling interrupts periodically may not be probing the issue raised. I think you need regularly to raise a higher priority interrupt in the hope that one occurs in the critical section identified. That said, I can't see a race condition.

The scenario as I see it is this. The circular buffer is not full when a character arrives. The ISR reads the UART data and puts it in the buffer. Then a higher priority interrupt occurs. Another character arrives, followed by a second, causing an overrun error. The priority ISR terminates. The UART ISR then terminates normally (not disabling the interrupt) and immediately runs again.

What I don't follow is what happens next. When it re-starts, depending on the buffer state, it should either read the data (clearing the overrun interrupt) or disable the UART IRQ. Either should be safe (although character(s) will be lost).

@dpgeorge
Copy link
Member

I think you need regularly to raise a higher priority interrupt in the hope that one occurs in the critical section identified.

Right, that's what I thought too, but UART RX has the highest priority! (Except for SysTick, so there could be something there...).

@hoihu
Copy link
Contributor Author

hoihu commented Oct 19, 2017

The scenario as I see it is this. The circular buffer is not full when a character arrives. The ISR reads the UART data and puts it in the buffer. Then a higher priority interrupt occurs. Another character arrives, followed by a second, causing an overrun error. The priority ISR terminates. The UART ISR then terminates normally (not disabling the interrupt) and immediately runs again.

yes, thanks for the clarification - precisely what I meant. Maybe race condition is not the perfect technical term.

What I don't follow is what happens next. When it re-starts, depending on the buffer state, it should either read the data (clearing the overrun interrupt)

The overrun condition is cleared by reading the status register SR followed by a read to the DR register. As it looks, the overrun must be cleared in order that the UART_FLAG_RXNE can be set again...

In general, the RXNEIE flag (enabled using __HAL_UART_ENABLE_IT(&self->uart, UART_IT_RXNE)) enables both irq source, incoming characters and ORE condition. Therefore it should be handled within this irq handler.

Just to clarify. I have verified the lock-up situation using pin toggling in the UART ISR handler. During lockup, I used a JTAG debugger to readout the UART's SR register. Some other people seems to have similar problems. e.g. https://github.com/spark/firmware/issues/776

I'm working on a better reproduction. Lowering the CPU speed may help.

@chuckbook
Copy link

stm32 uart hardware implementation is a bit bumpy. Without DMA utilization (which works pretty well) we had to explicitly read the status register (SR / ISR). For STM32F7 it is also required to clear error flags before leaving IRQ service routine.

@hoihu
Copy link
Contributor Author

hoihu commented Dec 21, 2017

Sorry to not have followed up earlier on this...

I was able to reproduce the lockup by using 2 pyboards with the test setup as @dpgeorge has described. I had to decrease the pyb frequency to 40-60MHz on the receiver side.

I used the latest 1.9.3 snapshot (as of 21.12.2017).

Step-by-step:

  1. Wire up pyboard sender's TX to the pyboard receivers' RX pin (e.g. by using UART 1 pin X9/X10)
  2. On the REPL of the sender enter:
from pyb import UART
import time,os
u=UART(1,230400)
while True: u.write(os.urandom(100)); time.sleep(0.01)
  1. on the receiver side, modify boot.py pyb.freq(40000000), reset, then on the REPL enter:
from pyb import UART
u=UART(1,230400)
while True: u.read(10)

you should now see a lot of data being transmitted from sender pyboard to receiver pyboard

  1. Generate some flash write loads... Open a new terminal python console on your PC and enter:
import time, random
# modify this with the receivers PYBFLASH path
PATH_TO_FILE = r'I:\test4.txt'  
i=0
while True: print('run {}'.format(i)); fh=open(r'I:\test4.txt','w');fh.write(1000*'test'); fh.flush(); time.sleep(random.randrange(5,50)/100.); fh.close(); i+=1

this saves chunks of data on the receiver's mass storage device, with a random break between 0.5-5sec.

On my PC (Win7) it takes around 50-300 writes until the receiver locks up.

@hoihu
Copy link
Contributor Author

hoihu commented Dec 21, 2017

What I think happens is:

  • The systick interrupts the UART RX interrupt rountine at the critical section outlined above
  • It enters the systick handler, then sees that it has some flash sections to flush - so the flash irq is set (not yet entered)
  • after exiting the systick interrupt, the flash irq immediately starts, because it has a higher priority than the UART receiver irq
  • the flash irq probably blocks for some time (I haven't measured, but flash erase/write cycles mst be in the range of several 10's of msec).
  • during that time the next UART rx char is received and causing an overrun condition in the uart's status register
  • after exiting the flash irq, finally the RX irq can be further processed. But now, overrun is set and hence the description of the problem I have initally given applies.
  • the UART RX irq continously fires, since the overrun condition is never cleared -> lockup

@hoihu
Copy link
Contributor Author

hoihu commented Dec 21, 2017

I made some further measurements with this test setup.

It does crash also with the default frequency of 168MHz but it takes longer (one testrun took ca. 3100 cycles to trigger the fault)

@dhylands
Copy link
Contributor

The UART IRQ (1) priority is lower than systick (0), but higher than the flash IRQ (2). The SysTick interrupt does do some additional DMA processing (which looks like it should be pretty fast).

At 230400 baud, there should be an interrupt approximately every 43 microseconds. Normally UART processing takes about 0.5 microseconds IIRC.

Not handling the overrun should definitely be fixed.

If you can instrument this with a logic analyzer, that might help to understand where the time is being spent.

@hoihu
Copy link
Contributor Author

hoihu commented Jan 1, 2018

The UART IRQ (1) priority is lower than systick (0), but higher than the flash IRQ (2).

I see, thanks. Sorry for the noise I may have created in my assumption above.

If you can instrument this with a logic analyzer, that might help to understand where the time is being spent.

I definetly intend to do that. Hopefully soon :) My employer is sponsoring MicroPython development by allowing time to be spent on these things (we are using MicroPython a lot internally). So I hope I can respond with an analysis in January.

Not handling the overrun should definitely be fixed.

What I don't like about this bug is that it is able to completly crash the system once this special time window is met.

Can somebody with access to 2 pyboards confirm this issue? I'd really like to see the severity of this issue being raised actually... On the other hand I also see the point of 300+ open issues and not many complaining about the UART :)

@bigfatter
Copy link

Dear Sir/Madam:
I have the problem. I run three uarts at the same time at stm32 f4, uart1, uart2 and uart3
when uart1 has overrun, all the system stuck there. I have to reboot the system.
I used hal api. how do I solve this problem?
thanks
roseanne

@rlnktt
Copy link

rlnktt commented Mar 27, 2019

Dear Sir/Madam:
I have the problem. I run three uarts at the same time at stm32 f4, uart1, uart2 and uart3
when uart1 has overrun, all the system stuck there. I have to reboot the system.
I used hal api. how do I solve this problem?
thanks
roseanne

I too have been facing the same issue consistently!!

@hoihu
Copy link
Contributor Author

hoihu commented Mar 27, 2019

I didn‘t follow up on this issue sorry.. we have fixed this locally on our branch but that is behind the current master by half a year.

the code base in the uart handling has changed a lot since. I think if somebody else can confirm it using 2 pyboards as outlined above that would help to increase the priority of this bug fixing.

Meanwhile what you can do is patch the irq code so that it makes sure that it will reset the ORE flag before exiting the irq handler ( thus avoiding the complete lockup of the system)

the hal library has some code how to do that.

@hoihu
Copy link
Contributor Author

hoihu commented Mar 27, 2019

something like

__HAL_UART_CLEAR_OREFLAG(&self->uart);

@rlnktt
Copy link

rlnktt commented Mar 28, 2019

if (__HAL_UART_GET_FLAG(&self->uart, UART_FLAG_ORE) != RESET) {
    if (__HAL_UART_GET_FLAG(&self->uart, UART_FLAG_RXNE) == RESET) {
        // overrun and empty data, just do a dummy read to clear ORE
        // and prevent a raise condition where a continous interrupt stream (due to ORE set) occurs
        // (see chapter "Overrun error" ) in STM32 reference manual
        self->uart.Instance->DR;
    } 
}

Hi !!

When I tried your above suggestion, I got the following error.

Screenshot 2019-03-28 at 1 01 00 PM

@dpgeorge
Copy link
Member

Following this up now, I tried out the 2x pyboard test described by @hoihu in #3375 (comment), but I could not reproduce the error. And it seems it's because commit 372e7a4 inadvertently "fixed" the issue (some some scenarios, not all).

To really see what was going on I added an artificial delay in the UART IRQ handler, just at the start of the if (UART_RXNE_IS_SET(self->uartx)) conditional block. So after checking that RXNE was set it would delay for a while (at least a few characters worth of time). With such a delay the sequence of events are, when 2 chars are written to the UART:

  • 1st char comes in
  • RXNE is set and UART IRQ runs
  • artificial delay is hit, 2nd char comes in and sets ORE
  • UART IRQ reads 1st char from DR, clearing RXNE but leaving ORE set
  • UART IRQ exits
  • UART IRQ reenters immediately due to ORE, but RXNE is cleared so nothing happens in IRQ handler
  • UART IRQ keeps being called
  • at some point IDLE is set because no more chars have come in (this is the recent commit mentioned above)
  • the UART IRQ handler then processes the IDLE flag and clears all flags in the process, including ORE
  • UART recovers

But there are still certain streams of UART chars that can cause the ORE to remain forever, streams that don't have enough pauses in them for the UART to be considered IDLE by the hardware.

As mentioned by @hoihu above, the proper fix is to handle the ORE interrupt properly and clear it.

@dpgeorge
Copy link
Member

I posted a fix for this in #4653

@dpgeorge
Copy link
Member

dpgeorge commented Apr 1, 2019

Should be fixed by 7b5bf5f

@dpgeorge dpgeorge closed this as completed Apr 1, 2019
tannewt added a commit to tannewt/circuitpython that referenced this issue Sep 11, 2020
Correction for Issue micropython#3296 - ble hanging on nrf52840
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants