-
-
Notifications
You must be signed in to change notification settings - Fork 8.2k
stm32: UART overrun race condition locks up system #3375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report, but I can't reproduce it. I tried the way described (but using another pyboard to generate the random data on the UART, not the PC) but everything was OK, I could save boot.py over 50 times in a row without issue. I also tried explicitly disabling interrupts in a function and then delaying for a while:
Running this in a loop didn't give any issues:
@hoihu are you able to reproduce the bug using the above |
@dpgeorge Disabling interrupts periodically may not be probing the issue raised. I think you need regularly to raise a higher priority interrupt in the hope that one occurs in the critical section identified. That said, I can't see a race condition. The scenario as I see it is this. The circular buffer is not full when a character arrives. The ISR reads the UART data and puts it in the buffer. Then a higher priority interrupt occurs. Another character arrives, followed by a second, causing an overrun error. The priority ISR terminates. The UART ISR then terminates normally (not disabling the interrupt) and immediately runs again. What I don't follow is what happens next. When it re-starts, depending on the buffer state, it should either read the data (clearing the overrun interrupt) or disable the UART IRQ. Either should be safe (although character(s) will be lost). |
Right, that's what I thought too, but UART RX has the highest priority! (Except for SysTick, so there could be something there...). |
yes, thanks for the clarification - precisely what I meant. Maybe race condition is not the perfect technical term.
The overrun condition is cleared by reading the status register SR followed by a read to the DR register. As it looks, the overrun must be cleared in order that the UART_FLAG_RXNE can be set again... In general, the RXNEIE flag (enabled using Just to clarify. I have verified the lock-up situation using pin toggling in the UART ISR handler. During lockup, I used a JTAG debugger to readout the UART's SR register. Some other people seems to have similar problems. e.g. https://github.com/spark/firmware/issues/776 I'm working on a better reproduction. Lowering the CPU speed may help. |
stm32 uart hardware implementation is a bit bumpy. Without DMA utilization (which works pretty well) we had to explicitly read the status register (SR / ISR). For STM32F7 it is also required to clear error flags before leaving IRQ service routine. |
Sorry to not have followed up earlier on this... I was able to reproduce the lockup by using 2 pyboards with the test setup as @dpgeorge has described. I had to decrease the pyb frequency to 40-60MHz on the receiver side. I used the latest 1.9.3 snapshot (as of 21.12.2017). Step-by-step:
you should now see a lot of data being transmitted from sender pyboard to receiver pyboard
this saves chunks of data on the receiver's mass storage device, with a random break between 0.5-5sec. On my PC (Win7) it takes around 50-300 writes until the receiver locks up. |
What I think happens is:
|
I made some further measurements with this test setup. It does crash also with the default frequency of 168MHz but it takes longer (one testrun took ca. 3100 cycles to trigger the fault) |
The UART IRQ (1) priority is lower than systick (0), but higher than the flash IRQ (2). The SysTick interrupt does do some additional DMA processing (which looks like it should be pretty fast). At 230400 baud, there should be an interrupt approximately every 43 microseconds. Normally UART processing takes about 0.5 microseconds IIRC. Not handling the overrun should definitely be fixed. If you can instrument this with a logic analyzer, that might help to understand where the time is being spent. |
I see, thanks. Sorry for the noise I may have created in my assumption above.
I definetly intend to do that. Hopefully soon :) My employer is sponsoring MicroPython development by allowing time to be spent on these things (we are using MicroPython a lot internally). So I hope I can respond with an analysis in January.
What I don't like about this bug is that it is able to completly crash the system once this special time window is met. Can somebody with access to 2 pyboards confirm this issue? I'd really like to see the severity of this issue being raised actually... On the other hand I also see the point of 300+ open issues and not many complaining about the UART :) |
Dear Sir/Madam: |
I too have been facing the same issue consistently!! |
I didn‘t follow up on this issue sorry.. we have fixed this locally on our branch but that is behind the current master by half a year. the code base in the uart handling has changed a lot since. I think if somebody else can confirm it using 2 pyboards as outlined above that would help to increase the priority of this bug fixing. Meanwhile what you can do is patch the irq code so that it makes sure that it will reset the ORE flag before exiting the irq handler ( thus avoiding the complete lockup of the system) the hal library has some code how to do that. |
something like __HAL_UART_CLEAR_OREFLAG(&self->uart); |
Hi !! When I tried your above suggestion, I got the following error. |
Following this up now, I tried out the 2x pyboard test described by @hoihu in #3375 (comment), but I could not reproduce the error. And it seems it's because commit 372e7a4 inadvertently "fixed" the issue (some some scenarios, not all). To really see what was going on I added an artificial delay in the UART IRQ handler, just at the start of the
But there are still certain streams of UART chars that can cause the ORE to remain forever, streams that don't have enough pauses in them for the UART to be considered IDLE by the hardware. As mentioned by @hoihu above, the proper fix is to handle the ORE interrupt properly and clear it. |
I posted a fix for this in #4653 |
Should be fixed by 7b5bf5f |
Correction for Issue micropython#3296 - ble hanging on nrf52840
Uh oh!
There was an error while loading. Please reload this page.
There is a race condition present in the stm32 port that relates to the way the UART overrun (ORE bit in USART status register) is handled in the IRQ service routine.
If this condition is met, the pyboard completly locks up and cannot be recovered (you have to hard-reset or power cycle).
The reason is because the UART RX ISR function does not clear the ORE flag if the receive buffer is empty and hence the irq is starting again as soon as the handler exits (ORE is also triggering the IRQ). It results in a 100% CPU utilisation.
This situation is explained in the STM32 reference manual page chapter "Overrun error" in the USART description.
Steps to reproduce:
you should now see a lot of random data being transferred to pyboard. Since the window time of the critical section (https://github.com/micropython/micropython/blob/master/ports/stm32/uart.c#L486-L494) is somewhere smaller than 1usec it's very hard to hit the bug and normally you won't see any troubles.
However, the raise condition can be forced if another IRQ can be triggered on the pyboard, hence delaying the calling of the UART service routine.
One way that worked for me was to load a file, e.g.
boot.py
, in a text editor and hit several times "save" (normally 10-20 times should be ok) -> BOOM. The pyboard freezes and continously fires the UART RX IRQThe way to resolve is to put the following code
just before https://github.com/micropython/micropython/blob/master/ports/stm32/uart.c#L486
On the L4, the overrun can be disabled in the UART's control register, so it doesn't need that check.
The text was updated successfully, but these errors were encountered: