The ethernet HAL of the STM32H7x4 serie might be the worst of all software drivers that came from CubeMX. On this page I’ll explain which problems arise, propose a solution and do some testing to verify the solution. You can also jump to the solution or go directly to the repository
We tested with the following:
- Ethernet HAL for the STM32H7x3 series
- HAL Version V1.7.0
- CubeMX Version 5.6.0
- Compiled with MDK-ARM V5
- Tested on NUCLEO-H743ZI Rev B
- LwIP version 2.0.3 with RAW API
The problem we encountered can be described as follows: Sometimes received packages are being overwritten (or so it seems) before they are freed by the application. When receiving files or other sequential data where the order (and contents) of packets are important, it seemed like sometimes chucks of data where found double, and some chunks of data where missing. The eventual filesize was correct. At that moment we where using a TCP connection so we (tought) we didn’t have to worry about data order and stuff.
We dived into the ethernet HAL of ST that comes with CubeMX, and it turned out that there are some small mistakes, but also fundamentilly there is something not right.
Finding The Root Of The Problem
To find the problem we first have to understand how the RX channel of ethernet in the H7 works. Lets find out.
The ethernet peripheral of the H7 isn’t build or designed by ST, its bought and implemented. Perhaps thats also (part of) the reason they weren’t capable of delivering a proper driver set with it.
The peripheral consists of a dedicated DMA controller with 2 channels. RX and TX. Our problem was clearly in the RX path, so I decided to start looking how it works using the ST HAL drivers. Note that the drivers are not limited to the ETH HAL file but also the ethernetif.c file. That is where the magic happens.
First thing’s first. A DMA controller, as you might know, needs to know where to put data and where to get it from. Then it needs to know how much data and the size of this data. After that all it needs is start signal and there is not much more to it. No difference with the ethernet DMA. The biggest difference is that this DMA needs to keep itself going, e.g. ethernet should continue to receive, also after receiving one package. That is why the ethernet DMA controller uses an descriptor list.
The Descriptor List
The descriptor list is a continous list of “descriptors”. Descriptors in this case are nothing more than a pile of pointers and some registers. The minium size of a RX descriptor is 4x32bits, which typically (for a ‘normal’ descriptor) consists of a 32 bit buffer address, 32 bits reserved, 32 bits payload, 29 bits of reserved and finally an OWN bit and IOC bit.
These pointers are used by the ethernet DMA controller, so it knows where to put the received data. The OWN bit tells if a descriptor is ready for new data (OWN = 1), or it’s filled up with data and the application can process it (OWN = 0). For more information about descriptors, please see the user manual, p. 2893, chapter 58.10.4 or the image above.
Now we put a few (default 4 in the HAL) of those descripters in a row and we’ve got ourselfs a descriptor list. Because there is no unlimited memory for descriptors in the H7 (nowhere btw), the list will eventually roll-over to the first descriptor. This makes it act more like a circulair DMA and a descriptor ring.
This means that used (filled) descriptors must eventually be re-used by the DMA, which makes perfect sense. The DMA controller simply fills a descriptor and moves on the next. To make sure the DMA controller does not overwrite an older packets data, the controllers checks the tail pointer. That is a pointer that points to the last descriptor in the list where the OWN bit is 0. The DMA controller uses the tail pointer to check if it is not overwriting previous packet data.
Preparing The List
To tell the DMA controller how this list looks, we’ll have to tell it how big one descriptor is. That means, how many words are between the first 32bit of the first descriptor and the first 32bit of the second descriptor. In the case of the HAL that 6, because they have included 2 extra 32bit pointers. And with a good reason, but I’ll explain that later on.
According to the usermanual of the H7 (RM0433), all the receive descriptors are prepared and given to the DMA as normal descriptor, with the OWN bit set(!) so the DMA can (may) process it (see also 58.10.4 of the UM, or the image below). The HAL however thinks it’s wise to ignore this(?) and set all OWN bits to 0.
Then, also from the usermanual, the tailpointer should be set to the first pointer with the OWN bit set to 1 (the applications first descriptor). In the HAL the tail pointer is set to the last descriptor of the list, so the DMA controller can fill all the descriptors up while the application is booting.
An overview for DMA reception setup and working from the user manual is found in the image below. I’s pretty clear.
Using And Freeing Descriptor Data
Now the DMA controller is pumping our RX descriptors full, we should read and parse the data (that part does LwIP in our application) and let the DMA controller know that read descriptors can be re-used. ST’s HAL implemented a function that does exactly this, except with a little side effect. The function is named HAL_ETH_BuildRxDescriptors() and it does the following; it will loop trough all descriptors that are used by the application, and frees them….? That means that when calling this function, every ethernet packet is freed and potentially being overwritten by the DMA controller.
OK, cool, good to know, but it’s all just a matter of calling the function on the right time, right? Yes, that is a possibility. What would be better is to have a function that only frees a single descriptor. Especially the descriptor that is being freed by the application. Not another one. So one of the changes in the HAL ethernet file is a function that goes by the name HAL_ETH_ReleaseSingleRxDescriptor(). I think release is a better word for its function than build.
The ethernetif.c file takes care of initialising the I/O for the used PHY and perhaps some functions to communicate with the PHY. Also it couples the HAL to the LwIP stack using ethernetif_input and low_level_output to in- and output data into and from LwIP. When looking at the low_level_input function, we can see that the functions check if there is data received, and if so it will get the length and buffer from this descriptor and then frees it immediately. Nice! That, in combination with only a descriptor ring of 4 descriptors could be the problem we (and many more) are facing!
I’ve tried to visualize this using the image below. The last RX descriptor was filled after the first build descriptors function was called.
ST already implemented a custom_pbuf from LwIP, which calls a given free function for when the packet is freed by application. When the application frees the packet, that could be our green light to also free the descriptor. All we should know is which descriptor this data belongs to. Doesn’t sound too hard, right? The only problem that can arise is fragmentation of the descriptor ring, because we can never be sure the application frees the packets in the same order they where received. That makes it a bit harder to keep the tail pointer pointing to the right descriptor. We tried to minimize this occurence by increasing the descriptor ring lenght to 16. This also means we need much more space for the RX buffer. Also an extra list to remember which pbufs belongs to which descriptors to free them later.
To compare it with ST’s solution, I again drawed an overview:
To verify our ideas and the improved stability I’ve rebuild the tcp-echoserver that came with LwIP. Using netcat I’ve sent a file with the ASCII characters 1-60000, splitted by newlines ( 0x0D 0x0A, \r\n ). The echo server now tries to extract a number (untill the newline characters) and compares them with a counter (starting at 1) that is converted to an ASCII line using sprintf.
The last number of the packet most likely is not complete, so a simple buffer is utilized. The flow of this buffering is shown in the diagram below.
Please check this repository for the source, especially the ethernetif.c and the stm32h7xx_hal_eth.c file. We performed tests on the NUCLEO-H743ZI which turned out positive. With the same test software but the original HAL and ethernetif.c we got an error, every single time. But – fair is fair, when only increasing to 16 descriptors instead of 4, ST’s HAL works too. I would not say it’s perfect, but our test passes.
Implementing this in your own application
Probably the most wanted chapter from this page. How do you implement all this? And what about the MPU?
1. Increase descriptor list size
2. Move descriptor addresses and Rx_Buff to right after the descriptor rings MPU regions
3. Setup MPU for descriptors, Rx_Buff and TX buffer using the overview below
4. Setup LwIP to limit packet sizes and TX buffer address
5. Edit ethernetif.c so it won’t free all packets when one packet is received
6. Add functions to stm32h7xx_hal_eth.c and remove some mistakes
7. Create (or edit) a scatter file to reserve some space for the TX buffer
8. Compile, program and enjoy the beauty of ethernet
|Increase descriptor list size
|Move descriptor addresses and Rx_Buff to right after the descriptor rings and MPU range
|Setup MPU for descriptors, Rx_Buff and TX buffer
|217, 233, 247
|Setup LwIP to limit packet sizes and TX buffer address
|Edit ethernetif.c so it won’t free all packets when one packet is received
|Add functions to stm32h7xx_hal_eth.c and remove some faults
|Create or add scatter file and add LwIP Heap section to reserve some space for the TX buffer
|Compile, program and enjoy the beauty of ethernet
There are still some improvements to be done. The speed is not what it can be (measured using iPerf). Some are easy, like the generation of the custom pbuf, that can happen at the init instead of every time a packet is received. But at least there is a start. Feel free to comment your thoughts and improvements, or clone the repo and start working with me!