Trenz Electronic Products > Trenz Electronic FPGA Modules

TE0890 HyperRAM data corruption

(1/6) > >>

Joris van Rantwijk:
I designed my own HyperRAM controller for the TE0890 Spartan-7 mini FPGA module.

The code for my HyperRAM controller and test driver is on Github:
https://github.com/jorisvr/te0890-utils/tree/master/hyperram_test

The whole thing is working very nicely, except that I see short bursts of data corruption approximately once every 10 to 20 hours. I have tried everything I could think of to find the source of these issues, but I just can't figure it out.

I'm not using the BlackMesaLabs hyperam Verilog core, because it only supports dword-level access while I want byte-level write enables, and because the Verilog code only works up to 80 MHz while I want to run at 100 MHz.
So I designed my own controller in VHDL. It operates the HyperRAM at 100 MHz, which should be the maximum supported frequency of the device. My test consists of a simple but intensive march test with varying data patterns (this is sometimes called "moving inversions"). The whole thing seems works almost flawlessly. It correctly handles repeated runs of the test pattern for many hours. But approximately once every 10 to 20 hours, the test detects a burst of between 1 and 4000 errors, then continues to run for hours again without errors.

Based on the error patterrns, I suspect the corruption occurs in the write data path and perhaps sometimes in the address, but not in the read data path. However it is really difficult to determine this with my current test method.

I tried shifting the clock phases used to drive data to the HyperRAM and to capture data from the HyperRAM. This confirms that I have at least 30 degrees margin in both directions before error rates increase significantly. I therefore find it unlikely that the data corruption is caused by something like setup/hold time violations on the HyperRAM interface.

I tried relaxing the timing configuration of the HyperRAM. At 100 MHz it should support tACC=4 cycles, tRWR=4 cycles. I tried running at default tACC=6 cycles, fixed 2x access latency, and tRWR=6 cycles. I still get data corruption in that configuration.

I tried running at 80 MHz instead of 100 MHz, but I still see data corruption.
I'm currently testing at 50 MHz and no corruption yet, but the errors are so infrequent that I will have to test for several days to be sure.
However I really want to run the RAM at 100 MHz and I believe it should be possible.

At this point I don't know how to debug this any further.
It is remarkable that the errors are extremely rare, but burst-like in nature (no errors for many hours, then hundreds of errors in a fraction of a second). This suggests to me that some aspect of the system is intermittently unstable.
Could the MMCM lose phase lock? - Why would it do that?
Could the signal amplitude of the HyperRAM interface drop below the noise margin? - What could cause that?
Is this just the best that HyperRAM can do? - But then it is basically unusable without ECC.
I'm powering my TE0890 module from the USB bus of my computer. Not the most low-noise supply, but I think it should be good enough.

Question 1: Does anybody have any clue what might be going on here?

Question 2: Does anybody have experience with the TE0890 HyperRAM? I'm interested in success stories, similar problems, different problems.

Question 3: Is anybody willing to run my HyperRAM test on their own TE0890 for a few days. The HyperRAM controller and test design are in my Github, linked above. Note that the expected error rate is extremely low, so the test may need to run for many hours to draw conclusions.

JH:
Hi,
sorry I can't help much but there are other IPs available(some are free, some you must pay (or trail version with 10 min time bomb)) maybe this helps you to find out the issue with your own IP:
Synaptic Lab (example with trail version on TE0725 available )
* https://synaptic-labs.force.com/s/ip-hbmc
* https://synaptic-labs.force.com/s/free-trials-xilinx-fpgaOpen Source:
* https://github.com/blackmesalabs/hyperramCYPRESS (registration necessary, core under NDA, no settings for our modules )
* http://www.cypress.com/documentation/software-and-drivers/hyperbus-master-interface-controller-ip-intellectual-propertyALSE (unknown, but they have TE0725 as reference)

* https://www.alse-fr.com/Hyper-RAM-Controller.htmlAnd you should check your timing constrain and clock domains, maybe you use some asynchronous clocks. Then this could possibly go wrong after a while, if the clock domain transition is not properly resolved.

br
John

Joris van Rantwijk:
Hi JH,

Thanks for your suggestions.
I was not even aware that there are commerical HyperBus cores available. However I really prefer to run open-source code on this board, and I think the HyperRAM interface is simple enough that it can be built from scratch.

I will take another look at the BlackMesaLabs core, but I think they also mentioned problems running at 100 MHz.

My constraints should be ok. My two internal clocks are both generated by the same MMCM with a fixed phase relation. Inter-clock and intra-clock constraints are derived automatically and pass the checks. Timing-critical signals to the HyperRAM go through IDDR or ODDR. I did not set any setup/hold constraints on these signals because I think the IDDR/ODDR will already fix their timing.

I'm currently testing at 50 MHz, and it looks like all errors are gone.
So that points in the direction of setup/hold violations, although it does not completely eliminate other causes.
My next step will be testing with cleaner power supply (not USB).
Then maybe another look at MMCM and constraints although I feel like I have tried everything already.
Then maybe test the BlackMesaLabs core at 80 MHz.

JH:
Hi,

ok, in this case all clks are derived from same source, that it should be OK.

100MHz is limit for 3V operation, see:
http://www.issi.com/WW/pdf/66-67WVH8M8ALL-BLL.pdf
you can also check timing parameters for read/write transactions, maybe some are slightly violated?

I've also seen that you has changed driver strange of the IOs, maybe you should also play a little bit with this driver strange and slew rate (the highest values are not always the best)

hyperram is nearly the fpga, so I think additional timing parameters for  trace length are not need.

br
John

Joris van Rantwijk:

Thanks for the helpful suggestions.

I have studied the timing of read/write cycles in detail, and also tested with more relaxed settings for tACC and tRWR, but then I still got errors.

The point about IO drive strength is well taken. My 16 mA setting was simply copied from the sample design, but I changed it to 12 mA to see if that helps. Unfortunately I still got errors, but not significantly more errors. That suggests to me this parameter is not critical, otherwise changing it would have either helped or hurt.

I ran a long test at reduced clock frequency 50 MHz. Still errors, although error rate seems lower.

To eliminate conducted nosie, I switched to a separate 5 V supply with extra decoupling, common-mode ferrites on power supply cable and serial cable, JTAG cable disconnected during the test. Tested at 100 MHZ: still errors.

I again confirmed that I can shift the DQ output phase relative to the HyperRAM clock by at least 60 degrees in both directions without triggering lots of errors. That is a pretty wide margin, so I believe timing on the interface from FPGA to HyperRAM is not at all critical.

Currently testing with reduced drive strength from the side of the HyperRAM. Its default drive strength is 34 Ohm which seems overly strong, so I reduced it to 67 Ohm. I keep my fingers crossed.

By the way, I noticed that the TE0890 board contains fewer decoupling capacitors than recommended by the Xilinx PCB Guide (ug483).
For example: ug483 recommends 100uF on VCCINT, but TE0890 has just 10uF.
ug483 recommends 47uF on VCCO per bank, but TE0890 has just 10uF in total.
I'm not sure that this explains anything, but I want to keep it in mind while I'm chasing system stability issues.

Joris.

Navigation

[0] Message Index

[#] Next page

Go to full version