TE0890 HyperRAM data corruption

Joris van Rantwijk · August 18, 2020, 09:57:31 AM

I designed my own HyperRAM controller for the TE0890 Spartan-7 mini FPGA module.

The code for my HyperRAM controller and test driver is on Github:
https://github.com/jorisvr/te0890-utils/tree/master/hyperram_test

The whole thing is working very nicely, except that I see short bursts of data corruption approximately once every 10 to 20 hours. I have tried everything I could think of to find the source of these issues, but I just can't figure it out.

I'm not using the BlackMesaLabs hyperam Verilog core, because it only supports dword-level access while I want byte-level write enables, and because the Verilog code only works up to 80 MHz while I want to run at 100 MHz.
So I designed my own controller in VHDL. It operates the HyperRAM at 100 MHz, which should be the maximum supported frequency of the device. My test consists of a simple but intensive march test with varying data patterns (this is sometimes called "moving inversions"). The whole thing seems works almost flawlessly. It correctly handles repeated runs of the test pattern for many hours. But approximately once every 10 to 20 hours, the test detects a burst of between 1 and 4000 errors, then continues to run for hours again without errors.

Based on the error patterrns, I suspect the corruption occurs in the write data path and perhaps sometimes in the address, but not in the read data path. However it is really difficult to determine this with my current test method.

I tried shifting the clock phases used to drive data to the HyperRAM and to capture data from the HyperRAM. This confirms that I have at least 30 degrees margin in both directions before error rates increase significantly. I therefore find it unlikely that the data corruption is caused by something like setup/hold time violations on the HyperRAM interface.

I tried relaxing the timing configuration of the HyperRAM. At 100 MHz it should support tACC=4 cycles, tRWR=4 cycles. I tried running at default tACC=6 cycles, fixed 2x access latency, and tRWR=6 cycles. I still get data corruption in that configuration.

I tried running at 80 MHz instead of 100 MHz, but I still see data corruption.
I'm currently testing at 50 MHz and no corruption yet, but the errors are so infrequent that I will have to test for several days to be sure.
However I really want to run the RAM at 100 MHz and I believe it should be possible.

At this point I don't know how to debug this any further.
It is remarkable that the errors are extremely rare, but burst-like in nature (no errors for many hours, then hundreds of errors in a fraction of a second). This suggests to me that some aspect of the system is intermittently unstable.
Could the MMCM lose phase lock? - Why would it do that?
Could the signal amplitude of the HyperRAM interface drop below the noise margin? - What could cause that?
Is this just the best that HyperRAM can do? - But then it is basically unusable without ECC.
I'm powering my TE0890 module from the USB bus of my computer. Not the most low-noise supply, but I think it should be good enough.

Question 1: Does anybody have any clue what might be going on here?

Question 2: Does anybody have experience with the TE0890 HyperRAM? I'm interested in success stories, similar problems, different problems.

Question 3: Is anybody willing to run my HyperRAM test on their own TE0890 for a few days. The HyperRAM controller and test design are in my Github, linked above. Note that the expected error rate is extremely low, so the test may need to run for many hours to draw conclusions.

JH · August 20, 2020, 08:16:47 AM

Hi,
sorry I can't help much but there are other IPs available(some are free, some you must pay (or trail version with 10 min time bomb)) maybe this helps you to find out the issue with your own IP:
Synaptic Lab (example with trail version on TE0725 available )

Open Source:

https://github.com/blackmesalabs/hyperram

CYPRESS (registration necessary, core under NDA, no settings for our modules )

http://www.cypress.com/documentation/software-and-drivers/hyperbus-master-interface-controller-ip-intellectual-property

ALSE (unknown, but they have TE0725 as reference)

https://www.alse-fr.com/Hyper-RAM-Controller.html

And you should check your timing constrain and clock domains, maybe you use some asynchronous clocks. Then this could possibly go wrong after a while, if the clock domain transition is not properly resolved.

br
John

Joris van Rantwijk · August 20, 2020, 06:11:41 PM

Hi JH,

Thanks for your suggestions.
I was not even aware that there are commerical HyperBus cores available. However I really prefer to run open-source code on this board, and I think the HyperRAM interface is simple enough that it can be built from scratch.

I will take another look at the BlackMesaLabs core, but I think they also mentioned problems running at 100 MHz.

My constraints should be ok. My two internal clocks are both generated by the same MMCM with a fixed phase relation. Inter-clock and intra-clock constraints are derived automatically and pass the checks. Timing-critical signals to the HyperRAM go through IDDR or ODDR. I did not set any setup/hold constraints on these signals because I think the IDDR/ODDR will already fix their timing.

I'm currently testing at 50 MHz, and it looks like all errors are gone.
So that points in the direction of setup/hold violations, although it does not completely eliminate other causes.
My next step will be testing with cleaner power supply (not USB).
Then maybe another look at MMCM and constraints although I feel like I have tried everything already.
Then maybe test the BlackMesaLabs core at 80 MHz.

JH · August 21, 2020, 07:10:46 AM

Hi,

ok, in this case all clks are derived from same source, that it should be OK.

100MHz is limit for 3V operation, see:
http://www.issi.com/WW/pdf/66-67WVH8M8ALL-BLL.pdf
you can also check timing parameters for read/write transactions, maybe some are slightly violated?

I've also seen that you has changed driver strange of the IOs, maybe you should also play a little bit with this driver strange and slew rate (the highest values are not always the best)

hyperram is nearly the fpga, so I think additional timing parameters for trace length are not need.

br
John

Joris van Rantwijk · August 25, 2020, 09:01:08 PM

Thanks for the helpful suggestions.

I have studied the timing of read/write cycles in detail, and also tested with more relaxed settings for tACC and tRWR, but then I still got errors.

The point about IO drive strength is well taken. My 16 mA setting was simply copied from the sample design, but I changed it to 12 mA to see if that helps. Unfortunately I still got errors, but not significantly more errors. That suggests to me this parameter is not critical, otherwise changing it would have either helped or hurt.

I ran a long test at reduced clock frequency 50 MHz. Still errors, although error rate seems lower.

To eliminate conducted nosie, I switched to a separate 5 V supply with extra decoupling, common-mode ferrites on power supply cable and serial cable, JTAG cable disconnected during the test. Tested at 100 MHZ: still errors.

I again confirmed that I can shift the DQ output phase relative to the HyperRAM clock by at least 60 degrees in both directions without triggering lots of errors. That is a pretty wide margin, so I believe timing on the interface from FPGA to HyperRAM is not at all critical.

Currently testing with reduced drive strength from the side of the HyperRAM. Its default drive strength is 34 Ohm which seems overly strong, so I reduced it to 67 Ohm. I keep my fingers crossed.

By the way, I noticed that the TE0890 board contains fewer decoupling capacitors than recommended by the Xilinx PCB Guide (ug483).
For example: ug483 recommends 100uF on VCCINT, but TE0890 has just 10uF.
ug483 recommends 47uF on VCCO per bank, but TE0890 has just 10uF in total.
I'm not sure that this explains anything, but I want to keep it in mind while I'm chasing system stability issues.

Joris.

JH · September 01, 2020, 12:03:04 PM

Hi,
sorry for the late reply.

I don't think that the less capacitors are the problem, in this case I think the error should be more continuos.

Now you wrote you has problems also with 50MHz, on your first post it sound you has only problems on 100MHz?

I haven't any experience with the hyperam, but what's with the refresh rate? I found for example:
https://community.cypress.com/thread/29439

br
John

Joris van Rantwijk · September 02, 2020, 05:44:41 PM

Quote from: JH on September 01, 2020, 12:03:04 PM
I don't think that the less capacitors are the problem, in this case I think the error should be more continuos.

Ok that makes sense. However, the same applies to every root cause I can think of. What sort of design flaw causes an average failure rate of once per 24 hours? In my experience with FPGAs, most designs fail either many times per second or never.

A very marginal timing violation could perhaps cause this behaviour, but my clock is properly constrained and I have full timing closure.

Quote
Now you wrote you has problems also with 50MHz, on your first post it sound you has only problems on 100MHz?

I haven't any experience with the hyperam, but what's with the refresh rate? I found for example:
https://community.cypress.com/thread/29439

I first tested at 100 MHz and found errors. I later tested also at 50 MHz and still got errors. I recently tested a lower drive strength for the HypeRAM device, but still errors.

Thanks for the link to the refresh discussion. That thread confirms my understanding of the datasheet. Basically HyperRAM automatically takes care of refresh. The FPGA just needs to keep bursts shorter than 4 us to give the HyperRAM an opportunity to refresh. My controller does that correctly.

I'm currently looking into the role of burst length. My test tries both short and long bursts. So far all errors occurred with short bursts. By itself this does not mean much because the short-burst test takes much more time than the long-burst test, so it may have more opportunity to see errors. But the short-burst test also performs many sequential accesses the same memory row. Perhaps the device does not like that access pattern.

Actually I think I'm grasping at straws now, but who knows.
It would be so nice to have a reliable open source HyperRAM controller

Joris.

JH · September 04, 2020, 02:14:03 PM

Hi Joris,

here is new post where pramber try to work with another open source IP (--> other git url).
https://forum.trenz-electronic.de/index.php/topic,1333.0.html
I have informed him about your post, maybe you can support each other to find a solution for your different problems with HyperRAM and open Source IPs.

br
John

Antti Lukats · February 28, 2023, 03:09:36 PM

Quote from: Joris van Rantwijk on September 02, 2020, 05:44:41 PM
It would be so nice to have a reliable open source HyperRAM controller
Joris.

Hi I could not agree with you more. There is also an OpenHBMC IP core at github, and well we hoped it would work. but it also fails once a day! OK first try it did work without a failure a whole week, but now we see data corruption almost every day! We are testing on:

https://shop.trenz-electronic.de/de/CR00107-01-CRUVI-carrier-board-with-AMD-Spartan-7

so it is totally different hardware and totally different IP core, and still we see similar failure rate!

My guess is that you also did not solve the issues? right?

Joris van Rantwijk · February 28, 2023, 10:28:37 PM

Quote from: Antti Lukats on February 28, 2023, 03:09:36 PM
My guess is that you also did not solve the issues? right?

Right, unfortunately. I ran some more tests but did not get any closer to a solution. And then I just gave up.
I had already crossed the fine line from perseverance into foolish obsession. I mean, it can be really great to stare at hex dumps for days on end, but only if you eventually figure it out.

Antti Lukats · March 01, 2023, 11:57:41 AM

Hi,

I kinda guessed that

well we do not want to give up.. I compiled your HyperRAM code for CR00107, it kinda worked FIRST try! but what we see is following:

R=0004 F=001d6342
P=0000 B=001 B=002 B=003 B=004 B=005 B=010 B=03f B=200 F=001d6342
P=ff00 B=001 B=002 B=003 B=004 B=005 B=010 B=03f B=200 F=001d6342
P=aa55 B=001 B=002 B=003 B=004 B=005 B=010 B=03f B=200 F=001d6342
P=cc33 B=001 B=002 B=003 B=004 B=005 B=010 B=03f B=200 F=001d6342
P=f00f B=001 B=002 B=003 B=004 B=005 B=010 B=03f B=200 F=001d6342
P=rrrr B=001 E=1-00000b-1-36b2-a961 E=1-000019-2-d7cf-2777

there are 5 PATTERN rounds without error, and then there are errors coming 100% all the time with random pattern.

That is we can not use your solution for long time testing

as there are errors in first round already, do you have any idea why?
Maybe we could just disable the random tests and try with pattern test only? How to do this?

UPDATE: I changed the burst lengts 1 and 2 both to 3, now all test rounds complete without failure. So the 100% errors happened only with random pattern and with burst lengths 1 and 2!

Any idea where this can come from?

I am running a long time test now, to see if some error happens when run a long time...

Joris van Rantwijk · March 01, 2023, 10:23:53 PM

Quote from: Antti Lukats on March 01, 2023, 11:57:41 AM
there are 5 PATTERN rounds without error, and then there are errors coming 100% all the time with random pattern.

To be precise: it seems the errors are not literally 100%, since the first error appears around address 0x000b and the next error at address 0x0019.
So this suggests to me that the memory is still mostly working even for random data. But it is definitely a large number of errors.

I think I also got most of my errors at short burst lengths, but definitely also sometimes at longer burst lengths. And for me it was not just the random patterns that failed.

Quote
That is we can not use your solution for long time testing as there are errors in first round already, do you have any idea why?
Maybe we could just disable the random tests and try with pattern test only? How to do this?

I have no idea. Also it is a very long time ago (relative to the timescale at which I can still understand my own VHDL code).
The random test is tricky because it relies on two identically initialized RNGs running in lock-step, one to write data and one to verify data. However I don't see how this could run smoothly for long bursts while failing specifically only for short bursts, and even then failing only intermittently.

Disabling random tests should be simple. You can just shorten the arrays test_pattern_data and test_pattern_random. Probably.

It looks like my hacked-up VHDL memory tester is nice for verifying a correctly working RAM, but not so nice for debugging a failing RAM. It is just too difficult to pinpoint what exactly is happening before and during and after an error. I had a vague plan to hook up the HyperRAM controller to a soft-core CPU inside the FPGA, then run a memory test program on the CPU. That enables more powerful debugging in case an error is discovered (re-read the failed address to see if the failure is consistent, etc.). But I was just too tired of the whole situation.

Antti Lukats · March 02, 2023, 10:56:04 AM

Hi
eh reading own VHDL code, can be fun? OK, I changed burst lengths 1 and 2 both to 3, so the test uses burst len 3 3 times and started a long term test yesterday. It has been running 22+ hours, no error.

with burst len 1 or 2 I got in every test about same amount of errors at different addresses, random locations and data.

so far I would almost say that your IP core works 100% and that the problem is somehow in the HyperRAM device itself, unfortunately we have ISSI die revision B silicon that is known to have errata (corruption) so we can not rule out the possibility that the chip itself is bad.

of course it hard to explain why the issue only happens with random sequence testing and short burst? With short burst the test itself takes LONGER could that be somehow related to refresh ? Not sure what to think.

So running the long time test, interesting if any errors come or not (with burst len >=3)

Joris van Rantwijk · March 02, 2023, 11:58:32 AM

Quote from: Antti Lukats on March 02, 2023, 10:56:04 AM
so far I would almost say that your IP core works 100% and that the problem is somehow in the HyperRAM device itself, unfortunately we have ISSI die revision B silicon that is known to have errata (corruption) so we can not rule out the possibility that the chip itself is bad.

Hmm, I didn't know there were errata. If we can't have any confidence that the device itself is correct, then I'm not sure what can be achieved through further testing.

Quote
of course it hard to explain why the issue only happens with random sequence testing and short burst? With short burst the test itself takes LONGER could that be somehow related to refresh ?

I have had similar thoughts: retention errors where the RAM somehow "forgets" data after some amount of time or some number of bursts. However I have definitely also seen errors with longer bursts (3, 4, 5, 16 bytes).

In most runs, I got errors with an average interval of 1000 rounds. Often a big burst of some 100 errors, then OK again for another 1000 rounds or so. Failures occurred more often during short bursts (but the test performs more short bursts than long bursts). Failures occurred about equally often during pattern vs random data.

I guess it is also still a possibility that you and I are seeing errors for completely different reasons.

Antti Lukats · March 02, 2023, 12:33:43 PM

OK, Update:

1) there is a possibility that die REV A and B have issues but we have LATEST rev D die on our board. So this die should not have errata.

2) we are running 24+ hours, no errors, but the round counter R is only at 0x00A0? how did you run 1000 round? it takes forever

hm.. could it be you see device errata and we see hm a bug in IP core? We can not really test short burst with random at all.

Joris van Rantwijk · March 02, 2023, 01:01:02 PM

Quote from: Antti Lukats on March 02, 2023, 12:33:43 PM
2) we are running 24+ hours, no errors, but the round counter R is only at 0x00A0? how did you run 1000 round? it takes forever

Then something is wrong. The test used to run ~ 50 rounds per hour on my board. Errors are indeed rare; about once per 24 hours.

Is your main clock 100 MHz ?
If it is not 100 MHz, the capturing of RWDS and read data may be wrong. These signals (output from RAM) are valid for a certain window after CK, and my controller uses that fact that the next 100 MHz clock edge falls exactly inside the safe capture window. So this only works for 100 MHz. Also if your hyperram revision has different timing, it won't work right.

Quote
hm.. could it be you see device errata and we see hm a bug in IP core?

Yes, absolutely. I just haven't been able to find the bug

Just to make sure: your design has a correct period constraint on the main clock and timing closure after routing ?

Antti Lukats · March 02, 2023, 02:01:43 PM

our board has 100MHz clock and is running your code without modifications (except the 1,2,3.. to 3,3,3.. modification to exclude short random tests).

our board has HyperRAM trace lengths for ALL signals between 2.5 and 3.6 mm, it is not possible to be better.

we use die REV D, what is current and recommended die revision.

the test has run what about 28 hours? no errors, the round counter is at 010E

I really wonder why the short burst random tests are failing, really confused where this kind of problem could happen?

Joris van Rantwijk · March 02, 2023, 09:31:29 PM

Quote from: Antti Lukats on March 02, 2023, 02:01:43 PM
the test has run what about 28 hours? no errors, the round counter is at 010E

Then something is completely wrong. The test is running 5 times slower on your board than on mine. I don't see any reason for that.
The RAM chip has some limited freedom in the timing of read transactions where it can insert wait states. But I have never actually seen that happen. And it could never slow down the test this much.
All other parts of the VHDL code have deterministic timing.

I would almost guess that capturing of the RWDS signal is flaky on your board. This could slow down the test through "false" wait states. But it should also cause massive numbers of data errors, even for pattern tests.

Do you see that short bursts run noticably slower than long bursts? I remember seeing the strings "B=001 B=002" etc. appearing one by one. B=001 would take by far the longest time, while B=010 zips past quickly. Unfortunately I can not easily hook up my TE0890 board right now.

Quote
our board has 100MHz clock and is running your code without modifications (except the 1,2,3.. to 3,3,3.. modification to exclude short random tests).

Ok, my code, but you probably changed pin placement constraints for your board layout. Does it stil meet timing?

I agree trace lengths probably don't matter. Perhaps FPGA speed grade could make a difference, I'm not sure. My board has speed grade 1 (I can never rememember whether that is the fastest or slowest grade).

Antti Lukats · March 03, 2023, 08:47:36 AM

long burst are much faster
one full round takes about 47 seconds
we changed the io constraints to much CR00107
now it is running the modified test for 44 hours, no errors

Joris van Rantwijk · March 03, 2023, 10:17:16 AM

Quote from: Antti Lukats on March 03, 2023, 08:47:36 AM
one full round takes about 47 seconds

Ok that sounds right. But in that case 28 hours should cover 2100 rounds = 0x0834.
Is your board maybe getting spurious resets?

If you are planning to "fix" the short burst access later, you could try increasing the parameter "t_rwr_clk" in the VHDL entity. It is basically the minimum delay between successive bursts. The default value should be OK according to the datasheet. But increasing it is a simple way to give the RAM chip an easier time. I tried this and it did not help at all, but my board doesn't have such very high failure rates on short bursts.

Antti Lukats · March 03, 2023, 11:41:55 AM

hm I do not think the delay between burst will change anything, as short burst with constant pattern work well.

Antti Lukats · March 03, 2023, 02:09:45 PM

SOLVED!

This is HyperRAM errata!

I changed from ISSI die REV D to OLDER die REV B and the short burst random errors disappeared.

I started a long time testing now!

So the IP core has no issues, well so far lets see if there will be error running long time.

QUESTION: Burst Len 2, is it 2 bytes or two WORDS? could not understand yet

Joris van Rantwijk · March 04, 2023, 01:51:00 PM

Quote from: Antti Lukats on March 03, 2023, 02:09:45 PM
SOLVED!

Well, yes, for the specific issue with short bursts on CR00107.
A probable workaround is to avoid bursts of 1 or 2 bytes.
An alternative workaround may be enabling fixed latency by changing bit 3 of "config0_data" from 0 to 1 in entity "hyperram_ctrl".

There is still no full explanation for the rare errors on TE0890 that I originally described 2 years ago. Those errors also occurred with longer burst lengths and also with fixed latency.
But I have pretty much given up on that mystery. It may have been undisclosed errata for old die revisions. It may even have been a single bad TE0890 board.

Quote
QUESTION: Burst Len 2, is it 2 bytes or two WORDS? could not understand yet

When the test prints "B=002", it means bursts of 2 bytes = 1 word = 1 clock cycle.
When the test prints "B=003", it means bursts of 3 bytes = 2 words with one of the four bytes masked out, etc.

Antti Lukats · March 06, 2023, 12:37:31 PM

Status:

rev A die (burst 1,2,3...): burst error detected over the weekend, at burst length of 4
rev D die: 50+ hours no errors, burst 1 and 2 excluded in tests
rev D die (burst 1,2,3...): started long time testing, with modified IP core to work around the short burst errata

we need a week+ test on rev D die to be sure your IP core works 100%, so far it looks like the rev A die may have data corruption that causes burst errors in long time testing, of course it is possible that rev D has also problems.

on another PC we run long time testing with rev D and OpenHBMC, there we also did see long time test errors, so we can not not say that rec D die works without errors.

Antti Lukats · March 10, 2023, 04:38:44 PM

test with die rev D has been running for a week, no errors we let it run another week but it seems the errors are not coming.

lasse@elcon.se · May 15, 2023, 07:22:41 AM

Is the ip core still running with no failure?

/Lasse

Antti Lukats · May 15, 2023, 08:26:40 AM

sorry I stopped the testing. but it did run 3+ weeks without errors!
ok, I restarted the testing...

Antti Lukats · May 22, 2023, 03:22:45 PM

another week of testing without errors.

just to be clear we test with CR00107 and ISSI hyperram die REV D

lasse@elcon.se · May 23, 2023, 12:19:56 PM

I think this is good news.
So there is some problem with hyperram and short burst.

Maybe OpenHBMC has the same problem.

Good work.
/Lasse

lasse@elcon.se · May 23, 2023, 01:24:04 PM

Dear Antti,

is it possbily to get this ipcore with this fix. I have a custom board with artix-7 and two hyperrams on board to check if it works .

/Lasse

News:

TE0890 HyperRAM data corruption