Recommendations for AMD Ryzen system, sudden reboots, random freezes

SolydXK is too quiet for you? SolydXK Enthusiast Editions, based on Debian Testing is for you! Here you can find news about Debian Testing and Unstable too, and also tests on SolydXK programs.
The support for SolydXK EE is provided by the community.
User avatar
ilu
Posts: 2004
Joined: 09 Oct 2013 12:45

Recommendations for AMD Ryzen system, sudden reboots, random freezes

Postby ilu » 04 Oct 2017 00:02

I'm currently debugging a AMD Ryzen system which has random lags and freezes (first time I saw a Linux system totally freeze, no console access possible). Those could be caused by hardware or software but since debugging hardware problems involves spending money I prefer to debug software first. I'm using SolydX64 with the 4.12.0 kernel from backports. I probably need at least 4.12.1 or better 4.13 - can you recommend what to do? The user doesn't like to live on the edge, he prefers stability, that's why I think EE is not for him. But this system needs to stay on the newest available kernel for at least the next 6 month.
Systems last words before freezing were

Code: Select all

Oct  4 01:17:17  kernel: [  926.752094] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x880000 action 0x6 frozen
Oct  4 01:17:17  kernel: [  926.752099] ata3: SError: { 10B8B LinkSeq }
Oct  4 01:17:17  kernel: [  926.752108] ata3.00: cmd a0/00:00:00:08:00/00:00:00:00:00/a0 tag 20 pio 16392 in
Oct  4 01:17:17  kernel: [  926.752108] Get event status notification 4a 01 00 00 10 00 00 00 08 00res 40/00:03:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
Oct  4 01:17:17  kernel: [  926.752110] ata3.00: status: { DRDY }
Oct  4 01:17:17  kernel: [  926.752114] ata3: hard resetting link
Oct  4 01:17:17  kernel: [  927.227971] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Oct  4 01:17:17  kernel: [  927.230762] ata3.00: configured for UDMA/100
Oct  4 01:17:17  kernel: [  927.231159] ata3: EH complete
Oct  4 01:17:22  kernel: [  932.345483] [UFW BLOCK] IN=eth1 OUT= MAC=xxx SRC=192.168.1.1 DST=192.168.1.56 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=39613 DF PROTO=TCP SPT=2061 DPT=14013 WINDOW=5840 RES=0x00 SYN URGP=0 
Oct  4 01:17:31  kernel: Oct  4 01:38:24  kernel: [    0.000000] Linux version 4.12.0-0.bpo.1-amd64 (debian-kernel@lists.debian.org) (gcc version 6.3.0 20170516 (Debian 6.3.0-18) ) #1 SMP Debian 4.12.6-1~bpo9+1 (2017-08-27)
Thelast line has the freeze and reboot. This doesn't tell me anything. ATA3 is a DVD device that has issues, I've removed it. But can that lead to a freeze more than 10 sec later?

kurotsugi
Posts: 2085
Joined: 09 Jan 2014 00:17

Re: Recommendations for AMD Ryzen system, random freezes

Postby kurotsugi » 04 Oct 2017 01:28

total system freeze comes from two things. first, the graphic card giving up. in this case only the screen is freeze. if you're watching movie, you can still hear the sound. secondly, something goes wrong with the kernel. for example, since 3.12 there is a nasty bug in the kernel related with my CPU where system without swap will got freeze when memory is almost full. in this case none is working. neither terminal, graphic, sound. only REISUB combo works.

before we start, please note that debugging might not worth to be done. you need to know how to read the logs and understand how the system works. normal people like us won't be able to that. since you're using AMD, I'd suggest to try liquorix kernel first. I don't know what they did on their kernel but AMD chips works better on liquorix kernel. you can find it here https://liquorix.net/

if you're curious or liquorix didn't works, first, you need to identify which made the system freeze. on the first case, you can look into either the kernel or X logs. there might some hints there. though, it's mostly graphic driver issue. if you're lucky, there might some workaround by playing with the X conf files. however, on most cases we need to install a brand new/different driver. try to find some keywords on the error message and dig more information about it.

that being said, it's lot easier to said than done. in my case rougly 90% of the solution is "ditch that graphic driver, use this one instead", or "the problem already fixed on driver ver X". that's basicly means you'll need to live near the edge (i.e: use debian testing). despite of the reputation, debian testing is actually quite stable. especially if you're using X. if nothing works, you might want to try debian testing on live cd and see whether if it works fine or not.

as for the second one...well, it's lots of works. you'll also need to be knowledgeable enough to dot that. for a start, the only source is dmesg logs and you might want to make it more detailed by using kernel debug mode. at first you'll find several traces and you'll need to observe which is the real problem. after you find the culprit, you'll need to find why did the problem happened and try several solution. on several case it means that you need to modify the source code and compile your own kernel. that might sounds overkill but for the sake of curiosity and learning, it actually quite fun. you'll learn a lot when you do it :3

else, instead of debugging it, simply use the newest kernel. the kernel in backport is known to be problematic due to several reason. you'll need to use debian testing to properly do that.

kurotsugi
Posts: 2085
Joined: 09 Jan 2014 00:17

Re: Recommendations for AMD Ryzen system, sudden reboots, random freezes

Postby kurotsugi » 04 Oct 2017 01:36

it seems that you've edited your post. unfortunately, that's might not the culprit. dmesg is blabbing whatever kernel want to say. the usefull message started from warn level and above. for a start, try to use "dmesg -l warn" or "dmesg -l err". after that, if you want more detailed information, you can use those parts to navigate the full dmesg log. I usually save the log to make it easier to read it later. you can use "dmesg > logs.txt" command to save it. the log will be saved as logs.txt in your home folder.

User avatar
ilu
Posts: 2004
Joined: 09 Oct 2013 12:45

Re: Recommendations for AMD Ryzen system, sudden reboots, random freezes

Postby ilu » 04 Oct 2017 02:47

"Freeze" means total freeze. Screen, sound, all input devices. Reboot means in 5 seconds from whatever you were doing to grub. sudo dmesg -l warn gives a lot of UFW messages which I don't think are relevant. Err finds these

Code: Select all

sudo dmesg -l err
[    2.576308] amdgpu 0000:09:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[    7.219262] sp5100_tco: I/O address 0x0cd6 already in use
[    7.236778] kvm: disabled by bios
but I can't tell whether its related to freeze or sudden reboot. How can I figure out at what time the error happened? I looked up the numbers and I think dmesg won't really help because I need to know what happened immediately before the last freeze/reboot not after ...

Anyway, first error is common and related to the AMD graphics card used, nobody seems to know the reason, fix?
Second error is a kernel bug stopping watchdog from working, patch is submitted but its fate is unclear https://bugzilla.kernel.org/show_bug.cgi?id=170741 https://bugs.debian.org/cgi-bin/bugrepo ... bug=853122
Third message regarding kvm is irrelevant.

Since this is Ryzen I'm almost certain that the issues are board/cpu/memory related, either hardware incompatibilities, BIOS problems or kernel support. Or a combination of all. The internet is full of these reports.

If I understand correctly, I need to be on testing for liquorix kernel?
What would be the best way to move to buster? Change sources.list and upgrade or do a new install from the latest EE ISO?

kurotsugi
Posts: 2085
Joined: 09 Jan 2014 00:17

Re: Recommendations for AMD Ryzen system, sudden reboots, random freezes

Postby kurotsugi » 04 Oct 2017 03:45

there are three branches namely: past, main, future. IIRC the kernel works for debian stable too. well...as long as the dependency is satisfied. if the main branch doesn't works the past branch which contain LTS's might works.

before we started to dig deeper it's better to try newest stuffs. as for the debug stuff...

Code: Select all

[    2.576308] amdgpu 0000:09:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
          ^ this is the time when the error happened.
for a meantime you seems can ignore this part

Code: Select all

#this one is related to the board. not the cpu/gpu
[    7.219262] sp5100_tco: I/O address 0x0cd6 already in use
#this one is just rants. 
[    7.236778] kvm: disabled by bios
our lead starts here

Code: Select all

[    2.576308] amdgpu 0000:09:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
which roughly means that we have some kind of driver issue. though, these kind of messages is usually harmless. that's why it happened at 7 second and the system ignored it (the system boots normally). the usefull only usefull information is that there's something wrong with the gpu.

at this point our choices are:
1. wait until we get new driver
2. report the bug to the maintainer
3. goes deeper with debug

if you choose (3) first, you'll need to enable debug mode on the kernel. after that...well, that's an adventure :lol: the earlier log tell us "where" the problem was. next, we need to find out "what" or "why" did it happened. we need to find the trigger which started the freeze. dmesg will tell you more detailed information so we can go with that first. after that, we need to watching system load, gpu related logs, X logs...and many more. there isn't general guide here so our next action would based on the scattered leads on our system. honestly it's an overkill so I didn't recommend it. personally I'll do it for the sake for curiosity but not to fix the issue. mostly, I do that to get more detailed information so that I could give detailed information later when I choose the best option which is...

report the issue to the maintainer. he'll guide you where to looks next or what we should do next :3

EDIT:
missed this one
What would be the best way to move to buster? Change sources.list and upgrade or do a new install from the latest EE ISO?
the upgrade route is laborous stuffs, doable but...we prefer to avoid it. a fresh install is the best way.

User avatar
ilu
Posts: 2004
Joined: 09 Oct 2013 12:45

Re: Recommendations for AMD Ryzen system, sudden reboots, random freezes

Postby ilu » 04 Oct 2017 04:01

The amdgpu problem is known for at least a year without anything freezing anywhere, I'm sure there is a bug report somewhere but I haven't searched for it. I'm almost sure it's not the reason. The freezes and sudden reboots are reported for all kind of systems that have one thing in common: Not the gpu, but the cpu, i.e. Ryzen.
In Linux terms Ryzen is cutting edge tech and so the newest kernel is best. In fact the system is urgently awaiting Kernel 4.15 to get support for sensors. That could take a while though :(

I've fixed the kvm problem because the system will run VMs.

It seems liquorix has only 4.12-14 I'm getting 4.12-13 in the backports or in testing, so that's not much different. What's the advantage of liquorix over testing?

kurotsugi
Posts: 2085
Joined: 09 Jan 2014 00:17

Re: Recommendations for AMD Ryzen system, sudden reboots, random freezes

Postby kurotsugi » 04 Oct 2017 04:42

liquorix is optimized for desktop and games. I don't know what exactly they did but amd chips works better with liquorix. on debian vanilla I got lot of lags and occasional freeze but on liquorix it's never happened. it was started as a random advice (liquorix in the old days tends to get new kernel faster) but later I saw lot of amd related issue could be fixed with liquorix. it performs better on liquorix too. so now, whenever I saw amd chips issue, try liquorix :lol:

User avatar
ilu
Posts: 2004
Joined: 09 Oct 2013 12:45

Re: Recommendations for AMD Ryzen system, sudden reboots, random freezes

Postby ilu » 04 Oct 2017 08:01

ok so i'll try liquorix and probably testing too to get newer mesa. Thank you.

User avatar
Zero Angel
Posts: 115
Joined: 01 Aug 2014 22:50

Re: Recommendations for AMD Ryzen system, sudden reboots, random freezes

Postby Zero Angel » 06 Oct 2017 02:07

The tco_watchdog error seems to be common for Ryzen machines.

Some Ryzen chips were affected by a bug which caused segfaults. I'm not sure whether this also lead to hard lockups/resets.

I'd firstly try to make sure that your distro is using a modern kernel. 4.10 and newer kernels have improved support for Ryzen's features.

Secondly, i'd make sure your motherboard firmware is up-to-date. The newest firmwares for most motherboards some with AGESA 1.0.0.6b or newer code which may have fixes for firmware bugs that affect linux.

Presuming both of the above are done, then test other components of your system to make sure they are not causing problems. Check all temps (CPU, GPU, motherboard, etc) and make sure they are within good range when the system is under load. Graphics card (hardware or drivers), faulty RAM, loose cables, or a failing power supply could all lead to lockups.

User avatar
ilu
Posts: 2004
Joined: 09 Oct 2013 12:45

Re: Recommendations for AMD Ryzen system, sudden reboots, random freezes

Postby ilu » 06 Oct 2017 12:47

Zero Angel wrote:Some Ryzen chips were affected by a bug which caused segfaults. I'm not sure whether this also lead to hard lockups/resets.
That's another thing I'll have to test once the system runs stable.
Zero Angel wrote:I'd firstly try to make sure that your distro is using a modern kernel. 4.10 and newer kernels have improved support for Ryzen's features.
I can now get 4.13 in sid. Will a sid kernel work in buster? Liquorix still doesn't have it.
Zero Angel wrote:Check all temps (CPU, GPU, motherboard, etc) and make sure they are within good range when the system is under load.
Any idea how to do that without sensor support? Sensor support might be available with kernel 4.15, thats another 8 months.

User avatar
grizzler
Posts: 2034
Joined: 04 Mar 2013 15:45
Location: The Hague, NL

Re: Recommendations for AMD Ryzen system, sudden reboots, random freezes

Postby grizzler » 06 Oct 2017 14:46

ilu wrote:I can now get 4.13 in sid. Will a sid kernel work in buster?
4.13 arrived in sid on 2 October, so I would expect it to migrate to testing tomorrow or the day after. Unless some issue shows up, of course.
Frank

SolydX EE 64 - tracking Debian Testing

kurotsugi
Posts: 2085
Joined: 09 Jan 2014 00:17

Re: Recommendations for AMD Ryzen system, sudden reboots, random freezes

Postby kurotsugi » 09 Oct 2017 00:15

Secondly, i'd make sure your motherboard firmware is up-to-date. The newest firmwares for most motherboards some with AGESA 1.0.0.6b or newer code which may have fixes for firmware bugs that affect linux.

this one should be done first. normally you can do it later but apparently AMD did messed up their firmware. they publicly admitted it, though, they didn't tell the details.

User avatar
Zero Angel
Posts: 115
Joined: 01 Aug 2014 22:50

Re: Recommendations for AMD Ryzen system, sudden reboots, random freezes

Postby Zero Angel » 18 Oct 2017 05:12

There are some possible workarounds to the freezing/reboot issues on Ryzen.

The first workaround is to go into BIOS and disable C6 states. This one solved my issues.

If that doesn't work, the other workaround I read about is to go into BIOS and disable SMT and/or opcache features.

User avatar
ilu
Posts: 2004
Joined: 09 Oct 2013 12:45

Re: Recommendations for AMD Ryzen system, sudden reboots, random freezes

Postby ilu » 18 Oct 2017 14:36

I've left these workarounds as last resorts. There was another bios update from gigabyte, I switched to EE and installed Liquorix kernel. Now wait and see ...

User avatar
ilu
Posts: 2004
Joined: 09 Oct 2013 12:45

Re: Recommendations for AMD Ryzen system, sudden reboots, random freezes

Postby ilu » 22 Oct 2017 02:14

Another crash today - after updating I tried to change permissions on a directory and ... bios splash screen and nothing. I had to reboot using the power button.


Return to “Testing zone”

Who is online

Users browsing this forum: No registered users and 1 guest