Last weekend I upgraded my server. It was supposed to be easy. I talked about the network adapter problems here: https://notweasel.com/nerd-stuff/installing-network-drivers-on-hyper-v-with-intel-i211-at-network-adapter/
It was 4 am Sunday morning when that problem got solved. I thought it would be smooth sailing from there.
Now connected to the network, I imported my VM’s and started up my webserver. Everything was peachy. My websites were back online and the system was stable.
I started up my file server. I started up my Windows 10 VM, which was slow as molasses before the upgrade and was happy to find it working well.
I started up a Windows 7 VM, just to start pushing the new hardware a bit. Then this happened:
When it happened, the whole system froze and had to be hard-rebooted with the power switch.
That little display is supposed to make troubleshooting easy. You look up the code and it tells you what the problem is. Unless it’s an 8.
If it’s an 8, you start Googling it and find a bunch of different things that MIGHT be causing it but nothing conclusive.
The first one I found said it was insufficient CPU power.
Normally I wouldn’t think there was a power issue. It’s got a 650 watt power supply. Thing is, the power supply doesn’t have a 4-pin ATX CPU power connector. I found an old Molex to ATX adapter and then couldn’t find the pack of modular wires for the PSU.
I found one that fit but there wasn’t any branding on it to say if it was meant for the current PSU. I used it anyway and figured it would be fine.
When the error came up, that became the primary suspect. I wasn’t sure about either the PSU cable or about how power is supplied. Knowing that two wires were being used to feed 4, I thought maybe I’d made a mistake there.
I used a multimeter to make sure that the PSU cable was correct and I was getting 12 volts instead of 5 or something else. That was fine. I still wasn’t sure if there might be a reason the ATX connector uses 2 wires instead of just one for12 volts. Maybe the PSU limits the amps on that channel or something. I have no idea.
I looked into buying a new power supply. I was gutted to see how expensive they are. I expected around $80 for a good one but they’re double that now.
I did surgery instead. I chucked out the cable with the Molex connectors and took apart the adapter and one of the PCIe 6-pin connectors that wasn’t being used. After some cutting and taping and poking at the connectors with bent staples I ended up with a 4-pin ATX connector that was definitely getting enough power.
It didn’t work. Well… it DID work, but it didn’t solve the problem. The system would still boot up and then crash within a minute with the code 8.
More searching made the situation sound more and more dire. It seemed like something was broken.
Bad CPU? Bad motherboard? Bad memory? https://www.neowin.net/forum/topic/1420569-moved-motherboard-to-new-case-now-no-post-error-code-8/
Maybe just some bad BIOS settings? https://www.reddit.com/r/ASUS/comments/rd8b28/crosshair_vii_hero_qcode_8_solved/
I updated the BIOS to the latest version, which also wipes out any custom settings. Rebooted. Same error.
More bad memory? https://www.reddit.com/r/ASUS/comments/dqinrh/qcode_8_help/
I swapped the memory between the server and my desktop. Same error.
It was now around 6 or 7 am. This was supposed to be easy.
I didn’t want to do this anymore. I wanted to sleep. I remembered that it was working when it was only running my web server. I thought of ways to get back to that.
I started it up and manically kept refreshing the Hyper-V manager on my workstation until it responded and then immediately killed all the VM’s except the webserver.
It worked. It didn’t crash. I went to bed.
When I woke up later on Sunday, I thought about what might cause the problem and how I could narrow it down.
I started the rest of the VM’s. It crashed.
I turned them off using the same method from earlier; frantically refreshing the manager until it responded and then killing them. With just the webserver and fileserver running, it was stable.
What caused it? Too much memory usage? Too much CPU demand?
I changed the settings on the Windows 10 VM to give it all of the available memory on startup. It started. It ran fine. I opened up 5 different YouTube videos and played them all simultaneously. I could see the CPU usage going up.
It ran fine.
I started up the Windows 7 VM again. It crashed almost immediately. It makes no sense to me. How does a VM crash the hardware?
I did more experiments and everything pointed to the VM being the issue. Nothing else I tried caused a problem. The system was running well unless I started that one VM. Then it crashed within seconds.
I deleted the VM and created a new one using the existing virtual hard drive. It started up and worked fine. I let it run like that for days. It was flawless.
Yesterday I re-imported the original VM and started it up. It crashed.
With this new knowledge, I did more searching and found I’m not the first one to have this happen. There’s a detailed story here about someone in a similar situation, migrating a VM from an Intel based server to an AMD one and then having random crashes: https://www.theserverside.technology/2020/03/22/the-little-virtual-machine-that-is-crashing-hyper-v-on-amd/
So there you have it. I’ve got a VM that can crash my server’s hardware and throw a code on the motherboard. I have no idea how that’s possible, but it is.
Hopefully this will help someone else with this very specific problem in the future. The solution for me was to create a new VM using the existing VHD.