Hard Drive Replacements

[Download not found]
ABSTRACT:
Changing out a bad hard drive is not as complex as a task as it may appear when you first open up a server.  Nonetheless, even if you’ve replaced them in the past the error message that you receive for a failed hard drive doesn’t come with a large neon sign saying “This One” when you open up the server.  This tech bulletin will not contain neon, but it will contain instructions and images for identifying the correct hard-drive that needs to be replaced and the steps you should take in replacing them.
TECH SUPPORT SUMMARY:
  • Most likely a failed hard drive will be out of warranty, so when you get a degraded array email you’ll either need to order a replacement or request an RMA if it is within the one-year warranty.
  • Before you take the cube with the bad hard drive offline it is critical that you call us.  We need to delete a file before you swap the hard drives out or else your IP information will never carry over.  At that point it would be monitor and keyboard time.
  • Next, you’ll need to learn how to identify which hard drive is bad.  This requires two parts.  The first is identifying whether or not the lower or higher numbered hard drive is the one that has failed.  The second is identifying which is actually the lower and higher numbered hard dive on the physical cube.
  • Important, 1 does not always equal 1.  Sda1 does not necessarily mean port 1.  It could be port 0, or it could be port 5.  Just identify higher and lower.  Notice in the example below that md1 is the detected failure, but is the higher of the two hard drives.  Md0 is actually hard drive 1 in this case, not md1; md1 is actually the higher drive corresponding to sda2.
  • After you know whether you’re dealing with the higher or lower numbered hard drive you ‘ll need to find that hard drive on the mother board.  Each type of cube varies so please look at the images below for numbering.
  • (F) or [_U] are your indications both in the email and in the diagnostics report of your failed HD.  [UU] indicates a properly functioning hard drive.
  • After the hard drive is replaced Response Care will need to reconfigure both hard drives.
Sample Email:
This is an automatically generated mail message from mdadm

running on <facility name>

A DegradedArray event had been detected on md device /dev/md/1.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid1]
md1 : active raid1 sda2[0](F) sdb2[1]

      974750881 blocks super 1.2 [2/1] [_U]

md0 : active raid1 sda1[0] sdb1[1]
      96244 blocks super 1.2 [2/2] [UU]
Steps to follow regardless of cube type
1) Trace hard drive connection to mother board to identify which port you are on.
2) Ports are numbered as is the image below.  Remember, 1 does not always equal 1; lower corresponds to lower, higher corresponds to higher.
 
HD Cubes
 
Bread Box (Stand Alone) Cubes
 
Rack Mount
 
CONCLUSION:
Hard drive crashes are rare, but as our facilities get older and older they are inevitable.  With three different types of servers and a complex numbering/naming convention on these motherboards it can get confusing.  When all else fails, if you can’t figure it out, you can always go with the trial and error method.  Test one, if the server comes up, you’ve selected the correct HD to replace.  If it gets stuck in starting up, you’ve selected the wrong one.  However, understanding what you are looking at in these degraded array emails or the diagnostics report can go a long way to addressing the issue in a quick and timely manner.