So much to do, and so little time.
Here is a quick update on some of the things I’m working on.
CP2 – CP3 migration. I was able to get CP3’s /root from iSCSI SAN, now its time for me to schedule and begin testing the transfer process of some accounts between CP2 to CP3. Once I can verify that transfers works, update DNS, and then schedule the transfer of more important (high-visibility) accounts. Then we observe to see if these accounts continue to have the problems they’ve had in the past.
SAN problems. Random issues with an older NSM160. One of our Network Storage Modules went off-line on Friday last week (11/6/09), about mid-day. I was able to call HP and get support immediately. We initially rebooted the unit and it returned to normal for a few hours. It then went off-line again later that day. HP took a closer look at the unit and discovered some really odd artifacts. The unit had stopped writing logs on August. HP engineering also found some really odd permission problems with the software RAID components. We took the unit off-line using their Repair NSM function, rebuild & re-initialized the RAID array, and then reintroduced the unit into the cluster so that the data could re-stripe itself to the rebuild unit. Fortunately, we have not seen any additional problems from that unit since doing so. HP has shipped a replacement unit, which is back-ordered, but should hopefully arrive in the next 3-5 days.
SAN/iQ upgrade. Meet with our HP/LHN Storage Sales Specialist and talked (for a good 2 hours) about updates down the line and our current situation. It was kinda fortunate that we had encountered problems the previous week so that I could bring up any support issues in our conversation. Overall, I feel that their support services are getting much better. I assume that most of their merger-pains have subsided and they are doing a rather decent job. I plan to schedule and perform a SAN/iQ software upgrade in early Januaray. We shall see how that goes.
WAD2. WAD is the acronym we’ve given to our Web, Auth, & Database servers. The redundant set (thus the reasoning behind 1 & 2) in a secondary site randomly went off-line on Monday morning (11/09/09). It was quite cumbersome to have had these units go down during that time since I was in a meeting with our HP/LHN rep.. After the meeting, I did some digging around and I believe I’ve narrowed down the cause of the interruption to an Automatic Transfer Switch. The switch may be faulty. We’re looking towards replacing it, and two faulty UPSes in the same space.
During the interruption of the WAD2 servers, actually a bit before that… While I was helping our Network Admin with some Radius configuration I ran into two headaches. 1) our service agreement for this product needs to be renewed. 2) it appears as if all production services are running on the WAD1 setup, except for Radius and bulk import utilities for administrative data. When WAD2 when down, it impacted those utilities and more importantly Radius Authentication for our new wireless network. Furthermore, once those services were restored, Joe found some issues with NFS and iSCSI mounts. Oh, I also discovered that Radius Authentication services (for some perl reason) cannot be enabled on WAD1 systems and so I’m faced with trying to figure out what the best fix will be in the foreseeable future.
Support Contract renewal for VMware. I’ve completed all of the leg work and paper work. I just need the P.O. to be signed so that I can submit it. Once that is done I can begin planning an update and eventually a capacity upgrade.
We’ve hired a Web Programmer / Systems Administrator. I don’t know the specifics but I’m content. The CIO hire is another process altogether that I’m not very keyed into.
I guess that about wraps it up.
–Raf

