March 2015

Hardware upgrades

eustace

Eustace was migrated from a G4p to a G6 on 2nd March which should have save some more power at UCL.

yevaud

New RAID controller and SSD were installed in yevaud on 2nd March but althought the card is recognised fine it does not currently seem to be seeing the disk so this needs further investigation.

gorynych

The sponsors of the Russian tile cache, gorynych, have kindly upgraded it by adding an extra 16Gb of memory and two 240GB SSDs which are now in use (as a RAID1 mirror pair) as the tile store. This should hopefully improve performance for Russian users as the previous configuration was right on the limits of disk I/O bandwidth.

Cooling Issues at IC

On 5th March we were notified that repair work would be taking place on part of the air conditioning system at IC commencing on 9th March. Although initial suggestions were that we should not be affected our servers in fact experienced a significant rise in temperatures.

Around 16:30 on 9th March we experienced a partial outage when a number of machines (spike-01, spike-02, thorn-01, thorn-02 and thorn-03) lost their internal network connection.

The site was placed in offline mode around 16:50 as none of the backends were available, and only one frontend could reach the database, and having decided that the likely cause was a failure of a switch common to those machines Grant set out for IC with a replacement switch while TomH worked to restore service using only the three frontend machines and routing to the database via ironbelly.

Service was restored using only spike-03 at 18:30 and spike-01 and spike-02 were bought back online 10 minutes later. On arrival at the data centre Grant determined that the actual cause of the failure was a loss of power to the switch due to a UPS overheating and cutting out. Power was restored and the site bought back into full operation at 19:45.