System failure handling

1. Camera and pi1 failure ( 20060812 )

    After power supply problem in LCO a problem with camera k2b and pi1 computer arised. Camera k2b was not answering and pi1
    didn't start.
    Camera k2b has been continously reset by watchdog , but it was not possible to send data to it ( especially commands ).
    There was a possiblity that during startup there was not enough power ( UPS has been discharged ) and firmaware for
    altera was not loaded properly. I made hard reset using "listwas" with the following procedure .
    After this camera is still reseting in the /var/log/messages  :
        Aug 21 05:13:49 localhost kernel: /opt/pi/dev/pisys/daq/src/ccd/ccddriver/picamdrv2K2K/drv2.6.5/pikam.c: USB Skeleton #193 now disconnected
        Aug 21 05:13:50 localhost kernel: usb 4-2: new full speed USB device using address 126
        Aug 21 05:13:50 localhost kernel: usb 4-2: device not accepting address 126, error -110
        Aug 21 05:13:50 localhost kernel: usb 4-2: new full speed USB device using address 127
        Aug 21 05:13:51 localhost kernel: usb 4-2: device not accepting address 127, error -110
        Aug 21 05:13:52 localhost kernel: usb 2-4: new high speed USB device using address 41
        Aug 21 05:13:52 localhost kernel: /opt/pi/dev/pisys/daq/src/ccd/ccddriver/picamdrv2K2K/drv2.6.5/pikam.c: Entering probe
        Aug 21 05:13:52 localhost kernel:
        Aug 21 05:13:52 localhost kernel: Maksymalny rozmiar pakietu: 512
        Aug 21 05:13:52 localhost kernel: /opt/pi/dev/pisys/daq/src/ccd/ccddriver/picamdrv2K2K/drv2.6.5/pikam.c: Success:  USB Skeleton device now attached                                                             to USBSkel
    So it didn't help ...
    I retried the procedure once again , but without to much hope for fixing the problem this way ...
   
    Aktualny stan :

2. Pi1 failure

Probowalem wszelkich sposobow podniesienia pi1 , wszystko na nic.
Wyglada na to ze pi1 albo padl calkiem , albo padl w nim zasilacz ( moze tylko bezpiecznik ) , w tej sytuacji trzeba to zdiagnozowac patrzac na
ekran monitora  , jesli nic nie wstaje ( zadne lampki czy innych znakow ) to zapewnie zasilacz, jak cos wstaje to pewnie jednak cos na plycie

3. Pi1 failure 20061219

Padl pi1, dzilalal normanie i zniknal, pozostale komputery zostaly dzialajace. Probowalem wszelkich metod, reset ze skrzyneczki i z listew nic nie pomoglo :
   python listwa.py 1 off_123
   python listwa.py 1 on_123
Wyglada na to ze padlo cos grubszego.
20061220 - W. Krzeminski probowal go podniesc, ale niewiele ze tego wyszlo, probowal reset , power off/on, rowniez wyjac i wlozyc kabel zasilajacy spowrotem
                    tez nic to nie pomoglo. Jakas dioda sie na nim pali , czyli raczej nie zasilacz. Sytuacja wyglada tak ze chyba padl calkiem - plyta lub procesor.
                    W kazdym razie wyglada to na jakas powazna awarie sprzetowa. WK raportuje ze po resetach i power off/on nic na ekranie pi1 sie nie pojawia
                    tylko szary ekran jest i zero napisow.
Myslalem ze moze byc potrzebne instalowanie spowrotem GRUB-a na boot sekorze :
     http://ftp.sh.cvut.cz/MIRRORS/slax/SLAX-5-modules/system/
     http://josephhall.org/grub_install_hda1.html
ale to nie wyszlo , bo wogole nie udalo sie go podniesc

Plyta do pi1 :
                       

4. Pi3 - disk /dev/hdb failure

Wyglada na to ze /dev/hdb jest popsuty , probowalem na niego przerzucic czesc danych ale nic z tego nie wyszlo, w /var/log/messages pojawily sie takie oto kwiatki :
Na tym dysku jest troche danych : 20060630  20060727  20060728  20060729  20060802
Noc 20060630 tylko czesciowo - 1GB probowalem przekopiowac cala ale dysk odmowil posluszenstwa - jest padniety

Dec 29 10:36:38 localhost sshd(pam_unix)[30854]: session closed for user pi
Dec 29 10:36:38 localhost sshd(pam_unix)[30878]: session opened for user pi by (uid=11034)
Dec 29 10:36:38 localhost sshd(pam_unix)[30878]: session closed for user pi
Dec 29 10:37:09 localhost kernel: hdb: dma_timer_expiry: dma status == 0x61
Dec 29 10:37:19 localhost kernel: hdb: DMA timeout error
Dec 29 10:37:19 localhost kernel: hdb: dma timeout error: status=0xd0 { Busy }
Dec 29 10:37:19 localhost kernel:
Dec 29 10:37:19 localhost kernel: hda: DMA disabled
Dec 29 10:37:19 localhost kernel: hdb: DMA disabled
Dec 29 10:37:50 localhost kernel: ide0: reset timed-out, status=0x90
Dec 29 10:37:51 localhost kernel: hdb: status error: status=0x00 { }
Dec 29 10:38:18 localhost kernel:
Dec 29 10:38:26 localhost kernel: hdb: drive not ready for command
Dec 29 10:38:28 localhost kernel: hdb: status error: status=0x00 { }
Dec 29 10:39:30 localhost kernel:
Dec 29 10:39:31 localhost sshd(pam_unix)[30989]: session opened for user pi by (uid=11034)
Dec 29 10:40:04 localhost kernel: hdb: drive not ready for command
Dec 29 10:40:27 localhost sshd(pam_unix)[31032]: session opened for user pi by (uid=11034)
Dec 29 10:40:53 localhost kernel: hdb: status error: status=0x00 { }
Dec 29 10:41:07 localhost kernel:
Dec 29 10:41:20 localhost sshd(pam_unix)[31032]: session closed for user pi
Dec 29 10:41:20 localhost kernel: hdb: drive not ready for command
Dec 29 10:41:26 localhost kernel: hdb: status error: status=0x00 { }
Dec 29 10:41:26 localhost kernel:
Dec 29 10:42:07 localhost sshd(pam_unix)[31152]: session opened for user pi by (uid=11034)
Dec 29 10:43:16 localhost kernel: hdb: drive not ready for command
Dec 29 10:44:16 localhost su(pam_unix)[31178]: session opened for user root by pi(uid=11034)
Dec 29 10:47:44 localhost sshd(pam_unix)[31152]: session closed for user pi
Dec 29 10:47:44 localhost kernel: ide0: reset: success
Dec 29 10:48:33 localhost kernel: hdb: dma_timer_expiry: dma status == 0x41
Dec 29 10:48:33 localhost kernel: hdb: dma_timer_expiry: dma status == 0x41
Dec 29 10:48:34 localhost kernel: hdb: DMA timeout error
Dec 29 10:48:35 localhost kernel: hdb: dma timeout error: status=0xd0 { Busy }
Dec 29 10:48:36 localhost kernel:
Dec 29 10:48:36 localhost kernel: hdb: DMA disabled
Dec 29 10:49:18 localhost kernel: ide0: reset: success
Dec 29 10:49:19 localhost kernel: hdb: dma_timer_expiry: dma status == 0x41
Dec 29 10:49:23 localhost kernel: hdb: dma_timer_expiry: dma status == 0x40
Dec 29 10:49:24 localhost kernel: hdb: DMA timeout retry
Dec 29 10:49:25 localhost kernel: hdb: timeout waiting for DMA
Dec 29 10:49:26 localhost kernel: hda: lost interrupt
Dec 29 10:50:12 localhost kernel: hdb: dma_timer_expiry: dma status == 0x40
Dec 29 10:50:35 localhost kernel: hdb: DMA timeout retry
Dec 29 10:50:35 localhost sshd(pam_unix)[31552]: session opened for user pi by (uid=11034)
Dec 29 10:50:35 localhost kernel: hdb: timeout waiting for DMA
Dec 29 10:50:36 localhost kernel: hda: irq timeout: status=0xd8 { Busy }
Dec 29 10:50:36 localhost kernel:
Dec 29 10:50:36 localhost sshd(pam_unix)[31552]: session closed for user pi
Dec 29 10:50:37 localhost kernel: ide0: reset: success
Dec 29 10:50:43 localhost kernel: hdb: irq timeout: status=0xd0 { Busy }
Dec 29 10:50:43 localhost sshd(pam_unix)[31577]: session opened for user pi by (uid=11034)
Dec 29 10:50:43 localhost kernel:
Dec 29 10:50:47 localhost kernel: ide0: reset: success
Dec 29 10:50:47 localhost sshd(pam_unix)[31577]: session closed for user pi
Dec 29 10:51:28 localhost kernel: hdb: irq timeout: status=0xd0 { Busy }
Dec 29 10:51:29 localhost kernel:
Dec 29 10:51:46 localhost kernel: ide0: reset: success
Dec 29 10:51:46 localhost kernel: hdb: irq timeout: status=0xd0 { Busy }
Dec 29 10:51:48 localhost kernel:
Dec 29 10:51:48 localhost kernel: ide0: reset: success
Dec 29 10:51:49 localhost kernel: hdb: irq timeout: status=0xd0 { Busy }
Dec 29 10:51:50 localhost kernel:
Dec 29 10:52:23 localhost kernel: ide0: reset: success
Dec 29 10:52:25 localhost kernel: hdb: irq timeout: status=0xd0 { Busy }
Dec 29 10:52:25 localhost kernel:
Dec 29 10:52:26 localhost kernel: ide0: reset: success
Dec 29 10:52:26 localhost kernel: hdb: irq timeout: status=0xd0 { Busy }
Dec 29 10:52:27 localhost kernel:
Dec 29 10:52:27 localhost kernel: ide0: reset timed-out, status=0x90
Dec 29 10:53:07 localhost kernel: hdb: status error: status=0x00 { }
Dec 29 10:53:08 localhost kernel:
Dec 29 10:53:09 localhost sshd(pam_unix)[31698]: session opened for user pi by (uid=11034)
Dec 29 10:53:09 localhost kernel: hdb: drive not ready for command
Dec 29 10:53:11 localhost kernel: hdb: status error: status=0x00 { }
Dec 29 10:53:11 localhost sshd(pam_unix)[31698]: session closed for user pi
Dec 29 10:53:15 localhost kernel:
Dec 29 10:53:28 localhost kernel: hdb: drive not ready for command
Dec 29 10:53:32 localhost sshd(pam_unix)[31727]: session opened for user pi by (uid=11034)
Dec 29 10:53:51 localhost kernel: hdb: status error: status=0x00 { }
Dec 29 10:54:01 localhost sshd(pam_unix)[31727]: session closed for user pi
Dec 29 10:54:01 localhost kernel:
Dec 29 10:54:12 localhost kernel: hdb: drive not ready for command
Dec 29 10:54:16 localhost kernel: hdb: status error: status=0x00 { }
Dec 29 10:54:23 localhost kernel:
Dec 29 10:54:28 localhost kernel: hdb: drive not ready for command
Dec 29 10:54:36 localhost kernel: ide0: reset: success
Dec 29 10:54:45 localhost kernel: hdb: irq timeout: status=0xd0 { Busy }
Dec 29 10:54:45 localhost kernel:
Dec 29 10:55:27 localhost rpc.mountd: Caught signal 15, un-registering and exiting.
Dec 29 10:55:27 localhost kernel: ide0: reset timed-out, status=0x90
Dec 29 10:55:29 localhost kernel: hdb: status error: status=0x00 { }
Dec 29 10:55:29 localhost kernel:
Dec 29 10:55:30 localhost kernel: hdb: drive not ready for command
Dec 29 10:55:30 localhost kernel: hdb: status error: status=0x00 { }
Dec 29 10:55:31 localhost kernel:
Dec 29 10:55:32 localhost nfs: rpc.mountd shutdown succeeded
Dec 29 10:55:32 localhost kernel: hdb: drive not ready for command
Dec 29 10:55:33 localhost kernel: hdb: status error: status=0x00 { }
Dec 29 10:55:33 localhost kernel:
Dec 29 10:55:33 localhost kernel: hdb: drive not ready for command
Dec 29 10:55:36 localhost nfs: nfsd shutdown succeeded
Dec 29 10:55:41 localhost kernel: hdb: status error: status=0x00 { }
Dec 29 10:56:22 localhost nfs: rpc.rquotad shutdown succeeded
Dec 29 10:56:23 localhost kernel:

A potem jeszcze :

Dec 29 11:06:03 localhost kernel: end_request: I/O error, dev hdb, sector 48175404
Dec 29 11:06:04 localhost kernel: end_request: I/O error, dev hdb, sector 48175412
Dec 29 11:06:04 localhost kernel: end_request: I/O error, dev hdb, sector 48175420
Dec 29 11:06:05 localhost kernel: end_request: I/O error, dev hdb, sector 48175428
Dec 29 11:06:05 localhost kernel: end_request: I/O error, dev hdb, sector 48175436
Dec 29 11:06:06 localhost kernel: end_request: I/O error, dev hdb, sector 48175444
Dec 29 11:06:06 localhost kernel: end_request: I/O error, dev hdb, sector 48175452
Dec 29 11:06:06 localhost kernel: end_request: I/O error, dev hdb, sector 48175460
Dec 29 11:06:06 localhost kernel: end_request: I/O error, dev hdb, sector 48175468
Dec 29 11:06:07 localhost kernel: end_request: I/O error, dev hdb, sector 48175476
Dec 29 11:06:07 localhost kernel: end_request: I/O error, dev hdb, sector 481754

5. Mount crashes !!!!

From time to time mount_server crashes this is caused by simple error in parsing, however this bug was not found , end of output :

        gap:motor-1: Started...  type=2
        update_serial_port_performance: Started: keyword=0, result=0, counter=1
        update_serial_port_performance: Ended normally
        check_axis_range: axis parameter type=2 val=0: range check OK
        gap:motor-1: Ended normally  type=2 value=0
        gap:motor-1: Started...  type=0
        update_serial_port_performance: Started: keyword=0, result=0, counter=1
        update_serial_port_performance: Ended normally
        check_axis_range: axis parameter type=0 val=-551638: range check OK
        gap:motor-1: Ended normally  type=0 value=-551638
        gap:motor-1: Started...  type=138
        get_mount_status: BEGIN: const_buf=0x8071744 |astro|, p_mountStatus=0x417fa230,
        execute_command: const_buf=|astro|, buf=|(null)|
        execute_command: STR2STR: const_buf=|astro|, buf=|astro|
        execute_command: mount_initialized=1
        execute_command: buf=|astro|
        execute_command: STR2LOWERCASE: const_buf=|astro|, buf=|astro|
        filter_comments: input=|astro| len=5
        filter_comments: len=5 (spaces=0 + noncomment=5)
        filter_comments: output=|astro| len=5
        execute_command: FILETER_COMMENTS: const_buf=|astro|, buf=|astro|


This is in monit.c ( FILETER_COMMENTS ) next printf expected ( when it was OK ) :

     execute_command: astro                                    at: Sun Feb 25 21:07:00 2007
     execute_command: Command found in database: command=|astro|, cmd_number=28 found=1
     execute_command: Command = |astro|
     execute_command: Command arguments: arg=(nil), arg=|(null)|
     gap:motor-0: Started...  type=3
     update_serial_port_performance: Started: keyword=0, result=0, counter=1
     update_serial_port_performance: Ended normally
     check_axis_range: axis parameter type=3 val=1518: range check OK
     gap:motor-0: Ended normally  type=3 value=1518

Zeby miec punkt odniesienia to tak, plik monit.c i tutaj jest tak ze linia 2978 to :

    printf(MSG_FUNC": FILETER_COMMENTS: const_buf=|%s|, buf=|%s|\n",const_buf, buf);fflush(0);

natomiast ta oczekiwana linia w bold wypisuje sie w lini 2993

    if (logging>=MESSAGE_LEVEL)
    {
      msg(log_file,MSG_FUNC": %-040s at: %s",buf,ctime(&t));
    }

czyli pad nastepuje gdzies w tym kawalku :

        printf(MSG_FUNC": FILETER_COMMENTS: const_buf=|%s|, buf=|%s|\n",const_buf, buf);fflush(0);
        //msg_filename(kn_log_name,MSG_FUNC": after filter_comments\n");
                                                                                                  
       if( buf == NULL || strlen(buf) < 1 )
      {
            if(buf) free(buf); buf=NULL;
            printf(MSG_FUNC": buf is an empty string! Will ignore this command.\n");fflush(0);
            return(0);
      }
      //msg_filename(kn_log_name,MSG_FUNC": execute_command checkpoint 2\n");
      //printf(MSG_FUNC": buf=%p is not an empty string\n",buf);fflush(0);
                                                                                                  
    #endif
                                                                                                  
     t=time(NULL);
     if (logging>=MESSAGE_LEVEL)
    {
      msg(log_file,MSG_FUNC": %-040s at: %s",buf,ctime(&t));
    }


dodam kilka dodatkowych printfow i moze cos sie wyjasni !!!