System failure handling
1. Camera and pi1 failure ( 20060812 )
After power supply problem in LCO a problem with
camera k2b and pi1 computer arised. Camera k2b was not answering and
pi1
didn't start.
Camera k2b has been continously reset by watchdog ,
but it was not possible to send data to it ( especially commands ).
There was a possiblity that during startup there was
not enough power ( UPS has been discharged ) and firmaware for
altera was not loaded properly. I made hard reset
using "listwas" with the following
procedure .
After this camera is still reseting in the
/var/log/messages :
Aug 21 05:13:49 localhost kernel:
/opt/pi/dev/pisys/daq/src/ccd/ccddriver/picamdrv2K2K/drv2.6.5/pikam.c:
USB Skeleton #193 now disconnected
Aug 21 05:13:50 localhost kernel:
usb 4-2: new full speed USB device using address 126
Aug 21 05:13:50 localhost kernel:
usb 4-2: device not accepting address 126, error -110
Aug 21 05:13:50 localhost kernel:
usb 4-2: new full speed USB device using address 127
Aug 21 05:13:51 localhost kernel:
usb 4-2: device not accepting address 127, error -110
Aug 21 05:13:52 localhost kernel:
usb 2-4: new high speed USB device using address 41
Aug 21 05:13:52 localhost kernel:
/opt/pi/dev/pisys/daq/src/ccd/ccddriver/picamdrv2K2K/drv2.6.5/pikam.c:
Entering probe
Aug 21 05:13:52 localhost kernel:
Aug 21 05:13:52 localhost kernel:
Maksymalny rozmiar pakietu: 512
Aug 21 05:13:52 localhost kernel:
/opt/pi/dev/pisys/daq/src/ccd/ccddriver/picamdrv2K2K/drv2.6.5/pikam.c:
Success: USB Skeleton device now attached
to USBSkel
So it didn't help ...
I retried the procedure once again , but without to
much hope for fixing the problem this way ...
Aktualny
stan :
- kamera k2b resetuje sie z watchdoga - komunikaty w
/var/log/messages jak wyzej
- daje ja sie otworzyc przez : test2K2K 1
ale getID ( komenda 17 ) daje dziwne rezulataty i dziala to tylko do
nastepnego resetu:
INPUT CHOICE:17
in getCameraID ...
CameraID = -1
#################################
Camera ID ( in HEX ) :
Camera Name: k1
Cypress ID : 2004-05-04 / 01
Altera ID :
20ffffffff-ffffffff-ffffffff / ffffffff
Camera ID : -1
Description: Kasprowicz Stankiewicz
#################################
nawet jak recznie sprawdzam status co sekunde to i tak sie resetuje i
wtedy juz kolejne komendy zwracaja blad :
#define
ENODEV 19 /* No such device */
to oznacza ze poszedl reset i program starcil kontankt z
kamera.
- jedyna szansa to chyba proba zaladowania firmwaru altery spod
test2K2K , ale to sie moze nie udac , wogole na razie sie nie udalo na
testach na
hozej a co dopiero na kamerze w niewiadomym stanie
- sprawdzilem tez na USB1.1 i jest to samo , wiec nie tu jest
problem na pewno - zreszta to byla tylko proba "pro forma" bo nie
wierzylem zeby mogla
cokolwiek zmienic
- na razie wylaczylem listwe komenda :
python listwa.py 2 off_45
zeby sie k2b w kolko nie resetowala , bo nie moge sledzic logow w celu
podniesienia pi1
aby wlaczyc nalezy na pi2 zalogowac sie jako root i zrobic :
cd /opt/pi/dev/pisys/reset/
python listwa.py 2 on_45
- NIE DZIALA :-(
Wlasciwie to nic nie zrobilem, probowalem
zaladowac wsad do altery przeslany przez PS ( jest w svn pod :
/opt/pi/dev/pisys/daq/src/firmware/k2a/vhdl/20080821ps ) , ale to
wygladalo jakby sie nie udalo , bo nie zdarzylem przed resetem i
chyba nie wszystkie bajty zdarzyly sie zaladowac z 131072 ktore ma plik
rpd.
Po tej operacji wygladalo jakby sie komunikacja z kamerami zawiesila bo
: test2K2K k2a zawisl na :
###########################
Opening camera by USB interface
Open:fd:-1
Opening devices ...
i nie nie moglem tez ubic tego programu ani odladowac drivera bo byl
zajety , wiec jedyne co pozostalo to reboot pi2 , zrobilem to i po
reboocie
k2b zaczela dobrze dzialac !!!!!
- jednak nie do konca dziala dobrze, nie
daje sie sciagnac klatki - transfer pada w ~ 1/4 :
Strange : expected 1024 bytes , obtained -1
GetData: ERROR ReadBytes exits with -1
(Device:/dev/usb/pikam1)
Transfered 2753536
bytes, size is 8445952
final_size=5692416, off=2753536
- dziwne bo za kazdym razem rwie sie w
okolicach tego samego rozmiaru :
Transfered 2791424
bytes, size is 8445952
- taki pomysl -
moze wsad od PS jest jakis nowszy , zaladowal sie dobrze tylko po
prostu z nim mi nie bedzie dobrze dzialac ?
Nie to raczej nie to bo wersja altery nadal jest kopnieta, jakby sie
dobrze zaladowal to by sie poprawila !
- probowalem zaladowac ponownie altere ,
ale nie wyszlo , zrobilem to z "Postpone watchdog refresh" , potem jak
zrobilem
"Restart
watchdog refresh" to pi2 sie powiesil , wiec nie nalezy tego uzywac,
lepiej juz chyba odladowac modul i na nowo zaladowac
- pad pi2 - przestawil sobie kolejnosc
dysko w BIOSIE i nie chcial wstac , otrzymalem instrukcje wchodzenia do
BIOS-u od KN
wykonalem ja i zmienilem kolejnosc bootowania, pi2 wstal , sadze ze
koniecznie trzeba mu sprawdzic dyski systemowe
bo chyba wypisal jakies bledy, log z
wstawania systemu 20060825
2. Pi1 failure
Probowalem wszelkich sposobow
podniesienia pi1 , wszystko na nic.
Wyglada na to ze pi1 albo padl calkiem , albo padl w nim zasilacz
( moze tylko bezpiecznik ) , w tej sytuacji trzeba to zdiagnozowac
patrzac na
ekran monitora , jesli nic nie wstaje ( zadne lampki czy innych
znakow ) to zapewnie zasilacz, jak cos wstaje to pewnie jednak cos na
plycie
3. Pi1 failure 20061219
Padl pi1, dzilalal normanie i zniknal, pozostale komputery zostaly
dzialajace. Probowalem wszelkich metod, reset ze skrzyneczki i z listew
nic nie pomoglo :
python listwa.py 1 off_123
python listwa.py 1 on_123
Wyglada na to ze padlo cos grubszego.
20061220 - W. Krzeminski probowal go podniesc, ale niewiele ze tego
wyszlo, probowal reset , power off/on, rowniez wyjac i wlozyc kabel
zasilajacy spowrotem
tez nic to nie pomoglo. Jakas dioda sie na nim pali , czyli raczej nie
zasilacz. Sytuacja wyglada tak ze chyba padl calkiem - plyta lub
procesor.
W kazdym razie wyglada to na jakas powazna awarie sprzetowa. WK
raportuje ze po resetach i power off/on nic na ekranie pi1 sie nie pojawia
tylko szary ekran jest i zero napisow.
Myslalem ze moze byc potrzebne instalowanie spowrotem GRUB-a na boot
sekorze :
http://ftp.sh.cvut.cz/MIRRORS/slax/SLAX-5-modules/system/
http://josephhall.org/grub_install_hda1.html
ale to nie wyszlo , bo wogole nie udalo sie go podniesc
Plyta do pi1 :
4. Pi3 - disk /dev/hdb failure
Wyglada na to ze /dev/hdb jest popsuty , probowalem na niego przerzucic
czesc danych ale nic z tego nie wyszlo, w /var/log/messages pojawily
sie takie oto kwiatki :
Na tym dysku jest troche danych : 20060630
20060727 20060728 20060729 20060802
Noc 20060630 tylko
czesciowo - 1GB probowalem przekopiowac cala ale dysk odmowil
posluszenstwa - jest padniety
Dec 29 10:36:38 localhost sshd(pam_unix)[30854]: session closed for
user pi
Dec 29 10:36:38 localhost sshd(pam_unix)[30878]: session opened for
user pi by (uid=11034)
Dec 29 10:36:38 localhost sshd(pam_unix)[30878]: session closed for
user pi
Dec 29 10:37:09
localhost kernel: hdb: dma_timer_expiry: dma status == 0x61
Dec 29 10:37:19
localhost kernel: hdb: DMA timeout error
Dec 29 10:37:19
localhost kernel: hdb: dma timeout error: status=0xd0 { Busy }
Dec 29 10:37:19 localhost kernel:
Dec 29 10:37:19 localhost kernel: hda: DMA disabled
Dec 29 10:37:19 localhost kernel: hdb: DMA disabled
Dec 29 10:37:50 localhost kernel: ide0: reset timed-out, status=0x90
Dec 29 10:37:51 localhost kernel: hdb: status error: status=0x00 { }
Dec 29 10:38:18 localhost kernel:
Dec 29 10:38:26 localhost kernel: hdb: drive not ready for command
Dec 29 10:38:28 localhost kernel: hdb: status error: status=0x00 { }
Dec 29 10:39:30 localhost kernel:
Dec 29 10:39:31 localhost sshd(pam_unix)[30989]: session opened for
user pi by (uid=11034)
Dec 29 10:40:04 localhost kernel: hdb: drive not ready for command
Dec 29 10:40:27 localhost sshd(pam_unix)[31032]: session opened for
user pi by (uid=11034)
Dec 29 10:40:53 localhost kernel: hdb: status error: status=0x00 { }
Dec 29 10:41:07 localhost kernel:
Dec 29 10:41:20 localhost sshd(pam_unix)[31032]: session closed for
user pi
Dec 29 10:41:20 localhost kernel: hdb: drive not ready for command
Dec 29 10:41:26 localhost kernel: hdb: status error: status=0x00 { }
Dec 29 10:41:26 localhost kernel:
Dec 29 10:42:07 localhost sshd(pam_unix)[31152]: session opened for
user pi by (uid=11034)
Dec 29 10:43:16 localhost kernel: hdb: drive not ready for command
Dec 29 10:44:16 localhost su(pam_unix)[31178]: session opened for user
root by pi(uid=11034)
Dec 29 10:47:44 localhost sshd(pam_unix)[31152]: session closed for
user pi
Dec 29 10:47:44 localhost kernel: ide0: reset: success
Dec 29 10:48:33 localhost kernel: hdb: dma_timer_expiry: dma status ==
0x41
Dec 29 10:48:33 localhost kernel: hdb: dma_timer_expiry: dma status ==
0x41
Dec 29 10:48:34 localhost kernel: hdb: DMA timeout error
Dec 29 10:48:35 localhost kernel: hdb: dma timeout error: status=0xd0 {
Busy }
Dec 29 10:48:36 localhost kernel:
Dec 29 10:48:36 localhost kernel: hdb: DMA disabled
Dec 29 10:49:18 localhost kernel: ide0: reset: success
Dec 29 10:49:19 localhost kernel: hdb: dma_timer_expiry: dma status ==
0x41
Dec 29 10:49:23 localhost kernel: hdb: dma_timer_expiry: dma status ==
0x40
Dec 29 10:49:24 localhost kernel: hdb: DMA timeout retry
Dec 29 10:49:25 localhost kernel: hdb: timeout waiting for DMA
Dec 29 10:49:26 localhost kernel: hda: lost interrupt
Dec 29 10:50:12 localhost kernel: hdb: dma_timer_expiry: dma status ==
0x40
Dec 29 10:50:35 localhost kernel: hdb: DMA timeout retry
Dec 29 10:50:35 localhost sshd(pam_unix)[31552]: session opened for
user pi by (uid=11034)
Dec 29 10:50:35 localhost kernel: hdb: timeout waiting for DMA
Dec 29 10:50:36 localhost kernel: hda: irq timeout: status=0xd8 { Busy }
Dec 29 10:50:36 localhost kernel:
Dec 29 10:50:36 localhost sshd(pam_unix)[31552]: session closed for
user pi
Dec 29 10:50:37 localhost kernel: ide0: reset: success
Dec 29 10:50:43 localhost kernel: hdb: irq timeout: status=0xd0 { Busy }
Dec 29 10:50:43 localhost sshd(pam_unix)[31577]: session opened for
user pi by (uid=11034)
Dec 29 10:50:43 localhost kernel:
Dec 29 10:50:47 localhost kernel: ide0: reset: success
Dec 29 10:50:47 localhost sshd(pam_unix)[31577]: session closed for
user pi
Dec 29 10:51:28 localhost kernel: hdb: irq timeout: status=0xd0 { Busy }
Dec 29 10:51:29 localhost kernel:
Dec 29 10:51:46 localhost kernel: ide0: reset: success
Dec 29 10:51:46 localhost kernel: hdb: irq timeout: status=0xd0 { Busy }
Dec 29 10:51:48 localhost kernel:
Dec 29 10:51:48 localhost kernel: ide0: reset: success
Dec 29 10:51:49 localhost kernel: hdb: irq timeout: status=0xd0 { Busy }
Dec 29 10:51:50 localhost kernel:
Dec 29 10:52:23 localhost kernel: ide0: reset: success
Dec 29 10:52:25 localhost kernel: hdb: irq timeout: status=0xd0 { Busy }
Dec 29 10:52:25 localhost kernel:
Dec 29 10:52:26 localhost kernel: ide0: reset: success
Dec 29 10:52:26 localhost kernel: hdb: irq timeout: status=0xd0 { Busy }
Dec 29 10:52:27 localhost kernel:
Dec 29 10:52:27 localhost kernel: ide0: reset timed-out, status=0x90
Dec 29 10:53:07 localhost kernel: hdb: status error: status=0x00 { }
Dec 29 10:53:08 localhost kernel:
Dec 29 10:53:09 localhost sshd(pam_unix)[31698]: session opened for
user pi by (uid=11034)
Dec 29 10:53:09 localhost kernel: hdb: drive not ready for command
Dec 29 10:53:11 localhost kernel: hdb: status error: status=0x00 { }
Dec 29 10:53:11 localhost sshd(pam_unix)[31698]: session closed for
user pi
Dec 29 10:53:15 localhost kernel:
Dec 29 10:53:28 localhost kernel: hdb: drive not ready for command
Dec 29 10:53:32 localhost sshd(pam_unix)[31727]: session opened for
user pi by (uid=11034)
Dec 29 10:53:51 localhost kernel: hdb: status error: status=0x00 { }
Dec 29 10:54:01 localhost sshd(pam_unix)[31727]: session closed for
user pi
Dec 29 10:54:01 localhost kernel:
Dec 29 10:54:12 localhost kernel: hdb: drive not ready for command
Dec 29 10:54:16 localhost kernel: hdb: status error: status=0x00 { }
Dec 29 10:54:23 localhost kernel:
Dec 29 10:54:28 localhost kernel: hdb: drive not ready for command
Dec 29 10:54:36 localhost kernel: ide0: reset: success
Dec 29 10:54:45 localhost kernel: hdb: irq timeout: status=0xd0 { Busy }
Dec 29 10:54:45 localhost kernel:
Dec 29 10:55:27 localhost rpc.mountd: Caught signal 15, un-registering
and exiting.
Dec 29 10:55:27 localhost kernel: ide0: reset timed-out, status=0x90
Dec 29 10:55:29 localhost kernel: hdb: status error: status=0x00 { }
Dec 29 10:55:29 localhost kernel:
Dec 29 10:55:30 localhost kernel: hdb: drive not ready for command
Dec 29 10:55:30 localhost kernel: hdb: status error: status=0x00 { }
Dec 29 10:55:31 localhost kernel:
Dec 29 10:55:32 localhost nfs: rpc.mountd shutdown succeeded
Dec 29 10:55:32 localhost kernel: hdb: drive not ready for command
Dec 29 10:55:33 localhost kernel: hdb: status error: status=0x00 { }
Dec 29 10:55:33 localhost kernel:
Dec 29 10:55:33 localhost kernel: hdb: drive not ready for command
Dec 29 10:55:36 localhost nfs: nfsd shutdown succeeded
Dec 29 10:55:41 localhost kernel: hdb: status error: status=0x00 { }
Dec 29 10:56:22 localhost nfs: rpc.rquotad shutdown succeeded
Dec 29 10:56:23 localhost kernel:
A potem jeszcze :
Dec 29 11:06:03 localhost kernel: end_request: I/O error, dev hdb,
sector 48175404
Dec 29 11:06:04 localhost kernel: end_request: I/O error, dev hdb,
sector 48175412
Dec 29 11:06:04 localhost kernel: end_request: I/O error, dev hdb,
sector 48175420
Dec 29 11:06:05 localhost kernel: end_request: I/O error, dev hdb,
sector 48175428
Dec 29 11:06:05 localhost kernel: end_request: I/O error, dev hdb,
sector 48175436
Dec 29 11:06:06 localhost kernel: end_request: I/O error, dev hdb,
sector 48175444
Dec 29 11:06:06 localhost kernel: end_request: I/O error, dev hdb,
sector 48175452
Dec 29 11:06:06 localhost kernel: end_request: I/O error, dev hdb,
sector 48175460
Dec 29 11:06:06 localhost kernel: end_request: I/O error, dev hdb,
sector 48175468
Dec 29 11:06:07 localhost kernel: end_request: I/O error, dev hdb,
sector 48175476
Dec 29 11:06:07 localhost kernel: end_request: I/O error, dev hdb,
sector 481754
5. Mount crashes !!!!
From time to time mount_server crashes this is caused by simple error
in parsing, however this bug was not found , end of output :
gap:motor-1: Started...
type=2
update_serial_port_performance:
Started: keyword=0, result=0, counter=1
update_serial_port_performance:
Ended normally
check_axis_range: axis parameter
type=2 val=0: range check OK
gap:motor-1: Ended normally
type=2 value=0
gap:motor-1: Started...
type=0
update_serial_port_performance:
Started: keyword=0, result=0, counter=1
update_serial_port_performance:
Ended normally
check_axis_range: axis parameter
type=0 val=-551638: range check OK
gap:motor-1: Ended normally
type=0 value=-551638
gap:motor-1: Started...
type=138
get_mount_status: BEGIN:
const_buf=0x8071744 |astro|, p_mountStatus=0x417fa230,
execute_command:
const_buf=|astro|, buf=|(null)|
execute_command: STR2STR:
const_buf=|astro|, buf=|astro|
execute_command:
mount_initialized=1
execute_command: buf=|astro|
execute_command: STR2LOWERCASE:
const_buf=|astro|, buf=|astro|
filter_comments: input=|astro|
len=5
filter_comments: len=5 (spaces=0
+ noncomment=5)
filter_comments: output=|astro|
len=5
execute_command:
FILETER_COMMENTS: const_buf=|astro|, buf=|astro|
This is in monit.c ( FILETER_COMMENTS ) next printf expected ( when it
was OK ) :
execute_command:
astro
at: Sun Feb 25 21:07:00 2007
execute_command: Command found in database:
command=|astro|, cmd_number=28 found=1
execute_command: Command = |astro|
execute_command: Command arguments: arg=(nil),
arg=|(null)|
gap:motor-0: Started... type=3
update_serial_port_performance: Started:
keyword=0, result=0, counter=1
update_serial_port_performance: Ended normally
check_axis_range: axis parameter type=3
val=1518: range check OK
gap:motor-0: Ended normally type=3
value=1518
Zeby miec punkt odniesienia to tak, plik monit.c i tutaj jest
tak ze linia 2978 to :
printf(MSG_FUNC": FILETER_COMMENTS: const_buf=|%s|,
buf=|%s|\n",const_buf, buf);fflush(0);
natomiast ta oczekiwana linia w bold wypisuje sie w lini 2993
if (logging>=MESSAGE_LEVEL)
{
msg(log_file,MSG_FUNC": %-040s at:
%s",buf,ctime(&t));
}
czyli pad nastepuje gdzies w tym kawalku :
printf(MSG_FUNC":
FILETER_COMMENTS: const_buf=|%s|, buf=|%s|\n",const_buf, buf);fflush(0);
//msg_filename(kn_log_name,MSG_FUNC": after filter_comments\n");
if( buf == NULL || strlen(buf) < 1 )
{
if(buf)
free(buf); buf=NULL;
printf(MSG_FUNC":
buf is an empty string! Will ignore this command.\n");fflush(0);
return(0);
}
//msg_filename(kn_log_name,MSG_FUNC":
execute_command checkpoint 2\n");
//printf(MSG_FUNC": buf=%p is not an empty
string\n",buf);fflush(0);
#endif
t=time(NULL);
if (logging>=MESSAGE_LEVEL)
{
msg(log_file,MSG_FUNC": %-040s at:
%s",buf,ctime(&t));
}
dodam kilka dodatkowych printfow i moze cos sie wyjasni !!!