Skip to content

Wrong checksum disk replica

How to reproduce the problem

Archive a file to tape:

[itctabuild02] ~ > echo -n '1234567890' > ten_byte_file.txt
[itctabuild02] ~ > run_eosuser1_shell
[itctabuild02] ~ (krb5=eosuser1)> xrdcp ten_byte_file.txt root://localhost//eos/dev/userfiles/testdir_1
[10B/10B][100%][==================================================][10B/s]  
[itctabuild02] ~ (krb5=eosuser1)> exit
exit
[itctabuild02] ~ >

Observe that the file is only on tape (d0::t1):

[itctabuild02] ~ > run_eosuser1_shell
[itctabuild02] ~ (krb5=eosuser1)> eos root://localhost ls -y /eos/dev/userfiles/testdir_1/ten_byte_file.txt
d0::t1   -rw-r--r--   1 eosuser1 eosuser1           10 Apr 16 11:55 ten_byte_file.txt
[itctabuild02] ~ (krb5=eosuser1)> exit
exit
[itctabuild02] ~ >

Request that the file be retrieved from tape:

[itctabuild02] ~ > run_eospoweruser1_shell
[itctabuild02] ~ (krb5=eospoweruser1)> xrdfs localhost prepare -s /eos/dev/userfiles/testdir_1/ten_byte_file.txt
eos:044620011458020202280000000001000042:e03bfa81.5e982b25:11
[itctabuild02] ~ (krb5=eospoweruser1)> exit
exit
[itctabuild02] ~ >

Observe that the file is both on disk and on tape (d1::t1):

[itctabuild02] ~ > run_eosuser1_shell
[itctabuild02] ~ (krb5=eosuser1)> eos root://localhost ls -y /eos/dev/userfiles/testdir_1/ten_byte_file.txt
d1::t1   -rw-r--r--   2 eosuser1 eosuser1           10 Apr 16 11:55 ten_byte_file.txt
[itctabuild02] ~ (krb5=eosuser1)> exit
exit
[itctabuild02] ~ >

Determine the location of the underlying physical file on the EOS FST and give it different contents whilst preserving its size:

[itctabuild02] ~ > sudo eos root://localhost fileinfo /eos/dev/userfiles/testdir_1/ten_byte_file.txt --fullpath
  File: '/eos/dev/userfiles/testdir_1/ten_byte_file.txt'  Flags: 0644
  Size: 10
Modify: Thu Apr 16 11:55:32 2020 Timestamp: 1587030932.74291000
Change: Thu Apr 16 11:56:14 2020 Timestamp: 1587030974.410272330
Birth : Thu Apr 16 11:55:32 2020 Timestamp: 1587030932.36661398
  CUid: 19227 CGid: 1487  Fxid: 00000010 Fid: 16    Pid: 15   Pxid: 0000000f
XStype: adler    XS: 0b 2c 02 0e    ETAGs: "4294967296:0b2c020e"
Layout: replica Stripes: 1 Blocksize: 4k LayoutId: 00100012
  #Rep: 2
┌───┬──────┬────────────────────────┬────────────────┬────────────────────────────────────────────┬──────────┬──────────────┬────────────┬────────┬────────────────────────┬──────────────────────────────────────────────────────────────┐
│no.│ fs-id│                    host│      schedgroup│                                        path│      boot│  configstatus│       drain│  active│                  geotag│                                             physical location│
└───┴──────┴────────────────────────┴────────────────┴────────────────────────────────────────────┴──────────┴──────────────┴────────────┴────────┴────────────────────────┴──────────────────────────────────────────────────────────────┘
 0    65535                localhost           tape.0                              /does_not_exist                       off      nodrain  offline                                                       /does_not_exist/00000000/00000010 
 1        2     itctabuild02.cern.ch        spinner.0 /run/media/smurray/250GB/fst_spinner_storage     booted             rw      nodrain   online                     flat /run/media/smurray/250GB/fst_spinner_storage/00000000/00000010 

*******
[itctabuild02] ~ > echo -n RUBBISH890 | sudo tee /run/media/smurray/250GB/fst_spinner_storage/00000000/00000010
RUBBISH890[itctabuild02] ~ > 
[itctabuild02] ~ > 

Try to copy out the disk replica as an end user and print the exit code of the failing command:

[itctabuild02] ~ > run_eosuser1_shell
[itctabuild02] ~ (krb5=eosuser1)> xrdcp root://localhost//eos/dev/userfiles/testdir_1/ten_byte_file.txt /tmp/tmp_ten_byte_file.txt
[0B/0B][100%][==================================================][0B/s]  
Run: [ERROR] Server responded with an error: [3007] Unable to read file - wrong file checksum fn= /run/media/smurray/250GB/fst_spinner_storage/00000000/00000010; input/output error (source)

[itctabuild02] ~ (krb5=eosuser1)> echo $?
54
[itctabuild02] ~ (krb5=eosuser1)> exit
exit
[itctabuild02] ~ > 

Observe that EOS still believes that the disk replica still has the previous checksum when in fact the file has been corrupted:

[itctabuild02] ~ > sudo eos root://localhost fileinfo /eos/dev/userfiles/testdir_1/ten_byte_file.txt --fullpath
  File: '/eos/dev/userfiles/testdir_1/ten_byte_file.txt'  Flags: 0644
  Size: 10
Modify: Thu Apr 16 11:55:32 2020 Timestamp: 1587030932.74291000
Change: Thu Apr 16 11:56:14 2020 Timestamp: 1587030974.410272330
Birth : Thu Apr 16 11:55:32 2020 Timestamp: 1587030932.36661398
  CUid: 19227 CGid: 1487  Fxid: 00000010 Fid: 16    Pid: 15   Pxid: 0000000f
XStype: adler    XS: 0b 2c 02 0e    ETAGs: "4294967296:0b2c020e"
Layout: replica Stripes: 1 Blocksize: 4k LayoutId: 00100012
  #Rep: 2
┌───┬──────┬────────────────────────┬────────────────┬────────────────────────────────────────────┬──────────┬──────────────┬────────────┬────────┬────────────────────────┬──────────────────────────────────────────────────────────────┐
│no.│ fs-id│                    host│      schedgroup│                                        path│      boot│  configstatus│       drain│  active│                  geotag│                                             physical location│
└───┴──────┴────────────────────────┴────────────────┴────────────────────────────────────────────┴──────────┴──────────────┴────────────┴────────┴────────────────────────┴──────────────────────────────────────────────────────────────┘
 0    65535                localhost           tape.0                              /does_not_exist                       off      nodrain  offline                                                       /does_not_exist/00000000/00000010 
 1        2     itctabuild02.cern.ch        spinner.0 /run/media/smurray/250GB/fst_spinner_storage     booted             rw      nodrain   online                     flat /run/media/smurray/250GB/fst_spinner_storage/00000000/00000010 

*******
[itctabuild02] ~ > 
[itctabuild02] ~ > sudo xrdadler32 /run/media/smurray/250GB/fst_spinner_storage/00000000/00000010
0fd802b1 /run/media/smurray/250GB/fst_spinner_storage/00000000/00000010
[itctabuild02] ~ > 

Observe that EOS ignores an end user’s request to retrieve the file from tape because EOS believes the disk replica already exists:

[itctabuild02] ~ > run_eospoweruser1_shell
[itctabuild02] ~ (krb5=eospoweruser1)> xrdfs localhost prepare -s /eos/dev/userfiles/testdir_1/ten_byte_file.txt
eos:044620011458020202280000000001000042:e03bfa81.5e982b25:12
[itctabuild02] ~ (krb5=eospoweruser1)> exit
exit
[itctabuild02] ~ > 
[itctabuild02] ~ > grep 'nothing to prepare' /var/log/eos/mgm/xrdlog.mgm
200416 12:02:09 time=1587031329.929413 func=HandleProtoMethodPrepareEvent level=INFO  logid=static.............................. unit=mgm@itctabuild02.cern.ch:1094 tid=00007fd8b76fd700 source=WFE:1666                       tident= sec=(null) uid=99 gid=99 name=- geo="" File /eos/dev/userfiles/testdir_1/ten_byte_file.txt is already on disk, nothing to prepare.
[itctabuild02] ~ > 

How an end user can recover the data

Ask EOS to evict the disk replica:

[itctabuild02] ~ > run_eospoweruser1_shell 
[itctabuild02] ~ (krb5=eospoweruser1)> xrdfs localhost prepare -e /eos/dev/userfiles/testdir_1/ten_byte_file.txt
[itctabuild02] ~ (krb5=eospoweruser1)> exit
exit
[itctabuild02] ~ > 

Observe that EOS now recognises the fact that the disk replica is in fact gone:

[itctabuild02] ~ > run_eosuser1_shell
[itctabuild02] ~ (krb5=eosuser1)> eos root://localhost ls -y /eos/dev/userfiles/testdir_1/ten_byte_file.txt
d0::t1   -rw-r--r--   1 eosuser1 eosuser1           10 Apr 16 11:55 ten_byte_file.txt
[itctabuild02] ~ (krb5=eosuser1)> exit
exit
[itctabuild02] ~ > 

Request that the file be retrieved from tape:

[itctabuild02] ~ > run_eospoweruser1_shell
[itctabuild02] ~ (krb5=eospoweruser1)> xrdfs localhost prepare -s /eos/dev/userfiles/testdir_1/ten_byte_file.txt
eos:044620011458020202280000000001000042:e03bfa81.5e982b25:13
[itctabuild02] ~ (krb5=eospoweruser1)> exit
exit
[itctabuild02] ~ > 

Observe that the file is both on disk and on tape (d1:t1):

[itctabuild02] ~ > run_eosuser1_shell
[itctabuild02] ~ (krb5=eosuser1)> eos root://localhost ls -y /eos/dev/userfiles/testdir_1/ten_byte_file.txt
d1::t1   -rw-r--r--   2 eosuser1 eosuser1           10 Apr 16 11:55 ten_byte_file.txt
[itctabuild02] ~ (krb5=eosuser1)> exit
exit
[itctabuild02] ~ > 

Copy the recovered file out:

[itctabuild02] ~ > run_eosuser1_shell
[itctabuild02] ~ (krb5=eosuser1)> xrdcp root://localhost//eos/dev/userfiles/testdir_1/ten_byte_file.txt /tmp/tmp_ten_byte_file.txt
[10B/10B][100%][==================================================][10B/s]  
[itctabuild02] ~ (krb5=eosuser1)> exit
exit
[itctabuild02] ~ > 
[itctabuild02] ~ > cat /tmp/tmp_ten_byte_file.txt; echo
1234567890
[itctabuild02] ~ > 

What a tape operator can do to recover the data

The same as an end user.