EOS disk file repairs¶
EOS provides at least two major mechanisms for repairing disk replicas:
The eos-ops-durability
project¶
The eos-ops-durability
project is written in python. The project has at least two active parts:
- The part responsible for detecting faults in and repairing differing EOS disk replicas.
- The part responsible for detecting drain failures.
The “differing EOS replicas” part is executed within Rundeck and the “drain failure” part is executed as a cron job on the EOS MGM.
Some of the functionality that was only available in the eos-ops-durability
project is now implemented by the EOS support for fsck
.
EOS client and server support for fsck
¶
Each individual EOS FST always runs what is known as a disk scanner thread. This thread runs a scan of the local FST filesystem every four hours. During the scan only files that have not been scanned in the last 7 days are scanned. Scanning a file means reading it contents and recalculating its checksum. The checksum is compared to what the EOS namespace believes is the correct value and the result of the test is stored in a local LevelDB database. The scanner code is also reused outside of the EOS FSTs in the form of a standalone program named eos-scan-fs
.
Scanning local filesystems of an FST can be heavy in terms of local disk I/O. The scanning of a filesystem can be turned off by by setting the filesystem attribute scaninterval
to 0
, for example:
[itctabuild02] ~ > eos fs config 1 scaninterval=0
[itctabuild02] ~ >
The EOS MGM contains the following two threads for fsck
support. Both of which can be toggled on or off with the eos fsck config
command:
- The
fsck
collection thread. - The
fsck
repair thread.
The fsck
collection thread collects the results of disk scans made by the FSTs. The fsck
repair thread acts on the collected results by making repairs where they can be made.
The EOS FST file scrubber thread¶
The EOS FST file scrubber thread is briefly described here for completeness. The scrubber read and writes the following two files at the root of the filesystem being tested:
scrub.write-once.FSID
scrub.re-write.FSID
If the file scrubber fails to either read or write these files, then the FST notifies the MGM that the filesystem is not healthy. The file scrubber thread can be made to ignore a filesystem by creating the following hidden file in the root of the local filesystem:
.eosnoscrub