Skip to content

EOS disk file repairs

EOS provides at least two major mechanisms for repairing disk replicas:

The eos-ops-durability project

The eos-ops-durability project is written in python. The project has at least two active parts:

  • The part responsible for detecting faults in and repairing differing EOS disk replicas.
  • The part responsible for detecting drain failures.

The “differing EOS replicas” part is executed within Rundeck and the “drain failure” part is executed as a cron job on the EOS MGM.

Some of the functionality that was only available in the eos-ops-durability project is now implemented by the EOS support for fsck.

EOS client and server support for fsck

Each individual EOS FST always runs what is known as a disk scanner thread. This thread runs a scan of the local FST filesystem every four hours. During the scan only files that have not been scanned in the last 7 days are scanned. Scanning a file means reading it contents and recalculating its checksum. The checksum is compared to what the EOS namespace believes is the correct value and the result of the test is stored in a local LevelDB database. The scanner code is also reused outside of the EOS FSTs in the form of a standalone program named eos-scan-fs.

Scanning local filesystems of an FST can be heavy in terms of local disk I/O. The scanning of a filesystem can be turned off by by setting the filesystem attribute scaninterval to 0, for example:

[itctabuild02] ~ > eos fs config 1 scaninterval=0
[itctabuild02] ~ > 

The EOS MGM contains the following two threads for fsck support. Both of which can be toggled on or off with the eos fsck config command:

  • The fsck collection thread.
  • The fsck repair thread.

The fsck collection thread collects the results of disk scans made by the FSTs. The fsck repair thread acts on the collected results by making repairs where they can be made.

The EOS FST file scrubber thread

The EOS FST file scrubber thread is briefly described here for completeness. The scrubber read and writes the following two files at the root of the filesystem being tested:

  • scrub.write-once.FSID
  • scrub.re-write.FSID

If the file scrubber fails to either read or write these files, then the FST notifies the MGM that the filesystem is not healthy. The file scrubber thread can be made to ignore a filesystem by creating the following hidden file in the root of the local filesystem:

  • .eosnoscrub