Skip to content

Deprecated

This page is deprecated and may contain information that is no longer up to date.

CTA Tape Verify

The CTA Tape Verify framework monitors tapes in the CTA system and alerts in case of potential data loss/corruption.

The framework is composed by a set of scripts (namely cta-verify-file, cta-verify-tape, cta-verification-feeder), a monitoring flume agent that parses the CTA logs and extracts logs related to verification and related periodic rundeck jobs.

cta-verify-file

This command submits a verification request to the CTA frontend. A verification request is a retrieve request with the isVerifyOnly flag set to true. CTA handles this requests specially:

  • On queuing time, the mount policy n attributed to the request is hard-coded in the /etc/cta/cta-frontend-xrootd.conf file (configuration options cta.verification.mount_policy). This is done so that verification jobs run with the lowest possible priority in the cta scheduler. The verification mount policy should be associated to a specific verification virtual organization to allow tracking ongoing verification jobs in cta-admin sq

  • When a verification retrieve job is executed, the tape file is streamed to /dev/null.

  • When a recall tape session is finished, verified jobs and bytes are accounted separately from user or repack jobs in the "Tape session finished message".

The options for this command are:

  • request.user: The user submitting the request. This will be "verification"
  • request.group: The group of the user submitting the request. This will be "it"
  • instance: The disk instance of the retrieve request. This can be any disk instance since verification jobs don't write the file anywhere.
  • vid: Optional. The vid of the tape the file belongs to.

cta-verify-tape

The cta-verify-tape command submits a tape for verification by selecting some files from it and issuing a cta-verify-file command for each of them.

When a tape is submitted for verification, it's verification_status is updated to {date: <current_date>, status: ongoing, files_submitted: <number of files>, files_verified: 0, files_failed: 0}.

The options for this command are:

  • vid: The vid of the tape to be verified.
  • read_time: The minimum amount of time it should take to read a tape (assuming a read speed of 300 MB/s).
  • data_size: The minimum amount of data that should be read from the tape. This should be just a number of bytes or a number followed by a unit (K, M, G, T, P, E)
  • first: The number of files that should be read from the beginning of the tape (fseq order)
  • last: The number of files that should be read from the end of the tape (fseq order)
  • random: The number of files that should be read from the middle of the tape
  • all: If present, all tape files will be verified
  • options: Options to pass to the cta-verify-file commands.

The script will choose jobs to satisfy all the flags. To satisfy the read_time and data_size options it will take jobs from the middle of the tape. If there are not enough files on the tape, it will just verify the whole tape.

The cta-verify-tape command should be run from the cta frontend. It's logs are written in /var/log/cta/verification/cta-verify-tape.log

Verification monitoring

Verification job submission on a tape

File submission is caught on cta-frontend MSG:

May 11 14:57:36.567479 ctafrontend cta-frontend: LVL="INFO" PID="210" TID="800" MSG="Queued retrieve request" user="ctaadmin1@ctafrontend" fileId="10026" instanceName="ctaeos" diskFilePath="dummy" diskFileOwnerUid="0" diskFileGid="0" dstURL="file://dummy" errorReportURL="" creationHost="ctafrontend" creationTime="1652273856" creationUser="ctaeos" requesterName="verification" requesterGroup="it" criteriaArchiveFileId="10026" criteriaCreationTime="1652273277" criteriaDiskFileId="10016" criteriaDiskFileOwnerUid="0" criteriaDiskInstance="ctaeos" criteriaFileSize="407" reconciliationTime="1652273277" storageClass="ctaStorageClass" checksumType="ADLER32" checksumValue="20ce7e60" tapeTapefile0="(vid=V01007 fSeq=10006 blockId=100051 fileSize=407 copyNb=1 creationTime=1652273277)" selectedVid="V01007" verifyOnly="1" catalogueTime="0.004542" schedulerDbTime="0.008317" policyName="verification" policyMinAge="1" policyPriority="1" retrieveRequestId="RetrieveRequest-Frontend-ctafrontend-210-20220511-14:35:58-0-30326" 

Verification job finished

End of verification job is caught in the MSG="File successfully read from tape" with isVerifyOnly="1" from cta-taped:

Jun  9 15:12:31.923036 tpsrv01 cta-taped: LVL="INFO" PID="6494" TID="6528" MSG="File successfully read from tape" thread="TapeRead" tapeDrive="VDSTK11" tapeVid="V01007" mountId="37" vo="vo" mediaType="T10K500G" tapePool="ctasystest" logicalLibrary="VDSTK11" mountType="Retrieve" labelFormat="0000" vendor="vendor" capacityInBytes="500000000000" fileId="10066" BlockId="100051" fSeq="10006" dstURL="file://dummy" isRepack="0" isVerifyOnly="1" positionTime="0.007167" readWriteTime="0.004118" waitFreeMemoryTime="0.000008" waitReportingTime="0.000098" transferTime="0.004224" totalTime="0.005537" dataVolume="407" headerVolume="480" driveTransferSpeedMBps="0.160195" payloadTransferSpeedMBps="0.073506" LBPMode="LBP_Off" repackFilesCount="0" repackBytesCount="0" userFilesCount="0" userBytesCount="0" verifiedFilesCount="1" verifiedBytesCount="407" checksumType="ADLER32" checksumValue="20ce7e60" status="success" 

Verification job failed

Logged as an ERROR in the TapeRead thread with isVerifyOnly="1" with MSG="Error reading a file in TapeReadFileTask":

[1626269255.260238000] Jul 14 15:27:35.260238 tpsrv017.cern.ch cta-taped: LVL="ERROR" PID="21203" TID="21550" MSG="Error reading a file in TapeReadFileTask" thread="TapeRead" tapeDrive="I4550832" tapeVid="I62382" mountId="14492" vo="IT-ST-TAB" mediaType="3592JD15T" tapePool="validation" logicalLibrary="IBM455" mountType="Retrieve" vendor="FUJIFILM" capacityInBytes="15000000000000" fileId="1339264257" BlockId="75" fSeq="2" dstURL="file://dummy" isRepack="0" isVerifyOnly="1" fileBlock="0" ErrorMessage="[ReadFile::position] - Reading HDR1: Failed ST read with crc32c in DriveGeneric::readExactBlock Errno=5: Input/output error"

cta-verification-feeder

This command is responsible for selecting which tapes are to be verified according to user supplied parameters. It is run periodically from rundeck and handles the following:

  • Check the tapes that have an ongoing verification status. If the verification has finished (cta-admin --json sq does now show a retrieve queue for the tape) change the verification status to finished).
  • Compare the number of ongoing verifications with a user supplied maximum. If it is less than the maximum, submit new tapes for verification until the limit is reached.

The feeder can use three parameters to choose which tape to submit for verification:

  • last_read: Choose the tape that has not been read in the longest time.
  • last_write: Choose the tape that has not been written in the longest time.
  • last_verified: Choose the tape that has not been verified in the longest time.

The options for the cta-verification-feeder are as follows:

  • filter: Filter out tapes matching the states passed (multiple states should be separated by ,).
  • tapepool: Only verify tapes in the specified tapepool (multiple tapepools should be separated by ',')
  • min_data_on_tape: Minimum number of bytes on tape for it to be verified
  • min_relative_capacity: Minimum relative filled capacity of tape for it to be verified
  • verify_path: Location of cta-verify-tape executable.
  • verify_policy: One of last_read, last_written, last_verified.
  • full_tapes: Verify only tapes that are full.
  • minage: Verify only tapes which have not been verified for so many days.
  • maxverify: maximum number verify processes to run concurrently.
  • verify_options: Options to pass to cta-verify-tape.
  • logfile: Log file path.
  • list_current_verifications: Just list current ongoing verifications and exit.
  • noaction: Dry run, do not submit the tapes for verification.

The feeder tries to satisfy all the options passed. If the options passed cause no tape to be eligible for verification, the script aborts with an exit code of 1.

The feeder calls the cta-verify-tape script for each tape it decides to verify.

The log file for the feeder is in /var/log/cta/verification/cta-verification-feeder.log.

Listing current verifications

Can be done with cta-verification-feeder --list_current_verifications or cta-admin --json sq, by seeing which retrieve queues have jobs queued with the verification mount policy.

Canceling verification

Once the verification jobs have been submitted, they cannot be canceled, just like normal retrieve jobs. You can however manually change the verification_status of the tape with cta-admin tape ch --verificationstatus. Note you must manually override the whole json. Changing the status from ongoing to done or cancelled will make the verification framework not list the current tape as being verified.

Monitoring

Tape verification monitoring is in this dashboard.

Automation

Done through rundeck with a job running every four hours.