Skip to content

SpectraLogic Libraries

SpectraLogic said (27 May 2020):

The IBM Tape drive enforces a 12 minute timeout on all commands sent to the Media Changer Device that are executed by the library. Due to this restriction of the IBM tape drive firmware, we believe that CERN should either implement the Asynchronous Move commands (below), or go to the implementation that RAL is using which would require a RIM as the exporter.

During a test phase of our library firmware, the library encountered a problem that required a long error recovery algorithm to take place. While the library was recovering from the error, the moves that we had queued on multiple drives failed with a Sense Key of 0Bh, and ASC/ASCQ of 08h/01h.

During our investigation as to why these moves failed with this sense data, we discovered in the IBM ADI Implementation Reference that the IBM drive will only wait for 12 minutes for a response to a command that it sends over the automation interface to the library. Here is the relevant section from the IBM ADI Implementation Reference that describes the timeout, and the reason that IBM has chosen to enforce this restriction in their drive firmware implementation.

3.4 SMC Internal Command Timeout

When the bridging manager forwards an SMC command to the remote SMC Device Server, the bridging manager will automatically initiate a 12-minute timer. It is expected that all forwarded commands complete before this time. Therefore, if the remote SMC Device Server does not respond within this time, the drive will internally abort the command and issue an Abort Task TM request to the remote SMC Device Server with the Tag of Task pertaining to the forwarded command. The local SMC Device Manager will terminate the command with Check Condition and report Sense Key 0Bh ASC/ASCQ 0801h Logical Unit Communication Time-Out to the initiator having issued the command.

This behavior is done to avoid certain scenarios in which a host times out and resets the interface. This resetting event has been demonstrated to also impact any activity on the RMC (host data) interface.

If the IBM tape drive is configured to export a Media Changer Device on the fiber bus, it sends commands that it receives at the Media Changer Device LUN to the library over the ADI interface, and then waits for a response from the library so that it can return the result of the command execution by the library to the host that initiated the command. If 12 minutes goes by without a response from the library, the tape drive will attempt to abort the command it sent to the library, and fail the command that it received from the host with a Sense Key of 0Bh, and ASC/ASCQ of 08h/01h.

Our library firmware was architected to try and successfully complete all the move requests that it receives from the host computer system. It has extensive error recovery routines that recalibrate entities in the library in the event that an attempt to access one of the entities fails, rather than just fail the move request back to the host when the access fails. For instance, if a move from a storage element to a data transfer element (slot to drive) is requested, and the robot attempts to place the media in the drive but the tape cannot be pushed far enough into the drive to ensure that the drive can load the tape into its deck, then the software will recalibrate the drive mouth and reattempt to place the media in the drive. If this continues to fail, then the software will attempt to have the other robot in the TFinity library attempt the operation, rather than fail the move request back to the host. Sometimes, these recoveries can take minutes to perform.

While we haven't had any customer libraries that have run into this error that I am aware of, when we analyzed the cause of the failed moves on our local test system we ran the thought experiment of how this timeout might potentially affect customers in the field. Most of our customers, like RAL with RIMs, do not run multiple drives as Media Changer Exporters. Rather they run them as a single drive exporting a Media Changer Device, and maybe one more drive also configured to export a Media Changer Device as a failover mechanism. In this scenario, the library can only accept 1 move at a time, because the Media Changer Device is a sequential access device and cannot accept a new move until the prior move has been completed. So, if an extended recovery situation does occur on a system that is configured in this way, there are no other moves waiting to gain access to the robot (and consuming large amounts of their 12 minute wait period) while the recovery is happening. For the customers that will have multiple drives exporting the Media Changer Device LUN, we theorized that the probability of encountering this timeout failure increases with the number of outstanding moves that are waiting for the robot.

In order to not have your software have to deal with a failed move command, and potentially taking a functional tape drive offline because of it, we proposed this new move paradigm, that is meant to work around the 12 minute timeout enforced by the IBM tape drive firmware. Rather than waiting for the response to the move command issued to the library, and potentially encountering the 12 minute timeout if the library goes into an extended recovery scenario, the Asynchronous Move returns immediately with a token that can be used to query when the move is done. Each query can block in the tape drive for up to 255 seconds. If the move completes in the timeframe specified in the query command, then the Media Changer Device responds immediately upon completion of the move with the move's status. If the move has not yet completed in the time period specified in the query command, then sense data indicating the move is still in progress is returned as an indication to the host software that it needs to query the Media Changer Device again for the status of the move. In this way, a move can take longer than the 12 minutes enforced by the IBM tape drive, and not cause a failure. This will also allow for 100+ drives to be configured in the TFinity if needed without ever hitting the 12 minute timeout.

We believe that eventually, as CERN adds more drives to the library, the statistical probability of one of these failures becomes inevitable. We think that CERN should either implement the asynchronous moves as described, or fall back to the code that you had working before with the Spectra Logic library, which does not suffer from the 12 minute internal timeout that is coded into the IBM drive firmware.

Asynchronous Moves

The IBM tape drive firmware starts a 12 minute timer whenever it receives a media changer command from a SCSI initiator that it passes along to the tape library it is connected to over its iADT interface. If the IBM tape drive does not receive a response from the tape library to the command it sent within the 12 minute window, it will attempt to abort the command it sent to the tape library over the iADT interface, and fail the command to the host with a Check Condition, and its sense data will be a sense key of 0Bh, and ASC/ASCQ of 08h/01h. The IBM tape drive does this in its firmware in order to avoid taking longer than the 15 minutes that most host systems impose on SCSI commands which generally result in a SCSI bus reset if exceeded.

In the rare event that the tape library may take longer than 12 minutes to complete a move due to a robotics issue, or as more and more tape drives are added to the library which could create a large queue of moves, we’ve designed and implemented a new move command paradigm that makes the issuing of moves immune to this 12 minute timeout window, and therefore creates a more robust system for you to work with.

Asynchronous Moves allows the host to issue a move to the library and immediately receive a response that either tells them that the move failed for some reason (for instance, the destination element in the move is already filled with media), or that the move was accepted and is being worked on. Then, the host can repeatedly query the library as to whether the move is complete, and avoid the 12 minute timeout that the drive imposes.