Published by USENIX in the Proceedings of Workshop on UNIX and Supercomputers, Pittsburgh, PA, 26-27 September, 1988, pages 183-214.
Please read the Postscript Version, this page is just to provide the text to the search engines.
.\" groff -X -te -ms paper .\" groff -te -ms paper | print-postscript .RP .\" @(#)$Header: paper,v 1.10 88/08/26 12:07:36 mike Exp $ (BRL) .TL BUMP .br The BRL/USNA Migration Project .AU Michael John Muuss, BRL .AU Terry Slattery, USNA .AU Donald F. Merritt, BRL .AI The US Army Ballistic Research Laboratory .br The US Naval Academy .AB On .UX systems with many users, or systems that run very large problems, disk space management can be particularly difficult. Space management has generally been accomplished by scripts and programs for determining ``disk hogs''. Users have been expected to explicitly move their working files to some offline storage media, often using manual procedures and record keeping. In addition, .UX programs attempting to write to a full file systems get an ENOSPC write error: ``No space on device''. Frequently, this behavior is not acceptable, especially where programs may execute for a long time. .PP This paper reports on the implementation of a solution to both these aspects of the disk space management problem, by providing a transparent file migration facility. The result of this software is .UX filesystems that have the appearance of significantly more capacity than then underlying disk drives, freeing the user community from worrying about managing offline storage media. .PP The system administrator or a \fBcron\fR script may run a utility to cause certain files to be migrated to one of several levels of offline storage. The inode for each migrated file remains present in the filesystem, with special ``file handle'' data used to recover the file on subsequent access. When a migrated file is opened, the kernel will block the user process, wait for a special user-mode migration daemon to recover the file from backing storage, and then allow the user process to continue. This mechanism is entirely transparent to the user, except for the delay. When a process attempts to write onto a full file system, the system can be configured to block the process, start a ``space management'' migration function to create file system space, then resume the blocked processes. .PP The details of the kernel modifications, support daemons, and related software necessary to provide fully transparent file migration will be presented. In addition, the software described is Public Domain, Distribution Unlimited. Several vendors have already expressed plans to incorporate it into their products. .AE .NH 1 BACKGROUND .PP Computer systems running .UX range from personal computers to supercomputers. On systems with many users, or systems that run very large problems, disk space management can be particularly difficult. Historicly, .UX space management has been accomplished by scripts and programs for determining ``disk hogs'' and mechanisms for users to explicitly move their files to some offline storage media (magnetic tape, removable disks, etc.), often using manual procedures and manual record keeping. .PP These traditional techniques have suffered from a number of serious pitfalls, some of which are not at all obvious. The most commonly heard complaint is that users are unclear about how to recover files that have been moved offline, what tape that files are on, how long the tape copies of relocated files will be kept, etc. Vigorous system administrators are often tempted to sweep their filesystems for files that have not been accessed recently, and then move these files to tape. Users are typically notified of this preemptive relocation of their files with a brief note, and typically no serious harm is done. However, if users have been working on a large software project for a long time, and they are using \fBmake\fR to generate their binary files, the filesystem sweep may determine that many source files are not being used, because they have not even been accessed for a long time. If those files are preemptively moved offline, \fBmake\fR will explode with error messages the next time it is run, because \fBmake\fR uses \fBstat\fR(2) to build the dependency tree. This is but one example of files that have not been ``accessed'' in a long time that are still being actively ``used''. .PP .UX programs attempting to write to a full file systems get an ENOSPC write error: ``No space on device''. Frequently, this behavior is not acceptable, especially in the supercomputer realm where programs may execute for a long time, after which there may not be enough space to write all the results to disk. Generally, it seems that most users would prefer that the system simply waited until additional space became available, perhaps to the accompanyment of warning messages, rather than aborting work in progress. If the out of space condition is recognized by the users, users can often open another window, or use another terminal, or influence a conveniently located colleague, to free up some additional space. It is most unfortunate that work in the process of being written to disk has always been lost in these circumstances. .PP This paper describes the implementation of a solution to both aspects of the disk space management problem. The software provides transparent file migration and archiving without requiring major changes to the .UX kernel or filesystem. One result is .UX filesystems that can have the appearance of significantly more capacity than then underlying disk drives, freeing the user community from worrying about managing offline storage media. .NH 1 GOALS .PP There were two primary goals of this project: First, to create a file migration system for .UX that provides filesystems that give the appearance of having significantly more online storage than the actual device that contains the filesystem. Second, all unmodified .UX programs that do not examine the raw filesystem must not be able to detect any difference between regular files, and migrated files, except for possible delay in completing an open() operation on a migrated file. .PP Secondary goals were to achieve these features: .IP 1) Have separation between migration \fIpolicy\fR, and migration \fImechanism\fR, so sites that may need to modify the policy will not have to alter the migration mechanism. .IP 2) Make minimal modifications to the .UX kernel. The basic kernel routines should be installed like a device driver, and all interface ``hooks'' should be short, and conditionally compiled via #ifdef MIGRATION. Most of the essential functionality should be located in user-mode code. .IP 3) No changes to the size or structure of the on-disk inodes. This permits this software to be installed on machines with existing filesystems, without needing to dump and reload all user files. This also minimizes changes to existing kernel code, dump/restore code, and standalone utilities. .IP 4) Provide support for an arbitrary number and variety of secondary storage devices and recording methods. By using a highly structured and modular interface to the software that handles the secondary storage devices, recipients of this software will be able to easily support additional hardware, and adapt to novel devices, without having to change the fundamental mechanism of this system. .IP 5) Provide extremely robust operation in the face of both system crashes and heavy system use. In essence, this software says ``trust me with your files;'' that trust must not be violated. Filesystem reliability and availability should be comparable to that of a .UX system that is not running this software. .IP 6) To have the capability of having multiple copies of migrated files located on secondary storage, for reliability, .IP 7) To have the capability for leaving a copy of a migrated file online, so that either rapid in-migration or rapid space reclamation can be accomplished, depending on which resource is required first, access or storage. .IP 8) To provide support for several types of secondary storage, and for staging files from one form of secondary storage to another. For example, small files might initially be migrated to some type of robotic mass storage, while larger files might go directly to operator mounted magnetic tape. Files on the robotic mass storage might be staged out to magnetic tape if they are not accessed within a few weeks. .LP Some consciously chosen limitations to this system are: .IP 1) Only regular files can be migrated. It is not possible to migrate directories or special files. .IP 2) To provide migration service only to the machine hosting the disk system. This software is not attempting to provide a CTSS-style ``common filesystem'' across multiple machines. .IP 3) There is no relief for the problem of creating files that are larger than the online capacity of a single filesystem. .NH 1 OVERVIEW .PP BUMP is a collection of user level tools, supported by a small set of kernel modifications, to provide the user and system administrator with facilities that allow files to be migrated to backing storage and then transparently restored when accessed. These tools allow the user or system administrator to identify files to migrate, force these files into a ``pre-migrated'' state, copy the pre-migrated files to backing storage, and release the disk storage associated with these files. A specially modified version of the standard \fBls\fR(1) program will defeat the transparency feature of the migration system, to allow the user to identify files that have been migrated. Additional tools allow the user to recover files in the background for future use, and determine the amount of space taken by files in the migration system. The system administrator has tools available to coalesce sparsely populated migration volumes and to move migrated files between different levels in the storage hierarchy. .\" .KF .\" .PSPIC fig1.ps .\" .KE .PP To migrate a set of files, the names of the selected files are collected in a migration-list file (see Figure 1, "Functional Diagram"), with an optional hint about future usage and optional comment regarding the reason for migration (e.g. sysadmin forced, etc). All files on this list are migrated to a special ``migrate'' directory that exists for each filesystem upon which BUMP has been configured to run. Files that have been migrated to the on-disk directory are in a ``pre-migrated'' state in which the online disk storage has not been released, but the original inode has been changed to mode IFMIG and an entry in the file database has been created. The archiver utility is run to make at least one copy (and typically two copies) of each pre-migrated file onto backing storage media. After the copy has been made, the file is marked as ``dual-migrated'', because both the online copy and the secondary storage copies exist. From the dual-migrated state, the file can be instantly returned to full online status if the migrated inode is opened, or the on-disk pre-migrated copy may be unlinked to free disk storage if space runs low. .PP When a migrated file is accessed, either by an attempt to \fBopen\fR(2) it, or by an attempt to \fBexec\fR(2) it, the kernel recognizes the migrated inode type, blocks the process attempting the action, and sends a message to a user level daemon requesting that the affected file be reloaded. The daemon runs the archiver utility to copy the file back into the ``migrate'' directory on the filesystem where the reload is to occur, and once restored to the dual-migrated state, ``unmigrates'' it. The kernel then learns of the reload completion from the daemon and allows the waiting process to continue. .PP Free disk space management is automatically provided by a system-wide ``low space'' disk usage threshold. When the threshold is crossed the kernel notifies the daemon to begin reclaiming free space. The kernel also informs the daemon when file space is completely exhausted, blocking processes that attempt to write on that file system until the daemon creates free space and sends a reply back to the kernel. .PP The system's operation is easily tailored for specific sites, both in terms of the selection policy for the files to migrate and the type of hardware used for archiving. A hierarchy of storage levels is supported for sites with more than one type of archival media. Policy decisions about which and how many files to migrate are easily adjustable by the system administrators, making it easy to adapt to varying requirements at different sites. .NH 1 FILE FINDING TOOLS .PP The first step in the migration process is to identify candidate files to migrate, perhaps using a policy of selecting files with the largest size or size*age product being selected first. Because selecting files to migrate is independent of the actual archiving mechanism, each site may implement its own selection policy. A combination of the \fBmigsweep\fR and .UX \fBfind\fR(1) utilities can aid system administrators in selecting files based on either the size*age product or some other policy. The only restriction on which files may be migrated is that they be regular files (i.e. type IFREG). .PP Additionally, the \fBmigsweep\fR tool will permit each user to designate (in a ``.precious'' file) the names of files that should never be migrated, up to a certain amount of ``permanent'' disk storage allocated for that user. In this way, a user can be certain that the ``.profile'' file and other small files that are very frequently used will not be migrated. Otherwise, getting logged in could become very time consuming! .PP Users will be provided with another tool to voluntarily migrate files that they know will not be needed for some period of time. The migrate list produced by these tools may include optional fields for a comment and hints about possible future reload times to be used by the archiving mechanism in a multi-level storage hierarchy. .NH 1 OUT MIGRATION .PP The migration file list, kept in a disk file to survive system crashes and reboots, is read by the \fBmigout\fR tool, that in turn migrates each file to a special per-filesystem 'migration' directory. See Figure 2, "Out Migration (migout)". In this process, migout creates a database entry for each migrated file, allocates the pre-migrated inode in the migration directory, and calls the mig_makemigrated() system call. mig_makemigrated() moves the block pointers from the original file's inode to the pre-migrated inode in the migration directory, zeros the original inode block pointers, stores the file's ``handle'' in the now empty block pointer area, and changes the original file's type to IFMIG. Files in this state are termed ``pre-migrated'' because they have been migrated from the normal .UX filesystem into the special on-disk migration directory, but have not been copied to any other storage media. This operation does not free any storage on the affected filesystem, and in fact, uses another inode for each pre-migrated file. However, it is an atomic operation that results in the file's data blocks being allocated to an inode in a protected directory where advisory file locking may be used to guarantee the integrity of files during archiving. .PP The ``file handle'' is a unique sequence number given to each file in the system that allows easy location of all copies of an migrated file, regardless of the storage methods used. File handles are used as the index into the file database to find all occurrences of a migrated file in the archiving system. This indirection is needed to allow files to be moved from one volume to another or from one storage level to another, without having to hunt down and modify the migrated inode to reflect a change. A file handle is composed of a 32-bit source host identifier and a 32-bit file identifier. Including the source host identifier in the migrated file handle prevents problems from arising when filesystems are accidentally used on the wrong computer system. This could easily happen when removable-media filesystems are present in a site, or it could also happen if a set of dump tapes were reloaded onto a machine other than the originating machine. If the file handle consisted of only a sequence number, and a filesystem from another system was mounted, this would have two undesirable consequences. First, it would grant the requesting user access to protected files owned by other users, and second, having reloaded the migrated files, it would remove the copy of the file from secondary storage, preventing the legitimate owner of the file from reloading it later. .NH 1 ARCHIVING (STAGING) .PP Once the file has been pre-migrated, it must be copied onto at least one backing storage method in order to free the on-line disk blocks. See Figure 3, "Migration Archiving (migarch)." The archiving (or staging) process is provided by \fBmigarch\fR, a utility that copies files from one storage method to another, including both to and from the on-disk migration directory where pre-migrated files reside. See Figure 4, "Migration Archiving, FROM disk", and Figure 5, "Migration Archiving, TO disk". Migarch will typically be instructed to create at least two copies of each migrated file to facilitate recovery of data written to media that may be subsequently damaged. .PP A storage method is defined as a type of media (e.g. tape) and a recording format (e.g. ANSI labels). The input to migarch is a list containing the filehandle, destination method, and number of copies for each file to be copied. The file database is searched to find all possible sources of this file, the ``closest'' copy is determined, an operator request issued for the source and destination volumes to be mounted, the data copied, and the file database updated. This process is repeated for each copy of each file. Destination tape volumes always start out being empty. Migarch is used to coalesce partial volumes, copying files from several partial volumes onto one empty volume. By always writing onto a previously empty tape volume, the system helps guarantee that an existing volume will never be corrupted by unintentional overwriting (such as by power loss during a write operation). .PP Migarch also allows groups of files to be copied to a single set of volumes, with the selection based on a combination of database parameters such as owner, group, or size. In this case, the input list is built by searching the file database (or the filesystem) for all files matching the desired criteria. One possible use of this feature would be to cause each volume to contain files owned by a single user, for security reasons. .PP Pre-migrated files that have been archived to backing storage may either remain in the file system or may be removed to free the associated disk space. If they remain in the filesystem, they are called ``dual-migrated'' files. In this state, one of two operations may occur: 1) a reload request arrives, causing the migin tool to quickly convert the file back to its normal state (see below), or 2) the filesystem runs short of space and the online copy of the dual-migrated file is unlinked to reclaim the disk space it consumed. .NH 1 MODIFICATIONS TO THE .UX FILESYSTEM .PP In order to implement migrated files, it was necessary to have some way to distinguish between regular files, and migrated files. The most obvious way to achieve this would have been to use another bit in the inode that indicates that the inode is migrated. While the Berkeley 4.2 BSD and 4.3 BSD filesystems have additional space in the on-disk inodes that such a bit could be placed, the current System V filesystems do not have any additional space. However, in both kinds of .UX systems there are several unused combinations of the IFMT file type bits in the i_mode inode field. Therefore, a single one of these unused combinations is given the symbolic name IFMIG, and is used to mark inodes that are migrated regular files. It is this lack of an additional bit that prevents migrating directory inodes and ``special file'' inodes, as well as regular files. .PP When an inode is of type IFMIG, representing a migrated file, it is necessary to store the eight byte ``file handle'' in the on-disk inode. In the Berkeley 4.2 BSD and 4.3 BSD filesystems there are 16 bytes marked ``reserved, currently unused'' in the ic_spare[] field, so on the Berkeley implementation the first eight bytes out of the 16 spare bytes are used to store the file handle. The System V on-disk inode has no unused space, so eight bytes of the disk block number array are reused (overloaded) to store the file handle information when the inode is of type IFMIG. This location is referred to as i_fhandle. .PP There are a number of cases where additional features could have been provided if space existed to store the file handle in all inodes, such as the Berkeley inode format allows. For example, this would have allowed files to have been reloaded for reading only, and then subsequently deleted from the disk, without having to write a new copy to secondary storage. This would also have allowed the concept of a ``dual-migrated'' file to have been handled in a somewhat simpler manner. A future effort will be to modify the System V filesystem to have the larger inode space required, and then to provide these additional features. .PP As a result of these file system modifications, it is necessary to update all system utilities that read the raw filesystem. Most notable among these are \fBfsck\fR, \fBdump\fR, and \fBrestore\fR. .NH 1 THE DEFINITION OF KERNEL OPERATIONS AND THE MESSAGE PROTOCOL .PP This section describes the communications protocol used between the .UX kernel mig.c module, and the user-mode migration daemon. For illustrative purposes, interfaces to existing kernel code are drawn from the work done interfacing BUMP to Cray UNICOS 3.0.10 on an XMP. Only the additions are shown, to prevent disclosing any proprietary software. Similar additions exist for 4.2 BSD and 4.3 BSD kernels. .PP There are two conceptual ``layers'' to the protocol that is used between the kernel and the migration daemon. The lower level is the basic message-passing mechanism by which the kernel and the user mode daemon exchange chunks of data (messages). The upper level is the definition of the structure of the messages, the meaning of the various message types, and the nature of any expected actions or responses. .NH 2 THE MESSAGE PASSING MECHANISM .PP To establish the communication path, the user mode daemon must \fBopen\fR(2) the special kernel interface device, /dev/mig0. /dev/mig0 will ordinarily be owned by root, and mode 0600, to protect the interface from unauthorized use. .PP When the daemon opens the /dev/mig0 interface, it receives a normal .UX file descriptor as the return value from the \fBopen\fR() sys-call. This file descriptor is used for all subsequent communications. The /dev/mig0 ``driver'' code has special checks to ensure exclusive use of the interface, ie, only one user mode process may have the Kernel/Daemon message interface open at any one time. .PP Once the daemon has the interface open, communication between the user mode daemon and the kernel is via normal .UX \fBread\fR() and \fBwrite\fR() system calls. .PP The daemon's .UX \fBwrite\fR() system call is vectored through the cdevsw table to the kernel routine migwrite(). The daemon may send a message to the kernel at any time. The driver code has been arranged so the kernel always has one local message buffer to read a message into. Therefore, the kernel will always be able to accept and process a message from the daemon. The byte count argument to the \fBwrite\fR() system call must be exactly the size of one message buffer, ie, sizeof(struct mig_msg), or an EIO error will be returned. A direct consequence of this is that the daemon must perform one \fBwrite\fR() system call for each message sent to the kernel. There is no message ``batching'' mechanism. .PP The daemon's .UX \fBread\fR() system call is vectored through the cdevsw table to the kernel routine migread(). The byte count argument to the \fBread\fR() system call must be at least as large as the size of one message buffer, so that one entire message can be sent to the daemon in a single operation. Partial reads are not permitted, to simplify the kernel code, and to prevent the daemon from losing track of the start-of-message-buffer location in the byte stream comming from the driver. If the kernel has one or more messages waiting for the daemon, the \fBread\fR() system call returns exactly one message back to the daemon, without delay. If the kernel has no messages waiting for the daemon, then the daemon is blocked at interruptible priority until a message arrives. .PP If the daemon wishes to ``sense'' the presense of a message, with an optional wait-for-message delay, the .UX select() call may be used, that vectors through the cdevsw table to the kernel routine migselect(). The migselect() routine indicates that there is read capacity, ie, a message is ready to be read, when that is the case. Attempts to sense the write capacity of the device always return a ``true'' indication. .PP When a close() sys-call is performed on the file descriptor returned from the open for /dev/mig0, the kernel routine migclose() will be called. At the time the message passing interface is closed, special cleanup action is taken by the kernel to deal with any messages that the daemon had left outstanding. .PP Should the the daemon die unexpectedly, or perform an exit() sys-call before closing the file descriptor to the /dev/mig0 interface, the normal .UX kernel code that cleanly closes all open file descriptors before destroying the hulk of the dead process will ensure that a suitable call of migclose() will occur, even though the daemon process did not explicitly perform one. This will ensure that the exclusive use semaphore (kernel variable mig_daemon_is_open) is properly cleared, so that when the daemon is restarted, continued operation will be possible. .PP The contents, organization, and semantics of the message contents are the domain of the higher level. .NH 2 THE DEFINITION OF A MESSAGE .PP The format of the data exchanged between the kernel and the daemon is defined by the C structure ``mig_msg'', defined in kernel header file migration.h. At present, it looks like this: . \" TA - set default tabs .de TA .ta 8n 16n 24n 32n 40n 48n 56n 64n 72n 80n .. .TA .sp .5 .nf .cs R 22 struct mig_msg { int ms_magic; /* MIG_MSG_MAGIC */ int ms_id; /* ID of msg, for kernel */ int ms_op; /* operation, see below */ int ms_result; /* may contain errno */ dev_t ms_dev; /* relevant device */ struct fhandle ms_handle; /* file handle */ struct mig_inode_id ms_inode; /* for mig_iunmigrate() */ } mc_msg; .cs R .fi .sp .5 The field ms_magic must always be set to the value MIG_MSG_MAGIC, or the message is discarded as ``noise'', and an error is logged. The field ms_id is a unique message ID that is issued by the kernel. The kernel will never have more than one message outstanding to the daemon with the same message ID. The daemon is required to echo this ID number back to the kernel in any reply message that might be sent. The field ms_op defines the operation, or purpose, of a message. The remaining fields, ms_result, ms_dev, ms_handle, and ms_inode contain valid values only when so noted in the documentation for a specific value of ms_op. Note that the contents of the ms_inode structure are to be considered ``opaque'' by the daemon, and are intended to be passed intact as one of the parameters to the mig_iunmigrate() sys-call. The daemon should never store or perform any operations on the the ms_inode element. .PP There are two forms of messages that the kernel can send: blocking messages and asynchronous messages. Blocking messages require a response from the daemon, and asynchronous messages do not require a response. It is important to note that the term ``blocking'' does not imply that the daemon must answer a message immediately, nor does it imply that the daemon may not answer other messages first. The term ``blocking'' signifies that there is a user mode process that has been blocked, awaiting the response message from the daemon, and that a kernel message structure remains committed to this transaction until the daemon replies. .PP There are only two types of messages that the daemon may send, and both are issued in response to a ``blocking'' kernel message: MIG_D2K_DONE messages, and MIG_D2K_FAIL messages. This simplicity of response messages was intended to simplify the kernel's job when ``inventing'' proper responses to messages that were outstanding when the daemon closes the /dev/mig0 interface. .PP Conceptually, either the kernel or the user mode daemon may transmit a message to the other at any time. Messages from the daemon to the kernel will always be processed immediately. Because the generation of kernel to daemon messages happens in the context of some process other than the daemon process, the message is stored in a message structure, and queued for processing by the daemon. This queueing mechanism requires that there be a pool of message structures, and at present, this pool is of fixed size (kernel parameter MIG_NMSG). When the entire pool of message structures has been committed to use, any additional requests will block waiting for a structure to become available. Therefore, it is desirable to set MIG_NMSG large enough to handle the anticipated number of transactions during ``high demand'' periods. .NH 2 USER-SETTABLE MODES .PP There are three new ``mode bits'' on every process that can be read and set by the user, using the mig_getflag() and mig_setflag() sys-calls. .PP The first bit, called SPACERETRY, determines kernel behavior when a user executes a \fBwrite\fR() sys-call to a file on a full filesystem. When SPACERETRY=0, the traditional .UX behavior occurs, and the \fBwrite\fR() sys-call returns -1, with variable ``errno'' set to ENOSPC. When SPACERETRY=1, the \fBwrite\fR() sys-call vectors to mig_nospace(), that will block the user's process until additional space becomes available, and then allows the \fBwrite\fR() operation to transparently proceed. .PP The remaining two bits, called MIGNOTRANSP and MIGCANCEL, control the way the kernel handles an attempt to open a migrated file. When MIGNOTRANSP=0, the user wants transparent access to a migrated file. The process will block until the file has been reloaded, after which the \fBopen\fR() will complete normally. If the user process receives a signal, perhaps SIGINT (eg, ``^C'') when the user gets impatient, the in-migration operation will be permitted to proceed, if MIGCANCEL=0. If MIGCANCEL=1, the daemon will be notified to abort the in-migration operation. .PP When MIGNOTRANSP=1, the user does not want transparent file access, instead preferring to have the \fBopen\fR() return -1 immediately with variable ``errno'' set to EMIGRATED. If MIGCANCEL=0, the daemon is notified to initiate an asynchronous reload, in the background. If MIGCANCEL=1, no communication with the daemon occurs. .NH 2 RELOADING MIGRATED FILES .PP When a user process attempts to open a migrated file, the \fBopen\fR() system call is vectored to the kernel routine \fBopen\fR(), that calls copen(). copen() contains a block of code to open an existing file, which is supplemented with the following block of code in the Cray os/sys2.c module: .sp .5 .nf .cs R 22 #ifdef MIGRATION if( !u->u_error && (ip->i_mode & IFMT) == IFMIG && !(mode&FTRUNC) ) { extern struct inode *mig_reload(); .sp .5 /* Note replacement of "ip" after reload */ if( (ip = mig_reload( ip )) == (struct inode *)0 ) return; } #endif .cs R .fi .sp .5 where copen() calls the mig_reload() subroutine. On the Cray, similar code is added to the gethead() routine in os/exec.c. .PP The mig_reload() routine checks the status of the MIGTRANSP and MIGCANCEL mode bits on this process to determine the precise handling of the operation. If MIGNOTRANSP=0 (transparent operation), the daemon is sent a message with ms_op=MIG_K2D_RELOAD_BLOCK, with ms_handle and ms_inode having the file handle and device/inode-number information pertaining to the migrated inode. This message is sent to the daemon using the mig_daemon_wait() routine, which will block waiting for a reply message from the daemon. If the daemon reply message is MIG_D2K_DONE, then the \fBopen\fR() proceeds normally. If the daemon reply message is MIG_D2K_FAIL, then the \fBopen\fR() fails, returning -1, and the value in the ms_result field of the daemon reply message is used as the value of ``errno''. The daemon should do all database operations for this message based strictly on the value of the ms_handle element. .PP If a signal is received while the process is blocked in mig_daemon_wait(), a notification message is sent to the daemon, the signal is posted to the user process, and the \fBopen\fR() will return -1, with the value of ``errno'' being set to EMIGFAIL. The message sent to the daemon will have the same values of ms_handle and ms_inode as the original MIG_K2D_RELOAD_BLOCK message had. If MIGCANCEL=0, then the message sent to the daemon is MIG_K2D_SILENCE_RELOAD, which informs the daemon to proceed with an outstanding RELOAD_BLOCK operation, but not to send any further notification back to the kernel, ie, treat the operation as if it had originally been of type MIG_K2D_RELOAD_ASYNC. If MIGCANCEL=1, then the message sent to the daemon is MIG_K2D_CANCEL_RELOAD, which informs the daemon that the reload of this file is no longer required, and if possible, it should be aborted. If the daemon has already ``committed'' to the reload operation, no harm is done by allowing the reload to complete. .PP In this protocol, there is the a potential for a non-harmful race condition between the kernel queueing a SILENCE or CANCEL message to the daemon just before the daemon begins sending a DONE or FAIL message to the kernel for that operation. Therefore, for best results the daemon should always check to see if there are any additional kernel-to-daemon messages waiting, before the daemon sends reply messages to the kernel. When the race condition exists, the daemon will send a reply message to the kernel that is no longer expected. This will be detected in the kernel routine mig_process_response(), and in this case, the kernel will simply note the occurrence of the race by sending the daemon a MIG_K2D_UNEXPECTED message for adding to the daemon log files. In this message, the original message is duplicated for return, with ms_id being the new message ID ms_result being set to the ms_id field of the unexpected message just received, and ms_op being set to MIG_K2D_UNEXPECTED. If the kernel is unable to obtain a message buffer to log this error condition, then a console printf() is performed, with the message ``WARNING: mig_process_response: unexpected daemon reply, id=%d''. This behavior was necessary on the out of buffers condition to prevent a potential buffer deadlock condition. .PP If MIGNOTRANSP=1 (non-transparent operation), the \fBopen\fR() fails with ``errno'' set to EMIGRATED. If MIGCANCEL=0, then the daemon is sent a MIG_K2D_RELOAD_ASYNC message, with ms_handle and ms_inode having the pertinent information. This permits the daemon to initiate an asynchronous reload operation for the file. The daemon should do all database operations for this message based strictly on the value of the ms_handle element. It is anticipated that the daemon would probably queue these asynchronous requests at a lower priority level than requests where there is a user process actively waiting a reload operation. If MIGCANCEL=1, then no daemon notification happens at all. .PP If the daemon is not running when mig_reload() is called, the \fBopen\fR() fails, and ``errno'' is set to EMIGOFF. No access to a migrated inode is permitted while the daemon is not running. .NH 2 LOW SPACE AND NO SPACE .PP There is a system-wide disk usage threshold mig_minfreefrags that can be read by anyone using the mig_getflag() sys-call, and can be set by the superuser with the mig_setflag() sys-call. .PP It is planned that in a future version of this software, where tighter integration with existing system utilities would be possible, that the minimum space threshold would be settable on a per-filesystem basis. On Berkeley .UX systems, this would most likely become part of the in-core mount table information, set by an extra field in /etc/fstab. .PP When a filesystem transitions from an amount of free storage above the threshold to an amount below the threshold, the routine mig_lowspace is called. Here is the code fragment from the Cray fs/c1/c1alloc.c routine: .sp .5 .nf .cs R 22 #ifdef MIGRATION { extern int mig_minfreefrags; .sp .5 if( fp->s_tfree >= mig_minfreefrags && (fp->s_tfree - reqblks) < mig_minfreefrags ) mig_lowspace(dev, fp->s_tfree); } #endif .cs R .fi .sp .5 The mig_lowspace() routine sends a message to the daemon, with ms_op set to MIG_K2D_LOWSPACE, ms_dev being the device running low on space, and ms_result set to the amount of storage remaining. No daemon response to the kernel is expected. .PP On receipt of the LOWSPACE message, the daemon has the option (depending on configuration parameters) of initiating a Space Management function to make additional space on the device. This might include removal of some pre-migrated files, and/or the initiation of an immediate filesystem sweep and out-migration operation, depending on the site-specific configuration. .PP Note that the daemon has to be prepared for getting several LOWSPACE messages concerning the same device within a short period of time, as the storage level oscillates around the threshold level. These should be collapsed into a single event within the daemon. .PP When an attempt is made to allocate blocks on a filesystem that is full, the routine mig_nospace() is called. Here is the code fragment from Cray fs/c1/c1alloc: .sp .5 .nf .cs R 22 #ifdef MIGRATION if( mig_nospace( dev ) == 0 ) goto retry; #endif prdev("ERROR: alloc.c: no available free space",dev); .cs R .fi .sp .5 mig_nospace() first checks the setting of the process SPACERETRY mode bit. If SPACERETRY=0, the kernel calls mig_lowspace() with a space level of zero to note the condition, the error ENOSPC is returned from mig_nospace(), and allocation fails. .PP If SPACERETRY=1, then a message is sent to the daemon with ms_op=MIG_K2D_NOSPACE and ms_dev set to the relevant major/minor device code. The kernel then blocks the user process until a reply message is received from the daemon. If the response is MIG_D2K_DONE, then the storage allocation is retried. There is no race condition here, because if another process has consumed the newly made storage before this process retries the allocation, mig_nospace() will be called again, and the operation repeats. If the response from the daemon is MIG_D2K_FAIL, then the allocation operation is abandoned, and an ENOSPC error is returned to the user process. It is not anticipated that the daemon would ever return a MIG_D2K_FAIL code to a NOSPACE message, as that defeats the purpose of this feature. .PP If the user process fields a signal while it is blocked waiting for the daemon reply to the NOSPACE message, then the kernel sends the daemon a message with ms_op=MIG_K2D_CANCEL_NOSPACE and ms_dev set to the major/minor device code. Note that the daemon is still expected to act on the no space condition, even though the user process is no longer blocked waiting on space, ie, treat the message as a NOSPACE message with 0 blocks left. Perhaps a better name would have been SILENCE_NOSPACE. There is a potential race condition here, as with the MIG_K2D_RELOAD_BLOCK message described earlier; it is not harmful, and mig_process_response() will take the same remedial action. .PP Note that the daemon should be prepared for multiple processes to encounter a NOSPACE condition on a given device within a fairly short time of each other. The daemon is responsible for holding all of them, and not releasing them until it is known that some storage becomes available, which the daemon can learn about from (a) the Space Management task on that device completing, and (b) periodic sampling of free space levels. .PP If mig_nospace() is called and the user has set SPACERETRY=1, but the migration daemon is not running, rather than returning an error, the kernel adopts a simple retry strategy, to prevent the user process from seeing unwanted ENOSPC errors. If the user has no signals pending, a non-interruptible kernel sleep will be initiated in mig_nospace(), with a one minute timeout. This will cause an allocation retry once a minute, until additional storage becomes available, or the process receives a signal, perhaps SIGINT (eg, ``^C'') when the user gets impatient. This polling behavior is not optimal, especially if several dozen processes need additional space, but it prevents the very desirable SPACERETRY feature from evaporating when the daemon is not running. .PP Thanks to this feature, processes that have SPACERETRY=1 should never see an ENOSPC error return. This is extremely valuable for long-running processes that write all their answers to disk just before exiting. .NH 2 TRUNCATING MIGRATED FILES .PP Whenever a process attempts to truncate a migrated file to zero length, the routine mig_trunc() is called. Truncations to non-zero lengths cause a normal mig_reload() operation, as described above. The code fragment from Cray fs/c1/c1iget.c. Also, if it is desired to take ``core dumps'' on top of migrated files, a similar modification will be required in os/sig.c .sp .5 .nf .cs R 22 #ifdef MIGRATION /* Reload the file prior to truncation, if new size > 0 */ if((ip->i_mode & IFMT) == IFMIG) { if( size == 0 ) u->u_error = mig_trunc( ip ); else u->u_error = mig_reload( ip ); if( u->u_error ) return; } #endif .cs R .fi .sp .5 When a migrated file is truncated to length zero, the kernel sends a message to the daemon with ms_op=MIG_K2D_TRUNCATE and ms_handle set to the appropriate file handle, after which, the kernel converts the inode back into a regular file (i_mode=IFREG) with length zero, and the normal truncate operation proceeds. .PP Typically, for migrated files whose data resides on some form of backing store, the daemon would move the database entries for the backing store copies into an ``age, then queue for volume reclamation'' queue, for subsequent Volume Management operations. .PP If the daemon is not running when mig_trunc() is called, the truncate operation fails, and ``errno'' is set to EMIGOFF. The contents of migrated inodes may not be altered while the daemon is not running, to prevent the migrated file database from becoming inconsistent with the state of the filesystem. .NH 2 UNLINKING MIGRATED FILES .PP Whenever a process attempts unlink a migrated file, the routine mig_unlink() is called. The code fragment from Cray os/sys4.c, routine unlink(): .sp .5 .nf .cs R 22 #ifdef MIGRATION /* * If this is the very last link to a migrated file, * inform the migration system, and allow it the opportunity * to note (and perhaps refuse) the operation, before * removing the directory entry or dereferencing the inode. */ if( ip->i_nlink == 1 && (ip->i_mode & IFMT) == IFMIG ) { if( (u->u_error = mig_unlink( ip )) != 0 ) goto out; } #endif .cs R .fi .sp .5 When the last link to a migrated inode is removed, the kernel sends a message to the daemon with ms_op=MIG_K2D_UNLINK and ms_handle set to the appropriate file handle, after which, the kernel converts the inode back into a regular file (i_mode=IFREG) with zero size, and the normal unlink operation proceeds. .PP Typically, for migrated files whose data resides on some form of backing store, the daemon would move the database entries for the backing store copies into an ``age, then queue for volume reclamation'' queue, for subsequent Volume Management operations. .PP If the daemon is not running when mig_unlink() is called, the unlink operation fails, and ``errno'' is set to EMIGOFF. Migrated inodes may not be removed while the daemon is not running, to prevent the migrated file database from becoming inconsistent with the state of the filesystem. .NH 2 THE SYSTEM CALL INTERFACE .PP In addition to the message passing interface, the kernel support for the migration system also provides several additional system calls. .PP In the present implementation, it was decided that these system calls would not be coded using the normal kernel sysent[] table, but would be handled by a private mechanism, so as to minimize the amount of existing kernel code that would have to be altered, and to prevent having to make vendor-specific system call interface modules for inclusion /lib/libc.a, the C runtime library. When a vendor installs this code in their system, they would presumably assign system call numbers, and add the interfaces to the C runtime library, for slightly increased speed and clarity. .PP To the applications programmer, the new migration system calls are indistinguishable from direct system calls. In the remainder of this paragraph, the details of the present implementation will be discussed, and then the difference will be ignored henceforth. All use of these new system calls is expected to be via the library routines provided in libmig.a, which establishes contact with the new system call interface in the kernel by opening device /dev/mig1, that will ordinarily be mode 0666 to permit general use. The interface routines in libmig will bundle up the system call number and arguments into a buffer, and \fBwrite\fR() it to /dev/mig1. Kernel routine migwrite() will copy this buffer into kernel space, and call the routine indicated by the system call number. The interface into the migration system call routines is identical to the interface seen when calling via the sysent[] table, to permit easy conversion to the direct method. .NH 2 FLAG MANIPULATION .LP .sp .5 .nf .cs R 22 mig_getflag( cmd, pid ) int cmd; int pid; .sp mig_setflag( cmd, pid, value ) int cmd; int pid; int value; .cs R .fi .PP There are presently four kernel parameters that may be read or altered using these two system calls. The SPACERETRY, MIGNOTRANSP, and MIGCANCEL parameters are one-bit quantities, and are stored on a per-process basis. The THRESHOLD low space threshold is presently a single integer value that applies to all filesystems (see LOWSPACE remarks, above). The ``pid'' argument must specify a live process, to avoid an ESRCH error. A ``pid'' value of zero is interpreted to indicate the pid of the process performing the system call. If the process specified by ``pid'' belongs to a different user, and the process performing the system call is not running as the superuser, then an EPERM error is returned. .PP When cmd=MIG_FLAG_SPACERETRY, the one bit parameter SPACERETRY is read or written. With a value of 0, writing to a full filesystem returns an ENOSPC error, while with a value of 1, writing to a full filesystem will always succeed, although potentially with some delay until free storage is available. .PP When cmd=MIG_FLAG_NOTRANSP, the one bit parameter MIGNOTRANSP is read or written. When cmd=MIG_FLAG_CANCEL, the one bit parameter MIGCANCEL is read or written. See the earlier remarks on reloading migrated files for a detailed description of the effects of these bits. .PP When cmd=MIG_FLAG_THRESHOLD, a mig_getflag() will return the current threshold upon which the daemon will be notified of a low space condition. Only the superuser may alter this parameter with mig_setflag(); non-privileged use will result in an EPERM error. Only non-negative values are permitted. Note that in this case the value of ``pid'' has no significance, but must be valid. .PP The SPACERETRY, MIGNOTRANSP, and MIGCANCEL bits are carried in the process structure p_flag word. They are inherited across forks, so that all child processes run with the same storage and migration behavior as the parent. Typically, a user would set the desired operating mode of his shell, and then all processes will behave in the desired manner. In order to permit these bits to be inherited by child processes, a one-line change needs to be applied to the fork() routine. In the subroutine newproc() in Cray os/fork.c: .sp .5 .nf .cs R 22 #ifdef MIGRATION rpp->p_flag |= (rip->p_flag & (SRTIM|SCPUS| SSPACERETRY|SMIGCANCEL|SMIGNOTRANSP)); #else rpp->p_flag |= (rip->p_flag & (SRTIM|SCPUS)); #endif .cs R .fi .sp .5 .NH 2 CREATING A MIGRATED INODE .LP .sp .5 .nf .cs R 22 mig_makemigrated( source, dest, handle ) char *source; char *dest; struct fhandle *handle; .cs R .fi .sp .5 .PP This system call is intended for use by the out-migration tool. Only processes running as superuser may use this system call; other users will get an EPERM error. In ordinary use, the ``source'' file will be some user file, and the ``dest'' file will be in the migration directory for that filesystem. .PP The file named by ``source'' must already exist and be a regular file (i_mode=IFREG), and the file named by ``dest'' must not yet exist. Both must be on the same filesystem. The destination file is created in mode 0400, and is given the same size as the source file. Then, all of the disk blocks are transferred from the source file to the destination file, and removed from the source file. Finally, the source file is changed from file type regular (i_mode=IFREG) to migrated (i_mode=IFMIG), and the file handle given in ``handle'' is stored in an implementation-specific location within the source file's inode. Note that the source file retains it's original ownership, access modes, size indicator (i_size), access and modification times. However, it now no longer is using any storage for disk blocks. Any attempt to access this file will result in notification of the migration daemon, as described in the previous section. .NH 2 UNMIGRATING AN INODE .LP .sp .5 .nf .cs R 22 mig_unmigrate( source, dest ) char *source; char *dest; .sp mig_iunmigrate( source, idest ) char *source; struct mig_inode_id *dest; .cs R .fi .sp .5 Two system calls exist for reversing the effect of the mig_makemigrated() system call. The mig_unmigrate() form takes two path names, and is intended for human-driven diagnostic and disaster-recovery uses, and is not used by the production migration software. The mig_iunmigrate() form uses an opaque object of type ``mig_inode_id'' such as is found in the ms_inode field of kernel to migration daemon messages like MIG_K2D_RELOAD_BLOCK. .PP In ordinary use, the ``source'' file will be in the migration directory, and the ``dest'' file will be the corresponding migrated user file. The ``source'' file must be a regular file (i_mode=IFREG), and the ``dest'' file must be a migrated file (i_mode=IFMIG). Both files must be the same size. The ``dest'' file is converted back into a regular file, and then the disk blocks are moved from the ``source'' to the ``dest'' file. At this point, only the ``change'' time on the ``dest'' file will have been affected by the mig_makemigrated() and mig_iunmigrate() process; all other inode fields will be exactly as they were before the inode was migrated. Finally, the ``source'' file is unlinked. .NH 2 FILE STATUS .PP One of the goals of this project was to implement a file migration capability that was so transparent that, except for additional delays for moving files in from backing storage, there would be no user visible differences. One implication of this is that the stat() system call must not indicate that a file is migrated -- otherwise, every application program that looks at the i_mode field of the inode (eg, find(1), du(1), etc) would have to be modified to know about the new inode type, IFMIG. That would not have been transparent at all! Instead, the stat() system call has been modified so that migrated inodes seem to be of regular file type, IFREG. This required the following change to Cray os/sys3.c routine stat1(): .sp .5 .nf .cs R 22 #ifdef MIGRATION /* * Migrated files look like regular files to all users. * Programs that care about the difference should use * the mig_stat() sys-call instead. */ if( (ip->i_mode & IFMT) == IFMIG ) ds.st_mode = (ip->i_mode & ~IFMT) | IFREG; #endif .cs R .fi .sp .5 .PP Having made this modification, this raises the question of how a program that desired to know the true status of an inode can obtain it. This lead to the creation of two additional system calls: .sp .5 .nf .cs R 22 mig_stat( name, statp, handle ) char *name; struct stat *statp; struct fhandle *handle; .sp mig_lstat( name, statp, handle ) char *name; struct stat *statp; struct fhandle *handle; .cs R .fi .sp .5 .PP mig_stat() functions exactly like stat(). For non-migrated files, the file handle structure contains all zeros, while for migrated files, the file handle structure is non-zero, and contains the appropriate migration information copied from the inode. Note that in the later case, the file type will still be IFREG, not IFMIG, so that code that may be handed the stat structure would not have to be concerned with the extra file type. .PP For kernels that support the Berkeley concept of a symbolic link, the mig_lstat() subroutine is to the lstat() system call, as mig_stat() is to the stat() system call. .NH 2 ERROR CODES .PP The existing set of kernel error codes that system calls can return in ``errno'' have been supplemented by several error codes that are specific to support for the migration system. .PP errno=EMIGRATED is returned when a user attempts to access a file that has been migrated, and the user has requested non-transparent access. This error is the result of the system honoring the request for non-transparency, and does not signify any difficulty. .PP errno=EMIGFAIL is returned when a wait for a transparent migration operation is interrupted by a signal. .PP errno=EMIGOFF is returned when a user attempts to access or delete a file that has been migrated, and the migration daemon is not running. No accesses to migrated files are permitted until the migration daemon has been restarted, so that the migration database remains consistent with the filesystem. The user should contact an operator or system administrator to have the daemon restarted. .PP errno=EMIGNLOC is returned on an attempt to migrate a file between two different filesystems, or when the source file is not local to the executing machine, eg, is an NFS file on a remote server. This error only occurs in the mig_makemigrated() and mig_unmigrate() system calls, and should only be seen by superuser processes. .NH 1 THE MIGRATION DAEMON .PP Consistent with the principles of good modular operating system design, and in order to keep the required kernel additions to a minimum, most of the real work to handle file reload operations and out of space conditions is delegated by the kernel to the user-mode migration daemon process. In turn, the migration daemon itself does very little more than prioritize and queue requests from the kernel, and spawn various other processes to execute the needed migration tools. In particular, the daemon can spawn multiple copies of the in-migration tools, and it can also initiate a space management procedure (typically a shell script) for every filesystem that is running low on available disk storage. .PP When the migration software is installed on a machine, the migration daemon becomes an integral part of the operating system software on that machine. The migration daemon plays the same kind of critical role with the .UX kernel filesystem functionality as the Internet ``super-server'' \fB/etc/inetd\fR and the domain name server \fBnamed\fR play for the .UX network functionality. In ordinary operation, the migration daemon is not expected to die. However, if the daemon should die (or be killed), the kernel will make reasonable responses to all filesystem requests that are made. In particular, if a filesystem runs out of space when the daemon is not running, SPACERETRY=1 operation for reliable file writing is still provided, using a simple, less efficient all-kernel technique. If the migration daemon is not running, users will be prohibited from opening or deleting migrated files, although operations that affect only the inode will still be permitted, such as \fBchmod\fR(2), \fBrename\fR(2) or \fBmv\fR(1). This behavior is necessary to keep the migration system databases synchronized with the state of the filesystem. If the migration daemon is not running, the system should be considered to be experiencing a serious problem. Fortunately, it should always be possible for the superuser to log in on the console to take remedial action. This implies that crucial system files such as \fB/bin/sh\fR should not ever be migrated. It would be wise if the main system directories \fB/bin\fR, \fB/lib\fR, \fB/etc\fR, and \fB/usr/bin\fR were always exempted from out-migration. It would be better still if the entire root and \fB/usr\fR filesystems were never subjected to out-migration. Not only will this keep the system more responsive at a very small penalty in online storage used, but it will also ensure that all files needed for effecting system repairs will be available online when such repairs are called for. .NH 1 IN MIGRATION .PP Reload requests are sent from the kernel to the daemon using the protocol described earlier. The migration daemon will fork and start a \fBmigin\fR process to cause the file to be reloaded. See Figure 6, "In-Migration Function (migin)". \fBmigin\fR runs the \fBmigarch\fR utility to copy the file back into the file system if it is not already in place (i.e. dual-migrated). When the file is reloaded into the appropriate file system's migration directory, the daemon performs a mig_iunmigrate() system call, and notifies the kernel of the successful reload. If \fBmigarch\fR experiences unrecoverable errors while trying to read every one of the multiple copies of the migrated file, then errors are logged in the migration log file, and an appropriate error is return through the daemon to the user process. .PP Files that are online in the filesystem have no existence in the secondary storage, because of the inability to store a file handle in the inode of a regular (i_mode IFREG) file. Therefore, when a file has been reloaded, all of the copies on secondary storage should be considered obsolete. However, to provide disaster recovery, the secondary storage copies of the migrated file can not immediately be turned over to Volume Management for reclamation. Instead, the secondary storage copies must be aged for a minimum of twice the backup interval before the storage can be reclaimed. If a file is removed, and then the filesystem is reloaded, it will still be available for a few additional days. Thus, any file that has been reloaded will be marked as obsolete in the file database, but will continue to be available for disaster recovery until the volume on which it is stored is reclaimed. .NH 1 SPACE MANAGEMENT .PP To enable fully automatic recovery from disk space shortages, the migration system utilizes two kernel-to-daemon messages. The LOWSPACE message is sent when the freespace on a filesystem falls below the configured threshold and the NOSPACE message is sent when a filesystem is full. Upon receiving either of these messages, the daemon starts the \fBmigspace\fR utility, that implements a system-specific policy for creating additional space. See Figure 7, "Space Management (migspace)". Typically, this policy would be to first consult the migration database to see if the affected file system has any dual-migrated files whose online disk storage can be immediately reclaimed. If this does not provide a sufficient amount of space, a list of new migration candidates should be built, and an out-migration operation would be initiated to move these files to backing storage. The worst case can occur when a few large files fill an entire filesystem, requiring all other files to be migrated to secondary storage. .PP Conservative sites, or sites that do not have 24-hour operator coverage may choose to configure the \fBmigspace\fR script to create space only by unlinking all dual-migrated files, and then to give up and wait for human intervention. In this case, processes needing additional file space will be blocked until additional space becomes available, or some human kills them. .PP Note that some care needs to be taken to ensure that \fBmigout\fR processing gets priority access to the tape drive in a single tape drive system. .NH 1 DATABASES .PP There are several databases and intermediate files used by the BUMP tools; 1) a file database to map file handles into file location, 2) a volume database to identify the location and type of a storage volume, 3) a list of files to be staged from one archiving method to another, containing source file handle, number of copies to make, etc. All databases and intermediate files use the same basic format and are manipulated by a common set of routines. The format is an ASCII text file, with each record newline terminated and each field terminated with a vertical bar ``|'' character. Database fields that may require updating occupy a preset number of characters, so that the field may be updated in place. Concurrent update is prevented by using the appropriate kernel file locking features. .PP Using a simple file format for the databases which can be manipulated by the standard .UX text processing tools results in a significant economy. Everything from simple enquiries up to the most sophisticated queries may be resolved using simple combinations of \fBgrep\fR, \fBawk\fR, \fBed\fR, and the rest of the .UX text processing tools. Creating management reports on storage utilization can be handled with small Shell scripts that are easily tailored to the specific needs of individual sites. Using a text file format also permits ordinary text editors to be used to examine and modify the databases during development. It is anticipated that this convenience will prove similarly useful when disaster recovery is required. .PP Performance of this simple strategy is not anticipated to be a factor. If each database record requires 100 bytes, then the information for 10,000 migrated files will use a single megabyte of storage for the database. A large system may expect to have several hundred thousand migrated files at any time, totaling perhaps hundreds of gigabytes of storage, yet the migration database will remain comparatively modest in size. .NH 1 PHILOSOPHY & IMPLICATIONS .PP The illusion of having unlimited on-line disk storage can be a great convenience for users. However, files may be migrated to automatic devices with recall times that can be measured in fractions of minutes, and files that have been migrated to devices that require operator intervention, such as conventional magnetic tape, will typically require several minutes per recall. The trade-off between the convenience of having extra storage and additional delay is certain to have uneven user appeal. Assuming that the file migration system has been well implemented, the success or failure of the file migration system in a particular environment will depend on having a migration policy that suits the needs of the most important users. This is why there has been such a strong emphasis placed on separating the migration \fIpolicy\fR from the migration \fImechanism\fR \(em because no single migration policy will be able to meet the needs of all sites. .PP The most challenging environment to implement a successful migration policy in is unfortunately the environment that .UX usually excels at: the highly interactive timesharing environment. Balancing the need for high interactivity with the need to have significantly more online storage than the actual capacity of the underlying filesystem will be difficult. Success is likely only if (a) the users perceive the benefits of additional convenient file storage as outweighing the inconvenience of an occasional delay in file access, and (b) the file migration policy has been tuned so that an average users ``working set'' of files is not ordinarily selected for migration, so that reload delays are incurred infrequently. .PP It is important to distinguish between the functions of the file migration system, the filesystem backup system, and the private user tape handling facility (sometimes also called an ``archiving system'', a usage that conflicts with the usage in this paper). Some operating systems, such as Cray's COS operating system, attempt to integrate parts of all three functions into the general filesystem capability. It is the purpose of this project only to provide a file migration capability, and not to disturb or significantly alter existing backup and user tape handling conventions. The file migration system is intended to provide the appearance of .UX filesystems that are significantly larger than the actual capacity of the disk hardware. The filesystem backup system is intended to provide disaster recovery, so that a failed disk drive can be restored to the same state that it had at the time the last backup tapes were written. The backup system is also often used to help recover from serious user errors, where files are inadvertently deleted. When the deleted files have not been modified since the last backup tapes were written, then the files can be restored from the backup tapes, and the user error is undone. Finally, the user tape handling facility is intended to allow users to selectively but permanently save files that may be needed again in the future, but will not be needed online for a significant amount of time. .PP To prevent filesystem backups from taking an absurd amount of tape, and an even more absurd amount of time, it is necessary to modify the backup procedures to discriminate between regular files and migrated files. For migrated files, the backup procedure should record only the inode information onto the dump tapes, and not the actual contents of the file. This is consistent with the definition of the backup system as providing only disaster recovery. Note that it is the fact that a filesystem may be reloaded from backup tapes that forces the secondary storage copies of all migrated files to be retained for an aging period. This ensures that if the most recent backup is reloaded onto the disk, the secondary storage copies of recently deleted files will continue to exist for a brief additional time, so that they can be reloaded by the user if needed. .PP The migration system does not keep a consistent set of file names on the secondary storage media, because all activity is identified by the inode number and file handle, and because one inode may have many different path names that lead to it. While the migration system does record a complete copy of the information in regular file inodes onto the secondary storage, it does not record any path name information, because directories and special files can not be migrated. As a result, the migration system is not suitable for use as a filesystem backup mechanism. If a filesystem was entirely destroyed and no dump tapes were usable, enough information is stored in the migration system so that all migrated files could be recovered, but all files would appear in their owners home directory. .PP Having the file migration capability does not relieve the need for a user tape handling facility, nor does it permit users with large storage requirements to ignore their responsibility for monitoring their storage usage and managing their collection of private tapes. The file migration system merely changes the point at which the filesystem appears to be full. The limiting factor will shift from the number of disk blocks available on a filesystem to the number of inodes available on a filesystem. Most .UX filesystems have a fixed number of inodes allocated per filesystem at the time that the filesystem is initially created (with \fBmkfs\fR). Thus, the additional inode requirements must be taken into account when filesystem are created. Many more inodes will be needed on filesystems that will store migrated files; 200,000 inodes is a good minimum for larger filesystems (0.5 Gbytes to 1 Gbytes). .PP Even those few .UX systems that can dynamically allocate additional inodes when needed are still limited by the fact that the size of the disk provides an upper bound on the number of files on the filesystem. This seemingly contrived limiting case becomes more of a serious issue when online storage for at least one copy of the largest user file needs to be held in reserve from the disk full of inodes. Failure to observe this relationship could prevent users from retrieving their largest files, and in general would probably be preceded by the migration system being forced into severe ``thrashing'' of files between primary and secondary storage. .PP Therefore, when implementing the file migration system on a particular machine, it is strongly recommended that the administration adopt a two-part policy on file storage, and that the user community be advised of this policy before the file migration system is activated. The first part of the policy is to note that when online space is required, user files may be transparently moved to secondary storage, following the detailed rules of the migration policy elected by the site. The second part of the policy is to note that files may not be maintained in the migration system, even on secondary storage, for more than a period of 1.5 years (or a similar time limit), and that users who require longer term storage of their data must take advantage of the user tape handling facility that has been provided. An automatic tool will be provided that will, with suitable advance warning to the users via E-mail, enforce this policy. .NH 1 CURRENT STATUS .PP As of this writing, the kernel support for the migration system is complete, and has been well tested. A demonstration package which includes \fBmigout\fR, \fBmigin\fR, and the migration daemon has been assembled that allows files to be migrated and unmigrated. Implementation of the \fBmigarch\fR utility and the set of methods for tape-style devices is well underway. Overall, most of the pieces now exist, and the final work of assembling the high-level tools is progressing well. .PP It is anticipated that this software will be in full production status in BRL in the early Fall of 1988, and will be made generally available as Public Domain Distribution Unlimited software by Fall/Winter of 1988. .NH 1 FUTURE WORK .PP The task of moving a user from one filesystem to another filesystem, to more evenly balance storage requirements is a task that system administrators will occasionally have to perform. Typically, this is accomplished on Berkeley systems using back-to-back \fBtar\fR programs, eg, .sp .5 .ti +.5i cd fromdir; tar cf - . | (cd todir; tar xf -) .sp .5 and on System V machines, this is done using a combination of \fBfind\fR and \fBcpio\fR, eg, .sp .5 .ti +.5i find . -depth -print | cpio -pdlm .sp .5 If a user with a significant number of migrated files was to be relocated to a new filesystem using this technique, the correct effect would be achieved, but the migration system would engage in a significant amount of activity. All the files on the source filesystem would have to be migrated in to the original disk, that would probably cause the space management function to have to out-migrate other files to make room. Then, all those files would be copied to the destination filesystem, that would probably also cause the space management function to have to out-migrate files there, too. The result of this is that whole collection of old files will have been brought back onto disk on a new filesystem, where they will consume online storage until they have aged sufficiently to qualify for out-migration. Adequate kernel support exists to permit the implementation of a \fBcpio\ -p\fR substitute that would relocate migrated inodes without forcing a reload operation, but as of this writing, this has not yet been done. .PP For the majority of .UX systems, the most popular secondary storage medium for file migration is likely to be operator mounted reels of magnetic tape. This makes the interface to the operator an important part of the file migration software, so that it is easy for the operator to determine what tape is to be mounted, and on which drive. In order for the migration system to be effective, this must work well, or the extra delay in mounting tapes will reflect poorly on the migration system. However, the design and implementation of a good operator interface mechanism for .UX is properly the subject of another project. By necessity, the initial version of the file migration software will offer a very simple interaction with the operator, except on systems like the Cray where an operator interface package already exists. Then, as time progresses, a separate effort to implement a good, portable operator interface package will be initiated, with the goal being to release another piece of software into the public domain. For the present, this has not been done. .PP Many additional migration methods will be added as the project progresses, with network tapes, Masstore systems robotic cartridge tape units, and the 8mm Exabyte tape cartridges being prime candidates. .PP The current design accomplishes its goal of no filesystem inode changes; however, this prevents certain worthwhile features from being provided. For example, a desirable feature is the ability to reclaim space used by files that have been reloaded for read-only access, by taking advantage of the copy on secondary storage. Another is to be able to supply the first portions of a file to a reading process while the balance of the file is being reloaded. These features require that the filehandle and quantity of reloaded data be added as new fields in the filesystem inodes. Future work will incorporate these filesystem changes and provide enhanced functionality. .PP In order to minimize the impact on the kernel, and various ancillary programs such as \fBmount\fR, the current implementation uses a single low space threshold, expressed in blocks, for all mounted file systems, regardless of file system size. A much better strategy would be to read the low space threshold from a file such as /etc/fstab, and provide it to the kernel as part of the \fBmount\fR(2) system call. This would allow the low space threshold to be set independently for each filesystem. .PP In a network filesystem environment, such as provided by Sun NFS, the migration software will reside entirely on the file server machine, so that file reloading will be entirely transparent to the client machines. It is presently unknown whether there will be any interactions between the potentially lengthy time required to open a migrated file, and protocol timeouts in the client machines. It may also be necessary to increase the number of \fBnfsd\fR daemons to account for some of them being blocked on migrated file opens. .PP Presently, there is no way to add the functionality of the new system calls \fBmig_stat()\fR and \fBmig_lstat()\fR to the current NFS protocol. One implication of this, combined with the fact that the file server handles the migrated files, is that there is no way for a client machine to determine whether a file is migrated or not. Finally, it is not clear how to propagate signals on the client to the server. If the an open of an NFS file is blocked on the server due to a migration reload operation, the signal needs to travel over the network. All these issues will be investigated in detail after the software is operating well for machines with locally attached disk drives. .NH 1 CONCLUSIONS .PP With a very limited set of modifications to the .UX kernel, it is possible to provide a fully transparent file migration capability, preserving the complete semantics of the .UX filesystem. Built around these kernel capabilities are a highly modular set of utility programs that select files for migration, and implement the details of moving files between different devices. One of the best aspects of the design is the strong separation between \fIpolicy\fR and \fImechanism\fR, so that the special needs of individual sites can be satisfied with a single software mechanism. .PP This software promises to satisfy a need that large scale users of .UX systems have long felt, yet provides the capability in such an elegant and transparent manner that even the most discriminating .UX person should not bristle. Much. .SH Acknowledgements .PP The authors would like to thank Chris Johnson for his careful and thorough review of the kernel code, Rick Matthews for his stimulating interchange of ideas, and Bob Reschly and Phil Dykstra for their advice, proofreading, and gratuitous humor as we designed and built this software. .PP The following strings that have been included in this paper are known to enjoy protection as trademarks; the trademark ownership is acknowledged in the table below. .TS center; l l. Trademark Trademark Owner _ Cray Cray Research, Inc Ethernet Xerox Corporation FX/8 Alliant Computer Systems Corporation IBM 370 International Business Machines Corporation Macintosh Apple Computer, Inc NFS Sun Microsystems, Inc PowerNode Gould, Inc ProNet Proteon, Inc Sun Workstation Sun Microsystems, Inc UNIX AT&T Bell Laboratories UNICOS Cray Research, Inc VAX Digital Equipment Corporation .TE