GitHub - cmlburnett/pymtar: Helper lbirary to queue and write a sequence of tar files to LTO magnetic tape

cmlburnett / pymtar Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Helper lbirary to queue and write a sequence of tar files to LTO magnetic tape

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
pymtar		pymtar
.gitignore		.gitignore
README		README
setup.py		setup.py

Repository files navigation

pymtar -- python magnetic tar

This repository is used to archive files onto magnetic tape.
It utilizes a sqlite file to store data, which should be written first to the tape, followed by all of the data.
The database used stores SHA256 values of all files to verify integrity.
Accessing the database, in the future, will make seeking to the appropriate tape file to extract a specific file easier.

### Installation ###
Installation is easy

	git clone https://github.com/cmlburnett/pymtar
	cd pymtar
	sudo python3 setup.py install

This requires my helper library for sqlite as I was tired of some basic quality of life functions when using sqlite:

	cd ..
	git clone https://github.com/cmlburnett/sqlitehelper
	cd sqlitehelper
	sudo python3 setup.py install

### Description ###
The sqlite database stores three primary entities:

	tape: contains metadata information about tapes (model, serial number, purchase date, etc)
	tar: a collection of files grouped together and written to tape with tar(1)
	tarfile: an regular file to write to tape

Tapes are identified by a unique serial number, presumably, printed on the cartridge by the manufacturer.
Optionally, a barcode can be included; the barcode is intended to be the standard tape barcodes but there is no required format for pymtar.

### Tape Basics ###
Magnetic tape is a linear access storage medium.
Tapes are generally written with tar(1) or cpio(1), this library uses tar.

A tape consists of a sequence of "files" with a marker at the end of each marker the delineates files.
A "file" is whatever you want, in this case a single tar file.
As such, a tape drive doesn't understand files in the typical fileystem perspective and tar provides this for you.

Tape devices are "rewind" or "non-rewind" in which the former auto-rewinds after each tape operation, which you should NOT use.

	auto-rewind: /dev/st0
	non-rewind: /dev/nst0

As multiple operations are performed when using pymtar writes, do not use the auto-rewinding device.

Shell actions to work with tapes:

	export TAPE=/dev/nst0
	mt rewind
	tar c /home
	mt offline

mt is a utility to manipulate the position of a tape.
tar and mt will assume to use $TAPE if -f not specified 
This example, rewinds a tape and writes /home to a single tar file as file 0 on the tape drive, and then ejects the tape.

mt actions:

	mt rewind: move the tape head to the start of the tape
	mt fsf: move the tape head to the next file
	mt bsf: move the tape head to the previous file
	mt offline: rewind and eject the tape
	mt status: will print out the file number, block number, and bit flags at the current head position

It's not quite this simple as bsf moves back to the beginning of the previous file marker on tape.
For example, if at file=10, block=0 then a bsf will move to file=10, block=-1.
Repeating bsf will then move to file=9, block=-1.
To write to file 10, fsf will move to file=10, block=0.

In other words:

	file=0, block=0: start of the tape (file 0)
	file=0, block=-1: at the start of the file marker after file 0
	file=1, block=0: start of file 1
	file=1, block=1000: block 100 of file 1
	file=1, block=-1: at the start of the file marker after file 1
	file=2, block=0: start of file 2

A schematicof a tape:

+---------------------------------------------------------------------------------
|  0  | F |  1  | F |  2  | F |
|     | I |     | I |     | I |
|     | L |     | L |     | L |
|     | E |     | E |     | E |
+----------------------------------------------------------------------------------
A     B   C     D   E     F   G

where A, C, and E are the start positions of files 0, 1, and 2.
where B, D, and F are the end position o ffiles 0, 1, and 2.

The tape found between B & C, and D & E, and F & G are the end-of-file markers.
These markers are how the tape drive counts files when manipulating the head position with mt(1).

If you look at mt status for each of these, you would find:

	A: file=0, block=0
	B: file=0, blokc=-1
	C: file=1, block=0
	D: file=1, block=-1
	E: file=2, block=0
	F: file=2, block=-1
	G: file=3, block=0

So if you were at file=10, block=40 and wanted to go to the start of file 9:

	mt bsf
	mt bsf
	mt bsf
	mt fsf

This moves to file=10, block=0 then file=9, block=-1 then file=8, block=-1 then file=9, block=0.

### pymtar ###
The goal of this library is to help facilitate writing a full tape from local filesystem data.
Every file queued into the database is SHA256 hashed so that future tape reads can ensure data is intact.
This means that queueing in files can take a long time.

To get started:

	python3 -m pymtar -d archive.db new tape manufacturer=ACME model=Q123 gen=LTO8RW sn=123456 barcode=LTO123L8 ptime=1990-12-20

This will create a tape. To create a tar file:

	python3 -m pymtar -d archive.db new tar tape=1 num=1 stime=now etime=now
	python3 -m pymtar -d archive.db new tar tape=1 num=2 stime=now etime=now

This will create a tar file with at tape file 1 and 2.
This is intentional as the archive.db should be stored first in file 0.

Once a tape and tar have been created, you can queue files:

	python3 -m pymtar -d archive.db queue tape=1 tar=1 /home/me/docs/*
	python3 -m pymtar -d archive.db queue tape=1 tar=2 /var/log/*

Each of these will take the files passed in by the shell, hash them, and add to the indicated tar file.
Currently, pymtar does not glob for files itself so you must explicitly pass it files to add.

Once ready to write to tape, I recommend:
- Create a directory for your tape number (001 in the example above)
- Create subdirectories 'start' and 'end'
- Put archive.db under 001/start/
- Copy in any shell scripts, etc you might have used to queue files
- Copy in any other meta data sources about the data
- Copy in pymtar and all helper libraries to ensure the archive database can be read with the correct version of software
- Copy in any programs used to access/interpret your data files

Basicaly, this is the metadata for the entirety of the tape.
Include any and all information you may want to understand the data itself.

Use tar to write this directory to the tape

	export TAPE=/dev/nst0
	mt rewind
	tar c 001/
	cp 001/start/archive.db 001/end/archive.db
	python3 -m pymtar -d 001/end/archive.db write tape=1 tar=1-2
	tar c 001/

This will write 001 directory to the start of the tape, which will aid in finding files on the tape and accessing hash data to verify data integrity.
Invoking pymtar will then write the two tar files to the tape.
As pymtar writes the tar files, it will update tar.stime and tar.etime in the 001/end/archive.db file.
The last tape file (3) will be another copy of the 001 directory with 001/end/archive.db reflecting the write times.

Options:
- Queue all data to multiple tapes first, and then write archive.db to each tape thus each tape contains complete redundant file and SH256 hash data
- Queue one tape and write it; queue another tape and write it; reuse the same archive.db each time such that subsequent tapes contain
  all of the data for prior tapes thus providing redundant file and SH256 hash data
- Queue one tape per archive.db without reuse between tapes to reduce "wasted" space


### Future ###
Currently, functionality of pymtar is limited as the library is new.

Future plans:
- Add support to facilitate searching for files/directories
- Add support to facilitate extracting specific files/directories
- Integirity check to verify SHA256 of the on-tape files
- Support access counters on each tape file to help track which files are accessed the most
- Enable pymtar to also write the file 0 and end file copy of the database
- Tape libraries with tape changers
- Error handling, currently I/O errors are not looked for or handled
  - Eg, if writing more data than the tape has space for
- Provide ability to abstract a copy of a tape to have replicants of a tape without entirely copying all of the tar/tarfile data
- Incremental backups
  - Queue new files to a new tar
  - Queue modified files (using SHA256) to a new tar