RELEASE_NOTES.txt

1.1.3
- --split option - required storing the VCF header of all files, and keeping md5 for both the concatenated file and each components
- Improvement in memory and thread management - to reduce memory consumption when compressing very large files (100s of GB to TBs)
- Separate --help text for each command
- Optimize MD5 performance (move to 32b and eliminate memory copying)
- Many bug fixes.

2.0.0
- New genozip file format 
- backward compatibility to decompress genozip v1 files
- Columns 1-9 (CHROM to FORMAT) are now put into their own dictionaries (except REF and ALT that are compressed together)
- Each INFO tag is its own dictionary
- --vblock for setting the variant block size
- Allow variant blocks larger than 64K and set the default variant block size based on the number of samples to balance 
  compression ratio with memory consumption.
- --sblock for setting the sample block size
- change haplotype permutation to keep within sample block boundaries
- create "random access" (index) section 
- new genozip header section with payload that is list of all sections - at end of file
- due to random access, .genozip files must be read from a file only and can no longer be streamed from stdin during genounzip / genocat
- all dictionaries are moved to the end of the genozip file, and are read upfront before any VB, to facilitate random access.
- genocat --regions to filter specific chromosomes and regions. these are accessed via random access
- genocat --samples to see specific samples only
- genocat --no-header to skip showing the VCF header
- genocat --header-only to show only the VCF header
- genocat --drop-genotypes to show only columns CHROM-INFO
- Many new developer --show-* options (see genozip -h -f)
- Better, more compressable B250 encoding
- --test for both genozip (compressed and then tests) and genounzip (tests without outputting)
- Support for --output in genocat
- Added --noisy which overrides default --quiet when outputting to stdout (using --stdout, or the default in genocat)
- --list can now show metadata for encrypted files too
- Many bug fixes, performance and memory consumption optimizations

2.1.0
- Rewrote VCF file data reader to avoid redudant copies and passes on the data
- Moved to size-constained rather than number-of-lines constrained variant blocks - change in --vblocks logic.
- Make MD5 calculation non-default, requires --md5. genounzip --test possible only if file was compressed with --md5
- Improved memory consumption for large VCFs with a single or small number of samples

2.1.1
- Reduced thread serialization to improve CPU core scalability
- New developer options --show-threads and --debug-memory
- Many bug fixes
- Improved help text

2.1.2
- Added --optimize and within it optimization for PL and GL

2.1.3
- Fixed bug in optimization in GL in --optimize

2.1.4
- rewrote the Hash subsystem - 
  (1) by removing a thread synchronization bottleneck, genozip now scales better with number of cores (esp better in files with very large dictionaries)
  (2) more advanced shared memory management reduces the overall memory consumption of hash tables, and allows to make them bigger - improving speed
- --show-sections now shows all dictionaries, not just FORMAT and INFO
- --added optimization for VQSLOD

3.0.0
- added --gtshark allowing the final stage of allele compression to be done with gtshark (provided it is installed
  on the computer an accessible on the path) instead of the default bzlib. This required a change to the genozip 
  file format and hence increment in major version. As usual, genozip is, and will always be, backward compatible -
  newer versions of genozip can uncompress files compressed with older versions.

3.0.2
- changed default number of sample blocks from 1024 for non-gtshark and 16384 in gtshark to 4096 for both modes.
- bug fixes

3.0.9
- bug fixes

3.0.11
- added genocat --strip

3.0.12
- added genocat --GT-only

4.0.0
- a bug that existed in versions 2.x.x and 3.x.x, related to an edge case in compression of INFO subfields. 
  fixing the bug resulted in the corrected intended file format that is slightly different than that used in v2/3.
  Because of this file format change, we are increasing the major version number. Backward compatibility is provided
  for correctly decompressing all files compressed with v2/3.

- VCF files that contain lines with Windows-style line ending \r\n will now compress losslessly preserving the line 
  ending

4.0.2
- genozip can now compress .bcf .bcf.gz .bcf.bgz and .xz files
- genounzip can now de-compress into a bgzip-ed .vcf.gz file

4.0.4
- add support for compressing a file directly from a URL
- remove support for 32-bit Windows (its been broken for a while)

4.0.6
- bug fixes

4.0.9
- improve performance for --samples --drop-genotypes --gt-only --strip and --regions - skip reading and decompressing
  all unneeded sections (previously partially implemented, now complete)
- bug fixes
  
4.0.10 
- updated license
- added --header-one to genocat
- query user whether to overwrite an existing file
- better error messages when running external tools
- bug fixes

4.0.11
- bug fixes

5.0.0
- Updated license
- Added user registration
- Added full support for compressing SAM/BAM, FASTQ, FASTA, GVF and 23andMe files
- Compression improvements for VCF files with any of these:
    1. lots of non-GT FORMAT subfields 
    2. ID data 
    3. END INFO subfield 
    4. MIN_DP FORMAT subfield
- Added genounzip output options: --bcf for VCF files and --bam for SAM files
- Added --input-type - tell genozip what type of file this is - if re-directing or file has non-standard extension
- Added --stdin-size - tell genozip the size of a redirected input file, for faster execution
- Added --show-index for genounzip and genocat - see index embedded in a genozip file
- Added --fast option for (a lot) faster compression, with (somewhat) reduced compression ratio
- Added --grep for genocat FASTQ
- Added --debug-progress and --show-hash, useful mostly for genozip developers
- Reduce default vblock from 128MB to 16MB

Note: some versions numbers are skipped due to failed conda builds (every build attempt consumes a version number)