Skip to content

Latest commit

 

History

History
234 lines (157 loc) · 8.22 KB

file_formats.md

File metadata and controls

234 lines (157 loc) · 8.22 KB
title noheader permalink layout location
File Formats
true
file_formats/
documentation
file_formats

File formats


The matrix file specifies the gene expression matrix to use.

The following formats are accepted by all tools: mtx, txt, h5ad, and loom Please note that wot expects cells on the rows and genes on the columns, except for the mtx format.

Text

The text format consists of tab or comma separated columns with genes on the columns and cells on the rows.

The first row, the header, must consist of an "id" field, and then the list of genes to be considered.

Each subsequent row will give the expression level of each gene for a given cell.

The first field must be a unique identifier for the cell, and then the tab or comma separated list of expression levels for each gene/feature.

Example:

idgene_1gene_2gene_3
cell_11.212.25.4
cell_22.34.15.0
MTX

The MTX format is a sparse matrix format with genes on the rows and cells on the columns as output by Cell Ranger. You should also have TSV files with genes and barcode sequences corresponding to row and column indices, respectively. These files must be located in the same folder as the MTX file with the same base file name. For example if the MTX file is my_data.mtx, you should also have a my_data.genes.txt file and a my_data.barcodes.txt file.

H5AD

A HDF5 file that provides a scalable way of keeping track of data together with learned annotations.. Please see description at https://anndata.readthedocs.io

Loom

A HDF5 file for efficient storage and access of large datases. Please see description at http://loompy.org/

The timestamp associated with each cell of the matrix file is specified in the days file. This file must be a tab or comma separated plain text file, with two header fields: "id" and "day".

Example:

idday
cell_11
cell_22.5

Gene or cell sets can be in gmx (Gene MatriX), gmt (Gene Matrix Transposed), or grp format.

The gmt format is convenient to store large databases of sets. However, for a handful of sets, the gmx format might offer better excel-editablity.

More information on these formats can be found here

GMT

The gmt format consists of one set per line. Each line is a tab-separated list composed as follows :

  • The set name (can contain spaces)
  • A commentary / description of the set (may be empty or contain spaces)
  • A tab-separated list of set members

Example:

Set1set 1 descriptiongene_2gene_1
Set2set 2 descriptiongene_3
Set3set 3 descriptiongene_4gene_1
GMX

The gmx format is the transposed of the gmx format. Each column represents a set. It is also tab-separated.

Example:

Set1Set2Set3
set 1 descriptionset 2 descriptionset 3 description
gene_2gene_3gene_4
gene_1gene_1
GRP

The grp format contains a single set in a simple newline-delimited text format.

Example:

gene_1
gene_2
gene_3

The batch associated with each cell of the matrix file is specified in the covariate file. This file must be a tab or comma separated plain text file, with two header fields: "id" and "covariate".

Example:

idcovariate
cell_10
cell_21

OT Configuration file

There are several options to specify Optimal Transport parameters in wot.

The easiest is to just use constant parameters and specify them when computing transport maps with the --epsilon or --lambda1 options.

If you want more control over what parameters are used, you can use a detailed configuration file. There are two kinds of configuration files accepted by wot.

Configuration per timepoint

You can specify each parameter at each timepoint. When computing a transport map between two timepoints, the average of the two parameters for the considered timepoints will be taken into account.

For instance, if you have prior knowledge of the amount of entropy at each timepoint, you could specify a different value of epsilon for each timepoint, and those would be used to compute more accurate transport maps.

The configuration file is a tab-separated text file that starts with a header that must contain a column named t, for the timepoint, and then the name of any parameter you want to set. Any parameter not specified in this file can be specified as being constant as previously, with the command-line arguments --epsilon, --lambda1, --tolerance, etc. .

Example:

tepsilon
00.001
10.002
20.005
30.008
3.50.01
40.005
50.001

Configuration per pair of timepoints

If you want to be even more explicit about what parameters to use for each transport map computation, you can specify parameters for pairs of timepoints.

As previously, the configuration is specified in a tab-separated text file. Its header must have columns t0 and t1, for source and destination timepoints.

Bear in mind though, that any pair of timepoints not specified in this file will not be computable. That means you should at least put all pairs of consecutive timepoints if you want to be able to compute full trajectories.

Example:

t0t1lambda1
0150
1280
2430
4510

This can for instance be used if you want to skip a timepoint (note how timepoints 3 or 3.5 are not present here). If a timepoint is present in the dataset but not in this configuration file, it will be ignored.

You can use as many parameter columns as you want, even none. All parameters not specified here can be specified as being constant as previously, with the command-line arguments --epsilon, --lambda1, --tolerance, etc. .

Census files are datasets files : tab-separated text files with a header. The header consists of an "id" field, and then the list of cell sets for the census.

Each subsequent row will give the proportion of ancestors that pertained in each of the mentionned cell sets.

The id is the time at which the ancestors lived.

Example:

idtip1tip2tip3
0.00.150.050.05
1.00.280.050.03
2.00.420.030.02
3.00.720.020.01
4.00.890.000.00
5.00.990.000.00