-
Notifications
You must be signed in to change notification settings - Fork 95
/
Copy pathsleuth_prep.Rd
123 lines (113 loc) · 6.55 KB
/
sleuth_prep.Rd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/sleuth.R
\name{sleuth_prep}
\alias{sleuth_prep}
\title{Constructor for a 'sleuth' object}
\usage{
sleuth_prep(sample_to_covariates, full_model = NULL, target_mapping = NULL,
aggregation_column = NULL, num_cores = max(1L, parallel::detectCores() -
1L), ...)
}
\arguments{
\item{sample_to_covariates}{a \code{data.frame} which contains a mapping
from \code{sample} (a required column) to some set of experimental conditions or
covariates. The column \code{path} is also required, which is a character
vector where each element points to the corresponding kallisto output directory. The column
\code{sample} should be in the same order as the corresponding entry in
\code{path}.}
\item{full_model}{an R \code{formula} which explains the full model (design)
of the experiment OR a design matrix. It must be consistent with the data.frame supplied in
\code{sample_to_covariates}. You can fit multiple covariates by joining them with '+' (see example)}
\item{target_mapping}{a \code{data.frame} that has at least one column
'target_id' and others that denote the mapping for each target. if it is not
\code{NULL}, \code{target_mapping} is joined with many outputs where it
might be useful. For example, you might have columns 'target_id',
'ensembl_gene' and 'entrez_gene' to denote different transcript to gene
mappings. Note that sleuth_prep will treat all columns as having the 'character' data type.}
\item{aggregation_column}{a string of the column name in \code{\link{target_mapping}} to aggregate targets
(typically to summarize the data on the gene level). The aggregation is done using a p-value aggregation
method when generating the results table. See \code{\link{sleuth_results}} for more information.}
\item{num_cores}{an integer of the number of computer cores mclapply should use
to speed up sleuth preparation}
\item{...}{any of several other arguments that can be used as advanced options for
sleuth preparation. See details.}
}
\value{
a \code{sleuth} object containing all kallisto samples, metadata,
and summary statistics
}
\description{
A sleuth is a group of kallistos. Borrowing this terminology, a 'sleuth' object stores
a group of kallisto results, and can then operate on them while
accounting for covariates, sequencing depth, technical and biological
variance.
}
\details{
This method takes a list of samples with kallisto results and returns a sleuth
object with the defined normalization of the data across samples (default is the DESeq method;
See \code{\link{basic_filter}}), and then the defined transformation of the data (default is log(x + 0.5)).
This also collects all of the bootstraps for the modeling done using \code{\link{sleuth_fit}}. This
function also takes several advanced options that can be used to customize your analysis.
Here are the advanced options for \code{sleuth_prep}:
Extra arguments related to Bootstrap Summarizing:
\itemize{
\item \code{extra_bootstrap_summary}: if \code{TRUE}, compute extra summary
statistics for estimated counts. This is not necessary for typical analyses; it is only needed
for certain plots (e.g. \code{\link{plot_bootstrap}}). Default is \code{FALSE}.
\item \code{read_bootstrap_tpm}: read and compute summary statistics on bootstraps on the TPM.
This is not necessary for typical analyses; it is only needed for some plots (e.g. \code{\link{plot_bootstrap}})
and if TPM values are used for \code{\link{sleuth_fit}}. Default is \code{FALSE}.
\item \code{max_bootstrap}: the maximum number of bootstrap values to read for each
transcript. Setting this lower than the total bootstraps available will save some time, but
will likely decrease the accuracy of the estimation of the inferential noise.
}
Advanced Options for Filtering:
\itemize{
\item \code{filter_fun}: the function to use when filtering. This function will be applied to the raw counts
on a row-wise basis, meaning that each feature will be considered individually. The default is to filter out
any features that do not have at least 5 estimated counts in at least 47% of the samples (see \code{\link{basic_filter}}
for more information). If the preferred filtering method requires a matrix-wide transformation or otherwise
needs to consider multiple features simultaneously instead of independently, please consider using
\code{filter_target_id} below.
\item \code{filter_target_id}: character vector of target_ids to filter using methods that
can't be implemented using \code{filter_fun}. If non-NULL, this will override \code{filter_fun}.
}
Advanced Options for the Normalization Step:
(NOTE: Be sure you know what you're doing before you use these options)
\itemize{
\item \code{normalize}: boolean for whether normalization and other steps should be performed.
If this is set to false, bootstraps will not be read and transformation of the data will not be done.
This should only be set to \code{FALSE} if one desires to do a quick check of the raw data.
The default is \code{TRUE}.
\item \code{norm_fun_counts}: a function to perform between sample normalization on the estimated counts.
The default is the DESeq method. See \code{\link{norm_factors}} for details.
\item \code{norm_fun_tpm}: a function to perform between sample normalization on the TPM.
The default is the DESeq method. See \code{\link{norm_factors}} for details.
}
Advanced Options for the Transformation Step:
(NOTE: Be sure you know what you're doing before you use these options)
\itemize{
\item \code{transform_fun_counts}: the transformation that should be applied
to the normalized counts. Default is \code{'log(x+0.5)'} (i.e. natural log with 0.5 offset).
\item \code{transform_fun_tpm}: the transformation that should be applied
to the TPM values. Default is \code{'x'} (i.e. the identity function / no transformation)
}
Advanced Options for Gene Aggregation:
\itemize{
\item \code{gene_mode}: Set this to \code{TRUE} to get the old counts-aggregation method
for doing gene-level analysis. This requires \code{aggregation_column} to be set. If
\code{TRUE}, this will override the p-value aggregation mode, but will allow for gene-centric
modeling, plotting, and results.
}
}
\examples{
# Assume we have run kallisto on a set of samples, and have two treatments,
genotype and drug.
colnames(s2c)
# [1] "sample" "genotype" "drug" "path"
so <- sleuth_prep(s2c, ~genotype + drug)
}
\seealso{
\code{\link{sleuth_fit}} to fit a model, \code{\link{sleuth_wt}} or
\code{\link{sleuth_lrt}} to perform hypothesis testing
}