diff --git a/docs/articles/neta.html b/docs/articles/neta.html index 566adb7..368f426 100644 --- a/docs/articles/neta.html +++ b/docs/articles/neta.html @@ -84,32 +84,41 @@

Network analysis toolkit

Christopher Conley, Pei Wang, Jie Peng

-

2017-05-10

+

2017-05-12

-

The goal of this vignette is to illustrate the network analysis toolkit in the spaceMap analysis pipeline. We use the fitted spaceMap networks to the bcpls data neta-bcpls webpage as an example.

+

The goal of this vignette is to illustrate the network analysis toolkit in the spaceMap analysis pipeline. We use the fitted spaceMap networks to the bcpls data as an example.

-
-

-Input

+
+

+Input

-

Before conducting the network analysis, you should have completed the model fitting step such that you have a network ensemble or one final network to analyze (e.g. output from bootVote). To fully realize the capability of the toolkit, you also should have node annotations mapped onto the network.

-

Load the bcpls list, which contains the example input arguments for the toolkit.

+

Before conducting the network analysis, you should have completed the model fitting step such that you have a network ensemble or a final network to analyze (e.g. output from bootVote). To fully realize the capability of the toolkit, you should also have annotations for the nodes.

+

Load the bcpls list.

library(spacemap)
 data("bcpls")
 #Attach "bcpls" data objects to the R Global Environment
 attach(bcpls)
-

Next we list the contents of bcpls and outline each input argument to the toolkit.

+

Next we list the contents of bcpls.

names(bcpls)
## [1] "net"   "yinfo" "xinfo" "bdeg"  "go2eg"
-

The net object is a list of two adjacency matrices that is output from bootVote. The protein–protein edges are encoded with 1’s in net$yy. Likewise, the CNA–protein edges are encoded with 1’s in net$xy.

-

The annotation objects xinfo and yinfo are of type data.frame, whose rows should correspond to the rows of net$xy and net$yy, respectively. Let us look at the contents of yinfo: The first column, id, is always required and must be unique for each node; The second column, alias, is optional and may be used for more human-readable labels. For example, two protein isoforms must have different ID’s but may have the same gene symbol alias. If genome coordinates are supplied for cis/trans identification, then we need three additional columns: chr a character for the chromosome number, start an integer for the beginning of the chromsome location, and end for the end of the chromsome location. The users can also add other (optional) columns such as strand (for DNA strand), description for additional annotations, etc., which will be exported to visulization software such cytoscrape. Finally, missing information should be denoted by ‘NA’.

+
    +
  • The net object is a list of two adjacency matrices that is output from bootVote. The protein–protein edges are encoded with 1’s in net$yy. Likewise, the CNA–protein edges are encoded with 1’s in net$xy.

  • +
  • The annotation objects xinfo and yinfo are of type data.frame, whose rows should correspond to the rows of net$xy and net$yy, respectively. Let us look at the contents of yinfo:

  • +
+
    +
  1. The first column, id, is always required and must be unique for each node;
  2. +
  3. The second column, alias, is optional and may be used for more human-readable labels. For example, two protein isoforms must have different id but may have the same gene symbol alias.
  4. +
  5. If genome coordinates are supplied for cis/trans identification, then we need three additional columns: chr a character for the chromosome number, start an integer for the starting positiion, and end for the ending position.

  6. +
  7. The users can also add other (optional) columns such as strand (for DNA strand), description for additional annotations, etc., which will be exported to visulization software such cytoscrape.
  8. +
  9. Finally, missing information should be denoted by ‘NA’.

  10. +
head(yinfo,5)
@@ -169,7 +178,7 @@

-

The same column and row rules are applicable for xinfo. Here, id contains precise genomic coordinates, while the alias contains cytoband information.

+

Similar rules are applicable for xinfo. In this example, id contains genomic coordinates, while alias contains cytoband information.

head(xinfo,5)
@@ -217,7 +226,9 @@

-

The mapping between the proteins and their biological process term in the Gene Ontology (GO) is specified in the list go2eg, which has named each element with a GO ID; and each element is a character vector specifying all the proteins that map to this GO ID. Proteins can map to multiple GO IDs. The mapping go2eg will be used for conducting functional enrichment analysis of CNA-hubs and network modules, but is otherwise optional.

+
    +
  • The mapping between the proteins and their Gene Ontology (GO) terms is specified in the list go2eg: each element is associated with a GO ID and is a character vector specifying all the proteins that map to this GO ID. Proteins can map to multiple GO terms. The mapping go2eg will be used for conducting functional enrichment analysis, but is otherwise optional.
  • +
head(go2eg,2)
## $`GO:0000122`
 ##  [1] "NP_009330"    "NP_644671"    "NP_001120"    "NP_056082"   
@@ -237,7 +248,7 @@ 

## [9] "NP_001356" "NP_005535" "NP_057726" "NP_001553" ## [13] "NP_000213" "NP_001744" "NP_000791" "NP_620590"

-

We can also create a biologically informative alias for each GO ID that will be useful for reporting enriched hubs/modules later.

+

We can also create a biologically informative alias for each GO ID, which will be useful for reporting enriched hubs/modules later.

library(AnnotationDbi)
 library(GO.db)
 process_alias <- AnnotationDbi::Term(names(go2eg))
@@ -248,11 +259,18 @@ 

## "MAPK cascade" ## GO:0000278 ## "mitotic cell cycle"

-

The list bdeg comes from bootVote, which is optional for network analysis, but would be useful for prioritizing both CNA- and protein- hubs. bdeg$yy[b,] is an integer vector representing the degree distribution of the proteins in the network fitted on the \(b\)th bootstrap replicate. Similarly, element bdeg$xy[b,] is an integer vector representing the degree distribution of the CNAs. If bdeg is provided in the function “rankHub”, the CNA-hubs will be prioritized according to their average degree rank across the \(B\) bootstrap replicates, so that highly ranked hubs would consistently have a large degree across the network ensemble.

+
    +
  • The list bdeg comes from bootVote. It is optional for network analysis, but would be useful for prioritizing both CNA- and protein- hubs.
  • +
+
    +
  1. bdeg$yy[b,] is an integer vector representing the degree distribution of the proteins in the network fitted on the \(b\)th bootstrap replicate.

  2. +
  3. Similarly, element bdeg$xy[b,] is an integer vector representing the degree distribution of the CNAs.

  4. +
  5. If bdeg is provided in the function rankHub, then hubs will be prioritized according to their average degree rank, so that highly ranked hubs would consistently have a large degree across the network ensemble.

  6. +
-
-

-Map Annotations

+
+

+Map Annotations

Convert the final network into an igraph object and map the annotations onto the network.

library(igraph)
 ig <- spacemap::adj2igraph(yy = net$yy, xy = net$xy, yinfo = yinfo, xinfo = xinfo)
@@ -263,21 +281,21 @@

The igraph package has a number of ways to access the annotation information. For example, if we wish to confirm the chromosome location of ERBB2, we can easily query:

vertex_attr(graph = ig, name = "chr", index = V(ig)[alias %in% "ERBB2"])
## [1] "chr17"
-

Our toolkit adds value by quickly summarizing annotations in a format that is publication ready. The next section illustrates how.

+

-
-

-Hub Analysis

-

We first prioritize the CNA- and protein- hubs. If the degree distributions from the network ensemble are available in the bdeg argument, then we rank the hubs according to the average degree rank. Accordingly, the most highly ranked hubs will have the most consistently high degree across network ensemble.

+
+

+Hub Analysis

+

We first prioritize the CNA- and protein- hubs. If the bdeg argument is specified, then we rank the hubs according to the average degree rank. Accordingly, the most highly ranked hubs will have the most consistently high degree across network ensemble.

To rank the protein nodes, use the rankHub command and simply specify the level = "y" argument.

ig <- rankHub(ig = ig, bdeg = bdeg$yy, level = "y")

To rank the CNA nodes, specify the level = "x" argument.

ig <- rankHub(ig = ig, bdeg = bdeg$xy, level = "x")

Alternatively, if the bdeg argument is not available, the ranking is according to degree in the final network (coded by “ig”).

tmp <- rankHub(ig = ig,  level = "x")
-
-

-Identify cis and trans

+
+

+Identify cis and trans

Next label \(x-y\) edges as being regulated in cis or in trans. The GenomicRanges package and the genomic coordinates chr,start,end columns of xinfo and yinfo are required for this step. If you don’t have the package GenomicRanges installed yet, try:

## try http:// if https:// URLs are not supported
 source("https://bioconductor.org/biocLite.R")
@@ -286,9 +304,9 @@ 

Now we can label the \(x-y\) edges with either “cis” or “trans” in the cis_trans edge attribute of ig.

ig <- cisTrans(ig = ig, level = "x-y")

-
-

-Report

+
+

+Report

We then report a well-organized table, as seen in Table 3 of the spaceMap publication. The top argument allows us to control how many hubs are displayed.

xhubs <- reportHubs(ig, top = 6, level = "x")
@@ -337,7 +355,7 @@

-

Similarly, we can report the top 10 protein hubs, as well as the final network degree, and a description of each hub, if the description column was specified in yinfo.

+

Similarly, we can report the top 10 protein hubs, their degrees in the final network, and a description of each hub, if the description column was specified in yinfo.

yhubs <- reportHubs(ig, top = 10, level = "y")
@@ -399,10 +417,10 @@

-
-

-GO-neighbor Percentage

-

A CNA neighborhood comprises all protein nodes that are directly connected to a CNA hub by an edge. CNA neighborhoods represent direct perturbations to the proteome by amplifications or deletions in the DNA. To quantify their functional relevance, we compute a score called the GO-neighbor percentage. Two protein nodes are called GO-neighbors if they share a common GO term in the same CNA neighborhood. We postulate that a high percentage of GO-neighbors within a CNA neighborhood associates the CNA hub with more functional meaning. These scores, as presented in figure 5 of the publication, can be generated with a GO mapping to the proteins as follows.

+
+

+GO-neighbor percentage

+

A CNA neighborhood comprises all protein nodes that are directly connected to a CNA hub by an edge. CNA neighborhoods represent direct perturbations to the proteome by amplifications or deletions in the DNA. To quantify their functional relevance, we compute a score called the GO-neighbor percentage. Two protein nodes are called GO-neighbors if they share a common GO term in the same CNA neighborhood. We postulate that a high percentage of GO-neighbors within a CNA neighborhood associates the CNA hub with more functional meaning. These scores, as presented in Figure 5 of the publication, can be generated with a GO mapping to the proteins as follows.

hgp <- xHubEnrich(ig = ig, go2eg = go2eg)
@@ -485,19 +503,22 @@

-
-

-Module Analysis

-

There are many criteria to define modules of a network. This toolkit does not require a specific algorithm for finding modules and allows users to import the module membership information (see mods argument in modEnrich). In the spaceMap publication, we use the edge-betweenness algorithm in igraph.

-
mods <- cluster_edge_betweenness(ig)
-

The main goal of module analysis is identifying modules that are functionally enriched. The modEnrich function will test for significantly over-represented GO-terms (or any other valid functional grouping) in a module using hyper-geometric tests. In this application, only the protein nodes have a functional mapping and we specify this through the levels = "y" argument. If both predictors and responses have a functional mapping in the go2eg argument, then we can specify levels = c("y","x"). Other arguments are available to control the enrichment testing; see the docs of modEnrich for more details.

+
+

+Module Analysis

+

There are many criteria to define modules of a network. This toolkit allows users to import the module membership information by themselves (see mods argument in modEnrich).

+

In the spaceMap publication, we use the edge-betweenness algorithm in igraph.

+
library(igraph)
+mods <- cluster_edge_betweenness(ig)
+

The main goal of module analysis is identifying modules that are functionally enriched. The modEnrich function will test for significantly over-represented GO-terms (or any other valid functional grouping) within a module using hyper-geometric tests.

+

In the current example, only the protein nodes have functional mapping and we specify this through the levels = "y" argument. If both predictors and responses have functional mapping in the go2eg argument, then we can specify levels = c("y","x"). Other arguments are available to control the enrichment testing; see modEnrich for more details.

outmod <- modEnrich(ig = ig, mods = mods, levels = "y", go2eg = go2eg, process_alias = process_alias)

The output of the module analysis is a list of 3 objects.

names(outmod)
## [1] "ig"   "etab" "eraw"
  • ig is the input igraph network object updated with a “process_id” attribute for nodes affiliated with a significant GO-term. The “process_id” and “module” attributes together are useful for visualizing which nodes are enriched for a specific biological function.

  • -
  • etab is the polished module enrichment table to report significant GO terms, the representation of the GO term in the module relative to the size of the GO term, and which CNA hubs belong to the module. The top ten hits appear as follows as in Table S.5 of the spaceMap publication’s supplementary materials.

  • +
  • etab is the polished module enrichment table to report significant GO terms, the representation of the GO term in the module relative to the size of the GO term, and which CNA hubs belong to the module. The top ten hits as in Table S.5 of the spaceMap publication’s supplementary materials are as follows:

outmod$etab[1:10,]
@@ -628,12 +649,12 @@

-
-

-Export for Network Visualization

-

There exist many tools for network visualization. In the publication of spaceMap, we exported the annotated igraph network object ig to the “graphml” format, which maintained all the attributes associated with nodes when read into Cytoscape for visualization.

+
+

+Export for Visualization

+

There are many tools for network visualization. In the publication of spaceMap, we exported the annotated igraph network object ig to the “graphml” format, which maintained all the attributes associated with nodes when read into Cytoscape for visualization.

igraph::write_graph(graph = outmod$ig, file = tempfile(), format = "graphml")
-

Here we list all the attributes associated with the nodes (i.e. vertices) that can be used in tandem with Cytoscape’s filtering functions to identify specific nodes of interest.

+

Here we list all the attributes associated with the nodes that can be used in tandem with Cytoscape’s filtering functions to identify specific nodes of interest.

vertex_attr_names(outmod$ig)
##  [1] "name"             "alias"            "chr"             
 ##  [4] "start"            "end"              "strand"          
@@ -641,31 +662,30 @@ 

## [10] "mean_rank_hub" "sd_rank_hub" "ncis" ## [13] "ntrans" "regulates_in_cis" "potential_cis" ## [16] "module" "process_id"

-

We describe some of the most useful attributes for visualization.

+

We describe some of the most useful attributes for visualization:

  • ‘name’: the unique node ID
  • ‘alias’: the node alias (e.g. gene symbol ERBB2)
  • ‘chr,start,end,strand’: the gene coordinates of nodes
  • ‘description’: any note (e.g. breast cancer oncogene)
  • -
  • ‘levels’: indicates the node belongs to predictors “x” or response level “y”
  • -
  • ‘rank_hub’: the rank of the hub within level (e.g. a value of “1” for a node of level x implies the most consistently high degree \(x\) node in the network. )
  • +
  • ‘levels’: indicates whether the node belongs to predictors “x” or responses “y”
  • +
  • ‘rank_hub’: the rank of the hub within its level (e.g. a value of “1” for a node of level “x” means that it is the most consistently high degree \(x\) node in the network. )
  • ‘regulates_in_cis’: list of genes regulated in cis
  • -
  • ‘module’: the module ID the node belongs to.
  • +
  • ‘module’: the module ID that the node belongs to.
  • ‘process_id’: the significant GO-term(s) associated with the node.

Also the edge attributes are exported to ‘graphml’ format.

edge_attr_names(outmod$ig)
## [1] "levels"    "cis_trans"
    -
  • ‘levels’ indicates an edge is \(x-y\) or \(y-y\) -
  • -
  • ‘cis_trans’ indicates an edge is either regulated cis or trans.
  • +
  • ‘levels’ indicates whether an edge is \(x-y\) or \(y-y\).
  • +
  • ‘cis_trans’ indicates whether an edge is regulated in cis or in trans.
-
-

-Summary

-

This toolkit is useful and convenient for analyzing and summarizing the output of spaceMap in the context of integrative genomics. The biological context mapped onto the network object can easily be exported to existing network visualization tools like Cytoscape.

+
+

+Summary

+

This toolkit is useful for analyzing and summarizing the output of the spaceMap model fit in the context of integrative genomics. The biological context mapped onto the network object can easily be exported to existing network visualization tools like Cytoscape.

@@ -676,15 +696,9 @@

Contents