This wiki describes how to use *vg rna* and related tools for transcriptomic analyses. 

## Spliced variation graphs

Similar to how genomic variant information can be represented using a variation graph, the splicing structure of a gene can also be represented as a graph. 

<p align="center">
<img src="https://github.com/jonassibbesen/bioinformatics-diagrams/blob/master/spliced_reference_graph.png" width="600">
</p>

Here nodes and edges correspond to exons and splice-junctions, respectively. With transcripts represented as paths through the graph. Without the introns and intergenic regions these are also known as splice graphs.

This spliced reference graph can be combined with a variation graph to produce a spliced variation graph containing both the transcriptomic splicing information and genomic variant information. 

<p align="center">
<img src="https://github.com/jonassibbesen/bioinformatics-diagrams/blob/master/spliced_variation_graph.png" width="750">
</p>

Paths through can still represent haplotypes and transcripts, but also now haplotype-specific transcripts (not shown above).  

### Construction

We can use *vg rna* to construct these spliced variation graphs by adding splice-junctions and optionally transcripts to an existing graph. This can be done using the following command:

```
vg rna -p -t <threads> -n annotation.[gtf|gff3] graph.pg > spliced_graph.pg
```

with the hereunder additional options available.

**Transcript annotation:** *vg rna* supports both the gtf and gff3 transcript annotation format. Note that all references (column 1) in the annotation must be part of the graph as embedded paths. By default only lines with the *exon* feature (column 4) will be parsed. This can be changed using `--feature-type`. In addition, the attribute tag (column 9) that are used as a transcript id/name can be changed using `--transcript-tag` (default: *transcript_id*).

**Intron database:** Besides transcripts, a database of introns can also be added as splice-junctions to a graph. This can be done using the option `--introns`. The input format is BED with the start and end being the intron boundaries. Note that the strand (column 6) is also needed.

**Graph format:**  *vg rna* supports any of the handle graph implementations and will use the same format for the graph output as the input. It is, however, recommended that the *PackedGraph* format is used as it strikes a good balance between memory usage and graph edit speed. A graph can be converted to the *PackedGraph* format using `vg convert -p`.

**Transcript paths:** Reference transcript paths can be added as embedded paths to the graph using `--add-ref-paths`. Reference transcript paths are transcripts that follow the reference paths defined in the annotation (column 1). See *Haplotype-specific transcripts* section for more information on projected non-reference transcript paths.  

**Splice graph:** By default *vg rna* will construct a spliced variation graph that includes the intergenic and intronic regions. If only the exonic regions (splice graph) are of interest this can be changed using `--remove-non-gene`. Note that all existing embedded paths will be deleted (including the reference). It is therefore recommended that transcript paths are added to the graph (see above). 

## Haplotype-specific transcripts

<p align="center">
<img src="https://github.com/jonassibbesen/bioinformatics-diagrams/blob/master/haplotype_transcript_paths.png" width="750">
</p>

More to come soon.

## Downstream analyses

All of the standard tools in the *vg toolkit* also works on spliced variation graphs. However, some tools have been optimized or designed specifically for transcriptomic analyses.

### RNA-seq mapping

To map RNA-seq reads to a spliced variation graph we recommend using *vg mpmap* as it has a mode (`-n rna`) that has specifically been optimized for RNA-seq data. More information on how to run it with RNA-seq data can be found at the [Multipath alignments and vg mpmap](https://github.com/vgteam/vg/wiki/Multipath-alignments-and-vg-mpmap) wiki page.

### Transcript quantification 

[rpvg](https://github.com/jonassibbesen/rpvg) can be used to infer the expression of (haplotype-specific) transcript paths. While not specifically part of the *vg toolkit* it works directly on the output from *vg rna* and *vg mpmap*. *rpvg* takes as input a spliced variation graph, read alignments in either *gam* or *gamp* format and a set of transcript paths represented in a [GBWT](https://github.com/jltsiren/gbwt) (see *Haplotype-specific transcripts* section on how to construct this using *vg rna*). *rpvg* is able to work on large sets of transcript paths and have successfully been used to infer the expression of 12M haplotype-specific transcripts (constructed from all 5,008 haplotypes in the 1000 genomes project). 

**Read alignments:** It is our experience that the best results are achieved using the default multipath alignment output (*gamp*) from *mpmap* as input to *rpvg*. Also, in order to get more correct probability estimates in *rpvg* it is recommended that the `--remove-bonuses` option is used in *mpmap*.
