Alevin

Alevin is a tool — integrated with the salmon software — that introduces a family of algorithms for quantification and analysis of 3’ tagged-end single-cell sequencing data. Currently alevin supports the following two major droplet based single-cell protocols:

  1. Drop-seq
  2. 10x-Chromium v1/2

Alevin works under the same indexing scheme (as salmon) for the reference, and consumes the set of FASTA/Q files(s) containing the Cellular Barcode(CB) + Unique Molecule identifier (UMI) in one read file and the read sequence in the other. Given just the transcriptome and the raw read files, alevin generates a cell-by-gene count matrix (in a fraction of the time compared to other tools).

Alevin works in two phases. In the first phase it quickly parses the read file containing the CB and UMI information to generate the frequency distribution of all the observed CBs, and creates a lightweight data-structure for fast-look up and correction of the CB. In the second round, alevin utilizes the read-sequences contained in the files to map the reads to the transcriptome, identify potential PCR/sequencing errors in the UMIs, and performs hybrid de-duplication while accounting for UMI collisions. Finally, a post-abundance estimation CB whitelisting procedure is done and a cell-by-gene count matrix is generated.

Using Alevin

Alevin requires the following minimal set of necessary input parameters (generally providing the flags in that order is recommended):

  • -l: library type (same as salmon), we recommend using ISR for both Drop-seq and 10x-v2 chemistry.
  • -1: CB+UMI file(s), alevin requires the path to the FASTQ file containing CB+UMI raw sequences to be given under this command line flag. Alevin also supports parsing of data from multiple files as long as the order is the same as in -2 flag.
  • -2: Read-sequence file(s), alevin requires the path to the FASTQ file containing raw read-sequences to be given under this command line flag. Alevin also supports parsing of data from multiple files as long as the order is the same as in -1 flag.
  • --dropseq / --chromium: the protocol, this flag tells the type of single-cell protocol of the input sequencing-library.
  • -i: index, file containing the salmon index of the reference transcriptome, as generated by salmon index command.
  • -p: number of threads, the number of threads which can be used by alevin to perform the quantification, by default alevin utilizes all the available threads in the system, although we recommend using ~10 threads which in our testing gave the best memory-time trade-off.
  • -o: output, path to folder where the output gene-count matrix (along with other meta-data) would be dumped.
  • --tgMap: transcript to gene map file, a tsv (tab-separated) file — with no header, containing two columns mapping of each transcript present in the reference to the corresponding gene (the first column is a transcript and the second is the corresponding gene).

Once all the above requirement are satisfied, alevin can be run using the following command:

> salmon alevin -l ISR -1 cb.fastq.gz -2 reads.fastq.gz --chromium  -i salmon_index_directory -p 10 -o alevin_output --tgMap txp2gene.tsv

Providing multiple read files to Alevin

Often, a single library may be split into multiple FASTA/Q files. Also, sometimes one may wish to quantify multiple replicates or samples together, treating them as if they are one library. Alevin allows the user to provide a space-separated list of files to all of it’s options that expect input files (i.e. -1, -2). The order of the files in the left and right lists must be the same. There are a number of ways to provide alevin with multiple CB and read files, and treat these as a single library. For the examples below, assume we have two replicates lib_A and lib_B. The left and right reads for lib_A are lib_A_cb.fq and lib_A_reads.fq, respectively. The left and right reads for lib_B are lib_B_cb.fq and lib_B_read.fq, respectively. The following are both valid ways to input these reads to alevin:

> salmon alevin -lISR -1 lib_A_cb.fq lib_B_cb.fq -2 lib_A_read.fq lib_B_read.fq

Similarly, both of these approaches can be adopted if the files are gzipped as well:

> salmon alevin -l ISR -1 lib_A_cb.fq.gz lib_B_cb.fq.gz -2 lib_A_read.fq.gz lib_B_read.fq.gz

Note

Don’t provide data through input stream

To keep the time-memory trade-off within acceptable bounds, alevin performs multiple passes over the Cellular Barcode file. Alevin goes through the barcode file once by itself, and then goes through both the barcode and read files in unison to assign reads to cells using the initial barcode mapping. Since the pipe or the input stream can’t be reset to read from the beginning again, alevin can’t read in the barcodes, and might crash.

Description of important options

Alevin exposes a number of useful optional command-line parameters to the user. The particularly important ones are explained here, but you can always run salmon alevin -h to see them all.

-p / --numThreads

The number of threads that will be used for quantification. Alevin is designed to work well with many threads, so, if you have a sufficient number of processors, larger values here can speed up the run substantially. In our testing we found that usually 10 threads gives the best time-memory trade-off.

Note

Default number of threads

The default behavior is for Alevin to probe the number of available hardware threads and

to use this number. Thus, if you want to use fewer threads (e.g., if you are running multiple instances of Salmon simultaneously), you will likely want to set this option explicitly in accordance with the desired per-process resource usage.

--whitelist

This is an optional argument, where user can explicitly specify the whitelist CB to use for cell detection and CB sequence correction. If not given, alevin generates its own set of putative CBs.

--noQuant

Generally used in parallel with --dumpfq. If Alevin is passed the --noQuant option, the pipeline will stop before starting the mapping. The general use-case is when we only need to concatenate the CB on the read-id of the second file and break the execution afterwards.

--noDedup

If Alevin is passed the --noDedup option, the pipeline only performs CB correction, maps the read-sequences to the transcriptome generating the interim data-structure of CB-EqClass-UMI-count. Used in parallel with --dumpBarcodeEq or --dumpBfh for the purposes of obtaining raw information or debugging.

--naive

If given this flag, alevin runs the full end-to-end pipeline, except correcting the UMI collisions (i.e. all related UMIs arising from the same gene are collapsed, rather than determining if they arise from disjoint collections of transcripts). The general use-case is to simulate the type of gene-analysis done by tools such as cellranger.

--mrna

The list of mitochondrial genes which are to be used as a feature for CB whitelising naive Bayes classification.

--rrna

The list of ribosomal genes which are to be used as a feature for CB whitelising naive Bayes classification.

--useCorrelation

If activated, in CB whitelist classification alevin computes the cell-by-cell pearson correlation of each candidate CB with putative true set of CB. This flag can slow down alevin’s processing.

--dumpfq

Generally used along with --noQuant. If activated, alevin will sequence correct the CB and attach the corrected CB sequence to the read-id in the second file and dumps the result to standard-out (stdout).

--dumpBfh

Alevin internally uses a potentially big data-structure to concisely maintain all the required information for quantification. This flags dumps the full CB-EqClass-UMI-count data-structure for the purposed of allowing raw data analysis and debugging.

--dumpFeatures

If activated, alevin dumps all the features used by the CB classification and their counts at each cell level. Generally, this is used for the purposes of debugging.

--dumpCsvCounts

This flags is used to internally convert the default binary format of alevin for gene-count matrix into a human readable csv (comma separated) format. The expression of all the gene in one cell is written in one row, while columns represents the genes.

Output

Typical 10x experiment can range form hundreds to tens of thousand of cells – resulting in huge size of the count-matrices. Traditionally single-cell tools dumps the Cell-v-Gene count matrix in various formats. Although, this itself is an open area of research but by default alevin dumps a per-cell level gene-count matrix in a binary-compressed format with the row and column indexes in a separate file.

A typical run of alevin will generate 3 files:

  • quants_mat.gz – Compressed count matrix.
  • quants_mat_cols.txt – Column Header (Gene-ids) of the matrix.
  • quants_mat_rows.txt – Row Index (CB-ids) of the matrix.

Alevin can also dump the count-matrix in a human readable – comma-separated-value (_CSV_) format, if given flag –dumpCsvCounts which generates a new output file called quants_mat.csv.

Misc

Finally, the purpose of making this software available is because we believe it may be useful for people dealing with single-cell RNA-seq data. We want the software to be as useful, robust, and accurate as possible. So, if you have any feedback — something useful to report, or just some interesting ideas or suggestions — please contact us (asrivastava@cs.stonybrook.edu and/or rob.patro@cs.stonybrook.edu). If you encounter any bugs, please file a detailed bug report at the Salmon GitHub repository.

References

[1]Macosko, Evan Z., et al. “Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets.” Cell 161.5 (2015): 1202-1214.
[2]Zheng, Grace XY, et al. “Massively parallel digital transcriptional profiling of single cells.” Nature communications 8 (2017): 14049.
[3]Patro, Rob, et al. “Salmon provides fast and bias-aware quantification of transcript expression.” Nature Methods (2017). Advanced Online Publication. doi: 10.1038/nmeth.4197.
[4]Petukhov, Viktor, et al. “Accurate estimation of molecular counts in droplet-based single-cell RNA-seq experiments.” bioRxiv (2017): 171496.