{ "metadata": { "kernelspec": { "display_name": "Bash", "language": "bash", "name": "bash" }, "language_info": { "codemirror_mode": "shell", "file_extension": ".sh", "mimetype": "text/x-sh", "name": "bash" } }, "nbformat": 4, "nbformat_minor": 5, "cells": [ { "id": "metadata", "cell_type": "markdown", "source": "
Alevin is a tool integrated with the Salmon software, so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the releases page. Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below .
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-1", "source": [ "wget -nv https://github.com/COMBINE-lab/salmon/releases/download/v1.10.0/salmon-1.10.0_linux_x86_64.tar.gz" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-2", "source": "Once you’ve downloaded a specific binary (here we’re using version 1.10.0), just extract it like so:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-3", "source": [ "tar -xvzf salmon-1.10.0_linux_x86_64.tar.gz" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-4", "source": "\n\n\nAs mentioned, installing salmon using conda is also an option, and you can do it using the following command in the terminal:
\n\nconda install -c conda-forge -c bioconda salmon\n
However, for this tutorial, it would be easier and quicker to use the downloaded pre-compiled binaries, as shown above.
\n
We’re going to use Alevin for demonstration purposes, but we do not endorse one method over another.
\nWe continue working on the same example data - a very small subset of the reads in a mouse dataset of fetal growth restriction {% cite Bacon2018 %} (see the study in Single Cell Expression Atlas and the project submission). For the purposes of this tutorial, the datasets have been subsampled to only 50k reads (around 1% of the original files). Those are two fastq files - one with transcripts and the another one with cell barcodes. You can download the files by running the code below:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-5", "source": [ "wget -nv https://zenodo.org/records/10116786/files/transcript_701.fastq\n", "wget -nv https://zenodo.org/records/10116786/files/barcodes_701.fastq" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-6", "source": "\n\nQuestion\nHow to differentiate between the two files if they are just called ‘Read 1’ and ‘Read 2’?
\n\n👁 View solution
\n\nThe file which contains the cell barcodes and UMI is significantly shorter (indeed, 20 bp!) compared to the other file containing longer, transcript read. For ease, we will use explicit file names.
\n
Additionally, to map your reads, you will need a transcriptome to align against (a FASTA) as well as the gene information for each transcript (a gtf) file. These files are included in the data import step below. You can also download these for your species of interest from Ensembl.
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-7", "source": [ "wget -c https://zenodo.org/record/4574153/files/Mus_musculus.GRCm38.100.gtf.gff -O GRCm38_gtf.gff\n", "wget -c https://zenodo.org/record/4574153/files/Mus_musculus.GRCm38.cdna.all.fa.fasta -O GRCm38_cdna.fasta" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-8", "source": "Why do we need FASTA and GTF files?\nTo generate gene-level quantifications based on transcriptome quantification, Alevin and similar tools require a conversion between transcript and gene identifiers. We can derive a transcript-gene conversion from the gene annotations available in genome resources such as Ensembl. The transcripts in such a list need to match the ones we will use later to build a binary transcriptome index. If you were using spike-ins, you’d need to add these to the transcriptome and the transcript-gene mapping.
\nWe will use the murine reference annotation as retrieved from Ensembl (GRCm38 or mm10) in GTF format. This annotation contains gene, exon, transcript and all sorts of other information on the sequences. We will use these to generate the transcript-gene mapping by passing that information to a tool that extracts just the transcript identifiers we need.
\nFinally, we will download some files for later so that you don’t have to switch kernels again - that’s an Alevin output folder for another sample!
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-9", "source": [ "wget https://zenodo.org/records/10116786/files/alevin_output_702.zip\n", "unzip alevin_output_702.zip" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-10", "source": "We will now install the package atlas-gene-annotation-manipulation
to help us generate the files required for Salmon and Alevin.
After the installation is completed, you can execute the code cell below and while it’s running, I’ll explain all the parameters we set here.
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-13", "source": [ "gtf2featureAnnotation.R -g GRCm38_gtf.gff -c GRCm38_cdna.fasta -d \"transcript_id\" -t \"transcript\" -f \"transcript_id\" -o map -l \"transcript_id,gene_id\" -r -e filtered_fasta" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-14", "source": "In essence, gtf2featureAnnotation.R script takes a GTF annotation file and creates a table of annotation by feature, optionally filtering a cDNA file supplied at the same time. Therefore the first parameter -g
stands for “gtf-file” and requires a path to a valid GTF file. Then -c
takes a cDNA file for extracting meta info and/or filtering - that’s our FASTA! Where –parse-cdnas (that’s our -c
) is specified, we need to specify, using -d
, which field should be used to compare to identfiers from the FASTA. We set that to “transcript_id” - feel free to inspect the GTF file to explore other attributes. We pass the same value in -f
, meaning first-field, ie. the name of the field to place first in output table. To specify which other fields to retain in the output table, we provide comma-separated list of those fields, and since we’re only interested in transcript to gene map, we put those two names (“transcript_id,gene_id”) into -l
. -t
stands for the feature type to use, and in our case we’re using “transcript”. Guess what -o
is! Indeed, that’s the output annotation table - here we specify the file path of our transcript to gene map. We will also have another output denoted by -e
and that’s the path to a filtered FASTA. Finally, we also put -r
which is there only to suppress header on output. Summarising, output will be a an annotation table, and a FASTA-format cDNAs file with unannotated transcripts removed.
Why filtered FASTA?\nSometimes it’s important that there are no transcripts in a FASTA-format transcriptome that cannot be matched to a transcript/gene mapping. Salmon, for example, used to produce errors when this mismatch was present. We can synchronise the cDNA file by removing mismatches as we have done above.
\nWe will use Salmon in mapping-based mode, so first we have to build a Salmon index for our transcriptome. We will run the Salmon indexer as so:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-15", "source": [ "salmon-latest_linux_x86_64/bin/salmon index -t filtered_fasta -i salmon_index -k 31" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-16", "source": "Where -t
stands for our filtered FASTA file, and -i
is the output of the mapping-based index. To build it, the function is using an auxiliary k-mer hash over k-mers of length 31. While the mapping algorithms will make use of arbitrarily long matches between the query and reference, the k size selected here will act as the minimum acceptable length for a valid match. Thus, a smaller value of k may slightly improve sensitivity. We find that a k of 31 seems to work well for reads of 75bp or longer, but you might consider a smaller k if you plan to deal with shorter reads. Also, a shorter value of k may improve sensitivity even more when using selective alignment (enabled via the –validateMappings flag). So, if you are seeing a smaller mapping rate than you might expect, consider building the index with a slightly smaller k.
\n\n\nTo be able to search a transcriptome quickly, Salmon needs to convert the text (FASTA) format sequences into something it can search quickly, called an ‘index’. The index is in a binary rather than human-readable format, but allows fast lookup by Alevin. Because the types of biological and technical sequences we need to include in the index can vary between experiments, and because we often want to use the most up-to-date reference sequences from Ensembl or NCBI, we can end up re-making the indices quite often.
\n
Time to use Alevin now! Alevin works under the same indexing scheme (as Salmon) for the reference, and consumes the set of FASTA/Q files(s) containing the Cellular Barcode(CB) + Unique Molecule identifier (UMI) in one read file and the read sequence in the other. Given just the transcriptome and the raw read files, Alevin generates a cell-by-gene count matrix (in a fraction of the time compared to other tools).
\n\n\n\nAlevin works in two phases. In the first phase it quickly parses the read file containing the CB and UMI information to generate the frequency distribution of all the observed CBs, and creates a lightweight data-structure for fast-look up and correction of the CB. In the second round, Alevin utilizes the read-sequences contained in the files to map the reads to the transcriptome, identify potential PCR/sequencing errors in the UMIs, and performs hybrid de-duplication while accounting for UMI collisions. Finally, a post-abundance estimation CB whitelisting procedure is done and a cell-by-gene count matrix is generated.
\n
Alevin can be run using the following command:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-17", "source": [ "salmon-latest_linux_x86_64/bin/salmon alevin -l ISR -1 barcodes_701.fastq -2 transcript_701.fastq --dropseq -i salmon_index -p 10 -o alevin_output --tgMap map --freqThreshold 3 --keepCBFraction 1 --dumpFeatures" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-18", "source": "All the required input parameters are described in the documentation, but for the ease of use, they are presented below as well:
\n\n\n\n\n
\n- \n
\n\n
-l
: library type (same as Salmon), we recommend using ISR for both Drop-seq and 10x-v2 chemistry.- \n
\n\n
-1
: CB+UMI file(s), alevin requires the path to the FASTQ file containing CB+UMI raw sequences to be given under this command line flag. Alevin also supports parsing of data from multiple files as long as the order is the same as in -2 flag. That’s our barcodes_701.fastq file.- \n
\n\n
-2
: Read-sequence file(s), alevin requires the path to the FASTQ file containing raw read-sequences to be given under this command line flag. Alevin also supports parsing of data from multiple files as long as the order is the same as in -1 flag. That’s our transcript_701.fastq file.- \n
\n\n
--dropseq
/--chromium
/--chromiumV3
: the protocol, this flag tells the type of single-cell protocol of the input sequencing-library. This is a study using the Drop-seq chemistry, so we specify that in the flag.- \n
\n\n
-i
: index, file containing the salmon index of the reference transcriptome, as generated by salmon index command.- \n
\n\n
-p
: number of threads, the number of threads which can be used by alevin to perform the quantification, by default alevin utilizes all the available threads in the system, although we recommend using ~10 threads which in our testing gave the best memory-time trade-off.- \n
\n\n
-o
: output, path to folder where the output gene-count matrix (along with other meta-data) would be dumped. We simply call it alevin_output- \n
\n\n
--tgMap
: transcript to gene map file, a tsv (tab-separated) file — with no header, containing two columns mapping of each transcript present in the reference to the corresponding gene (the first column is a transcript and the second is the corresponding gene). In our case, that’s map_code generated by using gtf2featureAnnotation.R function.- \n
\n\n
--freqThreshold
- minimum frequency for a barcode to be considered. We’ve chosen 3 as this will only remove cell barcodes with a frequency of less than 3, a low bar to pass but useful way of avoiding processing a bunch of almost certainly empty barcodes.- \n
\n\n
--keepCBFraction
- fraction of cellular barcodes to keep. We’re using 1 to quantify all!- \n
\n\n
--dumpFeatures
- if activated, alevin dumps all the features used by the CB classification and their counts at each cell level. It’s generally used in pair with other command line flags.
We have also added some additional parameters (--freqThreshold
, --keepCBFraction
) and their values are derived from the Alevin Galaxy tutorial after QC to stop Alevin from applying its own thresholds. However, if you’re not sure what value to pick, you can simply allow Alevin to make its own calls on what constitutes empty droplets.
This tool will take a while to run. Alevin produces many file outputs, not all of which we’ll use. You can refer to the Alevin documentation if you’re curious what they all are, you can look through all the different files to find information such as the mapping rate, but we’ll just pass the whole output folder directory for downstream analysis.
\n\n\nQuestion\n\n
\n- Can you find the information what was the mapping rate?
\n- How many transcripts did Alevin find?
\n\n👁 View solution
\n\n\n
\n- As mentioned above, in alevin_output folder there will be many different files, including the log files. To check the mapping rate, go to alevin_output -> logs and open salmon_quant file. There you will find not only information about mapping rate, but also many more, calculated at salmon indexing step.
\n- Alevin log can be found in alevin_output -> alevin and the file name is also alevin. You can find many details about the alevin process there, including the number of transcripts found.
\n
\n\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-19", "source": [ "# get the output folder from Alevin in case yours crashes\n", "wget https://zenodo.org/records/13732502/files/alevin_output.zip\n", "unzip alevin_output.zip" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-20", "source": "\nWarning: Process stopping\nThe command above will display the log of the process and will say “Analyzed X cells (Y% of all)”. For some reason, running Alevin may sometimes cause problems in Jupyter Notebook and this process will stop and not go to completion. This is the reason why we use hugely subsampled dataset here - bigger ones couldn’t be fully analysed (they worked fine locally though). The dataset used in this tutorial shouldn’t make any issues when you’re using Jupyter notebook through galaxy.eu, however might not work properly on galaxy.org. If you’re accessing Jupyter notebook via galaxy.eu and alevin process stopped, just restart the kernel (in the upper left corner: Kernel -> Restart kernel…) and that should help. However, if it doesn’t, here are the alternatives:
\n\n
\n- We prepared a Jupyter notebook with the same workflow as shown here, compatible with Google Collab. So if you have troubles with running the above step, you can use the notebook provided, upload it to your Google Collab and run the code there, following the tutorial on GTN.
\n- You can download the output folder that you would get as a result of running the code above. Simply run the code below.
\n
Let’s change gear a little bit. We’ve done the work in bash, and now we’re switching to R to complete the processing. To do so, you have to change Kernel to R (either click on Kernel
-> Change Kernel...
in the upper left corner of your JupyterLab or click on the displayed current kernel in the upper right corner and change it).
Now, when we are in the R environment, let’s install the packages required for this part - tximeta
and DropletUtils
:
Double check that the installation went fine:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-23", "source": [ "require(\"tximeta\")\n", "require(\"DropletUtils\")" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-24", "source": "\n\n\nIf you encounter this error message while loading the library, you can try to install them with BiocManager, as shown below. Just replace “package_name” with either
\ntximeta
orDropletUtils
. Beware, it can take up to 25 minutes to run!\n# run the code below only if you cannot load tximeta library (or using notebook version 1.0.1)\nif (!require(\"BiocManager\", quietly = TRUE))\n install.packages(\"BiocManager\")\n\nBiocManager::install(\"package_name\")\nlibrary(\"package_name\")\n
The tximeta package created by {% cite Love2020 %} is used for import of transcript-level quantification data into R/Bioconductor and requires that the entire output of alevin is present and unmodified.
\nIn the alevin_output -> alevin folder you can find the following files:
\nWe will only focus on quants_mat.gz though. First, let’s specify the path to that file:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-25", "source": [ "path <- 'alevin_output/alevin/quants_mat.gz'" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-26", "source": "We will specify the following arguments when running tximeta:
\nWith that we can create a dataframe and pass it to tximeta to create SummarizedExperiment object.
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-27", "source": [ "coldata <- data.frame(files = path, names=\"sample701\")\n", "alevin_se <- tximeta(coldata, type = \"alevin\")" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-28", "source": "Inspect the created object:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-29", "source": [ "alevin_se" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-30", "source": "As you can see, rowData names and colData names are still empty. Before we add some metadata, we will first identify barcodes that correspond to non-empty droplets.
\nSome sub-populations of small cells may not be distinguished from empty droplets based purely on counts by barcode. Some libraries produce multiple ‘knees’ (see the Alevin Galaxy tutorial for multiple sub-populations. The emptyDrops
method ({% cite Lun2019 %}) has become a popular way of dealing with this. emptyDrops
still retains barcodes with very high counts, but also adds in barcodes that can be statistically distinguished from the ambient profiles, even if total counts are similar.
EmptyDrops takes multiple arguments that you can read about in the documentation. However, in this case, we will only specify the following arguments:
\n\n\n\n\n
\n- \n
m
-\tA numeric matrix-like object - usually a dgTMatrix or dgCMatrix - containing droplet data prior to any filtering or cell calling. Columns represent barcoded droplets, rows represent genes.- \n
lower
- A numeric scalar specifying the lower bound on the total UMI count, at or below which all barcodes are assumed to correspond to empty droplets.- \n
niters
- An integer scalar specifying the number of iterations to use for the Monte Carlo p-value calculations.- \n
retain
- A numeric scalar specifying the threshold for the total UMI count above which all barcodes are assumed to contain cells.
Let’s then extract the matrix from our alevin_se
object. It’s stored in assays -> counts.
And now run emptyDrops:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-33", "source": [ "# Identify likely cell-containing droplets\n", "out <- emptyDrops(matrix_alevin, lower = 100, niters = 1000, retain = 20)\n", "out" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-34", "source": "We also correct for multiple testing by controlling the false discovery rate (FDR) using the Benjamini-Hochberg (BH) method ({% cite Benjamini1995 %}). Putative cells are defined as those barcodes that have significantly poor fits to the ambient model at a specified FDR threshold. Here, we will use an FDR threshold of 0.01. This means that the expected proportion of empty droplets in the set of retained barcodes is no greater than 1%, which we consider to be acceptably low for downstream analyses.
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-35", "source": [ "is.cell <- out$FDR <= 0.01\n", "sum(is.cell, na.rm=TRUE) # check how many cells left after filtering" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-36", "source": "We got rid of the background droplets containing no cells, so now we will filter the matrix that we passed on to emptyDrops, so that it corresponds to the remaining cells.
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-37", "source": [ "emptied_matrix <- matrix_alevin[,which(is.cell),drop=FALSE] # filter the matrix" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-38", "source": "From here, we can move on to adding metadata and we’ll return to emptied_matrix
soon.
The cells barcodes are stored in colnames. Let’s exctract them into a separate object:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-39", "source": [ "barcode <- colnames(alevin_se)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-40", "source": "Now, we can simply add those barcodes into colData names where we will keep the cell metadata. To do this, we will create a column called barcode
in colData and pass the stored values into there.
That’s only cell barcodes for now! However, after running emptyDrops, we generated lots of cell information that is currently stored in out
object (Total, LogProb, PValue, Limited, FDR). Let’s add those values to cell metadata! Since we already have barcodes in there, we will simply bind the emptyDrops output to the existing dataframe:
As you can see, the new columns were appended successfully and now the dataframe has 6 columns.
\nIf you have a look at the Experimental Design from that study, you might notice that there is actually more information about the cells. The most important for us would be batch, genotype and sex, summarised in the small table below.
\nIndex | \nBatch | \nGenotype | \nSex | \n
---|---|---|---|
N701 | \n0 | \nwildtype | \nmale | \n
N702 | \n1 | \nknockout | \nmale | \n
N703 | \n2 | \nknockout | \nfemale | \n
N704 | \n3 | \nwildtype | \nmale | \n
N705 | \n4 | \nwildtype | \nmale | \n
N706 | \n5 | \nwildtype | \nmale | \n
N707 | \n6 | \nknockout | \nmale | \n
We are currently analysing sample N701, so let’s add its information from the table.
\nWe will label batch as an integer - “0” for sample N701, “1” for N702 etc. The way to do it is creating a list with zeros of the length corresponding to the number of cells that we have in our SummarizedExperiment object, like so:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-45", "source": [ "batch <- rep(\"0\", length(colnames(alevin_se)))" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-46", "source": "And now create a batch slot in the colData names and append the batch
list in the same way as we did with barcodes:
A new column appeared, full of zeros - as expected!
\nIt’s all the same for genotype, but instead creating a list with zeros, we’ll create a list with string “wildtype” and append it into genotype slot:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-49", "source": [ "genotype <- rep(\"wildtype\", length(colnames(alevin_se)))\n", "colData(alevin_se)$genotype <- genotype" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-50", "source": "You already know what to do, right? A list with string “male” and then adding it into a new slot!
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-51", "source": [ "sex <- rep(\"male\", length(colnames(alevin_se)))\n", "colData(alevin_se)$sex <- sex" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-52", "source": "Check if all looks fine:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-53", "source": [ "colData(alevin_se)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-54", "source": "3 new columns appeared with the information that we’ve just added - perfect! You can add any information you need in this way, as long as it’s the same for all the cells from one sample.
\nThe genes IDs are stored in rownames. Let’s exctract them into a separate object:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-55", "source": [ "gene_ID <- rownames(alevin_se)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-56", "source": "Analogically, we will add those genes IDs into rowData names which stores gene metadata. To do this, we will create a column called gene_ID
in rowData and pass the stored values into there.
Since gene symbols are much more informative than only gene IDs, we will add them to our metadata. We will base this annotation on Ensembl - the genome database – with the use of the library BioMart. We will use the archive Genome assembly GRCm38 to get the gene names. Please note that the updated version (GRCm39) is available, but some of the gene IDs are not in that EnsEMBL database. The code below is written in a way that it will work for the updated dataset too, but will produce ‘NA’ where the corresponding gene name couldn’t be found.
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-59", "source": [ "# get relevant gene names\n", "library(\"biomaRt\") # load the BioMart library\n", "ensembl.ids <- gene_ID\n", "mart <- useEnsembl(biomart = \"ENSEMBL_MART_ENSEMBL\") # connect to a specified BioMart database and dataset hosted by Ensembl\n", "ensembl_m = useMart(\"ensembl\", dataset=\"mmusculus_gene_ensembl\")\n", "\n", "# The line above connects to a specified BioMart database and dataset within this database.\n", "# In our case we choose the mus musculus database and to get the desired Genome assembly GRCm38,\n", "# we specify the host with this archive. If you want to use the most recent version of the dataset, just run:\n", "# ensembl_m = useMart(\"ensembl\", dataset=\"mmusculus_gene_ensembl\")" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-60", "source": "\n\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-61", "source": [ "genes <- getBM(attributes=c('ensembl_gene_id','external_gene_name'),\n", " filters = 'ensembl_gene_id',\n", " values = ensembl.ids,\n", " mart = ensembl_m)\n", "\n", "# The line above retrieves the specified attributes from the connected BioMart database;\n", "# 'ensembl_gene_id' are genes IDs,\n", "# 'external_gene_name' are the genes symbols that we want to get for our values stored in ‘ensembl.ids’." ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-62", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-63", "source": [ "# see the resulting data\n", "head(genes)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-64", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-65", "source": [ "# replace IDs for gene names\n", "gene_names <- ensembl.ids\n", "count = 1\n", "for (geneID in gene_names)\n", "{\n", " index <- which(genes==geneID) # finds an index of geneID in the genes object created by getBM()\n", " if (length(index)==0) # condition in case if there is no corresponding gene name in the chosen dataset\n", " {\n", " gene_names[count] <- 'NA'\n", " }\n", " else\n", " {\n", " gene_names[count] <- genes$external_gene_name[index] \t# replaces gene ID by the corresponding gene name based on the found geneID’s index\n", " }\n", " count = count + 1 # increased count so that every element in gene_names is replaced\n", "}" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-66", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-67", "source": [ "# add the gene names into rowData in a new column gene_name\n", "rowData(alevin_se)$gene_name <- gene_names" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-68", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-69", "source": [ "# see the changes\n", "rowData(alevin_se)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-70", "source": "Warning: Ensembl connection problems\nSometimes you may encounter some connection issues with Ensembl. To improve performance Ensembl provides several mirrors of their site distributed around the globe. When you use the default settings for useEnsembl() your queries will be directed to your closest mirror geographically. In theory this should give you the best performance, however this is not always the case in practice. For example, if the nearest mirror is experiencing many queries from other users it may perform poorly for you. In such cases, the other mirrors should be chosen automatically.
\n
If you are working on your own data and it’s not mouse data, you can check available datasets for other species and just use relevant dataset in useMart()
function.
We can also flag mitochondrial genes. Usually those are the genes whose name starts with ‘mt-‘ or ‘MT-‘. Therefore, we will store those characters in mito_genes_names
and then use grepl() to find those genes stored in gene_name slot.
Now we can add another slot in rowData and call it mito that will keep boolean values (true/false) to indicate which genes are mitochondrial.
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-75", "source": [ "rowData(alevin_se)$mito <- mito\n", "rowData(alevin_se)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-76", "source": "Let’s go back to the emptied_matrix
object. Do you remember how many cells were left after filtering? We can check that by looking at the matrix’ dimensions:
We’ve gone from 3608 to 35 cells. We’ve filtered the matrix, but not our SummarizedExperiment. We can subset alevin_se
based on the cells that were left after filtering. We will store them in a separate list, as we did with the barcodes:
And now we can subset our SummarizedExperiment based on the barcodes that are in the retained_cells
list:
And that’s our subset, ready for downstream analysis!
\nWe’ve done the analysis for one sample. But there are 7 samples in this experiment and it would be very handy to have all the information in one place. Therefore, you would need to repeat all the steps for the subsequent samples (that’s when you’ll appreciate wrapped tools and automation in Galaxy workflows!). To make your life easier, we will show you how to combine the datasets on smaller scale. Also, to save you some time, we’ve already run Alevin on sample 702 (also subsampled to 50k reads). Let’s quickly repeat the steps we performed in R to complete the analysis of sample 702 in the same way as we did with 701.
\nBut first, we have to save the results of our hard work on sample 701!
\nSaving files is quite straightforward. Just specify which object you want to save and how you want the file to be named. Don’t forget the extension!
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-83", "source": [ "save(alevin_701, file = \"alevin_701.rdata\")" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-84", "source": "You will see the new file in the panel on the left.
\nNormally, at this point you would switch kernel to bash to run alevin, and then back to R to complete the analysis of another sample. Here, we are providing you with the alevin output for the next sample, which you have already downloaded and unzipped at the beginning of the tutorial when we were using bash.
\n\n\nWarning: Switching kernels & losing variables\nBe aware that every time when you switch kernel, you will lose variables you store in the objects that you’ve created, unless you save them. Therefore, if you want to switch from R to bash, make sure you save your R objects! You can then load them anytime.
\n
Above we described all the steps done in R and explained what each bit of code does. Below all those steps are in one block of code, so read carefully and make sure you understand everything!
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-85", "source": [ "## load libraries again ##\n", "library(tximeta)\n", "library(DropletUtils)\n", "library(biomaRt)\n", "\n", "## running tximeta ##\n", "path2 <- 'alevin_output_702/alevin/quants_mat.gz' # path to alevin output for N702\n", "alevin2 <- tximeta(coldata = data.frame(files = path2, names = \"sample702\"), type = \"alevin\") # create SummarizedExperiment from Alevin output\n", "\n", "## running emptyDrops ##\n", "matrix_alevin2 <- assays(alevin2)$counts # extract matrix from SummarizedExperiment\n", "out2 <- emptyDrops(matrix_alevin2, lower = 100, niters = 1000, retain = 20) # apply emptyDrops\n", "is.cell2 <- out2$FDR <= 0.01 # apply FDR threshold\n", "emptied_matrix2 <- matrix_alevin2[,which(is.cell2),drop=FALSE] # subset the matrix to the cell-containing droplets\n", "\n", "## adding cell metadata ##\n", "barcode2 <- colnames(alevin2)\n", "colData(alevin2)$barcode <- barcode2 # add barcodes\n", "\n", "colData(alevin2) <- cbind(colData(alevin2),out2) # add emptyDrops info\n", "\n", "batch2 <- rep(\"1\", length(colnames(alevin2)))\n", "colData(alevin2)$batch <- batch2 # add batch info\n", "\n", "genotype2 <- rep(\"wildtype\", length(colnames(alevin2)))\n", "colData(alevin2)$genotype <- genotype2 # add genotype info\n", "\n", "sex2 <- rep(\"male\", length(colnames(alevin2)))\n", "colData(alevin2)$sex <- sex2 # add sex info\n", "\n", "## adding gene metadata ##\n", "gene_ID2 <- rownames(alevin2)\n", "rowData(alevin2)$gene_ID <- gene_ID2\n", "\n", "# get relevant gene names\n", "ensembl.ids2 <- gene_ID2\n", "mart <- useEnsembl(biomart = \"ENSEMBL_MART_ENSEMBL\") # connect to a specified BioMart database and dataset hosted by Ensembl\n", "ensembl_m2 = useMart(\"ensembl\", dataset=\"mmusculus_gene_ensembl\")\n", "\n", "genes2 <- getBM(attributes=c('ensembl_gene_id','external_gene_name'),\n", " filters = 'ensembl_gene_id',\n", " values = ensembl.ids2,\n", " mart = ensembl_m2)\n", "\n", "# replace IDs for gene names\n", "gene_names2 <- ensembl.ids2\n", "count = 1\n", "for (geneID in gene_names2)\n", "{\n", " index <- which(genes2==geneID) # finds an index of geneID in the genes object created by getBM()\n", " if (length(index)==0) # condition in case if there is no corresponding gene name in the chosen dataset\n", " {\n", " gene_names2[count] <- 'NA'\n", " }\n", " else\n", " {\n", " gene_names2[count] <- genes2$external_gene_name[index] \t# replaces gene ID by the corresponding gene name based on the found geneID’s index\n", " }\n", " count = count + 1 # increased count so that every element in gene_names is replaced\n", "}\n", "\n", "rowData(alevin2)$gene_name <- gene_names2 # add gene names to gene metadata\n", "\n", "mito_genes_names <- '^mt-|^MT-' # how mitochondrial gene names can start\n", "mito2 <- grepl(mito_genes_names, rowData(alevin2)$gene_name) # find mito genes\n", "rowData(alevin2)$mito <- mito2 # add mitochondrial information to gene metadata\n", "\n", "## create a subset of filtered object ##\n", "retained_cells2 <- colnames(emptied_matrix2)\n", "alevin_subset2 <- alevin2[, colData(alevin2)$barcode %in% retained_cells2]\n", "\n", "alevin_702 <- alevin_subset2\n", "alevin_702" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-86", "source": "Alright, another sample pre-processed!
\nPre-processed sample 702 is there, but keep in mind that if you switched kernels in the meantime, you still need to load sample 701 that was saved before switching kernels. It’s equally easy as saving the object:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-87", "source": [ "load(\"alevin_701.rdata\")" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-88", "source": "Check if it was loaded ok:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-89", "source": [ "alevin_701" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-90", "source": "Now we can combine those two objects into one using one simple command:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-91", "source": [ "alevin_combined <- cbind(alevin_701, alevin_702)\n", "alevin_combined" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-92", "source": "If you have more samples, just append them in the same way. We won’t process another sample here, but pretending that we have third sample, we would combine it like this:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-93", "source": [ "alevin_subset3 <- alevin_702 # copy dataset for demonstration purposes\n", "alevin_combined_demo <- cbind(alevin_combined, alevin_subset3)\n", "alevin_combined_demo" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-94", "source": "You get the point, right? It’s important though that the rowData names and colData names are the same in each sample.
\nIt is generally more common to use SingleCellExperiment format rather than SummarizedExperiment. The conversion is quick and easy, and goes like this:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-95", "source": [ "library(SingleCellExperiment) # might need to load this library\n", "alevin_sce <- as(alevin_combined, \"SingleCellExperiment\")\n", "alevin_sce" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-96", "source": "As you can see, all the embeddings have been successfully transfered during this conversion and believe me, SCE object will be more useful for you!
\nYou’ve already learned how to save and load objects in Jupyter notebook, let’s then save the SCE file:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-97", "source": [ "save(alevin_sce, file = \"alevin_sce.rdata\")" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-98", "source": "The last thing that might be useful is exporting the files into your Galaxy history. To do it… guess what! Yes - switching kernels again! But this time we choose Python 3 kernel and run the following command:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-99", "source": [ "# that's Python now!\n", "put(\"alevin_sce.rdata\")" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-100", "source": "A new dataset should appear in your Galaxy history now.
\nWell done! In this tutorial we have:
\nAs you might now appreciate, some tasks are much quicker and easier when run in the code, but the reproducibility and automation of Galaxy workflows is a huge advantage that helps in dealing with many samples more quickly and efficiently.
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "cell_type": "markdown", "id": "final-ending-cell", "metadata": { "editable": false, "collapsed": false }, "source": [ "# Key Points\n\n", "- Create a SCE object from FASTQ files, including relevant gene and cell metadata, and do it all in Jupyter Notebook!\n", "\n# Congratulations on successfully completing this tutorial!\n\n", "Please [fill out the feedback on the GTN website](https://training.galaxyproject.org/training-material/topics/single-cell/tutorials/alevin-commandline/tutorial.html#feedback) and check there for further resources!\n" ] } ] }