How to demultiplex a count matrix and convert it to Cellenics®-compatible 10X files
Demultiplexing is the process of separating sequenced single-cell RNA-sequencing (scRNA-seq) reads for each sample into separate files. To load your data into the Biomage-hosted community instance of Cellenics®, you'll need the raw count matrices in the shape of three files: barcodes.tsv, features.tsv and matrix.mtx files. This is a common data type processed by 10x Cell Ranger. For every sample, you'll need a different set of files, and each set of files should be in a separate folder with the sample name.
The following notebook shows a brief tutorial written in R to demultiplex a count matrix stored in a R data files (.rds) object and convert it to 10x files, which can be directly uploaded to the Biomage-hosted community instance of Cellenics®.
In this tutorial, we will use data available here: https://lambrechtslab.sites.vib.be/en/single-cell.
Please, note that in this case, the .rds file stores a count matrix, but it can potentially store many different types of data, such as a Seurat object, or a SingleCellExperiment object. This tutorial is intended as an example that can be used as a walkthrough for similar cases.
We recommend reading the .rds file with R as illustrated in this tutorial but then looking carefully at the class of the object and at its data structure. If you are dealing with an object different from the one used in this tutorial, and you're not sure how to convert it, please contact the Biomage team via the community forum (https://community.biomage.net/), and we'll be happy to assist you.
Now let's jump into our data!
1. Load data and check data structure
Set working directory and load rds object:
setwd("/Your/Directory")
data.count <- readRDS("1867-counts_cells_cohort2.rds")
IMPORTANT: Check the class of the object stored in the .rds file. Please note that if you are using a different .rds file, and the class of the object is not "dgCMatrix", this script won't work. If you're not sure how to convert it, please contact the Biomage team via the community forum (https://community.biomage.net/), and we'll be happy to assist you.
class(data.count)
Install and load required package:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DropletUtils")
library("DropletUtils")
Looking at data structure we can see that this is a dgCMatrix with gene symbols as rownames and cell barcodes as colnames:
str(data.count)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## ..@ i : int [1:74601072] 8 9 55 57 75 94 189 228 529 575 ...
## ..@ p : int [1:50694] 0 1665 2156 2617 4232 5889 6256 8255 8899 10514 ...
## ..@ Dim : int [1:2] 22889 50693
## ..@ Dimnames:List of 2
## .. ..$ : chr [1:22889] "A1BG" "A1BG-AS1" "A2M" "A2M-AS1" ...
## .. ..$ : chr [1:50693] "BIOKEY_33_Pre_AAACCTGAGAGACTTA-1" "BIOKEY_33_Pre_AAACCTGAGTAGCGGT-1" "BIOKEY_33_Pre_AAACCTGCATGGTAGG-1" "BIOKEY_33_Pre_AAACCTGGTATAGGGC-1" ...
## ..@ x : num [1:74601072] 1 1 2 1 1 1 1 2 1 1 ...
## ..@ factors : list()
Let's look at the first cell barcode as an example:
colnames(data.count)[1]
## [1] "BIOKEY_33_Pre_AAACCTGAGAGACTTA-1"
In this particular case, cell barcodes consist of a prefix (BIOKEY13_Pre) and a sequence of bases (AAACCTGCAACAACCT-1) The prefixes are sample names, so we'll use them to demultiplex the data.
2. Demultiplex data based on barcodes prefixes and export as 10X files
Use a regular expression to extract prefixes (please note that if you are processing a different dataset, the regular expression may need to be modified depending on the specific colnames):
data.pfx <- gsub("(.+)_[A-Z]+-1$", "\\1", colnames(data.count), perl=TRUE)
Get unique sample names:
sample.names <- unique(data.pfx)
Check sample names:
head(sample.names)
## [1] "BIOKEY_33_Pre" "BIOKEY_33_On" "BIOKEY_38_Pre" "BIOKEY_38_On" "BIOKEY_35_Pre" "BIOKEY_35_On"
tail(sample.names)
## [1] "BIOKEY_34_Pre" "BIOKEY_34_On" "BIOKEY_39_Pre" "BIOKEY_39_On" "BIOKEY_42_Pre" "BIOKEY_42_On"
To export as 10X files that are directly compatible with the Cellenics® tool, define a function that creates a subdirectory named “demultiplexed” inside the current working directory, and save 10X data for each sample in different subfolders. If a folder named “demultiplexed” already exists, it will stop and return an error to avoid overwriting files.
Please, pay particular attention to the regular expression present in the last line of the following function. You might need to modify it if you're using a different dataset, in the same way that you did fot the first regular expression of this section. For example, if your samples are encoded in the suffix rather than in the prefix, you might need to change it from paste0("^",samples[i],".*") to paste0(samples[i],"$").
demultiplex_convert_to_10x <- function(obj, samples) {
if(class(obj) != "dgCMatrix") {
message("WARNING: this rds file does not contain a dgCMatrix! STOP RUNNING THIS SCRIPT")
message("Check the data type by running:")
message("class(data.count)")
stop()
}
if(!dir.exists(file.path(getwd(), "demultiplexed"))) {
dir.create(file.path(getwd(), "demultiplexed"))
} else {
message("WARNING! A demultiplexed directory already exists")
return()
}
for (i in 1:length(samples)) {
print(paste0("Converting sample ", samples[i]))
DropletUtils::write10xCounts(path = paste0(getwd(),"/demultiplexed/",samples[i]), x = obj[,grep(paste0("^",samples[i],".*"),colnames(obj))], type = "sparse", version="3")
}
}
Run the function:
demultiplex_convert_to_10x(obj = data.count, samples = sample.names)
Now you will find all the samples inside the “demultiplexed” folder. Each sample folder should contain 3 files named “barcodes.tsv”, “features.tsv”, and “matrix.mtx”. If the demultiplexed folder was not created, or was created but it's empty, this can be mainly for 2 to reasons:
1. The class of the object stored in your .rds file is not a dgCMatrix. Please look at the first part of section 1 of this tutorial to check the class of the object stored in your .rds file.
2. The regular expression to extract prefixes from barcodes needs to be adjusted for your specific data. Please go back to the first command of section 2 of this tutorial and modify the regular expression.
Your samples are now compatible with the Cellenics® tool! You can analyze your data for free using the Biomage-hosted community instance of Cellenics® that's available at: https://scp.biomage.net/