Converting H5 files for analysis in the open source scRNA-seq data visualization tool Cellenics®
Introduction
The H5 format (short for HDF5, which stands for Hierarchical Data Format version 5) is an increasingly common data format to store single-cell RNA sequencing (scRNA-seq) data. Cellranger, for example, defaults its output in that format. One of its advantages is the ability to store both the count matrices and all metadata in a single file (versus using features/barcodes/matrix files.)
In this article, we’ll show how to convert H5 files to the features/barcodes/matrix format to be able to upload and analyze your data using the Biomage-hosted community instance of Cellenics® while we work on adding native support for H5 files.
The H5 file format is a container that can have many different things inside. So, your H5 file may be different from the file generated directly by Cellranger. Because of this, the article is divided into two parts:
The first section will show how to process standard H5 files, the direct
output from
cellranger count
.
The second section will show how to take an arbitrary scRNA-seq H5 file, manually inspect its contents, pick what is necessary and convert them to a Cellenics®-supported format. Be mindful that this section is a bit more involved and might require a bit of manual code editing.
Processing standard H5 files - cellranger output
Standard Cellranger HDF5 files can be processed using functionality
already implemented in the Seurat
package. It should work out of the
box for Cellranger output. If this fails, refer to the
Non-Standard HDF5 file section of this document.
Libraries
We need to have Seurat, DropletUtils, and hdf5r installed. Seurat and hdf5r can be
installed from CRAN, and DropletUtils is available on
Bioconductor. To install them, you can use the install.packages()
function for Seurat
and hdf5r
, and refer to Bioconductor for instructions on installing DropletUtils.
library(Seurat)
library(DropletUtils)
library(hdf5r)
Processing
Set the data_dir to the folder that contains the h5 files. After that, we create a list of all H5 files in the directory, which will be converted.
data_dir <- "./"
setwd(data_dir)
h5_files <- list.files(data_dir, pattern = "*h5$")
Create an output directory to store the converted files.
output_dir <- "out"
dir.create(output_dir)
Convert the H5 files. The sample_name is going to be the folder name for each sample, feel free to edit as desired.
for (file in h5_files) {
# make sample names, removing .h5
sample_name <- sub("\\.h5$", "", basename(file))
sample_path <- file.path(output_dir, sample_name)
# to show progress
print(sample_name)
# load the count matrices
gene_names <- rownames(Seurat::Read10X_h5(file))
counts <- Seurat::Read10X_h5(file, use.names = F)
# convert
DropletUtils::write10xCounts(sample_path, counts, version = "3", gene.symbol = gene_names)
}
Processing non-standard H5 files
Non-standard H5s should be treated with care. The general idea is to
manually inspect the file using the hdf5r
R package or the GUI
program HDFView and take
note of the names of the slots that contain the necessary data. This is
the actual counts, the slot with the genes, and the slot with the
barcodes (the names of the cells). The problem is that depending on
previous processing of these files, the slots could be named
differently, which means that there’s no easy way to automate this, and
manual decisions must be made. All of them should be single columns
of integer numbers. These are not the slots
you’re looking for if there are decimals.
The counts slot could be called anything from “data
”, “counts
”
“reads
” to “umi_corrected_reads
” (we prefer UMI corrected counts if
available). Genes and barcodes are usually named like that, “genes
”
and “barcodes
”.
In addition, we will need two extra slots with metadata, the gene IDs (usually, ensemblIDs; look for a vector of strings that start with “ENS” and have a number), and the gene names (gene symbols).
More details are provided in the Define Parameters section.
Libraries
We need to install hdf5r
, data.table
and Matrix
packages (using the install.packages()
function for hdf5r
, data.table
and Matrix
. Please refer to Bioconductor for instructions on how to install DropletUtils
.)
library(hdf5r)
library(Matrix)
library(DropletUtils)
Define parameters
Define slot names by inspecting the H5 files to be processed, using
either hdf5r
or HDFView. The slot names are the paths inside the H5
file that point to different pieces of information required to convert
the files, such as the data or the gene names.
These slot names MUST be changed before processing. They are specific to each non-standard h5 file.
The counts, genes, and barcode slots' lengths must be the same. These three are used to build the sparse count matrix.
counts_slot
should point to the actual data.genes_slot
should point to an integer vector with row indicesbarcodes_slot
should point to an integer vector with column indices
counts_slot <- "umi_corrected_reads"
genes_slot <- "gene"
barcodes_slot <- "barcode"
The gene_names and ids should be the same length and most likely smaller than the counts/genes/barcodes slots. These are the gene labels used when creating the 10x files.
Like the previous slots, these MUST be renamed according to the structure of the specific h5 file being processed.
gene_ids_slot
should point to a character vector of gene ids.gene_names_slot
should point to a character vector of gene symbols
gene_ids_slot <- "gene_ids"
gene_names_slot <- "gene_names"
Bulk processing
Use this section to bulk process h5 files.
Set the data_dir to the folder that contains the h5 files. After that,
we create a list of all H5 files in the directory, which will be
converted. It’s important to print the h5_files
variable and check if
the file names are correct, and we’re processing the h5 files that
we want to process.
data_dir <- "./"
setwd(data_dir)
h5_files <- list.files(data_dir, pattern = "*h5$")
Create an output directory to store the converted files.
output_dir <- file.path(data_dir, "out")
dir.create(output_dir)
Required functions
These functions do the actual work, so we need to load them. They extract the slots we defined earlier and build the sparse count matrix using them.
extract_slots <- function(h5_path) {
h5 <- H5File$new(h5_path, mode = "r")
counts <- h5[[counts_slot]][]
genes <- h5[[genes_slot]][]
barcodes <- h5[[barcodes_slot]][]
gene_ids <- h5[[gene_ids_slot]][]
gene_names <- h5[[gene_names_slot]][]
r_barcodes <- data.table::frankv(barcodes, ties.method = "dense")
if(min(genes) == 0 || min(barcodes) == 0) {index1 <- F} else {index1 <- T}
return(
list(
"counts" = counts,
"genes" = genes,
"barcodes" = barcodes,
"r_barcodes" = r_barcodes,
"gene_ids" = gene_ids,
"gene_names" = gene_names,
"index1" = index1
)
)
}
build_sparse_matrix <- function(slots) {
sparse_matrix <-
sparseMatrix(
i = slots[["genes"]],
j = slots[["r_barcodes"]],
x = slots[["counts"]],
repr = "C",
index1 = slots[["index1"]]
)
return(sparse_matrix)
}
Processing
This block converts all the h5 files detected and stored in the
h5_files
variable using our previously defined parameters (slot names)
and functions. The sample_name will be the folder name for each
sample; feel free to edit as desired.
for (file in h5_files) {
print(file)
# make sample names, removing .h5
sample_name <- sub("\\.h5$", "", basename(file))
sample_path <- file.path(output_dir, sample_name)
# to show progress
print(sample_name)
# read h5 files and build sparse matrix
slots <- extract_slots(file)
counts <- build_sparse_matrix(slots)
# write to files.
DropletUtils::write10xCounts(sample_path,
counts,
barcodes = paste0("cell_", unique(slots[["barcodes"]])),
gene.id = slots[["gene_ids"]],
gene.symbol = slots[["gene_names"]],
version = "3")
}
Now your data should be in a format that's compatible with Cellenics®! You can analyze your data for free using the Biomage-hosted community instance of Cellenics® available at https://scp.biomage.net/