Preparing Seurat Data from Cellenics® for Submission to Cellxgene
Cellxgene is quickly becoming an important resource for annotated single-cell RNA sequencing datasets. As of this writing, there are more than 146 collections consisting 842 datasets. Researchers can submit their datasets to cellxgene, following the contribution guide. Submitting your datasets to cellxgene is free and provides exposure for you and your research.
Cellxgene accepts submissions with datasets in H5AD format. H5AD is a hierarchical data format, storing AnnData matrix stored in HDF5 format, a data format that is commonly used by Scanpy.
In this post, I’ll explain how to convert Seurat data, which I exported from the Biomage-hosted community instance of Cellenics®, into an H5AD file suitable for submission to cellxgene.
If you’d like to follow a long with the blogpost, you can download the files here:
Exporting Seurat object from the Biomage-hosted community instance of Cellenics®
For this blog post, I’m using a Covid-19 dataset which is available in the public datasets repository in the Biomage-hosted community instance of Cellenics®. You can easily explore this dataset from the public dataset repository available within the Data Management module. Let's assume that the dataset that you're looking to export to Cellxgene has been annotated using the Data Exploration module of Cellenics®. For this example, I have annotated the dataset with cell types identified by ScType. I’m going to submit this cell type annotation to cellxgene.
Once the cell types in the dataset are annotated, download the resulting RDS object from the project. To do this, go to Data Management click the “Download” button and choose “Processed Seurat object (.rds)”. Alternatively, you can also download the processed matrix from the resources links.
Converting Seurat object to H5AD
There are several ways to convert Seurat object to H5AD file. For this blog post, I’ll be following the tutorial by scanpy using Python. For this, we need to install two Python libraries: scanpy and anndata2ri. To install the libraries, run:
pip3 install scanpy anndata2ri
We also need to have Seurat installed in R. To do this, open R in the terminal (or R Studio) and run
install.libraries("Seurat")
Once the libraries are installed, create a Jupyter notebook. Import both these libraries in a cell and run it
import scanpy as sc
import anndata2ri
anndata2ri allows the running of R from Jupyter notebook. To enable that, we run the following piece of code in a new cell
# Activate the anndata2ri conversion between SingleCellExperiment and AnnData
anndata2ri.activate()
#Loading the rpy2 extension enables cell magic to be used
#This runs R code in jupyter notebook cells
%load_ext rpy2.ipython
Once anndata2ri is loaded, run the following code in a new cell
%%R
suppressPackageStartupMessages(library(Seurat))
# Load Seurat dataset from Seurat tutorial
seurat.data <- readRDS("processed_matrix.rds")
seurat.data
This prints out
From the output, we can see the Seurat object contains 6695 cells (samples) and 33,538 features (genes).
Now, let’s convert the Seurat object into an AnnData format. Run the code below in a new cell.
%%R -o sce_object
#convert the Seurat object to a SingleCellExperiment object
sce_object <- as.SingleCellExperiment(seurat.data)
The code above loads the Seurat library in R, and then uses it to load the RDS file containing the Seurat object. The data is then converted to a single-cell experiment object using as.SingleCellExperiment
and exposed to the Jupyter notebook environment using %%R -o sceobject
. To see the content of sce_object
, write the code below and run it
sce_object
The variable sce_object
contains an AnnData object with 6695 cells and 33,538 genes. We can see that the AnnData object contains the same number of cells and genes as the Seurat object and the conversion has succeeded. We can now use Scanpy to save the AnnData object into an H5AD file named “scdata.h5ad“. We’ll work with this H5AD file in the next section to format the data into a format suitable for CellxGene submission.
# Write SCE object into H5AD file
sc.write("scdata.h5ad", sce_object)
That’s all the code you need to convert Seurat objects into AnnData objects that you can use to work with Scanpy. You can download the Seurat to H5AD conversion notebook to see the whole script.
Formatting H5AD file for submission
With the Seurat object converted into an H5AD file, we are ready to format it according to the schema outlined in the CellXGene submission guideline. We’ll work through the recommendation points one by one. As of the writing of this blog post (Jul 2023), the following points are written in the guideline:
Formatting Requirements
We need the following collection metadata (i.e. details associated with your publication or study)
- Collection information:
- Title
- Description
- Contact: name and email
- Publication/preprint DOI: can be added later
- URLs: any additional URLs for related data or resources, such as GEO or protocols.io - can be added later
Each dataset needs the following information added to a single h5ad (AnnData 0.8) format file:
- Dataset-level metadata in uns:
- schema_version: 3.0.0
- title: title of the individual dataset
- optional: batch_condition: list of obs fields that define “batches” that a normalization or integration algorithm should be aware of
- Data in .X and raw.X:
- raw counts are required
- normalized counts are strongly recommended
- raw counts should be in raw.X if normalized counts are in .X
- if there is no normalized matrix, raw counts should be in .X
- Cell metadata in obs (for ontology term IDs, the values MUST be the most specific term available from the specified ontology):
- organism_ontology_term_id: NCBITaxon (
NCBITaxon:9606
for human,NCBITaxon:10090
for mouse)- donor_id: free-text identifier that distinguishes the unique individual that data were derived from. It is encouraged to be something not likely to be used in other studies (e.g. donor_1 is likely to not be unique in the data corpus)
- development_stage_ontology_term_id:
HsapDv
if human,MmusDv
if mouse,unknown
if information unavailable- sex_ontology_term_id:
PATO:0000384
for male,PATO:0000383
for female, orunknown
if unavailable- self_reported_ethnicity_ontology_term_id:
HANCESTRO
use multiethnicif more than one ethnicity is reported. If human and information unavailable, useunknown
. Usena
if non-human.- disease_ontology_term_id:
MONDO
orPATO:0000461f
or 'normal'- tissue_ontology_term_id:
UBERON
- cell_type_ontology_term_id:
CL
- assay_ontology_term_id:
EFO
- suspension_type: cell, nucleus, or na, as corresponding to assay. Use this table defined in the data schema for guidance. If the assay does not appear in this table, the most appropriate value MUST be selected and the curation team informed during submission so that the assay can be added to the table.
- Embeddings in obsm:
- One or more two-dimensional embeddings, prefixed with 'X_'
- Features in var & raw.var (if present):
- index is Ensembl ID
- preference is that gene have not been filtered in order to maximize future data integration efforts
CellXGene requires the existence of several keys in uns
, obs
and obsm
properties of the H5AD object, formatted to a certain standard. The values of these schemas has to conform to the specification set by cellxgene. Additional keys that are available in obs
but are not included in the schema will be added under Annotated fields
.
Additionally, the cellxgene team also have some tips for reducing file size:
Tips for reducing h5ad file size
- use compression=gzip in write_h5ad function
- matrix data stored as float32 rather than float64
- any metadata columns (obs/var) that are 64bit can probably become 32bit
- any string columns often are much smaller if made categorical
- remove any extra ‘layers’ that aren’t desired in the final version
- ensure X & raw.X are both sparse.csr_matrix if more than 50% values are zeroes
To make it easier to separate the process of formatting of the H5AD object, I’ve created another Jupyter notebook for the formatting. Let’s get started. In the new notebook, import the required library
import scanpy as sc
Read the H5AD object that we have exported in the previous section into the variable “scdata“.
scdata = sc.read("scdata.h5ad")
Print out the available keys
scdata
The H5AD object contains 3 properties: obs
, obsm
and layers
. These properties are mentioned in the cellxgene formatting requirements. However, the property uns
does not yet exist. We’ll add this property in the next step as we go through each of the formatting requirements.
Adding Dataset-level metadata
To add a new key to the Anndata object and insert the values, run
# Insert unstructured metadata
scdata.uns['schema_version'] = '3.0.0'
scdata.uns['title'] = 'GSE183716 - Covid19 MISC Dataset'
Print out the data again by running scdata
in a new cell
We can see that the property uns
has been added into the variable, containing the correct values.
2. Formatting Counts
As written in the requirement, cellxgene requires the dataset to contain raw counts data, even if the normalized data is available. If normalized data is available, it should be stored in .X, while the raw counts data should be stored in .raw.X. If normalized data is not available, the raw counts data should be stored in .X.
Currently, data is stored in the AnnData object in the .X
and layers
attributes. From our print out of scdata
, we can see that the layers
attribute contains only one key, logcounts
. To see the type of data that is stored in the object, run the scdata.X
and scdata.layers[‘logcounts’]
in separate cells.
This shows that both scdata.X
and scdata.layers[‘logcounts’]
contain data that is in the Compressed Sparse Row format. The Seurat object exported from the Biomage-hosted instance of Cellenics® contains both the raw counts and normalized counts. To see which one contains the raw and normalized counts, run print out scdata.X.data
and scdata.layers[‘logcounts’].data
scdata.X
contains integers, while scdata.layers[‘logcounts’]
contain decimal (floating point) values. Raw counts are always integers. Therefore, scdata.X
cointains the raw counts and scdata.layers[‘logcounts’]
contain the normalized count values.
Run the code below in a new cell to copy the raw counts into .raw.X
and the normalized counts into .X
using the
# Copy raw counts to .raw.X
scdata.raw = scdata.copy()
# Copy normalized counts to .X
scdata.X = scdata.layers['logcounts'].astype("float32")
While copying the normalized count, we have also transformed the type of the data from numpy.float64
to numpy.float32
. This is a recommended step to reduce file size. As we have copied over the normalized counts into .X
, we can remove .layers
to further reduce file size.
del scdata.layers
3. Annotating metadata
Metadata entries that were created in Cellenics® are exported when the Seurat object is downloaded. Out of all the metadata entries that are available, we’d like to keep 2 metadata keys: MISC_Status
and ScType.Immune.system.human
. MISC_Status
does not fit into any of the metadata entries that are defined in the cellxgene schema, so we’ll add this as a custom annotation. The values in ScType.Immune.system.human
, however, contain information about cell types that can fit into cell_type_ontology_term_id
.
The metadata keys as required by cellxgene are:
- scdata.obs['organism_ontology_term_id'] = NCBITaxon (`NCBITaxon:9606` for human, `NCBITaxon:10090` for mouse) - scdata.obs['donor_id'] = free-text identifier that distinguishes the unique individual that data were derived from. It is encouraged to be something not likely to be used in other studies (e.g. donor_1 is likely to not be unique in the data corpus) - scdata.obs['development_stage_ontology_term_id'] = `HsapDv` if human, `MmusDv` if mouse, `unknown` if information unavailable - scdata.obs['sex_ontology_term_id'] = `PATO:0000384` for male, `PATO:0000383` for female, or `unknown` if unavailable - scdata.obs['self_reported_ethnicity_ontology_term_id'] = `HANCESTRO` use 'multiethnic' if more than one ethnicity is reported. If human and information unavailable, use `unknown`. Use `na` if non-human. - scdata.obs['disease_ontology_term_id'] = `MONDO` or `PATO:0000461f` or 'normal' - scdata.obs['tissue_ontology_term_id'] = `UBERON` - scdata.obs['cell_type_ontology_term_id'] = `CL` - scdata.obs['assay_ontology_term_id'] = `EFO` - scdata.obs['suspension_type'] = `cell`, `nucleus`, or `na`, as corresponding to assay. Use this table defined in the data schema for guidance. If the assay does not appear in this table, the most appropriate value MUST be selected and the curation team informed during submission so that the assay can be added to the table.
To be able to fill in with the correct values, we’ll have to look at the publication and GEO entry corresponding to the dataset to get the corresponding information. We then have to look for the closest ontology term id in the ontologies as specified by cellxgene. These ontologies can be searched at https://www.ebi.ac.uk/ols4. Cellxgene has also provided some default values that should be used, in case the correct term does not exist.
Looking at the publication and GSE entry, there does not seem to be any clear information about the age (development_stage_ontology_term_id
), sex (sex_ontology_term_id
) or ethnicity (self_reported_ethnicity_ontology_term_id
). Therefore, we’ll set these fields to the default value
scdata.obs['sex_ontology_term_id'] = 'unknown'
scdata.obs['development_stage_ontology_term_id'] = 'unknown'
scdata.obs['self_reported_ethnicity_ontology_term_id'] = 'unknown'
The donor_id
, field is a free-text field which we can generate by ourselves. Cellxgene recommends that the value should be something which is unique between studies. The dataset already contain the field sample_name
, which already uniquely identify the donor for each of the samples. Let’s use that field for donor_id
.
scdata.obs.rename(columns={'sample_name': 'donor_id'}, inplace=True)
All three samples are from donors suffering from COVID-19-associated multisystem inflammatory syndrome in children (MIS-C). Searching for this term in OLS in the MONDO ontology we find the term ID MONDO:0100163.
We can use this for the field disease_ontology_term_id
.
scdata.obs['disease_ontology_term_id'] = 'MONDO:0100163'
The dataset examines peripheral mononuclear blood cells (PBMC), which originates from the blood tissue. Searching OLS for “blood” in the UBERON ontology, we find the term ID UBERON:0000178
. Let’s set this value for tissue_ontology_term_id.
scdata.obs['tissue_ontology_term_id'] = 'UBERON:0000178'
For cell_type_ontology_term_id
, we’d like to be specific, to highlight each of the different types of cells. We’ll base this on the annotations that have been created by ScType, which was generated after running ScType in Cellenics®. To see all unique cell types that have been generated by ScType run
for cell_type in scdata.obs['ScType.Immune.system.human'].unique():
print(cell_type)
Now we have to look up the closest CL ontology term corresponding to each cell type, and create a mapping
# Mapping custom cell type to cell type ontology
cell_type_map = {
'Non-classical monocytes' : 'CL:0000875',
'Naive CD8+ T cells': 'CL:0000900',
'Naive CD4+ T cells': 'CL:0000895',
'Classical Monocytes': 'CL:0000860',
'Plasma B cells': 'CL:0000786',
'Neutrophils': 'CL:0000775',
'Memory CD4+ T cells': 'CL:0001087',
'CD8+ NKT-like cells': 'CL:0000923',
'Pre-B cells': 'CL:0000817',
'Platelets': 'CL:0000233',
'Naive B cells': 'CL:0000788',
'Erythroid-like and erythroid precursor cells': 'CL:0001066'
}
Let’s use a for-loop to go through each of the cell type, and insert the ontology term ID. We’d also still like to keep the original ScType annotation, as the ontology might not exactly identify the cells 1-to-1. We’ll create another column called “sctype_annotation” to store the original sctype value.
for cell_name, taxonomy_id in cell_type_map.items():
scdata.obs.loc[scdata.obs['ScType.Immune.system.human'] == cell_name, ['cell_type_ontology_term_id']] = taxonomy_id
We’d also like to rename the column Sctype.Immune.system.human
to sctype_annotation
to make it easier to read.
scdata.obs.rename(columns={ 'ScType.Immune.system.human':'sctype_annotation' }, inplace=True)
The 3 samples correspond to the GEO GSM records: GSM5569718, GSM5569719, GSM5569720. From the records, we can see that the samples are processed using 10x Chromium 3’ v3. Searching for this in OLS under the EFO ontology, we find EFO:0009922. We can use this for assay_ontology_term_id
.
scdata.obs['assay_ontology_term_id'] = 'EFO:0009922'
The data contains cells. Therefore the suspsension type is ‘cell’.
scdata.obs['suspension_type'] = 'cell'
We do not need the remaining metadata keys. We can delete these to reduce the size of the dataset.
# Delete unnecessary obs
del scdata.obs['barcode']
del scdata.obs['orig.ident']
del scdata.obs['nCount_RNA']
del scdata.obs['nFeature_RNA']
del scdata.obs['percent.mt']
del scdata.obs['doublet_scores']
del scdata.obs['doublet_class']
del scdata.obs['emptyDrops_Total']
del scdata.obs['emptyDrops_LogProb']
del scdata.obs['emptyDrops_PValue']
del scdata.obs['emptyDrops_Limited']
del scdata.obs['emptyDrops_FDR']
del scdata.obs['cells_id']
del scdata.obs['seurat_clusters']
del scdata.obs['ident']
4. Embeddings in obsm
The obsm
attribute in AnnData is meant to contain multi-dimensional annotations that correspond to each of the cells that exists in the. The data contains 4 types of observations: HARMONY, PCA_FOR_HARMONY, X_pca and X_umap. Cellxgene requires the .obsm
attribute contain at least 1 or more 2-dimensional embeddings, with keys that should be prefixed with X_
. The embedding that we would like to show is X_umap
. Therefore, we do not need the other 3 embeddings. To delete them, run the code below in a new cell.
del scdata.obsm['HARMONY']
del scdata.obsm['PCA_FOR_HARMONY']
del scdata.obsm['X_pca']
To display UMAP embedding by default, we have to add default_embedding
in .uns
with the key to the .obsm
attribute. Run the following in a new cell.
scdata.uns['default_embedding'] = 'X_umap'
Saving data for submission
We have prepared the data. Print out to see that that we have all the keys by running scdata
in a new cell
eems like all the Save the data in compressed H5AD format by running
scdata.write_h5ad("submission.h5ad", compression=True)
This compresses and saves the data into submission.h5ad
.
That’s it for annotating H5AD objetcs for submission to cellxgene. You can download the H5AD formatting notebook to see the whole script.
Pre-Submission Check (Optional)
Before submitting to cellxgene, we can setup a local copy of cellxgene and load the data. Follow the instructions in the cellxgene Github README to setup a local copy of cellxgene. The open source version of cellxgene seems to have problems with Python versions later than 3.7, so I recommend setting up a virtual environment in another folder to install and run cellxgene with Python 3.7. Copy over submission.h5ad
into the folder so cellxgene has access to it.
# In a new folder
python3.7 -m venv "venv"
# Activate virtual env
source ./venv/bin/activate
# Install cellxgene
pip install cellxgene
Once you have cellxgene installed, launch the local browser by running
cellxgene launch submission.h5ad
Open up http://localhost:5005 in your browser to see the cellxgene browser. You’ll be presented with a dialog to create a user directory. As we’ll not be using it, you can input name the directory anything. For example, in this case, I name the directory ‘test’.
Once the modal closed, you should be able to see the cellxgene browser.
If the dataset has been processed and formatted correctly, all the metadata under .obs
should be displayed. As we are not using the latest version of cellxgene, there might be differences in the display in the local copy of cellxgene, and the online version.
Submission to CELLxGENE
Now that we have the data, we’re ready to start the submision. To begin your submission send an email to cellxgene@chanzuckerberg.com expressing your desire to submit a dataset. The team from cellxgene will reply to your request, vetting the dataset. To share your data, upload the dataset to a file sharing website (e.g. Google Drive, OneDrive, S3) and create a share link, the cellxgene team.
Conclusion
That’s it! Congratulations for going through this blog post. You are now ready to submit datasets you have analyzed in the Biomage-community instance of Cellenics® into cellxgene. If you have any questions, Check out the Biomage community forum . Biomage has also produced an online course for analyzing scRNAseq data using Cellenics®. Check out our blog post about the course: Unlock the Secrets of Single-Cell RNA-Seq Data Analysis with Our Comprehensive Course!