Validate & register scRNA-seq datasets#
scRNA-seq measures gene expression of individual cells. It generates datasets used to define cell states associated with phenotypes.
Their analysis is typically based on data objects like AnnData, SingleCellExperiment & Seurat objects.
These objects, however, often contain non-validated metadata, making data integration hard.
In this notebook, LaminDB is used to make turn AnnData
objects into validated & queryable assets.
!lamin init --storage ./test-scrna --schema bionty
Show code cell output
π‘ creating schemas: core==0.46.1 bionty==0.30.0
β
saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-28 18:24:01)
β
saved: Storage(id='AXBBarQq', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-08-28 18:24:01, created_by_id='DzTjkKse')
β
loaded instance: testuser1/test-scrna
π‘ did not register local instance on hub (if you want, call `lamin register`)
import lamindb as ln
import lnschema_bionty as lb
β
loaded instance: testuser1/test-scrna (lamindb 0.51.0)
ln.track()
π‘ notebook imports: lamindb==0.51.0 lnschema_bionty==0.30.0
β
saved: Transform(id='Nv48yAceNSh8z8', name='Validate & register scRNA-seq datasets', short_name='scrna', version='0', type=notebook, updated_at=2023-08-28 18:24:03, created_by_id='DzTjkKse')
β
saved: Run(id='CHafmKbEL0RBBfWOoz0J', run_at=2023-08-28 18:24:03, transform_id='Nv48yAceNSh8z8', created_by_id='DzTjkKse')
Human immune cells: Conde22#
lb.settings.species = "human"
Show code cell output
β
set species: Species(id='uHJU', name='human', taxon_id=9606, scientific_name='homo_sapiens', updated_at=2023-08-28 18:24:04, bionty_source_id='WWPq', created_by_id='DzTjkKse')
Transform #
(Here we skip typical transformation steps that involve filtering, normalizing, and formatting.)
Letβs look at an scRNA-seq count matrix in form of an AnnData object:
adata = ln.dev.datasets.anndata_human_immune_cells(
populate_registries=True # this pre-populates registries
)
Show code cell output
adata
AnnData object with n_obs Γ n_vars = 1648 Γ 36503
obs: 'donor', 'tissue', 'cell_type', 'assay'
var: 'feature_is_filtered', 'feature_reference', 'feature_biotype'
uns: 'cell_type_ontology_term_id_colors', 'default_embedding', 'schema_version', 'title'
obsm: 'X_umap'
Validate #
Validate genes in .var
#
lb.Gene.validate(adata.var.index, lb.Gene.ensembl_gene_id);
π‘ using global setting species = human
β
36355 terms (99.60%) are validated for ensembl_gene_id
β 148 terms (0.40%) are not validated for ensembl_gene_id: ENSG00000269933, ENSG00000261737, ENSG00000259834, ENSG00000256374, ENSG00000263464, ENSG00000203812, ENSG00000272196, ENSG00000272880, ENSG00000270188, ENSG00000287116, ENSG00000237133, ENSG00000224739, ENSG00000227902, ENSG00000239467, ENSG00000272551, ENSG00000280374, ENSG00000236886, ENSG00000229352, ENSG00000286601, ENSG00000227021, ...
148 gene identifiers canβt be validated (not currently in the Gene
registry). Ltβs inspect them to see what to do:
inspector = lb.Gene.inspect(adata.var.index, lb.Gene.ensembl_gene_id)
Show code cell output
π‘ using global setting species = human
β
36355 terms (99.60%) are validated for ensembl_gene_id
β 148 terms (0.40%) are not validated for ensembl_gene_id: ENSG00000269933, ENSG00000261737, ENSG00000259834, ENSG00000256374, ENSG00000263464, ENSG00000203812, ENSG00000272196, ENSG00000272880, ENSG00000270188, ENSG00000287116, ENSG00000237133, ENSG00000224739, ENSG00000227902, ENSG00000239467, ENSG00000272551, ENSG00000280374, ENSG00000236886, ENSG00000229352, ENSG00000286601, ENSG00000227021, ...
π‘ using global setting species = human
π‘ detected 35 terms in Bionty for ensembl_gene_id: ENSG00000277475, ENSG00000198840, ENSG00000198727, ENSG00000276017, ENSG00000274175, ENSG00000278704, ENSG00000274847, ENSG00000278384, ENSG00000275249, ENSG00000198804, ENSG00000276345, ENSG00000268674, ENSG00000278817, ENSG00000277400, ENSG00000212907, ENSG00000271254, ENSG00000198938, ENSG00000273554, ENSG00000198712, ENSG00000276760, ...
π‘ β add records from Bionty to your registry via .from_values()
π‘ couldn't validate 113 terms: ENSG00000273888, ENSG00000253878, ENSG00000278927, ENSG00000261490, ENSG00000249860, ENSG00000256222, ENSG00000244693, ENSG00000271409, ENSG00000204092, ENSG00000228139, ENSG00000270188, ENSG00000269933, ENSG00000259855, ENSG00000112096, ENSG00000287116, ENSG00000224739, ENSG00000272880, ENSG00000282965, ENSG00000227021, ENSG00000277050, ...
π‘ β if you are sure, add records to your registry via .from_values()
Logging says 35 of the non-validated ids can be found in the Bionty reference. Letβs register them:
records = lb.Gene.from_values(inspector.non_validated, lb.Gene.ensembl_gene_id)
ln.save(records)
Show code cell output
π‘ using global setting species = human
β
created 35 Gene records from Bionty matching ensembl_gene_id: ENSG00000198804, ENSG00000198712, ENSG00000228253, ENSG00000198899, ENSG00000198938, ENSG00000198840, ENSG00000212907, ENSG00000198886, ENSG00000198786, ENSG00000198695, ENSG00000198727, ENSG00000278704, ENSG00000277400, ENSG00000274847, ENSG00000276256, ENSG00000277630, ENSG00000278384, ENSG00000273748, ENSG00000271254, ENSG00000277475, ...
β did not create Gene records for 113 non-validated ensembl_gene_ids: ENSG00000112096, ENSG00000182230, ENSG00000203812, ENSG00000204092, ENSG00000215271, ENSG00000221995, ENSG00000224739, ENSG00000224745, ENSG00000225932, ENSG00000226377, ENSG00000226380, ENSG00000226403, ENSG00000227021, ENSG00000227220, ENSG00000227902, ENSG00000228139, ENSG00000228906, ENSG00000229352, ENSG00000231575, ENSG00000232196, ...
The remaining 113 are legacy IDs, not present in the current Ensembl assembly (e.g. ENSG00000112096).
Weβd still like to register them:
validated = lb.Gene.validate(adata.var.index, lb.Gene.ensembl_gene_id)
records = [lb.Gene(ensembl_gene_id=id) for id in adata.var.index[~validated]]
ln.save(records)
Show code cell output
π‘ using global setting species = human
β
36390 terms (99.70%) are validated for ensembl_gene_id
β 113 terms (0.30%) are not validated for ensembl_gene_id: ENSG00000269933, ENSG00000261737, ENSG00000259834, ENSG00000256374, ENSG00000263464, ENSG00000203812, ENSG00000272196, ENSG00000272880, ENSG00000270188, ENSG00000287116, ENSG00000237133, ENSG00000224739, ENSG00000227902, ENSG00000239467, ENSG00000272551, ENSG00000280374, ENSG00000236886, ENSG00000229352, ENSG00000286601, ENSG00000227021, ...
Now all genes pass validation:
lb.Gene.validate(adata.var.index, lb.Gene.ensembl_gene_id);
π‘ using global setting species = human
β
36390 terms (99.70%) are validated for ensembl_gene_id
β 113 terms (0.30%) are not validated for ensembl_gene_id: ENSG00000269933, ENSG00000261737, ENSG00000259834, ENSG00000256374, ENSG00000263464, ENSG00000203812, ENSG00000272196, ENSG00000272880, ENSG00000270188, ENSG00000287116, ENSG00000237133, ENSG00000224739, ENSG00000227902, ENSG00000239467, ENSG00000272551, ENSG00000280374, ENSG00000236886, ENSG00000229352, ENSG00000286601, ENSG00000227021, ...
Validate metadata in .obs
#
adata.obs.columns
Index(['donor', 'tissue', 'cell_type', 'assay'], dtype='object')
validated = ln.Feature.validate(adata.obs.columns)
β
3 terms (75.00%) are validated for name
β 1 term (25.00%) is not validated for name: donor
1 feature is not validated: "donor"
. Letβs register it:
feature = ln.Feature.from_df(adata.obs.loc[:, ~validated])[0]
ln.save(feature)
All metadata columns are now validated as feature:
ln.Feature.validate(adata.obs.columns);
β
4 terms (100.00%) are validated for name
Next, letβs validate the corresponding labels of each feature.
Some of the metadata labels can be typed using dedicated registries like CellType
:
validated = lb.CellType.validate(adata.obs.cell_type)
β received 32 unique terms, 1616 empty/duplicated terms are ignored
β
30 terms (93.80%) are validated for name
β 2 terms (6.20%) are not validated for name: germinal center B cell, megakaryocyte
Register non-validated cell types - they can all be loaded from a public ontology through Bionty:
nonval_cell_type_records = lb.CellType.from_values(
adata.obs.cell_type[~validated], "name"
)
ln.save(nonval_cell_type_records)
Show code cell output
β
created 2 CellType records from Bionty matching name: germinal center B cell, megakaryocyte
β now recursing through parents: this only happens once, but is much slower than bulk saving
π‘ you can switch this off via: lb.settings.auto_save_parents = False
π‘ also saving parents of CellType(id='uMLhrmbZ', name='germinal center B cell', ontology_id='CL:0000844', synonyms='GC B-cell|GC B cell|GC B lymphocyte|germinal center B lymphocyte|GC B-lymphocyte|germinal center B-cell|germinal center B-lymphocyte', description='A Rapidly Cycling Mature B Cell That Has Distinct Phenotypic Characteristics And Is Involved In T-Dependent Immune Responses And Located Typically In The Germinal Centers Of Lymph Nodes. This Cell Type Expresses Ly77 After Activation.', updated_at=2023-08-28 18:24:25, bionty_source_id='Zz1m', created_by_id='DzTjkKse')
β
created 1 CellType record from Bionty matching ontology_id: CL:0000785
β now recursing through parents: this only happens once, but is much slower than bulk saving
π‘ you can switch this off via: lb.settings.auto_save_parents = False
π‘ also saving parents of CellType(id='0I51jgPp', name='mature B cell', ontology_id='CL:0000785', synonyms='mature B lymphocyte|mature B-cell|mature B-lymphocyte', description='A B Cell That Is Mature, Having Left The Bone Marrow. Initially, These Cells Are Igm-Positive And Igd-Positive, And They Can Be Activated By Antigen.', updated_at=2023-08-28 18:24:25, bionty_source_id='Zz1m', created_by_id='DzTjkKse')
β
created 1 CellType record from Bionty matching ontology_id: CL:0001201
β now recursing through parents: this only happens once, but is much slower than bulk saving
π‘ you can switch this off via: lb.settings.auto_save_parents = False
π‘ also saving parents of CellType(id='CIS4VJI0', name='B cell, CD19-positive', ontology_id='CL:0001201', synonyms='CD19+ B cell|B lymphocyte, CD19-positive|B-lymphocyte, CD19-positive|CD19-positive B cell|B-cell, CD19-positive', description='A B Cell That Is Cd19-Positive.', updated_at=2023-08-28 18:24:26, bionty_source_id='Zz1m', created_by_id='DzTjkKse')
β
created 1 CellType record from Bionty matching ontology_id: CL:0000236
β now recursing through parents: this only happens once, but is much slower than bulk saving
π‘ you can switch this off via: lb.settings.auto_save_parents = False
π‘ also saving parents of CellType(id='cx8VcggA', name='B cell', ontology_id='CL:0000236', synonyms='B-cell|B lymphocyte|B-lymphocyte', description='A Lymphocyte Of B Lineage That Is Capable Of B Cell Mediated Immunity.', updated_at=2023-08-28 18:24:32, bionty_source_id='Zz1m', created_by_id='DzTjkKse')
β
created 1 CellType record from Bionty matching ontology_id: CL:0000945
β now recursing through parents: this only happens once, but is much slower than bulk saving
π‘ you can switch this off via: lb.settings.auto_save_parents = False
π‘ also saving parents of CellType(id='Z0yFV7vU', name='lymphocyte of B lineage', ontology_id='CL:0000945', description='A Lymphocyte Of B Lineage With The Commitment To Express An Immunoglobulin Complex.', updated_at=2023-08-28 18:24:32, bionty_source_id='Zz1m', created_by_id='DzTjkKse')
π‘ also saving parents of CellType(id='UrtDirMx', name='megakaryocyte', ontology_id='CL:0000556', synonyms='megalocaryocyte|megalokaryocyte|megacaryocyte', description='A Large Hematopoietic Cell (50 To 100 Micron) With A Lobated Nucleus. Once Mature, This Cell Undergoes Multiple Rounds Of Endomitosis And Cytoplasmic Restructuring To Allow Platelet Formation And Release.', updated_at=2023-08-28 18:24:25, bionty_source_id='Zz1m', created_by_id='DzTjkKse')
β
created 1 CellType record from Bionty matching ontology_id: CL:0000763
β now recursing through parents: this only happens once, but is much slower than bulk saving
π‘ you can switch this off via: lb.settings.auto_save_parents = False
π‘ also saving parents of CellType(id='g1zY6vUW', name='myeloid cell', ontology_id='CL:0000763', description='A Cell Of The Monocyte, Granulocyte, Mast Cell, Megakaryocyte, Or Erythroid Lineage.', updated_at=2023-08-28 18:24:33, bionty_source_id='Zz1m', created_by_id='DzTjkKse')
β
created 1 CellType record from Bionty matching ontology_id: CL:0000988
β now recursing through parents: this only happens once, but is much slower than bulk saving
π‘ you can switch this off via: lb.settings.auto_save_parents = False
π‘ also saving parents of CellType(id='Q0aQr5JB', name='hematopoietic cell', ontology_id='CL:0000988', synonyms='haematopoietic cell|hemopoietic cell|haemopoietic cell', description='A Cell Of A Hematopoietic Lineage.', updated_at=2023-08-28 18:24:33, bionty_source_id='Zz1m', created_by_id='DzTjkKse')
β
loaded 1 CellType record matching ontology_id: CL:0000548
β
created 1 CellType record from Bionty matching ontology_id: CL:0002371
β now recursing through parents: this only happens once, but is much slower than bulk saving
π‘ you can switch this off via: lb.settings.auto_save_parents = False
π‘ also saving parents of CellType(id='QMAH6IlS', name='somatic cell', ontology_id='CL:0002371', description='A Cell Of An Organism That Does Not Pass On Its Genetic Material To The Organism'S Offspring (I.E. A Non-Germ Line Cell).', updated_at=2023-08-28 18:24:34, bionty_source_id='Zz1m', created_by_id='DzTjkKse')
β
loaded 1 CellType record matching ontology_id: CL:0000548
β
created 1 CellType record from Bionty matching ontology_id: CL:0000003
β now recursing through parents: this only happens once, but is much slower than bulk saving
π‘ you can switch this off via: lb.settings.auto_save_parents = False
π‘ also saving parents of CellType(id='VT73gpK2', name='native cell', ontology_id='CL:0000003', description='A Cell That Is Found In A Natural Setting, Which Includes Multicellular Organism Cells 'In Vivo' (I.E. Part Of An Organism), And Unicellular Organisms 'In Environment' (I.E. Part Of A Natural Environment).', updated_at=2023-08-28 18:24:34, bionty_source_id='Zz1m', created_by_id='DzTjkKse')
β
created 1 CellType record from Bionty matching ontology_id: CL:0000000
lb.ExperimentalFactor.validate(adata.obs.assay)
lb.Tissue.validate(adata.obs.tissue);
β
3 terms (100.00%) are validated for name
β
17 terms (100.00%) are validated for name
Because we didnβt mount a custom schema that contains a Donor
registry, we use the Label
registry to track donor ids:
ln.Label.validate(adata.obs["donor"]);
β received 12 unique terms, 1636 empty/duplicated terms are ignored
β 12 terms (100.00%) are not validated for name: D496, 621B, A29, A36, A35, 637C, A52, A37, D503, 640C, A31, 582C
Donor labels are not validated, so letβs register them:
donors = [ln.Label(name=name) for name in adata.obs["donor"].unique()]
ln.save(donors)
ln.Label.validate(adata.obs["donor"]);
β
12 terms (100.00%) are validated for name
Validate external metadata#
In addition to whatβs already in the file, weβd like to link this file to external features including βspeciesβ and βassayβ:
ln.Feature.validate("species")
ln.Feature.validate("assay");
β
1 term (100.00%) is validated for name
β
1 term (100.00%) is validated for name
Letβs search for the scRNA-seq assay label:
lb.ExperimentalFactor.search("scRNA-seq").head(2)
id | synonyms | __ratio__ | |
---|---|---|---|
name | |||
single-cell RNA sequencing | 068T1Df6 | single-cell RNA-seq|scRNA-seq|single cell RNA ... | 100.000000 |
10x 3' v3 | Vep0itYq | 10X 3' v3 | 11.111111 |
scrna = lb.ExperimentalFactor.filter(id="068T1Df6").one()
Register #
Register data#
When we create a File
object from an AnnData
, weβll automatically link its feature sets and get information about unmapped categories:
file = ln.File.from_anndata(
adata, description="Conde22", var_ref=lb.Gene.ensembl_gene_id
)
Show code cell output
π‘ file will be copied to default storage upon `save()` with key `None` ('.lamindb/ZYlkASClrCMvJB45qd7X.h5ad')
π‘ parsing feature names of X stored in slot 'var'
π‘ using global setting species = human
β
36390 terms (99.70%) are validated for ensembl_gene_id
β 113 terms (0.30%) are not validated for ensembl_gene_id: ENSG00000269933, ENSG00000261737, ENSG00000259834, ENSG00000256374, ENSG00000263464, ENSG00000203812, ENSG00000272196, ENSG00000272880, ENSG00000270188, ENSG00000287116, ENSG00000237133, ENSG00000224739, ENSG00000227902, ENSG00000239467, ENSG00000272551, ENSG00000280374, ENSG00000236886, ENSG00000229352, ENSG00000286601, ENSG00000227021, ...
π‘ using global setting species = human
β
linked: FeatureSet(id='25aYS2IH5FAbY5M7wKd1', n=36390, type='float', registry='bionty.Gene', hash='rMZltwoBCMdVPVR8x6nJ', created_by_id='DzTjkKse')
π‘ parsing feature names of slot 'obs'
β
4 terms (100.00%) are validated for name
β
linked: FeatureSet(id='pHprlKs9J0dwl5OVeE6O', n=4, registry='core.Feature', hash='j5VHRKvBonP-MkZ4Tk_5', modality_id='1IXHv4eM', created_by_id='DzTjkKse')
file.save()
β
saved 2 feature sets for slots: 'var','obs'
β
storing file 'ZYlkASClrCMvJB45qd7X' at '.lamindb/ZYlkASClrCMvJB45qd7X.h5ad'
The file has the following 2 linked feature sets:
file.features
'var': FeatureSet(id='25aYS2IH5FAbY5M7wKd1', n=36390, type='float', registry='bionty.Gene', hash='rMZltwoBCMdVPVR8x6nJ', updated_at=2023-08-28 18:24:38, created_by_id='DzTjkKse')
'obs': FeatureSet(id='pHprlKs9J0dwl5OVeE6O', n=4, registry='core.Feature', hash='j5VHRKvBonP-MkZ4Tk_5', updated_at=2023-08-28 18:24:42, modality_id='1IXHv4eM', created_by_id='DzTjkKse')
You can further annotate your feature set with modality:
var_feature_set = file.features.get_feature_set("var")
modalities = ln.Modality.lookup()
var_feature_set.modality = modalities.rna
var_feature_set.save()
Link metadata#
Letβs now link observational metadata by adding labels to corresponding features.
cell_types = lb.CellType.from_values(adata.obs.cell_type, field="name")
efs = lb.ExperimentalFactor.from_values(adata.obs.assay, field="name")
tissues = lb.Tissue.from_values(adata.obs.tissue, field="name")
donors = ln.Label.from_values(adata.obs["donor"])
file.add_labels(cell_types, "cell_type")
file.add_labels(efs, "assay")
file.add_labels(tissues, "tissue")
file.add_labels(donors, feature="donor")
β
linked feature 'cell_type' to registry 'bionty.CellType'
β
linked feature 'assay' to registry 'bionty.ExperimentalFactor'
β
linked feature 'tissue' to registry 'bionty.Tissue'
β
linked feature 'donor' to registry 'core.Label'
Note that adding labels to an external feature will create an external feature set.
file.add_labels(lb.settings.species, feature="species")
file.add_labels(scrna, feature="assay")
β
linked feature 'species' to registry 'bionty.Species'
β
linked new feature 'species' together with new feature set FeatureSet(id='hahDsjzNlKcaPzm5Zc0b', n=1, registry='core.Feature', hash='gPtGuYSR8zS1IMgD2LES', updated_at=2023-08-28 18:24:43, modality_id='1IXHv4eM', created_by_id='DzTjkKse')
file.features
'var': FeatureSet(id='25aYS2IH5FAbY5M7wKd1', n=36390, type='float', registry='bionty.Gene', hash='rMZltwoBCMdVPVR8x6nJ', updated_at=2023-08-28 18:24:42, modality_id='uU4Uo0hX', created_by_id='DzTjkKse')
'obs': FeatureSet(id='pHprlKs9J0dwl5OVeE6O', n=4, registry='core.Feature', hash='j5VHRKvBonP-MkZ4Tk_5', updated_at=2023-08-28 18:24:42, modality_id='1IXHv4eM', created_by_id='DzTjkKse')
'external': FeatureSet(id='hahDsjzNlKcaPzm5Zc0b', n=1, registry='core.Feature', hash='gPtGuYSR8zS1IMgD2LES', updated_at=2023-08-28 18:24:43, modality_id='1IXHv4eM', created_by_id='DzTjkKse')
The file is now queryable by everything we linked:
file.describe()
π‘ File(id='ZYlkASClrCMvJB45qd7X', key=None, suffix='.h5ad', accessor='AnnData', description='Conde22', version=None, size=28049505, hash='WEFcMZxJNmMiUOFrcSTaig', hash_type='md5', created_at=2023-08-28 18:24:42, updated_at=2023-08-28 18:24:42)
Provenance:
ποΈ storage: Storage(id='AXBBarQq', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-08-28 18:24:01, created_by_id='DzTjkKse')
π« transform: Transform(id='Nv48yAceNSh8z8', name='Validate & register scRNA-seq datasets', short_name='scrna', version='0', type=notebook, updated_at=2023-08-28 18:24:38, created_by_id='DzTjkKse')
π£ run: Run(id='CHafmKbEL0RBBfWOoz0J', run_at=2023-08-28 18:24:03, transform_id='Nv48yAceNSh8z8', created_by_id='DzTjkKse')
π€ created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-28 18:24:01)
Features:
var (X):
π index (36390, bionty.Gene.id): ['nhFjFmu1xBZI', 'SqsX0a250Sys', 'mkHhz0ai3yXT', 'NA1yysqkE6PQ', 'imiSu6lCPdLv'...]
external:
π species (1, bionty.Species): ['human']
obs (metadata):
π cell_type (32, bionty.CellType): ['CD16-positive, CD56-dim natural killer cell, human', 'macrophage', 'mucosal invariant T cell', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'plasmablast']
π assay (4, bionty.ExperimentalFactor): ["10x 5' v2", "10x 5' v1", "10x 3' v3", 'single-cell RNA sequencing']
π tissue (17, bionty.Tissue): ['thoracic lymph node', 'omentum', 'skeletal muscle tissue', 'blood', 'jejunal epithelium']
π donor (12, core.Label): ['A52', 'A35', '637C', 'A36', 'D503']
A less well curated dataset#
Transform #
Letβs now consider a dataset with less-well curated features:
pbcm68k = ln.dev.datasets.anndata_pbmc68k_reduced()
We see that this dataset is indexed by gene symbols:
pbcm68k.var.index
Index(['HES4', 'TNFRSF4', 'SSU72', 'PARK7', 'RBP7', 'SRM', 'MAD2L2', 'AGTRAP',
'TNFRSF1B', 'EFHD2',
...
'ATP5O', 'MRPS6', 'TTC3', 'U2AF1', 'CSTB', 'SUMO3', 'ITGB2', 'S100B',
'PRMT2', 'MT-ND3'],
dtype='object', name='index', length=765)
Validate #
validated = lb.Gene.validate(pbcm68k.var.index, lb.Gene.symbol)
π‘ using global setting species = human
β
695 terms (90.80%) are validated for symbol
β 70 terms (9.20%) are not validated for symbol: ATPIF1, C1orf228, CCBL2, RP11-782C8.1, RP11-277L2.3, RP11-156E8.1, AC079767.4, GPX1, H1FX, SELT, ATP5I, IGJ, CCDC109B, FYB, H2AFY, FAM65B, HIST1H4C, HIST1H1E, ZNRD1, C6orf48, ...
In this case, we only want to register data with validated genes:
pbcm68k_validated = pbcm68k[:, validated].copy()
Validate cell types:
# inspect shows none of the terms are mappable
lb.CellType.inspect(pbcm68k_validated.obs["cell_type"])
# here we search the cell type names from the public ontology and grab the top match
# then add the cell type names from the pbcm68k as synonyms
celltype_bt = lb.CellType.bionty()
ontology_ids = []
mapper = {}
for ct in pbcm68k_validated.obs["cell_type"].unique():
ontology_id = celltype_bt.search(ct).iloc[0].ontology_id
record = lb.CellType.from_bionty(ontology_id=ontology_id)
mapper[ct] = record.name
record.save()
record.add_synonym(ct)
# standardize cell type names in the dataset
pbcm68k_validated.obs["cell_type"] = pbcm68k_validated.obs["cell_type"].map(mapper)
Show code cell output
β received 9 unique terms, 61 empty/duplicated terms are ignored
β 9 terms (100.00%) are not validated for name: Dendritic cells, CD19+ B, CD4+/CD45RO+ Memory, CD8+ Cytotoxic T, CD4+/CD25 T Reg, CD14+ Monocytes, CD56+ NK, CD8+/CD45RA+ Naive Cytotoxic, CD34+
π‘ couldn't validate 9 terms: CD8+ Cytotoxic T, CD34+, CD4+/CD25 T Reg, CD4+/CD45RO+ Memory, Dendritic cells, CD14+ Monocytes, CD56+ NK, CD8+/CD45RA+ Naive Cytotoxic, CD19+ B
π‘ β if you are sure, add records to your registry via .from_values()
β
created 1 CellType record from Bionty matching ontology_id: CL:0000451
π‘ also saving parents of CellType(id='9JGbXeUA', name='dendritic cell', ontology_id='CL:0000451', description='A Cell Of Hematopoietic Origin, Typically Resident In Particular Tissues, Specialized In The Uptake, Processing, And Transport Of Antigens To Lymph Nodes For The Purpose Of Stimulating An Immune Response Via T Cell Activation. These Cells Are Lineage Negative (Cd3-Negative, Cd19-Negative, Cd34-Negative, And Cd56-Negative).', updated_at=2023-08-28 18:24:44, bionty_source_id='Zz1m', created_by_id='DzTjkKse')
β
created 1 CellType record from Bionty matching ontology_id: CL:0000738
β now recursing through parents: this only happens once, but is much slower than bulk saving
π‘ you can switch this off via: lb.settings.auto_save_parents = False
π‘ also saving parents of CellType(id='MkrH0gsX', name='leukocyte', ontology_id='CL:0000738', synonyms='white blood cell|leucocyte', description='An Achromatic Cell Of The Myeloid Or Lymphoid Lineages Capable Of Ameboid Movement, Found In Blood Or Other Tissue.', updated_at=2023-08-28 18:24:45, bionty_source_id='Zz1m', created_by_id='DzTjkKse')
π‘ also saving parents of CellType(id='9JGbXeUA', name='dendritic cell', ontology_id='CL:0000451', synonyms='Dendritic cells', description='A Cell Of Hematopoietic Origin, Typically Resident In Particular Tissues, Specialized In The Uptake, Processing, And Transport Of Antigens To Lymph Nodes For The Purpose Of Stimulating An Immune Response Via T Cell Activation. These Cells Are Lineage Negative (Cd3-Negative, Cd19-Negative, Cd34-Negative, And Cd56-Negative).', updated_at=2023-08-28 18:24:45, bionty_source_id='Zz1m', created_by_id='DzTjkKse')
β
created 1 CellType record from Bionty matching ontology_id: CL:0001087
π‘ also saving parents of CellType(id='6VQXlWS7', name='effector memory CD4-positive, alpha-beta T cell, terminally differentiated', ontology_id='CL:0001087', synonyms='CD4-positive TEMRA|CD4+ TEMRA', description='A Cd4-Positive, Alpha Beta Memory T Cell With The Phenotype Cd45Ra-Positive, Cd45Ro-Negative, And Ccr7-Negative.', updated_at=2023-08-28 18:24:45, bionty_source_id='Zz1m', created_by_id='DzTjkKse')
β
created 2 CellType records from Bionty matching ontology_id: CL:4030002, CL:0000897
β now recursing through parents: this only happens once, but is much slower than bulk saving
π‘ you can switch this off via: lb.settings.auto_save_parents = False
π‘ also saving parents of CellType(id='ylUbqlrS', name='effector memory CD45RA-positive, alpha-beta T cell, terminally differentiated', ontology_id='CL:4030002', synonyms='terminally differentiated effector memory cells re-expressing CD45RA|terminally differentiated effector memory CD45RA+ T cells|TEMRA cell', description='An Alpha-Beta Memory T Cell With The Phenotype Cd45Ra-Positive.', updated_at=2023-08-28 18:24:46, bionty_source_id='Zz1m', created_by_id='DzTjkKse')
β
created 1 CellType record from Bionty matching ontology_id: CL:0000791
β now recursing through parents: this only happens once, but is much slower than bulk saving
π‘ you can switch this off via: lb.settings.auto_save_parents = False
π‘ also saving parents of CellType(id='WKpZjuYS', name='mature alpha-beta T cell', ontology_id='CL:0000791', synonyms='mature alpha-beta T-lymphocyte|mature alpha-beta T lymphocyte|mature alpha-beta T-cell', description='A Alpha-Beta T Cell That Has A Mature Phenotype.', updated_at=2023-08-28 18:24:46, bionty_source_id='Zz1m', created_by_id='DzTjkKse')
π‘ also saving parents of CellType(id='s6Ag7R5U', name='CD4-positive, alpha-beta memory T cell', ontology_id='CL:0000897', synonyms='CD4-positive, alpha-beta memory T-cell|CD4-positive, alpha-beta memory T-lymphocyte|CD4-positive, alpha-beta memory T lymphocyte', description='A Cd4-Positive, Alpha-Beta T Cell That Has Differentiated Into A Memory T Cell.', updated_at=2023-08-28 18:24:46, bionty_source_id='Zz1m', created_by_id='DzTjkKse')
β
created 1 CellType record from Bionty matching ontology_id: CL:0000624
β now recursing through parents: this only happens once, but is much slower than bulk saving
π‘ you can switch this off via: lb.settings.auto_save_parents = False
π‘ also saving parents of CellType(id='05vQoepH', name='CD4-positive, alpha-beta T cell', ontology_id='CL:0000624', synonyms='CD4-positive, alpha-beta T lymphocyte|CD4-positive, alpha-beta T-cell|CD4-positive, alpha-beta T-lymphocyte', description='A Mature Alpha-Beta T Cell That Expresses An Alpha-Beta T Cell Receptor And The Cd4 Coreceptor.', updated_at=2023-08-28 18:24:47, bionty_source_id='Zz1m', created_by_id='DzTjkKse')
π‘ also saving parents of CellType(id='6VQXlWS7', name='effector memory CD4-positive, alpha-beta T cell, terminally differentiated', ontology_id='CL:0001087', synonyms='CD4+ TEMRA|CD4-positive TEMRA|CD4+/CD45RO+ Memory', description='A Cd4-Positive, Alpha Beta Memory T Cell With The Phenotype Cd45Ra-Positive, Cd45Ro-Negative, And Ccr7-Negative.', updated_at=2023-08-28 18:24:47, bionty_source_id='Zz1m', created_by_id='DzTjkKse')
β
created 1 CellType record from Bionty matching ontology_id: CL:0000910
π‘ also saving parents of CellType(id='OxsmyL44', name='cytotoxic T cell', ontology_id='CL:0000910', synonyms='cytotoxic T lymphocyte|cytotoxic T-lymphocyte|cytotoxic T-cell', description='A Mature T Cell That Differentiated And Acquired Cytotoxic Function With The Phenotype Perforin-Positive And Granzyme-B Positive.', updated_at=2023-08-28 18:24:48, bionty_source_id='Zz1m', created_by_id='DzTjkKse')
β
created 1 CellType record from Bionty matching ontology_id: CL:0000911
β now recursing through parents: this only happens once, but is much slower than bulk saving
π‘ you can switch this off via: lb.settings.auto_save_parents = False
π‘ also saving parents of CellType(id='yvHkIrVI', name='effector T cell', ontology_id='CL:0000911', synonyms='effector T-lymphocyte|effector T-cell|effector T lymphocyte', description='A Differentiated T Cell With Ability To Traffic To Peripheral Tissues And Is Capable Of Mounting A Specific Immune Response.', updated_at=2023-08-28 18:24:48, bionty_source_id='Zz1m', created_by_id='DzTjkKse')
β
created 1 CellType record from Bionty matching ontology_id: CL:0002419
β now recursing through parents: this only happens once, but is much slower than bulk saving
π‘ you can switch this off via: lb.settings.auto_save_parents = False
π‘ also saving parents of CellType(id='2C5PhwrW', name='mature T cell', ontology_id='CL:0002419', synonyms='mature T-cell|CD3e-positive T cell', description='A T Cell That Expresses A T Cell Receptor Complex And Has Completed T Cell Selection.', updated_at=2023-08-28 18:24:49, bionty_source_id='Zz1m', created_by_id='DzTjkKse')
β
created 1 CellType record from Bionty matching ontology_id: CL:0000084
β now recursing through parents: this only happens once, but is much slower than bulk saving
π‘ you can switch this off via: lb.settings.auto_save_parents = False
π‘ also saving parents of CellType(id='BxNjby0x', name='T cell', ontology_id='CL:0000084', synonyms='T-lymphocyte|T-cell|T lymphocyte', description='A Type Of Lymphocyte Whose Defining Characteristic Is The Expression Of A T Cell Receptor Complex.', updated_at=2023-08-28 18:24:49, bionty_source_id='Zz1m', created_by_id='DzTjkKse')
π‘ also saving parents of CellType(id='OxsmyL44', name='cytotoxic T cell', ontology_id='CL:0000910', synonyms='CD8+ Cytotoxic T|cytotoxic T lymphocyte|cytotoxic T-cell|cytotoxic T-lymphocyte', description='A Mature T Cell That Differentiated And Acquired Cytotoxic Function With The Phenotype Perforin-Positive And Granzyme-B Positive.', updated_at=2023-08-28 18:24:49, bionty_source_id='Zz1m', created_by_id='DzTjkKse')
β
created 1 CellType record from Bionty matching ontology_id: CL:0000919
π‘ also saving parents of CellType(id='ORD0dMdt', name='CD8-positive, CD25-positive, alpha-beta regulatory T cell', ontology_id='CL:0000919', synonyms='CD8+CD25+ Treg|CD8+CD25+ T-lymphocyte|CD8+CD25+ T(reg)|CD8+CD25+ T lymphocyte|CD8+CD25+ T cell|CD8-positive, CD25-positive Treg|CD8-positive, CD25-positive, alpha-beta regulatory T-lymphocyte|CD8-positive, CD25-positive, alpha-beta regulatory T-cell|CD8+CD25+ T-cell|CD8-positive, CD25-positive, alpha-beta regulatory T lymphocyte', description='A Cd8-Positive Alpha Beta-Positive T Cell With The Phenotype Foxp3-Positive And Having Suppressor Function.', updated_at=2023-08-28 18:24:50, bionty_source_id='Zz1m', created_by_id='DzTjkKse')
β
created 1 CellType record from Bionty matching ontology_id: CL:0000795
β now recursing through parents: this only happens once, but is much slower than bulk saving
π‘ you can switch this off via: lb.settings.auto_save_parents = False
π‘ also saving parents of CellType(id='oTsFrhYW', name='CD8-positive, alpha-beta regulatory T cell', ontology_id='CL:0000795', synonyms='CD8-positive, alpha-beta regulatory T-cell|CD8-positive, alpha-beta Treg|CD8-positive T(reg)|CD8-positive, alpha-beta regulatory T lymphocyte|CD8+ Treg|CD8+ T(reg)|CD8+ regulatory T cell|CD8-positive, alpha-beta regulatory T-lymphocyte|CD8-positive Treg', description='A Cd8-Positive, Alpha-Beta T Cell That Regulates Overall Immune Responses As Well As The Responses Of Other T Cell Subsets Through Direct Cell-Cell Contact And Cytokine Release.', updated_at=2023-08-28 18:24:50, bionty_source_id='Zz1m', created_by_id='DzTjkKse')
β
created 1 CellType record from Bionty matching ontology_id: CL:0000625
β now recursing through parents: this only happens once, but is much slower than bulk saving
π‘ you can switch this off via: lb.settings.auto_save_parents = False
π‘ also saving parents of CellType(id='VnKkQsME', name='CD8-positive, alpha-beta T cell', ontology_id='CL:0000625', synonyms='CD8-positive, alpha-beta T lymphocyte|CD8-positive, alpha-beta T-lymphocyte|CD8-positive, alpha-beta T-cell', description='A T Cell Expressing An Alpha-Beta T Cell Receptor And The Cd8 Coreceptor.', updated_at=2023-08-28 18:24:51, bionty_source_id='Zz1m', created_by_id='DzTjkKse')
π‘ also saving parents of CellType(id='ORD0dMdt', name='CD8-positive, CD25-positive, alpha-beta regulatory T cell', ontology_id='CL:0000919', synonyms='CD8-positive, CD25-positive Treg|CD8-positive, CD25-positive, alpha-beta regulatory T-lymphocyte|CD8-positive, CD25-positive, alpha-beta regulatory T-cell|CD8+CD25+ T lymphocyte|CD4+/CD25 T Reg|CD8+CD25+ T(reg)|CD8+CD25+ T-lymphocyte|CD8+CD25+ T cell|CD8+CD25+ Treg|CD8+CD25+ T-cell|CD8-positive, CD25-positive, alpha-beta regulatory T lymphocyte', description='A Cd8-Positive Alpha Beta-Positive T Cell With The Phenotype Foxp3-Positive And Having Suppressor Function.', updated_at=2023-08-28 18:24:51, bionty_source_id='Zz1m', created_by_id='DzTjkKse')
β
created 1 CellType record from Bionty matching ontology_id: CL:0002057
π‘ also saving parents of CellType(id='O0AQiAuv', name='CD14-positive, CD16-negative classical monocyte', ontology_id='CL:0002057', synonyms='CD16-negative monocyte|CD16- monocyte', description='A Classical Monocyte That Is Cd14-Positive, Cd16-Negative, Cd64-Positive, Cd163-Positive.', updated_at=2023-08-28 18:24:52, bionty_source_id='Zz1m', created_by_id='DzTjkKse')
π‘ also saving parents of CellType(id='O0AQiAuv', name='CD14-positive, CD16-negative classical monocyte', ontology_id='CL:0002057', synonyms='CD16-negative monocyte|CD14+ Monocytes|CD16- monocyte', description='A Classical Monocyte That Is Cd14-Positive, Cd16-Negative, Cd64-Positive, Cd163-Positive.', updated_at=2023-08-28 18:24:52, bionty_source_id='Zz1m', created_by_id='DzTjkKse')
β
created 1 CellType record from Bionty matching ontology_id: CL:0002102
π‘ also saving parents of CellType(id='Xkw89opD', name='CD38-negative naive B cell', ontology_id='CL:0002102', synonyms='CD38-negative naive B lymphocyte|CD38-negative naive B-cell|CD38- naive B-cell|CD38-negative naive B-lymphocyte|CD38- naive B lymphocyte|CD38- naive B-lymphocyte|CD38- naive B cell', description='A Cd38-Negative Naive B Cell Is A Mature B Cell That Has The Phenotype Cd38-Negative, Surface Igd-Positive, Surface Igm-Positive, And Cd27-Negative, That Has Not Yet Been Activated By Antigen In The Periphery.', updated_at=2023-08-28 18:24:52, bionty_source_id='Zz1m', created_by_id='DzTjkKse')
π‘ also saving parents of CellType(id='Xkw89opD', name='CD38-negative naive B cell', ontology_id='CL:0002102', synonyms='CD38- naive B cell|CD38- naive B lymphocyte|CD38-negative naive B-cell|CD8+/CD45RA+ Naive Cytotoxic|CD38- naive B-lymphocyte|CD38-negative naive B-lymphocyte|CD38-negative naive B lymphocyte|CD38- naive B-cell', description='A Cd38-Negative Naive B Cell Is A Mature B Cell That Has The Phenotype Cd38-Negative, Surface Igd-Positive, Surface Igm-Positive, And Cd27-Negative, That Has Not Yet Been Activated By Antigen In The Periphery.', updated_at=2023-08-28 18:24:52, bionty_source_id='Zz1m', created_by_id='DzTjkKse')
Now, all cell types are validated:
lb.CellType.validate(pbcm68k_validated.obs["cell_type"]);
β
9 terms (100.00%) are validated for name
Register #
file = ln.File.from_anndata(
pbcm68k_validated, description="10x reference pbmc68k", var_ref=lb.Gene.symbol
)
π‘ file will be copied to default storage upon `save()` with key `None` ('.lamindb/L5bU7hgOWmZM2VCa5A5I.h5ad')
π‘ parsing feature names of X stored in slot 'var'
π‘ using global setting species = human
β
695 terms (100.00%) are validated for symbol
π‘ using global setting species = human
β
linked: FeatureSet(id='ly9qQudzkdxmB6iyCQpz', n=695, type='float', registry='bionty.Gene', hash='W4ps_86b5dxk2Wd1gWTo', created_by_id='DzTjkKse')
π‘ parsing feature names of slot 'obs'
β
1 term (25.00%) is validated for name
β 3 terms (75.00%) are not validated for name: n_genes, percent_mito, louvain
β
linked: FeatureSet(id='pkRKZiJN6QOJ9jr0h7YJ', n=1, registry='core.Feature', hash='s3UdP7CYe4X5m_DR3y-Q', modality_id='1IXHv4eM', created_by_id='DzTjkKse')
file.save()
β
saved 2 feature sets for slots: 'var','obs'
β
storing file 'L5bU7hgOWmZM2VCa5A5I' at '.lamindb/L5bU7hgOWmZM2VCa5A5I.h5ad'
var_feature_set = file.features.get_feature_set("var")
var_feature_set.modality = modalities.rna
var_feature_set.save()
cell_types = lb.CellType.from_values(pbcm68k_validated.obs["cell_type"], "name")
file.add_labels(cell_types, "cell_type")
file.add_labels(lb.settings.species, feature="species")
file.add_labels(scrna, feature="assay")
β
loaded: FeatureSet(id='hahDsjzNlKcaPzm5Zc0b', n=1, registry='core.Feature', hash='gPtGuYSR8zS1IMgD2LES', updated_at=2023-08-28 18:24:43, modality_id='1IXHv4eM', created_by_id='DzTjkKse')
β
linked new feature 'species' together with new feature set FeatureSet(id='hahDsjzNlKcaPzm5Zc0b', n=1, registry='core.Feature', hash='gPtGuYSR8zS1IMgD2LES', updated_at=2023-08-28 18:24:53, modality_id='1IXHv4eM', created_by_id='DzTjkKse')
π‘ no file links to it anymore, deleting feature set FeatureSet(id='hahDsjzNlKcaPzm5Zc0b', n=1, registry='core.Feature', hash='gPtGuYSR8zS1IMgD2LES', updated_at=2023-08-28 18:24:53, modality_id='1IXHv4eM', created_by_id='DzTjkKse')
β
linked new feature 'assay' together with new feature set FeatureSet(id='GnSqWXm8Iav7T4whcHOU', n=2, registry='core.Feature', hash='Inp3VkpViv8yrmx_T6Cz', updated_at=2023-08-28 18:24:53, modality_id='1IXHv4eM', created_by_id='DzTjkKse')
file.features
'var': FeatureSet(id='ly9qQudzkdxmB6iyCQpz', n=695, type='float', registry='bionty.Gene', hash='W4ps_86b5dxk2Wd1gWTo', updated_at=2023-08-28 18:24:53, modality_id='uU4Uo0hX', created_by_id='DzTjkKse')
'obs': FeatureSet(id='pkRKZiJN6QOJ9jr0h7YJ', n=1, registry='core.Feature', hash='s3UdP7CYe4X5m_DR3y-Q', updated_at=2023-08-28 18:24:53, modality_id='1IXHv4eM', created_by_id='DzTjkKse')
'external': FeatureSet(id='GnSqWXm8Iav7T4whcHOU', n=2, registry='core.Feature', hash='Inp3VkpViv8yrmx_T6Cz', updated_at=2023-08-28 18:24:53, modality_id='1IXHv4eM', created_by_id='DzTjkKse')
file.describe()
π‘ File(id='L5bU7hgOWmZM2VCa5A5I', key=None, suffix='.h5ad', accessor='AnnData', description='10x reference pbmc68k', version=None, size=589484, hash='eKVXV5okt5YRYjySMTKGEw', hash_type='md5', created_at=2023-08-28 18:24:53, updated_at=2023-08-28 18:24:53)
Provenance:
ποΈ storage: Storage(id='AXBBarQq', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-08-28 18:24:01, created_by_id='DzTjkKse')
π« transform: Transform(id='Nv48yAceNSh8z8', name='Validate & register scRNA-seq datasets', short_name='scrna', version='0', type=notebook, updated_at=2023-08-28 18:24:53, created_by_id='DzTjkKse')
π£ run: Run(id='CHafmKbEL0RBBfWOoz0J', run_at=2023-08-28 18:24:03, transform_id='Nv48yAceNSh8z8', created_by_id='DzTjkKse')
π€ created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-28 18:24:01)
Features:
var (X):
π index (695, bionty.Gene.id): ['SkU86c6dfr0B', 'wUYYqQhHOQ1T', 'FJ4p0HleLknx', 'mLZxpATriwGh', '54Nah0TGq83q'...]
external:
π assay (1, bionty.ExperimentalFactor): ['single-cell RNA sequencing']
π species (1, bionty.Species): ['human']
obs (metadata):
π cell_type (9, bionty.CellType): ['CD14-positive, CD16-negative classical monocyte', 'CD16-positive, CD56-dim natural killer cell, human', 'cytotoxic T cell', 'B cell, CD19-positive', 'effector memory CD4-positive, alpha-beta T cell, terminally differentiated']
file.view_lineage()
π Now letβs continue with data integration: Integrate scRNA-seq datasets