Jupyter Notebook

Validate & register flow cytometry data#

Background#

Flow cytometry is a technique used to analyze and sort cells or particles based on their physical and chemical characteristics as they flow in a fluid stream through a laser beam.

Here, we’ll transform, validate and register two flow cytometry datasets (Alpert19 and FlowIO sample) to demonstrate how to create and query a custom flow cytometry registry.

!lamin init --storage ./test-flow --schema bionty
Hide code cell output
πŸ’‘ creating schemas: core==0.46.1 bionty==0.30.0 
βœ… saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-28 18:25:36)
βœ… saved: Storage(id='5rdLhmxu', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-flow', type='local', updated_at=2023-08-28 18:25:36, created_by_id='DzTjkKse')
βœ… loaded instance: testuser1/test-flow
πŸ’‘ did not register local instance on hub (if you want, call `lamin register`)

import lamindb as ln
import lnschema_bionty as lb
import readfcs

lb.settings.species = "human"
βœ… loaded instance: testuser1/test-flow (lamindb 0.51.0)
βœ… set species: Species(id='uHJU', name='human', taxon_id=9606, scientific_name='homo_sapiens', updated_at=2023-08-28 18:25:37, bionty_source_id='dIlS', created_by_id='DzTjkKse')
ln.track()
πŸ’‘ notebook imports: lamindb==0.51.0 lnschema_bionty==0.30.0 readfcs==1.1.6
βœ… saved: Transform(id='OWuTtS4SAponz8', name='Validate & register flow cytometry data', short_name='flow', version='0', type=notebook, updated_at=2023-08-28 18:25:37, created_by_id='DzTjkKse')
βœ… saved: Run(id='1ZuHcutzHrGcXShQqmlK', run_at=2023-08-28 18:25:37, transform_id='OWuTtS4SAponz8', created_by_id='DzTjkKse')

Alpert19#

Transform #

(Here we skip steps of data transformations, which often includes filtering, normalizing, or formatting data.)

We start with a flow cytometry file from Alpert19:

ln.dev.datasets.file_fcs_alpert19(
    populate_registries=True,  # pre-populate registries to simulate an used instance
)


PosixPath('Alpert19.fcs')

Use readfcs to read the fcs file into memory:

adata = readfcs.read("Alpert19.fcs")
adata
AnnData object with n_obs Γ— n_vars = 166537 Γ— 40
    var: 'n', 'channel', 'marker', '$PnB', '$PnE', '$PnR'
    uns: 'meta'

Validate #

First, let’s validate the features in .var.

We’ll use the CellMarker reference to link features:

lb.CellMarker.validate(adata.var.index, "name");
πŸ’‘ using global setting species = human
βœ… 27 terms (67.50%) are validated for name
❗ 13 terms (32.50%) are not validated for name: Time, Cell_length, Dead, (Ba138)Dd, Bead, CD19, CD4, IgD, CD11b, CD14, CCR6, CCR7, PD-1

We see that many features aren’t validated. Let’s standardize the identifiers first to get rid of synonyms:

adata.var.index = lb.CellMarker.standardize(adata.var.index)
πŸ’‘ using global setting species = human
πŸ’‘ standardized 35/40 terms

Great, now we can validate our markers once more:

validated = lb.CellMarker.validate(adata.var.index, "name")
πŸ’‘ using global setting species = human
βœ… 35 terms (87.50%) are validated for name
❗ 5 terms (12.50%) are not validated for name: Time, Cell_length, Dead, (Ba138)Dd, Bead

Things look much better, but we still have 5 CellMaker records that seem more like metadata. Hence, let’s curate the AnnData object a bit more.

Let’s move metadata (non-validated cell markers) into adata.obs:

adata.obs = adata[:, ~validated].to_df()
adata = adata[:, validated].copy()

Now we have a clean panel of 35 cell markers:

lb.CellMarker.validate(adata.var.index, "name");
πŸ’‘ using global setting species = human
βœ… 35 terms (100.00%) are validated for name

Next, let’s register the metadata features we moved to .obs:

# Feature.from_df creates feature records with type auto-populated
features = ln.Feature.from_df(adata.obs)
ln.add(features)

In addition, We’d also like to link this file with external features:

ln.Feature.validate("assay", "name")
lb.ExperimentalFactor.validate("FACS", "name");
βœ… 1 term (100.00%) is validated for name
❗ 1 term (100.00%) is not validated for name: FACS

Since we never validated the term β€œFACS”, let’s search for it’s ontology and register it:

lb.ExperimentalFactor.bionty().search("FACS").head(2)
ontology_id definition synonyms parents molecule instrument measurement __ratio__
name
fluorescence-activated cell sorting EFO:0009108 A Flow Cytometry Assay That Provides A Method ... FACS|FAC sorting [] None None None 100.000000
acute chest syndrome EFO:0007129 A Vaso-Occlusive Crisis Of The Pulmonary Vascu... ACS|Acute Chest Syndrome|acute chest syndrome|... [EFO:0003818] None None None 85.714286
facs = lb.ExperimentalFactor.from_bionty(ontology_id="EFO:0009108")
facs.save()
βœ… created 1 ExperimentalFactor record from Bionty matching ontology_id: EFO:0009108

Adding a new modality:

modality = ln.Modality(name="protein", description="readouts of protein abundance")
modality.save()

Register #

file = ln.File.from_anndata(adata, description="Alpert19", var_ref=lb.CellMarker.name)
πŸ’‘ file will be copied to default storage upon `save()` with key `None` ('.lamindb/S9Y7S0hYrOyHXqhl9lNK.h5ad')
πŸ’‘ parsing feature names of X stored in slot 'var'
πŸ’‘    using global setting species = human
βœ…    35 terms (100.00%) are validated for name
πŸ’‘    using global setting species = human
βœ…    linked: FeatureSet(id='0TzA7d3tHioFmK92XGRU', n=35, type='float', registry='bionty.CellMarker', hash='ldY9_GmptHLCcT7Nrpgo', created_by_id='DzTjkKse')
πŸ’‘ parsing feature names of slot 'obs'
βœ…    5 terms (100.00%) are validated for name
βœ…    linked: FeatureSet(id='jfIhmZbZdXC4QqZeVSRL', n=5, registry='core.Feature', hash='tcrxvqPZAuXUJOFQK1dn', modality_id='c5m8McK4', created_by_id='DzTjkKse')
file.save()
βœ… saved 2 feature sets for slots: 'var','obs'
βœ… storing file 'S9Y7S0hYrOyHXqhl9lNK' at '.lamindb/S9Y7S0hYrOyHXqhl9lNK.h5ad'
file.add_labels(facs, "assay")
file.add_labels(lb.settings.species, "species")
βœ… linked new feature 'assay' together with new feature set FeatureSet(id='oMko7wSABiSzUoaLKQrV', n=1, registry='core.Feature', hash='_arQglHkRek_WGJB6rQQ', updated_at=2023-08-28 18:25:42, modality_id='c5m8McK4', created_by_id='DzTjkKse')
πŸ’‘ no file links to it anymore, deleting feature set FeatureSet(id='oMko7wSABiSzUoaLKQrV', n=1, registry='core.Feature', hash='_arQglHkRek_WGJB6rQQ', updated_at=2023-08-28 18:25:42, modality_id='c5m8McK4', created_by_id='DzTjkKse')
βœ… linked new feature 'species' together with new feature set FeatureSet(id='oC3tSM9huKlggYTYecMt', n=2, registry='core.Feature', hash='AA9XOJYxdoYfrGM5UoL_', updated_at=2023-08-28 18:25:42, modality_id='c5m8McK4', created_by_id='DzTjkKse')
var_feature_set = file.features.get_feature_set("var")
var_feature_set.modality = modality
var_feature_set.save()
file.features
'var': FeatureSet(id='0TzA7d3tHioFmK92XGRU', n=35, type='float', registry='bionty.CellMarker', hash='ldY9_GmptHLCcT7Nrpgo', updated_at=2023-08-28 18:25:42, modality_id='wGPaZsvc', created_by_id='DzTjkKse')
'obs': FeatureSet(id='jfIhmZbZdXC4QqZeVSRL', n=5, registry='core.Feature', hash='tcrxvqPZAuXUJOFQK1dn', updated_at=2023-08-28 18:25:42, modality_id='c5m8McK4', created_by_id='DzTjkKse')
'external': FeatureSet(id='oC3tSM9huKlggYTYecMt', n=2, registry='core.Feature', hash='AA9XOJYxdoYfrGM5UoL_', updated_at=2023-08-28 18:25:42, modality_id='c5m8McK4', created_by_id='DzTjkKse')

Check a few validated cell markers in .var:

file.features["var"].df().head(10)
name synonyms gene_symbol ncbi_gene_id uniprotkb_id species_id bionty_source_id updated_at created_by_id
id
n40112OuX7Cq CD123 IL3RA 3563 P26951 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
N2F6Qv9CxJch CD11B ITGAM 3684 P11215 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
0evamYEdmaoY Igd None None None uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
L0m6f7FPiDeg CD86 CD86 942 A8K632 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
4EojtgN0CjBH CD161 KLRB1 3820 Q12918 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
yCyTIVxZkIUz DNA2 DNA2 1763 P51530 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
bspnQ0igku6c CD16 FCGR3A 2215 O75015 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
roEbL8zuLC5k Cd14 CD14 4695 O43678 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
2VeZenLi2dj5 PD1 PID1|PD-1|PD 1 PDCD1 5133 A0A0M3M0G7 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
h4rkCALR5WfU CD56 NCAM1 4684 P13591 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse

FlowIO sample#

Let’s transform, validate and register another flow file:

Transform #

There are no further transformations necessary.

adata2 = readfcs.read(ln.dev.datasets.file_fcs())

Validate #

We’d like to track all features in .var, so we register them:

adata2.var.index = lb.CellMarker.bionty().standardize(adata2.var.index)
πŸ’‘ using global setting species = human
πŸ’‘ standardized 14/16 terms
markers = lb.CellMarker.from_values(adata2.var.index, "name")
ln.save(markers)
πŸ’‘ using global setting species = human
βœ… loaded 10 CellMarker records matching name: CD3, CD28, CD8, Cd4, CD57, Cd14, Cd19, CD27, Ccr7, CD127
βœ… created 4 CellMarker records from Bionty matching name: CCR5, CD45RO, Ki67, SSC-A
❗ did not create CellMarker records for 2 non-validated names: FSC-A, FSC-H

Standardize synonyms so that all features pass validation:

adata2.var.index = lb.CellMarker.standardize(adata2.var.index)
πŸ’‘ using global setting species = human
πŸ’‘ standardized 14/16 terms
lb.CellMarker.validate(adata2.var.index, "name");
πŸ’‘ using global setting species = human
βœ… 14 terms (87.50%) are validated for name
❗ 2 terms (12.50%) are not validated for name: FSC-A, FSC-H

Register #

file2 = ln.File.from_anndata(
    adata2, description="My fcs file", var_ref=lb.CellMarker.name
)
πŸ’‘ file will be copied to default storage upon `save()` with key `None` ('.lamindb/WOCJOiWjTLjCNaqT4dcs.h5ad')
πŸ’‘ parsing feature names of X stored in slot 'var'
πŸ’‘    using global setting species = human
βœ…    14 terms (87.50%) are validated for name
❗    2 terms (12.50%) are not validated for name: FSC-A, FSC-H
πŸ’‘    using global setting species = human
βœ…    linked: FeatureSet(id='x9xoYkcMZX6BswBUFdsM', n=14, type='float', registry='bionty.CellMarker', hash='npy5P7AYbjKLInpXlNvb', created_by_id='DzTjkKse')
file2.save()
βœ… saved 1 feature set for slot: 'var'
βœ… storing file 'WOCJOiWjTLjCNaqT4dcs' at '.lamindb/WOCJOiWjTLjCNaqT4dcs.h5ad'
file2.add_labels(facs, "assay")
file2.add_labels(lb.settings.species, "species")
βœ… linked new feature 'assay' together with new feature set FeatureSet(id='yBnkUgllUpuQYUqGeesF', n=1, registry='core.Feature', hash='_arQglHkRek_WGJB6rQQ', updated_at=2023-08-28 18:25:44, modality_id='c5m8McK4', created_by_id='DzTjkKse')
βœ… loaded: FeatureSet(id='oC3tSM9huKlggYTYecMt', n=2, registry='core.Feature', hash='AA9XOJYxdoYfrGM5UoL_', updated_at=2023-08-28 18:25:42, modality_id='c5m8McK4', created_by_id='DzTjkKse')
βœ… linked new feature 'species' together with new feature set FeatureSet(id='oC3tSM9huKlggYTYecMt', n=2, registry='core.Feature', hash='AA9XOJYxdoYfrGM5UoL_', updated_at=2023-08-28 18:25:44, modality_id='c5m8McK4', created_by_id='DzTjkKse')
var_feature_set = file2.features.get_feature_set("var")
var_feature_set.modality = modality
var_feature_set.save()
file2.features
'var': FeatureSet(id='x9xoYkcMZX6BswBUFdsM', n=14, type='float', registry='bionty.CellMarker', hash='npy5P7AYbjKLInpXlNvb', updated_at=2023-08-28 18:25:44, modality_id='wGPaZsvc', created_by_id='DzTjkKse')
'external': FeatureSet(id='oC3tSM9huKlggYTYecMt', n=2, registry='core.Feature', hash='AA9XOJYxdoYfrGM5UoL_', updated_at=2023-08-28 18:25:44, modality_id='c5m8McK4', created_by_id='DzTjkKse')
file2.view_lineage()
https://d33wubrfki0l68.cloudfront.net/7c4d51a14970cbe94a61d8ede741e6fb50058966/44feb/_images/b318bb771ab7df6d0aa0fe40424cc7472b985b9e6ba35fabb861eab086d87f5b.svg

Query by cell markers#

Which datasets have CD14 in the flow panel:

cell_markers = lb.CellMarker.lookup()
cell_markers.cd14
CellMarker(id='roEbL8zuLC5k', name='Cd14', synonyms='', gene_symbol='CD14', ncbi_gene_id='4695', uniprotkb_id='O43678', updated_at=2023-08-28 18:25:40, species_id='uHJU', bionty_source_id='WzJd', created_by_id='DzTjkKse')
panels_with_cd14 = ln.FeatureSet.filter(cell_markers=cell_markers.cd14).all()
ln.File.filter(feature_sets__in=panels_with_cd14).df()
storage_id key suffix accessor description version initial_version_id size hash hash_type transform_id run_id updated_at created_by_id
id
S9Y7S0hYrOyHXqhl9lNK 5rdLhmxu None .h5ad AnnData Alpert19 None None 33367624 14w5ElNsR_MqdiJtvnS1aw md5 OWuTtS4SAponz8 1ZuHcutzHrGcXShQqmlK 2023-08-28 18:25:42 DzTjkKse
WOCJOiWjTLjCNaqT4dcs 5rdLhmxu None .h5ad AnnData My fcs file None None 6876232 Cf4Fhfw_RDMtKd5amM6Gtw md5 OWuTtS4SAponz8 1ZuHcutzHrGcXShQqmlK 2023-08-28 18:25:44 DzTjkKse

Shared cell markers between two files:

files = ln.File.filter(feature_sets__in=panels_with_cd14, species__name="human").list()
file1, file2 = files[0], files[1]
file1_markers = file1.features["var"]
file2_markers = file2.features["var"]

shared_markers = file1_markers & file2_markers
shared_markers.list("name")
['Cd14', 'CD27', 'CD3', 'CD57', 'Cd19', 'CD127', 'Ccr7', 'CD8', 'Cd4', 'CD28']

Flow marker registry#

Check out your CellMarker registry:

lb.CellMarker.filter().df()
name synonyms gene_symbol ncbi_gene_id uniprotkb_id species_id bionty_source_id updated_at created_by_id
id
n40112OuX7Cq CD123 IL3RA 3563 P26951 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
N2F6Qv9CxJch CD11B ITGAM 3684 P11215 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
0evamYEdmaoY Igd None None None uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
L0m6f7FPiDeg CD86 CD86 942 A8K632 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
4EojtgN0CjBH CD161 KLRB1 3820 Q12918 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
yCyTIVxZkIUz DNA2 DNA2 1763 P51530 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
bspnQ0igku6c CD16 FCGR3A 2215 O75015 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
roEbL8zuLC5k Cd14 CD14 4695 O43678 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
2VeZenLi2dj5 PD1 PID1|PD-1|PD 1 PDCD1 5133 A0A0M3M0G7 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
h4rkCALR5WfU CD56 NCAM1 4684 P13591 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
uThe3c0V3d4i CD27 CD27 939 P26842 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
a4hvNp34IYP0 CD3 None None None uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
lRZYuH929QDw CD85j None None None uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
ljp5UfCF9HCi TCRgd TCRGAMMADELTA|TCRΞ³Ξ΄ None None None uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
Nb2sscq9cBcB CD57 B3GAT1 27087 Q9P2W7 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
fpPkjlGv15C9 Ccr6 CCR6 1235 P51684 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
L0WKZ3fufq0J CD11c ITGAX 3687 P20702 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
k0zGbSgZEX3q HLADR HLA‐DR|HLA-DR|HLA DR None None None uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
agQD0dEzuoNA CXCR3 CXCR3 2833 P49682 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
8OhpfB7wwV32 Cd19 CD19 930 P15391 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
4uiPHmCPV5i1 CXCR5 CXCR5 643 A0N0R2 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
hVNEgxlcDV10 CD127 IL7R 3575 P16871 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
gEfe8qTsIHl0 CD24 CD24 100133941 B6EC88 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
CR7DAHxybgyi CD38 CD38 952 B4E006 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
c3dZKHFOdllB CD33 CD33 945 P20138 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
0vAls2cmLKWq ICOS ICOS 29851 Q53QY6 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
sYcK7uoWCtco Ccr7 CCR7 1236 P32248 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
0qCmUijBeByY CD94 KLRD1 3824 Q13241 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
50v4SaR2m5zQ CD25 IL2RA 3559 P01589 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
ttBc0Fs01sYk CD8 CD8A 925 P01732 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
HEK41hvaIazP Cd4 CD4 920 B4DT49 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
CLFUvJpioHoA CD28 CD28 940 B4E0L1 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
a624IeIqbchl CD45RA None None None uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
YA5Ezh6SAy10 DNA1 None None None uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
cFJEI6e6wml3 CD20 MS4A1 931 A0A024R507 uHJU WzJd 2023-08-28 18:25:40 DzTjkKse
Qa4ozz9tyesQ Ki67 Ki-67|KI 67 None None None uHJU WzJd 2023-08-28 18:25:43 DzTjkKse
UMsp5g0fgMwY CCR5 CCR5 1234 P51681 uHJU WzJd 2023-08-28 18:25:43 DzTjkKse
XvpJ6oL3SG7w CD45RO None None None uHJU WzJd 2023-08-28 18:25:43 DzTjkKse
VZBURNy04vBi SSC-A SSC A|SSCA None None None uHJU WzJd 2023-08-28 18:25:43 DzTjkKse
Hide code cell content
# a few tests
assert set(shared_markers.list("name")) == set(
    [
        "Ccr7",
        "CD3",
        "Cd14",
        "Cd19",
        "CD127",
        "CD27",
        "CD28",
        "CD8",
        "Cd4",
        "CD57",
    ]
)
ln.File.filter(feature_sets__in=panels_with_cd14).exists()
True
Hide code cell content
# clean up test instance
!lamin delete --force test-flow
!rm -r test-flow
πŸ’‘ deleting instance testuser1/test-flow
βœ…     deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-flow.env
βœ…     instance cache deleted
βœ…     deleted '.lndb' sqlite file
❗     consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/test-flow