Design Choices¶
This page explains the decisions we made for neurotransmitter classification in Catena, alternatives considered, trade-offs, and how the current code is organised.
Problem / Goal¶
Once we have a structural connectome (neurons + synapses + partners), we still don't know whether each synaptic edge is excitatory or inhibitory. The goal of this module is to:
- assign a neurotransmitter class to pre-synaptic sites (and by extension, to neurons),
- so that each edge in the graph can be labelled with a functional sign (excitatory vs inhibitory),
- using curated adult fly data as ground-truth, with only ~1000 examples (can be extended to more) per major transmitter in our current local datasets.
This is deliberately scoped as a supervised classification problem:
- Input: small 3D EM cutouts (patches) around pre-synaptic sites.
- Output: one of a small set of neurotransmitter labels (e.g. acetylcholine, GABA, glutamate, serotonin, dopamine, octopamine, tyramine).
We operate under a practical version of Dale's Principle: a neuron is treated as primarily excitatory or inhibitory based on its dominant transmitter, so that its outgoing synapses inherit a consistent sign.
Alternatives Considered¶
1. Use original Synister as-is¶
The starting point for this module is Nil Eckstein's Synister project (and its updated dev branch by Diane Adjavon at HHMI Janelia).
Option: keep Synister's original repo and configuration structure untouched, and just plug in local data.
- Pros
- Battle-tested on large datasets.
-
No extra work restructuring the code, hoping that it works off-the-shelf.
-
Cons
- The original code targets large-scale Janelia pipelines, not small curated local datasets.
- Configuration and data paths are less convenient for our local, smaller adult-fly-derived cutouts.
- Harder to evolve in lockstep with Catena's conventions and environment setup.
Decision: reimplement Synister's core ideas inside Catena, by adapting the layout, configuration, and data loading to our use case.
2. Single backbone vs multiple architectures¶
The original Synister used a VGG-style 3D CNN. Our reimplementation adds a 3D ResNet-18-style backbone (lightweight residual network) and keeps both options available.
Option: choose only one (VGG or ResNet) and simplify.
- Pros
- Less code and fewer configs to maintain.
-
Fewer model choices for users to worry about.
-
Cons
- Harder to compare architectures on the same data and pipeline.
- Less flexibility if one backbone behaves better on some neurotransmitters or datasets.
Decision: support both VGG3D and a lightweight 3D ResNet:
- VGG3D: closer to the original Synister implementation.
- ResNet3D: newer, residual architecture (as shown in the ResNet schematic, Figure 1).
This is reflected in models/resnet3d.py and models/vgg3d.py, selectable via configuration (TRAIN.MODEL_TYPE).
3. Ad-hoc scripts vs structured engine/config layout¶
Option: have a single monolithic training/prediction script with hard-coded paths.
-
Pros
- Quick to prototype.
- Fewer files to read.
-
Cons
- Fragile and hard to reuse across datasets.
- Harder to share experiments or reproduce results.
Decision: use a structured layout very similar to Synister's updated dev branch:
config/- YAML/py-style configs for train and predict.engine/train/train_3d.py- core training logic.engine/predict/predict_3d.py- prediction logic.engine/post/evaluate_3d.py- evaluation.models/- backbone definitions (ResNet3D, VGG3D).data_utils/pre_process/split_data.py- reproducible train/val split generation.- Top-level
trainer.pyandpredicter.pylaunchers that read the configs.
This keeps the logic, configuration, and models cleanly separated, making it easier to experiment and compare runs.
4. Fixed data source vs multiple backends¶
Our data comes from curated adult-fly EM datasets, but the exact storage can vary:
- HDF5 cutouts in class-labelled directories,
- pickled splits,
- or synapse locations stored in MongoDB.
Option: only support one data source (e.g. directory of HDF5 files).
- Pros
- Simpler data loading code.
-
Less configuration branching.
-
Cons
- Inflexible for different labs / infrastructure.
- Hard to scale to large, database-backed synapse tables.
Decision: support three data source methods for prediction, configured via _C.DATA_SOURCE.METHOD:
'pkl'- use a.pklfile defining train/val splits.'directory'- point to a directory of HDF5 files (e.g. unlabeled or new data).'mongo'- pull synapse locations from a MongoDB collection for large-scale inference.
This matches the original Synister philosophy while remaining practical for smaller local datasets.
Decision¶
We adopted the following design:
-
Problem framing: 3D patch-based supervised classification of pre-synaptic sites into neurotransmitter classes, used to sign and label edges in an existing connectome.
-
Core model family: Synister-style 3D CNNs with two backbones:
- 3D VGG-like network.
- Lightweight 3D ResNet-18-style network (Figure 1).
-
Code structure:
- Config-driven training and prediction (
config/config.py,config/config_predict.py). - Modular
engine/folder withtrain_3d.py,predict_3d.py, andevaluate_3d.py. - Separate
models/for backbones anddata_utils/for preprocessing/splitting.
- Config-driven training and prediction (
-
Data sources:
- Local HDF5 patch datasets.
- Optional
.pklsplit files for reproducible experiments. - Optional MongoDB backend for large-scale prediction runs.
This keeps the module close in spirit to Synister, while being adapted for Catena's curated, smaller-scale neurotransmitter datasets.
Trade-offs¶
-
Model richness vs dataset size
Our largest dataset currently has on the order of ~1000 examples per major neurotransmitter. Using 3D CNNs (VGG/ResNet) is expressive, but we must be careful not to overfit. The design intentionally stays with moderate-sized backbones rather than very deep networks. -
Multiple backbones vs simplicity
Supporting both VGG3D and ResNet3D adds complexity to configs and code. The upside is that it is easy to compare and switch backbones without rewriting the pipeline. -
Configuration flexibility vs user overhead
The YACS-style config with many knobs (DATA.USE_SPLIT_FILE,TRAIN.MODEL_TYPE, Mongo settings, etc.) is powerful but can feel heavy at first. The trade-off is made in favour of reproducibility and reuse. -
3D CNN patches vs graph-level models
We stay at the level of 3D local patches around synapses, rather than using graph neural networks or whole-cell context. This simplifies the training story and matches Synister's original framing, but it means that purely local morphology + intensity around a synapse must carry most of the classification signal.
Implementation Notes¶
-
Training
- Launched via:
python trainer.py -c config/config.py TRAIN.MODEL_TYPEchooses between"RESNET"and"VGG".DATA.USE_SPLIT_FILE+DATA.SPLIT_FILEcontrol whether to use a predefined.pklsplit.
- Launched via:
-
Prediction
- Launched via:
python predicter.py config/config_predict.py _C.PREDICT.CHECKPOINTsets the model checkpoint._C.RAW_DATA.CONTAINERand_C.RAW_DATA.DATASETdefine the raw EM source._C.DATA_SOURCE.METHOD ∈ {"pkl", "directory", "mongo"}selects how synapse patches are sourced.
- Launched via:
-
Evaluation
- Uses
engine/post/evaluate_3d.py(orscripts/evaluate.py) to compute: - accuracy,
- per-class metrics,
- confusion matrices
from prediction CSVs.
- Uses
-
Integration with the larger pipeline
- Gunpowder and Daisy are used for data loading and scheduling in the larger pipeline context (e.g. when sourcing synapse locations from MongoDB or large volumes), following Synister.
Operational Guidance¶
-
Create a reusable split
- Start by creating a split with
split_data.py(or a similar script) so that you have a reusable.pklfile for train/val splits.
- Start by creating a split with
-
Configure
config/config.py- Set
DATA.DATA_DIR_PATHto your root directory containing one subfolder per neurotransmitter class. - Enable
DATA.USE_SPLIT_FILEand pointDATA.SPLIT_FILEto your split.pkl. - Choose
TRAIN.MODEL_TYPE = "RESNET"or"VGG".
- Set
-
For small datasets
- Keep augmentations modest.
- Monitor per-class performance (some transmitters may have fewer examples if your data is imbalanced).
- Use early stopping or careful checkpoint selection.
-
Prediction on large volumes
- Use the
'mongo'data source mode if synapse coordinates live in a MongoDB collection. - Or use
'directory'mode to sweep over new HDF5 cutouts.
- Use the
-
Always sanity-check
- Class balance in your training set.
- Per-class confusion matrices.
- A few visual examples of correctly vs incorrectly classified synapses.
Future Work / Open Questions¶
-
More architectures
The module is already designed to support multiple backbones; adding other 3D CNNs or lightweight transformers is straightforward, but needs to be justified by data scale. -
Better integration with connectome-level analysis
Aggregating synapse-level predictions into neuron-level labels (and checking consistency with Dale's Principle) is an obvious next step. -
Uncertainty and active learning
Using model confidence to prioritise which synapses to re-annotate could help expand the curated adult-fly datasets efficiently. -
Cross-dataset generalisation
Evaluating how models trained on one adult-fly dataset transfer to others (or to different species) remains an open question.