Design Choices¶

This page explains the decisions we made for neurotransmitter classification in Catena, alternatives considered, trade-offs, and how the current code is organised.

Problem / Goal¶

Once we have a structural connectome (neurons + synapses + partners), we still don't know whether each synaptic edge is excitatory or inhibitory. The goal of this module is to:

assign a neurotransmitter class to pre-synaptic sites (and by extension, to neurons),
so that each edge in the graph can be labelled with a functional sign (excitatory vs inhibitory),
using curated adult fly data as ground-truth, with only ~1000 examples (can be extended to more) per major transmitter in our current local datasets.

This is deliberately scoped as a supervised classification problem:

Input: small 3D EM cutouts (patches) around pre-synaptic sites.
Output: one of a small set of neurotransmitter labels (e.g. acetylcholine, GABA, glutamate, serotonin, dopamine, octopamine, tyramine).

We operate under a practical version of Dale's Principle: a neuron is treated as primarily excitatory or inhibitory based on its dominant transmitter, so that its outgoing synapses inherit a consistent sign.

Alternatives Considered¶

1. Use original Synister as-is¶

The starting point for this module is Nil Eckstein's Synister project (and its updated dev branch by Diane Adjavon at HHMI Janelia).

Option: keep Synister's original repo and configuration structure untouched, and just plug in local data.

Pros
Battle-tested on large datasets.
No extra work restructuring the code, hoping that it works off-the-shelf.
Cons
The original code targets large-scale Janelia pipelines, not small curated local datasets.
Configuration and data paths are less convenient for our local, smaller adult-fly-derived cutouts.
Harder to evolve in lockstep with Catena's conventions and environment setup.

Decision: reimplement Synister's core ideas inside Catena, by adapting the layout, configuration, and data loading to our use case.

2. Single backbone vs multiple architectures¶

The original Synister used a VGG-style 3D CNN. Our reimplementation adds a 3D ResNet-18-style backbone (lightweight residual network) and keeps both options available.

Option: choose only one (VGG or ResNet) and simplify.

Pros
Less code and fewer configs to maintain.
Fewer model choices for users to worry about.
Cons
Harder to compare architectures on the same data and pipeline.
Less flexibility if one backbone behaves better on some neurotransmitters or datasets.

Decision: support both VGG3D and a lightweight 3D ResNet:

VGG3D: closer to the original Synister implementation.
ResNet3D: newer, residual architecture (as shown in the ResNet schematic, Figure 1).

This is reflected in models/resnet3d.py and models/vgg3d.py, selectable via configuration (TRAIN.MODEL_TYPE).

3. Ad-hoc scripts vs structured engine/config layout¶

Option: have a single monolithic training/prediction script with hard-coded paths.

Pros
- Quick to prototype.
- Fewer files to read.
Cons
- Fragile and hard to reuse across datasets.
- Harder to share experiments or reproduce results.

Decision: use a structured layout very similar to Synister's updated dev branch:

config/ - YAML/py-style configs for train and predict.
engine/train/train_3d.py - core training logic.
engine/predict/predict_3d.py - prediction logic.
engine/post/evaluate_3d.py - evaluation.
models/ - backbone definitions (ResNet3D, VGG3D).
data_utils/pre_process/split_data.py - reproducible train/val split generation.
Top-level trainer.py and predicter.py launchers that read the configs.

This keeps the logic, configuration, and models cleanly separated, making it easier to experiment and compare runs.

4. Fixed data source vs multiple backends¶

Our data comes from curated adult-fly EM datasets, but the exact storage can vary:

HDF5 cutouts in class-labelled directories,
pickled splits,
or synapse locations stored in MongoDB.

Option: only support one data source (e.g. directory of HDF5 files).

Pros
Simpler data loading code.
Less configuration branching.
Cons
Inflexible for different labs / infrastructure.
Hard to scale to large, database-backed synapse tables.

Decision: support three data source methods for prediction, configured via _C.DATA_SOURCE.METHOD:

'pkl' - use a .pkl file defining train/val splits.
'directory' - point to a directory of HDF5 files (e.g. unlabeled or new data).
'mongo' - pull synapse locations from a MongoDB collection for large-scale inference.

This matches the original Synister philosophy while remaining practical for smaller local datasets.

Decision¶

We adopted the following design:

Problem framing: 3D patch-based supervised classification of pre-synaptic sites into neurotransmitter classes, used to sign and label edges in an existing connectome.
Core model family: Synister-style 3D CNNs with two backbones:
- 3D VGG-like network.
- Lightweight 3D ResNet-18-style network (Figure 1).
Code structure:
- Config-driven training and prediction (config/config.py, config/config_predict.py).
- Modular engine/ folder with train_3d.py, predict_3d.py, and evaluate_3d.py.
- Separate models/ for backbones and data_utils/ for preprocessing/splitting.
Data sources:
- Local HDF5 patch datasets.
- Optional .pkl split files for reproducible experiments.
- Optional MongoDB backend for large-scale prediction runs.

This keeps the module close in spirit to Synister, while being adapted for Catena's curated, smaller-scale neurotransmitter datasets.

Trade-offs¶

Model richness vs dataset size
Our largest dataset currently has on the order of ~1000 examples per major neurotransmitter. Using 3D CNNs (VGG/ResNet) is expressive, but we must be careful not to overfit. The design intentionally stays with moderate-sized backbones rather than very deep networks.
Multiple backbones vs simplicity
Supporting both VGG3D and ResNet3D adds complexity to configs and code. The upside is that it is easy to compare and switch backbones without rewriting the pipeline.
Configuration flexibility vs user overhead
The YACS-style config with many knobs (DATA.USE_SPLIT_FILE, TRAIN.MODEL_TYPE, Mongo settings, etc.) is powerful but can feel heavy at first. The trade-off is made in favour of reproducibility and reuse.
3D CNN patches vs graph-level models
We stay at the level of 3D local patches around synapses, rather than using graph neural networks or whole-cell context. This simplifies the training story and matches Synister's original framing, but it means that purely local morphology + intensity around a synapse must carry most of the classification signal.

Implementation Notes¶

Training
- Launched via:
```
python trainer.py -c config/config.py
```
- TRAIN.MODEL_TYPE chooses between "RESNET" and "VGG".
- DATA.USE_SPLIT_FILE + DATA.SPLIT_FILE control whether to use a predefined .pkl split.
Prediction
- Launched via:
```
python predicter.py config/config_predict.py
```
- _C.PREDICT.CHECKPOINT sets the model checkpoint.
- _C.RAW_DATA.CONTAINER and _C.RAW_DATA.DATASET define the raw EM source.
- _C.DATA_SOURCE.METHOD ∈ {"pkl", "directory", "mongo"} selects how synapse patches are sourced.
Evaluation
- Uses engine/post/evaluate_3d.py (or scripts/evaluate.py) to compute:
- accuracy,
- per-class metrics,
- confusion matrices
  from prediction CSVs.
Integration with the larger pipeline
- Gunpowder and Daisy are used for data loading and scheduling in the larger pipeline context (e.g. when sourcing synapse locations from MongoDB or large volumes), following Synister.

Operational Guidance¶

Create a reusable split
- Start by creating a split with split_data.py (or a similar script) so that you have a reusable .pkl file for train/val splits.
Configure config/config.py
- Set DATA.DATA_DIR_PATH to your root directory containing one subfolder per neurotransmitter class.
- Enable DATA.USE_SPLIT_FILE and point DATA.SPLIT_FILE to your split .pkl.
- Choose TRAIN.MODEL_TYPE = "RESNET" or "VGG".
For small datasets
- Keep augmentations modest.
- Monitor per-class performance (some transmitters may have fewer examples if your data is imbalanced).
- Use early stopping or careful checkpoint selection.
Prediction on large volumes
- Use the 'mongo' data source mode if synapse coordinates live in a MongoDB collection.
- Or use 'directory' mode to sweep over new HDF5 cutouts.
Always sanity-check
- Class balance in your training set.
- Per-class confusion matrices.
- A few visual examples of correctly vs incorrectly classified synapses.

Future Work / Open Questions¶

More architectures
The module is already designed to support multiple backbones; adding other 3D CNNs or lightweight transformers is straightforward, but needs to be justified by data scale.
Better integration with connectome-level analysis
Aggregating synapse-level predictions into neuron-level labels (and checking consistency with Dale's Principle) is an obvious next step.
Uncertainty and active learning
Using model confidence to prioritise which synapses to re-annotate could help expand the curated adult-fly datasets efficiently.
Cross-dataset generalisation
Evaluating how models trained on one adult-fly dataset transfer to others (or to different species) remains an open question.