Design Choices¶

This page explains the decisions we made for mitochondria segmentation in Catena, alternatives considered, trade-offs, and where we'd like to go next.

Problem / Goal¶

We want a mitochondria segmentation pipeline that:

works reliably on FIBSEM EM volumes (at least to start with, since our in-house datasets are currently all FIBSEM volumes),
fits naturally into Catena's broader connectomics workflow,
can provide both semantic masks (mitochondria vs non-mito) and instance labels when needed,
and is practical to train and run on typical lab infrastructure (GPU workstation / HPC node).

Conceptually, we follow a two-stage setup:

Semantic segmentation - predict a voxel-wise mitochondria mask.
Instance segmentation - convert that mask into uniquely labelled mitochondria (connected components).

We also want to reuse public tools and models wherever possible rather than reinventing everything. For ground-truth, we therefore lean on Empanada / MitoNet and then train our own models in a way that is easy to adapt and extend.

Data curation¶

For training data, we use Empanada to generate mitochondria labels:

Empanada provides MitoNet, a panoptic segmentation model for mitochondria.
We fine-tune MitoNet on our FIBSEM data and use its predictions as a strong starting point for ground-truth curation.

This avoids fully manual voxel-wise annotation while still giving us high-quality masks to train on.

Data are stored in Zarr stores with a consistent layout:

raw EM under volumes/raw
labels under:
volumes/labels/neuron_ids (when treating neurons as labels), or
volumes/labels/mito_ids (for mitochondria).

The label_type flag ('neuron' / 'mito') lets the same code path handle either case.

Alternatives Considered¶

1. Use Empanada / MitoNet directly at inference¶

Option: rely entirely on fine-tuned MitoNet via Empanada for both training and inference.

Pros
- Reuses a well-tested panoptic segmentation model.
- Minimal extra modelling work in Catena.
Cons
- Ties inference tightly to the Empanada stack. Also, Empanada only exposed MitoNet within napari, which is not suitable run segmentations on large datasets.
- Less flexibility to experiment with architectures or training schemes.
- Harder to integrate with Catena's generalised Zarr-based, patch-wise training/inference pattern.

We still use Empanada/MitoNet for ground-truth generation, but not as Catena's main inference engine.

2. Single "best" model vs multiple architectures¶

Option: pick one model (e.g. a single 3D U-Net) and optimise only that.

Pros
- Simpler codebase.
- Less configuration branching.
Cons
- Harder to compare architectures on the same data and pipeline.
- Less flexibility when data or requirements change.

Instead, we support two architectures side by side:

MONAI 3D U-Net baseline (model_type = "monai_unet").
Residual U-Net (RS-UNet) adapted from Xie et al. (model_type = "rs_unet").

Both share the same training and inference scaffolding (Zarr IO, patch-based training, sliding-window prediction), but differ in backbone details.

3. Semantic + connected components vs full panoptic model¶

Option: train a full panoptic instance segmentation model (e.g. directly at MitoNet's level of complexity).

Pros
- Instances are explicit; no need for post-hoc connected components.
- Closer to Empanada's internal model.
Cons
- More complex models and training pipelines.
- Heavier to run and tune for new datasets.
- Harder to integrate with generic segmentation tools in Catena.

We instead adopt a semantic-first approach (because sometimes only a semantic mask is the requirement) and do instance segmentation via connected components:

semantic prediction is produced by either MONAI U-Net or RS-UNet,
instance labels are generated by a small, explicit post-processing step.

This keeps the pipeline simpler, more transparent, and easier to adapt.

Decision¶

We settled on the following design:

Ground-truth generation:
Use Empanada / MitoNet to obtain strong panoptic predictions.
Curate these predictions into training labels for mitochondria.
For public datasets, we use Seg2Link-3D for ground-truth curation (see proofreading section).
Two semantic segmentation backbones:
MONAI 3D U-Net as a straightforward baseline.
Residual U-Net (RS-UNet) adapted from Xie et al., with residual blocks in the encoder-decoder backbone to improve feature reuse and gradient flow while keeping the same semantic-to-instance pipeline.
Zarr-based patch-wise training and inference:
Patch size and stride controlled via config (patch_size, stride).
All EM data under volumes/raw; labels under volumes/labels/*.
Instance segmentation as a separate, simple step:
Use connected components on binarised semantic masks, with configurable thresholds and size filters.

This gives us a flexible, modular setup that fits with how Catena handles other modalities (neurons, synapses, EM masks). We can adapt this framework to explore affinity-based methods for mitochondria segmentation (we actually can use LSDs for joint neuron + mitochondria) predictions and also explore contemporary graph-cut methods for better instance segmentation results.

Trade-offs¶

Model complexity vs ease of use
- MONAI U-Net is simple and easy to understand (super-easy to use); RS-UNet is more expressive but slightly more complex.
- Supporting both adds a bit of code complexity but gives users flexibility.
Panoptic vs semantic -> instances
- Using Empanada only for GT and not for inference means we duplicate some functionality, but we gain a unified Catena-style training/inference story.
- Doing instances via connected components is naive compared to full panoptic models, but it is explicit, tunable, and easy to inspect.
Patch-wise processing vs full-volume
- Patch-based training/inference is necessary for memory reasons, but it introduces tiling and overlap hyperparameters that need to be set carefully.
Format conventions
- Insisting on volumes/raw and volumes/labels/... plus Zarr structure may require users to convert/import their data, but it standardises everything downstream.

Implementation Notes¶

Training configuration¶

Training is driven by an Args class (see monai_unet_train.py / rsunet_train.py), with key fields such as:

Experiment & data:
- exp_name: experiment identifier (used for checkpoints, logs).
- train_zarr_dirs, test_zarr_dirs: one or more Zarr directories for train/test.
  - label_type: 'neuron' or 'mito', which determines which label dataset to read (volumes/labels/neuron_ids vs volumes/labels/mito_ids).

Note

Akin to Xie et al., we have this to verify if a model trained on neuron segmentations, yields better results when fine-tuned on mitochondria segmentation, even when the mitochondrial datasets are limited.

Training schedule:
- epochs: max training epochs.
- batch_size: patch batch size (often 1 for 3D patches when running on a 12GB GPU).
- eval_interval: validation frequency.
- ckpt_interval: checkpoint save interval.
Patch & resolution:
- patch_size: e.g. [128, 128, 128].
- stride: e.g. [64, 64, 64] (controls overlap).
- original_res, target_res: physical resolution metadata (often equal for now).
Preprocessing & sampling:
- clahe: whether to apply CLAHE contrast normalisation.
- subsample_frac, subsample_number, subsample_seed: control patch subsampling.
- balance_patches: if True, enforce a balance between positive/negative patches.
- min_positive_pixels: minimum number of positive pixels per patch to consider it "positive".
Optimisation & loss:
- learning_rate, learning_rate_after_hotstart_50.
- loss_type: 'DiceLoss' or 'DiceCE' (Dice + cross-entropy).
- loss_weights: weight factor to compensate for class imbalance.
Augmentation & model init:
- rotation_augs, contrast_augs.
- model_loc / resume_checkpoint: path to a checkpoint (for fine-tuning) or None for training from scratch. This will be updated and cleaned up in newer releases of code.
- freeze_encoder, hotstart: options to partially freeze or warm-start the model.

Both MONAI U-Net and RS-UNet use the same high-level config; the difference is which script you call and what model_type you specify.

Inference configuration¶

Inference uses an InferenceArgs class (e.g. in predict.py and rsunet_predict.py), with:

model_path: path to the .pth checkpoint.
test_zarr_dirs: list of Zarr directories to run inference on.
label_type: must match training.
patch_size, stride, original_res, target_res, clahe: must be consistent with training.
batch_size, num_workers: runtime performance controls.
output_dir, output_filename, output_format ("tiff" or "zarr").
model_type: "rs_unet" or "monai_unet" to select the architecture.

The inference script writes out a semantic prediction volume at the target resolution. If all you need is semantic mitochondria labels, you can stop here.

Instance segmentation¶

Instance labels are generated by instance_segmenter.py, driven by a ConversionArgs class:

input_prediction_path: path to the saved semantic prediction file (TIFF or Zarr).
output_instance_dir, output_format: where and how to save instances.
chunk_size, overlap: control chunked processing for large volumes.
thres_foreground: probability threshold for binarising semantic predictions (0-1).
thres_small_instances: minimum size for keeping an instance.
scale_factors: optional scaling (usually (1, 1, 1)).
remove_small_mode: how to treat small objects (e.g. 'background').

The script:

Loads the semantic prediction volume.
Binarises it using thres_foreground.
Runs connected components per chunk (with overlap stitching).
Removes small instances below thres_small_instances.
Writes an instance-labelled volume.

Operational Guidance¶

Choosing a backbone¶

Start with the MONAI U-Net if you want a straightforward baseline and easier debugging.
Try RS-UNet if you:
- already have a working baseline, and
- want to see whether residual blocks improve IoU / Dice for your dataset.

Tuning training¶

If mitochondria are rare in the volume, ensure:
balance_patches = True,
min_positive_pixels is set to something sensible (inspect a few visualisations).
If training is unstable or overfitting:
reduce learning_rate,
consider lowering loss_weights,
start with fewer augmentations and add them back gradually.

Tuning instance segmentation¶

If you see many tiny, spurious instances:
increase thres_small_instances.
If you lose small mitochondria you care about:
decrease thres_small_instances,
and possibly lower thres_foreground slightly (but watch for noise).
For large volumes, adjust chunk_size and overlap so chunks fit into memory but still allow smooth stitching.

Always visually inspect:

raw EM,
semantic prediction,
instance labels

for a few representative subvolumes before trusting metrics alone.

Future Work / Open Questions¶

Better instance segmentation
- Explore more advanced instance-labelling methods (beyond plain connected components) while keeping the semantic-first philosophy.
Tighter integration with Empanada
- Streamline workflows for transferring models and labels between Empanada and Catena.
Cross-dataset generalisation
- Systematically test how well models trained on one FIBSEM dataset transfer to others, and what minimal fine-tuning is required.
Joint modelling with other modalities
- Use shared backbones or multi-task setups (e.g. neurons + mitochondria) to exploit shared structure in EM data, while keeping the current pipelines as a simple, reliable baseline.

The current design aims to be practical, transparent, and composable good enough to use today, and flexible enough to evolve as we gather more data and experience.