Skip to content

Design Choices

This page explains the decisions we made for mitochondria segmentation in Catena, alternatives considered, trade-offs, and where we'd like to go next.

Problem / Goal

We want a mitochondria segmentation pipeline that:

  • works reliably on FIBSEM EM volumes (at least to start with, since our in-house datasets are currently all FIBSEM volumes),
  • fits naturally into Catena's broader connectomics workflow,
  • can provide both semantic masks (mitochondria vs non-mito) and instance labels when needed,
  • and is practical to train and run on typical lab infrastructure (GPU workstation / HPC node).

Conceptually, we follow a two-stage setup:

  1. Semantic segmentation - predict a voxel-wise mitochondria mask.
  2. Instance segmentation - convert that mask into uniquely labelled mitochondria (connected components).

We also want to reuse public tools and models wherever possible rather than reinventing everything. For ground-truth, we therefore lean on Empanada / MitoNet and then train our own models in a way that is easy to adapt and extend.

Data curation

For training data, we use Empanada to generate mitochondria labels:

  • Empanada provides MitoNet, a panoptic segmentation model for mitochondria.
  • We fine-tune MitoNet on our FIBSEM data and use its predictions as a strong starting point for ground-truth curation.

This avoids fully manual voxel-wise annotation while still giving us high-quality masks to train on.

Data are stored in Zarr stores with a consistent layout:

  • raw EM under volumes/raw
  • labels under:
  • volumes/labels/neuron_ids (when treating neurons as labels), or
  • volumes/labels/mito_ids (for mitochondria).

The label_type flag ('neuron' / 'mito') lets the same code path handle either case.

Alternatives Considered

1. Use Empanada / MitoNet directly at inference

Option: rely entirely on fine-tuned MitoNet via Empanada for both training and inference.

  • Pros

    • Reuses a well-tested panoptic segmentation model.
    • Minimal extra modelling work in Catena.
  • Cons

    • Ties inference tightly to the Empanada stack. Also, Empanada only exposed MitoNet within napari, which is not suitable run segmentations on large datasets.
    • Less flexibility to experiment with architectures or training schemes.
    • Harder to integrate with Catena's generalised Zarr-based, patch-wise training/inference pattern.

We still use Empanada/MitoNet for ground-truth generation, but not as Catena's main inference engine.

2. Single "best" model vs multiple architectures

Option: pick one model (e.g. a single 3D U-Net) and optimise only that.

  • Pros

    • Simpler codebase.
    • Less configuration branching.
  • Cons

    • Harder to compare architectures on the same data and pipeline.
    • Less flexibility when data or requirements change.

Instead, we support two architectures side by side:

  1. MONAI 3D U-Net baseline (model_type = "monai_unet").
  2. Residual U-Net (RS-UNet) adapted from Xie et al. (model_type = "rs_unet").

Both share the same training and inference scaffolding (Zarr IO, patch-based training, sliding-window prediction), but differ in backbone details.

3. Semantic + connected components vs full panoptic model

Option: train a full panoptic instance segmentation model (e.g. directly at MitoNet's level of complexity).

  • Pros

    • Instances are explicit; no need for post-hoc connected components.
    • Closer to Empanada's internal model.
  • Cons

    • More complex models and training pipelines.
    • Heavier to run and tune for new datasets.
    • Harder to integrate with generic segmentation tools in Catena.

We instead adopt a semantic-first approach (because sometimes only a semantic mask is the requirement) and do instance segmentation via connected components:

  • semantic prediction is produced by either MONAI U-Net or RS-UNet,
  • instance labels are generated by a small, explicit post-processing step.

This keeps the pipeline simpler, more transparent, and easier to adapt.

Decision

We settled on the following design:

  1. Ground-truth generation:
  2. Use Empanada / MitoNet to obtain strong panoptic predictions.
  3. Curate these predictions into training labels for mitochondria.
  4. For public datasets, we use Seg2Link-3D for ground-truth curation (see proofreading section).

  5. Two semantic segmentation backbones:

  6. MONAI 3D U-Net as a straightforward baseline.
  7. Residual U-Net (RS-UNet) adapted from Xie et al., with residual blocks in the encoder-decoder backbone to improve feature reuse and gradient flow while keeping the same semantic-to-instance pipeline.

  8. Zarr-based patch-wise training and inference:

  9. Patch size and stride controlled via config (patch_size, stride).
  10. All EM data under volumes/raw; labels under volumes/labels/*.

  11. Instance segmentation as a separate, simple step:

  12. Use connected components on binarised semantic masks, with configurable thresholds and size filters.

This gives us a flexible, modular setup that fits with how Catena handles other modalities (neurons, synapses, EM masks). We can adapt this framework to explore affinity-based methods for mitochondria segmentation (we actually can use LSDs for joint neuron + mitochondria) predictions and also explore contemporary graph-cut methods for better instance segmentation results.

Trade-offs

  • Model complexity vs ease of use

    • MONAI U-Net is simple and easy to understand (super-easy to use); RS-UNet is more expressive but slightly more complex.
    • Supporting both adds a bit of code complexity but gives users flexibility.
  • Panoptic vs semantic -> instances

    • Using Empanada only for GT and not for inference means we duplicate some functionality, but we gain a unified Catena-style training/inference story.
    • Doing instances via connected components is naive compared to full panoptic models, but it is explicit, tunable, and easy to inspect.
  • Patch-wise processing vs full-volume

    • Patch-based training/inference is necessary for memory reasons, but it introduces tiling and overlap hyperparameters that need to be set carefully.
  • Format conventions

    • Insisting on volumes/raw and volumes/labels/... plus Zarr structure may require users to convert/import their data, but it standardises everything downstream.

Implementation Notes

Training configuration

Training is driven by an Args class (see monai_unet_train.py / rsunet_train.py), with key fields such as:

  • Experiment & data:
    • exp_name: experiment identifier (used for checkpoints, logs).
    • train_zarr_dirs, test_zarr_dirs: one or more Zarr directories for train/test.
      • label_type: 'neuron' or 'mito', which determines which label dataset to read (volumes/labels/neuron_ids vs volumes/labels/mito_ids).

Note

Akin to Xie et al., we have this to verify if a model trained on neuron segmentations, yields better results when fine-tuned on mitochondria segmentation, even when the mitochondrial datasets are limited.

  • Training schedule:

    • epochs: max training epochs.
    • batch_size: patch batch size (often 1 for 3D patches when running on a 12GB GPU).
    • eval_interval: validation frequency.
    • ckpt_interval: checkpoint save interval.
  • Patch & resolution:

    • patch_size: e.g. [128, 128, 128].
    • stride: e.g. [64, 64, 64] (controls overlap).
    • original_res, target_res: physical resolution metadata (often equal for now).
  • Preprocessing & sampling:

    • clahe: whether to apply CLAHE contrast normalisation.
    • subsample_frac, subsample_number, subsample_seed: control patch subsampling.
    • balance_patches: if True, enforce a balance between positive/negative patches.
    • min_positive_pixels: minimum number of positive pixels per patch to consider it "positive".
  • Optimisation & loss:

    • learning_rate, learning_rate_after_hotstart_50.
    • loss_type: 'DiceLoss' or 'DiceCE' (Dice + cross-entropy).
    • loss_weights: weight factor to compensate for class imbalance.
  • Augmentation & model init:

    • rotation_augs, contrast_augs.
    • model_loc / resume_checkpoint: path to a checkpoint (for fine-tuning) or None for training from scratch. This will be updated and cleaned up in newer releases of code.
    • freeze_encoder, hotstart: options to partially freeze or warm-start the model.

Both MONAI U-Net and RS-UNet use the same high-level config; the difference is which script you call and what model_type you specify.

Inference configuration

Inference uses an InferenceArgs class (e.g. in predict.py and rsunet_predict.py), with:

  • model_path: path to the .pth checkpoint.
  • test_zarr_dirs: list of Zarr directories to run inference on.
  • label_type: must match training.
  • patch_size, stride, original_res, target_res, clahe: must be consistent with training.
  • batch_size, num_workers: runtime performance controls.
  • output_dir, output_filename, output_format ("tiff" or "zarr").
  • model_type: "rs_unet" or "monai_unet" to select the architecture.

The inference script writes out a semantic prediction volume at the target resolution. If all you need is semantic mitochondria labels, you can stop here.

Instance segmentation

Instance labels are generated by instance_segmenter.py, driven by a ConversionArgs class:

  • input_prediction_path: path to the saved semantic prediction file (TIFF or Zarr).
  • output_instance_dir, output_format: where and how to save instances.
  • chunk_size, overlap: control chunked processing for large volumes.
  • thres_foreground: probability threshold for binarising semantic predictions (0-1).
  • thres_small_instances: minimum size for keeping an instance.
  • scale_factors: optional scaling (usually (1, 1, 1)).
  • remove_small_mode: how to treat small objects (e.g. 'background').

The script:

  1. Loads the semantic prediction volume.
  2. Binarises it using thres_foreground.
  3. Runs connected components per chunk (with overlap stitching).
  4. Removes small instances below thres_small_instances.
  5. Writes an instance-labelled volume.

Operational Guidance

Choosing a backbone

  • Start with the MONAI U-Net if you want a straightforward baseline and easier debugging.
  • Try RS-UNet if you:
    • already have a working baseline, and
    • want to see whether residual blocks improve IoU / Dice for your dataset.

Tuning training

  • If mitochondria are rare in the volume, ensure:
  • balance_patches = True,
  • min_positive_pixels is set to something sensible (inspect a few visualisations).
  • If training is unstable or overfitting:
  • reduce learning_rate,
  • consider lowering loss_weights,
  • start with fewer augmentations and add them back gradually.

Tuning instance segmentation

  • If you see many tiny, spurious instances:
  • increase thres_small_instances.
  • If you lose small mitochondria you care about:
  • decrease thres_small_instances,
  • and possibly lower thres_foreground slightly (but watch for noise).
  • For large volumes, adjust chunk_size and overlap so chunks fit into memory but still allow smooth stitching.

Always visually inspect:

  • raw EM,
  • semantic prediction,
  • instance labels

for a few representative subvolumes before trusting metrics alone.

Future Work / Open Questions

  • Better instance segmentation

    • Explore more advanced instance-labelling methods (beyond plain connected components) while keeping the semantic-first philosophy.
  • Tighter integration with Empanada

    • Streamline workflows for transferring models and labels between Empanada and Catena.
  • Cross-dataset generalisation

    • Systematically test how well models trained on one FIBSEM dataset transfer to others, and what minimal fine-tuning is required.
  • Joint modelling with other modalities

    • Use shared backbones or multi-task setups (e.g. neurons + mitochondria) to exploit shared structure in EM data, while keeping the current pipelines as a simple, reliable baseline.

The current design aims to be practical, transparent, and composable good enough to use today, and flexible enough to evolve as we gather more data and experience.