Proofreading¶
Short overview of what this module does and links to usage.
- Install & Usage: See the module's README and scripts in the GitHub repository for the most up-to-date instructions.
- Design Choices: See Design Choices for the "why" behind the "what".
Why do we proofread segmentations?¶
No matter how good the model, automatic segmentations are never perfect. Proofreading is the step where humans (or semi-automatic tools) inspect and correct these segmentations so they can be trusted for downstream analysis.
Proofreading happens at different scales:
-
Small sub-volumes
Carefully proofread blocks are used to retrain machine learning models, turning corrected segmentations into new ground-truth. This is far more efficient than painting labels directly on raw voxels from scratch. -
Whole-brain segmentations
At larger scale, proofreading focuses on neuron trajectories and major merge/split errors so that the resulting segmentation can be safely used for circuit reconstruction and quantitative analysis.
At the moment, proofreading is still one of the most expensive and rate-limiting steps in connectomics. The goal of this module within Catena is to make proofreading:
- more targeted (only where it matters),
- more scalable (via prioritisation and potentially via automation),
- and more useful (every correction feeds back into better models and better future segmentations).
Instead of treating proofreading as a painful afterthought, we treat it as a key part of the loop that accelerates ground-truth generation and steadily improves the entire pipeline. We believe good ground-truth leads to more useful models.
Proofreading Sub-cubes locally with Seg2link¶
Seg2Link was designed to support semi-automatic cell segmentation and comes with two main workflows:
a) segmentation, and
b) proofreading in 3D.
We mainly leverage the proofreading workflow. Seg2Link is built on napari, which makes it very convenient to:
- inspect 3D segmentations,
- correct split and merge errors,
- and even directly paint in false negative regions if needed.
This makes it ideal for proofreading sub-cubes locally, for example, when you want to clean up a small region to use as ground-truth for retraining.
Seg2Link expects TIFF volumes as input, so we provide additional scripts to convert our Zarr-based outputs into TIFF stacks.
You can find these scripts here:
Proofreading whole-brain segmentations in CAVE¶
We also use CAVE (better known to many as FlyWire) to proofread large, ingested segmentations.
CAVE acts as a version control system for segmentations: it keeps track of all edits to the segmentation graph as users merge or split neuron instances during proofreading. Every change is recorded, so we can always see what was corrected, by whom, and when.
CAVE comes with a rich ecosystem of tools and APIs, documented here:
In Catena, we ingest most of our large volumes into CAVE and then:
- proofread neuron trajectories either densely in specific regions or sparsely but completely across the brain,
- export the corrected segmentations, and
- use these improved labels to retrain or fine-tune our models.
Our current export scripts can be found here:
Automatic Proofreading with CATMAID Skeletons¶
Tracing neurons has been the standard way to reconstruct their trajectories for a long time, and CATMAID has been the tool of choice for doing this at scale. Once these skeletons exist, they are not just useful for morphology, they can also be reused to proofread automatic segmentations.
In Catena, we explore using CATMAID skeletons to automatically merge neuron segments produced by either:
- bottom-up, affinity-based approaches, or
- top-down methods.
The basic idea is simple: if a single skeleton runs through multiple segmented fragments, those fragments probably belong to the same neuron and should be merged.
One practical caveat is that skeleton node placement is not always very dense. To fit B-splines (or other smooth curves) that more faithfully follow the true neuron trajectory, we sometimes need to resample skeleton points at a higher frequency before using them for proofreading.
We experimented a bit with this idea; an initial script and some preliminary results can be found here:
At the moment, we do not yet have a large number of fully proofread, complete skeletons to exploit, so this line of work is still exploratory. As more high-quality tracings become available, we expect this approach to become a much more powerful automatic proofreading tool (at least to recover fragmented neuronal backbones).
Why do we proofread segmentations?¶
As with neuron and mitochondria segmentation, machine predictions for synapses are never 100% accurate. We still need ways to check and verify them, usually with a human in the loop.
Synapses are especially critical because they define the connectivity between neurons and are often the readout we care about when comparing brains across learning conditions, developmental stages, or disease models. Missed synapses or spurious synapses can directly distort the inferred circuit.
Automatic methods such as Synful or SimpSyn can:
- miss true synaptic sites (false negatives), or
- propose too many candidates (false positives).
Both kinds of errors need to be proofread.
CATMAID-based proofreading workflow¶
For synapses, we use CATMAID as our proofreading environment:
- Synapse predictions from
SynfulorSimpSynare exported and pushed into CATMAID as candidate sites. See our example script here. - Using CATMAID's native tools, proofreaders inspect each candidate in the EM volume.
- Synapses are then tagged with labels such as:
correct synapseincorrect synapseuncertain
These tags allow us to:
- clean up synapse lists for downstream connectivity analysis,
- build curated validation sets, and
- feed corrections back into the training loop for future models.
See our preprint on proofreading synapses - Beyond Agreement: Standardizing Crowdsourced Synapse Annotations through Proofreading in EM Connectomics