StructMatcher
StructMatcher is a CIF-based composition and structure app for structure hygiene, duplicate removal, known-library matching, novelty filtering, and structure-landscape visualization.
Modes
| Mode | Use case | Main outputs |
|---|---|---|
| Pairwise | Compare exactly two CIFs | structure-comparison-pair.csv, summary report |
| Dedup | Cluster identical CIFs in one uploaded set | structure-comparison-dedup.csv, summary report |
| Cross-reference | Find candidate CIFs that match a reference library | structure-comparison-cross-matches.csv, unmatched novel CIF ZIP, summary report |
| Novelty filter | Keep candidates with no match in local MP or an uploaded reference library | novelty CSV, novel CIF ZIP, novel representative CIF ZIP, summary report |
| Cluster map | Embed uploaded CIFs into a 2D map | embedding CSV, PNG plot, summary report |
Matching Logic
The app uses pymatgen StructureMatcher. Matching can tolerate translations, rotations, primitive/supercell changes, and modest relaxation noise.
Pairwise mode reports both the normal element-aware structure match and an anonymous framework/prototype match. The pair CSV and HTML report include exactly one V among these four outcome columns:
- same formula + same structure
- same formula + different structure
- different formula + same prototype
- different formula + different topology
Result tables report derived symmetry metadata (space_group_name, space_group_number, and crystal_system) from pymatgen SpacegroupAnalyzer.
Identity-style matching now uses a two-step gate by default:
1. StructureMatcher must fit the structures. 2. If the Space group match checkbox is enabled, both structures must have the same derived space-group number using the same symprec and angle_tolerance values.
The default SpacegroupAnalyzer tolerances are symprec=0.05 and angle_tolerance=8. The CSV reports structurematcher_fit, symmetry_agrees, final_identity_match, and symmetry_rejected_matches so users can audit cases where topology matched but derived symmetry disagreed. Framework/anonymous topology mode disables this gate by default; for other topology-only checks, uncheck Space group match or set identity_symmetry_check=false in advanced options.
Presets:
CDVAE generated: default tolerant settings for generated structures (ltol=0.3,stol=0.5,angle_tol=10).DFT relaxed: pymatgen-style relaxed tolerances.Strict: tighter lattice/site/angle tolerance.ML-relaxed tolerant: looser settings for ML-relaxed candidate sets.Anonymous structure type: ignores element identity and groups by topology/stoichiometry.
The UI exposes the main matching controls:
- Formula filter: same reduced formula, exact composition, anonymous stoichiometry, or no formula prefilter.
- Species matching: element identity while ignoring oxidation labels, exact species/oxidation labels, or framework topology only.
- Cell handling: primitive plus supercell matching, primitive-only matching, or as-uploaded matching.
- Lattice tolerance, site tolerance, angle tolerance, and optional RMS displacement cutoff.
- Space group match: checked by default for identity matching.
- Space Group Analyzer tolerances:
symprecandangle_tolerance. - Novelty output: novel rows only, or include matched rows for audit.
- Reference set: cross-reference and novelty modes share the same reference selector. Use local Materials Project CIFs by default, or upload reference archives.
Advanced options can still be supplied as key=value lines:
ltol=0.3
stol=0.5
angle_tol=10
rms_tolerance=0.2
max_dist_tolerance=0.5
symprec=0.05
angle_tolerance=8
identity_symmetry_check=true
formula_mode=reduced
species_mode=element
cell_mode=primitive_supercell
include_matched=false
reference_database=local_mp
cluster_eps=1.5
max_cifs=1000
Candidate And Reference Uploads
For cross-reference and novelty modes, the UI splits the input row into candidates on the left and the reference set on the right. The reference set can be:
Local Materials Project CIFs: uploaded CIFs are candidates; each candidate first retrieves local MP CIFs with the same reduced composition, then runsStructureMatcher.Uploaded reference CIF archive: upload candidate CIFs on the left and one or more reference CIF/ZIP/TAR.GZ files on the right.
Cross-reference mode also exports structure-comparison-cross-novel-cifs.zip, containing all candidate CIFs that did not match the selected reference set.
When Space group match is enabled, cross-reference and novelty reports also include a “Geometric Matches Rejected by Space Group” audit section in summary.html. Those rows passed the geometric StructureMatcher fit but were rejected because the derived candidate and reference space-group numbers disagreed under the selected symprec and angle_tolerance. The same rows are exported as structure-comparison-symmetry-rejected.csv when present.
The legacy combined-upload workflow is still supported:
- Upload two archives: the first archive is candidates, later archives are references.
- Upload one archive with folders named
candidates/andreference/. - Use direct CIF filename prefixes such as
cand_,candidate_,ref_,reference_,mp_,icsd_, orcod_.
For novelty mode, the same reference selector is used, but the output emphasizes candidates that do not match the selected reference set. With the default Local Materials Project CIFs reference, all left-side uploaded CIFs are treated as candidates. For each candidate, the app first normalizes its composition with pymatgen, retrieves local MP CIFs with the identical reduced composition, and only then runs StructureMatcher. With Uploaded reference archive, left-side uploads are candidates and right-side uploads are references. The CSV reports:
matched_source_databasematched_filenamematched_material_idlocal_mp_same_composition_candidatesn_candidates_checked
Choose Uploaded reference archive when you want the older candidate-vs-reference archive workflow.
Novel Representatives
Novelty mode first filters candidates that have no reference match. It then deduplicates the novel subset and exports:
structure-comparison-novel-cifs.zip: all novel CIFs.structure-comparison-novel-representatives.zip: one representative CIF per duplicate group.structure-comparison-novel-dedup.csv: duplicate groups inside the novel subset.
This is the recommended output for downstream screening pipelines.
Phase 2 Report Package
Every run now exports structure-comparison-report-package.zip, which bundles:
summary.html: card-style report with matching configuration, result preview, and upload manifest.summary.csv.upload-manifest.csv.- the primary mode CSV, such as
structure-comparison-dedup.csvorstructure-comparison-novelty.csv. - extra mode CSVs, such as
structure-comparison-novel-dedup.csvwhen applicable. structure-comparison-symmetry-rejected.csvwhen geometric matches were rejected by the space-group gate.- useful CIF folders when applicable, such as
representatives/,duplicates_removed/,novel_only/, andnovel_representatives/. - cross-reference novel CIFs when applicable, under
cross_novel_cifs/.
Dedup mode also exports standalone structure-comparison-representatives.zip and structure-comparison-duplicates-removed.zip.
Cluster Features
Cluster mode builds a lightweight feature matrix from:
- composition statistics such as atomic mass and electronegativity
- lattice parameters, volume per atom, density, and angle descriptors
- space group and crystal-system one-hot features
- common element fractions
PCA is always available when scikit-learn is installed. t-SNE uses scikit-learn. UMAP is used when umap-learn is installed, otherwise the app falls back to PCA.
Limitations
The comparison ignores magnetic order, charge state, and detailed partial-occupancy semantics. Borderline structures should be inspected manually, especially when using tolerant or anonymous matching.