Random Sampling of Configurations
When the number of possible configurations is very large, full enumeration may be computationally prohibitive. In these cases, you can generate a random sample of symmetry-inequivalent configurations instead.
This guide covers:
Generating random unique configurations
Sampling modes:
degeneracy_weightedvsuniformReproducibility with seeds
Working with pymatgen structures
Generating structures in batches
When to Use Random Sampling
Full enumeration finds all symmetry-inequivalent configurations, which is ideal when:
You need a complete set for exhaustive calculations
The configuration space is small enough to enumerate
Random sampling is useful when:
Full enumeration would take too long or use too much memory
You only need a representative subset (e.g., for machine learning training data)
You want to explore a large configuration space without exhaustive enumeration
Basic Usage
With Pymatgen Structures
For crystallographic applications, use random_unique_structure_substitutions():
from pymatgen.core import Structure, Lattice
from bsym.interface.pymatgen import random_unique_structure_substitutions
import numpy as np
# Create a 4x4 square lattice
coords = np.array([[0.0, 0.0, 0.0]])
lattice = Lattice.from_parameters(a=1.0, b=1.0, c=1.0, alpha=90, beta=90, gamma=90)
unit_cell = Structure(lattice, ['Li'], coords)
parent_structure = unit_cell * [4, 4, 1]
# Generate 10 random unique structures with 4 Na substitutions
# (There are 33 unique configurations for this composition)
random_structures = random_unique_structure_substitutions(
parent_structure,
'Li',
{'Na': 4, 'Li': 12},
n=10,
seed=42
)
print(f"Generated {len(random_structures)} unique structures")
for i, struct in enumerate(random_structures):
print(f"Structure {i}: {struct.number_of_equivalent_configurations} equivalent configurations")
Generated 10 unique structures
Structure 0: 128 equivalent configurations
Structure 1: 32 equivalent configurations
Structure 2: 128 equivalent configurations
Structure 3: 64 equivalent configurations
Structure 4: 64 equivalent configurations
Structure 5: 16 equivalent configurations
Structure 6: 64 equivalent configurations
Structure 7: 64 equivalent configurations
Structure 8: 64 equivalent configurations
Structure 9: 64 equivalent configurations
/home/docs/checkouts/readthedocs.org/user_builds/bsym/envs/stable/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
Abstract Configuration Space
You can also use random_unique_configurations() directly on a ConfigurationSpace:
from bsym.interface.pymatgen import configuration_space_from_structure
config_space = configuration_space_from_structure(parent_structure)
# Generate 5 random unique configurations
random_configs = config_space.random_unique_configurations(
site_distribution={1: 4, 0: 12}, # 4 occupied, 12 vacant
n=5,
seed=42
)
print(f"Generated {len(random_configs)} unique configurations")
for config in random_configs:
print(f"{config.tolist()}: degeneracy = {config.count}")
Generated 5 unique configurations
[0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0]: degeneracy = 128
[0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0]: degeneracy = 32
[0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0]: degeneracy = 128
[1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0]: degeneracy = 64
[0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1]: degeneracy = 64
Sampling Modes
Two sampling modes are available, controlled by the sampling parameter:
degeneracy_weighted (default)
Each configuration in the full (unsymmetrised) space has equal probability of being selected. This means equivalence classes with higher degeneracy are more likely to be sampled.
This mode is appropriate when:
You want sampling that reflects the statistical weight of configurations
High-degeneracy configurations are more “important” for your application
You’re doing thermodynamic sampling where degeneracy matters
uniform
Each equivalence class has equal probability of being selected, regardless of degeneracy. This uses rejection sampling internally.
This mode is appropriate when:
You want equal representation of all unique configurations
You’re building a diverse training set
Degeneracy should not influence selection
# Compare sampling modes
n_samples = 50
# Degeneracy-weighted sampling
weighted_degeneracies = []
for i in range(n_samples):
configs = config_space.random_unique_configurations(
site_distribution={1: 4, 0: 12},
n=1,
sampling='degeneracy_weighted',
seed=i
)
weighted_degeneracies.append(configs[0].count)
# Uniform sampling
uniform_degeneracies = []
for i in range(n_samples):
configs = config_space.random_unique_configurations(
site_distribution={1: 4, 0: 12},
n=1,
sampling='uniform',
seed=i
)
uniform_degeneracies.append(configs[0].count)
print("Degeneracy-weighted sampling:")
print(f" Mean degeneracy: {np.mean(weighted_degeneracies)}")
print()
print("Uniform sampling:")
print(f" Mean degeneracy: {np.mean(uniform_degeneracies)}")
Degeneracy-weighted sampling:
Mean degeneracy: 76.16
Uniform sampling:
Mean degeneracy: 54.32
With degeneracy_weighted sampling, the mean degeneracy of sampled configurations is higher because high-degeneracy configurations are more likely to be selected.
Reproducibility
Use the seed parameter to get reproducible results:
# Same seed produces same results
configs_1 = config_space.random_unique_configurations(
site_distribution={1: 4, 0: 12},
n=3,
seed=12345
)
configs_2 = config_space.random_unique_configurations(
site_distribution={1: 4, 0: 12},
n=3,
seed=12345
)
print("Results are identical:", all(
c1.tolist() == c2.tolist()
for c1, c2 in zip(configs_1, configs_2)
))
Results are identical: True
Generating Structures in Batches
When generating large numbers of structures, you may want to work in batches - for example, to run DFT calculations on each batch before generating more. The exclude_file and output_file parameters enable this workflow while ensuring no duplicate structures across batches.
import tempfile
import os
# Using a temporary directory for this example
with tempfile.TemporaryDirectory() as tmpdir:
batch_1_file = os.path.join(tmpdir, 'batch_1.json')
batch_2_file = os.path.join(tmpdir, 'batch_2.json')
batch_3_file = os.path.join(tmpdir, 'batch_3.json')
# Batch 1: Generate initial structures
structures_1 = random_unique_structure_substitutions(
parent_structure,
'Li',
{'Na': 4, 'Li': 12},
n=5,
seed=42,
output_file=batch_1_file, # Save configurations for later exclusion
)
print(f"Batch 1: {len(structures_1)} structures")
# Batch 2: Generate more structures, excluding batch 1
structures_2 = random_unique_structure_substitutions(
parent_structure,
'Li',
{'Na': 4, 'Li': 12},
n=5,
seed=43,
exclude_file=batch_1_file, # Exclude previous batch
output_file=batch_2_file,
)
print(f"Batch 2: {len(structures_2)} structures")
# Batch 3: Exclude multiple previous batches
structures_3 = random_unique_structure_substitutions(
parent_structure,
'Li',
{'Na': 4, 'Li': 12},
n=5,
seed=44,
exclude_file=[batch_1_file, batch_2_file], # Exclude both previous batches
output_file=batch_3_file,
)
print(f"Batch 3: {len(structures_3)} structures")
print(f"Total unique structures: {len(structures_1) + len(structures_2) + len(structures_3)}")
Batch 1: 5 structures
Batch 2: 5 structures
Batch 3: 5 structures
Total unique structures: 15
The configuration files are portable JSON, so batches can be generated on different machines as long as the number of sites to substitute is the same.
Performance Considerations
When Random Sampling May Be Slow
Performance may degrade when:
napproaches the total number of unique configurations (many rejections)Using
uniformsampling with highly variable degeneracies (rejection sampling overhead)The symmetry group is very large (computing equivalents is expensive)
Tips
If you need most or all unique configurations, use full enumeration instead
For
uniformsampling, be aware that low-degeneracy configurations require more attempts to findIf sampling seems to hang, you may be requesting more configurations than exist
Example: Generating Training Data
A common use case is generating diverse training data for machine learning:
# Generate diverse structures for ML training
# Using Na=8, Li=8 which has 153 unique configurations
training_structures = random_unique_structure_substitutions(
parent_structure,
'Li',
{'Na': 8, 'Li': 8},
n=20,
sampling='uniform', # Equal representation of all unique configs
seed=42
)
print(f"Generated {len(training_structures)} training structures")
print(f"All structures have composition: {training_structures[0].composition.reduced_formula}")
# These can be exported for DFT calculations
# for i, struct in enumerate(training_structures):
# struct.to(filename=f'training_{i:03d}.cif', fmt='cif')
Generated 20 training structures
All structures have composition: NaLi
API Reference
ConfigurationSpace.random_unique_configurations
config_space.random_unique_configurations(
site_distribution, # dict mapping species labels to counts
n, # number of configurations to generate
sampling='degeneracy_weighted', # or 'uniform'
seed=None, # random seed for reproducibility
exclude=None, # list of Configuration objects to exclude
)
random_unique_structure_substitutions
random_unique_structure_substitutions(
structure, # parent pymatgen Structure
to_substitute, # species label to substitute (e.g., 'Li')
site_distribution, # dict mapping species to counts (e.g., {'Na': 4, 'Li': 12})
n, # number of structures to generate
sampling='degeneracy_weighted', # or 'uniform'
seed=None, # random seed for reproducibility
atol=1e-5, # tolerance for coordinate mapping
exclude_file=None, # path(s) to JSON file(s) of configurations to exclude
output_file=None, # path to save generated configurations
)
save_configurations / load_configurations
from bsym.configuration import save_configurations, load_configurations
# Save configurations to JSON
save_configurations(configurations, 'configs.json')
# Load configurations from JSON
configurations = load_configurations('configs.json')