Random Sampling of Configurations

When the number of possible configurations is very large, full enumeration may be computationally prohibitive. In these cases, you can generate a random sample of symmetry-inequivalent configurations instead.

This guide covers:

Generating random unique configurations
Sampling modes: degeneracy_weighted vs uniform
Reproducibility with seeds
Working with pymatgen structures
Generating structures in batches

When to Use Random Sampling

Full enumeration finds all symmetry-inequivalent configurations, which is ideal when:

You need a complete set for exhaustive calculations
The configuration space is small enough to enumerate

Random sampling is useful when:

Full enumeration would take too long or use too much memory
You only need a representative subset (e.g., for machine learning training data)
You want to explore a large configuration space without exhaustive enumeration

Basic Usage

With Pymatgen Structures

For crystallographic applications, use random_unique_structure_substitutions():

from pymatgen.core import Structure, Lattice
from bsym.interface.pymatgen import random_unique_structure_substitutions
import numpy as np

# Create a 4x4 square lattice
coords = np.array([[0.0, 0.0, 0.0]])
lattice = Lattice.from_parameters(a=1.0, b=1.0, c=1.0, alpha=90, beta=90, gamma=90)
unit_cell = Structure(lattice, ['Li'], coords)
parent_structure = unit_cell * [4, 4, 1]

# Generate 10 random unique structures with 4 Na substitutions
# (There are 33 unique configurations for this composition)
random_structures = random_unique_structure_substitutions(
    parent_structure,
    'Li',
    {'Na': 4, 'Li': 12},
    n=10,
    seed=42
)

print(f"Generated {len(random_structures)} unique structures")
for i, struct in enumerate(random_structures):
    print(f"Structure {i}: {struct.number_of_equivalent_configurations} equivalent configurations")

Generated 10 unique structures
Structure 0: 128 equivalent configurations
Structure 1: 32 equivalent configurations
Structure 2: 128 equivalent configurations
Structure 3: 64 equivalent configurations
Structure 4: 64 equivalent configurations
Structure 5: 16 equivalent configurations
Structure 6: 64 equivalent configurations
Structure 7: 64 equivalent configurations
Structure 8: 64 equivalent configurations
Structure 9: 64 equivalent configurations

/home/docs/checkouts/readthedocs.org/user_builds/bsym/envs/stable/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Abstract Configuration Space

You can also use random_unique_configurations() directly on a ConfigurationSpace:

from bsym.interface.pymatgen import configuration_space_from_structure

config_space = configuration_space_from_structure(parent_structure)

# Generate 5 random unique configurations
random_configs = config_space.random_unique_configurations(
    site_distribution={1: 4, 0: 12},  # 4 occupied, 12 vacant
    n=5,
    seed=42
)

print(f"Generated {len(random_configs)} unique configurations")
for config in random_configs:
    print(f"{config.tolist()}: degeneracy = {config.count}")

Generated 5 unique configurations
[0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0]: degeneracy = 128
[0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0]: degeneracy = 32
[0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0]: degeneracy = 128
[1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0]: degeneracy = 64
[0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1]: degeneracy = 64

Sampling Modes

Two sampling modes are available, controlled by the sampling parameter:

`degeneracy_weighted` (default)

Each configuration in the full (unsymmetrised) space has equal probability of being selected. This means equivalence classes with higher degeneracy are more likely to be sampled.

This mode is appropriate when:

You want sampling that reflects the statistical weight of configurations
High-degeneracy configurations are more “important” for your application
You’re doing thermodynamic sampling where degeneracy matters

`uniform`

Each equivalence class has equal probability of being selected, regardless of degeneracy. This uses rejection sampling internally.

This mode is appropriate when:

You want equal representation of all unique configurations
You’re building a diverse training set
Degeneracy should not influence selection

# Compare sampling modes
n_samples = 50

# Degeneracy-weighted sampling
weighted_degeneracies = []
for i in range(n_samples):
    configs = config_space.random_unique_configurations(
        site_distribution={1: 4, 0: 12},
        n=1,
        sampling='degeneracy_weighted',
        seed=i
    )
    weighted_degeneracies.append(configs[0].count)

# Uniform sampling
uniform_degeneracies = []
for i in range(n_samples):
    configs = config_space.random_unique_configurations(
        site_distribution={1: 4, 0: 12},
        n=1,
        sampling='uniform',
        seed=i
    )
    uniform_degeneracies.append(configs[0].count)

print("Degeneracy-weighted sampling:")
print(f"  Mean degeneracy: {np.mean(weighted_degeneracies)}")
print()
print("Uniform sampling:")
print(f"  Mean degeneracy: {np.mean(uniform_degeneracies)}")

Degeneracy-weighted sampling:
  Mean degeneracy: 76.16

Uniform sampling:
  Mean degeneracy: 54.32

With degeneracy_weighted sampling, the mean degeneracy of sampled configurations is higher because high-degeneracy configurations are more likely to be selected.

Reproducibility

Use the seed parameter to get reproducible results:

# Same seed produces same results
configs_1 = config_space.random_unique_configurations(
    site_distribution={1: 4, 0: 12},
    n=3,
    seed=12345
)

configs_2 = config_space.random_unique_configurations(
    site_distribution={1: 4, 0: 12},
    n=3,
    seed=12345
)

print("Results are identical:", all(
    c1.tolist() == c2.tolist() 
    for c1, c2 in zip(configs_1, configs_2)
))

Results are identical: True

Generating Structures in Batches

When generating large numbers of structures, you may want to work in batches - for example, to run DFT calculations on each batch before generating more. The exclude_file and output_file parameters enable this workflow while ensuring no duplicate structures across batches.

import tempfile
import os

# Using a temporary directory for this example
with tempfile.TemporaryDirectory() as tmpdir:
    batch_1_file = os.path.join(tmpdir, 'batch_1.json')
    batch_2_file = os.path.join(tmpdir, 'batch_2.json')
    batch_3_file = os.path.join(tmpdir, 'batch_3.json')
    
    # Batch 1: Generate initial structures
    structures_1 = random_unique_structure_substitutions(
        parent_structure,
        'Li',
        {'Na': 4, 'Li': 12},
        n=5,
        seed=42,
        output_file=batch_1_file,  # Save configurations for later exclusion
    )
    print(f"Batch 1: {len(structures_1)} structures")
    
    # Batch 2: Generate more structures, excluding batch 1
    structures_2 = random_unique_structure_substitutions(
        parent_structure,
        'Li',
        {'Na': 4, 'Li': 12},
        n=5,
        seed=43,
        exclude_file=batch_1_file,  # Exclude previous batch
        output_file=batch_2_file,
    )
    print(f"Batch 2: {len(structures_2)} structures")
    
    # Batch 3: Exclude multiple previous batches
    structures_3 = random_unique_structure_substitutions(
        parent_structure,
        'Li',
        {'Na': 4, 'Li': 12},
        n=5,
        seed=44,
        exclude_file=[batch_1_file, batch_2_file],  # Exclude both previous batches
        output_file=batch_3_file,
    )
    print(f"Batch 3: {len(structures_3)} structures")
    
    print(f"Total unique structures: {len(structures_1) + len(structures_2) + len(structures_3)}")

Batch 1: 5 structures
Batch 2: 5 structures
Batch 3: 5 structures
Total unique structures: 15

The configuration files are portable JSON, so batches can be generated on different machines as long as the number of sites to substitute is the same.

Performance Considerations

When Random Sampling May Be Slow

Performance may degrade when:

n approaches the total number of unique configurations (many rejections)
Using uniform sampling with highly variable degeneracies (rejection sampling overhead)
The symmetry group is very large (computing equivalents is expensive)

Tips

If you need most or all unique configurations, use full enumeration instead
For uniform sampling, be aware that low-degeneracy configurations require more attempts to find
If sampling seems to hang, you may be requesting more configurations than exist

Example: Generating Training Data

A common use case is generating diverse training data for machine learning:

# Generate diverse structures for ML training
# Using Na=8, Li=8 which has 153 unique configurations
training_structures = random_unique_structure_substitutions(
    parent_structure,
    'Li',
    {'Na': 8, 'Li': 8},
    n=20,
    sampling='uniform',  # Equal representation of all unique configs
    seed=42
)

print(f"Generated {len(training_structures)} training structures")
print(f"All structures have composition: {training_structures[0].composition.reduced_formula}")

# These can be exported for DFT calculations
# for i, struct in enumerate(training_structures):
#     struct.to(filename=f'training_{i:03d}.cif', fmt='cif')

Generated 20 training structures
All structures have composition: NaLi

API Reference

`ConfigurationSpace.random_unique_configurations`

config_space.random_unique_configurations(
    site_distribution,  # dict mapping species labels to counts
    n,                  # number of configurations to generate
    sampling='degeneracy_weighted',  # or 'uniform'
    seed=None,          # random seed for reproducibility
    exclude=None,       # list of Configuration objects to exclude
)

`random_unique_structure_substitutions`

random_unique_structure_substitutions(
    structure,          # parent pymatgen Structure
    to_substitute,      # species label to substitute (e.g., 'Li')
    site_distribution,  # dict mapping species to counts (e.g., {'Na': 4, 'Li': 12})
    n,                  # number of structures to generate
    sampling='degeneracy_weighted',  # or 'uniform'
    seed=None,          # random seed for reproducibility
    atol=1e-5,          # tolerance for coordinate mapping
    exclude_file=None,  # path(s) to JSON file(s) of configurations to exclude
    output_file=None,   # path to save generated configurations
)

`save_configurations` / `load_configurations`

from bsym.configuration import save_configurations, load_configurations

# Save configurations to JSON
save_configurations(configurations, 'configs.json')

# Load configurations from JSON
configurations = load_configurations('configs.json')