{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Random Sampling of Configurations\n", "\n", "When the number of possible configurations is very large, full enumeration may be computationally prohibitive. In these cases, you can generate a random sample of symmetry-inequivalent configurations instead.\n", "\n", "This guide covers:\n", "- Generating random unique configurations\n", "- Sampling modes: `degeneracy_weighted` vs `uniform`\n", "- Reproducibility with seeds\n", "- Working with pymatgen structures\n", "- Generating structures in batches" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## When to Use Random Sampling\n", "\n", "Full enumeration finds *all* symmetry-inequivalent configurations, which is ideal when:\n", "- You need a complete set for exhaustive calculations\n", "- The configuration space is small enough to enumerate\n", "\n", "Random sampling is useful when:\n", "- Full enumeration would take too long or use too much memory\n", "- You only need a representative subset (e.g., for machine learning training data)\n", "- You want to explore a large configuration space without exhaustive enumeration" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Basic Usage\n", "\n", "### With Pymatgen Structures\n", "\n", "For crystallographic applications, use `random_unique_structure_substitutions()`:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Generated 10 unique structures\n", "Structure 0: 32 equivalent configurations\n", "Structure 1: 16 equivalent configurations\n", "Structure 2: 64 equivalent configurations\n", "Structure 3: 64 equivalent configurations\n", "Structure 4: 64 equivalent configurations\n", "Structure 5: 32 equivalent configurations\n", "Structure 6: 64 equivalent configurations\n", "Structure 7: 32 equivalent configurations\n", "Structure 8: 64 equivalent configurations\n", "Structure 9: 64 equivalent configurations\n" ] } ], "source": [ "from pymatgen.core import Structure, Lattice\n", "from bsym.interface.pymatgen import random_unique_structure_substitutions\n", "import numpy as np\n", "\n", "# Create a 4x4 square lattice\n", "coords = np.array([[0.0, 0.0, 0.0]])\n", "lattice = Lattice.from_parameters(a=1.0, b=1.0, c=1.0, alpha=90, beta=90, gamma=90)\n", "unit_cell = Structure(lattice, ['Li'], coords)\n", "parent_structure = unit_cell * [4, 4, 1]\n", "\n", "# Generate 10 random unique structures with 4 Na substitutions\n", "# (There are 33 unique configurations for this composition)\n", "random_structures = random_unique_structure_substitutions(\n", " parent_structure,\n", " 'Li',\n", " {'Na': 4, 'Li': 12},\n", " n=10,\n", " seed=42\n", ")\n", "\n", "print(f\"Generated {len(random_structures)} unique structures\")\n", "for i, struct in enumerate(random_structures):\n", " print(f\"Structure {i}: {struct.number_of_equivalent_configurations} equivalent configurations\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Abstract Configuration Space\n", "\n", "You can also use `random_unique_configurations()` directly on a `ConfigurationSpace`:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Generated 5 unique configurations\n", "[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0]: degeneracy = 4\n", "[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1]: degeneracy = 16\n", "[0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1]: degeneracy = 32\n", "[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1]: degeneracy = 32\n", "[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0]: degeneracy = 16\n" ] } ], "source": [ "from bsym.interface.pymatgen import configuration_space_from_structure\n", "\n", "config_space = configuration_space_from_structure(parent_structure)\n", "\n", "# Generate 5 random unique configurations\n", "random_configs = config_space.random_unique_configurations(\n", " site_distribution={1: 4, 0: 12}, # 4 occupied, 12 vacant\n", " n=5,\n", " seed=42\n", ")\n", "\n", "print(f\"Generated {len(random_configs)} unique configurations\")\n", "for config in random_configs:\n", " print(f\"{config.tolist()}: degeneracy = {config.count}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sampling Modes\n", "\n", "Two sampling modes are available, controlled by the `sampling` parameter:\n", "\n", "### `degeneracy_weighted` (default)\n", "\n", "Each configuration in the full (unsymmetrised) space has equal probability of being selected. This means equivalence classes with higher degeneracy are more likely to be sampled.\n", "\n", "This mode is appropriate when:\n", "- You want sampling that reflects the statistical weight of configurations\n", "- High-degeneracy configurations are more \"important\" for your application\n", "- You're doing thermodynamic sampling where degeneracy matters\n", "\n", "### `uniform`\n", "\n", "Each equivalence class has equal probability of being selected, regardless of degeneracy. This uses rejection sampling internally.\n", "\n", "This mode is appropriate when:\n", "- You want equal representation of all unique configurations\n", "- You're building a diverse training set\n", "- Degeneracy should not influence selection" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Degeneracy-weighted sampling:\n", " Mean degeneracy: 52.5\n", "\n", "Uniform sampling:\n", " Mean degeneracy: 41.0\n" ] } ], "source": [ "# Compare sampling modes\n", "n_samples = 50\n", "\n", "# Degeneracy-weighted sampling\n", "weighted_degeneracies = []\n", "for i in range(n_samples):\n", " configs = config_space.random_unique_configurations(\n", " site_distribution={1: 4, 0: 12},\n", " n=1,\n", " sampling='degeneracy_weighted',\n", " seed=i\n", " )\n", " weighted_degeneracies.append(configs[0].count)\n", "\n", "# Uniform sampling\n", "uniform_degeneracies = []\n", "for i in range(n_samples):\n", " configs = config_space.random_unique_configurations(\n", " site_distribution={1: 4, 0: 12},\n", " n=1,\n", " sampling='uniform',\n", " seed=i\n", " )\n", " uniform_degeneracies.append(configs[0].count)\n", "\n", "print(\"Degeneracy-weighted sampling:\")\n", "print(f\" Mean degeneracy: {np.mean(weighted_degeneracies)}\")\n", "print()\n", "print(\"Uniform sampling:\")\n", "print(f\" Mean degeneracy: {np.mean(uniform_degeneracies)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With `degeneracy_weighted` sampling, the mean degeneracy of sampled configurations is higher because high-degeneracy configurations are more likely to be selected." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reproducibility\n", "\n", "Use the `seed` parameter to get reproducible results:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Results are identical: True\n" ] } ], "source": [ "# Same seed produces same results\n", "configs_1 = config_space.random_unique_configurations(\n", " site_distribution={1: 4, 0: 12},\n", " n=3,\n", " seed=12345\n", ")\n", "\n", "configs_2 = config_space.random_unique_configurations(\n", " site_distribution={1: 4, 0: 12},\n", " n=3,\n", " seed=12345\n", ")\n", "\n", "print(\"Results are identical:\", all(\n", " c1.tolist() == c2.tolist() \n", " for c1, c2 in zip(configs_1, configs_2)\n", "))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generating Structures in Batches\n", "\n", "When generating large numbers of structures, you may want to work in batches - for example, to run DFT calculations on each batch before generating more. The `exclude_file` and `output_file` parameters enable this workflow while ensuring no duplicate structures across batches." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Batch 1: 5 structures\n", "Batch 2: 5 structures\n", "Batch 3: 5 structures\n", "Total unique structures: 15\n" ] } ], "source": [ "import tempfile\n", "import os\n", "\n", "# Using a temporary directory for this example\n", "with tempfile.TemporaryDirectory() as tmpdir:\n", " batch_1_file = os.path.join(tmpdir, 'batch_1.json')\n", " batch_2_file = os.path.join(tmpdir, 'batch_2.json')\n", " batch_3_file = os.path.join(tmpdir, 'batch_3.json')\n", " \n", " # Batch 1: Generate initial structures\n", " structures_1 = random_unique_structure_substitutions(\n", " parent_structure,\n", " 'Li',\n", " {'Na': 4, 'Li': 12},\n", " n=5,\n", " seed=42,\n", " output_file=batch_1_file, # Save configurations for later exclusion\n", " )\n", " print(f\"Batch 1: {len(structures_1)} structures\")\n", " \n", " # Batch 2: Generate more structures, excluding batch 1\n", " structures_2 = random_unique_structure_substitutions(\n", " parent_structure,\n", " 'Li',\n", " {'Na': 4, 'Li': 12},\n", " n=5,\n", " seed=43,\n", " exclude_file=batch_1_file, # Exclude previous batch\n", " output_file=batch_2_file,\n", " )\n", " print(f\"Batch 2: {len(structures_2)} structures\")\n", " \n", " # Batch 3: Exclude multiple previous batches\n", " structures_3 = random_unique_structure_substitutions(\n", " parent_structure,\n", " 'Li',\n", " {'Na': 4, 'Li': 12},\n", " n=5,\n", " seed=44,\n", " exclude_file=[batch_1_file, batch_2_file], # Exclude both previous batches\n", " output_file=batch_3_file,\n", " )\n", " print(f\"Batch 3: {len(structures_3)} structures\")\n", " \n", " print(f\"Total unique structures: {len(structures_1) + len(structures_2) + len(structures_3)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The configuration files are portable JSON, so batches can be generated on different machines as long as the number of sites to substitute is the same." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Performance Considerations\n", "\n", "### When Random Sampling May Be Slow\n", "\n", "Performance may degrade when:\n", "- `n` approaches the total number of unique configurations (many rejections)\n", "- Using `uniform` sampling with highly variable degeneracies (rejection sampling overhead)\n", "- The symmetry group is very large (computing equivalents is expensive)\n", "\n", "### Tips\n", "\n", "1. If you need most or all unique configurations, use full enumeration instead\n", "2. For `uniform` sampling, be aware that low-degeneracy configurations require more attempts to find\n", "3. If sampling seems to hang, you may be requesting more configurations than exist" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example: Generating Training Data\n", "\n", "A common use case is generating diverse training data for machine learning:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Generated 20 training structures\n", "All structures have composition: Li8 Na8\n" ] } ], "source": [ "# Generate diverse structures for ML training\n", "# Using Na=8, Li=8 which has 153 unique configurations\n", "training_structures = random_unique_structure_substitutions(\n", " parent_structure,\n", " 'Li',\n", " {'Na': 8, 'Li': 8},\n", " n=20,\n", " sampling='uniform', # Equal representation of all unique configs\n", " seed=42\n", ")\n", "\n", "print(f\"Generated {len(training_structures)} training structures\")\n", "print(f\"All structures have composition: {training_structures[0].composition.reduced_formula}\")\n", "\n", "# These can be exported for DFT calculations\n", "# for i, struct in enumerate(training_structures):\n", "# struct.to(filename=f'training_{i:03d}.cif', fmt='cif')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## API Reference\n", "\n", "### `ConfigurationSpace.random_unique_configurations`\n", "\n", "```python\n", "config_space.random_unique_configurations(\n", " site_distribution, # dict mapping species labels to counts\n", " n, # number of configurations to generate\n", " sampling='degeneracy_weighted', # or 'uniform'\n", " seed=None, # random seed for reproducibility\n", " exclude=None, # list of Configuration objects to exclude\n", ")\n", "```\n", "\n", "### `random_unique_structure_substitutions`\n", "\n", "```python\n", "random_unique_structure_substitutions(\n", " structure, # parent pymatgen Structure\n", " to_substitute, # species label to substitute (e.g., 'Li')\n", " site_distribution, # dict mapping species to counts (e.g., {'Na': 4, 'Li': 12})\n", " n, # number of structures to generate\n", " sampling='degeneracy_weighted', # or 'uniform'\n", " seed=None, # random seed for reproducibility\n", " atol=1e-5, # tolerance for coordinate mapping\n", " exclude_file=None, # path(s) to JSON file(s) of configurations to exclude\n", " output_file=None, # path to save generated configurations\n", ")\n", "```\n", "\n", "### `save_configurations` / `load_configurations`\n", "\n", "```python\n", "from bsym.configuration import save_configurations, load_configurations\n", "\n", "# Save configurations to JSON\n", "save_configurations(configurations, 'configs.json')\n", "\n", "# Load configurations from JSON\n", "configurations = load_configurations('configs.json')\n", "```" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.10.0" } }, "nbformat": 4, "nbformat_minor": 4 }