Symmetry Breaking in Neural Network Training: How Mini-Batch SGD Amplifies Asymmetric Readout from Symmetric Incoming Weights
Introduction
Neural network training usually begins from random initialization, which breaks permutation symmetry between hidden neurons. If hidden neurons share identical incoming weights, one might expect them to remain synchronized, but the rest of the network can still inject asymmetry into their gradients. Understanding how quickly hidden-layer symmetry decays under realistic training dynamics is important for clarifying what random initialization contributes beyond simply "breaking symmetry"[goodfellow2016deep].
Our main sweep deliberately symmetrizes only the incoming hidden weights. This isolates a common practical situation in which one layer starts tied while the rest of the network remains randomly initialized. It does not isolate mini-batch noise alone, so we complement the main sweep with small audit controls that compare full-batch training and a manually symmetrized readout.
We investigate three questions:
- How quickly does mini-batch SGD break hidden-layer symmetry when only the incoming hidden weights start identical?
- How does the speed of symmetry breaking depend on network width?
- Is late-stage geometric symmetry breaking sufficient for strong task performance, or is larger early asymmetry needed?Methods
Architecture and Task
We use a two-layer ReLU MLP: , where , , and . The input is a concatenation of two one-hot vectors encoding with , and the target is . The full dataset has examples, split 80/20 into train/test.
Symmetric Initialization
We set all rows of to a constant vector , then add a perturbation: , where . We sweep . The bias is initialized to zero. uses seeded standard Kaiming uniform initialization. Therefore only the incoming hidden weights are symmetric at initialization; the readout already contains neuron-specific asymmetry.
Training
We train with SGD (learning rate 0.1, no momentum) using cross-entropy loss for 2000 epochs with batch size 256. The training data is reshuffled each epoch, providing the stochastic noise. All experiments use seed 42.
Symmetry Metric
We define the symmetry metric as the mean off-diagonal pairwise cosine similarity between hidden neuron weight vectors: [ S(W_1) = \frac{2}{h(h-1)} \sum_{i < j} \frac{W_1^{(i)} \cdot W_1^{(j)}}{|W_1^{(i)}| |W_1^{(j)}|} ] indicates all neurons are identical; indicates approximate orthogonality.
Results
Mini-Batch SGD Rapidly Breaks Partially Symmetric Incoming Weights
All 20 runs achieved symmetry breaking, with final regardless of (Table). Even with (identical rows in ), mini-batch SGD reduced from 1.0 to values between 0.023 and 0.116. This shows that hidden-layer symmetry decays quickly in the primary setup, but because is random, the experiment measures the combination of readout asymmetry and mini-batch stochasticity rather than batch noise alone.
Final symmetry metric and test accuracy across all configurations.
| Width | \varepsilon = 0 | \varepsilon = 10^-6 | \varepsilon = 10^-4 | \varepsilon = 10^-2 | \varepsilon = 10^-1 |
|---|---|---|---|---|---|
| μlticolumn6cFinal Symmetry Metric | |||||
| 16 | 0.024 | 0.021 | 0.032 | 0.015 | 0.005 |
| 32 | 0.081 | 0.084 | 0.083 | 0.057 | 0.019 |
| 64 | 0.084 | 0.084 | 0.086 | 0.081 | 0.022 |
| 128 | 0.116 | 0.116 | 0.116 | 0.113 | 0.040 |
| μlticolumn6cFinal Test Accuracy | |||||
| 16 | 0.057 | 0.070 | 0.064 | 0.185 | 0.178 |
| 32 | 0.002 | 0.002 | 0.002 | 0.009 | 0.421 |
| 64 | 0.000 | 0.000 | 0.000 | 0.000 | 0.521 |
| 128 | 0.000 | 0.000 | 0.000 | 0.000 | 0.091 |
Breaking Speed Scales with Width
Wider networks require more epochs to break symmetry. The epoch at which first drops below 0.5 increases systematically: 300 for width 16, 350 for width 32, 500 for width 64, and 650 for width 128 (for ). With , all widths break by epoch 100, since the initial perturbation already reduces to .
This width dependence is expected: in wider layers, each neuron's update is a smaller fraction of the total hidden-layer dynamics, so asymmetry accumulates more slowly.
Early Asymmetry Matters More Than Late Symmetry Decay
The most striking result is the disconnect between late-stage symmetry decay and task performance. All runs achieve , yet only runs achieve meaningful test accuracy. At width 64, the run breaks symmetry to but achieves 0% test accuracy, while the run reaches and 52.1% test accuracy (with 98.8% train accuracy).
This suggests that late geometric divergence is not sufficient on its own. Small- runs eventually decorrelate the rows of , but they do so too late and too weakly to support useful modular addition features. A sufficiently large initial perturbation appears to seed early diversity that SGD can then amplify into task-relevant representations.
Audit Controls Show Batch Noise Is Not Sufficient in Isolation
Two additional controls clarify the mechanism behind the main sweep. First, replacing mini-batch SGD with full-batch training at width 16 and only reduces symmetry to after 2000 epochs, compared with under the audited mini-batch run. Second, if we manually symmetrize the columns of before training, the width 16, model stays at for at least 500 SGD epochs.
These controls show that the main sweep does not demonstrate "batch noise alone" breaking a fully symmetric network. Instead, the data support a narrower and more defensible claim: mini-batch stochasticity rapidly amplifies the asymmetry already present in the randomly initialized readout, causing the incoming hidden weights to diverge.
Discussion
Limitations. We study only two-layer MLPs on a single task. Deeper networks may exhibit qualitatively different symmetry-breaking dynamics. We use SGD without momentum; Adam's adaptive learning rates may interact differently with partially symmetric initialization. Our main sweep symmetrizes only , so it does not isolate fully symmetric hidden units. Our symmetry metric (cosine similarity) captures geometric but not functional diversity.
Implications. The finding that late structural symmetry breaking is necessary but not sufficient for learning has practical implications for initialization design. Random initialization schemes (Kaiming, Xavier) help not merely because they eventually decorrelate hidden neurons, but because they inject enough asymmetry early for SGD to amplify into functionally distinct features. Our results quantify how strongly that early asymmetry matters on modular addition.
Future work. Extending to deeper architectures, studying the interaction with normalization layers (which can re-symmetrize representations), explicitly comparing fully symmetric and partially symmetric readouts, and developing functional diversity metrics beyond cosine similarity are promising directions.
Reproducibility
All code is provided in the accompanying SKILL.md. Experiments use PyTorch 2.6.0, seed 42, and ran on CPU in about 10 minutes on the audited March 28, 2026 machine. The complete parameter grid (4 widths 5 epsilons runs) is deterministic for the pinned stack used here; exact floating-point values may vary modestly across platforms.
\bibliographystyle{plainnat}
References
- [goodfellow2016deep] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
---
name: symmetry-breaking-neural-networks
description: Study how identical incoming hidden weights evolve during neural network training. Initialize 2-layer ReLU MLPs with symmetric first-layer rows, keep the readout layer on seeded Kaiming initialization, add controlled perturbations (epsilon sweep from 0 to 0.1), train on modular addition mod 97 with SGD, and measure symmetry decay via pairwise cosine similarity. Reveals that mini-batch SGD rapidly amplifies this asymmetry, breaking speed scales with network width, and only large perturbations (epsilon = 0.1) yield task-useful representations.
allowed-tools: Bash(git *), Bash(python *), Bash(python3 *), Bash(pip *), Bash(.venv/*), Bash(cat *), Read, Write
---
# Symmetry Breaking in Neural Network Training
This skill trains 2-layer ReLU MLPs whose incoming hidden weights start symmetric and measures how quickly training drives those rows apart. It sweeps 4 hidden widths x 5 perturbation scales = 20 runs, tracking symmetry decay and learning dynamics.
## Prerequisites
- Requires **Python 3.10+**. No internet access needed (all data is generated synthetically).
- Verified runtime: **about 10 minutes** on CPU for the full 20-run sweep (`run.py`) and about **24 seconds** for the unit tests on the March 28, 2026 audit machine.
- All commands must be run from the **submission directory** (`submissions/symmetry-breaking/`).
## Step 0: Get the Code
Clone the repository and navigate to the submission directory:
```bash
git clone https://github.com/davidydu/Claw4S.git
cd Claw4S/submissions/symmetry-breaking/
```
All subsequent commands assume you are in this directory.
## Step 1: Environment Setup
Create a virtual environment and install dependencies:
```bash
python3 -m venv .venv
.venv/bin/pip install -r requirements.txt
```
Verify all packages are installed:
```bash
.venv/bin/python -c "import torch, numpy, scipy, matplotlib; print('All imports OK')"
```
Expected output: `All imports OK`
## Step 2: Run Unit Tests
Verify all modules work correctly:
```bash
.venv/bin/python -m pytest tests/ -v
```
Expected: Pytest exits with `25 passed` and exit code 0.
## Step 3: Run the Experiments
Execute the full symmetry-breaking experiment suite:
```bash
.venv/bin/python run.py
```
Expected: Script prints progress for 20 runs (4 hidden widths x 5 epsilon values), generates plots, and exits with `[4/4] Saving results to results/`. Verified runtime is about 10 minutes on CPU.
This will:
1. Generate modular addition dataset (a + b mod 97), 80/20 train/test split
2. For each (hidden_dim, epsilon) pair, initialize identical `fc1` rows, keep `fc2` on seeded Kaiming init, and train with SGD
3. Log symmetry metric (mean pairwise cosine similarity of hidden neurons) every 50 epochs
4. Generate 3 plots: symmetry trajectories, accuracy vs epsilon, symmetry heatmap
5. Save results to `results/results.json`, summary to `results/summary.json`, report to `results/report.md`
## Step 4: Validate Results
Check that results were produced correctly:
```bash
.venv/bin/python validate.py
```
Expected: Prints file checks, data point counts, scientific sanity diagnostics, and `Validation passed.`
The validator now also reports:
- chance-level accuracy (`1/modulus`)
- best test accuracy at the highest epsilon
- best-accuracy gain between highest and lowest epsilon
If those signals are too weak (for example, no task-useful high-epsilon run), validation exits non-zero with a clear error message.
## Step 5: Review the Report
Read the generated report:
```bash
cat results/report.md
```
The report contains:
- Per-run table: initial/final symmetry, breaking epoch, test accuracy for each (width, epsilon)
- Key findings: mean symmetry and accuracy for zero vs non-zero epsilon
- Breaking speed statistics for substantial perturbations
Verified results from the March 28, 2026 audit run:
- All 20 mini-batch runs reduce the final `fc1` symmetry metric below `0.12`
- Breaking epoch increases with hidden width: `300` (dim 16), `350` (dim 32), `500` (dim 64), `650` (dim 128) for `epsilon <= 1e-4`
- Only `epsilon = 0.1` yields strong task performance at larger widths: width 32 reaches `42.1%` test accuracy, width 64 reaches `52.1%`, width 128 reaches `9.1%`
- Width 64 + epsilon 0.1 achieves `52.1%` test accuracy with `98.8%` train accuracy
Methodological note:
- This code symmetrizes only the incoming hidden-layer weights (`W1`). The readout matrix (`W2`) remains randomly initialized, so interpret the results as the combination of readout asymmetry and mini-batch stochasticity rather than batch noise in isolation.
- In supervisor verification controls, full-batch training at width 16 / epsilon 0 reduced symmetry only to about `0.92` after 2000 epochs, while manually symmetrizing `W2` kept the hidden layer at symmetry `~1.0` for 500 SGD epochs.
## How to Extend
- **Change the task:** Replace the modular addition dataset in `src/data.py` with any classification task. The `generate_modular_addition_data()` function returns `(x_train, y_train, x_test, y_test)`.
- **Add hidden widths:** Pass a different `hidden_dims` list to `run_all_experiments()` in `run.py`.
- **Add epsilon values:** Pass a different `epsilons` list to `run_all_experiments()` in `run.py`.
- **Change the optimizer:** Modify the optimizer in `src/trainer.py` (e.g., replace SGD with Adam) to study how different optimizers interact with symmetry breaking.
- **Multi-layer networks:** Extend `SymmetricMLP` in `src/model.py` to add more hidden layers and track per-layer symmetry.
- **Different symmetric inits:** Modify `_symmetric_init()` in `src/model.py` to try different base weight values or structured symmetries.
- **Isolate pure hidden-unit symmetry:** Add an option to symmetrize `fc2` as well, then compare mini-batch and full-batch training to separate readout asymmetry from batch-noise effects.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.