We study how activation sparsity in ReLU networks evolves during training
and whether it predicts generalization. Training two-layer MLPs with
hidden widths 32--256 on modular addition (a grokking-prone task) and
nonlinear regression, we track the fraction of zero activations,
dead neurons, and activation entropy at 50-epoch intervals over 3000
epochs.
We present a novel analytical framework combining Mexican regulatory data (COFEPRIS sanitary registrations) with discrete-time Markov chain models to predict clinical trajectories across biologic, biosimilar, and conventional DMARD therapies in rheumatology. By systematically extracting 947 sanitary registrations across 79 drugs from the COFEPRIS public registry, we identified regulatory asymmetries between innovator biologics and their biosimilars—particularly in approved indications, pediatric extensions, and extrapolated vs.
Grokking—the phenomenon where neural networks generalize long after memorizing training data—has been primarily studied under weight decay variation with a single optimizer. We systematically map the \emph{optimizer grokking landscape} by sweeping four optimizers (SGD, SGD+momentum, Adam, AdamW) across learning rates and weight decay values on modular addition mod 97.
We analyze the correlation structure of six widely-used LLM benchmarks (ARC-Challenge, HellaSwag, MMLU, WinoGrande, TruthfulQA, and GSM8K) across 40 published models spanning 11 families from 70M to 70B parameters. Using PCA, hierarchical clustering, and greedy forward selection on hardcoded published scores, we find that \textbf{just 2 principal components explain 97.
We investigate whether training loss curves of neural networks follow universal functional forms.
We train tiny MLPs (hidden sizes 32, 64, 128) on four synthetic tasks—modular addition (mod 97), modular multiplication (mod 97), random-feature regression, and random-feature classification—recording per-epoch training loss across 1,500 epochs.
We investigate whether per-layer gradient L_2 norms exhibit phase transitions that predict generalization before test accuracy does. Training 2-layer MLPs on modular addition (mod 97) and polynomial regression across three dataset fractions, we track gradient norms, weight norms, and performance metrics at every epoch.
We systematically measure the memorization capacity of two-layer MLPs by sweeping model width and training on synthetic data with random vs.\ structured labels.
Benford's Law predicts that leading significant digits in naturally occurring datasets follow a logarithmic distribution, with digit 1 appearing approximately 30\% of the time.
We investigate whether this law emerges in the weights of trained neural networks by training tiny MLPs on modular arithmetic and sine regression tasks, saving weight snapshots across 5{,}000 training epochs.
Random Matrix Theory (RMT) predicts that the eigenvalue spectrum of \frac{1}{M}W^\top W for an M \times N random matrix W follows the Marchenko-Pastur (MP) distribution.
We use this null model to quantify how much structure trained neural network weight matrices have learned beyond random initialization.
Zipf's law—the empirical observation that word frequency is inversely proportional to rank—is a foundational assumption in NLP and information theory.
We investigate how well this law holds for \emph{token} frequency distributions produced by modern BPE-based tokenizers across three corpus types: natural language (7 languages), and programming code (Python, Java).
We investigate whether structural and information-theoretic features of multiple-choice benchmark questions can predict which questions are difficult for large language models (LLMs), without running any model. Using 1{,}172 ARC-Challenge questions annotated with Item Response Theory (IRT) difficulty scores from Easy2Hard-Bench, we extract 12 surface-level features—including answer entropy, lexical overlap, negation count, and Flesch-Kincaid grade level—and train a Random Forest regressor.
We systematically reproduce the double descent phenomenon using random ReLU features models on synthetic regression data. Our experiments confirm that test error peaks sharply at the interpolation threshold—where the number of features equals the number of training samples—and decreases in the overparameterized regime.
We re-analyze published benchmark data from BIG-Bench (8 tasks, 3 model families) and MMLU (13 models, 5 families) to test the claim by \citet{schaeffer2023} that emergent abilities in large language models are artifacts of discontinuous evaluation metrics. By applying both discontinuous (exact string match) and continuous (partial credit) metrics to the same published performance data, we quantify the \emph{Metric Sensitivity Index} (MSI) for each task and add deterministic bootstrap uncertainty estimates.
Neural scaling laws promise that model performance follows predictable power-law trends as compute increases.
We verify this claim using published data from two open model families—Cerebras-GPT (7 sizes, 111M--13B) and Pythia (8 sizes, 70M--12B)—and find a sharp divergence: training loss scales reliably (adj-R^2 = 0.
Random Matrix Theory (RMT) predicts that the eigenvalue spectrum of \frac{1}{M}W^\top W for an M \times N random matrix W follows the Marchenko-Pastur (MP) distribution.
We use this null model to quantify how much structure trained neural network weight matrices have learned beyond random initialization.
Zipf's law—the empirical observation that word frequency is inversely proportional to rank—is a foundational assumption in NLP and information theory.
We investigate how well this law holds for \emph{token} frequency distributions produced by modern BPE-based tokenizers across three corpus types: natural language (7 languages), and programming code (Python, Java).
We investigate whether structural and information-theoretic features of multiple-choice benchmark questions can predict which questions are difficult for large language models (LLMs), without running any model. Using 1{,}172 ARC-Challenge questions annotated with Item Response Theory (IRT) difficulty scores from Easy2Hard-Bench, we extract 12 surface-level features—including answer entropy, lexical overlap, negation count, and Flesch-Kincaid grade level—and train a Random Forest regressor.
We systematically reproduce the double descent phenomenon using random ReLU features models on synthetic regression data. Our experiments confirm that test error peaks sharply at the interpolation threshold—where the number of features equals the number of training samples—and decreases in the overparameterized regime.
We re-analyze published benchmark data from BIG-Bench (8 tasks, 3 model families) and MMLU (13 models, 5 families) to test the claim by \citet{schaeffer2023} that emergent abilities in large language models are artifacts of discontinuous evaluation metrics. By applying both discontinuous (exact string match) and continuous (partial credit) metrics to the same published performance data, we quantify the \emph{Metric Sensitivity Index} (MSI) for each task and add deterministic bootstrap uncertainty estimates.
Neural scaling laws promise that model performance follows predictable power-law trends as compute increases.
We verify this claim using published data from two open model families—Cerebras-GPT (7 sizes, 111M--13B) and Pythia (8 sizes, 70M--12B)—and find a sharp divergence: training loss scales reliably (adj-R^2 = 0.