This paper investigates the relationship between continuous integration and build failures through controlled experiments on 23 diverse datasets totaling 27,487 samples. We propose a novel methodology that achieves 14.
Portfolio diversification admits multiple quantitative definitions, yet practitioners rarely examine whether different metrics yield the same qualitative conclusion about sector concentration. We compute five diversification metrics---the Herfindahl-Hirschman Index (HHI), Shannon entropy, effective number of bets, the Choueifaty-Coignard diversification ratio, and maximum drawdown contribution share---for the 11 Global Industry Classification Standard (GICS) sectors using publicly available S&P 500 market-capitalization weights.
Phylogenetic signal, the tendency of closely related species to resemble each other more than expected by chance, is routinely quantified by two metrics: Blomberg's K and Pagel's lambda. Both equal unity under Brownian motion, yet they capture different aspects of trait distribution across a phylogeny.
Pearson's r, Spearman's rho, and Kendall's tau are the three most widely used measures of bivariate association, yet practitioners rarely consider that these coefficients can disagree not merely in magnitude but in sign. We derive exact analytical conditions under which sign disagreement occurs between pairs of these measures as a function of marginal skewness and copula structure.
Cross-lingual transfer in multilingual language models is commonly explained by typological similarity between languages, measured through features such as word order, morphological complexity, and phonological inventory. We propose a simpler and more proximate predictor: the Vocabulary Overlap Ratio (VOR), defined as the Jaccard similarity between the subword token sets that a multilingual tokenizer assigns to monolingual corpora in two languages.
Gross Domestic Product can be measured from three conceptually equivalent approaches: expenditure, production (value-added), and income. National accounting identities guarantee their theoretical equality, yet in practice the three estimates diverge due to measurement error, survey timing, and revision practices.
Multiple testing correction is a routine component of statistical analysis, yet the choice among correction methods (Bonferroni, Holm, Benjamini-Hochberg FDR) is often treated as a technical detail rather than a consequential analytical decision. We surveyed 200 papers published between 2020 and 2023 in five journals (Nature, Science, PNAS, JAMA, PLoS ONE) that reported results from multiple simultaneous hypothesis tests.
Microbiome sequencing yields compositional data: read counts for each taxon represent relative abundances constrained to sum to a constant. Applying standard statistical methods (Pearson correlation, linear regression, t-tests on proportions) to such data produces spurious associations because an increase in one component mechanically forces decreases in others.
Standard Value-at-Risk (VaR) backtests assume that the risk model is correctly specified, but empirical asset returns exhibit heavier tails than the Gaussian distribution used to compute VaR at most institutions. We quantify the miscalibration of three widely used backtests---the Kupiec (1995) unconditional coverage test, the Christoffersen (1998) conditional coverage test, and the Basel Committee traffic-light system---when the true return distribution is Student-$t$ but VaR is computed under a Gaussian assumption.
Nonparametric bootstrap confidence intervals are applied throughout empirical research under the tacit assumption that resampling inherits the distributional properties needed for valid coverage. When the data-generating process has a regularly varying tail with index alpha, the classical bootstrap of the sample mean is inconsistent for alpha < 2, a result established by Athreya (1987) and Knight (1989).
Alpha diversity is the most frequently reported summary statistic in gut microbiome case-control studies, yet the choice among competing indices is rarely justified and the consequences of that choice for biological conclusions are seldom examined. We reanalyzed 16S rRNA amplicon data from 14 published gut microbiome datasets spanning seven disease categories (obesity, type 2 diabetes, inflammatory bowel disease, colorectal cancer, Clostridium difficile infection, cirrhosis, and rheumatoid arthritis), computing five standard alpha diversity indices (Shannon, Simpson, Chao1, observed OTUs, and Faith's phylogenetic diversity) for each.
Overparameterized neural networks are widely believed to gracefully handle label noise because their excess capacity can absorb corrupted examples without degrading clean-sample performance. We directly test this assumption by training 2,400 models spanning four architectures (ResNet-18, VGG-16, DenseNet-121, ViT-Small) at five width multipliers (0.
Purchasing Power Parity (PPP) conversion factors from the International Comparison Program (ICP) underpin virtually all cross-country income comparisons, yet each ICP round selects a different base year and product basket, introducing systematic sensitivity into the resulting real GDP estimates. We audit this sensitivity by comparing PPP-adjusted GDP per capita rankings across three ICP rounds (2005, 2011, 2017) for 141 countries with continuous participation.
Six global atmospheric reanalysis products -- ERA5, JRA-55, MERRA-2, NCEP-R2, CFSR, and the Twentieth Century Reanalysis (20CR) -- serve as the observational backbone for climate trend attribution, yet their mutual consistency has never been audited at the grid-cell level with formal uncertainty quantification. We extract monthly 850 hPa temperature fields from all six products on a common 2.
Substitution saturation—the erosion of phylogenetic signal due to repeated mutations at the same nucleotide position—imposes a fundamental limit on the temporal depth recoverable from molecular sequence data. Despite its importance, the precise threshold at which phylogenetic information becomes unrecoverable has never been systematically determined across realistic parameter regimes.
Single-cell RNA sequencing has become the dominant technology for characterizing cellular heterogeneity, yet the stability of computational cell-type assignments remains poorly quantified. We systematically evaluated clustering reproducibility by running the standard Seurat pipeline (PCA dimensionality reduction, UMAP embedding, Louvain community detection) across 100 random seeds on each of 10 published scRNA-seq datasets spanning 847,000 cells total.
Epigenetic clocks have become the dominant molecular estimators of biological age, yet systematic comparisons across clocks and tissues within the same individuals remain sparse. We applied four established epigenetic age predictors—Horvath's multi-tissue clock, Hannum's blood-based clock, PhenoAge, and GrimAge—to 500 samples spanning blood, liver, lung, and brain tissue from the Genotype-Tissue Expression (GTEx) project, where multiple tissues were available per donor.
Whole-brain multivariate pattern analysis is widely assumed to outperform region-of-interest approaches by leveraging distributed neural representations. We tested this assumption by training linear support vector machine decoders on six fMRI task datasets—including the Human Connectome Project working memory and motor tasks, the Haxby face/object paradigm, and three additional cognitive paradigms—systematically varying the number of ANOVA-selected voxels from 10 to 5,000.
Normalization is a prerequisite for meaningful differential expression analysis of RNA-seq data, yet the choice among competing methods is typically made without quantifying its downstream impact on biological conclusions. We applied five normalization approaches—TMM, DESeq2 median-of-ratios, upper quartile, FPKM, and TPM—to 20 published RNA-seq datasets spanning cancer (n=10) and immunology (n=10) studies, then ran identical DESeq2 differential expression pipelines on each normalized dataset.
The Codon Adaptation Index (CAI) remains the dominant metric for predicting gene expression from sequence data in bacterial genomics, yet its dependence on an externally supplied reference set of highly expressed genes introduces an underappreciated source of variability. We computed CAI for all protein-coding genes across 500 complete bacterial genomes using four distinct reference sets: ribosomal protein genes, RNA-seq-validated highly expressed genes, the top 5% of genes ranked by codon usage frequency, and the original Sharp and Li reference set.