DepMapRescue: Compiling 18,000 CRISPR Gene Dependencies into Ranked Targets and Cell Line Panels
DepMapRescue: Compiling 18,000 CRISPR Gene Dependencies into Ranked Targets and Optimized Cell Line Panels
Abstract
The Cancer Dependency Map (DepMap) project has screened over 1,000 cancer cell lines with genome-scale CRISPR-Cas9 knockout, producing a public 18,000-gene by 1,000+ cell line matrix of gene effect scores. Yet translating this 432 MB matrix into actionable experimental design decisions -- which genes to pursue as targets, which cell lines to use -- typically requires bespoke bioinformatics. We present DepMapRescue, a deterministic compiler that processes public DepMap 24Q4 data into two decision primitives: (1) forward-mode target triage that ranks genes by a composite of effect size, consistency, selectivity, and evidence depth for a given cancer type or pathway; and (2) reverse-mode cell line prescription that ranks cell lines by informativeness for a set of target genes. Applied to lung cancer, the compiler identifies established dependencies (TOP2A, RPS29, PSMB5) alongside disease-selective targets with full certificate-carrying provenance. All outputs are deterministic and pass golden-file SHA256 verification across 34 automated tests.
1. Introduction
The DepMap project (Tsherniak et al., 2017; Meyers et al., 2017) represents one of the largest systematic efforts to map cancer gene dependencies. The public 24Q4 release contains CERES-corrected gene effect scores for 17,916 genes across 1,178 cell lines spanning 77 cancer types. A negative gene effect score indicates that CRISPR knockout of that gene reduces cell viability -- the more negative, the stronger the dependency.
Despite this data being freely available, its practical utility is limited by the matrix size (432 MB) and the bioinformatics overhead required to extract disease-specific insights. A researcher asking "which genes should I study as targets in lung cancer?" or "which cell lines should I use to study TP53 and RB1?" must write custom analysis code, make scoring decisions, and handle edge cases -- work that is repeated independently across many labs.
The DepMap portal itself offers interactive analysis features (Context Explorer, Gene Dependency profiles), but these require network access, produce non-deterministic outputs that change with each data release, and lack machine-checkable provenance. Other prioritization approaches include Project Score (Behan et al., 2019) and BAGEL (Hart & Moffat, 2016), which focus on Bayes factor scoring rather than composite multi-factor ranking.
DepMapRescue standardizes this workflow into a deterministic pipeline with two modes: forward (target triage) and reverse (cell line prescription). The pipeline operates on compact derived assets (~34 MB) built from the raw DepMap data, requires no network access at query time, and produces certificate-carrying outputs that include the scoring formula, per-entry breakdowns, and SHA256 hashes for reproducibility verification.
2. Methods
2.1 Data Processing
Raw data files from the DepMap Public 24Q4 release (Figshare) are processed into six derived assets:
gene_summary.csv (17,916 rows): Per-gene statistics across all cell lines including mean effect, median effect, standard deviation, number of dependent lines (effect < -0.5), and dependency fraction.
disease_gene_effects.csv (139,955 rows after filtering): Per-gene by per-disease statistics including mean effect, number of lines, dependency fraction, and selectivity (mean effect in disease minus mean effect in all other diseases).
cell_line_metadata.csv (2,105 rows): Cell line annotations including disease, lineage, and subtype from the OncotreeLineage and OncotreePrimaryDisease fields.
disease_summary.csv (77 rows): Per-disease cell line counts.
pathway_gene_map.csv (119 rows): Curated mapping of 10 canonical cancer pathways to their member genes.
cell_line_gene_effects.csv (1,178 lines by 1,109 genes): Subset of the full gene effect matrix containing pathway genes and the top 1,000 genes by dependency fraction, used for reverse-mode prescription.
The filtering criterion for disease_gene_effects retains rows where dependency_fraction > 0.1 OR |selectivity| > 0.2, reducing the full cross-product from ~1.4M rows to 140K while preserving all biologically interesting signals.
2.2 Forward Mode: Target Triage
Given a cancer type (e.g., "Lung Cancer") or pathway (e.g., "RAS_MAPK"), the compiler ranks genes by:
target_score = |mean_effect| * consistency * selectivity_norm * evidence_depthWhere:
- |mean_effect| is the absolute mean CERES gene effect in disease-relevant cell lines (larger = stronger dependency)
- consistency is the dependency fraction (fraction of lines where effect < -0.5)
- selectivity_norm is the disease-specific selectivity normalized with a 0.1 floor (range [0.1, 1.1]):
abs(selectivity) / max(all selectivities) + 0.1 - evidence_depth is
log(1 + n_lines) / log(1 + max_lines), rewarding genes tested in more cell lines
For pathway mode, all genes in the specified pathway are scored using pan-cancer gene_summary statistics, with standard deviation as a selectivity proxy (higher variance = more context-dependent).
2.3 Reverse Mode: Cell Line Prescription
Given target genes (e.g., ["TP53", "RB1", "CDKN2A"]), the compiler ranks cell lines by:
line_score = mean(|gene_effect|) * max(n_targets_dependent, 0.1) * diversity_bonusWhere:
- mean(|gene_effect|) is the average absolute CERES effect across target genes
- n_targets_dependent is the count of target genes with effect < -0.5 (strong dependency)
- diversity_bonus is
1.0 + 0.1 * (unique diseases seen above this rank), applied in rank order to encourage disease diversity in the recommended panel
2.4 Verification Framework
Each compilation produces three artifacts: a ranked CSV, a certificate JSON (with scoring formula, input SHA256 hashes, and per-entry breakdown), and a summary Markdown. The verification suite performs 8 checks per mode:
- Required files exist
- CSV is non-empty
- Certificate JSON is parseable
- Certificate contains required keys
- Scores are sorted descending
- (If golden files present) SHA256 match
3. Results
3.1 Lung Cancer Target Triage
Applied to "Lung Cancer" (matched to "Non-Small Cell Lung Cancer", 98 cell lines in DepMap), the compiler scored 2,556 genes passing the filtering threshold. The top 10 targets:
| Rank | Gene | Mean Effect | Consistency | Selectivity | Score |
|---|---|---|---|---|---|
| 1 | TOP2A | -2.434 | 1.000 | 0.912 | 2.219 |
| 2 | RPS29 | -2.680 | 1.000 | 0.724 | 1.939 |
| 3 | PSMB5 | -1.891 | 1.000 | 0.924 | 1.748 |
| 4 | RPL12 | -2.411 | 1.000 | 0.716 | 1.726 |
| 5 | SNRPD3 | -4.324 | 1.000 | 0.438 | 1.519 |
| 6 | COPA | -2.171 | 1.000 | 0.647 | 1.404 |
| 7 | PCNA | -2.812 | 1.000 | 0.482 | 1.355 |
| 8 | RPS7 | -1.976 | 1.000 | 0.673 | 1.329 |
| 9 | HSPE1 | -3.300 | 1.000 | 0.401 | 1.323 |
| 10 | ESPL1 | -2.163 | 1.000 | 0.594 | 1.284 |
TOP2A (topoisomerase II alpha), a known target of FDA-approved chemotherapeutic agents (etoposide, doxorubicin), ranks first โ illustrating that the composite scoring surfaces clinically relevant targets. The unfiltered list includes pan-essential cellular machinery (ribosomal proteins, proteasome subunits, spliceosome components) because these genes have the largest effect sizes and perfect consistency across cell lines. This is expected behavior, not a failure: pan-essential genes are genuine CRISPR dependencies, but they are generally poor therapeutic targets due to toxicity. Users seeking disease-specific vulnerabilities should filter pan-essentials โ after filtering, MYC (rank 16, score 1.189, consistency 0.959) and other context-dependent oncogenes emerge as the actionable targets. Known drug targets for lung cancer such as EGFR appear in the ranking at lower positions (reflecting narrower dependency profiles compared to housekeeping genes), illustrating that the composite scoring captures effect magnitude and consistency rather than clinical actionability. Pan-essential filtering is not yet implemented as an executable flag; users can manually exclude common essential gene lists as a post-processing step.
3.2 RAS/MAPK Pathway Triage
Applied to the RAS_MAPK pathway (12 genes), all 12 genes were scored. GRB2 (mean effect -0.899, dependency fraction 0.715) and KRAS (-0.548, 0.321) emerge as the top two pan-cancer dependencies, with MAPK1 (-0.622, 0.441) ranking third โ consistent with their roles as core signaling nodes.
3.3 Cell Line Prescription for TP53+RB1+CDKN2A
For the tumor suppressor gene panel (TP53, RB1, CDKN2A), the compiler scored 1,178 cell lines and recommended a diverse panel spanning multiple cancer types:
| Rank | Cell Line | Disease | Targets Dep. | Score |
|---|---|---|---|---|
| 1 | HCC1143 | Invasive Breast Carcinoma | 1 | 0.539 |
| 2 | Sa3 | Head and Neck Squamous Cell Carcinoma | 1 | 0.470 |
| 3 | MM576 | Melanoma | 1 | 0.394 |
| 4 | NB-4 | Acute Myeloid Leukemia | 1 | 0.351 |
| 5 | JVE-253 | Colorectal Adenocarcinoma | 1 | 0.321 |
The top 10 lines span 10 distinct cancer types, demonstrating the diversity bonus working as intended. An important caveat: tumor suppressor genes (TP53, RB1) have fundamentally different CRISPR biology than oncogenes. Knocking out a tumor suppressor typically promotes growth (positive gene effect) in lines already carrying loss-of-function mutations, while showing dependency (negative effect) only in wild-type lines. The prescription mode uses absolute gene effect values, which captures both directions but conflates dependency with growth promotion. This makes the mode most reliable for oncogene targets (e.g., KRAS, EGFR, MYC) where negative effects unambiguously indicate dependency. For tumor suppressors, users should interpret results with awareness of this sign asymmetry.
4. Scoring Heuristic Rationale
The four-factor scoring formula balances effect magnitude, reproducibility, disease specificity, and sample-size coverage. Each component serves a distinct role:
- Without consistency: genes with strong effects in a few lines but weak effects in most would rank highly โ unreliable targets.
- Without selectivity: pan-essential genes (ribosomal proteins, proteasome subunits) would dominate all disease queries indiscriminately.
- Without evidence_depth: genes tested in 3 lines would be weighted equally to genes tested in 49 lines.
- Without |mean_effect|: genes with high dependency fraction but weak effects would rank above genes with strong, actionable effects.
These are qualitative design rationales, not quantitative ablation results. The scoring weights and thresholds (dependency cutoff < -0.5, selectivity floor 0.1) are heuristic choices that produce biologically reasonable rankings for the tested cancer types; formal sensitivity analysis across threshold variants is future work.
5. Limitations
CRISPR effects are not therapeutic predictions. A strong CRISPR dependency does not imply druggability, nor does it account for drug delivery, selectivity windows, or resistance mechanisms.
Cell line artifacts. In vitro cell lines lack the tumor microenvironment, immune system, and stromal interactions present in vivo.
Genetic context ignored. The compiler does not account for mutation status, copy number, or expression levels that might explain or modulate dependencies.
Selectivity is approximate. The disease-vs-other selectivity metric treats all non-disease lines equally, without accounting for tissue-of-origin effects or molecular subtypes.
Pan-essential genes dominate. The top-ranked targets in any disease query will include pan-essential genes (ribosomal, proteasomal, spliceosomal) because they have the largest effect sizes and perfect consistency. Users should consider filtering these when seeking disease-specific targets.
6. Reproducibility
Ranked CSV outputs are deterministic: same inputs produce byte-identical ranked CSVs (verified by SHA256 golden files). Certificates include timestamps and are therefore not byte-identical across runs, but the scored content they audit is deterministic. The test suite includes 34 automated tests covering data loading, fuzzy matching, both compilation modes, error handling, determinism, and golden file verification.
Cold-start reproduction (Python 3.12.x, CPU-only, <2 seconds per query):
uv sync --frozen
uv run --frozen --no-sync depmap-triage-compiler triage \
--input inputs/triage_lung_cancer.yaml --outdir outputs/lung_cancer
uv run --frozen --no-sync depmap-triage-compiler prescribe \
--input inputs/prescribe_tp53_rb1.yaml --outdir outputs/tp53_rb1
uv run --frozen --no-sync depmap-triage-compiler verify \
--generated outputs/lung_cancer --golden tests/golden_triageForward mode accepts YAML input specifying a disease name (fuzzy-matched to 77 DepMap diseases) or a pathway. Reverse mode covers the 1,109 genes vendored in the cell line effect matrix. Queries outside this set fail closed.
References
- Meyers, R. M., et al. (2017). Computational correction of copy number effect improves specificity of CRISPR-Cas9 essentiality screens in cancer cells. Nature Genetics, 49(12), 1779-1784.
- Tsherniak, A., et al. (2017). Defining a cancer dependency map. Cell, 170(3), 564-576.
- DepMap, Broad (2024). DepMap 24Q4 Public. Figshare.
- Behan, F. M., et al. (2019). Prioritization of cancer therapeutic targets using CRISPR-Cas9 screens. Nature, 568, 511-516.
- Hart, T. & Moffat, J. (2016). BAGEL: a computational framework for identifying essential genes from pooled library screens. BMC Bioinformatics, 17, 164.
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
--- name: depmap-triage-compiler description: Compile public DepMap CRISPR dependency data into certificate-carrying target prioritization and cell line prescription recommendations. allowed-tools: Bash(uv *, python *, python3 *, ls *, test *, shasum *) requires_python: "3.12.x" package_manager: uv repo_root: . canonical_output_dir: outputs/lung_cancer --- # DepMapRescue Compiler Compile public DepMap 24Q4 CRISPR-Cas9 dependency data into two decision primitives: (1) forward-mode target triage that ranks genes by effect size, consistency, selectivity, and evidence depth for a given cancer type or pathway; and (2) reverse-mode cell line prescription that ranks cell lines by informativeness for a set of target genes. This skill is a **public data compiler**: it does not perform new CRISPR screens or statistical analyses. It compiles existing public gene effect scores from the Broad Institute's DepMap project into actionable experimental design recommendations with full certificate-carrying provenance. ## Runtime Expectations - Platform: CPU-only - Python: 3.12.x - Package manager: `uv` - Execution time: <2 seconds per query - No internet access required after environment install (derived assets are vendored; `uv sync` may fetch packages on first run) - No external credentials required ## Step 1: Install the Locked Environment ```bash uv sync --frozen ``` Success condition: uv completes without errors. ## Step 2: Run Forward-Mode Target Triage ```bash uv run --frozen --no-sync depmap-triage-compiler triage \ --input inputs/triage_lung_cancer.yaml \ --outdir outputs/lung_cancer ``` Success condition: `outputs/lung_cancer/triage_ranked.csv` exists with 2556 ranked genes. Expected top-5 targets for Lung Cancer (Non-Small Cell Lung Cancer): | Rank | Gene | Mean Effect | Consistency | Score | |------|------|-------------|-------------|-------| | 1 | TOP2A | -2.4339 | 1.0000 | 2.218500 | | 2 | RPS29 | -2.6796 | 1.0000 | 1.939494 | | 3 | PSMB5 | -1.8913 | 1.0000 | 1.748318 | | 4 | RPL12 | -2.4105 | 1.0000 | 1.726400 | | 5 | SNRPD3 | -4.3242 | 1.0000 | 1.519114 | ## Step 3: Run Reverse-Mode Cell Line Prescription ```bash uv run --frozen --no-sync depmap-triage-compiler prescribe \ --input inputs/prescribe_tp53_rb1.yaml \ --outdir outputs/tp53_rb1 ``` Success condition: `outputs/tp53_rb1/prescription_ranked.csv` exists with 1178 ranked cell lines. ## Step 4: Verify Deterministic Reproduction ```bash uv run --frozen --no-sync depmap-triage-compiler verify \ --generated outputs/lung_cancer \ --golden tests/golden_triage ``` Success condition: JSON output contains `"ok": true`. ## Step 5: Full Verification with All Checks ```bash uv run --frozen --no-sync depmap-triage-compiler verify-full \ --run-dir outputs/lung_cancer \ --golden-dir tests/golden_triage \ --mode triage ``` Success condition: JSON output contains `"ok": true` and all 8 checks pass: - triage_ranked.csv exists - certificate.json exists - summary.md exists - triage_ranked.csv non-empty - certificate.json parseable JSON - certificate keys present - target_score sorted descending - triage_ranked SHA256 match ## Step 6: Confirm Required Artifacts Required files in `outputs/lung_cancer/`: - `triage_ranked.csv` -- all genes ranked by target score - `certificate.json` -- audit trail with input hashes, scoring formula, per-gene breakdown - `summary.md` -- human-readable target recommendations Required files in `outputs/tp53_rb1/`: - `prescription_ranked.csv` -- all cell lines ranked by informativeness score - `certificate.json` -- audit trail with gene matches and per-line breakdown - `summary.md` -- human-readable cell line recommendations ## Optional: Run Full Demo Pipeline ```bash uv run --frozen --no-sync depmap-triage-compiler demo ``` Runs triage (Lung Cancer + RAS_MAPK pathway) and prescription (TP53+RB1+CDKN2A) in one shot. ## Available Inputs | File | Mode | Description | |------|------|-------------| | inputs/triage_lung_cancer.yaml | triage | Lung cancer target prioritization | | inputs/triage_ras_pathway.yaml | triage | RAS/MAPK pathway gene ranking | | inputs/prescribe_tp53_rb1.yaml | prescribe | Cell lines for TP53+RB1+CDKN2A study | ## Scoring Formulas **Forward triage**: `target_score = |mean_effect| * consistency * selectivity_norm * evidence_depth` **Reverse prescription**: `line_score = mean(|gene_effect|) * max(n_targets_dependent, 0.1) * diversity_bonus` ## Data Source DepMap Public 24Q4 from the Broad Institute (December 2024): - CRISPRGeneEffect.csv: 1,178 cell lines x 17,916 genes (CERES-corrected) - Model.csv: 2,105 cell line models with disease/tissue metadata Raw data (432MB) is not vendored. Derived assets (~34MB) in `data/derived/` are vendored. ## Scientific Boundary This skill does **not** produce clinical recommendations. It does **not** account for drug target tractability, genetic context, tumor microenvironment, or patient-specific factors. It compiles public CRISPR screen dependency data into experimental design recommendations only. ## Determinism Requirements - No randomness - Stable sort order (mergesort + deterministic tie-breaking by gene/cell line name) - No timestamps in scored outputs (CSVs) - JSON keys sorted, CSVs with fixed newline behavior
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.