Documents
Comprehensive documentation for KOSMOS2 platform including terminology, analysis pipelines, and data descriptions
January 26, 2026 - Initial Release
Database Statistics
• Institutions: 27 hospitals
• Samples: 462 patient samples
• NGS: 14 Panels
• Mutations (SNV/Indel): 7,294 variants across 1,026 genes
• Copy Number Variations (CNV): 335 CNV events
• Gene Fusions: 6 fusion events
Annotation Database
• Reference: hg19 (GRCh37)
• VEP: v115
• COSMIC: v103, released 18-Nov-2025
KOSMOS2 employs a comprehensive genomic analysis pipeline to process and analyze cancer genomic data:
1. Sequencing & Variant Calling
Tumor-only somatic variant analysis using targeted panel sequencing, aligned to GRCh37 (hg19) reference genome.
Panel Types:
2. Variant Filtering (VCF Processing)
• Genotype (GT) extraction and DP > 100 filtering
• Multi-allelic variants split into separate records
• VAF (Variant Allele Frequency) calculation
• VAF ≥ 0.05 filtering threshold
3. Variant Annotation (VEP)
Annotation using Ensembl VEP (Variant Effect Predictor) with the following criteria:
Transcript Selection
Canonical transcript only (Canonical = Yes)
Population Frequency Filtering
• gnomAD_exon_EAS_AF < 0.001
• gnomAD_genome_EAS_AF < 0.001
• MAX_AF < 0.01Splice Site Prediction
SpliceAI cutoff ≥ 0.5 included
Clinical Significance
ClinVar: Exclude Benign variants
Biotype
Protein coding genes only
COSMIC Filtering
Remove variants with COSMIC Germline = Yes and Somatic = No
Consequence Selection
missense_variantstop_gainedstop_loststart_lostframeshift_variantinframe_deletioninframe_insertionsplice_acceptor_variantsplice_donor_variant
4. Statistical Analysis
• Kaplan-Meier survival analysis (OS and PFS)
• Log-rank test for group comparison
5. Visualization & Reporting
• OncoPlot generation for mutation landscape
• Lollipop plots for protein-level mutations
• Kaplan-Meier survival curves
• Disco Plot for individual sample genomic overview (SNV/Indel, CNV, Fusion on circular chromosome layout)
• Interactive data exploration interface
KOSMOS2 integrates comprehensive genomic and clinical data:
Sample Data
Survival Data
Status: 0 = censored (alive or no progression), 1 = event (death or progression)
Mutation Data & Annotation
Each variant is annotated with multiple prediction tools and databases:
Basic Information
Gene Symbol, Chromosome, Position, Reference/Alternate Allele, Variant Type (SNP, INS, DEL)
Consequence (Variant Effect)
Predicted effect of variant on gene/transcript:
• missense_variant: Amino acid changed to different amino acid
• stop_gained: Premature stop codon created (nonsense mutation)
• stop_lost: Stop codon removed, protein extension
• start_lost: Start codon removed
• frameshift_variant: Insertion/deletion causing reading frame shift
• inframe_deletion: Deletion preserving reading frame
• inframe_insertion: Insertion preserving reading frame
• splice_acceptor_variant: Variant in splice acceptor site (3' end of intron)
• splice_donor_variant: Variant in splice donor site (5' end of intron)Impact (Severity)
Predicted severity of variant effect on protein function:
• HIGH: Likely loss of function (frameshift, stop_gained, splice variants)
• MODERATE: Possible functional change (missense, inframe indels)
• LOW: Unlikely to affect function significantly (synonymous)
• MODIFIER: Non-coding or intergenic variantsHGVS Notation
HGVSc (coding DNA change, e.g., c.1234A>G), HGVSp (protein change, e.g., p.Val412Met) - Standardized nomenclature for describing variants
SIFT (Sorting Intolerant From Tolerant)
Predicts whether an amino acid substitution affects protein function based on sequence homology.
• Tolerated: Score ≥ 0.05 (high confidence)
• Tolerated Low Confidence: Score ≥ 0.05 (low confidence)
• Deleterious: Score < 0.05 (high confidence)
• Deleterious Low Confidence: Score < 0.05 (low confidence)
• Unknown: No prediction availablePolyPhen-2 (Polymorphism Phenotyping v2)
Predicts the impact of amino acid substitutions on protein structure and function.
• Benign: Score < 0.15 (likely neutral)
• Possibly Damaging: Score 0.15-0.85 (potential functional impact)
• Probably Damaging: Score > 0.85 (likely deleterious)
• Unknown: No prediction availableREVEL (Rare Exome Variant Ensemble Learner)
Ensemble method combining multiple tools to predict pathogenicity of missense variants.
• Benign: Score ≤ 0.644 (likely neutral)
• Pathogenic: Score > 0.644 (likely disease-causing)
• Unknown: No prediction availableSpliceAI
Deep learning-based tool that predicts splicing alterations by evaluating delta scores for donor/acceptor gain and loss. Variants with any delta score ≥ 0.5 are marked as PASS, indicating a high likelihood of affecting splicing.
dbSNP (Database of Single Nucleotide Polymorphisms)
NCBI database cataloging known genetic variations (rs IDs). Presence indicates previously reported variant in general population
COSMIC (Catalogue Of Somatic Mutations In Cancer)
Comprehensive database of somatic mutations in cancer. COSMIC ID indicates variant has been observed in cancer samples
ClinVar
NCBI database of clinically significant variants. Classifications: Pathogenic, Likely Pathogenic, Uncertain Significance, Likely Benign, Benign
Data Privacy & Security
All patient data is de-identified in accordance with institutional review board (IRB) protocols. The platform implements role-based access control to ensure data security and compliance with privacy standards.
Log-Rank Test
Statistical test to compare survival distributions between two or more groups. A p-value < 0.05 indicates statistically significant difference in survival between groups. The platform requires minimum 10 samples per group to calculate log-rank test statistics.
Mutation Frequency Analysis
Calculation of mutation prevalence across samples and genes. Includes identification of significantly mutated genes and hotspot mutations within protein domains.
This documentation is continuously updated to reflect the latest features and methodologies implemented in KOSMOS2. For additional questions or clarifications, please contact the support team.