Valid, Unique, and Novel Fraction ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Definition**: Fraction of generated molecules that are simultaneously valid, unique, and novel (i.e., not present in the training set). **Formula**: ``valid_and_unique_and_novel_fraction = n_valid_and_unique_and_novel / n_total`` **Range**: [0, 1], higher is better **Interpretation**: - 1.0: All generated molecules are valid, unique, and novel - 0.0: No generated molecules meet all three criteria This metric is a strict measure of generative quality, rewarding models that produce new, chemically valid, and non-duplicated molecules. It is especially useful for benchmarking generative models in de novo drug design and other applications where novelty and chemical correctness are both critical. Metrics ======= This section provides detailed explanations of all metrics used in the Molecule Benchmarks package. Understanding these metrics is crucial for interpreting benchmark results and comparing molecular generation models. Overview -------- The benchmark suite evaluates molecular generation models across multiple dimensions: - **Validity**: Are the generated molecules chemically valid? - **Uniqueness**: How many unique molecules are generated? - **Novelty**: How many molecules are different from the training set? - **Diversity**: How diverse are the generated molecules? - **Distribution similarity**: How similar are the generated molecules to the reference distribution? Validity Metrics ---------------- Valid Fraction ~~~~~~~~~~~~~~ **Definition**: Percentage of generated molecules that are chemically valid. **Formula**: ``valid_fraction = n_valid / n_total`` **Range**: [0, 1], higher is better **Interpretation**: - 1.0: All generated molecules are chemically valid - 0.5: Half of the generated molecules are valid - 0.0: No valid molecules generated **Code example**: .. code-block:: python validity_score = results['validity']['valid_fraction'] print(f"Valid molecules: {validity_score:.3f} ({validity_score*100:.1f}%)") Unique Fraction ~~~~~~~~~~~~~~~ **Definition**: Percentage of generated molecules that are unique (no duplicates). **Formula**: ``unique_fraction = n_unique / n_total`` **Range**: [0, 1], higher is better **Interpretation**: - 1.0: All generated molecules are unique - 0.5: Half of the generated molecules are unique - Low values indicate mode collapse or limited diversity Valid and Unique Fraction ~~~~~~~~~~~~~~~~~~~~~~~~~~ **Definition**: Percentage of molecules that are both chemically valid and unique. **Formula**: ``valid_and_unique_fraction = n_valid_and_unique / n_total`` **Range**: [0, 1], higher is better **Interpretation**: This is often considered the most important validity metric as it captures both chemical correctness and diversity. Novel Fraction ~~~~~~~~~~~~~~ **Definition**: Percentage of generated molecules that are novel, i.e., not present in the training dataset. Novelty is typically measured among valid and unique molecules, but in this benchmark it is computed as the fraction of all generated molecules that are novel. **Formula**: ``novel_fraction = n_novel / n_total`` **Range**: [0, 1], higher is better **Interpretation**: - High values indicate the model generates new molecules - Low values suggest the model is memorizing training data Unique Fraction at 1000 ~~~~~~~~~~~~~~~~~~~~~~~ **Definition**: Fraction of unique molecules among the first 1000 generated molecules. If fewer than 1000 molecules are generated, the value is set to -1. **Formula**: ``unique_fraction_at_1000 = n_unique_1000 / 1000`` **Range**: [0, 1] (or -1 if <1000 molecules), higher is better **Interpretation**: - 1.0: All of the first 1000 generated molecules are unique - 0.5: Only 500 of the first 1000 are unique - -1: Fewer than 1000 molecules were generated This metric is useful for comparing models that generate a fixed number of molecules and for detecting early mode collapse. Unique Fraction at 10000 ~~~~~~~~~~~~~~~~~~~~~~~~ **Definition**: Fraction of unique molecules among the first 10000 generated molecules. If fewer than 10000 molecules are generated, the value is set to -1. **Formula**: ``unique_fraction_at_10000 = n_unique_10000 / 10000`` **Range**: [0, 1] (or -1 if <10000 molecules), higher is better **Interpretation**: - 1.0: All of the first 10000 generated molecules are unique - 0.5: Only 500 of the first 10000 are unique - -1: Fewer than 10000 molecules were generated This metric is useful for comparing models that generate a fixed number of molecules and for detecting early mode collapse. Moses Metrics ------------- The Moses metrics are based on the benchmarking suite from the paper "Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models" ([arXiv:1811.12823](https://arxiv.org/abs/1811.12823)). Fraction Passing Moses Filters ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Definition**: Percentage of molecules that pass a set of medicinal chemistry filters. **Filters include**: - Molecular weight: 150-500 Da - LogP: -2 to 6 - Number of heavy atoms: 10-50 - Number of rings: ≤6 - PAINS (Pan Assay Interference) filters - And more... **Range**: [0, 1], higher is better **Interpretation**: High values indicate drug-like molecules suitable for pharmaceutical applications. **Code example**: .. code-block:: python filter_score = results['moses']['fraction_passing_moses_filters'] print(f"Drug-like molecules: {filter_score:.3f}") SNN Score (Similarity to Nearest Neighbor) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Definition**: Average similarity of generated molecules to their most similar molecule in the training set. **Calculation**: 1. For each generated molecule, find the most similar training molecule 2. Calculate Tanimoto similarity using Morgan fingerprints 3. Average across all generated molecules **Range**: [0, 1], optimal around 0.5-0.7 **Interpretation**: - Too high (>0.8): Model is copying training data - Too low (<0.3): Generated molecules are too different from training data - Optimal range indicates good balance between novelty and drug-likeness Internal Diversity (IntDiv) ~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Definition**: Average pairwise Tanimoto distance within the generated set. **Variants**: - **IntDiv**: Using p=1 (Manhattan distance) - **IntDiv2**: Using p=2 (Euclidean distance) **Formula**: ``IntDiv = 1 - average_pairwise_similarity`` **Range**: [0, 1], higher is better **Interpretation**: - High values indicate diverse molecular structures - Low values suggest mode collapse or limited chemical space exploration **Code example**: .. code-block:: python diversity = results['moses']['IntDiv'] print(f"Internal diversity: {diversity:.3f}") Scaffold Similarity ~~~~~~~~~~~~~~~~~~~ **Definition**: Cosine similarity between scaffold distributions of generated and training molecules. **Calculation**: 1. Extract Murcko scaffolds from molecules 2. Create frequency distributions 3. Calculate cosine similarity between distributions **Range**: [0, 1], higher is better **Interpretation**: Measures how well the model captures the scaffold diversity of the training set. Fragment Similarity ~~~~~~~~~~~~~~~~~~~ **Definition**: Cosine similarity between fragment distributions of generated and training molecules. **Calculation**: 1. Fragment molecules into substructures 2. Create frequency distributions 3. Calculate cosine similarity **Range**: [0, 1], higher is better **Interpretation**: Measures how well the model captures the chemical fragment space. Distribution Metrics -------------------- KL Divergence Score ~~~~~~~~~~~~~~~~~~~ **Definition**: Measures similarity between molecular property distributions of generated and training sets. **Properties evaluated**: - BertzCT (molecular complexity) - MolLogP (lipophilicity) - MolWt (molecular weight) - TPSA (topological polar surface area) - NumHAcceptors (hydrogen bond acceptors) - NumHDonors (hydrogen bond donors) - NumRotatableBonds (rotatable bonds) - NumAliphaticRings (aliphatic rings) - NumAromaticRings (aromatic rings) **Formula**: For each property, calculate ``KL(P_ref || P_gen)`` then average and transform: ``exp(-KL_avg)`` **Range**: [0, 1], higher is better **Interpretation**: - 1.0: Perfect match between distributions - <0.5: Significant differences in molecular properties - This metric captures how well the model reproduces the chemical property space **Code example**: .. code-block:: python kl_score = results['kl_score'] print(f"Property distribution similarity: {kl_score:.3f}") FCD Score (Fréchet ChemNet Distance) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Definition**: Measures similarity between generated and reference molecular distributions in a learned feature space. **Calculation**: 1. Encode molecules using ChemNet (pre-trained neural network) 2. Calculate Fréchet distance between distributions 3. Lower scores indicate better similarity **Variants**: - **fcd**: Using all generated molecules - **fcd_valid**: Using only valid generated molecules - **fcd_normalized**: ``exp(-0.2 * fcd)`` for easier interpretation - **fcd_valid_normalized**: ``exp(-0.2 * fcd_valid)`` **Range**: - **fcd**: [0, ∞], lower is better - **fcd_normalized**: [0, 1], higher is better **Interpretation**: - FCD values <1: Excellent similarity - FCD values 1-5: Good similarity - FCD values >10: Poor similarity **Code example**: .. code-block:: python fcd = results['fcd']['fcd'] fcd_norm = results['fcd']['fcd_normalized'] print(f"FCD score: {fcd:.2f} (normalized: {fcd_norm:.3f})") Metric Interpretation Guidelines -------------------------------- Quality Assessment ~~~~~~~~~~~~~~~~~~ **High-quality model characteristics**: - Valid fraction > 0.9 - Valid and unique fraction > 0.8 - Novel fraction > 0.7 - SNN score: 0.5-0.7 - Internal diversity > 0.8 - KL score > 0.9 - FCD score < 2.0 **Warning signs**: - Valid fraction < 0.5 (chemical knowledge issues) - Unique fraction < 0.7 (mode collapse) - Novel fraction < 0.3 (memorization) - SNN score > 0.8 (copying training data) - Internal diversity < 0.5 (limited diversity) Model Comparison ~~~~~~~~~~~~~~~~ When comparing models, consider: 1. **Primary metrics**: Valid and unique fraction, SNN score, FCD score 2. **Secondary metrics**: Internal diversity, KL score, filter passage rate 3. **Application-specific**: Novel fraction for drug discovery, scaffold similarity for lead optimization **Example comparison**: .. code-block:: python def compare_models(results_dict): """Compare multiple model results.""" for model_name, results in results_dict.items(): validity = results['validity']['valid_and_unique_fraction'] novelty = results['validity']['valid_and_unique_and_novel_fraction'] diversity = results['moses']['IntDiv'] similarity = results['moses']['snn_score'] print(f"{model_name}:") print(f" Quality: {validity:.3f}") print(f" Novelty: {novelty:.3f}") print(f" Diversity: {diversity:.3f}") print(f" Similarity: {similarity:.3f}") Trade-offs ~~~~~~~~~~ Different metrics often involve trade-offs: - **Validity vs. Novelty**: Higher novelty may reduce validity - **Diversity vs. Quality**: More diverse generation may reduce average quality - **Similarity vs. Novelty**: Optimal similarity range balances these factors Statistical Significance ~~~~~~~~~~~~~~~~~~~~~~~~ For robust evaluation: .. code-block:: python # Run multiple evaluations with different seeds results_list = [] for seed in range(5): # Set random seed and run evaluation results = run_benchmark_with_seed(seed) results_list.append(results) # Calculate statistics import numpy as np validity_scores = [r['validity']['valid_fraction'] for r in results_list] mean_validity = np.mean(validity_scores) std_validity = np.std(validity_scores) print(f"Validity: {mean_validity:.3f} ± {std_validity:.3f}") Advanced Metrics ---------------- For specialized applications, additional metrics can be computed: Conditional Metrics ~~~~~~~~~~~~~~~~~~~ For property-conditioned generation: - **MAE (Mean Absolute Error)**: Between target and generated properties - **Conditional validity**: Validity rate for specific property ranges **Code example**: .. code-block:: python # Custom property analysis from rdkit.Chem import Descriptors def analyze_property_match(generated_smiles, target_logp): """Analyze LogP matching for conditional generation.""" valid_mols = [Chem.MolFromSmiles(s) for s in generated_smiles if s] valid_mols = [m for m in valid_mols if m is not None] logp_values = [Descriptors.MolLogP(mol) for mol in valid_mols] mae = np.mean([abs(lp - target_logp) for lp in logp_values]) return mae Pharmacophore Metrics ~~~~~~~~~~~~~~~~~~~~~ For drug discovery applications: - **Pharmacophore coverage**: Percentage of important pharmacophores covered - **ADMET properties**: Drug metabolism and toxicity predictions Scaffold Metrics ~~~~~~~~~~~~~~~~ For lead optimization: - **Scaffold hopping**: Generation of molecules with different scaffolds but similar properties - **Core preservation**: Maintaining key structural motifs Best Practices -------------- Comprehensive Evaluation ~~~~~~~~~~~~~~~~~~~~~~~~ Use multiple metrics for complete assessment: .. code-block:: python def comprehensive_evaluation(results): """Print comprehensive metric analysis.""" print("=== COMPREHENSIVE EVALUATION ===") # Validity v = results['validity'] print(f"Validity: {v['valid_fraction']:.3f}") print(f"Uniqueness: {v['unique_fraction']:.3f}") print(f"Quality (V&U): {v['valid_and_unique_fraction']:.3f}") print(f"Novelty: {v['valid_and_unique_and_novel_fraction']:.3f}") # Moses m = results['moses'] print(f"Drug-likeness: {m['fraction_passing_moses_filters']:.3f}") print(f"Training similarity: {m['snn_score']:.3f}") print(f"Diversity: {m['IntDiv']:.3f}") # Distribution print(f"Property match: {results['kl_score']:.3f}") print(f"Feature similarity: {results['fcd']['fcd']:.2f}") Context-Aware Interpretation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Consider your application when interpreting metrics: - **Early drug discovery**: Emphasize novelty and diversity - **Lead optimization**: Focus on similarity and property matching - **Chemical space exploration**: Prioritize diversity and coverage Reporting Guidelines ~~~~~~~~~~~~~~~~~~~~ When publishing results: 1. Report all major metrics with confidence intervals 2. Provide dataset and evaluation details 3. Compare against established baselines 4. Discuss trade-offs and limitations 5. Include example molecules for qualitative assessment **Example results table**: .. code-block:: text Metric Model A Model B Baseline Valid fraction 0.95±0.02 0.88±0.03 0.92±0.01 Valid & unique 0.87±0.03 0.82±0.04 0.85±0.02 Novel fraction 0.76±0.04 0.69±0.05 0.71±0.03 SNN score 0.63±0.02 0.58±0.03 0.61±0.02 Internal diversity 0.84±0.02 0.89±0.02 0.82±0.03 KL score 0.91±0.01 0.87±0.02 0.89±0.01 FCD score 1.8±0.3 2.4±0.4 2.1±0.2