S2648 Dataset - Homology-based cross-validation split
Original dataset publication: Dehouck et al. (2009) Fast and accurate predictions of protein stability changes upon mutations using statistical potentials and neural networks: PoPMuSiC-2.0. Bioinformatics, 25, 2537–2543
Description
This datasets, originally released by Dehouck et al. (2009), contains 2648 mutations in 132 proteins. Here we release an alternative cross-validation split were similar (as computed by BLASTP) proteins are clustered togheter and included in the same cross-validation subset. In this way, any possible bias derived from similaraty between training ans testing sequences is removed. We suggest to use this cross-validation split if you plan to evaluate your method on the S2648 dataset.