The S2648 dataset

This datasets, originally released by Dehouck et al. (2009), contains 2648 mutations in 132 proteins. Here we release an alternative cross-validation split were similar (as computed by BLASTP) proteins are clustered togheter and included in the same cross-validation subset. In this way, any possible bias derived from similaraty between training ans testing sequences is removed. We suggest to use this cross-validation split if you plan to evaluate your method on the S2648 dataset.

References

  • Dehouck, Y. et al. (2011) PoPMuSiC 2.1: a web server for the estimation of protein stability changes upon mutation and sequence optimality. BMC Bioinformatics, 12, 151.