Towards benchmark stability and integrity