How can independent researchers reliably detect bias, discrimination, and other systematic errors in software-based decision-making systems?