The p-value of shenanigans

Please answer the following questions:

1) Have you ever taken a high-stakes multiple choice exam? 

2) Do you have a strong understanding of how questions for high-stakes multiple choice exams are created?

For most people, the answers to questions 1 and 2 are “yes” and “no”, respectively. This should be a cause for alarm. You, and almost everyone you know, have participated in a game in which you did not know the rules. Some of you unknowingly played the game well. Similarly, many of us played this game poorly, at least in part, because the rules were never explained in an accessible way. 

The first step in understanding high-stakes multiple choice exams requires appreciating how these tests differ from those typically given in a high school (or college) classroom. Good teachers normally give “criterion-referenced” exams. Tests that compare individuals to a set standard (criterion) are given almost exclusively in high school, college, and graduate school. High-stakes exams (SAT, GMAT, MCAT, etc.) are different by design. These exams are “norm-referenced” with the aim to compare examinees to one another. Norm-referenced exams exist to generate a percentile rank. Put another way, everyone can get an “A” on a criterion-referenced exam; norm-referenced exams allow no such option.

A norm-referenced exam is a direct byproduct of norm-referenced questions. Creating a distribution, as you can imagine, becomes increasingly more difficult when the examinee pool is ever smaller, smarter, and better prepared. Each question in a high-stakes exam is ultimately given a “p value” to indicate the percentage of examinees that got the question right. Questions at the extremes, either all right or all wrong, are equally useless for a norm-referenced exam. The sweet spot? 40-60% If approximately half of examinees get each question wrong the results generated over hundreds of questions will be a nice-looking distribution (aka “bell curve”).

How do test writers ensure approximately 50% of examinees get a question wrong?  The answer is called “content transfer”.  An alternative, and equally appropriate term, would be “shenanigans”. Content transfer means intentionally presenting material in an unusual or unexpected way. These unexpected twists can make the conscientious examinee feel unprepared. Simply knowing that material is deliberately presented in an atypical manner can also encourage test takers to approach these exams for exactly what they are: a game.  

How, you may ask, do test writers “test” these questions? This occurs during the exam itself. Approximately 15-20% of questions given during high-stakes exams are being “beta tested” real time. These questions do not count towards the percentile rank score but, nonetheless, can have a tremendous impact on examinees. Those that stumble with unvetted sample questions may waste unnecessary time and effort on questions that don’t count. Conversely, those examinees that can simply “shrug off” a difficult (or unvalidated) question are likely to do better in this format.

The more test takers know about the inner workings of high-stakes exams the better equipped they will be. It is increasingly recognized that the psychological approach utilized during these exams is important. Viewing high-stakes exams as an imperfect game is much a healthier, and successful, approach than assuming these questions are a true measure of your knowledge, ability, and worth.

JAY on LINKEDIN