Ian, Pat and Aulikki,
Thank you for posting your reflections on the use of the multiple choice question. Ian posed some excellent questions and I have tried to answer them. So, Ian here we go.
Let me start by saying that as a non-native English speaker assessed primarily through the use of essays, I detested multiple choice questions. I first encountered them on college entrance tests like the TOEFL and the SAT. Yet even then I acknowledged to myself that if I knew the answer to the question I did not find them confusing. Since then I have learned a great deal about the use of the multiple choice questions through formal courses and a lot of experience. I find myself now offering sessions at various training events on how to craft good multiple choice questions that assess the higher level learning objectives.
One of the first things I learned from my professors, who held a joint appointment at a University and a company designing college entrance exams, was that one should never base a decision about a person’s future on written exams alone. Written exams are one piece of the picture of a person’s competence and should be treated with great caution.
You identified several of the reasons for this caution - the authenticity of assessment, the difficulty of writing good questions, and the fact that even a well-written question can be misinterpreted by a very capable learner. These are true for the entire field of written assessment, they are applicable to all types of text-based questions. These issues are not unique to the multiple choice question.
I think that every poster to this forum is in agreement that the most authentic assessment is direct observation of the learner performing the actual task. The closer to reality the test is, the better. There is no form of written question that will get close to the authenticity of reality. For performance based tasks, we can focus on the products (final outcome) examine the processes (actions that lead to the product) for creating them as well. But before we all go towards direct observations we need to consider these questions:
How will we grade the learner performance?
- checklists (did they do all the tasks and did they do them in the right sequence)
- rating scales (was the performance poor, fair, good, or excellent)
- rubrics (a more descriptive way to characterize the quality of the student’s work)
How can we ensure that the grader was objective, did not miss elements of the performance, has a good handle on what constitutes poor, or excellent performance?
- we can record the exam and review it again but how much more time does that entail?
If we have multiple raters, how do we ensure inter-rater reliability?
What about essay questions? I turned your example into one: “Here is a lot of data ... Please write an essay explaining whether you expect tornadoes to occur in the forecast area during the forecast time?" Sure, we will gain more information from a question like this than we will from a multiple choice question, but in reality, a forecaster will not write an essay when forecasting a tornado. The essay question does not call forth the desired performance either. An essay question like this also lacks fidelity because we cannot give the learners all the data that they can access in their offices, and we do not wish them to talk to a colleague when writing the essay.
Similarly to a multiple choice question, the question above will not reveal the process that the user went through to determine the answer. In order to get to that, we’ll need to write a better question. For instance:
Review the data above and determine if a tornado will occur in the forecast area during the forecast time. Write out a complete justification for your forecast using the data provided, and describe the sequence of steps that you used in arriving at your conclusion. Explain why you performed the steps in the order that you did.
Perhaps this illustrates the point that good essay questions that call forth the desired performance are not easy to write. As a grader of essay questions, I have seen plenty of examples of poorly written questions, as well as plenty of answers to a good essay question that were still misinterpreted by the learner. How do we grade those answers? They contain sound, well-structured, logical arguments, but somewhat tangential to the topic.
We can ask in general, how do we grade essay questions? Should we base our judgment against an ideal answer? If so, we’ll need to look for the structure of the answer, the sequence of steps, the completeness of the justification. This means that we need to have written at least a “model answer” outline (better yet a complete model answer) in order to be fair in our grading. And what about the human tendency to rate a “good” answer as only “fair” if it was preceded by an “excellent” answer? How about inter-rater reliability and objectivity? We can control these issues to some extent, but not completely. So even after we have put in all this effort and time on creating a good essay question, we are still uncertain whether the user can perform what they have written out.
Furthermore, all written question types will be susceptible to linguistic problems, especially for second language speakers.
A good essay question and a good multiple choice question (all question types for that matter) are difficult to write. I am sure you realize that the example of a multiple choice question that you offered is a True/False question written in multiple-choice format:
Here is a lot of data ... Would you expect tornadoes to occur in the forecast area during the forecast time?
a) Yes
b) No
To assess such a complex task, we’ll need a series of multiple choice questions. The series may begin with a question like this:
Review this data. What is the probability that a tornado will occur in the forecast area at this forecast time and why?
a. 10% … for these reasons
b. 30% … for these reasons
c. 60% … for these reasons
d. 90% … for these reasons
The series could continue with: "Where in the forecast area would you expect the tornado to occur?" or “When would you issue your tornado warning?”
You suggested that the spread of answers on some of the example questions in my presentation is an artifact of the multiple choice question form: “I don't believe that this [diversity of answers] was due to the fact that people weren't competent in terms of the content but that it was an artifact of the assessment technique, ie, multiple choice questions. I'm sure that if we all sat around discussing the issues there would be much more convergence than showed in the answers.”
I would like to offer a few more ideas as to why the presentation participants evaluated the example questions with such divergent perspectives. After a previous offering of this presentation, the participants (an international group of future WMO trainers) and I discussed their experience of the presentation and its example questions. Their evaluation of the examples, too, was diverse; their explanations for their answers, however, were not related to the multiple choice format itself, but rather, to their experience with assessment and the unique circumstances of their practice.
For example, for the example question about a snowstorm during rush hour, some participants explained that this was the first time they have seen a multiple choice question used for anything other than simple facts: in other words, they were not accustomed to evaluating multiple choice questions for higher level learning objectives. Another participant suggested that the observational network in his country was not extensive, and could not be used to provide the ground truth as the question suggested. Another pointed out their models do not include simulated reflectivity; yet another indicated that he has never thought about how forecasters decide if a model has a good handle on reality.
Thus, as you can see, there are more reasons for the spread among participants’ evaluation of the example questions than simply artifacts of the multiple choice format.
While I will continue to use these example questions with U.S. audiences, I will look for other examples for international audiences. The diversity of experience and practice that a global audience brings to the table make it impossible for a single question, multiple choice or otherwise, to be universally evaluable, given the world community’s variety of meteorological tools and forecasting methods.
You also mentioned that multiple choice questions require “high linguistic level” from both the writer and the person taking the question. I agree and would add the the same is true for essay questions as well. Second language speakers may find answering essay questions even more taxing than answering multiple choice ones.
A common misconception about multiple choice questions is that they need to be convoluted and difficult to understand in order to function well. In fact, the opposite is true: a well-written multiple choice question reduces the learner’s cognitive load by being clear and precise in its question and alternatives. In this way, only lack of knowledge of the content prevents a learner from selecting the correct answer. Often times, trainers think that their first or second draft of a question is ready for use. As I pointed out in my presentation, we need to make more drafts and test them with actual users before we put the questions into operation in an active assessment. I would also suggest that the same is true for all question types, including essay questions.
In a well-written multiple choice question, every distractor is designed to address a common misunderstanding, a confusing aspect, or something that most learners find difficult to apply. If a learner selects the wrong answer, the instructor can identify an area in which the student needs to improve. In order for a multiple choice question to address a higher-level learning objective, it needs to have a clearly defined problem, be based on real-life situations, and have well-written options that test for the common misconceptions about the topics. With a multiple choice question rather than an essay question, we can also be a bit more confident that human subjectivity did not influence our grading approach.
As Pat mentioned, multiple choice questions are an efficient and effective way of measuring if learning has occurred for all levels of learning objectives in a variety of situations. They do not suffer some of the grading problems of other types of questions. They have also been successfully used for decades in college entrance tests diagnosing the ability of learners to succeed in academia. I hope that you can see that the issues you identified as problematic for multiple choice questions -- the authenticity of assessment, the difficulty of writing good questions, and the fact that even a well-written question can be misinterpreted by a very capable learner -- are in fact problematic for the entire field of text-based assessment
With all this in mind, I would still prefer that pilots, medical staff, and nuclear power plant operators -- as well as personnel in the many other fields in which the consequences of mistakes can be catastrophic -- have passed a direct observation test with flying colors. Text-based assessment can give us only a crude approximation of someone’s competency, at best.
Cheers,
Tsvet