Partly-Automated Evaluation and Assessment of Written Narratives
Wayne Smith, Ph.D., Lecturer
Department of Management,
College of Business and Economics,
California State University Northridge
About the author: Wayne teaches general business, management, and organizational behavior courses (required and electives) at CSU Northridge. Additionally, he has taught undergraduate courses in Accounting and Computer Science at Santa Monica College, and graduate-level technology courses at both CSU Channel Islands and UC Irvine. He has been an IT administrator at CSU Northridge, and is a long-time member of the CSU “CATS” community. Since 1983, he has also done intermittent strategy and IT consulting work for a range of for-profit firms and government entities. His Ph.D. is in Information Systems and Technology from Claremont Graduate University. His current research interests are in the areas of telecommunications policy, philosophy of language, and statistical computing. In his spare time, he enjoys amateur radio, table tennis, and singing.
There are many schisms in our complex educational environment. The divergent goals and pedagogies of Science v. Humanities; “pure” subjects v. “applied” subjects; faculty perspectives v. administrative perspectives; traditional delivery v. online delivery; large, urban campuses v. small, rural campuses are but a few of the persistent tensions in our interconnected system. These dichotomous pairs are palpable, but occasionally incommensurable, and often irreconcilable.
Another key distinction in our pedagogical environment is between evaluations (and assessments) that are quantitative, and evaluations (and assessments) that are qualitative. At the critical level of student-instructor, day-to-day interaction within a matriculated course-of-record, the former often lends itself to multiple-choice exams and perhaps, summative evaluations; the latter often lends itself to narrative-based exams and perhaps (if everything goes right), formative evaluation. Why and how either methodology leads to rigorous and relative assessment (beyond evaluation) is in itself a difficult issue (Allen, 2004), and especially so for General Education learning outcomes (Allen, 2006).
The movement toward “learner analytics” systems and supporting technologies reflects these distinctions too. Simply put, some learning outcomes lend themselves more readily to technology-mediated, technology-enhanced, or technology-automated measurement systems while others do not. In the context of limitations of current research and the associated practical issues of Learning Management Systems (LMS), Whitmer (2012) notes that “[the evaluation of the]…quality of discussion posts or other activities…” (emphasis added) is needed but difficult or impossible to obtain from “logfile analysis [alone]”. Whitmer is right, and touches upon a problematic area of educational life in modernity. Campuses and degree-granting programs will need a strategy and the concomitant tools to help evaluate writing that not only assists with the evaluation of quality but also textuality and many other higher-order discursive contexts. Naturally, each step of the strategic formulation and subsequent implementation of a workable approach to computer-aided text analysis will need to be steered by faculty.
In my teaching experience, the measurement and management distinctions between numerically-scored exams, and holistically-scored exams is perhaps one of the largest distinctions of practical educational measurement significance. Quizzes, exams, and similarly-situated artifacts that rely on parameterized measures with well-defined specifications (e.g., “numeric answer”, “multiple-choice”, “matching”, etc.) can be automated and routinized with existing technology, especially online technology. To those who have used Moodle or other LMSs, these decisions-points and pedagogical interventions should be quite familiar. However, for many other types of course evaluation artifacts—especially written narratives—no easy, mainstream, good capital-for-labor substitution exists. I haven’t conducted a formal survey, but I suspect that many CSU faculty believe that students’ written work may be able to be submitted electronically, but cannot be analyzed electronically. Or can it?
Applications in the Professions
Some recent lay articles discuss this subject in more detail. For example, “sentiment analysis”, among other types of “text mining”, is now being used to partly automate the process of predicting which movies will do well at the box-office during the first week of a movie’s opening (Dodes, 2012). The New York Times reports on the increasing use of automated tools to help assist with the evaluation of writing (Stross, 2012). Naturally, not everyone is convinced of the value of this approach (Winerip, 2012). In any case, many individuals have now seen and used Google’s interactive Ngram Viewer which plots changes in word frequencies over time using Google Books as its word corpus (Google, 2012). This nascent tool provides a glimpse into what will be possible in the near future.
Applications in Higher Education
Since 1999, The Educational Testing Service (ETS) has scored several “high-stakes” essays, including the GMAT and GRE, using both a human grader and an automated (computer) grader (ETS, 2012a). ETS also offers a publically-available bibliography of papers authored or co-authored by ETS staff on the subject of automated essay grading and related subject areas (ETS, 2012b). As of August 31, 2012, there are over twenty peer-reviewed publications listed. Beyond simply theoretical aspirations, many of these publications are empirical in nature and deal with operational and mainstream specifics such as the use of “E-rater”, a tool for writing assistance embedded in the “Turnitin” software product widely used in the CSU system and linked via “assignments” (but not typically other writing types) in Moodle. As a cautious reader might expect, some of the papers offer mixed results or difficult-to-reproduce effects, and this is true of “e-rater” use as well. But it is increasingly clear that the technology of “automated essay scoring”, “text mining”, “statistical machine learning”, “quantitative corpus linguistics”, and “digital humanities” is improving (Williamson, et al., 2012). It’s improving in ways that assist faculty and other stakeholders with the reliability and validity of measuring writing (Bennett and Bejar, 1998), and it’s improving in ways that enable faculty, administrators, and employers to ask new questions regarding the quality and productivity of student writing both within the traditional writing disciplines of Language, Literature, and Linguistics, and throughout various disciplines.
Using several automated essay scoring engines to analyze more than 22,000 essays written from 7th, 8th, and 10th graders across the nation, Shermis and Hamner (2012) conclude that “the [computer] results meet or exceed that of the human raters” (p. 26), and interestingly, “…diverse use of vocabulary…and greater vocabulary density predict 90% of the true variance in rater judgments of essays” (p. 14). If true, the use of vocabulary-based tools, such as the Google corpus or even better, the well-known “WordNet” (Miller, 1995) database, may be able to complement human scorers well. Further, the authors suggest that “As a general scoring approach, automated essay scoring appears to have developed to the point where it can be reliably applied in both low-stakes assessment (e.g., instructional evaluation of essays) and perhaps as a second scorer for high-stakes testing.” Most of these systems are closed-source, and most need to be “trained” with a human grader on a small but representative sample of similar papers. Naturally, college essays are more complex, and involve, at a minimum, discipline-specific form and content, elements of argumentation and logic, advanced vocabulary and sentence structure, figurative language, and many literary principles. Perhaps truth even needs to be assessed somehow. Extensive empirical testing remains to be done, especially with respect to reliability and validity across many sub-groups of students. And finally, the CSU might start with the English language but cannot end there.
Persistent Organizational Challenges and Trends
1) As with all of higher education, the CSU is striving to improve the measurement and management of the assessment cycle over time. Beyond section-level, course-level, and program-level evaluation, assessment at multiple levels for multiple purposes requires the incorporation of multiple types of student deliverables, including written (and eventually, oral, non-verbal, or performance-related) examples of work aligned with various student learning outcomes. This aegis demands that the associated technical processes be broad while still maintaining the highest academic measurement standards. 2), We are increasingly relying on larger class sections and consequently, higher student-to-faculty ratios. Without additional in-class and out-of-class assistance, it becomes increasingly difficult to read long student papers, much less provide rich, personalized feedback. Some instructors may even be changing the length or perhaps the content of their assigned class writing due to the changes in class size. Some of these idiosyncratic changes may not even be widely captured or impacts known across the CSU. 3), A large body of research clearly indicates the need for “writing-across-the curriculum” (see, for example, Thaiss and Porter, 2012; WAC, 2012) and specific pedagogical interventions by discipline are becoming increasingly critical (see, for example, Smit, 2010). If true, this trend not only broadens the participatory scope of writing collaboration across an institution, but also changes the nature of writing evaluation so as to discern the reliability and validity of narrative arguments by discipline, program, academic level, socio-economic status of student, etc.. 4), Evaluating written narratives well on a recurring basis likely requires an highly interdisciplinary approach involving, at a minimum, faculty with expertise with the subject-matter at hand, knowledge regarding writing pedagogy and evaluation, and perhaps computer science experience and concomitant infrastructure support. Note that smaller CSU campuses may not have all of these institutional resources, yet the manifest need for automated evaluation of narratives is no less important. 5) Our UC colleagues can benefit from systems that help evaluate written responses, perhaps to help graduate teaching assistants (TAs) in large lecture-hall sections of core, undergraduate courses. Our CCC colleagues can certainly benefit from this technology for multiple reasons. Note that any improvement in lower-division core or GE outcomes in CCCs, especially in composition, prose, or rhetoric, should be reflected in higher quality articulation valences and learning outcomes for CSU/UC-bound transfer students.
Persistent Technological Challenges and Trends
1), The development of digital computing and the development of automated evaluation of writing—from Reed-Kellogg sentence-parsing diagrams to “readability” scoring and beyond—have a long, intertwined history. However, mainstream tools, especially widely-deployed online LMSs in the CSU (e.g., Moodle and Blackboard) contain no or little functionality to evaluate the students’ narrative responses. For example, an analytical report for a quiz in Moodle will yield frequencies for multiple-choice responses, but simply skips “short answer” or “essay” questions. Further, no base functionality for such “word analysis” appears to be on the “roadmap” for Moodle. 2), It is possible to write “plug-ins” for Moodle (and “building blocks” for Blackboard) to add some desired functionality. An analytical example is the “Item Response Theory” (IRT)—a relatively advanced quantitative technique—plug-in available for Moodle. Of course, IRT is not intended for word analysis. A related question is whether a “plug-in” architecture is even appropriate or robust enough for our multi-headed purposes. Expanded Application Programming Interfaces (API), such as the new Learning Tools Interoperability (LTI) standard may provide some much needed help. It is possible that over time a Moodle instance will need to be paired with an “learner analytical engine” instance to help with all (or many, or especially difficult) quantitative and qualitative analytical tasks. Note that “out-of-LMS” analysis may be useful for summative and formative analytical tasks, but in the CSU we will also want “in-LMS” analysis to be able to provide immediate feedback directly to the students. 3), Pearson provides essay scoring functionality and it is possible that the new CSU Online agreement with Pearson may help diffuse this technology to other CSU programs.
There is another important issue. Faculty learn best from the experiences of other faculty in contexts (perceived or real) that are similar to their own. The diffusion of new ideas—especially ideas that are controversial, innovative, or ill-studied—often requires a working “proof-of-concept” or an empirical “pilot” in order to identify, acknowledge, learn, adapt, and ultimately adopt (or reject) a new idea.
Preliminary, Exploratory Case Study
In Spring, 2011, I had the opportunity to do an informal, preliminary analysis of a large-scale set of student narratives. This analysis was for a required, core course for undergraduate Business students: Principles of Management and Organizational Behavior. There were approximately 600 students (469 valid responses) across six sections, and nearly all of the students were taught in the hybrid format (I was one of the six instructors). This experiment was done partly to demonstrate how Moodle could be used for multiple-section (“meta-course”) course assessment purposes. The students watched a movie and subsequently answered 16 questions in Moodle. Each of the 16 questions contained a multiple-choice question (sometimes with more than a single response possible) and an “explain” open-ended “short-answer”. The purpose of the assessment was to see if students generally understood one of the key learning outcomes of this course: that is, the identification and application of the concepts of planning, leading, organizing, and controlling. Moodle, like all LMSs of which I am aware, provides frequency distributions and associated charts for the multiple-choice response, but no reporting analysis of the “short-answer” responses. Even for additional quantitative analysis—for example, analysis of subgroups and parametric (Z- and t-) tests—faculty regularly need to augment an LMS solution with additional desktop software.
I downloaded the data from Moodle, a task that can be quite common for extended analysis. I chose to use the R statistical software package (R Core Team, 2012) and the “tm” (text mining) package (Feinerer, et al., 2008), partly because tasks can be automated and partly because to scale in the long-run (similar to Moodle), open source software often represents good value (especially in highly budget-constrained times) and a good, flexible strategy (especially when a campus’ LMS and many other plug-ins are open source).
First, I made some minor technological transformations and conducted a rudimentary “missing response” analysis. Second, I generated distributional properties for the responses at both the overall level and for each question. The overall mean response length was 48 words per response (median = 37) and the standard deviation was 42 words per response. 48 words per response multiplied by 16 questions multiplied by 469 responses is more than 360,000 words (by way of comparison, Herman Melville’s novel Moby Dick is approximately 210, 000 words). Just determining the product moments of students’ responses is useful but not available in base Moodle, with a Moodle plug-in, or seen on the future “roadmap” plans for Moodle 2.x or 3.x (perhaps other LMSs will differ in the future). This new summary information can provide assessment information quickly and without excessive subjective interpretation. Even if a full complement of instructor involvement is necessary, automated tools should be able to estimate a representative sample. Here is an example of a response to a question about culture at approximately the median (37) length number of words:
“The Japanese culture shows collectivism, high power distance, high uncertainty avoidance, and masculinity, while the American culture shows individualism, low power distance, low uncertainty avoidance, and femininity, which is why there is so much conflict in the movie.”
For some assessment applications (even some evaluation applications), such an analysis may be all that is needed. Also, the summary information can be used to extract specific responses, including by various sub-groups (e.g., by specific instructor-section, by specific student characteristics), needed to validate models and more important, complete the assessment and feedback loop. I extracted a number of such responses, and didn’t correct for “surface” errors involving any area of language use such as mechanics, syntax, and grammar.
Third, I generated frequencies for each word (again, both overall and by individual question). High-frequency words were then interpreted manually in the context of the overall assessment and each specific question. For example, words such as “factory”, “goals/goals”, “line”, “plant”, production”, and “task” can be associated with the concept of planning, and “achieve”, “believe”, ‘change”, “decision”, “distance”, “feel”, “motivation”, “performance/performing”, “positive” “power”, “relationship”, “tried/try/trying” can be associated with the concept of leading. Recall that the purpose of this educational instrument is for general assessment. So fourth, I looked at high-frequency words that neither appeared in the multiple-choice question nor any of the multiple-choice answers. In the right context and using the appropriate methodology, this analytical technique may provide some evidence that supports the working hypothesis that the students’ subjective responses do indeed explain the students’ objectives responses using words that are inadvertently anchored by the question or response(s). This potential evidence comes in several forms—for example, the use of the high-frequency words such as “avoidance” (conflict management style), “family/life” (background context), “team” (structure and communication), and “value” (differing perspectives and utility) were all found to be important in this assessment context. These results support the idea that the students are drawing inferences from the content of the movie by way of identifying supporting evidence without an express, specific instruction by the faculty to write in a manner consistent with the principles of argumentative logic. It seems to me that collegiate-level assessment—particularly in the technical, professional disciplines—is best served by precisely this kind of evidence pattern and practice. This isn’t to say “close reading” by all instructors isn’t warranted; it just to say that, occasionally, some emerging technology tools may help offer some hope to committed but often overburdened CSU faculty.
Fifth, I looked at overall response characteristics. I learned that the word leading and its headword “lead” appeared in at least one response to each of the 16 questions, but the words planning, organizing, and controlling (and their respective headwords) did not. This is similar to the idea of “word dispersion”; a more sophisticated approach is called “colocation”. I learned that the responses to one of the questions about Strategic Human Resources exhibited student response word use more representative of culture than human resources per se. This suggests an instructor-review and possible re-wording of this question for the future. I also learned that later in the questionnaire, the mean response length decreased and the standard deviation narrowed. Given my experience with surveys, this finding seems intuitive. Finally, I chose to use a “baseline” desktop environment to gain a preliminary, but realistic, understanding of the feasibility of such an analysis. On a 2.4 GHz PC with 3 GB of RAM running Windows XP, this complete analysis, when fully automated, required more than 2 minutes to run to completion. Elementary text analysis is both CPU- and I/O-intensive, and moreover, some advanced techniques require significantly more RAM and therefore a 64-bit computing environment.
Significant additional functionality is possible as well. The Moodle responses can be linked with exogenous variables as needed. R or other packages can check for duplicate (or lexically similar) responses although the existing Moodle ↔ Turnitin interface will need to be modified in the future for this to work with student deliverables other than “regular” assignments. As with “automated essay scoring”, additional inferential and interpretive analysis (and perhaps individual and aggregate feedback) can be performed. Such analysis is aided by the use of an appropriate word corpus, if available for specific disciplines at specific levels. Note that there is a workload continuum between the traditional approach of “reading all the responses in context” and this new approach of “specifying parameters to a computer program and interpreting the results”. No single discrete point along this workload continuum is a solution for all faculty in all contexts. Over time, faculty should be able to choose where and when computer-assisted text analysis is helpful to augment their discipline-specific knowledge to achieve various evaluation and assessment purposes. A recent example of a computer-aided approach to an evaluation of written communication in the field of management, for example, is Pollach (2012). Pollach also includes in this paper not only a primer to help readers understand the world of computer-aided text analysis and corpus linguistics but also a partial list of available desktop software for individual use.
I suppose it is heartening that ETS’s efforts to automate the scoring of essays has earned widespread acceptance by graduate programs. However, ETS’s task—quite frankly—is relatively easy. We need technology and systems that draw upon the best quantitative approaches and the best qualitative approaches. We need the deep knowledge and applicable skillsets resident in various CSU faculty disciplinary streams. We need to complement parametric-based (“probability distributions”) analysis with non-parametric (“mining”) approaches where needed. We need systems that provide direct, immediate, and rich feedback not only for the student but also a range of other stakeholders, some of whom may not have even been explicitly identified yet. We need integrated systems that provide feedback that scales to our size and our purpose. We need systems that help evaluate writing in English, but also in other languages as well. We need technology malleable enough to work effectively and efficiently in both extant courses and newly-conceived courses irrespective of delivery format. Finally, we need software that is open, algorithms that are transparent, and effects that are reproducible.
We in the CSU have a history of exploring how to use pervasive technology, systems, and services to help students in various ways, and some of our efforts have had mixed results (Gerth, 2010—see especially pp. 287-308). But despite pernicious budget challenges and occasional mission conflict in modernity, we can design, develop, implement, and incrementally adjust such learning-support systems—some on our own, many with partners. We have the faculty talent, the support of crucial academic and administrative IT staff, and best of all, one of the most diverse (if not the most diverse) set of motivated student-learners in the world. It is precisely that diversity that can be leveraged in support of recurring student success, even it means occasionally using contemporary technology-based applications to help with the enormous conceptual, logistical, and methodological challenges in evaluating and assessing students’ writing-extensive and writing-intensive activities.
Allen, M. J. (2004). Assessing Academic Programs in Higher Education. Boston, MA: Anker Publishing.
Allen, M. J. (2006). Assessing General Education Programs, Boston. MA: Anker Publishing.
Bennett, R., and Bejar, I. (1998). Validity and Automad Scoring: It’s Not Only the Scoring. Educational Measurement: Issues and Practice, 17(4), 9-17.
Dodes, R. (2012, August 3). Twitter Goes to the Movies, Wall Street Journal, pp 1-0D.1. Retrieved from http://online.wsj.com/article/SB10000872396390443343704577553270169103822.html
ETS (2012a). Automated Scoring and Natural Language Processing—Bibliography. Retreived from http://www.ets.org/research/topics/as_nlp/bibliography
ETS (2012b). How the Test is Scored. Retrieved from http://www.ets.org/gre/revised_general/scores/how/
Feinerer, I., Hornik, Kl, and Meyer, D. (2008). Text Mining Infrastructure in R, Journal of Statistical Software, Mar., 25(5). Retrieved from http://www.jstatsoft.org/v25/i05
Gerth, D. (2010). The People’s University: A History of the California State University. Berkeley, CA: Institute of Governmental Studies Press.
Miller, G. (1995). WordNet: A Lexical Database for English. Communications of the ACM, 38(11), 39-41.
Pollach, I. (2012). Taming Textual Data: The Contribution of Corpus Linguistics to Computer-Aided Text Analysis, Organizational Research Methods, 15(2), 263-287.
R Core Team (2012). R: A Language and Environment for Statistical Computing, Vienna, Austria: R Foundation for Statistical Computing. Retrieved from http://www.r-project.org
Shermis, M., and Hamner, B. (2012). Contrasting State-of-the-Art Automated Scoring of Essays: Analysis. The William and Flora Hewlett Foundation, Apr. Retrieved from http://dl.dropbox.com/u/44416236/NCME%202012%20Paper3_29_12.pdf
Smit, D. (2010). Strategies to Improve Student Writing. Retrieved from http://www.theideacenter.org/sites/default/files/IDEA_Paper_48.pdf
Stross, R. (2012, June 10). The Algorithm Didn’t Like My Essay. New York Times, pp. 3-BU.3. Retrieved from http://www.nytimes.com/2012/06/10/business/essay-grading-software-as-teachers-aide-digital-domain.html
Thaiss, C, and Porter, T. (2010). The State of WAC/WID in 2010: Methods and Results of the U.S. Survey of the International WAC/WID Mapping Project. College Composition and Communication, 61(3), 534-570.
WAC (2012). Writing Across the Curriculum. Retrieved from http://wac.colostate.edu/journals.cfm
Whitmer, J. (2012). Learning Management System Analytics: John Goodland Meets the Digital Age, CSU ITL Connections Newsletter, Spring. Retrieved from http://www.calstate.edu/itl/newsletter/12-spring.shtml
Williamson, D., Xi, X., and Breyer, F. J. (2012), A Framework for Evaluation and Use of Automated Scoring. Educational Measurement: Issues and Practice, 31(1), 2-13.
Winerip, M. (2012, April 23). Facing a Robo-Grader? Just Keep Obfuscating Mellifluously. New York Times, pp. 11-A.11. Retrieved from http://www.nytimes.com/2012/04/23/education/robo-readers-used-to-grade-test-essays.html?pagewanted=all