Welcome To  NEM   

Journals(Abstract)

Performance Differences of Large Language Models in IELTS Writing Evaluation

Luan Keyun

Department of English, Hebei University of Technology

Abstract:

With the IELTS exam transitioning entirely to computer-based testing, the shift toward electronic submission and feedback aligns with the growing trend of using large language models for writing self-assessment. Traditional manual teacher evaluation suffers from high costs, limited coverage, and delayed feedback, whereas large language models—with their low cost, immediacy, and reproducibility—have emerged as a core alternative tool for student self-assessment. However, significant scoring discrepancies between different models and inconsistent feedback across multiple evaluations have led to mixed results in student self-assessment. To address this, this study focuses on the IELTS Writing Task 2. It selects three mainstream Chinese large language models—Doubao, Yuanbao, and DeepSeek—and systematically compares their performance across four dimensions—scoring accuracy, scoring stability, feedback accuracy, and feedback consistency—using a corpus of IELTS essays covering different score bands and topic types. The findings indicate that no all-purpose model currently exists; test-takers should select or combine models based on their core needs. Models excel at evaluating surface-level language dimensions but have limited capacity to assess higher-order thinking dimensions such as arguments and structure. Models exhibit no fatigue effects, offering new possibilities for large-scale standardized writing assessment. This study provides empirical evidence for the development of a human-machine collaborative IELTS writing assessment system.


Key Words:

large language model; IELTs; writing evaluation; automated essay scoring; second language writing

技术支持:人人站CMS
Powered by RRZCMS