Credit: Creative Images Lab/Getty
One evening, my partner Boyan Li was at the kitchen table, reviewing student work for a coding class he taught during his PhD at Harvard Medical School in Boston, Massachusetts. The task asked students to apply a computational-biology algorithm to a specific dataset. Each submission needed more than a quick glance. He executed the code, reviewed the results, and followed the logic step by step. Some answers were obviously right; others were clearly incorrect. But many landed in a gray area: partially correct, yet inconsistent in their approach or logic. These were the trickiest to judge and took the most time.
As someone who studies higher education, I observed this process with academic curiosity. What appeared to be a straightforward technical job—running code and verifying results—turned out to be highly interpretive. Grading coding tasks means determining what demonstrates understanding, what counts as a mistake, and how much flexibility is acceptable. This aligned with my own work on how students learn and grow, which treats educational activities as fundamentally relational: even something as routine as grading becomes a conversation between the evaluator and the student.
Watching this blend of technical ability and human insight made me wonder: can generative AI (genAI) help with grading without removing the interpretive effort that gives it value?
Testing AI in Practice
Coding tasks appear particularly fitting for AI tools. Unlike essays, code follows defined patterns and rigid rules, simplifying evaluation. My partner tried this using OpenAI’s ChatGPT 5.4. He shared the assignment instructions along with the model answer and asked it to judge a student’s code for correctness. In reality, ChatGPT mostly matched the student’s code against the model answer and had trouble spotting legitimate alternative methods. It frequently zeroed in on small problems—like slower performance—rather than judging whether the student grasped the core algorithm, which was the primary goal of the assignment.
Seeing his frustration, I noticed ChatGPT lacked crucial background. I recommended he include details about typical student errors and specify which minor flaws could be overlooked.

Chatbots in science: What can ChatGPT do for you?
His current method turned out to be very useful here: before grading, he writes his own code and compares it to the instructor’s model answer. This helps him predict where students might struggle—often the same spots where he initially errs. Trends also appeared during student meetings. Students frequently asked similar questions, and some arrived with AI-generated responses they didn’t fully grasp. These repeated sticking points highlighted major hurdles in properly applying the entire algorithm—clues that would’ve been hard to find from the model answer alone.
Including these observations made the AI tool more helpful. It could propose additional test cases, checking whether a student’s solution met the rubric criteria but failed on ‘edge cases’—situations where, for example, an algorithm receives extreme (yet valid) inputs. For one task, students coded an algorithm to align a genome sequence. One student turned in long, confusing code that met all three rubric requirements. ChatGPT, though, spotted a logic error and, after careful analysis, suggested an edge case where it would produce wrong results. Without AI, this error might’ve been missed or taken hours of manual review.
At the same time, ChatGPT had obvious drawbacks. It sometimes flagged any difference from the model answer as a mistake, even when the student’s method worked. It gave confident-sounding explanations that didn’t survive deeper scrutiny. And, unless directly told, it didn’t consistently verify whether the code actually executed. Fully automated grading—submitting student code and getting a final score—was still not feasible.

Drawing on her experience as a higher-education researcher, Yulu Hou helped her partner to experiment with automated marking of undergraduate coding assignments.Credit: Hima Rawal
Key Takeaways
These initial tests revealed that using AI well is less about building a fully automatic grading system and more about fitting it into the current workflow. ChatGPT functions best as a teaching aide, not the ultimate judge. Here’s how to get the most from it.
Add context. When designing prompts for grading, I discovered it worked best to go step by step: first, present the problem and let the model solve it independently; then share one or more model answers; and finally, point out critical steps, frequent mistakes, and minor issues that shouldn’t count against the student.
Create test cases. AI excels at finding edge cases that standard checks might overlook. These edge cases can then be added to the grading rubric to support a more complete assessment.



