Can AI Accurately Evaluate Coding Assignments?

One evening, my partner Boyan Li was at the kitchen table, reviewing student work for a coding class he taught during his PhD at Harvard Medical School in Boston, Massachusetts. The task asked students to apply a computational-biology algorithm to a specific dataset. Each submission needed more than a quick glance. He executed the code, reviewed the results, and followed the logic step by step. Some answers were obviously right; others were clearly incorrect. But many landed in a gray area: partially correct, yet inconsistent in their approach or logic. These were the trickiest to judge and took the most time.

As someone who studies higher education, I observed this process with academic curiosity. What appeared to be a straightforward technical job—running code and verifying results—turned out to be highly interpretive. Grading coding tasks means determining what demonstrates understanding, what counts as a mistake, and how much flexibility is acceptable. This aligned with my own work on how students learn and grow, which treats educational activities as fundamentally relational: even something as routine as grading becomes a conversation between the evaluator and the student.

Watching this blend of technical ability and human insight made me wonder: can generative AI (genAI) help with grading without removing the interpretive effort that gives it value?

Testing AI in Practice

Coding tasks appear particularly fitting for AI tools. Unlike essays, code follows defined patterns and rigid rules, simplifying evaluation. My partner tried this using OpenAI’s ChatGPT 5.4. He shared the assignment instructions along with the model answer and asked it to judge a student’s code for correctness. In reality, ChatGPT mostly matched the student’s code against the model answer and had trouble spotting legitimate alternative methods. It frequently zeroed in on small problems—like slower performance—rather than judging whether the student grasped the core algorithm, which was the primary goal of the assignment.

Seeing his frustration, I noticed ChatGPT lacked crucial background. I recommended he include details about typical student errors and specify which minor flaws could be overlooked.

Chatbots in science: What can ChatGPT do for you?

His current method turned out to be very useful here: before grading, he writes his own code and compares it to the instructor’s model answer. This helps him predict where students might struggle—often the same spots where he initially errs. Trends also appeared during student meetings. Students frequently asked similar questions, and some arrived with AI-generated responses they didn’t fully grasp. These repeated sticking points highlighted major hurdles in properly applying the entire algorithm—clues that would’ve been hard to find from the model answer alone.

Including these observations made the AI tool more helpful. It could propose additional test cases, checking whether a student’s solution met the rubric criteria but failed on ‘edge cases’—situations where, for example, an algorithm receives extreme (yet valid) inputs. For one task, students coded an algorithm to align a genome sequence. One student turned in long, confusing code that met all three rubric requirements. ChatGPT, though, spotted a logic error and, after careful analysis, suggested an edge case where it would produce wrong results. Without AI, this error might’ve been missed or taken hours of manual review.

At the same time, ChatGPT had obvious drawbacks. It sometimes flagged any difference from the model answer as a mistake, even when the student’s method worked. It gave confident-sounding explanations that didn’t survive deeper scrutiny. And, unless directly told, it didn’t consistently verify whether the code actually executed. Fully automated grading—submitting student code and getting a final score—was still not feasible.

Standing beside a whiteboard of notes, Yulu Hou (center) engages with peers during a teaching and learning workshop. — Drawing on her experience as a higher-education researcher, Yulu Hou helped her partner to experiment with automated marking of undergraduate coding assignments.Credit: Hima Rawal

Key Takeaways

These initial tests revealed that using AI well is less about building a fully automatic grading system and more about fitting it into the current workflow. ChatGPT functions best as a teaching aide, not the ultimate judge. Here’s how to get the most from it.

Add context. When designing prompts for grading, I discovered it worked best to go step by step: first, present the problem and let the model solve it independently; then share one or more model answers; and finally, point out critical steps, frequent mistakes, and minor issues that shouldn’t count against the student.

Create test cases. AI excels at finding edge cases that standard checks might overlook. These edge cases can then be added to the grading rubric to support a more complete assessment.

Top Posts

Iran Hunts US Military Phones: CrashStealer macOS Malware & the CVD Blueprint Unmasked

Benjamin Cowen’s Bold Q4 Forecast: Bitcoin’s $44K Bottom is Imminent!

Hidden Fallout: The Lingering Echoes of the State Department RIF

Can AI Accurately Evaluate Coding Assignments?

Beyond the Main Branch: Streamlining AI Workflows with Git Worktrees

The AI Safety Capital Rising: Beyond Silicon Valley’s Shadow

Beyond Context Engineering: The Loop Experiment Running Blind Without an LLM

The Agent Security Chasm: 54% of Enterprises Battling AI Breaches While Credentials Freely Roam

Unleashing Kimi K3: The 2.8 Trillion-Parameter Open MoE Powerhouse with Delta Attention and 1M Context Horizon

Unlock Peak AI Performance: 5 Essential Assets Before Scaling Your Team

Iran Hunts US Military Phones: CrashStealer macOS Malware & the CVD Blueprint Unmasked

Benjamin Cowen’s Bold Q4 Forecast: Bitcoin’s $44K Bottom is Imminent!

Hidden Fallout: The Lingering Echoes of the State Department RIF

Dell XPS 16: The Sleek Powerhouse Redefining Creativity for Pros

The Trust Chasm: Why Enterprise AI’s Real Crisis Isn’t Retrieval, It’s Context Collapse

Beyond the Main Branch: Streamlining AI Workflows with Git Worktrees

Chaos in the Cloud: Flipkart’s Wild Ride Through KubeCon 2026

Beyond the Blueprint: The Untold Journey of Hardware MavericksMAX

Trending

Iran Hunts US Military Phones: CrashStealer macOS Malware & the CVD Blueprint Unmasked

Benjamin Cowen’s Bold Q4 Forecast: Bitcoin’s $44K Bottom is Imminent!

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Can AI Accurately Evaluate Coding Assignments?

Testing AI in Practice

Key Takeaways

Related Posts