Your Open-Ended Projects Aren’t as AI-Proof as You Think
The difference between unstructured and structured projects
Every time a school moves to “open-ended assessments” as their answer to ChatGPT, I understand the instinct. Give students more choice. More freedom. More room for original thought. AI can’t fake authentic personal voice, right?
A study out of the University of Pennsylvania just ran that assumption through three years of university data science courses, and the results are worth exploring, particularly if you design PBL units and believe that student choice is your primary defence against AI shortcuts.
The 30-Point Gap
Kaihua Ding at the University of Pennsylvania tracked 117 students across three offerings of upper-level data science courses between 2024 and 2025. The setup was straightforward: students completed take-home assignments where AI use was permitted, then sat proctored exams where it was not.
The performance gap was stark. On AI-permissive homework and knowledge checks, students averaged around 90%. On proctored exams, that dropped to roughly 63%, a gap of nearly 30 percentage points. Effect size: Cohen’s d = 1.52. In educational research, anything above 0.8 is considered large.
Even more telling: 90% of students scored 90 or above on the take-home assessments. On the proctored exam, 4.9% hit that mark.
The grades looked fine on paper. The learning had not followed.
Ding calls this “AI inflation.” It is a reasonable name. But what matters for teachers is the mechanism understanding why the inflation happened, because that tells you something specific about how to design around it.
Why Open-Ended Projects Don’t Fix the Problem
This research pushes back on widely repeated advice, including guidelines cited from UNESCO and several university teaching centres. The standard recommendation has been to shift toward open-ended assessments on the grounds that AI supposedly struggles with original synthesis and complex creativity.
The problem with that logic, as Ding’s empirical data shows, is that it misidentifies where AI actually struggles.
AI does not struggle with creativity. It struggles with chained, multi-step reasoning where later problems depend on the outputs of earlier ones. Those are different limitations entirely.
In the 2024 course offering, students were given an almost entirely open-ended project, pick your dataset, pick your approach, analyse something real from U.S. Bureau of Labor Statistics data. The expectation was that this freedom would produce diverse, original work.
What actually happened: the majority of students independently chose the same macroeconomic inflation datasets. Not because they copied each other. Because when you give an AI model a fully unconstrained problem, it takes the path of lowest resistance and “lowest resistance” means the dataset and analytical approach most heavily represented in its training data. Students were, without necessarily realising it, delegating both the question and the answer to the tool.
Average scores: 91.47. Standard deviation: 16.83. Almost everyone clustered together near the top.
The 2025 redesign constrained the same project. Specific dataset volume requirements. Required modelling techniques. A set of interconnected analytical questions where the output of one stage fed directly into the next. Still flexible. Students still made interpretive choices. The average score dropped to 78.58 and the standard deviation rose to 30.42.
That increased spread is evidence of actual differentiation. The assessment could now tell the difference between students who understood the material and students who didn’t.

What “Interconnected” Actually Means in Practice
The theoretical argument here draws on Cognitive Load Theory, specifically the idea of element interactivity. A task with low element interactivity can be broken into pieces and solved piece by piece. A task with high element interactivity requires a holistic understanding of how the parts relate to each other.
Current AI models are very capable at low-interactivity tasks. They struggle when a problem requires tracking outputs across multiple dependent steps, because that strains both their multi-step reasoning capacity and the limits of their context window.
The design principle Ding proposes is to make assessment components sequentially dependent. Stage 2 requires the output from Stage 1. Stage 3 builds on Stage 2. You cannot hand each sub-problem to a fresh AI prompt and stitch the results together, because the specific data outputs of your Stage 1 analysis are what determines what Stage 2 is asking you to do.
Think about how a scientist runs an experiment. The first set of results determines what the next question is. Or how a journalist builds an investigation — what they find in one source changes what they need to look for next. The cognitive work is in those transitions, those judgement calls about what matters and where to go next.
That is where genuine understanding lives, and it is also exactly where current AI tools lose coherence.
A Challenge for PBL Design
This research matters to PBL practitioners for a specific reason: we already build multi-stage projects. A well-designed Learning Mission naturally has phases, research, analysis, design, critique, revision, presentation. The architecture is there. But the question is whether those stages are consequentially linked or just sequentially scheduled.
There is a difference. You can have a project with five phases where each phase is essentially a separate, self-contained task, find sources, then write a summary, then design a product, then present it. Each of those can be done in isolation with AI assistance and the outputs assembled at the end.
An interconnected design says: the specific constraints you discovered in your research phase must determine the specific parameters of your design phase. Your design choices create the specific tensions your critique phase must address. The intellectual through-line is load-bearing.
This is harder to design. It is also, not coincidentally, what makes a project educationally meaningful rather than just experientially complex.
The other finding worth sitting with is the one about semi-structured constraints. Ding’s data shows that students given specific constraints, a particular dataset, a defined success criterion, a bounded problem, actually produced more ambitious, more varied work than students given total freedom. They engaged with larger datasets. They used more sophisticated methods. They made genuinely different analytical choices from each other.
Total freedom turns out to be somewhat demotivating, or at least unguiding. A focused constraint, paradoxically, seems to create more space for creative thinking, not less.
What the Correlation Data Tells Us About Validity
The study’s most practically useful finding for assessment design involves correlation. Ding compared how well different assessment types predicted performance on fully proctored exams, the gold standard for measuring actual knowledge.
Modular, AI-permissive knowledge checks correlated with proctored exam scores at r = 0.671. Those scores explained about 45% of the variance in exam performance.
Interconnected project scores correlated at r = 0.925. They explained 86% of the variance.
It says that interconnected projects, even though AI use was permitted, were measuring roughly the same underlying competencies as the proctored exam, but under conditions that actually resemble how professionals work. Students who knew the material did well on both. Students who were outsourcing understanding to AI did not.
A note on scope: this study ran in university data science contexts with 117 students. It would be overreaching to apply these numbers directly to Year 7 humanities or primary STEM. But the underlying logic that sequential dependency creates cognitive demand AI cannot easily shortcut holds across subject areas and age groups. What changes is how you build that dependency into a unit.
The PBL Future Labs Connection
This research lands squarely in territory we think about constantly when working with schools on Learning Mission design. The question of what makes a project authentic is, in the AI era, inseparable from the question of what makes it resistant to trivial delegation.
A project that asks students to “research climate change and propose a solution” has almost no sequential dependency. Students can gather sources, generate a summary, produce a proposal, and refine the language, each step handled separately, nothing genuinely connecting the outputs. The project looks complex. The cognitive work can be minimal.
A project that asks students to analyse specific local data about their school’s energy use, then use those particular findings to cost a particular intervention, then present to a specific facilities team with specific constraints, that project builds the chain. The outputs of each stage determine the actual problem of the next one.
We have been articulating this for some time through the Learning Mission Framework. It is good to see the mechanism finally validated empirically, even if in a university computing context. The direction of travel is clear.
One Thing Worth Trying Before the End of Term
If you have a project unit running right now, pick one transition between phases and ask yourself: could a student complete Phase 2 without actually having done Phase 1?
If the answer is yes, if Phase 2 is just a generic task that could be approached without reference to the specific outputs of Phase 1, consider adding a checkpoint. Require students to cite a specific finding from Phase 1 before Phase 2 can proceed. Make the connection explicit and load-bearing.
It is a small design change. It shifts a great deal of weight onto the student’s actual thinking.
by Phillip Alcock
Reference:
Kaihua Ding, “Designing AI-Resilient Assessments Using Interconnected Problems: A Theoretically Grounded and Empirically Validated Framework,” arXiv:2512.10758v3, University of Pennsylvania, January 2026. https://arxiv.org/abs/2512.10758







