Task Quality Studio: An Evidence-First AI System for Authentic Learning Task Generation and Evaluation in K-12 Education

Authors: Gunduzhan Acar
Affiliation: Mirasys LLC
Date: March 2026
Version: 2.0


Abstract

This paper presents Task Quality Studio (TQS), an AI-driven system for generating, evaluating, and iteratively improving K-12 curriculum tasks grounded in established educational frameworks. TQS addresses a persistent challenge in educational technology: the tendency of large language models (LLMs) to produce learning tasks that exhibit surface-level compliance with pedagogical standards while lacking genuine cognitive depth—a phenomenon we term structural mimicry. The system builds directly on the Authentic Learning with Technology Model (ALTmodel) developed by McConaughey and Facteau (2017) in their Lynn University dissertation, which intersects Wiggins and McTighe's Acquisition-Meaning Making-Transfer (AMT) cognitive taxonomy with Puentedura's SAMR technology integration model to create a quality grid for evaluating learning tasks. TQS extends the ALTmodel in three ways: (1) an evidence-first evaluation architecture that requires verbatim textual evidence before assigning scores, eliminating the AI score inflation that McConaughey and Facteau identified as a challenge in their human-scored framework; (2) an expanded 18-dimension, 28-point rubric that operationalizes the ALTmodel's learning and technology axes alongside Burton's (2011) six characteristics of authentic tasks into machine-evaluable dimensions with anti-hallucination safeguards; and (3) an automated six-stage curriculum pipeline that generates, evaluates, and refines tasks at scale while preserving the professional learning principles that McConaughey and Facteau demonstrated improve teacher task design. TQS also provides a Teacher Studio interface enabling educators to evaluate their own tasks against the same rubric, receive narrative coaching feedback, and iteratively uplift tasks from lower to higher cognitive levels—operationalizing the "mirror, not factory" principle that emerged from the ALTmodel's original professional development design. We describe the system architecture, the theoretical grounding of each evaluation dimension, and how TQS extends the ALTmodel research into a technology-enabled professional learning tool.

Keywords: authentic learning, ALTmodel, AI curriculum generation, evidence-based evaluation, Understanding by Design, SAMR, task quality rubric, K-12 education, large language models, McConaughey and Facteau


1. Introduction

1.1 The Problem: AI-Generated Curriculum and the Quality Gap

The proliferation of large language models (LLMs) has created new possibilities for automated curriculum generation. However, early implementations reveal a fundamental quality problem: LLMs are remarkably skilled at producing text that looks pedagogically sound while lacking the cognitive substance that makes learning tasks effective (Kasneci et al., 2023). A worksheet dressed in transfer-task language—mentioning "real-world scenarios" and "critical thinking"—may still require nothing more than recall. We observed this pattern systematically: in initial testing, an AI evaluator scored 19/19 on tasks that human experts rated as mediocre, a failure mode we documented as the "19/19 Problem."

This quality gap is precisely the challenge that McConaughey and Facteau (2017) identified in their work with New York City public school teachers: "teachers often lack a framework to evaluate the characteristics of high-quality tasks, resulting in products that vary widely based on teacher interest, training, and the learning goals of the school" (p. 9). Their solution—the Authentic Learning with Technology Model (ALTmodel)—provided a clear, two-axis framework for teachers to analyze and improve their task design. TQS takes this framework and asks: can we operationalize the ALTmodel into an AI system that provides the same quality lens at scale, while preventing the structural mimicry that AI introduces?

This paper describes how Task Quality Studio addresses the quality gap through three innovations that extend the ALTmodel:

  1. Evidence-first evaluation — An architecture where the scoring model never sees the original task. Instead, it scores only from validated verbatim evidence extracted and verified programmatically, eliminating the pathway for score inflation.

  2. A multi-framework rubric — An 18-dimension, 28-point evaluation rubric that operationalizes the ALTmodel's two axes alongside Burton's (2011) six characteristics of authentic tasks, Wiggins and McTighe's (2005) concept of enduring understandings, and developmental appropriateness research into machine-evaluable dimensions.

  3. Teacher Studio as professional learning tool — A technology-enabled implementation of the professional learning principles McConaughey and Facteau demonstrated in their six-module professional development series, enabling teachers to reflect on and improve their own task design through rubric-aligned feedback.

1.2 Scope and Context

TQS was developed for Texas K-12 educators using the Texas Essential Knowledge and Skills (TEKS) standards framework, with 32,494 TEKS standards and 1,554 Common Core State Standards (CCSS) loaded into the system's standards catalog. The initial validated curriculum is Grade 3 English Language Arts (TEKS §110.5). The system is designed to be framework-agnostic and extensible to any state or national standards.

The system runs on commodity hardware: Apple Mac Studio computers (64GB-128GB unified memory each) with cloud LLM fallback. This design decision reflects a core principle—educator tools should not depend on expensive cloud infrastructure or create ongoing per-query costs that would limit adoption.


2. Theoretical Foundations

2.1 The ALTmodel: McConaughey and Facteau's Framework

The foundational framework for TQS is the Authentic Learning with Technology Model (ALTmodel), developed by Leah McConaughey and Paul Facteau in their 2017 doctoral dissertation at Lynn University, Creating a Model and Professional Learning to Support the Design of Authentic Student Learning Tasks. The ALTmodel was designed to address three barriers to high-quality task design that McConaughey and Facteau identified in their research with New York City public school K-12 teachers, principals, and academic coaches:

  1. Teachers lack a consistent learning framework to evaluate task quality, leading to wide variation based on individual training and interest.
  2. Technology is limited to traditional teaching practices — used for substitution (typing instead of handwriting) rather than driving deeper cognition.
  3. Professional learning is limited and ineffective — disconnected, fragmented, lecture-based experiences that do not support systematic approaches to curriculum design.

The ALTmodel intersects two established educational frameworks into a single quality grid:

The Learning Axis: Wiggins' Acquisition / Meaning Making / Transfer (AMT)

McConaughey and Facteau built on Wiggins and McTighe's Understanding by Design (UbD) framework (Wiggins, 1998; Wiggins & McTighe, 2005), specifically the AMT cognitive taxonomy, which categorizes learning goals along a progression of increasing authenticity and cognitive complexity:

  • Acquisition — Students gain knowledge and skills: facts, vocabulary, procedures, and discrete competencies. Teacher-directed, linear. This is necessary but insufficient for deep understanding.
  • Meaning Making — Students construct understanding by connecting facts, analyzing relationships, and interpreting "why" and "how"—not merely "what." Contextualized, with some student analysis and interpretation.
  • Transfer — Students independently apply knowledge to novel, complex, authentic situations they have not previously encountered. Non-linear, unstructured, student-driven. Mirrors professional and adult challenges.

McConaughey and Facteau chose AMT over other cognitive complexity frameworks (such as Bloom's Taxonomy or Webb's Depth of Knowledge) because AMT situates cognitive complexity within authentic, meaningful performances rather than defining it as discrete verbs or processes absent of context (McConaughey & Facteau, 2017, p. 36). This distinction proved critical in TQS: LLMs can easily insert Bloom's higher-order verbs into task descriptions without changing the underlying cognitive demand, but AMT-based evaluation requires evidence of genuine novelty, student agency, and authentic context.

The Technology Axis: Puentedura's SAMR

For the technology dimension, the ALTmodel uses Puentedura's (2006) SAMR model:

  • Substitution — Technology replaces a traditional tool with no functional change (typing instead of handwriting).
  • Augmentation — Technology provides functional improvement (spell-check, search).
  • Modification — Technology significantly redesigns the task (real-time collaborative editing, data visualization).
  • Redefinition — Technology enables previously impossible tasks (citizen science, global collaboration, computational modeling).

The ALTmodel Grid (Scores 0-5)

The ALTmodel crosses these two axes to create a quality grid scored 0-5:

                  Acquisition      Meaning Making      Transfer
                  (low auth)       (medium auth)       (high auth)
                ┌───────────────┬─────────────────┬───────────────┐
 Modification/  │               │                 │               │
 Redefinition   │   0 (ANTI-    │       3         │    5 ★        │
 (high tech)    │    PATTERN)   │                 │  SWEET SPOT   │
                ├───────────────┼─────────────────┼───────────────┤
 Substitution/  │               │                 │               │
 Augmentation   │      1        │       2         │      4        │
 (low tech)     │               │                 │               │
                └───────────────┴─────────────────┴───────────────┘

The critical insight encoded in this grid is the identification of an anti-pattern (Score 0): high technology paired with low-level cognition. As McConaughey and Facteau describe it, this catches tasks where "flashy technology" is used "without deeper cognition" — for example, "students use an expensive app to make a digital flashcard set. The tech is sophisticated but the task is still rote memorization." The grid makes this relationship explicit: learning drives technology, not vice versa. "If removing the technology doesn't change the cognitive demand, the tech isn't serving learning" (McConaughey & Facteau, 2017).

Validation Results

McConaughey and Facteau validated the ALTmodel through a professional development series with principals, academic coaches, and teachers from nine New York City public schools. Participants attended six in-person professional development modules over six months, learning to analyze and redesign their own tasks using the ALTmodel framework. Results showed that 12 out of 15 redesigned tasks (80%) increased in authenticity from both learning and technology perspectives, 2 tasks (13%) remained at the same level, and 1 task (7%) decreased (McConaughey & Facteau, 2017, p. 4). Qualitative feedback indicated that participants found the ALTmodel effective in rethinking the extent to which their tasks engaged students in deeper cognition and effective technology use, and felt inspired to change both short-term and long-term practice.

2.2 How TQS Extends the ALTmodel

TQS builds on the ALTmodel in four significant ways:

Extension 1: Operationalizing the grid into 18 machine-evaluable dimensions. The ALTmodel provides a 6-cell quality grid (Scores 0-5). TQS decomposes this into 18 specific dimensions across four tiers, enabling precise diagnosis of why a task scores at a particular level and what specifically needs to improve. Where the ALTmodel tells a teacher "this task is Score 2 (low tech + meaning making)," TQS can say "this task scores 2 because it has strong cognitive complexity (A1=2) and standards alignment (E1=2) but lacks real-world fidelity (B1=0), student choice (B5=0), and meaningful technology integration (C1=1, C2=0)."

Extension 2: Evidence-first evaluation to prevent AI score inflation. When McConaughey and Facteau implemented the ALTmodel with human evaluators, scoring relied on professional judgment—teachers and coaches discussed and normed their ratings through collaborative protocols. AI evaluators lack this professional judgment and are susceptible to structural mimicry, where surface-level features (keywords, formatting, stated intentions) receive credit without substantive evidence. TQS addresses this through an architecture where the scoring model never sees the original task, only validated verbatim evidence.

Extension 3: Automated generation with the ALTmodel as quality target. The ALTmodel was designed as a diagnostic and professional development tool—helping teachers analyze and improve existing tasks. TQS extends this to generation: the pipeline creates new tasks with the ALTmodel's quality criteria embedded in the generation prompts, then evaluates them against the same rubric. This closes the loop between "what good looks like" (the ALTmodel grid) and "how to produce it" (the generation pipeline).

Extension 4: Technology-enabled professional learning. McConaughey and Facteau's professional development required six months of in-person sessions with expert facilitators. TQS provides the same reflective framework through an always-available digital tool—Teacher Studio—where any teacher can evaluate their tasks, receive dimension-by-dimension feedback, and iteratively improve without waiting for the next PD session. This does not replace collaborative professional learning; it augments it by providing the "mirror" function continuously.

2.3 Burton's Six Characteristics of Authentic Tasks

The authenticity dimensions in the TQS rubric (B1-B6) draw directly from Burton's (2011) synthesis of six frameworks for authentic assessment. McConaughey and Facteau cite Burton's work as identifying "several common characteristics across all frameworks to describe high-quality tasks including 'fidelity of task to the real world, [creation of a] polished product, higher order thinking seamlessly integrated with assessment, collaboration, [requiring] students to make judgements and choices, and complexity'" (Burton, 2011, p. 24, as cited in McConaughey & Facteau, 2017, p. 7).

TQS encodes these six characteristics as binary (0/1) scoring dimensions:

  1. B1: Real-World Fidelity — Does the task mirror challenges professionals or informed citizens actually face?
  2. B2: Valued Professional Product — Does the student create something with purpose beyond a grade?
  3. B3: Higher-Order Thinking — Does the task require analysis, synthesis, evaluation, or creation?
  4. B4: Collaboration — Is there genuine interdependence, not just "work in groups"?
  5. B5: Judgment and Choice — Do students make real decisions about approach, methods, or presentation?
  6. B6: Non-Linear Structure — Are there multiple valid paths to the core intellectual challenge?

TQS operationalizes these with specific "litmus tests" to combat structural mimicry:

  • The Remove-It Test (B1): If you remove the real-world scenario, does the task reduce to a standard worksheet? If yes, the scenario is decorative, not authentic. Score B1=0.
  • The Student Agency Test (B5): If all students will produce essentially the same output, there is no genuine choice regardless of what the instructions say. Score B5=0.
  • The Product Value Test (B2): Would anyone outside the classroom voluntarily read, use, or act on the student's product? If not, B2=0.
  • The Costume Test (B1): Is the real-world context integral to the cognitive demand, or is it a "costume" draped over a traditional exercise?

These operationalized tests transform Burton's abstract authenticity criteria into concrete, machine-evaluable heuristics.

2.4 Additional Theoretical Foundations

Enduring Understandings (Wiggins & McTighe, 2005). TQS adds dimension U1 (Enduring Understanding, 0-2) to operationalize Wiggins and McTighe's concept that the most valuable learning centers on "big ideas" that transfer across contexts. A task addressing isolated facts scores U1=0; a task that cannot be completed without grappling with a transferable principle scores U1=2.

Novelty Distance (Perkins & Salomon, 1992). Dimension N1 (0-2) measures how far the task's context is from typical classroom instruction. True transfer requires application in genuinely novel situations. This dimension addresses tasks that claim to be "transfer" but use the same examples and contexts from instruction.

Cognitive Load Theory (Sweller, 1988; Sweller et al., 2019). Dimension F3 (Cognitive Load, 0-1) operationalizes the research finding that working memory constraints are binding—a brilliantly designed transfer task that overwhelms students accomplishes nothing.

Developmental Appropriateness (Piaget, 1964; Vygotsky, 1978). Dimension G1 (Developmental Fit, 0-2) was added after the system's first full curriculum run produced Grade 3 transfer tasks that read like college capstone projects. The ALTmodel's quality grid is grade-agnostic—a "Score 5" task for Grade 3 and Grade 12 must both demonstrate transfer + technology redefinition, but the developmental expression of that is radically different. TQS adds grade-banded structural constraints:

  • PK-2: 800-1,200 words maximum, 1 day duration, simple vocabulary, familiar contexts
  • 3-5: 1,500-2,500 words, 2-3 days, grade-appropriate vocabulary with a forbidden jargon list
  • 6-8: 2,500-4,000 words, 2-4 days, academic vocabulary introduced
  • 9-12: 3,500-5,000 words, 3-5 days, professional vocabulary permitted

G1=0 is a hard constraint—tasks are rejected and regenerated, not refined, because iterative improvement within the wrong developmental band rarely produces acceptable results.

Culturally Sustaining Pedagogy (Paris, 2012; Paris & Alim, 2017). TQS enforces cultural diversity requirements during task generation. For Texas classrooms with 40-50% Hispanic/Latino student populations, the system requires named authors from diverse cultural traditions, non-European/non-Western contexts, and grade-appropriate engagement with cultural perspectives.

Authentic Intellectual Work (Newmann & Associates, 1996). Newmann's framework established that authentic achievement requires construction of knowledge, disciplined inquiry, and value beyond school. TQS operationalizes these principles across multiple dimensions: B2 (valued product), B3 (higher-order thinking), E2 (disciplinary practices), and D1 (student ownership of the inquiry process).

Authentic Assessment (Gulikers et al., 2004). Gulikers, Bastiaens, and Kirschner's five-dimensional framework for authentic assessment—covering task, physical context, social context, assessment result, and criteria—informed the comprehensive scope of the TQS rubric, which evaluates not just cognitive level but feasibility, collaboration context, and assessment alignment.


3. System Architecture

3.1 Two-Interface Design: Factory and Mirror

TQS consists of two complementary interfaces, reflecting a distinction that emerged from the ALTmodel research:

Pipeline Admin — A curriculum manufacturing system. Given a standards framework, subject, and grade level, it autonomously generates, evaluates, refines, and approves learning tasks. This is the "factory" side: batch processing, quality control, and continuous improvement.

Teacher Studio — A teacher-facing evaluation and professional development tool. This embodies the "mirror, not factory" design principle: its primary value is helping teachers see their tasks through the ALTmodel quality lens, not replacing teacher judgment. McConaughey and Facteau's professional development modules were built around the same principle—teachers analyzed their own tasks, reflected on where they fell on the ALTmodel grid, and redesigned them collaboratively. Teacher Studio provides this reflective framework as an always-available digital tool.

Both interfaces share a SQLite database and the same evidence-based scoring engine, ensuring evaluation consistency whether a task was AI-generated or teacher-created.

3.2 The Evidence-First Evaluation Engine

The core technical contribution of TQS is the evidence-first evaluation architecture, implemented in evidence-scorer.js. This six-step pipeline was designed specifically to solve the 19/19 Problem—the tendency of LLM evaluators to infer quality from surface-level signals rather than substantive evidence.

Step 1: Specification Gate (Programmatic) Validates input structure, determines the task's target AMT level, and rejects malformed submissions before any LLM call. This step also parses the task into structural blocks (claims, instructions, assessment criteria, context), enabling downstream coherence checks.

Step 2: Evidence Extraction (LLM Call #1) An LLM extracts verbatim quotes from the task content for each of the 18 rubric dimensions. The model must cite exact text—no paraphrasing, no summarization, no inference. Extraction hints guide the LLM to look for specific textual patterns per dimension (e.g., for B4-Collaboration: "Look for evidence of structured interdependence, assigned roles, or collaborative protocols").

Step 3: Evidence Validation (Programmatic) Each extracted quote is programmatically verified against the original task content using fuzzy string matching (sliding window and ordered word overlap algorithms). Quotes that do not appear in the source text—hallucinated evidence—are discarded. This step is the critical anti-hallucination safeguard: the LLM cannot fabricate evidence to justify a score.

Step 4: Constrained Scoring (LLM Call #2) A second LLM call scores each dimension, but the model receives only the validated evidence, not the original task. This architectural decision means the scorer physically cannot see structural mimicry cues (e.g., the phrase "real-world scenario" in a task header) that do not correspond to genuine task content. TEKS verification also occurs here: if a task overclaims standards coverage, E1 is capped at 1.

Step 5: External Constraints (Programmatic) Hard limits and cross-dimension coherence rules enforce the ALTmodel's gating logic:

  • ALTmodel Score 5 requires A1=3 AND C1≥2 (transfer + technology modification or higher)
  • ALTmodel Score 4 requires A1≥2 AND total≥18
  • Anti-pattern (Score 0): A1=1 AND C1≥2 (high tech + low cognition)
  • Cross-dimension coherence: A1=3 (Transfer) implies B3≥1, B5≥1, B6≥1
  • G1=0 triggers rejection

Step 6: Narrative Generation (LLM Call #3) An instructional coach-style narrative is generated, highlighting strengths first, then specific improvement suggestions. This follows the "strengths before growth areas" principle from effective instructional coaching practice (Knight, 2007). The narrative references the validated evidence, allowing teachers to see exactly which parts of their task earned or lost points.

3.3 The Six-Stage Curriculum Pipeline

The automated pipeline processes standards into approved curriculum tasks through six stages:

Stage 1: Plan — Generates a flat task list from standards. Each TEKS Student Expectation receives one task. Tasks are independent and standalone—not sequenced into units. This design decision reflects the ALTmodel's use as a diagnostic rubric for individual tasks, not a sequencing principle. Teachers select tasks as needed.

Stage 2: Review — AI review of plan structure against a dynamic checklist that adapts to pipeline parameters (e.g., when a maximum task count limits standards coverage).

Stage 3: Seed Generation — Full task content generation with grade-banded structural constraints, cultural diversity requirements, and technology tool caps by grade band (PK-K: 0-1 tools, 3-5: 2 tools, 6-8: 3 tools). Evaluation criteria from the rubric are embedded directly in the generation prompt, so the LLM knows what "good" looks like before generating.

Stage 4: Evaluate — Each task is scored through the evidence-first engine via the Teacher Studio API, ensuring a single evaluation pathway for both pipeline and manual submissions.

Stage 5: Refine — An iterative improvement loop: evaluate → diagnose → improve → re-evaluate → keep/discard. The refinement prompt includes ranked weak dimensions with evaluator justifications and a "preserve strengths" directive, enabling targeted rather than random improvement.

Stage 6: Approve — Level-differentiated thresholds reflecting the ALTmodel's inherent difficulty gradient: Acquisition ≥ 7/28, Meaning Making ≥ 17/28, Transfer ≥ 22/28. An acquisition task scoring 7/28 is genuinely good (it correctly scores 0 on most authenticity dimensions), while a transfer task needs 22/28 to demonstrate genuine multi-dimensional quality.


4. The Teacher Experience: Teacher Studio

4.1 From Six Modules to an Always-Available Mirror

McConaughey and Facteau's professional development consisted of six in-person modules delivered over six months: Inspire, Rethink, Reflect, Model, Design, and Implement & Refine. Each module built on the previous one, progressing from empathy for students through framework introduction to hands-on task redesign. Participants reported that the ALTmodel helped them "rethink the extent to which their tasks engaged students in deeper cognition and effective technology use" and "inspired them to change their short-term and long-term practice" (McConaughey & Facteau, 2017, p. 4).

Teacher Studio operationalizes the same reflective progression as a digital tool:

McConaughey & Facteau Module Teacher Studio Feature
Rethink — Shift conceptual understanding using the ALTmodel Evaluate Your Task — See where your task falls on the ALTmodel grid
Reflect — Evaluate depths of authenticity in your own tasks Dimension Breakdown — 18 scored dimensions with teacher-friendly explanations
Model — Learn from exemplar tasks Library — Browse pipeline-generated templates rated on the ALTmodel
Design — Redesign tasks for higher authenticity Improve/Uplift — AI-guided rewriting at current or higher AMT level
Implement & Refine — Iterate based on implementation Iteration History — Track task evolution with before/after comparison

The key difference is availability: McConaughey and Facteau's modules required coordinated in-person sessions with facilitators; Teacher Studio provides the reflective framework on-demand. This does not replace collaborative professional learning—the qualitative feedback in McConaughey and Facteau's study showed that peer discussion and collaborative norming were essential to shifting teacher thinking. Teacher Studio augments PLC work by providing the evaluation infrastructure continuously.

4.2 Evaluate Your Task

A teacher pastes or uploads their task content. The evidence-first engine scores it across all 18 dimensions and returns:

  • A total score with quality band (Emerging 0-7, Developing 8-14, Proficient 15-20, Exemplary 21-28)
  • An ALTmodel Score (0-5) mapping to McConaughey and Facteau's original grid
  • A radar chart visualization of dimension scores
  • A dimension-by-dimension breakdown with clickable descriptions explaining what each dimension measures and why it matters
  • A narrative feedback section written as an instructional coach, leading with strengths

The dimension descriptions were added after teacher feedback indicated that score breakdowns were meaningless without understanding what "B2: Professional Product" or "N1: Novelty Distance" actually measured. Each dimension now has a teacher-friendly explanation (e.g., "B2 measures whether the student's output has value or purpose beyond getting a grade. Would anyone outside the classroom voluntarily read, use, or act on this product?").

4.3 Improve and Uplift

Teachers can improve tasks at the current AMT level or uplift them to a higher cognitive level—the digital equivalent of McConaughey and Facteau's "Design" module where participants redesigned their own tasks:

  • Improve at current level — The system identifies the weakest dimensions and generates a revised version addressing specific deficiencies while preserving strengths.
  • Upgrade to Meaning Making — An acquisition task is rewritten to add analysis, comparison, and conceptual connection while retaining standard alignment.
  • Upgrade to Transfer — A task is rewritten to include authentic context, student agency, professional products, and non-linear problem solving.

When a teacher accepts an improvement, the system re-evaluates and updates both content and target level. Iteration history with "before" and "after" content viewing lets teachers trace how their task evolved.

4.4 Library and Professional Learning Community Support

Pipeline-generated tasks are available as a read-only template library. Teachers browse by standard, level, or strand, then import templates into their personal studio for customization. The library and personal studio are strictly separated: Library shows only pipeline templates, My Studio shows only user-owned tasks.

The evaluation rubric serves a dual purpose: automated quality control and a shared professional vocabulary for teacher collaboration. When teachers in a Professional Learning Community evaluate their tasks against the same 18 dimensions, they develop the "common agreement regarding ALTmodel concept and language" that McConaughey and Facteau identified as essential in their third module (Reflect). "This task scores B5=0—students don't actually make any choices" is more actionable than "this task could be more engaging."


5. The Rubric in Detail

5.1 Four-Tier Structure

The 18 dimensions are organized into four tiers reflecting a priority hierarchy:

Tier 1: Purpose (max 7 points) — Is this task worth doing?

  • A1 Cognitive Level (0-3): AMT classification per the ALTmodel's learning axis
  • U1 Enduring Understanding (0-2): Connection to transferable big ideas (Wiggins & McTighe, 2005)
  • E1 Standards Coverage (0-2): Alignment to cited standards, with overclaim detection

Tier 2: Authenticity & Rigor (max 10 points) — Does the task demand authentic performance?

  • B1-B6: Burton's six characteristics of authentic tasks (0-1 each, max 6)
  • D1 Student Ownership (0-2): Who drives the learning
  • E2 Disciplinary Practices (0-1): Authentic discipline-specific methods

Tier 3: Implementation (max 9 points) — Is it feasible and well-supported?

  • C1 Technology Depth (0-3): SAMR classification per the ALTmodel's technology axis
  • C2 Tech-Learning Alignment (0-2): Does technology serve learning
  • N1 Novelty Distance (0-2): How far from classroom instruction context
  • F1 Material Availability (0-1): Accessible in typical schools
  • F2 Time Realism (0-1): Completable in stated timeframe
  • F3 Cognitive Load (0-1): Manageable pace (Sweller, 1988)

Tier 4: Grade Fit (max 2 points) — Is it developmentally appropriate?

  • G1 Developmental Fit (0-2): Hard constraint; G1=0 = reject

Total: 28 points maximum.

5.2 Gating Logic and the ALTmodel Score

The rubric's gating logic directly implements the ALTmodel's quality grid:

  • ALTmodel Score 5 requires A1=3 AND C1≥2 (transfer + modification/redefinition). This is McConaughey and Facteau's "sweet spot" — high cognition paired with transformative technology.
  • ALTmodel Score 4 requires A1≥2 AND total≥18
  • Anti-pattern (Score 0): A1=1 AND C1≥2 (high tech + low cognition). McConaughey and Facteau explicitly identified this as the pattern where "flashy technology" masks shallow learning.

This gating mechanism ensures that high total scores require substantive quality on the ALTmodel's primary axes, not just accumulation of minor points. A task cannot reach Score 5 through perfect feasibility and collaboration alone—it must demonstrate transfer-level cognition with transformative technology.

5.3 Level-Differentiated Thresholds

Approval thresholds are calibrated to each AMT level's expected dimension profile:

  • Acquisition (≥7/28): These tasks correctly score 0 on most authenticity and transfer dimensions. An acquisition task scoring 7 with strong standards alignment, appropriate cognitive load, and good developmental fit is a well-designed learning activity. As McConaughey and Facteau noted, "transfer is not every-day... Acquisition and Meaning Making are necessary scaffolding" (p. 88 in the context framework).
  • Meaning Making (≥17/28): Expected to show analysis and connection (A1=2), some Burton characteristics, and purposeful technology use.
  • Transfer (≥22/28): Expected to demonstrate all Burton characteristics, high student ownership, far novelty distance, and integral technology. Reaching 22/28 requires scoring well across all four tiers.

6. Standards Integration

6.1 Scale

TQS maintains a standards catalog of 34,048 entries:

  • 32,494 Texas Essential Knowledge and Skills (TEKS) standards
  • 1,554 Common Core State Standards (CCSS)

Standards are structured hierarchically: Framework → Subject → Grade → Strand → Standard. Each standard has a level classification: "Knowledge and Skills" headers are organizational (no tasks generated), while "Student Expectations" are the actionable standards that drive task generation.

6.2 Standards Verification

A novel contribution is the integration of standards verification into the evaluation rubric. E1 (Standards Coverage) is not merely a check that standard codes are listed—the evidence-first engine extracts textual evidence that the standard's requirements are actually addressed in the task content. If a task lists TEKS §110.5.b.6.A (comprehension strategies) but the content contains no comprehension activities, E1 is capped at 1 regardless of other evidence. This prevents "standard-stuffing"—the practice of listing alignment codes for credit without genuine coverage.


7. Discussion

7.1 What TQS Adds to the ALTmodel

TQS does not introduce new educational theory. Its contribution is the operationalization of the ALTmodel and related frameworks into a machine-evaluable system with anti-hallucination safeguards. Specifically:

  1. Making the ALTmodel's quality axes measurable at dimension level. McConaughey and Facteau's grid provides a holistic 0-5 score. TQS decomposes this into 18 specific dimensions, enabling precise diagnosis and targeted improvement. A teacher can see not just "Score 2" but exactly which Burton characteristics are missing and what the ALTmodel's technology axis looks like in their specific task.

  2. Preventing structural mimicry. Human evaluators using the ALTmodel in a PLC setting can detect when a task merely claims authenticity without delivering it. AI evaluators cannot—unless the architecture forces them to work from validated evidence rather than surface features. The evidence-first pipeline is TQS's solution to this problem.

  3. Scaling the professional learning cycle. McConaughey and Facteau demonstrated that 80% of tasks improved after their six-module professional development. TQS enables the same reflective cycle to happen continuously, on-demand, for any teacher with an internet connection. The diagnostic specificity (18 dimensions vs. a single grid position) provides more actionable feedback than the original framework.

  4. Adding developmental calibration. The ALTmodel is grade-agnostic by design. TQS extends it with grade-banded constraints and a developmental fit dimension (G1) that ensures a "Score 5" task for Grade 3 looks fundamentally different from a "Score 5" task for Grade 12—both achieve transfer with technology redefinition, but calibrated to what students at each level can actually do.

  5. Closing the generation-evaluation loop. The ALTmodel was designed to analyze existing tasks. TQS uses it to generate new tasks with quality criteria embedded in generation prompts, evaluate them against the full rubric, and iteratively improve them. This closes the loop between "what good looks like" and "how to produce it at scale."

7.2 Limitations

Several limitations should be acknowledged:

  1. Evaluation validation. While the evidence-first architecture reduces score inflation, we have not yet conducted a formal inter-rater reliability study comparing TQS scores to human expert ratings. A target of Cohen's kappa ≥ 0.60 has been set but not yet measured.

  2. Single curriculum validation. Results are based on Grade 3 ELA (TEKS). Generalizability to other subjects (Science, Math, Social Studies), grade bands (high school), and standards frameworks (CCSS, NGSS) remains to be demonstrated.

  3. No classroom validation. TQS evaluates task design quality, not learning outcomes. A task scoring 28/28 on the rubric is hypothesized to produce better learning than one scoring 10/28, but this hypothesis requires classroom-based research to validate. McConaughey and Facteau recommended that "further studies could examine samples of student work and compare the levels of authenticity, cognitive complexity, and technology use in the original and redesigned tasks" (2017, p. 46)—this recommendation applies equally to TQS-generated tasks.

  4. LLM dependency. The system's quality depends on the capabilities of the underlying LLMs. As models improve, TQS outputs should improve; as models change, calibration may drift.

  5. Teacher adoption. The system has not yet been tested with practicing teachers. McConaughey and Facteau's research demonstrated that the ALTmodel framework shifts teacher thinking when supported by structured professional development. Whether a digital tool can produce similar shifts without human facilitation is an open question.

7.3 Future Directions

Near-term priorities include:

  • Expert calibration study to establish inter-rater reliability against human scorers
  • Expansion to Common Core Standards
  • Pilot deployment with Texas teacher cohorts, modeled on McConaughey and Facteau's NYC study design
  • Export to LMS formats (Google Classroom, Canvas) for direct classroom use
  • PLC collaboration features enabling groups of teachers to evaluate and discuss tasks collectively, supporting the collaborative norming process McConaughey and Facteau identified as essential
  • Fine-tuned evaluation models using accumulated human-validated scoring data

8. Conclusion

Task Quality Studio demonstrates that AI can generate genuinely high-quality curriculum tasks when evaluation is grounded in established educational frameworks—specifically the Authentic Learning with Technology Model developed by McConaughey and Facteau (2017)—and protected by anti-hallucination architecture. By operationalizing the ALTmodel's dual-axis quality grid into 18 machine-evaluable dimensions, implementing evidence-first scoring that prevents AI inflation, and providing Teacher Studio as a continuous professional learning tool, TQS extends the ALTmodel from a professional development framework into a technology-enabled curriculum quality system.

The key insight is architectural, not algorithmic: the solution to AI score inflation is not better prompting—it is an evaluation structure where the scoring model cannot see the original text and cannot hallucinate evidence. This principle generalizes beyond education to any domain where AI evaluation of AI-generated content must be trustworthy.

For educators, TQS offers what McConaughey and Facteau envisioned: a "clear and common learning model" that supports teachers to "assess, select and design tasks based on quality measures" (2017, p. 10). For the field of educational technology, it offers a reference architecture for how AI curriculum tools can be built to enforce rather than merely claim alignment with research-based pedagogical practice.

The technical engineering practices, process optimization methodology, and infrastructure lessons learned during TQS development are documented in the companion paper: Task Quality Studio: Technical Lessons Learned (TQS-TECHNICAL-LESSONS-LEARNED.md).


References

Anderson, L. W., & Krathwohl, D. R. (Eds.). (2001). A taxonomy for learning, teaching, and assessing: A revision of Bloom's taxonomy of educational objectives. Longman.

Burton, K. (2011). A framework for determining the authenticity of assessment tasks: Applied to an example in law. Journal of Learning Design, 4(2), 20-28.

Gulikers, J. T. M., Bastiaens, T. J., & Kirschner, P. A. (2004). A five-dimensional framework for authentic assessment. Educational Technology Research and Development, 52(3), 67-86. https://doi.org/10.1007/BF02504676

Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., Kruber, S., Kutyniok, G., Michaeli, T., Nerdel, C., Pfeffer, J., Poquet, O., Sailer, M., Schmidt, A., Seidel, T., … Kasneci, G. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 103, 102274. https://doi.org/10.1016/j.lindif.2023.102274

Knight, J. (2007). Instructional coaching: A partnership approach to improving instruction. Corwin Press.

Kraft, M. A., Blazar, D., & Hogan, D. (2018). The effect of teacher coaching on instruction and achievement: A meta-analysis of the causal evidence. Review of Educational Research, 88(4), 547-588. https://doi.org/10.3102/0034654318759268

McConaughey, L., & Facteau, P. (2017). Creating a model and professional learning to support the design of authentic student learning tasks [Doctoral dissertation, Lynn University]. SPIRAL. https://spiral.lynn.edu/etds/4

Newmann, F. M., & Associates. (1996). Authentic achievement: Restructuring schools for intellectual quality. Jossey-Bass.

Paris, D. (2012). Culturally sustaining pedagogy: A needed change in stance, terminology, and practice. Educational Researcher, 41(3), 93-97. https://doi.org/10.3102/0013189X12441244

Paris, D., & Alim, H. S. (2017). Culturally sustaining pedagogies: Teaching and learning for justice in a changing world. Teachers College Press.

Perkins, D. N., & Salomon, G. (1992). Transfer of learning. In T. Husén & T. N. Postlethwaite (Eds.), International encyclopedia of education (2nd ed.). Pergamon Press.

Piaget, J. (1964). Part I: Cognitive development in children: Piaget development and learning. Journal of Research in Science Teaching, 2(3), 176-186. https://doi.org/10.1002/tea.3660020306

Puentedura, R. R. (2006). Transformation, technology, and education [Blog post]. Hippasus. http://hippasus.com/resources/tte/

Sweller, J. (1988). Cognitive load during problem solving: Effects on learning. Cognitive Science, 12(2), 257-285. https://doi.org/10.1207/s15516709cog1202_4

Sweller, J., van Merriënboer, J. J. G., & Paas, F. (2019). Cognitive architecture and instructional design: 20 years later. Educational Psychology Review, 31(2), 261-292. https://doi.org/10.1007/s10648-019-09465-5

Vygotsky, L. S. (1978). Mind in society: The development of higher psychological processes (M. Cole, V. John-Steiner, S. Scribner, & E. Souberman, Eds.). Harvard University Press.

Wiggins, G. P. (1998). Educative assessment: Designing assessments to inform and improve student performance. Jossey-Bass.

Wiggins, G. P., & McTighe, J. (2005). Understanding by design (Expanded 2nd ed.). Association for Supervision and Curriculum Development.


Appendix A: Full Rubric Dimension Summary

Tier Dimension Key Max What It Measures
1: Purpose Cognitive Level A1 3 AMT classification (Acquisition/Meaning Making/Transfer)
1: Purpose Enduring Understanding U1 2 Connection to transferable big ideas
1: Purpose Standards Coverage E1 2 Alignment to cited standards with overclaim detection
2: Authenticity Real-World Fidelity B1 1 Burton: Task mirrors real professional/citizen challenges
2: Authenticity Professional Product B2 1 Burton: Output has value beyond a grade
2: Authenticity Higher-Order Thinking B3 1 Burton: Requires analysis, synthesis, or evaluation
2: Authenticity Collaboration B4 1 Burton: Genuine interdependence in teamwork
2: Authenticity Judgment and Choice B5 1 Burton: Students make real decisions
2: Authenticity Non-Linear B6 1 Burton: Multiple valid paths to the challenge
2: Authenticity Student Ownership D1 2 Who drives the learning
2: Authenticity Disciplinary Practices E2 1 Authentic discipline-specific methods
3: Implementation Technology Depth C1 3 ALTmodel technology axis: SAMR level
3: Implementation Tech-Learning Alignment C2 2 Technology serves learning goals
3: Implementation Novelty Distance N1 2 How far from classroom instruction context
3: Implementation Material Availability F1 1 Accessible in typical public schools
3: Implementation Time Realism F2 1 Completable in stated timeframe
3: Implementation Cognitive Load F3 1 Manageable pace for the grade
4: Grade Fit Developmental Fit G1 2 Age-appropriate across all task elements
TOTAL 28

Appendix B: Evidence-First Pipeline Flow

┌─────────────────────────────────────────────────────────────┐
│                    TASK CONTENT INPUT                         │
│              (teacher-submitted or pipeline-generated)        │
└─────────────────────┬───────────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────────┐
│  STEP 1: SPECIFICATION GATE (Code)                           │
│  • Validate input structure                                  │
│  • Determine target AMT level                                │
│  • Parse into claims/instructions/assessment/context blocks  │
│  • Reject malformed submissions                              │
└─────────────────────┬───────────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────────┐
│  STEP 2: EVIDENCE EXTRACTION (LLM Call #1)                   │
│  • Extract verbatim quotes per dimension                     │
│  • No paraphrasing, no inference                             │
│  • Output: quote-per-dimension map                           │
└─────────────────────┬───────────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────────┐
│  STEP 3: EVIDENCE VALIDATION (Code — Fuzzy Match)            │
│  • Verify each quote exists in source text                   │
│  • Sliding window + ordered word overlap matching            │
│  • Discard hallucinated quotes                               │
│  ★ ANTI-HALLUCINATION CHECKPOINT                             │
└─────────────────────┬───────────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────────┐
│  STEP 4: CONSTRAINED SCORING (LLM Call #2)                   │
│  • Scorer sees ONLY validated evidence, NOT original text    │
│  • Score 18 dimensions from evidence                         │
│  • TEKS overclaim detection: cap E1 if standards overclaimed │
└─────────────────────┬───────────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────────┐
│  STEP 5: EXTERNAL CONSTRAINTS (Code)                         │
│  • Clamp scores to defined maximums                          │
│  • Apply coherence rules (A1=3 → B3≥1, B5≥1, B6≥1)         │
│  • Anti-pattern check (A1=1 ∧ C1≥2 → ALTmodel Score 0)     │
│  • Grade-band tech caps on C1/C2                             │
│  • G1=0 → reject and regenerate                              │
│  • ALTmodel Score mapping with gating                        │
└─────────────────────┬───────────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────────┐
│  STEP 6: NARRATIVE GENERATION (LLM Call #3)                  │
│  • Instructional coach-style feedback                        │
│  • Strengths first, then growth areas                        │
│  • References validated evidence                             │
│  • Specific, actionable improvement suggestions              │
└─────────────────────┬───────────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────────┐
│                    EVALUATION OUTPUT                          │
│  • 18 dimension scores (total/28)                            │
│  • ALTmodel Score (0-5) per McConaughey & Facteau grid       │
│  • Quality band (Emerging/Developing/Proficient/Exemplary)   │
│  • Narrative feedback                                        │
│  • Weakest dimensions + improvement suggestions              │
│  • Hard constraint violations                                │
└─────────────────────────────────────────────────────────────┘

Appendix C: ALTmodel Quality Grid (McConaughey & Facteau, 2017)

                  Acquisition      Meaning Making      Transfer
                  (low auth)       (medium auth)       (high auth)
                ┌───────────────┬─────────────────┬───────────────┐
 Modification/  │               │                 │               │
 Redefinition   │   0 (ANTI-    │       3         │    5 ★        │
 (high tech)    │    PATTERN)   │                 │  EXEMPLARY    │
                ├───────────────┼─────────────────┼───────────────┤
 Substitution/  │               │                 │               │
 Augmentation   │      1        │       2         │      4        │
 (low tech)     │               │                 │               │
                └───────────────┴─────────────────┴───────────────┘

Source: McConaughey, L., & Facteau, P. (2017). Creating a model and
professional learning to support the design of authentic student
learning tasks. Lynn University.

Gating in TQS:
  Score 5 requires A1=3 AND C1≥2
  Score 4 requires A1≥2 AND total≥18
  Score 0 triggers on A1=1 AND C1≥2 (regardless of total)