Large Language Models: Evaluation, Education, and Language Technologies

Author names below in italics are students in my group.

Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language tasks, from code generation to complex reasoning. As LLMs become embedded in everyday tools and workflows, my research addresses four interconnected questions: How can we systematically evaluate LLM reasoning by probing where and why models fail, beyond aggregate accuracy metrics? How should LLMs be integrated into education in ways that deepen rather than undermine student learning? How much linguistic information can be strategically removed from text before meaning is lost, and what does this imply for semantic information theory? And how can LLMs power next-generation text input technologies for languages with complex writing systems? These questions are organized into two research threads — LLMs for Learning and LLMs as Language Engines — spanning evaluation methodology, pedagogical design, information theory, and deployed systems.

LLMs for Learning: Evaluation and Education

These two projects, conducted with co-author Xinming Yang, approach LLMs from the perspective of learning — both diagnosing how models themselves reason and fail, and designing pedagogical methods that use LLMs to deepen human learning. The error analysis framework provides a systematic understanding of LLM reasoning failures; the Socrates platform translates this understanding into practical tools for the classroom.

Probing LLM Reasoning through Synthetic Misconception Generation

Evaluating large language models typically relies on accuracy benchmarks — collections of questions with known correct answers. While informative, these benchmarks provide a coarse signal: they tell us whether a model answered correctly, but reveal little about the nature of its errors, the structure of its reasoning failures, or the types of misconceptions it harbors. A model that scores 80% on a benchmark may still exhibit systematic reasoning flaws that remain invisible to aggregate metrics, and two models with identical scores may fail in qualitatively different ways with distinct implications for downstream applications.

My research introduces a framework that uses LLMs themselves as instruments for fine-grained reasoning evaluation. Rather than testing models against pre-written questions, I develop a dual-agent architecture in which one LLM plays the role of a student — generating synthetic misconceptions and incorrect problem-solving attempts — while a second LLM serves as a grader, tasked with diagnosing the errors. This generative approach produces a rich landscape of controlled reasoning failures, which I classify using a five-category error taxonomy: conceptual misunderstanding (applying the wrong principle), procedural error (flawed execution of a correct strategy), factual hallucination (inventing non-existent facts), context misinterpretation (misunderstanding the problem statement), and arithmetic imprecision (calculation errors in multi-step reasoning).

By systematically generating and diagnosing errors across multiple models, the framework produces
diagnostic profiles — structured characterizations of how and why each model fails. These profiles reveal that reasoning failure patterns are model-specific and often uncorrelated with overall accuracy: a model can be highly accurate yet brittle in specific reasoning dimensions, while another with lower accuracy may exhibit more predictable and interpretable failure modes. The framework also provides a foundation for downstream applications, including the design of targeted educational exercises where students must identify and correct AI-generated errors — connecting directly to the learning-by-teaching paradigm explored in my education research.

Related publications:

Learning by Teaching: Engaging Students as Instructors of LLMs

The dominant paradigm in applying LLMs to computer science education casts the model as a virtual tutor that explains concepts, debugs code, and answers questions. While convenient, this approach carries a significant risk: students may become overly reliant on LLMs for reasoning, substituting AI assistance for the productive struggle that is essential to developing deep understanding and practical skills. Academic dishonesty is a further concern, as LLMs can already solve many standard course problems at a human level.

My research proposes a fundamentally different pedagogical paradigm grounded in the protégé effect — the well-documented phenomenon that teaching a subject deepens the instructor's own understanding and mastery. Rather than positioning the LLM as the teacher, I invert the model: students act as instructors who must teach an LLM to solve a given problem. To do this successfully, a student must understand the material well enough to decompose the problem, articulate the reasoning steps, and construct worked examples that bridge the LLM's knowledge gap. This active engagement replaces passive consumption of AI-generated answers.

A key challenge in this approach is question design. Most standard course questions can already be solved directly by LLMs, making them unsuitable as teaching targets. I develop a set of strategies for designing questions with engineered knowledge gaps — problems that an LLM cannot solve on its own, but can solve when given a correctly structured prompt written by a student who genuinely understands the material. Students are guided to use Chain-of-Thought (CoT) prompting, which requires them to decompose the problem into step-by-step reasoning, and few-shot prompting, which requires them to construct complete worked examples as templates for the LLM. This process forces students to identify the core patterns in a problem and articulate a generalizable solution before teaching it. To evaluate whether a student's instruction is reliably correct, the Socrates system applies self-consistency, querying the LLM multiple times and using a majority vote to determine whether the student's prompt consistently produces the correct answer.

To make this approach accessible to instructors without specialized technical knowledge, I implement Socrates, a system that provides a playground for students to interact with the custom questions and an LLM-based grader that evaluates their submitted instructions. Socrates requires minimal programming overhead and is designed for straightforward deployment in undergraduate courses. An evaluation in an undergraduate computer science course at CUNY Queens College demonstrates that this active-learning method leads to statistically significant improvements in student performance compared to historical cohorts, at a low operational cost.

Related publications:

LLMs as Language Engines: Compression and Text Input

These two projects, conducted with co-author Yuchun Zou, approach LLMs as general-purpose language processing engines — investigating how they can compress and reconstruct text, and how they can serve as the foundation for next-generation text input technologies. The compression work establishes empirical foundations for semantic information theory; the IME benchmarking work applies these insights to a practical deployment challenge with real-time latency constraints.

Text-Preserving Lossy Text Compression via Strategic Deletion and LLM Reconstruction

Text compression has traditionally been approached as an information-theoretic problem: find an encoding that minimizes the number of bits required to represent a message while preserving its exact content (lossless compression), or discard information that is perceptually or statistically inessential (lossy compression, as in image and audio codecs). Lossy compression has rarely been applied to natural language text, because even small deletions — a dropped word, a missing modifier — can render text ungrammatical, ambiguous, or semantically incorrect. Text demands precision in a way that images and audio do not.

My research investigates a new paradigm for lossy text compression: strategically deleting words from a document such that an LLM can faithfully reconstruct the original, while the compressed text remains human-readable and semantically intact. The key insight is that not all words carry equal semantic weight — function words, redundant modifiers, and contextually predictable content can often be removed without destroying meaning, provided that a sufficiently capable language model is available to fill in the gaps during decompression. This reframes compression as a paired process of strategic deletion (encoding) and LLM-based reconstruction (decoding), where the compression ratio is determined by how aggressively words are removed and the reconstruction quality depends on both the deletion strategy and the model's linguistic competence.

I benchmark six deletion strategies spanning different granularities and linguistic criteria, from simple frequency-based removal to syntax-aware and semantics-guided approaches. Each strategy is evaluated across multiple LLMs along three dimensions: compression ratio (how much text is removed), reconstruction quality (how faithfully the LLM recovers the original meaning), and readability of the compressed text (whether a human can still understand the gist from the deleted version). Results reveal that LLM-based reconstruction substantially outperforms classical lossy compression baselines, and that the choice of deletion strategy produces markedly different operating points on the compression-quality frontier. This work establishes an empirical foundation for semantic information theory — the study of how much linguistic information can be discarded before meaning is lost — and has direct implications for efficient text storage, low-bandwidth communication, and LLM-powered input methods where compression and reconstruction are complementary operations.

Related publications:

Benchmarking LLMs for Chinese and Japanese Text Input

Efficient text entry for Chinese and Japanese presents a unique challenge not found in alphabetic writing systems. Both languages employ thousands of logographic characters, making direct keyboard input impractical. Instead, users rely on Input Method Editors (IMEs), specialized software that converts phonetic input — Pinyin for Mandarin Chinese, Romaji for Japanese — into the intended characters. This conversion is far from straightforward, primarily due to pervasive homophony: a single pronunciation can correspond to many different characters and words. For example, the Pinyin input “yì yì” can map to 意义 (meaning/significance) or 异议 (objection), and the correct choice can only be determined from the broader sentence context. Japanese compounds this challenge further by integrating three distinct scripts — Kanji, Hiragana, and Katakana — requiring the IME to select the correct logographic character from a list of homophonous candidates.

Traditional IMEs, whether rule-based or statistical, approach this as a local ranking problem: selecting the most probable character from a candidate list based on immediate context. While effective for simple typographical errors, they are fundamentally ill-equipped to handle the full range of errors users actually make, including homophone confusion (selecting the wrong character with the same pronunciation), phonetic misspellings, and orthographic/semantic errors where the phonetic input is valid but the selected character is contextually wrong. Resolving these errors requires long-range semantic and pragmatic understanding of the entire sentence — precisely the capability in which LLMs excel.

My research introduces the first comprehensive benchmark for evaluating LLMs as the foundational technology for next-generation Chinese and Japanese IMEs. The benchmark covers two core tasks: phonetic-to-character generation, which assesses the ability to convert Pinyin or Romaji input into the correct character sequence, and textual error correction, which assesses the ability to detect and correct the full spectrum of input errors in Chinese and Japanese text. A diverse set of LLMs, including multilingual, open-source, and proprietary models, is evaluated against established baseline methods. Models primarily designed for complex multi-step reasoning are deliberately excluded, as their high latency is incompatible with the real-time demands of interactive text input.

Evaluation employs a comprehensive suite of metrics spanning both linguistic quality and computational efficiency: semantic similarity (SimHash, BERTScore), lexical accuracy (BLEU, ROUGE), character error rate (CER), and efficiency measures including completion time, time to first token (TTFT), and tokens per second (TPS). Results demonstrate that top-tier LLMs significantly outperform traditional systems on ambiguity resolution and the correction of complex errors, by leveraging deep contextual understanding that traditional IMEs lack. However, the analysis reveals a critical tradeoff between accuracy and computational efficiency that varies across models, underscoring that model selection for real-world IME deployment must jointly optimize linguistic fidelity and response latency. The datasets, evaluation scripts, and results from this study serve as a public resource for future research on next-generation IME technologies.

Related publications: