Large Language Models in Education and Text Input

Author names below in italics are students in my group.

Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language tasks, from code generation to complex reasoning. As LLMs become embedded in everyday tools and workflows, two questions become increasingly important: how should they be integrated into education without undermining student learning, and how can they serve as the foundation for next-generation human-computer interfaces that handle the full complexity of natural language input? My research addresses both questions, exploring LLMs both as a subject of critical pedagogical study and as a technology to be rigorously benchmarked for real-world deployment.

Learning by teaching: engaging students as instructors of LLMs

The dominant paradigm in applying LLMs to computer science education casts the model as a virtual tutor that explains concepts, debugs code, and answers questions. While convenient, this approach carries a significant risk: students may become overly reliant on LLMs for reasoning, substituting AI assistance for the productive struggle that is essential to developing deep understanding and practical skills. Academic dishonesty is a further concern, as LLMs can already solve many standard course problems at a human level.

My research proposes a fundamentally different pedagogical paradigm grounded in the protégé effect — the well-documented phenomenon that teaching a subject deepens the instructor's own understanding and mastery. Rather than positioning the LLM as the teacher, I invert the model: students act as instructors who must teach an LLM to solve a given problem. To do this successfully, a student must understand the material well enough to decompose the problem, articulate the reasoning steps, and construct worked examples that bridge the LLM's knowledge gap. This active engagement replaces passive consumption of AI-generated answers.

A key challenge in this approach is question design. Most standard course questions can already be solved directly by LLMs, making them unsuitable as teaching targets. I develop a set of strategies for designing questions with engineered knowledge gaps — problems that an LLM cannot solve on its own, but can solve when given a correctly structured prompt written by a student who genuinely understands the material. Students are guided to use Chain-of-Thought (CoT) prompting, which requires them to decompose the problem into step-by-step reasoning, and few-shot prompting, which requires them to construct complete worked examples as templates for the LLM. This process forces students to identify the core patterns in a problem and articulate a generalizable solution before teaching it. To evaluate whether a student's instruction is reliably correct, the Socrates system applies self-consistency, querying the LLM multiple times and using a majority vote to determine whether the student's prompt consistently produces the correct answer.

To make this approach accessible to instructors without specialized technical knowledge, I implement Socrates, a system that provides a playground for students to interact with the custom questions and an LLM-based grader that evaluates their submitted instructions. Socrates requires minimal programming overhead and is designed for straightforward deployment in undergraduate courses. An evaluation in an undergraduate computer science course at CUNY Queens College demonstrates that this active-learning method leads to statistically significant improvements in student performance compared to historical cohorts, at a low operational cost.

Related publications:

Learning by Teaching: Engaging Students as Instructors of Large Language Models in Computer Science Education
Xinming Yang, Haasil Pujara, Jun Li
2nd Conference on Language Modeling (COLM), October 2025.
[source code]

Benchmarking LLMs for Chinese and Japanese text input

Efficient text entry for Chinese and Japanese presents a unique challenge not found in alphabetic writing systems. Both languages employ thousands of logographic characters, making direct keyboard input impractical. Instead, users rely on Input Method Editors (IMEs), specialized software that converts phonetic input — Pinyin for Mandarin Chinese, Romaji for Japanese — into the intended characters. This conversion is far from straightforward, primarily due to pervasive homophony: a single pronunciation can correspond to many different characters and words. For example, the Pinyin input “yì yì” can map to 意义 (meaning/significance) or 异议 (objection), and the correct choice can only be determined from the broader sentence context. Japanese compounds this challenge further by integrating three distinct scripts — Kanji, Hiragana, and Katakana — requiring the IME to select the correct logographic character from a list of homophonous candidates.

Traditional IMEs, whether rule-based or statistical, approach this as a local ranking problem: selecting the most probable character from a candidate list based on immediate context. While effective for simple typographical errors, they are fundamentally ill-equipped to handle the full range of errors users actually make, including homophone confusion (selecting the wrong character with the same pronunciation), phonetic misspellings, and orthographic/semantic errors where the phonetic input is valid but the selected character is contextually wrong. Resolving these errors requires long-range semantic and pragmatic understanding of the entire sentence — precisely the capability in which LLMs excel.

My research introduces the first comprehensive benchmark for evaluating LLMs as the foundational technology for next-generation Chinese and Japanese IMEs. The benchmark covers two core tasks: phonetic-to-character generation, which assesses the ability to convert Pinyin or Romaji input into the correct character sequence, and textual error correction, which assesses the ability to detect and correct the full spectrum of input errors in Chinese and Japanese text. A diverse set of LLMs, including multilingual, open-source, and proprietary models, is evaluated against established baseline methods. Models primarily designed for complex multi-step reasoning are deliberately excluded, as their high latency is incompatible with the real-time demands of interactive text input.

Evaluation employs a comprehensive suite of metrics spanning both linguistic quality and computational efficiency: semantic similarity (SimHash, BERTScore), lexical accuracy (BLEU, ROUGE), character error rate (CER), and efficiency measures including completion time, time to first token (TTFT), and tokens per second (TPS). Results demonstrate that top-tier LLMs significantly outperform traditional systems on ambiguity resolution and the correction of complex errors, by leveraging deep contextual understanding that traditional IMEs lack. However, the analysis reveals a critical tradeoff between accuracy and computational efficiency that varies across models, underscoring that model selection for real-world IME deployment must jointly optimize linguistic fidelity and response latency. The datasets, evaluation scripts, and results from this study serve as a public resource for future research on next-generation IME technologies.

Related publications:

Benchmarking Large Language Models for Chinese and Japanese IMEs: Phonetic-to-Character Generation and Textual Error Correction
Yuchun Zou, Tedd Lee, Xiaodi Fan, Jun Li
International Conference on Language Resources and Evaluation (LREC), May 2026.
[source code]