This study introduces CELLM (Chinese Education Large Language Model), a specialized 1.5B-parameter open-source LLM designed specifically for Chinese educational applications. The research addresses two critical gaps in current LLM development: (1) the lack of transparency in training processes among existing open-source models, and (2) the scarcity of high-quality Chinese educational datasets compared to English counterparts.
The core innovation lies in developing a fully transparent training pipeline with two key components. First, the authors curated Chinese-fineweb-edu-v2, a domain-specific pretraining corpus combining multiple Chinese educational resources (25.4% industry corpus, 18.6% safety corpus, etc.). Second, they created a novel multi-turn dialogue translation framework that successfully converted 258,000 English instructional entries into Chinese with 97.7% accuracy, significantly expanding available Chinese educational data.
Technical implementation adopts a causal-decoder architecture with grouped-query attention (GQA) and rotary positional encoding (RoPE), optimized for educational contexts. The model demonstrates particular strength in humanities (26.77% accuracy on C-Eval-humanities) and social sciences (26.35% on C-Eval-social-science), though shows limitations in STEM domains (21.48% on C-Eval-stem) and programming tasks (0.6 score on mbpp benchmark).
Notably, the paper provides complete architectural transparency-detailing everything from vocabulary size (151,936 tokens) to training parameters (33.6B pretraining tokens, 16B fine-tuning tokens). This open approach, combined with the release of all models, data, and code, establishes CELLM as a foundational resource for Chinese educational LLM research, while setting performance baselines across 11 evaluation datasets including C-Eval, CMMLU and MMLU.
The work represents a significant step toward democratizing educational LLM development in non-English contexts, though acknowledges current limitations in model scale (1.5B parameters) compared to commercial counterparts. Future directions include expanding pretraining data and exploring alignment techniques to enhance STEM performance.