C-Eval是适用于大语言模型的多层次多学科中文评估套件,由上海交通大学、清华大学和爱丁堡大学研究人员在2023年5月份联合推出,包含13948个多项选择题,涵盖52个不同的学科和四个难度级别,用在评测大模型中文理解能力。通过零样本(zero-shot)和少样本(few-shot)测试,C-Eval 能评估模型在未见过的任务上的适应性和泛化能力。

C-Eval的主要功能
多学科覆盖:C-Eval 包含 52 个不同学科的题目,涵盖 STEM、社会科学、人文科学等多个领域,全面评估语言模型的知识储备。
多层次难度分级:设有四个难度级别,从基础到高级,细致评估模型在不同难度下的推理和泛化能力。
量化评估与标准化测试:包含 13948 个多项选择题,通过标准化评分系统提供量化性能指标,支持不同模型的横向对比。
如何使用C-Eval
<span class="token keyword">from</span> datasets <span class="token keyword">import</span> load_dataset
dataset <span class="token operator">=</span> load_dataset<span class="token punctuation">(</span><span class="token string">"ceval/ceval-exam"</span><span class="token punctuation">,</span> name<span class="token operator">=</span><span class="token string">"computer_network"</span><span class="token punctuation">)</span>
<span class="token function">wget</span> https://huggingface.co/datasets/ceval/ceval-exam/resolve/main/ceval-exam.zip
<span class="token function">unzip</span> ceval-exam.zip
<span class="token keyword">from</span> transformers <span class="token keyword">import</span> AutoModelForCausalLM<span class="token punctuation">,</span> AutoTokenizer
model_name <span class="token operator">=</span> <span class="token string">"your-model-name"</span>
tokenizer <span class="token operator">=</span> AutoTokenizer<span class="token punctuation">.</span>from_pretrained<span class="token punctuation">(</span>model_name<span class="token punctuation">)</span>
model <span class="token operator">=</span> AutoModelForCausalLM<span class="token punctuation">.</span>from_pretrained<span class="token punctuation">(</span>model_name<span class="token punctuation">)</span>
以下是中国关于{科目}考试的单项选择题,请选出其中的正确答案。
以下是中国关于{科目}考试的单项选择题,请选出其中的正确答案。
inputs <span class="token operator">=</span> tokenizer<span class="token punctuation">(</span>prompt<span class="token punctuation">,</span> return_tensors<span class="token operator">=</span><span class="token string">"pt"</span><span class="token punctuation">)</span>
outputs <span class="token operator">=</span> model<span class="token punctuation">.</span>generate<span class="token punctuation">(</span><span class="token operator">**</span>inputs<span class="token punctuation">)</span>
response <span class="token operator">=</span> tokenizer<span class="token punctuation">.</span>decode<span class="token punctuation">(</span>outputs<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">,</span> skip_special_tokens<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span>
answer <span class="token operator">=</span> extract_answer<span class="token punctuation">(</span>response<span class="token punctuation">)</span> <span class="token comment"># 自定义函数,提取答案选项</span>
<span class="token keyword">from</span> sklearn<span class="token punctuation">.</span>metrics <span class="token keyword">import</span> accuracy_score
<span class="token comment"># 假设 `predictions` 是模型的预测结果,`labels` 是真实答案</span>
accuracy <span class="token operator">=</span> accuracy_score<span class="token punctuation">(</span>labels<span class="token punctuation">,</span> predictions<span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f"Validation Accuracy: </span><span class="token interpolation"><span class="token punctuation">{</span>accuracy<span class="token punctuation">:</span><span class="token format-spec">.2f</span><span class="token punctuation">}</span></span><span class="token string">"</span></span><span class="token punctuation">)</span>
<span class="token punctuation">{</span>
<span class="token property">"chinese_language_and_literature"</span><span class="token operator">:</span> <span class="token punctuation">{</span>
<span class="token property">"0"</span><span class="token operator">:</span> <span class="token string">"A"</span><span class="token punctuation">,</span>
<span class="token property">"1"</span><span class="token operator">:</span> <span class="token string">"B"</span><span class="token punctuation">,</span>
<span class="token punctuation">}</span><span class="token punctuation">,</span>
<span class="token punctuation">}</span>
C-Eval的应用场景
语言模型性能评估:全面衡量语言模型的知识水平和推理能力,帮助开发者优化模型性能。
学术研究与模型比较:为研究人员提供标准化的测试平台,分析和比较不同语言模型在各学科的表现,推动学术研究和技术进步。
教育领域应用开发:助力开发智能辅导系统和教育评估工具,用模型生成练习题、自动评分,提升教育领域的智能化水平。
行业应用优化:在金融、医疗、客服等行业,评估和优化语言模型的领域知识和应用能力,提升行业智能化解决方案的效果。
社区合作与技术评测:作为开放平台,促进开发者社区的交流与合作,为模型竞赛和技术评测提供公平的基准测试工具。

[超站]友情链接:
四季很好,只要有你,文娱排行榜:https://www.yaopaiming.com/
关注数据与安全,洞悉企业级服务市场:https://www.ijiandao.com/