<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>可靠性 on 墨然</title><link>https://moran.is-a.dev/tags/%E5%8F%AF%E9%9D%A0%E6%80%A7/</link><description>Recent content in 可靠性 on 墨然</description><generator>Hugo</generator><language>zh-cn</language><lastBuildDate>Mon, 15 Dec 2025 18:22:00 +0800</lastBuildDate><atom:link href="https://moran.is-a.dev/tags/%E5%8F%AF%E9%9D%A0%E6%80%A7/atom.xml" rel="self" type="application/rss+xml"/><item><title>评测大模型别只看榜单：我给它出的 30 道“小考卷”</title><link>https://moran.is-a.dev/posts/llm-evaluation-playbook/</link><pubDate>Mon, 15 Dec 2025 18:22:00 +0800</pubDate><guid>https://moran.is-a.dev/posts/llm-evaluation-playbook/</guid><description>榜单像体检报告的平均分，真正重要的是：你的业务里它会在哪些题上失手。</description></item><item><title>我开始给 AI 做“体检”：不是为了挑刺，是为了别被它骗</title><link>https://moran.is-a.dev/posts/ai-model-checkup/</link><pubDate>Mon, 24 Nov 2025 22:15:00 +0800</pubDate><guid>https://moran.is-a.dev/posts/ai-model-checkup/</guid><description>模型的“自信”不等于正确。做一套小小的评测题库，比吵架更有效。</description></item></channel></rss>