How are you testing LLM behavior in production? Looking for real workflows. FeedbackHut

What happened

Hey everyone, I've been building AI-first products and integrating LLMs into production systems for a while. At some point I needed more confidence in what I was shipping and started looking into automated evals — couldn't find anything that integrated cleanly with Playwright and Vitest, so I ended up writing some lightweight extensions for internal use. Now I'm not sure whether to open source them or just delete them — depends on whether this is actually a problem other people have. But first —

Business impact

Flagged via r/QualityAssurance.

Sources

How are you testing LLM behavior in production? Looking for real workflows
r/QualityAssurance