Testing

Practical testing workflows automated with AI agents.

VoiceTest - Voice Agent Test Harness

Open-source test harness for voice agents with support for Retell, VAPI, Bland, and LiveKit. Run autonomous simulations and evaluate with LLM judges.

Python

voicetest

Testingcommunitysource

Open-source test harness for voice agents with support for Retell, VAPI, Bland, and LiveKit. Run autonomous simulations and evaluate with LLM judges.

Python

Open-RAG-Eval

Testingcommunitysource

an open source RAG evaluation framework that does not require golden answers, and can be used to evaluate performance of RAG tools connected to an AI

RAG

Voice Lab

Testingcommunitysource

A comprehensive testing and evaluation framework for voice agents across language models, prompts, and agent personas.

Python

Arize-Phoenix

Testingcommunitysource

Arize-Phoenix is an open source library for agent testing, evaluation and observability.

Python

AgentBench

Testingcommunitysource

AgentBench v0.2 is a benchmark designed to evaluate Large Language Models as agents across a diverse set of environments, enhancing framework usabilit

Python

Manifest

Testingcommunitysource

Open-source, real-time cost observability platform for AI agents. Track tokens, costs, messages, and model usage with a local-first dashboard. Support

Python

AgentOps

Testingcommunitysource

AgentOps aims to improve AI agent development with tools for observability, evaluations, and replay analytics, offering a streamlined process for test

Python

Chatbot Simulation Evaluation

Testingcommunitysource

Simulate user interactions to evaluate chatbot performance, ensuring robustness and reliability in real-world scenarios.

Langgraph

LangChain

LangGraph

AgentEval: A Multi-Agent System for Assessing Utility of LLM-Powered Application

Testingcommunitysource

Introduces AgentEval for evaluating and assessing the performance of LLM-based applications.

Autogen

EvoAgentX

Testingcommunitysource

EvoAgentX is building a Self-Evolving Ecosystem of AI Agents, it will give you automated framework for evaluating and evolving agentic workflows.

Python

LangFuse

Testingcommunitysource

Langfuse, an open-source LLM engineering platform, offers debugging, prompt management, metrics for LLM apps improvement, and won the #1 Golden Kitty

Python

Bananalyzer by Reworkd

Testingcommunitysource

Bananalyzer is a framework for evaluating AI agents on web tasks, utilizing Playwright for creating diverse datasets of website snapshots for reliable

Playwright