# Spelltest: AI Testing Framework for LLM Prompts
⚠️ ⚠️ ⚠️ **Important Warnings** ⚠️ ⚠️ ⚠️
- **OpenAI Costs**: Usage of this framework can lead to a significant number of requests to OpenAI, especially when running extensive simulations. This can result in substantial costs on your OpenAI account. I bear no responsibility for any expenses incurred. Ensure you're mindful of your OpenAI budget and understand the pricing model.
- **Early Release:** This version of Spelltest is in its early stages. While it's fully functional for its defined scope, it's not yet available on pip for distribution.
- **Documentation:** Detailed documentation is in the works. For now, familiarize yourself with the source code and ensure you understand the underlying mechanics before adaptation.
⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️
Spelltest empowers developers and researchers with a Python testing framework, specifically tailored to evaluate the quality of prompts for large language models (LLMs) such as OpenAI's GPT series. The goal of Spelltest is to make simulating user interactions and generating synthetic user behavior become an intuitive process, paving the way for deeper insights into conversation dynamics, engagement, persuasiveness, and other more custom metrics.
## Overview
Spelltest is more than just a testing tool – it's an investigative compass into the world of conversational AI. By using Spelltest, developers can:
- Thoroughly test and optimize prompts for LLMs.
- Simulate a variety of user interactions to gain a comprehensive understanding of model behavior.
- Understand and measure factors like engagement and persuasiveness in responses.
## Features
**Current Features:**
- **Simulate Against Prompt:** Spelltest lets you simulate against a specific prompt and subsequently evaluate the quality of a single model response.
**Features TO DO:**
- **Pre-simulation and post-simulation tasks**
- **Conversational Simulation:** In the future, Spelltest aims to simulate against a prompt and evaluate the entire conversation context.
- **Direct Interaction with LLMChains Instead of Prompts:** Direct interfacing with an LLMChain (Langchain) is on the horizon.
## Usage
Getting hands-on with Spelltest is easy. Here's a rudimentary example to guide you:
**!!THIS IS VERY EXPENSIVE TO RUN THIS EXEMPLE, ABOUT ~$0.60 OR MORE!!**
```python
from spelltest.simulate import simulate_for_prompt, run_simulations
TARGET_PROMPT = "You're a travel planner. " \
"You receive traveller description within travel requirements " \
"and return detailed plan with each hour planned in detail"
@simulate_for_prompt(
prompt=TARGET_PROMPT,
user_case_name="Nomad Weekend Trip",
user_description="You're a very busy nomad who struggles with planning. "
"You're moved to Seattle and looking at how to spend your first Saturday exploring the city",
output_expectation="Well-planned objective, detailed, and comprehensive schedule that meets user's requirements",
metric_definition="Our accuracy is TPAS - The Travel Plan Accuracy Score. "
"This metric measures the accuracy of the generated response "
"by evaluating the inclusion of the expected output, well-scheduled travel plan "
"and nothing else. The TPAS is a numerical value between 0 and 100, with 100 representing "
"a perfect match to the expected output and 0 indicating non-accurate result.",
user_knowledge_about_app="The app receives text input about travel "
"requirements(i.e. place, preferences, short description of "
"what people we need a travel plan for) and returns a travel schedule",
llm_name="gpt-3.5-turbo",
size=5,
temperature=0.8,
tags=["Nomad", "Weekend"],
chat_mode=False,
)
def test_prompt(simulation_result):
print(simulation_result)
@simulate_for_prompt(
prompt=TARGET_PROMPT,
user_case_name="Family Day Out Of Chicago",
user_description="You're a parent with two young children, aged 4 and 6. "
"It's your family's first time visiting Seattle. You're keen on planning a Saturday "
"that's both fun for the kids and relaxing for the adults. You'd like to visit some "
"kid-friendly spots and also have some downtime.",
output_expectation="A balanced travel plan that incorporates kid-friendly attractions, rest periods, "
"and ensures a memorable day for the entire family.",
metric_definition="Our accuracy is TPAS - The Travel Plan Accuracy Score. "
"This metric measures the accuracy of the generated response "
"by evaluating the inclusion of the expected output, well-scheduled travel plan "
"and nothing else. The TPAS is a numerical value between 0 and 100, with 100 representing "
"a perfect match to the expected output and 0 indicating non-accurate result.",
user_knowledge_about_app="The app takes in a text input about travel "
"requirements (i.e. place, preferences, short description of "
"the travelers needing a plan) and outputs a detailed travel schedule.",
llm_name="gpt-3.5-turbo",
size=5,
temperature=0.8,
tags=["Family", "Children", "Relaxing"],
chat_mode=False,
)
def test_family_day_out(simulation_result):
print(simulation_result)
@simulate_for_prompt(
prompt=TARGET_PROMPT,
user_case_name="Retired Couple's Relaxing Getaway",
user_description="You're a retired couple in your 60s, looking for a peaceful and relaxed day out in Seattle. "
"You prefer less walking and are interested in local arts, culture, and good food.",
output_expectation="A calm and leisurely day plan with minimal walking, focused on art galleries, museums, and top local restaurants.",
metric_definition="Our accuracy is TPAS - The Travel Plan Accuracy Score. "
"This metric measures the accuracy of the generated response "
"by evaluating the inclusion of the expected output, well-scheduled travel plan "
"and nothing else. The TPAS is a numerical value between 0 and 100, with 100 representing "
"a perfect match to the expected output and 0 indicating non-accurate result.",
user_knowledge_about_app="The app receives text input about travel "
"requirements(i.e. place, preferences, short description of "
"what people we need a travel plan for) and returns a travel schedule",
llm_name="gpt-3.5-turbo",
size=5,
temperature=0.8,
tags=["Retired", "Relaxing", "Culture"],
chat_mode=False,
)
def test_prompt_for_retired_couple(simulation_result):
print(simulation_result)
if __name__ == "__main__":
run_simulations()
```
## Contributing
Your insights can shape the future of Spelltest. Whether you're forking, pointing out issues, or submitting pull requests, every contribution is valued and appreciated.
## License
Spelltest is an open-source endeavor. However, every user is advised to pay heed to the provided warnings and guidelines. Your usage is at your own discretion and understanding of the potential risks.
处处清欢
- 粉丝: 1445
- 资源: 2809
最新资源
- js-leetcode题解之158-read-n-characters-given-read4-ii-call
- js-leetcode题解之157-read-n-characters-given-read4.js
- js-leetcode题解之156-binary-tree-upside-down.js
- js-leetcode题解之155-min-stack.js
- js-leetcode题解之154-find-minimum-in-rotated-sorted-array-ii.js
- js-leetcode题解之153-find-minimum-in-rotated-sorted-array.js
- js-leetcode题解之152-maximum-product-subarray.js
- js-leetcode题解之151-reverse-words-in-a-string.js
- js-leetcode题解之150-evaluate-reverse-polish-notation.js
- js-leetcode题解之149-max-points-on-a-line.js
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈