spelltestpython人工智能资源-CSDN文库

共29个文件

py：15个

jinja2：10个

txt：1个

版权申诉

167 浏览量 2023-08-18 11:12:25 上传评论收藏 24KB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

spelltest python人工智能 Evaluate the quality of prompts for large language models (LLMs) such as OpenAI s GPT series.zip （29个子文件）

spelltest-main

setup.py 189B

LICENSE 1KB

output_result.py 313B

examples

__init__.py 0B

primitive

__init__.py 0B

primitive.py 514B

test_chain_simulation.py 5KB

spelltest

__init__.py 0B

simulation_manager

utils.py 707B

__init__.py 0B

random_data_generator.py 482B

prompts

system.chat_assistant.txt.jinja2 54B

summary_rationale.txt.jinja2 416B

output_expectation.txt.jinja2 299B

output_expectation_system.chat_assistant.txt.jinja2 210B

evaluation_rationale_chat.txt.jinja2 1KB

accuracy.txt.jinja2 528B

system.chat_user_agent.txt.jinja2 490B

simulate.txt.jinja2 251B

evaluation_rationale.txt.jinja2 2KB

chat_system_message.txt.jinja2 70B

simulation_single_answer.py 7KB

manager.py 3KB

simulation_chat.py 13KB

simulate.py 2KB

entities.py 3KB

requirements.txt 336B

.gitignore 3KB

README.md 8KB

# Spelltest: AI Testing Framework for LLM Prompts ⚠️ ⚠️ ⚠️ **Important Warnings** ⚠️ ⚠️ ⚠️ - **OpenAI Costs**: Usage of this framework can lead to a significant number of requests to OpenAI, especially when running extensive simulations. This can result in substantial costs on your OpenAI account. I bear no responsibility for any expenses incurred. Ensure you're mindful of your OpenAI budget and understand the pricing model. - **Early Release:** This version of Spelltest is in its early stages. While it's fully functional for its defined scope, it's not yet available on pip for distribution. - **Documentation:** Detailed documentation is in the works. For now, familiarize yourself with the source code and ensure you understand the underlying mechanics before adaptation. ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ Spelltest empowers developers and researchers with a Python testing framework, specifically tailored to evaluate the quality of prompts for large language models (LLMs) such as OpenAI's GPT series. The goal of Spelltest is to make simulating user interactions and generating synthetic user behavior become an intuitive process, paving the way for deeper insights into conversation dynamics, engagement, persuasiveness, and other more custom metrics. ## Overview Spelltest is more than just a testing tool – it's an investigative compass into the world of conversational AI. By using Spelltest, developers can: - Thoroughly test and optimize prompts for LLMs. - Simulate a variety of user interactions to gain a comprehensive understanding of model behavior. - Understand and measure factors like engagement and persuasiveness in responses. ## Features **Current Features:** - **Simulate Against Prompt:** Spelltest lets you simulate against a specific prompt and subsequently evaluate the quality of a single model response. **Features TO DO:** - **Pre-simulation and post-simulation tasks** - **Conversational Simulation:** In the future, Spelltest aims to simulate against a prompt and evaluate the entire conversation context. - **Direct Interaction with LLMChains Instead of Prompts:** Direct interfacing with an LLMChain (Langchain) is on the horizon. ## Usage Getting hands-on with Spelltest is easy. Here's a rudimentary example to guide you: **!!THIS IS VERY EXPENSIVE TO RUN THIS EXEMPLE, ABOUT ~$0.60 OR MORE!!** ```python from spelltest.simulate import simulate_for_prompt, run_simulations TARGET_PROMPT = "You're a travel planner. " \ "You receive traveller description within travel requirements " \ "and return detailed plan with each hour planned in detail" @simulate_for_prompt( prompt=TARGET_PROMPT, user_case_name="Nomad Weekend Trip", user_description="You're a very busy nomad who struggles with planning. " "You're moved to Seattle and looking at how to spend your first Saturday exploring the city", output_expectation="Well-planned objective, detailed, and comprehensive schedule that meets user's requirements", metric_definition="Our accuracy is TPAS - The Travel Plan Accuracy Score. " "This metric measures the accuracy of the generated response " "by evaluating the inclusion of the expected output, well-scheduled travel plan " "and nothing else. The TPAS is a numerical value between 0 and 100, with 100 representing " "a perfect match to the expected output and 0 indicating non-accurate result.", user_knowledge_about_app="The app receives text input about travel " "requirements(i.e. place, preferences, short description of " "what people we need a travel plan for) and returns a travel schedule", llm_name="gpt-3.5-turbo", size=5, temperature=0.8, tags=["Nomad", "Weekend"], chat_mode=False, ) def test_prompt(simulation_result): print(simulation_result) @simulate_for_prompt( prompt=TARGET_PROMPT, user_case_name="Family Day Out Of Chicago", user_description="You're a parent with two young children, aged 4 and 6. " "It's your family's first time visiting Seattle. You're keen on planning a Saturday " "that's both fun for the kids and relaxing for the adults. You'd like to visit some " "kid-friendly spots and also have some downtime.", output_expectation="A balanced travel plan that incorporates kid-friendly attractions, rest periods, " "and ensures a memorable day for the entire family.", metric_definition="Our accuracy is TPAS - The Travel Plan Accuracy Score. " "This metric measures the accuracy of the generated response " "by evaluating the inclusion of the expected output, well-scheduled travel plan " "and nothing else. The TPAS is a numerical value between 0 and 100, with 100 representing " "a perfect match to the expected output and 0 indicating non-accurate result.", user_knowledge_about_app="The app takes in a text input about travel " "requirements (i.e. place, preferences, short description of " "the travelers needing a plan) and outputs a detailed travel schedule.", llm_name="gpt-3.5-turbo", size=5, temperature=0.8, tags=["Family", "Children", "Relaxing"], chat_mode=False, ) def test_family_day_out(simulation_result): print(simulation_result) @simulate_for_prompt( prompt=TARGET_PROMPT, user_case_name="Retired Couple's Relaxing Getaway", user_description="You're a retired couple in your 60s, looking for a peaceful and relaxed day out in Seattle. " "You prefer less walking and are interested in local arts, culture, and good food.", output_expectation="A calm and leisurely day plan with minimal walking, focused on art galleries, museums, and top local restaurants.", metric_definition="Our accuracy is TPAS - The Travel Plan Accuracy Score. " "This metric measures the accuracy of the generated response " "by evaluating the inclusion of the expected output, well-scheduled travel plan " "and nothing else. The TPAS is a numerical value between 0 and 100, with 100 representing " "a perfect match to the expected output and 0 indicating non-accurate result.", user_knowledge_about_app="The app receives text input about travel " "requirements(i.e. place, preferences, short description of " "what people we need a travel plan for) and returns a travel schedule", llm_name="gpt-3.5-turbo", size=5, temperature=0.8, tags=["Retired", "Relaxing", "Culture"], chat_mode=False, ) def test_prompt_for_retired_couple(simulation_result): print(simulation_result) if __name__ == "__main__": run_simulations() ``` ## Contributing Your insights can shape the future of Spelltest. Whether you're forking, pointing out issues, or submitting pull requests, every contribution is valued and appreciated. ## License Spelltest is an open-source endeavor. However, every user is advised to pay heed to the provided warnings and guidelines. Your usage is at your own discretion and understanding of the potential risks.

评论收藏

内容反馈

版权申诉