suno-ai-基于Python-Flask+suno.ai.bark实现的文本转语音Web-UI.zip

共18个文件

wav：6个

js：3个

py：2个

flask

人工智能

python

需积分: 1 164 浏览量 2024-04-08 09:35:52 上传评论 1 收藏 6.46MB ZIP 举报

《基于Python Flask与suno.ai.bark的文本转语音Web应用构建详解》在现代信息技术领域，人工智能（AI）的应用已经深入到各个角落，而文本转语音（Text-to-Speech, TTS）技术更是其中的一颗璀璨明珠。本文将详细探讨如何使用Python编程语言，结合Flask框架和suno.ai.bark库，构建一个功能完备的文本转语音Web界面（UI）。让我们了解Flask。Flask是Python中的一款轻量级Web服务器网关接口（WSGI）Web应用框架，以其简洁、灵活的设计理念深受开发者喜爱。它允许开发者用少量代码就能快速搭建出功能强大的Web应用，非常适合用于构建原型或小型项目。 suno.ai.bark是suno.ai公司推出的一个AI语音合成服务，它提供了丰富的语音选项和高度自然的发音效果。通过调用suno.ai.bark的API，开发者可以轻松地将文本转换为高质量的语音输出。在构建这个Web UI的过程中，我们需要完成以下关键步骤： 1. **环境配置**：确保安装了Python和pip，然后使用pip安装Flask和suno.ai.bark的Python客户端库。命令行输入如下： ``` pip install flask suno.ai-bark ``` 2. **创建Flask应用**：创建一个Python文件，如`app.py`，并初始化Flask应用。设置一个路由处理文本输入和语音生成的请求，例如： ```python from flask import Flask, request, jsonify from suno.ai.bark import Bark app = Flask(__name__) @app.route('/tts', methods=['POST']) def text_to_speech(): text = request.json['text'] bark = Bark() audio_data = bark.synthesize(text) return audio_data ``` 3. **交互界面设计**：利用HTML、CSS和JavaScript构建前端页面，用户可以通过输入框输入文本，点击按钮触发Ajax请求，将文本发送到后端。同时，后端返回的语音数据可以通过JavaScript控制浏览器播放。 4. **部署应用**：将Flask应用部署到服务器，如本地开发环境、Heroku或者AWS等云平台，确保服务可被网络访问。 5. **安全与优化**：添加必要的错误处理和日志记录，考虑使用模板引擎来动态生成HTML，还可以使用缓存策略提高性能，减少重复的语音合成请求。 6. **测试与迭代**：进行详尽的测试，包括单元测试、集成测试和用户体验测试，根据反馈持续优化应用。这个项目展示了Python和AI技术在实际应用中的结合，通过Flask提供易用的Web界面，suno.ai.bark则作为强大的后端支持，实现了高效、自然的文本转语音服务。对于学习Python Web开发和AI应用的开发者来说，这是一个很好的实践项目，既锻炼了前后端开发技能，又深入了解了AI语音合成技术。

资源推荐

资源详情

资源评论

收起资源包目录

suno-ai_基于Python-Flask+suno.ai.bark实现的文本转语音Web-UI.zip （18个子文件）

suno-ai_基于Python-Flask+suno.ai.bark实现的文本转语音Web-UI

barkwebui_screenshot.png 199KB

templates

index.html 10KB

barkwebui_connector.py 10KB

requirements.txt 67B

barkwebui_server.py 4KB

static

barkwebui.js 10KB

theme.js 1KB

populate.js 2KB

output

2ae166c31fd04d648676e2978fc4bdc2.wav 1.18MB

95ad6c96eec942b2a17802950a31abc0.wav 2MB

db373183a20f49afb3d345db33125504.wav 1.7MB

d27e5222690742e5b0c1ed769be98f28.wav 1.77MB

30257a6610ec42f6b4a7cdb6f6154764.wav 1.74MB

57b6a85964cf48aeb2c86bdf031130e9.wav 1.25MB

img

favicon.ico 15KB

css

barkwebui.css 12KB

json

barkwebui.json 4KB

README.md 7KB

# Bark Web UI This application is a Python Flask-based web UI designed to facilitate the generation of text-to-speech using [Suno AI's Bark](https://github.com/suno-ai/bark). It offers a variety of customisation options, including the ability to modify voice pitch, speed, and other parameters. ## Screenshot ![Bark Web UI Screenshot](barkwebui_screenshot.png) ## Sample audio Some have pitch and speed adjustments applied. ![Sample Audio 01](https://github.com/bradsec/barkwebui/assets/7948876/477d6410-e9df-4809-ac74-f22647292a36) ![Sample Audio 02](https://github.com/bradsec/barkwebui/assets/7948876/cf09b7b6-133e-435f-8b99-dfae8d5278da) ![Sample Audio 03](https://github.com/bradsec/barkwebui/assets/7948876/287472ce-896f-4412-b096-e78fc738f6dd) ![Sample Audio 04](https://github.com/bradsec/barkwebui/assets/7948876/04fbd340-7605-41b8-8c7b-abfbb923259a) ## Installation 1. Install Bark by following the instructions from the [Bark repository](https://github.com/suno-ai/bark). 1a. If you have not run bark before you will need to download the models, running a test will download and cache the required models (note models vary in size including one over 5GB in size). ```terminal python -m bark --text "Let's get this party started!" --output_filename "party.wav" ``` 2. Once bark is running clone this repo into a directory called `webui` within the `bark` installation location. ```Terminal cd bark git clone https://github.com/bradsec/barkwebui webui ``` 3. Install any additional Python packages mentioned in the [requirements.txt](requirements.txt) file to meet the required imports in `app.py` and `bark_connector.py`. There will be shared imports already installed by the Bark setup process. If applicable before installing imports activate the Python venv or conda/miniconda environment you are using for Bark. 4. Run the `python barkwebui_server.py` from within the `webui` folder to start the Flask web server application and a similar output should be displayed: ```terminal * Serving Flask app 'barkwebui_server' * Debug mode: on WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead. * Running on all addresses (0.0.0.0) * Running on http://127.0.0.1:5000 ``` 5. Access web application via browser address as shown in terminal window. ## Structure - `barkwebui_server.py` provides the Flask web server functionality receives and returns information from the web interface and passes into `barkwebui_connector.py`. Also handles writing and deleting from of entries from the JSON dateset. - `barkwebui_connector.py` breaks up text input before passing text to the Bark application. Also applies any audio effect selected like changes to speed, pitch, noise reductions and removing silence if selected. It will then write the `.wav` with unique filename to the `static/output` directory. - `templates/index.html` the only HTML file used for the app. It will reference other files like css and JavaScript from the `static` directory. - `static/js` - This directory contains two JavaScript for index.html template page. - `barkwebui.js` provides most of the page functionality and the link between `app.py` using [Socket.IO](https://socket.io/) - `populate.js` populates the select dropdown options in index.html. - `theme.js` for dark and light theme switching. - `static/output` contains the completed wav audio files. - `static/json` contains the `barkwebui.json` which contains information about any generated audio files. <details> <summary>Text Temperature</summary> This parameter affects how the model generates speech from text. A higher text temperature value makes the model's output more random, while a lower text temperature value makes the model's output more deterministic. In other words, with a high text temperature, the model is more likely to generate unusual or unexpected speech from a given text prompt. On the other hand, with a low text temperature, the model is more likely to stick closely to the most probable output. </details> <details> <summary>Waveform Temperature</summary> This parameter affects how the model generates the final audio waveform. A higher waveform temperature value introduces more randomness into the audio output, which might result in more unusual sounds or voice modulations. A lower waveform temperature, on the other hand, makes the audio output more predictable and consistent. </details> <details> <summary>Reduce Noise / Noise Reduction (NR)</summary> Reduce background noise (not as good as an AI enhanced cleaner and often difficult to tell impact to audio given the randomness of each Bark generated speech even with same settings, it also can't remove echoing or AI hallucination). Code Ref (bark_connector.py): If value of 'reduce_noise' is True, it triggers noise reduction on the generated audio using the noisereduce library. reduce_noise takes the audio data and the sample rate as parameters and returns the audio with reduced noise. If reduce_noise is False, no noise reduction is applied, and the original audio is used. </details> <details> <summary>Remove Silence (RS)</summary> Remove any extended pauses or silence (may not do much, was included for situations when generated voice contains long pauses for unknown reasons). Code Ref (bark_connector.py): If value of 'remove_silence' is True, it enables aggressive silence removal by setting the VAD (Voice Activity Detection) to level 3. The webrtcvad library is used for voice activity detection. If remove_silence is False, the VAD level is set to 0, which means no silence removal is applied. The sample rate also had to be reduced to 16000 from 24000 to get it to work with the webrtcvad library. </details> <details> <summary>Adjusting audio speed and pitch</summary> Changes to speed and pitch may cause a fair amount of echo and reverb in the output audio. Running the audio through a third-party AI audio tool may help remove echo or reverb. A library called librosa is used for manipulating the audio speed and pitch. The speed of the audio is adjusted using the `librosa.effects.time_stretch` function, which stretches or compresses the audio by a certain factor. If the speed parameter passed into the `generate_voice` function is not 1.0 (i.e., the speed of the audio needs to be changed), the audio is time-stretched by the given rate. For instance, if the speed is 2, the audio's duration will be halved, making it play twice as fast. The pitch of the audio is adjusted using the `librosa.effects.pitch_shift` function. This function shifts the pitch of the audio by a certain number of half-steps. If the pitch parameter passed into the `generate_voice` function is not 0 (i.e., the pitch of the audio needs to be changed), the pitch of the audio is shifted by the given number of half-steps. For instance, if the pitch is 2, the pitch of the audio will be increased by 2 half-steps. </details> ### Clearer Speech and Audio Results **You will get cleaner speech and better results just generating without NR or RS checked and running through an AI-enhanced tool like [Adobe Podcast Enhance](https://podcast.adobe.com/enhance) or other similar tools.**

评论收藏

内容反馈