# WAV2CLIP
Ho-Hsiang Wu, Prem Seetharaman, Kundan Kumar, Juan Pablo Bello
<div><a href="https://arxiv.org/abs/2110.11499"><img src="https://github.githubassets.com/images/icons/emoji/unicode/1f4c4.png" alt=":page_facing_up:" style="width: 32px;"></a><a href="https://github.com/descriptinc/lyrebird-wav2clip"><img src="https://github.githubassets.com/images/icons/emoji/octocat.png" alt=":octocat:" style="width: 32px;"></a></div>
## Abstract
We propose Wav2CLIP, a robust audio representation learning method by distilling from Contrastive Language-Image Pre-training (CLIP). We systematically evaluate Wav2CLIP on a variety of audio tasks including classification, retrieval, and generation, and show that Wav2CLIP can outperform several publicly available pre-trained audio representation algorithms. Wav2CLIP projects audio into a shared embedding space with images and text, which enables multimodal applications such as zero-shot classification, and cross-modal retrieval. Furthermore, Wav2CLIP needs just ~10% of the data to achieve competitive performance on downstream tasks compared with fully supervised models, and is more efficient to pre-train than competing methods as it does not require learning a visual model in concert with an auditory model. Finally, we demonstrate image generation from Wav2CLIP as qualitative assessment of the shared embedding space. Our code and model weights are open sourced and made available for further applications.
## VQGAN-CLIP Generate Samples
### ESC-50
<div><span style="width: 200px; text-align: center; display: inline-block;">Frog</span><span style="width: 200px; text-align: center; display: inline-block;">Frog</span><span style="width: 200px; text-align: center; display: inline-block;">Frog</span><span style="width: 200px; text-align: center; display: inline-block;">Frog</span></div>
<div><img src="artifacts/esc50/1-18755-A-4-frog.png" alt="1-18755-A-4" width="200"/><img src="artifacts/esc50/1-15689-B-4-frog.png" alt="1-15689-B-4" width="200"/><img src="artifacts/esc50/2-32515-B-4-frog.png" alt="2-32515-B-4" width="200"/><img src="artifacts/esc50/1-31836-B-4-frog.png" alt="1-31836-B-4" width="200"/></div>
<div><audio controls style="width: 200px;" src="artifacts/esc50/1-18755-A-4.wav"></audio><audio controls style="width: 200px;" src="artifacts/esc50/1-15689-B-4.wav"></audio><audio controls style="width: 200px;" src="artifacts/esc50/2-32515-B-4.wav"></audio><audio controls style="width: 200px;" src="artifacts/esc50/1-31836-B-4.wav"></audio></div>
<div><span style="width: 200px; text-align: center; display: inline-block;">Church_bells</span><span style="width: 200px; text-align: center; display: inline-block;">Church_bells</span><span style="width: 200px; text-align: center; display: inline-block;">Church_bells</span><span style="width: 200px; text-align: center; display: inline-block;">Church_bells</span></div>
<div><img src="artifacts/esc50/5-219044-A-46-church_bells.png" alt="5-219044-A-46" width="200"/><img src="artifacts/esc50/1-13572-A-46-church_bells.png" alt="1-13572-A-46" width="200"/><img src="artifacts/esc50/3-139109-A-46-church_bells.png" alt="3-139109-A-46" width="200"/><img src="artifacts/esc50/2-77346-A-46-church_bells.png" alt="2-77346-A-46" width="200"/></div>
<div><audio controls style="width: 200px;" src="artifacts/esc50/5-219044-A-46.wav"></audio><audio controls style="width: 200px;" src="artifacts/esc50/1-13572-A-46.wav"></audio><audio controls style="width: 200px;" src="artifacts/esc50/3-139109-A-46.wav"></audio><audio controls style="width: 200px;" src="artifacts/esc50/2-77346-A-46.wav"></audio></div>
<div><span style="width: 200px; text-align: center; display: inline-block;">Fireworks</span><span style="width: 200px; text-align: center; display: inline-block;">Fireworks</span><span style="width: 200px; text-align: center; display: inline-block;">Fireworks</span><span style="width: 200px; text-align: center; display: inline-block;">Fireworks</span></div>
<div><img src="artifacts/esc50/1-115545-B-48-fireworks.png" alt="1-115545-B-48" width="200"/><img src="artifacts/esc50/5-160614-C-48-fireworks.png" alt="5-160614-C-48" width="200"/><img src="artifacts/esc50/1-115545-A-48-fireworks.png" alt="1-115545-A-48" width="200"/><img src="artifacts/esc50/1-115546-A-48-fireworks.png" alt="1-115546-A-48" width="200"/></div>
<div><audio controls style="width: 200px;" src="artifacts/esc50/1-115545-B-48.wav"></audio><audio controls style="width: 200px;" src="artifacts/esc50/5-160614-C-48.wav"></audio><audio controls style="width: 200px;" src="artifacts/esc50/1-115545-A-48.wav"></audio><audio controls style="width: 200px;" src="artifacts/esc50/1-115546-A-48.wav"></audio></div>
<div><span style="width: 200px; text-align: center; display: inline-block;">Chirping_birds</span><span style="width: 200px; text-align: center; display: inline-block;">Crow</span><span style="width: 200px; text-align: center; display: inline-block;">Wind</span><span style="width: 200px; text-align: center; display: inline-block;">Clock_alarm</span></div>
<div><img src="artifacts/esc50/1-34495-A-14-chirping_birds.png" alt="1-34495-A-14" width="200"/><img src="artifacts/esc50/1-39835-B-9-crow.png" alt="1-39835-B-9" width="200"/><img src="artifacts/esc50/2-109374-A-16-wind.png" alt="2-109374-A-16" width="200"/><img src="artifacts/esc50/1-96890-A-37-clock_alarm.png" alt="1-96890-A-37" width="200"/></div>
<div><audio controls style="width: 200px;" src="artifacts/esc50/1-34495-A-14.wav"></audio><audio controls style="width: 200px;" src="artifacts/esc50/1-39835-B-9.wav"></audio><audio controls style="width: 200px;" src="artifacts/esc50/2-109374-A-16.wav"></audio><audio controls style="width: 200px;" src="artifacts/esc50/1-96890-A-37.wav"></audio></div>
<div><span style="width: 200px; text-align: center; display: inline-block;">Crickets</span><span style="width: 200px; text-align: center; display: inline-block;">Sheep</span><span style="width: 200px; text-align: center; display: inline-block;">Insects</span><span style="width: 200px; text-align: center; display: inline-block;">Airplane</span></div>
<div><img src="artifacts/esc50/2-96033-A-13-crickets.png" alt="2-96033-A-13" width="200"/><img src="artifacts/esc50/1-49409-A-8-sheep.png" alt="1-49409-A-8" width="200"/><img src="artifacts/esc50/1-73585-A-7-insects.png" alt="1-73585-A-7" width="200"/><img src="artifacts/esc50/2-74361-A-47-airplane.png" alt="2-74361-A-47" width="200"/></div>
<div><audio controls style="width: 200px;" src="artifacts/esc50/2-96033-A-13.wav"></audio><audio controls style="width: 200px;" src="artifacts/esc50/1-49409-A-8.wav"></audio><audio controls style="width: 200px;" src="artifacts/esc50/1-73585-A-7.wav"></audio><audio controls style="width: 200px;" src="artifacts/esc50/2-74361-A-47.wav"></audio></div>
<div><span style="width: 200px; text-align: center; display: inline-block;">Siren</span><span style="width: 200px; text-align: center; display: inline-block;">Car_horn</span><span style="width: 200px; text-align: center; display: inline-block;">Dog</span><span style="width: 200px; text-align: center; display: inline-block;">Crying_baby</span></div>
<div><img src="artifacts/esc50/2-43806-A-42-siren.png" alt="2-43806-A-42" width="200"/><img src="artifacts/esc50/2-54086-A-43-car_horn.png" alt="2-54086-A-43" width="200"/><img src="artifacts/esc50/2-116400-A-0-dog.png" alt="2-116400-A-0" width="200"/><img src="artifacts/esc50/1-22694-A-20-crying_baby.png" alt="1-22694-A-20" width="200"/></div>
<div><audio controls style="width: 200px;" src="artifacts/esc50/2-43806-A-42.wav"></audio><audio controls style="width: 200px;" src="artifacts/esc50/2-54086-A-43.wav"></audio><audio controls style="width: 200px;" src="artifacts/esc50/2-116400-A-0.wav"></audio><audio controls style="width: 200px;" src="artifacts/esc50/1-22694-A-20.wav"></audio></div>
## Replicate
We also provide more examples through [Replicate](https://replicate.com/hohsia
没有合适的资源?快使用搜索试试~ 我知道了~
深度学习多模态开源项目wav2clip源码
共98个文件
png:28个
wav:24个
sample:13个
需积分: 0 3 下载量 130 浏览量
2023-08-23
20:46:10
上传
评论
收藏 40.22MB ZIP 举报
温馨提示
github仓库地址: https://github.com/descriptinc/lyrebird-wav2clip CLIP技术巧妙地将图像与文字映射至同一编码空间,在图片-文字多模态任务上表现出色。而wav2clip使用视频训练集作为训练,进行图像与音乐间的关联构建,将图像与音乐映射至同一CLIP编码空间,最终实现了图像、音乐、文本三模态信息的编码空间共享。为多模态相关任务提供指导。
资源推荐
资源详情
资源评论
收起资源包目录
lyrebird-wav2clip.zip (98个子文件)
LICENSE.md 1KB
setup.py 1KB
.git
index 8KB
HEAD 23B
refs
heads
master 41B
tags
remotes
origin
HEAD 32B
objects
pack
pack-a9476d86b13a8233412ea758ddaa1bc0e29d288f.idx 6KB
pack-a9476d86b13a8233412ea758ddaa1bc0e29d288f.pack 20.09MB
info
description 73B
packed-refs 178B
info
exclude 240B
logs
HEAD 196B
refs
heads
master 196B
remotes
origin
HEAD 196B
hooks
post-update.sample 189B
prepare-commit-msg.sample 1KB
commit-msg.sample 896B
pre-receive.sample 544B
update.sample 4KB
pre-commit.sample 2KB
pre-rebase.sample 5KB
applypatch-msg.sample 478B
fsmonitor-watchman.sample 5KB
push-to-checkout.sample 3KB
pre-applypatch.sample 424B
pre-push.sample 1KB
pre-merge-commit.sample 416B
config 311B
.pre-commit-config.yaml 381B
docs
_config.yml 26B
artifacts
umap
umap_vggsound.png 336KB
umap_esc50.png 242KB
umap_urbansound8k.png 152KB
umap_tau.png 254KB
esc50
1-49409-A-8-sheep.png 485KB
1-34495-A-14-chirping_birds.png 434KB
1-115545-B-48-fireworks.png 489KB
5-160614-C-48.wav 431KB
2-109374-A-16-wind.png 464KB
1-39835-B-9.wav 431KB
1-115546-A-48.wav 431KB
2-74361-A-47.wav 431KB
1-31836-B-4.wav 431KB
2-74361-A-47-airplane.png 463KB
2-32515-B-4-frog.png 516KB
2-109374-A-16.wav 431KB
1-22694-A-20.wav 431KB
5-160614-C-48-fireworks.png 511KB
3-139109-A-46-church_bells.png 478KB
1-22694-A-20-crying_baby.png 467KB
5-219044-A-46.wav 431KB
1-18755-A-4-frog.png 418KB
1-15689-B-4.wav 431KB
1-115545-B-48.wav 431KB
1-73585-A-7.wav 431KB
1-115545-A-48-fireworks.png 474KB
1-18755-A-4.wav 431KB
2-77346-A-46-church_bells.png 471KB
1-96890-A-37.wav 431KB
2-96033-A-13-crickets.png 491KB
2-32515-B-4.wav 431KB
3-139109-A-46.wav 431KB
1-13572-A-46.wav 431KB
1-15689-B-4-frog.png 557KB
2-116400-A-0-dog.png 448KB
2-43806-A-42-siren.png 458KB
1-73585-A-7-insects.png 477KB
1-96890-A-37-clock_alarm.png 460KB
1-115545-A-48.wav 431KB
1-13572-A-46-church_bells.png 442KB
2-54086-A-43-car_horn.png 433KB
2-54086-A-43.wav 431KB
2-116400-A-0.wav 431KB
1-31836-B-4-frog.png 461KB
1-49409-A-8.wav 431KB
5-219044-A-46-church_bells.png 438KB
2-96033-A-13.wav 431KB
1-115546-A-48-fireworks.png 487KB
2-77346-A-46.wav 431KB
2-43806-A-42.wav 431KB
1-34495-A-14.wav 431KB
1-39835-B-9-crow.png 462KB
README.md 8KB
wav2clip
__init__.py 1KB
pre_training
__init__.py 0B
loss.py 1KB
dataset.py 4KB
model.py 5KB
model
__init__.py 0B
resnet.py 7KB
encoder.py 5KB
.gitignore 2KB
conf
distillation_args.yaml 240B
README.md 2KB
scripts
pre_training.py 2KB
VQGAN-CLIP
main.py 4KB
generate.py 46KB
README.md 299B
共 98 条
- 1
资源评论
西山点子王
- 粉丝: 5
- 资源: 6
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- YOLO算法-禾本科杂草数据集-4760张图像带标签.zip
- YOLO算法-无人机俯视视角动物数据集-10140张图像带标签-斑马-骆驼-大象-牛-羊.zip
- YOLO算法-挖掘机与火焰数据集-8129张图像带标签-挖掘机.zip
- YOLO算法-塑料数据集-3029张图像带标签-塑料制品-白色塑料.zip
- PyKDL库源码,编译安装PyKDL库
- YOLO算法-红外探测数据集-10573张图像带标签-小型车-人-无人机.zip
- 基于 C++和TCP和WebSocket的即时通信系统设计与实现(源码+文档)
- 电商管理系统项目源代码全套技术资料.zip
- 全国2022年04月高等教育自学考试02326操作系统试题及答案
- YOLO算法-垃圾数据集-3818张图像带标签-可口可乐-百事可乐.zip
- YOLO算法-瓶纸盒合并数据集-1317张图像带标签-纸张-纸箱-瓶子.zip
- YOLO算法-杂草检测项目数据集-3970张图像带标签-杂草.zip
- YOLO算法-杂草检测项目数据集-3853张图像带标签-杂草.zip
- YOLO算法-挖掘机与火焰数据集-7735张图像带标签-挖掘机.zip
- 文旅项目源代码全套技术资料.zip
- YOLO算法-罐头和瓶子数据集-1531张图像带标签-鲜奶-瓶子.zip
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功