Paper List

Gaoustcer 3月 31, 2024

Paper List

重要基础模型、训练及微调

Title	Summary
Learning Transferable Visual Models From Natural Language Supervision	CLIP:大规模多模态预训练技术
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis	从多角度相机中获得高效的3D表征
LoRA: Low-Rank Adaptation of Large Language Models	利用低秩分解减少大模型微调的参数量
LLaMA: Open and Efficient Foundation Language Models	开源大语言模型预训练
Denoising Diffusion Probabilistic Models	扩散模型图像生成
Learning to Prompt for Vision-Language Models	微调prompt而不是整个模型以期待在下游任务上取得更高的性能和更强的泛化能力
Masked Autoencoders Are Scalable Vision Learners	图像领域的Bert，极为优雅的研究，强烈建议学习本文中讲故事的逻辑

以上论文分别关于大语言模型/多模态预训练模型的训练/微调/下游任务迁移中的各项技术，其中Nerf是一种3D感知相关的工作，之后的具身智能模型中使用3D信息将会是主流

多模态大模型

训练

Title	Summary
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation	一种新的多模态融合方式，自举式学习以提升数据质量，统一多模态模型的表征任务和生成任务
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation	多模态融合做表征的新思路
Segment Anything	利用多模态prompt实现任意物体的图像分割

微调和迁移

Title	Summary
Conditional Prompt Learning for Vision-Language Models	Prompt中引入Conditional Prompt提升prompt在下游任务中的泛化性能
Visual Prompt Tuning	在Transformer输入加入可学习的参数提升多模态模型在下游任务上的迁移能力
The Power of Scale for Parameter-Efficient Prompt Tuning	研究了在大语言模型上微调prompt的若干技术细节

图像-文本生成

Title	Summary
Zero-Shot Text-to-Image Generation	DALLE：借助VAVAE做图像生成
High-Resolution Image Synthesis with Latent Diffusion Models	Stable Diffusion在和pixel space等价的latent space上做扩散以减少训练和推理开销
StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery	借助CLIP中蕴含的语义信息做风格迁移
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models	基于文本对图像进行编辑，用到了Classifier-free diffusion
Hierarchical Text-Conditional Image Generation with CLIP Latents	DALLE-2
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding	Imagegan

Diffusion及其改进

Title	Summary
Diffusion Models Beat GANs on Image Synthesis	Guided Diffusion实现conditional图片生成
Classifier-Free Diffusion Guidance	对上文的改进，避免为了生成条件图片的同时训练Classifier
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation	对Stable Diffusion借助少量文本-图像对做微调
Adding Conditional Control to Text-to-Image Diffusion Models	借助文本中包含的细粒度语义信息对图像做编辑

具身智能Baseline

建议看一下这个Talk 推动开放世界中的机器人泛化

Title	Summary
Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation	借助3d体素做具身智能Agent language-conditional的控制
CLIPort: What and Where Pathways for Robotic Manipulation	二维图像+CLIP文本编码器做控制
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances	借助大语言模型做具身智能Agent控制
PaLM-E: An Embodied Multimodal Language Model	预训练具身智能多模态大模型

本站访客数人次