Google presents PaliGemma
A versatile 3B VLM for transfer
paper page: https://lnkd.in/eQevVDJh
PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.
Tencent presents MiraData
A Large-Scale Video Dataset with Long Durations and Structured Captions
paper page: https://lnkd.in/eSSQ5XzQ
Sora's high-motion intensity and long consistent videos have significantly impacted the field of video generation, attracting unprecedented attention. However, existing publicly available datasets are inadequate for generating Sora-like videos, as they mainly contain short videos with low motion intensity and brief captions. To address these issues, we propose MiraData, a high-quality video dataset that surpasses previous ones in video duration, caption detail, motion strength, and visual quality. We curate MiraData from diverse, manually selected sources and meticulously process the data to obtain semantically consistent clips. GPT-4V is employed to annotate structured captions, providing detailed descriptions from four different perspectives along with a summarized dense caption. To better assess temporal consistency and motion intensity in video generation, we introduce MiraBench, which enhances existing benchmarks by adding 3D consistency and tracking-based motion strength metrics. MiraBench includes 150 evaluation prompts and 17 metrics covering temporal consistency, motion strength, 3D consistency, visual quality, text-video alignment, and distribution similarity. To demonstrate the utility and effectiveness of MiraData, we conduct experiments using our DiT-based video generation model, MiraDiT. The experimental results on MiraBench demonstrate the superiority of MiraData, especially in motion strength.
The bleeding-edge alignment technique DPO for vision language models is now available in Hugging Face TRL along with LoRA/QLoRA ⚡️
Links and more in comments 🔖
DPO is a popular cutting-edge alignment technique for language models.
TLDR; a (preference) model is trained using a dataset of inputs and chosen and rejected outputs, and this model generates scores for each input. the main model is fine-tuned using the scores.
Essentially DPO in vision language models is pretty similar, since vision language models are models that take in images projected to text embedding space, it's just input tokens output tokens.
Quentin Gallouédec implemented support for Idefics2, Llava 1.5, and PaliGemma in TRL. 👏
as of now, VLM processors are quite non-standard, only difference is due to processor and chat templates themselves, you can implement it very easily (see his PR in links)
Thanks to TRL's support for PEFT and bitsandbytes you can also try QLoRA and LoRA fine-tuning (which comes in blog post) 😏
Please try the scripts, share your models and let us know how it goes!
🌟Introducing Transcription Delight!🌟 Effortlessly Generate Transcripts from any YouTube video (or any uploaded video/audio)!
This App is super cool 😎 & incredibly handy! 🛠
It also refines your transcript with an LLM, transforming it into a polished markdown output for your downstream needs😍
🤔For example, You can use the markdown transcript as input in Claude/GPT-4, and get ready to throw questions at it or summarize with ease!🚀📝
🔥Transcription Delight is an app created by Abubakar Abid🙌 -- Dive into this Gradio app and up your game!
Try on Hugging Face Spaces: https://lnkd.in/gHFyHFbr
OR Build the app locally in three lines of Code, by doing:
👉 𝚐𝚒𝚝 𝚌𝚕𝚘𝚗𝚎 𝚑𝚝𝚝𝚙𝚜://𝚑𝚞𝚐𝚐𝚒𝚗𝚐𝚏𝚊𝚌𝚎.𝚌𝚘/𝚜𝚙𝚊𝚌𝚎𝚜/𝚊𝚋𝚒𝚍𝚕𝚊𝚋𝚜/𝚝𝚛𝚊𝚗𝚜𝚌𝚛𝚒𝚙𝚝𝚒𝚘𝚗-𝚍𝚎𝚕𝚒𝚐𝚑𝚝
👉 𝚌𝚍 𝚝𝚛𝚊𝚗𝚜𝚌𝚛𝚒𝚙𝚝𝚒𝚘𝚗-𝚍𝚎𝚕𝚒𝚐𝚑𝚝
👉 𝚙𝚢𝚝𝚑𝚘𝚗 𝚊𝚙𝚙.𝚙𝚢
UltraEdit
Instruction-based Fine-Grained Image Editing at Scale
paper page: https://lnkd.in/e762P2Me
This paper presents UltraEdit, a large-scale (approximately 4 million editing samples), automatically generated dataset for instruction-based image editing. Our key idea is to address the drawbacks in existing image editing datasets like InstructPix2Pix and MagicBrush, and provide a systematic approach to producing massive and high-quality image editing samples. UltraEdit offers several distinct advantages: 1) It features a broader range of editing instructions by leveraging the creativity of large language models (LLMs) alongside in-context editing examples from human raters; 2) Its data sources are based on real images, including photographs and artworks, which provide greater diversity and reduced bias compared to datasets solely generated by text-to-image models; 3) It also supports region-based editing, enhanced by high-quality, automatically produced region annotations. Our experiments show that canonical diffusion-based editing baselines trained on UltraEdit set new records on MagicBrush and Emu-Edit benchmarks. Our analysis further confirms the crucial role of real image anchors and region-based editing data.
[#partenariat] Nous vous l’avions annoncé récemment, nous collaborons avec la Banque des Territoires du Groupe Caisse des Dépôts et Hugging Face, dans le déploiement d’une solution sur-mesure et souveraine.
🎯 L’objectif ? Faciliter l'accompagnement offert par les agents de la Banque des Territoires auprès des collectivités, dans le cadre du projet EduRénov.
🚀 Entre enjeux de souveraineté, d’optimisation et de sécurisation des données, découvrez au travers de cet article, les étapes qui ont menées au déploiement de la première version du système.
👉 L’article est disponible ici : https://lnkd.in/e5Mjy-VD, ainsi que directement consultable et téléchargeable ci-dessous.
Vous avez un projet de transformation numérique ? L’IA générative peut vous permettre d’atteindre vos objectifs !
Contactez-nous ➡ [email protected]
🚀Create animated portraits (like the attached video clip) using the official LivePortrait Gradio app on Hugging Face Spaces!
MIT License. Keep reading for more info, examples and demo link👇
Produce lifelike videos from a single source image and input motion 🤯
Model is trained on 69M high-quality frames, and thus produces high quality outputs
Fast generation on ZeroGPU (A100s) on Spaces for free!
Demo link: https://lnkd.in/gXSd5VkB
Build, Train, and Deploy AI Models with Google TPUs on Hugging Face! We're excited to announce the General Availability of Google TPUs on Hugging Face. Hugging Face users can now use the power of Google Cloud TPUs in both Spaces and Inference Endpoints to build, train, and deploy your Generative AI models. 🚀
TL;DR:
🚀 Google Cloud TPUs are available on Spaces and Inference Endpoints.
💡 3 new options from 16GB to 128GB TPU memory (1x1, 2x2, 2x4 v5e TPU) in us-west1
🛠 Use TPU in Spaces for ML demos or dev mode to easily training.
📈 Deploy LLMs starting with Meta Llama 3 and Google DeepMind Gemma with Mistral and others to follow on Inference Endpoints
🔄 New Text Generation Inference backend now supports Google TPUs.
🌟 Starting at just $1.38/hour.
Blog: https://lnkd.in/e_au-mqt
Spaces: https://lnkd.in/eCun-cb9
Inference Endpoints: https://lnkd.in/eqks3UKd
Big Kudos to Alvaro Moran, Morgan Funtowicz, Simon Pagezy, Thibault Goehringer, Michelle Habonneau, Christophe Rannou, and the whole HF team for bringing Google TPUs to every Hugging Face user!