AniSora: Exploring the Frontiers of Animation Video Generation in the Sora Era Paper β’ 2412.10255 β’ Published Dec 13, 2024 β’ 1
view post Post 385 RoboBrain π§ an 32B open embodied AI model enabling multi-robot collaboration, released by BAAIBeijing.Model: BAAI/robobrain-681e1389c64d06b3e4a45e44Dataset: BAAI/ShareRobotβ¨ Task decomposition into 20+ precise actionsβ¨ Operable region detection (e.g: teapot handles, drawers)β¨ Motion trajectory prediction to avoid collisions See translation β 1 1 + Reply
view post Post 367 Seed-Coder π» code models by ByteDance ByteDance-Seed/seed-coder-680de32c15ead6555c75b0e4β¨ 8B models: base/instruct/reasoning β¨ MIT licensedβ¨ Model-centric data filtering (less manual effort) See translation π 1 1 + Reply
π May 2025 - Open works from the Chinese community Collection 6 items β’ Updated about 6 hours ago β’ 1
view post Post 1992 HunyuanCustom π₯ a multimodal video generation framework supporting image, audio, video & text conditions, released by TencentHunyuan tencent/HunyuanCustom HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation (2505.04512)β¨Strong Identity Consistencyβ¨SOTA outperforms See translation π₯ 4 4 + Reply
π May 2025 - Open works from the Chinese community Collection 6 items β’ Updated about 6 hours ago β’ 1
On Path to Multimodal Generalist: General-Level and General-Bench Paper β’ 2505.04620 β’ Published 5 days ago β’ 69
HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation Paper β’ 2505.04512 β’ Published 5 days ago β’ 32
view post Post 4902 A ton of impactful models and datasets in open AI past week, let's summarize the best π€© merve/releases-apr-21-and-may-2-6819dcc84da4190620f448a3π¬ Qwen made it rain! They released Qwen3: new dense and MoE models ranging from 0.6B to 235B π€― as well as Qwen2.5-Omni, any-to-any model in 3B and 7B!> Microsoft AI released Phi4 reasoning models (that also come in mini and plus sizes)> NVIDIA released new CoT reasoning datasetsπΌοΈ > ByteDance released UI-TARS-1.5, native multimodal UI parsing agentic model> Meta released EdgeTAM, an on-device object tracking model (SAM2 variant)π£οΈ NVIDIA released parakeet-tdt-0.6b-v2, a smol 600M automatic speech recognition model> Nari released Dia, a 1.6B text-to-speech model> Moonshot AI released Kimi Audio, a new audio understanding, generation, conversation modelπ©π»βπ» JetBrains released Melium models in base and SFT for coding> Tesslate released UIGEN-T2-7B, a new text-to-frontend-code model π€© See translation π₯ 11 11 + Reply
view post Post 6406 A real-time object detector much faster and accurate than YOLO with Apache 2.0 license just landed to Hugging Face transformers π₯ D-FINE is the sota real-time object detector that runs on T4 (free Colab) π€©> Collection with all checkpoints and demo ustc-community/d-fine-68109b427cbe6ee36b4e7352Notebooks:> Tracking https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DFine_tracking.ipynb> Inference https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DFine_inference.ipynb> Fine-tuning https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DFine_finetune_on_a_custom_dataset.ipynbh/t @vladislavbro @qubvel-hf @ariG23498 and the authors of the paper π©Regular object detectors attempt to predict bounding boxes in (x, y, w, h) pixel perfect coordinates, which is very rigid and hard to solve π₯²βΉοΈD-FINE formulates object detection as a distribution for bounding box coordinates, refines them iteratively, and it's more accurate π€©Another core idea behind this model is Global Optimal Localization Self-Distillation ‡οΈthis model uses final layer's distribution output (sort of like a teacher) to distill to earlier layers to make early layers more performant. See translation 2 replies Β· π 13 13 π 10 10 π₯ 6 6 β€οΈ 4 4 + Reply
view post Post 3840 ACE-Step π΅ a music generation foundation model released by StepFun & ACEStudio Model: ACE-Step/ACE-Step-v1-3.5BDemo: ACE-Step/ACE-Stepβ¨ 3.5B, Apache2.0 licensedβ¨ 115Γ faster than LLMs (4-min music in 20s on A100)β¨ Diffusion + DCAE + linear transformer = speed + coherenceβ¨ Supports voice cloning, remixing, lyric editing & more See translation 1 reply Β· π 8 8 + Reply
view post Post 793 CCI4.0-M2 π A powerful dataset with 3 specialized subsets, released by BAAIBeijing BAAI/cci40-68199d90bbc798680df16d7cβ¨ M2-Base: 3.5TB web data (EN/ZH), with LLM-augmented content, APACHE2.0β¨ M2-CoT: 4.2TB of auto-synthesized CoT reasoning dataβ¨ M2-Extra: domain-specific knowledge See translation π 1 1 + Reply