Baidu just dropped an open-source multimodal AI that it claims beats GPT-5 and Gemini

Baidu Inc., China's largest search engine company, released a new artificial intelligence model on Monday that its developers claim outperforms competitors from Google and OpenAI on several vision-related benchmarks despite using a fraction of the computing resources typically required for such systems.

The model, dubbed ERNIE-4.5-VL-28B-A3B-Thinking, is the latest salvo in an escalating competition among technology companies to build AI systems that can understand and reason about images, videos, and documents alongside traditional text — capabilities increasingly critical for enterprise applications ranging from automated document processing to industrial quality control.

What sets Baidu's release apart is its efficiency: the model activates just 3 billion parameters during operation while maintaining 28 billion total parameters through a sophisticated routing architecture. According to documentation released with the model, this design allows it to match or exceed the performance of much larger competing systems on tasks involving document understanding, chart analysis, and visual reasoning while consuming significantly less computational power and memory.

"Built upon the powerful ERNIE-4.5-VL-28B-A3B architecture, the newly upgraded ERNIE-4.5-VL-28B-A3B-Thinking achieves a remarkable leap forward in multimodal reasoning capabilities," Baidu wrote in the model's technical documentation on Hugging Face, the AI model repository where the system was released.

The company said the model underwent "an extensive mid-training phase" that incorporated "a vast and highly diverse corpus of premium visual-language reasoning data," dramatically boosting its ability to align visual and textual information semantically.

How the model mimics human visual problem-solving through dynamic image analysis

Perhaps the model's most distinctive feature is what Baidu calls "Thinking with Images" — a capability that allows the AI to dynamically zoom in and out of images to examine fine-grained details, mimicking how humans approach visual problem-solving tasks.

"The model thinks like a human, capable of freely zooming in and out of images to grasp every detail and uncover all information," according to the model card. When paired with tools like image search, Baidu claims this feature "dramatically elevates the model's ability to process fine-grained details and handle long-tail visual knowledge."

This approach marks a departure from traditional vision-language models, which typically process images at a fixed resolution. By allowing dynamic image examination, the system can theoretically handle scenarios requiring both broad context and granular detail—such as analyzing complex technical diagrams or detecting subtle defects in manufacturing quality control.

The model also supports what Baidu describes as enhanced "visual grounding" capabilities with "more precise grounding and flexible instruction execution, easily triggering grounding functions in complex industrial scenarios," suggesting potential applications in robotics, warehouse automation, and other settings where AI systems must identify and locate specific objects in visual scenes.

Baidu's performance claims draw scrutiny as independent testing remains pending

Baidu's assertion that the model outperforms Google's Gemini 2.5 Pro and OpenAI's GPT-5-High on various document and chart understanding benchmarks has drawn attention across social media, though independent verification of these claims remains pending.

The company released the model under the permissive Apache 2.0 license, allowing unrestricted commercial use—a strategic decision that contrasts with the more restrictive licensing approaches of some competitors and could accelerate enterprise adoption.

"Apache 2.0 is smart," wrote one X user responding to Baidu's announcement, highlighting the competitive advantage of open licensing in the enterprise market.

According to Baidu's documentation, the model demonstrates six core capabilities beyond traditional text processing. In visual reasoning, the system can perform what Baidu describes as "multi-step reasoning, chart analysis, and causal reasoning capabilities in complex visual tasks," aided by what the company characterizes as "large-scale reinforcement learning."

For STEM problem solving, Baidu claims that "leveraging its powerful visual abilities, the model achieves a leap in performance on STEM tasks like solving problems from photos." The visual grounding capability allows the model to identify and locate objects within images with what Baidu characterizes as industrial-grade precision. Through tool integration, the system can invoke external functions including image search capabilities to access information beyond its training data.

For video understanding, Baidu claims the model possesses "outstanding temporal awareness and event localization abilities, accurately identifying content changes across different time segments in a video." Finally, the thinking with images feature enables the dynamic zoom functionality that distinguishes this model from competitors.

Inside the mixture-of-experts architecture that powers efficient multimodal processing

Under the hood, ERNIE-4.5-VL-28B-A3B-Thinking employs a Mixture-of-Experts (MoE) architecture — a design pattern that has become increasingly popular for building efficient large-scale AI systems. Rather than activating all 28 billion parameters for every task, the model uses a routing mechanism to selectively activate only the 3 billion parameters most relevant to each specific input.

This approach offers substantial practical advantages for enterprise deployments. According to Baidu's documentation, the model can run on a single 80GB GPU — hardware readily available in many corporate data centers — making it significantly more accessible than competing systems that may require multiple high-end accelerators.

The technical documentation reveals that Baidu employed several advanced training techniques to achieve the model's capabilities. The company used "cutting-edge multimodal reinforcement learning techniques on verifiable tasks, integrating GSPO and IcePop strategies to stabilize MoE training combined with dynamic difficulty sampling for exceptional learning efficiency."

Baidu also notes that in response to "strong community demand," the company "significantly strengthened the model's grounding performance with improved instruction-following capabilities."

The new model fits into Baidu's ambitious multimodal AI ecosystem

The new release is one component of Baidu's broader ERNIE 4.5 model family, which the company unveiled in June 2025. That family comprises 10 distinct variants, including Mixture-of-Experts models ranging from the flagship ERNIE-4.5-VL-424B-A47B with 424 billion total parameters down to a compact 0.3 billion parameter dense model.

According to Baidu's technical report on the ERNIE 4.5 family, the models incorporate "a novel heterogeneous modality structure, which supports parameter sharing across modalities while also allowing dedicated parameters for each individual modality."

This architectural choice addresses a longstanding challenge in multimodal AI development: training systems on both visual and textual data without one modality degrading the performance of the other. Baidu claims this design "has the advantage to enhance multimodal understanding without compromising, and even improving, performance on text-related tasks."

The company reported achieving 47% Model FLOPs Utilization (MFU) — a measure of training efficiency — during pre-training of its largest ERNIE 4.5 language model, using the PaddlePaddle deep learning framework developed in-house.

Comprehensive developer tools aim to simplify enterprise deployment and integration

For organizations looking to deploy the model, Baidu has released a comprehensive suite of development tools through ERNIEKit, what the company describes as an "industrial-grade training and compression development toolkit."

The model offers full compatibility with popular open-source frameworks including Hugging Face Transformers, vLLM (a high-performance inference engine), and Baidu's own FastDeploy toolkit. This multi-platform support could prove critical for enterprise adoption, allowing organizations to integrate the model into existing AI infrastructure without wholesale platform changes.

Sample code released by Baidu shows a relatively straightforward implementation path. Using the Transformers library, developers can load and run the model with approximately 30 lines of Python code, according to the documentation on Hugging Face.

For production deployments requiring higher throughput, Baidu provides vLLM integration with specialized support for the model's "reasoning-parser" and "tool-call-parser" capabilities—features that enable the dynamic image examination and external tool integration that distinguish this model from earlier systems.

The company also offers FastDeploy, a proprietary inference toolkit that Baidu claims delivers "production-ready, easy-to-use multi-hardware deployment solutions" with support for various quantization schemes that can reduce memory requirements and increase inference speed.

Why this release matters for the enterprise AI market at a critical inflection point

The release comes at a pivotal moment in the enterprise AI market. As organizations move beyond experimental chatbot deployments toward production systems that process documents, analyze visual data, and automate complex workflows, demand for capable and cost-effective vision-language models has intensified.

Several enterprise use cases appear particularly well-suited to the model's capabilities. Document processing — extracting information from invoices, contracts, and forms — represents a massive market where accurate chart and table understanding directly translates to cost savings through automation. Manufacturing quality control, where AI systems must detect visual defects, could benefit from the model's grounding capabilities. Customer service applications that handle images from users could leverage the multi-step visual reasoning.

The model's efficiency profile may prove especially attractive to mid-market organizations and startups that lack the computing budgets of large technology companies. By fitting on a single 80GB GPU — hardware costing roughly $10,000 to $30,000 depending on the specific model — the system becomes economically viable for a much broader range of organizations than models requiring multi-GPU setups costing hundreds of thousands of dollars.

"With all these new models, where's the best place to actually build and scale? Access to compute is everything," wrote one X user in response to Baidu's announcement, highlighting the persistent infrastructure challenges facing organizations attempting to deploy advanced AI systems.

The Apache 2.0 licensing further lowers barriers to adoption. Unlike models released under more restrictive licenses that may limit commercial use or require revenue sharing, organizations can deploy ERNIE-4.5-VL-28B-A3B-Thinking in production applications without ongoing licensing fees or usage restrictions.

Competition intensifies as Chinese tech giant takes aim at Google and OpenAI

Baidu's release intensifies competition in the vision-language model space, where Google, OpenAI, Anthropic, and Chinese companies including Alibaba and ByteDance have all released capable systems in recent months.

The company's performance claims — if validated by independent testing — would represent a significant achievement. Google's Gemini 2.5 Pro and OpenAI's GPT-5-High are substantially larger models backed by the deep resources of two of the world's most valuable technology companies. That a more compact, openly available model could match or exceed their performance on specific tasks would suggest the field is advancing more rapidly than some analysts anticipated.

"Impressive that ERNIE is outperforming Gemini 2.5 Pro," wrote one social media commenter, expressing surprise at the claimed results.

However, some observers counseled caution about benchmark comparisons. "It's fascinating to see how multimodal models are evolving, especially with features like 'Thinking with Images,'" wrote one X user. "That said, I'm curious if ERNIE-4.5's edge over competitors like Gemini-2.5-Pro and GPT-5-High primarily lies in specific use cases like document and chart" understanding rather than general-purpose vision tasks.

Industry analysts note that benchmark performance often fails to capture real-world behavior across the diverse scenarios enterprises encounter. A model that excels at document understanding may struggle with creative visual tasks or real-time video analysis. Organizations evaluating these systems typically conduct extensive internal testing on representative workloads before committing to production deployments.

Technical limitations and infrastructure requirements that enterprises must consider

Despite its capabilities, the model faces several technical challenges common to large vision-language systems. The minimum requirement of 80GB of GPU memory, while more accessible than some competitors, still represents a significant infrastructure investment. Organizations without existing GPU infrastructure would need to procure specialized hardware or rely on cloud computing services, introducing ongoing operational costs.

The model's context window — the amount of text and visual information it can process simultaneously — is listed as 128K tokens in Baidu's documentation. While substantial, this may prove limiting for some document processing scenarios involving very long technical manuals or extensive video content.

Questions also remain about the model's behavior on adversarial inputs, out-of-distribution data, and edge cases. Baidu's documentation does not provide detailed information about safety testing, bias mitigation, or failure modes — considerations increasingly important for enterprise deployments where errors could have financial or safety implications.

What technical decision-makers need to evaluate beyond the benchmark numbers

For technical decision-makers evaluating the model, several implementation factors warrant consideration beyond raw performance metrics.

The model's MoE architecture, while efficient during inference, adds complexity to deployment and optimization. Organizations must ensure their infrastructure can properly route inputs to the appropriate expert subnetworks — a capability not universally supported across all deployment platforms.

The "Thinking with Images" feature, while innovative, requires integration with image manipulation tools to achieve its full potential. Baidu's documentation suggests this capability works best "when paired with tools like image zooming and image search," implying that organizations may need to build additional infrastructure to fully leverage this functionality.

The model's video understanding capabilities, while highlighted in marketing materials, come with practical constraints. Processing video requires substantially more computational resources than static images, and the documentation does not specify maximum video length or optimal frame rates.

Organizations considering deployment should also evaluate Baidu's ongoing commitment to the model. Open-source AI models require continuing maintenance, security updates, and potential retraining as data distributions shift over time. While the Apache 2.0 license ensures the model remains available, future improvements and support depend on Baidu's strategic priorities.

Developer community responds with enthusiasm tempered by practical requests

Early response from the AI research and development community has been cautiously optimistic. Developers have requested versions of the model in additional formats including GGUF (a quantization format popular for local deployment) and MNN (a mobile neural network framework), suggesting interest in running the system on resource-constrained devices.

"Release MNN and GGUF so I can run it on my phone," wrote one developer, highlighting demand for mobile deployment options.

Other developers praised Baidu's technical choices while requesting additional resources. "Fantastic model! Did you use discoveries from PaddleOCR?" asked one user, referencing Baidu's open-source optical character recognition toolkit.

The model's lengthy name—ERNIE-4.5-VL-28B-A3B-Thinking—drew lighthearted commentary. "ERNIE-4.5-VL-28B-A3B-Thinking might be the longest model name in history," joked one observer. "But hey, if you're outperforming Gemini-2.5-Pro with only 3B active params, you've earned the right to a dramatic name!"

Baidu plans to showcase the ERNIE lineup during its Baidu World 2025 conference on November 13, where the company is expected to provide additional details about the model's development, performance validation, and future roadmap.

The release marks a strategic move by Baidu to establish itself as a major player in the global AI infrastructure market. While Chinese AI companies have historically focused primarily on domestic markets, the open-source release under a permissive license signals ambitions to compete internationally with Western AI giants.

For enterprises, the release adds another capable option to a rapidly expanding menu of AI models. Organizations no longer face a binary choice between building proprietary systems or licensing closed-source models from a handful of vendors. The proliferation of capable open-source alternatives like ERNIE-4.5-VL-28B-A3B-Thinking is reshaping the economics of AI deployment and accelerating adoption across industries.

Whether the model delivers on its performance promises in real-world deployments remains to be seen. But for organizations seeking powerful, cost-effective tools for visual understanding and reasoning, one thing is certain. As one developer succinctly summarized: "Open source plus commercial use equals chef's kiss. Baidu not playing around."

Original Source: https://venturebeat.com/ai/baidu-just-dropped-an-open-source-multimodal-ai-that-it-claims-beats-gpt-5

Disclaimer: This article is a reblogged/syndicated piece from a third-party news source. Content is provided for informational purposes only. For the most up-to-date and complete information, please visit the original source. Digital Ground Media does not claim ownership of third-party content and is not responsible for its accuracy or completeness.

Baidu just dropped an open-source multimodal AI that it claims beats GPT-5 and Gemini

How the model mimics human visual problem-solving through dynamic image analysis

Baidu's performance claims draw scrutiny as independent testing remains pending

Inside the mixture-of-experts architecture that powers efficient multimodal processing

The new model fits into Baidu's ambitious multimodal AI ecosystem

Comprehensive developer tools aim to simplify enterprise deployment and integration

Why this release matters for the enterprise AI market at a critical inflection point

Competition intensifies as Chinese tech giant takes aim at Google and OpenAI

Technical limitations and infrastructure requirements that enterprises must consider

What technical decision-makers need to evaluate beyond the benchmark numbers

Developer community responds with enthusiasm tempered by practical requests

About The Author

admin

More From Author

I'm a diehard Pixel fan, but this sleek Android phone has me questioning why

Your Google Chrome browser just got a useful autopilot feature – here's how it works

My favorite budget-friendly robot vacuum is from a brand you've never heard of

Leave a Reply Cancel reply