A year after DeepSeek’s R1 model wiped a trillion dollars off the US tech market, the Chinese AI lab is back with V4. And this time, they brought receipts.
Today DeepSeek released preview versions of two new models: V4-Pro and V4-Flash. Both are open-source under MIT, both support a 1 million token context window, and both are available through the DeepSeek API right now. Let me break down what actually shipped, what the benchmarks say, and why you should care.
Two Models, Same Philosophy
DeepSeek went with a two-tier approach this time:
| V4-Pro | V4-Flash | |
|---|---|---|
| Total Parameters | 1.6 trillion | 284 billion |
| Activated Parameters | 49 billion | 13 billion |
| Context Length | 1M tokens | 1M tokens |
| Pre-training Data | 33T tokens | 32T tokens |
| License | MIT | MIT |
Both use Mixture-of-Experts (MoE) architecture, which means they don’t run all 1.6 trillion parameters at once. For each token, only a subset of “experts” activate. V4-Pro lights up 49 billion parameters per forward pass, and V4-Flash uses just 13 billion. That’s how you get frontier-level performance without needing a data center the size of a small country.
V4-Pro is now the biggest open-weight model available. It outstrips Moonshot AI’s Kimi K2.6 (1.1T), MiniMax’s M1 (456B), and more than doubles DeepSeek’s own V3.2 (671B).
Three Reasoning Modes
Something I appreciate about this release is the explicit reasoning modes. Both models support three levels:
- Non-think: Fast, intuitive responses. Good for routine tasks and low-stakes queries.
- Think High: Conscious logical analysis with a thinking step. Better for complex problem-solving.
- Think Max: Pushes reasoning as far as it can go. Uses a special system prompt and extended thinking. For when you need the model to really work for it.
You switch between these using the reasoning_effort parameter in the API (high or max), or just set thinking_mode to thinking vs non_thinking.
DeepSeek recommends a context window of at least 384K tokens when running in Think Max mode. That’s not a typo.
The Architecture Matters
DeepSeek didn’t just scale up V3.2 and call it a day. There are three notable architectural changes:
Hybrid Attention Architecture: This combines Compressed Sparse Attention (CSA) with Heavily Compressed Attention (HCA). The practical effect? At the full 1M token context, V4-Pro uses only 27% of the single-token inference FLOPs and 10% of the KV cache compared to V3.2. That’s a massive efficiency gain for anyone working with long documents or large codebases.
Manifold-Constrained Hyper-Connections (mHC): A new way to handle residual connections across layers. It strengthens signal propagation without losing model expressivity. Translation: the model learns more effectively across its depth.
Muon Optimizer: Replaces the standard AdamW optimizer with Muon for faster convergence and better training stability. If you’ve ever dealt with training instability at scale, you know this matters.
Benchmarks: How Does It Actually Stack Up?
I pulled the numbers from DeepSeek’s technical report. Here’s the honest picture.
V4-Pro Max vs the big names
On LiveCodeBench (real coding problems, not recycled ones), V4-Pro Max hits 93.5, beating GPT-5.4 (not reported) and Gemini 3.1 Pro (91.7). On Codeforces, it achieves a rating of 3206, which is genuinely competitive with professional competitive programmers.
On GPQA Diamond (graduate-level science questions), it scores 90.1. That’s behind Gemini 3.1 Pro (94.3), GPT-5.4 (93.0), and Claude Opus 4.6 (91.3), but ahead of Kimi K2.6 (90.5) and GLM-5.1 (86.2).
For agentic tasks, SWE Verified shows 80.6%, matching Claude Opus 4.6 (80.8%) and beating Gemini 3.1 Pro (80.6%). On BrowseComp (web browsing tasks), it hits 83.4%, comparable to Gemini 3.1 Pro (85.9%) and GPT-5.4 (82.7%).
Where does it fall short? Knowledge-heavy benchmarks. On SimpleQA-Verified, V4-Pro Max scores 57.9, which trails Gemini 3.1 Pro (75.6) significantly. DeepSeek themselves acknowledge a “developmental trajectory that trails state-of-the-art frontier models by approximately 3 to 6 months” on pure knowledge tasks.
V4-Flash: The Budget Option That Punches Above Its Weight
Here’s where it gets interesting for developers watching their API bill. V4-Flash in Think Max mode achieves:
- LiveCodeBench: 91.6 (vs V4-Pro Max’s 93.5)
- GPQA Diamond: 88.1 (vs 90.1)
- Codeforces Rating: 3052 (vs 3206)
That’s remarkably close to V4-Pro for a model that’s nearly 6x smaller in activated parameters. For most coding tasks and reasoning workloads, V4-Flash in Think Max mode might be the sweet spot.
The Pricing Story
This is where DeepSeek continues to make everyone uncomfortable.
| Model | Input (cache miss) | Input (cache hit) | Output |
|---|---|---|---|
| V4-Flash | $0.14 / 1M tokens | $0.028 / 1M tokens | $0.28 / 1M tokens |
| V4-Pro | $1.74 / 1M tokens | $0.145 / 1M tokens | $3.48 / 1M tokens |
To put that in perspective, V4-Flash at $0.28/M output tokens undercuts GPT-5.4 Nano, Gemini 3.1 Flash, and Claude Haiku 4.5. V4-Pro at $3.48/M output tokens undercuts Gemini 3.1 Pro, GPT-5.5, Claude Opus 4.7, and GPT-5.4.
The cache hit pricing on V4-Flash is particularly aggressive at $0.028/M input tokens. If your use case involves repeated system prompts or recurring context (which is most agentic workflows), the actual cost drops even further.
Both models support the OpenAI ChatCompletions API format and the Anthropic API format. The base URL is https://api.deepseek.com for OpenAI-compatible and https://api.deepseek.com/anthropic for Anthropic-compatible. Switching is as simple as changing the model name parameter.
Text Only (For Now)
One important caveat: both V4 models are text-only. No image generation, no audio, no video understanding. While competitors like GPT-5.4 and Gemini 3.1 Pro offer multimodal capabilities, DeepSeek has stayed focused on language and reasoning.
For a lot of developer use cases, that’s fine. But if you need vision capabilities for document understanding or image analysis, you’ll still need to look elsewhere.
Agent-First Optimization
Something that caught my attention: DeepSeek specifically optimized V4 for popular agent frameworks including Claude Code, OpenClaw, and OpenCode. The Atlas Cloud analysis notes that DeepSeek uses V4-Pro internally as its coding agent of choice, with employee feedback suggesting it surpasses Claude Sonnet 4.5 in quality.
That kind of framework-specific tuning matters more than benchmarks suggest. A model that scores well in isolation but behaves inconsistently inside a structured agent loop is a headache to deploy. Treating major agent frameworks as first-class optimization targets is a sign of maturity in how DeepSeek thinks about real-world usage.
The Huawei Chip Angle
The other big story here is hardware independence.

The exact extent of Huawei chip usage in training is unclear, but the fact that V4 can run natively on domestic Chinese silicon is significant. Between US export controls restricting access to Nvidia’s most advanced chips and Beijing’s push for domestic alternatives, this represents a step toward AI sovereignty for China.
For the rest of us, it mainly means that the model was built under real hardware constraints, which likely contributed to the efficiency innovations.
How to Get Started
Chat interface: Head to chat.deepseek.com or the DeepSeek mobile app. V4-Pro is available in Expert Mode, V4-Flash in Fast Mode.
API access: Set your model parameter to deepseek-v4-pro or deepseek-v4-flash. Existing DeepSeek API integrations need minimal changes.
from openai import OpenAI
client = OpenAI(
api_key="your-deepseek-api-key",
base_url="https://api.deepseek.com"
)
response = client.chat.completions.create(
model="deepseek-v4-pro",
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Refactor this Python function to use type hints and async/await."}
],
reasoning_effort="max" # Non-think, "high", or "max"
)
print(response.choices[0].message.content)
Local deployment: Full model weights are on Hugging Face and ModelScope. Both base and instruct versions are available. Keep in mind that V4-Pro at 1.6T total parameters (even with FP4+FP8 mixed precision) requires serious hardware. V4-Flash at 284B is more practical for local experimentation.
DeepSeek recommends temperature=1.0 and top_p=1.0 for local deployment sampling parameters.
Deprecation Warning
If you’re currently using deepseek-chat or deepseek-reasoner in production, take note. Those model names will be retired on July 24, 2026. During the transition, they map to V4-Flash’s non-thinking and thinking modes respectively. Plan your migration now.
The Honest Take
Is V4 going to shock the market like R1 did? Probably not. The world has adjusted to the reality that Chinese AI labs can produce competitive models at lower costs. The stock market reaction was muted compared to the January 2025 bloodbath.
But V4 is a genuinely strong release. The 1M token context window as a standard feature (not a premium tier) is a clear signal about where the open-source frontier is heading. The pricing continues to put pressure on every closed-source competitor. And the agent-first optimization shows DeepSeek understands how developers actually use these models today.
If you’re building agentic workflows, doing long-context coding tasks, or just want to cut your API costs without sacrificing too much quality, V4 is worth a serious look. Start with V4-Flash in Think Max mode and see if it handles your workload. If you need the extra knowledge capacity and the best coding performance, step up to V4-Pro.
The model weights are MIT licensed, the API is live, and the pricing is hard to argue with. Go try it.

