Generalizable Value Functions and Introverted Intuition (Ni)

The Role of Value Functions

In a previous post on policy gradients, we introduced the actor-critic architecture: an actor (the policy $\pi_\theta$) that chooses actions, and a critic (a value function $\hat{V}_\phi$) that evaluates states. The actor answers “what should I do?”; the critic answers “how good is the situation I’m in?”.

A natural first question is: why bother with the critic at all? Vanilla policy gradient (REINFORCE) trains the actor without any value function; Q-learning, on the other hand, replaces the actor entirely with a value function. The full picture is that value functions show up in two distinct roles:

In policy gradient, the critic is a helper. It reduces gradient variance and supplies per-step credit. We use ArCHer as the worked example, then take a dialectical look at why critic-free variants like GRPO can also succeed.
In Q-learning, the value function is no longer a helper — it is the policy. There is no separate actor; $\arg\max_a Q^*(s, a)$ acts directly.

We walk through both before stepping back to ask what these functions are actually doing under the hood.

The Variance Problem in Policy Gradient

Consider a pure actor method applied to a 20-step web agent task. The agent receives a single binary reward at the end: 1 for success, 0 for failure. REINFORCE computes the policy gradient as:

\[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(a_{i,t} \vert s_{i,t}) \cdot R_i\]

Every token in a successful trajectory gets reinforced equally. Every token in a failed trajectory gets suppressed equally. The gradient tells the actor “do more of everything you did when you succeeded” — it cannot distinguish the decisive click from a useless scroll. With $N$ rollouts that each take 20 steps, you have 20 decisions but only 1 bit of feedback. The signal-to-noise ratio is abysmal.

A constant baseline $b$ (e.g., the mean reward across the batch) helps: it centers the advantages so that average-performing trajectories get near-zero gradient. But it cannot distinguish states. An agent in a clearly good state (already on the right product page) and an agent in a clearly bad state (stuck on an error page) receive advantages that differ only in the trajectory’s final outcome — not in their current prospects.

A State-Dependent Baseline Already Helps

In fact, REINFORCE itself can use $V_\phi(s_t)$ as a baseline. This is perfectly valid — any function of $s_t$ can serve as a baseline without introducing bias (as shown in the policy gradient post). The advantage becomes:

\[A(s_t, a_t) = G_t - V_\phi(s_t)\]

where $G_t = \sum_{t'=t}^{T} \gamma^{t'-t} r_{t'}$ is the Monte Carlo return. This is still REINFORCE — the return $G_t$ comes from the actual trajectory, no bootstrapping involved — but with a much better baseline. Instead of asking “was this trajectory better than average?”, it asks “was this trajectory better than expected from this state?”.

This already provides meaningful per-step signal. If the agent is on the right product page ($V_\phi(s_t)$ is high) and the trajectory succeeds ($G_t$ is high), the advantage is small — success was expected. If the agent is on the homepage ($V_\phi(s_t)$ is moderate) and navigates directly to the right category (leading to eventual success), the advantage is large — the outcome was better than expected from that state.

Why V(s) reduces variance. Without a baseline, success always gives advantage 1, failure always gives 0 — the gradient cannot distinguish easy states from hard ones. With V(s), advantages are centered per state: succeeding from an easy state (high V) gets a small advantage, while succeeding from a hard state (low V) gets a large one.

Can $Q(s, a)$ be used as a baseline? (Click to expand)

A valid baseline must be a function of $s$ only — not of $a$. The reason is that the baseline property relies on:

$$\mathbb{E}_{a \sim \pi}\!\left[\nabla_\theta \log \pi(a \vert s) \cdot b(s)\right] = b(s) \cdot \nabla_\theta \sum_a \pi(a \vert s) = b(s) \cdot \nabla_\theta 1 = 0$$

The key step is pulling $b(s)$ out of the expectation over $a$ — which is only valid when $b$ does not depend on $a$. Since $Q(s, a)$ depends on $a$, it cannot be pulled out, and subtracting it would change the expected gradient — introducing bias.

$Q(s, a)$ plays a different role: it is the signal itself, not a baseline. The policy gradient theorem states:

$$\nabla_\theta J = \mathbb{E}\!\left[\nabla_\theta \log \pi(a \vert s) \cdot Q^\pi(s, a)\right]$$

The advantage $A(s, a) = Q(s, a) - V(s)$ uses $Q$ as the signal and $V$ as the baseline. Replacing $V$ with $Q$ in the baseline slot would collapse the advantage to zero — subtracting the signal from itself.

From Baseline to Bootstrapping

If REINFORCE + $V(s)$ baseline already gives per-step credit, what does actor-critic add? The difference is in how the advantage is computed. REINFORCE uses:

\[A^{\text{MC}}(s_t, a_t) = G_t - V_\phi(s_t) \qquad \text{(Monte Carlo return minus baseline)}\]

Actor-critic replaces the MC return with a bootstrapped estimate:

\[A^{\text{TD}}(s_t, a_t) = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t) \qquad \text{(one-step TD residual)}\]

The MC version is unbiased — $G_t$ is the true sampled return. But it is still high-variance: $G_t$ sums rewards over many future steps, each subject to stochastic transitions. In a 20-step web task where the environment is non-stationary (pages load differently, A/B tests shift layouts), this sum carries enormous noise.

The TD version replaces all future randomness with the critic’s estimate $V_\phi(s_{t+1})$. This is biased — if the critic is wrong, the advantage is wrong — but dramatically lower variance, because the estimate depends only on the immediate transition $(s_t, a_t, s_{t+1})$ rather than the entire future trajectory. In practice, a slightly biased but low-variance gradient is far more useful for learning than an unbiased but noisy one.

This is the real contribution of the critic in actor-critic: not just providing a baseline (REINFORCE can do that too), but enabling bootstrapping — replacing noisy future returns with learned predictions. The critic becomes load-bearing: its accuracy directly determines the quality of the actor’s gradient. This is why training the critic well matters so much.

A Worked Example in Policy Gradient: ArCHer

Before going on, a clarification: “critic” and “value function” are not two different things. The critic is a value function — specifically, a learned approximation $\hat{V}_\phi(s) \approx V^\pi(s)$, trained to predict expected returns. “Critic” emphasizes its role in the actor-critic architecture; “value function” emphasizes its mathematical definition. They refer to the same object. In the LM setting the critic is usually a copy of the base language model with a scalar value head replacing the token prediction head — see the LM-as-critic post for the engineering details.

ArCHer (Actor-Critic with Hierarchical Turn-Level Credit Assignment) was introduced by Zhou et al. (2024) to address a fundamental limitation of RLHF-style training for multi-step language model agents: trajectory-level rewards provide no signal about which step was responsible for success or failure.

Consider a web agent that completes a 10-step shopping task. Standard PPO assigns the terminal reward to the entire trajectory — all 10 steps receive the same advantage. The agent cannot learn that step 3 (clicking the right product) was the decisive action while step 7 (scrolling past relevant content) was a mistake. Both get reinforced equally.

The original ArCHer paper proposes a hierarchical actor-critic framework whose central idea is: the turn, not the token, is the right granularity for credit assignment in multi-step LM agents. Below we describe the formulation and losses precisely.

Formulation: Multi-Turn MDP

ArCHer models the interaction as a token-level MDP embedded within a turn-level structure. At each turn $t$, the agent observes a state $s_t$ (e.g., a web page) and generates a response $a_t = (a_t^1, a_t^2, \ldots, a_t^L)$ as a sequence of $L$ tokens. The environment then transitions to a new state $s_{t+1}$. After $T$ turns, a scalar reward $R$ is received.

The key decomposition is between the high-level (turn-level) and low-level (token-level) policies. The high-level policy selects the overall intent at each turn, while the low-level policy autoregressively generates the tokens that implement it. In the simplified ArCHer PPO variant we implement, these are collapsed into a single autoregressive policy — but the turn-level value function is preserved.

The Critic: Turn-Level Value Function

The critic $V_\phi(s_t)$ predicts the expected discounted return from turn $t$ onward:

\[V_\phi(s_t) \approx \mathbb{E}\left[\sum_{k=t}^{T} \gamma^{k-t} r_k \mid s_t\right]\]

where $r_k = 0$ for intermediate turns and $r_T = R$. This is a state value function — it conditions only on the observation $s_t$, not on the action $a_t$. In the LM implementation, the critic is a copy of the base model with the language modeling head replaced by a scalar projection. The value is read at a single token position per turn: the last token of the observation (prompt boundary), where causal masking ensures the hidden state encodes the full observation but none of the response.

The critic is trained by regression on Monte Carlo return targets:

\[G_t = \gamma^{T-1-t} \cdot R\]

with a simple MSE loss:

\[\mathcal{L}_{\text{critic}}(\phi) = \frac{1}{2} \mathbb{E}_t\left[(V_\phi(s_t) - G_t)^2\right]\]

Using MC returns rather than bootstrapped targets $r_t + \gamma V_\phi(s_{t+1})$ is a deliberate choice: with sparse terminal rewards ($r_t = 0$ for $t < T$), bootstrapped targets reduce to $\gamma V_\phi(s_{t+1})$, making the critic chase its own predictions. MC returns ground the training in the actual outcome.

The Actor: Step-Level Advantages

Given the critic, the step-level advantage is computed as a TD(0) residual:

\[A(t) = \begin{cases} V_\phi(s_{t+1}) - V_\phi(s_t) & t < T \\ R - V_\phi(s_T) & t = T \end{cases}\]

This measures whether the agent’s action at turn $t$ increased or decreased the expected return relative to the critic’s prediction. Positive means the action was better than expected; negative means worse.

The advantage is broadcast to all tokens within the turn: every token $a_t^l$ in the response at turn $t$ shares the same advantage $A(t)$. This is the core ArCHer insight — within a single turn, the agent generates a coherent action (reasoning + command), and assigning different advantages to individual tokens within the same action is noisy and semantically meaningless.

Trajectory-level PPO assigns the same advantage to every step (red). ArCHer computes per-step TD residuals (blue) — the "scroll" step gets a negative advantage while productive steps get positive signal. Drag the slider to see how advantages evolve as the critic trains.

The Actor Loss

The actor is trained with the standard PPO clipped objective, using the step-level advantages:

\[\mathcal{L}_{\text{actor}}(\theta) = -\mathbb{E}_{t,l}\left[\min\left(\rho_t^l \, A(t),\; \text{clip}(\rho_t^l, 1-\varepsilon, 1+\varepsilon) \, A(t)\right)\right]\]

where $\rho_t^l = \frac{\pi_\theta(a_t^l \vert s_t, a_t^{<l})}{\pi_{\theta_{\text{old}}}(a_t^l \vert s_t, a_t^{<l})}$ is the per-token importance ratio. The clipping threshold $\varepsilon$ (typically 0.2) prevents the policy from moving too far from the rollout policy in a single update.

Note that $A(t)$ does not depend on $l$ — it is the same for all tokens in the turn. The per-token ratios $\rho_t^l$ provide token-level granularity in how much the policy changed, but the direction of the gradient (reinforce or suppress) is determined entirely at the turn level.

Training Order: Critic First

A subtle but important detail: ArCHer trains the critic before the actor, reversing the standard PPO order. The reason is that the actor’s gradient quality depends entirely on the advantage estimates, which depend on the critic. Training the actor with a stale critic wastes gradient steps on noisy signals. By training the critic first and then recomputing the advantages with the updated $V_\phi$, the actor always sees the best available value estimates.

In the previous post, we discussed the challenges of using language models as critic functions — the need for action-sensitive representations and the instability of end-to-end TD training. ArCHer sidesteps several of these issues by using MC return targets (grounding the critic in actual outcomes) and by evaluating at a single position per step (the prompt boundary), where causal masking ensures the representation is clean.

What we implement in WebGym is a simplified variant — ArCHer PPO — that drops the hierarchical structure of the original paper and applies the turn-level critic idea directly to a flat PPO training loop for multi-step web agents. There is no high-level goal selector; just a single policy that acts at each turn, with a step-level critic providing per-turn credit assignment.

A Dialectical View: Critic-Free Methods Also Work

ArCHer makes the critic load-bearing. But empirically, an entire family of methods — GRPO and its descendants — has dropped the critic completely and still works very well, especially for math and code reasoning. If the critic is so important, how do these methods get away without one? This section takes a dialectical view: when does the critic actually pay for itself, and when does brute-force sampling do the job equally well?

GRPO: Sampling Replaces the Critic

GRPO (Group Relative Policy Optimization) takes a radically different approach to the variance problem: instead of learning a value function, estimate it statistically by sampling multiple completions from the same prompt.

The core observation is simple. A learned critic $V_\phi(s)$ approximates $\mathbb{E}_{\pi}[G \vert s]$ — the expected return from state $s$. But there is another way to estimate an expectation: draw samples and take the empirical mean. Given a prompt $q$, GRPO samples a group of $G$ completions $\{o_1, o_2, \ldots, o_G\}$ from the current policy $\pi_\theta$, scores each with a reward function $r(q, o_i)$, and uses the group statistics as a baseline:

\[\hat{A}_i = \frac{r(q, o_i) - \text{mean}(\{r(q, o_j)\}_{j=1}^G)}{\text{std}(\{r(q, o_j)\}_{j=1}^G)}\]

This is essentially a Monte Carlo estimate of the advantage — normalized to zero mean and unit variance within each group. No learned parameters, no value head, no critic training loop. The “value function” is replaced by the sample mean of the group.

The GRPO loss applies the same PPO-style clipping, but at the task level (one advantage per completion) rather than per-token or per-step:

\[\mathcal{L}_{\text{GRPO}}(\theta) = -\frac{1}{G}\sum_{i=1}^{G} \frac{1}{\vert o_i \vert}\sum_{l=1}^{\vert o_i \vert} \left[\min\!\left(\rho_i^l \, \hat{A}_i,\; \text{clip}(\rho_i^l, 1-\varepsilon, 1+\varepsilon)\, \hat{A}_i\right) - \beta \, D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})\right]\]

where $\rho_i^l = \frac{\pi_\theta(o_i^l \vert q, o_i^{<l})}{\pi_{\theta_{\text{old}}}(o_i^l \vert q, o_i^{<l})}$ is the per-token importance ratio, and $\beta$ controls the KL penalty against a reference policy $\pi_{\text{ref}}$ (typically the SFT model).

When Does Each Win?

The clean way to read GRPO is that the group mean is a (statistical) value function — just an empirical one rather than a parametric one. With $G$ samples per prompt, $\text{mean}(r(q, o_j))$ converges to $\mathbb{E}_\pi[r \vert q]$ at rate $O(G^{-1/2})$. The function approximator and its training loop are gone; sampling does the same job.

So the dialectic isn’t critic vs no-critic. It is: at what granularity does your problem need credit, and which estimator is cheaper at that granularity?

Setting	Best fit	Reason
Single-turn task (math, code, single answer)	GRPO-family	One advantage per completion is enough. The group mean is unbiased, $G \!\!= 8\!-\!64$ samples already gives a workable estimator, and you save a separately trained critic.
Short-horizon task with dense reward	Either, often GRPO	Per-step credit doesn’t add much when each step’s contribution is already legible from the reward signal.
Long-horizon multi-turn task with sparse terminal reward	Critic-based (ArCHer)	A 10-step trajectory has 1 reward bit but 10 decisions. Group mean cannot tell you which of the 10 steps caused the failure; a per-step $V_\phi(s_t)$ can.
Off-policy / replay-heavy training	Critic-based	The critic generalizes across states it has digested; the group-mean estimator only knows about the prompts you currently sample.

There is no clean winner. The pro-critic position says: a parametric $V_\phi$ amortizes value estimation across states, generalizes, and gives per-step credit. The pro-critic-free position says: when the granularity you need is task-level, sampling estimates the same thing without the engineering cost (and without the well-known failure modes of LM critics — representation drift, TD instability, value-head collapse). The deeper analysis of the GRPO family — DAPO, GSPO, and friends — and where each variant wins lives in the GRPO-family post.

The takeaway for the rest of this post: whether you build the value function explicitly (ArCHer) or implicitly via sampling (GRPO), you are doing the same underlying thing — estimating an expected return from a state. That estimation is what we’ll examine more closely below.

场景	适合	原因
单轮任务（数学、代码、单一回答）	GRPO 族	一个补全一个优势就够了。组内均值无偏，\(G \!\!= 8\!-\!64\) 个样本已能给出可用估计，还省掉了一个独立训练的 critic。
奖励稠密的短视野任务	两者皆可，常用 GRPO	当每步贡献已经从奖励信号上可读时，逐步信用不会带来太多额外收益。
终端奖励稀疏的多回合长视野任务	基于 critic（ArCHer）	10 步轨迹有 1 比特奖励却有 10 个决策。组均值无法告诉你 10 步中哪一步导致了失败；逐步 \(V_\phi(s_t)\) 可以。
严重 off-policy / replay 重的训练	基于 critic	Critic 在它消化过的状态上能泛化；组均值估计器只知道你当前采样的 prompt。

Value Functions in Q-Learning: When V IS the Policy

Both REINFORCE and actor-critic are policy gradient methods: they explicitly parameterize and optimize a policy $\pi_\theta$. The value function — whether $V(s)$ as a baseline or $V(s)$ for bootstrapping — is a helper that improves the policy’s gradient. The policy remains the primary object being learned.

Q-learning inverts this relationship entirely. There is no explicit policy. Instead, the value function is the primary object: learn $Q^*(s, a)$ directly via the Bellman optimality equation

\[Q^*(s, a) = \mathbb{E}\!\left[r + \gamma \max_{a'} Q^*(s', a')\right]\]

and extract the policy implicitly as $\pi^*(s) = \arg\max_a Q^*(s, a)$. The value function does not serve the policy — the value function replaces it.

This creates a fundamental difference in what the value function needs to capture. In actor-critic, the critic learns $V^\pi(s)$ — the expected return under the current policy. It only needs to evaluate states, not compare actions. In Q-learning, $Q^*(s, a)$ must be action-sensitive: it must distinguish the expected return of different actions from the same state, because that distinction is the policy. This is a much harder representational requirement, especially in the LM setting where the action space (all possible token sequences) is combinatorially large. The Q-learning scalability post examines whether this requirement can be met at scale.

What Is the Value Function Really Doing? Value Judgment, Not Reasoning

Step back from the math. In all the cases above — $V_\phi$ as baseline, $V_\phi$ as bootstrap target, group mean as empirical $V$, $Q^*$ as the policy itself — the same operation is happening:

Given a state, return a scalar that summarizes how good it is.

That is the entire job. Not a derivation. Not a chain of inferences. A single verdict.

The architectural label for this is value judgment, in deliberate contrast with reasoning. Reasoning produces a sequence of intermediate steps where the steps are part of what makes the answer correct — a chain of derivations, each one inspectable. Judgment produces only the answer; whatever computation produced it has been integrated away into the parameters of $V_\phi$, or into the empirical mean over $G$ samples, and is no longer recoverable from the output.

This distinction is not philosophical decoration; it has direct architectural consequences for how we build critics.

An Inspiration from Jungian Psychology: Ni

A useful structural label for this kind of function comes from Jungian analytical psychology, which (independently of any computational concern) catalogs eight cognitive functions and divides them along the same axis we just identified. On one side sit judging / reasoning functions — Te, Ti — whose outputs are explicit chains of derivations, designed to be inspected. On the other side sits one perceiving function with a particularly relevant signature: introverted intuition, abbreviated Ni.

Ni is the function that takes many disparate inputs, lets them settle, and outputs a single integrative impression — without surfacing the reasoning behind it. The classic description is “I just know where this is going” with no defendable derivation to back it up; the integration happened, but it happened in a way the system itself cannot replay step by step. (Background and the rest of the eight-function taxonomy live in the MBTI / Jungian functions post.)

Translated into the engineering vocabulary of this post, Ni is the cognitive shape of a value function. The critic ingests many trajectories during training, integrates them into its parameters, and at inference time emits a scalar value head per state — with no rationale attached. The justification (“which past trajectories the prediction depends on, which features mattered most”) is not produced and cannot be produced from the output alone. It lives in the weights.

Cognitive function	Output shape	Reasoning visible?	Engineering counterpart
`Ti` / `Te` (thinking)	a chain of derivations	yes	symbolic solver, CoT reasoner, hand-coded heuristic
`Ni` (introverted intuition)	a single integrative scalar/gestalt	no	scalar value head, learned $V_\phi(s)$

Two clarifications. First, this is not the claim that critics have personalities — it is the claim that the architectural shape of an integrative-judgment function and the architectural shape of a stepwise-reasoning function are different, and that any taxonomy that already separates them (Jung’s happens to be a clean one) gives us a useful vocabulary for the distinction. Second, we are taking only the inspiration: the structural label, plus its consequence for design. The rest of psychology stays in its own post.

认知功能	输出形状	推理可见？	工程对应物
`Ti` / `Te`（思考）	一条推导链	是	符号求解器、CoT 推理器、手写启发式
`Ni`（内倾直觉）	一个整合性的标量／整体感	否	标量值头、学到的 \(V_\phi(s)\)

An Intuitive Critic Should Not Reason

The practical claim that follows is the one this whole section was building toward:

An intuitive critic — a function whose job is integrative value judgment — should not exhibit reasoning in its output.

Concretely, $V_\phi(s)$ should be a scalar emitted in a single forward pass at a designated readout position, with no token sequence in between. The “thinking” should happen implicitly in the forward pass, not explicitly as emitted tokens. Several recent design trends violate this and are, from this framing, architecturally mismatched to what a critic is for:

Generative reward models that emit critique text before a score. The critique tokens are sitting in the architectural slot that integration should occupy in a Ni-shaped function. The score, having been forced to follow the text, becomes a post-hoc rationalization of the emitted critique rather than an integrative verdict. The model has been pushed toward Ti shape (defensible chain) when its job is Ni shape (integrative scalar).
Process reward models (PRMs) that score each step using their own chain of thought. The CoT makes the PRM look defensible step by step, but it pushes the function toward a Ti-style audit and away from the integrative shape that the value head is set up to produce.
“Thinking critics” that consume reasoning tokens before outputting $V$. The compute spent on those tokens is compute spent in the wrong cognitive position. If reasoning helps, that reasoning belongs to the actor, which then queries an integrative critic — not to the critic itself.

What is consistent with the framing, and worth keeping:

Scalar value heads read off a single token position. ArCHer reads at the prompt boundary; standard reward models read at EOS. Pure judgment, no emitted reasoning. This is the architecturally honest implementation of $V_\phi$.
MC-return supervision rather than verbal-rationale supervision. A Ni-shaped function trains on outcomes — did the trajectory succeed? — not on rationales. This is exactly what ArCHer’s MSE-against-$G_t$ loss does, and what reward-from-preferences losses do. Training a critic on “explain why this state is good” data is training the wrong function.
Critic-first, then bootstrap. ArCHer’s order — develop the integrative function on real returns first, then trust its own predictions as bootstrap targets — is the engineering analogue of grounding intuition in real consequences before letting it predict on its own. A Ni function with no consequential history is hallucination; a critic with no return-supervised pretraining is a noisy randomly-initialized scalar regressor.

The clean division of labor: the actor is allowed — and required — to reason. Chain-of-thought is its native medium; it has to produce the action, and the chain that produced it is the action. The critic is not. It has to deliver an integrative value judgment, fast and silent, and let the actor handle whatever stepwise derivation the action needs. Mixing the two functions into one model that “thinks before scoring” is the architectural mistake this whole framing is here to flag.

Generalizable Value Functions and Introverted Intuition (Ni)

The Role of Value Functions

价值函数的作用

The Variance Problem in Policy Gradient

策略梯度中的方差问题

A State-Dependent Baseline Already Helps

状态相关的基线已有帮助

From Baseline to Bootstrapping

从基线到自举

A Worked Example in Policy Gradient: ArCHer

策略梯度的具体例子：ArCHer

Formulation: Multi-Turn MDP

形式化：多回合 MDP

The Critic: Turn-Level Value Function

Critic：回合级价值函数

The Actor: Step-Level Advantages

Actor：步级优势函数

The Actor Loss

Actor 损失函数

Training Order: Critic First

训练顺序：先训练 Critic

A Dialectical View: Critic-Free Methods Also Work

辩证视角：Critic-Free 方法同样可行

GRPO: Sampling Replaces the Critic

GRPO：用采样替代 critic

When Does Each Win?

什么时候用哪个？

Value Functions in Q-Learning: When V IS the Policy

价值函数在 Q-Learning 中：当 V 就是策略

What Is the Value Function Really Doing? Value Judgment, Not Reasoning

价值函数到底在做什么？是价值判断，不是推理

An Inspiration from Jungian Psychology: Ni

一个来自荣格心理学的灵感：Ni

An Intuitive Critic Should Not Reason

直觉型 critic 不应该出现推理