<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Bradley C. Love</title>
    <description>I am Professor of Cognitive and Decision Sciences in Experimental Psychology at UCL and a fellow at The Alan Turing Institute for data science. My lab&apos;s research centers around human learning and decision making, integrating behavioural, computational, and neuroscience perspectives.</description>
    <link>http://bradlove.org</link>
    <atom:link href="http://http://bradlove.org/feed.xml" rel="self" type="application/rss+xml" />
    
      <item>
        <title>Giving LLMs too much RoPE: A limit on Sutton’s Bitter Lesson</title>
        <description>&lt;h5 id=&quot;introduction&quot;&gt;Introduction&lt;/h5&gt;
&lt;p&gt;Sutton’s Bitter Lesson (&lt;a href=&quot;https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson.pdf&quot;&gt;Sutton, 2019&lt;/a&gt;) argues that machine learning breakthroughs, like AlphaGo, BERT, and large-scale vision models, rely on general, computation-driven methods that prioritize learning from data over human-crafted priors. Large language models (LLMs) based on transformer architectures exemplify this trend, scaling effectively with data and compute. Yet, positional embeddings—a key transformer component—seem to challenge this philosophy. Most embedding schemes are fixed, not learned, and embed a human-designed prior that words closer in a sentence are more relevant than those farther apart. This blog explores this machine learning practice that appears to defy the Bitter Lesson. We also analyze patterns in learned absolute positional embeddings, which partially align with fixed, human-designed schemes but show intriguing variations, highlighting the complexity of positional encodings in LLMs and the need for further research.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Transformers Need Positional Embeddings&lt;/strong&gt;&lt;br /&gt;
Transformers process tokens in parallel using permutation-invariant attention mechanisms, lacking inherent sequence awareness. Without positional information, they treat “The cat sat on the mat” and “Mat the on sat cat the” identically, despite order being critical for meaning in language. While LLMs may learn word order differently from humans, they require consistent order encoding to function (see our &lt;a href=&quot;https://bradlove.org/blog/prob-llm-consistency&quot;&gt;recent blog&lt;/a&gt; for more). Positional embeddings provide this order, using either fixed or learned methods. Human intuition suggests nearby words are more relevant than distant ones, implying a decay in influence over distance—a prior often baked into positional encoding designs. However, &lt;a href=&quot;https://arxiv.org/abs/2410.21216v2&quot;&gt;Chen et al., 2024&lt;/a&gt; argue this assumption may be outdated for modern LLMs.&lt;/p&gt;

&lt;h5 id=&quot;a-brief-history-of-positional-embeddings&quot;&gt;A Brief History of Positional Embeddings&lt;/h5&gt;

&lt;p&gt;&lt;strong&gt;Early Positional Embeddings: Baking in Long-Term Decay&lt;/strong&gt;&lt;br /&gt;
The original Transformer (&lt;a href=&quot;https://arxiv.org/abs/1706.03762&quot;&gt;Vaswani et al., 2017&lt;/a&gt;) used sinusoidal positional embeddings, applying deterministic sinusoidal functions to encode positions. These embeddings exhibit long-term decay, where similarity between embeddings decreases with token distance, aligning with the intuition that distant tokens are less relevant (Fig. 1).&lt;/p&gt;

&lt;figure class=&quot;fig&quot;&gt;
&lt;img src=&quot;/images/blog/sinusoidal.png&quot; title=&quot;Figure 1&quot; class=&quot;u-max-full-width centered&quot; /&gt;
&lt;figcaption&gt;
  &lt;div class=&quot;inner-caption centered&quot;&gt;
  &lt;b&gt;Figure 1: Cosine similarity of sinusoidal positional embeddings, showing decay in similarity as token distance increases, reflecting the intuition that nearby tokens are more relevant.&lt;/b&gt;
  &lt;/div&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;&lt;strong&gt;Absolute Learnable Embeddings: Data-Driven but Limited&lt;/strong&gt;&lt;br /&gt;
Absolute learnable positional embeddings, used in models like BERT, GPT-1, GPT-2, Galactica, and OPT, align with Sutton’s Bitter Lesson by assigning trainable vectors to each position, allowing the model to learn positional relationships from data. This data-driven approach avoids human priors, theoretically enabling optimal patterns to emerge during training. However, these embeddings are limited by a fixed maximum sequence length, hindering generalization to longer contexts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Current Generation Embeddings: Back to Human Priors&lt;/strong&gt;&lt;br /&gt;
State-of-the-art models like LLaMA, Qwen, and DeepSeek use Rotary Position Embeddings (RoPE) (&lt;a href=&quot;https://arxiv.org/abs/2104.09864&quot;&gt;Su et al., 2021&lt;/a&gt;). RoPE applies fixed, relative rotations in attention, reintroducing a human prior of long-term decay (Fig. 2) while enabling length extrapolation—a key advantage for modern LLMs. This shift seems to step back from the data-driven ideal, suggesting that learning from data may not always be optimal, especially given practical constraints like context length.&lt;/p&gt;

&lt;figure class=&quot;fig&quot;&gt;
&lt;img src=&quot;/images/blog/rope_decay.png&quot; title=&quot;Figure 2&quot; class=&quot;u-max-full-width centered&quot; /&gt;
&lt;figcaption&gt;
  &lt;div class=&quot;inner-caption centered&quot;&gt;
  &lt;b&gt;Figure 2: RoPE similarity decay, showing decreasing similarity with increasing token distance, enabling effective handling of long sequences with built-in decay.&lt;/b&gt;
  &lt;/div&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h5 id=&quot;absolute-positional-embeddings-revisited&quot;&gt;Absolute Positional Embeddings: Revisited&lt;/h5&gt;
&lt;p&gt;Why did the field shift from human-designed fixed positional embeddings to a data-driven approach consistent with Sutton’s Bitter Lesson, only to return to fixed schemes? To explore this, we investigate learnable absolute positional embeddings, uncovering patterns that warrant further study.&lt;/p&gt;

&lt;p&gt;Across various model sizes, architectures, and training datasets, these embeddings partially converge on the long-term decay seen in fixed embeddings. Surprisingly, they also show periodic oscillations, a byproduct in fixed embeddings designed for decay but lacking clear theoretical justification. Their presence in learnable embeddings is puzzling, varying with model capacity, architecture, and training data. Are these oscillations beneficial or artifacts? Their variability underscores the need for further research to clarify their role and optimize positional encodings in LLMs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GPT-2: A Periodic Surprise&lt;/strong&gt;&lt;br /&gt;
We analyzed cosine similarities of positional embeddings in pretrained GPT-2 models (Fig. 3). The top panel shows pairwise similarities between token positions, and the bottom averages similarities by distance. A smooth, periodic pattern emerges, noted in &lt;a href=&quot;https://arxiv.org/abs/2010.04903&quot;&gt;research&lt;/a&gt; and &lt;a href=&quot;https://www.lesswrong.com/posts/qvWP3aBDBaqXvPNhS/gpt-2-s-positional-embedding-matrix-is-a-helix&quot;&gt;blogs&lt;/a&gt;, but without clear explanation. One might assume models capture hierarchical structure in training data, but this doesn’t hold: the pattern persists across diverse datasets without aligned peaks. Why does a data-driven approach produce such structured patterns?&lt;/p&gt;

&lt;figure class=&quot;fig&quot;&gt;
&lt;img src=&quot;/images/blog/positional_embedding_similarity_gpt2_pretrained.png&quot; title=&quot;Figure 3&quot; class=&quot;u-max-full-width centered&quot; /&gt;
&lt;img src=&quot;/images/blog/positional_embedding_similarity_by_distance_gpt2_pretrained.png&quot; title=&quot;Figure 3&quot; class=&quot;u-max-full-width centered&quot; /&gt;
&lt;figcaption&gt;
  &lt;div class=&quot;inner-caption centered&quot;&gt;
  &lt;b&gt;Figure 3: Cosine similarities of GPT-2 pretrained positional embeddings, showing periodic oscillations (top: pairwise similarities; bottom: averaged by distance), defying expected monotonic decay.&lt;/b&gt;
  &lt;/div&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;&lt;strong&gt;Varying Patterns Across Models&lt;/strong&gt;&lt;br /&gt;
We examined Galactica (125M, 1.3B, 6.7B), trained on academic papers, and OPT (125M, 350M, 1.3B, 2.7B, 6.7B), trained on general text. Galactica’s smaller models show periodicity, but the 6.7B model trends toward a simpler similarity decrease (Fig. 4). OPT models vary: the 350M model mirrors GPT-2’s periodicity, while the 125M and larger models diverge, losing periodic structure (Fig. 5).&lt;/p&gt;

&lt;figure class=&quot;fig&quot;&gt;
&lt;img src=&quot;/images/blog/positional_embedding_similarity_galactica.png&quot; title=&quot;Figure 4&quot; class=&quot;u-max-full-width centered&quot; /&gt;
&lt;img src=&quot;/images/blog/positional_embedding_similarity_by_distance_galactica.png&quot; title=&quot;Figure 4&quot; class=&quot;u-max-full-width centered&quot; /&gt;
&lt;figcaption&gt;
  &lt;div class=&quot;inner-caption centered&quot;&gt;
  &lt;b&gt;Figure 4: Galactica pretrained positional embedding similarities, with smaller models showing periodicity and the 6.7B model trending toward monotonic decay (top: pairwise; bottom: by distance).&lt;/b&gt;
  &lt;/div&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;figure class=&quot;fig&quot;&gt;
&lt;img src=&quot;/images/blog/positional_embedding_similarity_opt.png&quot; title=&quot;Figure 5&quot; class=&quot;u-max-full-width centered&quot; /&gt;
&lt;img src=&quot;/images/blog/positional_embedding_similarity_by_distance_opt.png&quot; title=&quot;Figure 5&quot; class=&quot;u-max-full-width centered&quot; /&gt;
&lt;figcaption&gt;
  &lt;div class=&quot;inner-caption centered&quot;&gt;
  &lt;b&gt;Figure 5: OPT pretrained positional embedding similarities, with the 350M model showing periodicity similar to GPT-2, while others vary (top: pairwise; bottom: by distance).&lt;/b&gt;
  &lt;/div&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;&lt;strong&gt;Training Data’s Role&lt;/strong&gt;&lt;br /&gt;
We analyzed CodeParrot (110M, 1.5B), GPT-2-based models trained on Python code. The 110M model shows periodicity distinct from GPT-2’s 124M, and the 1.5B model diverges further, unique from both smaller CodeParrot and same-sized GPT-2 models (Fig. 6). This suggests training data shapes these patterns.&lt;/p&gt;

&lt;figure class=&quot;fig&quot;&gt;
&lt;img src=&quot;/images/blog/positional_embedding_similarity_codeparrot.png&quot; title=&quot;Figure 6&quot; class=&quot;u-max-full-width centered&quot; /&gt;
&lt;img src=&quot;/images/blog/positional_embedding_similarity_by_distance_codeparrot.png&quot; title=&quot;Figure 6&quot; class=&quot;u-max-full-width centered&quot; /&gt;
&lt;figcaption&gt;
  &lt;div class=&quot;inner-caption centered&quot;&gt;
  &lt;b&gt;Figure 6: CodeParrot pretrained positional embedding similarities, showing distinct periodic patterns influenced by code-specific training data (top: pairwise; bottom: by distance).&lt;/b&gt;
  &lt;/div&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;We also studied GPT-2 124M variants trained on neuroscience articles in forward (FWD), backward (BWD), and permuted (PERM) orders, detailed in a &lt;a href=&quot;https://bradlove.org/blog/prob-llm-consistency&quot;&gt;blog&lt;/a&gt; and &lt;a href=&quot;https://arxiv.org/abs/2505.08739&quot;&gt;paper&lt;/a&gt;. These show a weak similarity decrease up to 450 tokens, dipping negative before returning to zero, distinct from other models and a random baseline (INIT) (Fig. 7).&lt;/p&gt;

&lt;figure class=&quot;fig&quot;&gt;
&lt;img src=&quot;/images/blog/positional_embedding_similarity_x_models_reordered.png&quot; title=&quot;Figure 7&quot; class=&quot;u-max-full-width centered&quot; /&gt;
&lt;img src=&quot;/images/blog/positional_embedding_similarity_by_distance_x_models_reordered.png&quot; title=&quot;Figure 7&quot; class=&quot;u-max-full-width centered&quot; /&gt;
&lt;figcaption&gt;
  &lt;div class=&quot;inner-caption centered&quot;&gt;
  &lt;b&gt;Figure 7: GPT-2 124M positional embedding similarities trained on neuroscience text in forward (FWD), backward (BWD), and permuted (PERM) orders, compared to random initialization (INIT), showing weak decay up to 450 tokens (top: pairwise; bottom: by distance).&lt;/b&gt;
  &lt;/div&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h5 id=&quot;a-frontier-to-suttons-bitter-lesson&quot;&gt;A Frontier to Sutton’s Bitter Lesson&lt;/h5&gt;
&lt;p&gt;The shift from absolute learnable positional embeddings to RoPE in modern LLMs highlights a trade-off between Sutton’s data-driven ideal and practical scalability. Learnable embeddings align with the Bitter Lesson but are limited by fixed context lengths, while RoPE’s fixed decay enables length extrapolation at the cost of reintroducing human priors. Both fixed (e.g., sinusoidal) and learnable embeddings exhibit periodic oscillations, but in fixed embeddings, these are a byproduct, with decay as the intended design. Their purpose remains unclear.&lt;/p&gt;

&lt;p&gt;Intriguingly, learnable embeddings’ oscillations suggest partial convergence with fixed methods, but models like GPT-2 (all sizes) and OPT-350M show exaggerated oscillations, raising concerns about potential degenerative solutions or training artifacts. In contrast, models like Galactica-6.7B and larger OPT (and the smallest OPT) variants display more desirable decay with oscillations that adapt to training data and model scale. This flexibility, absent in RoPE’s rigid structure, may better capture nuanced patterns but risks instability. Are these oscillations beneficial or artifacts? They are unlikely to capture hierarchical structure in text, as peaks and troughs do not align across model sizes and architectures, but the periodicity may help distinguish positions, especially at greater distances.&lt;/p&gt;
</description>
        <pubDate>Wed, 11 Jun 2025 00:00:00 +0000</pubDate>
        <link>http://bradlove.org/blog/position-embd</link>
        <guid isPermaLink="true">http://bradlove.org/blog/position-embd</guid>
      </item>
    
      <item>
        <title>Backwards Compatible: The Strange Math Behind Word Order in AI</title>
        <description>&lt;h3 id=&quot;probability-101-does-order-matter-in-sequences&quot;&gt;Probability 101: Does Order Matter in Sequences?&lt;/h3&gt;
&lt;p&gt;Ever tried reading a sentence backward? Or jumbling the words like a word salad? Take the sentence “The cat sat on the mat” as an example (Fig. 1). Imagine calculating how likely this exact sentence is to appear. Does it matter if you start with “The” and work forward, or begin with “mat” and go backward, or even shuffle the words randomly? Your gut might say, “Yeah, scrambling it feels way harder!” But here’s the wild part: math says the probability of the whole sentence stays the same, no matter the order you process it in. Let’s unpack why.&lt;/p&gt;

&lt;figure class=&quot;fig&quot;&gt;
&lt;img src=&quot;/images/blog/cat_proba.png&quot; title=&quot;Figure 1&quot; class=&quot;u-max-full-width centered&quot; /&gt;
&lt;figcaption&gt;
  &lt;div class=&quot;inner-caption centered&quot;&gt;
  &lt;b&gt;Figure 1: Forward and backward factorizations of the joint probability of a text sequence.&lt;/b&gt;
&lt;/div&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;We’re talking about the &lt;em&gt;joint probability&lt;/em&gt; of the full sequence, &lt;strong&gt;P(The, cat, sat, on, the, mat)&lt;/strong&gt;, which measures how likely it is for all six words to appear together in that order. You can break this down into steps, and the order of those steps shouldn’t change the final answer. Here’s how it looks:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Forward order&lt;/strong&gt;: Start with “The” then find the chance of “cat” given “The” then “sat” given “The cat” all the way to “mat” given “The cat sat on the.” That’s:
  P(The) × P(cat | The) × P(sat | The, cat) × P(on | The, cat, sat) × P(the | The, cat, sat, on) × P(mat | The, cat, sat, on, the)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Backward order&lt;/strong&gt;: Flip it! Start with “mat” then the chance of “the” given “mat” then “on” given “mat the” up to “The” given all the rest. That’s:
  P(mat) × P(the | mat) × P(on | mat, the) × P(sat | mat, the, on) × P(cat | mat, the, on, sat) × P(The | mat, the, on, sat, cat)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shuffled order&lt;/strong&gt;: Mix it up, like starting with “sat” then “The” then “mat,” and so on. One mix could be:
  P(sat) × P(The | sat) × P(mat | sat, The) × P(on | sat, The, mat) × P(the | sat, The, mat, on) × P(cat | sat, The, mat, on, the)&lt;/p&gt;

&lt;p&gt;The magic? All these paths—forward, backward, or shuffled—land at the same joint probability, &lt;strong&gt;P(The, cat, sat, on, the, mat)&lt;/strong&gt;. This is thanks to the chain rule of probability, which keeps the math consistent. A measure called &lt;em&gt;perplexity&lt;/em&gt;, which shows how certain a model is about the sequence, should also be identical across these orderings, given by:
\(\exp \left(-\frac{1}{6} \ln P(\text{The, cat, sat, on, the, mat})\right).\)
So, in theory, whether you read the sentence forward, backward, or as a jumbled mess, its predictability stays the same. We prove this equivalence mathematically and provide the full derivations in our paper &lt;a href=&quot;https://arxiv.org/abs/2505.08739&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;do-llms-produce-perplexities-consistent-with-theory&quot;&gt;Do LLMs Produce Perplexities Consistent with Theory?&lt;/h3&gt;
&lt;p&gt;Large language models (LLMs), like the GPT-2 models we studied (with 124M, 355M, and 774M parameters), learn by predicting the next word in a sentence, a process called autoregressive training. They break down a sentence into tokens (words or parts of words) and estimate conditional probabilities—like the chance of “sat” following “The cat”. This mirrors our probability example, suggesting that LLMs should, in theory, produce the same perplexity for a sequence whether it’s processed forward, backward, or shuffled. But do they?&lt;/p&gt;

&lt;p&gt;We trained GPT-2 models on a massive dataset of neuroscience papers (1.3 billion tokens, spanning 20 years) in three ways: forward (normal reading order), backward (reversed token order), and permuted (randomly shuffled tokens within each sequence). If the theory holds, all models should have the same perplexities for the same text. But what happens in practice?&lt;/p&gt;

&lt;h3 id=&quot;what-we-found-theory-meets-reality&quot;&gt;What We Found: Theory Meets Reality&lt;/h3&gt;
&lt;p&gt;Surprisingly, the models didn’t perfectly align with the theory. Forward and backward models had similar perplexities, but forward models consistently performed slightly better, meaning they were less uncertain about predicting sequences. Models trained on permuted (shuffled) text, however, showed much higher perplexities, deviating significantly from both forward and backward models (Fig. 2). This suggests that shuffling tokens makes prediction much harder for the model.&lt;/p&gt;

&lt;figure class=&quot;fig&quot;&gt;
&lt;img src=&quot;/images/blog/val_losses_comparison.jpg&quot; title=&quot;Figure 2&quot; class=&quot;u-max-full-width centered&quot; /&gt;
&lt;figcaption&gt;
  &lt;div class=&quot;inner-caption centered&quot;&gt;
  &lt;b&gt;Figure 2: Average validation perplexity differences across across model sizes and training directions.&lt;/b&gt; Forward and backward text training yields similar perplexities, though forward models consistently achieve lower values (difference below zero). This gap widens slightly with model size. Permuted text training yields much higher perplexity than both forward and backward models, with similar differences to each, causing the curves to overlap. Shaded regions indicate one standard deviation over the mean across three random initializations.
&lt;/div&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;We dug deeper to understand why. The culprit? &lt;em&gt;Attention biases&lt;/em&gt; in how these models process text. LLMs use a mechanism called self-attention to weigh the importance of different tokens in a sequence. We found that forward and backward models tend to focus heavily on nearby tokens and those at the start or end of a sequence, irrespective of the meaning of those tokens. Permuted models, however, developed very different attention patterns (Fig. 3). Biases toward specific token positions can affect how a sequence is processed when factorized in different orders, with these differences cascading through the model and leading to variations in perplexity.&lt;/p&gt;

&lt;p&gt;Some recent studies have also observed that language models perform differently when trained on forward versus backward text. However, their findings or explanations are flawed or incomplete due to experimental setups that violate theoretical principles. We provide a detailed discussion in our &lt;a href=&quot;https://arxiv.org/abs/2505.08739&quot;&gt;preprint&lt;/a&gt;.&lt;/p&gt;

&lt;figure class=&quot;fig&quot;&gt;
&lt;img src=&quot;/images/blog/attn_weights_norm_ranks_by_distance_small_seed1.jpg&quot; title=&quot;Figure 3&quot; class=&quot;u-max-full-width centered&quot; /&gt;
&lt;figcaption&gt;
  &lt;div class=&quot;inner-caption centered&quot;&gt;
  &lt;b&gt;Figure 3: Positional bias in self-attention varies with training directions and layers (GPT-2 124M).&lt;/b&gt; Normalized attention rank ($min=0, max=1$) is plotted as a function of token distance within the context, averaged across heads, sampled sequences, and layers. Compared to models at initialization (Init), forward (Fwd) and backward (Bwd) trained models show strong positional biases toward both nearby tokens and tokens at maximal distance, with the degree of bias varying across layers. In contrast, the model trained on permuted text (Perm) displays distinct patterns, with positional bias generally decreasing as token distance increases across most layers.
&lt;/div&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h3 id=&quot;how-do-forward-and-backward-models-perform-on-benchmarks&quot;&gt;How do forward and backward models perform on benchmarks?&lt;/h3&gt;
&lt;p&gt;We’ve shown that our models differ in their perplexity and attention patterns. But how different are they when put to a real test? To find out, we evaluated them on BrainBench, a benchmark that challenges models and human experts to predict the outcomes of neuroscience experiments (see the original Nature Human Behavior paper &lt;a href=&quot;https://www.nature.com/articles/s41562-024-02046-9&quot;&gt;here&lt;/a&gt;). BrainBench presents pairs of study abstracts—one real, one altered to change the results but still sound plausible—and asks which is correct. This task tests model’s ability to spot patterns in complex scientific texts, making it a perfect fit for our models trained on neuroscience literature.&lt;/p&gt;

&lt;p&gt;The results? Our forward and backward models performed remarkably similarly, showing that their differences in training order don’t significantly impact their ability to predict neuroscience outcomes. Importantly, both models, especially at larger sizes (like our 774M-parameter GPT-2), rivaled and often surpassed human experts.&lt;/p&gt;

&lt;p&gt;This finding speaks to a bigger debate: are large language models (LLMs) good models of human language learning? Humans learn language in a forward, meaningful order, but our models learned effectively across forward, backward, and even shuffled sequences to some extent. This suggests LLMs are perhaps not just mimicking human language but are general learning machines, capable of capturing predictive patterns in any data, even when it doesn’t follow human-like structure. Their success on BrainBench, especially for backward models, mirrors how LLMs excel in non-human-language domains—like scientific data or code—where patterns don’t always resemble natural language. This versatility challenges the idea that LLMs are limited to human-like learning.&lt;/p&gt;

&lt;h3 id=&quot;what-this-means-for-llms&quot;&gt;What This Means for LLMs?&lt;/h3&gt;
&lt;p&gt;Our findings reveal a gap between theory and practice. Theoretically, the order of tokens shouldn’t affect a model’s perplexity, but in reality, LLMs are sensitive to how sequences are presented. These deviations could signal deeper issues, like untrustworthy outputs or even hallucinations—when models generate convincing but incorrect information. Understanding these biases helps us build more reliable models. Here, we’ve shown that training sibling models on the same data provides a way to evaluate how internally consistent LLMs are in terms of their inferred probabilities.&lt;/p&gt;

&lt;p&gt;For full details and extended results, check out our &lt;a href=&quot;https://arxiv.org/abs/2505.08739&quot;&gt;preprint&lt;/a&gt;, &lt;a href=&quot;https://github.com/braingpt-lovelab/backwards&quot;&gt;code&lt;/a&gt;, and &lt;a href=&quot;https://huggingface.co/llm-probability&quot;&gt;model weights&lt;/a&gt;.&lt;/p&gt;
</description>
        <pubDate>Tue, 27 May 2025 00:00:00 +0000</pubDate>
        <link>http://bradlove.org/blog/prob-llm-consistency</link>
        <guid isPermaLink="true">http://bradlove.org/blog/prob-llm-consistency</guid>
      </item>
    
      <item>
        <title>The Psychology of Persuasion</title>
        <description>&lt;p&gt;When we are on the right side of an argument, most of us believe presenting the facts and supporting evidence should be enough to persuade others. Instead, we are baffled when friends and family continue to vote for policies that run counter to their interests or pour the milk before the tea. Presenting evidence is not enough for persuasion because people are motivated reasoners driven by their core values and community membership. Rather than weight the evidence in an unbiased fashion, people construct narratives or stories to understand themselves and the world.&lt;/p&gt;

&lt;h3 id=&quot;decision-making-as-story-telling&quot;&gt;Decision Making as Story Telling&lt;/h3&gt;

&lt;p&gt;Imagine a jury sitting on a murder trial. The jurors aren’t weighting the probabilities of all the possible scenarios taking all the evidence into account. Instead, they are considering whether the story told by the prosecutor or defence is more coherent and persuasive. Once they settle on a story, evidence is interpreted in light of that narrative. Oddly, a story might be more compelling by only focusing on the strongest points at the expense of mentioning all the supporting evidence.&lt;/p&gt;

&lt;p&gt;In our personal lives, we also tell stories about ourselves. We aren’t going to be receptive to information that conflicts with our personal story, such as being told we are racist. We also understand our actions through story telling. For example, we come to like what we purchase in the supermarket rather than simply purchase what we like. After all, why would we buy and eat something that we didn’t like? When it becomes difficult to explain a choice, such as when confronted with an aisle full of different jams that all would do, we can lapse into inaction.&lt;/p&gt;

&lt;h3 id=&quot;the-story-teller-matters&quot;&gt;The Story Teller Matters&lt;/h3&gt;

&lt;p&gt;We are rarely persuaded by our enemies. Common ground and shared values are lubricants for persuasion. For example, someone denying climate change in the presence of overwhelming evidence may do so because of broader motivations, such as fearing increased government regulations and being forced to give up their car. Someone on the same “team” who shares these values and goals is best positioned to make the case for climate change, whereas an environmental campaigner who favours more socialist policies and rides a bike to work is likely to be discounted presenting the same evidence. A blowback effect could even occur where the climate change denier takes the environmentalist’s “lies” as further evidence for the hoax whose true aim is to dismantle their way of life. People tend to follow community norms.&lt;/p&gt;

&lt;h3 id=&quot;persuasion-to-action&quot;&gt;Persuasion to Action&lt;/h3&gt;

&lt;p&gt;Persuading someone does not guarantee action. For example, many people support politicians but don’t vote. To translate beliefs into actions, people need specific plans and triggers. A potential voter would need to reserve time in their diary and arrange transport to the polling station. Action happens when the environment supports it. Indeed, the basic idea of Nudge is not to persuade per se, but to make it easier for people to make the “right” choice, such as when organ donation is the default option. Like persuasion, action is not all about education. Facts matter, but sadly not as much as we would like to believe.&lt;/p&gt;

</description>
        <pubDate>Fri, 16 Jul 2021 00:00:00 +0000</pubDate>
        <link>http://bradlove.org/blog/persuasion</link>
        <guid isPermaLink="true">http://bradlove.org/blog/persuasion</guid>
      </item>
    
      <item>
        <title>A neuroscience-inspired approach to transfer learning</title>
        <description>&lt;p&gt;Inspired by the brain, we find a goal-directed attention approach to feature reuse bests a commonly used machine learning strategy (&lt;a href=&quot;https://arxiv.org/abs/2002.02342&quot;&gt;Luo et al., 2020&lt;/a&gt;). In particular, attentional modulation of mid-level features in deep convolutional neural networks is more effective than retraining the last layer to transfer to a new task.&lt;/p&gt;

&lt;p&gt;Neuroscience and machine learning have been enjoying a virtuous cycle in which advances in one field spurs advances in the other. For example, deep convolutional neural networks (DCNNs) were motivated by the organisation of the visual cortex. In this blog, we highlight another success for neuroscience-inspired approaches, namely using goal-directed attention to repurpose an existing network for a new task.&lt;/p&gt;

&lt;h3 id=&quot;goal-directed-attention-in-humans&quot;&gt;Goal-directed attention in humans&lt;/h3&gt;
&lt;p&gt;When searching for one’s car keys, a sensible strategy is to prioritise small and metallic objects. Focusing on goal-directed features at the expense of irrelevant features can increase one’s chances of finding the target item. Instead of retraining one’s brain for this particular recognition task, people use goal-directed attention to modulate activity in their visual system.&lt;/p&gt;

&lt;figure class=&quot;fig&quot;&gt;
&lt;img src=&quot;/images/blog/topDownAttention3.png&quot; title=&quot;Figure 2 from topDown preprint.&quot; class=&quot;u-max-full-width centered&quot; /&gt;
&lt;figcaption&gt;
  &lt;div class=&quot;inner-caption centered&quot;&gt;
  Figure 2 from &lt;a href=&quot;https://arxiv.org/abs/2002.02342&quot;&gt;Luo et al., 2020&lt;/a&gt;: The absence of a strong top-down signal (left) to guide visual processing leads to uncertainty about what this confusing image depicts. In contrast, when there is an expectation that a dog is present (right) the visual system is reconfigured to be more sensitive and biased toward supporting information, which leads to successful recognition of the Dalmatian.
&lt;/div&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h3 id=&quot;conventional-transfer-learning-in-machine-learning&quot;&gt;Conventional transfer learning in machine learning&lt;/h3&gt;
&lt;p&gt;In contrast, one popular method for transfer learning in machine learning is to remove the final layer of the DCNN and retrain it for the new task. Like the attentional approach, most aspects of the original network are preserved. For example, all the useful features previously learned could be reused for a task that prioritises finding one’s keys. To provide another &lt;a href=&quot;https://keras.io/guides/transfer_learning/#an-endtoend-example-finetuning-an-image-classification-model-on-a-cats-vs-dogs\&quot;&gt;example&lt;/a&gt;, a DCNN model pre-trained on ImageNet could be fine-tuned into a cats-vs-dogs detector using very little data.&lt;/p&gt;

&lt;h3 id=&quot;an-alternative-approach-goal-directed-attention&quot;&gt;An alternative approach: goal directed attention&lt;/h3&gt;
&lt;p&gt;Goal-directed attention and transfer learning approaches reuse existing features, but there is a critical difference. In the brain, goal-directed attention primarily operates at mid- to late-stages of the ventral visual stream. Our networks with goal-directed attention operate similarly. In contrast, transfer learning adjusts features at the very end of a DCNN. How does a neuroscience-inspired approach compare to the standard machine learning approach?&lt;/p&gt;

&lt;p&gt;Here, we describe a study in which we incorporate goal-directed attention into the mid-level of a DCNN and use it as an alternative to the transfer learning approach. Results from three object recognition tasks favour the neuroscience-inspired approach both in terms of performance and ability to scale.&lt;/p&gt;

&lt;h3 id=&quot;incorporating-goal-directed-attention-in-dcnn&quot;&gt;Incorporating goal-directed attention in DCNN&lt;/h3&gt;
&lt;p&gt;In cognitive neuroscience, goal-directed attention is a mechanism that emphasises or de-emphasises features based on their task relevance. This is often formalised as the stretching and contracting of psychological feature dimensions.&lt;/p&gt;

&lt;figure class=&quot;fig&quot;&gt;
&lt;img src=&quot;/images/blog/size_albedo_intro.png&quot; title=&quot;Figure 1 from topDown preprint.&quot; class=&quot;u-max-full-width centered&quot; /&gt;
&lt;figcaption&gt;
  &lt;div class=&quot;inner-caption centered&quot;&gt;
  Figure 1 from &lt;a href=&quot;https://arxiv.org/abs/2002.02342&quot;&gt;Luo et al., 2020&lt;/a&gt;: Attention alters the importance of feature dimensions. Four kitchen objects vary on two feature dimensions: albedo and size. In this example, albedo is the attended dimension (hence stretched) whereas attention to size is tuned down (hence compressed). Consequently, the key becomes more similar to the silver toaster than to the chopping board or salt shaker.
&lt;/div&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;To incorporate this principle into DCNN models, we introduce a goal-directed attention layer at the mid-level of a pre-trained DCNN that can direct its focus on a set of features based on their goal relevance.&lt;/p&gt;

&lt;figure class=&quot;fig&quot;&gt;
&lt;img src=&quot;/images/blog/attention_layer.png&quot; title=&quot;Figure 4 from topDown preprint.&quot; class=&quot;u-max-full-width centered&quot; /&gt;
&lt;figcaption&gt;
  &lt;div class=&quot;inner-caption centered&quot;&gt;
  Figure 4 from &lt;a href=&quot;https://arxiv.org/abs/2002.02342&quot;&gt;Luo et al., 2020&lt;/a&gt;: Integration of Attention Layer with VGG-16. The attention layer is constructed with the same shape as the output representation of the preceding layer but constrained such that a single filter value is used across all spatial locations. The attention operation is carried out as a Hadamard product between the pre-attention activations and attention weights. As the bottom panel shows, previously highly activated filter can be tuned down by a small attention weight (colour from dark to bright) whereas previously barely activated filter can become highly activated due to attention re-weighting (colour from bright to dark)
&lt;/div&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h3 id=&quot;attention-beats-convention&quot;&gt;Attention beats convention&lt;/h3&gt;
&lt;p&gt;Models trained on ImageNet using either approach are tested on three object recognition tasks involving standard ImageNet images, blended images and natural adversarial images. Natural adversarial images exploit vulnerabilities in DCNNs such as colour and texture biases (&lt;a href=&quot;https://arxiv.org/pdf/1907.07174.pdf&quot;&gt;Hendrycks et al., 2019&lt;/a&gt;).&lt;/p&gt;

&lt;figure class=&quot;fig&quot;&gt;
&lt;img src=&quot;/images/blog/eg_intro.png&quot; title=&quot;Figure 3 from topDown preprint.&quot; class=&quot;u-max-full-width centered&quot; /&gt;
&lt;figcaption&gt;
  &lt;div class=&quot;inner-caption centered&quot;&gt;
  Figure 3 from &lt;a href=&quot;https://arxiv.org/abs/2002.02342&quot;&gt;Luo et al., 2020&lt;/a&gt;: (Left) A standard image from ImageNet’s Tabby Cat category (&lt;a href=&quot;http://www.image-net.org/papers/imagenet_cvpr09.pdf&quot;&gt;Deng et al., 2009&lt;/a&gt;). (Middle) A blended image by alpha-blending an image of a cat and an image of a dog. (Right) A natural adversarial image of a dragonfly misclassified as banana by DenseNet-121 with high confidence (&lt;a href=&quot;https://arxiv.org/pdf/1907.07174.pdf&quot;&gt;Hendrycks et al., 2019&lt;/a&gt;).

&lt;/div&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;All three tests follow the same procedure involving both target and non-target images. For example, when testing a model dedicated to detecting Chihuahuas, an equal number of Chihuahua and non-Chihuahua images are used to tune the network. For each model, we assess performance using signal detection theory.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/2002.02342&quot;&gt;We found&lt;/a&gt; that the goal-directed attention approach generally outperformed (i.e., higher $d^\prime$) the widely used transfer learning approach in all three tasks.&lt;/p&gt;

&lt;p&gt;One explanation is that even though the attention layer had fewer tunable parameters ($512$ vs. $4,096,000$ parameters) than the retraining approach, the cascading effects through subsequent network layers provided the needed flexibility to match the task goal. The results suggest that this neuroscience-inspired approach can enable the model to more effectively adapt to new tasks at a relatively low cost. Additionally, since each attention weight has a unique correspondence to the entire feature map from the preceding layer, this goal-directed mechanism can potentially be more interpretable than the fully connected weights.&lt;/p&gt;

</description>
        <pubDate>Thu, 22 Oct 2020 00:00:00 +0000</pubDate>
        <link>http://bradlove.org/blog/attention</link>
        <guid isPermaLink="true">http://bradlove.org/blog/attention</guid>
      </item>
    
      <item>
        <title>Model-based fMRI giveth and taketh away</title>
        <description>&lt;p&gt;What’s better than fMRI or cognitive modelling? Of course, their combination in the form of &lt;a href=&quot;https://doi.org/10.1016/j.jmp.2016.01.001&quot;&gt;model-based fMRI&lt;/a&gt;! Rather than evaluating simple contrasts based on the experimental design, such as where in the brain lights up more for houses vs. faces, model-based fMRI evaluates proposed cognitive processes and representations.&lt;/p&gt;

&lt;p&gt;In this blog, we’ll first consider an example of how model-based fMRI reveals aspects of brain activity that would not easily be found by standard methods. Then, we’ll get to the main story and share a new finding with you from a large-scale neuroeconomics study called &lt;a href=&quot;//narps.info&quot;&gt;NARPS (Neuroimaging Analysis Replication and Prediction Study&lt;/a&gt;; &lt;a href=&quot;https://www.biorxiv.org/content/10.1101/843193v1.abstract?%3Fcollection=&quot;&gt;bioRxiv preprint)&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;NARPS evaluated how the many possible ways one can analyse fMRI data can affect the conclusions researchers draw. Seventy different research labs, including our own, signed up to partake in this endeavour and were given a few months to independently complete the analyses. In NARPS, the researchers were in a sense the study participants — NARPS asked whether the analysis choices researchers make shapes basic scientific conclusions.&lt;/p&gt;

&lt;p&gt;We went one step further and analysed the data two different ways ourselves. One analysis was fairly standard (what we thought most teams would do) whereas the other approach was model-based. As you will see below, many findings that were found in the traditional analysis were deemed spurious in the model-based analysis. Before jumping into this main story, we’ll consider an example in which model-based analysis allowed for a discovery that would not be possible with traditional methods. Model-based analysis giveth and taketh away!&lt;/p&gt;

&lt;h3 id=&quot;model-based-analysis-giveth&quot;&gt;Model-Based analysis giveth&lt;/h3&gt;

&lt;figure class=&quot;fig&quot;&gt;
&lt;img src=&quot;/images/blog/figure3_DavisEtal.png&quot; title=&quot;Figure 3 from Davis et al. (2012) Striatal and Hippocampal Entropy and Recognition Signals in Category Learning: Simultaneous Processes Revealed by Model-Based fMRI.&quot; class=&quot;u-max-full-width centered&quot; /&gt;
&lt;figcaption&gt;
  &lt;div class=&quot;inner-caption centered&quot;&gt;
  Figure 3 from &lt;a href=&quot;https://doi.org/10.1037/a0027865&quot;&gt;Davis et al. (2012)&lt;/a&gt;: Illustrations of the model-based measures used to characterise (fMRI) BOLD response. Brain regions associated with the cognitive model&apos;s recognition strength measure are depicted in red, and brain regions associated with the category match (measured in terms of entropy) are depicted in cyan. The bottom panel represents the predicted shape of each model-based regressor for the two item-types over the course of the experiment. For the model-based measures, the predicted pattern for exception trials is given in red, and the predicted pattern for rule-following trials is given in green. Model-based analysis allows for two simultaneous occurring cognitive processes to be localised.
&lt;/div&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;In learning studies, the time course of how representations are acquired and updated is critical. Model-based analyses, in which a cognitive model is fit to behavioural data, can be used to capture such changes across trials. In the figure above, a category learning model was fit to behaviour and then internal model measures of item recognition and category match were extracted and used to analyse the (fMRI) &lt;a href=&quot;https://doi.org/10.1037/a0027865&quot;&gt;BOLD response in the hippocampus.&lt;/a&gt; In other &lt;a href=&quot;https://doi.org/10.1093/cercor/bhr036&quot;&gt;studies of learning&lt;/a&gt;, model-based fMRI allowed processes in different trial phases (decision vs. feedback processing) to be isolated. The cognitive models made it possible to quantify these hypothesised mental operations that were not directly observable.&lt;/p&gt;

&lt;p&gt;These examples focus on univariate relationships between the brain and model measure, but it is also possible to analyse patterns of activity, such as how &lt;a href=&quot;https://www.pnas.org/content/113/46/13203&quot;&gt;internal representations in the model parallel patterns of activity across voxels in the brain&lt;/a&gt;. Of course, any model-based analysis is only as good as the model. The cognitive model should be supported by previous work, including evaluation in behavioural studies. Decoding methods can also be used to test &lt;a href=&quot;https://doi.org/10.1016/j.cub.2013.08.035&quot;&gt;which of a set of competing models is most consistent with the BOLD response.&lt;/a&gt;&lt;/p&gt;

&lt;h3 id=&quot;model-based-analysis-taketh-away&quot;&gt;Model-based analysis taketh away&lt;/h3&gt;

&lt;p&gt;The previous examples of model-based analysis revealed effects that would not otherwise be observable. Model-based analysis can also “remove” effects that are probably misleading (i.e., false alarms).&lt;/p&gt;

&lt;p&gt;In the &lt;a href=&quot;//narps.info&quot;&gt;NARPS&lt;/a&gt; project, our team conducted a model-based analysis that yielded some results at odds with the standard approach. The difference we found is germane to the goal of NARPS –  NARPS is interested in how the many possible data analysis pipelines for fMRI data affect our scientific conclusions. The primary goal of NARPS is to examine the variability of fMRI data analyses as carried out by different groups of researchers. Seventy different research teams signed up to partake in this endeavour and were given a few months to complete the analyses. We were one of those seventy independent teams that made NARPS possible.&lt;/p&gt;

&lt;p&gt;Regarding the data itself, over a hundred participants engaged in a standard decision making task (like &lt;a href=&quot;https://doi.org/10.1126/science.1134239&quot;&gt;Tom et al., 2007&lt;/a&gt; and &lt;a href=&quot;https://doi.org/10.1073/pnas.0910230107&quot;&gt;De Martino et al., 2010&lt;/a&gt;), for more information see &lt;a href=&quot;https://www.narps.info/analysis.html&quot;&gt;NARPS Data &amp;amp; Analysis&lt;/a&gt; or the &lt;a href=&quot;https://www.biorxiv.org/content/10.1101/843193v1.abstract?%3Fcollection=&quot;&gt;bioRxiv preprint &lt;/a&gt;. While in the scanner they had to accept or reject gambles in the form of unbiased coin flips; each gamble could incur in either gains or losses. Participants were either in a group where the gambles were calibrated for loss aversion (equal indifference) or not (equal range).&lt;/p&gt;

&lt;p&gt;To get at the variability in analysing fMRI data, teams performed whole-brain corrected analyses and submitted their binary (yes/no) decisions regarding nine hypotheses for specific contrasts related to previous work (&lt;a href=&quot;https://doi.org/10.1126/science.1134239&quot;&gt;Tom et al., 2007&lt;/a&gt;; &lt;a href=&quot;https://doi.org/10.1073/pnas.0910230107&quot;&gt;De Martino et al., 2010&lt;/a&gt;; &lt;a href=&quot;https://doi.org/10.1523/JNEUROSCI.0497-13.2013&quot;&gt;Canessa et al., 2013&lt;/a&gt;; &lt;a href=&quot;https://doi.org/10.1016/j.neuroimage.2016.11.050&quot;&gt;Canessa et al., 2017&lt;/a&gt;). The hypotheses are presented in the following table along with four columns: our expected results before analyzing the data, our model-absent results (i.e., gains and losses only), our model-present results (i.e., gains, losses, and decision entropy — explained below), and prediction market results. After the submission deadline (end of February), prediction markets were organized for all hypotheses (similar to &lt;a href=&quot;https://doi.org/10.1126/science.aaf0918&quot;&gt;Camerer et al., 2016&lt;/a&gt;; &lt;a href=&quot;https://doi.org/10.1038/s41562-018-0399-z&quot;&gt;Camerer et al., 2018&lt;/a&gt;; &lt;a href=&quot;https://doi.org/10.1073/pnas.1516179112&quot;&gt;Dreber et al., 2015&lt;/a&gt;) for a little over a week at the beginning of May. The researcher prediction market closed with the values seen in the table (arbitrary token units), which were highly correlated with the fundamental values reported in the &lt;a href=&quot;https://www.biorxiv.org/content/10.1101/843193v1.abstract?%3Fcollection=&quot;&gt;preprint&lt;/a&gt;.&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt; &lt;/th&gt;
      &lt;th&gt;Expected&lt;/th&gt;
      &lt;th&gt;Gains &amp;amp; Losses only&lt;/th&gt;
      &lt;th&gt;Gains, Losses, &amp;amp; Decision Entropy&lt;/th&gt;
      &lt;th&gt;Researcher Prediction Markets&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Parametric effect of gain&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt; &lt;/td&gt;
      &lt;td&gt; &lt;/td&gt;
      &lt;td&gt; &lt;/td&gt;
      &lt;td&gt; &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;1&lt;/strong&gt;. Positive effect in ventromedial prefrontal cortex (vmPFC) - for the equal indifference group&lt;/td&gt;
      &lt;td&gt;yes&lt;/td&gt;
      &lt;td&gt;yes&lt;/td&gt;
      &lt;td&gt;no&lt;/td&gt;
      &lt;td&gt;0.814&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;2&lt;/strong&gt;. Positive effect in vmPFC - for the equal range group&lt;/td&gt;
      &lt;td&gt;yes&lt;/td&gt;
      &lt;td&gt;no&lt;/td&gt;
      &lt;td&gt;no&lt;/td&gt;
      &lt;td&gt;0.753&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;3&lt;/strong&gt;. Positive effect in ventral striatum (VS) - for the equal indifference group&lt;/td&gt;
      &lt;td&gt;yes&lt;/td&gt;
      &lt;td&gt;yes&lt;/td&gt;
      &lt;td&gt;no&lt;/td&gt;
      &lt;td&gt;0.743&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;4&lt;/strong&gt;. Positive effect in VS - for the equal range group&lt;/td&gt;
      &lt;td&gt;yes&lt;/td&gt;
      &lt;td&gt;yes&lt;/td&gt;
      &lt;td&gt;yes&lt;/td&gt;
      &lt;td&gt;0.789&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Parametric effect of loss&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt; &lt;/td&gt;
      &lt;td&gt; &lt;/td&gt;
      &lt;td&gt; &lt;/td&gt;
      &lt;td&gt; &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;5&lt;/strong&gt;. Negative effect in vmPFC - for the equal indifference group&lt;/td&gt;
      &lt;td&gt;yes&lt;/td&gt;
      &lt;td&gt;yes&lt;/td&gt;
      &lt;td&gt;yes&lt;/td&gt;
      &lt;td&gt;0.952&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;6&lt;/strong&gt;. Negative effect in vmPFC - for the equal range group&lt;/td&gt;
      &lt;td&gt;yes&lt;/td&gt;
      &lt;td&gt;yes&lt;/td&gt;
      &lt;td&gt;yes&lt;/td&gt;
      &lt;td&gt;0.805&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;7&lt;/strong&gt;. Positive effect in amygdala - for the equal indifference group&lt;/td&gt;
      &lt;td&gt;-&lt;/td&gt;
      &lt;td&gt;no&lt;/td&gt;
      &lt;td&gt;no&lt;/td&gt;
      &lt;td&gt;0.073&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;8&lt;/strong&gt;. Positive effect in amygdala - for the equal range group&lt;/td&gt;
      &lt;td&gt;-&lt;/td&gt;
      &lt;td&gt;no&lt;/td&gt;
      &lt;td&gt;no&lt;/td&gt;
      &lt;td&gt;0.274&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;9&lt;/strong&gt;. Greater positive response to losses in amygdala for equal range condition vs. equal indifference condition.&lt;/td&gt;
      &lt;td&gt;-&lt;/td&gt;
      &lt;td&gt;no&lt;/td&gt;
      &lt;td&gt;no&lt;/td&gt;
      &lt;td&gt;0.188&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h4 id=&quot;our-expectations-and-model-absent-analysis&quot;&gt;Our expectations and model-absent analysis&lt;/h4&gt;

&lt;p&gt;Given that the previous literature shows strong effects of value in both VS and vmPFC, we suspected that the majority of teams would answer “yes” for hypotheses 1-6. As for the hypotheses related to effects in the amygdala (hypotheses 7-9), we were indifferent given the conflicting findings for this area (e.g., &lt;a href=&quot;https://doi.org/10.1126/science.1134239&quot;&gt;Tom et al., 2007&lt;/a&gt; and &lt;a href=&quot;https://doi.org/10.1073/pnas.0910230107&quot;&gt;De Martino et al., 2010&lt;/a&gt;). Our initial expectations were then updated based on directly regressing gain and loss values from the experimental design onto the blood-oxygen-level dependent (BOLD) signal. However, below we explain what we think is the better model after including an additional term (i.e., inverse decision entropy) that captures something akin to decision confidence present in this task which we estimated from behaviour.&lt;/p&gt;

&lt;h4 id=&quot;our-model&quot;&gt;Our model&lt;/h4&gt;

&lt;p&gt;For our reported results, first, we estimated parameters from a simple cognitive model fit to behaviour: a logistic regression with an intercept and separate terms for gains and losses, predicting either accept or reject gamble. From the model’s predictions $p_{\mathrm{accept}}$, we were also able to calculate the inverse decision entropy:&lt;/p&gt;

\[\mathrm{iDE} = p_{\mathrm{accept}} \times log_2(p_{\mathrm{accept}}) + p_\mathrm{reject} \times log_2(p_{\mathrm{reject}})\]

&lt;p&gt;for each gamble (see figure below). (We use inverse decision entropy because it aligns with intuitive notions of decision confidence.) Second, the BOLD model consisted of an intercept, gains, losses, and inverse decision entropy, as well as an assortment of standard movement nuisance regressors (i.e., rotations, translations, and framewise displacement).&lt;/p&gt;

&lt;figure class=&quot;fig&quot;&gt;
&lt;img src=&quot;/images/blog/narps_figure_1_ide.png&quot; title=&quot;Model-based fMRI&quot; class=&quot;u-max-full-width centered&quot; /&gt;
&lt;figcaption&gt;
  &lt;div class=&quot;inner-caption centered&quot;&gt;
  Behavioural model and task. Three equations describe the behavioural model in &lt;b&gt;a&lt;/b&gt;) where subjective value (SV) is a weighted combination of gains and losses, $p_{\mathrm{accept}}$ is the probability of accepting a gamble, and inverse decision entropy (iDE) is the negative Shannon entropy of $p_{\mathrm{accept}}$ and its complement $p_{\mathrm{reject}}$. In &lt;b&gt;b&lt;/b&gt;) $p_{\mathrm{accept}}$ is plotted against subjective value to show how high values of inverse decision entropy characterise the tails of the sigmoid and bottoms out for middle values of SV, where $p_{\mathrm{accept}} = 0.5$. In &lt;b&gt;c&lt;/b&gt;) the 2x2 table shows four different trial types based on whether the trial presents low or high values for each of the variables of interest, SV and iDE. Also, portending a main result, each cell presents the percentage of voxels that shows these specific conjunctions of effects.
&lt;/div&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h4 id=&quot;our-results&quot;&gt;Our results&lt;/h4&gt;

&lt;p&gt;With respect to our model, which included inverse decision entropy as another term, we only found sufficient evidence to support hypotheses 4, 5 and 6. These results are surprising as the literature might lead one to expect that effects should be stronger for hypotheses 1-3. Instead, our model-based analysis’s inclusion of the entropy term led to these hypotheses not being supported. Had we not included inverse decision entropy in the model, we would have also answered affirmatively to hypotheses 1 and 3.&lt;/p&gt;

&lt;p&gt;The contrast in results for the standard and model-based analysis demonstrates the importance of model-based fMRI analyses in interpreting results. Unlike the previous cases considered, where model-based analysis revealed effects that would not otherwise be found, here including appropriate terms (e.g., entropy) led to effects no longer being observed. Given that uncertainty is a critical factor in this task, we believe that including this cognitive construct into the analysis provides a more accurate view on the data.&lt;/p&gt;

&lt;p&gt;Rather than model-based analysis simply being a more technical analysis in some senses, it should be be seen as more conceptually correct when the cognitive model used captures important aspects of participants’ mental states. Although the model-based analysis we presented was very simple, it successfully leveraged the behavioural data to better understand the imaging data.&lt;/p&gt;

&lt;h3 id=&quot;model-based-analysis-giveth-part-ii&quot;&gt;Model-based analysis giveth: Part II&lt;/h3&gt;

&lt;p&gt;From one perspective, the model-based analyses, which included a measure of entropy for each gamble decision, rendered a number of value effects non-significant. This is likely a good thing as the standard parametric analysis did not take into account important cognitive processes related to confidence, such as the cognitive model’s entropy measure.&lt;/p&gt;

&lt;p&gt;From another perspective, the model-based analyses revealed a bunch of qualitatively new findings related to entropy. The entropy side of the story appears bigger and more exciting than the value one. To learn more about entropy and its relation to value, you can check out the poster we presented at SfN: &lt;a href=&quot;/images/blog/poster_neuralEntropy_SFN.pdf&quot;&gt;&lt;em&gt;The neural link between subjective value and decision entropy&lt;/em&gt;&lt;/a&gt;, or better yet, our &lt;a href=&quot;https://doi.org/10.1101/2020.02.18.954362&quot;&gt;bioRxiv preprint&lt;/a&gt;. There we focus on the importance of decision entropy with respect to subjective value (as opposed to gains and losses separately, as we have reported here).&lt;/p&gt;

&lt;p&gt;Finally, we would like to thank the participants in the study, all the members of the 70 participating labs, and the NARPS organisers. In addition to providing such a fine dataset, a formal assessment of variability in the fMRI processing pipeline was long overdue. Organising such projects is a lot of work, but it needs to be done. We hope this blog helps advance the aims of understanding how analysis choices affect scientific conclusions.&lt;/p&gt;
</description>
        <pubDate>Mon, 18 Nov 2019 00:00:00 +0000</pubDate>
        <link>http://bradlove.org/blog/narps</link>
        <guid isPermaLink="true">http://bradlove.org/blog/narps</guid>
      </item>
    
      <item>
        <title>Fast food science is a shit sandwich</title>
        <description>&lt;p&gt;When it comes to technology and communication, faster is usually considered better. For example, test pilot Chuck Yeager showed &lt;a href=&quot;https://en.wikipedia.org/wiki/The_Right_Stuff_(book)&quot;&gt;“The Right Stuff”&lt;/a&gt; by being the first person to break the sound barrier. We celebrate computer chips becoming faster. In the age of the internet, many people view real-time interactive communication, such as on Twitter (more on this later!), as highly desirable. However, faster is not always better. Resisting the obvious and puerile &lt;a href=&quot;https://www.youtube.com/watch?v=aIWrFNDKQ6o]&quot;&gt;joke&lt;/a&gt;, fast food is a clear example of faster not being best – It has its place but no one with any taste or class would argue that its the highest quality food, nor fit for occasions that are about more than convenience.&lt;/p&gt;

&lt;figure class=&quot;fig&quot;&gt;
  &lt;img src=&quot;/images/blog/donald-trump-fast-food.jpg&quot; title=&quot;Let them eat cold fast food&quot; class=&quot;u-max-full-width centered&quot; /&gt;
  &lt;figcaption&gt;
    &lt;div class=&quot;inner-caption centered&quot;&gt;An &lt;a href=&quot;https://www.rollingstone.com/politics/politics-news/trump-fast-food-white-house-779128/&quot;&gt; embarrassment&lt;/a&gt; of riches, on &lt;a href=&quot;https://Twitter.com/realDonaldTrump&quot;&gt;Twitter&lt;/a&gt;.
    &lt;/div&gt;
  &lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Yet, somehow when it comes to science, where one would think reflection and deep thought would be prized, a lot of the community seems to have moved toward the fast food model of thought. The ethos is that instant commenting and evaluation is somehow expediting science. It’s not, much like how the public is not better informed by virtue of the 24/7 news cycle. The science case is actually more insidious than sound-bite journalism as the scientists themselves are the ones who shape the story in their own &lt;a href=&quot;https://Twitter.com/ProfData/status/1096770650168016898&quot;&gt;ahistorical echo chambers&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I experienced this recently when a prominent journal reviewer, who we believe majorly lost the plot by confusing our theory paper as a methods paper, posted his negative review without our consent as a blog post (copycats followed) shortly after we received the journal rejection (see &lt;a href=&quot;http://bradlove.org/blog/open-review&quot;&gt;here&lt;/a&gt; and &lt;a href=&quot;http://bradlove.org/blog/open-review-2&quot;&gt;here&lt;/a&gt; for a discussion). This &lt;a href=&quot;https://everythinghertz.com/76&quot;&gt;podcast&lt;/a&gt; has also been recommended to me on the issue.&lt;/p&gt;

&lt;p&gt;Within moments, we were obliged to respond to defend ourselves. I spent years working on a project and somehow found myself completing and posting a &lt;a href=&quot;http://bradlove.org/blog/open-review&quot;&gt;public blog response&lt;/a&gt; within an hour, absurd. Fortunately, we got it right and our &lt;a href=&quot;https://www.biorxiv.org/content/10.1101/439893v2&quot;&gt;revision&lt;/a&gt; cements that case, which is how it usually goes when one spends years thinking about a project and critics don’t have that luxury.&lt;/p&gt;

&lt;p&gt;What struck me was there was no actual scientific discourse on Twitter and there couldn’t be under these conditions for the type of work we do. It was chaotic. It was bizarre. We were &lt;a href=&quot;https://en.wikipedia.org/wiki/Tone_policing&quot;&gt;tone policed&lt;/a&gt; by a &lt;a href=&quot;https://Twitter.com/siminevazire/status/1083533474332430336&quot;&gt;social psychologist&lt;/a&gt; who didn’t seem to understand (&lt;a href=&quot;https://www.youtube.com/watch?v=0Uc4DI-BF28]&quot;&gt;skin in the game?&lt;/a&gt;) the situation but sure had to pick a side, all while ignoring the existence of the early-career-researcher (ECR) &lt;a href=&quot;https://Twitter.com/ProfData/status/1083546572988780550&quot;&gt;lead author&lt;/a&gt;. We had no time to digest people’s points and chart a response. Scientific debate on Twitter is akin to politicians trying to score points in an American-style Presidential debate. We scored our points for sure, but it’s not a game that we are interested in playing.&lt;/p&gt;

&lt;p&gt;Social media can be cesspool open to abuse that stands in stark contrast to open review models (that all involve consent) at journals like &lt;a href=&quot;https://elifesciences.org/articles/21397#SA2&quot;&gt;eLife&lt;/a&gt; and computer science conferences like &lt;a href=&quot;https://openreview.net/group?id=ICLR.cc/2019/Conference&quot;&gt;ICLR&lt;/a&gt;. In our view, science doesn’t truly progress from takedowns and hit-and-runs, but from people thinking deeply about what they are doing, often in light of the feedback from others when it can be fully and deeply processed. Open review should not be someone’s internet graffiti.&lt;/p&gt;

&lt;figure class=&quot;fig&quot;&gt;&lt;img src=&quot;/images/blog/usa_for_croatia_2001.jpg&quot; title=&quot;Brad learns about open review while on holiday&quot; class=&quot;u-max-full-width centered&quot; /&gt;
&lt;figcaption&gt;
&lt;div class=&quot;inner-caption centered&quot;&gt;
My first experience with poorly executed &quot;open review&quot; from decades ago, as not currently practiced in &lt;a href=&quot;https://openreview.net/group?id=ICLR.cc/2019/Conference&quot;&gt;computer science&lt;/a&gt; or in &lt;a href=&quot;https://elifesciences.org/articles/21397#SA2&quot;&gt;quality journals&lt;/a&gt;. Hate the game.
&lt;/div&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;When we had time to dissect the attack blogs, which we did solely out of thoroughness as we did not find the points particularly relevant, we discovered that our main critic’s pet measure that supposedly is His gift to us wasn’t even really suited to our approach which relies on rank information. Still, others parrot the points of our critic’s musings absent thought, like that similarity and classifier functions are one-and-the-same because there exists examples of both that compute covariance information, which is the logical equivalent of concluding that two distinct species that both eat bugs under certain circumstances are one-and-the-same.&lt;/p&gt;

&lt;p&gt;The point here is that, while it took us only moments to appreciate this critic missed our global point, it took weeks to appreciate that even the specifics were off the mark. Yet, people in moments were firing away opinions as facts (mostly by parroting one person’s views), lecturing us by tweet how science works, and telling us to sit back and enjoy it to make the most of it, which was all &lt;a href=&quot;https://www.nytimes.com/1988/04/27/sports/knight-is-criticized-over-rape-remark.html&quot;&gt;rather&lt;/a&gt; &lt;a href=&quot;https://www.nytimes.com/1990/03/26/us/texas-candidate-s-comment-about-rape-causes-a-furor.html&quot;&gt;rapey&lt;/a&gt; and &lt;a href=&quot;https://www.anxiety.org/psychology-of-dictators-power-fear-anxiety&quot;&gt;authoritarian&lt;/a&gt;. It would be bad enough if confined to Twitterverse, but this garbage thinking sticks and colours discourse, much like fake news does.&lt;/p&gt;

&lt;p&gt;Twitter can exacerbate conflict through its dark triad of instant interaction and feedback, brief responses, and occasionally mob dynamics. Some seem to think they are owed a response because they questioned you or went to the trouble of writing a blog about you or (a blog about a blog) to the Kleene star. Here’s a hint, no one is obligated to respond to others’ hot takes. It is not a sign of strength and intellectual integrity to wade through the morass. Not everything dignifies a response, and even when something does it is not necessarily worth one’s time. Our preferred response is our &lt;a href=&quot;https://www.biorxiv.org/content/10.1101/439893v2&quot;&gt;revised preprint&lt;/a&gt;. Often, insta responding is a poor use of time and the people on the receiving end usually aren’t really processing what is said anyway. These rapid interactions favour narcissist bullshit merchants, who are exactly the folks you don’t want running a field. Dealing with them is &lt;a href=&quot;https://en.wikipedia.org/wiki/Gish_gallop&quot;&gt;effectively a Denial of Service (DoS) attack&lt;/a&gt; on actual thought, which is not expediting science. The experience can be consuming in the moment, but the half-life of thoughts on Twitter is brief.&lt;/p&gt;

&lt;figure class=&quot;fig&quot;&gt;
&lt;img src=&quot;/images/blog/violentDelights.gif&quot; title=&quot;Full of sound and fury, signifying nothing.&quot; class=&quot;u-max-full-width centered&quot; /&gt;
&lt;figcaption&gt;
&lt;div class=&quot;inner-caption centered&quot;&gt;
In this quote from Shakespeare, violent means sudden. Yeah, &lt;a href=&quot;https://www.youtube.com/watch?v=pJS5sce8OeQ&quot;&gt;sure it does&lt;/a&gt;.
&lt;/div&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Of course, media like Twitter can play positive roles in science, such as providing a means, albeit with a biased sample, to learn about recent work and people’s views and meet new people. I have learned a lot from people sharing information in direct messages (DMs) as well, and then there is the light-hearted &lt;a href=&quot;https://Twitter.com/nathanieldaw/status/1096408932673880065&quot;&gt;banter&lt;/a&gt;. In contrast, real-time debate, especially when it’s a takedown on a particular person or paper, is unlikely to have valuable content.&lt;/p&gt;

&lt;p&gt;It might be in science that you get to choose two from FAST, PROPERLY EXECUTED, INNOVATIVE. In my estimation, the last attribute is what is often lacking and what often goes under appreciated. I am advocating for giving scientists the opportunity to actually think. That’s why I got into this line of work. Sometimes it means sitting silently and thinking something through for a couple days and eventually getting it right after repeatedly doing so for months. In the open office plan of science, when people are expected to engage in instant debates that are formulated along the wrong dimensions, I just don’t see anything very deep or useful being produced. So, here’s for something better than cold fast food at the science banquet. Choose your dining partners carefully.&lt;/p&gt;
</description>
        <pubDate>Fri, 22 Feb 2019 00:00:00 +0000</pubDate>
        <link>http://bradlove.org/blog/fast-food-science</link>
        <guid isPermaLink="true">http://bradlove.org/blog/fast-food-science</guid>
      </item>
    
      <item>
        <title>Sebastian&apos;s Thoughts on Open Review</title>
        <description>&lt;p&gt;My name is Sebastian Bobadilla-Suarez and I am an early career researcher (ECR — postdoc’ing in the Love Lab). I did my PhD with Brad Love at UCL as well. This post is about recent events regarding the review process of our manuscript titled &lt;a href=&quot;https://doi.org/10.1101/439893&quot;&gt;&lt;em&gt;Measures of neural similarity&lt;/em&gt;&lt;/a&gt;. Our manuscript was submitted to a prestigious journal and went through a formal review process. It was rejected by the reviewers, which is fine, but one of the reviewers decided to post &lt;a href=&quot;https://nikokriegeskorte.org/2019/01/09/whats-the-best-measure-of-representational-dissimilarity/&quot;&gt;his review on his own blog&lt;/a&gt;. This was problematic for several reasons, &lt;a href=&quot;http://bradlove.org/blog/open-review&quot;&gt;see here for Brad’s response&lt;/a&gt;. Sam Schwarzkopf also shared &lt;a href=&quot;https://neuroneurotic.net/2019/01/10/an-open-review-of-open-reviewing/&quot;&gt;his take&lt;/a&gt; too.&lt;/p&gt;

&lt;p&gt;Before I go on, I want to say that I am entirely in favor of open debate of ideas, open science and fully deconstructing manuscripts. I fully encourage this. However, open science should not be used to maintain the status quo but to challenge it. Also, I really appreciate the time and effort that goes into providing feedback on manuscripts, whether in a formal review process or not. Although we may not always agree with reviews on a manuscript, they are always welcomed as useful in one way or another to improve the work.&lt;/p&gt;

&lt;p&gt;After reading &lt;a href=&quot;https://twitter.com/ProfData/status/1083004240711307265&quot;&gt;some of the threads on this&lt;/a&gt;, I’d like to give my two cents as first author. I was surprised to see how polarized the subject of open science can become. A lot of the discourse from certain individuals seems hopelessly &lt;a href=&quot;https://en.wikipedia.org/wiki/Manichaeism&quot;&gt;Manichaeistic&lt;/a&gt; (e.g., “I’m for open science, you’re not”). I am for open science, as I said above, but I am also for understanding how new and open science systems impact those lower in the scientific ranks. I would assume we are all pro open science as a default but still working on best practices, including practices pertaining to open review. To be clear, the motivation for this blog post is not to sidestep the points made in the original review. The goal here is to share my experience and perspective.&lt;/p&gt;

&lt;p&gt;I feel the review process has been tainted for this project; a project that I hold close to my heart as one of my favorite initiated during my PhD. This obviously makes me biased, but then again who is going to stand up for my work if not me? I understand that uploading your manuscript on a preprint server invites informal comment and feedback, which is one of the reasons to do it in the first place, and as I said I fully welcome and appreciate any and all comments on my work. However, posting a formal review as a blog post necessarily carries more weight than any feedback provided outside the formal review process, especially when posted by one of the leaders in the field. I did feel wronged by how one of our reviewers has handled the process. The junior authors were not given the professional courtesy of notification and none of the authors opted into this way of handling reviews — we had no opportunity to reply before rejection. Ultimately, is accepting to review a manuscript with the goal of eventually blogging about it (as opposed to improving it) a conflict of interest? I am still developing my opinions with respect to best practices in open science.&lt;/p&gt;

&lt;p&gt;I see that transparency can have pitfalls when the line between formal review and public debate has been blurred (especially at early stages on the rocky road to getting published). The fact that this line was blurred in a non-consensual fashion when best practices have not been cast into convention yet is unacceptable. Nonetheless, the fact that we in the open science community can discuss these cases, push back against dominant power structures, and set precedents will be beneficial moving forward.&lt;/p&gt;

&lt;p&gt;I hope there are lessons to be learned here in general and that in my specific case future reviewers may place appropriate weight on the posted review in question — avoiding the use of such a post as a heuristic to base their own opinions on, since I personally think the review misses the point of my preprint. Is it now impossible to claim that future reviews are somehow independent of each other given that the blog post in question was the outcome of a formal review process? I think it is now a case of how to appropriately contextualize them. Maybe proponents of open science, and open review specifically, have thought of such situations and how privileged voices can drown out the voices of more junior people, like me. My biggest hope for open science is that it will create a fairer and more accessible system for the new generation of scientists. This is only possible by having these debates on new ways of doing things.&lt;/p&gt;
</description>
        <pubDate>Thu, 10 Jan 2019 00:00:00 +0000</pubDate>
        <link>http://bradlove.org/blog/open-review-2</link>
        <guid isPermaLink="true">http://bradlove.org/blog/open-review-2</guid>
      </item>
    
      <item>
        <title>An Open Review of Niko Kriegeskorte</title>
        <description>&lt;p&gt;Imagine you think and work carefully on an &lt;a href=&quot;https://www.biorxiv.org/content/early/2018/10/12/439893&quot;&gt;ambitious paper for a few years&lt;/a&gt;, trying to answer fundamental questions the field has overlooked. Now, imagine after a very long wait you receive negative reviews that completely missed the main point. Instead, the reviews project a certain person’s pet concerns, goals, and interests onto your work, which are only tangentially related to your central questions. Worst yet, this person has acolytes whose capacity to miss the broader picture is only surpassed by their self-righteousness. That would be frustrating and potentially career-changing for some.&lt;/p&gt;

&lt;p&gt;Now, let’s add to this scenario that this reviewer personally emails you moments after your rejection absent kind words (he got the memo that empathy is passé), but demands that you agree with his viewpoint and alerts you that he will post an &lt;a href=&quot;https://nikokriegeskorte.org/2019/01/09/whats-the-best-measure-of-representational-dissimilarity/&quot;&gt;open review of your work&lt;/a&gt;. This is inhuman.&lt;/p&gt;

&lt;p&gt;But, open is good right? It can be, but it can also be incredibly self-serving. The review is of course open when Niko wants it to be and it serves him. Was the review posted before the editor made the reject decision? Of course not, because then we could respond and potentially affect the decision. Was it posted after we had time to publish elsewhere? Of course not. It was posted at the darkest time for my team when we are in our most vulnerable position when there’s little time to respond and we have more important things to worry about. Nevertheless, we are obliged to respond because Niko has poisoned the well for our project. “Open” here is not to serve the community or the authors but to provide a cheap blog and attention for Niko for the limited number of manuscripts that it serves Niko to review. The reinforcing power structure here should be apparent to anyone clued in.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://www.biorxiv.org/content/early/2018/10/12/439893&quot;&gt;Our paper itself&lt;/a&gt;, which I encourage you to read and form your own opinion, is about the nature of neural similarity, namely what makes two brain states similar. The main questions are whether the brain’s preferred notion of similarity is different across regions and tasks. We find that the preferred similarity measures are common across regions but differ across tasks. This is cool. Of course, as we discuss, whatever measure is “best” (whatever that means and it does mean different things to different people) will depend on many issues, including data quality and quantity. We muse a bit on how these higher-level measures of similarity relate to underlying computations and representations. There’s been a ton of work in Psychology on what makes two stimuli similar, but in Neuroscience people largely default to a few options without any real evaluation. Thus, our work is very needed in the field and timely. We get traction on this neglected problem by using a decoding approach to approximate the information available in a brain state. We discuss how much this approximation should be trusted in light of our central questions, namely does the brain use the same similarity measure across regions and tasks.&lt;/p&gt;

&lt;figure class=&quot;fig&quot;&gt;
&lt;img src=&quot;/images/blog/neural_similarity.png&quot; title=&quot;Figure 1 from Bobadilla-Suarez et al. (2018) on families of similarity measures.&quot; class=&quot;u-max-full-width centered&quot; /&gt;
&lt;figcaption&gt;
  &lt;div class=&quot;inner-caption centered&quot;&gt;
  From  Figure 1, &lt;a href=&quot; https://doi.org/10.1101/439893&quot;&gt;Bobadilla-Suarez et al. (2018)&lt;/a&gt;:
Families of similarity measures. (left panel) Similarity measures divide into
those concerned with angle vs. magnitude differences between vectors. Pearson correlation
whereas Euclidean distance are common angle and magnitude measures, respectively. The
magnitude family further subdivides according to distributional assumptions. Measures
like Mahalanobis are distributional in that they are sensitive to co-variance such that
similarity falls more rapidly along low variance directions. (right panel) The choice of
similarity measure can strongly affect inferences about neural representational spaces. In
this example, stimuli &lt;b&gt;a&lt;/b&gt;, &lt;b&gt;b&lt;/b&gt;, and &lt;b&gt;c&lt;/b&gt; elicit different patterns of activity across two voxels.
When Pearson correlation is used, stimulus &lt;b&gt;a&lt;/b&gt; is more similar to &lt;b&gt;b&lt;/b&gt; than to &lt;b&gt;c&lt;/b&gt;. However,
when the Euclidean measure is used, the pattern reverses such that stimulus &lt;b&gt;a&lt;/b&gt; is more
similar to &lt;b&gt;c&lt;/b&gt; than &lt;b&gt;b&lt;/b&gt;.
&lt;/div&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Decidedly what we are not trying to do is determine which neural similarity measure has the best properties by some metric, such as split-half reliability, bias, whatever small methodological point is of interest to some. Of course, that is what primarily interests some, such as Niko, but these points are minor and largely inconsequential to our goals and conclusions. Niko provided a top-down reading of our work strictly through the lens of his interests that fails to engage with the main ideas of the paper. I leave it to the acolytes to review his papers, of which I am familiar. Again, please read &lt;a href=&quot;https://www.biorxiv.org/content/early/2018/10/12/439893&quot;&gt;our paper&lt;/a&gt;, rather than parrot Niko’s views.&lt;/p&gt;

&lt;p&gt;As these sideshows entertain, fundamental questions about how to bridge from neurons to voxels to compact higher-level descriptions to computations remain unanswered. To make progress, the field needs leaders who are open to ideas and are broader thinkers. Of course, instead we have a system that entrenches and amplifies those in positions of power within the field. Rather than fall in line, my lab is trying to address these difficult and subtle questions. However, how can we make progress in the field when we are reviewed by people like Niko who doesn’t believe the brain has representations?&lt;/p&gt;
&lt;blockquote class=&quot;twitter-tweet&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;true, the brain does not need representations. it also doesn&amp;#39;t need information or causality. it&amp;#39;s a dynamical system after all. it&amp;#39;s *us* who need causality, and information theory, and representational interpretations to understand the brain. &lt;a href=&quot;https://t.co/8JatZUo1yt&quot;&gt;https://t.co/8JatZUo1yt&lt;/a&gt;&lt;/p&gt;&amp;mdash; Kriegeskorte Lab (@KriegeskorteLab) &lt;a href=&quot;https://twitter.com/KriegeskorteLab/status/1028669449484816385?ref_src=twsrc%5Etfw&quot;&gt;August 12, 2018&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async=&quot;&quot; src=&quot;https://platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;

&lt;p&gt;While we can all laugh at the occasional pseudo profound cringe-inducing tweets by celebrities like Elon Musk, we should expect more from the leaders of our field. It’s intolerable for our scientific fate to be controlled by someone who is a &lt;a href=&quot;https://en.wikipedia.org/wiki/Mind%E2%80%93body_dualism&quot;&gt;Cartesian Dualist&lt;/a&gt; or is profoundly confused by levels of analysis. I am glad that Niko and others picked up on ideas from Roger Shepard and others from 1970 on &lt;a href=&quot;https://www.sciencedirect.com/science/article/pii/0010028570900022&quot;&gt;second-order isomorphism&lt;/a&gt; and that they popularised others’ efforts to apply related ideas to the &lt;a href=&quot;http://science.sciencemag.org/content/293/5539/2425&quot;&gt;analysis of fMRI data&lt;/a&gt;. They have made a career out of correlating the upper diagonal of matrices and plodding through attendant concerns. Now it’s time to allow others to make progress and introduce new ideas into the literature.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Postscript&lt;/strong&gt;: &lt;a href=&quot;https://twitter.com/ProfData/status/1083004240711307265&quot;&gt;Lots&lt;/a&gt; &lt;a href=&quot;https://twitter.com/ProfData/status/1083053727701970944&quot;&gt;of&lt;/a&gt; &lt;a href=&quot;https://twitter.com/INM7_ISN/status/1083019074215600129&quot;&gt;discussion&lt;/a&gt; &lt;a href=&quot;https://twitter.com/djnavarro/status/1083034770982858752&quot;&gt;on&lt;/a&gt; &lt;a href=&quot;https://twitter.com/IrisVanRooij/status/1083048770147897346&quot;&gt;Twitter&lt;/a&gt;. To be clear, we are not against the eLife model of publishing reviews upon acceptance, nor are we against leaving comments on preprints, which can allow the authors to respond and perhaps make edits. We are against using the existence of a preprint as a pretext to write journal reviews which are really self-serving blog posts, especially when they are posted the moment one’s paper is rejected by the editor. This take on open reviewing is open to abuse and is not really open as the reviewer decides what, where, when and how.  Furthermore, existing models of open review involve consent from all parties.&lt;/p&gt;

&lt;p&gt;Also see the post by the first author too, here: &lt;a href=&quot;http://bradlove.org/blog/open-review-2&quot;&gt;Sebastian’s Thoughts on Open Review&lt;/a&gt;.&lt;/p&gt;
</description>
        <pubDate>Wed, 09 Jan 2019 00:00:00 +0000</pubDate>
        <link>http://bradlove.org/blog/open-review</link>
        <guid isPermaLink="true">http://bradlove.org/blog/open-review</guid>
      </item>
    
      <item>
        <title>How a CogSci undergrad invented PageRank three years before Google</title>
        <description>&lt;p&gt;Before Google, search engines, like AltaVista, often retrieved spurious web pages. Out of all the possible pages to return how does one determine which ones are the most relevant? One key to Google’s success was the PageRank algorithm developed by Google founders &lt;a href=&quot;https://www.sciencedirect.com/science/article/pii/S016975529800110X&quot;&gt;Sergey Brin and Larry Page in 1998&lt;/a&gt;. As they say, the rest is history, except there was a curious prehistory.&lt;/p&gt;

&lt;p&gt;Three years prior in 1995, while an undergrad in Brown’s Cognitive and Linguistic Sciences program, &lt;a href=&quot;http://bradlove.org/papers/love_sloman_1995.pdf&quot;&gt;I published an identical algorithm to PageRank&lt;/a&gt;, so I guess it would be more correct to say that Brin and Page published an algorithm identical to the Love and Sloman centrality algorithm. At the time, I was a Mathematics and Computer Science major that switched over to the Cognitive and Linguistic Sciences program because I wanted to understand which algorithms the human mind used to solve interesting problems. The story of my undergraduate honors thesis highlights how thinking about how the mind works can be useful for solving practical problems.&lt;/p&gt;

&lt;p&gt;Returning to the centrality measure, the goal was to determine which parts of concepts were most central or important to people. The idea I had was that people view nodes in human concepts as more central to the extent that other nodes depend on them. For example, in the graph below of our concept of &lt;em&gt;Robin&lt;/em&gt; (collected from human participants), &lt;em&gt;Beak&lt;/em&gt; should be  somewhat central because &lt;em&gt;Eats&lt;/em&gt; depends on it. Like PageRank, indirect connections also influence centrality. For example, &lt;em&gt;Eats&lt;/em&gt; depends on &lt;em&gt;Beak&lt;/em&gt; and &lt;em&gt;Living&lt;/em&gt; depends on eats, i.e., &lt;em&gt;Living&lt;/em&gt; → &lt;em&gt;Eats&lt;/em&gt; → &lt;em&gt;Beak&lt;/em&gt;, which should have the effect of making &lt;em&gt;Beak&lt;/em&gt; even more central to our conception of a &lt;em&gt;Robin&lt;/em&gt;. To take into account all of these influences, the centrality algorithm iteratively computes how central a node is, taking into account its place in the overall dependency graph. With some mathematics background, I worked out that this iterative algorithm converges to the Eigen vector with the largest Eigen value in the dependency matrix (all the links can be represented as a matrix).&lt;/p&gt;

&lt;figure class=&quot;fig&quot;&gt;&lt;img src=&quot;/images/blog/figure.jpg&quot; title=&quot;An example dependency (link) graph from Love and Sloman (1995).&quot; class=&quot;u-max-full-width centered&quot; /&gt;
&lt;figcaption&gt;
&lt;div class=&quot;inner-caption centered&quot;&gt;
An example dependency (link) graph from &lt;a href=&quot;http://bradlove.org/papers/love_sloman_1995.pdf&quot;&gt;Love and Sloman (1995)&lt;/a&gt;.
&lt;/div&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;PageRank is identical, but instead of working on a graph for a human concept it works on the links in the world wide web; simply replace concept node with webpage and dependency link with hyperlink. The goal of each algorithm is the same, to determine which nodes in a network are most central. &lt;a href=&quot;http://www.ams.org/samplings/feature-column/fcarc-pagerank&quot;&gt;Here&lt;/a&gt; is a good description of the math and ideas behind PageRank (i.e., the centrality algorithm) for those who want to know more.&lt;/p&gt;

&lt;p&gt;One wonders whether other ideas are lying in the cognitive science dustbin awaiting rediscovery. The field itself is largely driven by fads and is prone to ignore genuine discoveries. That year at the Cognitive Science Society (CSS) conference my paper was well received but did not make a big splash. At the time, CSS folks were excited about &lt;a href=&quot;https://en.wikipedia.org/wiki/Connectionism&quot;&gt;connectionism&lt;/a&gt; and a paper on that topic won best student paper. Of course, that trend gave way to &lt;a href=&quot;https://doi.org/10.1017/S0140525X10003134&quot;&gt;Bayesianism&lt;/a&gt; which has or will likely give way to deep learning. CSS tends to be fad-driven, which is one of the several reasons I resigned from the CSS last summer, but that is a topic for another blog post.&lt;/p&gt;

&lt;p&gt;My undergraduate thesis is not a unique case of cognitive science research being relevant to machine learning research. The &lt;a href=&quot;https://www.nature.com/articles/323533a0&quot;&gt;backpropagation algorithm&lt;/a&gt;, which is behind the past and current neural networks revolution, was developed by cognitive scientists. In addition, &lt;a href=&quot;http://psycnet.apa.org/record/1991-32228-001&quot;&gt;John R. Anderson independently discovered&lt;/a&gt; the Dirichlet process mixture for effective Bayesian clustering. And of course, the current excitement about the convolutional neural network architecture (trained with backpropgation) is motived by basic insights on how the human visual system is organized.&lt;/p&gt;

&lt;p&gt;In these examples, establishing a connection to machine learning was possible because the cognitive science research was formal.
Perhaps one lesson is that more students in cognitive science should seek training in formal methods. Another lesson is that computer scientists may be well served from some contact with cognitive science. Facetiously as much as seriously, a final lesson for any potential benefactors with deep pockets is to contact me because I have some more good ideas waiting on the shelf! 😊&lt;/p&gt;
</description>
        <pubDate>Sun, 10 Dec 2017 00:00:00 +0000</pubDate>
        <link>http://bradlove.org/blog/cogsci-page-rank</link>
        <guid isPermaLink="true">http://bradlove.org/blog/cogsci-page-rank</guid>
      </item>
    
      <item>
        <title>Inclusive, Productive, Accountable</title>
        <description>&lt;p&gt;Slogans can prove hollow or can invite one to reflect on core values. Aiming for the latter, our new lab motto is &lt;b&gt;Inclusive, Productive, Accountable (IPA)&lt;/b&gt;. We aim for a community where everyone is hoppy, I mean happy, no matter their drink of choice. In seriousness, for the IPA motto to have a positive effect, what it concretely means needs to be clear.&lt;/p&gt;

&lt;h3 id=&quot;inclusivity&quot;&gt;Inclusivity&lt;/h3&gt;

&lt;p&gt;This amounts to not forming cliques that exclude others within lab (even inadvertently). The worst case scenario is creating virtual labs within the lab. Unfortunately, like high school students, this is what people in a workplace will naturally gravitate towards when they don’t take care.&lt;/p&gt;

&lt;p&gt;What does this mean concretely? It means a lot, but here are some concrete examples for guidance:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Use the lab email list (not a personal email list) to advertise events so as to not exclude people.&lt;/li&gt;

&lt;li&gt;Use the lab calendar to try to schedule these events when people are around where possible.&lt;/li&gt;

&lt;li&gt;If you are going somewhere with a bunch of people from lab (e.g., a talk, drinks, lunch), wonder why you are not inviting everyone. I am not saying that people can’t have favourites in lab to spend time with, but one should wonder when half the lab is somewhere and there was no group invite. This is not to say that people are obligated to join group events, but everyone should always feel welcomed to do so and in the loop.&lt;/li&gt;

&lt;li&gt;Don’t refer to anything as a lab event unless everyone in lab was invited with sufficient notice and there is some decent probability the majority of people would have an interest in attending.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That’s just a sample of what is a nebulous concept —  welcoming and including everyone in lab such that there is one lab, not several factions.&lt;/p&gt;

&lt;h3 id=&quot;productivity&quot;&gt;Productivity&lt;/h3&gt;

&lt;p&gt;Knowing what the ultimate goal is and efficiently (in time and other resources) working toward it. This concept requires more unpacking, but notice productivity is more than “doing a lot of stuff” and working endless hours. Productivity requires alignment with lab goals and outputs.&lt;/p&gt;

&lt;h3 id=&quot;accountability&quot;&gt;Accountability&lt;/h3&gt;

&lt;p&gt;Taking responsibility for what falls within your realm; Being straightforward when you fall short (e.g., admit mistakes and seek to correct) as opposed to shifting blame; Doing what you say (agreed) you would do, which amounts to being trustworthy; Accepting consequences of one’s actions (or lack thereof).&lt;/p&gt;
</description>
        <pubDate>Thu, 06 Jul 2017 00:00:00 +0000</pubDate>
        <link>http://bradlove.org/blog/ipa-motto</link>
        <guid isPermaLink="true">http://bradlove.org/blog/ipa-motto</guid>
      </item>
    
  </channel>
</rss>
