News
Exploring capability gated out-of-context reasoning " Less Wrong
1+ hour, 39+ min ago (1653+ words) This work was done as an independent --- out of interest in stenographic reasoning. A version with interactive charts and diagrams is available here. AI helped write code and suggest edits for this work. Classical monitoring assumes shared input means shared…...
Have we already lost? Part 1: The Plan in 2024 " Less Wrong
7+ hour, 1+ min ago (176+ words) Written very quickly for the Inkhaven Residency. As I take the time to reflect on the state of AI Safety in early 2026, one question feels unavoidable: have we, as the AI Safety community, already lost? That is, have we passed…...
Stockfish is not a chess superintelligence (and it doesn't need to be) " Less Wrong
9+ hour, 20+ min ago (427+ words) TLDR: I demonstrate Stockfish is not a chess superintelligence in the sense of understanding the game better than all humans in all situations. It still kicks our ass. In the same way, AI may end up not dominating us in…...
Inkhaven: a menu " Less Wrong
8+ hour, 29+ min ago (976+ words) What is happening right now? What is everyone doing and why? A candid account of the situation from my perspective. I think the situation on the ground is an urgent crisis. Other people's actions don't seem to match that reality....
Generalisation isn't actually (that) important " Less Wrong
9+ hour, 20+ min ago (374+ words) TLDR: I demonstrate Stockfish is not a chess superintelligence in the sense of understanding the game better than all humans in all situations. It still kicks our ass. In the same way, AI may end up not dominating us in…...
Do not be surprised if Less Wrong gets hacked " Less Wrong
10+ hour, 7+ min ago (670+ words) Or, for that matter, anything else. This post is meant to be two things: Claude Mythos was announced yesterday. That announcement came with a blog post from Anthropic's Frontier Red Team, detailing the large number of zero-days (and other security…...
Why Alignment Risk Might Peak Before ASI - a Substrate Controller Framework " Less Wrong
12+ hour, 12+ min ago (930+ words) In this post I develop the argument that alignment risk arises as product of prediction variance reduction in the substrate controller of the agent. I develop this through a mechanistic framework that explains instrumental convergence in ways that I don't…...
One Week in the Rat Farm " Less Wrong
13+ hour, 27+ min ago (1861+ words) Hello, Less Wrong. This is a personal introduction diary-ish post and it does not have a thesis. I apologise if this isn't a good fit for the website; I just needed to unload my brain somewhere and this seemed like…...
Zero-Shot Alignment: Harm Detection via Incongruent Attention Mechanisms " Less Wrong
15+ hour, 32+ min ago (394+ words) First, a Technical Summary Technical Summary Adapter Architecture and Mechanisms The adapter is a compact module with roughly 4. 7 million parameters placed on top of a frozen Phi-2 base model. It never modifies the base weights. Instead, it intercepts the final…...
How I use Claude as a personal coach " Less Wrong
16+ hour, 49+ min ago (629+ words) Last week I wrote about my reflections on using Claude as a personal coach. Today, when I couldn't figure out what to write, I noticed a comment from Viliam: I would appreciate a more detailed explanation of how specifically you…...