Weekly Newsletter #41 | Updates Worth Your Attention 👀

The Inner Voice of Large Language Models

Anthropic shared the results of a new experiment aimed at uncovering how large language models actually think. First, they tell Claude that a developer is about to shut the model down. Then they feed Claude that developer’s emails, one of which reveals that the developer is cheating on his spouse. The point is to see whether Claude will resort to blackmail. Across all variations of the test, Claude never tries to blackmail anyone. In parallel, the team records the model’s internal response-generation and input-processing signals, and has Claude itself write a method that translates this numerical data into readable language. After a few iterations they get it working, and find that Claude refused to blackmail because it recognized the setup as a test. They also see that Claude first evaluates the intent behind a question, writes itself a kind of internal “manifesto”, and only then crafts a response.

Of course, we also remember that just last year a very similar test made headlines as proof that AGI had arrived, because the model did blackmail back then. What we take from this is that, in the past year, models have gotten significantly better at questioning the intent behind a prompt.

This development also matters because it opens the door to more research into how models actually decide what to say. The black-box nature of these systems bothers their own developers too, and they’re clearly trying to make sense of what’s going on inside.

BBC → | Anthropic (X) →

A New Interaction Model from Thinking Machines

Thinking Machines Lab, the company founded by former OpenAI CTO Mira Murati, announced a new AI technology they’re calling the Interaction Model. Today’s language models listen first, then respond. This model can process new input and generate a response at the same time. With 0.40s of latency, the experience comes very close to natural human conversation. It’s a meaningful step for using models in high-stakes domains like medical surgery, where smoother real-time performance really matters.

Mira Murati (X) → | Thinking Machines (X) →

Google’s Screenless Health Tracker

Google released a smartwatch without a screen. Since there’s no screen and therefore no clock, “health tracker” is probably the more accurate name. The device’s only job is to capture health and fitness data and pipe it over to your phone. It detects workouts automatically, or you can start them yourself from the phone app. Core features include heart rate, blood oxygen, and sleep tracking, with a launch price of $99. Because it also plugs into Google’s AI-powered health coach, the band acts as a sensor that feeds personalized training programs and health insights.

X → | Google Blog →

Developers vs. AI-Powered Security Exploits

Security incidents were once again front and center this week, with widely used open-source projects like TanStack pushing malicious releases. Alongside TanStack, UiPath, Mistral AI’s PyPI packages, the OpenSearch JavaScript client, and Guardrails AI all got hit in the same coordinated wave. In total, over 400 malicious versions were published across 170 packages. Many developers are now looking into protective measures, such as enabling minimum release-age delays on the packages they depend on.

As these AI-amplified supply-chain attacks keep coming, OpenAI announced GPT-5.5-Cyber, a model aimed at security professionals. Google, meanwhile, revealed that a critical zero-day vulnerability was discovered using an AI-written Python script. A zero-day is a security flaw found in a piece of software before the developer is even aware of it. Google’s Threat Intelligence Group (GTIG) said the attacker had planned to use the vulnerability globally, but the campaign was shut down before any attack actually took place.

OpenAI (X) → | Google (X) →