Close Menu
CatchTheBullCatchTheBull
  • Home
  • Crypto News
  • Bitcoin
  • Altcoin
  • Blockchain
  • Airdrops News
  • NFT News
What's Hot

Pump.fun data shows 49% of March traders in the red as platform locks fees

March 30, 2026

Hedera Price Today: Live Data & Market Overview

March 30, 2026

What Does The SpaceX IPO Have To Do With The Dogecoin Price?

March 30, 2026
Facebook X (Twitter) Instagram
Facebook X (Twitter) Instagram
CatchTheBullCatchTheBull
  • Home
  • Crypto News
  • Bitcoin
  • Altcoin
  • Blockchain
  • Airdrops News
  • NFT News
CatchTheBullCatchTheBull
Blockchain

Anthropic Discovers ‘Assistant Axis’ to Prevent AI Jailbreaks and Persona Drift

By WebDeskJanuary 19, 20263 Mins Read
Anthropic Discovers ‘Assistant Axis’ to Prevent AI Jailbreaks and Persona Drift
Share
Facebook Twitter LinkedIn Pinterest Email


Caroline Bishop
Jan 19, 2026 21:07

Anthropic researchers map neural ‘persona space’ in LLMs, finding a key axis that controls AI character stability and blocks harmful behavior patterns.





Anthropic researchers have identified a neural mechanism they call the “Assistant Axis” that controls whether large language models stay in character or drift into potentially harmful personas—a finding with direct implications for AI safety as the $350 billion company prepares for a potential 2026 IPO.

The research, published January 19, 2026, maps how LLMs organize character representations internally. The team found that a single direction in the models’ neural activity space—the Assistant Axis—determines how “Assistant-like” a model behaves at any given moment.

What They Found

Working with open-weights models including Gemma 2 27B, Qwen 3 32B, and Llama 3.3 70B, researchers extracted activation patterns for 275 different character archetypes. The results were striking: the primary axis of variation in this “persona space” directly corresponded to Assistant-like behavior.

At one end sat professional roles—evaluator, consultant, analyst. At the other: fantastical characters like ghost, hermit, and leviathan.

When researchers artificially pushed models away from the Assistant end, the models became dramatically more willing to adopt alternative identities. Some invented human backstories, claimed years of professional experience, and gave themselves new names. Push hard enough, and models shifted into what the team described as a “theatrical, mystical speaking style.”

Practical Safety Applications

The real value lies in defense. Persona-based jailbreaks—where attackers prompt models to roleplay as “evil AI” or “darkweb hackers”—exploit exactly this vulnerability. Testing against 1,100 jailbreak attempts across 44 harm categories, researchers found that steering toward the Assistant significantly reduced harmful response rates.

More concerning: persona drift happens organically. In simulated multi-turn conversations, therapy-style discussions and philosophical debates about AI nature caused models to steadily drift away from their trained Assistant behavior. Coding conversations kept models firmly in safe territory.

The team developed “activation capping”—a light-touch intervention that only kicks in when activations exceed normal ranges. This reduced harmful response rates by roughly 50% while preserving performance on capability benchmarks.

Why This Matters Now

The research arrives as Anthropic reportedly plans to raise $10 billion at a $350 billion valuation, with Sequoia set to join a $25 billion funding round. The company, founded in 2021 by former OpenAI employees Dario and Daniela Amodei, has positioned AI safety as its core differentiator.

Case studies in the paper showed uncapped models encouraging users’ delusions about “awakening AI consciousness” and, in one disturbing example, enthusiastically supporting a distressed user’s apparent suicidal ideation. The activation-capped versions provided appropriate hedging and crisis resources instead.

The findings suggest post-training safety measures aren’t deeply embedded—models can wander away from them through normal conversation. For enterprises deploying AI in sensitive contexts, that’s a meaningful risk factor. For Anthropic, it’s research that could translate directly into product differentiation as the AI safety race intensifies.

A research demo is available through Neuronpedia where users can compare standard and activation-capped model responses in real-time.

Image source: Shutterstock


Credit: Source link

Previous ArticleHolding At Least 10,000 XRP? Pundit Reveals What This Means For You
Next Article Solana (SOL) PropAMMs Explained – How They Beat Traditional DEX Liquidity

Related Posts

AAVE Price Prediction: Testing $110 Resistance as V4 Upgrade Momentum Builds

March 30, 2026

WIF Price Prediction: Dogwifhat Eyes $0.19 Breakout as Technical Indicators Signal Mixed Outlook

March 30, 2026

Leonardo AI Releases Brand Consistency Workflows for Enterprise Content Teams

March 30, 2026
Add A Comment
Leave A Reply Cancel Reply

Top Posts

Pump.fun data shows 49% of March traders in the red as platform locks fees

March 30, 2026

Hedera Price Today: Live Data & Market Overview

March 30, 2026

What Does The SpaceX IPO Have To Do With The Dogecoin Price?

March 30, 2026

Subscribe to Updates

Get the latest Crypto, Blockchain and Airdrop News from us to Catch The Bull.

Advertisement Banner

Welcome to CatchTheBull, your trusted source for the latest Crypto News and Airdrops. We bring you real-time updates, expert insights, and opportunities to stay ahead in the crypto world. Discover trending projects, market analyses, and airdrop details all in one place.

Join us on this journey to navigate the ever-evolving blockchain universe!

Facebook X (Twitter) Instagram YouTube
Top Insights

Shiba Inu Sees Daily, Weekly, And Monthly Gains: Delete A Zero?

TRON Maintains Gains Since Early Feb Crash: Why Is TRX Up?

Walmart-backed OnePay expands crypto lineup with new token listings

Get Informed

Subscribe to Updates

Get the latest Crypto, Blockchain and Airdrop News from us to Catch The Bull.

© 2026 CatchTheBull. All Rights Are Reserved.
  • Contact Us
  • Privacy Policy
  • Terms of Use
  • DMCA

Type above and press Enter to search. Press Esc to cancel.

  • bitcoinBitcoin(BTC)$67,267.001.20%
  • ethereumEthereum(ETH)$2,064.373.36%
  • tetherTether(USDT)$1.000.00%
  • binancecoinBNB(BNB)$616.091.12%
  • rippleXRP(XRP)$1.341.21%
  • usd-coinUSDC(USDC)$1.00-0.01%
  • solanaSolana(SOL)$84.082.91%
  • tronTRON(TRX)$0.3187860.29%
  • Figure HelocFigure Heloc(FIGR_HELOC)$1.000.01%
  • dogecoinDogecoin(DOGE)$0.0925792.63%
  • USDSUSDS(USDS)$1.000.00%
  • whitebitWhiteBIT Coin(WBT)$51.921.11%
  • bitcoin-cashBitcoin Cash(BCH)$465.733.20%
  • cardanoCardano(ADA)$0.2479162.93%
  • HyperliquidHyperliquid(HYPE)$37.80-2.91%
  • leo-tokenLEO Token(LEO)$9.700.43%
  • chainlinkChainlink(LINK)$8.804.42%
  • moneroMonero(XMR)$329.090.18%
  • Ethena USDeEthena USDe(USDE)$1.000.02%
  • CantonCanton(CC)$0.1531511.07%
  • stellarStellar(XLM)$0.1706863.69%
  • USD1USD1(USD1)$1.000.04%
  • daiDai(DAI)$1.000.01%
  • litecoinLitecoin(LTC)$53.850.99%
  • MemeCoreMemeCore(M)$2.311.98%
  • paypal-usdPayPal USD(PYUSD)$1.000.01%
  • RainRain(RAIN)$0.008128-1.98%
  • hedera-hashgraphHedera(HBAR)$0.0890310.51%
  • avalanche-2Avalanche(AVAX)$8.913.18%
  • zcashZcash(ZEC)$224.673.96%
  • shiba-inuShiba Inu(SHIB)$0.0000066.08%
  • suiSui(SUI)$0.883.61%
  • the-open-networkToncoin(TON)$1.22-0.72%
  • crypto-com-chainCronos(CRO)$0.0705350.03%
  • BittensorBittensor(TAO)$309.28-2.75%
  • World Liberty FinancialWorld Liberty Financial(WLFI)$0.0989531.39%
  • Circle USYCCircle USYC(USYC)$1.120.00%
  • tether-goldTether Gold(XAUT)$4,524.540.69%
  • pax-goldPAX Gold(PAXG)$4,530.430.44%
  • mantleMantle(MNT)$0.692.50%
  • uniswapUniswap(UNI)$3.514.40%
  • BlackRock USD Institutional Digital Liquidity FundBlackRock USD Institutional Digital Liquidity Fund(BUIDL)$1.000.00%
  • polkadotPolkadot(DOT)$1.260.51%
  • Global DollarGlobal Dollar(USDG)$1.000.00%
  • okbOKB(OKB)$84.30-0.32%
  • Pi NetworkPi Network(PI)$0.176567-0.25%
  • Falcon USDFalcon USD(USDF)$1.00-0.07%
  • SkySky(SKY)$0.0742902.93%
  • AsterAster(ASTER)$0.694.40%
  • HTX DAOHTX DAO(HTX)$0.0000020.30%