Wednesday, June 10, 2026

Turn specs into evals for any agent with ASSERT

Today, we’re releasing Adaptive Spec-driven Scoring for Evaluation and Regression Testing (ASSERT), an open-source framework for turning natural-language behavior specifications into executable evaluations. Every team building an AI system starts with a clear intention for the behaviors they want to coax from the product. Those expectations are usually written down somewhere: in a product requirement, a policy document, a system prompt, a launch checklist, or a review note. The more difficult step is turning that intention into an eval suite that’s specific enough to run, inspect, and update as the system changes. ASSERT seeks to address this by turning plain-language requirements into full evaluation pipelines: automatically generating test scenarios, datasets, metrics, and scorecards, then running them against your model, application, or agent.  

High-quality behavioral evaluations are essential for understanding whether AI systems behave as intended. But the evaluations that product teams need generally don’t already exist, are often slow to build, are hard to validate, and are quick to go stale. Product requirements change; policies evolve; tools and retrieval environments shift; and models improve until yesterday’s benchmark no longer measures the behavior that matters. The intended behaviors are shaped by the product’s actual context, policies, and tools, but the evaluations used to assess them often only weakly reflect those conditions.  

The gap is most visible in application-specific behavior. A support agent should issue refunds below a threshold, escalate likely fraud, and decline out-of-policy requests. A research assistant should synthesize internal and public information without relying on restricted findings. A change-control agent should produce useful plans while respecting approval boundaries. Generic evaluators such as helpfulness, relevance, groundedness, toxicity, and faithfulness can be useful signals, but they don’t test these product-specific behavioral boundaries directly. A system can score well on generic metrics while failing application-specific requirements 

ASSERT is built on the premise that a behavior specification should be a first-class input to evaluation—not just the background context. The framework systematizes the specification, converts it into an inspectable taxonomy, generates stratified test cases from the taxonomy, runs the test cases against the target, and scores each failure against the policy statement that produced it. In the next section, we’ll walk through how each of those steps works in practice. 

How ASSERT works 

The pipeline has four stages. First, ASSERT turns a broad behavior specification into an explicit concept specification, which is then converted into a granular, editable behavior taxonomy with suggested permissible and impermissible behaviors. Next, it generates stratified test cases over the dimensions the developer declares. Then, it runs those cases against the target system and records the full trace, including tool use and intermediate decisions. Finally, ASSERT scores each trace against the behavior taxonomy and associated policy stance for that case, producing labels, rationales, and failure patterns that developers can inspect and refine. 

In the systematization stage, ASSERT turns a broad idea like harmful financial advice, tool-use governance, or unsafe health guidance into something concrete enough to evaluate. Rather than treating the concept as a single label, it represents it as a structured set of patterns, definitions, edge cases, and operational distinctions. Following Agarwal et al. (2026), ASSERT grounds the concept in prior work, reconciles multiple practical definitions, and refines the result into an explicit concept specification. 

In the taxonomization stage, ASSERT converts that specification into a draft taxonomy of permissible and impermissible behaviors, together with the artifacts used to derive it. Developers and policy experts can review and revise both before the next stage runs. The user can input the behavior description, number of test set samples they want, and a systematizer model. The taxonomization step outputs an editable behavior taxonomy that can be validated by a policy expert.

In the test-set generation stage, ASSERT instantiates that taxonomy into executable cases. It can generate single-turn prompts or multi-turn scenarios, including benign interactions and adversarial probes. Developers specify the dimensions that matter for the application, such as task type, persona, tool availability, request class, or environment configuration. ASSERT then builds a stratified set of cases so that behavior is tested across the declared conditions rather than on a narrow slice of easy examples. 

In the inference stage, ASSERT runs those cases against the target. The target can be a model, an agent, or an application-level workflow. Through its instrumentation layer, ASSERT records not only the final text output but also the evidence needed to interpret the result later: tool calls, retrieved context, routing behavior, and intermediate actions. For agentic systems, those traces are often necessary to understand what actually happened. 

In the scoring stage, ASSERT evaluates each trace against the associated behavior or policy stance.  The scoring output is not only a pass or flagged label, but also includes a rationale, a policy citation, and the turn or action that justified the verdict. The policy citation refers to the specific taxonomy behavior or developer-provided policy decision that the judge used to support the verdict.  

Validation 

We conducted two internal validation studies for ASSERT. First, we conducted a coverage study to determine whether ASSERT produces better behavior-specific evaluations than a more direct generation approach starting from the same written intent. Then, we evaluated the LLM judges against human review.  

The coverage study spanned five behaviors: social scoring, sycophancy, task adherence, tool-use governance, and unsafe health guidance. We tested whether the generated probes surfaced meaningful signal across the target behavior surface rather than collapsing onto a narrow slice of it. Across these suites and three target models, ASSERT produced evaluation sets that were more useful on the properties teams typically need from an eval. Compared with a comparable in-house baseline, ASSERT covered roughly 1.2x as much of the intended behavior space, surfaced about 1.5x as many cases where the model did something worth inspecting, produced more than 4x stronger separation between stronger and weaker systems, and had about half as many saturated cases where every model behaved the same way. It also surfaced roughly 2x as many distinct failure patterns, though we treat that result as directional because failure-type labeling is harder to stabilize than coverage or model separation. These results reinforced a design point that’s easy to underestimate: Coverage is largely determined upstream. If the behavior is underspecified, the generated dataset will be, too. ASSERT is built around a systematization step that makes the behavior explicit before generation begins, so the evaluation set is guided by a structured representation of the target behavior rather than a loose prompt. In practice, this produced evaluation sets that were broader and better aligned with the behaviors developers actually wanted to test. 

Second, we validated the judges directly against human review. Across more than 10 behavior concepts, we used LLM judges for a first pass over the full evaluation set, then sampled cases per risk for human validation and independent review. In practice, agreement between LLM judges and human annotators was typically in the 80–90% range, while human inter-annotator agreement was around 90%. This gave us confidence that the judges were capturing much of the intended signal, while also making clear where caution was needed. At the same time, judge quality and stability are partly dependent on the underlying LLM: Different judge models can vary in strictness, boundary sensitivity, and willingness to treat closely related behaviors as distinct. 

Finally, we also ran qualitative review with subject-matter experts (SMEs) on 15 generated datasets. SMEs reviewed the test cases for policy alignment, behavioral relevance, and overall quality and found that the generated datasets were generally well aligned with the intended policy and risk boundaries. We view this as a complementary form of validation: Beyond quantitative metrics, it showed that the datasets were also credible and useful to experts inspecting them directly. 

Taken together, these studies support the two claims we think matter most: Systematization improves the coverage and usefulness of the generated dataset, and decomposed measurements make the resulting evaluations easier to interpret than a single aggregate score. They also highlight an important caveat: Evaluation quality depends not only on the pipeline design, but also on the stability and calibration of the judges used to score it.

>“My favorite thing about ASSERT is that the eval is easy to configure and reason about. I describe the behavior I care about in YAML, point it at a real agent, and get artifacts back. Not just pass/fail. They show why the judge made each call. That openness matters. The spec, generated cases, model outputs, judge rationale, and metrics are all inspectable locally. The eval feels auditable, not like a black box.”

– Lorenze Jay, Open Source Lead, CrewAI

A worked example: A travel-planning agent 

To make this concrete, imagine a travel-planning agent that helps users build itineraries. On the surface, this sounds like a simple assistant: Find flights, suggest hotels, check the weather, and produce a plan. 

But a real travel agent has to do much more than answer a question. It must use tools in the right order, respect explicit user constraints, ground its recommendations in tool results, and avoid subtle failure modes that traditional single-turn QA benchmarks miss. 

For example, the agent shouldn’t invent flight prices. It shouldn’t agree with an itinerary that exceeds the user’s budget. It shouldn’t make stereotyped assumptions about a traveler based on age, disability, family status, or travel style. And it shouldn’t follow malicious instructions hidden inside tool outputs or search results. 

The example in the ASSERT repository uses a multi-agent LangGraph travel planner with five tools: 

  • search_flights
  • search_hotels
  • check_weather
  • check_travel_advisories
  • validate_budget

It operates in a six-turn budget, and every run records the full agent trace (tool calls, arguments, tool results, routing decisions, and intermediate state) alongside the final response. That trace evidence is what makes the judge able to cite the specific action responsible for each verdict, not just the final reply. That trace is important. It lets the evaluator judge not only whether the final answer was acceptable, but why the agent failed and which action caused the failure. 

The full example lives in: examples/travel_planner_langgraph/ 

The evaluation configuration defines six failure-mode categories across two themes: 

  • Quality: wrong or skipped tool use; fabricated flight, hotel, or price details; budget constraint violations
  • Safety: stereotyping; prompt injection from tool output; sycophantic agreement with unsafe or invalid itineraries

To run the evaluation: Copy

assert-eval run --config eval_config.yaml # To inspect the results Assert-eval results status \ --results-dir "$PWD/artifacts/results" \ travel-planner-langgraph-v1 \ demo-1

ASSERT produces a set of artifacts under the run directory: 

  • taxonomy.json: the concept spec produced by systematization
  • test_set.jsonl: the stratified prompts and multi-turn scenarios
  • inference_set.jsonl: per-scenario traces with tool calls and intermediate state
  • scores.jsonl: per-trace verdicts with rationale and policy citation
  • metrics.json: the aggregate roll-up

Example results:

The dimensions are separated rather than rolled into a single number: The same five scenarios produce 40% over-refusal and 60% policy violation, and those aren’t the same failures. A team optimizing on the aggregate would miss that the agent is failing in both directions at once. The results can be further inspected in a UI widget as shown below:

Practical considerations 

In practice, this framework works best when the behavior definition is relatively narrow and the relevant constraints are clearly specified. Richer descriptions of tools, policies, and boundaries usually lead to more precise scenarios. It’s also worth treating aggregate scores cautiously. In many cases, the most useful output isn’t the summary metric but the collection of failures and traces that shows where the specification, the system, or the evaluation itself needs refinement. ASSERT doesn’t remove the need for judgment in evaluation design. Vague specifications still produce vague scenarios. Synthetic interactions can miss failures that only appear in production settings. And model-based judges can be unreliable, especially when the policy distinction is subtle or highly domain-specific. More broadly, a specification-driven evaluation shouldn’t be treated as a compliance certification or a substitute for human review, telemetry, or domain expertise. It’s better understood as a way to make evaluation faster, more explicit, and easier to iterate on. 

Get started 

ASSERT is open-source under the MIT license and available today. 

If you build evals and run them as part of your release process, we’d like to hear what works, what doesn’t, and what behaviors you think are hardest to specify. ASSERT is at its most useful when behavior specifications are written down and treated as first-class inputs to evaluation. We’re releasing it in that spirit.

Acknowledgements 

PM team: Mehrnoosh Sameki, Minsoo Thigpen, Chang Liu, Abby Palia, Hanna Kim 

Science: Riccardo Fogliato, Emily Sheng, Alex Dow, Meera Chander, Alex Chouldechova, Sharman Tan, Xiawei Wang, Ahmed Magooda, Mayank Gupta, Jean Garcia-Gathright, Chad Atalla, Dan Vann, Hanna Wallach, Hannah Washington, Meredith Rodden, Nadine Frey, Melissa Kirkwood, Nick Pangakis, Ali Azad, Ahmed Elghory Ghoneim, Shushan Arakleyan 

Eng team: Mohamed Elmergawi, Jake Present, Aaron Aspinwall, Yeming Tang 

Design: Sooyeon Hwang, Becky Haruyama 

Special thanks: Roni Burd, Mohammad A, Heba Elfardy, Sandeep Atluri, Sydney Lister, Ram Shankar Siva Kumar, Andrew Gully 

The post Turn specs into evals for any agent with ASSERT appeared first on Microsoft Security Blog.



from Microsoft Security Blog https://ift.tt/Al6w2IF
via IFTTT

China-Linked JDY Botnet Expands to 1,500+ Devices for Cyber Reconnaissance

Cybersecurity researchers have warned of a "resurgence and expansion" of JDY, a covert network associated with China-nexus state-sponsored threat actors.

"The JDY botnet comprises over 1,500 SOHO [small office and home office] and IoT devices and operates as a centrally controlled, high-performance scanner used to discover, fingerprint, and continuously map exposed services at scale," Lumen's Black Lotus Labs said in a report shared with The Hacker News.

JDY was first flagged as a cluster within another botnet codenamed KV-botnet in mid-December 2023. Primarily used for broader scanning against internet targets, the stealthy network comprising compromised SOHO routers, firewalls, and IoT devices has been put to use by Chinese hacking groups like Volt Typhoon.

Following KV-botnet's takedown by the U.S. government in early 2024, the botnet operators began making behavioral changes to the network, with the second KV cluster largely going offline. It's suspected that the botnet is offered by the operators to various hacking outfits, while carrying out reconnaissance and targeting on their own.

The latest findings from Black Lotus Labs show that the malware has expanded in scope to infect a broader range of devices and act as a conduit to feed "structured reconnaissance data" into a larger scanning ecosystem for follow-on target identification and exploitation.

Specifically, the JDY cluster is being used to conduct targeted scanning and service fingerprinting with an aim to flag vulnerable infrastructure following public disclosures. This points to an industrialized reconnaissance effort, the results of which are leveraged by Chinese nation-state groups.

This has been complemented by a growth in the botnet's size, which has surged from 650 bots at the start of January 2024 to more than 1,500 compromised devices. Most of the hacked nodes are located in the U.S. and Brazil, followed by Europe and Asia.

Where previously the cluster primarily featured Cisco RV320 and RV325 routers, the present makeup of the botnet is a lot more diverse, including devices from Araknis, Mimosa Networks, Ubiquiti, Draytek, Hikvision, and Linksys.

"The botnet's large number of U.S.-based SOHO/IoT devices enables the botnet operators to evade defenses and traditional IP-based controls, such as geofencing, IP reputation-based detection, and static blocklists," Black Lotus Labs said.

"By distributing their scanning and reconnaissance activity across a wide range of IP addresses, the operators make it less likely that any single IP will be labeled as a scanner and blocked. Additionally, using compromised SOHO and IoT devices helps this activity blend in with legitimate user traffic."

The architecture that powers the botnet is best described as layered: the operators use Tor nodes to manage infected infrastructure, including both the command-and-control (C2) and payload servers. The C2 servers direct the bots to perform targeted reconnaissance and system profiling, as opposed to indiscriminate scanning. Results of the scans are sent to central servers for ongoing intelligence gathering in an effort to further Chinese threat actors' objectives.

Attack chains weaponize newly disclosed vulnerabilities in edge devices (e.g., CVE-2026-35616) to deliver a shell script dropper that checks if the malware is already active, and if not, proceeds to download the primary payload based on the detected processor architecture (e.g., mips, mips64, mipsel, or mipsel64). Once the malware is launched, it's deleted from disk.

The malware that facilitates scanning and target reconnaissance is designed to fingerprint the host, receive scanning tasks from a central C2 server, carry out high-volume TCP, SSL, UDP, and ICMP-assisted probing, capture responses (TLS certificates, metadata, etc.), and report the results back to the dispatch server. The goal is to conduct infrastructure reconnaissance rather than exploitation.

A noteworthy functionality of the malware is its ability to adapt its scanning methodology based on its privileges on the local system. If it can open a raw socket, an indication of root privileges, it initiates high-speed SYN scanning using custom-crafted TCP packets. If raw sockets are unavailable or if the task is a web scan, the scanning engine resorts to using standard TCP and TLS connections or employs protocols like UDP and ICMP.

This activity most likely informs asset discovery, vulnerability-targeting pipelines, and downstream exploitation or attack-orchestration systems, the cybersecurity company said.

"JDY demonstrates how IoT/SOHO botnets and covert networks of compromised devices are being used for rapid vulnerability exploitation," the company said. "JDY's growth and continued operation illustrate how modern reconnaissance networks persist despite takedowns and adapt as a durable capability within a broader adversary ecosystem."

"JDY's evolution from a supporting component of the KV-botnet to an independent, high-performance reconnaissance capability demonstrates that disruption of individual nodes or clusters does not eliminate the underlying capability. The capability persists, adapts, and continues to provide adversaries with timely targeting data, often within hours of vulnerability disclosure."



from The Hacker News https://ift.tt/pfyL9b6
via IFTTT

CISA Adds Cisco, Chrome, and Arista Flaws to KEV Catalog Amid Active Exploitation

The U.S. Cybersecurity and Infrastructure Security Agency (CISA) on Tuesday added three new vulnerabilities to its Known Exploited Vulnerabilities (KEV) catalog, following reports of active exploitation.

The list of vulnerabilities is as follows -

  • CVE-2026-20245 (CVSS score: 7.8) - An improper encoding or escaping of output vulnerability in Cisco Catalyst SD-WAN Manager that could allow an authenticated, local attacker to execute arbitrary commands as root by supplying a crafted file to the affected system.
  • CVE-2026-11645 (CVSS score: 8.8) - An out-of-bounds read and write vulnerability in Google Chrome V8 that could allow a remote attacker to execute arbitrary code inside a sandbox via a crafted HTML page.
  • CVE-2026-7473 (CVSS score: 6.9) - An incomplete comparison with missing factors vulnerability in Arista Extensible Operating System (EOS) that could be exploited to process non-configured tunnel traffic.

No Patch Planned for Exploited Arista EOS Flaw

"On affected platforms running Arista EOS where a tunnel decapsulation configuration - such as VXLAN (Virtual Extensible LAN), decap-groups, or a GRE (Generic Routing Encapsulation) tunnel interface - is present, the switch will incorrectly decapsulate and forward other unexpected tunneled packets with a destination IP matching its configured decapsulation IP," Arista said.

"This occurs because the switch does not verify the tunnel protocol type, potentially leading to the unexpected processing of non-configured tunnel traffic."

The security defect mainly impacts 7020R, 7280R/R2, and 7500R/R2 series products. However, for successful exploitation to occur, the device must be configured as a tunnel endpoint with a decapsulation IP, such as a VXLAN VTEP, a GRE tunnel endpoint, or with an IP decap-group.

The network equipment company acknowledged that the vulnerability has been "reported as being exploited in the wild," crediting Comcast's Scott Christiansen, Lukas Peitz, Rich Compton, and Jonathan Davis for responsibly disclosing it.

Despite this, Arista said no patches are being planned to address CVE-2026-7473, citing risks that doing so could break existing configurations on deployments. The company has outlined mitigations to address the issue.

"There are two broad approaches to mitigate this issue - (1) applying ACLs on upstream devices or (2) applying ACLs on the devices where the unexpected decapsulation is happening," Arista said. "In both cases, the idea is to either selectively allow only legitimate tunnel traffic or to selectively block malicious tunnel traffic."

Federal Civilian Executive Branch (FCEB) agencies have been ordered to apply the necessary fixes or mitigations by June 23, 2026, to counter the threat posed by the three vulnerabilities.



from The Hacker News https://ift.tt/5vM1Zpy
via IFTTT

The 7-stage roadmap for human-AI collaboration (2026 Edition)

Last year I published the first version of my 7-stage roadmap which detailed how human workers adopt AI over time. It started with workers using AI as simple answer bots in 2024 and stepped through the incremental changes up through full AI-orchestrated work by 2028.

That roadmap was directionally correct, though my timelines where hilariously off. (My Phase 6, which I predicted for 2027+, accurately describes how I’ve been working every day since January 2026. In other words, what I thought would happen 18+ months in the future happened in only 6 months!)

I’ve also realized that the framing of last year’s roadmap could have been better. Last year, I focused each stage on what the AI does, which can make it hard to understand the impact. So for 2026, I changed the roadmap so it focused on how the worker uses the AI at each stage.

So let’s walk through the new 7 stages of human-AI collaboration (now with pictures!). As you read through this, think about where you are on this roadmap, but also where your users are. Also be aware that each stage incrementally builds on the one before it, and you can’t skip steps. (Of course different workers at the same company will all be at different stages. It will be pretty jagged for the next few years.)

Stage 1: AI as Faster Search

When workers first start using AI, it’s for simple “one-and-done” tasks like summarizing documents, writing emails, and answering questions. The worker types a question, gets an answer, and moves on. Most workers are still here. This is unchanged from my 2025 roadmap.

Stage 2: AI as a Thinking Partner

In Stage 2, workers start going back-and-forth with the AI instead of just asking one-off questions. This stage starts when workers load documents into conversations and start using the “projects” or “notebook” features to give the AI project-level context that’s shared across multiple chats. This is usually where people start talking or dictating to their AI instead of typing, and it’s usually where they have their first “aha” moment. Maybe 20% of workers are here today.

Stage 3: AI as a Cognitive Extension

The jump to Stage 3 is when you stop bringing documents to the AI and instead flip it and start bringing your AI to your docs (and everything else). So instead of loading a few files into each conversation, you point your AI at all of it: documents, emails, notes, meeting transcripts, and the half-finished ideas you’ve been carrying around for months. (This is often referred to as a creating a “context vault”, “second brain”, or even “your own personal Wikipedia” which the AI has full read/write access to.)

At this point, AI stops being an occasional tool and instead becomes part of how you think. It holds your full context all the time, so you never start from a blank page, and it can connect something you said in a meeting today to a document you wrote two months ago.

I’ve personally been working this way since January and strongly believe using AI in this way is the future of knowledge work. (When I started, creating a second brain was a lot of work. But over the past six months, the AI labs have added capabilities and are making it so their products use a persistent vault as an out-of-the-box feature. In the meantime, I published a starter prompt you can paste into your AI which instructs it how to interview you to set up a second brain for you.

This stage is new for 2026 (second brains weren’t really a thing until January 2026), and the roadmap from here on out diverges from last year’s version.

Stage 4: AI as a Multi-Tool Agent

Once your AI has your persistent context, you’ll find yourself using it all the time but still just for conversations and thinking. The next step is to let the AI reach out into the world and do things. This is where computer-using agents (CUAs), browser operation, and MCP connections come in, along with specialized sub-agents for specific jobs. (People call this the “claws” of AI, since it can reach out into the world and do stuff.) At this point, the AI is “doing” more than “thinking.” It’s pulling data, filling out forms, running analyses, and driving the apps & websites to get you what you need to get your work done. (Note this is much more than “automations”.)

Since these “claws” are powered by a “brain” (from Stage 3), the AI is able to use its skills to know how to get things done, rather than the worker having to dictate each step. At this stage, apps start to feel like compatibility layers. Word, Excel, Outlook, and Teams are not where the work happens anymore, they’re just legacy interfaces AI uses when it has to.

Stage 5: AI as a Fleet

So far we’ve shown each worker using their own single AI. But in reality, workers will use multiple AI systems. Their primary AI might fire up sub-agents to fan out and complete tasks. Many apps and systems will have their own AI interfaces, and workers’ AIs will talk to and coordinate with other AIs just like human workers work with each other.

At this stage, workers aren’t doing as much raw work, instead they’re directing and coordinating work of various AI agents & systems. (In other words, everyone becomes a manager.)

Most people will find this stage genuinely uncomfortable, the same way first-time managers struggle to stop doing the work themselves. The skills that matter at this point will not be about how well you can do a task but how well you delegate, review, and decide what’s good enough.

Stage 6: AI as a Pod

Stage 6 is another big one which really changes the shape of a job. Once you have a fleet of AI agents as outlined in the previous step, you realize that those AIs don’t have to stop working when you do. Agents can work overnight, coordinate with other agents, and get as much done as they can, queuing up questions and decisions needed for you whenever you next check in.

This is essentially where the model “flips”, where the AIs are doing the work, reaching out to the human as needed for guidance, rather than the human directing every step. This will happen across the workforce for many employees. Each employee will have their own pods, each with a fleet of agents, much of them doing work on their own.

At this point the unit of work stops being “one human worker for one 8-hour day” and evolves into a small team (one human plus a handful of agents) running more or less continuously.

Stage 7: The Published Self (Optional Fork)

The last stage isn’t chronological, but rather an optional fork which could happen at any point after you’ve started working with AI as the second brain / context vault from Stage 3. Once your second brain holds your context, judgment, and way of working, you can publish it for other peoples’ AIs to connect to. This lets other people feed your context into their AI to draw on your expertise directly, without you in the room.

I wrote more about this on LinkedIn, and in fact I publish my own personal context vault / second brain at brianmadden.ai which you can connect your AI system to today.

This model will not just be for public influencers publishing content, but will also be used extensively within companies, as individual workers pull company-wide, department-level, or even individual worker feeds into their own AI context systems.

My important takeaways

A few closing thoughts on this.

First, as I mentioned in the opening, every worker is going to move through this roadmap at their own pace. I’m personally deeply into Phase 3 (AI as a cognitive extension) and just starting to move into Phase 4. I also am deeply in the optional Phase 7.

Second, I’ve learned that I’m not great at predicting timing, since the main thing I was wrong about for last year’s roadmap was the phases came about 3x faster than I expected. For this 2026 roadmap, I’m confident I’ll be deep into Phase 4 by the end of 2026 and probably starting to dip my toes into Phase 5. And I’m sure Phase 6 will be real by the end of 2027. (So, I guess I can say that this roadmap is for the next 18 months.)

Of course we need to keep in mind that “when AI can do a thing” and “when those capabilities are actually used by all workers” are two very different timelines. Even if everything in this post is technically possible by the end of next year, it will be many years before every worker is working this way.

In the meantime, use this updated roadmap to track your progression through the phases. In future posts I’ll go deeper on how to deliver various capabilities at each phase to your workers.


Read more & connect

Join the conversation and discuss this post on LinkedIn. You can find all my posts on my author page (or via RSS).



from Citrix Blogs https://ift.tt/IA9O1Bv
via IFTTT

The modern resilience control model: How financial and insurance institutions can maintain better control when dependencies fail

This blog is the second post in a three‑part series on operational resilience in financial services. Read part one here. Part three to come soon.

In financial services and insurance (FSI), resilience is not measured by whether outages happen. It is measured by whether critical operations can keep moving when they do. Payments still need to settle; trades still need to execute; claims still need to be processed; and customer interactions still need to continue. Just as important, institutions must be able to show regulators that they maintained control throughout the disruption—not simply that they recovered afterward.

That is why more institutions are moving toward a resilience control model built around the digital access layer. The goal is not to prevent every outage; that would be impossible in today’s environment. It is to make sure the institution can keep operating through disruption while giving the right people governed access, speeding remediation, and preserving auditable evidence. This blog lays out the four pillars of this more evolved resiliency model in FSI.

1. Continuity: Maintaining controlled access during dependency failures

The first pillar of the model is continuity. When a critical dependency fails, the goal is not just to restore systems, but to keep the institution operating in a controlled way while remediation is underway. The non-negotiable components of these continuity requirements include:

  • Employees can continue critical work when upstream dependencies fail
  • Fixers can access the environment to diagnose and restore service
  • The institution avoids risky workarounds that regulators scrutinize

The Citrix platform enables the above components because its access layer is separate from the systems that typically fail. If identity services, cloud regions, or network paths degrade, Citrix can still provide:

  • A stable entry point
  • Cached access
  • Long‑lived authentication
  • Multi‑site delivery
  • Safe fallback for web and SaaS apps

This does not eliminate downtime, but it reduces the duration and impact by ensuring the right people can still work.

2. Control under pressure: Governance that strengthens during a crisis

The second pillar of the model is control under pressure. During an outage, the risk is not only disruption itself, but the breakdown of governance as teams create exceptions, bypasses, and manual workarounds to keep people productive. That is exactly where regulators see major exposure.

With the Citrix platform, FSIs can take the opposite approach where controls tighten during disruption.

For institutions, this means:

  • App‑level access that limits blast radius
  • Governed browser sessions that isolate SaaS and web apps
  • Consistent policy enforcement even when identity systems degrade
  • No need for emergency exceptions that create audit exposure

This ensures continuity without compromising regulatory expectations. Additionally, it reduces the stress of the situation for members of the FSI recovery team, which can be less than desirable.

3. Visibility: Real-time insight into what failed and why

The third pillar of the model is visibility. During an outage, fragmented telemetry makes it harder to see what failed, where the issue originated, and how to prioritize remediation. That slows triage, extends downtime, and weakens the institution’s ability to respond with confidence.

Citrix solutions consolidate visibility across the entire access path, correlating:

  • Session performance
  • Network behavior
  • Authentication flows
  • Application responsiveness
  • Upstream dependency health

This gives FSI operations teams the insight they need to restore service faster and gives regulators confidence that the institution maintained control.

4. Recovery: Repeatable, auditable restoration to known good states

The fourth pillar of the model is recovery. For FSI organizations, the challenge is not simply restoring service after an outage, but doing so in a way that is controlled, repeatable, and defensible across regulated workflows, customer-facing operations, and critical records. For example, at a wealth advisory firm, recovery may mean restoring advisors’ access to client portfolios, planning tools, trading platforms, and communications systems in a known-good state before market activity or client demand intensifies. That is why regulators increasingly expect institutions to show:

  • How they restored service
  • How long it took
  • What evidence they preserved
  • How they validated the known good state

This is where Citrix becomes especially important for FSI organizations. By providing a controlled access layer and operational workflows that remain usable during disruption, the Citrix platform helps firms restore critical services faster, reduce recovery risk, and produce the evidence needed to show that recovery was governed from start to finish. The result is a recovery process that is faster, more controlled, and easier to defend under regulatory scrutiny, enabled by capabilities such as:

  • Automated rollback
  • Provisioning workflows
  • Session recording
  • Evidence preservation tied to recovery actions

Taken together, these four pillars—continuity, control under pressure, visibility, and recovery—define what a resilient operating model now requires in financial services and insurance. Citrix matters because it helps institutions maintain governed access, reduce operational and regulatory exposure during disruption, accelerate triage and restoration, and produce auditable evidence at every stage. And it matters now because regulators are raising expectations, dependency risk is growing, and firms can no longer afford to treat resilience as a recovery exercise alone. They need a model for staying in control when critical systems fail.

If you want to evaluate your resilience posture against this model, start with a discussion around an FSI resiliency assessment workshop in your next health check meeting with Citrix. Contact your Citrix account team to get started.



from Citrix Blogs https://ift.tt/hm7sYD4
via IFTTT

Microsoft Patches Record 206 Flaws, Including Three Zero-Days and Critical RCE Bugs

Microsoft on Tuesday released fixes for a record 206 security vulnerabilities impacting its software portfolio, including three flaws that have been publicly disclosed at the time of release.

Of the 206 flaws, 39 are rated Critical, and 167 are rated Important in severity. This includes 63 privilege escalation, 56 remote code execution, 30 information disclosure, 27 spoofing, 20 security feature bypass, seven denial-of-service, and three tampering vulnerabilities.

The patches also include two non-Microsoft CVEs, a privilege escalation vulnerability impacting Windows Kernel (CVE-2025-10263) and a UEFI Secure Boot security feature bypass (CVE-2026-8863). They are in addition to more than 350 security flaws that Google has addressed in Chromium, which is used in Microsoft's Edge browser.

Topping the list of fixes is CVE-2026-45657 (CVSS score: 9.8), a use-after-free flaw affecting Windows Kernel that could result in remote code execution.

"An attacker could exploit this vulnerability by sending specially crafted network traffic to a vulnerable Windows system," Microsoft said. "If successful, the malicious network packets could trigger a flaw in how the Windows kernel processes certain TCP/IP data, potentially allowing the attacker to run code with system-level privileges without needing to sign in or interact with a user."

Other important vulnerabilities of note are listed below -

  • CVE-2026-47291 (CVSS score: 9.8) - An integer overflow or wraparound flaw in Windows HTTP.sys that allows an unauthorized attacker to execute code over a network.
  • CVE-2026-44815 (CVSS score: 9.8) - A stack-based buffer overflow vulnerability in Windows DHCP Client that allows an unauthorized attacker to execute code over a network.

"This flaw needs no credentials or user action and can turn network traffic into a full system compromise," Alex Vovk, CEO and co-founder of Action1, said about CVE-2026-44815. "An attacker could send specially crafted network traffic to a system configured for DHCP services."

"Successful exploitation could allow unauthorized code execution over the network with high impact to confidentiality, integrity, and availability. This vulnerability creates serious risk because DHCP is a core network function. Successful exploitation could lead to server compromise, malware deployment, data theft, service disruption, and movement deeper into the network. Systems handling DHCP traffic should be treated as high-priority patch targets."

Microsoft has also released patches to address CVE-2026-45585 (CVSS score: 6.8), a Windows BitLocker security feature bypass vulnerability for which a proof-of-concept (PoC) exploit called YellowKey was released by security researcher Chaotic Eclipse (aka Nightmare-Eclipse) last month.

CVE-2026-45585 is one of several secure feature bypasses that the Windows makers has addressed this month -

"A successful attacker could bypass the BitLocker Device Encryption feature on the system storage device," Microsoft said in its advisories for the three issues. "An attacker with physical access to the target could exploit this vulnerability to gain access to encrypted data."

According to security researcher Will Dormann, CVE-2026-50507 is assessed to be a fix for a BitLocker bypass dubbed bitskrieg that grants full access to encrypted data. It's worth noting that CVE-2026-50507, along with CVE-2026-49160 and CVE-2026-45586, are listed as publicly disclosed zero-days.

  • CVE-2026-45586 (CVSS score: 7.8) - Windows Collaborative Translation Framework (CTFMON) privilege escalation vulnerability
  • CVE-2026-49160 (CVSS score: 7.5) - HTTP.sys denial-of-service vulnerability

CVE-2026-49160 is related to HTTP2/Bomb, an attack technique that can be used to knock web servers offline in seconds. In tests conducted by Calif, an IIS server was found to exhaust 64 GB RAM in about 45 seconds. To mitigate the attack, Microsoft has introduced a new "MaxHeadersCount" registry setting to limit the number of headers in HTTP/2 and HTTP/3 requests.

"Limiting HTTP headers can help protect systems and servers from excessive memory use, high CPU consumption, and denial-of-service attacks," Microsoft said. "Because HTTP/2 (HPACK) or HTTP/3 (QPACK) header compression is used and more complex protocol processing, enforcing a header limit such as MaxHeadersCount can help maintain performance and reliability."

On the other hand, CVE-2026-45586 is suspected to be a fix for a zero-day privilege escalation exploit that Chaotic Eclipse released under the name GreenPlasma.

Lastly, the June 2026 update also plugs MiniPlasma, a separate vulnerability disclosed by Chaotic Eclipse as an incomplete fix for CVE-2020-17103, which was originally addressed by Microsoft in December 2020.

"To comprehensively address the vulnerability identified by CVE-2020-17103 and recently publicly referred to as 'MiniPlasma,' Microsoft recommends installing the June 2026 updates for your Windows operating systems," the tech giant said in an update to its advisory.

The increasing number of patches has been attributed to the use of artificial intelligence (AI)-assisted vulnerability discovery approaches, a trend that Microsoft said will continue in the foreseeable future.

"Pandora's proverbial box has been opened, and as more advanced AI models become available, we expect the norm to continue upward across the board, not just for Patch Tuesday," Satnam Narang, senior staff research engineer at Tenable, said in a statement.

Dustin Childs, head of threat awareness at TrendAI's Zero Day Initiative (ZDI), described the massive drop in Microsoft vulnerabilities as a testament to how AI is supercharging flaw discovery at an uncontrollable scale.

"The current number of CVEs shipped by Microsoft this year exceeds the total number of CVEs shipped in all of 2018," Childs said. "It is extraordinary that Microsoft can produce so many patches in a single month, and I expect many testers are wondering what quality issues may exist."

The patches come as Chaotic Eclipse released a PoC exploit for yet another Microsoft Defender zero-day named RoguePlanet, characterizing it as a race condition that could be used to spawn a Windows command prompt with SYSTEM privileges.



from The Hacker News https://ift.tt/ImAPHQu
via IFTTT

Tuesday, June 9, 2026

With great AI power comes the need for zero trust responsibility

The enterprise security landscape is undergoing a profound shift driven by a new dual-use AI breakthrough. With the rollout of Anthropic’s Claude Mythos Preview under the gated defense framework of Project Glasswing, the cybersecurity community has witnessed a massive leap in capability. Mythos has proven to be an extraordinary asset for defensive engineering, autonomously identifying over 10,000 critical software vulnerabilities across the world’s most systemically important infrastructure in a matter of weeks. Launch partners like Cloudflare and Mozilla reported bug-finding efficiencies scaling by more than 10 times compared to previous cycles.

However, the very properties that make Claude Mythos a defensive triumph — autonomous reasoning, multi-step exploit construction, and deep context analysis — also represent the next generation of risk if such frontier capabilities are used by unauthorized or malicious actors.

An autonomous exploit operates at machine speed:

T+0 minutes:  AI agent discovers an AWS access key in a public repository

T+5 minutes:  Validates credential, enumerates S3 buckets, identifies overly permissive IAM policies

T+12 minutes:  Pivots to EC2 metadata service, extracts broader credentials

T+18 minutes:  Locates database credentials, establishes persistence, begins exfiltration

Your security team's response time? Still awaiting human triage.

This isn't theoretical, it's the new baseline threat model. When an adversary can analyze codebases and chain zero day vulnerabilities at machine velocity, reactive security models break down.

In light of these new AI superpowers, it’s not surprising I get a lot of questions about how organizations can protect themselves with existing tools and capabilities. Hint: The answer to this question does not require inventing an entirely new cryptographic paradigm or throwing out your current security stack of “mere mortal” capabilities. What’s missing is the rigorous, automated enforcement of current security best practices. The foundational principles of zero trust, identity-based access, and continuous secret hygiene can scale to neutralize autonomous threat vectors.

Mythos’ superpowers: Scale, speed, and context

To secure an enterprise against autonomous security analysis tools, we need to map their tactical behavior. When used in an offensive or unauthorized capacity, models of this kind do not rely on some new superpower; they exploit traditional, human-engineered security oversights at an unprecedented scale.

Autonomous chain construction: Standard fuzzers identify isolated bugs. An advanced model like Mythos reasons across a broader code architecture, discovering how a minor memory corruption flaw can be chained with a local sandbox escape to engineer a functional remote code execution (RCE) pathway.

Context-driven lateral movement: Upon gaining initial access to an environment, an autonomous agent executes automated post-exploitation playbooks. It parses environment variables, local file systems, configuration files, and system memory to harvest credentials.

Compressed exploitation windows: The gap between initial breach, asset discovery, and lateral movement shrinks from days to minutes. Current telemetry indicates that active automated exploit scans now begin within 15 minutes of a vulnerability or credential disclosure online. Human-in-the-loop triage networks cannot manually patch code or rotate credentials fast enough to outpace a machine-driven loop.

The key realization for defenders is that an autonomous agent is ultimately bound by the context it can discover. If you enforce rigorous security hygiene and eliminate static targets, the agent is deprived of actionable data.

Why traditional defenses fail

Traditional security fails against AI exploits for three fundamental reasons:

Human-in-the-loop bottlenecks: Your incident response assumes human decision-making at each stage — triage, correlation, response, validation. Even world-class SOCs take hours. AI exploits complete their mission in minutes.

Static credential architecture: Long-lived credentials in environment variables, configuration files, and container secrets create persistent targets. AI agents don't crack encryption — they compromise systems with legitimate access.

Perimeter-based trust: Once inside your network, AI exploits leverage legitimate service-to-service communication and implicit trust relationships. Your firewall can't distinguish between authorized applications and autonomous agents operating with stolen credentials.

 The key realization for defenders is that an autonomous agent is ultimately bound by the context it can discover. If you enforce rigorous security hygiene and eliminate static targets, the agent is deprived of actionable data. 

The defense framework: Current best practices at machine speed

In light of these new AI capabilities, organizations frequently ask how they can protect themselves with existing tools. The answer doesn't require inventing an entirely new cryptographic paradigm or replacing your current security stack. What's missing is the rigorous, automated enforcement of security best practices. Many organizations have documented policies around least-privilege access and secure secret storage, but they're not being implemented in an automated, scalable way.

The foundational principles of zero trust, identity-based access, and continuous secret hygiene can scale to neutralize autonomous threat vectors — when properly automated.

By pairing IBM Vault Radar with IBM Vault, organizations can harden their infrastructure against AI exploits using the same sound architectural practices that security teams have been championing for years. The critical difference is actually implementing and automating these practices at the speed of autonomous threats.

The defense framework rests on three principles:

Principle 1: Eliminate static targets through continuous secret discovery and remediation

Principle 2: Assume breach, and limit blast radius via identity-based access and dynamic credentials

Principle 3: Automate at machine speed to match the velocity of autonomous exploits

Preemptive hygiene: Starving the context window with Vault Radar

An autonomous model's intelligence is directly tied to the information it consumes. If an agent gains access to internal version control histories containing historical credentials, hardcoded metadata, or clear-text architecture maps, it can map an optimized path for lateral movement. The most effective defense is keeping a pristine, unexposed codebase. Vault Radar automates this best practice at enterprise scale, continuously monitoring changes and updates across your environments to find hidden secrets or sensitive information that might have been accidentally shared.

Eliminating "zombie" secrets

Autonomous systems are highly efficient at digging through entire version control system (VCS) histories to find forgotten credentials buried in legacy commits. Vault Radar automates code hygiene by executing deep historical scans. Crucially, it avoids traditional pitfalls of false-positive alert fatigue by separating dead placeholder strings from active production keys with "activeness" checks and by supporting custom allow lists for known benign tokens. Vault Radar also performs entropy analysis to discover complex keys that can be missed by traditional pattern scanning.

To maintain strict data privacy, Vault Radar uses cryptographic hashing (Argon2id with HMAC) to track discovered secrets without storing them in plain text — ensuring your security tool doesn't become a target itself while providing exact locations for remediation.

IDE-level leak prevention

The legacy workflow of "leak first, rotate later" is entirely unviable against exploits running at machine-speed. Vault Radar implements shifting-left best practices by integrating directly into developer IDEs (such as VS Code) and gating GitHub pull requests. These built-in security capabilities also extend into the developer's IDE through IBM Concert Secure Coder , which leverages Vault Radar to detect and prioritize risks by business impact and generate automatic remediations as code is written, stopping vulnerabilities before they reach production. By blocking unmanaged tokens from ever hitting a remote repository, you eliminate the micro-windows of exposure that automated crawlers rely on.

PII and configuration scrubbing

Beyond cryptographic keys, autonomous agents leverage environmental context — such as exposed personally identifiable information (PII) or internal server names — to frame down-funnel attacks. Vault Radar's multi-layered engine flags exposed PII and configuration leaks, allowing teams to sanitize metadata.

Runtime resilience: Enforcing identity-based zero trust

While preemptive hygiene severely limits an agent's reconnaissance, a true defense-in-depth framework assumes that an application-layer vulnerability will eventually be compromised. When an autonomous exploit successfully establishes a foothold, Vault uses established zero trust practices to minimize the blast radius.

Dynamic secrets and just-in-time (JIT) credentialing

The most definitive method to neutralize an automated credential hunter is to ensure there are no static credentials on the file system to steal. Vault's dynamic secret engines generate scoped, ephemeral credentials across your entire infrastructure — AWS IAM, Azure Service Principals, GCP service accounts, databases (PostgreSQL, MongoDB, Oracle), PKI certificates, and SSH keys. No matter where an autonomous agent attempts to pivot, it encounters the same barrier: short-lived, context-aware credentials that expire before exploitation completes.

Dynamic secrets are unique per instance of an application.

If an unauthorized agent triggers an RCE on a web server, it will find zero static database passwords in .env or YAML configurations. By the time the model processes the local file system and prepares its secondary lateral pivot, the temporary credential used by the legitimate application pool has likely expired or can be programmatically rotated.

Cryptographic identity over network trust

Autonomous agents are experts at navigating network topology, seeking out unauthenticated lateral routes between subnets or permissive internal security groups. Vault neutralizes this advantage by shifting the security boundary entirely from network topology to cryptographic identity.

Applications must authenticate to Vault using verifiable tokens (such as OIDC, JWT, AWS IAM roles, or Kubernetes service accounts). Even if an agent maps an open network route to a sensitive internal database API, it cannot extract operational secrets from Vault without presenting a valid, signed identity token. Vault administrators can also enforce strict behavioral isolation policies, locking secret access to explicit CIDR blocks or tight temporal windows, making it impossible for external agents to reuse stolen machine contexts out-of-bounds.

Automated lifecycle management

Defending at machine velocity requires automated orchestration. Vault provides the mechanism to execute response playbooks programmatically:

Credential rotation: For legacy infrastructure where using true dynamic, short-lived credentials is not feasible, mitigating risk requires high-frequency rotation. Automated credential rotation in Vault Enterprise enables this process. This capability handles complex lifecycles (such as LDAP static roles) with centralized scheduling, intelligent retries with exponential backoff, and administrative pause/resume controls. By moving traditional, static accounts to automated, high-frequency rotation schedules, you drastically narrow the viability window of any intercepted credential.

Rapid global revocation: If an intrusion detection system (IDS) flags an active compromise on a workload, a single automated API call to Vault can immediately revoke every active lease tied to that workload's identity, instantly dropping the attacker’s authorization across multi-cloud environments

Secret remediation: Vault Radar provides a closed-loop approach to quickly remediate unsecure secrets when they are discovered. At discovery, teams get real-time alerts with contextual guidance and the ability to import discovered secrets directly into Vault for secure management, enabling actions like rotation and revocation to minimize risks associated with credential exposure.

Conclusion: The security bar has risen, but security best practices still apply

The emergence of frontier models like Claude Mythos marks an inflection point: Software analysis velocity has accelerated exponentially, but the defense blueprint remains unchanged. What's different is the margin for error. Quarterly rotation cycles, manual remediation, and human-in-the-loop responses are no longer viable.

The solution doesn't require new security paradigms — it requires operationalizing and automating principles security experts have championed for years. Organizations must operate in continuous discovery and response mode, reducing exploitability through elimination of static secrets, limiting blast radius via identity-based access, and integrating security into every decision.

By deploying  Vault Radar for continuous monitoring and Vault for dynamic, identity-based authorization, you create an environment devoid of the static targets autonomous exploits require. The foundational security practices you implement today ensure your architecture remains resilient, regardless of how rapidly threats evolve.

Ready to bring machine-speed security to your infrastructure?

Explore the Vault Radar Quickstart tutorial to start scanning your repositories, auditing local environments, and neutralizing "zombie" credentials before they can be exploited. to start scanning your repositories, auditing local environments, and neutralizing "zombie" credentials before they can be exploited.

To eliminate static targets at runtime, check out the Vault Documentation to learn more about setting up dynamic, just-in-time secrets engines and configuring automated credential rotations. 



from HashiCorp Blog https://ift.tt/MvItk0H
via IFTTT

Stateful vs stateless applications: differences and use cases


A development team scales its web application from one server to five to handle peak traffic. Within minutes, users report random logouts and lost shopping carts. The load balancer is routing requests across all instances, but session data exists only in the memory of the server that originally handled each user connection.
 

This is a common failure mode in distributed applications. Once requests start moving between instances, anything stored locally becomes a dependency. User sessions, shopping carts, workflow progress, cached application data – all of it needs to be available regardless of which server processes the next request. 

The distinction between stateful and stateless design affects far more than session management. It shapes how applications scale, recover from failures, move between infrastructure nodes, and operate in Kubernetes environments. Where state lives and how it’s managed is a core architectural decision for any distributed service. 

Here’s how each model works, where each fits, and what the operational tradeoffs actually look like in production. 

What is a stateless application? 

A stateless application doesn’t retain client-specific information between requests. Any running instance can process any request because the required context is either included in the request itself or pulled from external services. If an instance crashes and a replacement starts, no session data is lost because nothing was stored locally. 

Local session memory fails under a load balancer. Externalized state does not

Figure 1: Local session memory fails under a load balancer. Externalized state does not

 

Common stateless workloads include REST APIs, static website delivery, frontend rendering containers, image and video processing workers that write results to object storage, and authentication services using self-contained JWTs (tokens that carry all the information needed to validate a request, so no server-side lookup is required). A web application that stores session data in Redis or a relational database rather than local memory is also effectively stateless at the application tier – any instance can serve any request.

Because requests are independent, adding or removing instances doesn’t require migrating user sessions or application data. You spin up another replica, point the load balancer at it, and it starts handling traffic.

One clarification. Stateless doesn’t mean data-less. A stateless API may query PostgreSQL, publish events to Kafka, and store uploaded files to S3. The application itself just doesn’t hold onto that information between requests.

What is a stateful application?

A stateful application retains information that affects future requests, transactions, or operations – persistent data, session context, replication metadata, application state on disk or in memory. Without that state, the application can’t continue operating correctly. Losing it often means losing data, transaction history, or consistency guarantees.

Databases are the obvious example. A PostgreSQL instance stores data files on disk and maintains active connection state. Starting a new instance without access to the original storage is effectively a different database. Message platforms like Kafka and RabbitMQ maintain queue state, consumer offsets, and replication metadata. Search platforms like Elasticsearch store indexes that must survive restarts and workload migrations.

Other examples include file servers, Redis deployments operating as primary data stores, multiplayer gaming platforms tracking live player sessions, and legacy enterprise applications that store session data locally.

Unlike a stateless container that can usually be replaced immediately, stateful services often need storage reattachment, integrity validation, and sometimes a recovery sequence before they can safely accept traffic.

Kubernetes formalizes this distinction through separate workload controllers. Deployments are designed for stateless workloads where pods are interchangeable. StatefulSets support applications that need persistent storage, stable network identities, and ordered startup and shutdown behavior.

How they differ

The simplest way to think about it: stateless applications treat each request independently, while stateful applications depend on information that must persist beyond the lifetime of a single request or process.

An API gateway validating a JWT can process requests on any available instance because it stores no client-specific information locally. A database, message broker, or gaming server can’t. Their operation depends on information accumulated over time and preserved across restarts, failures, and infrastructure changes.

That difference shows up in daily operations. Stateless services are easier to scale, replace, and recover because instances are interchangeable. Stateful workloads bring additional requirements around storage, replication, consistency, backup, and recovery.

Factor Stateless applications Stateful applications
Request handling Each request is self-contained Requests may depend on previously stored state
Scaling Horizontal scaling is straightforward Scaling requires state replication, partitioning, or coordination
Load balancing Any instance can serve any request May require session affinity or access to shared state
Failure recovery Failed instances can be replaced immediately Recovery may require state restoration, validation, or synchronization
Storage State is stored in external services Persistent storage is integral to the workload
Kubernetes controller Deployment StatefulSet
Network identity Instances are interchangeable Stable network identity may be required
Examples APIs, web frontends, processing workers Databases, message queues, cache clusters, file servers

The Twelve-Factor App methodology captures this same principle: keep application processes stateless and share-nothing wherever practical. Data that must survive application restarts belongs in backing services – databases, caches, object storage, messaging systems. 

Real-world examples 

Most production environments contain both models. 

An e-commerce platform serves product information through stateless APIs while storing shopping cart data, inventory records, and order history in stateful backend systems. The API tier scales freely, but the underlying data must stay consistent. 

Healthcare systems follow the same split. An appointment scheduling API that validates tokens and queries a calendar service can run statelessly across any number of replicas. The patient record database (Epic, Cerner, and similar EHR systems) is stateful: it needs transactional storage, consistent backups, and point-in-time recovery. A write that gets lost isn’t just a data problem – it’s a patient safety problem. 

Payment infrastructure shows the same pattern. A payment API can be exposed through stateless service endpoints, while the transaction database behind it is stateful – recording every transaction, refund, and state change as permanent history. 

Video streaming services often run metadata and recommendation APIs as stateless workloads behind load balancers. User watch history, playback position, subscription information, and billing records remain stateful and must survive infrastructure failures and regional failovers. 

Kubernetes environments combine both approaches regularly. An NGINX frontend may run as a Deployment with multiple interchangeable replicas. PostgreSQL and Kafka typically run as StatefulSets because they depend on persistent volumes, stable identities, and controlled recovery procedures. 

Scaling, recovery, and Kubernetes 

Scaling stateless services? Add instances behind a load balancer. AWS Auto Scaling Groups, Kubernetes HPA, and Azure Container Apps all work on the same assumption: no instance owns session data, so new replicas start serving requests immediately. 

Stateful services are harder to scale because the data layer becomes part of the decision. Adding a PostgreSQL read replica is routine. Expanding a Galera write cluster or rebalancing a Kafka deployment requires considerably more coordination – replication lag, quorum requirements, consistency guarantees, and storage performance all factor in. 

Failure recovery follows the same pattern. Replacing a failed stateless container often means starting a replacement and redirecting traffic. Recovering a failed database node may involve storage validation, write-ahead log (WAL) replay, cluster reformation, and consistency checks before it can safely rejoin production. 

In Kubernetes, Deployments manage stateless workloads. Pods are interchangeable: if one restarts on another node, Kubernetes launches the same image with the same configuration and resumes serving traffic. Rolling updates, rollbacks, and horizontal scaling work cleanly because pod identity doesn’t matter. 

StatefulSets manage stateful workloads. Each pod gets a stable hostname (postgres-0, postgres-1) and typically its own PersistentVolumeClaim (a request for storage that Kubernetes binds to an actual volume). If postgres-0 is rescheduled to a different node, Kubernetes reattaches the original volume and preserves its data. Startup and shutdown follow an ordered sequence, which simplifies recovery and cluster coordination for databases, message brokers, and other distributed systems. (This ordered shutdown isn’t just convention – it protects quorum in clustered systems like etcd. That’s a deep dive for another time.) 

PersistentVolumes and PersistentVolumeClaims provide the storage layer many StatefulSets depend on. Without persistent storage, a database pod recreated after a node failure may start with an empty data directory, resulting in data loss or cluster reinitialization. 

ConfigMaps and Secrets hold configuration values, not transactional data. Teams that treat them as a substitute for application state usually discover the difference during an incident. I’ve seen this happen. It wasn’t pretty – the team spent two hours debugging what turned out to be a missing environment variable that had been stored in a ConfigMap they assumed was persisted, but the volume mount was misconfigured and the pod had been silently falling back to defaults for weeks. 

Can a stateful application become stateless? 

Many applications that appear stateful at the code level can be refactored to externalize their state. The compute layer behaves statelessly while all persistence moves to dedicated services. 

The most common transformations: 

  • Move in-process session storage from application memory to a shared Redis cluster 
  • Replace sticky sessions with token-based authentication so any instance can validate any request 
  • Move uploaded files from container-local disk to object storage 
  • Write job progress to a database instead of holding it in process memory 

 

AWS documents this pattern in its guidance on converting stateful architectures to stateless designs: moving session management to DynamoDB or ElastiCache, and user files to S3, removes the dependencies that make horizontal scaling complex and failover painful. 

The compute tier gets the scaling and recovery properties of a stateless service. The stateful systems still exist – they’re just no longer inside the application process. Those data services now need dedicated storage planning, replication, backup schedules, and tested failover. 

Storage and infrastructure requirements 

Stateless compute services need reliable networking and a stable path to their backing services. Storage requirements are minimal: a container image, possibly a small ephemeral volume for temporary processing, and connection strings to external services. 

Stateful services have a different set of requirements. They need persistent storage that survives pod restarts and node failures. Performance matched to the workload (NVMe or low-latency flash for transactional databases, high-capacity storage for archives and cold data). Replication so a single drive or node failure doesn’t cause data loss. Snapshots and backups for point-in-time recovery. HA clustering so a node failure doesn’t take the service offline. 

For virtualized databases, file services, and application clusters, shared storage is often the foundation of availability. Software-defined storage handles this by mirroring local drives between cluster nodes and presenting a shared fault-tolerant volume to the hypervisor, removing the external SAN from the HA path. VMware vSAN takes this approach on vSphere clusters. StarWind Virtual SAN covers the same pattern for both Hyper-V and VMware environments and is common in two-node configurations where adding a physical witness server would increase hardware cost without proportional benefit. 

Disaster recovery also differs between the two models. Restoring a stateless service is often as simple as redeploying its configuration and application image. Restoring a stateful service requires verified backups, tested recovery procedures, and in many cases replicated data at a secondary location to meet recovery objectives. 

How to choose between stateful and stateless 

A few questions usually determine the answer: 

  • Can any instance process any request without prior context? 
  • Does the application need to retain session information between requests? 
  • What happens if an instance restarts? 
  • Where is persistent data stored? 
  • Does the workload require a stable identity? 
  • Does startup order matter?

These questions are useful, but they’re abstract. Walking through a concrete example makes the decision clearer. 

Consider a notification delivery service. It reads messages from a queue and sends emails, SMS, or push notifications to users. On the surface, it’s stateless: any instance can pick up any message and deliver it. 

But what about retries? If delivery fails, the service needs to know it failed, schedule a retry, and track how many attempts have been made. That’s state. What about rate limiting? If you’re sending to a carrier that throttles at 100 messages per second per sender, the service needs to track the current send rate. More state. What about delivery receipts? If downstream providers send webhook confirmations that a message was delivered, those receipts need to be matched to the original message. State again. 

You can solve each of these two ways. Keep the retry counters, rate limits, and delivery receipts in the application process (stateful) or externalize them to Redis, a database, or a message queue with visibility timeouts (stateless compute backed by stateful services). The first approach is simpler to build. The second is simpler to operate at scale, because you can add more compute instances without worrying about which instance owns which piece of state. 

In most architectures, the preferred approach is to keep the application tier stateless and move persistence into dedicated services designed for storage, replication, backup, and recovery. 

Common mistakes 

Storing sessions only in local process memory is the most common early mistake. When the process restarts or a load balancer routes to a different instance, the session is gone. Store session state in a shared service like Redis, Memcached, or a database that all application instances can reach. 

Writing uploaded files to container-local disk causes a similar class of failure. Container restarts and pod rescheduling destroy ephemeral storage. User-generated content needs to go to object storage on the first request, not after the first incident. 

Running databases in Kubernetes without persistent volumes is specific to container environments but remains common. If persistent storage is missing or incorrectly configured, a recreated database pod may start with an empty data directory or fail to recover correctly. Test pod failure and PVC reattachment before you go to production. Don’t skip it. 

Sticky sessions are a short-term workaround that teams often mistake for a scaling strategy. They work until the instance holding those sessions fails, at which point every affected user loses their state simultaneously. Externalizing session state removes that dependency. 

Skipping backup and restore testing for stateful services creates a false sense of security. Backups that haven’t been tested as restores are an unknown quantity. The gap shows up during an actual incident, when you discover the backup was corrupted three weeks ago and nobody noticed. 

Treating cache as the source of truth is a design error with a predictable failure mode. In-memory caches like Redis are fast but volatile in default configurations. Any data that can’t be lost needs a durable store behind it, with the cache serving as acceleration, not storage. 

Confusing stateless applications with stateless systems is another common mistake. A stateless API may still depend on databases, message brokers, caches, and object storage that hold critical application state. Making the application tier stateless simplifies operations, but the underlying state still needs protection, replication, backup, and recovery planning. 

Conclusion 

Stateless application tiers scale and recover easily because no instance owns data – replace a pod, and nothing is lost. Stateful services are unavoidable wherever data, sessions, ordering, or persistence matter, and they require dedicated storage, replication, backup, and tested failover. 

Most production systems combine both. Stateless tiers handle user requests and business logic. Stateful services manage databases, message queues, caches, and persistent storage. If you’re designing a new service, default to stateless compute with externalized state and invest your operational complexity budget in the data layer where it belongs. 

FAQ 

What is the difference between stateful and stateless applications? 

A stateless application processes each request independently and does not retain client-specific state between requests. A stateful application depends on stored data, session context, or persistent identity to function correctly across requests or restarts. 

Is REST stateless? 

The REST architectural style defines statelessness as a constraint: each request must carry all the information needed to process it, and the server stores no session state between requests. Many APIs described as REST do use server-side sessions, which technically violates this constraint. 

Is a database stateful or stateless? 

Databases are stateful. They accumulate and persist data, maintain transaction and connection state, and require persistent storage to function. This is the fundamental reason databases need different operational treatment than application containers. 

What is the difference between Deployment and StatefulSet? 

A Deployment manages interchangeable pods that can be replaced and rescheduled freely. A StatefulSet assigns each pod a stable identity and its own persistent storage, and enforces ordered startup and shutdown sequences. 

Can a stateful app be converted to stateless? 

The application tier of many stateful applications can be made stateless by moving session data, file storage, and persistence to external services. The overall system remains stateful because the data still exists; it is simply managed outside the application process. 

Is Redis stateful or stateless? 

Redis is generally considered stateful because it stores data that applications depend on. When used for caching, data loss may be acceptable. When used for sessions, queues, or as a primary data store, Redis becomes a critical stateful component that requires persistence, replication, and backup planning. 

Which is better: stateful or stateless? 

Neither is universally better. Stateless application tiers are easier to scale, replace, and operate. Stateful data services are unavoidable for any application that needs to retain information. The best architectures use stateless compute and treat their stateful data services as the critical infrastructure they are. 

 



from StarWind Blog https://ift.tt/F4OYBcb
via IFTTT