Driving Cloud Observability: A Conversation with Rishab Jolly of Microsoft

Rishab Jolly is a Senior Program Manager at Microsoft, where he plays a pivotal role in shaping the product strategy and execution of application monitoring within Azure Monitor—one of the most advanced and scalable cloud observability platforms in the world. With over 11 years in the technology space and more than 8 years focused on observability, Rishab brings deep expertise at the intersection of cloud infrastructure, monitoring, and AI.

As a Microsoft Gold Speaker, Rishab has shared his insights at premier global events, including Microsoft Build, and is widely recognized for his thought leadership in the evolving landscape of intelligent monitoring solutions. Beyond his contributions at Microsoft, he is passionate about mentoring startups, judging tech competitions, and engaging with a growing community of over 20,000 followers on LinkedIn and his podcast, Curious Souls, where he explores the human side of innovation and leadership.

In this Q&A, Rishab offers his perspective on the future of observability in an AI-native world, how product leadership is evolving in response to modern monitoring demands, and what it takes to build platforms that scale with both complexity and clarity…

(Disclaimer: The views expressed in my responses are my own and do not necessarily reflect those of Microsoft.)

Rishab, can you start by sharing your journey into cloud-native product development and how that led to your role with Azure Monitor at Microsoft?

My journey began as a software test engineer in India, where I worked on complex enterprise systems. That experience grounded me in the importance of software quality, resilience, and the need for deep visibility when things go wrong. After earning my MBA in the U.S., I joined Microsoft and eventually transitioned into product management—motivated by a desire to influence product direction and help customers solve real-world challenges at scale.

I was drawn to cloud-native product development because it brings enormous flexibility—but also massive complexity. Applications are now distributed, ephemeral, and constantly evolving. That’s where observability becomes mission-critical. Without it, teams are flying blind. With it, they can understand what’s happening in real time, reduce downtime, and make confident decisions faster.

Over the last eight years, I’ve helped define and execute the product strategy for Application Insights, part of Azure Monitor, which supports thousands of global customers across every industry—from finance to retail to public sector. We help ensure that critical systems stay performant and reliable. The stakes are real: an outage doesn’t just cost money—it can block access to essential services or ruin customer trust.

That’s what makes this work meaningful for me. Observability isn’t just a technical necessity—it’s a business enabler and a safeguard for digital experiences. Knowing that our product helps companies protect revenue and deliver better user experiences is what drives me every day.

What attracted you specifically to the observability space, and why do you believe it plays such a critical role in the cloud-native ecosystem?

What attracted me to the observability space was the combination of technical depth, real-time impact, and high-stakes decision-making. Early in my career as a developer/test engineer, I experienced firsthand how difficult it was to troubleshoot live systems under pressure—especially without meaningful visibility. When I moved into product management, I saw observability not just as a debugging tool, but as a strategic enabler for modern software teams.

In the cloud-native world, where applications are built on microservices, containers, and dynamic infrastructure, traditional monitoring breaks down. Observability fills that gap—it gives teams the context and correlation they need to detect issues early, understand root causes, and maintain reliability at scale. But beyond that, it’s also becoming smarter.

One of the most exciting areas I’ve worked on is integrating AI and machine learning into Azure Monitor. By analyzing massive volumes of telemetry in real time, we can surface anomalies, detect patterns, and even predict incidents before they happen. Our goal isn’t to replace humans—but to augment their intuition with intelligent, actionable insights.

Observability today is no longer reactive—it’s predictive, proactive, and deeply tied to business outcomes. Whether it’s protecting revenue, improving customer experience, or avoiding costly downtime, the impact is tangible. That’s what makes this space so critical—and why I’m passionate about shaping its future.

Azure Monitor supports massive workloads across the globe. What are the key design principles that guide your team when building for such scale and complexity?

When building a platform like Azure Monitor that supports global-scale, mission-critical applications, we follow a set of core design principles to ensure the product remains resilient, performant, and trustworthy—no matter the complexity of the environment.First is scalability from day one. Azure Monitor is built to handle data from millions of resources across diverse environments—cloud-native, hybrid, and on-premises. We focus on scalable ingestion pipelines and dynamic infrastructure to support workloads of all sizes.

  • Second, we prioritize reliability and high availability. Observability needs to be there when customers need it most—especially during outages, spikes, or unexpected behaviors. We design the system to degrade gracefully and ensure continuity of critical experiences like alerts, dashboards and more.
  • Performance and efficiency are equally important. Customers expect real-time insights, and our architecture focuses on delivering low-latency experiences while remaining cost-effective.
  • We also take a security- and compliance-first approach, ensuring that data is protected, access is controlled, and governance needs—such as GDPR are met by design.
  • Finally, a core guiding principle is to maximize signal over noise. At cloud scale, the volume of telemetry can be overwhelming. That’s where our investments in intelligent analytics and machine learning come in—helping customers focus on what matters, identify root causes quickly, and act with confidence.

These principles have helped Azure Monitor become a trusted observability platform for organizations around the world, including those running critical workloads that power industries, governments, and digital experiences we all rely on.

As applications become more distributed and ephemeral, what new observability challenges have you seen emerge over the past few years?

Over the past few years, the shift to distributed and ephemeral architectures—with microservices, containers, and serverless—has introduced a new wave of observability challenges that traditional monitoring simply can’t handle.

  • A major one is the loss of context. When resources are short-lived and dynamic, it becomes harder to trace the lifecycle of a request or pinpoint where something broke. Correlating signals across intricate service dependencies is increasingly difficult. A single transaction might touch dozens of microservices, APIs, and external systems. Without dynamic mapping and visualization, issues like cascading failures can remain invisible until they reach the end user.
  • We’ve also seen a massive explosion in telemetry data leading to high costs. As customers adopt finer-grained architectures, observability pipelines get flooded with signals. To manage this scale efficiently, we use sampling strategies that help control costs and reduce noise—while still retaining meaningful insights. The goal is to provide the right level of visibility without overwhelming teams or infrastructure.
  • Additionally, alert fatigue has intensified. In dynamic environments, minor fluctuations or auto-scaling events can trigger alerts—many of which are non-actionable. Teams are often bombarded with notifications, which dilutes focus on critical incidents and slows response times.
  • Tool fragmentation is another challenge we frequently observe. Many organizations still rely on a patchwork of legacy monitoring solutions that don’t integrate well with cloud-native platforms. This leads to visibility silos, especially in hybrid and multi-cloud setups, where root-cause analysis becomes even harder.
  • And finally, there’s the human side. Modern observability isn’t just about tools—it’s about helping teams make sense of complexity. Developers, SREs, and business stakeholders all need different levels of insight. One of our ongoing goals with Azure Monitor is to reduce this cognitive load—by surfacing meaningful, actionable insights, not just raw data.

These challenges are pushing the industry to rethink observability—from reactive monitoring to proactive, AI-assisted intelligence. And that’s exactly where we’re investing.

Can you explain how AI and machine learning are being integrated into Azure Monitor to enhance the monitoring experience for developers and operators?

Microsoft has recently introduced several AI-powered innovations in Azure Monitor that are transforming how developers and operators observe, troubleshoot, and optimize complex cloud-native systems.

  1. Issues & Investigation (Preview): This feature uses AIOps to automatically identify likely root causes, and recommend next steps—dramatically reducing investigation time and noise.
  2. Health Models (Preview): These apply machine learning to detect degradation and anomalies in real time, helping teams proactively address issues before they affect end users.
  3. Intelligent View in Application Map (GA): By automatically mapping service dependencies and performance outliers, this ML-enhanced view accelerates root-cause analysis in distributed systems.
  4. Log Analytics Simple Mode (GA): This intuitive, spreadsheet-style interface democratizes telemetry analysis for all roles—not just KQL experts—reducing ramp-up time and operational friction.
  5. AI-Powered Application Insights Code Optimizations (GA): Developers running .NET apps on Azure now receive actionable code-level performance suggestions, directly integrated with tools like GitHub Copilot for Azure and Copilot coding agents, bringing observability closer to the development lifecycle.
  6. Enhanced AI and Agent Observability (Public Preview): Azure Monitor now supports real-time monitoring and continuous evaluation of generative AI apps and agentic systems, in partnership with Azure AI Foundry. This includes:
    1. A unified observability dashboard for full-stack visibility across infrastructure and AI application metrics.
    2. Alerting via Application Insights, enabling proactive response to changes in performance, quality, or safety.
    3. Debugging and tracing capabilities, allowing teams to investigate issues like groundedness regressions with deep visibility.
  7. AIOps in Logs & Metrics: Built-in anomaly detection, dynamic thresholds, and machine learning functions for time-series analysis help teams move from reactive troubleshooting to predictive and intelligent observability.

These innovations reflect a broader shift in observability—from being a reactive tool to becoming an AI-augmented, decision-driving capability. At Azure Monitor, our mission is to surface clarity, not just data—empowering teams to deliver reliable systems at global scale, even as architectures grow more complex.

What are some practical use cases where AI has directly improved detection, diagnostics, or incident response within the platform?

AI and machine learning are now deeply woven into Azure Monitor’s core capabilities, enabling faster detection, smarter diagnostics, and more effective incident response across complex, cloud-native environments. Here are some practical use cases where we’ve seen meaningful impact:

1. Automated Incident Detection and Root Cause Analysis

With the new AI-powered Issue Investigation, Azure Monitor ingests relevant telemetry—metrics, logs, traces—into a unified issue record. The system then performs end-to-end correlation, identifies anomalies, and generates root cause explanations with remediation recommendations.

Example: If error rates spike across microservices, the system can trace dependencies and pinpoint the failing component—such as a misconfigured container or bottlenecked database—reducing triage time dramatically.

2. Intelligent Alerting and Anomaly Detection

Azure Monitor’s Smart Detection uses machine learning to understand baseline patterns and trigger alerts only when deviations truly matter—such as a sudden increase in failure rates or user-facing latency spikes.

With dynamic thresholds, the platform continuously adjusts its sensitivity, reducing alert noise and helping teams focus on genuine incidents rather than false positives.

3. Application Performance Diagnostics

The Application Map with Intelligent View uses ML to visualize service dependencies and automatically surface bottlenecks or failure hotspots—even across highly distributed architectures.

Example: In cases of sluggish response times, the map can quickly show whether an API or backend service is responsible, helping teams isolate issues in seconds rather than hours.

4. Proactive Health Models and Alert Noise Reduction

AI-powered Health Models continuously analyze telemetry to detect business-impacting incidents, correlating multiple abnormal signals to surface a single, high-confidence alert.

Example: Rather than firing alerts for each individual anomaly, Azure Monitor raises a unified health issue only when correlated signals—such as database and storage failures—suggest a true outage.

These AI capabilities not only reduce manual effort and response time, but also bring clarity, precision, and proactivity to how modern teams manage their cloud environments. It’s a shift from reactive monitoring to intelligent observability, and it’s core to how we’re evolving Azure Monitor.

There’s often concern around AI-generated insights creating noise or masking root issues. How do you ensure transparency and control in the way AI surfaces recommendations?

That’s a very real concern—and one we’ve taken seriously as we integrate AI more deeply into Azure Monitor. While AI can unlock powerful insights, it’s essential that these recommendations are transparent, explainable, and actionable—not black boxes.

  • First, we design every AI-assisted feature to include clear explanations of why a recommendation or detection was surfaced. For example, in the Issue Investigation experience, the platform doesn’t just say “here’s a problem”—it shows the correlated telemetry, traces, and metrics that contributed to the root cause hypothesis.
  • Second, we prioritize human-in-the-loop design. AI provides suggestions, not conclusions. Operators can review, validate, or override the insights as needed. This ensures that teams remain in control while still benefiting from automation. We also give users fine-grained control over thresholds, anomaly sensitivity, and alert tuning. Features like dynamic alerting can be adjusted based on the team’s risk tolerance and operational maturity.
  • Finally, we emphasize observability of the AI itself.  Azure AI Foundry Observability, integrated with Azure Monitor Application Insights, enables you to continuously monitor your deployed AI applications to ensure that they’re performant, safe, and produce high-quality results in production. In addition to the continuous monitoring capabilities, we also provide continuous evaluation capabilities for Agents to add further enhance the Foundry Observability dashboard with visibility into additional critical quality and safety metrics.

In short, we believe that for AI in observability to be truly valuable, it must be as transparent and accountable as the systems it’s helping to monitor.

When you think about metrics, logs, and traces, how do you prioritize improvements across those telemetry pillars to deliver a more unified observability experience?

When it comes to metrics, logs, and traces—the foundational pillars of observability—our priority is to unify the experience, not just enhance each signal in isolation. At Azure Monitor, we focus on improvements that help users connect the dots across telemetry types, so they can move from detection to diagnosis as quickly and confidently as possible. Our goal goes beyond signal collection—we aim to deliver a unified observability experience where users can work holistically across signals.

  1. Correlation-first workflows
    We enhance feature design by defaulting to cross-signal context: when a metric anom­aly is detected, the user is immediately linked to related logs and traces. Tools like Application Map with Intelligent View automatically display dependencies and performance hotspots.
  1. Unified timeline and querying tools
    The shared framework for metrics, logs, and traces lets customers use Log Analytics and Kusto to query across all telemetry types—but we’ve built Simple Mode and Smart Defaults to make these queries accessible to non-KQL experts, streamlining insights in a single interface.
  1. Consistent telemetry schema
    Recent updates and coming soon expanded OpenTelemetry support ensure that telemetry from different pillars can be automatically correlated and visualized, minimizing schema mismatches

Our product decisions are grounded in user workflows: if someone is resolving a latency issue, they shouldn’t need to hop between tools. Prioritizing cross-pillar investments—like unified timelines, correlated context, and simplified querying—helps deliver a cohesive and efficient observability experience that turns data into insight.

Azure Monitor is used by teams with varying levels of cloud maturity. How do you design the product experience to meet the needs of both advanced teams and those just beginning their cloud journey?

Supporting teams across the full spectrum of cloud maturity is one of the most important—and rewarding—challenges we face with Azure Monitor. Our approach is to design an experience that’s intuitive for beginners, but powerful and extensible for experts.

For teams just starting their cloud journey, we prioritize simplicity, guided onboarding, and smart defaults. Features like Log Analytics Simple Mode, out-of-the-box dashboards, and AI-powered Smart Detection help users get value without needing deep expertise in telemetry or Kusto Query Language (KQL).

For advanced teams, we offer deep customization, rich analytics, and much more. Power users can write complex KQL queries, build custom workbooks, connect with ServiceNow, or route data to external SIEM. We also expose granular controls around data retention, sampling, and role-based access—critical for enterprise-scale observability.

The key is to build progressive complexity—letting customers start simple, but grow into more sophisticated capabilities as their needs evolve. We also invest heavily in documentation, in-product help, and customer feedback to ensure the platform remains accessible yet powerful.

Ultimately, our goal is to make Azure Monitor feel approachable for a startup, and indispensable for large enterprises —without compromising on capability or control.

In your view, what makes a strong product leader in the observability space? What skills or mindset have been most critical in your own success?

I’ve had the privilege of working with some phenomenal product leaders in the observability space—people who bring sharp technical judgment, deep customer empathy, and a relentless focus on outcomes. The ones who stand out consistently connect the dots between engineering realities, business priorities, and user pain—then drive alignment without losing momentum.

From my side, what’s helped me contribute meaningfully to strategy and execution is:

  1. Anchor in customer outcomes, not just features: I’ve found it’s easy to ship dashboards, charts, or alerts—but the real value comes from asking “What customer pain does this solve? Are we helping teams detect, diagnose, and resolve issues faster? Are we reducing mean time to resolution? Are we increasing confidence in releases? ” It’s that outcomes-over-outputs mindset that keeps the work meaningful.
  2. Prioritize problems, not projects: Rather than jumping into delivery mode, I focus on deeply understanding the underlying user friction—whether it’s noisy alerts or siloed telemetry—and ensure our roadmap tackles the right problems, not just the next items.
  3. Drive alignment through clarity, not control: In a complex space like observability, I’ve learned that clear problem framing, crisp narratives, and shared mental models move things forward far more effectively than just chasing outputs.

It’s so important to push beyond just delivering features—to ask, are we actually helping customers resolve incidents faster, reduce alert fatigue, or trust their systems more? That outcomes-over-outputs mindset keeps the work grounded and aligned with what really matters.

Looking to the future, what developments in cloud-native observability are you most excited about? Are there any trends or technologies you believe will redefine how monitoring is done over the next few years?

There are a few key trends I’m especially excited about—each of which has the potential to fundamentally shift how we think about observability in the cloud-native era.

  • First is the evolution from observability as a backend function to observability as a product experience. It’s no longer just about exposing telemetry—it’s about helping teams make faster, smarter decisions through AI-assisted insights, visual context, and automation. We’re already seeing this with features like Issues & Investigation, Health Models, and AI-powered Application insights code optimizations—and we’re just scratching the surface.
  • Second, I believe open standards like OpenTelemetry will become the universal language of observability. As systems grow more complex, vendor lock-in becomes a real bottleneck. The move toward interoperability and open instrumentation will empower customers to bring their data wherever it’s most actionable, and platforms like Azure Monitor will need to lead with flexibility and extensibility.
  • Third, I’m excited about the convergence of observability and AI model monitoring. As generative AI and agentic systems go into production, teams will need new ways to track not just performance and latency, but also groundedness, safety, drift and more. The integration between Azure Monitor and Azure AI Foundry Observability is a first step in this direction—bringing full-stack visibility to the next generation of intelligent apps.
  • And finally – making observability more accessible and integrated into developer workflows. Features like Application Insights Code Optimizations (GA) provide code-level recommendations directly inside Visual Studio Code, enhanced further with GitHub Copilot for Azure. At the same time, the introduction of Grafana dashboards natively within Azure Monitor reflects a commitment to open tools and familiar interfaces—lowering the learning curve while supporting advanced use cases.

The next era of observability will be open, intelligent, and deeply integrated into the fabric of software delivery—and I’m excited to help build it…

By Randy Ferguson