“Earlier this 12 months, we launched Mission Flash within the Advancing Reliability weblog sequence, to reaffirm our dedication to empowering Azure prospects in monitoring digital machine (VM) availability in a strong and complete method. At the moment, we’re excited to share the progress we’ve made since then in growing holistic monitoring choices to satisfy prospects’ distinct wants. I’ve requested Senior Technical Program Supervisor, Pujitha Desiraju, from the Azure Core Manufacturing High quality Engineering staff to share the newest investments as a part of Mission Flash, to ship the very best monitoring expertise for purchasers.”—Mark Russinovich, CTO, Azure.
Flash, because the mission is internally recognized, is a set of efforts throughout Azure Engineering, that goals to evolve Azure’s digital machine (VM) availability monitoring ecosystem right into a centralized, holistic, and intelligible answer prospects can depend on to satisfy their particular observability wants. As a part of this multi-year endeavor, we’re excited to announce the:
- Normal availability of VM availability data in Azure Useful resource Graph for environment friendly and at-scale monitoring, handy for detailed downtime investigations and influence evaluation.
- Preview of a VM availability metric in Azure Monitor for fast debugging is now publicly accessible, development evaluation of VM availability over time, and establishing threshold-based alerts on situations that influence workload efficiency.
- Preview of VM availability standing change occasions by way of Azure Occasion Grid for instantaneous notifications on important modifications in VM availability, to shortly set off remediation actions to stop end-user influence.
Our dedication stays, to sustaining information consistency and related rigorous high quality requirements throughout all of the monitoring options which are a part of Flash, together with current options like Useful resource Well being or Exercise Log, so we ship a constant and cohesive expertise to prospects.
VM availability data in Azure Useful resource Graph for at-scale evaluation
Along with already flowing VM availability states, we not too long ago printed VM well being annotations to Azure Useful resource Graph (ARG) for detailed failure attribution and downtime evaluation, together with enabling a 14-day change monitoring mechanism to hint historic modifications in VM availability for fast debugging. With these new additions, we’re excited to announce the overall availability of VM availability data within the HealthResources dataset in ARG! With this providing customers can:
- Effectively question the newest snapshot of VM availability throughout all Azure subscriptions directly and at low latencies for periodic and fleetwide monitoring.
- Precisely assess the influence to fleetwide enterprise SLAs and shortly set off decisive mitigation actions, in response to disruptions and kind of failure signature.
- Arrange customized dashboards to oversee the great well being of purposes by becoming a member of VM availability data with extra useful resource metadata current in ARG.
- Monitor related modifications in VM availability throughout a rolling 14-day window, by utilizing the change-tracking mechanism for conducting detailed investigations.
Getting began
Customers can question ARG by way of PowerShell, REST API, Azure CLI, and even the Azure Portal. The next steps element how information might be accessed from Azure Portal.
- As soon as on the Azure Portal, navigate to Useful resource Graph Explorer which can appear to be the beneath picture:
Determine 1: Azure Useful resource Graph Explorer touchdown web page on Azure Portal.
- Choose the Desk tab and (single) click on on the HealthResources desk to retrieve the newest snapshot of VM availability data (availability state and well being annotations).
Determine 2: Azure Useful resource Graph Explorer Window depicting the newest VM availability states and VM well being annotations within the HealthResources desk.
There will probably be two sorts of occasions populated within the HealthResources desk:
Determine 3: Snapshot of the kind of occasions current within the HealthResources desk, as proven in Useful resource Graph Explorer on the Azure Portal.
This occasion denotes the newest availability standing of a VM, primarily based on the well being checks carried out by the underlying Azure platform. Under are the provision states we presently emit for VMs:
- Out there: The VM is up and working as anticipated.
- Unavailable: We’ve detected disruptions to the traditional functioning of the VM and due to this fact purposes is not going to run as anticipated.
- Unknown: The platform is unable to precisely detect the well being of the VM. Customers can often verify again in a couple of minutes for an up to date state.
To ballot the newest VM availability state, check with the properties area which comprises the beneath particulars:
Pattern
{
"targetResourceType": "Microsoft.Compute/virtualMachines",
"previousAvailabilityState": "Out there",
"targetResourceId": "/subscriptions/<subscriptionId>/resourceGroups/<ResourceGroupName>/suppliers/Microsoft.Compute/virtualMachines/<VMName>",
"occurredTime": "2022-10-11T11:13:59.9570000Z",
"availabilityState": "Unavailable"
}
Property descriptions
Subject
|
Description
|
Corresponding RHC area
|
targetResourceType
|
Kind of useful resource for which well being information is flowing
|
resourceType
|
targetResourceId
|
Useful resource Id
|
resourceId
|
occurredTime
|
Timestamp when the newest availability state is emitted by the platform
|
eventTimestamp
|
previousAvailabilityState
|
Earlier availability state of the VM
|
previousHealthStatus
|
availabilityState
|
Present availability state of the VM
|
currentHealthStatus
|
Check with this doc for a listing of starter queries to additional discover this information.
This occasion contextualizes any modifications to VM availability, by detailing essential failure attributes to assist customers examine and mitigate the disruption as wanted. See the complete record of VM well being annotations emitted by the platform.
These annotations might be broadly categorised into three buckets:
- Downtime Annotations: These annotations are emitted when the platform detects VM availability transitioning to Unavailable. (For instance, throughout sudden host crashes, rebootful restore operations).
- Informational Annotations: These annotations are emitted throughout management airplane actions with no influence to VM availability. (Similar to VM allocation/Cease/Delete/Begin). Normally, no extra buyer motion is required in response.
- Degraded Annotations: These annotations are emitted when VM availability is detected to be in danger. (For instance, when failure prediction fashions predict a degraded {hardware} element that may trigger the VM to reboot at any given time). We strongly urge customers to redeploy by the deadline specified within the annotation message, to keep away from any unanticipated lack of information or downtime.
To ballot the related VM well being annotations for a useful resource, if any, check with the properties area which comprises the next particulars:
Pattern
{
"targetResourceType": "Microsoft.Compute/virtualMachines", "targetResourceId": "/subscriptions/<subscriptionId>/resourceGroups/<ResourceGroupName>/suppliers/Microsoft.Compute/virtualMachines/<VMName>",
"annotationName": "VirtualMachineHostRebootedForRepair",
"occurredTime": "2022-09-25T20:21:37.5280000Z",
"class": "Unplanned",
"abstract": "We're sorry, your digital machine is not accessible as a result of an sudden failure on the host server. Azure has begun the auto-recovery course of and is presently rebooting the host server. No extra motion is required from you right now. The digital machine will probably be again on-line after the reboot completes.",
"context": "Platform Initiated",
"motive": "Surprising host failure"
}
Property descriptions
Subject
|
Description
|
Corresponding RHC area
|
targetResourceType
|
Kind of useful resource for which well being information is flowing
|
resourceType
|
targetResourceId
|
Useful resource Id
|
resourceId
|
occurredTime
|
Timestamp when the newest availability state is emitted by the platform
|
eventTimestamp
|
annotationName
|
Title of the Annotation emitted
|
eventName
|
motive
|
Transient overview of the provision influence noticed by the shopper
|
title
|
class
|
Denotes whether or not the platform exercise triggering the annotation was both deliberate upkeep or unplanned restore. This area shouldn’t be relevant to buyer/VM-initiated occasions.
Attainable values: Deliberate | Unplanned | Not Relevant | Null
|
class
|
context
|
Denotes whether or not the exercise triggering the annotation was because of a licensed person or course of (customer-initiated), or because of the Azure platform (platform-initiated) and even exercise within the visitor OS that has resulted in availability influence (VM initiated).
Attainable values: Platform-initiated | Person-initiated | VM-initiated | Not Relevant | Null
|
context
|
abstract
|
Assertion detailing the trigger for annotation emission, together with remediation steps that may be taken by customers
|
abstract
|
Check with this doc for a listing of starter queries to additional discover this information.
Looking forward to 2023, we have now a number of enhancements deliberate for the annotation metadata that’s surfaced within the HealthResources dataset. These enrichments will give customers entry to richer failure attributes to decisively put together a response to a disruption. In parallel, we goal to increase the period of historic lookback to a minimal of 30 days so customers can comprehensively observe previous modifications in VM availability.
VM availability metric in Azure Monitor Preview
We’re excited to share that the out-of-box VM availability metric is now accessible as a public preview for all customers! This metric shows the development of VM availability over time, so customers can:
Arrange threshold-based metric alerts on dipping VM availability to shortly set off applicable mitigation actions.
Correlate the VM availability metric with current platform metrics like reminiscence, community, or disk for deeper insights into regarding modifications that influence the general efficiency of workloads.
Simply work together with and chart metric information throughout any related time window on Metrics Explorer, for fast and straightforward debugging.
Route metrics to downstream tooling like Grafana dashboards, for establishing customized visualizations and dashboards.
Getting began
Customers can both eat the metric programmatically by way of the Azure Monitor REST API or instantly from the Azure Portal. The next steps spotlight metric consumption from the Azure Portal.
As soon as on the Azure Portal, navigate to the VM overview blade. The brand new metric will show as VM Availability (Preview), together with different platform metrics underneath the Monitoring tab.
Determine 4: View the newly added VM Availability Metric on the VM overview web page on Azure Portal.
Choose (single click on) the VM availability metric chart on the overview web page, to navigate to Metrics Explorer for additional evaluation.
Determine 5: View the newly added VM availability Metric on Metrics Explorer on Azure Portal.
Metric description:
Show Title
|
VM Availability (preview)
|
Metric Values
|
1 throughout anticipated habits; corresponds to VM in Out there state.
0 when VM is impacted by rebootful disruptions; corresponds to VM in Unavailable state.
NULL (reveals a dotted or dashed line on charts) when the Azure service that’s emitting the metric is down or is unaware of the precise standing of the VM; corresponds to VM in Unknown state.
|
Aggregation
|
The default aggregation of the metric is Common, for prioritized investigations primarily based on extent of downtime incurred.
The opposite aggregations accessible are:
Min, to instantly pinpoint to all of the instances the place VM was unavailable.
Max, to instantly pinpoint to all of the situations the place VM was Out there.
Refer right here for extra particulars on chart vary, granularity, and information aggregation.
|
Knowledge Retention
|
Knowledge for the VM availability metric will probably be saved for 93 days to help in development evaluation and historic lookback.
|
Pricing
|
Please check with the Pricing breakdown, particularly within the “Metrics” and “Alert Guidelines” sections.
|
Looking forward to 2023, we plan to incorporate influence particulars (person vs platform initiated, deliberate vs unplanned) as dimensions to the metric, so customers are effectively outfitted to interpret dips, and arrange rather more focused metric alerts. With the emission of dimensions in 2023, we additionally anticipate transitioning the providing to a normal availability standing.
Introducing instantaneous notifications on modifications in VM availability by way of Occasion Grid
We’re thrilled to introduce our newest monitoring providing—the non-public preview of VM availability standing change occasions in an Occasion Grid System Subject, which makes use of the low-latency know-how of Azure Occasion Grid! Customers can now subscribe to the system subject and route these occasions to their downstream tooling utilizing any of the accessible occasion handlers (comparable to Azure Features, Logic Apps, Occasion Hubs, and Storage queues). This answer makes use of an event-driven structure to speak scoped modifications in VM availability to finish customers in lower than 5 seconds from the disruption incidence. This empowers customers to take instantaneous mitigation actions to stop finish person influence.
As a part of the non-public preview, we’ll emit occasions scoped to modifications in VM availability states, with the pattern schema beneath:
Pattern
{
"id": "4c70abbc-4aeb-4cac-b0eb-ccf06c7cd102",
"subject": "/subscriptions/<subscriptionId>,
"topic": "/subscriptions/<subscriptionId>/resourceGroups/<ResourceGroupName>/suppliers/Microsoft.Compute/virtualMachines/<VMName>/suppliers/Microsoft.ResourceHealth/AvailabilityStatuses/present",
"information": {
"resourceInfo": {
"id":"/subscriptions/<subscriptionId>/resourceGroups/<ResourceGroupName>/suppliers/Microsoft.Compute/virtualMachines/<VMName>/suppliers/Microsoft.ResourceHealth/AvailabilityStatuses/present",
"properties": {
"targetResourceId":"/subscriptions/<subscriptionId>/resourceGroups/<ResourceGroupName>/suppliers/Microsoft.Compute/virtualMachines/<VMName>"
"targetResourceType": "Microsoft.Compute/virtualMachines",
"occurredTime": "2022-09-25T20:21:37.5280000Z"
"previousAvailabilityState": "Out there",
"availabilityState": "Unavailable"
}
},
"apiVersion": "2020-09-01"
},
"eventType": "Microsoft.ResourceNotifications.HealthResources.AvailabilityStatusesChanged",
"dataVersion": "1",
"metadataVersion": "1",
"eventTime": "2022-09-25T20:21:37.5280000Z"
}
The properties area is absolutely in step with the microsoft.resourcehealth/availabilitystatuses occasion in ARG. The occasion grid answer gives near-real-time alerting capabilities on the information current in ARG.
We’re presently releasing the preview to a small subset of customers to scrupulously take a look at the answer and accumulate iterative suggestions. This strategy permits us to preview and even announce the overall availability of a top quality and well-rounded providing in 2023. As we glance in the direction of the overall availability of this answer, customers can anticipate to obtain occasions when annotations, automated RCAs are emitted by the platform.
What’s subsequent?
We’ll be closely centered on strengthening our monitoring platform to repeatedly enhance the expertise for purchasers primarily based on ongoing suggestions collected from the group (comparable to aggregated VMSS well being exhibiting degraded inaccurately, VM unavailable for quarter-hour, Lacking VM downtimes in Exercise Log). By streamlining our inner message pipeline, we goal to not solely enhance information high quality, but additionally preserve information consistency throughout our choices and develop the scope of failure situations surfaced.
Introducing Degraded VM Availability state
In mild of our upcoming efforts to centralize our monitoring structure, we’ll be well-positioned to introduce a Degraded VM availability state for digital machines in 2023. This state will probably be extraordinarily helpful in establishing focused alerts on predicted {hardware} failure situations the place there may be imminent danger to VM availability. This state will even enable customers to effectively observe instances of degraded {hardware} or software program failures needing to redeploy, which immediately don’t trigger a corresponding change in VM availability. We will even goal to emit reminder annotations by the period of the VM being marked Degraded, to stop customers from overlooking the request to redeploy.
Develop scope of failure attribution to incorporate utility freeze occasions
In 2023, we plan to develop our scope of failure attribution and emission to additionally embody utility freeze occasions that could be prompted because of community agent updates, host OS updates lasting thirty seconds and freeze-causing restore operations. It will guarantee customers have enhanced visibility into freeze influence and will probably be utilized throughout our monitoring choices, together with Useful resource Well being and Exercise Logs.
Be taught Extra
Please keep tuned for extra bulletins on the Flash initiative, by monitoring updates to the Advancing Reliability Sequence!