Creating an incident response plan – Monitoring Microsoft 365 Tenant Health

If an incident occurs that affects the availability of services or features in your tenant, you need to be able to respond quickly. An incident response plan is a framework that you can prepare to help you address issues quickly.

While the details of each incident may differ, the steps you take to both prepare and work through one are the same:

  1. Validate the incident scope details and confirm that your environment is affected. Not all incidents affect all tenants, so use the information in the Message Center (https://admin.microsoft.com/#/MessageCenter), as well as investigative procedures such as self-assessments and tests or synthetic transactions.
  2. Determine whether the incident is relevant to your organization. If the incident involves a service that your organization hasn’t yet deployed or doesn’t interfere with business operations, it may not be relevant.
  3. Once degradation and relevancy to your environment have been confirmed, review information sources for details on the timeline of Microsoft’s response. Microsoft will post regular status updates in the Message Center. If information such as a timeline has not been established, you can open a service ticket with Microsoft to request this information.
  4. Develop a backup solution in case the service outage or degradation lasts longer than an acceptable time frame for your organization. Depending on the type of outage, this may mean working offline to fulfill business requirements.

Business continuity planning (BCP) is important regardless of the technology platforms or services being used. Work with various business unit owners to establish communication plans and methods to continue business operations should a service interruption occur.

Monitoring service health

Service health information is available from the Microsoft 365 admin center (https://admin.microsoft.com). Microsoft provides health information for a variety of services and features, including the SaaS services such as Exchange Online or SharePoint Online, the health of the directory synchronization environment, as well as the Windows operating system feature issues and service health.

You can check the overall service health by navigating to the health dashboard (Health | Dashboard), as shown in Figure 2.5:

Figure 2.5 – Service health dashboard

The health dashboard contains the current health status of all Microsoft 365 services. Normally, services will appear as Healthy, though this status will be updated when a service experiences an issue.

The Service health page (Health | Service health) will display the most detailed and comprehensive information on any ongoing or resolved issues:

Figure 2.6 – Service health page

If a service has an advisory or incident, you can expand the issue item under Active issues to display relevant events, as shown in Figure 2.7:

Figure 2.7 – Service health active issues

Selecting an individual item reveals expanded information about the particular issue. See Figure 2.8 for an example:

Figure 2.8 – Expanded active issue

Each service with an incident will display a status. Possible statuses include the following:

  • Normal service: This status indicates the service is available and has no current incidents or incidents during the reporting period.
  • Extended recovery: This status indicates that while steps have been completed to resolve the incident, it may take time for operations to return to normal. During an extended recovery period, some service operations might be delayed or take longer to complete.
  • Investigating: This status indicates that a potential service incident is being reviewed.
  • Service restored: This status indicates that an incident was active earlier in the day but the service was restored.
  • Service interruption: This status indicates the service isn’t functioning and that affected users are unable to access the service.
  • Additional information: This status indicates the presence of information regarding a recent incident from the previous day.
  • Service degradation: This status indicates that the service is slow or occasionally seems to be unresponsive for brief periods.
  • PIR published: This status indicates that a Post-Incident Report (PIR) of the service incident has been published.
  • Restoring service: This status indicates that the service incident is being resolved.

As an administrator, it’s important to frequently check the Service health dashboard to be apprised of alerts or incidents. If a service issue is affecting the Microsoft 365 admin center, you can also try the Office 365 status page (https://status.office.com) and the Azure status page (https://status.azure.com).