Lab 2 - AI-Driven Anomaly Detection

Task Goal

In this laboratory session, you'll explore the AI-Driven anomaly detection feature and how to get the most out of the issues generated by the AI/ML algorithm, leveraging the advanced baselining to keep the focus on the most relevant issues as well as getting detailed information helping you to resolve the idenfitied anomalies.

This lab task will guide you through an AI-Driven anomaly detection issue workflow:

get an overview of the issues detected on the network, organized by location and category, using the Assurance issue dashboard
when an anomaly is detected, compare the expected performance computed by the AI/ML baselining algorithm, with the actual network peformance
get guidance to further understand the impacted clients and access points, as well as the potential root cause

Benefits

The AI-driven anomaly detection feature extends the vast issue set available on Cisco DNA Assurance and leverage AI/ML (Artificial Intelligence and Machine Learning) to model the network performance, learning from the past behavior in relation to the network conditions (e.g., number and type of clients, applications used, RF conditions, etc..).

Instead of using traditional static thresholds, the AI/ML model generates a baseline defining the expected lower and upper bound of a given KPI and a specific network entity; such range is used as a reference for the expected behavior and an anomaly is raised when the actual KPI falls outside these boundaries.

The most important advantage of this approach is that the alerting system takes into account the network conditions and the past network behavior to establish whether the network performance is normal or anomalous.

A advantage of this approach is that you don't need to configure anything to optimize the AI-Driven anomaly detection algorithm; the baselines are automatically computed for each network entity, constantly learning the normal network behavior using telemetry data.

This is also the reason why in this lab you won't be required to configure anything; even in a regular network setup, the only requirements to start benefitting from the AI-Driven anomaly detection issues are to have your network devices discovered and managed by the Cisco Catalyst Center appliance, and to enable the Cisco AI Analytics service, as explained on the Lab 8 - Service Operations.

Even if the configuration is really simple, in this lab we provide you with a pre-configured system, as the AI-Driven anomaly detection typically requires up to 1 week of data ingestion in order to have enough data to train the AI/ML model specifically for your network.

This is the first practical example on how AI/ML is applied to a networking use case; the network conditions are constantly changing and setting a static threshold would often result in either excessive alerting (if the threshold is too low, likely resulting in lots of false positives) or in missing important events (if the threshold is set too high).

Issue types

The AI-Driven issue types available as of Cisco Catalyst Center version 2.3.7 are organized in two main groups:

Onboarding and roaming
Throughput

Onboarding and roaming

This category of issues focuses on the Wireless LAN client connection and roaming events.

AI/ML baselining is beneficial for this category of use cases for the following reasons:

the time to complete a Wireless LAN onboarding or roaming depends on many factors, such as the SSID security policy (e.g., PSK authentication is usually faster than 802.1x), the building (even for the same SSID, the connection to the AAA server or the load on the server may be very different), the client type(s) typically using a given SSID
having failures on Wireless LAN onboarding or roaming is normal, but - similarly to the onboarding/roaming time - the actual failure rate depends on several factors, including the time of the day (e.g., morning or afternoon, when many people reach or leave the office, or lunch time, when people move a lot across the building)

Setting static thresholds is challenging, however, an AI model can easily learn patterns in the data, considering the network conditions, therefore adapting the baseline and allowing to focus only on truly anomalous events.

Onboarding issue types

The following list includes details about what data is used to build each baseline and associated issue type:

Issue Type	Target KPI	Input data	Entity aggregation
Excessive time to connect	expected time to complete the Wireless LAN onboarding	successful attempts	`WLC`, `SSID` and `Building`
Excessive failures to connect	expected failure rate to complete the Wireless LAN onboarding	failed attempts	`WLC`, `SSID` and `Building`
Excessive failures to roam	expected failure rate to complete the Wireless LAN roaming (client reassociating to a new AP, same `SSID`)	failed attempts	`SSID`, `Source Building` and `Destination Building`
Excessive time to Associate	expected time to complete 802.11 (re)association	successful attempts	`SSID` and `Building`
Excessive failures to Associate	expected number of association failures	failed attempts	`SSID` and `Building`
Excessive time to get Authenticated	expected time to complete the AAA authentication (including 802.1x and PSK, not counting Web-Auth)	successful attempts	`SSID` and `Building`
Excessive failures to get Authenticated	expected authentication failure rate (including 802.1x and PSK, not counting Web-Auth)	failed attempts	`AAA Server`
Excessive time to get an IP Address	expected time to complete the IP address learning	successful attempts	`SSID` and `Building`
Excessive failures to get an IP address	expected IP address learning failure rate	failed attempts	`DHCP Server`

Throughput

This category of issues focuses on modeling the Application throughput for Wireless LAN clients. The application throughput data source is AVC/NBAR, collected using NetFlow for IOS-XE Wireless LAN Controllers and baselines are computed for each individual radio in the network.

AI/ML baselining is beneficial for this type of issues as throughput can vary by several orders of magnitude depending on many factors (e.g. client number, client types, applications used, RF conditions...), therefore is almost impossible to use static threhsolds.

The following list includes details about what data is used to build each baseline and associated issue type:

Throughput issue types

Issue type	Sample applications
Drop in total radio throughput	All
Drop in radio throughput for Cloud Applications	Office 365, Salseforce, Google Workspace, iCloud and other similar cloud-based applications.
Drop in radio throughput for Collaboration Applications	SIP, Webex, MS Teams, Slack, Facetime, and other similar collaboration apps.
Drop in radio throughput for Social Applications	Facebook, Instagram, LinkedIn, Twitter and other similar social media apps.
Drop in radio throughput for Media Applications	Youtube, Netflix, Hulu, Espn, and several other audio/video streaming apps.

Usecase workflow

It's time to explore the AI-driven issue in the lab setup.

Issue dashboard

The AI-driven issues are presented in the Assurance Issue dashboard.

Reach it by going to the main menu:

Menu > Assurance > Issues and Events:

AI-Driven issues - Hamburger menu

AI-Driven issues can be easily recognized by the AI tag in front of the Issue Type . It's also possible to use the AI-Driven table filter to hide the other issue types:

AI-Driven issues - Dashboard view

The issue dashboard groups issues based on their type, showing a summary on the selected time period (optionally filered by site), with issue count the last occurrence.

The first step into using an AI-Driven issue is to click on the issue Type and then selecting a specific issue.

In this lab we'll analyze a Radio Throuhgput issue.

Look for the Drop in total radio throughput issue category on the Issue Dashboard; you can also click on the AI Driven button to show only AI-Driven issues on the table:

AI-Driven issues - Select Cloud Application Throughput issue

AI-issue overview

Once you open an AI-Driven issue you're presented with a sumamry of the key information about the anomaly detected by the AI/ML engine, such as:

Time and date
Location
Info about the affected AP and radio band
The number of impacted clients

The Problem Details view presents the KPI view (in this case, the throughput for radio throughput ):

The blue line represents the actual KPI values observed on the given AP
The green band represents the predicted normal KPI value range for the same KPI. Note how the baseline changes over time, reflecting the constantly changing network conditions; throughput can change a lot and it depends on many factors, therefore a simple threshold would not work at all to catch issues of this kind.
The red bands highlight the anomaly period, when the actual throughput drops significantly from the AI/ML predicted baseline.

AI-Driven issues - Problem details

Impact

The observed issue overview refers to a specific radio and the prediction takes into account a variety of RF and traffic related KPIs for the affected radio and the clients associated to it.

Knowing that a traffic anomaly was detected on a radio allows to identify critical areas of the network where the client experience is suffering, however, at this point we need to know exactly what clients and applications were affected.

This information is easily accessible using the Impact view, where the information is organized on three tabs:

Impacted Clients:
This view shows all the clients that were connected to the affected radio at the time the anomaly was detected, with client-specific RF summary information and device classification. Clicking on a client MAC Address will take you to the Device 360 view, while clicking on the Username will take you to the User 360 view for that client; this allows to get a direct access to the detailed client-specific information.

AI-Driven issues - Impacted clients

Device Breakout:
This view shows the throughput aggregated by device type, in order to identify the traffic patterns by client type.

Alt text

Applications by TX/RX:
This view shows the throughput by each individual application, showing the ones that were observed during the time period, and those that experienced a throughput drop.

AI-Driven issues - Impacted Apps

Root Cause Analysis

The next step is to explore the Root Cause Analysis tab, where you can observe a wide range of KPIs related to the affected radio that help understanding the network conditions on the affected device at the time the anomaly was detected.

The root cause analysis detection method automatically highlights the most Probable network causes by pre-selecting the KPIs that are likely to explain the issue.

AI-Driven throughput issue - Root Cause Analysis

In this specific example you can observe how at the time of the anomalous throughput drop, there was a steep increase in the percentage of clients exhibiting a low RSSI (up to 80% clients were received by the AP at less than -80 dBm RSSI).

The pre-selected KPIs for this specific issue indicate that the clients' wireless connection quality dropped, and this is the likely root cause of the throughput drop.

By default, the UI only shows the KPIs that are considered to be relevant to explain the anomaly, however, you can use the Add KPI menu to add any of the available KPIs; doing this helps adding context and fully understanding the network conditions of the affected network device at the time the anomaly was detected.

AI-Driven throughput issue - Add KPIs to Root Cause Analysis view

Key takeaways

AI-Driven issues are a powerful tool to detect network anomalies, especially for KPIs where using static thresholds, simple rules or simpler baselining techniques would either result in high alerting noise, or missing relevant events altogether.

The AI-Driven issues make use of AI/ML techniques to produce baselines that predict the normal behavior of each network entity, using past data that always keeps into account the network conditions.

This concludes the exploration of the AI-Driven issues feature.
You can use the link below to proceed with the exploration of other use cases.

Click here to go back to the use-cases list