Lab 2 - AI-Driven Anomaly Detection
Task Goal
In this laboratory session, you'll explore the AI-Driven anomaly detection feature and how to get the most out of the issues generated by the AI/ML algorithm, leveraging the advanced baselining to keep the focus on the most relevant issues as well as getting detailed information helping you to resolve the idenfitied anomalies.
This lab task will guide you through an AI-Driven anomaly detection issue workflow:
- get an overview of the issues detected on the network, organized by location and category, using the Assurance issue dashboard
- when an anomaly is detected, compare the expected performance computed by the AI/ML baselining algorithm, with the actual network peformance
- get guidance to further understand the impacted clients and access points, as well as the potential root cause
Benefits
The AI-driven anomaly detection feature extends the vast issue set available on Cisco DNA Assurance and leverage AI/ML (Artificial Intelligence and Machine Learning) to model the network performance, learning from the past behavior in relation to the network conditions (e.g., number and type of clients, applications used, RF conditions, etc..).
Instead of using traditional static thresholds, the AI/ML model generates a baseline defining the expected lower and upper bound of a given KPI and a specific network entity; such range is used as a reference for the expected behavior and an anomaly is raised when the actual KPI falls outside these boundaries.
The most important advantage of this approach is that the alerting system takes into account the network conditions and the past network behavior to establish whether the network performance is normal or anomalous.
A advantage of this approach is that you don't need to configure anything to optimize the AI-Driven anomaly detection algorithm; the baselines are automatically computed for each network entity, constantly learning the normal network behavior using telemetry data.
This is also the reason why in this lab you won't be required to configure anything; even in a regular network setup, the only requirements to start benefitting from the AI-Driven anomaly detection issues are to have your network devices discovered and managed by the Cisco Catalyst Center appliance, and to enable the Cisco AI Analytics service, as explained on the Lab 8 - Service Operations.
Even if the configuration is really simple, in this lab we provide you with a pre-configured system, as the AI-Driven anomaly detection typically requires up to 1 week of data ingestion in order to have enough data to train the AI/ML model specifically for your network.
This is the first practical example on how AI/ML is applied to a networking use case; the network conditions are constantly changing and setting a static threshold would often result in either excessive alerting (if the threshold is too low, likely resulting in lots of false positives) or in missing important events (if the threshold is set too high).
Issue types
The AI-Driven issue types available as of Cisco Catalyst Center version 2.3.7 are organized in two main groups:
Onboarding and roaming
This category of issues focuses on the Wireless LAN client connection and roaming events.
AI/ML baselining is beneficial for this category of use cases for the following reasons:
- the time to complete a Wireless LAN onboarding or roaming depends on many factors, such as the SSID security policy (e.g., PSK authentication is usually faster than 802.1x), the building (even for the same SSID, the connection to the AAA server or the load on the server may be very different), the client type(s) typically using a given SSID
- having failures on Wireless LAN onboarding or roaming is normal, but - similarly to the onboarding/roaming time - the actual failure rate depends on several factors, including the time of the day (e.g., morning or afternoon, when many people reach or leave the office, or lunch time, when people move a lot across the building)
Setting static thresholds is challenging, however, an AI model can easily learn patterns in the data, considering the network conditions, therefore adapting the baseline and allowing to focus only on truly anomalous events.
Onboarding issue types
The following list includes details about what data is used to build each baseline and associated issue type:
Issue Type | Target KPI | Input data | Entity aggregation |
---|---|---|---|
Excessive time to connect | expected time to complete the Wireless LAN onboarding | successful attempts | WLC , SSID and Building |
Excessive failures to connect | expected failure rate to complete the Wireless LAN onboarding | failed attempts | WLC , SSID and Building |
Excessive failures to roam | expected failure rate to complete the Wireless LAN roaming (client reassociating to a new AP, same SSID ) |
failed attempts | SSID , Source Building and Destination Building |
Excessive time to Associate | expected time to complete 802.11 (re)association | successful attempts | SSID and Building |
Excessive failures to Associate | expected number of association failures | failed attempts | SSID and Building |
Excessive time to get Authenticated | expected time to complete the AAA authentication (including 802.1x and PSK, not counting Web-Auth) | successful attempts | SSID and Building |
Excessive failures to get Authenticated | expected authentication failure rate (including 802.1x and PSK, not counting Web-Auth) | failed attempts | AAA Server |
Excessive time to get an IP Address | expected time to complete the IP address learning | successful attempts | SSID and Building |
Excessive failures to get an IP address | expected IP address learning failure rate | failed attempts | DHCP Server |
Throughput
This category of issues focuses on modeling the Application throughput for Wireless LAN clients. The application throughput data source is AVC/NBAR, collected using NetFlow for IOS-XE Wireless LAN Controllers and baselines are computed for each individual radio in the network.
AI/ML baselining is beneficial for this type of issues as throughput can vary by several orders of magnitude depending on many factors (e.g. client number, client types, applications used, RF conditions...), therefore is almost impossible to use static threhsolds.
The following list includes details about what data is used to build each baseline and associated issue type:
Throughput issue types
Issue type | Sample applications |
---|---|
Drop in total radio throughput | All |
Drop in radio throughput for Cloud Applications | Office 365, Salseforce, Google Workspace, iCloud and other similar cloud-based applications. |
Drop in radio throughput for Collaboration Applications | SIP, Webex, MS Teams, Slack, Facetime, and other similar collaboration apps. |
Drop in radio throughput for Social Applications | Facebook, Instagram, LinkedIn, Twitter and other similar social media apps. |
Drop in radio throughput for Media Applications | Youtube, Netflix, Hulu, Espn, and several other audio/video streaming apps. |
Usecase workflow
It's time to explore the AI-driven issue in the lab setup.
Issue dashboard
The AI-driven issues are presented in the Assurance Issue dashboard.
Reach it by going to the main menu:
Menu > Assurance > Issues and Events
:
AI-Driven issues can be easily recognized by the AI
tag in front of the Issue Type . It's also possible to use the AI-Driven
table filter to hide the other issue types:
The issue dashboard groups issues based on their type, showing a summary on the selected time period (optionally filered by site), with issue count the last occurrence.
The first step into using an AI-Driven issue is to click on the issue Type and then selecting a specific issue.
In this lab we'll analyze a Radio Throuhgput issue.
Look for the Drop in total radio throughput
issue category on the Issue Dashboard; you can also click on the AI Driven
button to show only AI-Driven issues on the table:
AI-issue overview
Once you open an AI-Driven issue you're presented with a sumamry of the key information about the anomaly detected by the AI/ML engine, such as:
- Time and date
- Location
- Info about the affected AP and radio band
- The number of impacted clients
The Problem Details
view presents the KPI view (in this case, the throughput for radio throughput ):
- The blue line represents the actual KPI values observed on the given AP
- The green band represents the predicted normal KPI value range for the same KPI. Note how the baseline changes over time, reflecting the constantly changing network conditions; throughput can change a lot and it depends on many factors, therefore a simple threshold would not work at all to catch issues of this kind.
- The red bands highlight the anomaly period, when the actual throughput drops significantly from the AI/ML predicted baseline.
Impact
The observed issue overview refers to a specific radio and the prediction takes into account a variety of RF and traffic related KPIs for the affected radio and the clients associated to it.
Knowing that a traffic anomaly was detected on a radio allows to identify critical areas of the network where the client experience is suffering, however, at this point we need to know exactly what clients and applications were affected.
This information is easily accessible using the Impact
view, where the information is organized on three tabs:
- Impacted Clients:
This view shows all the clients that were connected to the affected radio at the time the anomaly was detected, with client-specific RF summary information and device classification. Clicking on a clientMAC Address
will take you to theDevice 360
view, while clicking on theUsername
will take you to theUser 360
view for that client; this allows to get a direct access to the detailed client-specific information.
- Device Breakout:
This view shows the throughput aggregated by device type, in order to identify the traffic patterns by client type.
- Applications by TX/RX:
This view shows the throughput by each individual application, showing the ones that were observed during the time period, and those that experienced a throughput drop.
Root Cause Analysis
The next step is to explore the Root Cause Analysis
tab, where you can observe a wide range of KPIs related to the affected radio that help understanding the network conditions on the affected device at the time the anomaly was detected.
The root cause analysis detection method automatically highlights the most Probable network causes
by pre-selecting the KPIs that are likely to explain the issue.
In this specific example you can observe how at the time of the anomalous throughput drop, there was a steep increase in the percentage of clients exhibiting a low RSSI
(up to 80% clients were received by the AP at less than -80 dBm RSSI).
The pre-selected KPIs for this specific issue indicate that the clients' wireless connection quality dropped, and this is the likely root cause of the throughput drop.
By default, the UI only shows the KPIs that are considered to be relevant to explain the anomaly, however, you can use the Add KPI
menu to add any of the available KPIs; doing this helps adding context and fully understanding the network conditions of the affected network device at the time the anomaly was detected.
Key takeaways
AI-Driven issues are a powerful tool to detect network anomalies, especially for KPIs where using static thresholds, simple rules or simpler baselining techniques would either result in high alerting noise, or missing relevant events altogether.
The AI-Driven issues make use of AI/ML techniques to produce baselines that predict the normal behavior of each network entity, using past data that always keeps into account the network conditions.
This concludes the exploration of the AI-Driven issues feature.
You can use the link below to proceed with the exploration of other use cases.