August 2, 2023
Sutharsan S

SONiC & ONES: The Dynamic Duo for Advanced Open Source Networks

Network Monitoring Benefits

Monitoring is an indispensable process for networks due to several reasons. However, the primary goal of a monitoring system is to catch the glitches before they become a major issue. This blog deep dives into the smart benefits of monitoring the SONiC network with the ONES multi-vendor stack (Open Networking Enterprise Suite). Firstly, let’s start with why it is important to deploy monitoring systems:

- Performance: The networks are responsible for transferring data between servers, storage devices, and other network components. Monitoring allows administrators to ensure the network is performing optimally and identify as well as resolve any bottleneck quickly.

- Capacity Planning: Monitoring helps to track the usage of the current bandwidth, storage, and processing power. This in turn gives clarity to plan for future growth and expansion of the network.

- Diagnose and Troubleshooting: It also enables us to decode the issues in the network, diagnose the root causes, and troubleshoot them effectively.

In general, monitoring is vital to ensure the reliability, security, and performance of data center networks, and can help prevent downtimes, security breaches, and other issues.

The Need for Monitoring SONiC Networks

SONiC is an open-source network operating system that offers unparalleled advantages including flexibility, scalability, and vendor-agnosticism. However, monitoring the SONiC network becomes crucial right after its deployment for the following reasons:

- Multi-Vendor Support: One of the main advantages of SONiC is its multi-vendor support ecosystem. However, this brings more challengeswhen it comes to monitoring the  multi-vendor switches and ASICs.

- Open-Source: Although SONiC has validation suites to cater the needs of consistent development, it is necessary to monitor the quality of the network functions.

- Deep Insights: Monitoring should not just rely on SYSLOG and SNMP. In fact, the issues of monitoring using SNMP are already well documented. The modern era demands more deep insights into the network for monitoring them efficiently. This enables the usage of advanced analytics and packet-level monitoring in the form of visual dashboards—integrating multiple data sources and prioritizing critical alerts.

- On-Demand: The monitoring should be provided as an on-demand micro-service. It can lead to improved scalability, ease of deployment, reliability, and flexibility—in turn, driving up cost efficiency.

- Clubbed Usage of Support and NetOps: The monitoring system should be flexible enough to be clubbed for the usage of both support and NetOps. This provides an added advantage of simplified management, better visibility, and collaboration—thereby, effectively reducing the Total Cost of Ownership (TCO).

- Peripheral Monitoring: This includes monitoring optics (transceivers), power supply units (PSUs), fans, and temperature sensors. Here’s why this is super critical:

  • Reliability and Fault Detection: Monitoring these peripherals helps in the early detection of faults/abnormalities and thus enables proactive maintenance to mitigate the risk of unexpected disruptions.
  •  Performance Optimization: It also aids in optimizing the performance of the system. For instance, monitoring optics helps detect degraded or faulty transceivers that can impact network connectivity and data transmission speed. By proactively identifying and updating faulty components, the overall system can maintain optimal performance and avoid potential bottlenecks. 
  • Prevents Cascading Failures: Imagine a power supply unit fails without detection. It can result in the system drawing excessive power from the remaining PSUs, potentially causing their failure. By monitoring PSUs, any failing or overloaded units can be identified and addressed promptly, preventing subsequent failures and ensuring the system’s sound operations.

Overall, peripheral monitoring helps support operational continuity, improves maintenance efficiency, and enhances the stability as well as the longevity of the system.

The Need for Advanced SONiC Monitoring

SONiC deployment undoubtedly offers multiple advantages. However, there is no efficient way to monitor the SONiC devices and benefit from its features. Let’s glance through some details:

- Tools: SONiC includes several built-in monitoring tools that can be used to track network performance, diagnose issues, and analyze traffic. These include utilities like "sflow" which can be used to collect and analyze data from network devices. But these only provide information about the traffic and data samples and miss out on the crucial information about the state of the node.

- SNMP: It is a standard protocol used for monitoring and managing network devices. SONiC supports SNMP, partially including SNMP Traps. However, the support of SNMP is not complete with issues in the SNMP agent as well as the set of traps.

- Network Telemetry: Network telemetry involves collecting and analyzing data from network devices to gain insights into network performance, capacity, and usage. SONiC supports several network telemetry standards, including gRPC and Inband Network Telemetry (INT). But again, this support is also incomplete, and the range of support varies from vendor to vendor. Further, the data available is quite shallow, leaving the crucial pieces of data not being exported.

- Third-Party Monitoring Tools: Many third-party network monitoring tools, including Grafana, Prometheus, and Zabbix, support SONiC. These tools can provide additional capabilities like visualization, alerting, and reporting. But these support tools provide only basic data about the CPU and memory for the node. 

In a nutshell, these tools work to some extent. However, there is no single, comprehensive tool that provides the support of all the critical data needed to proactively monitor the SONiC networks and enable the operators to leverage its advantages to the fullest. Also, it is critical that a single solution works across all the vendors supporting the SONiC stack and achieve the deployments seamlessly across networks.

ONES: The All-in-One Solution 

ONES is an industry-leading network management and the best SONiC support solution built to address the unique challenges of the SONiC monitoring system. It is highly modular for integration with different layers of the NetOps stack, solving multiple use cases ranging from orchestration to visibility and support. 

ONES Architecture reproduced from 

This resolves the needs listed in the SONiC monitoring section. ONES supports monitoring across multi-vendor SONiC platform and the crucial elements of the data from each node to derive deep insights. Additionally, it provides a single interface as a conduit to read all the critical data of the nodes—thus providing a unified view of the node.

One of the key benefits is the comprehensive view of the network, including the deep data of the nodes that enables the operators to act on issues across the entire network. This also houses powerful analytics capabilities, such as anomaly detection and predictive analytics, which can help to identify patterns and trends in the network and proactively address potential issues.

Use-Cases of Monitoring

Let’s have a look at a few use cases to better understand the monitoring aspects of SONiC.

Health Monitoring

This is one of the key use cases of network monitoring. It involves continuous monitoring of various performance metrics such as bandwidth utilization, latency, packet loss, availability time, and identifying issues that may impact the network performance.

Apart from the regular monitoring, health of the nodes also needs to be monitored effectively. This includes the various parts of the nodes, viz FANs, PSUs, CPU temperature, etc. These need to be monitored constantly to ensure the reliability of the nodes and detect issues at an early stage to prevent downtime, reduced performance, and other hiccups.

ONES monitors various health metrics across different vendors. For instance, a snapshot of the dashboard below provides a single view of faulty components in the network. This includes details of all the PSU and FAN components across all the nodes that are monitored by the ONES application.

ONES - Faulty Components Dashboard

Utilization Monitoring

This involves the process of measuring and tracking the usage of resources such as CPU, memory, bandwidth, and disk space. Some of the important usage of utilization monitoring include:

  • Capacity Planning: Network operators can track how network resources are being utilized. By analyzing this data, operators can forecast future resource requirements, plan for network upgrades, and avoid potential resource shortages.

  • Performance Optimization: Utilization monitoring also helps network operators to identify performance bottlenecks, such as overused CPU or bandwidth, and take action to optimize the network performance. For example, operators can add more capacity to reduce congestion, or adjust network settings to better balance the load across resources.

  • Cost Optimization: Identify underutilized resources that can be consolidated or decommissioned to save costs. For example, operators can identify servers with low CPU utilization and consolidate them into fewer physical machines—reducing hardware and energy costs.

In brief, utilization monitoring is important for capacity planning, performance optimization, and cost optimization. Regular monitoring and analysis of network resource utilization can help ensure the reliability and performance of the infrastructure, while optimizing costs and complying with regulations.

ONES - Cabling Usage Dashboard

Another advantage of ONES is that the utilization of network and node resources are monitored more efficiently than the normal parameters including the amount of cabling utilized along with the available set. The above snapshot has one such metric. Similarly, ONES goes deep in terms of monitoring to provide such details of the system.

Proactive Monitoring

This focuses on identifying and resolving potential issues before they become critical in the network. The following list highlights some of the aspects: 

- Early Detection: Proactive monitoring helps identify potential issues early on, before they escalate into the major ones. This helps to reduce downtime and minimize the impact on network performance.

- Predictive Analysis: It uses predictive analysis techniques to identify patterns and trends in network behavior and predict potential issues by leveraging big data from the nodes. This enables network operators to take preventive measures to avoid downtime and improve network performance. Also, this can be extended to the node details to predict the node behaviors.

- Cost Savings: Prevent costly network outages and reduce the need for expensive emergency repairs. Further, the monitoring can also help spot opportunities for cost optimization by identifying underutilized resources or areas where resources can be consolidated.

Overall, proactive monitoring plays a vital role in the early detection of potential issues, predictive analysis, cost savings, improved user experience, and compliance. This improves the reliability of the network.

As we wrap up here, we are excited to enable you to seamlessly adopt SONiC via the innovative features of ONES. These empower networking teams to seamlessly monitor their network, improve its performance, and skyrocket their business productivity.

Let’s maximize smart monitoring and minimize downtime. 

Author - Sutharsan