
5 Steps to Total Autonomous Networking (TAN)
Why use stacking?
Every network needs a resilient and versatile core. A chassis can seem like a great solution, since it allows double the controller cards and power supplies for resiliency, and it can use any combination of line cards for versatility of speeds and feeds. However, with a chassis-based solution, it all has to be located in one place. If the network is using a redundancy protocol like Virtual Router Redundancy Protocol (VRRP), one chassis is lying in standby not being used, which is expensive. Plus, the failover performance of VRRP means that downstream hosts will see a noticeable outage in the event of a failure.
There is a lot of information about stacking and its benefits, such as easier management and fast recovery from failures. But a stack of individual fixed-format devices does not yield the flexibility that is needed to change or grow speeds and feeds easily. This type of stacking also limits the number of available ports. A stack of chassis is far more useful. It offers all the benefits of a chassis, but can be located in different geographical locations to offer protection from local disasters. It also offers very fast failover performance so that downstream hosts do not notice any failures. With chassis stacking, full use can be made of both chassis, because it is an active/active architecture. Investment on equipment is better utilized, and available bandwidth is doubled.
But is stacking really necessary? Multi-Chassis Link Aggregation (MLAG) works well in the data center to provide a dual-homed aggregated link for servers. It is active/active and has fast failover without complex protocols. Can MLAG be used for core chassis too? The answer is yes, but if the network also has to forward Layer 3 traffic, the complexity of the MLAG solution increases dramatically. This is because MLAG is best-suited to Layer 2 applications, like data centers. For Layer 3 networks, a single MAC address and a single IP address are required to minimize downstream disruption after a failover, and a method of failing over the Layer 3 routes is required. The standard method for creating a redundant unicast Layer 3 core is to use Virtual Router Redundancy Protocol (VRRP) to provide a gateway failover for hosts in the local LAN. However, VRRP does not provide a rapid failover, and also puts one chassis into standby mode. This solution satisfies none of the uptime, energy efficiency or ROI expectations that organizations now have of their data networks.
For multicast, there simply is no standard protocol for managing router redundancy. Stacking takes care of the dual-homed aggregated links and provides a simple solution for a virtual IP address, so that Layer 3 traffic can be routed without the need for complex failover protocols. Additionally, the stack is still fully active/active even for Layer 3 traffic, since the states of Layer 3 protocols are automatically synchronized across both stack members, which leads to minimal traffic disruption in the event of a failover.
VCStack Plus
Allied Telesis VCStack Plus allows two SwitchBlade x8100 chassis to be stacked with either one or two SBx81CFC960 controller cards in each, for easy active/active resiliency with rapid fault recovery. Initial configuration is effortless, requiring no spanning tree or failover protocols, and ongoing administration is reduced because both chassis can be managed as a single virtual device.
Allied Telesis designed VCStack Plus to meet the following seven requirements of a resilient network core solution:
- Fast failover: no downstream disruption on failover; predictable failover times
- Simple configuration: reduced maintenance and less chance of configuration errors
- Maximizing return on investment: active/active architecture for reduced equipment costs
- Cross-chassis LAGs: ideal for resiliency, bandwidth and easy management
- Scalable: ability to add more ports without having to reconfigure or replace equipment
- Powerful simplicity: the benefits of multiple stacked units with ease of management
- Long distance: perfect for distributed network environments and protection against local disasters
1. Fast failover
A key requirement of any resilient system is its ability to recover quickly from failure. For a network core this is especially important since by definition, the effects of disruption in the core are magnified in the outer network layers.
Traditional failover mechanisms rely on standby devices monitoring the health of active devices using CPU-intensive protocols. When the standby device detects that the active device has failed, it takes over as the active device. This can create a significant disruption to the attached hosts, because the standby device can take up to 30 seconds to detect that the active device has failed, depending on the failover protocol being used. Faster protocols are not the answer, as they consume more CPU cycles and make the entire system less stable. During failover, tables may also be flushed and addresses re-learned–causing more disruption downstream.
Stacking uses a different failover mechanism because it keeps both devices in sync with each other. If one fails, another takes over seamlessly with no disruption to attached hosts. This is because a stack is tightly coupled, so failure detection is almost instant (sub-second); and because tables do not need to be flushed. Failover times are also predictable in a stack because the time to detect the failure is consistent, unlike protocols which can have a wide variance.
With stacking, failover is fast and the user experience is predictable. Network administrators can perform maintenance tasks during the day, safe in the knowledge that users will not notice any disruption to their traffic.
2. Simple configuration
Studies and practical experience shows that one of the most common causes of failure in a network is configuration error. The easier a configuration is to understand, the less likely it is that mistakes will be made, and the easier it will be to troubleshoot and diagnose issues.
Failover protocols, such as Hot Standby Router Protocol (HSRP) and VRRP, are typically used in combination with Spanning Tree Protocol (STP) to provide automatic switchover to the standby system when the active system fails. But these protocols require significant configuration, and diagnosing issues can be time-consuming because of the complex interactions and timing dependencies between processes.
Stacking offers a much simpler way to configure a resilient pair of chassis. There is no need for failover protocols or STP, and the configuration commands are simple. Diagnosing issues is much easier too–stacking has in-built diagnostics and generates Simple Network Management (SNMP) traps and log messages whenever a significant event occurs.
Because stacking virtualizes the control plane across multiple devices, these devices operate as a single unit under unified software control. The configuration that is applied to the stack does not have to specify the way that information is communicated within the stack. It simply treats the stack as a single virtual device, and needs only to specify how the stack interacts with its external environment. The software comprising the virtualized control plane manages the route distribution, bandwidth sharing, forwarding plane synchronization and more within the stack. This centralized control of a distributed set of devices is an effective way to simplify network core management.
Another benefit of stacking is that the effort required to manage both chassis is halved because stacking presents them as a single entity. There is only one management interface, so stack management is as simple as managing a single device.
Both stacked devices run the same configuration file, so changes only need to be made once because they are synchronized on both devices automatically. This reduces the chance of errors and saves time. Firmware versions are also synchronized across stacked devices, so firmware upgrades are easier too.
3. Maximizing return on investment
A multi-chassis network core represents a major investment for an organization. It is not only a major investment of capital expenditure (CAPEX), it also requires significant ongoing operational expenditure (OPEX) in the form of IT engineering time, electricity usage, upgrade expenses, support contracts, and more. Therefore, an organization needs to maximize the benefit it gains from this investment. If half of the network core sits idle as a standby redundant unit, this is not an effective use of this investment.
Unfortunately, this is the typical situation when traditional failover protocols are used with STP. One device is designated as the active system and performs all the control and data plane processing. The other device waits in a standby mode, monitoring the active system and ready to take over if a failure is detected. The resulting redundancy is therefore very expensive, and the network cannot take advantage of the potential bandwidth sitting idle in standby.




