Implement Cisco NX-os high-availability features

nexus 9000 series nx-os high availability and redundancy guide and cisco nexus 7000 series nx-os high availability commands nx-os high availability and redundancy guide
Dr.MohitBansal Profile Pic
Dr.MohitBansal,Canada,Teacher
Published Date:26-10-2017
Your Website URL(Optional)
Comment
High Availability This chapter covers the following topics focused on high availability: ■ Physical Redundancy ■ Generic Online Diagnostics ■ NX-OS High Availability Architecture ■ Process Modularity ■ Process Restart ■ Stateful Switchover ■ Nonstop Forwarding ■ In-Service Software Upgrades (ISSU) Requirements in the data center are rapidly changing—where there were once generous maintenance windows, now there are none. Best effort delivery of service has been replaced with strict Service Level Agreements (SLA), sometimes with financial penalties incurred to lines of business or customers. This chapter introduces various hardware and software components that make the Nexus 7000 a highly available platform to meet these changing data center requirements. Physical Redundancy Redundancy within the Nexus 7000 begins at the physical chassis and extends into the software and operational characteristics of the system. To provide a redundant hard- ware platform from which to build on, the Nexus 7000 provides the following hardware components: ■ Redundant power supplies ■ Cooling system280 NX-OS and Cisco Nexus Switching: Next-Generation Data Center Architectures ■ Redundant Supervisors ■ Redundant Ethernet Out-of-Band (EOBC) ■ Redundant Fabric Modules The following sections describe these components in greater detail. Redundant Power Supplies The Nexus 7010 provides the ability to install up to three power supplies. To account for the additional line cards in the system, the Nexus 7018 provides the ability to install up to four power supplies. Each power supply has redundant inputs that feed completely independent power units that feed two redundant power buses within the chassis. The mode in which redundancy is achieved is user configurable to one of four modes; these power redundancy schemes are consistent between the 10 slot and 18 slot versions of the Nexus 7000: ■ Non-redundant (combined): All available power from all available power supplies and inputs is made available for the system to draw from. This mode is available but not recommended unless extraordinary circumstances exist. ■ N+1 (ps-redundant): The default mode that protects against the failure of one power supply. When operating in this mode, the power made available to the system is the sum of all the power supplies minus the largest. ■ Grid redundancy (insrc-redundant): Also called input source redundancy. Most data centers today are equipped with redundant power feeds to the data center and redundant distribution systems within the data center. In grid redundancy, each input of the installed power supplies connects to different power grids. If a total loss of power occurs on either side, the system remains powered on. In this mode, the power made available to the system is the sum of all the power supplies installed in the system. This number is then cut in half to create the power budget for modules. ■ Full redundancy (redundant): The combination of input source redundancy and power supply redundancy. This provides the least amount of power available for line cards and crossbars but ensures that no failure, whether internal or external, compro- mises the availability of the system. Example 6-1 shows how to configure the power redundancy mode and verify the operat- ing mode. Example 6-1 Configuring and Verifying Power Redundancy Congo(config) power redundancy-mode ? combined Configure power supply redundancy mode as combined insrc-redundant Configure power supply redundancy mode as grid/AC input source redundantChapter 6: High Availability 281 ps-redundant Configure power supply redundancy mode as PS redundant redundant Configure power supply redundancy mode as InSrc and PS redundant Congo(config) power redundancy-mode redundant Congo(config) show environment power Power Supply: Voltage: 50 Volts Power Actual Total Supply Model Output Capacity Status (Watts ) (Watts ) - 1 N7K-AC-6.0KW 668 W 3000 W Ok 2 N7K-AC-6.0KW 663 W 3000 W Ok 3 0 W 0 W Absent Actual Power Module Model Draw Allocated Status (Watts ) (Watts ) - 1 N7K-M132XP-12 N/A 750 W Powered-Up 2 N7K-M148GT-11 N/A 400 W Powered-Up 5 N7K-SUP1 N/A 210 W Powered-Up 6 N7K-SUP1 N/A 210 W Powered-Up Xb1 N7K-C7010-FAB-1 N/A 60 W Powered-Up Xb2 N7K-C7010-FAB-1 N/A 60 W Powered-Up Xb3 N7K-C7010-FAB-1 N/A 60 W Powered-Up Xb4 xbar N/A 60 W Absent Xb5 xbar N/A 60 W Absent fan1 N7K-C7010-FAN-S N/A 720 W Powered-Up fan2 N7K-C7010-FAN-S N/A 720 W Powered-Up fan3 N7K-C7010-FAN-F N/A 120 W Powered-Up fan4 N7K-C7010-FAN-F N/A 120 W Powered-Up N/A - Per module power not available Power Usage Summary: - Power Supply redundancy mode (configured) PS-Redundant Power Supply redundancy mode (operational) Non-Redundant Total Power Capacity (based on configured mode) 6000 W282 NX-OS and Cisco Nexus Switching: Next-Generation Data Center Architectures Total Power of all Inputs (cumulative) 6000 W Total Power Output (actual draw) 1331 W Total Power Allocated (budget) 3550 W Total Power Available for additional modules 2450 W Redundant Cooling System The Nexus 7010 has two redundant fans for line cards and two redundant fans for fabric modules located in the rear of the chassis. For the Nexus 7018, the system I/O and fabric fans are located within the same field replaceable unit (FRU). Placing the fan trays in the rear of the chassis makes the system extremely serviceable and ensures that cabling does not get in the way of removal or replacement of the fan tray. All the fans in the system are hot-swappable. If one of these fans fails, the redundant module increases rotation speed to continue to cool the entire system. Although a single fan can cool the entire system in the event of a failure, it is critical that all fans be physically present in the system at all times. This keeps the airflow characteristics of the system intact. If one of the fans is physically removed and not replaced, the system shuts down after several warnings and a 3-minute timer has expired. Example 6-2 shows how to verify the status of the fans installed in the system. Example 6-2 Verifying System and I/O Fans Congo show environment fan Fan: Fan Model Hw Status - Fan1(sys_fan1) N7K-C7010-FAN-S 1.1 Ok Fan2(sys_fan2) N7K-C7010-FAN-S 1.1 Ok Fan3(fab_fan1) N7K-C7010-FAN-F 1.1 Ok Fan4(fab_fan2) N7K-C7010-FAN-F 1.1 Ok Fan_in_PS1 Ok Fan_in_PS2 Ok Fan_in_PS3 Absent Fan Air Filter : Absent Congo A status of Ok should be in all installed fans; anything other than this status would require attention from the administrator, ensuring that the appropriate fan tray is properly seated, which is a good first step. If one or more fans fail within a tray, the Nexus 7000 switch can adjust the speed of the remaining fans to compensate for the failed fans. A fan failure could also lead to temperature alarms if not corrected in a timely manner.Chapter 6: High Availability 283 Temperature sensors are located throughout the system to monitor temperature and adjust fan speeds as necessary to ensure all components are within their appropriate oper- ational range. Each module is equipped with intake, outlet, and on-board sensors. Two temperature thresholds are tracked for each sensor: ■ Minor temperature threshold: When a minor threshold is exceeded, a system mes- sage will be logged; call home and SNMP notifications are sent if configured. ■ Major temperature threshold: A major temperature threshold being exceeded would cause the same actions as a minor threshold, unless the intake sensor experiences a major threshold violation. In this scenario, the module is powered down. If the intake module of the active Supervisor experiences a major threshold violation and a HA- standby Supervisor is present, the module shuts down. If no standby Supervisor is present, the system monitors the temperature every 5 seconds for 2 minutes and then shuts down the module. Example 6-3 shows how to monitor the temperature at various points within the system. Example 6-3 Verifying System Temperature Congo show environment temperature Temperature: - Module Sensor MajorThresh MinorThres CurTemp Status (Celsius) (Celsius) (Celsius) - 1 Crossbar(s5) 105 95 49 Ok 1 QEng1Sn1(s12) 115 110 62 Ok 1 QEng1Sn2(s13) 115 110 61 Ok 1 QEng1Sn3(s14) 115 110 58 Ok 1 QEng1Sn4(s15) 115 110 59 Ok 1 QEng2Sn1(s16) 115 110 62 Ok 1 QEng2Sn2(s17) 115 110 60 Ok 1 QEng2Sn3(s18) 115 110 59 Ok 1 QEng2Sn4(s19) 115 110 60 Ok 1 L2Lookup(s27) 115 105 44 Ok 1 L3Lookup(s28) 120 110 55 Ok 2 Crossbar(s5) 105 95 36 Ok 2 CTSdev4 (s9) 115 105 52 Ok 2 CTSdev5 (s10) 115 105 50 Ok 2 CTSdev7 (s12) 115 105 51 Ok 2 CTSdev9 (s14) 115 105 48 Ok 2 CTSdev10(s15) 115 105 47 Ok 2 CTSdev11(s16) 115 105 46 Ok 2 CTSdev12(s17) 115 105 44 Ok284 NX-OS and Cisco Nexus Switching: Next-Generation Data Center Architectures 2 QEng1Sn1(s18) 115 105 44 Ok 2 QEng1Sn2(s19) 115 105 43 Ok 2 QEng1Sn3(s20) 115 105 40 Ok 2 QEng1Sn4(s21) 115 105 42 Ok 2 L2Lookup(s22) 115 105 40 Ok 2 L3Lookup(s23) 120 110 48 Ok 5 Intake (s3) 60 42 17 Ok 5 EOBC_MAC(s4) 105 95 35 Ok 5 CPU (s5) 105 95 29 Ok 5 Crossbar(s6) 105 95 40 Ok 5 Arbiter (s7) 110 100 48 Ok 5 CTSdev1 (s8) 115 105 39 Ok 5 InbFPGA (s9) 105 95 36 Ok 5 QEng1Sn1(s10) 115 105 40 Ok 5 QEng1Sn2(s11) 115 105 40 Ok 5 QEng1Sn3(s12) 115 105 36 Ok 5 QEng1Sn4(s13) 115 105 39 Ok 6 Intake (s3) 60 42 18 Ok 6 EOBC_MAC(s4) 105 95 36 Ok 6 CPU (s5) 105 95 28 Ok 6 Crossbar(s6) 105 95 39 Ok 6 Arbiter (s7) 110 100 46 Ok 6 CTSdev1 (s8) 115 105 39 Ok 6 InbFPGA (s9) 105 95 34 Ok 6 QEng1Sn1(s10) 115 105 39 Ok 6 QEng1Sn2(s11) 115 105 38 Ok 6 QEng1Sn3(s12) 115 105 35 Ok 6 QEng1Sn4(s13) 115 105 36 Ok xbar-1 Intake (s2) 60 42 19 Ok xbar-1 Crossbar(s3) 105 95 47 Ok xbar-2 Intake (s2) 60 42 19 Ok xbar-2 Crossbar(s3) 105 95 42 Ok xbar-3 Intake (s2) 60 42 18 Ok xbar-3 Crossbar(s3) 105 95 45 Ok Congo xbar-1 Intake (s2) 60 42 19 Ok xbar-1 Crossbar(s3) 105 95 47 Ok xbar-2 Intake (s2) 60 42 19 Ok xbar-2 Crossbar(s3) 105 95 42 Ok xbar-3 Intake (s2) 60 42 18 Ok xbar-3 Crossbar(s3) 105 95 45 Ok CongoChapter 6: High Availability 285 In this example, all current temperature values are well below any threshold violation. Each environment might be slightly different; therefore, it is considered good practice to baseline these temperatures in your environment and trend these over time. Redundant Supervisors Supervisor modules provide the control plane operations for the system. These functions include building forwarding tables, maintaining protocol adjacencies, and providing man- agement interfaces to the system. In the Nexus 7010, slots 5 and 6 are reserved for Supervisor modules. In the Nexus 7018, slots 9 and 10 are reserved for Supervisor mod- ules. Supervisor modules have a slightly different form factor, so I/O modules cannot be installed in these slots. Redundant Supervisor modules provide a completely redundant control plane and redundant management interfaces for the platform. Redundant Supervisors behave in an active/standby configuration where only one Supervisor is active at any time. This level of control plane redundancy provides protection against hardware failure and provides a foundation for advanced features such as Stateful Switchover (SSO) and In-Service Software Upgrades (ISSU) that are covered later in this chapter. From a management standpoint, each Supervisor provides an out-of-band Connectivity Management Processor (CMP) and an in-band management (mgmt0)inter- face. These interfaces were covered in detail in the previous chapter. The CMP provides a standalone network stack that is always available as long as power is applied to the system. This type of technology is analogous to the “lights out” capabili- ties of most modern server offerings. When comparing this to legacy networking appli- cations, the CMP functionality can be used to replace terminal servers that provide console connectivity if the system has experienced major issues causing normal connec- tivity to be lost. From the CMP, a network operator can monitor log files and console ports and power cycle the entire system. The CMP is completely independent of NX-OS and guarantees that any outages will not be prolonged due to the inability to access the device remotely. The management interfaces operate in an active/standby just as the Supervisors do. Whichever Supervisor is active is where connectivity for the mgmt0 interface is derived. Note Due to the active/standby nature of the mgmt0 interface, it is recommended that the management interfaces of both supervisors are physically connected to an external switching infrastructure at all times. Example 6-4 shows how to verify Supervisor redundancy. Example 6-4 Verifying Supervisor Redundancy Congo show system redundancy status Redundancy mode - administrative: HA286 NX-OS and Cisco Nexus Switching: Next-Generation Data Center Architectures operational: HA This supervisor (sup-1) - Redundancy state: Active Supervisor state: Active Internal state: Active with HA standby Other supervisor (sup-2) - Redundancy state: Standby Supervisor state: HA standby Internal state: HA standby Redundant Ethernet Out-of-Band (EOBC) Various forms of communication between line cards, fabric modules, and Supervisors are required within a normal system operation. This communication occurs over an internal switching infrastructure called the Ethernet Out-of-Band Channel (EOBC). Each Supervisor contains a 24-port Gigabit switch that connects to line cards and fabric mod- ules within the system. Additionally, each line card contains a small switch with ports connecting to both Supervisors and the local processor. The components that make up the EOBC bus provide a redundant infrastructure for management and control traffic local to the system. Redundant Fabric Modules The Nexus 7000 series of switches provides the ability to install up to five fabric modules per system. The fabric modules are installed to meet the capacity and redundancy requirements of the system. Each line card load balances data plane traffic across all the available fabric modules within the system. If one of the fabric modules should fail, traf- fic rebalances across the remaining fabrics. When the failed fabric is replaced, traffic is automatically redistributed again. You can monitor fabric module status and utilization, as demonstrated in Example 6-5. Example 6-5 Verifying Fabric Module Status and Utilization Congo show module xbar Xbar Ports Module-Type Model Status - 1 0 Fabric Module 1 N7K-C7010-FAB-1 ok 2 0 Fabric Module 1 N7K-C7010-FAB-1 okChapter 6: High Availability 287 3 0 Fabric Module 1 N7K-C7010-FAB-1 ok 4 0 Fabric Module 1 N7K-C7010-FAB-1 ok Xbar Sw Hw - 1 NA 1.0 2 NA 1.0 3 NA 1.0 4 NA 1.0 Xbar MAC-Address(es) Serial-Num - 1 NA JAB1211019U 2 NA JAB121101AQ 3 NA JAB121101A9 4 NA JAB1211018G this terminal session Congo Congo show hardware fabric-utilization Slot Direction Utilization - 1 ingress 0.0% 1 egress 0.0% 2 ingress 0.0% 2 egress 0.0% 3 ingress 0.0% 3 egress 0.0% 5 ingress 0.0% 5 egress 0.0% 6 ingress 0.0% 6 egress 0.0% Generic Online Diagnostics There is a strong interest within data centers today to move operations from reactive to proactive. As part of this operational shift, it becomes necessary to identify hardware failures before they happen and to take preventative action prior to their failure. NX-OS follows the tradition of the widely deployed Catalyst line of switches with its implemen-288 NX-OS and Cisco Nexus Switching: Next-Generation Data Center Architectures tation of Generic Online Diagnostics (GOLD), which provides the mechanisms necessary to test and verify the functionality of a particular component at various times during the operation of the component. As the name implies, GOLD provides these mechanisms in a fashion that can usually be done on a device that is connected to the network with mini- mal or no disruption to the operation of the device. In this section, we provide an overview of the capabilities, operation, and configuration of GOLD. Note GOLD provides a robust suite of diagnostic tests, many of them are executed in the background with no disruption to the system. Some of the tests, however, are disrup- tive and should be utilized with caution within a production environment. GOLD verifies functionality using a variety of techniques; the full suite of diagnostic utilities is broken down into the following categories: ■ Bootup diagnostics ■ Runtime diagnostics ■ On-demand diagnostics Within each of these categories, specific tests are also classified as disruptive or nondisruptive. Bootup Diagnostics Prior to a module coming online within NX-OS, several checks are run on the hardware depending on the type. By default, a complete set of tests are run prior to placing the module in service. It is not recommended to alter this behavior, but if necessary to decrease boot time, these tests can be bypassed, as shown in Example 6-6. Example 6-6 Bypassing Bootup Diagnostics Congo show diagnostic bootup level Current bootup diagnostic level: complete Congo conf t Enter configuration commands, one per line. End with CNTL/Z. Congo(config) diagnostic bootup level bypass Congo(config) sho diagnostic bootup level Current bootup diagnostic level: bypass Congo(config)Chapter 6: High Availability 289 Runtime Diagnostics Although bootup diagnostics prevent a module from coming online without exhaustively testing the hardware functionality, it is not uncommon for modules or entire systems to run for months or years without rebooting. It is therefore necessary to run periodic checks on the hardware during the normal operation of the device. These checks are referred to as runtime diagnostics and can be viewed from the command-line interface (CLI). Example 6-7 shows the runtime diagnostics performed on a Supervisor module. Example 6-7 Supervisor Runtime Diagnostics Congo show diagnostic description module 5 test all ManagementPortLoopback : A bootup test that tests loopback on the management port of the module EOBCPortLoopback : A bootup test that tests loopback on the EOBC ASICRegisterCheck : A health monitoring test,enabled by default that checks read/write access to scratch registers on ASICs on the module. USB : A bootup test that checks the USB controller initialization on the module. CryptoDevice : A bootup test that checks the CTS device initialization on the module. NVRAM : A health monitoring test, enabled by default that checks the sanity of the NVRAM device on the module. RealTimeClock : A health monitoring test, enabled by default that verifies the real time clock on the module. PrimaryBootROM : A health monitoring test that verifies the primary BootROM on the module. SecondaryBootROM : A health monitoring test that verifies the secondary BootROM290 NX-OS and Cisco Nexus Switching: Next-Generation Data Center Architectures on the module. CompactFlash : A Health monitoring test, enabled by default, that verifies access to the internal compactflash devices. ExternalCompactFlash : A Health monitoring test, enabled by default, that verifies access to the external compactflash devices. PwrMgmtBus : A Health monitoring test, enabled by default, that verifies the standby Power Management Control Bus. SpineControlBus : A Health monitoring, enabled by default, test that verifies the standby Spine Card Control Bus. SystemMgmtBus : A Health monitoring test, enabled by default, that verifies the standby System Bus. StatusBus : A Health monitoring test, enabled by default, that verifies status transmitted along Status Bus. StandbyFabricLoopback : A Health monitoring test, enabled by default, that verifies packet path from the Standby supervisor to the Fabric Example 6-8 shows the runtime diagnostics performed on a line card. Example 6-8 Line Card Runtime Diagnostics Congo show diagnostic description module 2 test all EOBCPortLoopback : A bootup test that tests loopback on the EOBC ASICRegisterCheck : A health monitoring test,enabled by default that checks read/write access to scratch registers on ASICs on the module. PrimaryBootROM : A health monitoring test that verifies the primary BootROMChapter 6: High Availability 291 state. SecondaryBootROM : A health monitoring test that verifies the secondary BootROM state. PortLoopback : A health monitoring test that will test the packet path from the Supervisor card to the physical port in ADMIN DOWN state on Line cards. RewriteEngineLoopback : A health monitoring test, enabled by default, that does non disruptive loopback for all LC ports upto the Rewrite Engine ASIC (i.e. Metro) device. Each of these tests has a default run interval that can be verified, as shown in Example 6-9. Example 6-9 Default Runtime Diagnostics Schedule Congo show diagnostic content module 2 Module 2: 10/100/1000 Mbps Ethernet Module Diagnostics test suite attributes: B/C/ - Bypass bootup level test / Complete bootup level test / NA P/ - Per port test / NA M/S/ - Only applicable to active / standby unit / NA D/N/ - Disruptive test / Non-disruptive test / NA H/ - Always enabled monitoring test / NA F/ - Fixed monitoring interval test / NA X/ - Not a health monitoring test / NA E/ - Sup to line card test / NA L/ - Exclusively run this test / NA T/ - Not an ondemand test / NA A/I/ - Monitoring is active / Monitoring is inactive / NA Testing Interval ID Name Attributes (hh:mm:ss) - 1) EOBCPortLoopback CNXT -NA- 2) ASICRegisterCheck- NA 00:01:00 3) PrimaryBootROM NA 00:30:00 4) SecondaryBootROM NA 00:30:00292 NX-OS and Cisco Nexus Switching: Next-Generation Data Center Architectures 5) PortLoopback CPNEA 00:15:00 6) RewriteEngineLoopback- PNEA 00:01:00 Congo show diagnostic content module 5 Module 5: Supervisor module-1X (Active) Diagnostics test suite attributes: B/C/ - Bypass bootup level test / Complete bootup level test / NA P/ - Per port test / NA M/S/ - Only applicable to active / standby unit / NA D/N/ - Disruptive test / Non-disruptive test / NA H/ - Always enabled monitoring test / NA F/ - Fixed monitoring interval test / NA X/ - Not a health monitoring test / NA E/ - Sup to line card test / NA L/ - Exclusively run this test / NA T/ - Not an ondemand test / NA A/I/ - Monitoring is active / Monitoring is inactive / NA Testing Interval ID Name Attributes (hh:mm:ss) - 1) ManagementPortLoopback CDXT -NA- 2) EOBCPortLoopback CDXT -NA- 3) ASICRegisterCheck- NA 00:00:20 4) USB- CNXT -NA- 5) CryptoDevice CNXT -NA- 6) NVRAM- NA 00:00:30 7) RealTimeClock- NA 00:05:00 8) PrimaryBootROM NA 00:30:00 9) SecondaryBootROM NA 00:30:00 10) CompactFlash NA 00:30:00 11) ExternalCompactFlash NA 00:30:00 12) PwrMgmtBus MNA 00:00:30 13) SpineControlBus- MNA 00:00:30 14) SystemMgmtBus- MNA 00:00:30 15) StatusBus- MNA 00:00:30 16) StandbyFabricLoopback- SNA 00:00:30 In certain configurations, these tests might not be applicable and can be disabled. If performance issues are experienced and a hardware failure is suspected, it might beChapter 6: High Availability 293 preferable to change the runtime interval. Example 6-10 shows how to disable or change the runtime interval of these tests. Example 6-10 Manipulating Runtime Diagnostic Parameters Congo(config) no diagnostic monitor module 5 test 9 Congo(config) diagnostic monitor interval module 5 test 3 hour 00 min 00 second 45 Congo(config) show diagnostic content module 5 Module 5: Supervisor module-1X (Active) Diagnostics test suite attributes: B/C/ - Bypass bootup level test / Complete bootup level test / NA P/ - Per port test / NA M/S/ - Only applicable to active / standby unit / NA D/N/ - Disruptive test / Non-disruptive test / NA H/ - Always enabled monitoring test / NA F/ - Fixed monitoring interval test / NA X/ - Not a health monitoring test / NA E/ - Sup to line card test / NA L/ - Exclusively run this test / NA T/ - Not an ondemand test / NA A/I/ - Monitoring is active / Monitoring is inactive / NA Testing Interval ID Name Attributes (hh:mm:ss) 1) ManagementPortLoopback CDXT -NA- 2) EOBCPortLoopback CDXT -NA- 3) ASICRegisterCheck- NA 00:00:45 4) USB- CNXT -NA- 5) CryptoDevice CNXT -NA- 6) NVRAM- NA 00:00:30 7) RealTimeClock- NA 00:05:00 8) PrimaryBootROM NA 00:30:00 9) SecondaryBootROM NI 00:30:00 10) CompactFlash NA 00:30:00 11) ExternalCompactFlash NA 00:30:00 12) PwrMgmtBus MNA 00:00:30 13) SpineControlBus- MNA 00:00:30 14) SystemMgmtBus- MNA 00:00:30 15) StatusBus- MNA 00:00:30 16) StandbyFabricLoopback- SNA 00:00:30 Congo(config)294 NX-OS and Cisco Nexus Switching: Next-Generation Data Center Architectures On-Demand Diagnostics Problems that are intermittent are sometimes attributed to failing hardware. As a trou- bleshooting step, you should test a particular component to verify that the hardware is operating properly and thus eliminate hardware as a potential cause. In NX-OS, you can do this by using on-demand tests. Example 6-11 shows how to manually initiate a diagnostic test and view the results. Example 6-11 On-Demand Diagnostics Congo diagnostic start module 5 test non-disruptive Congo show diagnostic result module 5 Current bootup diagnostic level: complete Module 5: Supervisor module-1X (Active) Test results: (. = Pass, F = Fail, I = Incomplete, U = Untested, A = Abort, E = Error disabled) 1) ManagementPortLoopback . 2) EOBCPortLoopback . 3) ASICRegisterCheck- . 4) USB- . 5) CryptoDevice . 6) NVRAM- . 7) RealTimeClock- . 8) PrimaryBootROM . 9) SecondaryBootROM . 10) CompactFlash . 11) ExternalCompactFlash . 12) PwrMgmtBus . 13) SpineControlBus- . 14) SystemMgmtBus- . 15) StatusBus- . 16) StandbyFabricLoopback- U Congo In the output of Example 6-11, all tests that were run against the module in question passed diagnostics as denoted with a period. Should a particular test fail, further investi- gation might be required. The Cisco Technical Assistance Center (TAC) can use this infor- mation to replace modules that are covered under support agreements.Chapter 6: High Availability 295 NX-OS High-Availability Architecture The high-availability features of NX-OS are managed by several system-level processes: ■ System Manager: At the highest level, the System Manager is responsible for the overall state of the system. The System Manager monitors the health of the system and the various services that are running based on the configured high availability policies. The System Manager manages the starting, stopping, monitoring, and restart- ing of services. Along with these high-level tasks, the System Manager also ensures that state is synchronized between Supervisors and coordinates the switchover of Supervisors if necessary. To verify the health of the System Manager process, there is a hardware watchdog timer located on the Supervisor. Periodically, the System Manager resets the watchdog timer with a keepalive indicator. If the hardware watch- dog timer expires, with no keepalives from the System Manager, a Supervisor switchover occurs. ■ Persistent Storage Service (PSS): Where state information for the various services are stored. PSS provides a database of state and runtime information. Services within NX-OS dump information to the PSS at various intervals and after restart glean this information from the PSS to restore the service to prefailure state. ■ Message and transaction services (MTS): An interprocess communication (IPC) bro- ker that handles message routing and queuing between services and hardware within the system. The function of the MTS ensures that processes can be restarted inde- pendently and that messages from the other processes are received after a restart has occurred. These software features combine to create operational benefits, which as discussed throughout the remainder of this chapter. Process Modularity To achieve the highest levels of redundancy, NX-OS represents a complete modular soft- ware architecture. Each modular component within NX-OS must be enabled by the net- work administrator prior to the feature being configured, or even loaded into memory. Most services within NX-OS are represented as loadable modules or features that must be enabled. If one of these processes experiences errors, the service can be restarted independent of other features or services. This level of modularity exists primarily where HA cannot be achieved by mechanisms within the protocol itself—for example, Graceful Restart for Border Gateway Protocol (BGP). Processes can be enabled using the feature command or disabled using the no feature command. Example 6-12 shows the modular processes that can be enabled.296 NX-OS and Cisco Nexus Switching: Next-Generation Data Center Architectures Example 6-12 Modular Features Congo(config) feature ? bgp Enable/Disable Border Gateway Protocol (BGP) cts Enable/Disable CTS dhcp Enable/Disable DHCP Snooping dot1x Enable/Disable dot1x eigrp Enable/Disable Enhanced Interior Gateway Routing Protocol (EIGRP) eou Enable/Disable eou(l2nac) glbp Enable/Disable Gateway Load Balancing Protocol (GLBP) hsrp Enable/Disable Hot Standby Router Protocol (HSRP) interface-vlan Enable/Disable interface vlan isis Enable/Disable IS-IS Unicast Routing Protocol (IS-IS) lacp Enable/Disable LACP msdp Enable/Disable Multicast Source Discovery Protocol (MSDP) netflow Enable/Disable NetFlow ospf Enable/Disable Open Shortest Path First Protocol (OSPF) ospfv3 Enable/Disable Open Shortest Path First Version 3 Protocol (OSPFv3) pbr Enable/Disable Policy Based Routing(PBR) pim Enable/Disable Protocol Independent Multicast (PIM) pim6 Enable/Disable Protocol Independent Multicast (PIM) for IPv6 port-security Enable/Disable port-security private-vlan Enable/Disable private-vlan rip Enable/Disable Routing Information Protocol (RIP) scheduler Enable/Disable scheduler ssh Enable/Disable ssh tacacs+ Enable/Disable tacacs+ telnet Enable/Disable telnet tunnel Enable/Disable Tunnel Manager udld Enable/Disable UDLD vpc Enable/Disable VPC (Virtual Port Channel) vrrp Enable/Disable Virtual Router Redundancy Protocol (VRRP) vtp Enable/Disable VTP wccp Enable/Disable Web Cache Communication Protocol (WCCP) In addition to selectively enabling or disabling particular features, software modularity provides a mechanism in which software can be patched to address security vulnerabili- ties or apply hot fixes without requiring a complete upgrade of the system.Chapter 6: High Availability 297 Process Restart Services within NX-OS can be restarted if they experience errors or failures. These restarts can be initiated by a network operator or by the System Manager upon detecting an error condition. Each NX-OS service has an associated set of high availability (HA) policies. HA policies define how the system reacts to a failed service. Following are actions per- formed by the System Manager: ■ Stateful process restart: While in a running state, restartable processes checkpoint their runtime state information to the PSS. If a service fails to respond to heartbeats from the System Manager, that process is restarted. When the process has been restarted, all the state information is gleaned from the PSS. ■ Stateless process restart: The service is restarted, and all runtime information is rebuilt from the configuration or by reestablishing adjacencies. ■ Supervisor switchover: In a dual Supervisor configuration, the active Supervisor is rebooted and the standby immediately takes over as the active Supervisor. Following are a few variables associated with the progression of possible System Manager actions: ■ Maximum retries: Specifies the number of times the System Manager attempts to perform a specific action before declaring the attempt failed. For example, the system might try to perform a stateful restart three times before attempting a stateless restart three times, and finally initiating a Supervisor switchover. ■ Minimum lifetime: Specifies the time that a service must run after a restart before de- claring the restart a success. This value is configurable but must be greater than 4 minutes. Stateful Switchover The combination of the NX-OS Software architecture and redundant Supervisors pro- vides the capability to seamlessly switchover to the redundant Supervisor. This switchover can occur for a number of reasons, the most common of which are user-initi- ated, System Manager-initiated, or as part of an ISSU. Example 6-13 shows how to verify the Supervisor status of the system and initiate a man- ual switchover from the active to the standby Supervisor. Example 6-13 Supervisor Redundancy Congo show redundancy status Redundancy mode -298 NX-OS and Cisco Nexus Switching: Next-Generation Data Center Architectures administrative: HA operational: HA This supervisor (sup-6) - Redundancy state: Active Supervisor state: Active Internal state: Active with HA standby Other supervisor (sup-5) Redundancy state: Standby Supervisor state: HA standby Internal state: HA standby System start time: Mon Nov 2 08:11:50 2009 System uptime: 0 days, 0 hours, 42 minutes, 11 seconds Kernel uptime: 0 days, 0 hours, 25 minutes, 1 seconds Active supervisor uptime: 0 days, 0 hours, 20 minutes, 0 seconds Congo Congo system switchover Congo Congo sho system redundancy status Redundancy mode - administrative: HA operational: HA This supervisor (sup-1) - Redundancy state: Active Supervisor state: Active Internal state: Active with HA standby Other supervisor (sup-2) Redundancy state: Standby Supervisor state: HA standby Internal state: HA standby Congo

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.