Advertisement

Quick Links

CS-Storm™ 500GT 3U Server Hardware Guide
(Rev C)
H-6150

Advertisement

Table of Contents
loading
Need help?

Need help?

Do you have a question about the CS-Storm 500GT 3U and is the answer not in the manual?

Questions and answers

Subscribe to Our Youtube Channel

Summary of Contents for Cray CS-Storm 500GT 3U

  • Page 1 CS-Storm™ 500GT 3U Server Hardware Guide (Rev C) H-6150...
  • Page 2: Table Of Contents

    Contents Contents About the CS-Storm 500GT 3U Server Hardware Guide..................3 System Description..............................8 Server Components..............................11 Controls and Indicators..........................14 Drive Support and Configuration........................17 System Interconnect Diagram..........................19 PCIe Architecture..............................20 PCIe Connections and Cabling..........................21 Power Distribution..............................23 Power Supplies..............................24 Hydra Fan Control Utility............................25 Management Daughter Card (MDC)........................35 MDC Control Panel............................35...
  • Page 3: About The Cs-Storm 500Gt 3U Server Hardware Guide

    Original publication. Scope and Audience This document provides information about the CS-Storm 500GT 3U server. Installation and service information is provided for users who have experience maintaining high performance computing (HPC) equipment. Installation and maintenance tasks should be performed by experienced technicians in accordance with the service agreement.
  • Page 4 About the CS-Storm 500GT 3U Server Hardware Guide Acronym Definition Bridge board Bridge board. A PCI board/card that provides front panel control signals from the motherboard to the power backplane and SATA signals from the motherboard to the disk backplane.
  • Page 5 About the CS-Storm 500GT 3U Server Hardware Guide Acronym Definition U.2 formerly known as SFF-8639, is a computer interface for connecting SSDs to a computer. It uses up to four PCI Express lanes. ® ® Intel UltraPath Interconnect. UPI is a point-to-point processor interconnect capable of up to 10.4 GT/s.
  • Page 6 About the CS-Storm 500GT 3U Server Hardware Guide Regulatory Country Marking Compliance CE Mark Europe WARNING This is a class A product. In a domestic environment this product may cause radio interference in which case the user may be required to take adequate measures.
  • Page 7 Trademarks The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and design, SONEXION, Urika-GX, and YARCDATA. The following are trademarks of Cray Inc.: APPRENTICE2, CHAPEL, CLUSTER CONNECT, ClusterStor, CRAYDOC, CRAYPAT, CRAYPORT, DATAWARP, ECOPHLEX, LIBSCI, NODEKARE.
  • Page 8: System Description

    System Description System Description The CS-Storm ™ 500GT system is a dense 3U or 4U 19-inch wide rackmount server that is optimized to support today’s highest power GPU or FPGA accelerator cards. Each 500GT server contains two Intel ® Xeon ®...
  • Page 9 System Description Table 2. CS-Storm 500GT Server Specifications Feature Description Rack options 19in rack, 42RU and 48RU options Chassis ● 19-inch wide, 3U or 4U rackmounted chassis ● Up to 15 server chassis in a 48RU rack ● Chassis weight: ○...
  • Page 10 System Description Feature Description ● 2 additional PCIe 3.0 x16 slots can be added with 8 GPUs Network adapter cards ● Omni-Path (100Gb/s) ● InfiniBand ™ EDR (100Gb/s) or HDR (200Gb/s) ● Ethernet (100Gb/s) Cooling Air cooled (front to rear air flow) ●...
  • Page 11: Server Components

    Server Components Server Components The major components in the CS-Storm 500GT server are shown in the following figure. Figure 2. CS-Storm 500GT 3U Chassis Components Drive cage 2 PCIe add-on/ SATA/NVMe network cards blank drive slots (4-7) (up to 4)
  • Page 12 Server Components on page 20. The switch board provides an interface between the motherboard and GPUs through the PLX device. The switch board also provides a direct power connection to each GPU slot. GPU status signals are routed to the front panel through the PCIe switches and over the SMBus.
  • Page 13 Server Components Figure 3. GPU Carriers, Multiple Views, CS-Storm 500GT Power connector NVIDIA card NVIDIA carrier GPU card PCIe connector FPGA carrier GPU assembly (bottom view) FPGA card Accelerator card carrier bracket Locking Ejector thumbscrew handles Blank two-slot GPU carrier Carrier guide Insulator Blank one-slot...
  • Page 14: Controls And Indicators

    Controls and Indicators Figure 5. Front Controls - CS-Storm 500GT Power button GPU status LEDs Synchronize switch with LED (CRAY logo to CSS indicator) PCIe link status LEDs Reset button Chassis health LED System ID button System health LED with LED Power button [blue].
  • Page 15 Solid On Fatal error Blinking Non-fatal error Synchronize switch. This switch sets/synchronizes the CRAY logo to display the same conditions as the CSS LED. Synchronize is on (default) when the switch is in the up position, as shown. CRAY logo [blue/amber].
  • Page 16 Server Components Figure 6. GPU and PCIe LEDs - CS-Storm 500GT = GPU Status LEDs Normal operation Solid (red) Fatal alarm. Indicates over temperature, over current, or communication error. = PCIe Status LEDs These LEDs indicate the status and transfer speed for the PCIe connection through the PCIe (PLX) switch to the GPU.
  • Page 17: Drive Support And Configuration

    Server Components Rear PCIe Slots, I/O Connectors, and LEDs Figure 7. Rear Controls and Connectors - CS-Storm 500GT 4 PCIe 3.0, x16 slots (half-height, half-length) Drive bay 1 Drive bay 2 (Drive slots 0-3, SATA) PCIe card latch (Drive slots 4-7, SATA/NVMe) PSU 0 PSU 1 Management...
  • Page 18 Server Components ● 8 SATA drives ○ Bay 1 - 4 SATA ○ Bay 2 - 4 SATA ● 4 SATA and 4 NVMe drives ○ Bay 1 - 4 SATA ○ Bay 2 - 4 NVMe (requires NVMe cables from slots 2 and 3) NVMe Support ●...
  • Page 19: System Interconnect Diagram

    System Interconnect Diagram System Interconnect Diagram The figure shows the CS-Storm 500GT server interconnect and cable connections between each of the major subsystem components. Figure 8. Balanced PCIe 4PLX Interconnect Diagram - CS-Storm 500GT NVME x4 SATA x4 Front Control/Indicators (101782100) Drive Bays 4-7 SATA x4 I2C Control (101781600)
  • Page 20: Pcie Architecture

    PCIe Architecture PCIe Architecture Figure 9. Balanced 4PLX PCIe Block Diagram - CS-Storm 500GT Balanced PCIe Configuration Five of the PCIe slots connect to each CPU NVMe (4x) 4 PCIe switch board (4PLX) Optional (x16) Add-on Card (x16) Network Card (x16) PCIe Switch Board Optional (x8) PCIe...
  • Page 21: Pcie Connections And Cabling

    PCIe Connections and Cabling PCIe Connections and Cabling The CS-Storm 500GT supports up to four PCIe 3.0 x16 slots for high-speed network adapter cards. The PCIe lanes for HSN adapter cards come from the motherboard through the PCIe switch board and on to the add-in slot connectors through twin-axial ribbon cable assemblies.
  • Page 22 PCIe Connections and Cabling Figure 11. PCIe Connectors Bottom View - CS-Storm 500GT Bottom cover PCIe switch board Twin-ax cable paddle connectors Chassis (bottom) H-6150 (Rev C)
  • Page 23: Power Distribution

    Power Distribution Power Distribution CS-Storm 500GT system PDU choices may be based on data center facilities/requirements, customer preferences, and system/rack equipment configurations. Each 2200W power supply (PSU) in the server connects to a PDU outlet through a 1.5 m power cord. Chassis Power Distribution Up to 4 (N+1) power supplies in the chassis receive power from the rack PDU.
  • Page 24: Power Supplies

    Power Distribution CS-Storm 500GT Power Supplies The CS-Storm 500GT uses up to four 2200W high-efficiency power supplies to support different PCIe/GPU configurations. Each power supply receives power from a rack PDU. The PSUs support 2+1 or 2+2 redundancy with hot-swap capability and provide up to 4.4kW of power. The power supplies support Power Management Bus (PMBus ™...
  • Page 25: Hydra Fan Control Utility

    The hydra fan control utility monitors and controls GPUs and fans in CS-Storm 500GT servers. This utility controls Cray designed PCIe expansion and fan control logic through the motherboard BMC. The utility runs as a Linux service daemon (hydrad) and is distributed as an RPM package.
  • Page 26 Hydra Fan Control Utility This file contains the running environment for the hydrad service. The running parameters for fan speed and GPU temperature can be adjusted on the system. Restart the hydrad service to apply changes made to the hydra.conf file. RPM Package After installing the hydra RPM package, the hydra utility automatically registers and starts up the hydrad daemon.
  • Page 27 Hydra Fan Control Utility ● CAUTION: ○ GPU Overheating ○ Manually setting the default fan speed to low can overheat the GPUs. Monitor GPU temperature after manually setting the fan speed to avoid damage to the GPU or accelerator card. ●...
  • Page 28 Hydra Fan Control Utility To disable or enable active fan control: # hydra fan [on|off] Active Fan Control by GPU temperature off: Manual Fan Control To set manual fan control to a specific PWM duty value (% = 10 to 100): # hydra fan off # hydra fan [%] Command line options (examples shown below):...
  • Page 29 Hydra Fan Control Utility hydra gpu: GPU Power Control The CS-Storm has power control logic for the all GPUs that can be controlled using a hydrad CLI command. GPU power can be disabled to reduce power consumption. The default initial power state for GPUs is power on. If the GPU power is off, the GPU is not powered on when powered on, unless GPU power is enabled using the CLI command.
  • Page 30 Hydra Fan Control Utility p2_margin: ok ( -55.0 'C) inlet: ok ( 31.0 'C) outlet: ok ( 45.0 'C) hydra fan: Display Fan Status and Set Control Mode The hydrad fan command displays fan status and changes fan control mode and speed. When active fan control is disabled, the fan speed is automatically set to the default manual fan speed.
  • Page 31 Hydra Fan Control Utility Set fan control mode to active. # hydra fan on hydra sensor: Display GPU Temperatures The hydra sensor command displays GPU temperatures # hydra sensor PCI1-A PCI1-B PCI2-A PCI2-B PCI3-A PCI3-B PCI4-A PCI4-B [root@hydra3 ~]# hydra power: Display Power Values The hydra power command displays PSU, motherboard and GPU power status and can be used to reset the peak/average and energy counters.
  • Page 32 Hydra Fan Control Utility Power : 84.0 A 1118 W (Peak 1118 W, Average 1129 W) Energy: 1.9 Wh in last 1secs(0h 0m 1s) Fan Speeds by GPU Temperature As described above, fan speeds increase and decrease based on GPU termperatures. If one of GPU gets hot and exceeds the next temperature region, hydrad immediately changes the fan speed to reach target speed.
  • Page 33 Hydra Fan Control Utility Discover Utility A discovery utility (hscan.py) identifies all systems/nodes that are running hydrad. The hscan.py utility provides the following information from hydrad. (hydrad contains the internal identification/discovery service and provides information through UDP port 38067.) You can turn off the discover capability using the discover=off option in the hydra.conf file for each system.
  • Page 34 Hydra Fan Control Utility Options for the discovery utility are displayed in the order they are entered: $ ./hscan.py -mcn 00:1e:67:56:11:fd 192.168.100.74 sona $ ./hscan.py -ncm sona 192.168.100.74 00:1e:67:56:11:fd System Information When you run ./hscan.py without option, each hydrad displays basic system information to your command window.
  • Page 35: Management Daughter Card (Mdc)

    Management Daughter Card (MDC) Management Daughter Card (MDC) The CS-Storm 500GT MDC configures, monitors, and manages server subsystems and components. The primary functions of the card are: ● Automatic detection of GPUs, PSUs, and HDD/SSDs ● Component management ● Temperature monitoring ●...
  • Page 36: Mdc Dip Switch Configuration

    Management Daughter Card (MDC) Blinking (0.5 sec) Node/fan control error Green and Red Status Steady Cold start then could not boot system LEDs Blinking (0.5 sec) Health of the server is abnormal Winking† (0.2 sec) Firmware flashing in progress Winking (0.5 sec) Initializing resources Reset Button Pressing the Reset button discharges power from the MDC assembly and initiates a...
  • Page 37 Management Daughter Card (MDC) Figure 15. MDC DIP Switches - CS-Storm 500GT ON/Up = 1 Off/Down = 0 (IDSW1) (IDSW2) Slot ID Rack ID H-6150 (Rev C)
  • Page 38 Management Daughter Card (MDC) Figure 16. MDC Internal DIP Switches - CS-Storm 500GT Internal Boot Device Boot Mode Settings 3-1 3-2 2-1 2-2 Boot Mode 3-3 3-4 Internal Boot Device Default. Internal boot Default. NAND flash (256 MB) SPI EEPROM (16 Mb) Serial boot over USB on-the-go (OTG) [USB connected host]...
  • Page 39: Pcie Bifurcation Of The 4 Pcie Switch Board

    PCIe Bifurcation of the 4 PCIe Switch Board PCIe Bifurcation of the 4 PCIe Switch Board This feature is reserved for custom FPGA configurations. These DIP switches are all set to Off (default) for all other configurations. The CS-Storm 500GT PCIe PLX switch board has four DIP switches that are used to configure bifurcation of PCIe lanes out of each PCIe switch chip (PEX8796).
  • Page 40: Environmental Specifications

    Environmental Specifications Environmental Specifications The table lists shipping, operating, and storage environment specifications for CS-Storm 500GT servers. CS-Storm 500GT servers comply with ASHRAE Class A2 specifications when configured with 250W accelerator cards and ASHRAE Class A1 specifications with 300W/400W accelerators. Table 4.
  • Page 41: S2600Bp Motherboard Description

    S2600BP Motherboard Description S2600BP Motherboard Description The Intel ® S2600BP (Buchanan Pass) motherboard is designed to support the Intel Xeon ® Scalable processor family, previously codenamed “Skylake". Previous generation Xeon processors are not supported. Figure 18. Intel ® S2600BP Motherboard Table 5.
  • Page 42 S2600BP Motherboard Description Feature Description Internal I/O Connectors ● Bridge slot to extend board I/O ● One 1x12 internal Video header ● One 1x4 IPMB header ● One internal USB 2.0 connector ● One 1x12 pin control panel header ● One DH-10 serial Port connector ●...
  • Page 43: S2600Bp Component Locations

    S2600BP Motherboard Description Feature Description ● Support for Intel Intelligent Power Node Manager (Require PMBus compliant power supply) RAID Support ● Intel Rapid Storage RAID Technology (RSTe) 5.0 ● Intel Embedded Server RAID Technology 2 (ESRT2) with optional Intel RAID C600 Upgrade Key to enable SATA RAID 5 S2600BP Component Locations Intel...
  • Page 44 S2600BP Motherboard Description management port is active with or without the RMM4 Lite key installed. The dedicated management port and the two onboard NICs support a BMC embedded web server and GUI. Dedicated management port/NIC LEDs. The link/activity LED (at the right of the connector) indicates network connection when on, and transmit/receive activity when blinking.
  • Page 45 S2600BP Motherboard Description Color State Criticality Description although still ● Fan warning or failure when the number of fully functional, or system is operational fans is less than minimum number needed operating in a to cool the system redundant state but ●...
  • Page 46 Status LED Condition BMC/Video memory test Solid blue Solid amber Non-recoverable condition. Contact failed Cray service for information on replacing the motherboard. Both universal bootloader Blink blue (6 Hz) Solid amber Non-recoverable condition. Contact (u-Boot) images bad Cray service for information on replacing the motherboard.
  • Page 47 S2600BP Motherboard Description Beep_LED Error Message POST Progress Code Description Sequence 1 long blink Intel® TXT security 0xAE, 0xAF System halted because Intel® Trusted violation Execution Technology detected a potential violation of system security. 3 blinks Memory error Multiple System halted because a fatal error related to the memory was detected.
  • Page 48: S2600Bp Processor Socket Assembly

    S2600BP Motherboard Description POST Code Diagnostic LEDs There are two rows of four POST code diagnostic LEDs (eight total) on the back edge of the motherboard. These LEDs are difficult to view through the back of the server/node chassis. During the system boot process, the BIOS executes a number of platform configuration processes, each of which is assigned a specific hex POST code number.
  • Page 49 S2600BP Motherboard Description Bolster Plate The bolster plate is an integrated subassembly that includes two corner guide posts placed at opposite corners and two springs that attach to the heatsink via captive screws. The springs are pulled upward as the heatsink is lowered and tightened in place, creating a compressive force between socket and heatsink.
  • Page 50: S2600Bp Architecture

    S2600BP Motherboard Description Figure 21. Intel ® S2600BP Processor Socket Assembly Heatsink (2U 80 x 107 mm) Captive shoulder nuts (T-20 Torx) Processor Heat Sink Module (PHM) Heatsink align and attach clips Assembly Alignment keys Processor package clip Package carrier (clip) Key notches Pin A1 Processor package...
  • Page 51: S2600Bp Processor Population Rules

    S2600BP Processor Population Rules ® Although the Intel S2600BP motherboard supports using different processors on each socket, Cray performs platform validation only on systems that are configured with identical processors. For optimal system performance and reliability, install identical processors. If needed, the S2600BP may operate with one processor in the CPU 1 socket. However, some board features may not be functional if a second processor is not installed.
  • Page 52: S2600Bp Memory Support And Population Rules

    Memory Population Rules ® Although mixed DIMM configurations are supported on the Intel S2600BP motherboard, Cray performs platform validation only on systems that are configured with identical DIMMs installed. Each memory slot should be populated with identical DDR4 DIMMs. ●...
  • Page 53: S2600Bp Configuration And Recovery Jumpers

    S2600BP Motherboard Description ● Processor sockets are self-contained and autonomous. However, all memory subsystem support (such as Memory RAS and Error Management) in the BIOS setup is applied commonly across processor sockets. ● Mixing DIMMs of different frequencies and latencies is not supported within or across processor sockets. ●...
  • Page 54 S2600BP Motherboard Description when the standard firmware update process fails. This jumper should remain in the default/disabled position when the server is running normally. To perform a Force ME Update, follow these steps: 1. Move the jumper (J4B1) from the default operating position (covering pins 1 and 2) to the enabled position (covering pins 2 and 3).
  • Page 55 S2600BP Motherboard Description The BIOS introduces three mechanisms to start the BIOS recovery process, which is called Recovery Mode: ● The Recovery Mode Jumper causes the BIOS to boot in Recovery Mode. ● The Boot Block detects partial BIOS update and automatically boots in Recovery Mode. ●...
  • Page 56: S2600Bp Bios Features

    S2600BP Motherboard Description 3. Boot the system into Setup. Check the Error Manager tab, and you should see POST Error Codes: ● 0012 System RTC date/time not set ● 5220 BIOS Settings reset to default settings 4. Go to the Setup Main tab, and set the System Date and System Time to the correct current settings. Make any other changes that are required in Setup –...
  • Page 57 S2600BP Motherboard Description Figure 23. BIOS Setup Security Tab Security Administrator Password Status Not Installed Administrator password is User Password Status Not Installed used if Power On Password is enabled and to control Set Administrator Password change access in BIOS Setup. Set User Password Length is 1-14 characters.
  • Page 58 S2600BP Motherboard Description Passwords are case sensitive. The Administrator and User passwords must be different from each other. An error message will be displayed and a different password must be entered if there is an attempt to enter the same password for both. The use of “Strong Passwords”...
  • Page 59 S2600BP Motherboard Description ● The OFF function of the Power button ● System Reset button If [Enabled], the power and reset buttons on the server front panel are locked, and they must be controlled via a system management interface. H-6150 (Rev C)

Table of Contents