Table of Contents

Advertisement

Quick Links

Sun HPC 3.0 SCI Guide
901 San Antonio Road
Palo Alto, , CA 94303-4900
USA 650 960-1300 Fax 650 969-9131
Part No: 805-6263-10
June 1999, Revision A

Advertisement

Table of Contents
loading
Need help?

Need help?

Do you have a question about the Sun HPC 3.0 and is the answer not in the manual?

Questions and answers

Summary of Contents for Sun Microsystems Sun HPC 3.0

  • Page 1 Sun HPC 3.0 SCI Guide 901 San Antonio Road Palo Alto, , CA 94303-4900 USA 650 960-1300 Fax 650 969-9131 Part No: 805-6263-10 June 1999, Revision A...
  • Page 2 Sun, Sun Microsystems, the Sun logo, SunStore, AnswerBook2, docs.sun.com, and Solaris are trademarks, registered trademarks, or service marks of Sun Microsystems, Inc. in the U.S. and other countries. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the U.S. and other countries. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc.
  • Page 3: Table Of Contents

    Contents Preface vii Preparing for SCI Installation 1 Other Necessary Documentation 1 SCI Adapter Cards 1 Supported SCI Network Topologies 3 Two-Node Networks 3 Three-Node Networks 4 Four-Node Networks 5 SCI Adapter Card Scrubber Jumpers 7 Network Connection Procedure 11 Install SCI Adapter Cards 11 Notes for Scrubber Jumper Settings 11 Notes for Switched Two-Node Network 12...
  • Page 4 Connect New Adapter Card to Network 34 Create a Temporary Network Map 34 Run sciconf 34 Update sci_config.hpc 35 Run sm_config 35 Confirm sci_config.hpc Contents 36 Reboot the Node 36 Sun HPC 3.0 SCI Guide ♦ June 1999, Revision A...
  • Page 5 Verify the New Network 36 SCI Interface Troubleshooting 37 SCI Switch 37 General Hardware Inspection 37 SCI Switch Status LED Locations 37 Port Status LEDs 38 General Switch Status LED 39 The get_ci_status Command 39 Client Net Failure 40 Incorrect Software Configuration 40 Incorrect Firmware 41 Man Pages 43 sm_config(1) 43...
  • Page 6 Sun HPC 3.0 SCI Guide ♦ June 1999, Revision A...
  • Page 7: Preface

    Preface The Sun HPC 3.0 SCI Guide is intended for experienced system administrators. How This Book Is Organized Chapter 1, provides an overview of the Sun HPC SCI subsystem, including descriptions of the principal hardware and software components. Chapter 2, outlines the procedure for connecting the cluster nodes to an SCI network in the various supported topologies.
  • Page 8 These are called class options. Command-line variable; replace You must be root to do this. with a real name or value. To delete a file, type rm filename. Sun HPC 3.0 SCI Guide ♦ June 1999, Revision A viii...
  • Page 9 Shell Prompts Shell Prompts TABLE P–2 Shell Prompt C shell machine_name% C shell superuser machine_name# Bourne shell and Korn shell Bourne shell and Korn shell superuser Related Documentation Related Documentation TABLE P–3 Application Title Part Number 805-6262-10 Sun HPC ClusterTools 3.0 Product Notes Sun MPI Programming 805-7230-10...
  • Page 10 Related Documentation TABLE P–3 (continued) Application Title Part Number 805-6258-10 LSF Batch User’s Guide 805-6260-10 LSF Batch Programmer’s Guide Sun HPC 3.0 SCI Guide ♦ June 1999, Revision A...
  • Page 11: Preparing For Sci Installation

    SCI network on your Sun HPC cluster. SCI Adapter Cards Sun HPC 3.0 cluster nodes connect to the SCI network through SCI adapter cards installed in the node’s SBus slots. In two-node SCI networks, the SCI adapter cards are ordinarily connected to each other directly, without going through a switch.
  • Page 12 Figure 1–1 Basic SCI Network Connection Schemes for Sun HPC 3.0 Clusters Sun HPC 3.0 SCI Guide ♦ June 1999, Revision A...
  • Page 13: Supported Sci Network Topologies

    SCI drivers. Two-Node Networks Figure 1–2 shows how two nodes in a Sun HPC 3.0 cluster can be connected via an SCI network. The SCI adapter card in one node is connected directly to an SCI adapter card in the other node. There is no intervening SCI switch, which is the usual connection scheme for two-node networks.
  • Page 14: Three-Node Networks

    Cards in a Two-Node Network” on page 12. Three-Node Networks Figure 1–3 shows examples of how three Sun HPC nodes can be connected to an SCI network, in both unstriped and striped modes. Sun HPC 3.0 SCI Guide ♦ June 1999, Revision A...
  • Page 15: Four-Node Networks

    Figure 1–3 Supported Three-Node SCI Interconnections Four-Node Networks Figure 1–4 shows examples of how four Sun HPC nodes can be connected to an SCI networ, in both unstriped and striped modes. Preparing for SCI Installation...
  • Page 16 Figure 1–4 Supported Four-Node SCI Interconnections. Sun HPC 3.0 SCI Guide ♦ June 1999, Revision A...
  • Page 17: Sci Adapter Card Scrubber Jumpers

    SCI Adapter Card Scrubber Jumpers Each SCI adapter card has a jumper, called the scrubber jumper. This jumper configures the scrubber circuit, which controls link maintenance functions. Figure 1–5 shows its location on the SCI adapter card. Figure 1–5 Location of the Scrubber Jumper Table 1–1 specifies the appropriate scrubber jumper settings for unswitched and switched SCI networks.
  • Page 18 Figure 1–6 Examples of Scrubber Jumper Settings in Two-Node Networks Sun HPC 3.0 SCI Guide ♦ June 1999, Revision A...
  • Page 19 Note - SCI adapter cards are shipped without a default setting. Therefore, examine the setting on each SCI adapter card and adjust it if necessary. If scrubber jumpers are not set correctly when installed, communication between nodes may experience intermittent faults. Preparing for SCI Installation...
  • Page 20 Sun HPC 3.0 SCI Guide ♦ June 1999, Revision A...
  • Page 21: Network Connection Procedure

    CHAPTER Network Connection Procedure The Sun HPC ClusterTools 3.0 software may—but need not be—already installed on the nodes before you perform the procedures described in this chapter. The ClusterTools 3.0 software must be installed before you configure the SCI drivers, as described in Chapter 3.
  • Page 22: Notes For Switched Two-Node Network

    Connecting SCI Adapter Cards in a Two-Node Network This section explains how to create a two-node SCI network that does not use an SCI switch. 1. Position the nodes in the desired locations. Sun HPC 3.0 SCI Guide ♦ June 1999, Revision A...
  • Page 23 2. Use an 80-line SCI station cable to connect the SCI adapter card of one node to the SCI adapter card of the other node. If each of your nodes has two SCI adapter cards, use two SCI cables, one for each pair of adapter cards.
  • Page 24: Connecting Sci Adapter Cards In A Three- Or Four-Node Network

    See Figure 2–2 and Figure 2–3 for connection examples. 3. Connect the node power cords to the appropriate power outlets. 4. Turn the node power switches on and boot the nodes. Sun HPC 3.0 SCI Guide ♦ June 1999, Revision A...
  • Page 25 Figure 2–2 Examples of Three-Node Switched SCI Connections Network Connection Procedure...
  • Page 26 Figure 2–3 Examples of Four-Node SCI Switched Connections Sun HPC 3.0 SCI Guide ♦ June 1999, Revision A...
  • Page 27: Configuring The Sci Network Interface

    CHAPTER Configuring the SCI Network Interface This chapter explains how to configure the SCI network interface in a Sun HPC 3.0 cluster. It covers the following procedures: 4 Creating a temporary network map for later reference – See “Create a Temporary Network Map for Later Reference”...
  • Page 28: Clustertools 3.0 Software Must Be Installed

    [NOTFOUND=return] files. Since the /etc/services file is modified and used by SUNWsma and other packages, the /etc/nsswitch.conf entry should be as follows: services: files nisplus Place the term files first before other entries. Sun HPC 3.0 SCI Guide ♦ June 1999, Revision A...
  • Page 29: Create A Temporary Network Map For Later Reference

    Create a Temporary Network Map for Later Reference Before you create the SCI configuration file, sci_config.hpc, sketch out a rough map of the physical network connections and identify each SCI adapter by its serial number. Look on the connector panel of the node. An adapter’s serial number will be printed on a white label in the upper left corner of the adapter‘s connector.
  • Page 30: Create Sci_Config.hpc

    Copy the applicable template to /opt/SUNWsma/sci_config.hpc. For example, to create a configuration file for the two-node striped topology, # cd /opt/SUNWsma # cp /opt/SUNWhpc/bin/Install_Utilities/config_dir/sma2-2stripes.hpc sci_config.hpc Use the sma4-2stripes.hpc template for creating a three-node, striped configuration. Sun HPC 3.0 SCI Guide ♦ June 1999, Revision A...
  • Page 31 Next, edit sci_config.hpc. Every template type is organized into eight sections. Instructions for editing each section are provided below. Section 1 – Cluster Configuration Type Section 1 asks you to specify the type of cluster you have; you are given the options: SC (Sun Cluster) or HPC.
  • Page 32 4 Three- or four-node cluster – either nonstriped or striped, set Number of Direct Links in cluster = Section 5 – Number of Ring Connections Ring connections are not supported by Sun HPC ClusterTools 3.0 software. Therefore, always specify Sun HPC 3.0 SCI Guide ♦ June 1999, Revision A...
  • Page 33 Number of Rings in cluster = 0 Section 6– Adapter Information List all SCI adapters in the cluster and describe the connection details for each. Use a separate line for each adapter description. The format for describing unswitched connections is host n :: adp n is connected to = link n :: endpt n When no switch is used, an adapter (adp) is connected to a particular endpoint...
  • Page 34 :: adp n is connected to = switch n :: port n Here, an adapter is connected to port n of switch n. Figure 3–4 through Figure 3–7 show examples of this format. Sun HPC 3.0 SCI Guide ♦ June 1999, Revision A...
  • Page 35 Figure 3–4 Three-Node, Nonstriped Configuration Figure 3–5 Three-Node, Striped Configuration Configuring the SCI Network Interface...
  • Page 36 Figure 3–7 Four-Node, Striped Configuration Adapter ID values are assigned automatically by the device driver. Initially, the device driver assigns ID 0 to the adapter installed in the lowest-numbered SBus slot, Sun HPC 3.0 SCI Guide ♦ June 1999, Revision A...
  • Page 37 ID 1 to the adapter in the next higher-numbered slot, and so forth. ID value assignments are always consecutive, even if the adapters are not installed in adjacent slots. Consequently, if adapter cards are not installed in adjacent slots, adapter ID values do not necessarily correspond to SBus slot numbers.
  • Page 38: Propagate The Sci Configuration

    If the sm_config output conflicts with Section 6 of the sci_config.hpc file, stop execution of sm_config (press Control-C) and correct the configuration file. Then run sm_config again and compare its output with sci_config.hpc again. Sun HPC 3.0 SCI Guide ♦ June 1999, Revision A...
  • Page 39: Verify The Rank Of The Sci Interface

    When the contents of the sci_config.hpc file are confirmed by the sm_config output, press Return to allow sm_config to complete execution. Verify the Rank of the SCI Interface Look in the file hpc.conf and change the default ranking of the SCI interface to give it the highest priority.
  • Page 40 Sun HPC 3.0 SCI Guide ♦ June 1999, Revision A...
  • Page 41: Verify That The Network Is Functional

    CHAPTER Verify That the Network Is Functional Perform the steps described below to verify that the SCI network functions correctly. Run get_ci_status Execute get_ci_status on all cluster nodes to verify interconnectivity. Run ifconfig --a Execute ifconfig -a to verify that all the nodes are up with the SCI daemons running.
  • Page 42: Do All-To-All Ping

    These procedures include tests of internode communication. Execute the procedures provided in that chapter as a final verification of the SCI network. Sun HPC 3.0 SCI Guide ♦ June 1999, Revision A...
  • Page 43: Adding Or Replacing Sci Adapter Cards

    CHAPTER Adding or Replacing SCI Adapter Cards If you add or replace an SCI adapter card on a node that has already been configured by sm_config, perform the steps described in this chapter to initialize the new adapter card. Note - Because the node has already been configured by sm_config, it should also already contain SUNWsci and the other SCI-related packages.
  • Page 44: Connect New Adapter Card To Network

    == 1 SCI card was detected in the system == Programming SCI card #1 with nodeid, adapter #, and firmware ... == this takes 20 seconds, please wait ... (continued) Sun HPC 3.0 SCI Guide ♦ June 1999, Revision A...
  • Page 45: Update Sci_Config.hpc

    (Continuation) == Programming is done ... Please verify the following information: sciadm $Revision: 2.30 $ DOLPHIN SBus card (SBus2b) found in SBus slot 0 on Board# 0, card slot 0. Adapter number: NodeId: 220 (0xdc) Slot Number: (0x00) System Board Number: (0x00) Card Slot Number: (0x00)
  • Page 46: Confirm Sci_Config.hpc Contents

    The sm_config output will also identify which nodes need to be rebooted. Reboot those nodes. Verify the New Network Perform the network verification steps described in Chapter 4 to be certain that the network still functions correctly. Sun HPC 3.0 SCI Guide ♦ June 1999, Revision A...
  • Page 47: Sci Interface Troubleshooting

    CHAPTER SCI Interface Troubleshooting SCI Switch General Hardware Inspection Perform the following checks to determine the physical state of various SCI subsystem components. Verify that: 4 All SCI scrubber jumpers are properly set, depending on the cluster topology. 4 All SCI cables are properly seated. 4 All SCI switches have power applied 4 No SCI status LEDs are red—see Table 6–1 and Table 6–2 SCI Switch Status LED Locations...
  • Page 48: Port Status Leds

    Port errors: Associated port LED is red SCI cable out, sync error Port operative, no transactions Associated port LED is green Port operative, with transactions Associated port LED is blinking green Sun HPC 3.0 SCI Guide ♦ June 1999, Revision A...
  • Page 49: General Switch Status Led

    General Switch Status LED The switch status LED located on the rear panel indicates overall switch failures (Table 6–2). SCI Switch Rear Panel LED TABLE 6–2 Situation LED Status Fatal switch errors: fatal hardware error, temperature too high, fan(s) not operative, power supply problem Switch operational Green...
  • Page 50: Client Net Failure

    4 The working copy of the sm_config template file correctly matches the hardware configuration and cluster topology. 4 sm_config ran successfully on only one of the cluster nodes. 4 All nodes were rebooted after sm_config was executed. Sun HPC 3.0 SCI Guide ♦ June 1999, Revision A...
  • Page 51: Incorrect Firmware

    Incorrect Firmware If an SCI adapter cardSCI adapter card is loaded with the wrong firmware, the SCI cards will not be detected upon system power-on or reboot/reset. Improper loading of the firmware can happen two ways: 4 Old firmware programmed into new SBus2b cards 4 New firmware programmed into old SBus2 cards If proper firmware is loaded, a banner (containing the word FCode) will be printed from each SCI card twice during power-on or reboot or reset.
  • Page 52 Note which serial number(s) are displayed. Cards that do not have their serial numbers displayed are bad and need replacement. Sun HPC 3.0 SCI Guide ♦ June 1999, Revision A...
  • Page 53: Sm_Config(1)

    APPENDIX Man Pages This appendix contains man pages for: 4 sm_config 4 get_ci_status. sm_config(1) sm_config(1) man Page CODE EXAMPLE A–1 sm_config(1M) Maintenance Commands sm_config(1M) NAME sm_config - SCI adapter configuration utility for clusters SYNOPSIS sm_config [-t] -f filename AVAILABILITY SUNWsma INTERFACE CLASSIFICATION Sun Private DESCRIPTION...
  • Page 54 - used on host1, host3 and host4 :- HOST 0 host1 HOST 1 _%host2 HOST 2 host3 HOST 3 host4 template_2 - used on host2 :- HOST 0 _%host1 (continued) Sun HPC 3.0 SCI Guide ♦ June 1999, Revision A...
  • Page 55: Man Pages

    (Continuation) HOST 1 host2 HOST 2 _%host3 HOST 3 _%host4 A caveat to keep in mind when running sm_config stand- alone mode is that, sm_config cannot guarantee the coherency of the /etc/sma.config generated during the different invo- cations (for above case - /etc/sma.config on host2 versus the ones on host1, host3 and host4) if the user were to supply inconsistent input data for the two cases.
  • Page 56 /etc/services file is not searched. This behaviour is different from the default nis behaviour. this scenario inetd will unable start (continued) Sun HPC 3.0 SCI Guide ♦ June 1999, Revision A...
  • Page 57: Get_Sci_Status(1M)

    (Continuation) sm_configd daemon. SunOS 5.5.1 Last change: 30 March 1997 get_sci_status(1m) get_sci_status(1m) man Page CODE EXAMPLE A–2 get_ci_status(1M) Maintenance Commands get_ci_status(1M) NAME get_ci_status - Displays the Cluster configuration, the adapter status and the SMA session status. SYNOPSIS get_ci_status [ -l ] AVAILABILITY SUNWsma INTERFACE CLASSIFICATION...
  • Page 58 (Switch_id #1) have the following problems - 1. Adapter_id => Probes reachable (keyword active) and SMA session - not established (keyword inopera- tional) (continued) Sun HPC 3.0 SCI Guide ♦ June 1999, Revision A...
  • Page 59 (Continuation) 2. Adapter_id 140 => SCI Probes unreachable (keyword inactive) session - established (keyword inopera- tional). This is a brief transitionary stage. 3. Adapter_id 204 => SCI Probes unreachable (keyword inactive) session - not established (keyword ino- perational) USAGE get_ci_status can be run from the command line by any user.
  • Page 60 Sun HPC 3.0 SCI Guide ♦ June 1999, Revision A...
  • Page 61: Sma Messages

    APPENDIX SMA Messages If the HPC cluster uses an SCI switch, the SMA software can display the following messages. Message Directory The following pages list SMA messages in alphabetical order: 4 SUNWcluster.sma.smactl.4007 - Cannot create logical adapter: None found 4 SUNWcluster.sma.smactl.4008 - Cannot create logical adapter:no response 4 SUNWcluster.sma.smad.1030 - $clustername adapter $adp selected 4 SUNWcluster.sma.smad.1101 - smad($pid): entering stand-alone mode 4 SUNWcluster.sma.smad.1102 - smad: Cluster...
  • Page 62: Sma Messages

    ‘‘request to exit, or a fatal error.’’ msgid ‘‘SUNWcluster.sma.watchdog.4002.error’’ msgstr ‘‘The process-id of the parent daemon is in brackets.’’ msgid ‘‘SUNWcluster.sma.watchdog.4002.fix’’ msgstr ‘‘Not Applicable. ‘‘ ############################################################################# SUNWcluster.sma.smad.4004 - smad($pid): exiting by request (continued) Sun HPC 3.0 SCI Guide ♦ June 1999, Revision A...
  • Page 63 (Continuation) ############################################################################# msgid ‘‘SUNWcluster.sma.smad.4004.message’’ msgstr ‘‘The SMAD child daemon is dying by request.’’ ‘‘request to exit, or a fatal error.’’ msgid ‘‘SUNWcluster.sma.smad.4004.error’’ msgstr ‘‘Probably a due to a shutdown or a pkgrm.’’ msgid ‘‘SUNWcluster.sma.smad.4004.fix’’ msgstr ‘‘Not Applicable. ‘‘ ############################################################################# SUNWcluster.sma.smak.1001 - SCI Adapter $adp: Card operational ############################################################################# msgid...
  • Page 64 ‘‘monitored. It can also happen if the SMAD was already in ‘monitor’ ‘‘ ‘‘mode and had died and restarted.’’ msgid ‘‘SUNWcluster.sma.smad.1102.error’’ msgstr ‘‘Not Available.’’ msgid ‘‘SUNWcluster.sma.smad.1102.fix’’ msgstr ‘‘Not Applicable.’’ ############################################################################# (continued) Sun HPC 3.0 SCI Guide ♦ June 1999, Revision A...
  • Page 65 (Continuation) SUNWcluster.sma.smad.1103 - smad: Cluster ‘$clustname’ running ############################################################################# msgid ‘‘SUNWcluster.sma.smad.1103.message’’ msgstr ‘‘CMM has informed SMAD of the cluster membership. SMAD continues to ‘‘ ‘‘monitor the entire cluster. It can also happen if the SMAD was ‘‘ ‘‘already in ‘cluster’ mode and had died and restarted.’’ msgid ‘‘SUNWcluster.sma.smad.1103.error’’...
  • Page 66 ‘‘SUNWcluster.sma.smactl.4008.message’’ msgstr ‘‘SMAD has not responded to a request to create a logical adapter.It ‘‘ ‘‘is likely that SMAD has aborted / died in a fatal manner.’’ msgid ‘‘SUNWcluster.sma.smactl.4008.error’’ (continued) Sun HPC 3.0 SCI Guide ♦ June 1999, Revision A...
  • Page 67 (Continuation) msgstr ‘‘Not Available.’’ msgid ‘‘SUNWcluster.sma.smactl.4008.fix’’ msgstr ‘‘Not Applicable.’’ SMA Messages...

Table of Contents