CAG Infrastructure Overview
CAG Infrastructure Overview
Infrastructure Overview
March 2021
George Lambidakis, Ofer Licht
Necessary Terminology
... Cisco has more acronyms than the US Government, and they change all the time
• CAG – Common ASIC Group (Eyal)
• CHG – Common Hardware Group (Ravi K – Eyal’s boss)
• CEC – Cisco Employee Connection (Cisco credentials)
• IT – Cisco IT proper (security, policy, networking, phones)
• EngIT – Engineering IT (DC infra: VMs, compute, NAS)
• Tools Group – (part of CHG) LSF SW, tool wrappers, licenses
• CapNet – Cisco’s global internal network
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Necessary Terminology, Continued
• LSF – Load Sharing Facility (IBM), like SunGrid
• DC – Data Center (NTN, SJC/MTV, BGL, RTP, GPK)
• Labs – Cisco’s lab infrastructure, separate from DC
• Clusters – UCS compute nodes, composed of 40 blades
• ServiceNow – Cisco’s IT broke/fix service system
• E-store – Cisco’s IT request system (SW, services, etc.)
• Duo MFA – Single sign on (SSO)
• MobilePass – Single use password generator required for sudo
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Help - What Single Link Do I Need?
Debugging and reporting problems
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Total 562
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
CAG Infrastructure Topology and Connectivity
RTP
100ms / OC-48
NTN
RTP01 RTP05 AMS 50-60ms/500Mb
(Amsterdam) NTN01
50 LSF hosts
150 ms / OC48
70 ms / 250 LSF hosts
2 ms / 50-70 ms / 500Mb
10G OC48 2 ms / 2 ms /
150Mb 1Gb
50-60ms/500Mb
SJC Campus 150 ms / OC48 AER Backup VPN CAE02
(VNC) (Almere) (VNC)
(removed)
210 ms OC48
sjc12 sjc05
160 ms / OC48
2x10G 2x10G BLR
~1 ms 150 ms 1Gb
ISPA ISPB
12ms / 10G BGL11
GPK01
MTV (CA) (UK)
17ms 100 LSF hosts
NAS
mirrors
mirrors
mirrors
mirrors
NAS
mirrors
LSF NAS NAS NAS
MTV RTP BGL NTN GPK
SOS Servers asic-sos-rtp01 asic-sos-ntn01 asic-sos-gpk01
• All LSF hosts support an interactive use model ; running in LSF is indistinguishable from
running locally or on a VM
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Resource Strategy
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Desktop and Compute Strategy
• Virtual desktops (VNC) not intended for execution of compute or
memory intensive applications
• Licensed EDA tools run in LSF on the highest performance
hardware available, using optimized slot counts
• Benefits
• Persistent – connect/disconnect/share to desktop anywhere
• Project based fairshare and license allocation
• Optimized job slot counts per host
• Supported – centrally managed and supported DC resources
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
VMs, Desktops & RealVNC
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
VMs – Region Specific VNC Capacity
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
** EngIT does not allow use of TigerVNC, TightVNC, etc. due to CEC credential requirement
VMs, Cont’d
• VMs dispatch LSF jobs to their local farm
• Option available to submit jobs to an arbitrary farm
• Care needed due to project file system locations
• Users can run VNC servers in other locations
• Users working in multiple regions, see Alternate Home Directory
• Common Usage:
• Engineers in BGL/CAE run VNC servers in SJC
• They have additional home directories in SJC
• Servers display back to laptops (VNC client) in BGL
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Alternate Home Directory Illustration Omer’s laptop
running a VNC
client in CAE
s
210m
Humans SJC t h e WA
N CAE
ss
e r acro
r v
g to s e Low latency VNC
n nectin 2 ms connection from
2 ms co
cl ient client to server
VNC
VNC Server in MTV VNC Server in NTN
VMs
Storage and VNC
0 ms
210m 0 ms server in same room
VNC s
serve
Low latency VNC server R/W r R/W
across NTN
the W
AN
mtv5-netapp-ns Omer’s alternate ntn01-netapp-ns Omer’s default
/users/omsali home directory in /users/omsali home directory in
NAS MTV* (SJC) NAS NTN
LSF
SJC/MTV
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
MTV is Mountain View, CA and is ~5 miles from the main Cisco campus
Enterprise RealVNC (Cisco Licensed - Required)
• VM based VNCs (basis for LSF clients)
• Eng IT supported*
• Uses CEC credentials
• Dynamic resizing of display resolution (xrandr)
• Encrypted connections
• Secure collaboration w/o sharing passwords
• Our documentation: VNC Overview and RealVNC
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
* While technically true, Unix and VNC support is mostly up to the users
LSF and Resource Allocation
*LSF - Load Sharing Facility
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Access
• LSF access to our farms is restricted
• Employees/contractors reporting to Eyal have LSF access
based on reporting chain (w/o requesting)
• Engineers outside of CAG request admission via dedicated
mailing lists, subject to CAG management approval
• Otherwise, no LSF access
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Fairness
• Operating Principle: Licenses and compute resources are allocated to
projects based on business priorities
• CAG uses LSF to enforce business priorities
• Priorities modeled as project and user ‘share’ values
• Jobs launch with a project ID and a computed priority
• Jobs PEND outside of exec host based on resource availability and priority
• Non-LSF jobs are problematic
• They violate fairness since they poll license server directly
• Resource allocation not based on business priorities
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Project and User Based Fairshare
• Job scheduling system based on shares and dynamic priority
rush Highest priority first come first serve queue for individual (human) low count jobs [PBFS]
normal Medium priority, project based allocation, intended for most simulation regressions [PBFS]
3-4
Lowest priority, project based allocation, intended for long running simulation workloads [PBFS]
long
build Used for unlicensed workloads as well as massively parallel tools (Voltus, Seascape)
interactive 20 (misleading name) Waveform viewers, large file editing, mostly idle jobs
imp 2-5 Implementation and physical design tools ; slots based on host memory
Imphcc 1-2 High Core Count implementation/PD queue for full chip (access restricted)
analog 8-10 Analog design tools (Virtuoso) with larger memory hosts ; Jobs periodically use CPU time
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
analogsim 36 Dedicated capacity for analog simulations ; configured as 1 slot = 1 physical CPU
User
Queue Characteristics Limit
Hard Run
Limit Queue
Limit
Soft Run
Limit
Soft Memory
Limit
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
* Soft limits apply sane defaults and avoid jobs that otherwise run forever
Compute Farm – Status and Soft/Hard Limits
1 unique job
per slot
Non-CAG
queue
Suspended
jobs
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Compute Farm – Simulation Job Status
Project
Project
Shares
Dynamic
Priority
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Licenses
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
License Servers (users generally do not specify)
ls-sjc-01 – primary CAG license server (SNPS, CDNS, etc.)
ls-sjc-03 – primary CHG license server (some VIP)
ls-csi-01 – NTN Low latency license server (ARC MetaWare)
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
CAD Tools Support (not EngIT, not IT)
• Part of the CHG (non-CAG) organization
• Manages LSF, tools, and licenses (not HW)
• Handles installation and support of most EDA tools
• /auto/edatools mirrored across sites as requested
• Installation initiated via a ToolBox case
• Does not preclude private/project specific tool installation
• Provides tool wrappers
• Available for most tools (VCS, DC, ICC2, hspice, etc.)
• Region aware wrappers set license paths
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Vendor Libraries
• Libraries are generally maintained by us (CAG)
• Standard project independent library installation areas
• Optional mirroring to RTP, NTN, GPK, and BGL
• Mirrors have single source location (RW)
• One or more mirrors (RO) in other locations
• Update rates of 1x-6x/day based on size and type
• Daily data migration limits, per Eng IT policies
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Storage
• NetApp 100% SSD Storage, w/ tiered storage options
• 6 HA node pairs in MTV ; 4 in NTN ; 6 in BGL
• w/ and w/o backups, snapshots
• NFS and CIFS/SMB volumes
• Replaced on a 2 year cycle by Cisco Engineering IT
• Storage as a Service
• We (CAG) do not own storage
• We pay for it based on consumption
• Off site backups, when enabled, are standard and automated
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Volume Identification and Traditional Limits
• local volumes (local, I know, it is a silly name)
• Snapshots, Backups
• 10.18.229.84:/local/cagbb-gb-pd 3.0T 453G 2.5T 16% /auto/cagbb-gb-pd
• Maximum recommended size = 20TB*
• Limited by Cisco backup system policy
Due to the use if multi-IP filers, the host name of the filer is not visible via the output of df and the Unix ‘hosts’ command is used to lookup
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
CIFS/SMB* Laptop Access
• Windows and OSX use CIFS (SMB) to mount file systems
• OSX
smb://mtv5-netapp-eg/local/argon
smb://mtv5-netapp-ns/workspace/wslocal002/rwaldoem
• WIN
\\mtv5-netapp-eg\local\argon
\\mtv5-netapp-ns\workspace\wslocal002\rwaldoem
• Full list of CAG /ws areas in the nightly CAG /ws Report
• Don’t have a /ws? Want one? See Workspace Storage Request
Examples: /ws/kevenes-sjc (MTV), /ws/okarniel-ntn (NTN)
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Mirrors and Site Selectors
• Mirror – One RW master and one or more RO locations
• Site Selector – Independent RW file systems, multiple locs
• Mirrors/Site Selectors employ region based mounting
• /auto/<name> is effectively dynamic
• Mount location controlled by Vintela (understands regions)
• Example: /auto/asic-tools
• In SJC, mounts mtv5-netapp-eg:/local/asic_tools
• In NTN, mounts ntn01-netapp-eg:/dfs/asic_tools
• This happens to be a mirror, so the NTN location is RO (and dfs)
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Mirrors
• NetApp provided feature using Snapmirroring
• Data from a single RW (Read Write) master site is replicated to
one or more RO (Read Only) sites
• Useful for tools and libraries
• Schedule varies from 1x/day to 6x/day based on the size and
type of data is being replicated
• Automated operation once setup
• Ex: /auto/asic-libs MTV(rw), NTN(ro), RTP(ro), CSI(ro),BGL(ro)
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Site Selectors
• Manually maintained (by CAG engineers)
• Data from a RW (Read Write) site is often replicated to one or
more RW sites using tools rsync or scp
• Useful for project data and certain types of tools
• Care should be taken when synchronizing into and out of
CAE/NTN due to WAN bandwidth limitations
• Ex: /auto/palladium in NTN(rw), MTV(rw), and BGL(rw)
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
NetApp FlexGroups
• Newish NetApp proprietary feature *
• Allows single volumes > 100T
• Developed for PD type flows
• Data striped across all filer nodes instead of 1
A NetApp is composed of node pairs (HA). With FG, a 2 filer system like that
in in NTN has 4 nodes – and data is striped across all 4
3
4 5
6
7 8
9
10 11 13
12
14 16
15
17 19
18
20 22
21
23
24
25 26 28
27
29 31
30
32 34
33
35 37
36
38 40
39
41 43
42
44 46
45
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
1 2
3
4 5
6
7 8 10
9
11 13
12
14
15
16 17 19
18
20 22
21
23
24
25 26 28
27
29 31
30
32
33
34 35 37
36
38
39
40 41 43
42
44 46
45
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
92300YC
• NetApp nodes also have internal intra-node 40G92300YC
connectivity
Node 1
LNK HA0 b
LNK HA0 a
NV
ACT/LINK
O=100
Y=1000
1 2 4
3 5
a
c
LNK LNK LNK LNK
LNK HA0 b
LNK HA0 a
NV
ACT/LINK
O=100
Y=1000
1 2 4
3 5
a
c
Node 2 LNK LNK LNK LNK
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
CISCO NEXUS
5596UP CISCO NEXUS
5596UP
ID ID
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
STAT STAT
Data Security
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Access to CAG Protected Content
• CAG Full Time Employees (FTE) +VP approved exceptions
• All in ; Access to all official project CAG protected content *
• CAG Contractors
• In for their project ; Access to all data required for their project
• Access to common scripts, libraries, etc.
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential * Except “skunkworks” (hidden) projects
File Protection System
• Most CAG data resides in the Unix file system (NetApps)
• Protected using Unix GID enabled groups
• World access removed from all file systems ; group access only
• We require filers with NFS Extended Groups (EG) support
• This feature allows > 16 Unix groups
• Authorization is performed by the filer rather than the Unix host
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Repository Protection Scheme
• Perforce security
• Access
• Users require CEC accounts for authentication.
• Users must be members of an ASIC-specific AD group (composed of dynamic HR-list
under Eyal plus manual additions)
• AD groups used to restrict access to specific project data (same groups as NFS)
• Encryption
• All perforce traffic uses SSL (including authentication, file data, and meta-data)
• Auditing
• Medium level of server-side logging, short retention of logs
• Source and Perforce workspaces are 99% NFS
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Complications
• Humans – a significant paradigm shift
• Users require education and training
• Managers need to actively manage data access (group membership)
• Users need to become familiar with tool reported access messages
• Non-CAG collaborators (i.e. SW, DFT) are be a support burden to all
• Infrastructure
• 24-48 hours latency when adding new members to a program
• Increased reliance on a robust Active Directory (AD) infrastructure
• In rare circumstances, users can lose access to data for up to 24 hours
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Training Materials and Useful Links
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Training Materials From Doc Central
• The following documents are available in Cisco's Doc Central repository.
To access them, you must already be a member of standard group "cag-
base"
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Cisco Internal Home https://wwwin.cisco.com
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Employee Resources : https://wwwin.cisco.com
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Case Management – http://ays.cisco.com
• Laptop, phone, network, bade,.... Problems go here
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Requesting Things – http://estore.cisco.com
• Desktop SW, storage requests, One Time Passes (OPTs), etc.
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
estore - Continued
• Once you place an order, use the "My
Orders" link to see them
• To see current SW subscriptions, use
the "My Things"
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
MobilePass and VPN
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Duo MFA (Multi-Factor Authentication)
• Link
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Buying Things – http://smartbuy.cisco.com
• Headsets
• Keyboards
• Mice
• Desktop systems
• etc.
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Active Directory (ADAM) https://adam.cisco.com
Change Unix information, home directories, group management
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Password Reset https://pwreset.cisco.com
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Backup
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Optimization: pid-track.pl – Process Tracking
• Wall time vs CPU time
• Stalled job analysis, multi-thread effectiveness
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Laptop Security
• Laptops are permitted to access and store restricted data
• They are considered secure devices
• IT managed
• Users authenticate using CEC credentials
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
OS and LSF Migrations
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Red Hat 7.4 Migration
• Cisco will end support for RH6 in CY20
• Requires us to migrate the infrastructure to RH7
• For tools that will not run RH7, Eng IT provides FBE
• Using a wrapper, able to run tools natively using the old OS
• Test VMs and LSF hosts created with new OS and packages
• RH7 testing progress documented in RH7 Testing Matrix
• For multi-user VMs, the Desktop Environment (DE) is problematic
• Gnome requires HW assist (unavail in VMs)
• KDE uses excessive resources, poor choice for multi-user environments
• Xfce will be the likely choice for a default DE
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
LSF 10 Migration
• We currently run LSF 9.1
• LSF 10 provides incremental benefits
• Improved reporting of PEND reasons
• Improved transfer of statistics to RTM DB
• Better support for RH7 as well as better support from IBM
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential