0% found this document useful (0 votes)
4 views

Deep Learning Server Platform_Admin Manual 2.0

The document is an admin manual for the Deep Learning Server platform at Vellore Institute of Technology, detailing its high-end NVIDIA GPU configuration and capabilities for AI research. It includes sections on hardware specifications, software stack architecture, server and Jupyter Hub access, user provisioning, Docker installation, and Linux administration commands. Additionally, it provides instructions for managing Docker containers and images, as well as monitoring system performance.

Uploaded by

alsalam0504
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Deep Learning Server Platform_Admin Manual 2.0

The document is an admin manual for the Deep Learning Server platform at Vellore Institute of Technology, detailing its high-end NVIDIA GPU configuration and capabilities for AI research. It includes sections on hardware specifications, software stack architecture, server and Jupyter Hub access, user provisioning, Docker installation, and Linux administration commands. Additionally, it provides instructions for managing Docker containers and images, as well as monitoring system performance.

Uploaded by

alsalam0504
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

DEEP LEARNING

SERVER
Admin Manual

Vellore Institute of Technology, Chennai

From

Tekcogent Solutions Private Limited


Index

1. About Deep Learning Server Platform 2

2. Nvidia H100 80GB PCI GPU Card 3

3. NVIDIA NVLINK Bridge 4

4. Software Stack layout – High level 5

5. Implemented Software Stack Architecture 6

6. Server Access 7

7. Jupyter Hub Access 8

8. User Provisioning 9
9. Nvidia-smi 11
10. Reference for Docker Installation 12
11. Reference Docker commands 13
12. Linux Administration 17

1/19
About Deep Learning Server platform
We are happy to inform you that recently we had implemented a high end NVIDIA GPU
platform with capability to support development and execution of Artificial Intelligence based
research projects and applications. This platform has good configuration that supports large and
computationally intensive operations.

Hardware Configuration:

https://www.supermicro.com/en/products/system/gpu/tower/sys-741ge-tnrt CPU:

2 x 24 Cr processor (48 Cr)

Memory: 1 TB

Storage: 1 x 1.9TB NVMe PCIe (For OS) / 3 x 14TB SATA 6Gb/s7.2K (File storage)

GPU: 2 x NVIDIA H100 80GB PCIe 5.0 x16 Passive Cooling

NVLINK: NVIDIA NVLINK Bridge

OS: Ubuntu 24.04 LTS

2/19
NVIDIA H100 Tensor Core GPU
Extraordinary performance, scalability,and security for every data center

An Order-of-Magnitude Leap for Accelerated Computing


The NVIDIA H100 Tensor Core GPU delivers exceptional performance, scalability,
and security for every workload. H100 uses breakthrough innovations based on
the NVIDIA Hopper™ architecture to deliver industry-leading conversational AI,
speeding up large language models by 30X

3/19
NVLink Bridge
NVIDIA NVLink is a high-speed point-to-point (P2P) peer transfer connection. Where one GPU can
transfer data to and receive data from one other GPU. The NVIDIA H100 card supports NVLink bridge
connection with a single adjacent NVIDIA H100 card. Each of the three attached bridges spans two PCIe
slots. To function correctly as well as to provide peak bridge bandwidth, bridge connection with an
adjacent NVIDIA H100 card must incorporate all three NVLink bridges. Wherever an adjacent pair of
NVIDIA H100 cards exists in Product Features NVIDIA H100 PCIe GPU PB-11133-001_v02 | 9 the server,
for best bridging performance and balanced bridge topology, the NVIDIA H100 pair should be bridged.

4/19
Software stack layout – High level

5/19
Implemented Software Stack Architecture:

6/19
Server Access:
Credential

System: 172.16.8.22 (Through ssh)

User name: root

Password: ****

Putty for ssh access:

7/19
Jupyter Hub access:
172.16.8.22 (Through Browser)

Credential:

Username: admin

Password: V!t@321

1. Login to Machine with Username and Password.

2. The jupyter hub will be flashed

Machine access with Docker & without Docker

Note: Without Docker, version conflict issues will be created for projects within the user

8/19
User Provisioning
Steps to Set Up the Container

Create a Dockerfile
nano Dockerfile

Open the Dockerfile in a text editor and add the following content:
# Use the Ubuntu base image
FROM ubuntu:latest
# Install required packages (openssh-server and sudo)
RUN apt-get update && apt-get install -y \
openssh-server \
sudo \
&& apt-get clean
# Create SSH directory and enable SSH
RUN mkdir /var/run/sshd
# Add the user setup script to the container
COPY setup_user.sh /usr/local/bin/setup_user.sh
RUN chmod +x /usr/local/bin/setup_user.sh
# Expose the SSH port
EXPOSE 22
# Allow root login via SSH (optional)
RUN sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
# Ensure PasswordAuthentication is enabled
RUN sed -i 's/#PasswordAuthentication yes/PasswordAuthentication yes/' /etc/ssh/sshd_config
# Run the setup script and start the SSH service
CMD ["/bin/bash", "-c", "/usr/local/bin/setup_user.sh && /usr/sbin/sshd -D"]
save the file by using ctrl+0 and ctrl+x.

Create the User Setup Script


nano setup_user.sh

Open the setup_user.sh in a text editor and add the following content.

#!/bin/bash
# Set default values if environment variables are not provided
USERNAME=${USERNAME:-user}
PASSWORD=${PASSWORD:-password}
# Create the user with the specified username and password
useradd -m -s /bin/bash "$USERNAME"
echo "$USERNAME:$PASSWORD" | chpasswd
# Install sudo if not already installed
apt-get update && apt-get install -y sudo
# Add the user to the sudo group
usermod -aG sudo "$USERNAME"
# Ensure the sudoers file has no restrictions for the sudo group
9/19
echo "%sudo ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers
# Allow the user to use SSH
mkdir -p /home/$USERNAME/.ssh
chown -R $USERNAME:$USERNAME /home/$USERNAME/.ssh
echo "User $USERNAME created with password $PASSWORD and added to sudo group."
# Execute the provided command (SSH in this case)
exec "$@"
save the file by using ctrl+0 and ctrl+x.
Save the following content in a file named setup_user.sh (placed in the same directory as the Dockerfile)

Build the Docker Image


Run the following command to build the Docker image:
docker build -t <image name> .

Run the Docker Container

Use the docker run command to start a container. Pass the USERNAME and PASSWORD as environment
variables:
docker run -d --name <container name> -p 2222:22 -e USERNAME=<user name> -e
PASSWORD=<password> <image name>
-p 2222:22: Maps port 2222 on the host to port 22 in the container.
-e USERNAME and -e PASSWORD: Set the username and password dynamically.

SSH into the Container

From the host system or an external machine, use the following command to SSH into the container:
ssh username@<host-ip> -p 2222
Replace <host-ip> with your host system's IP address (e.g., 10.10.10.10).
Enter the password when prompted (password123 in this example).

Change Password for the User

From Docker Host (without SSH):


Execute a command inside the running container to change the user's password:
docker exec -it <image name> bash -c "echo '<username>:<newpassword>' | chpasswd"
Replace the username and newpassword with the desired password.

Remove a Container and Images

Stop and Remove the Container


Stop the container:
docker stop <container-name>
Remove the container:
docker rm <container-name>
10/19
Stop and Remove the Images
List all Docker images:
docker images
Remove the image:
docker rmi <image-id>

Nvidia-smi

11/19
Reference for Installation docker into the machine.
https://https://www.digitalocean.com/community/tutorials/how-to-install-and-use-docker-on-
ubuntu-20-04

Pull the docker Image for the Project


Sudo docker pull <image name>
Sudo docker run -d -it -p 23:22 –name UBUNTU /bin/bash

Nvidia Container Tool Kit Configuration


Install nvidia-container-toolkit by following the install-guide.

Once complete installing and configure the docker

12/19
Docker Commands - For Reference
1. To display the help information for Docker commands:
Docker --help

2. The command to check the Docker version:


Docker --version

3. The command to get the Docker version in JSON format using the --format option:
Docker version --format ‘{{ json .}}’

4. This command provides detailed information about your Docker installation, including system-wide
settings, storage drivers, network information, and more. No additional arguments are needed for a
basic docker info:
Docker info

5. This command provides information about how to use the docker pull command in Docker. Running it
will display a list of options, flags, and a brief description of how to pull images from a Docker
registry:
Docker pull --help

6. This command to pull the ubuntu:20.04 image in Docker is:


Docker pull ubuntu:20.04 (uses a specific ‘tag’)

7. This command in Docker is used to download the latest version of the Redis image from Docker Hub:
Docker pull redis (doesn’t use a ‘tag’, so the latest version is pulled by default)

8. The command docker image ls is used to list all the Docker images on your local machine. It provides
information about the image repositories, tags, image IDs, creation dates, and sizes:
Docker image ls

9. This command in short is used to list all Docker images stored locally on your system. It displays the
repository, tag, image ID, creation time, and size of each image:
Docker images

10. This command will run Redis in the background with default settings. If Redis is not installed locally,
Docker will pull the Redis image from Docker Hub automatically:
Docker run redis

13/19
11. This command is used to list all the currently running Docker containers. It shows information about
the container ID, image, command, status, ports, and names of the running containers:
Docker ps

12. The command docker ps -a is used in Docker to list all containers on your system, including both
running and stopped containers:
Docker ps -a

13. The command docker run -it redis is used to start a Docker container running the Redis image
interactively:
Docker run -it redis

14. The command docker run -d redis is used to run a Redis container in detached mode:
Docker run -d redis

15. This command runs a Redis container interactively in the background:


Docker run -it -d redis

16. This command starts a Redis container named akhilredis in the background, with an interactive
terminal:
Docker run -it --name=akhilredis -d redis

17. The docker stats command provides real-time information about the resource usage (like CPU,
memory, network I/O) of running Docker containers. It shows metrics for each container, including
CPU usage,Memory usage,Network I/O,Block I/O,PIDs (number of processes):
Docker stats

18. This command searches the Docker Hub for images related to "redis." It will return a list of Redis
images, along with details like the image name, description, stars, and whether it’s an official image:
Docker search redis

19. This command will return a list of Redis images on Docker Hub with a minimum of 3 stars and display
their full descriptions:
Docker search --filter=stars=3 --no-trunc redis

20. This command searches for Docker images related to Redis on Docker Hub with a minimum of 3 stars
and limits the results to 10 images, displaying full descriptions:
Docker search --filter=stars=3 --no-trunc --limit 10 redis

14/19
21. The command docker start a8217c4c56 is used to start a Docker container with the specified
container ID (a8217c4c56):
Docker start a8217c4c56 (also try name instead of ID here)

22. The command docker stop a8217c4c56 is used to stop a running Docker container:
Docker stop a8217c4c56 (also try name instead of ID here)

23. To restart a Docker container, you can use either the container ID:
Docker restart a8217c4c56 (also try name instead of ID here)

24. The command docker pause is used to temporarily pause all processes within a container:
Docker pause a8217c4c56 (also try name instead of ID here)

25. The command to unpause a Docker container:


Docker unpause a8217c4c56 (also try name instead of ID here)

26. The command you are referring to is used to view the logs of a Docker container:
Docker logs a8217c4c56 (also try name instead of ID here)

27. The command docker exec -it a8217c4c56 bash is used to run a new interactive shell session inside a
running Docker container:
Docker exec -it a8217c4c56 bash (start bash inside the container, type exit to exit the bash)

28. The docker run command you provided is used to create and start a new container based on the redis
image, but with an error in how it specifies the command to run inside the container:
Docker run -i -t --name=akhilredis -d redis /bin/bash

29. The command docker exec 023828e786e0 apt-get update runs the apt-get update command inside a
running Docker container identified by the container ID 023828e786e0:
Docker exec 023828e786e0 apt-get update

30. The command docker rename vibrant_yellow test is used to rename a Docker container:
Docker rename vibrant_yellow test (renames the container to “test”, container can be
running or stopped)
31. The command docker rm test is used in Docker to remove a container named test:
Docker rm test (you have to stop the container before removing it, also try this with
container ID)

32. The command docker stop $(docker ps -a -q) is used to stop all running Docker containers:
Docker stop $(docker ps -a -q) (Stops all running containers)
15/19
33. The command docker rm -f $(sudo docker ps -a -q) is used to forcefully remove all Docker containers,
both running and stopped, from the system:
Docker rm -f $(sudo docker ps -a -q) (removes all stopped containers)

34. The command docker inspect happy_faraday is used to get detailed information about a Docker
container or image named happy_faraday. When you run this command, Docker will return a JSON
output containing all available details about the container or image:
Docker inspect happy_faraday (also works with ID)

35. The command docker kill happy_faraday is used to immediately stop (terminate) a running Docker
container named happy_faraday:
Docker kill happy_faraday (same as stop)

36. The command docker kill $(docker ps -q) is used to stop all running Docker containers:
Docker kill $(docker ps -q) (stops all running containers)

37. The docker volume create command is used to create a new volume in Docker. Volumes are
persistent storage areas that can be used by containers to store data outside the container's file
system:
Docker volume create new-vol

38. The command docker volume ls is used to list all the volumes in Docker. Docker volumes are used to
persist data created by and used by Docker containers:
Docker volume ls

39. The docker volume inspect command is used to retrieve detailed information about a specific Docker
volume:
Docker volume inspect new-vol

40. This command will start a Redis container in the background, named redisvol, and mount a Docker
volume named new-vol to the /app directory inside the container. If the volume doesn’t already exist,
Docker will create it automatically:
Docker run -d --name redisvol --mount source=new-vol,target=/app redis (create a new vol ->
docker volume create new-vol)

16/19
Linux Administration

View Active Users and Their Activities:


Active users and their current activities, including CPU and memory usage.
w

Displays who is logged into the system, showing their login time and terminal.
who

Shows the last logins of users on the system.


last

Displays the last login times for all users on the system.
lastlog

Shows system uptime and load averages 1, 5, and 15 minutes).


uptime

Lists all running processes, including the user, CPU, and memory usage.
ps aux

Real-time process monitoring, showing resource usage, such as CPU, memory, and active users.
top

An enhanced version of top , with an interactive user interface for monitoring system processes.
htop

Monitors disk I/O usage by processes.


sudo iotop

Reports on virtual memory statistics, system processes, paging, block I/O, and CPU usage.
vmstat

Displays memory usage, showing total, used, and free memory in a human readable format.
free -h

Shows disk space usage for mounted filesystems in a human-readable format.


df -h

Displays disk usage for a specific user's home directory.


du -sh /home/<username>

Displays network connections and listening ports.


netstat -tuln
17/19
Displays network connections and listening ports with more detailed information.
ss -tuln

System Activity and Resource Usage:


Collects, reports, and saves system activity information, such as CPU, memory, and I/O usage.
sar -u 1 3 # CPU usage every 1 second, 3 times

glances Linux Administration 4 A cross-platform system monitoring tool that provides real-time system
resource usage CPU, memory, disk, network).
glances

Real-time monitoring tool that provides detailed reports on system activity, including CPU, memory, disk I/O,
and network.
sudo atop

User Management Commands


View Users
Lists all users on the system.
cat /etc/passwd

Lists users as stored in the system's user database, which can include additional sources like LDAP.
getent passwd

View User Permissions


Displays the UID, GID, and groups a user is a member of.
id <username>

Shows which groups a specific user belongs to.


groups <username>

Checks the file permissions and ownership in a user's home directory.


ls -l /home/

Displays the commands that a user is allowed to run with sudo .


sudo -l -U <username>

Create, Modify, and Delete Users


Adds a new user to the system.
sudo useradd <username>

Adds a user to a group. The -a flag appends the user to the group without removing them from other groups.
sudo usermod -aG <group> <username>

18/19
Deletes a user from the system.
sudo userdel <username>

Change the password for a user.


sudo passwd <username>

Displays information about a user's password expiration and aging.


sudo chage -l <username>

19/19

You might also like