Lenovo Reference Architecture For IBM Cloud Pak For Data
Lenovo Reference Architecture For IBM Cloud Pak For Data
Xiaotong Jiang
Xifa Chen
Weixu Yang
Lin Xu
3 Requirements ............................................................................................ 7
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
2
Storage
7.1.2 Lenovo ThinkSystem SR630 Server ......................................................................................... 24
7.1.3 Lenovo ThinkSystem DE6000F All Flash Storagy Array ........................................................... 24
7.1.4 Lenovo ThinkSystem DM5000F Unified Flash Storage Array ................................................... 25
7.1.5 Lenovo RackSwitch G8052 ....................................................................................................... 26
7.1.6 Lenovo ThinkSystem NE1032/NE1032T Rack Switch .............................................................. 26
7.1.7 Lenovo RackSwitch NE10032 - Cross-Rack Switch ................................................................. 27
Resources ..................................................................................................... 37
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
3
Storage
1 Introduction
This document describes the reference architecture for IBM Cloud Pak for Data on top of RedHat OpenShift
on Lenovo ThinkSystem servers and storage. It provides a multi-cloud End to End (E2E) data and AI
infrastructure for customers, along with an integrated and flexible workflow for processing data, helping
integrate and unlock the value of all customers’ data. This reference architecture provides planning, design,
and deployment considerations for implementing Cloud Pak for Data with Lenovo products.
With the ever-increasing volume, variety and velocity of data available to an enterprise comes the challenge
of deriving the most value from it. This task requires multiple source data collection, suitable data
management, flexible and extendable data processing and easy data model inference deployment. Cloud Pak
for Data brings the power of AI to the enterprise. Cloud Pak for Data is an all-in-one multi-cloud data and AI
platform that can be containerized and deployed on top of OpenShift built on On-Prem or public cloud
infrastructure, to provide a secure environment for data collection, organization, and analysis. Cloud Pak for
Data expands and enhances this technology to withstand the demands of your enterprise, adding
management, security, governance, and analytics features. The result is that you get a more enterprise ready
solution for complex, large-scale analytics.
OpenShift brings a containerized platform to Cloud Pak for Data with many benefits that cannot be obtained
on physical infrastructure or in the cloud. Containerization simplifies the management of your big data and AI
infrastructure, enables faster time to results and makes it more cost effective. It is a proven software
technology that makes it possible to run multiple operating systems and applications on the same server at
the same time. Containerization can increase IT agility, flexibility, and scalability while creating significant cost
savings. Workloads get deployed faster, performance and availability increases and operations become
automated, resulting in IT that is simpler to manage and less costly to own and operate.
This reference architecture is intended for IT professionals, technical architects, sales engineers, and
consultants to assist in planning, designing, and implementing the Cloud Pak for Data solution with Lenovo
hardware. Knowledge of common big data processing, container, OpenShift and cloud will be helpful. For
more information about Cloud Pak for Data, please see “Resources”.
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
4
Storage
2 Business problem and business value
2.1 Business Problem
The world is well on its way to generate more than 40 million TB of data by 2020. Businesses must be able to
keep pace with the demand for resources in order to benefit. This data comes from everywhere, including
sensors that are used to gather climate information, posts to social media sites, digital pictures and videos,
purchase transaction records, and cell phone global positioning system (GPS) signals. Research agency
Burning Glass Technologies, in association with IBM, predicts demand for data scientists will grow by 28% by
2020 and academic institutions will not be able to fulfill the demand1. Meanwhile, our data scientists are busy
at leveraging multiple tools and spend much of their time managing, protecting, and collecting multi-source
data. According to Forbes, Data scientists spend approximately 80% of their time preparing data for analysis.
The gap between the huge volume of big data to prepare/manage and limited energy/time of data scientists
requires a more efficient data and analytics integration platform.
Per M& I sources, 85% of organizations are committed to a Multi-Cloud strategy. Investments in cloud
technology and resources are also on the rise, according to the sources. A majority of respondents also said
they plan to maintain or increase their investment in cloud over the next two years, including both internal
private and external public cloud.
availability/resiliency requirements
A containerized data and analytics platform that can be deployed on enterprise multi-cloud environment helps
meet these business requirements.
1 The Quant Crunch - How the Demand for Data Science Skills is Disrupting the Job Market, Looking Glass Technologies, 2017
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
5
Storage
One of the key on-prem components required to deploy Cloud Pak for Data is the infrastructure platform.
Lenovo has partnered with IBM to verify the Lenovo ThinkSystem platform for Cloud Pak for Data. Lenovo
ThinkSystem servers integrated with OpenShift provide a flexible, secure and scalable container platform.
ThinkSystem servers have a rich set of configurable options depending upon the data workload and business
needs. Together with Lenovo ThinkSystem infrastructure, Cloud Pak for Data provides an E2E (end to end)
data analytic on-prem cloud solution and speed up revealing value from multi-source enterprise data for
business.
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
6
Storage
3 Requirements
The functional and non-functional requirements for this reference architecture are desribed in this section.
Unified platform
o A single platform that integrates data management, data governance and analysis for greater
efficiency and improved use of resources. Enable self-service collaboration across teams.
AI-ready
o Manage end-to-end data workflows to help ensure that data is easily accessible for AI. Make sure
that your data is high-quality to deliver accurate, automated insights and decisions. Seamlessly
build and manage machine learning models across development and production in a collaborative
environment
Cloud-native agility
o Accelerate application development and deployment with a multicloud data platform that is agile,
resilient and portable. Benefit from Kubernetes containerization to provision and scale services in
minutes, instead of months, inside a more secure, governed environment.
Data virtualization
o Query data easily and more securely across multiple sources, on cloud or on premises. Exploit
the combined processing power of those sources for massive query acceleration and achieve the
speed and scalability your business needs for today's and tomorrow’s workloads.
Extensible APIs
o Use Cloud Pak for Data API in your applications to accelerate implementations and deliver
significant business value.
Customized workflow
o Provision preferred data services flexibly and rapidly and customize data workflows to your
individual needs.
Continuous intelligence
o Develop real-time streaming applications and deliver continuous intelligence across your
business. With IBM Streams on IBM Cloud Pak for Data, you can enable continuous and rapid
analysis of massive volumes of data in motion or at rest.
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
7
Storage
3.2 Non-functional requirements
Customers require their big data solution to be easy, dependable, and fast. The following non-functional
requirements are key:
Easy:
o Ease of development
o Easy management at scale
o Advanced job management
o Multi-tenancy
o Easy to access data by various user types
Dependable:
Fast:
o Superior performance
o Scalability
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
8
Storage
4 Architectural overview
4.1 IBM Cloud Pak for Data
This chapter gives an architectural overview of IBM Cloud Pak for Data. Figure 1 gives a high-level overview
of the multi-cloud architecture of Cloud Pak for Data.
In this architecture, the Cloud Pak for Data admin console provides a unified control plane for user
management, data management, data governance, data analysis, and business analysis in multiple locations
– on the public cloud, and on premise in the data center. In this sense, the on-prem cluster is essentially an
extension of the public cloud. The Cloud Pak for Data admin console provides centralized configuration and
security management across the clusters. This centralized control provides a consistent mechanism to
manage distributed data analytic clusters, configuration policy, and security.
The deployment of Cloud Pak for Data on-prem requires an OpenShift cluster, which provides the compute
and storage containerized. In this document, Cloud Pak for Data on-prem clusters will be deployed as pods
running on top of the OpenShift cluster. This simplifies the Cloud Pak for Data deployments because you do
not need dedicated hosts for implementing Cloud Pak for Data clusters. Instead, multiple applications can be
installed on the same cluster. See Figure 2 for the architecture of the Cloud Pak for Data on-prem clusters
when deployed on OpenShift.
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
9
Storage
Figure 2 Architecture of the Cloud Pak for Data on OpenShift
Admin/User control plane – This plane has over-all admin functions. It allow customers to manage
users and user permissions, create projects, Govern data quality across the organization, Monitoring
and managing active analytics environments and resources, deploy models and expose endpoints,
etc.
Data/AI analytic plane – it is a powerful computational engine. User can create a Python, R, or Scala
notebook-based project, create a data connection to data source, and transform and analyse data by
using this platform.
Add-on plane – Cloud Pak for Data includes a catalog of add-ons that customers can use to extend
the functionality of Cloud Pak for Data. The catalog includes the following types of add-ons:
AI, Analytics, Dashboards, Data governance, Data sources, Developer tools, Industry accelerators,
Storage.
Figure 3 shows the high level architecture of the Red Hat OpenShift Container Platform and the core
building blocks. OpenShift is a platform designed to orchestrate containerized workloads across a cluster of
nodes. The system uses the Kubernetes as the core container orchestration engine, which manages the
Docker container images and their lifecycle.
The physical configuration of the OpenShift platform is based on the Kubernetes cluster architecture. The
master node is the primary node on which the Kubernetes scheduler, along with the distributed cluster data
store (etcd), the REST API services, and other associated management services run. In a product
environment, you need to ensure high availability of the master services through replicating the services to
multiple physical servers and implementing monitoring and load-balancing services such as Keepalived and
HAproxy. The infrastructure nodes can be used in a product setting to implement such services.
Application nodes (or just shown as Node in the diagram) run the users containerized applications on top of
the Docker container environment. With OpenShift, you can easily write and deploy applications knowing that
they’ll run on a platform optimized for Red Hat OpenShift. When choosing to deploy a private cloud on-
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
11
Storage
premises, Cloud Pak for Data System provides optimized hardware to increase the container performance of
the Red Hat cluster while speeding the time to value of data workloads.
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
12
Storage
5 Component model
Cloud Pak for Data provides features and capabilities that meet the functional and non-functional
requirements of customers. It supports the need for an end-to-end solution for data and analytics within an
enterprise across different industries, such as financial services, retail, media, healthcare, manufacturing,
telecommunications, and government organizations. One of its design principles was to help organizations
access a vast array of data sources on-premises and in the cloud—all while applying deep data management
and analytics within a private cloud setting.
Cloud Pak for Data enables users to connect to data (no matter where it lives), govern it, virtualize it, and
use it for analysis. Cloud Pak for Data also enables all of your data users to collaborate from a single, unified
interface, so your IT department doesn't need to deploy and connect multiple applications.
Cloud Pak for Data native cloud hyper-converged architecture consists of a set of core software components
that can run on-prem in the data center or in the public cloud. Together, the components provide all the
services required to explore and profile, transform, and analyze data from a single web application across the
different clouds as well as provide the common policy framework and centralized configuration management
to catalog, manage and govern users’ data.
As shown in Figure 4, Cloud Pak for Data is composed of 4 main logical components. Following sections is a
brief description of these components.
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
13
Storage
5.1 Base & Core
Cloud Pak for Data features data collection, data virtualization, data governance, and data processing
practices at its core, which can be applied to application. Auditing data access and verifying access privileges.
Cloud Pak for Data allows administrators to configure, collect, and view audit events, and generate reports.
Cloud Pak for Data tracks access permissions and actual accesses to all entities in enterprise.
Cloud Pak for Data APIs facilitate programmatic management of users and their access control, along with
user account management. They can be used to interact with your governance metadata to manage assets,
custom asset types, and the association between them. They provide the capability to manage analytics
projects and the collaborative use of assets (notebooks, scripts, datasets), allowing users to quickly harvest
insight from the data in a repetitive fashion as well as from automated job scheduling. The API's also
automate deployment (from development to production) and help maintain machine learning models, making
them accessible through HTTP endpoints on the platform.
Cloud Pak for Data Infra & Admin sets the standard for enterprise deployment by delivering granular visibility
into and control over every part of the data and AI jobs, which empowers operators to improve performance,
enhance quality of service, increase compliance, and reduce administrative costs. Cloud Pak for Data makes
administration of your enterprise data processing and AI jobs simple and straightforward, at any scale.
Cloud Pak for Data monitors a number of performance and health metrics for services and role instances that
are running on your clusters. These metrics are monitored against configurable thresholds and can be used to
indicate whether a cluster is functioning as expected. You can view these metrics in the web client, which
displays metrics about jobs, pod, services, clusters and so on.
Cloud Pak for Data deploy and integrate several types of database and message systems. Cloudant is a
distributed database that is optimized for handling heavy workloads that are typical of large, fast-growing web
and mobile apps. Available as an SLA-backed, fully managed cloud and on-prem service, Cloudant elastically
scales throughput and storage independently. Kafka is a distributed commit log service. Kafka functions much
like a publish/subscribe messaging system, but with better throughput, built-in partitioning, replication, and
fault tolerance. Kafka is a good solution for large scale message processing applications. Influxdb is a time
series database is used to store log, sensor and other data, over a period of time. Influxdb has seen
significant traction and is known for its simplicity and ease of use, along with its ability to perform at scale.
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
14
Storage
Cloud Pak for Data includes a catalog of add-ons: AI, Analytics, Dashboards, Data governance, Data sources,
Developer tools, Industry accelerators, and Storage. For more information about add-ons, see this website:
https://docs-icpdata.mybluemix.net/docs/.../com.ibm.icpdata.doc/zen/admin/add-ons.html
Cloud Pak for Data has 2 integrations. Customer can audit sensitive data and synchronize data by integrate
IBM Guardium and StoredIQ with Cloud Pak for Data. For more information about the integrations, see this
website:
https://docs-icpdata.mybluemix.net/docs/.../com.ibm.icpdata.doc/zen/admin/integrations.html.
IBM Db2 Warehouse is a software-defined data warehouse supporting Docker container technology, and it
can be deployed as an add-on database in cluster. This data warehousing approach is client-managed and
optimized for fast and flexible deployment. Expect automated scaling to meet agile analytic workloads. With
IBM Db2 Warehouse, you have control over your data and applications without having to handle complex
database deployment and management tasks. Based on the number of worker nodes selected, IBM Cloud
Pak for Data automatically creates the appropriate data warehouse environment. For more information about
IBM Db2 Warehouse as an add-on, see this website:
https://www.ibm.com/support/knowledgecenter/SSQNUZ_2.1.0/com.ibm.icpdata.doc/zen/admin/work-with-db-
db2wh.html#work-with-db-db2wh
5.4 Dashboard
Cloud Pak for Data has a web-based dashboard. From the dashboard, users can do their administrator
operations, data management, data governance and analysis in a unified web-based UI. Figure 5 shows web
client dashboard of Cloud Pak for Data.
Figure 6 Cloud Pak for Data deployment Architecture with the ThinkSystem Nodes
This Cloud Pak for Data reference architecture is implemented on a set of pods on OpenShift that make up a
cluster. A Cloud Pak for Data cluster consists of two types of logical nodes: Master and Worker.
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
16
Storage
Master nodes and Worker nodes run the following types of services:
A Cloud Pak for Data deployment consists of two Master nodes for high availability, and three or more Worker
nodes.
Cloud Pak for Data is deployed as micro-service applications on the OpenShift platform. Figure 7 shows part
of Cloud Pak for Data components applications.
IBM Db2 Warehouse is deployed as an add-on database in the cluster. Db2 Warehouse is designed to
provide organizations with the highly flexible architecture that is needed in the dynamic, fast-moving world of
big data and cloud computing. Db2 Warehouse leverages external ThinkSystem DE6000F/ DM5000F arrays
as backend storage. For a single node, the warehouse uses a symmetric multiprocessing (SMP) architecture
for cost-efficiency. For two or more nodes, the warehouse is deployed using a massively parallel processing
(MPP) architecture for high availability and improved performance. Figure 8 shows a dashboard of Db2
Warehouse deployed as an add-on in a Cloud Pak for Data cluster.
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
17
Storage
Figure 8 Db2 Warehouse in Cloud Pak for Data Cluster
More details on the networking, hardware system management and deployment steps are described in the
following sections.
6.1 Networking
The reference architecture specifies two networks: a high-speed cluster network and a management network.
Two types of top of rack switches are required; one 1Gbps switch for out-of-band management and a pair of
10Gbps switches for the cluster network for High Availability. See Figure 9 below.
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
18
Storage
Figure 9 Cloud Pak for Data on OpenShift network
The two 10GbE NIC ports of each node are link aggregated into a single bonded network connection giving
20Gbps of bandwidth. The two data switches are connected together as a Virtual Link Aggregation Group
(vLAG) pair using LACP to provide the switch redundancy. If a NE1032 switch drops out of the network, the
other NE1032 continues transferring traffic. The switch pairs are connected with dual 10Gbps links called an
ISL, which allows maintaining consistency between the two peer switches.
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
19
Storage
6.2 Systems management
The Lenovo XClarity Administrator software provides centralized resource management that reduces
management complexity, speeds up response, and enhances the availability of Lenovo® server systems and
solutions. The Lenovo XClarity Administrator provides agent-free hardware management for Lenovo’s
ThinkSystem® rack servers, System x® rack servers, and Flex System™ compute nodes and components,
including the Chassis Management Module (CMM) and Flex System I/O modules.
Figure 10 shows the Lenovo XClarity™ Administrator interface in which servers, storage, switches and other
rack components are managed and status is shown on the dashboard. Lenovo XClarity™ Administrator is a
virtual appliance that is quickly imported into a virtualized server environment.
In addition, xCAT provides a scalable distributed computing management and provisioning tool that provides
a unified interface for hardware control, discovery and operating system deployment. It can be used to
facilitate or automate the management of cluster nodes. For more information, see: Lenovo XClarity
Administrator Product Guide
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
20
Storage
6.3 Cloud Pak for Data on OpenShift deployment
6.3.1 Pre-requisites
In order to perform initial configuration and installation of the Cloud Pak for Data clusters, you need to have a
running OpenShift version 3.11 platform and a Helm/Tiller Version 2.9.1 that can get access to the OpenShift
platform. You need to apply for an IBM Passport account and download the installation file for an IBM Cloud
Private installation. At that point, you can run the IBM Cloud Private installation file and download the Cloud
Pak for Data installation file. In addition to the Cloud Pak for Data installation file, you also need to create a
project in OpenShift and create a required security context constraint. Then, you need to create a cluster role
binding and bind it to the default service account. Llastly, you need to log into a docker registry and create the
docker secret for an icp4d-anyuid service account. More detailed instructions to prepare a Cloud Pak for Data
on OpenShift deployment can be found here:
https://www.ibm.com/support/knowledgecenter/SSQNUZ_current/com.ibm.icpdata.doc/zen/install/openshift-
noicp.html
Note: Cluster deployer can use VMs to deploy Load balancer and Master for reducing cost.
Table 2 provides a minimum node configuration summary for a deployment example on Lenovo ThinkSystem
Platform.
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
21
Storage
Load
4 CPU(s) 32GB 100GB 2 10G NIC
Balancer
Note: External storage node’s minimum node configuration depends on customers’ specific requirement.
For the network, two Lenovo ThinkSystem NE1032 10Gbps switches and one ThinkSystem G8052 1Gbps
switches are deployed in cluster.
The cluster administrator can scale up the clusters later by adding more nodes or creating additional clusters
based on requirements. In this reference architecture, we provided a recommended mid-sized, production
level hardware configuration based on a rough workload profile estimate. The configuration bill of materials
are provided in section 8.
More detailed installation instructions for Cloud Pak for Data on OpenShift deployment can be found here:
https://www.ibm.com/support/knowledgecenter/en/SSQNUZ_2.1.0/com.ibm.icpdata.doc/zen/install/openshift-
noicp.html
More detailed installation instructions for Db2 Warehouse as Cloud Pak for Data add-on can be found here:
https://www.ibm.com/support/knowledgecenter/en/SSQNUZ_2.1.0/com.ibm.icpdata.doc/zen/admin/install-
data-source-add-ons.html
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
22
Storage
7 Deployment considerations
This section describes considerations for deploying the Cloud Pak for Data solution with ThinkSystem servers
and storage.
Combined with second-generation Intel® Xeon® Scalable Processors (Bronze, Silver, Gold, and Platinum), the
Lenovo SR650 server offers an even higher density of workloads and performance that lowers the total cost
of ownership (TCO). Its pay-as-you-grow flexible design and great expansion capabilities solidify
dependability for any kind of workload with minimal downtime.
The SR650 server provides high internal storage density in a 2U form factor with its impressive array of
workload-optimized storage configurations. It also offers easy management and saves floor space and power
consumption for most demanding use cases by consolidating storage and server into one system.
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
23
Storage
For more information, see the Lenovo ThinkSystem SR650 Product Guide:
https://lenovopress.com/lp1050-thinksystem-sr650-server-xeon-sp-gen2
Combined with second generation Intel Xeon Processors (Xeon SP Gen 2), the SR630 server offers scalable
performance and storage capacity. The SR630 server supports up to two processors, up to 2933 MHz
memory speed, up to 3 TB of memory capacity with TruDDR4 DIMMs, up to 12x 2.5-inch or 4x 3.5-inch drive
bays with an extensive choice of NVMe PCIe SSDs, SAS/SATA SSDs, and SAS/SATA HDDs, and flexible I/O
expansion options with a LOM slot, a dedicated storage controller slot, and up to 3x PCIe slots. In additional,
The SR630 with Xeon SP Gen 2 supports up to 7.5 TB of memory capacity with a combination of TruDDR4
DIMMs and Intel DC persistent memory modules (DCPMMs)
The SR630 server offers basic or advanced hardware RAID protection and a wide range of networking
options, including selectable LOM, ML2, and PCIe network adapters. The next-generation Lenovo XClarity
Controller, which is built into the SR630 server, provides advanced service processor control, monitoring, and
alerting functions.
For more information, see the Lenovo ThinkSystem SR630 Product Guide:
https://lenovopress.com/lp1049-thinksystem-sr630-server-xeon-sp-gen2
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
24
Storage
ThinkSystem DE6000F models are available in a 2U rack form-factor with 24 small form-factor (2.5-inch SFF)
drives (2U24 SFF) and include two controllers, each with 64 GB cache for a system total of 128 GB. Universal
10 Gb iSCSI or 4/8/16 Gb Fibre Channel (FC) ports provide base host connectivity, and the host interface
cards provide additional 12 Gb SAS, 10/25 Gb iSCSI, or 8/16/32 Gb FC connections.
The ThinkSystem DE6000F Storage Array scales up to 192 solid-state drives (SSDs) with the attachment of
Lenovo ThinkSystem DE240S 2U24 SFF Expansion Enclosures.
The Lenovo ThinkSystem DE6000F 2U24 SFF enclosure is shown in the following figure.
For more information, see the Lenovo ThinkSystem ThinkSystem DE6000F Product Guide:
https://lenovopress.com/lp0910-lenovo-thinksystem-de6000f-all-flash-storage-array
ThinkSystem DM5000F models are 2U rack-mount controller enclosures that include two controllers, 64 GB
RAM and 8 GB battery-backed NVRAM (32 GB RAM and 4 GB NVRAM per controller), and 24 SFF hot-swap
drive bays (2U24 form factor). Controllers provide universal 1/10 GbE NAS/iSCSI or 8/16 Gb Fibre Channel
(FC) ports, or 1/10 GbE RJ-45 ports for host connectivity.
A single ThinkSystem DM5000F Storage Array scales up to 144 solid-state drives (SSDs) with the attachment
of Lenovo ThinkSystem DM240S 2U24 SFF Expansion Enclosures.
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
25
Storage
Figure 14 Lenovo ThinkSystem DM5000F
https://lenovopress.com/lp0911-lenovo-thinksystem-dm5000f-unified-flash-storage-array
For more information, see the Lenovo RackSwitch G8052 Product Guide:
https://lenovopress.com/tips1270-lenovo-rackswitch-g8052
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
26
Storage
availability for business sensitive traffic. These switches deliver line-rate, high-bandwidth switching, filtering,
and traffic queuing without delaying data.
The NE1032 RackSwitch has 32x SFP+ ports that support 1 GbE and 10 GbE optical transceivers, active
optical cables (AOCs), and direct attach copper (DAC) cables.
The NE1032T RackSwitch has 24x 1/10 Gb Ethernet (RJ-45) fixed ports and 8x SFP+ ports that support
1 GbE and 10 GbE optical transceivers, active optical cables (AOCs), and direct attach copper (DAC) cables.
The NE10032 RackSwitch has 32x QSFP+/QSFP28 ports that support 40 GbE and 100 GbE optical
transceivers, active optical cables (AOCs), and direct attach copper (DAC) cables. It is an ideal cross-rack
aggregation switch for use in a multi rack big data Cloudera cluster.
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
27
Storage
Figure 18 Lenovo ThinkSystem NE10032 cross-rack switch
https://lenovopress.com/lp0609-lenovo-thinksystem-ne10032-rackswitch
Processor Selection
Cloud Pak for Data workload types may be skewed toward IO-bound workloads that create heavy network
traffic or CPU bound workloads that stress the CPU itself. Intel Gold processors in this reference architecture
provide a 2 processor core per drive ratio which gives the maximum drive throughput plus a full set of cores
for additional data analytics. Intel Processors in the Platinum class provide higher core counts to meet the
highest of CPU bound workloads.
• Sorting
• Indexing
• Grouping
• Data importing and exporting
• Data movement and transformation
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
28
Storage
Below are several examples of CPU-bound workloads:
• Clustering/Classification
• Complex text mining
• Natural-language processing
• Feature extraction
Memory Selection
For some in-memory data processing engines like Db2 warehouse, memory size and performance have a
larger impact on performance. For this reason, in-memory data processing workloads are recommended to
use higher memory.
Additional considerations for memory configuration include bandwidth and latency requirements.
Applications with high transactional memory usage should focus on DIMM configurations that are balanced
across the CPU memory controllers and their memory channels.
There are two types of storage consumed by containerized applications – ephemeral (non-persistent) and
persistent. As the names suggest, non-persistent storage is created and destroyed along with the container
and is only used by applications during their lifetime as a container. Hence, non-persistent storage is used for
temporary data. When implementing the OpenShift Container Platform, local disk space on the application
nodes can be configured and used for the non-persistent storage volumes.
Persistent storage, on the other hand, is used for data that needs to be persisted across container
instantiations. An example is a 2 or 3-tier application that has separate containers for the web and business
logic tier and the database tier. The web and business logic tier can be scaled out using multiple containers
for high availability. The database that is used in the database tier requires persistent storage that is not
destroyed.
OpenShift uses a persistent volume framework that operates on two concepts – persistent storage and
persistent volume claim. Persistent storage is the physical storage volumes that are created and managed by
the OpenShift cluster administrator. When an application container requires persistent storage, it would create
a persistent volume claim (PVC). The PVC is a unique pointer/handle to a persistent volume on the physical
storage, except that PVC is not bound to a physical volume. When a container makes a PVC request,
OpenShift would allocate the physical disk and binds it to the PVC. When the container image is destroyed,
the volume bound to the PVC is not destroyed unless you explicitly destroy that volume. In addition, during
the lifecycle of the container if it relocates to another physical server in the cluster, the PVC binding will still be
maintained. After the container image is destroyed, the PVC is released, but the persisted storage volume is
not deleted. The specific persistent storage policy for the volume will determine when the volume gets
deleted.
A variety of persistent storage options are available for OpenShift, choices including NFS, OpenStack Cinder,
Ceph RBD, iSCSI, fiber channel SAN, hyperconverged storage using Red Hat OpenShift Container Storage,
AWS elastic block storage (EBS), and others. For a complete list of these choices and the corresponding
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
29
Storage
requirements, see the link below: access.redhat.com/documentation/en-
us/openshift_container_platform/3.9/html-single/installation_and_configuration/#configuring-persistent-
storage
Network Considerations
To support high-availability in the data network, redundant switches should be specified for each tier of
switches in the cluster. Section 6.1 describes the data network topology with the 10Gb cross-rack redundant
switch pairs. These switch pairs should be configured for Virtual Link Aggregation Groups (vLAG) on Lenovo
switches (or LACP) which provides coherency between the pairs to continue transferring traffic when a single
switch drops out.
Also, the two server ethernet port configurations for NIC bonding or NIC teaming must also be configured for
LACP (mode=4 or mode=802.3ad). This way, a single NIC, network cable or switch can fail and that network
connection will continue with the remaining half of the network connection. The bonded NIC interface also
operates at twice the speed, or 20Gb/s in this configuration.
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
30
Storage
8 Appendix: Lenovo Bill of materials
This appendix contains the bill of materials (BOMs) for different configurations of hardware for Cloud Pak for
Data on OpenShift deployments. There are sections for servers, storage and networking.
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
31
Storage
AUX0 ThinkSystem Package for SR630 1
AVWH ThinkSystem 550W RDN PSU Caution Label 1
AUWM Lenovo ThinkSystem 1U LP+LP BF Riser Dummy 1
AUWL Lenovo ThinkSystem 1U LP Riser Dummy 1
AUWF Lenovo ThinkSystem Super Cap Holder Dummy 1
Companion Part for XClarity Controller Standard to Enterprise Upgrade in
B173 1
Factory
AUWG Lenovo ThinkSystem 1U VGA Filler 1
ASFE Notice for Advanced Format 512e Hard Disk Drives 1
5PS7A01504 Essential Service - 3Yr 24x7 4Hr Response + YourDrive YourData 1
5AS7A02045 Hardware Installation Server (Business Hours) 1
7S0FCTO1WW Red Hat Linux w/Lenovo Support 1
RHEL Server Physical or Virtual Node, 2 Skt Standard Subscription
S0N6 1
w/Lenovo Support 3Yr
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
32
Storage
AUTY ThinkSystem 12-15 sequence Label for 24x2.5"Chassis 1
AUTU ThinkSystem 4-7 NVMe sequence Label for 16x2.5"and 24x2.5" 1
AVEQ ThinkSystem 8x1 2.5" HDD Filler 2
AUTA XCC Network Access Label 1
AVJ2 ThinkSystem 4R CPU HS Clip 2
AUSF Lenovo ThinkSystem 2U MS CPU Performance Heatsink 2
AUSG ThinkSystem SR650 6038 Fan module 1
Companion Part for XClarity Controller Standard to Enterprise Upgrade in
B173 1
Factory
AUTJ ThinkSystem common Intel Label 1
B31F ThinkSystem M.2 480GB SSD Thermal Kit 1
AURS Lenovo ThinkSystem Memory Dummy 12
AUTQ ThinkSystem small Lenovo Label for 24x2.5"/12x3.5"/10x2.5" 1
AWF9 ThinkSystem Response time Service Label LI 1
AUSZ ThinkSystem SR650 Service Label LI 1
AVWK ThinkSystem EIA Plate with Lenovo Logo 1
AUTD ThinkSystem SR650 model number Label 1
AUT9 ThinkSystem 1600W RDN PSU Caution Label 1
AURT Lenovo ThinkSystem 2U 3FH Riser Dummy 1
AURF Lenovo ThinkSystem 2U 2FH Riser Dummy 1
AUSA Lenovo ThinkSystem M3.5" Screw for EIA 4
AUSU ThinkSystem Package for SR650 1
AUSH MS First 2U 8x2.5" HDD BP Cable Kit 1
AUSQ On Board to 2U 8x2.5" HDD BP NVME Cable 1
B0ML Feature Enable TPM on MB 1
A2HP Configuration ID 01 1
5374CM1 Configuration Instruction 1
AVE7 ThinkSystem 430-8i SAS/SATA 12Gb HBA placement 1
A2JX Controller 01 1
A2HP Configuration ID 01 1
5PS7A06897 Premier with Essential - 3Yr 24x7 4Hr Response + YourDrive YourData 1
5AS7A02045 Hardware Installation Server (Business Hours) 1
7S0FCTO1WW Red Hat Linux w/Lenovo Support 1
RHEL Server Physical or Virtual Node, 2 Skt Standard Subscription
S0N6 1
w/Lenovo Support 3Yr
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
33
Storage
8.2 Networking BOM
Lenovo ThinkSystem NE1032 Switch
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
34
Storage
B4JL Lenovo ThinkSystem DE6000 Add Snapshot 2048 PFK 1
B4JH Lenovo ThinkSystem DE6000H Add Synch Mirroring PFK 1
B4JG Lenovo ThinkSystem DE6000H Add Asynch Mirroring PFK 1
5AS7A02067 Hardware Installation Storage (Business Hours) 1
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
35
Storage
5WS7A18251 Foundation- 3Y NBD ThinkSystem DM5000F AFA 1
5WS7A18257 Foundation- 3Y NBD DM5000F 46TB (12x 3.84TB SSD) Pack 1
5AS7A02067 Hardware Installation Storage (Business Hours) 1
Auto-Derived Part Items
AU16 0.5m External MiniSAS HD 8644/MiniSAS HD 8644 Cable 2
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
36
Storage
Resources
For more information, see the following resources:
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
37
Storage
Document history
Version 1.0 14 Oct 2019 First version
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
38
Storage
Trademarks and special notices
© Copyright Lenovo 2019.
Lenovo and the Lenovo logo are trademarks or registered trademarks of Lenovo in the United States, other
countries, or both. These and other Lenovo trademarked terms are marked on their first occurrence in this
information with the appropriate symbol (® or ™), indicating US registered or common law trademarks owned
by Lenovo at the time this information was published. Such trademarks may also be registered or common
law trademarks in other countries. A current list of Lenovo trademarks is available from
https://www.lenovo.com/us/en/legal/copytrade/.
The following terms are trademarks of Lenovo in the United States, other countries, or both:
AnyBay™
Flex System™
Lenovo®
Lenovo XClarity™
RackSwitch™
Lenovo(logo)®
System x®
ThinkSystem™
TruDDR4™
The following terms are trademarks of other companies:
Intel, Xeon, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries
in the United States and other countries.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.
References in this document to Lenovo products or services do not imply that Lenovo intends to make them
available in every country.
Information is provided "AS IS" without warranty of any kind.
All customer examples described are presented as illustrations of how those customers have used Lenovo
products and the results they may have achieved. Actual environmental costs and performance
characteristics may vary by customer.
Information concerning non-Lenovo products was obtained from a supplier of these products, published
announcement material, or other publicly available sources and does not constitute an endorsement of such
products by Lenovo. Sources for non-Lenovo list prices and performance numbers are taken from publicly
available information, including vendor announcements and vendor worldwide homepages. Lenovo has not
tested these products and cannot confirm the accuracy of performance, capability, or any other claims related
to non-Lenovo products. Questions on the capability of non-Lenovo products should be addressed to the
supplier of those products.
All statements regarding Lenovo future direction and intent are subject to change or withdrawal without notice,
and represent goals and objectives only. Contact your local Lenovo office or Lenovo authorized reseller for the
full text of the specific Statement of Direction.
Some information addresses anticipated future capabilities. Such information is not intended as a definitive
statement of a commitment to specific levels of performance, function or delivery schedules with respect to
any future products. Such commitments are only made in Lenovo product announcements. The information is
presented here to communicate Lenovo’s current investment and development activities as a good faith effort
to help with our customers' future planning.
Performance is based on measurements and projections using standard Lenovo benchmarks in a controlled
environment. The actual throughput or performance that any user will experience will vary depending upon
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
39
Storage
considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the
storage configuration, and the workload processed. Therefore, no assurance can be given that an individual
user will achieve throughput or performance improvements equivalent to the ratios stated here.
Photographs shown are of engineering prototypes. Changes may be incorporated in production models.
Any references in this information to non-Lenovo websites are provided for convenience only and do not in
any manner serve as an endorsement of those websites. The materials at those websites are not part of the
materials for this Lenovo product and use of those websites is at your own risk.
Reference Architecture for IBM Cloud Pak for Data with Lenovo ThinkSystem Servers and
40
Storage