Information Technology Infrastructure IT602
• MTTR can be kept low by: • But beware of SPOFs!
Having a service contract with the supplier • Calculate availability:
Having spare parts on-site A = 1 − (1 − A1 )n
Automated redundancy and failover • Total availability = 1 − (1 − 0.99)2 = 99.99%
• Steps to complete repairs:
Notification of the fault (time before seeing an alarm message) Sources of Unavailability - Human Errors
Processing the alarm • 80% of outages impacting mission-critical services is caused by people and
Finding the root cause of the error process issues
Looking up repair information • Examples:
Getting spare components from storage • Performing a test in the production environment
• Switching off the wrong component for repair • Tape drives contain very sensitive pieces of mechanics that can
• Swapping a good working disk in a RAID set instead of the break easily
defective one • Sources of Unavailability - Bathtub Curve
• Restoring the wrong backup tape to production • A component failure is most likely when the component is new
• Accidentally removing files • Sometimes a component doesn't even work at all when unpacked for the
• Mail folders, configuration files first time. This is called a DOA component–Dead On Arrival.
• Accidentally removing database entries • When a component still works after the first month, it is likely that it will
• Drop table x instead of drop table y continue working without
Sources of Unavailability - Software Bugs failure until the end of its life
• Because of the complexity of the software, it is nearly impossible (and very Sources of Unavailability - Environmental Issues
costly) to create bug-free software • Environmental issues can cause downtime. Issues with
• Application software bugs can stop an entire system • Power
• Operating systems are software too • Cooling
• Operating systems containing bugs can lead to • External factors like:
• corrupted file systems, • Disasters
• network failures, or • Fire
• other sources of unavailability • Earthquakes
Sources of Unavailability - Planned Maintenance • Flooding
• Sometimes needed to perform systems management tasks: • Sources of Unavailability - Complexity of the
• Upgrading hardware or software Infrastructure
• Implementing software changes • Adding more components to an overall system design can undermine high
• Migrating data availability
• Creation of backups • Even if the extra components are implemented to achieve high
• During planned maintenance the system is more vulnerable to downtime availability
than under normal circumstances • Complex systems
• A temporary SPOF could be introduced • Have more potential points of failure
• Systems managers could make mistakes • Are more difficult to implement correctly
• Sources of Unavailability - Physical Defects • Are harder to manage
• Everything breaks down eventually • Sometimes it is better to just have an extra spare system in the closet than to
• Mechanical parts are most likely to break first use complex redundant systems
• Examples: • Availability Patterns
• Fans for cooling equipment usually break because of dust in the • A single point of failure (SPOF) is a component in the infrastructure that, if
bearings it fails, causes downtime to the entire system.
• Disk drives contain moving parts • SPOFs should be avoided in IT infrastructures as they pose a large risk to
• Tapes are very vulnerable to defects as the tape is spun on and off the availability of a system.
the reels all the time • We just need to know what is shared and if the risk of sharing is acceptable.
• To eliminate SPOFs, a combination of redundancy, failover, and fallback WEEK#3
can be used. Performance Concepts
• Redundancy • Performance is a typical
• Redundancy is the duplication of critical components in a single system, to hygiene factor
avoid a single point of failure (SPOF) • Nobody notices a highly
• Examples: performing system
• A single component having two power supplies; if one fails, the other • But when a system is not
takes over performing well enough,
• Dual networking interfaces users quickly start complaining
• Redundant cabling Perceived Performance
• Failover • Perceived performance refers to how quickly a system appears to perform
• Failover is the (semi)automatic switch-over to a standby system or its task
component • In general, people tend to overestimate their own patience
• Examples: • People tend to value predictability in performance
• Windows Server failover clustering • When the performance of a system is fluctuating, users remember a
• VMware High Availability bad experience
• Oracle Real Application Cluster (RAC) database • Even if the fluctuation is relatively rare
• Fallback • Inform the user about how long a task will take
• Fallback is the mannual switchover to an identical standby computer system • Progress bars
in a different location • Splash screens
• Typically used for disaster recovery Performance during Infrastructure Design
• Three basic forms of fallback solutions: • A solution must be designed, implemented, and supported to meet the
• Hot site performance requirements
• Cold site Even under increasing load
• Warm site • Calculating performance of a system in the design phase is:
• Business Continuity Extremely difficult
• In case of a disaster, the infrastructure could become unavailable, in some Very unreliable
cases for a longer period of time. • Performance must be considered:
• Business continuity is about identifying threats an organization faces and When the system works as expected
providing an effective response. When the system is in a special state, like:
• To handle the effect of disasters, following processes are Failing parts
• Business Continuity Management (BCM) and Maintenance state
• Disaster Recovery Planning (DRP) Performing backup
• Business Continuity Running batch jobs
• An IT disaster is defined as an irreparable problem in a datacenter, making • Some ways to do this are:
the datacenter unusable • Benchmarking
• Using vendor experience
• Prototyping and User Profiling • Predict the load a new software system will pose on the infrastructure before
• Benchmarking the software is actually built
• A benchmark uses a specific test program to assess the relative performance • Get a good indication of the expected usage of the system
of an infrastructure component • Steps:
• Benchmarks compare: • Define a number of typical user groups (personas)
• Performance of various subsystems • Create a list of tasks personas will perform on the new system
• Across different system architectures • Decompose tasks to infrastructure actions
• CPU benchmarking is the practice of determining how a processor will • Estimate the load per infrastructure action
perform in a standardized way. This is typically done using special software • Calculate the total load
packages. Some popular benchmarking packages include Whetstone, • Performance of a Running System
Dhrystone, 3DMark, PCMark and others. • Managing Bottlenecks
• Benchmarks comparing the raw speed of parts of an infrastructure • The performance of a system is based on:
• Like the speed difference between processors or between disk drives • The performance of all its components
• Not taking into account the typical usage of such components • The interoperability of various components
• Examples: • A component causing the system to reach some limit is referred to as the
• Floating Point Operations Per Second – FLOPS bottleneck of the system
• Million Instructions Per Second – MIPS of a CPU • Every system has at least one bottleneck that limits its performance
• Prototyping • If the bottleneck does not negatively influence performance of the complete
• Also known as proof of concept (PoC) system under the highest expected load, it is OK
• Prototypes measure the performance of a system at an early stage • Performance Testing
• Building prototypes: • Load testing - shows how a system performs under the expected load
• Hiring equipment from suppliers • Stress testing - shows how a system reacts when it is under extreme load
• Using data centre capacity at a vendor’s premise • Endurance testing - shows how a system behaves when it is used at the
• Using cloud computing resources expected load for a long period of time
• Focus on those parts of the system that pose the highest risk, as early as • Performance Testing - Breakpoint
possible in the design process • Ramp up the load
• Vendor Experience • Start with a small number of virtual users
• The best way to determine the performance of a system in the design phase: • Increase the number over a period of time
use the experience of vendors • The test result shows how the performance varies with the load, given as
• They have a lot of experience running their products in various infrastructure number of users versus response time.
configurations • Performance Testing
• Vendors can provide: • Performance testing software typically uses:
• Tools • One or more servers to act as injectors
• Figures • Each emulating a number of users
• Best practices • Each running a sequence of interactions
• User Profiling • A test conductor
• Coordinating tasks
• Gathering metrics from each of the injectors • Earlier accessed data can be fetched from cache, instead of from the
• Collecting performance data for reporting purposes internet
• Performance testing should be done in a production-like environment • Benefits:
• Performance tests in a development environment usually lead to • Users get their data faster
results that are highly unreliable • All other users are provided more bandwidth to the internet, as the
• Even when underpowered test systems perform well enough to get data does not have to be downloaded again
good test results, the faster production system could show • Grid Computing
performance issues that did not occur in the tests • A computer grid is a high performance cluster that consists of systems that
• To reduce cost: are spread geographically
• Use a temporary (hired) test environment • The limited bandwidth is the bottleneck
• Performance Patterns • Examples:
• Increasing Performance on Upper Layers • SETI@HOME
• 80% of the performance issues are due to badly behaving applications • CERN LHC Computing Grid (140 computing centers in 35
• Application performance can benefit from: countries)
• Database and application tuning • Broker firms exist for commercial exploitation of grids
• Prioritizing tasks • Security is a concern when computers in the grid are not under control
• Working from memory as much as possible (as opposed to working • Capacity Management
with data on disk) • Capacity management guarantees high performance of a system in the long
• Making good use of queues and schedulers term
• Typically more effective than adding compute power • To ensure performance stays within acceptable limits, performance must be
• Disk Caching monitored
• Disks are mechanical devices that are slow by nature • Trend analyses can be used to predict performance degradation
• Caching can be implemented i: • Anticipate on business changes (like forthcoming marketing campaigns)
• Disks WEEK#4
• Disk controllers Security
• Operating system • Security is the combination of:
• Cache memory: • Availability
• Stores all data recently read from disk • Confidentiality
• Stores some of the disk blocks following the recently read disk • Integrity
blocks Time it takes to fetch 1 MB of • Focused on the recognition and resistance of attacks
• Caching data (ms) • For IT infrastructures availability is a non-functional attribute in its own
Network, 1 Gbit/s 675 right
Hard disk, 15k rpm, 4 KB disk blocks 105 Computer Crimes
Main memory DDR3 RAM 0.2 • Reasons for committing crime against IT infrastructures:
Web Proxies CPU L1 cache 0.016 • Personal exposure and prestige
• When users browse the internet, data can be cached in a web proxy server • Creating damage
• A web proxy server is a type of cache • Financial gain
• Terrorism Determining an acceptable level of risk
• Warfare Assessing the current level of risk
Personal Exposure and Prestige Taking steps to reduce risk to the acceptable level
• In the past, the hacker community was very keen on getting personal or Maintaining that level
group exposure by hacking into a secured IT infrastructure. When hackers Risk list
proved that they could enter a secured system and made it public, they gained A risk list can be used to quantify risks
respect from other hackers. Risk is calculated based on:
• While nowadays most hacking activity is done for other reasons, there are Asset name - component that needs to be protected
still large communities of hackers that enjoy the game. Vulnerability - weakness, process or physical exposure that makes the asset
• Creating Damage susceptible to exploits
• Creating damage to organizations to create bad publicity Exploit - a way to use one or more vulnerabilities to attack an asset
• For instance, by defacing websites, bringing down systems or websites, or Probability - an estimation of the likelihood of the occurrence of an exploit
• making internal documents public Impact - the severity of the damage when the vulnerability is exploited
Financial Gain Example of Part of a Risk List
• For instance, by holding data hostage and asking for ransom money, stealing
credit card data, changing account data in bank systems
• Stealing passwords of customers and ordering goods on their behalf
• The main purpose of terrorism is creating fear in a society
• A well-planned attack targeted at certain computer systems, like the
• Computer system that manages the water supply
• or
• A nuclear power plant, could result in chaos and fear amongst citizens
• Certain governments use hacking practices as acts of war
• Since economies and societies today largely depend on the IT
Risk Response
infrastructures, bringing important IT systems down in a certain country
• Controls can be designed and implemented based on identified severity of
could cause the economy to collapse.
the risk in the risk list.
• Bringing down the internet access of a country for example means: no access
• There four risk responses:
to social media, no e-mails, no web shops, no stock trading, no search
• Acceptance of the risk
engines, etc.
• Avoidance of the risk - do not perform actions that impose risk
Risk management
• Transfer of the risk - for instance transfer the risk to an insurance
Managing security is all about managing risks
The effort we put in securing the infrastructure should be directly related to
• Mitigation of the risk and accepting the residual risk
the risk at hand
Risk management is the process of:
• Information can be stolen in many ways Integrity Level Description
• Examples: 1 Integrity of information is of no importance
• Key loggers can send sensitive information like passwords to third 2 Errors in information are allowed
parties Only incidental errors in information are
• Network sniffers can show network packages that contain sensitive allowed
No errors are allowed, leads to reputational
information or replay a logon sequence 4
• Data on backup tapes outside of the building can get into No errors are allowed, leads to damage to
wrong hands 5
organization or society
• Disposed PCs or disks can get into the wrong hands
• Corrupt or dissatisfied staff can copy information • Example of availability levels
• End users are led to a malicious website that steals
information (phishing) Availability Level Description
• Security Controls 1 No requirements on availability
CIA Some unavailability is allowed during office
• Three core goals of security (CIA): hours
Some unavailability is allowed only outside of
Confidentiality 3
office hours
No unavailability is allowed, 24/7/365
Availability 4
availability, risk for reputational damage
• Confidentiality - prevents the intentional or unintentional unauthorized No unavailability is allowed risk for damage to
disclosure of data organization or society
• Integrity - ensures that:
No modifications to data are made by unauthorized staff or processes Security Controls
Unauthorized modifications to data are not made by authorized staff or • Controls mitigate risks
processes • Security controls must address at least one of the CIA
Data is consistent • Information can be classified based on CIA levels
• Availability - ensures the reliable and timely access to data or IT resources • Controls can be designed and implemented based on the identified risk level
• Example of confidentiality levels for CIA
Confidentiality Level Description • Attack Vectors
1 Public information • Malicious code
2 Information for internal use only • Applications that, when activated, can cause network and server
Information for internal use by restricted overload, steal data and passwords, or erase data
group • Worms
Secret: reputational damage if information is • Self-replicating programs that spread from one computer to another,
made public
leaving infections as they travel
Top secret: damage to organization or society
if information is made public • Virus