Information Technology Infrastructure IT602

Information Technology Infrastructure • Security
Various Kinds of Architecture • Performnace

 Business architecture • Recoverability
 Enterprise architecture • Testtability
 Data architecture, • Scalability
 Application architecture and Handling Conflicting NFRs
 Infrastructure architecture It is unusual to encounter conflicting NFRs for instance users may want a system
Is there a General Definition of IT Infrastructure? that is secure but not want to be bothered by passwords
No generally accepted definition of IT infrastructure seems to exist • It is the task of the infrastructure architect to balance these NFRs, in some
In literature, many definitions of IT infrastructure are described. Some of them are: cases some NFRs may take priority over others and the architect must
IT Infrastructure involve the relevant stakeholders
IT infrastructure consists of the equipment, systems, software, and services used in AVAILABILITY
common across an organization, regardless of mission/program/project. IT • Everyone expects their infrastructure to be available all the time
Infrastructure also serves as the foundation upon which mission/program/project- • A 100% guaranteed availability of an infrastructure is impossible
specific systems and capabilities are built. • A fact of life
What infrastructure comprises dependents on: • There is always a chance of downtime.
 Who you ask Calculationon of Aavailability
 What their point of view is • Availability can neither be calculated, nor guaranteed upfront
• It can only be reported on afterwards, when a system has run for
some years
• Over the years, much knowledge and experience is gained on how to
design high available systems
• Failover
• Redundancy
• Structured programming
For most people, infrastructure is invisible and taken for granted • Avoiding Single Points of Failures (SPOFs)
Introduction to Non-Functional Attributes • Implementing systems management
• IT infrastructure provides services to applications • The availability of a system is usually expressed as a percentage of uptime
• Many of these services can be defined as functions such as in a given time period, usually one year or one month
• Disk space, • Example for downtime expressed as a percentage per year
• Processing, Downtime Downtime Downtime
Availability %
• Connectivity per year per month per week
99.8% 17.5 hours 86.2 minutes 20.2 minutes
However most of these services are non functional in nature
99.9% ("three nines") 8.8 hours 43.2 minutes 10.1 minutes
Non-Functional Attributes 52.6 minutes 4.3 minutes 1.0 minutes
99.99% ("four nines")
Non functional attributes describe the qualitative behavior of the system rather than 99.999% ("five nines") 5.3 minutes 25.9 seconds 6.1 seconds
its specific functionality and these include
• Availability
Typical requirements used in service level agreements today are 99.8% or 99.9%  Having technician come to the datacenter with the spare component
availability per month for a full IT system  Physically repairing the fault
• The availability of the infrastructure must be much higher  Restarting and testing the component
 Typically in the range of 99.99% or higher • Calculation Examples
• 99.999% uptime is also known as carrier grade availability
• For one component, higher availability levels for a complete
Component MTBF (h) MTTR (h) Availability in %
system are very uncommon, as they are almost impossible
Power supply 100,000 8 0.9999200 99.99200
to reach
Fan 100,000 8 0.9999200 99.99200
• It is a good practice to agree on the maximum frequency of unavailability 300,000 8 0.9999733 99.99733
System board
MTBF and MTTR 1,000,000 8 0,9999920 99.99920
Memory
• Mean Time Between Failures (MTBF)
CPU 500,000 8 0.9999840 99.99840
 The average time that passes between failures
Network
• Mean Time To Repair (MTTR)
Interface 250,000 8 0.9999680 99.99680
 The time it takes to recover from a failure Controller (NIC)
Serial components: One defect leads to downtime

• Some components have higher MTBF than others
• Some typical MTB’s:
Component MTBF (hours) • Example: the above system’s availability is:
Hard disk 750,000 0.9999200 × 0.9999200 × 0.9999733 × 0.9999920 × 0.9999840
Power supply 100,000 × 0.9999680 = 0.99977 = 𝟗𝟗. 𝟗𝟕𝟕%
Fan 100,000 (each components’ availability is at least 99.99%)
Ethernet Network Switch 350,000 Parallel components: One defect: no downtime!
RAM 1,000,000
MTTR
• MTTR can be kept low by: • But beware of SPOFs!
 Having a service contract with the supplier • Calculate availability:
 Having spare parts on-site A = 1 − (1 − A1 )n
 Automated redundancy and failover • Total availability = 1 − (1 − 0.99)2 = 99.99%
• Steps to complete repairs:
 Notification of the fault (time before seeing an alarm message) Sources of Unavailability - Human Errors
 Processing the alarm • 80% of outages impacting mission-critical services is caused by people and
 Finding the root cause of the error process issues
 Looking up repair information • Examples:
 Getting spare components from storage • Performing a test in the production environment
• Switching off the wrong component for repair • Tape drives contain very sensitive pieces of mechanics that can
• Swapping a good working disk in a RAID set instead of the break easily
defective one • Sources of Unavailability - Bathtub Curve
• Restoring the wrong backup tape to production • A component failure is most likely when the component is new
• Accidentally removing files • Sometimes a component doesn't even work at all when unpacked for the
• Mail folders, configuration files first time. This is called a DOA component–Dead On Arrival.
• Accidentally removing database entries • When a component still works after the first month, it is likely that it will
• Drop table x instead of drop table y continue working without
Sources of Unavailability - Software Bugs failure until the end of its life
• Because of the complexity of the software, it is nearly impossible (and very Sources of Unavailability - Environmental Issues
costly) to create bug-free software • Environmental issues can cause downtime. Issues with
• Application software bugs can stop an entire system • Power
• Operating systems are software too • Cooling
• Operating systems containing bugs can lead to • External factors like:
• corrupted file systems, • Disasters
• network failures, or • Fire
• other sources of unavailability • Earthquakes
Sources of Unavailability - Planned Maintenance • Flooding
• Sometimes needed to perform systems management tasks: • Sources of Unavailability - Complexity of the
• Upgrading hardware or software Infrastructure
• Implementing software changes • Adding more components to an overall system design can undermine high
• Migrating data availability
• Creation of backups • Even if the extra components are implemented to achieve high
• During planned maintenance the system is more vulnerable to downtime availability
than under normal circumstances • Complex systems
• A temporary SPOF could be introduced • Have more potential points of failure
• Systems managers could make mistakes • Are more difficult to implement correctly
• Sources of Unavailability - Physical Defects • Are harder to manage
• Everything breaks down eventually • Sometimes it is better to just have an extra spare system in the closet than to
• Mechanical parts are most likely to break first use complex redundant systems
• Examples: • Availability Patterns
• Fans for cooling equipment usually break because of dust in the • A single point of failure (SPOF) is a component in the infrastructure that, if
bearings it fails, causes downtime to the entire system.
• Disk drives contain moving parts • SPOFs should be avoided in IT infrastructures as they pose a large risk to
• Tapes are very vulnerable to defects as the tape is spun on and off the availability of a system.
the reels all the time • We just need to know what is shared and if the risk of sharing is acceptable.
• To eliminate SPOFs, a combination of redundancy, failover, and fallback WEEK#3
can be used. Performance Concepts
• Redundancy • Performance is a typical
• Redundancy is the duplication of critical components in a single system, to hygiene factor
avoid a single point of failure (SPOF) • Nobody notices a highly
• Examples: performing system
• A single component having two power supplies; if one fails, the other • But when a system is not
takes over performing well enough,
• Dual networking interfaces users quickly start complaining
• Redundant cabling Perceived Performance
• Failover • Perceived performance refers to how quickly a system appears to perform
• Failover is the (semi)automatic switch-over to a standby system or its task
component • In general, people tend to overestimate their own patience
• Examples: • People tend to value predictability in performance
• Windows Server failover clustering • When the performance of a system is fluctuating, users remember a
• VMware High Availability bad experience
• Oracle Real Application Cluster (RAC) database • Even if the fluctuation is relatively rare
• Fallback • Inform the user about how long a task will take
• Fallback is the mannual switchover to an identical standby computer system • Progress bars
in a different location • Splash screens
• Typically used for disaster recovery Performance during Infrastructure Design
• Three basic forms of fallback solutions: • A solution must be designed, implemented, and supported to meet the
• Hot site performance requirements
• Cold site Even under increasing load
• Warm site • Calculating performance of a system in the design phase is:
• Business Continuity Extremely difficult
• In case of a disaster, the infrastructure could become unavailable, in some Very unreliable
cases for a longer period of time. • Performance must be considered:
• Business continuity is about identifying threats an organization faces and  When the system works as expected
providing an effective response.  When the system is in a special state, like:
• To handle the effect of disasters, following processes are Failing parts
• Business Continuity Management (BCM) and Maintenance state
• Disaster Recovery Planning (DRP) Performing backup
• Business Continuity Running batch jobs
• An IT disaster is defined as an irreparable problem in a datacenter, making • Some ways to do this are:
the datacenter unusable • Benchmarking
• Using vendor experience
• Prototyping and User Profiling • Predict the load a new software system will pose on the infrastructure before
• Benchmarking the software is actually built
• A benchmark uses a specific test program to assess the relative performance • Get a good indication of the expected usage of the system
of an infrastructure component • Steps:
• Benchmarks compare: • Define a number of typical user groups (personas)
• Performance of various subsystems • Create a list of tasks personas will perform on the new system
• Across different system architectures • Decompose tasks to infrastructure actions
• CPU benchmarking is the practice of determining how a processor will • Estimate the load per infrastructure action
perform in a standardized way. This is typically done using special software • Calculate the total load
packages. Some popular benchmarking packages include Whetstone, • Performance of a Running System
Dhrystone, 3DMark, PCMark and others. • Managing Bottlenecks
• Benchmarks comparing the raw speed of parts of an infrastructure • The performance of a system is based on:
• Like the speed difference between processors or between disk drives • The performance of all its components
• Not taking into account the typical usage of such components • The interoperability of various components
• Examples: • A component causing the system to reach some limit is referred to as the
• Floating Point Operations Per Second – FLOPS bottleneck of the system
• Million Instructions Per Second – MIPS of a CPU • Every system has at least one bottleneck that limits its performance
• Prototyping • If the bottleneck does not negatively influence performance of the complete
• Also known as proof of concept (PoC) system under the highest expected load, it is OK
• Prototypes measure the performance of a system at an early stage • Performance Testing
• Building prototypes: • Load testing - shows how a system performs under the expected load
• Hiring equipment from suppliers • Stress testing - shows how a system reacts when it is under extreme load
• Using data centre capacity at a vendor’s premise • Endurance testing - shows how a system behaves when it is used at the
• Using cloud computing resources expected load for a long period of time
• Focus on those parts of the system that pose the highest risk, as early as • Performance Testing - Breakpoint
possible in the design process • Ramp up the load
• Vendor Experience • Start with a small number of virtual users
• The best way to determine the performance of a system in the design phase: • Increase the number over a period of time
use the experience of vendors • The test result shows how the performance varies with the load, given as
• They have a lot of experience running their products in various infrastructure number of users versus response time.
configurations • Performance Testing
• Vendors can provide: • Performance testing software typically uses:
• Tools • One or more servers to act as injectors
• Figures • Each emulating a number of users
• Best practices • Each running a sequence of interactions
• User Profiling • A test conductor
• Coordinating tasks
• Gathering metrics from each of the injectors • Earlier accessed data can be fetched from cache, instead of from the
• Collecting performance data for reporting purposes internet
• Performance testing should be done in a production-like environment • Benefits:
• Performance tests in a development environment usually lead to • Users get their data faster
results that are highly unreliable • All other users are provided more bandwidth to the internet, as the
• Even when underpowered test systems perform well enough to get data does not have to be downloaded again
good test results, the faster production system could show • Grid Computing
performance issues that did not occur in the tests • A computer grid is a high performance cluster that consists of systems that
• To reduce cost: are spread geographically
• Use a temporary (hired) test environment • The limited bandwidth is the bottleneck
• Performance Patterns • Examples:
• Increasing Performance on Upper Layers • SETI@HOME
• 80% of the performance issues are due to badly behaving applications • CERN LHC Computing Grid (140 computing centers in 35
• Application performance can benefit from: countries)
• Database and application tuning • Broker firms exist for commercial exploitation of grids
• Prioritizing tasks • Security is a concern when computers in the grid are not under control
• Working from memory as much as possible (as opposed to working • Capacity Management
with data on disk) • Capacity management guarantees high performance of a system in the long
• Making good use of queues and schedulers term
• Typically more effective than adding compute power • To ensure performance stays within acceptable limits, performance must be
• Disk Caching monitored
• Disks are mechanical devices that are slow by nature • Trend analyses can be used to predict performance degradation
• Caching can be implemented i: • Anticipate on business changes (like forthcoming marketing campaigns)
• Disks WEEK#4
• Disk controllers Security
• Operating system • Security is the combination of:
• Cache memory: • Availability
• Stores all data recently read from disk • Confidentiality
• Stores some of the disk blocks following the recently read disk • Integrity
blocks Time it takes to fetch 1 MB of • Focused on the recognition and resistance of attacks
Component
• Caching data (ms) • For IT infrastructures availability is a non-functional attribute in its own
Network, 1 Gbit/s 675 right
Hard disk, 15k rpm, 4 KB disk blocks 105 Computer Crimes
Main memory DDR3 RAM 0.2 • Reasons for committing crime against IT infrastructures:
Web Proxies CPU L1 cache 0.016 • Personal exposure and prestige
• When users browse the internet, data can be cached in a web proxy server • Creating damage
• A web proxy server is a type of cache • Financial gain
• Terrorism  Determining an acceptable level of risk
• Warfare  Assessing the current level of risk
Personal Exposure and Prestige  Taking steps to reduce risk to the acceptable level
• In the past, the hacker community was very keen on getting personal or  Maintaining that level
group exposure by hacking into a secured IT infrastructure. When hackers Risk list
proved that they could enter a secured system and made it public, they gained A risk list can be used to quantify risks
respect from other hackers. Risk is calculated based on:
• While nowadays most hacking activity is done for other reasons, there are Asset name - component that needs to be protected
still large communities of hackers that enjoy the game. Vulnerability - weakness, process or physical exposure that makes the asset
• Creating Damage susceptible to exploits
• Creating damage to organizations to create bad publicity Exploit - a way to use one or more vulnerabilities to attack an asset
• For instance, by defacing websites, bringing down systems or websites, or Probability - an estimation of the likelihood of the occurrence of an exploit
• making internal documents public Impact - the severity of the damage when the vulnerability is exploited
Financial Gain Example of Part of a Risk List
• For instance, by holding data hostage and asking for ransom money, stealing
credit card data, changing account data in bank systems
OR
• Stealing passwords of customers and ordering goods on their behalf
Terrorism
• The main purpose of terrorism is creating fear in a society
• A well-planned attack targeted at certain computer systems, like the
• Computer system that manages the water supply
• or
• A nuclear power plant, could result in chaos and fear amongst citizens
Warfare
• Certain governments use hacking practices as acts of war
• Since economies and societies today largely depend on the IT
Risk Response
infrastructures, bringing important IT systems down in a certain country
• Controls can be designed and implemented based on identified severity of
could cause the economy to collapse.
the risk in the risk list.
• Bringing down the internet access of a country for example means: no access
• There four risk responses:
to social media, no e-mails, no web shops, no stock trading, no search
• Acceptance of the risk
engines, etc.
• Avoidance of the risk - do not perform actions that impose risk
Risk management
• Transfer of the risk - for instance transfer the risk to an insurance
 Managing security is all about managing risks
company
 The effort we put in securing the infrastructure should be directly related to
• Mitigation of the risk and accepting the residual risk
the risk at hand
Exploits
 Risk management is the process of:
• Information can be stolen in many ways Integrity Level Description
• Examples: 1 Integrity of information is of no importance
• Key loggers can send sensitive information like passwords to third 2 Errors in information are allowed
parties Only incidental errors in information are
3
• Network sniffers can show network packages that contain sensitive allowed
No errors are allowed, leads to reputational
information or replay a logon sequence 4
damage
• Data on backup tapes outside of the building can get into No errors are allowed, leads to damage to
wrong hands 5
organization or society
• Disposed PCs or disks can get into the wrong hands
• Corrupt or dissatisfied staff can copy information • Example of availability levels
• End users are led to a malicious website that steals
information (phishing) Availability Level Description
• Security Controls 1 No requirements on availability
CIA Some unavailability is allowed during office
2
• Three core goals of security (CIA): hours
Some unavailability is allowed only outside of
 Confidentiality 3
office hours
 Integrity
No unavailability is allowed, 24/7/365
 Availability 4
availability, risk for reputational damage
• Confidentiality - prevents the intentional or unintentional unauthorized No unavailability is allowed risk for damage to
5
disclosure of data organization or society
• Integrity - ensures that:
 No modifications to data are made by unauthorized staff or processes Security Controls
 Unauthorized modifications to data are not made by authorized staff or • Controls mitigate risks
processes • Security controls must address at least one of the CIA
 Data is consistent • Information can be classified based on CIA levels
• Availability - ensures the reliable and timely access to data or IT resources • Controls can be designed and implemented based on the identified risk level
• Example of confidentiality levels for CIA
Confidentiality Level Description • Attack Vectors
1 Public information • Malicious code
2 Information for internal use only • Applications that, when activated, can cause network and server
Information for internal use by restricted overload, steal data and passwords, or erase data
3
group • Worms
Secret: reputational damage if information is • Self-replicating programs that spread from one computer to another,
4
made public
leaving infections as they travel
Top secret: damage to organization or society
5
if information is made public • Virus
• Example of integrity levels

• Self-replicating program fragment that attaches itself to a program or • Permissions are granted related to the identity and the groups it
file enabling it to spread from one computer to another, leaving belongs to: authorization
infections as it travels • Layered Security
• Trojan Horse • Layered security (also known as a Defense-In-Depth strategy) implements
• Appears to be useful software but will actually do damage once various security measures in various parts of the IT infrastructure
installed or run on your computer • Instead of having one big firewall and have all your security depend
• Denial of service attack on it, it is better to implement several layers of security
• An attempt to overload an infrastructure to cause disruption of a • Preferably security layers make use of different technologies
service • This makes it harder for hackers to break through all barriers, as they
• Can lead to downtime of a system, disabling an organization to do its will need specific knowledge for each step
business • Disadvantage: increases the complexity of the system
• In a Distributed Denial of Service (DDoS) attack the attacker uses • Cryptography
many computers to overload the server • The practice of hiding information using encryption and decryption
• Groups of computers that are infected by malicious code, called techniques
botnets, perform an attack • Encryption is the conversion of information from a readable state to
• Preventive DDoS measures: apparent random data
• Split business and public resources • Only the receiver has the ability to decrypt this data, transforming it back to
• Move all public facing resources to an external cloud provider the original information
• Setup automatic scalability (auto scaling, auto deployment) using • A cipher is a pair of algorithms that implements the encryption and
virtualization and cloud technology decryption process. The operation of a cipher is controlled by a key
• Limit bandwidth for certain traffic • Block ciphers
• Lower the Time to Live (TTL) of the DNS records to be able to • Input:
reroute traffic to other servers when an attack occurs • A block of plaintext
• Setup monitoring for early detection • A key
• Phishing • Output:
• A technique of obtaining sensitive information • A block of cipher text
• The phisher sends an e-mail that appears to come from a legitimate • Used across a wide range of applications, from ATM machine data
source, like a bank or credit card company, requesting "verification" encryption to e-mail privacy and secure remote access
of information • Standards:
• The e-mail usually contains a link to a fraudulent web page • Data Encryption Standard (DES)
• Security Patterns • Advanced Encryption Standard (AES)
• Identity and Access Management (IAM) • Stream ciphers
• The process of managing the identity of people and systems, and their • Create an arbitrarily long stream of key material
permissions • Combines key stream with the plaintext bit-by-bit or character-by-
• The IAM process follows three steps: character
• Users or systems claim who they are: identification • Used when data is in transit over the network
• The claimed identity is checked: authentication • RC4 is a widely-used stream cipher
Cryptographic Attacks
• Every encryption method can be broken using a brute force attack
• Except a one-time pad cipher with the key of equal or greater length
than the message
• A brute force attack consists of systematically checking all possible keys
until the correct key is found
• The amount of effort needed is exponentially dependent on the size of the
key
• Effective security could be achieved if it is proven that no efficient method
(as opposed to the time consuming brute force method) can be found to break
the cipher
• Most successful attacks are based on flaws in the implementation of an
encryption cipher
• To ensure a cipher is flawless, the source code is usually open source and
thus open to inspection to everyone

Information Technology Infrastructure IT602

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Information Technology Infrastructure IT602

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Information Technology Infrastructure IT602

Uploaded by

Copyright:

Available Formats

Information Technology Infrastructure • Security

Various Kinds of Architecture • Performnace

Serial components: One defect leads to downtime

• Example of integrity levels

You might also like