0% found this document useful (0 votes)
10 views36 pages

CP4P Compression and Backup

The document provides an overview of file compression and backup strategies for programmers, detailing the differences between lossless and lossy compression, as well as various file formats. It emphasizes the importance of backups, outlining the 3-2-1 backup strategy and the need for geographically separate and platform-independent copies. Additionally, it discusses the methods and locations for backups, highlighting the significance of data integrity and recovery in the event of data loss.

Uploaded by

bm999252999
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views36 pages

CP4P Compression and Backup

The document provides an overview of file compression and backup strategies for programmers, detailing the differences between lossless and lossy compression, as well as various file formats. It emphasizes the importance of backups, outlining the 3-2-1 backup strategy and the need for geographically separate and platform-independent copies. Additionally, it discusses the methods and locations for backups, highlighting the significance of data integrity and recovery in the event of data loss.

Uploaded by

bm999252999
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

COMPUTER

PRINCIPLES FOR
PROGRAMMERS
Compression, 3/\/[®¥|°710/\/, Backup
( Encryption )
How many programmers does it take to change a lightbulb?
t3A1A5TTcFKmylhIoGABjwacB1vIWgIv6S6LdLcSg
8s=
News of the Week i
Agenda
Lecture:
1. What, Why, and How of “File Compression”
… depends on the Use Case
… overview of formats
… Lossless vs Lossy
2. What, Why, and How of “Backup”
… types of backups
… backup media
Agenda (Cont’d)
Activity:
1. Explore File Compression
2. Compress various native file formats within a ZIP
archive and compare the compression factors
3. Upload files to OneDrive to demonstrate a network
backup.
4. Do your own 3-2-1 backup.
What is “File Compression?”

4 pink
5 green
3 blue
What is “File Compression?”
• storing a file’s data in “less space”
by “minimizing redundancy” in the content
• An archive is a collection of folders and files
stored in one file, e.g. filename.ZIP
• Files are usually compressed
• Files can be encrypted
• Cross platform exchange
• OS options to compress / encrypt local files
Why use File Compression?
• Writing / Sending data takes bandwidth and I/O
time

• Encrypt off-site data for security


• compression software has a password encrypt option
• VoIP must compress data in real-time
• Streaming sends compressed, receiver
decompresses.
How File Compression is done
Data compression combines:
1. match and replace duplicate strings using a
dictionary: unique code = "repeating string"
 Lempel–Ziv–Welch (LZW) compression (1984)
2. replace 8-bit char with variable bit length
symbols based on character frequency
 Huffman coding (1952)
 David A. Huffman was a Ph.D student at MIT
…who didn't know it could not be done.
Huffman decoding
Encoded: 1 0 1 1 0 0 0 1 0 1 1 1 0 0
1 1 1
Decode
Logic:
? ? ?
0 0 0
? = next
encoded bit, S L O E
L O S S L E S S
10 110 0 0 10 111 0 0

8 chars × 8 bits = 64 bits compressed to 14 bits (22%)


Encode / Decode info: 0=S, 10=L, 110=O, 111=E.
Lossless vs. Lossy Compression
Lossless: contains all original data with
redundancies removed.
• Data, PNG / TIFF images, FLAC / ALAC audio
Lossy: sacrifices quality for smaller file size
by dropping fine-grained, subtle details from
original data.
• JPG images, MP3 / AAC audio, all video
• for end-use only, not for modification/editing
Lossless vs. Lossy (not what you might think)
GIF Lossy, 45.4kb PNG/TIFF, 941kb JPG Lossy, 25.4kb

8 bits/pixel, 256 colours LOSSLESS 10:1 compression


HiRes CD MP3/AAC
9,216 kbps 1,411 kbps 320 kbps
653% 100% 22.7%

HRA has 6.5 times higher than


that of CD WAV files (1,411 kbps)
and almost 29 times higher than
that of MP3/AAC (320 kbps
HiRes CD MP3/AAC
9,216 kbps 1,411 kbps 320 kbps
653% 100% 22.7%
Common Compression File Formats

Data

ZIP
Music
ZIP
.docx .pptx .xlsx MP3 AAC MQA
.tar.gz 7z RAR WAV ogg FLAC

Standard for
Video
Images Lossless
Cross-Platform MPG MP4 DIVX
JPEG JXL AVIF WebP
data exchange XVID MOV AVI
GIF PNG TIFF RAW
AVIF WebP JXL
BOLD formats are Lossless
Drawbacks to Compression
o Time: compression needs CPU and primary storage resources
o PCs have lots of both and only one user. Servers on the other hand…
o Space: archived files must be uncompressed before use,
extra space needed for both compression & decompression
o Integrity: any data corruption can cause loss of entire archive
o Solid or multi-volume archives can be lost with even minor data
corruption. Archive repair is possible but not probable.
o Test your archives to confirm integrity.
o Recoverability: the Lossy sacrifice is reduced quality
o Lossy compression is appropriate only for specific Use Cases.
Why do we need backups?
#1 Accidental deletion by users or IT people 2/3 to 3/4
of all
#2 Hardware failure: all storage fails eventually data loss

Far Less Frequent Causes


o Catastrophe: sabotage (ransomware), fire, flood, theft
o Account on cloud provider is cancelled or accidently closed
o Cloud service provider as a single point of failure
Continuous Data Integrity with
• Redundant Array of RAID
Independent Disks
• RAID 1, 5, 6
tolerate drive failure
• RAID 1 pairs drives
RAID 5 +1 parity drive
RAID 6 +2 parity drives
• RAID appears as one logical
drive space to OS (excluding
parity drives)
• read/write performance
increases with multiple drives
doing concurrent I/O
Three characteristics define a Backup

A copy in a
geographically
separate location
that is platform
independent.
Classic File Backup Strategy
Types: Full (all files) + Differential (only files changed since last Full backup)
Full backup is slow, Differential backup is faster but gets slower.
Restore requires Full + Differential.
Classic File Backup Strategy
Types: Full (all files) + Incremental (files changed since last backup of any type)
Incremental backups are faster than Differential.
Restore is slowest because multiple backup types must be done in sequence.
Enterprise backup
• Backup software does Full, Differential, Incremental strategies
• Options for file versions / generations, and periodic snapshots..
• Enterprise OS provides for backup of continuously running systems.
• LTO tape or Optical Disc libraries as nearline tertiary storage
• AWS Glacier, Google Nearline, Sync cloud cold storage
• Inexpensive backup storage but slow to restore ($$$ if speed needed)
• Recovery and Restoration speed is highly variable.
• Depends on data transfer rates from backup device or location, and
complexity of rebuilding the relational aspect of data base objects
• Data deduplication and Single-instance optimize storage
• eliminate duplicate copies of data within and across systems
User Level File Recovery … is not backup
Windows File History, macOS Time Machine are not exactly backup
• Automatic copying of files to external or network drive [good]
• Historical versions of user files maintained. Easy to restore. [good]
• Must configure and test to ensure copying of all user folders. [okay]
• If drive is always connected, it is not a backup, just a copy; it is likely
not geographically separate and not platform independent. [bad!]
Windows Recycle Bin, macOS Trash can are not backup
• Only good for oops! and short-term recovery. [hopeful]
Two-way synchronization is not backup [deluded]
• Synchronization is platform interdependent, not independent.[bad!]
• A file on one system does not have a "copy" on other systems, [bad!]
the same file co-exists on all synchronized systems. [good?]
3-2-1 Backup Checklist
• 3 copies (change only the active file, not the backups)
• 1 active, 1 local backup, 1 remote backup
• 2 different formats/platforms (platform independence)
• External drive is platform independent only when not plugged in
• LTO tape or optical disc. Initially local, optionally moved offsite.
• One-way backup to cloud cold storage (not two-way cloud sync)
• 1 off-site backup (geographically separate location)
• Cloud storage different from your cloud IaaS, PaaS, SaaS provider
• tape/optical media – rotate Full, Diff, Incr to offsite storage services
• The near loss of Toy Story 2
The final word on backups…
Backups do not matter.
Only RESTORE matters.
NOTES
…not on the quiz but here for further
information and explanation.
Effect of File Compression on Data Transfer
Assumptions:
• 1MB plain text file, unique for each of 30,000 users
• Network throughput is 2 seconds per file
• text compressed to 35% of original, throughput 1 sec/file

Data Time Size


Original 1MB plain text 8.24 Mb 16.6 hours 241.4Gb
Compressed to 35% 2.88 Mb 8.3 hours 84.5Gb
Compression for end user distribution
TIFF, PNG, GIF (lossless). JPG (lossy).
• formats used by the graphics industry

FLAC, CD:WAV (lossless). MP3, AAC, MP4 (lossy).


• Compression formats used by the sound engineering
and music industry

MPG, MP4, DIVX, XVID, MOV, AVI (all lossy).


• Compression formats used by the video industry
How Compression Works
• Here is an old quote from Vangie Beal:
Data compression is particularly useful in communications because it
enables devices to transmit or store the same amount of data in fewer bits.
There are a variety of data compression techniques, but only a few have
been standardized. The CCITT has defined a standard data compression
technique for transmitting and a compression standard for data
communications through modems. In addition, there are file compression
formats, such as ARC and ZIP.
• This quote contains 449 characters.
How Compression Works
(cont’d)
•Replace “compression ” with "♠". The text becomes:
Data ♠is particularly useful in communications because it enables devices to
transmit or store the same amount of data in fewer bits. There are a variety
of data ♠techniques, but only a few have been standardized. The CCITT has
defined a standard data ♠technique for transmitting and a ♠standard for data
communications through modems. In addition, there are file ♠formats, such
as ARC and ZIP.
• With dictionary “♠compression˽” and 5 replacements,
total size is 406 characters, 90.4% of 449.
• algorithm builds a token/string dictionary
How Compression Works
(cont’d)
• With more pattern matching and a bigger dictionary…
♠compression˽ ☻transmit ♣here are˽ ♥data˽

♦communications˽ ☺standard˽ ♪technique


♥♠is particularly useful in ♦because it enables devices to ☻ or store the
same amount of ♥in fewer bits. T♣a variety of ♥♠♪s, but only a few have
been ☺ized. The CCITT has defined a ☺♥♠♪ for ☻ting faxes and a ♠☺ for
♥♦through modems. In addition, t♣ file ♠formats, such as ARC and ZIP.
• Including dictionary, total size is 363 characters or 81% of
original. The more pattern matches there are for each
dictionary item, the higher the compression.
Two types of Compression
Formats
Lossless (LZW and or Huffman)
• ZIP, TIFF, FLAC, and other general file compression routines are lossless: all original data is completely
encoded. Lossless compression reduces redundant data by not repeating recurring strings of the same data
(e.g. a large blank space in a TIFF image or a long noiseless passage in FLAC audio).
• After decompression, the data is always complete and is indistinguishable from the original source file.
• GIF files use LZW lossless compression as part of their file format. Although GIF images capture only 265
colours per frame, this is their highest resolution. The minimal colour depth captured is by design, it is not an
artifact of compression. GIFs are must successful as sharp-edged line art with animations.
Lossy
• JPG/JPEG, MPG, MP3, and other end-user formats use lossy compression. Data representing the highest,
fine-grained resolutions of the image or sound are removed in order to achieve high levels of compression to
reduce file size for distribution. The level of compression is variable according to the Use Case. e.g. minimal
compression of the original, high compression for email or MMS distribution.
• JPG images effectively delete colour information to achieve compression.
• MP3s simplify the sound waves of audio
• Developers decide to adjust the level of compression according to their Use Cases.
Overview of some Compression File
Formats
• ZIP:
o The most popular general-purpose compression archive.
o Supported on virtually all platforms from mainframes to PCs.
o Includes features such as encryption using password protection.
• RAR, 7z, TAR, StuffIt:
• Proprietary general-purpose compression file formats with incremental
improvements over Zip but with the loss of standardized support.
• use different algorithms, with various benefits and uses.
• Some are designed for different operating systems
(StuffIt for Mac, TAR for *nix--TapeARchive).
What is a “Backup” and why do we need backups?
(Cont’d)
• Backup is the “procedure for making extra copies” of data “in case
the original is lost or damaged and must be restored.” The procedure
includes storing the copy in a geographically separate location which
is platform independent from the original file and host system.
• Having a backup will allow you to recover from lost, broken or stolen
hardware, and from your own accidental deletions.
• You should be in the habit of backing up user created files on your
laptop or PC. OS and apps can be restored from their original
software providers or a system Restore Point but user created data
can only be restored from backups.
When & How to run your Backup
• Automatically:
o Performed by continuously running backup software that constantly monitors
for file changes. Used for Full and Incremental strategy.
• Scheduled:
o system operator or backup software runs a backup at specific times, such as
overnight, when it has the least impact on business operations or at critical
business times such as at accounting month/year end. Used for Full,
Differential, and Incremental strategy.
• Manual Backup:
o a user performs backups at their own convenience. Not a strategy.
o It is the least effective method (what if you forget to do it?), but it’s better than
no backup at all!
Locations of Backup Media
• Local:
o copies files to a drive in use by the system.
o fastest and most convenient, but if the computer is lost or malfunctions,
so goes the data! Just having a copy is not a backup.
o Local copies may be made to reduce downtime. The copies are then
moved to External media or transmitted which is a slower process.
• External:
• Copies files to External/Portable/Flash Drive.
i.e. a device which can be disconnected from the computer.
• It is a backup when the platform independent device is taken off-site.
Locations of Backup Media
(Cont’d)
• Network:
o Back up files to the cloud (Google Drive, OneDrive, Dropbox, iCloud)
o It is a slower option for large backup. Cost effective communications
bandwidth has significantly less throughput than writing data to a
directly attached device.
• The best location depends on the type of work you’re doing,
the volatility of the data (how quickly it changes), the volume
of data, the backup window (available downtime), security
considerations, and the speed/availability of restoration.

You might also like