Ending the Confusion - A Simple Explanation of Data Deduplication

Data Deduplication or is it de-duplication, the veryemail server to the original 10 MG. Now let's say
fact that every time I type it, my spell check tellseveryone loves the file and they all save it in their
me it's spelled wrong, is a hint that it might needpersonal directory on a server. You now have 1,000
some clarification.MBs on your file server. A backup solution that uses
It goes by many names, commonality factoring,single instance or common file elimination will only
single instance store, common file elimination,store a single copy of that file. This reduces your
referential integrity and the list goes on. What do allbackup window, network traffic and backup storage
of these terms have in common?from 1,000 MBs to 20MBc. All other references to
Two things come to mind. First, none of them setthat file are stored as pointers. They are still available
off alarms in my word processor, which means theyfor recovery, but use almost no storage.
can't be used in a technical discussion. When weThis method also eliminates the need to backup
invent new technology, we need to invent newcommon files like operating systems and applications
words for it. Second, at a high level, they all refer towhen doing full server backups. Since I am using a
various methods to reduce the amount of data beingdisk-to-disk based example, all the "which tape has
stored.which version" is eliminated and tracked in the
The data reduction field is still struggling with manymetadata of the backup solution.
terms and a lack of standardization. I will define theBlock-Level Deduplication Backup
basics degrees of deduplication available today andBlock-level data deduplication assigns a unique hash
the advantages as they relate to data backup.algorithm to every block of data. Block size varies
Full Backupdepending on the application ranging from 4KB to
A Full backup is a complete copy of data for every56KB. Some applications use a fixed block size while
file and every server, every time you complete aothers use a variable block sizing. Generally, a smaller
backup. This method is used because it isblocks size will find more commonality and reduce
straightforward and simple. If you want to recoverdata by a greater amount. Block level deduplication
data from four days ago, you retrieve that tape (orcan be applied globally across many backup sets.
tapes) and begin your restore. The restore process isA tradeoff to smaller blocks size is the greater
one tape or set of tapes. The drawback is it usesprocessing and I/O overhead. Breaking a 100GB block
the longest backup window of all the methods.of data into 8K blocks produces 12 million chunks of
As backup administrators begin to run into shorterdata. Reconstructing all these chunks of data delays
backup windows, they usually start doing differentialrestore times. Some technologies have configurations
backups.settings to increase restore performance by
Differential Backupproducing "sub-masters" within the data set. This
This is where you have a weekly or master backupallows for faster restores but requires additional
and each successive backup after that includes all thestorage. You must consider the additional overhead
changes from the master. That means as the weekwhen evaluation of deduplication methods.
goes on, you are backing up more and more dataByte-level Deduplicatiion Backup
each time to capture all the changes. When it is timeByte-level data deduplication performs a
to recover, you only need your master and thebyte-by-byte comparison of the current data
differential for the day you are recovering to, so astreams with the bytes it has seen before. This is a
Friday recovery requires only two tapes. This is amuch more accurate comparison and produces a
basic level of deduplication introduced by tapesmuch higher commonality in the data sets. Most
decades ago.byte-level deduplication approaches are content
Incremental Backupaware. That means it is engineered specifically to
Differential backups may cause your Thursday nightunderstand the backup application's data stream so it
backup windows to be too long. That is where tapecan identify information like file name, file type and
based incremental backups come in. After yourdate/time stamp.
master, each tape only stores file that have hadBecause comparisons at this level are resource
changes since the last backup. The trade off here isintensive, it is usually done after the backup occurs,
recovery. If you are recovering Friday's backup, youcalled post processing, versus in-line, which is the
will need to take the master, let's say Sunday, andnorm with block level deduplication. This means
insert each tape until you reach Friday to process abackups complete at full disk performance, but
full restore. This deduplicates more data than therequire additional storage to cache the backups while
differential, but it has some tradeoffs. It can be verythey are processed. In addition, the byte-level
time consuming and is further complicated whendeduplication process is usually limited to a single
tapes are taken off-site each night.backup set and not generally applied globally across
Deduplication Backupbackup sets.
Enter the various levels of disc-based deduplicationIn many cases, byte-level technology keeps the
technology. At a basic level, that can be brokenmost recent generation as a master. That significantly
down into three types: Single instance, Block level andimproves restore times. Ninety percent of all restores
Byte level. Some of the earliest deduplicationare of the most recent generation. Restore times are
technologies came out of the wide area file servicesan important consideration when considering any
segment. Reducing data allowed higher bandwidthbackup solution.
utilization and minimized the expense of wide areaAll venders have their own approach to deduplication.
networks.Some use storage appliances, others have software
Single Instance Deduplication Backuponly solutions, and still others have complete
An email application example is the best way toend-to-end replacements for existing backup
describe single instance deduplication. If you email asolutions.
10 MB attachment to 100 employees in yourOne thing is for sure, without data deduplciation
company, that could equate to 1,000 MBs of data. Inremote backup and automated business continuity
this case, a single instance of the attached file issolutions across corporate and public Wide Area
stored. Other recipients receive the email with aNetworks would not be a reality today.
"pointer" to that file. This reduces the storage on the