Data Deduplication

The ChallengeThe disadvantage is that it is impossible to predict
As data stores continue to grow and the need forhow long the deduplication process will take. Also,
retaining more and more organizational data for legalsince the data needs to be written to the target
reasons increases, IT professionals are working tofirst, more disk space will be required until the
determine if their current backup strategies can keepprocess finishes.
up. Tapes - while offering easy transferability to an* In-line deduplication: With in-line deduplication, the
off-site location - can be extremely costly to store. Ithash calculations are created on the target device as
also can be very time-consuming to restore datathe data is written. If a duplicate is found, the new
from tapes. Alternatively, the cost of disk hasblock of data is not stored. This method requires less
decreased to the point where using disk-to-diskstorage on the target, but can be slower due to
backup is a viable option. For customers using ahash calculations and lookups taking a long time.
combination of disk and tape backup solutions, dataPerformance varies across vendors.
deduplication can help that cost come down evenWhat Are the Advantages?
more, plus save valuable time at every level.Data deduplication brings a wide variety of benefits
to organizations:
Wikipedia defines data deduplication as "a specific* Save on storage space for disk-to-disk backups:
form of compression where redundant data isAccording to the Enterprise Strategy Group's report
eliminated." Take the example of a 50 MB PowerPointby Tony Asaro and Heidi Biggar entitled "Data
presentation emailed to 10 people. If each personDe-duplication and Disk-to-Disk Backup Systems" (July
stores the presentation in their home directory, we2007), "Through hands-on testing, ESG has found
now have 500 MB allocated to storing the same data!that data deduplication technologies can provide 10
If each person then forwards the presentation to 1times, 20 times, 30 times and even great reduction in
other individual and those people also store thecapacity needed for backup." Thus, companies can
presentation, we have 1G of storage dedicated to asee savings not only in the disk needed for the
single file! Incremental and differential backups aside,primary backup, but also in the cost of disk for a
this one file will take up 1G of storage for its initialsecondary site, or in monthly charges for an off-site
backup.backup service.
Data deduplication takes care of this redundancy by* Save on heating and cooling: By decreasing the
recognizing that the data in each of these individualamount of disk needed, organizations can see a
files is the same. It therefore stores one copy of thereduction in heating and cooling costs.
file and creates pointers to the rest. Now, instead of* Save on space: With less disk needed, organizations
using 1G of storage, 20 people have used a total ofalso save on the amount of floor/rack space needed
only 50 MB of disk space.to house the backup solution.
However, let's assume that each person makes a* Save on bandwidth: Less data going across the
change in one slide. Now the data across all the fileswire means lowered bandwidth costs.
is not the same. Some data deduplication products* Decrease time and costs for data restoration:
are smart enough to work on the subfile level: theyRecovery from disc is instantaneous, while recovery
locate the blocks of data that are the same, storefrom tape can be slow and time-consuming. If the
those one time, and then store the differing blockstape needed is in off-site storage, more time and
separately. Because of the pointers the datacosts will be incurred.
deduplication product creates, each person canWhat Backup Vendors Support This Technology?
retrieve their unique version of the file, even thoughThere are a host of vendors offering this technology,
it has been stored in separate blocks.including ExaGrid, EMC DataDomain, and Barracuda
How Does It Work?Backup (formerly BitLeap until Barracuda bought them
Deduplication technology works by comparing chunkslast year).
of data and searching for duplicates. It does this byWhere Can I Learn More?
assigning a unique identifier to each chunk, calculatedCheck Data Domain for whitepapers (like the one
by a cryptographic hash function. When a duplicate ismentioned in this article) and a deduplication calculator.
found, the file is removed and a link to the first file isESG's report contains some great information,
created. If this file is changed, then a copy of theincluding questions to ask vendors when selecting a
changed file or block is written to disk during thesolution.
next backup.Conclusion
Types of Deduplication TechnologyIf you are considering a new backup strategy for
There are two types of data deduplicationyour organization, taking a look at what data
technology currently in use:deduplication can do for you is a must. We feel that
* Post-process deduplication: As the name implies,development of this technology is just getting
post-process deduplication runs after the data is sentstarted, and can only improve as more products hit
to the target device. The advantage of this is thatthe marketplace.
since the deduplication process can be slow, time for© Copyright 2010, Uptime NetManagement, Inc.
backup is not lost waiting for deduplication to occur.