| Data Deduplication or is it de-duplication, the very | | | | email server to the original 10 MG. Now let's say |
| fact that every time I type it, my spell check tells | | | | everyone loves the file and they all save it in their |
| me it's spelled wrong, is a hint that it might need | | | | personal directory on a server. You now have 1,000 |
| some clarification. | | | | MBs on your file server. A backup solution that uses |
| It goes by many names, commonality factoring, | | | | single instance or common file elimination will only |
| single instance store, common file elimination, | | | | store a single copy of that file. This reduces your |
| referential integrity and the list goes on. What do all | | | | backup window, network traffic and backup storage |
| of these terms have in common? | | | | from 1,000 MBs to 20MBc. All other references to |
| Two things come to mind. First, none of them set | | | | that file are stored as pointers. They are still available |
| off alarms in my word processor, which means they | | | | for recovery, but use almost no storage. |
| can't be used in a technical discussion. When we | | | | This method also eliminates the need to backup |
| invent new technology, we need to invent new | | | | common files like operating systems and applications |
| words for it. Second, at a high level, they all refer to | | | | when doing full server backups. Since I am using a |
| various methods to reduce the amount of data being | | | | disk-to-disk based example, all the "which tape has |
| stored. | | | | which version" is eliminated and tracked in the |
| The data reduction field is still struggling with many | | | | metadata of the backup solution. |
| terms and a lack of standardization. I will define the | | | | Block-Level Deduplication Backup |
| basics degrees of deduplication available today and | | | | Block-level data deduplication assigns a unique hash |
| the advantages as they relate to data backup. | | | | algorithm to every block of data. Block size varies |
| Full Backup | | | | depending on the application ranging from 4KB to |
| A Full backup is a complete copy of data for every | | | | 56KB. Some applications use a fixed block size while |
| file and every server, every time you complete a | | | | others use a variable block sizing. Generally, a smaller |
| backup. This method is used because it is | | | | blocks size will find more commonality and reduce |
| straightforward and simple. If you want to recover | | | | data by a greater amount. Block level deduplication |
| data from four days ago, you retrieve that tape (or | | | | can be applied globally across many backup sets. |
| tapes) and begin your restore. The restore process is | | | | A tradeoff to smaller blocks size is the greater |
| one tape or set of tapes. The drawback is it uses | | | | processing and I/O overhead. Breaking a 100GB block |
| the longest backup window of all the methods. | | | | of data into 8K blocks produces 12 million chunks of |
| As backup administrators begin to run into shorter | | | | data. Reconstructing all these chunks of data delays |
| backup windows, they usually start doing differential | | | | restore times. Some technologies have configurations |
| backups. | | | | settings to increase restore performance by |
| Differential Backup | | | | producing "sub-masters" within the data set. This |
| This is where you have a weekly or master backup | | | | allows for faster restores but requires additional |
| and each successive backup after that includes all the | | | | storage. You must consider the additional overhead |
| changes from the master. That means as the week | | | | when evaluation of deduplication methods. |
| goes on, you are backing up more and more data | | | | Byte-level Deduplicatiion Backup |
| each time to capture all the changes. When it is time | | | | Byte-level data deduplication performs a |
| to recover, you only need your master and the | | | | byte-by-byte comparison of the current data |
| differential for the day you are recovering to, so a | | | | streams with the bytes it has seen before. This is a |
| Friday recovery requires only two tapes. This is a | | | | much more accurate comparison and produces a |
| basic level of deduplication introduced by tapes | | | | much higher commonality in the data sets. Most |
| decades ago. | | | | byte-level deduplication approaches are content |
| Incremental Backup | | | | aware. That means it is engineered specifically to |
| Differential backups may cause your Thursday night | | | | understand the backup application's data stream so it |
| backup windows to be too long. That is where tape | | | | can identify information like file name, file type and |
| based incremental backups come in. After your | | | | date/time stamp. |
| master, each tape only stores file that have had | | | | Because comparisons at this level are resource |
| changes since the last backup. The trade off here is | | | | intensive, it is usually done after the backup occurs, |
| recovery. If you are recovering Friday's backup, you | | | | called post processing, versus in-line, which is the |
| will need to take the master, let's say Sunday, and | | | | norm with block level deduplication. This means |
| insert each tape until you reach Friday to process a | | | | backups complete at full disk performance, but |
| full restore. This deduplicates more data than the | | | | require additional storage to cache the backups while |
| differential, but it has some tradeoffs. It can be very | | | | they are processed. In addition, the byte-level |
| time consuming and is further complicated when | | | | deduplication process is usually limited to a single |
| tapes are taken off-site each night. | | | | backup set and not generally applied globally across |
| Deduplication Backup | | | | backup sets. |
| Enter the various levels of disc-based deduplication | | | | In many cases, byte-level technology keeps the |
| technology. At a basic level, that can be broken | | | | most recent generation as a master. That significantly |
| down into three types: Single instance, Block level and | | | | improves restore times. Ninety percent of all restores |
| Byte level. Some of the earliest deduplication | | | | are of the most recent generation. Restore times are |
| technologies came out of the wide area file services | | | | an important consideration when considering any |
| segment. Reducing data allowed higher bandwidth | | | | backup solution. |
| utilization and minimized the expense of wide area | | | | All venders have their own approach to deduplication. |
| networks. | | | | Some use storage appliances, others have software |
| Single Instance Deduplication Backup | | | | only solutions, and still others have complete |
| An email application example is the best way to | | | | end-to-end replacements for existing backup |
| describe single instance deduplication. If you email a | | | | solutions. |
| 10 MB attachment to 100 employees in your | | | | One thing is for sure, without data deduplciation |
| company, that could equate to 1,000 MBs of data. In | | | | remote backup and automated business continuity |
| this case, a single instance of the attached file is | | | | solutions across corporate and public Wide Area |
| stored. Other recipients receive the email with a | | | | Networks would not be a reality today. |
| "pointer" to that file. This reduces the storage on the | | | | |