I deduplicate, you deduplicate, it deduplicates … it is not yet in dictionary but it should not delay!
Deduplication is becoming more prevalent in the world of proprietary solutions for data backup. However an open source solution deduplication shows the tip of his nose for some time and begins to mature : Opendedup.
For those who have forgotten or do not know this technology, I propose the definition of Wikipedia :
« Data deduplication is a specific form of compression where redundant data is eliminated, typically to improve storage utilization. In the deduplication process, duplicate data is deleted, leaving only one copy of the data to be stored. However, indexing of all data is still retained should that data ever be required. Deduplication is able to reduce the required storage capacity since only the unique data is stored. For example, a typical email system might contain 100 instances of the same one megabyte (MB) file attachment. If the email platform is backed up or archived, all 100 instances are saved, requiring 100 MB storage space. With data deduplication, only one instance of the attachment is actually stored; each subsequent instance is just referenced back to the one saved copy. In this example, a 100 MB storage demand could be reduced to only 1 MB. Different applications have different levels of data redundancy. Backup applications generally benefit the most from de-duplication due to the nature of repeated full backups of an existing file system. »
Also, add that to optimize this deduplication, data storage is usually in blocks of data as shown in the diagram below :

Great Applications for Deduplication
- Backups
- Virtual Machines
- Network shares for unstructured data such as office documents and PSTs
- Any application with a large amount of deduplicated data
Applications that are not a good fit for Deplication
- Anything that has totally unique data
- Pictures
- Music Files
- Movies/Videos
- Encrypted Data