De-duplication is used extensively in digital forensics and e-Discovery as a way of culling documents. While the process itself is simple, de-duplication can be performed in numerous ways which affect review time, cost and your understanding of the custodians. Here are some questions that frequently come up while we discuss de-duplication options with clients:
Q1: What do “horizontal de-duplication” and “vertical de-duplication” mean?
These terms are related to the scope of de-duplication. Horizontal de-duplication refers to de-duplication that is performed globally, across custodians while vertical de-duplication indicates that de-duplication scope will be limited to each custodian’s documents.
When documents are de-duplicated horizontally, all but one copy of a document is removed from the document universe regardless of which custodians had the same document. On the other hand, when de-duplication is performed vertically, multiple copies of the same document may be left in the document universe as long as each copy originated from a different custodian. This would allow the legal team to gain a greater understanding of the custodians at the expense of having more documents in the review database.
Q2: I chose to have my documents de-duplicated prior to processing. Now, how do I track down duplicate copies of a certain document?
When your documents are de-duplicated, duplicates are flagged in the service provider’s back-end database instead of being removed permanently. Every time de-duplication is performed, your service provider should include a de-duplication report that lists de-duplicated copies of each document and at a minimum their hashes, sizes, file names and folder paths.
Q3: How are attachment families handled during de-duplication?
De-duplication should normally be performed at the attachment family level rather than document level. In other words, an e-mail message and all of its attachments would have to be identical to another e-mail family in order for them to be considered duplicates. This would ensure that an e-mail attachment would not be de-duplicated against a loose electronic document and removed from its family.
Q4: What is a cryptographic hash?
A cryptographic hash is a fixed-size signature for an arbitrary block of data that represents its contents. Hash algorithms are designed such that even the slightest change to the original data changes the signature dramatically. These signatures can then be used to compare documents and identify duplicates. Message Digest 5 (MD5) and SHA-1 are two popular cryptographic hash functions used in e-Discovery.
Q5: Can two documents with different file names or dates be considered duplicates?
Yes. De-duplication is usually performed by comparing cryptographic hashes (e.g. MD5, SHA1 etc.) of documents to each other. The calculated hash values are based on the binary contents of documents and do not take into account external metadata that is stored in the file system. Therefore, two files with the same contents but different file names would produce the same hash value.
Most e-Discovery service providers would allow you to use a custom hash that includes your choice of metadata fields in addition to document contents for de-duplication. For example, you could choose to include the file name field in your custom hash if you would like to make sure that documents can be considered duplicates only when their file names are also identical.
Note: Some document types (i.e. Ms Office documents, Adobe PDF files etc.) contain internal metadata including dates. Since this information is stored inside the document, it would affect the calculated hash value.
Q6: Are e-mails hashed the same way as loose electronic documents?
An e-mail is essentially a set of fields stored in a container. This container can hold an individual message (e.g. an MSG file) or it can be a database-like structure containing multiple messages (e.g. PST, NSF etc.). Consequently, most e-Discovery software compute cryptographic hashes for e-mails based on the metadata values found in a predetermined set of fields. The following is a list of fields typically used for e-mail de-duplication: Author, Recipient(s), CC, BCC, Date Sent, Subject, Attachment Count, Attachment Names, Message Body
E-mail messages contain many more fields than the fields typically used for de-duplication and some of these fields can have variations among multiple copies of an e-mail message. For example, 4 copies of a message may be read and the fifth copy may be unread. Depending on the nature of the case, the contents of these additional fields may or may not be relevant. However, it is always important to clearly define what should be considered a duplicate at the onset of each project.
Q7: What are the chances of two different documents having the same MD5 hash?
Extremely small. In a set of 2^64 (18,446,744,073,709,551,616) documents, the chances of two different documents having the same MD5 hash is 50% (birthday paradox).
Q8: I heard that MD5 is now considered cryptographically broken. Is it good enough to identify duplicate documents?
The fact that MD5 is now cryptographically broken means that an attacker can create a pair of non-identical files that produce the same MD5 hash. Keep in mind that this is different than a preimage attack where an attacker produces a file that matches a specific, known hash value. There are no known preimage attacks against MD5 as of this writing.
Briefly, MD5 is currently considered suitable for identifying duplicate documents. However, our recommendation is to hash each document using more than one algorithm (e.g. MD5 and SHA-1) to alleviate security concerns. We anticipate that SHA-256 will be the new hash standard for e-Discovery in the near future.