eDiscovery: focus on deduplication

In the third of article in our ediscovery series, Hayley Pizzey provides an in-depth review of one of the simpler interrogation tools: de-duplication.

13 April 2015

In our article ‘Interrogation of big data we outlined some of the key tools and search methods used to effectively and efficiently facilitate the interrogation of big data. In this article we provide an in-depth review of one of the simpler interrogation tools: de-duplication, one of the most effective and frequently used ways to reduce the volume of documents that require review. It is also one which delivers significant reduction in time and cost.

The most common form of de-duplication is by Message Digest algorithm 5 (‘MD5’) or ‘hash’ value. MD5 will calculate a 32 digit hexadecimal number (i.e. a number consisting of 32 characters which may include numbers 0-9 and letters a-f) for each electronic file, thereby providing each file with its own unique ‘fingerprint’. The detective work can then begin, with the software taking on the role of Sherlock Holmes in removing unnecessary duplicate documents from the review set – eliminating potential suspect documents at every stage.

For emails the process is slightly more complex.  This is because when email messages are stored as files, their byte content may be varied causing their fingerprints to differ. This can cause difficulties for even the most astute of detectives. However, the software has an inventive system for dealing with this issue. For emails, the MD5 hash values will be based on the fields of ‘to’, ‘from’, ‘cc’, ‘subject’ and ‘body text’ without reference to spaces and attachment data – allowing them to be ‘fingerprinted’ with ease. Once all the relevant electronic documents have been given their fingerprint, the duplicates can be identified and removed.

The simpler MD5 is not, however, appropriate in all situations, and requests for customised de-duplication are increasingly common. For example, it is not unusual for Blackberry and smartphone devices or certain email archiving systems to automatically insert non-relevant additional lines of text, such as confidentiality statements, which prevent otherwise identical emails from being identified as duplicates. However, wherever there is a problem, there is a solution. In such circumstances, bespoke programming code can be created, which ignore the non-relevant text and capture these otherwise identical documents, ensuring a more effective de-duplication result.

The de-duplication process can also be used to group near duplicates together. For example, multiple versions of a nearly completed contract may exist, with only minor grammatical changes. It takes only a little detective work to note the probability that if one such contract is relevant to solving a case then the others will also be of importance. A vast amount of time can therefore be saved by reviewing one document and then assigning its relevance to the entire group.

This solution is not without its drawbacks, however. Near de-duplication can be a helpful tool, but agreeing the definition of “near” is not an open and shut case. Should near de-duplication only include minor grammatical changes, such as a comma being replaced by a semi-colon, or could it expand to include documents with additional text? The answer will almost always depend on the circumstances surrounding the review and the documents being searched against.

De-duplication and keyword searching can help reduce costs and save time and money at multiple points of the discovery process. It is a simple, but integral, part of the e-discovery process. As Sherlock Holmes once said, one requires effectively collected data to build a case, as one cannot “make bricks without clay” and de-duplication is a key part of this effective data collection.

If we can help with your disclosure issues please contact John Mackenzie, Guy Harvey or Hayley Pizzey. You may also find the answer to some of your questions in our e-discovery brochure.

Related articles