Focus on concept clustering

In the next of our e-discovery series, Hayley Pizzey looks at ‘Concept Clustering’ which is becoming more widely used as a means of improving the time and cost efficiencies of reviewing large quantities of electronically stored information (ESI) during the ediscovery process.

16th July 2015

Our last article discussed the use of communications mapping as part of a wider early case assessment procedure. Our focus now turns to another tool - ‘Concept Clustering’ which is becoming more widely used as a means of improving the time and cost efficiencies of reviewing large quantities of electronically stored information (ESI).

The initial phase of the concept clustering process involves the production of a ‘concept wheel’. An example of this, from digital review platform Stroz Friedberg, is shown below:

The concept wheel is produced by the platform reading and interpreting the entire body of every document in the review sample, looking for words that appear in high frequency. This then provides an automatically generated set of keywords which accelerates the e-discovery process. This is because the user doesn’t have to set their own keywords and input them into the system.

The next stage of the process is where ‘concept clusters’ are generated. The artificial intelligence in the system creates these clusters based on words and phrases that appear most often beside the generated key words. This can identify and dismiss large swathes of documents irrelevant to the review process and cut down the time required to review the entire data set. Furthermore, the visual and interactive nature of the tool, combined with the intuitiveness of the platform, make it extremely easy to use. By just clicking on a particular cluster, the user can drill down into specific documents to obtain a clear picture of the number of documents a particular set of search terms will yield.

The further from the middle of the wheel, the more specific the clusters become. This allows for very quick identification of what documents may or may not require further review. Moving further from the centre, the intelligent search engine comes up with additional keywords based on their frequency and the relationship they have with the original key words. As well as aiding the review process, this artificial intelligence also enables users to dismiss large tranches of documents that contain keywords, but in the wrong context, for example if the word is in contained in spam. This removes one of the issues of using only a simple linear keyword search approach – which normally returns a large number of false positives. This is especially prevalent where one of the search terms is very generic, or has more than one meaning. This is also particularly useful when negotiating the scope of a document review with litigation adversaries. As the intelligent system will produce a much more extensive keyword list than individuals on their own, it will reduce the time spent bartering with the other side over what should and should not be searched for.

For example, a search for ‘apples’, may eventually cluster into a search for ‘green apples’ and ‘red apples’. If you had no interest in red apples, you can instantly dismiss the documents that contain that search term. Furthermore, if the ‘green apples’ cluster is much larger or smaller than expected, this knowledge can be used as part of the greater early case assessment procedure. It facilitates a better understanding of the number of documents that will require a human review and gives a clearer perspective of the resources required.

Further inspection into each cluster reveals the relevant documents. These can be further sorted into individual folders, either to discard them, or align them for a further review.

If we can help with your disclosure issues please contact John Mackenzie, Guy Harvey or Hayley Pizzey.

Related articles