Interrogation of big data

In the second article in the series which provides some pointers for making sense of Big Data, Hayley Pizzey of Shepherd and Wedderburn, discusses the methods and tools which can be used to interrogate data sets.

18th February 2015

A bone of contention for both claimant and defendant during litigation can often be the cost of searching and reviewing substantial volumes of complex data (aka ‘Big Data’), and this has had a significant impact in fuelling demand for and development of software tools capable of effective and efficient data analysis and interrogation.

The Home Office’s report  eDiscovery in Digital Forensic Investigation states that:

 …an ideal tool to support the needs of both the technical and investigative elements of digital investigations does not appear to exist. However, the tools assessed did meet many of the key requirements and could be a significant part of a combined solution.

In this ever-growing market Shepherd and Wedderburn has identified, and can offer some of the key tools and search methods that help separate the wheat from the chaff.

E-disclosure works by collecting sources of information and documentation, extracting the data, indexing it and placing it into a database. One of the simplest interrogation tools is that of de-duplication. This allows the user to choose to identify duplicate documents (such as cc’d emails) or even near-duplicates (such as forwarded or slightly amended drafts of documents) and remove them from the data set. By doing this, the amount of data which would otherwise be earmarked for manual review can be drastically reduced.

Keyword searching

Another frequently used searching method is keyword searching. This generally ranges from single word searches through to more complex Boolean logic searches. Boolean logic searches are where multiple searches are combined by the use of AND, OR and NOT to refine searches to, for example, include or exclude certain words when searching the document. Fuzzy searching can also be implemented to identify misspellings (for example calender instead of calendar).

Concept searching

Use can also be made of the more developed concept searching tool which attempts to understand the concept or context being conveyed rather than a specific set of letters. For example, a keyword search for ‘gun’ might return documents containing both gun and guns whereas a concept search may return documents containing ‘shooter’, ‘piece’ or ‘sawn-off’.  This is extremely helpful when words have more than one meaning or synonyms such as bank which could mean a financial institution or the side of a river.

Predictive coding

Predictive coding involves the manual review by an “expert reviewer” who knows the case well and knows the types of documents sought and issues in dispute.  The reviewer examines a small selection of the documents, enough to provide a statistically reliable sample size, and marks as either relevant or irrelevant. Using that sample and the resultant algorithm, the system conducts a search that extracts relevant documents. Predictive coding tools are typically cheaper, faster and more accurate than manual document review methods.


One further tool, not yet widely available, detects emotion. This is likely to be used as a more proactive tool to help detect bad behaviour and non-compliance but could also assist with a reactive e-disclosure exercise to help refine results and reduce the number of documents that could be subject to the manual review.

Extracting digital information from devices requires the appropriate tools, training and experience. It is only with a strong understanding of the tools and their capabilities that an organisation can make the right cost-benefit decision in e-disclosure.

If we can help with your disclosure issues please contact John Mackenzie or Guy Harvey. You may also find the answer to some of your questions in our e-discovery brochure.

Related articles