Use of "predictive coding" in eDiscovery document review…best friend or job replacement?
Predictive
Coding (AKA: algorithm assisted "text categorization") refers to the
use of a software program to identify documents that are relevant or responsive
to a particular case or issue, based on a review of test documents (or a
population of "seed sets", "validation sets" and
"training sets") by lawyers and subject matter experts. The computer
assisted methodology involves a machine learning process and a combination of
different algorithmic tools.
This
method of assisting counsel in searching, culling and categorizing documents is
considered to be one of the most important developments in the eDiscovery
industry. In fact, it is so significant that some insiders believe the
technology will eventually replace the jobs of lawyers executing document review.[1]
While
using algorithms can, in many situations, make finding the proverbial
"needle in a haystack" much more efficient, it is our position that
the methodology will continue to be more of a super-charged assistant to,
rather than a replacement for, the lawyer review team.
The
more likely future of eDiscovery with more technology assisted review (TAR), is
one where the standards of document review will be raised and the parameters on
how much eDiscovery is considered to be rational and proportionate to the case
will be significantly increased due to increased efficiencies.
In
support of the above position that TAR is an enhancement rather than a
replacement is a quote from Da Silva Moore v. Publicis Groupe (2012), the
first and most cited case on the use of TAR. In it, Judge Peck validates
lawyers must be part of the process stating, "[lawyers]…can help cull
extraneous documents from a set for review and thus enrich the set of documents
used to train predictive coding technology. However,Peck further endorses
TAR explaining that, "[It]…can help target specific concepts that might not turn up in lawyer random sampling,
which can ensure a more comprehensive review."
Since
2012, other cases have emerged which provide more reason for lawyer reviewers
to fear for their jobs. The most significant was Federal Housing Finance Agency v HSBC (2014)
where Judge Denise Cote stated, "Predictive coding had a better track
record in production of responsive documents than human review". Further
supporting the HSBC case was Good v. American Water Works (2014)published
at the end of 2014, where Judge John Copenhaver stated that predictive coding
may be used in determining privileged documents and/or content.
While
the above are all US decisions, finally in early 2016, the UK courts have begun
to support. As in Pyrrho Investments Ltd. v. MWB
Property Ltd, where Master Matthews turned to the
disclosure rules set forth in Practice Direction 31b, supporting it use,
stating “automated methods of searching if a full review of each and every
document would be unreasonable.” He also noted “whether it would be right for
approval to be given in other cases will, of course, depend upon the particular
circumstances obtaining in them”. However in the Pyrrho case, the consensus of
the parties regarding proportionality, efficacy, and suitability was the key
consideration.
Finally,
in May of 2016 TAR was again supported in a commonwealth case based on a report
by Berwin Leighton Paisner (BLP), where the
petitioner sought a buy-out of his minority shareholding. The respondents
contested the allegations and petitioner’s suggested valuation. Nevertheless,
the parties reached agreement on most directions in advance of the first Case
Management Conference. The respondent possessed the vast majority of the
potentially relevant documents, approximately 500,000. The sticking point,
according to BLP, was over the most proportionate and appropriate approach to
disclosure. The plaintiff’s solicitors wanted to adopt a linear review approach
suing an agreed upon list of custodians and search terms. BLP, which
represented the respondent, asserted that the costs of this approach would be
excessive and TAR could achieve “super results…at a more proportionate cost.”
The court agreed and ordered that TAR be used by the respondent, following the
respondent’s solicitors’ arguments to the court referring to the relevant
passages and relevant factors outlined by Master Matthews in the Pyrrho case
supporting the use of TAR.
As shown
above, in the past few years the use of predictive coding has been increasingly
advocated for by corporate counsel and supported by judicial cases, primarily
for its efficiencies. That said, it is also well-established that a substantial
amount of linear document review by lawyers or subject matter experts is needed
to effectively and accurately train the predictive coding algorithm.
Unfortunately,
to simply proclaim a "general rule" on how large the "training
set" needs to be is not entirely possible. While there is a
relationship between the training set size and the total number of documents in
the population of a given case, the more relevant determination of size has to
do with the, "complexity of the categorization problem at hand"[2] Stated
another way, the more relevancy issues given and within each issue, the more
words and phrases codified as relevant and not relevant, the larger the
training sets will need to be.
So…when
should you consider using predictive coding? There is no simple
answer. However, there are emerging some broad parameters, in terms of
total document volume and training set sizes, that provide general guidance.
From
a technical standpoint a training-set can be as low as 500 documents and provide
very precise results if the categorization is exceptionally simple. On
the other hand, if the categorization is exceptionally complex, a training set
of 30,000 may still be too small to provide the desired level of confidence.[3] It
should be noted, though, that predictive coding is not a one set process and
the most senior experts have difficulty agreeing on "determination of seed
sets (random, judgmental, mix), layering search terms, and the best/most
accurate analytic and coding methodology".[4] The developing of the training set is an
iterative process that can take "as few as three generations or as many as
forty-five".[5] Furthermore, each relevancy issue has a binary
decision tree. Therefore, if there are many separate issues that increases
complexity, and thus training, set-sizes need to be significantly increased.
With
that said, from a practical stand point, cases that utilize predictive coding
typically involve total document volumes of more than 500,000 with the average
case involving more than a million documents. This large number has
historically been the case due at least in part to the significant cost of the
predictive coding software. The average training-set will range from about
7,000 to 12,000 documents, but due to the iterative nature of the process the
size of the training set could potentially be much larger.
In
any case, the average cost savings from the use of predictive coding range from
30% to 80%, based on the ability of lawyer reviewers to forego the review of
thousands to hundreds of thousands of non-responsive documents.
Further,
with the time saved by delegating review to the trained algorithm can free
lawyer reviewers to perform more sophisticated analysis and higher-order tasks
to ensure sound case strategy. For example, lawyers are still required to
perform the higher-level review of documents prioritized in the
system. Typically, these are the documents most likely to be critical in
the dispute or protected by attorney-client or work product privilege.
Moreover,
subject matter experts and lawyers help choose the right keywords to maximize
the return of responsive results while minimizing the likelihood of overlooking
important variants or other related terms. Additionally, statisticians
can play a role in validating the reliability and quality of search results by
sampling throughout the process and demonstrate that the process is consistent,
as necessary to ensure its defensibility. Lastly, forensic technology
specialists can help guide lawyers in using the most effective types of review
(search and cull) platforms.
In
summary, TAR (or more accurately, "predictive coding") may be able to
replicate and in some cases overtake basic first-pass document review functions
that entail tagging vast quantities of documents for relevancy and into legal
issue categories. However, it is doubtful that this technology can (in the
foreseeable future) replace the knowledge, frame of reference and expertise of
seasoned lawyers and legal technology professionals who continue to be required
to help manage the discovery process to a successful result.
[1] Predictive
Coding: Emerging Questions and Concerns, Charles Yablon & Nick
Landsman-Roos, South Carolina Law Review, Volume 64, page 2, Spring 2013
[2] "An
empirical analysis of the training and feature set size in text categorization
for e-Discovery", Ali Hadjarian, Deloitte Financial Advisory Services,
Washington DC, white paper published for a ICAIL predictive coding standards
workshop, June 14, 2013, Casa dell'Aviatore, viale dell'Universita 20, Rome,
Italy
[3] Ali
Hadjarian, Deloitte Financial Advisory Services, Washington DC, interview, June
12, 2015
[4] "Ready
for the Matrix? The Promise and Limitations of Predictive Coding",
Pearle, ACEDS, June 2014,
[5] ibid
*The author is the Senior Legal Consultant
of Deloitte Legal.
Working in China since 2002, he has unique
cultural depth comingled with compliance, policy and fraud investigations
experience utilizing digital forensics and electronic discovery. Key projects
have involved FCPA and other regulatory compliance.
Legal and commercial background with more than twenty years consulting with
MNCs and Chinese organizations focusing on compliance and business issues.
Industries include insurance, IT, pharmaceutical, manufacturing and retail.
Eight years of team management and P&L responsibility.
Proficient in Chinese "Mandarin" (dialogue as well as reading and
writing).
Earned JD (Juris Doctor law degree), CFE (Certified Fraud Examiner) and HK
mediation license while working in Hong Kong, Beijing and Shanghai.