User Name:     Password:        Join Us
  • 1
  • 2
  • 3
  • 4
  • 5
▪ China’s Market Regulator Reined in Internet Commercial Ads
▪ Stricter than the GDPR, China’s Privacy Law Provides Prohibitive and Control Oblig
▪ China kicked off the 1st national security review on DiDi
▪ Non-prosecution for compliance under ISO 37301 - Dentons lawyers take the world’s
▪ China’s Data Security Law is anything but frightening
▪ Alibaba fined USD 2.68 billion for abusing dominant market position in China
▪ China’s new “Blocking Statute” and the concerns it raised
▪ Survey result: how is bribery risk managed in China?
▪ China’s Administrative Punishment Law Awards Meaningful Credits for Compliance Eff
▪ Salon | How Would the Sanction on Pompeo and Blocking Measures Impact Foreign Comp
▪ Fees to speakers: academic exchange or commercial bribery
▪ China’s Personal Information Protection Law (2)
▪ China’s Personal Information Protection Law (1)
▪ Reading Into China’s Export Control Law
▪ English Translation of Export Control Law of China
▪ China Issued Its List of Unreliable Entities
▪ Demystify Corporate Social Credit System in China
▪ China is deploying “Operation Skynet” to further “Fox Hunt”
▪ China is to award whistleblowers heavily – foreign companies are more vulnerable t
▪ 130 Chinese headhunters arrested, involving breach of 200 million personal info
▪ Corporate Compliance Programs Evaluation Issued by US DOJ (Chinese Translation)
▪ The prospect is promising to commercialize Level-3 autonomous driving in China
▪ Intelligent and digital infrastructures are scheduled to accompany automatic vehic
▪ Will China illegalize VIEs?
▪ You cannot miss the gold rush under China's new Foreign Investment Law
▪ Classified Protection Under China's Cyber Security Law
▪ China is to fast-track law-making in autonomous driving
▪ What compliance obligations to meet to transfer data from within China?
▪ Chinese government uses digital forensics technology to dig bribery evidence
▪ A Chinese medical device distributor fined CNY 50,000 for bribing with Moutai
▪ How would Chinese E-commerce Law affect you (1)?
▪ Conflict between the culture and the Party’s rules: $70 gift money got a director
▪ "Excessive Pricing" from perspective of Competition Law
▪ Does China prohibit cross-border transfer of scientific data?
▪ Hypermarket Caesar jailed for ten years for giving “reward for go-between”
▪ How is environmental protection tax collected in China?
▪ China Redefined Bribery Anticompetitive in Nature
▪ China is to amend its Constitution
▪ Chinese government vowed to crack down on bribe givers more harshly
▪ China has its own Dodd-Frank; the award for whistleblower could be US$ 80K
▪ Chinese government may LIUZHI a suspect of wrongdoing
▪ Cooking clinical trial data is rampant and now criminally punishable in China
▪ 5th Viadrina Compliance Congress
▪ Does a compliance bird eat nothing?
▪ How Are Drugs Being Sold in China Despite the Anti-Corruption Crusading
▪ Chinese whistle-blower lauded while French boss fled out of China
▪ Life Sentence for Deputy Chief Justice of China
▪ Why Is Chinese Anti-bribery Law a Very Important Compliance Obligation?
▪ The Report on Corporate Compliance Management in China (2016)
▪ Use of "predictive coding" in eDiscovery document review…best friend or job replac
Home > Compliance
Use of "predictive coding" in eDiscovery document review…best friend or job replacement?
By Mark Schroeder | 2017/1/22 21:31:24

Predictive Coding (AKA: algorithm assisted "text categorization") refers to the use of a software program to identify documents that are relevant or responsive to a particular case or issue, based on a review of test documents (or a population of "seed sets", "validation sets" and "training sets") by lawyers and subject matter experts. The computer assisted methodology involves a machine learning process and a combination of different algorithmic tools.

This method of assisting counsel in searching, culling and categorizing documents is considered to be one of the most important developments in the eDiscovery industry. In fact, it is so significant that some insiders believe the technology will eventually replace the jobs of lawyers executing document review.[1]

While using algorithms can, in many situations, make finding the proverbial "needle in a haystack" much more efficient, it is our position that the methodology will continue to be more of a super-charged assistant to, rather than a replacement for, the lawyer review team.

The more likely future of eDiscovery with more technology assisted review (TAR), is one where the standards of document review will be raised and the parameters on how much eDiscovery is considered to be rational and proportionate to the case will be significantly increased due to increased efficiencies.  

In support of the above position that TAR is an enhancement rather than a replacement is a quote from Da Silva Moore v. Publicis Groupe (2012), the first and most cited case on the use of TAR. In it, Judge Peck validates lawyers must be part of the process stating, "[lawyers]…can help cull extraneous documents from a set for review and thus enrich the set of documents used to train predictive coding technology. However,Peck further endorses TAR explaining that, "[It]…can help target specific concepts that might not turn up in lawyer random sampling, which can ensure a more comprehensive review."

Since 2012, other cases have emerged which provide more reason for lawyer reviewers to fear for their jobs. The most significant was Federal Housing Finance Agency v HSBC (2014) where Judge Denise Cote stated, "Predictive coding had a better track record in production of responsive documents than human review".  Further supporting the HSBC case was Good v. American Water Works (2014)published at the end of 2014, where Judge John Copenhaver stated that predictive coding may be used in determining privileged documents and/or content.

While the above are all US decisions, finally in early 2016, the UK courts have begun to support. As in Pyrrho Investments Ltd. v. MWB Property Ltd, where Master Matthews turned to the disclosure rules set forth in Practice Direction 31b, supporting it use, stating “automated methods of searching if a full review of each and every document would be unreasonable.” He also noted “whether it would be right for approval to be given in other cases will, of course, depend upon the particular circumstances obtaining in them”. However in the Pyrrho case, the consensus of the parties regarding proportionality, efficacy, and suitability was the key consideration.

Finally, in May of 2016 TAR was again supported in a commonwealth case based on a report by Berwin Leighton Paisner (BLP), where the petitioner sought a buy-out of his minority shareholding. The respondents contested the allegations and petitioner’s suggested valuation. Nevertheless, the parties reached agreement on most directions in advance of the first Case Management Conference. The respondent possessed the vast majority of the potentially relevant documents, approximately 500,000. The sticking point, according to BLP, was over the most proportionate and appropriate approach to disclosure. The plaintiff’s solicitors wanted to adopt a linear review approach suing an agreed upon list of custodians and search terms. BLP, which represented the respondent, asserted that the costs of this approach would be excessive and TAR could achieve “super results…at a more proportionate cost.” The court agreed and ordered that TAR be used by the respondent, following the respondent’s solicitors’ arguments to the court referring to the relevant passages and relevant factors outlined by Master Matthews in the Pyrrho case supporting the use of TAR.

As shown above, in the past few years the use of predictive coding has been increasingly advocated for by corporate counsel and supported by judicial cases, primarily for its efficiencies. That said, it is also well-established that a substantial amount of linear document review by lawyers or subject matter experts is needed to effectively and accurately train the predictive coding algorithm.

Unfortunately, to simply proclaim a "general rule" on how large the "training set" needs to be is not entirely possible. While there is a relationship between the training set size and the total number of documents in the population of a given case, the more relevant determination of size has to do with the, "complexity of the categorization problem at hand"[2]  Stated another way, the more relevancy issues given and within each issue, the more words and phrases codified as relevant and not relevant, the larger the training sets will need to be.

So…when should you consider using predictive coding? There is no simple answer. However, there are emerging some broad parameters, in terms of total document volume and training set sizes, that provide general guidance.

From a technical standpoint a training-set can be as low as 500 documents and provide very precise results if the categorization is exceptionally simple.  On the other hand, if the categorization is exceptionally complex, a training set of 30,000 may still be too small to provide the desired level of confidence.[3]  It should be noted, though, that predictive coding is not a one set process and the most senior experts have difficulty agreeing on "determination of seed sets (random, judgmental, mix), layering search terms, and the best/most accurate analytic and coding methodology".[4]  The developing of the training set is an iterative process that can take "as few as three generations or as many as forty-five".[5] Furthermore, each relevancy issue has a binary decision tree. Therefore, if there are many separate issues that increases complexity, and thus training, set-sizes need to be significantly increased.

With that said, from a practical stand point, cases that utilize predictive coding typically involve total document volumes of more than 500,000 with the average case involving more than a million documents. This large number has historically been the case due at least in part to the significant cost of the predictive coding software. The average training-set will range from about 7,000 to 12,000 documents, but due to the iterative nature of the process the size of the training set could potentially be much larger.

In any case, the average cost savings from the use of predictive coding range from 30% to 80%, based on the ability of lawyer reviewers to forego the review of thousands to hundreds of thousands of non-responsive documents.

Further, with the time saved by delegating review to the trained algorithm can free lawyer reviewers to perform more sophisticated analysis and higher-order tasks to ensure sound case strategy. For example, lawyers are still required to perform the higher-level review of documents prioritized in the system. Typically, these are the documents most likely to be critical in the dispute or protected by attorney-client or work product privilege.

Moreover, subject matter experts and lawyers help choose the right keywords to maximize the return of responsive results while minimizing the likelihood of overlooking important variants or other related terms.  Additionally, statisticians can play a role in validating the reliability and quality of search results by sampling throughout the process and demonstrate that the process is consistent, as necessary to ensure its defensibility. Lastly, forensic technology specialists can help guide lawyers in using the most effective types of review (search and cull) platforms.

In summary, TAR (or more accurately, "predictive coding") may be able to replicate and in some cases overtake basic first-pass document review functions that entail tagging vast quantities of documents for relevancy and into legal issue categories. However, it is doubtful that this technology can (in the foreseeable future) replace the knowledge, frame of reference and expertise of seasoned lawyers and legal technology professionals who continue to be required to help manage the discovery process to a successful result.

[1] Predictive Coding: Emerging Questions and Concerns, Charles Yablon & Nick Landsman-Roos, South Carolina Law Review, Volume 64, page 2, Spring 2013

[2] "An empirical analysis of the training and feature set size in text categorization for e-Discovery", Ali Hadjarian, Deloitte Financial Advisory Services, Washington DC, white paper published for a ICAIL predictive coding standards workshop, June 14, 2013, Casa dell'Aviatore, viale dell'Universita 20, Rome, Italy

[3] Ali Hadjarian, Deloitte Financial Advisory Services, Washington DC, interview, June 12, 2015

[4] "Ready for the Matrix? The Promise and Limitations of Predictive Coding", Pearle, ACEDS, June 2014,

[5] ibid


*The author is the Senior Legal Consultant of Deloitte Legal.


Working in China since 2002, he has unique cultural depth comingled with compliance, policy and fraud investigations experience utilizing digital forensics and electronic discovery. Key projects have involved FCPA and other regulatory compliance.

Legal and commercial background with more than twenty years consulting with MNCs and Chinese organizations focusing on compliance and business issues. Industries include insurance, IT, pharmaceutical, manufacturing and retail. Eight years of team management and P&L responsibility. 

Proficient in Chinese "Mandarin" (dialogue as well as reading and writing).

Earned JD (Juris Doctor law degree), CFE (Certified Fraud Examiner) and HK mediation license while working in Hong Kong, Beijing and Shanghai.

Tweet Like Email LinkedIn
There are no comments for this journal entry. To create a new comment, use the form below.
    Enter your information below to add a new comment.
Email:    (optional)
URL:    (optional)
  Comment Moderation Enabled
Your comment will not appear until it has been cleared by a website editor.
The Compliance Reviews COPYRIGHT © 2013-19 All Rights Reserved. Supported by International Risk and Compliance Association and International Risk and Compliance Institute Limited. 沪ICP备10034943号-8