Doc. Review Law (or: How I learned to stop worrying and love the Data)

Posted by Umar Khan on October 31, 2019

I am, among other things, a lawyer.

I’m still not exactly sure how I wound up being a lawyer. I had always been fascinated with computers from as far back as I can remember. I published my first website at 12 and started messing around with my sister’s college c++ textbook. But it was decided that I was to become a mechanical engineer. So in O Levels (i.e. high school) I ended up taking all science and advanced math courses. I did well enough in those, but I craved exposure to the humanities so much by that point that in college I abruptly switched to studying things like philosophy and political science. Fast forward to graduation: what do I do now? Law school seemed like the best bet. Another three years and a mountain of debt later I stood initiated into the ranks of those warriors of justice holding up the very foundations of civilization as we know it.

Except, my experience as a lawyer was nothing like what one typically imagines life as a lawyer to be. The job market for lawyers was (and still is) incredibly rough. There didn’t seem to be any room for me even in the non-profit and legal aid sector. I ended up working on a lot of eDiscovery projects which essentially boil down to lawyers doing data mining.

Allow me to present a typical scenario. Two massive corporations sue each other, for something like breach of contract. They both go to court, where they demand that the other party produce all documents in their possession relevant to the litigation. This can be as broad as all e-mails and documents generated by all employees in a certain division going back four or five years. In the old days, this would entail sending vans full of boxes of documents to the requesting party. These days, it is delivered as electronically stored information scraped from the hard drives and servers of the responding party. Often, these data dumps reach up into the terabyte range.

Either way, the next phase is the same. A bunch of lawyers are stuffed into a room and spend long hours and days reading each e-mail, each attachment, each power point, each pdf and each excel one by one and tagging it as relevant or not. Sometimes each document needs to be further categorized. Weeks, sometimes months go by. The work is tedious and often repetitive.

However, if you think about it, what I have just described presents an excellent use case scenario for Natural Language Processing, clustering and predictive analytics. And that is precisely the direction the eDiscovery industry is headed in. Currently, eDiscovery is the most expensive phase of litigation. Naturally; all those lawyers in that cramped room usually like to get paid for the long hours they put in. As such there has been a proliferation of companies and startups attempting to harness the latest, cutting edge machine learning and AI technologies to try to make a dent in the eDiscovery workload.

From the start of my eDiscovery career, I was very curious about the technologies being used behind the scenes. The primary software platform used to conduct this work is like SQL server with a GUI layered on top and built in document browsers. I started experimenting with this platform and learning how to craft queries to come up with creative ways to organize and filter the dataset. By identifying common metadata threads, I could knock out big chunks of useless data. By using RegEx and search terms I was able to pull out veins of documents that were more valuable than others.

By this time, I was working as a staff attorney at a major downtown NYC law firm. This is where I got what I consider my first break; my first chance to write actual code to solve a set of real-world problems. I was assigned to a case involving insider trading. We had to compile, clean and analyze a large volume of trading records. Time stamps had to be formatted and converted to multiple time zones. Transactions had to be sorted out by stock, by date, by any number of other criteria. Various calculations had to be done on the fly to these slices of data, such as total and per share basis, running tallies and so on. And then finally, charts had to be generated dynamically from the dataset as it was being sliced and parsed.

At first, we proceeded to do this manually and by hand. I began to craft formulas to try to speed things up. Very quickly it became apparent that this was an unwieldly and limited approach. So i decided to dive into VBA. For those who aren’t familiar, VBA is a scripting language that’s built into the Office universe of apps. It has all the features you see in more common languages: variables, data structures, control flow even classes. The syntax is clunky and outdated (using Python after having hacked away at VBA for months felt unnervingly easy). However, it was my first taste of programming and I immediately fell in love.

VBA allowed me to develop powerful tools to not only automate the workflow but create standalone solutions that others could use easily to interact with the dataset as they wanted. I went on to do some pretty interesting things with VBA. I did a couple of freelance projects, including one for a midsized accounting firm that involving building pipelines to pull data from Outlook into Excel and generating detailed visualizations for an analytics dashboard. On a project with a major oil and gas firm in Houston, my team leads gave me free reign to build a solution to automate their workflow and they were not disappointed. I pushed the capabilities of Excel and VBA beyond what I thought was possible and gave them an incredibly sophisticated solution that was really my first full-fledged software application. 70 pages of code when printed, tons of features, version control, the whole nine yards. It took me more than a month to build, and I spent countless hours upgrading and maintaining it. It was deployed out to a team of 20 people, all of whom were very appreciative at how much faster and easier the app made their jobs.

I have never loved doing anything as much as I love writing code. Part of it is definitely because of the instant feedback loop involved in writing code: when you get it right the results are virtually instantaneous; the reward pathways are triggered right away and the reinforcement develops quickly. But there is also the feeling that I have come to my true calling. Sure, I got here by a winding path. But I’m lucky enough to have gotten here at all, to a point where I found something, I’m passionate about, that I enjoy doing and have the opportunity to pursue as a career. They say, “if you love what you do, you dont work a day in your life”. It seems then that I am on the cusp of retirement.

I decided that it was time to get serious about programming and data science. I had been casually doing data science courses online for a while, but the words of Hunter S. Thompson kept coming back to me: “If a thing is worth doing, it’s worth doing right”. Once my last project ended, I enrolled in Flatiron’s Data Science program. Every instinct I had told me it was the right move. I knew the industry I was in was undergoing rapid transformation powered by AI and machine learning. We are living in the golden age of data, and AI and machine learning techniques are revolutionizing virtually every segment of modern life. The legal profession, despite its natural tendency to resist change, is no exception. Clients are increasingly demanding the use of advanced technology to curtail litigation expenses. I knew I had the knack for writing code and working creatively with datasets, and that this is what I wanted to do for a living. It seemed abundantly clear that Data Science was my destiny. Flatiron seemed to be the best means of getting there.

And so began my journey towards Data Science mastery. I’ll keep you posted on how it goes.