Perspectives on Data Science in Therapeutic Development

Across the landscape of pharmaceutical and biotechnology R&D, information technology and data analytics are catalyzing new discoveries and accelerating clinical development programs. To get an inside look, the AcademyHealth blog interviewed Dr. Eric Perakslis, Senior Vice-President for R&D Informatics at Takeda Pharmaceuticals International in Cambridge, MA. Dr. Perakslis is a data science pioneer with more than 20 years of bioinformatics experience with several pharma companies. In addition, he's led the Center for Biomedical Informatics at Harvard Medical School, and served as the FDA CIO and Chief Scientist (Informatics).

What's your big picture view on the impact of data science in medical research today?

For an impact perspective, I think you have to respect the principles embedded in the Gartner Hype Curve for Emerging Technology and the Curve of Innovation made popular by Everett Rogers. It's important to keep the whole picture of a technology's adoption in mind. People tend to overestimate the impact of technology in the short-term, but underestimate it in the long-term. For big data analytics, the results will be similar to what we have seen in other aspects of IT and the impact will be incremental initially, but over time, transformational.

If you take the analogy of the Human Genome Project and the imagined ‘new cures’ perspective, the Genome Project was often said to unlock this and unlock that. When it worked out that genome sequencing was effective, the cost was too prohibitive to do much with it. Now twenty years later, costs are down to $1,000 a genome. Yet, even now the experts are saying, “Well, we need a lot more patients and a lot more data before it's going to add up to cures.” So, with regard to data and IT, you take this major leap forward with innovation and cost, and now you have this challenge of scale of impact. As we’re seeing with the wide applications in genomics now, I’m confident we will follow similar cycles of great leaps and new challenges with these technologies.

From our own perspective in pharma, the biology is extremely complex but now we have the capability to double click through those complexities. Overall, we're getting better and data science is getting interesting for drug discovery and other health care applications for big data analytics.

From your career perspective what would be your advice to a quantitatively oriented undergraduate student about a career path in data science?

That's a great question. You know I ended up in data science even though it didn't exist when I was a student. If you think back to when I was in college in the early 1980’s biology wasn't taught as a quantitative science or a computer science. It's actually taught as both now. It was taught really as a qualitative science; and there was a lot of memorization, taxonomies to learn, genetics to assemble, these things were the foundations. And jump to now, if you look at postdocs and graduates coming out of school, these young people can code for themselves. They can do the math and they can do the computer science. They're still biologists too, but it's very different.

I always tell people to follow their passion and I think if you like STEM, stick with it. If you're good at it, if you like it, you're going to do fine. This next generation will have amazing unintentional careers just the way that we did, and it's going to be much more free-flowing. Most of the young people I know are far more entrepreneurial that my generation was. They don't think of getting good jobs. They think of getting experiences, and then starting their own company or something different.

Let’s talk about how the industry is working to integrate new sources of data into therapeutic development. The “real world” data being incorporated into research domains, e.g., data abstracted from electronic health records, social media, sensors or other source.

Real world evidence isn’t new, yet it's still emerging. What I mean is that it's going to be essential for things like value-based access to therapies. It used to be that a pharma company's goal was to get a drug through the FDA. Well, pharma generally knows how to do that. Now, the goal is to get paid for it. You actually need a product that not only seems effective but actually brings incremental value to patients, so that people will buy it. And, that's a good thing.

The real world evidence agenda is interesting because you know what you're really doing is anecdotal analysis of data that was generated for other purposes. On one hand you see really, really great trends and fascinating observations. On the other hand, it's almost impossible to ask why something is the way it is observed. A classic case of this would be that you would start to see a safety signal, looking at FDA Adverse Events Reporting System (AERS) data and look at claims data around a drug. What you could probably do is validate with fairly good precision that this is actually a signal you're seeing, or not. You could use other data sources. You could kind of look at it six different ways and say okay this thing is real. What you have, however, is no way to understand the mechanism of it. While the evidence you can build around that signal is shown by association to be valid, you often don't know why it's real. In a lot of ways, if this was criminal law, a judge would look at this and say it is all circumstantial because it is. It doesn’t mean it’s not real, it just means it needs discipline.

Now, regarding sensor data - that's something entirely different. That's a direct measurement. Sensor data are direct phenotypic measurements of a patient or a subject. They're actually quite different. They actually are very objective. The heart rate is whatever the Apple watch says the heart rate is within the rage of precision of the Apple watch, right? I'm on the National Academies of Science Mobile Health IT Working Group and we're looking at using biosensors in clinical trials as it is a growing field. What we're seeing is the personal sensor technology used today is not equivalent to clinical grade yet. There are a lot of interesting things you can do with wearables today for things like vital sign measurement. We have a new publication coming out about this next month and what we're envisioning for their use in clinical trials. Today, they don't really come close to standard of care, but we're using them to study the data.

Can you say more about how you are looking at sensor data?

Say that you're an orthopedic surgeon. It used to be that after you replaced someone's hip, you'd walk by, you'd ask the nurses to watch. You asked the nurse, ” I want the person to walk around the floor twice with their IV pole, make sure they're moving.” The nurse may or may not see them, may or may not remember to write that down. Put a Fitbit on them, and you actually have an objective measure.

If you want to do cardiac monitoring, the standard of care would be the Holter Monitor, which has been used since the 1960s and has been the standard of care since the late 70’s and early 80’s. The monitors are actually pretty small now; about the size of an iPhone. You put them on your chest and you wear it for a month and it gives intensely accurate cardio-physiological data. There are people now that are working to replace that with something the size of a Bandaid, which would be awesome, it's just not close yet.

And so the thing about medical devices is that you really do have to meet the standard of care. I argue that in order for someone to use something other than a Holter Monitor, one, it has to be at least as good if not better. Two, more importantly, you have to disrupt an entire industrial complex. Virtually every community-based hospital in the country has Holter Monitors and knows how to put them on patients. The point being that your favorite, brilliant little app really has a high bar to overcome for adoption. The issue with digital health applications is that it really has to pass that bar of 'as good or better' standard of care in bringing incremental value to disrupt the economics of health care. That said, at Takeda we're putting wearables in all of our trials. We just keep finding out that they are not quite ready for what we want to use them for.

How is pharma responding to the opportunities for large aggregated data resources in research?

I think the standard deviation on this topic across industry is huge. If you look at what I did, I built a system called tranSMART, an open source clinical data warehouse system and started putting Johnson & Johnson clinical trials in the Cloud in 2008 - about the same time the government went with Cloud. Later, when I was at FDA in 2012, I started moving FDA data to the Cloud. At Takeda, I'm in that stage right now, as we have just started with the Cloud first policy this year. A lot of other pharma companies aren't there yet, so the standard deviation here is wide.

Right now, companies, especially young, light, agile, early stage companies are running their entire drug development process in the Cloud. With the Medidata Rave (a commercial clinical trials data integration system) a variety of all these software and service providers that are creating virtual data systems to support pharma R&D. Then you've got the 'old fashioned' companies where I think they're going to have 50 full-time people running Microsoft patches from here to eternity to get the job done.

I think that the desirable trend, if you start with the value of a product in health care as a first principle, and the way I've done data science with Takeda, is to integrate data from all aspects of a drug development program. Almost all of the costs of a trial will be billable to the drug and accountable to the product label. If we've got a program in a certain type of cancer, everything that we use - the clinical trial sites, the clinical database, the electronic data - all that data would ideally be a developed as a service that you could build into a trial. This approach enables us the flexibility to accommodate the different data needs for different therapeutic programs. A trial for psychiatric therapeutics is very different from a cancer trial.

Overall, it's likely that there will be a couple of very large vendors that support the clinical trials industry and it will be far more facile and flexible infrastructure than what we have today.

Is it too far of a leap to think cloud-based data aggregation is going to make trials more efficient?

No, I think the pharmas are already making good use of these services. I just think it's a matter of different organizations getting there faster. I think that for some trials, we can write almost online with apps and not even have a clinical site. This is the success of the Medidata Raves. People are simply saying, “Look I don't need a clinical trial management system anymore.”

Like every place we see in the consumer sector today, virtualization of data resources is proving to be really cost effective as far use of infrastructure in the cloud for pharma R&D programs. I have said for some time that pharma data science should be built as a service and not an IT infrastructure.

Given the heterogeneity of health care delivery systems, financing mechanisms, the demographics, economics, and delivery systems around the world, how is your company going about the integration of this data across the global market place?

This is something that the larger R&D companies are actually good at figuring out. We think of it in terms of the concept of regulatory strategy. For a certain type of product innovation, you have a strategy about where the best country to launch is. Then, it's "where's the second country?" Then, "where's the fortieth?" Pharma companies have global brand teams that are usually a hybrid of R&D, marketing, commercial, and supply chain. They have the capability to set up 40 or 50 registry trials in 40 of 50 countries, and they do that well.

From a data perspective I always say, “You know a drug label is the most expensive PDF on the planet.” At the end of the day that's the outcome of all drug discovery. A company spends a billion dollars to get a label that defines what you can use it for, where you can use it, how it's made and what the effects should be. From a data perspective, companies are really set up as a labeling database that has all of the relevant product information. That same database has metadata on foreign language, language requirements, and all the relevant data for translating these elements.

Putting together all of the pieces through a data strategy is what global pharma companies understand and run well.

The aspects of reinforced machine learning, artificial intelligence (AI) and natural language processing (NLP) are now prominent in research. What do you think the impact of these tools will be on drug discovery?

The impact will be incremental, but genuine. For example, NLP obviously has a lot more maturity too it, even though it's still imperfect, than some of those others being featured now. I think we're getting more and more tools in the R&D arsenal and that’s what it comes down to – brute strength in our computing capabilities. And like anything else, when the right problem comes up and you can fix it with the right tool, it's a beautiful thing. But the way Silicon Valley works, they often produce tools that are in search of a problem – and we're finding applications to use some of them. For example, we're applying these methods in our work on microbiome data – a domain that is exploding with interest. The microbiome DNA sequence has large regions in it that we don't understand the functional significance. It’s a great problem for AI. AI can simulate lots of iterations and go on substituting options for those blind segments of DNA. Data scientists run the simulations over and over to learn the roles of the genes in these regions, and we're seeing results.

You know where I fall on the whole 'AI is going to change the world thing’? I take the perspective from my ex-FDA role and say, “let’s take a step back and see if this makes sense.“ Overall, I view myself as very optimistic about these tools, but they aren't panaceas. They are really important methods of analyzing data that we're going to exploit to the fullest.