Portrait photo of a woman with light skin tone and blonde hair.

Dana O’Connor, Machine Learning Research Scientist at PSC

Before her recent presentation at the PEARC conference, I sat down with Dana O’Connor, Machine Learning Research Scientist with PSC’s AI and Big Data group, to talk about her recent award and her work at PSC.

CRYSTAL STRUCTURE PREDICTION

CC Dana, I want to congratulate you on your recent “rookie” award.

DO Yes, the “senior” rookie award. I already feel old… [laughs].

CC That’s to honor someone “new,” but not brand new, who has done something worth recognizing in Mellon College of Science (MCS) – the CMU college affiliated with PSC. Your research focus is applying computational methods to crystal structure prediction (CSP) of organic molecules. What can we learn by studying this, what are some applications?

DO CSP tries to determine how things will pack. It sounds simple, but it’s actually very complicated. Especially for organic molecules, there are a lot of very weak interactions that make it possible for them to pack in a lot of different ways. CSP breaks down into two big steps: generation and ranking. Generation is its own problem, because that’s where you actually pack everything. You have to make sure that everything is physically reliable, so it involves a lot of math. Then ranking, you must be extremely accurate, which is usually very expensive in terms of computation. 

A lot of CSP research lately has focused on how to decrease ranking time as well as how to start doing generation in other parts of what we call the design space. Right now, it’s really good for very small, semi-rigid molecules. As you get more flexible, and you start increasing those spaces that molecules could actually exist in, it becomes a lot more complicated.

One of the main applications is pharmaceuticals. That’s where most of the bread is buttered, if you will. But, in academia, a lot of research focuses on energetic materials, which is a very polite way of saying bombs because the Department of Energy (DOE) is a very big funder of research in that area. A third orphan child is organic semiconductors (OSC), materials for solar cells.

CC I’m sure that’s a big focus of research with all the conversion to renewables and solar.

DO One of the papers I’m working on currently focuses on that. I researched energetics for my PhD because the DOE was my major funder. But, in my opinion, OSCs is an under-explored area for this application. It’s like that third road where nobody’s really funding it.

MACHINE LEARNING

CC Interesting. As a machine learning research scientist, would you elaborate on the difference between machine learning and artificial intelligence (AI)?

DO You know, my own mother has never learned this, she’s asked me this. Machine learning is a subset of AI, because it covers anything where you’re trying to have a computer “think.” Robotics, machine learning, and even simple automation tasks fall under AI, but machine learning is a subset where you’re teaching the computer specifically to do pattern recognition. Whereas in robotics, you’re training it to do a different thing. 

CC What kind of things do you apply machine learning to in your research area?

DO Sure. So, part of the crystal structure prediction (CSP) problem, like I said, is ranking. We have tons and tons of theoretical calculations that we can use to train machine learning models. One of the things I’m working on right now is trying to use graph neural networks to do some of this. But the question is always, how much training data do you need to get close enough to the correct answer? And how to generate it, and especially now, with energy consumption questions, how to generate it responsibly.

Using machine learning to try to do ranking problems, as well as property prediction, is big in CSP. If you get a crystal that doesn’t have the properties you want, you don’t necessarily want to move forward with that. Another thing that machines are really good at is property prediction.

CC Earlier this spring you presented some of your research to the American Physics Society and you also have a paper that will be presented at the PEARC conference. What would you like to share about that?

DO The work I shared at APS was just wrapping up some stuff from my thesis which my advisor suggested I present. That was CSP of energetic materials, very DOE-heavy stuff that also happened to illustrate a good point. There are two variations of one energetic material, and we found that the leading methodology of ranking those structures actually tends to produce the incorrect answer. That was our big find in that paper and we’re basing a lot of things, training machine learning models, on this one method. This method isn’t perfect, but you’re going to get that with any methodology that you use, aside from solving the Schrodinger equation, which no one has been able to do. 

The papers I submitted to PEARC both use density functional theory (DFT) predictions. One of the papers benchmarked DFT methods against Bridges 2 versus the Ookami supercomputer system at Stony Brook [New York], focusing on the accuracy of the ML methods. The main purpose of Ookami is to showcase the A64FX processor. Stony Brook recently got a GRACE processor, so I was playing around with that as well. The accepted paper does feature machine learning on the property prediction of functional organic molecular crystals with graph neural networks. That’s basically taking crystals, turning them into graphs, feeding them through machine learning, and trying to predict the band gap. So, it’s related to organic semiconductors. A nice steppingstone to further work.

SUPPORTING RESEARCHERS AND COLLABORATIONS

CC You also support researchers who are using some of our PSC systems, Bridges 2 and Neocortex. Can you talk a bit about that, how does that work? 

DO Usually researchers contact us when they encounter an error. I’ve seen quite a few errors since I worked on Bridges 2 as a PhD student. Neocortex, I’m still trying to get a handle on, but usually people are pretty good about giving a first effort, and then they will contact us if there’s something that’s causing them to bang their head against a wall. Sometimes it’s a very simple fix because we’ve seen it a million times, and other times we have to try to do “open heart surgery” on the road. It just really varies a lot.

CC So it’s like a help desk for these specialized systems with some debugging thrown in?

DO A little bit, yeah. On Neocortex, it’s usually some sort of debugging problem because they give you a very specific template to follow, but it’s hard to follow sometimes… what needs to happen before something else? But we also get a lot of help from Cerebras on that. That’s the vendor that produces the CS2, a powerful AI system that is the main compute element of Neocortex.

CC What can you share about your collaborative protein folding research with Stanford and PNNL?

DO Oh sure. I wish there was more to tell. So that started when I was looking for some research projects to get started on, since I was a research scientist here and it’s very intimidating to dive headfirst into something you have no idea about without other people. I frequently find that I’m the “dumbest person in the room,” so I always like to ask other people for help. I reached out to some Neocortex users. One of them was at Stanford, he’s really interested in protein folding. Some other researchers joined our efforts. We have someone from William and Mary, someone from PNNL, and we’re really interested in looking at the bottlenecks of protein folding as well as how we can train these models faster. We’re hoping to someday port one of them to Neocortex, but that’s not a trivial matter. We’re trying to look at ways to make protein folding prediction a little easier for people that don’t have six million NVIDIA GPUs.

CC So protein folding, what is that used for? 

DO I think most people these days are interested in drug design. I think it’s called small molecule protein interactions. It’s basically, how will a protein interact with a biomolecule or a drug, because that can impact the way it folds and then also interacts. Stuff like that.

CC You also help with resource matching for NAIRR. So, what’s involved with that?

DO First, the proposals go through a review committee. Does it have scientific merit? Does it meet the call for the proposals? Then they usually request resources in their review panel.

We then determine if that’s appropriate, what capacity is the machine at, which machines are suited for it, for Neocortex? Obviously, can you do these very specific things that Neocortex is designed for? So, typically, it’s a bigger debate between the GPU-based systems and people who request Neocortex.

Usually, people are aware that Neocortex is not a GPU-based system, but our job is to make sure that they are aware that it’s not, and if they aren’t, to inform them when they get on the system. But it’s just kind of an exchange where we discuss, is this the right place for these resources, is this resource already at capacity? Which is why Bridges 2 has now joined the NAIRR pilot, to take some of that GPU load off Delta and TACC and some of the other places that were early providers. People are very interested in the A- and H-100 NVIDIA GPUs. Bridges 2 uses the V-100s, which aren’t bad, of course, just a little older.

OTHER INTERESTS

CC So when you’re not working at PSC, doing all these interesting things, what kind of things do you like to do in your spare time?

DO I like to run. My fiancée is a marathon runner, so somewhat mandatory there. I am not a marathon runner. He ran a 3:34 in the Pittsburgh Marathon and was disappointed, so that’s what I’m dealing with. I also love to read. I want to say I am a frequent flyer at the Carnegie Library of Pittsburgh. 

CC What kind of things do you like to read in your spare time?

DO I really like nonfiction. I used to read a lot of biographies and political history. I also read fiction, usually for book clubs, or based on people’s recommendations. But a lot of nonfiction.

CC What kind of biographies? I’m a biography fan myself, I’ve read biographies on Teddy Roosevelt, Lincoln, lots of presidential histories. 

DO Grant, Seward, Lincoln.

CC Have you read Team of Rivals by Doris Kearns Goodwin? That’s an interesting one.

DO No. Oh, the one about the cabinet? Back in middle school (mid-2000’s), I read a Hamilton biography, about the same time Lin-Manuel Miranda was reading it. He later based his award- winning play on that book. I’ve also read a Kirsten Gillibrand one, so I’m all over the place.

CC Very interesting. Anything else you would like to share?

DO I can’t really think of anything. It’s been great working here.

CC Thank you for taking the time to sit down and talk with me. 

DO Well, it always sounds less interesting when I hear myself saying it…

CC Not at all. You covered many really interesting topics, so thank you.


Learn more about AI and Big Data at PSC.