I come neither to bury the Google search engine nor to praise it — it needs neither, and to shift from misquoting the Bard to misquoting the Emancipator, my words can have little effect on it either way. It’s a simple fact of life that I think everyone can agree on, though, that Web-based searching has made the way we look for information unrecognizable to earlier generations.
That last bit I had to say a little under my breath: I am that earlier generation. I remember this thing called Biological Abstracts, which was a primitive artifact that lived at this placed called a library. BA, as strange as it may seem, was made of dead trees, with letters of carbon black stamped on it. It helped you find research papers. You’d look up a topic by thumbing through these things called “sheets” and, for a given span of dates, it would present a list of references to the papers that had appeared on that topic. Then you’d have to look through the library for other dead-tree-carbon-black artifacts if you wanted to read the papers.
The precision of this instrument in some hands (mine, anyway) can be seen in the fact that, in the Spring of 1984, I managed to write an immunology class paper on AIDS without discovering the existence of HIV.*
Not that anyone would call modern search engines perfect. When we’re trying to get accurate information, they often throw marketing, propaganda or even purposeful misinformation at us. And if our words aren’t perfectly tuned to the metadata that define web copy, we can wind up very far afield. (To clean up a rude joke I heard, “On the Web you’re never more than three clicks away from something you don’t want your children to see.”)
So. Call us greedy. We want the vast power of Web searching, with far more specificity. Well, considering the venue, you won’t be surprised for me to suggest that HPC may be part of the answer.
Sherlock, as we’ve said before, is a YarcData Urika “data appliance” designed specifically to search unstructured, arbitrary networks for information. Yarc calls it by that unusual term to stress that the base model isn’t a supercomputer in the usual sense, because you don’t program it. It’s already set to do the searching, you give it the data and the SPARQL search terms and it does the rest. But PSC’s folks worked with Yarc to customize Sherlock with GPUs, to give it additional flexibility, including programmability. And even better, because it isn’t a massively distributed machine that requires you to tell it how to split a problem up into tens of thousands of little parallel pieces, you can program it much like you would your desktop computer.
The ability to run Java-based Discover — very unusual in a supercomputer^ — comes with that setup, and that’s what Woolls is trying to take advantage of. Discover, he told me when we were preparing a press release on the collaboration, is meant to automate the tough next step of the search process — one that we all do manually today: Read the content to see if it’s really what you were looking for.
“In essence we take over where search engines stop,” he said. “We take those documents and read them for you.” The added kick is that the program will not only help you fish out relevance — it can also make connections to related content that you didn’t think to ask for, but which is on target for your needs. Sherlock will kick the testing of Discover into overdrive, expanding on CFL’s earlier work with U.S. Patent records and Wikipedia.
“The aim is to increase the amount of data that we’re going through,” Woolls said. “We’ll test out existing queries to see what kind of response is coming out, then consider other queries which we can answer — potentially, other ways … which we haven’t been able to explore so far.”
I’m looking forward to his talk. That’s 2 p.m. on Sept. 4; it’s an open talk, but RSVP early, seating is limited.*For the sticklers, the Gallo “HTLV-3″ paper appeared in May of that year, and the Montagnier “LAV” paper in 1983, so although I was cutting it close I should have had a shot at it. ^PSC’s Blacklight has that capability too, BTW.