Nieman Foundation at Harvard
HOME
          
LATEST STORY
Google is changing up search. What does that mean for news publishers?
ABOUT                    SUBSCRIBE
Sept. 29, 2009, 9 a.m.

Five projects on the frontier of text-based data analysis and visualization

Last week, I attended the Transparent Text symposium at IBM’s offices in Cambridge. The conference focused on text-based data storage, analysis, and visualization — awesomely nerdy stuff, in other words.

Some of the presentations would be familiar to loyal readers of this site: Amanda Michel’s distributed reporting at ProPublica, Ethan Zuckerman’s Media Cloud and “nutritional labeling” for news, DocumentCloud, and The Guardian’s crowdsourcing tool. Here, then, are five other projects that piqued my interest at the conference:

OpenCalais

I’ve mentioned OpenCalais in the context of DocumentCloud, but there’s much more to the software, which was purchased by Thomson Reuters in 2007. In a sentence, OpenCalais parses text for names, locations, organizations, and other entities to make unstructured documents more useful. Oh, and it’s free.

Above are the slides presented by Tom Tague, head of OpenCalais, whose talk focused on how publishers are using the service. The best example is on the last slide: Two investigative-journalism networks, which Tague did not name, are using OpenCalais to compare birth, death, and wedding records with government contracts to identify conflicts of interest that wouldn’t be otherwise apparent.

IBM’s DeepQA project

IBM’s successor to Deep Blue, the chess-playing supercomputer that defeated Gary Kasparov, is DeepQA, a natural language processor that’s being trained to play Jeopardy. It’s a whole different challenge, the complexities of which were explained in a New York Times article last spring and in the IBM promotional video above.

What does this have to do with journalism? Nothing, at first, but the research behind DeepQA (or “Watson,” as they call it at IBM) could improve the way information is processed and interpreted — and hasn’t that long been the news industry’s specialty?

Maplight

Medicare Prescription Drug Price Negotiation Act of 2007 (at MAPLight.org)

Center for Responsive Politics Medicare Prescription Drug Price Negotiation Act of 2007 (at MAPLight.org)

Maplight is a project funded primarily by the Sunlight Foundation that seeks to “illuminate” the connection between money and politics in California and the federal government. Their databases allow users to compare votes on particular bills with campaign funding from interest groups that supported or opposed the legislation. The widget above, for instance, demonstrates the correlation, if not causation, between contributions and votes on a Medicare bill in 2007.

IBM’s Many Eyes project

Many Eyes is IBM’s free data-visualization software. (I used it for two posts earlier this year.) Fernanda Viégas and Martin Wattenberg demonstrated some of their best text-based visualizations, like Word Tree, and previewed a new one that compares Google searches, pictured above comparing the most common endings of searches for “is my son…” and “is my daughter…” Think of it as an amped-up version of Google Suggest.

Linked data at The New York Times

I actually missed this presentation, but Alexis Lloyd of The New York Times Co.’s research and development group, which we profiled at length in May, discussed how the Times is using linked data to organize its content. ReadWriteWeb reported on this project in June. The slide above, for instance, illustrates how the Times classifies airline accidents to create a more-intelligent archive of its plane-crash coverage.

Slide photos by Andreas Myhrvold Braendhaugen and lite used under a Creative Commons license.

POSTED     Sept. 29, 2009, 9 a.m.
 
Join the 60,000 who get the freshest future-of-journalism news in our daily email.
Google is changing up search. What does that mean for news publishers?
A shift to AI-generated search results will decrease the traffic that Google sends to publishers’ sites, as more people get what they need straight from the Google search page instead.
The Athletic’s live audio rooms bring sports talk radio into this century
The Athletic’s first live room took place in September 2021. By January 2022, they’d done 100. Today, they’re closing in on 1,000.
In Spain, a new data-powered news outlet aims to increase accountability reporting
Demócrata.es, launched in March, publishes data-driven reporting and plans to expand.