Weblog

Voices from the future of science: Rufus Pollock of the Open Knowledge Foundation

August 18th, 2008

If there’s a single quote that best captures the ethos of open science, it might be the following bon mot from Rufus Pollock, digital rights activist, economist at the University of Cambridge and a founder of the Open Knowledge Foundation: “The best thing to do with your data will be thought of by someone else.”

It’s also a pithy way to convey both the challenge and opportunity for publishers of scientific research and data. How can we best capitalize on the lessons from the rise of the Web and open source software to accelerate scientific research? What’s the optimal way to package data so it can be used in ways no one anticipates?

I talked to Pollock, who’s been a driving force behind efforts to improve sharing and reuse of data, about where we stand in developing a common legal, technical and policy infrastructure to make open science happen, and what he thinks the next steps should be.

What strategies and concepts can we use from open source to foster open science? Can you give us a big picture description of the role you see the Open Knowledge Foundation playing?

I’d say that in terms of applying lessons from open source, the biggest thing to look at is data. Code and data have so many similarities — indeed, in many ways, the distinction between code and data are beginning to blur. The most important similarity is that both lend themselves naturally to being broken down into smaller chunks, which can then be reused and recombined.

This breaking down into smaller, reusable chunks is something we at the Open Knowledge Foundation refer to as “componentization.” You can break down projects, whether they are data sets or software programs, into pieces of a manageable size — after all, the human brain can only handle so much data — and do it in a way that makes it easier to put the pieces back together again. You might call this the Humpty Dumpty principle. And splitting things up means people can work independently on different pieces of a project, while others can work on putting the pieces back together — that’s where “many minds” come in.

What’s also crucial here is openness: without openness, you have a real problem putting things together. Everyone ends up owning a different piece of Humpty, and it’s a nightmare getting permission to put him back together (to use jargon from economics, you have an anti-commons problem). Similarly, if a data set starts off closed, it’s harder for different people to come along and begin working on bits of it. It’s not impossible to do componentization under a proprietary regime, but it is a lot harder.

With the ability to recombine information as the goal, it’s critical to be explicit about openness — both about what it is, and about what you intend when you make your work available. In the world of software, the key to making open source work is licensing, and I believe the same is true for science. If you want to enable reuse — whether by humans, or more importantly, by machines operated by humans — you’ve got to make it explicit what can be used, and how. That’s why, when we started the Open Knowledge Foundation back in 2004, one of the first things we focused on was defining what “open” meant. That kind of work, along with the associated licensing efforts, can seem rather boring, but it’s absolutely crucial for putting Humpty back together. Implicit openness is not enough.

So, in terms of open science, one of the main things the Open Knowledge Foundation has been doing is conceptual work — for example, providing an explicit definition of openness for data and knowledge in the form of the open knowledge/data definition, and then explaining to people why it’s important to license their data so it conforms to the definition.

So, to return to the main question, I think one of the strategies we should be taking from open source is its approach to the Humpty Dumpty problem. We should be creating and sharing “packages” of data, using the same principles you see at work in Linux distributions — building a Debian of data, if you like. Debian has currently got something like 18,000 software packages, and these are maintained by hundreds, if not thousands, of people — many of whom have never met. We envision the community being able to do the same thing with scientific and other types of data. This way, we can begin to divide and conquer the complexity inherent in the vast amounts of material being produced — complexity I don’t see us being able to manage any other way.

Your Comprehensive Knowledge Archive Network (CKAN) is a registry for open knowledge packages and projects, and people have added more than 100 in the past year. Can you tell us how the project got started? What have the recent updates achieved? And what are your future plans — where do you hope to go next?

If you’ve got an ambitious goal like this one [of radically changing data sharing and production practices], you’ve got to start with a modest approach — asking, “what is the simplest thing we can do that would be useful?” So we began by identifying some of the key things necessary for a knowledge-sharing infrastructure, to figure out what we could contribute. Sometimes what’s needed is conceptual, like our definitions. Sometimes you need a guide for applying concepts, like our principles for open knowledge development. And you need a way to share resources, which is why we started KnowledgeForge, which hosts all kinds of knowledge development projects.

The impetus behind CKAN was to make it easier for people to find open data, as well as to make their data available to others (especially in a way that can be automated). If you use Google to search for data, you’re much more likely to find a page about data than you are to find the data itself. As a scientist, you don’t want to find just one bit of information — you want the whole set. And you don’t want shiny front ends or permission barriers at any point in the process. We’ve been making updates to CKAN so machines can better interact with the data, which makes it so people who want data don’t have to jump as many hurdles to get it. Ultimately, we want people to be able to request data sets and have the software automatically install any additions and updates on their computers.

What are the biggest challenges to making open science work? If you had to lay out a 3-point agenda for the next five years, what would the action items be?

I think that, like with nearly everything else, the social and cultural challenges may be the biggest hurdle. One aspect of making it work is ensuring that more people understand exactly what they can gain from sharing. I think it’s like a snowball:  you might not get much back, initially, from sharing, but over time, you’d be able to see your data sets plugged in with other data sets, and your peers doing the same thing. The results might encourage you to share more.

As for a 3-point agenda:

1.) Open access is very important. In particular, I’d like to see the funders of science mandate not just open access to publications but also, as part of the process, open access to the data. They are paying for the research, so they can provide the incentive to make the results open. Moreover, it should be easier to get open access to the data; you wouldn’t necessarily have the same kind of struggle with publishers.

2.) I think we need more evangelism/advocacy for open science. We’re seeing big shifts in the way we do science, but we’re still on the cusp of a movement to bring open approaches together in a common infrastructure.

3.) We need to make it easier for people to share and manage large data sets. Open science is already working in some respects; arXiv.org is an extraordinary resource, for instance, but we need a better infrastructure for handling the data itself. I also think that many people are put off sharing because they think they don’t know how to manage data. That causes people to hesitate or give up completely. We need to make the process smoother. Sharing your data should be as frictionless as possible.

What do you see as the most important development in open science over the last year?

Without question, the progress we’re making with data licensing. We have the Science Commons Protocol for Implementing Open Access Data, which conforms to the Open Knowledge Definition, and the very first open data licenses that comply with the protocol: the Open Data Commons Public Domain Dedication and License (ODC-PDDL) and the CC0 public domain waiver. We now need to encourage people to start using these waivers — or any other open license that complies.

When I talk to people about what the open science movement is trying to achieve, the most common response I get is, “Well, won’t Google take care of that?” Do you hear that? What’s your response?

I would ask, “Well, what is ‘that’?” You find that many people believe that if you put something online, it’s automatically open, and Google will do the rest. Google is great, but it can’t handle things like community standards or usage rights. And in any case, I’m deeply skeptical of “one ring to rule them all” solutions. What we need is more along the lines of “small pieces, loosely joined.” Of course organizations like Google could help a lot (or hurt!), and they’re certainly an important part of the ecosystem. But at the Open Knowledge Foundation, we like to say that the revolution will be decentralized. No one person, organization or company is going to do everything. Even Google didn’t make the Web standards or create the web pages and hyperlinks that make search engines work. As it stands, Google may be good for finding bits of Humpty, but not for creating or putting him back together.

Have you read Chris Anderson’s piece, The End of Theory: The Data Deluge Makes the Scientific Method Obsolete? If so, what’s your take on it?

I’ll be politic and say that it’s provocative but ultimately unconvincing. There are reasons why we have theory. Imagine a library where you could have any book you want, but there are no rules for searching, so you have to search every book. The knowledge space is just too vast. In economics, just like in science, you need models to isolate the variables you’re interested in. There may be millions of variables, for instance, to explain why you’re a happy person right now. You had a happy childhood, you just listened to a symphony, etc. And the number of possible explanations (or, more formally, “regressions”) grows exponentially with the variables, so you’re creating a situation that’s computationally hard — problems that, using brute force, would take longer than the lifetime of the universe to solve, even with the fastest supercomputers around.

I’d argue that with more data, you need more, not less modeling insight. As the haystack grows, finding the needle by brute force is likely to be a less attractive, not more attractive option. Of course it’s true that more data and more computational power are a massive help in making progress in science or any other area. It’s just that they have to be used intelligently.

On a more personal note, how does being an economist inform your approach/perspective?

Economists study information goods a lot, so I’d say my background has been very influential. Economics 101 tells us that openness is often the most efficient way to do things, especially when there’s the possibility of up-front funding by, for instance, the government. There are clear, massive benefits for society in having a healthy, balanced information commons. Unfortunately, it is often the case that those who benefit from proprietarization have better-paid advocates, better-oiled PR machines, etc.

My hope is that this work that so many of us are doing pro bono, often in our spare time, will slowly increase in impact — and that, at a minimum, we can ensure that all publicly funded scientific research will be open.

#
Previous posts in this series:

Intelligent Television: The Open Access Documentary Project

August 15th, 2008

Intelligent Television has posted early clips from a documentary project it’s undertaking with BioMed Central to showcase the benefits of open access (OA) to scientific and medical research. The project will produce a series of videos featuring interviews with activists, publishers and other stakeholders in OA, as well as consumers of OA information in the developed and developing world.

Check out the clips for highlights of interviews with our own John Wilbanks, Vice President of Science at Creative Commons, and Heather Joseph, Executive Director of SPARC.

Microsoft Research launches new tools for knowledge sharing

August 1st, 2008

Big news:  Microsoft Research has unveiled new add-ins for some of the most popular Microsoft products to make them more useful for the scientific community — including tools for creating, sharing and preserving research in the formats used by scientific publishers and digital archives. The suite of add-ins, described in detail here, includes the Creative Commons Add-in for Office 2007, which lets anyone embed a Creative Commons license directly into their documents.

Using the Creative Commons Add-in,  you can choose from among the licenses available on the CC site to express your intentions regarding the use of your work. The embedded license links directly to its online representation at the CC site, while a machine-readable representation is stored in the Office Open XML document.

The Chronicle of Higher Education, reporting on the launch:

Saying it wants to help scholars and publishers write, edit, and publish academic articles, this week Microsoft Corporation rolled out a set of new software tools to perform those tasks, as well as to navigate thorny copyright issues and find and share scholarly data. …

For example, the Article Authoring Add-in for Word 2007 enables authors to structure and annotate their documents according to formats that publishers and digital archives require. The articles can then be converted easily to formats that facilitate their digital storage and preservation. The company is offering the new software free to licensed users of Word and other Microsoft products.

The tool allows users to create documents in the widely used format developed by the National Library of Medicine’s free digital archive of peer-reviewed biomedical and life-sciences journal literature, PubMed Central. But users will also be able to shape the software to suit other formats because the code for the tool is openly accessible and freely adaptable. …

“We’ve never before addressed what we could put around Office, Excel, SharePoint, and our other programs to make them more useful for science,” said Tony Hey, corporate vice president of Microsoft’s external-research division. “For example, Word was not tailored for scientific papers. But we decided to see, Can we make it more useful in that way?”

He said the company is also responding to the demand for researchers to provide greater access to their findings, and even their research data. Already the National Institutes of Health requires that any publications from research it finances be placed in PubMed Central within one year of publication. The National Science Foundation has a similar requirement, as do Harvard University’s faculties of law and of arts and sciences.

Such developments have increasingly raised concerns about copyrights and fair reuse of archived materials. So to help authors, publishers, and databases embed information about copyrights and licenses in Microsoft Office documents, the company released another free product, called the Creative Commons Add-in for Office 2007.

Science Commons visited the development team working on the add-ins in Seattle last year, and we’re excited to support this initiative.

“There are fundamental shifts taking place in how we manage the flow of scientific knowledge, and they bring demand for new tools that expand our choices for knowledge sharing and collaboration,” says John Wilbanks, Vice President of Science at Creative Commons. “We’re thrilled that Microsoft has taken these important steps to meet that demand.”

Update: Jon Udell has an excellent interview with Tony Hey about the initiative: How Microsoft’s External Research Division works with a new breed of e-scientists.

WSJ: Putting Drug Development in Patients’ Hands

July 29th, 2008

The Wall Street Journal has an article today featuring Marty Tenenbaum and CollabRx, one of our partners in the Health Commons project. The piece, Putting Drug Development in Patients’ Hands, explores how patients are using the Internet to take the reins in developing drugs for disease — especially rare and neglected diseases for which drug discovery has yet to prove commercially viable. Among the strategies patients are pursuing:  funding their own “virtual” pharmaceutical companies, tasked to specific diseases.

One patient profiled in the article, Bonnie J. Addario, is a lung-cancer survivor who turned to CollabRx for help. Addario began the search for a cure using the traditional approach. After raising $800, 000 and distributing it to researchers, she was struck by the realization that despite the hard work going into the research, lung-cancer survival rates hadn’t changed in 40 years. Addario and her husband convened a conference with researchers to uncover hidden roadblocks:

[The] group identified a number of problems that hinder progress toward a cure. Among them: Researchers didn’t know what others were doing, tissue and blood specimens needed for experiments weren’t centrally located or shared, and the findings of experiments weren’t integrated to help assess what the key priorities should be.

The Health Commons, also discussed in the article, was launched to help make it easier for anyone to pull together the resources necessary for advancing disease research, without the legal wrangling commonly associated with securing access to research, data and unique research materials like cell lines. This includes implementing standardized, machine-readable agreements to facilitate intra-academic and academic-industry transfer of biological materials.

“The Health Commons is creating the legal and technical foundation for a friction-free, point-and-click marketplace like Amazon or eBay, so disease research isn’t bogged down by needlessly complicated, costly legal negotiations,” explains John Wilbanks, who leads Science Commons. “Some resources may be freely available; others may be available for a fee. But if it’s in the Health Commons, it will be a swift, permissionless transaction, because permission will have already been granted.”

Our founding partners in the Health Commons are CommerceNet, CollabRx and the Public Library of Science.

To learn more about the Health Commons, visit the project page, where you’ll find a white paper co-authored by Wilbanks and Tenenbaum, plus a 6-minute video introduction to the project.

A public commitment to open science

July 28th, 2008

Good design choices are the key to powerful network effects. And when the goal is accelerating scientific research, there may be no more powerful design element than institutional policy. By making the right policy choices, people at institutions can help usher in new norms for knowledge sharing — where research results are systematically “plugged into” the network, multiplying the opportunities for discovery.

Boston University’s Superfund Basic Research Program (BU SBRP) has embarked on just such an endeavor. The program, which works to uncover the effects of improperly managed hazardous waste on reproductive health, has published its own open science policy. The policy is a declaration of BU SBRP’s commitment to sharing research, and outlines the methods it uses to make the results freely available to anyone who can use them:

Our SBRP program holds the view that publicly supported scientific knowledge and tools produced by research programs like ours should be freely available and accessible. On this web site we have implemented our commitment to open science in different ways, including using new web-based technologies and alternative approaches to licensing work that make the vision of shared research possible. …

The use of open source wiki software encourages communication and collaboration on research, both externally with the public and internally within a research group. BU SBRP is developing an internal wiki for research collaborations.

RSS (Really Simple Syndication) allows subscribers to automatically see the latest products of research, including new publications, events, and research tools freely accessible through a web site.

Emerging permissionless licensing systems allow researchers to choose the terms under which they want to share their work; these include Creative Commons Licenses and the General Public License.

Finally, open access journals are those which make content available to everyone, without requiring a subscription. With the emergence of web-based publishing, this model can make research more easily available to more researchers in more locations. A list of open access journals can be found at DOAJ.

Acknowledging that the outputs of research “come in many forms,” the program uses these tools in many different ways:

Statistical techniques and computer code for modeling environmental exposures and health outcomes can be licensed through permissionless systems, written in open source languages and fully commented, and shared through a wiki. Laboratory methods and synthetic data created to test different techniques can also be shared, updated, modified by individual researchers or collaboratively, and discussed through wiki software. Published articles can be made accessible through Open Access on-line journals.

In publishing the policy, BU SBRP seeks not only to provide information about its approach, but also to develop a “compelling model” for other research programs dedicated to sharing “scientific findings, analytical tools, data, and research methods.”

It’s an extremely promising and worthwhile project. Kudos to BU SBRP for helping to define and propagate new norms for knowledge sharing that can foster the growth of open science.

Galapagos NV: drug discovery innovator

July 24th, 2008

Not long ago, GlaxoSmithKline (GSK) made headlines with its “Massive Cancer Information Giveaway“: a gift of over 300 cell lines derived from a wide variety of tumors, including breast, prostate, lung and ovarian cancers. The cell lines were made freely available to cancer researchers via the National Cancer Institute’s caBIG, with the goal of “crowdsourcing” the search for predictive biomarkers, making clinical trials shorter and, ultimately, getting drugs more quickly to people suffering from disease.

Now Galapagos NV, a Belgium-based drug discovery company, is following in GSK’s footsteps. The company is making freely available several proprietary databases of information about small molecules and proteins through the EMBL’s European Bioinformatics Institute (EMBL-EBI). With funding from the Wellcome Trust, EMBL-EBI is returning the data to the public domain — precisely what we recommend in the Science Commons Protocol for Implementing Open Access Data.

“This makes the scientific data that Galapagos has gathered an extraordinary gift — not just to science, but to open science,” says John Wilbanks, who leads Science Commons. “Returning the data to the public domain removes the legal barriers that prevent us from making full use of the latest technologies for data integration and analysis. The Galapagos data can now be used in ways no one can anticipate — the very definition of innovation.”

Building the foundation for open science

July 23rd, 2008

The New York Times ran a much-discussed piece this week on open science and our colleague Karim Lakhani at the Harvard Business School: If You Have a Problem, Ask Everyone. John Wilbanks was with Lakhani at OSCON when the story broke, speaking at a forum on participatory cultures.

The piece gives a tantalizing glimpse of the potential for open science. With a foundation in a shared legal and technical framework, we could scale initiatives like the ones Lakhani has identified, with the Web itself functioning as an innovation engine for science.

That was the impetus behind the open science workshop we held last week in Barcelona in conjunction with ESOF 2008. We’ve now published our recommendations [PDF] for foundational principles to help foster the growth of open science across the globe — a way to connect and empower the people and organizations using open approaches to accelerating discovery. It provides definitions for four pillars of open science: open access to literature, access to research materials, open data and open cyberinfrastructure.

“We have to look hard at the foundations of science on the network and advocate ceaselessly for the necessary upgrades — to science as well as the network — that will allow us to get millions of new eyes on science, asking millions of new questions,” writes Wilbanks in a post on his blog at Nature Network. “Until that happens, we won’t really have a digital science culture. We’ll have simply made the old problems into digital problems.”

We’re grateful to all of the participants for joining us and helping to make the workshop a success. Special thanks go to Dr. Cameron Neylon, who not only participated but also blogged the event at Science in the open, sharing his insights with the community at large. Here are links to his notes, along with a few brief excerpts:

  • Policy and technology for e-science: a forum on open science policy: “James Boyle (Duke Law School, Chair of board of directors of Creative Commons, Founder of Science Commons) gave a wonderful talk (40 minutes, no slides, barely taking breath) where his central theme was the relationship between where we are today with open science and where international computer networks were in 1992. He likened making the case for open science today with that of people suggesting in 1992 that the networks would benefit from being made freely accessible, freely useable, and based on open standards.”
  • Policy for open science: the wrap-up session: “The benefits of the open web come from the explosion of people actually using a computer network. We must think of the users of an open-architected science [having] the same potential for explosion. “
  • Policy for open science — reflections on the workshop: “The workshop that I’ve reported on over the past few days was both positive and inspiring. There is a real sense that the ideas of Open Access and Open Data are becoming mainstream. As several speakers commented, within 12-18 months it will be very unusual for any leading institution not to have a policy on Open Access to its published literature. …Open Data remains further behind, both with respect to policy and awareness. … We need to look critically at different models [for building a commons], what they are good for, how they work.”

We’re hoping to continue the fruitful conversations started the Barcelona workshop. If you’re interested in joining us, send us an email to let us know.

How open is that data?

July 21st, 2008

Last week we shared the news about research Melanie Dulong de Rosnay has been conducting on the complexities of opening access to scientific data. The research has now been published over at Nature Precedings, in the form of a paper entitled, Check Your Data Freedom: A Taxonomy to Assess Life Science Database Openness.

“Molecular biology data are subject to terms of use that vary widely between databases and curating institutions,” writes de Rosnay in the abstract. “[This paper] builds upon research led by Science Commons demonstrating why open data and the freedom to integrate facilitate innovation and how this openness can be achieved. … [Most terms of use for databases] are not harmonized, [are] difficult to understand and impose controls that prevent others from effectively reusing data.”

To address the problem, the paper proposes a “checklist for data openness…to assist database curators who wish to make their data more open to make sure they do so.”

That’s not all. Our outstanding research assistant, recent MIT graduate Shirley Fung (S.B. 2007 and M.Eng 2008), helped de Rosnay with the project and has published a website that lets anyone explore and evaluate the legal and technical openness of the sample set of scientific databases (and submit more databases). Here’s a look at the home page:

Find Open Data

Browse a list of databases compliant with the Science Commons Open Access Data Protocol

Browse Policies

View databases categorized by their technical and legal accessibility regimes

Classified Databases

Find all the databases classified by the project. You also may want to use the “Browse Policies” section for a specific kinds of databases

Submit Policy

Use the questionnaire to submit a database policy to our system

If you have questions about the project, let us know.

On the complexities of sharing scientific data

July 16th, 2008

Ethan Zuckerman, the Berkman fellow who founded Geekcorps and co-leads Global Voices with Rebecca MacKinnon, has a nice piece today on our efforts to clear the legal hurdles blocking the integration of scientific databases, highlighting research by Melanie Dulong de Rosnay (see our post from earlier this week).

Writes Zuckerman at World Changing:

Creative Commons is a clever use of the copyright system intended to make it easier for people who want to, to share their work with others. Jonathan Coulton has used Creative Commons to enable an army of remixers and videomakers to produce promotional materials for his songs and albums. Authors like Dan Gillmor and Cory Doctorow have used Creative Commons to let people download, translate and make audio versions of their books. And Global Voices uses Creative Commons so that blogs and news sites can use our content without asking us for permission.

What about scientists?

[...]

Under US law, pretty much anything you write down is copyrighted. Scrawl an original note on a napkin and it’s protected until 70 years after your death. Facts, however, are another matter - they can’t be copyrighted. So while trivial but creative scribblings are copyrighted, unless you choose to release them into the public domain, the information painstakingly discovered about the human genome - DNA sequences, for instance - aren’t. But the containers they’re stored in - the databases they’re held in - can be copyrighted.

If I sound confused about this stuff, that’s because I am.

[...]

This question of complexity is what Melanie’s research has focused on. She looked at the terms of use for roughly 200 databases necessary for work in the life sciences. Evaluating the terms on all those databases, she discovered that only 7 met her stringent definitions of Open Access to data - these databases could be accessed without registration; they could be downloaded for local use; they could be incorporated into other works; they had clear, understandable terms of use. This last factor proved to be the most challenging. She spent hours reading these terms with other experts in the field and discovered that, a great deal of time, the experts disagreed on what was permitted under a specific agreement.

The reason this is important, Melanie explains, is that scientific research proceeds more quickly when researchers can share resources. But with databases encumbered by different, confusing legal protections, it can become a legal nightmare for researchers to do complex work building new tools that combine information from two databases in a novel way, for instance. And databases that are protected by access restrictions can be out of reach to scientists in developing nations who might not have the financial or technical resources to access them.

So how do you deal with the problem of conflicting terms of use for the “containers” of scientific data?

As Zuckerman points out, we initially offered advice aimed at helping database publishers figure out when it made sense to use Creative Commons licenses. But it was evident that this wouldn’t solve the problem.

After further research, Science Commons collaboratively developed and published the Protocol for Implementing Open Access Data, through which we recommend explicitly returning the data to the public domain, using legal tools like the CC0 waiver or the Open Data Commons Public Domain Dedication and License (ODC-PDDL).

If you’d like to learn more about the protocol, we encourage you to check out the FAQ or send us an email. We’re happy to answer any questions you may have.

Video: Timo Hannay on CC-based science publishing

July 14th, 2008

In case you missed it over at Open Access News, there’s a new 3-part video documentary on publishing open content using Creative Commons licenses. Part two (below) is an interview with Nature’s Timo Hannay — an especially interesting bit in light of the recent discussions about business models for sustainable open access (OA) publishing.

If there’s anything that the interview makes clear, it’s that scientific publishing is undergoing a profound transformation, and people in OA are confronting its most difficult challenges head-on. They’re freeing science while providing us with valuable lessons about the kinds of publishing models that can work in the digital era — a boon on both counts.