Thursday, June 12, 2008

Communicating in Code

Every year, thousands of computer science papers are published. Many of these papers describe an algorithm or a program that the author has written, showing how the author's approach is better than previous solutions to the problem. The papers are amost invariably written in English (a universal language in the subject) and contain mathematical notation (another universal language.) All of this is useful. It distributes knowledge to other researchers and leads to development. But strangely, most of these papers contain no source code. Sometimes there is a small snippet of pseudocode illustrating the central innovation. Often, even that is missing. It seems bizarre, as if a journal of art criticism contained no illustrations, or a book on how to cook contained no recipes. Why are we in this absurd situation?

1. Computer scientists, not software architects. The usual response is that the author is a scientist studying ideas, rather than an engineer building something meant for serious use. But the scientist has invariably already built something to test the idea. It's not like particle physics, where theorists have ideas that can't be put into practice because of the expense. (Ideas like these are usually dismissed as unworthy of a paper in computer science. It's a cultural thing.) It could easily be distributed, but isn't. Perhaps some of the authors really have no interest in seeing source code from others, and assume everyone feels the same way. This way of thinking is so alien to me I have trouble even considering the point of view. A mathematical formula is so compact in its notation that it takes enormous effort to unpack. And the formulas contained in papers rarely define all their variables, expecting the reader to pick up the meaning of some of the variables from experience or context.

2. Lack of a universal language.There is no standard programming language. Perhaps a dozen languages are currently widely used. Where source code is available for the kind of programs I am interested in, it is invariably in one of the following languages: C, C++, C#, Java, Matlab, or Lisp. However, all of these (with the exception of Lisp) are so similar in structure that any programmer who can understand code in one of the languages can understand code in any of the others. It may be uncomfortable to work in, but more of an annoyance than a roadblock.

3. Lack of universal libraries. Most programs use other libraries in order to run. These libraries can be difficult to install and use, and often have unwritten assumptions built into them. The researcher could easily share the original code, but sharing the libraries needed to run it and where to get them seems like a huge burden.This is true, but why is it true? Why don't we have a widely shared set of libraries to take care of all the previously solved computer science problems? The answer is because no one is making the effort to create useful code and share it. If this were the priority rather than just the papers, the problem would quickly take care of itself. It's partly a chicken-and-egg problem.

4. The program wasn't written to be understood. Often computer scientists are ashamed of their code, and don't want it to be made public. They know that it is sketchy, inefficient, poorly documented, bug-ridden, and requires arcane rituals to get to work at all. It's more like a messy research notebook than a paper. This is all true. But why don't they rewrite the code for the paper? It wouldn't take much more work than writing the paper itself. The answer is purely cultural: a paper confers prestige, raises awareness, gets you into conferences, and so forth. Beautiful code goes largely unappreciated. It is this cultural aspect I really want to see changed. It seems to me that the right language to express computer science ideas in is (well designed and well documented) code. It can sacrifice some efficiency for the sake of clarity. Learning to communicate in this way should be taught as part of every computer science class.I'm not advocating strict obediedence to some standard. I'm advocating teaching literacy in programming, the ability to read and write code whose purpose is completely transparent and as easy to read as prose text. Perhaps the language, tools, and libraries to do this don't really exist yet. But in the future, people will look back in wonder that we were able to get anything done at all. Imagine how useful it would be to be able to immediately compare a new approach to any existing approach. Imagine how much progress could be made if all the existing ideas were available to be used without significant effort as building blocks of a new algorithm.

Sigh.

No comments: