Sunday, April 12, 2009

open lab architecture

as i'm approaching the time to write my thesis, and getting lots of requests for my code, and collaborating with a bunch of people, and learning about version control, good coding/documentation practice etc., i want to establish for myself an architecture within which i can accomplish all my goals. the desirata for this architecture are as follows:

desirata:

1) easy collaboration on projects
2) version control of everything (keeps a complete archived scientific record)
3) allows for arbitrary openness (to eventually have a kind of "open lab notebook", but that is well organized)
4) general enough to incorporate both real experiments, and code development, and everything else my "lab" might do
5) custom rss feeds

i've been working on the back end (in my spare time), over the last few months. i've now converged on something that seems pretty good to me. in particular, for each project, i have the following hierarchical folder system:

-- code (folder for each set of scripts collectively implementing some function)
-- data (organized into subfolders according to lab of origin)
-- docs(folder for each publication)
-- meeting notes (folder for each meeting)
-- talks (a folder for each talk)

each project is under version control using git, with a private repo hosted by github. this seems to work pretty well, although i still haven't figured out a particularly good way to integrate documentation of code with this.

with regards to the front end, i want to create a wiki homepage, organized as follows:

-- projects
-- papers
-- posters
-- talks
-- blog

the papers, posters, and talks would really just be a bibliography, with links to final versions. blog would also just be a link to my blog (which would ideally be formatted to look like the rest of the webpage). the projects section, however, is really the key thing under consideration. each project's front end would then be organized much like a publication. for instance, a random project might be organized as follows:

I. Intro
II. Model
III. Algorithm
IV. Results
V. Discussion
VI. Bibliography

each section would be organized somewhat differently. consider, for instance, the "results" section. it would be organized with a number of subsections, one for each "feature". each feature would be organized as follows:

-- some text with equations/figures
-- source code (and data necessary to generate figures/statistics)
-- documentation for implementing this feature on new data

the key, however, is how that organization gets there. essentially, what i'd want to do is use commands such as \input{...}, which would call appropriate files. the text subsection, for instance, would call a tex file, that is written to be compiled either by a wiki, or by latex. this is actually pretty easy to do, since most wiki engines use latex markup for equation editing (i'm looking into the details on various wiki forums).

it seems to me, that with this architecture, i would have achieve all my desirata, and it could transform the way i work in a positive way. for instance, nobody would ever have to email me to ask me about my progress on a project that i am working on with them, as they would have an rss feed updating them on my progress. making all my code available would simply be a matter of changing permissions.

perhaps most importantly, i would have a single source documenting all my work. although the current culture of science requires that, every so often, scientists package some elements of their work, and publish it as a stand alone *document*. obviously, that approach is totally antiquated, for a number of reasons. first, science is a progression of steps, with no obvious delineation between major landmarks, necessarily. thus, forcing us to publish everything at once, instead of steadily over time, is unnatural. second, if there are mistakes in the published version, they simply stay in the scientific record. clearly, if mistakes are found, it would be advantageous to update the scientific record. third, this facilitates a more collaborative attitude, as other people, if they find mistakes, can simply make edits directly, or suggest them at the least. fourth, incorporating things like simulations, videos, etc, is straightforward. fourth, publications could link directly to my wiki, so people could easily find it.

anyway, this idea leads me to the following questions:

1) do you have any suggestions/modifications to the proposed architecture
2) in particular, do you have any good ideas about where to incorporate documentation of my code
3) do you think, generally, this is a good idea and would be worth my time
4) do you think i should locally host everything (ie, version control and wiki), or use other people's stuff (which would mean something like github and wikidot)? if other peoples, do you have any suggestions of which wiki host to use?
5) how much work/time, realistically, do you think this would take to set up? and do you think it is worth it?