Sunday, April 12, 2009

open lab architecture

as i'm approaching the time to write my thesis, and getting lots of requests for my code, and collaborating with a bunch of people, and learning about version control, good coding/documentation practice etc., i want to establish for myself an architecture within which i can accomplish all my goals. the desirata for this architecture are as follows:

desirata:

1) easy collaboration on projects
2) version control of everything (keeps a complete archived scientific record)
3) allows for arbitrary openness (to eventually have a kind of "open lab notebook", but that is well organized)
4) general enough to incorporate both real experiments, and code development, and everything else my "lab" might do
5) custom rss feeds

i've been working on the back end (in my spare time), over the last few months. i've now converged on something that seems pretty good to me. in particular, for each project, i have the following hierarchical folder system:

-- code (folder for each set of scripts collectively implementing some function)
-- data (organized into subfolders according to lab of origin)
-- docs(folder for each publication)
-- meeting notes (folder for each meeting)
-- talks (a folder for each talk)

each project is under version control using git, with a private repo hosted by github. this seems to work pretty well, although i still haven't figured out a particularly good way to integrate documentation of code with this.

with regards to the front end, i want to create a wiki homepage, organized as follows:

-- projects
-- papers
-- posters
-- talks
-- blog

the papers, posters, and talks would really just be a bibliography, with links to final versions. blog would also just be a link to my blog (which would ideally be formatted to look like the rest of the webpage). the projects section, however, is really the key thing under consideration. each project's front end would then be organized much like a publication. for instance, a random project might be organized as follows:

I. Intro
II. Model
III. Algorithm
IV. Results
V. Discussion
VI. Bibliography

each section would be organized somewhat differently. consider, for instance, the "results" section. it would be organized with a number of subsections, one for each "feature". each feature would be organized as follows:

-- some text with equations/figures
-- source code (and data necessary to generate figures/statistics)
-- documentation for implementing this feature on new data

the key, however, is how that organization gets there. essentially, what i'd want to do is use commands such as \input{...}, which would call appropriate files. the text subsection, for instance, would call a tex file, that is written to be compiled either by a wiki, or by latex. this is actually pretty easy to do, since most wiki engines use latex markup for equation editing (i'm looking into the details on various wiki forums).

it seems to me, that with this architecture, i would have achieve all my desirata, and it could transform the way i work in a positive way. for instance, nobody would ever have to email me to ask me about my progress on a project that i am working on with them, as they would have an rss feed updating them on my progress. making all my code available would simply be a matter of changing permissions.

perhaps most importantly, i would have a single source documenting all my work. although the current culture of science requires that, every so often, scientists package some elements of their work, and publish it as a stand alone *document*. obviously, that approach is totally antiquated, for a number of reasons. first, science is a progression of steps, with no obvious delineation between major landmarks, necessarily. thus, forcing us to publish everything at once, instead of steadily over time, is unnatural. second, if there are mistakes in the published version, they simply stay in the scientific record. clearly, if mistakes are found, it would be advantageous to update the scientific record. third, this facilitates a more collaborative attitude, as other people, if they find mistakes, can simply make edits directly, or suggest them at the least. fourth, incorporating things like simulations, videos, etc, is straightforward. fourth, publications could link directly to my wiki, so people could easily find it.

anyway, this idea leads me to the following questions:

1) do you have any suggestions/modifications to the proposed architecture
2) in particular, do you have any good ideas about where to incorporate documentation of my code
3) do you think, generally, this is a good idea and would be worth my time
4) do you think i should locally host everything (ie, version control and wiki), or use other people's stuff (which would mean something like github and wikidot)? if other peoples, do you have any suggestions of which wiki host to use?
5) how much work/time, realistically, do you think this would take to set up? and do you think it is worth it?

10 comments:

Benjamin Stein said...

My random thoughts:

* This is definitely worth pursuing. I agree that the current model of research/publications is antiquated. I understand why, but there's no reason you have to play in that space. I think it's worth trying a few things and seeing what sticks.

* Git (and GitHub in particular) is a fantastic place to keep code and even documentation. I recommend building on top of it.

* Git is relatively inaccessible to people. It requires a non-zero amount of effort to understand and use. That's not to say you shouldn't use it, but you should be cognizant of that fact

* Keep the DRY principle in mind. Try to find a way to pull code into your presentations/wiki directly from your Git repo instead of copy/pasting.

* Use other people's shit whenever you possibly can (Avoid the NIH syndrome)

* Does it make sense to separate into papers, posters and talks? They all seem like the same thing to me.

* Why can't you comment your code inline? Why does it need something special?

* Take a look at http://github.com/raganwald/homoiconic/tree/master It's a great example of someone using GitHub for publishing. As you can see it has a lot of the benefits that you spoke about, but it's confusing what's going on and the format is pretty inaccessible to most people. Try to grok what he's doing and then see which parts work well and which parts don't. I think you can learn a lot here.

joshyv said...

* my only worry about building on top of github is the relative inaccessibility of it. it seems to me that if i run my own wiki, or use some wiki host, things can look much more like a website, and be simpler. one of the main desirata of this project would be to make the process of going from my thoughts, to my code, to your problems, as SEAMLESS as possible.

* i'm not sure what the DRY principle is, but if its never copying/pasting, but also simply using commands like \input{...}, then i'm totally with you, i'm just not entirely sure how to do it yet.

* the issue of papers/posters/talks is actually a bit complicated. essentially, each project has (i) code, (ii) data, and (iii) docs, where docs contains papers and notes. but posters/talks transcend individual projects, acting much like "review materials". so actually, in my "research" folder on my computer, i have the subdivisions: (i) projects, (ii) posters, (iii) talks, and (iv) misc. the projects folder contains projects as described above. the misc folder contains macros and global definitions that i use across projects, like latex templates and such. posters and talks could technically be combined into a single folder (along with review papers when i start writing them), assuming i had a good front end to navigate. but posters and talks are very different things. i use different programs to edit and preview them. also, on the front end, i'll have different ways of incorporating them. for instance, talks i'll probably upload into slideshare and then embed them. thus, things like that become easier.

* i'm not sure what you mean by just "comment your code inline". i have line-by-line comments in all my code, plus, i have an explanation of what's going on at the top. but that explanation is redundant with my tex files, and doesn't link to them now. also, the explanation in my scripts is ugly, and doesn't compile using latex or anything, and lacks the results of running the code.

* i looked at homoiconic. he seems to share my desirata, but its not clear that he has converged on anything particularly useful. can you give me a hint regarding what you think he has done well?

joshyv said...

as i think about it a little more, and incorporate some of your comments, maybe something like the following would be the best approach:

* use git as the back end, and build everything on top of that

* github can act as the front end for collaborative coding, and serve as a remote repository, but it is really not appropriate as a front end for more general purpose uses

* a wiki, on the other end, does seem ideal for front end uses

* thus, each script's documentation can be a link to the some file under version control, much like my articles are essentially just calling a bunch of other files, and my wiki will do the same

* as far as i can tell, the easiest way to do all this would be to host my own wiki, which could call local files (otherwise, i'd have to figure out a way to mirror the files stored on some wiki farm to mirror my stuff.

* this seems all relatively straightforward to implement actually, given that the back end is already set up fairly well. i really just need to create a wiki on some server that i run.

* do you guys agree with all this?

joshyv said...

given all the above stuff, i still need to use an editor to deal with all this stuff. historically, i've been using vim to edit posters/papers/talks/documents, and meditor to edit matlab files. obviously, this is problematic in a number of ways, most importantly:

1) vim is not pretty, and is for linux users, which i'm not, so using it never seems natural to me

2) meditor doesn't have features that i'd like an editor to have, for instance, reasonable key bindings, projects, etc.

based on this, i'm thinking about starting to use a single editor for everything: wiki, docs, talks, papers, matlab, webpage, etc. seems to me, given my constraints, and that i'm on a mac, that textmate (with latex, matlab, and git bundles). what do you think?

joshyv said...

project hosting

* it seems like the idea of using some code hosting platform is probably the way to go. for concreteness, my desirata for code hosting are:

1) natively hosts git
2) has a wiki that is editable and reasonable
3) is not scary to my end users
4) has a place for feature requests/bugs/questions/etec.

as far as i can tell, github, which i've been using, only has the first desirata, but not the others. google has the others, but not the first (at least not yet, see: http://tr.im/iMi5). gitorious has all but the last. various other code hosting services mostly don't have the third desirata.

i put in a feature request at gitorious to add a place for feature requests, and i put a feature request in google code hosting to add native git support, so we'll see who comes through. regardless, it seems like we are converging on something....dare i say it....HUGE!

joshyv said...

so, i thought more about what you said, and i was nearly convinced that the right way to go would be for me to simply make a webpage that links to my git repo's, that store my paper's in pdf form.

but then i had the following thought: when writing papers, after having all my ideas down on paper, in some format that i like, i have to spend a few days modifying it to submit it to whatever journal we are submitting to. the point is, in general, i have my results written down in some format that i like, and then i port it to some format that the journal likes. this is always a tedious process. i had hoped that using latex would ease that process, as it would simply be a matter of changing templates, but that is in fact not the case. rather, we have to rearrange sections for them, put tables and figures in the end, put figure legends separate from figures, limit the bibliography section for them, etc. point is, a bunch of work, that has to be done, no matter what. further, submitting papers in latex format has been a horrible experience, for many journals. so bad, in fact, that in the future, if i submit to such a journal, i will certainly NOT submit it in latex format, but rather, word, or whatever stupid bullshit they want.

all this in mind, it seems to me that i should organize my thoughts as they progress in a way that is...well....organized. a tex file compiled into a pdf is a fine way, but it certainly lacks a bunch of nice features. first, it must be downloaded in its entirety each time. second, "diff's" are not so easy to implement in that format. third, incorporating things like video's is a pain. fourth, modifying it is kind of a hack. in particular, you have to go to the tex file, and find the place that needs to be modified, modify it, recompile it, and then see. fifth, it is somewhat limiting in terms of design. sixth, i really just don't like it that much. i never have. not that i want a WYSIWYG text editor, but rather, the whole process feels unnatural to me.

given all the above, now how do you feel?

joshyv said...

a response from q:

So i guess there are a couple of things. First, yes, latex submissions are a bit
of a pain, as one still needs to change stuff. But it's not that horrible. And
once the final version is out it's rather easy to bring it back into the format
you like.

Also, tex2html and stuff like that allows for it to be easily converted into a
webpage. It's not very beautiful, but it wouldln't be too hard to produce some
css that'd make it nice.

I agree diffs are a real issue. In fact, that is my main issue with latex. I'd
love some good way of doing diffs. if you look at McKays latex code for his
book, you see what really committing to using diff means in terms of what your
code ends up looking like. Horrible. But I feel like a repository together with
diff is ok, and makes up for that.

To me it seems like git + latex + tex2html or so is great. It allows easy use of
equations etc, is waaaay more flexible than word or anything like that (though
journal production teams don't seem to think that?), allows for integration of
pdfs and web text (and, honestly, a link to a video is fine by me), is
(comparatively) easy to reformat, and, very importantly, is a standard that is
quite established and so will put off fewer ppl. I think I'll try to arrange
something like that for me. Kind of excited about it now.

the other thing to keep in mind is that there is a great value to freezing. And
it makes sens to freeze every time there is something really new. So people know
where to look. Say I like x. Then you work on your project, and find out that x'
is better. But for whatever reason I actually liked x. I should easily be able
to find that. And having to wade back through a long list of updates is not
cool. There should be a simple list of major updates that clearly identifies the
x I liked so I can find it again really easily. I think published papers do that
pretty well.

About submission: submitting in word is EVEN worse. Honestly, I am really fed up
with silly reformatting by journals. Just don't get what the use is. Do it the
way NIPS does it, minimal (though still non-negligible) work, and produces great
results.

joshyv said...

>So i guess there are a couple of things. First, yes, latex submissions are a bit
>of a pain, as one still needs to change stuff. But it's not that horrible. And
>once the final version is out it's rather easy to bring it back into the format
>you like.

i don't disagree with that entirely, but then you basically need two main.tex's, one for "my" version, and one for "their" version. each simply calls other tex files, that have the meat in it. for instance, "my" main.tex would look like:
\input{latex_paper.tex}
\begin{document}
\section{intro} \label{sec:intro}
\input{intro}
...
\end{document}

"their" main.tex would look similar, but it would first call some other template (other than "latex_paper.tex" that i've defined). figures and figure legend calls would be arranged differently. otherwise, they'd be basically identical. the only problem with this approach would be bibliography, that i'd want different for them probably (less complete). i guess i could define a new command \mycite{.}, that would only be citations if using my template, and otherwise ignored.


>Also, tex2html and stuff like that allows for it to be easily converted into a
>webpage. It's not very beautiful, but it wouldln't be too hard to produce some
>css that'd make it nice.

def not beautiful.


>I agree diffs are a real issue. In fact, that is my main issue with latex. I'd
>love some good way of doing diffs. if you look at McKays latex code for his
>book, you see what really committing to using diff means in terms of what your
>code ends up looking like. Horrible. But I feel like a repository together with
>diff is ok, and makes up for that.

wiki + git + good hosting (such as github or gitorious) = perfect for diffs

>To me it seems like git + latex + tex2html or so is great. It allows easy use of
>equations etc, is waaaay more flexible than word or anything like that (though
>journal production teams don't seem to think that?), allows for integration of
>pdfs and web text (and, honestly, a link to a video is fine by me), is
>(comparatively) easy to reformat, and, very importantly, is a standard that is
>quite established and so will put off fewer ppl. I think I'll try to arrange
>something like that for me. Kind of excited about it now.

seems to me like git + wiki + good host + wiki2latex = pretty great. my wiki main will look a lot like the above. wiki2latex compiles both a tex file, and a pdf, and is run by mediawiki, meaning that it is properly developed, as they make money based on it (unlike tex2html, which is more of a side project as far as i can tell). regardless, it mostly seems like the difference between our preferences is relatively minor at this point. the main differences, as far as i can tell are (i) i care more about the front end being pretty, and (ii) i think may way is more conducive to next generation science, in that it incorporates issue tracking nicely, and more naturally incorporates minor improvements to results, etc. your approach, however, is somewhat easier.

>the other thing to keep in mind is that there is a great value to freezing. And
>it makes sens to freeze every time there is something really new. So people know
>where to look. Say I like x. Then you work on your project, and find out that x'
>is better. But for whatever reason I actually liked x. I should easily be able
>to find that. And having to wade back through a long list of updates is not
>cool. There should be a simple list of major updates that clearly identifies the
>x I liked so I can find it again really easily. I think published papers do that
>pretty well.

agreed. i'll still have some papers though :)

>About submission: submitting in word is EVEN worse. Honestly, I am really fed up
>with silly reformatting by journals. Just don't get what the use is. Do it the
>way NIPS does it, minimal (though still non-negligible) work, and produces great
>results.

didn't know word was even worse. NIPS is pretty great, and as good a system as i can imagine.

joshyv said...

so, one aspect of things that hasn't yet been made explicit is that a multilevel front end is certainly desirable. more specifically, there is the level for the casual reader (hereafter, top level), and then the level for the end user (documentation level). these are very different things. so, i now see the process in three distinct steps:

1) as our thoughts are progressing in a project, it makes sense to document our progress in something like a wiki, which is super easy to modify as we progress.

2) when we get ready to submit an article on the topic, we basically need to create a document in some way. assuming we were documenting our thoughts in an intelligent way, this will require essentially making a new file, by calling something like wiki2latex or wiki2doc on each of the sections of the wiki that we want to incorporate into the published version. obviously, that command on its own won't work, so it will require a little manual formatting, but probably not much more than is already required to get in their stupid formats. at this point, we can also make all the pages of the wiki public.

3) at some later date (around the time of publication probably), we can then release documentation of our code, if that is appropriate. this is actually fairly easy to implement, given that i've commented the code well. the key to this, is that using matlab, i can publish m-files to a variety of formats. playing around with this a bit, it seems i can export to html, xml, or latex. so, i'm leaning toward html or xml at this point, but i don't know enough about embedding latex/wiki in either of them.

thus, it seems like i'm very near to a solution. i need somebody to host my wiki, and my project. makes sense for that to be the same place, as otherwise, things could get complicated (trying to sync everything). gitorious seems like a pretty nice place to do it. i basically only have two remaining questions:

(a) does anybody know of any easy way to embed either latex or wiki formatted stuff into html or xml

(b) can anybody suggest a host that has a wiki, native git support, issue tracking, and generally looks pretty?

Benjamin Stein said...

Unfortunately I don't think you'll get any of that stuff out of the box from a hosting provider or web company. You're going to need to get a server, install a bunch of shit, add your business logic, and then pretty it up.