This paper describes the early development work on the leading web interface to the R statistical programming language, with a brief history up to release 3a ("Cardiff") and particular emphasis on release 4 ("Weston").
R is a statistics language modelled after the S language developed at AT&T (see [1]). It shares many features with the other major S-like language, MathSoft's Splus, and although it differs in some fundamental ways, program code is generally portable between the two languages. R has become popular for teaching advanced statistics in many universities, because it allows almost complete manipulation of the procedures, it is similar to the commercial Splus and it is freely distributable. Its speed of statistical development is unmatched because of this open development style and is a popular attribute of the language.
Most webservers allow readers to run a limited selection of programs chosen by the webmaster. Input is accepted from the browser and output displayed via the Common Gateway Interface (CGI) which defines how these inputs and outputs are handled. Naturally, it is possible to make R one of the available programs when used with a "wrapper script" which translates between CGI and R input and output. Rcgi is such a script.
In the next section of this paper, I will describe briefly the motivation for the Rcgi system. Section three demonstrates its usage and section four details the redevelopment of the system to produce the new release. Finally, section five shows how its user community is being involved in the next round of development.
The R system is freely redistributable and modifiable under the terms of the GNU Public License (GPL, [2]). The GNU project is an ongoing effort by the Free Software Foundation to produce a completely redistributable implementation of a Unix-like system and R is their statistical programming language.
The GPL allows anyone to make copies of the licensed software in either source or executable form as long as the source is available upon request. In an academic environment, this is particularly useful, as all students may have a copy of the software to run on their own workstations in their own time, while we can only provide the system on our own workstations. Additionally, skilled programmers can fix or improve the basic system, which isn't possible with "closed source" software.
Even though students can easily have their own copies of the software, there are two principal reasons why it is still useful to provide access to our installation over the internet, as also supported by (for example) [3].
The first is that not all students have their own workstations. While we are hopeful that this will change as a result of new initiatives for universal computer access following from recommendations in the Dearing Report on Higher Education [4], many students have to use the open access computing facilities around the university campus. The department has the ability to put software on only a small number of these machines and has no control over the configuration of the remaining desktops. While it is possible to make it available to other campus workstations from the command line, a web interface is simpler to access and insulates us from configuration changes on the other machines. As long as the computer can still run a web browser, they can use our system.
Feedback from students on our third-year course (who were the first to use the system) suggests that the web front-end is popular partly because of its "batch mode" of operation, running some commands and returning the output, together with the commands for editing and resubmission.
The second and increasingly important benefit of Rcgi is the ability for the lecturer to provide worked examples to the students with the option of leaving spaces for the students to contribute their own data to the examples. Rather than having to write their own CGI programs for each example, lecturers need only write the HTML for the page (which many know already) and the program in the R language.
As stated above, one of the main benefits of Rcgi over other ways of connecting R to the web is that a lecturer using it to provide worked examples online doesn't need to be able to write and install CGI programs of their own. The simplest way to demonstrate how much of an advantage this can be is to show a small example and the output pages it generates.
Take the following snippet of R code, which defines a vector of numbers and generates the basic summary statistics for them:
test <- c(1,45,2,26,37,35,32,7,4,8,42,23,32,27,29,20) print(summary(test)) |
which has the output
Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 7.75 26.50 23.13 32.75 45.00 |
For an elementary statistics course, the lecturer may wish to provide this as a worked example, but allow the students to change the values in the "test" list. The HTML code to do this is:
<form method='post' action='/cgi-bin/Rcgi'>
<input type='hidden' name='script' value='test <- c(' />
<input type='text' name='script'
value='1,45,2,26,37,35,32,7,4,8,42,23,32,27,29,20' />
<input type='hidden' name='script' value=')
print(summary(test))
' />
<input type='submit' value='go!' />
</form>
|
and screenshots of the data entry page and the results returned from Rcgi are included in figure 1.
More elaborate prewritten routines can be loaded into the demonstration software in one of two ways, both capable of being shared between many HTML pages. At the simplest level, it is possible to load in a file of commands which will be evaluated by R as if they had been entered directly. By placing the file on the Rcgi server and using R's source command, they can be loaded in this way.
R also offers a sophisticated extension facility by way of its library command. A number of downloadable packages for various tasks are available from the Comprehensive R Archive Network (CRAN) and are loaded into R with this command when needed. Any libraries available to the R installation used by the Rcgi server are usable by page writers.
As an elementary security check, pages passed to the Rcgi system have to come from one of a defined list of approved sites. This should prevent use by anyone other than people approved by the webmaster responsible for the Rcgi system. However, the command input interface is available to everyone by default, so most people using the software prefer to password protect the entire system and grant access only to students on their course. The example and download site is publically accessible, in order to give prospective users the opportunity to test the software before downloading it.
The software was originally developed purely for use by our own institution and courses because of the reasons discussed in the motivation section. As time progressed, it became clear through discussions with other interested statisticians at conferences that there was interest in the developing web environment. Following these discussions, a tidied release 3a ("Cardiff") was produced.
Rcgi consisted of three scripts, each handling different tasks. The main script, unimaginatively titled "go", was the centre of the system. It accepted input from the calling page, reformed this into a script, did a quick security check as described in the last section, started R, sent it the script and captured the output. The textual output was reformatted into a web page and sent back to the browser.
Two auxiliary scripts handled the graphical output, one just sent the PostScript file of the graphs to the browser and the other processed this into a web page with a GIF format graphic for each page. The graphics scripts were written as shell scripts, while the main script was written in Perl. While this was the most efficient creation of the scripts for our own use, they required other installers to rewrite many parts of the system before it would function correctly on their servers.
Since the system was first written, there have been many improvements in web technology. Mature Perl libraries are now available to handle CGI input-output processes efficiently and safely, and to keep programs running when not in use, thereby eliminating the time and processor usage of the repeated startup of the Perl interpreter when the system is busy. If the number of simultaneous requests exceeds the number of Rcgi scripts running at any one time, another will be started. After a specified length of time that any script has been idle, it will close down. This greatly reduces the demands of providing an open service.
There has also been more research into how the internet can assist the development of open software such as Rcgi. In the milestone paper "The Cathedral and the Bazaar", Eric S Raymond suggests some basic principles for open development projects to leverage the internet to produce better results faster [5]. Following the suggestions made in that paper has allowed the redevelopment to occur at a faster pace than would otherwise have been possible.
In light of these developments, it was decided to rewrite the system to take advantage of these new possibilities. The new system consists of one CGI script which runs without warnings and so is suitable for running continuously without memory leaks or other undesirable problems.
Internally, the system is split into two sections. A module called "Rcgi::Session" handles the communication with the actual R interpreter, accepting input and storing the output until it is requested by the CGI script. This also opens up the possibility of handling the R interpreter itself in a similar way to the CGI script, keeping the bare minimum number of interpreters running continuously and having several CGI scripts using each instance of R. The efficiency of many browsers using the bare minimum number of CGI script startups using the bare minimum of R interpreters should not be underestimated, but the programming complexity of such a setup is also considerable, so it is left as a future option for development.
The use of Rcgi::Session makes the CGI script fairly simple. Upon getting a request from a browser, the script decides what is being requested and connects the specified session to the browser in an appropriate manner.
The most commonly suggested point for improvement was the installation procedure. It was a completely manual procedure and (as detailed above) required extensive installer intervention in order to tell the system the location of the other programs which it needed in order to function correctly. There was no central place for this information to be entered and the commands calling these programs were scattered throughout the three scripts.
As a result, an installation program is included in the new release. When run, the script attempts to locate the various programs it requires and build a module called "Rcgi::SystemPrograms" from subroutines using different possible system programs. Not even this is straightforward, as there are many possible ways to find a program on the system. While not entirely reliable, the system attempts to use the Unix shell's own information first, then the GNU "locate" program which is widely installed and finally search the filesystem via the general Unix "find" utility. There is a small possibility that a suboptimal option is found in the shell's information, so all results are offered to the installer for review before the SystemPrograms module is built.
It is hoped that the new release will bring renewed interest in the system and also in the wider idea of using this powerful statistics language as part of teaching statistics across the internet. While there is now much more teaching material available online, it is mostly in the form of electronic texts, with very few interactive demonstrations (see [6]).
The next weakness to correct will be the documentation and examples supplied with the distribution. This paper will become part of the documentation and better examples will be developed alongside future teaching at UEA to try to correct these deficiencies. In addition to the Rcgi mailing list, the latest release also gives a number of alternative ways to contact the author and other interested parties. As is usually the case with Free Software, the future development of the software will be determined by its users.
[1] Venables, B., Smith, D., Gentleman, R. & Ihaka, R. (1997). Notes on R: A Programming Environment for Data Analysis and Graphics (software manual).
[2] Free Software Foundation (1991). GNU General Public License, version 2, from http://www.gnu.org/copyleft/gpl.html
[3] Thioulouse, J. & Chevent, F. (1996). NetMul, a world-wide web user interface for multivariate analysis software. Computational Statistics and Data Analysis 21, 369-372.
[4] National Committee of Inquiry into Higher Education (1997). Higher Education in the Learning Society. Report submitted to the Secretaries of State for Education and Employment. Available online from http://www.leeds.ac.uk/educol/ncihe/
[5] Raymond, Eric Steven (2000). The Cathedral and the Bazaar, online essay,
Revision 1.51, from http://www.tuxedo.org/%7Eesr/writings
/cathedral-bazaar/
[6] The Open Directory, Science: Math: Statistics, seen December 2000 at http://www.dmoz.org/Science/Math/Statistics/