Last revision August 6, 2004
|Table of Contents:|
The original motivation for the development of computers was to automate data analysis: performing mathematical transformations of data; summarizing; selecting according to criteria; and making connections between items. This is still a core function, particularly in scientific research, and a necessary one for any large amount of data.
Arbitrary data analysis can be accomplished by writing specific programs in a language such as Fortran or C that relate directly to your data. Such programs require expertise to write and sacrifice generality of application for efficiency in solving one particular problem.
Another approach is to have a fixed program that solves one particular class of problem, but for any input data set. This kind of program is easiest to use, but has the least flexibility. What if you want to do something slightly different than the specified procedure of the program? An example would be a program that calculates basic statistics for any data set: means, standard deviations, t-statistics, etc.
Still another approach is a canned program that incorporates many functions, including perhaps the ability to tie functions together with a scripting (macro) language. Many PC and Macintosh data analysis programs adopt this strategy. The problem is that as the program becomes more powerful by incorporating more functions, it becomes more unwieldy and difficult to learn. An example is the Microsoft Excel program.
Programs such as Excel are especially useful when you want immediate visual display of the data and its analysis; or want to manipulate or relate individual items from the larger data set; or want to interactively experiment with manipulations.
The Unix approach is to have a set of well-defined utility programs, each with some flexibility in its functionality, that can be linked together with the pipeline concept to solve different parts of the problem. Each program is simpler to learn that a single giant integrated one, but working together they can accomplish much the same thing.
Some of these utility programs implement simple programming languages that provide even greater flexibility. These are called "interpreted languages" because the programming instructions that you write are interpreted as they are read by the program. They do not need to be compiled into machine language, as does a Fortran or C program. Thus, there are no extra steps involved in making and using an interpreted language program. You write the instructions and immediately run it.
The interpreted languages described here are very "high level", which means that a few simple statements can accomplish a lot of work. In fact, many of the "programs" that you would use with these languages are only a few words long and fit on the command line itself. For longer programs, you put them in a file called a "script" and then pass that script to the language processor to be run.
Shell scripts are another form of interpreted language that are best suited to combining basic Unix programs and utilities into more complicated commands. They are generally not useful for data analysis.
The high level statements and immediate execution of these interpreted languages makes them ideal for custom data analysis. You can write the program quickly and change it quickly. Of course, these languages are not as efficient as compiled languages such as Fortran or C, so you should not use them for heavy mathematical computation. They are best used for problems that require data selection, simple transformations (for example, combining or weighting variables), and summaries.
Three very useful interpreted languages are found on most Unix systems:
sed is an editing utility that can be used to edit files according to instructions in an editing script. Rather than opening the file and using interactive editing commands, such as in vi or emacs, you use a very simple "editing language" to write a set of instructions -- a program -- that sed will execute on the file that you specify. This allows you to automate editing tasks that you might need to perform over and over on a set of files.
sed operates as a Unix filter that reads one or more files on disk, or its standard input, performs its editing tasks on the lines that it has read, and writes out the edited lines to standard output, which you would normally redirect to a file, or send into a pipe to another filter program.
sed was one of the earliest Unix utility programs and is standardized on all Unix distributions. The most useful editing instructions for sed are described on this page.
awk is a pattern scanning and processing language. It has been optimized to let you look for lines within input data that match specified patterns, and then either select those lines for the output, delete them from the output, or edit them in some fashion before outputing them.
These same editing type tasks can be accomplished by sed, but awk goes beyond sed by allowing you to perform arbitrary actions, including arithmetic computations and running other Unix programs and processing their output, rather than just simple editing actions. awk scripts also can perform many of the operations of a shell script.
awk is an appropriate programming language to use whenever you need a simple way to manipulate input data according to patterns, do basic arithmetic, or trigger other actions or programs based on the contents of the data file.
Like sed, awk operates as a Unix filter program and can participate in pipelines.
awk was one of the earliest Unix utility programs and is standardized on all Unix distributions. The basic programming instructions for awk are described on this page.
perl is a very comprehensive interpreted language designed by programming guru Larry Wall. He wanted to combine the editing and pattern processing functions of sed and awk; the access to Unix programs provided by shell scripts; and the ability to call Unix system routines and efficiently perform complicated mathematical or character processing provided by compiled C programs.
In some sense, perl is a "superset" of these other languages. It is extremely powerful and has become a standard for writing scripts on web servers. Unfortunately, it is also quite complex. And it is not necessarily found on all Unix systems. Some vendors provide it, but in other cases, you must download and compile the source code yourself.
Competent programmers who are familiar with the C language will benefit from learning perl. For basic data processing, the simpler combination of sed, awk, and shell scripts is recommended.
The perl language is not described in these pages. Complete tutorial and reference books are available from the publisher O'Reilly and Associates. There are also entire USENET news groups devoted to perl.