If you are new to the CGI world, there's no need to worry—basic CGI programming is very easy. Ninety percent of CGI-specific code is concerned with reading data submitted by a user through an HTML form, processing it, and returning some response, usually as an HTML document.

In this section, we will show you how easy basic CGI programming is, rather than trying to teach you the entire CGI specification. There are many books and online tutorials that cover CGI in great detail (see http://hoohoo.ncsa.uiuc.edu/). Our aim is to demonstrate that if you know Perl, you can start writing CGI scripts almost immediately. You need to learn only two things: how to accept data and how to generate output.

The HTTP protocol makes clients and servers understand each other by transferring all the information between them using headers, where each header is a key-value pair. When you submit a form, the CGI program looks for the headers that contain the input information, processes the received data (e.g., queries a database for the keywords supplied through the form), and—when it is ready to return a response to the client—sends a special header that tells the client what kind of information it should expect, followed by the information itself. The server can send additional headers, but these are optional. Figure 1-1 depicts a typical request-response cycle.

Figure 1-1

Figure 1-1. Request-response cycle

Sometimes CGI programs can generate a response without needing any input data from the client. For example, a news service may respond with the latest stories without asking for any input from the client. But if you want stories for a specific day, you have to tell the script which day's stories you want. Hence, the script will need to retrieve some input from you.

To get your feet wet with CGI scripts, let's look at the classic "Hello world" script for CGI, shown in Example 1-1.

Example 1-1. "Hello world" script

  #!/usr/bin/perl -Tw
  
  print "Content-type: text/plain\n\n";
  print "Hello world!\n";

We start by sending a Content-type header, which tells the client that the data that follows is of plain-text type. text/plain is a Multipurpose Internet Mail Extensions (MIME) type. You can find a list of widely used MIME types in the mime.types file, which is usually located in the directory where your web server's configuration files are stored.[4] Other examples of MIME types are text/html (text in HTML format) and video/mpeg (an MPEG stream).

[4]For more information about Internet media types, refer to RFCs 2045, 2046, 2047, 2048, and 2077, accessible from http://www.rfc-editor.org/.

According to the HTTP protocol, an empty line must be sent after all headers have been sent. This empty line indicates that the actual response data will start at the next line.[5]

[5]The protocol specifies the end of a line as the character sequence Ctrl-M and Ctrl-J (carriage return and newline). On Unix and Windows systems, this sequence is expressed in a Perl string as \015\012, but Apache also honors \n, which we will use throughout this book. On EBCDIC machines, an explicit \r\nshould be used instead.

Now save the code in hello.pl, put it into a cgi-bin directory on your server, make the script executable, and test the script by pointing your favorite browser to:

http://localhost/cgi-bin/hello.pl

It should display the same output as Figure 1-2.

Figure 1-2

Figure 1-2. Hello world

A more complicated script involves parsing input data. There are a few ways to pass data to the scripts, but the most commonly used are the GET and POST methods. Let's write a script that expects as input the user's name and prints this name in its response. We'll use the GET method, which passes data in the request URI (uniform resource indicator):

http://localhost/cgi-bin/hello.pl?username=Doug

When the server accepts this request, it knows to split the URI into two parts: a path to the script (http://localhost/cgi-bin/hello.pl) and the "data" part (username=Doug, called the QUERY_STRING). All we have to do is parse the data portion of the URI and extract the key username and value Doug. The GET method is used mostly for hardcoded queries, where no interactive input is needed. Assuming that portions of your site are dynamically generated, your site's menu might include the following HTML code:

<a href="/cgi-bin/display.pl?section=news">News</a><br>
<a href="/cgi-bin/display.pl?section=stories">Stories</a><br>
<a href="/cgi-bin/display.pl?section=links">Links</a><br>

Another approach is to use an HTML form, where the user fills in some parameters. The HTML form for the "Hello user" script that we will look at in this section can be either:

<form action="/cgi-bin/hello_user.pl" method="POST">
<input type="text" name="username">
<input type="submit">
</form>

or:

<form action="/cgi-bin/hello_user.pl" method="GET">
<input type="text" name="username">
<input type="submit">
</form>

Note that you can use either the GET or POST method in an HTML form. However, POSTshould be used when the query has side effects, such as changing a record in a database, while GETshould be used in simple queries like this one (simple URL links are GET requests).[6]

[6]See Axioms of Web Architecture at http://www.w3.org/DesignIssues/Axioms.html#state.

Formerly, reading input data required different code, depending on the method used to submit the data. We can now use Perl modules that do all the work for us. The most widely used CGI library is the CGI.pm module, written by Lincoln Stein, which is included in the Perl distribution. Along with parsing input data, it provides an easy API to generate the HTML response.

Our sample "Hello user" script is shown in Example 1-2.

Example 1-2. "Hello user" script

  #!/usr/bin/perl
  
  use CGI qw(:standard);
  my $username = param('username') || "unknown";
  
  print "Content-type: text/plain\n\n";
  print "Hello $username!\n";

Notice that this script is only slightly different from the previous one. We've pulled in the CGI.pm module, importing a group of functions called :standard. We then used its param( ) function to retrieve the value of the username key. This call will return the name submitted by any of the three ways described above (a form using either POST, GET, or a hardcoded name with GET; the last two are essentially the same). If no value was supplied in the request, param( ) returns undef.

my $username = param('username') || "unknown";

$username will contain either the submitted username or the string "unknown" if no value was submitted. The rest of the script is unchanged—we send the MIME header and print the "Hello $username!" string.[7]

[7]All scripts shown here generate plain text, not HTML. If you generate HTML output, you have to protect the incoming data from cross-site scripting. For more information, refer to the CERT advisory at http://www.cert.org/advisories/CA-2000-02.html.

As we've just mentioned, CGI.pm can help us with output generation as well. We can use it to generate MIME headers by rewriting the original script as shown in Example 1-3.

Example 1-3. "Hello user" script using CGI.pm

  #!/usr/bin/perl
  
  use CGI qw(:standard);
  my $username = param('username') || "unknown";
  
  print header("text/plain");
  print "Hello $username!\n";

To help you learn how CGI.pm copes with more than one parameter, consider the code in Example 1-4.

Example 1-4. CGI.pm and param( ) method

  #!/usr/bin/perl
  
  use CGI qw(:standard);
  print header("text/plain");
  
  print "The passed parameters were:\n";
  for my $key ( param( ) ) {
      print "$key => ", param($key), "\n";
  }

Now issue the following request:

http://localhost/cgi-bin/hello_user.pl?a=foo&b=bar&c=foobar

Separating key=value Pairs

Note that & or ; usually is used to separate the key=value pairs. The former is less preferable, because if you end up with a QUERY_STRING of this format:

id=foo&reg=bar

some browsers will interpret &reg as an SGML entity and encode it as &reg;. This will result in a corrupted QUERY_STRING:

id=foo&reg;=bar

You have to encode & as &amp; if it is included in HTML. You don't have this problem if you use ; as a separator:

id=foo;reg=bar

Both separators are supported by CGI.pm, Apache::Request, and mod_perl's args( ) method, which we will use in the examples to retrieve the request parameters.

Of course, the code that builds QUERY_STRING has to ensure that the values don't include the chosen separator and encode it if it is used. (See RFC2854 for more details.)

The browser will display:

The passed parameters were:
a => foo
b => bar
c => foobar

Now generate this form:

<form action="/cgi-bin/hello_user.pl" method="GET">
<input type="text" name="firstname">
<input type="text" name="lastname">
<input type="submit">
</form>

If we fill in only the firstname field with the value Doug, the browser will display:

The passed parameters were:
firstname => Doug
lastname =>

If in addition the lastname field is MacEachern, you will see:

The passed parameters were:
firstname => Doug
lastname => MacEachern

These are just a few of the many functions CGI.pm offers. Read its manpage for detailed information by typing perldoc CGI at your command prompt.

We used this long CGI.pm example to demonstrate how simple basic CGI is. You shouldn't reinvent the wheel; use standard tools when writing your own scripts, and you will save a lot of time. Just as with Perl, you can start creating really cool and powerful code from the very beginning, gaining more advanced knowledge over time. There is much more to know about the CGI specification, and you will learn about some of its advanced features in the course of your web development practice. We will cover the most commonly used features in this book.

For now, let CGI.pm or an equivalent library handle the intricacies of the CGI specification, and concentrate your efforts on the core functionality of your code.