regexp(n)

regexp(n)

read Home Page New Index registry


_________________________________________________________________

NAME
       regexp - Match a regular expression against a string

SYNOPSIS
       regexp  ?switches? exp string ?matchVar? ?subMatchVar sub-
       MatchVar ...?
_________________________________________________________________

DESCRIPTION
       Determines whether the regular expression exp matches part
       or  all  of  string  and  returns  1  if  it does, 0 if it
       doesn't.

       If additional arguments are specified  after  string  then
       they  are  treated  as  the names of variables in which to
       return information about which part(s) of  string  matched
       exp.   MatchVar  will  be  set to the range of string that
       matched all of exp.  The first  subMatchVar  will  contain
       the  characters in string that matched the leftmost paren-
       thesized subexpression within exp,  the  next  subMatchVar
       will  contain  the characters that matched the next paren-
       thesized subexpression to the right in exp, and so on.

       If the initial arguments to regexp start with - then  they
       are  treated as switches.  The following switches are cur-
       rently supported:

       -nocase   Causes upper-case characters  in  string  to  be
                 treated  as  lower case during the matching pro-
                 cess.

       -indices  Changes what  is  stored  in  the  subMatchVars.
                 Instead  of storing the matching characters from
                 string, each variable will contain a list of two
                 decimal  strings giving the indices in string of
                 the first and last characters  in  the  matching
                 range of characters.

       --        Marks the end of switches.  The argument follow-
                 ing this one will be treated as exp even  if  it
                 starts with a -.

       If  there are more subMatchVar's than parenthesized subex-
       pressions within exp, or if a particular subexpression  in
       exp  doesn't  match  the  string (e.g. because it was in a
       portion of the expression that wasn't matched),  then  the
       corresponding  subMatchVar  will  be  set  to ``-1 -1'' if
       -indices has been specified or to an empty  string  other-
       wise.

REGULAR EXPRESSIONS
       Regular  expressions are implemented using Henry Spencer's
       package (thanks, Henry!), and much of the  description  of
       regular expressions below is copied verbatim from his man-
       ual entry.

       A regular expression is zero or more  branches,  separated
       by  ``|''.   It  matches  anything that matches one of the
       branches.

       A branch is zero or more pieces, concatenated.  It matches
       a match for the first, followed by a match for the second,
       etc.

       A piece is an atom possibly followed by ``*'',  ``+'',  or
       ``?''.   An atom followed by ``*'' matches a sequence of 0
       or more matches of the atom.  An atom  followed  by  ``+''
       matches  a  sequence of 1 or more matches of the atom.  An
       atom followed by ``?'' matches a match of the atom, or the
       null string.

       An atom is a regular expression in parentheses (matching a
       match for the regular expression), a  range  (see  below),
       ``.''   (matching  any  single character), ``^'' (matching
       the null string at the beginning  of  the  input  string),
       ``$''  (matching  the  null string at the end of the input
       string), a ``\'' followed by a single character  (matching
       that  character), or a single character with no other sig-
       nificance (matching that character).

       A range is a sequence of characters  enclosed  in  ``[]''.
       It   normally   matches  any  single  character  from  the
       sequence.  If the sequence begins with ``^'',  it  matches
       any  single  character  not from the rest of the sequence.
       If two characters in the sequence are separated by  ``-'',
       this  is  shorthand  for the full list of ASCII characters
       between them (e.g. ``[0-9]'' matches any  decimal  digit).
       To  include  a  literal ``]'' in the sequence, make it the
       first character (following a possible ``^'').  To  include
       a literal ``-'', make it the first or last character.

CHOOSING AMONG ALTERNATIVE MATCHES
       In general there may be more than one way to match a regu-
       lar expression to an input string.  For example,  consider
       the command
              regexp  (a*)b*  aabaaabb  x  y
       Considering only the rules given so far, x and y could end
       up with the values aabb and aa, aaab and aaa, ab and a, or
       any of several other combinations.  To resolve this poten-
       tial ambiguity regexp chooses among alternatives using the
       rule ``first then longest''.  In other words, it considers
       the possible matches in order working from left  to  right
       across  the  input string and the pattern, and it attempts

       to match longer pieces of the input string before  shorter
       ones.   More  specifically,  the  following rules apply in
       decreasing order of priority:

       [1]    If a regular expression could match  two  different
              parts of an input string then it will match the one
              that begins earliest.

       [2]    If a regular expression contains |  operators  then
              the leftmost matching sub-expression is chosen.

       [3]    In  *, +, and ? constructs, longer matches are cho-
              sen in preference to shorter ones.

       [4]    In sequences of expression  components  the  compo-
              nents are considered from left to right.

       In  the  example from above, (a*)b* matches aab:  the (a*)
       portion of the pattern is matched first  and  it  consumes
       the  leading  aa;  then the b* portion of the pattern con-
       sumes the next b.  Or, consider the following example:
              regexp  (ab|a)(b*)c  abc  x  y  z
       After this command x will be abc, y will be ab, and z will
       be  an  empty  string.   Rule 4 specifies that (ab|a) gets
       first shot at the input string and Rule 2  specifies  that
       the  ab sub-expression is checked before the a sub-expres-
       sion.  Thus the b has already been claimed before the (b*)
       component  is checked and (b*) must match an empty string.

KEYWORDS
       match, regular expression, string

read Home Page New Index registry