Regular Expression Basic

Regular Expression Basic

In computing, a regular expression provides a concise and flexible means for "matching" (specifying and recognizing) strings of text, such as particular characters, words, or patterns of characters. Abbreviations for "regular expression" include "regex" and "regexp". The concept of regular expressions was first popularized by utilities provided by Unix distributions, in particular the editored and the filter grep.A regular expression is written in a formal language that can be interpreted by a regular expression processor, which is a program that either serves as a parser generator or examines text and identifies parts that match the provided specification. Historically, the concept of regular expressions is associated with Kleene's formalism of regular sets, introduced in the 1950's.

Here are examples of specifications that could be expressed in a regular expression:

the sequence of characters "car" appearing consecutively in any context, such as in "car", "cartoon", or "bicarbonate"
the sequence of characters "car" occurring in that order with other characters between them, such as in "Icelander" or "chandler"
the word "car" when it appears as an isolated word
the word "car" when preceded by the word "blue" or "red"
the word "car" when not preceded by the word "motor"
a dollar sign immediately followed by one or more digits, and then optionally a period and exactly two more digits (for example, "$100" or "$245.99").

These examples are simple. Specifications of great complexity can be conveyed by regular expressions.

Regular expressions are used by many text editors, utilities, and programming languages to search and manipulate text based on patterns. Some of these languages, including Perl, Ruby, AWK, and Tcl, have been designed so that regular expressions are fully integrated into the syntax of the core language itself. Other programming languages like .NET languages, Java, and Python instead provide regular expressions through standard libraries. For yet other languages, such as Object Pascal, C and C++, non-core libraries are available (however, versionC++11 provides regular expressions in its Standard Libraries).

As an example of the syntax, the regular expression \bex can be used to search for all instances of the string "ex" that occur after "word boundaries". Thus \bex will find the matching string "ex" in two possible locations, (1) at the beginning of words, and (2) between two characters in a string, where the first is not a word character and the second is a word character. For instance, in the string "Texts for experts", \bex matches the "ex" in "experts" but not in "Texts" (because the "ex" occurs inside a word and not immediately after a word boundary).

Many modern computing systems provide wildcard characters in matching filenames from a file system. This is a core capability of many command-line shells and is also known as globbing. Wildcards differ from regular expressions in generally expressing only limited forms of patterns.

Basic concepts

A regular expression, often called a pattern, is an expression that specifies a set of strings. It is more concise to specify a set's members by rules (such as a pattern) than by a list. For example, the set containing the three strings "Handel", "Händel", and "Haendel" can be specified by the pattern H(ä|ae?)ndel (or alternatively, it is said that the pattern matches each of the three strings). In most formalisms, if there exists at least one regex that matches a particular set then there exist an infinite number of such expressions. Most formalisms provide the following operations to construct regular expressions.

Boolean "or": A vertical bar separates alternatives. For example, gray|grey can match "gray" or "grey".
Grouping: Parentheses are used to define the scope and precedence of the operators (among other uses). For example, gray|grey andgr(a|e)y are equivalent patterns which both describe the set of "gray" and "grey".
Quantification: A quantifier after a token (such as a character) or group specifies how often that preceding element is allowed to occur. The most common quantifiers are the question mark ?, the asterisk * (derived from the Kleene star), and the plus sign + (Kleene cross).

`?`	The question mark indicates there is zero or one of the preceding element. For example, `colou?r` matches both "color" and "colour".
`*`	The asterisk indicates there is zero or more of the preceding element. For example, `abc` matches "ac", "abc", "abbc", "abbbc*", and so on.
`+`	The plus sign indicates there is one or more of the preceding element. For example, `ab+c` matches "abc", "abbc", "abbbc", and so on, but not "ac".

These constructions can be combined to form arbitrarily complex expressions, much like one can construct arithmetical expressions from numbers and the operations +, −, ×, and ÷. For example, H(ae?|ä)ndel and H(a|ae|ä)ndel are both valid patterns which match the same strings as the earlier example, H(ä|ae?)ndel.

The precise syntax for regular expressions varies among tools and with context; more detail is given in the Syntax section.

Syntex Section

Metacharacter	Meaning
[ ]	Match anything inside the square brackets for ONE character position once and only once, for example, [12] means match the target to 1 and if that does not match then match the target to 2 while [0123456789] means match to any character in the range 0 to 9.
-	The - (dash) inside square brackets is the 'range separator' and allows us to define a range, in our example above of [0123456789] we could rewrite it as [0-9]. You can define more than one range inside a list, for example, [0-9A-C] means check for 0 to 9 and A to C (but not a to c). NOTE: To test for - inside brackets (as a literal) it must come first or last, that is, [-0-9] will test for - and 0 to 9.
^	The ^ (circumflex or caret) inside square brackets negates the expression (we will see an alternate use for the circumflex/caret outside square brackets later), for example, [^Ff] means anything except upper or lower case F and [^a-z] means everything except lower case a to z. NOTE: Spaces, or in this case the lack of them, between ranges are very important.

NOTE: There are some special range values that are built-in to most regular expression software and have to be if it claims POSIX 1003.2 compliance for either BRE or ERE.

Metacharacter

Meaning

^ The ^ (circumflex or caret) outside square brackets means look only at the beginning of the target string, for example, ^Win will not find Windows in STRING1 but ^Moz will find Mozilla. $ The $ (dollar) means look only at the end of the target string, for example, fox$ will find a match in 'silver fox' since it appears at the end of the string but not in 'the fox jumped over the moon'. . The . (period) means any character(s) in this position, for example, ton. will find tons, tone and tonneau but not wanton because it has no following character.

Metacharacter	Meaning
?	The ? (question mark) matches the preceding character 0 or 1 times only, for example, colou?r will find both color (0 times) and colour (1 time).
*	The * (asterisk or star) matches the preceding character 0 or more times, for example, tre* will find tree (2 times) and tread (1 time) and trough (0 times).
+	The + (plus) matches the previous character 1 or more times, for example, tre+ will find tree (2 times) and tread (1 time) but not trough (0 times).
{n}	Matches the preceding character, or character range, n times exactly, for example, to find a local phone number we could use [0-9]{3}-[0-9]{4} which would find any number of the form 123-4567. Note: The - (dash) in this case, because it is outside the square brackets, is a literal. Value is enclosed in braces (curly brackets).
{n,m}	Matches the preceding character at least n times but not more than m times, for example, 'ba{2,3}b' will find 'baab' and 'baaab' but NOT 'bab' or 'baaaab'. Values are enclosed in braces (curly brackets).

So lets try them out with our example target strings.

Metacharacter	Meaning
()	The ( (open parenthesis) and ) (close parenthesis) may be used to group (or bind) parts of our search expression together - see this example.
\|	The \| (vertical bar or pipe) is called alternation in techspeak and means find the left hand OR right values, for example, gr(a\|e)y will find 'gray' or 'grey'.

<humblepie> In our examples, we blew this expression ^([L-Z]in), we incorrectly stated that this would negate the tests [L-Z], the '^' only performs this function inside square brackets, here it is outside the square brackets and is an anchor indicating 'start from first character'. Many thanks to Mirko Stojanovic for pointing it out and apologies to one and all.</humblepie>

Value	Meaning
[:digit:]	Only the digits 0 to 9
[:alnum:]	Any alphanumeric character 0 to 9 OR A to Z or a to z.
[:alpha:]	Any alpha character A to Z or a to z.
[:blank:]	Space and TAB characters only.
[:xdigit:]	Hexadecimal notation 0-9, A-F, a-f.
[:punct:]	Punctuation symbols . , " ' ? ! ; : # $ % & ( ) * + - / < > = @ [ ] \ ^ _ { } \| ~
[:print:]	Any printable character.
[:space:]	Any whitespace characters (space, tab, NL, FF, VT, CR). Many system abbreviate as \s.
[:graph:]	Exclude whitespace (SPACE, TAB). Many system abbreviate as \W.
[:upper:]	Any alpha character A to Z.
[:lower:]	Any alpha character a to z.
[:cntrl:]	Control Characters NL CR LF TAB VT FF NUL SOH STX EXT EOT ENQ ACK SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC IS1 IS2 IS3 IS4 DEL.

These are always used inside square brackets in the form [[:alnum:]] or combined as [[:digit:]a-d]

Character Class Abbreviations
\d	Match any character in the range 0 - 9 (equivalent of POSIX [:digit:])
\D	Match any character NOT in the range 0 - 9 (equivalent of POSIX [^[:digit:]])
\s	Match any whitespace characters (space, tab etc.). (equivalent of POSIX [:space:] EXCEPT VT is not recognized)
\S	Match any character NOT whitespace (space, tab). (equivalent of POSIX [^[:space:]])
\w	Match any character in the range 0 - 9, A - Z and a - z (equivalent of POSIX [:alnum:])
\W	Match any character NOT the range 0 - 9, A - Z and a - z (equivalent of POSIX [^[:alnum:]])
Positional Abbreviations
\b	Word boundary. Match any character(s) at the beginning (\bxx) and/or end (xx\b) of a word, thus \bton\b will find ton but not tons, but \bton will find tons.
\B	Not word boundary. Match any character(s) NOT at the beginning(\Bxx) and/or end (xx\B) of a word, thus \Bton\B will find wantons but not tons, but ton\B will find both wantons and tons.

Modifiers

Modifiers are used to perform case-insensitive and global searches:

Modifier	Description
i	Perform case-insensitive matching
g	Perform a global match (find all matches rather than stopping after the first match)
m	Perform multiline matching

Brackets

Brackets are used to find a range of characters:

Expression	Description
[abc]	Find any character between the brackets
[^abc]	Find any character not between the brackets
[0-9]	Find any digit from 0 to 9
[A-Z]	Find any character from uppercase A to uppercase Z
[a-z]	Find any character from lowercase a to lowercase z
[A-z]	Find any character from uppercase A to lowercase z
[adgk]	Find any character in the given set
[^adgk]	Find any character outside the given set
(red\|blue\|green)	Find any of the alternatives specified

Metacharacters

Metacharacters are characters with a special meaning:

Metacharacter	Description
.	Find a single character, except newline or line terminator
\w	Find a word character
\W	Find a non-word character
\d	Find a digit
\D	Find a non-digit character
\s	Find a whitespace character
\S	Find a non-whitespace character
\b	Find a match at the beginning/end of a word
\B	Find a match not at the beginning/end of a word
\0	Find a NUL character
\n	Find a new line character
\f	Find a form feed character
\r	Find a carriage return character
\t	Find a tab character
\v	Find a vertical tab character
\xxx	Find the character specified by an octal number xxx
\xdd	Find the character specified by a hexadecimal number dd
\uxxxx	Find the Unicode character specified by a hexadecimal number xxxx

Quantifiers

Quantifier	Description
n+	Matches any string that contains at least one n
n*	Matches any string that contains zero or more occurrences of n
n?	Matches any string that contains zero or one occurrences of n
n{X}	Matches any string that contains a sequence of X n's
n{X,Y}	Matches any string that contains a sequence of X to Y n's
n{X,}	Matches any string that contains a sequence of at least X n's
n$	Matches any string with n at the end of it
^n	Matches any string with n at the beginning of it
?=n	Matches any string that is followed by a specific string n
?!n	Matches any string that is not followed by a specific string n

How to Find or Validate an Email Address

The regular expression I receive the most feedback, not to mention "bug" reports on, is the one you'll find right on this site's home page: \b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b. This regular expression, I claim, matches any email address. Most of the feedback I get refutes that claim by showing one email address that this regex doesn't match. Usually, the "bug" report also includes a suggestion to make the regex "perfect".
As I explain below, my claim only holds true when one accepts my definition of what a valid email address really is, and what it's not. If you want to use a different definition, you'll have to adapt the regex. Matching a valid email address is a perfect example showing that (1) before writing a regex, you have to know exactly what you're trying to match, and what not; and (2) there's often a trade-off between what's exact, and what's practical.
The virtue of my regular expression above is that it matches 99% of the email addresses in use today. All the email address it matches can be handled by 99% of all email software out there. If you're looking for a quick solution, you only need to read the next paragraph. If you want to know all the trade-offs and get plenty of alternatives to choose from, read on.
If you want to use the regular expression above, there's two things you need to understand. First, long regexes make it difficult to nicely format paragraphs. So I didn't include a-z in any of the three character classes. This regex is intended to be used with your regex engine's "case insensitive" option turned on. (You'd be surprised how many "bug" reports I get about that.) Second, the above regex is delimited with word boundaries, which makes it suitable for extracting email addresses from files or larger blocks of text. If you want to check whether the user typed in a valid email address, replace the word boundaries with start-of-string and end-of-string anchors, like this: ^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$

.
The previous paragraph also applies to all following examples. You may need to change word boundaries into start/end-of-string anchors, or vice versa. And you will need to turn on the case insensitive matching option.

Trade-Offs in Validating Email Addresses

Yes, there are a whole bunch of email addresses that my pet regex doesn't match. The most frequently quoted example are addresses on the .museum top level domain, which is longer than the 4 letters my regex allows for the top level domain. I accept this trade-off because the number of people using .museum email addresses is extremely low. I've never had a complaint that the order forms or newsletter subscription forms on the JGsoft websites refused a .museum address (which they would, since they use the above regex to validate the email address).
To include .museum, you could use ^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,6}$. However, then there's another trade-off. This regex will match john@mail.office. It's far more likely that John forgot to type in the .com top level domain rather than having just created a new .office top level domain without ICANN's permission.
This shows another trade-off: do you want the regex to check if the top level domain exists? My regex doesn't. Any combination of two to four letters will do, which covers all existing and planned top level domains except .museum. But it will match addresses with invalid top-level domains like asdf@asdf.asdf. By not being overly strict about the top-level domain, I don't have to update the regex each time a new top-level domain is created, whether it's a country code or generic domain.
^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.(?:[A-Z]{2}|com|org|net|edu|gov|mil|biz|info|mobi|name|aero|asia|jobs|museum)$

could be used to allow any two-letter country code top level domain, and only specific generic top level domains. By the time you read this, the list might already be out of date. If you use this regular expression, I recommend you store it in a global constant in your application, so you only have to update it in one place. You could list all country codes in the same manner, even though there are almost 200 of them.
Email addresses can be on servers on a subdomain, e.g. john@server.department.company.com. All of the above regexes will match this email address, because I included a dot in the character class after the @ symbol. However, the above regexes will also match john@aol...com which is not valid due to the consecutive dots. You can exclude such matches by replacing [A-Z0-9.-]+\. with (?:[A-Z0-9-]+\.)+ in any of the above regexes. I removed the dot from the character class and instead repeated the character class and the following literal dot. E.g. \b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,4}\b will match john@server.department.company.com but not john@aol...com.
Another trade-off is that my regex only allows English letters, digits and a few special symbols. The main reason is that I don't trust all my email software to be able to handle much else. Even though John.O'Hara@theoharas.com is a syntactically valid email address, there's a risk that some software will misinterpret the apostrophe as a delimiting quote. E.g. blindly inserting this email address into a SQL will cause it to fail if strings are delimited with single quotes. And of course, it's been many years already that domain names can include non-English characters. Most software and even domain name registrars, however, still stick to the 37 characters they're used to.
The conclusion is that to decide which regular expression to use, whether you're trying to match an email address or something else that's vaguely defined, you need to start with considering all the trade-offs. How bad is it to match something that's not valid? How bad is it not to match something that is valid? How complex can your regular expression be? How expensive would it be if you had to change the regular expression later? Different answers to these questions will require a different regular expression as the solution. My email regex does what I want, but it may not do what you want.

If you want to learn more about Regular Expression, Please visit these site
http://www.regular-expressions.info/
http://www.w3schools.com/jsref/jsref_obj_regexp.asp

Some Use-full example of regular expression in QTP Scripting

Dim re As New RegExp

Dim ma As Match

re.Pattern = "[A-Z][0-9][0-9][0-9]" ' uppercase char

followed by 2 digits

re.IgnoreCase = False ' case sensitive search

re.Global = True ' find all the

occurrences

For Each ma In re.Execute(txtSource.Text)

Msgbox "Found '" & ma.Value & "' at index " & ma.FirstIndex

---------------------------------------------------------------------------------------------

Regular Expression example in qtp

Regular Expression Example :

(Login to the Flight Reservation Application,Open orders and Send Fax Orders)

While Sending fax orders ,the fax order window would change for different orders, so there we need to add Regular Expressions

Action1 : (Login To the Flight Reservation Application)

systemutil.Run “C:\Program Files (x86)\HP\QuickTest Professional\samples\flight\app\flight4a.exe”

Dialog(“Login”).WinEdit(“Agent Name:”).Set “venkatesh”

Dialog(“Login”).WinEdit(“Agent Name:”).Type micTab

Dialog(“Login”).WinEdit(“Password:”).SetSecure “4c0ba4645292b686779b42c81dd13093544face9?

Dialog(“Login”).WinEdit(“Password:”).Type micReturn

Action2 : (Open Orders)