Regular Expression Basic
How to Find or Validate an Email
Address
The regular expression I receive the most feedback, not to mention
"bug" reports on, is the one you'll find right on this site's home page:
\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b.
This regular expression, I claim, matches any email address. Most of the
feedback I get refutes that claim by showing one email address that this regex
doesn't match. Usually, the "bug" report also includes a suggestion
to make the regex "perfect".
As I explain below, my claim only holds true when one accepts my definition of what a valid email address really is, and what it's not. If you want to use a different definition, you'll have to adapt the regex. Matching a valid email address is a perfect example showing that (1) before writing a regex, you have to know exactly what you're trying to match, and what not; and (2) there's often a trade-off between what's exact, and what's practical.
The virtue of my regular expression above is that it matches 99% of the email addresses in use today. All the email address it matches can be handled by 99% of all email software out there. If you're looking for a quick solution, you only need to read the next paragraph. If you want to know all the trade-offs and get plenty of alternatives to choose from, read on.
If you want to use the regular expression above, there's two things you need to understand. First, long regexes make it difficult to nicely format paragraphs. So I didn't include a-z in any of the three character classes. This regex is intended to be used with your regex engine's "case insensitive" option turned on. (You'd be surprised how many "bug" reports I get about that.) Second, the above regex is delimited with word boundaries, which makes it suitable for extracting email addresses from files or larger blocks of text. If you want to check whether the user typed in a valid email address, replace the word boundaries with start-of-string and end-of-string anchors, like this: ^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$
.
The previous paragraph also applies to all following examples. You may need to change word boundaries into start/end-of-string anchors, or vice versa. And you will need to turn on the case insensitive matching option.
To include .museum, you could use ^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,6}$. However, then there's another trade-off. This regex will match john@mail.office. It's far more likely that John forgot to type in the .com top level domain rather than having just created a new .office top level domain without ICANN's permission.
This shows another trade-off: do you want the regex to check if the top level domain exists? My regex doesn't. Any combination of two to four letters will do, which covers all existing and planned top level domains except .museum. But it will match addresses with invalid top-level domains like asdf@asdf.asdf. By not being overly strict about the top-level domain, I don't have to update the regex each time a new top-level domain is created, whether it's a country code or generic domain.
^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.(?:[A-Z]{2}|com|org|net|edu|gov|mil|biz|info|mobi|name|aero|asia|jobs|museum)$
could
be used to allow any two-letter country code top level domain, and only
specific generic top level domains. By the time you read this, the list might
already be out of date. If you use this regular expression, I recommend you
store it in a global constant in your application, so you only have to update
it in one place. You could list all country codes in the same manner, even
though there are almost 200 of them.
Email addresses can be on servers on a subdomain, e.g. john@server.department.company.com. All of the above regexes will match this email address, because I included a dot in the character class after the @ symbol. However, the above regexes will also match john@aol...com which is not valid due to the consecutive dots. You can exclude such matches by replacing [A-Z0-9.-]+\. with (?:[A-Z0-9-]+\.)+ in any of the above regexes. I removed the dot from the character class and instead repeated the character class and the following literal dot. E.g. \b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,4}\b will match john@server.department.company.com but not john@aol...com.
Another trade-off is that my regex only allows English letters, digits and a few special symbols. The main reason is that I don't trust all my email software to be able to handle much else. Even though John.O'Hara@theoharas.com is a syntactically valid email address, there's a risk that some software will misinterpret the apostrophe as a delimiting quote. E.g. blindly inserting this email address into a SQL will cause it to fail if strings are delimited with single quotes. And of course, it's been many years already that domain names can include non-English characters. Most software and even domain name registrars, however, still stick to the 37 characters they're used to.
The conclusion is that to decide which regular expression to use, whether you're trying to match an email address or something else that's vaguely defined, you need to start with considering all the trade-offs. How bad is it to match something that's not valid? How bad is it not to match something that is valid? How complex can your regular expression be? How expensive would it be if you had to change the regular expression later? Different answers to these questions will require a different regular expression as the solution. My email regex does what I want, but it may not do what you want.
If you want to learn more about Regular Expression, Please visit these site
http://www.regular-expressions.info/
http://www.w3schools.com/jsref/jsref_obj_regexp.asp
Some Use-full example of regular expression in QTP Scripting
In computing, a regular expression provides a concise and flexible means for "matching" (specifying and recognizing) strings of text, such as particular characters, words, or patterns of characters. Abbreviations for "regular expression" include "regex" and "regexp". The concept of regular expressions was first popularized by utilities provided by Unix distributions, in particular the editored and the filter grep.A regular expression is written in a formal language that can be interpreted by a regular expression processor, which is a program that either serves as a parser generator or examines text and identifies parts that match the provided specification. Historically, the concept of regular expressions is associated with Kleene's formalism of regular sets, introduced in the 1950's.
Here are examples of specifications that could be expressed in a regular expression:
- the sequence of characters "car" appearing consecutively in any context, such as in "car", "cartoon", or "bicarbonate"
- the sequence of characters "car" occurring in that order with other characters between them, such as in "Icelander" or "chandler"
- the word "car" when it appears as an isolated word
- the word "car" when preceded by the word "blue" or "red"
- the word "car" when not preceded by the word "motor"
- a dollar sign immediately followed by one or more digits, and then optionally a period and exactly two more digits (for example, "$100" or "$245.99").
These examples are simple. Specifications of great complexity can be conveyed by regular expressions.
Regular expressions are used by many text editors, utilities, and programming languages to search and manipulate text based on patterns. Some of these languages, including Perl, Ruby, AWK, and Tcl, have been designed so that regular expressions are fully integrated into the syntax of the core language itself. Other programming languages like .NET languages, Java, and Python instead provide regular expressions through standard libraries. For yet other languages, such as Object Pascal, C and C++, non-core libraries are available (however, versionC++11 provides regular expressions in its Standard Libraries).
As an example of the syntax, the regular expression
\bex can be used to search for all instances of the string "ex" that occur after "word boundaries". Thus \bex will find the matching string "ex" in two possible locations, (1) at the beginning of words, and (2) between two characters in a string, where the first is not a word character and the second is a word character. For instance, in the string "Texts for experts", \bex matches the "ex" in "experts" but not in "Texts" (because the "ex" occurs inside a word and not immediately after a word boundary).
Many modern computing systems provide wildcard characters in matching filenames from a file system. This is a core capability of many command-line shells and is also known as globbing. Wildcards differ from regular expressions in generally expressing only limited forms of patterns.
Basic concepts
A regular expression, often called a pattern, is an expression that specifies a set of strings. It is more concise to specify a set's members by rules (such as a pattern) than by a list. For example, the set containing the three strings "Handel", "Händel", and "Haendel" can be specified by the pattern
H(ä|ae?)ndel (or alternatively, it is said that the pattern matches each of the three strings). In most formalisms, if there exists at least one regex that matches a particular set then there exist an infinite number of such expressions. Most formalisms provide the following operations to construct regular expressions.- Boolean "or"
- A vertical bar separates alternatives. For example,
gray|greycan match "gray" or "grey". - Grouping
- Parentheses are used to define the scope and precedence of the operators (among other uses). For example,
gray|greyandgr(a|e)yare equivalent patterns which both describe the set of "gray" and "grey". - Quantification
- A quantifier after a token (such as a character) or group specifies how often that preceding element is allowed to occur. The most common quantifiers are the question mark
?, the asterisk*(derived from the Kleene star), and the plus sign+(Kleene cross).
?The question mark indicates there is zero or one of the preceding element. For example, colou?rmatches both "color" and "colour".*The asterisk indicates there is zero or more of the preceding element. For example, ab*cmatches "ac", "abc", "abbc", "abbbc", and so on.+The plus sign indicates there is one or more of the preceding element. For example, ab+cmatches "abc", "abbc", "abbbc", and so on, but not "ac".
These constructions can be combined to form arbitrarily complex expressions, much like one can construct arithmetical expressions from numbers and the operations +, −, ×, and ÷. For example,
H(ae?|ä)ndel and H(a|ae|ä)ndel are both valid patterns which match the same strings as the earlier example, H(ä|ae?)ndel.
The precise syntax for regular expressions varies among tools and with context; more detail is given in the Syntax section.
Syntex Section
Metacharacter
|
Meaning
|
[
]
|
Match anything inside the square
brackets for ONE character position once and only once, for example, [12]
means match the target to 1 and if that does not match then match the target
to 2 while [0123456789] means match to any character in the range 0 to 9.
|
-
|
The - (dash) inside square
brackets is the 'range separator' and allows us to define a range, in our
example above of [0123456789] we could rewrite it as [0-9].
You can define more than one range
inside a list, for example, [0-9A-C] means check for 0 to 9 and A to C (but
not a to c).
NOTE: To test for - inside brackets (as a literal) it
must come first or last, that is, [-0-9] will test for - and 0 to 9.
|
^
|
The ^ (circumflex or caret) inside
square brackets negates the expression (we will see an alternate use for
the circumflex/caret outside square brackets later), for example,
[^Ff] means anything except upper or lower case F and [^a-z] means everything
except lower case a to z.
NOTE: Spaces, or in this case the lack of them, between ranges
are very important.
|
NOTE: There are some special range values that are built-in to most regular expression software and have to
be if it claims POSIX 1003.2 compliance for either BRE or ERE.
Metacharacter
Meaning
^ The ^
(circumflex or caret) outside square brackets means look only at the
beginning of the target string, for example, ^Win will not find Windows
in STRING1 but ^Moz will find Mozilla. $ The $ (dollar)
means look only at the end of the target string, for example, fox$ will find a
match in 'silver fox' since it appears at the end of the string but not
in 'the fox jumped over the moon'. . The . (period) means any character(s) in
this position, for example, ton. will find tons, tone and tonneau
but not wanton because it has no following character.
Metacharacter
|
Meaning
|
?
|
The ? (question mark) matches the
preceding character 0 or 1 times only, for example, colou?r will find both
color (0 times) and colour (1 time).
|
*
|
The * (asterisk or star) matches
the preceding character 0 or more times, for example, tre* will find tree (2
times) and tread (1 time) and trough (0 times).
|
+
|
The + (plus) matches the previous
character 1 or more times, for example, tre+ will find tree (2 times) and
tread (1 time) but not trough (0 times).
|
{n}
|
Matches the preceding character,
or character range, n times exactly, for example, to find a local phone
number we could use [0-9]{3}-[0-9]{4} which would find any number of the form
123-4567.
Note: The - (dash) in this case, because it is outside the
square brackets, is a literal. Value is enclosed in braces (curly
brackets).
|
{n,m}
|
Matches the preceding character at
least n times but not more than m times, for example, 'ba{2,3}b' will find
'baab' and 'baaab' but NOT 'bab' or 'baaaab'. Values are enclosed in braces
(curly brackets).
|
So lets try them out with our
example target strings.
Metacharacter
|
Meaning
|
()
|
The ( (open parenthesis) and )
(close parenthesis) may be used to group (or bind) parts of our search
expression together - see this example.
|
|
|
The | (vertical bar or pipe) is
called alternation in techspeak and means find the left hand OR right
values, for example, gr(a|e)y will find 'gray' or 'grey'.
|
<humblepie> In our examples, we blew this expression ^([L-Z]in), we
incorrectly stated that this would negate the tests [L-Z], the '^' only
performs this function inside square brackets, here it is outside
the square brackets and is an anchor indicating 'start from first
character'. Many thanks to Mirko Stojanovic for pointing it out and apologies
to one and all.</humblepie>
Value
|
Meaning
|
[:digit:]
|
Only the digits 0 to 9
|
[:alnum:]
|
Any alphanumeric character 0 to 9
OR A to Z or a to z.
|
[:alpha:]
|
Any alpha character A to Z or a to
z.
|
[:blank:]
|
Space and TAB characters only.
|
[:xdigit:]
|
Hexadecimal notation 0-9, A-F,
a-f.
|
[:punct:]
|
Punctuation symbols . , " ' ?
! ; : # $ % & ( ) * + - / < > = @ [ ] \ ^ _ { } | ~
|
[:print:]
|
Any printable character.
|
[:space:]
|
Any whitespace characters (space,
tab, NL, FF, VT, CR). Many system abbreviate as \s.
|
[:graph:]
|
Exclude whitespace (SPACE, TAB).
Many system abbreviate as \W.
|
[:upper:]
|
Any alpha character A to Z.
|
[:lower:]
|
Any alpha character a to z.
|
[:cntrl:]
|
Control Characters NL CR LF TAB VT
FF NUL SOH STX EXT EOT ENQ ACK SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM
SUB ESC IS1 IS2 IS3 IS4 DEL.
|
These are always used inside
square brackets in the form [[:alnum:]] or combined as [[:digit:]a-d]
Character Class Abbreviations
|
|
\d
|
Match any character in the range 0
- 9 (equivalent of POSIX [:digit:])
|
\D
|
Match any character NOT in the
range 0 - 9 (equivalent of POSIX [^[:digit:]])
|
\s
|
Match any whitespace characters
(space, tab etc.). (equivalent of POSIX [:space:] EXCEPT VT is not
recognized)
|
\S
|
Match any character NOT whitespace
(space, tab). (equivalent of POSIX [^[:space:]])
|
\w
|
Match any character in the range 0
- 9, A - Z and a - z (equivalent of POSIX [:alnum:])
|
\W
|
Match any character NOT the range
0 - 9, A - Z and a - z (equivalent of POSIX [^[:alnum:]])
|
Positional Abbreviations
|
|
\b
|
Word boundary. Match any
character(s) at the beginning (\bxx) and/or end (xx\b) of a word, thus
\bton\b will find ton but not tons, but \bton will find tons.
|
\B
|
Not word boundary. Match any
character(s) NOT at the beginning(\Bxx) and/or end (xx\B) of a word, thus
\Bton\B will find wantons but not tons, but ton\B will find both wantons and
tons.
|
Modifiers
Modifiers are used to perform case-insensitive and global searches:
Modifier
|
Description
|
i
|
Perform case-insensitive matching
|
g
|
Perform a global match (find all matches rather than
stopping after the first match)
|
m
|
Perform multiline matching
|
Brackets
Brackets are used to find a range of characters:
Expression
|
Description
|
[abc]
|
Find any character between the brackets
|
[^abc]
|
Find any character not between the brackets
|
[0-9]
|
Find any digit from 0 to 9
|
[A-Z]
|
Find any character from uppercase A to uppercase Z
|
[a-z]
|
Find any character from lowercase a to lowercase z
|
[A-z]
|
Find any character from uppercase A to lowercase z
|
[adgk]
|
Find any character in the given set
|
[^adgk]
|
Find any character outside the given set
|
(red|blue|green)
|
Find any of the alternatives specified
|
Metacharacters
Metacharacters are characters with a special meaning:
Metacharacter
|
Description
|
.
|
Find a single character, except newline or line terminator
|
\w
|
Find a word character
|
\W
|
Find a non-word character
|
\d
|
Find a digit
|
\D
|
Find a non-digit character
|
\s
|
Find a whitespace character
|
\S
|
Find a non-whitespace character
|
\b
|
Find a match at the beginning/end of a word
|
\B
|
Find a match not at the beginning/end of a word
|
\0
|
Find a NUL character
|
\n
|
Find a new line character
|
\f
|
Find a form feed character
|
\r
|
Find a carriage return character
|
\t
|
Find a tab character
|
\v
|
Find a vertical tab character
|
\xxx
|
Find the character specified by an octal number xxx
|
\xdd
|
Find the character specified by a hexadecimal number dd
|
\uxxxx
|
Find the Unicode character specified by a hexadecimal
number xxxx
|
Quantifiers
Quantifier
|
Description
|
n+
|
Matches any string that contains at least one n
|
n*
|
Matches any string that contains zero or more occurrences
of n
|
n?
|
Matches any string that contains zero or one occurrences
of n
|
n{X}
|
Matches any string that contains a sequence of X n's
|
n{X,Y}
|
Matches any string that contains a sequence of X to Y n's
|
n{X,}
|
Matches any string that contains a sequence of at least X n's
|
n$
|
Matches any string with n at the end of it
|
^n
|
Matches any string with n at the beginning of it
|
?=n
|
Matches any string that is followed by a specific string n
|
?!n
|
Matches any string that is not followed by a specific
string n
|
How to Find or Validate an Email
Address
The regular expression I receive the most feedback, not to mention
"bug" reports on, is the one you'll find right on this site's home page:
\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b.
This regular expression, I claim, matches any email address. Most of the
feedback I get refutes that claim by showing one email address that this regex
doesn't match. Usually, the "bug" report also includes a suggestion
to make the regex "perfect".As I explain below, my claim only holds true when one accepts my definition of what a valid email address really is, and what it's not. If you want to use a different definition, you'll have to adapt the regex. Matching a valid email address is a perfect example showing that (1) before writing a regex, you have to know exactly what you're trying to match, and what not; and (2) there's often a trade-off between what's exact, and what's practical.
The virtue of my regular expression above is that it matches 99% of the email addresses in use today. All the email address it matches can be handled by 99% of all email software out there. If you're looking for a quick solution, you only need to read the next paragraph. If you want to know all the trade-offs and get plenty of alternatives to choose from, read on.
If you want to use the regular expression above, there's two things you need to understand. First, long regexes make it difficult to nicely format paragraphs. So I didn't include a-z in any of the three character classes. This regex is intended to be used with your regex engine's "case insensitive" option turned on. (You'd be surprised how many "bug" reports I get about that.) Second, the above regex is delimited with word boundaries, which makes it suitable for extracting email addresses from files or larger blocks of text. If you want to check whether the user typed in a valid email address, replace the word boundaries with start-of-string and end-of-string anchors, like this: ^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$
The previous paragraph also applies to all following examples. You may need to change word boundaries into start/end-of-string anchors, or vice versa. And you will need to turn on the case insensitive matching option.
Trade-Offs in Validating Email Addresses
Yes, there are a whole bunch of email addresses that my pet regex doesn't match. The most frequently quoted example are addresses on the .museum top level domain, which is longer than the 4 letters my regex allows for the top level domain. I accept this trade-off because the number of people using .museum email addresses is extremely low. I've never had a complaint that the order forms or newsletter subscription forms on the JGsoft websites refused a .museum address (which they would, since they use the above regex to validate the email address).To include .museum, you could use ^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,6}$. However, then there's another trade-off. This regex will match john@mail.office. It's far more likely that John forgot to type in the .com top level domain rather than having just created a new .office top level domain without ICANN's permission.
This shows another trade-off: do you want the regex to check if the top level domain exists? My regex doesn't. Any combination of two to four letters will do, which covers all existing and planned top level domains except .museum. But it will match addresses with invalid top-level domains like asdf@asdf.asdf. By not being overly strict about the top-level domain, I don't have to update the regex each time a new top-level domain is created, whether it's a country code or generic domain.
^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.(?:[A-Z]{2}|com|org|net|edu|gov|mil|biz|info|mobi|name|aero|asia|jobs|museum)$
Email addresses can be on servers on a subdomain, e.g. john@server.department.company.com. All of the above regexes will match this email address, because I included a dot in the character class after the @ symbol. However, the above regexes will also match john@aol...com which is not valid due to the consecutive dots. You can exclude such matches by replacing [A-Z0-9.-]+\. with (?:[A-Z0-9-]+\.)+ in any of the above regexes. I removed the dot from the character class and instead repeated the character class and the following literal dot. E.g. \b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,4}\b will match john@server.department.company.com but not john@aol...com.
Another trade-off is that my regex only allows English letters, digits and a few special symbols. The main reason is that I don't trust all my email software to be able to handle much else. Even though John.O'Hara@theoharas.com is a syntactically valid email address, there's a risk that some software will misinterpret the apostrophe as a delimiting quote. E.g. blindly inserting this email address into a SQL will cause it to fail if strings are delimited with single quotes. And of course, it's been many years already that domain names can include non-English characters. Most software and even domain name registrars, however, still stick to the 37 characters they're used to.
The conclusion is that to decide which regular expression to use, whether you're trying to match an email address or something else that's vaguely defined, you need to start with considering all the trade-offs. How bad is it to match something that's not valid? How bad is it not to match something that is valid? How complex can your regular expression be? How expensive would it be if you had to change the regular expression later? Different answers to these questions will require a different regular expression as the solution. My email regex does what I want, but it may not do what you want.
If you want to learn more about Regular Expression, Please visit these site
http://www.regular-expressions.info/
http://www.w3schools.com/jsref/jsref_obj_regexp.asp
Some Use-full example of regular expression in QTP Scripting
Dim re As
New RegExp
Dim ma As Match
re.Pattern =
"[A-Z][0-9][0-9][0-9]" '
uppercase char
followed by
2 digits
re.IgnoreCase = False ' case sensitive search
re.Global = True ' find all the
occurrences
For Each ma In
re.Execute(txtSource.Text)
Msgbox
"Found '" & ma.Value & "' at index " &
ma.FirstIndex
Next
---------------------------------------------------------------------------------------------
Regular
Expression example in qtp
Regular
Expression Example :
(Login to
the Flight Reservation Application,Open orders and Send Fax Orders)
While
Sending fax orders ,the fax order window would change for different orders, so
there we need to add Regular Expressions
Action1
: (Login To the Flight Reservation
Application)
systemutil.Run
“C:\Program Files (x86)\HP\QuickTest
Professional\samples\flight\app\flight4a.exe”
Dialog(“Login”).WinEdit(“Agent
Name:”).Set “venkatesh”
Dialog(“Login”).WinEdit(“Agent
Name:”).Type micTab
Dialog(“Login”).WinEdit(“Password:”).SetSecure
“4c0ba4645292b686779b42c81dd13093544face9?
Dialog(“Login”).WinEdit(“Password:”).Type micReturn
Action2 :
(Open Orders)
Window(“Flight
Reservation”).Activate
Window(“Flight
Reservation”).WinMenu(“Menu”).Select “File;Open Order…”
Window(“Flight
Reservation”).Dialog(“Open Order”).WinCheckBox(“Order No.”).Set “ON”
Window(“Flight
Reservation”).Dialog(“Open Order”).WinEdit(“Edit”).Set DataTable(“order_no”,
dtGlobalSheet)
Window(“Flight
Reservation”).Dialog(“Open Order”).WinButton(“OK”).Click
Window(“Flight
Reservation”).Activate
Action3 :
(Send Fax Orders)
Window(“Flight
Reservation”).Activate
Window(“Flight
Reservation”).WinMenu(“Menu”).Select “File;Fax Order…”
Window(“Flight
Reservation”).Dialog(“Fax Order No. 1?).ActiveX(“MaskEdBox”).Type
“11111111111111?
Window(“Flight
Reservation”).Dialog(“Fax Order No. 1?).WinButton(“Send”).Click
Window(“Flight
Reservation”).Activate
Action4 :
(Close the application)
Window(“Flight
Reservation”).Activate
Window(“Flight
Reservation”).Close
Goto Global
Data Table :
make “A” as
“oder_no” and enter
1
2
3
Here we are
adding Regular Expressions
Goto Expert
Veiw–Action3
rht click
on Dialog(“Fax Order No. 1?) goto object
propert1es
goto Fax
Order No. 1 click there u will get browse button open then make the below
Fax Order
No\. [1-3]
check
Regular Expression and click no in the pop up
Excute it.
Comments
Post a Comment