Regular expressions in PHP

Looking at regular expressions with all the strange symbols and characters can be pretty intimidating. I’ve been through a number of books and articles in the past trying to explain regular expressions, but I’ve never seen anything like this one that breaks it down into nice easy to understand chunks.

Here it is:

Introduction to Regular Expressions

Regular expressions are a powerful way of analysing and retrieving values from text strings. Looking at a full blown regular expression can be intimidating so this tutorial breaks them down into a series of simple examples, easing you into the process. I tend to use the PHP function preg_match when using regular expressions. It takes the form:

preg_match('/regularexpression/', $textstring)

Note the forward slash at the start and end of the regular expression. This is a way of indicating there is a regular expression between the forward slashes. Other PHP commands used with regexs are preg_split, preg_replace and preg_match_all. You can find out more of these from the official PHP website.

Searching for an exact text phrase

If you want to check if an exact text string is within another text string, there are no special regex characters required – you just use the exact text phrase for the regular expression. For example:

if (preg_match('/tutorial/', 'tips and tutorials are here'))

echo "word 'tutorial' found!";

Note – it’s case sensitive. It’s actually more efficient to use the PHP function substr in these cases but I’m just kicking things off with an easy example.

Start and end of text

If you’re searching for text at the start or end of a text file, use the symbols ^ and $.
“^The”: matches any string that starts with “The”;
“of despair$”: matches a string that ends in the substring “of despair”;

Multiple characters

The symbols ‘*’, ‘+’, and ‘?’ denote the number of times a character or a sequence of characters may occur. What they mean respectively is: “zero or more”, “one or more”, and “zero or one”. For example:
“tu*”: matches a string that has the letter t followed by zero or more u’s (“t”, “tu”, “tuuuuu”, “tutorial”, etc).
“tu+”: similar but at least one u (“tu”, “tuuu”, “tut”, etc).
“tu?”: there may or may not be a u.
“t?b+$”: a possible t followed by one or more u’s ending the string.

Or if you want to be more specific on the number of multiple characters, you can specify a range within braces {}.
“o{3}h”: matches a string that has exactly three o’s followed by h (“oooh”).
“o{3,}h”: there are at least three o’s (“oooh”, “ooooh”, “ooooooooooh”, etc).
“o{3,5}h”: from three to five o’s (“oooh”, “ooooh”, or “oooooh”).
You always specify the first number of a range but you can’t specify just the last number (eg – {3,5} or {3,} but not {,5}).

If you want to quantify a sequence of characters rather than just a single character, put them inside parentheses:
“t(ut)*”: matches a string that has an t followed by zero or more copies of the sequence “ut” (eg – “t”, “tut”, “tutututut”, etc).
“t(ut){1,3}”: between one to three copies of “ut” (“tut”, “tutut”, “tututut”).

OR operator

The ‘|’ symbol works as an OR operator: “tips|tutorials”: matches a string that has either “tips” or “tutorials” in it.
“(b|cd)ef”: a string that has either “bef” or “cdef”.
“(a|b)*c”: a string that has a sequence of alternating a’s and b’s ending in a c.

Wild character

A period (‘.’) is a wild character – it can stand for any single character:
“t.*p”: matches a string that has a t followed by any number of characters followed by a p (“tip”, “tp”, “tdfsadfsadsfp”, etc).
“^.{5}$”: a string with exactly 5 characters (“bingo”, “blind”, “rainy”, “asdfe”, etc).

Bracket expressions

Bracket expressions lets you match a whole range of characters to a single position of a string:
“[tu]”: matches a string that has either a ‘t’ or a ‘u’ (that’s the same as “t|u”);
“[a-d]”: a string that has lowercase letters ‘a’ through ‘d’ (that’s equal to “a|b|c|d” and even “[abcd]”);
“^[a-zA-Z]”: a string that starts with a letter;
“[0-9]%”: a string that has a single digit before a percent sign;
“,[a-zA-Z0-9]$”: a string that ends in a comma followed by an alphanumeric character.
Note that inside brackets, all the regex special characters are just ordinary characters – they don’t do any of their usual regular expression functions.

Excluding characters

You can also exclude characters by using a ‘^’ as the first symbol in a bracket expression:
“%[^a-zA-Z]%” matches a string with a character that is not a letter between two percent signs).
Note – the difference between this application and using ^ at the start of a regular expression which specifies the first character of a string.

Escaping regular expression characters

What do you do if you want to check for one of the regular expression special characters “^.[$()|*+?{\” in your text string? You have to escape these characters with a backslash (‘\’).

Retrieving text using preg_match

If you want to extract a phrase out of a text string, you use the PHP function preg_match in the following format:

preg_match('/regular expression/', $textstring, $matchesarray)

It returns a value of 1 if there is a match to your regular expression, a value of 0 if no match. For example,

echo preg_match ('/test/', "a test of preg_match");

outputs 1 whereas

echo preg_match ('/tutorial/', "a test of preg_match");

outputs 0.

Preg_match is really useful for extracting phrases out of a text string. To do this, you specify an array as the third argument (eg – $matchesarray is what I use in the example). You also need to use parenthesizes in your regular expression to specify the sections you want to retrieve. If there’s a successful match, $matchesarray is filled with the results of the search. $matchesarray[0] contain the entire text string. $matchesarray[1] contains the text that matched the first captured parenthesized subpattern, and so on.

For example, the following regex divides a url into two sections. The first section is “http://” (note the escaping back slash), the second section is whatever comes after:

preg_match ('/(http://)(.*)/', "http://www.tipsntutorials.com/", $matchesarray)

This fills $matchesarray with the following values:
$matchesarray[0] = “http://www.tipsntutorials.com/”
$matchesarray[1] = “http://”
$matchesarray[2] = “www.tipsntutorials.com/”

source:

http://www.tipsntutorials.com/tutorials/PHP/50

infraGrey

09 July 2014

Power tip:

Tired of those blank or empty code lines? Here’s a nifty trick using Regex to remove them.

Do a Find-and-replace search for anything that matches the following
^(?:[\t ]*(?:\r?\n|\r))+
with a blank entry.

That’s it.
infraGrey

04 September 2019

One of the better resources on RegEx available on the web – easy to read and understand: https://cs.lmu.edu/~ray/notes/regex/
infraGrey

04 September 2019

Quick reference:
To search for a pattern contained in files within a directory via the Linux command line.

grep -rnw '/path/to/somewhere/' -e 'pattern'

-r or -R is recursive,
-n is line number, and
-w stands for match the whole word.
-l (lower-case L) can be added to just give the file name of matching files.

Along with these, –exclude, –include, –exclude-dir flags could be used for efficient searching:

This will only search through those files which have .c or .h extensions:

grep --include=\*.{c,h} -rnw '/path/to/somewhere/' -e "pattern"
This will exclude searching all the files ending with .o extension:

grep --exclude=*.o -rnw '/path/to/somewhere/' -e "pattern"
For directories it’s possible to exclude a particular directory(ies) through –exclude-dir parameter. For example, this will exclude the dirs dir1/, dir2/ and all of them matching *.dst/:

grep --exclude-dir={dir1,dir2,*.dst} -rnw '/path/to/somewhere/' -e "pattern"

https://stackoverflow.com/questions/16956810/how-do-i-find-all-files-containing-specific-text-on-linux

Published

04 April 2012

infraGrey in Web and Tech, Work | 04 April 2012