转载

正则表达式入门及备忘

概述

正则表达式，主要是用符号描述了一类特定的文本（模式）。而正则表达式引擎则负责在给定的字符串中，查找到这一特定的文本。

本文主要是列出常用的正则表达式符号，加以归类说明。本文仅仅是快速理解了正则表达式相关元字符，作一个备忘，供以后理解更复杂表达式的参考，以后关于正则表达式的相关内容会持续更新本文。示例语言用C#

概述

普通字符

字符集合

速记的字符集合

指定重复次数的字符

匹配位置字符

分支替换字符

匹配特殊字符

组，反向引用，非捕获组

贪婪与非贪婪

回溯与非回溯

正向预搜索、反向预搜索

最后

普通字符

最简单的一种文本描述，就是直接给出要匹配内容。如要在”Generic specialization, the decorator pattern, chains of responsibilities, and extensible software.” 找到pattern，那么正则式就直接是”heels”即可

string input = "Generic specialization, the decorator pattern, chains of responsibilities, and extensible software.";             Regex reg = new Regex("pattern", RegexOptions.IgnoreCase);              Console.WriteLine(reg.Matches(input).Count);  //output 1

View Code

字符集合

将字符放在中括号中，即为字符集合。通过字符集合告诉正则式引擎从字符集合中的字符，仅匹配出一个字符。

字符	匹配的字符	示例
[...]	匹配括号中的任一字符	[abc]可以匹配单个字符a,b或c,但不能匹配其它字符
[^...]	匹配非括号中的任一字符	[^abc]可以匹配任一个除a,b,c的一个字符，如d,e,f

比如单词灰色gray(英)和grey(美)，在一段文本中匹配出 gray 或grey , 那么通过正则式gr[ae]y 就可以了;又比如要在一段文本中找到me和my,正则式是m[ey]

我们还可以在字符集合中使用连字号 – 来表示一个范围，比如 [0-9] 表示匹配一个0到9数字；[a-zA-Z] 表示匹配英文字母；[0-9 a-zA-Z]表示匹配一个0到9的数字或英文字母

string input = "The color of shirt is gray and color of shoes is grey too.";             Regex reg = new Regex("gr[ae]y", RegexOptions.IgnoreCase);              Console.WriteLine(reg.Matches(input).Count);  //output 2              var matchs = reg.Matches(input);             foreach (Match match in matchs)             {                 Console.WriteLine(match.Value);//output gray  grey             }

View Code

速记的字符集合

我们常常要匹配一个数字，一个字母，一个空白符，虽然可以用普通的字符类来表示，但不够方便，所以正则式提示了一些常用的字符集合的速记符

字符	匹配的字符	示例
/d	从0到9的任何一个数字	/d/d 可以匹配72，但不能匹配me或7a
/D	非数字符	/D/D 可以匹配me，但不能匹配7a或72
/w	任一个单词字符，如A-Z, a-z, 0-9和下划线字符	/w/w/w/w可以匹配aB_2，但不能匹配ab_@
/W	非单词字符	/W 可以匹配@，但不能匹配a
/s	任一空白字符，包括了制表符，换行符，回车符，换页符和垂直制表符	匹配所有传统的空白字符
/S	任一非空白字符	/S 可以匹配任一非空白字符，如~！@#&
.	任一字符	匹配任一字符，换行符除外

string input = "1024 hello world&%";  Regex reg1 = new Regex(@"/d/d/d/d");  if (reg1.IsMatch(input))   {      Console.WriteLine(reg1.Match(input).Value);//output 1024  }  Regex reg2 = new Regex(@"/W/W");  if (reg2.IsMatch(input))  {      Console.WriteLine(reg2.Match(input).Value);//output &%  }

View Code

指定重复次数的字符

指定重复匹配前面的字符多少次：匹配重复的次数，不匹配内容。比如说，要在一系列电话号码中找到以158开始的11位手机号，如果我们没有学过下面的内容，正则表达式为158/d/d/d/d/d/d/d/d；但如果我们学习了下面的知识，则正则表达式为158/d{8}

字符	匹配的字符	示例
{n}	匹配前面字符n次	x{2},可以匹配xx,但不能匹配xxx
{n,}	匹配前面字符n次或更多	x{2,}可以2个或更多的x,比如可以匹配xx,xxx,xxxx,xxxxxx
{n,m}	匹配前面字符最少n次，最多m次。如果n为0，则可以不指定	x{2,4},匹配了xx,xxx,xxxx,但不能匹配x,xxxxx
?	匹配前面的字符0次或1次,相当于{0，1}	x? 匹配x或空
+	匹配前面的字符1次或多次, 相当于{1,}	x+ 匹配x,xx，或xxx
*	匹配前面的字符0次或多次	x* 匹配0个或多个x

string input = "my phone number is 15861327445, please call me sometime.";  Regex reg1 = new Regex(@"158/d{8}");//匹配以158为开头的11位手机号  if (reg1.IsMatch(input))  {      Console.WriteLine(reg1.Match(input).Value);//output 15861327445  }  string input2 = "November is the 11 month of the year, you can use Nov for short.";  Regex reg2 = new Regex(@"Nov(ember)?");//匹配Nov 或者November  var matchs = reg2.Matches(input2);  foreach (Match match in matchs)  {      Console.WriteLine(match.Value);//output November  Nov  }  string input3 = "1000, 100, 2003, 9999,10000";  Regex reg3 = new Regex(@"/b[1-9]/d{3}/b");//匹配1000到9999的数字  var matchs3 = reg3.Matches(input3);  foreach (Match match in matchs3)  {      Console.WriteLine(match.Value);//output 1000 2003 9999  其中/b是指匹配单词的边界  }  string input4 = "1000, 100, 2003, 9999,10000,99,10,1, 99999";  Regex reg4 = new Regex(@"/b[1-9][0-9]{2,4}/b");//匹配100到99999的数字  var matchs4 = reg4.Matches(input4);  foreach (Match match in matchs4)  {      Console.WriteLine(match.Value);//output 1000 100 2003 9999 10000 99999  其中/b是指匹配单词的边界  }

View Code

匹配位置的字符

现在我们已经学会了使用字符集合，字符集合的速记符来匹配大部分的文本了。但是如果我们遇到以下的情况，怎么办？

要求匹配文本的第一个单词为google

要求匹配文本以bye结束

要求匹配文本每一行的第一个单词为数字

要求匹配一个单词以hel开头

上面的这种匹配一个位置，但不匹配任何内容的需求很正常。在正则表达式中也提供了一些特殊的字符来匹配位置(不匹配内容)。如用^匹配文本的开始位置，用$匹配文本的结束位置，/b匹配一个单词的边界

字符	匹配的字符	示例
^	其后的模式必须在字符串的开始处，如果是一个多行字符串，应位于任一行的开始。对于多行文本（有回车符），需要设置Multiline标识
$	前面的模式必须在字符串的结尾处，如果是一个多行字符串，应该在任一行的末尾
/b	匹配一个单词的边界，
/B	匹配一个非单词的边界，并不在一个单词的开始或结尾处
/A	前面的模式必须在字符串的开始，并忽略多行标识
/z	前面的模式必须在字符串的末尾，并忽略多行标识
/Z	前面的模式必须在字符串的末尾，或是位于换行符前

string input1 = "the color of shirt is gray and the color of shoes is grey too.";  Regex reg1 = new Regex(@"^the");//匹配  var matchs1 = reg1.Matches(input1);  foreach (Match match in matchs1)  {      Console.WriteLine(string.Format("the value is:{0}; and the index is:{1}", match.Value, match.Index));//output the value is:the; and the index is:0  }  string input2 = "the color of shirt shirts is gray and the color of shoes is grey too.";  Regex reg2 = new Regex(@"/b/w*irt/b");//匹配shirt单词,但不会匹配到shirts  var matchs2 = reg2.Matches(input2);  foreach (Match match in matchs2)  {      Console.WriteLine(string.Format("the value is:{0}; and the index is:{1}", match.Value, match.Index));//output the value is:shirt; and the index is:13  }

View Code

分支替换字符

在字符集合中，我们可以在用中括号来指定匹配中括号中的任一字符，即模式中可以列出多种字符情景，被匹配的文本只要有符合其中的任一情景就可以被匹配出来。那有没有这样的一种机制，同一个正则式中有多个模式，只有满足其中的任一模式就可以被匹配出来。再配合分组，就可以把一个复杂的正则式分成多个相对简单子的正则式来做。类似于逻辑符号OR的意思吧。

字符	匹配的字符	示例
\|	选择匹配符，匹配前面或后面的任一模式	cat\|mouse 可以匹配出cat 或mouse

string input1 = "color: blue, grey, gray, white, black";             Regex reg1 = new Regex(@"(grey)|(gray)");//匹配grey和gray              var matchs1 = reg1.Matches(input1);             foreach (Match match in matchs1)             {                 Console.WriteLine(match.Value);//output grey, gray             }

View Code

匹配特殊字符

到现在，我们已经知道了字符集合，一些速记的字符集合，匹配位置的字符，指定匹配次数的字符，分支匹配。我们用的这些符号，在正则表达式中代表了各种特定的意义。那当我们要匹配这些字符本身，我们应该怎么办？在特殊字符前加上/, 以下是一些常用特殊字符的转义字符的列表

字符	匹配的字符	示例
//	匹配字符/
/.	匹配字符.
/*	匹配字符*
/+	匹配字符+
/?	匹配字符?
/\|	匹配字符\|
/(	匹配字符(
/)	匹配字符)
/{	匹配字符{
/}	匹配字符}
/^	匹配字符^
/$	匹配字符$
/n	匹配字符n
/r	匹配字符r
/t	匹配字符t
/f	匹配字符f
/nnn	匹配一个三位八进制数指定的ASCII字符，如/103匹配一个大写的C
/xnn	匹配一个二位十六进制数指定的ASCII字符，如/x43匹配C
/xnnnn	匹配一个四位十六进制数指定的unicode字符
/cV	匹配一个控制字符，如，/cV匹配Ctrl+V

string input1 = "2.5+1.5=4";             Regex reg1 = new Regex(@"2/.5/+1/.5=4");//其中.和+在正则式中都是特殊字符，如果想匹配这些特殊字符本身的含义，那么在前面加上一个/              var matchs1 = reg1.Matches(input1);             foreach (Match match in matchs1)             {                 Console.WriteLine(match.Value);//output 2.5+1.5=4             }

View Code

组，反向引用，非捕获组

组，可以用圆括号，将正则表达式的部分括起来并独立使用，在圆括号之间的正则式叫做一个组。可以将匹配次数的字符和分支匹配字符应用到组。

1 示例: public void Set, public void SetValue

正则式Set(Value)? , 其中(Value)是一个组，匹配次数的字符?将应用于整个组(Value)，可以匹配到Set或SetValue

2 示例：Out of sight, out of mind

正则式： “(out of) sight, /1 mind”

正则表达式引擎会将 “()”中匹配到的内容存储起来，作为一个“组”，并且可以通过索引的方式进行引用。表达式中的“/1”，用于反向引用表达式中出现的第一个组。同时在c#中，也可以通过组来访问捕获到的组的内容。注意，Groups[0]是整个匹配的字符串，组的内容从索引1开始

string input1 = "out of sight, out of mind";  Regex reg1 = new Regex(@"(out of) sight, /1 mind");  if (reg1.IsMatch(input1))  {      Console.WriteLine(reg1.Match(input1).Value);//output out of sight, out of mind  }  Console.WriteLine(reg1.Match(input1).Groups[1].Value);  // output out of

View Code

3 可根据组名进行索引。使用以下格式为标识一个组的名称(?<groupname>…)

正则式： “(?<Group1>out of) sight, /1 mind”

string input2 = "out of sight, out of mind";  Regex reg2 = new Regex(@"(?<Group1>out of) sight, /1 mind");  if (reg2.IsMatch(input2))  {      Console.WriteLine(reg2.Match(input2).Value);//output out of sight, out of mind  }  Console.WriteLine(reg2.Match(input2).Groups["Group1"].Value);  // output out of

View Code

4在表达式外引用，对于索外用$索引，或组名用${组名}

示例：Out of of sight, out of mind

正则式 “(?<Group1>[a-z]+) /1”

string input3 = "out of sight, out out of mind mind";             Regex reg3 = new Regex(@"(?<Group1>[a-z]+) /1");//匹配重复的单词              if(reg3.IsMatch(input3))             {                 Console.WriteLine(reg3.Replace(input3, "$1"));//output  out of sight, out of mind                 Console.WriteLine(reg3.Replace(input3, "${Group1}"));//output  out of sight, out of mind             }

View Code

5非捕获组，在组前加上?: 因为有的组表达的仅仅是一个选择替换，当我们不想用浪费存储时，以用不捕获该组

"(?:out of) sight"

string input4 = "out of sight, out of mind";             Regex reg4 = new Regex(@"(?:out of) sight");//使用了非捕获?:后，在表达式内就不能使用/1去引用了              if (reg4.IsMatch(input4))             {                 Console.WriteLine(reg4.Match(input4).Groups[1].Value);  // output 空             }

View Code

字符	匹配的字符	示例
(?<groupname>exp)	匹配exp,并捕获文本到名称为name的组里
(?:exp)	匹配exp,不捕获匹配的文本，也不给此分组分配组号

贪婪与非贪婪

正则表达式的引擎默认是贪婪，只要模式允许，它将匹配尽可能多的字符。通过在“重复描述字符（*,+等）”后面添加“?”，可以将匹配模式改成非贪婪。贪婪与非贪婪与指定重复次数的字符的内容密切相关。

字符	匹配的字符	示例
?	如果是跟在量词（即指定匹配次数的字符)后面,那么正则表达式则采用非贪婪模式

示例 out of sight, out of mind

贪婪正则式 : .* of 输出out of sight, out of

非贪婪正则式 : .*? of 输出 out of

另外一个示例

输入：The title of cnblog is <h1>code changes the world</h1>

目标：匹配HTML标记

正则式1:<.+>

正则式1的输出: <h1>code changes the world</h1>

正则式2:<.+?>

正则式2的输出: <h1> </h1>

string input1 = "out of sight, out of mind";  Regex reg1 = new Regex(@".* of");//默认贪婪模式，尽可能的匹配更多的文本。在遇到第一个of时，正则引擎并没有停止下来，继续执行以期望后面还有一个of，这样就可以匹配到更多的文本。如果后面没有匹配到新的of，刚执行回溯  if (reg1.IsMatch(input1))  {      Console.WriteLine(reg1.Match(input1).Value);//output out of sight, out of  }  string input2 = "out of sight, out of mind";  Regex reg2 = new Regex(@".*? of");//在指定重复次数的字符后面加上?,正则式则为非贪婪模式。一但遇到第一个符合条件的文本，就匹配结束  if (reg2.IsMatch(input2))  {      Console.WriteLine(reg2.Match(input2).Value);//output out of  }  string input3 = "The title of cnblog is <h1>code changes the world</h1>";  Regex reg3 = new Regex(@"<.+>");//贪婪模式:匹配HTML标记  var matchs3 = reg3.Matches(input3);  foreach (Match match in matchs3)  {      Console.WriteLine(match.Value);//output <h1>code changes the world</h1>  }  string input4 = "The title of cnblog is <h1>code changes the world</h1>";  Regex reg4 = new Regex(@"<.+?>");//非贪婪模式:匹配HTML标记  var matchs4 = reg4.Matches(input4);  foreach (Match match in matchs4)  {      Console.WriteLine(match.Value);//output <h1> </h1>  }

View Code

回溯与非回溯

使用“(?>…)”方式进行非回溯声明。由于正则表达式引擎的贪婪特性，导致它在某些情况下，将进行回溯以获得匹配，请看下面的示例：

示例：Live for nothing, die for something

正则式(默认非回溯): “.*thing,” 输出Live for nothing 。“.*”由于其贪婪特性，将一直匹配到字符串的最后，随后匹配“thing”，但在匹配“,”时失败，此时引擎将回溯，并在“thing,”处匹配成功

正则式(回溯):”(?>.*)thing,” 匹配不到任何东西。由于强制非回溯，所以整个表达式匹配失败

string input1 = "Live for nothing, die for something,";  Regex reg1 = new Regex(@"(.*)thing,");  var matchs1 = reg1.Matches(input1);  foreach (Match match in matchs1)  {      Console.WriteLine(match.Value);//output   }  Regex reg2= new Regex(@"(?>.*)thing,");  var matchs2 = reg2.Matches(input1);  foreach (Match match in matchs2)  {      Console.WriteLine(match.Value);//output   }

View Code

字符	匹配的字符	示例
(?>...)	匹配组内表达式时，不回溯

正向预搜索、反向预搜索

匹配特定的模式，并声明前面或后面的内容。意思跟匹配位置差不多

字符	匹配的字符	示例
(?=exp)	左边的模式后面必须紧跟着exp，声明本身不作为匹配结果的一部分
(?!exp)	左边的模式的后面不能紧跟着exp，声明本身不作为匹配结果的一部分
(?<=exp)	右边的模式的前面必须是exp，声明本身不作为匹配结果的一部分
(?<!exp)	右边的模式的前面不能是exp，声明本身不作为匹配结果的一部分

string input1 = "hello 1024 world 8080 bye";  Regex reg1 = new Regex(@"/d{4}(?= world)");  if (reg1.IsMatch(input1))  {      Console.WriteLine(reg1.Match(input1).Value);//output 1024  }  Regex reg2 = new Regex(@"/d{4}(?! world)");  if (reg2.IsMatch(input1))  {      Console.WriteLine(reg2.Match(input1).Value);//output 8080  }  Regex reg3 = new Regex(@"(?<=world )/d{4}");  if (reg3.IsMatch(input1))  {      Console.WriteLine(reg3.Match(input1).Value);//output 8080  }  Regex reg4 = new Regex(@"(?<!world )/d{4}");  if (reg4.IsMatch(input1))  {      Console.WriteLine(reg4.Match(input1).Value);//output 1024  }