Computer Science/Algorithms

Regular Expression

metamong 2023. 1. 11.

๐Ÿ’š ์ •๊ทœ ํ‘œํ˜„์‹์€ 'ํŒจํ„ด ๋งค์นญ ๊ธฐ๋ฐ˜์œผ๋กœ ํŠน์ •ํ•œ ๊ทœ์น™์„ ๊ฐ€์ง€๋Š” ๋ฌธ์ž์—ด์„ ๊ฒ€์ƒ‰&๋ถ„๋ฆฌํ•˜๊ณ  ๊ต์ฒดํ•˜๋Š” ๊ฐ•๋ ฅํ•œ ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•˜๋Š” ํ˜•์‹ ์–ธ์–ด'

 

๐Ÿ’š python์—์„œ๋Š” re library๋ฅผ import

import re

1. re.search() & re.match() & re.fullmatch()

๐Ÿ’š re.match() checks for a match ONLY at the beginning of the string, whereas re.search() checks for a match anywhere in the string

 

โ€ป re.match๋Š” ๋ฌธ์ž์—ด ์ฒ˜์Œ๋ถ€ํ„ฐ ๋งค์นญ / re.search๋Š” ๋ฌธ์ž์—ด ์ผ๋ถ€๋ถ„ ๋งค์นญ) โ€ป

 

๐Ÿ’š re.fullmatch() checks for entire string to be a match.

 

โ‘  re.search(): Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object. Return None if no position in the string matches the pattern

 

โ‘ก re.match(): If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match. Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line. If you want to locate a match anywhere in string, use search() instead

 

โ‘ข re.fullmatch(): If the whole string matches the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.

 

ex)

import re
print(re.match("c", "abcdef"))    # No match
print(re.search("c", "abcdef"))   # Match returns the match object
print(re.fullmatch("p.*n", "python")) # Match returns the match object
print(re.fullmatch("r.*n", "python")) # No match

print(re.match('Hello', 'Hello World!')) #<re.Match object; span=(0, 5), match='Hello'>
print(re.search('World', 'Hello World!')) #<re.Match object; span=(6, 11), match='World'>

2. regular expressions

๐Ÿ’š RE syntaxes ๐Ÿ’š

๋ฉ”ํƒ€๋ฌธ์ž ์„ค๋ช… ์‚ฌ์šฉ๋ฒ•
^ ์‹œ์ž‘ ํŒจํ„ด ํ‘œํ˜„ ^abc: abc๋กœ ์‹œ์ž‘ํ•˜๋Š” ํŒจํ„ด
$ ์ข…๋ฃŒ ํŒจํ„ด ํ‘œํ˜„ xyz$: xyz๋กœ ์ข…๋ฃŒํ•˜๋Š” ํŒจํ„ด
[๋ฌธ์ž๋“ค] ๋ฌธ์ž๋“ค ์ค‘์— 1๊ฐœ๋งŒ ํ—ˆ์šฉ (์›ํ•˜๋Š” ๋ฌธ์ž๋“ค์˜ ์ง‘ํ•ฉ ํ‘œํ˜„) [Ww]orld: 'World' ๋˜๋Š” 'world'
[^๋ฌธ์ž๋“ค] [๋ฌธ์ž๋“ค]์„ ์ œ์™ธํ•œ ๋ฌธ์ž๋“ค์˜ ์ง‘ํ•ฉ ํ‘œํ˜„ [^aeiou]: ์†Œ๋ฌธ์ž ๋ชจ์Œ์ด ์•„๋‹Œ ๋ฌธ์ž๋“ค
| ๋‘ ํŒจํ„ด ์ค‘์— ํ•˜๋‚˜๋งŒ ํ—ˆ์šฉ (OR) a | b: a ๋˜๋Š” b
? ์•ž ํŒจํ„ด์ด ์—†๊ฑฐ๋‚˜ ํ•˜๋‚˜๋งŒ ํ—ˆ์šฉ a?: a๊ฐ€ ์—†๊ฑฐ๋‚˜ ํ•˜๋‚˜๋งŒ
+ ์•ž ํŒจํ„ด์ด ํ•˜๋‚˜ ์ด์ƒ ์กด์žฌํ•˜๋Š” ํ‘œํ˜„ a+: a๊ฐ€ ํ•˜๋‚˜ ์ด์ƒ
* ์•ž ํŒจํ„ด์ด 0๊ฐœ ์ด์ƒ ์กด์žฌํ•˜๋Š” ํ‘œํ˜„ a*: a๊ฐ€ ์—†๊ฑฐ๋‚˜ ํ•˜๋‚˜ ์ด์ƒ
ํŒจํ„ด{n} ์•ž ํŒจํ„ด์ด n๋ฒˆ ๋ฐ˜๋ณตํ•˜๋Š” ํ‘œํ˜„ a{2}: a๊ฐ€ ์—ฐ์† 2๋ฒˆ ๋‚˜ํƒ€๋‚˜๋Š” ํŒจํ„ด
ํŒจํ„ด{n,m} ์•ž ํŒจํ„ด์ด ์ตœ์†Œ n๋ฒˆ, ์ตœ๋Œ€ m๋ฒˆ ๋ฐ˜๋ณต (n ๋˜๋Š” m ์ƒ๋žต ๊ฐ€๋Šฅ) a{3,5}: a๊ฐ€ 3๋ฒˆ, 4๋ฒˆ, 5๋ฒˆ ๋‚˜ํƒ€๋‚˜๋Š” ํŒจํ„ด
\d ์ˆซ์ž 0-9 (๋ชจ๋“  ์ˆซ์ž) \d\d\d: 0-9 ๋ฒ”์œ„์˜ ์ˆซ์ž 3๊ฐœ ์˜๋ฏธ
\D  ์ˆซ์ž๋ฅผ ์ œ์™ธํ•œ ๋ชจ๋“  ๋ฌธ์ž [^0-9]
\w ๋ฌธ์ž ([a-zA-Z0-9_] (์˜๋ฌธ ๋Œ€์†Œ๋ฌธ์ž, ์ˆซ์ž, ๋ฐ‘์ค„ ๋ฌธ์ž)) \w\w\w: 3๊ฐœ ๋ฌธ์ž ์˜๋ฏธ
\W ์˜๋ฌธ ๋Œ€์†Œ๋ฌธ์ž, ์ˆซ์ž, ๋ฐ‘์ค„ ๋ฌธ์ž๋ฅผ ์ œ์™ธํ•œ ๋ชจ๋“  ๋ฌธ์ž [^a-zA-Z0-9_]
\s ํ™”์ดํŠธ ์ŠคํŽ˜์ด์Šค, [\t\n\r\f]์™€ ๋™์ผ \s\s: ํ™”์ดํŠธ ์ŠคํŽ˜์ด์Šค ๋ฌธ์ž 2๊ฐœ ์˜๋ฏธ
. ์ค„๋ฐ”๊ฟˆ(\n)์„ ์ œ์™ธํ•œ ๋ชจ๋“  ๋ฌธ์ž .{3}: ๋ฌธ์ž 3๊ฐœ
[0-9] 0~9 ์‚ฌ์ด์˜ ๋ชจ๋“  ์ˆซ์ž [0-9]+: ex) 0, 123, 14235
[A-Z] A-Z ์‚ฌ์ด์˜ ๋ชจ๋“  ๋Œ€๋ฌธ์ž [A-Z]+: ex) ABC, AEEEE
[a-z] a-z์‚ฌ์ด์˜ ๋ชจ๋“  ์†Œ๋ฌธ์ž [a-z]+: ex) abc, abbc

3. RE examples

s = 'Hello World!'
t = 'hello'
print(re.search('^Hello', s))
#<re.Match object; span=(0, 5), match='Hello'>

print(re.search('World!$', s))
#<re.Match object; span=(6, 12), match='World!'>

print(re.match('hello|world|python', t))
#<re.Match object; span=(0, 5), match='hello'>

print(re.match('[0-9]*', 'abc'))
#<re.Match object; span=(0, 0), match=''>

print(re.match('[0-9]+', '123'))
#<re.Match object; span=(0, 3), match='123'>

print(re.match('a*b', 'b'))
#<re.Match object; span=(0, 1), match='b'>

print(re.match('a*b', 'aab'))
#<re.Match object; span=(0, 3), match='aab'>

print(re.match('a+b', 'aab'))
#<re.Match object; span=(0, 3), match='aab'>

print(re.match('abc?d', 'abd'))
#<re.Match object; span=(0, 3), match='abd'>

print(re.match('ab[0-9]?d', 'ab3d'))
#<re.Match object; span=(0, 4), match='ab3d'>

print(re.match('ab.d', 'abxd'))
#<re.Match object; span=(0, 4), match='abxd'>

print(re.match('h{3}', 'hhhello'))
#<re.Match object; span=(0, 3), match='hhh'>

print(re.match('(hello){3}', 'hellohellohelloworld'))
#<re.Match object; span=(0, 15), match='hellohellohello'>

print(re.match('[0-9]{3}-[0-9]{4}-[0-9]{4}', '010-0000-0000'))
#<re.Match object; span=(0, 13), match='010-0000-0000'>

print(re.match('[0-9]{2,3}-[0-9]{3,4}-[0-9]{4}','02-000-0000'))
#<re.Match object; span=(0, 11), match='02-000-0000'>

print(re.match('[a-zA-Z]+','Hello1234')) #์˜๋ฌธ
#<re.Match object; span=(0, 5), match='Hello'>

print(re.match('[๊ฐ€-ํžฃ]+', 'ํ™๊ธธ๋™')) #ํ•œ๊ธ€
#<re.Match object; span=(0, 3), match='ํ™๊ธธ๋™'>

print(re.match('[^A-Z]+', 'hello')) #^์ œ์™ธ
#<re.Match object; span=(0, 5), match='hello'>

print(re.search('\*+', '1**2')) #ํŠน์ˆ˜๋ฌธ์ž๋Š” ์•ž์— ์—ญ์Šฌ๋ž˜์‹œ(\)
#<re.Match object; span=(1, 3), match='**'>

print(re.match('[$()a-zA-Z0-9]','$(document')) #[]๋กœ ๋ฒ”์œ„ ์ค‘
#<re.Match object; span=(0, 1), match='$'>

print(re.match('\d+', '1234'))
#<re.Match object; span=(0, 4), match='1234'>

print(re.match('\D+', 'Hello'))
#<re.Match object; span=(0, 5), match='Hello'>

print(re.match('\w+', 'Hello_1234'))
#<re.Match object; span=(0, 10), match='Hello_1234'>

print(re.match('\W+', '(:)'))
#<re.Match object; span=(0, 3), match='(:)'>

print(re.match('[a-zA-Z0-9\s]+', 'Hello 1234'))
#<re.Match object; span=(0, 10), match='Hello 1234'>

4. re.compile() & group() & re.findall()

๐Ÿ’š re.compile()์„ ์‚ฌ์šฉํ•ด ์ž์ฃผ ์‚ฌ์šฉํ•˜๋Š” ์กฐ๊ฑด์„ compile()์˜ ๊ฒฐ๊ณผ๋กœ ์ €์žฅํ•œ ๋’ค, ์ดํ›„ match()๋‚˜ search()๋กœ ์—ฌ๋Ÿฌ ๋ฒˆ ์ ์šฉ ๊ฐ€๋Šฅ

: using re.compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program.

p = re.compile('[0-9]+')
print(p.match('1234')) #<re.Match object; span=(0, 4), match='1234'>
print(p.search('hello1')) #<re.Match object; span=(5, 6), match='1'>

 

๐Ÿ’š match()๋‚˜ search() ๊ฒฐ๊ณผ์— group()์„ ์‚ฌ์šฉํ•ด ๊ฒฐ๊ด๊ฐ’์„ ๋ณ„๋„์— ์ €์žฅํ•ด ํ™œ์šฉ ๊ฐ€๋Šฅ

text = 'Hello World'
x = re.match('^[A-Za-z]{5}', text).group()
print(x)

 

๐Ÿ’š re.findall()์„ ์‚ฌ์šฉํ•ด์„œ๋Š” ์กฐ๊ฑด์— ๋งž๋Š” ๋ชจ๋“  ๋ฌธ์ž์—ด์„ ๋ฐ˜ํ™˜ํ•ด์คŒ

 

โ€ป re.findall(): Return all non-overlapping matches of pattern in string, as a list of strings or tuples. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result. The result depends on the number of capturing groups in the pattern. If there are no groups, return a list of strings matching the whole pattern. If there is exactly one group, return a list of strings matching that group. If multiple groups are present, return a list of tuples of strings matching the groups. Non-capturing groups do not affect the form of the result.

print(re.findall('[A-Za-z]{2,}','Hello World') ) #['Hello', 'World']
print(re.findall(r'(\w+)=(\d+)', 'set width=20 and height=10')) #[('width', '20'), ('height', '10')]

5. BOJ ๋ฌธ์ œ๋ชจ์Œ

โ˜… Regular Expression ์ค‘์ƒ๊ธ‰ 


* re docu) https://docs.python.org/3/library/re.html#

* 2023 DAS

 

'Computer Science > Algorithms' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

๐Ÿ”ธCoordinate Compression๐Ÿ”ธ  (0) 2023.01.24
๐Ÿซ‚ Prefix Sum  (0) 2023.01.18
๐Ÿ›Dynamic Programming๐Ÿ›  (1) 2023.01.03
bitmasking  (0) 2022.12.20
๐Ÿ‘€ Binary Search ๐Ÿ‘€  (2) 2022.12.06

๋Œ“๊ธ€