Classic DP: Regular Expression Matching
Note
Now all the plugins has supported English. I'm still improving the website...
This article will resolve
LeetCode | Difficulty |
---|---|
10. Regular Expression Matching | 🔴 |
Prerequisites
Before reading this article, you should learn:
Regular expressions are a very powerful tool. In this article, we will take a closer look at the underlying principles of regular expressions. LeetCode problem #10 "Regular Expression Matching" requires us to implement a simple regular expression matching algorithm, including the wildcard characters '.' and '*'.
These two wildcards are the most commonly used. The dot '.' can match any single character, and the asterisk '*' allows the preceding character to repeat any number of times (including 0 times).
For example, the pattern string ".a*b"
can match the text "zaaab"
as well as "cb"
; the pattern string "a..b"
can match the text "amnb"
; and the pattern string ".*"
is quite powerful, as it can match any text.
The problem will give us two input strings s
and p
, where s
represents the text and p
represents the pattern string. Your task is to determine if the pattern string p
can match the text s
. We can assume that the pattern string only contains lowercase letters and the aforementioned wildcards and is always valid, meaning it won't include invalid patterns like *a
or b**
.
The function signature is as follows:
boolean isMatch(string s, string p);
bool isMatch(string s, string p);
def isMatch(s: str, p: str) -> bool:
func isMatch(s string, p string) bool
var isMatch = function(s, p) {}
For the regular expression we are going to implement, what is the difficulty?
The dot wildcard (.
) is actually easy to implement. Any character in s
will match blindly with the .
wildcard. The main challenge is the asterisk wildcard (*
). When encountering *
, the preceding character can be repeated once, multiple times, or not at all. How should we handle this?
The answer to this problem is simple: enumerate all possible scenarios. If any one of them results in a successful match, we consider p
to match s
. Whenever we need to enumerate possibilities for two strings, we should instinctively think of using dynamic programming techniques.
I. Thought Analysis
Let's visualize the matching process between s
and p
. Imagine two pointers, i
and j
, moving along s
and p
respectively. If both pointers can reach the end of their respective strings, the match is successful; otherwise, it fails.
If we ignore the *
wildcard, the only thing we can do when facing two characters s[i]
and p[j]
is to check if they match:
boolean isMatch(String s, String p) {
int i = 0, j = 0;
while (i < s.length() && j < p.length()) {
// the '.' wildcard is a versatile match
if (s.charAt(i) == p.charAt(j) || p.charAt(j) == '.') {
// match, continue to match s[i+1..] and p[j+1..]
i++; j++;
} else {
// not a match
return false;
}
}
return i == j;
}
bool isMatch(string s, string p) {
int i = 0, j = 0;
while (i < s.size() && j < p.size()) {
// the '.' wildcard is versatile
if (s[i] == p[j] || p[j] == '.') {
// match, continue to match s[i+1..] and p[j+1..]
i++; j++;
} else {
// not a match
return false;
}
}
return i == j;
}
def isMatch(s: str, p: str) -> bool:
i, j = 0, 0
while i < len(s) and j < len(p):
# the '.' wildcard is versatile
if s[i] == p[j] or p[j] == '.':
# match, then continue to match s[i+1..] and p[j+1..]
i += 1
j += 1
else:
# not a match
return False
return i == j
func isMatch(s string, p string) bool {
i, j := 0, 0
for i < len(s) && j < len(p) {
// the '.' wildcard is versatile
if s[i] == p[j] || p[j] == '.' {
// match, continue to match s[i+1..] and p[j+1..]
i++
j++
} else {
// not a match
return false
}
}
return i == j
}
var isMatch = function(s, p) {
let i = 0, j = 0;
while (i < s.length && j < p.length) {
// The '.' wildcard is versatile
if (s[i] === p[j] || p[j] === '.') {
// Match, then match s[i+1..] and p[j+1..]
i++; j++;
} else {
// No match
return false;
}
}
return i === j;
};
Consider this: if we introduce the *
wildcard, the situation becomes slightly more complex. However, it's not hard to understand if we analyze it case by case.
When p[j + 1]
is the *
wildcard, we discuss the following cases:
- If
s[i] == p[j]
, there are two scenarios:
1.1 p[j]
might match multiple characters. For example, if s = "aaa"
and p = "a*"
, then p[0]
will match 3 characters "a"
through the *
.
1.2 p[j]
might also match 0 characters. For example, if s = "aa"
and p = "a*aa"
, since the subsequent characters can match s
, p[0]
can only match 0 times.
- If
s[i] != p[j]
, there is only one scenario:
p[j]
can only match 0 times, and then we check if the next character can match s[i]
. For instance, if s = "aa"
and p = "b*aa"
, then p[0]
can only match 0 times.
In summary, we can modify the previous code to handle the *
wildcard as follows:
if (s.charAt(i) == p.charAt(j) || p.charAt(j) == '.') {
// match
if (j < p.length() - 1 && p.charAt(j + 1) == '*') {
// there is a * wildcard, which can match 0 or more times
} else {
// no * wildcard, match exactly once
i++; j++;
}
} else {
// do not match
if (j < p.length() - 1 && p.charAt(j + 1) == '*') {
// there is a * wildcard, can only match 0 times
} else {
// no * wildcard, the match cannot proceed
return false;
}
}
if (s[i] == p[j] || p[j] == '.') {
// match
if (j < p.size()-1 && p[j+1] == '*') {
// there is a * wildcard, which can match 0 times or multiple times
} else {
// no * wildcard, match exactly 1 time
i++; j++;
}
} else {
// not match
if (j < p.size()-1 && p[j+1] == '*') {
// there is a * wildcard, can only match 0 times
} else {
// no * wildcard, the match cannot proceed
return false;
}
}
if s[i] == p[j] or p[j] == '.':
# match
if j < len(p) - 1 and p[j + 1] == '*':
# there is a * wildcard, which can match 0 times or multiple times
pass
else:
# no * wildcard, match exactly 1 time
i += 1
j += 1
else:
# not a match
if j < len(p) - 1 and p[j + 1] == '*':
# there is a * wildcard, can only match 0 times
pass
else:
# no * wildcard, the match cannot proceed
return False
if s[i] == p[j] || p[j] == '.' {
// match
if j < len(p) - 1 && p[j + 1] == '*' {
// has * wildcard, can match 0 times or multiple times
} else {
// no * wildcard, match exactly 1 time
i++; j++;
}
} else {
// not match
if j < len(p) - 1 && p[j + 1] == '*' {
// has * wildcard, can only match 0 times
} else {
// no * wildcard, the match cannot proceed
return false
}
}
var exampleFunc = function(s, p) {
if (s.charAt(i) == p.charAt(j) || p.charAt(j) == '.') {
// match
if (j < p.length - 1 && p.charAt(j + 1) == '*') {
// there is a * wildcard, can match 0 times or multiple times
} else {
// no * wildcard, match exactly 1 time
i++; j++;
}
} else {
// no match
if (j < p.length - 1 && p.charAt(j + 1) == '*') {
// there is a * wildcard, can only match 0 times
} else {
// no * wildcard, the match cannot proceed
return false;
}
}
}
The overall approach is now clear, but the question arises: when encountering the *
wildcard, should it match 0 times or multiple times? And how many times constitutes "multiple"?
You see, this is a problem of making "choices," where you need to enumerate all possible choices to arrive at a result. The core of the dynamic programming algorithm is "state" and "choice." "State" simply refers to the positions of the two pointers i
and j
, and "choice" is how many characters p[j]
decides to match.
II. Dynamic Programming Solution
Based on the "state," we can define a dp
function:
boolean dp(String s, int i, String p, int j);