Algorithm Problem 007: Substring with Concatenation of All Words
Problem
Problem Source: 30. Substring with Concatenation of All Words
Problem Description:
You are given a string
s
and an array of stringswords
. All the strings ofwords
are of the same length.A concatenated string is a string that exactly contains all the strings of any permutation of
words
concatenated.
- For example, if
words = ["ab","cd","ef"]
, then"abcdef"
,"abefcd"
,"cdabef"
,"cdefab"
,"efabcd"
, and"efcdab"
are all concatenated strings."acdbef"
is not a concatenated string because it is not the concatenation of any permutation ofwords
.Return an array of the starting indices of all the concatenated substrings in
s
. You can return the answer in any order.
Example 1:
1
2
3
4
5 Input: s = "barfoothefoobarman", words = ["foo","bar"]
Output: [0,9]
Explanation:
The substring starting at 0 is "barfoo". It is the concatenation of ["bar","foo"] which is a permutation of words.
The substring starting at 9 is "foobar". It is the concatenation of ["foo","bar"] which is a permutation of words.Example 2:
1
2
3
4 Input: s = "wordgoodgoodgoodbestword", words = ["word","good","best","word"]
Output: []
Explanation:
There is no concatenated substring.Example 3:
1
2
3
4
5
6 Input: s = "barfoofoobarthefoobarman", words = ["bar","foo","the"]
Output: [6,9,12]
Explanation:
The substring starting at 6 is "foobarthe". It is the concatenation of ["foo","bar","the"].
The substring starting at 9 is "barthefoo". It is the concatenation of ["bar","the","foo"].
The substring starting at 12 is "thefoobar". It is the concatenation of ["the","foo","bar"].Constraints:
1 <= s.length <= 10^4
1 <= words.length <= 5000
1 <= words[i].length <= 30
s
andwords[i]
consist of lowercase English letters.
Approach
For convenience, let n = words.length
and n2 = words[i].length
.
A brute force solution would be to iterate over every index i
in s
, and then check whether the substring starting at index i
with length n * n2
(s[i : i + n * n2)
) is a concatenated substring. This approach is similar to brute force string matching. Clearly, for this hard problem, the time complexity of such a method is unacceptable.
Therefore, we need to optimize the approach. Notice the condition that “All the strings of words
are of the same length”. This is a very special constraint. Assume we have examined a substring s1 = s[i : i + n * n2)
and recorded how many words it deviates from forming a valid concatenated substring; call this state state1
. Then, when considering s2 = s[i + n2 : i + n * (n2 + 1))
, s2
is derived from s1
by removing the substring s[i : i + n2)
and adding the substring s[i + n * n2: i + n * (n2 + 1))
— both of which have the same length as a word. Thus, we can easily update state1
using these two changing substrings to compute state2
, which tells us how far s2
is from being a concatenated substring.
Let’s consider an example. Suppose words = ["ab","cd","ef"]
and s = "ggcdabef"
. In this case, n = 3
and n2 = 2
.
Let s1 = s[0: 6) = "ggcdab"
. At this point, s1
has one extra "gg"
and is missing "ef"
relative to a valid concatenated substring — this is our state1
. When we examine s2 = s[0 + 2, 6 + 2) = "cdabef"
, s2
is obtained by removing "gg"
and adding "ef"
. By applying the changes from s1
to s2
, we obtain state2
: s2
satisfies the condition of being a concatenated substring — it uses every word in words
exactly once.
Why do we shift by n2
from s1
to s2
? Because the concatenated substring is a permutation of words
, and each word has length n2
. Without shifting by n2
, we cannot reuse the results.
Returning to the problem itself, our goal is to find all starting indices of the concatenated substrings. How many initial states are there in total? There are exactly n2
initial states: s[0: n * n2)
, s[1: n * n2 + 1)
, …, s[n2 - 1: n * (n2 + 1) - 1)
. Every subsequent substring of length n * n2
can be derived from one of these initial states.
The remaining challenge is how to efficiently record the state and implement state transitions. Let’s combine the explanation with the code.
Code
Considering that words
might contain duplicate entries, we use a Map rather than a Set for counting:
1 | unordered_map<string, int> mp; |
Then, we define two vector
. Notice their constructors:
1 | vector<int> v1(n2, mp.size()); |
Each hash table in v2
records the difference between the count required for a valid concatenated substring (i.e., the initial mp
) and the count of words found in the current substring of length n * n2
.
v1
records the number of entries in the corresponding hash table that are not 0
. In other words, if an element in v1
is 0
, then the corresponding substring of length n * n2
is a valid concatenated substring. The purpose of v1
is to quickly check whether the substring meets the condition for a concatenated substring.
Let’s take the construction of the initial state as an example to see how these two vector
work:
1 | for (int i = 0; i < n2; ++i) { |
Next, we should continuously perform state transitions from the initial states. The process is similar to constructing the initial state, but with the additional step of adding a word:
1 | for (int i = n2; i + n2 * n <= s.size(); ++i) { // Note the loop termination condition |
Thus, the final code is:
1 | class Solution { |
Time Complexity: O(n * n2 * n2 + s.size() * n2)
.
Space Complexity: O(n * n2 * n2)
, mainly due to the space required for v2
.
Note that each word has a length of n2
, so the space cost per word is O(n2)
, and extracting a word from s
takes O(n2)
time.