Algorithm Problem 007: Substring with Concatenation of All Words

Problem

Problem Source: 30. Substring with Concatenation of All Words

Problem Description:

You are given a string s and an array of strings words. All the strings of words are of the same length.

A concatenated string is a string that exactly contains all the strings of any permutation of words concatenated.

  • For example, if words = ["ab","cd","ef"], then "abcdef", "abefcd", "cdabef", "cdefab", "efabcd", and "efcdab" are all concatenated strings. "acdbef" is not a concatenated string because it is not the concatenation of any permutation of words.

Return an array of the starting indices of all the concatenated substrings in s. You can return the answer in any order.

Example 1:

1
2
3
4
5
Input: s = "barfoothefoobarman", words = ["foo","bar"]
Output: [0,9]
Explanation:
The substring starting at 0 is "barfoo". It is the concatenation of ["bar","foo"] which is a permutation of words.
The substring starting at 9 is "foobar". It is the concatenation of ["foo","bar"] which is a permutation of words.

Example 2:

1
2
3
4
Input: s = "wordgoodgoodgoodbestword", words = ["word","good","best","word"]
Output: []
Explanation:
There is no concatenated substring.

Example 3:

1
2
3
4
5
6
Input: s = "barfoofoobarthefoobarman", words = ["bar","foo","the"]
Output: [6,9,12]
Explanation:
The substring starting at 6 is "foobarthe". It is the concatenation of ["foo","bar","the"].
The substring starting at 9 is "barthefoo". It is the concatenation of ["bar","the","foo"].
The substring starting at 12 is "thefoobar". It is the concatenation of ["the","foo","bar"].

Constraints:

  • 1 <= s.length <= 10^4
  • 1 <= words.length <= 5000
  • 1 <= words[i].length <= 30
  • s and words[i] consist of lowercase English letters.

Approach

For convenience, let n = words.length and n2 = words[i].length.

A brute force solution would be to iterate over every index i in s, and then check whether the substring starting at index i with length n * n2 (s[i : i + n * n2)) is a concatenated substring. This approach is similar to brute force string matching. Clearly, for this hard problem, the time complexity of such a method is unacceptable.

Therefore, we need to optimize the approach. Notice the condition that “All the strings of words are of the same length”. This is a very special constraint. Assume we have examined a substring s1 = s[i : i + n * n2) and recorded how many words it deviates from forming a valid concatenated substring; call this state state1. Then, when considering s2 = s[i + n2 : i + n * (n2 + 1)), s2 is derived from s1 by removing the substring s[i : i + n2) and adding the substring s[i + n * n2: i + n * (n2 + 1)) — both of which have the same length as a word. Thus, we can easily update state1 using these two changing substrings to compute state2, which tells us how far s2 is from being a concatenated substring.

Let’s consider an example. Suppose words = ["ab","cd","ef"] and s = "ggcdabef". In this case, n = 3 and n2 = 2.

Let s1 = s[0: 6) = "ggcdab". At this point, s1 has one extra "gg" and is missing "ef" relative to a valid concatenated substring — this is our state1. When we examine s2 = s[0 + 2, 6 + 2) = "cdabef", s2 is obtained by removing "gg" and adding "ef". By applying the changes from s1 to s2, we obtain state2: s2 satisfies the condition of being a concatenated substring — it uses every word in words exactly once.

Why do we shift by n2 from s1 to s2? Because the concatenated substring is a permutation of words, and each word has length n2. Without shifting by n2, we cannot reuse the results.

Returning to the problem itself, our goal is to find all starting indices of the concatenated substrings. How many initial states are there in total? There are exactly n2 initial states: s[0: n * n2), s[1: n * n2 + 1), …, s[n2 - 1: n * (n2 + 1) - 1). Every subsequent substring of length n * n2 can be derived from one of these initial states.

The remaining challenge is how to efficiently record the state and implement state transitions. Let’s combine the explanation with the code.

Code

Considering that words might contain duplicate entries, we use a Map rather than a Set for counting:

1
2
3
4
unordered_map<string, int> mp;
for (string & w : words) {
mp[w]++;
}

Then, we define two vector. Notice their constructors:

1
2
vector<int> v1(n2, mp.size());
vector<unordered_map<string, int>> v2(n2, mp);

Each hash table in v2 records the difference between the count required for a valid concatenated substring (i.e., the initial mp) and the count of words found in the current substring of length n * n2.

v1 records the number of entries in the corresponding hash table that are not 0. In other words, if an element in v1 is 0, then the corresponding substring of length n * n2 is a valid concatenated substring. The purpose of v1 is to quickly check whether the substring meets the condition for a concatenated substring.

Let’s take the construction of the initial state as an example to see how these two vector work:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
for (int i = 0; i < n2; ++i) {
// Check the state of the substring s[i : i + n * n2)
for (int j = 0; j < n; ++j) {
string s2 = s.substr(i + n2 * j, n2); // Obtain the substring s2 of length n2
int v = --v2[i][s2]; // Update the count
// If v2[i][s2] > 0, we are **missing** v2[i][s2] occurrences of s2 to form a valid concatenated substring.
// If v2[i][s2] < 0, there are **extra** v2[i][s2] occurrences of s2 to form a valid concatenated substring.
// During the construction of the initial state, v2[i][s2] only decreases.
if (v == 0) v1[i]--; // So when v2[i][s2] becomes 0, decrement v1[i]
else if (v == -1) v1[i]++; // But when v2[i][s2] first drops below 0, increment v1[i] as this word becomes extra
}

if (v1[i] == 0) { // At this point, the condition for a concatenated substring is met, record the index.
ans.push_back(i);
}
}

Next, we should continuously perform state transitions from the initial states. The process is similar to constructing the initial state, but with the additional step of adding a word:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
for (int i = n2; i + n2 * n <= s.size(); ++i) { // Note the loop termination condition
// Note the definition of v2; here, we perform an increment operation.
string oldStr = s.substr(i - n2, n2);
int v = ++v2[i % n2][oldStr]; // Don't forget to take modulo n2
if (v == 0) v1[i % n2]--;
else if (v == 1) v1[i % n2]++;

string newStr = s.substr(i + n2 * (n - 1), n2);
v = --v2[i % n2][newStr];
if (v == 0) v1[i % n2]--;
else if (v == -1) v1[i % n2]++;

if (v1[i % n2] == 0) { // At this point, the condition for a concatenated substring is met, record the index.
ans.push_back(i);
}
}

Thus, the final code is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
class Solution {
public:
vector<int> findSubstring(string s, vector<string>& words) {
int n = words.size(), n2 = words[0].size();

unordered_map<string, int> mp;
for (string & w : words) {
mp[w]++;
}

vector<int> v1(n2, mp.size());
vector<unordered_map<string, int>> v2(n2, mp);

vector<int> ans;

if (s.size() < n * n2) { // The only corner case
return ans;
}

for (int i = 0; i < n2; ++i) {
for (int j = 0; j < n; ++j) {
string s2 = s.substr(i + n2 * j, n2);
int v = --v2[i][s2];
if (v == 0) v1[i]--;
else if (v == -1) v1[i]++;
}

if (v1[i] == 0) {
ans.push_back(i);
}
}

for (int i = n2; i + n2 * n <= s.size(); ++i) {
string oldStr = s.substr(i - n2, n2);
int v = ++v2[i % n2][oldStr];
if (v == 0) v1[i % n2]--;
else if (v == 1) v1[i % n2]++;

string newStr = s.substr(i + n2 * (n - 1), n2);
v = --v2[i % n2][newStr];
if (v == 0) v1[i % n2]--;
else if (v == -1) v1[i % n2]++;

if (v1[i % n2] == 0) {
ans.push_back(i);
}
}

return ans;
}
};

Time Complexity: O(n * n2 * n2 + s.size() * n2).

Space Complexity: O(n * n2 * n2), mainly due to the space required for v2.

Note that each word has a length of n2, so the space cost per word is O(n2), and extracting a word from s takes O(n2) time.