819. Most Common Word
Problem Description
The problem requires that you find the most frequent word from a given string, paragraph
, which is not included in a list of banned words, banned
. The string paragraph
can contain uppercase and lowercase letters. Finally, the result should be a lowercase word that occurs the most frequently in paragraph
and is not present in the banned
list. It is assured that there is at least one such word, and there is only one unique answer.
Intuition
The challenge is to accurately count word occurrences while excluding banned words and managing different cases (uppercase/lowercase). The approach involves several steps. First, normalize the paragraph by converting it to lowercase to ensure case insensitivity. Next, use regular expressions to extract words by ignoring punctuation and other non-alphabetic characters.
After the words have been extracted, use the Counter
class from the collections library to count the frequency of each word. We then iterate through the most common words while skipping those present in the set of banned words. A set is used for the banned words to achieve O(1) lookup times.
By doing this, we ensure that the first word that appears in the most_common
list and is not in the set of banned words is the answer, and we can then return this word as the most common non-banned word in the paragraph.
Solution Approach
The solution to the problem utilizes several Python features and data structures to approach the task efficiently:
-
Normalization of Text: First, we convert the entire paragraph to lowercase using the
lower()
method. This ensures that all words are counted in a case-insensitive manner. -
Word Extraction: We then use the
re.findall()
function from there
module (which stands for "regular expression") to extract all the words in the text. The regular expression'[a-z]+'
is used to match continuous sequences of lowercase letters ('a' to 'z'), effectively skipping over any punctuation or other characters. -
Word Frequency Counting: Python's
Counter
class from thecollections
module is used to count the frequency of each word. This data structure is ideal for this use case because it allows for efficient counting and retrieval of the most common elements. -
Set for Banned Words: The banned words are stored in a
set
, which allows us to check if a word is banned in constant time, thanks to the underlying hash table implementation of sets in Python. -
Finding the Most Common Non-Banned Word: We use the
most_common()
method of theCounter
class to get a list of words sorted by their frequency in descending order. Thenext()
function, in conjunction with a generator expression, iterates through this list to find and retrieve the first word that is not present in the set of banned words.
The final line return next(word for word, _ in p.most_common() if word not in s)
uses a generator expression to go through the words in the order of decreasing frequency and checks if the word is not in the banned set s
. The underscore _
is used to ignore the frequency in the tuple returned by p.most_common()
since we only care about the word itself in this context.
With the help of these tools and constructs, the solution efficiently finds the most frequent, non-banned word in the input paragraph.
Ready to land your dream job?
Unlock your dream job with a 2-minute evaluator for a personalized learning plan!
Start EvaluatorExample Walkthrough
To illustrate the proposed solution approach, imagine you have the following paragraph
and banned
list:
paragraph
: "Bob hit a ball, the hit BALL flew far after it was hit."
banned
: ["hit"]
Following the solution steps:
-
Normalization of Text: Convert the entire paragraph to lowercase.
Result: "bob hit a ball, the hit ball flew far after it was hit."
-
Word Extraction: Use a regular expression to extract all the words.
Code Example:
re.findall(r'[a-z]+', "bob hit a ball, the hit ball flew far after it was hit.")
Result: ['bob', 'hit', 'a', 'ball', 'the', 'hit', 'ball', 'flew', 'far', 'after', 'it', 'was', 'hit']
-
Word Frequency Counting: Use the
Counter
to count the frequency of each word.Code Example:
Counter(['bob', 'hit', 'a', 'ball', 'the', 'hit', 'ball', 'flew', 'far', 'after', 'it', 'was', 'hit'])
Result: Counter({'hit': 3, 'ball': 2, 'bob': 1, 'a': 1, 'the': 1, 'flew': 1, 'far': 1, 'after': 1, 'it': 1, 'was': 1})
-
Set for Banned Words: Create a set of the banned words.
Result:
{"hit"}
-
Finding the Most Common Non-Banned Word: Iterate through the list of most common words and return the first one that is not banned.
Code Example:
next(word for word, _ in Counter(['bob', 'hit', 'a', 'ball', 'the', 'hit', 'ball', 'flew', 'far', 'after', 'it', 'was', 'hit']).most_common() if word not in {"hit"})
Result: 'ball'
So, in this example, ball
would be the most frequent, non-banned word from the paragraph
. It appears 2 times and is not included in the banned
list.
Solution Implementation
1from collections import Counter
2import re
3from typing import List
4
5class Solution:
6 def mostCommonWord(self, paragraph: str, banned: List[str]) -> str:
7 # Convert the set of banned words to lowercase and create a set for fast look-up
8 banned_words_set = set(word.lower() for word in banned)
9
10 # Use regular expression to find all words (sequences of alphabetic characters) and convert them all to lowercase
11 # The '+' indicates that we're looking for one or more alphabetic characters together as a word
12 words = re.findall(r'[a-z]+', paragraph.lower())
13
14 # Create a Counter object to count the occurrences of each word in the paragraph
15 word_count = Counter(words)
16
17 # Use a generator expression to iterate over the most common words
18 # and return the first one which is not in the set of banned words
19 most_common_word = next(word for word, _ in word_count.most_common() if word not in banned_words_set)
20
21 # Return the most common non-banned word
22 return most_common_word
23
1import java.util.HashSet;
2import java.util.HashMap;
3import java.util.Map;
4import java.util.Set;
5import java.util.regex.Matcher;
6import java.util.regex.Pattern;
7
8class Solution {
9 // Compilation of pattern to match words with lowercase letters
10 private static final Pattern PATTERN = Pattern.compile("[a-z]+");
11
12 public String mostCommonWord(String paragraph, String[] banned) {
13 // Initialize a set to keep track of banned words for quick lookup
14 Set<String> bannedWords = new HashSet<>();
15 for (String word : banned) {
16 bannedWords.add(word);
17 }
18
19 // Initialize a map to count the occurrence of each word
20 Map<String, Integer> wordFrequency = new HashMap<>();
21
22 // Preprocess the paragraph to make it lowercase for case insensitivity
23 Matcher matcher = PATTERN.matcher(paragraph.toLowerCase());
24
25 // Find all matches and populate the word frequency map
26 while (matcher.find()) {
27 String currentWord = matcher.group();
28 // Skip the word if it is banned
29 if (!bannedWords.contains(currentWord)) {
30 wordFrequency.put(currentWord, wordFrequency.getOrDefault(currentWord, 0) + 1);
31 }
32 }
33
34 // Initialize variables to track the most frequent word
35 int maxFrequency = 0;
36 String mostFrequentWord = null;
37
38 // Iterate through the map entries to find the most common word
39 for (Map.Entry<String, Integer> entry : wordFrequency.entrySet()) {
40 if (entry.getValue() > maxFrequency) {
41 maxFrequency = entry.getValue();
42 mostFrequentWord = entry.getKey();
43 }
44 }
45 // Return the most common word that is not banned
46 return mostFrequentWord;
47 }
48}
49
1#include <unordered_set>
2#include <unordered_map>
3#include <cctype>
4#include <string>
5#include <vector>
6
7class Solution {
8public:
9 // Method to find the most common non-banned word in a paragraph.
10 string mostCommonWord(string paragraph, vector<string>& banned) {
11 // Create a set of banned words for quick lookup.
12 unordered_set<string> bannedWords(banned.begin(), banned.end());
13 // Use a map to count the occurrence of each word.
14 unordered_map<string, int> wordCount;
15 // Variable to store the most common word.
16 string mostCommon;
17 // Loop through the characters of the paragraph.
18 for (int i = 0, maxCount = 0, length = paragraph.size(); i < length;) {
19 // Skip non-alpha characters and continue to the next iteration.
20 if (!isalpha(paragraph[i]) && (++i > 0)) continue;
21
22 // Find the start of the next word.
23 int j = i;
24 string word;
25 // Convert the word to lowercase and store in word variable.
26 while (j < length && isalpha(paragraph[j])) {
27 word.push_back(tolower(paragraph[j]));
28 ++j;
29 }
30 // Update the starting position for the next word.
31 i = j + 1;
32 // If the word is in the banned list, skip to the next word.
33 if (bannedWords.find(word) != bannedWords.end()) continue;
34
35 // Increment the count of the word in the map.
36 ++wordCount[word];
37 // If the current word's count is greater than maxCount, update the most common.
38 if (wordCount[word] > maxCount) {
39 mostCommon = word;
40 maxCount = wordCount[word];
41 }
42 }
43 // Return the most common word after processing the entire paragraph.
44 return mostCommon;
45 }
46};
47
1function mostCommonWord(paragraph: string, banned: string[]): string {
2 // Convert the paragraph to lower case for consistent comparison.
3 const lowercaseParagraph = paragraph.toLocaleLowerCase();
4 // A map to keep track of the count of each word.
5 const wordCount = new Map<string, number>();
6 // A set containing the banned words for quick look-up.
7 const bannedWords = new Set<string>(banned);
8
9 // Split the paragraph into words using a regex that matches non-letter characters.
10 for (const word of lowercaseParagraph.split(/[^A-Za-z]/)) {
11 // Skip empty strings or banned words.
12 if (word === '' || bannedWords.has(word)) {
13 continue;
14 }
15 // Increment the word count for each occurrence of the word.
16 wordCount.set(word, (wordCount.get(word) ?? 0) + 1);
17 }
18
19 // Initialize a placeholder for the most common word and its count.
20 let mostCommon = ['', 0];
21 // Iterate through the map entries to find the word with the highest count.
22 for (const [currentWord, count] of wordCount) {
23 if (count > mostCommon[1]) {
24 mostCommon = [currentWord, count];
25 }
26 }
27
28 // Return the most common word.
29 return mostCommon[0];
30}
31
Time and Space Complexity
The provided code snippet defines a function that returns the most common non-banned word from a string paragraph
, ignoring words in the banned list banned
. It utilizes regular expressions to identify words and the collections.Counter
class to count occurrences.
Time Complexity
The time complexity of the mostCommonWord
function is primarily determined by several operations:
-
re.findall
: This function is used to find all substrings inparagraph
that match the regular expression[a-z]+
, effectively extracting all words composed of lowercase letters. The complexity of this operation isO(n)
, wheren
is the length ofparagraph
, as it must traverse the entire paragraph and perform pattern matching. -
paragraph.lower()
: Lowercasing the entire paragraph has a time complexity ofO(n)
. -
Counter.most_common
: TheCounter
object is created from the list of words found usingre.findall
, which also has a complexity ofO(m)
wherem
is the number of words in theparagraph
. Themost_common
method then sorts these word counts, which in the worst case isO(m log m)
if theCounter
implementation uses Timsort, a variation of sorting that has this worst-case complexity. -
next
with generator expression: In the worst case, we might need to iterate through allm
words to find one that is not in the banned set. Checking if a word is in the banned set isO(1)
due to using a set.
The overall time complexity T(n)
therefore is dominated by the sorting step in Counter.most_common
, which gives us:
T(n) = O(n) + O(n) + O(m) + O(m log m)
Since m
, the number of unique words, is typically much less than n
, we can consider O(n)
as a more practical worst-case estimate for large paragraphs.
Space Complexity
The space complexity is also determined by several factors:
-
The set of banned words
s
: If there areb
banned words, the space complexity for the set isO(b)
. -
The counter
p
: In the worst case, if every word inparagraph
is unique, the space needed for the counter would beO(m)
. -
The list of words produced by
re.findall
: This list also has a space complexity ofO(m)
in the worst case.
Therefore, the total space complexity S(m, b)
is a combination of the space needed for these data structures:
S(m, b) = O(b) + O(m) + O(m)
S(m, b) = O(m + b)
In conclusion, the mostCommonWord
function has a time complexity of O(n)
and a space complexity of O(m + b)
, where n
is the length of paragraph
, m
is the number of unique words in the paragraph
, and b
is the number of banned words.
Learn more about how to find time and space complexity quickly using problem constraints.
Which algorithm should you use to find a node that is close to the root of the tree?
Recommended Readings
LeetCode Patterns Your Personal Dijkstra's Algorithm to Landing Your Dream Job The goal of AlgoMonster is to help you get a job in the shortest amount of time possible in a data driven way We compiled datasets of tech interview problems and broke them down by patterns This way we
Recursion Recursion is one of the most important concepts in computer science Simply speaking recursion is the process of a function calling itself Using a real life analogy imagine a scenario where you invite your friends to lunch https algomonster s3 us east 2 amazonaws com recursion jpg You first
Runtime Overview When learning about algorithms and data structures you'll frequently encounter the term time complexity This concept is fundamental in computer science and offers insights into how long an algorithm takes to complete given a certain input size What is Time Complexity Time complexity represents the amount of time
Want a Structured Path to Master System Design Too? Don’t Miss This!