You are given a list of words, and your task is to find the top N most frequent words in the list. Implement a function that efficiently counts word occurrences and returns the top N words sorted by frequency.
- Efficiently count the occurrences of each word.
- Retrieve the top N words sorted by frequency.
- If two words have the same frequency, they should be sorted alphabetically.
- Assume the input is a large dataset, so efficiency matters.
- Why is
HashMap
a good choice for counting word occurrences? - How would you sort the words efficiently based on frequency?
- What is the time complexity of your approach?
- How would you handle case sensitivity (e.g., "Java" and "java")?
- What if the input stream is too large to fit into memory? How would you modify your solution?
Would you like a sample implementation for this? π
prompt: yes
ChatGPT:
Here's a Java implementation that efficiently counts word occurrences using a HashMap
and retrieves the top N most frequent words sorted by frequency and alphabetically when frequencies match.
import java.util.*;
public class WordFrequencyCounter {
public static List<String> getTopNFrequentWords(String[] words, int N) {
// Step 1: Count word occurrences using HashMap
Map<String, Integer> wordCount = new HashMap<>();
for (String word : words) {
word = word.toLowerCase(); // Normalize case sensitivity
wordCount.put(word, wordCount.getOrDefault(word, 0) + 1);
}
// Step 2: Use a PriorityQueue (Min-Heap) to keep track of top N words
PriorityQueue<Map.Entry<String, Integer>> minHeap = new PriorityQueue<>(
(a, b) -> a.getValue().equals(b.getValue()) ?
a.getKey().compareTo(b.getKey()) : // Sort alphabetically if frequencies match
a.getValue() - b.getValue() // Sort by frequency (ascending order)
);
for (Map.Entry<String, Integer> entry : wordCount.entrySet()) {
minHeap.offer(entry);
if (minHeap.size() > N) {
minHeap.poll(); // Remove the least frequent element
}
}
// Step 3: Extract words from the heap (sorted order)
List<String> result = new ArrayList<>();
while (!minHeap.isEmpty()) {
result.add(minHeap.poll().getKey());
}
Collections.reverse(result); // Convert to descending order
return result;
}
public static void main(String[] args) {
String[] words = {"java", "python", "java", "c", "python", "java", "c", "c++", "c++", "c++", "python"};
int N = 3; // Get top 3 frequent words
List<String> topWords = getTopNFrequentWords(words, N);
System.out.println("Top " + N + " most frequent words: " + topWords);
}
}
- Count occurrences using a
HashMap<String, Integer>
, where the key is the word, and the value is the frequency. - Use a Min-Heap (
PriorityQueue
) to store the top N frequent words efficiently.- If the heap size exceeds
N
, we remove the least frequent element. - If two words have the same frequency, they are sorted alphabetically.
- If the heap size exceeds
- Extract words from the heap and reverse the list to get them in descending order of frequency.
Top 3 most frequent words: [java, python, c++]
(Here, "java" appears 3 times, "python" appears 3 times, and "c++" appears 3 times, so they are sorted alphabetically.)
- Counting words (
HashMap
) β O(N) - Maintaining a Min-Heap of size
N
β O(M log N), whereM
is the number of unique words. - Extracting results from the heap β O(N log N)
Overall, the approach runs in O(N + M log N) time, making it efficient for large datasets.
β
Handle streaming data (use a sliding window approach).
β
Optimize space (use a Trie if dealing with prefix-based searches).
β
Distributed processing (for very large datasets, use tools like Apache Spark).
Would you like me to tweak this for a specific use case? π