Subtitle Segmentation Algorithm by Gemini 2.0 Flash Thinking

Absolutely! Here's a breakdown of an algorithm designed to address the challenges you've outlined, along with code examples in TypeScript to illustrate the process.

Algorithm Design

The core idea is to process the subtitle segments in chunks, use the LLM to identify potential paragraph and chapter breaks within those chunks, and then merge and refine these suggestions to form the final structure.

Steps:

Chunking with Overlap:
- Divide the SubtitleSegment array into overlapping chunks. This addresses the context window limitation while minimizing the risk of cutting off logical units.
- Define a chunkSize (number of segments per chunk) and a overlapSize (number of segments shared between adjacent chunks).
- The first chunk starts at index 0, the second at chunkSize - overlapSize, the third at 2 * (chunkSize - overlapSize), and so on.
- The last chunk might be smaller than chunkSize.
LLM-Based Segmentation (Per Chunk):
- For each chunk, send a carefully crafted prompt to the LLM.
- The prompt should include the text of the segments in the current chunk.
- Instruct the LLM to identify potential paragraph and chapter boundaries based on semantic relationships, topic shifts, or changes in speaker (if detectable).
- Ask the LLM to return the indices of the segments where it believes a paragraph or chapter break should occur within the current chunk. This avoids relying on the LLM for exact character positions.
- Example Prompt (can be refined):
```
"Analyze the following subtitle segments and identify logical paragraph and chapter breaks. Return the indices of the segments where a new paragraph or chapter should begin within this block.\n\nSegments:\n" +
chunk.map((seg, index) => `${index + chunkStartIndex}: ${seg.text}`).join("\n") +
"\n\nRespond with a JSON object like this: {\"paragraphBreaks\": [index1, index2, ...], \"chapterBreaks\": [indexA, indexB, ...] }"
```
- Process the LLM's JSON response to extract the suggested paragraph and chapter break indices. Remember that these indices are relative to the start of the current chunk.
Mapping and Merging (Paragraphs):
- Iterate through the results from each chunk's LLM processing.
- For each suggested paragraph break index within a chunk, map it back to the original SubtitleSegment array index.
- Keep track of potential paragraph boundaries. If multiple overlapping chunks suggest a break around the same segment, it strengthens the likelihood of a genuine paragraph break.
- Define a tolerance (e.g., a few segments) for considering breaks from different chunks as the same.
- Construct Paragraph objects based on these identified boundaries. The start and end times can be taken from the first and last segments of the paragraph.
Mapping and Merging (Chapters):
- Similar to paragraph merging, iterate through the LLM's suggested chapter breaks for each chunk.
- Map the indices back to the original array.
- Look for clusters of chapter break suggestions across overlapping chunks.
- Construct Chapter objects. You'll need a mechanism to generate titles. This could be:
  - The text of the first few words of the first paragraph in the chapter.
  - Asking the LLM to generate a title based on the content of the paragraphs within a potential chapter (more LLM calls).
- Ensure that chapters don't overlap and that all paragraphs are assigned to a chapter.
Final Transcription Assembly:
- Organize the created Chapter objects into the final Transcription structure.

Code Example (Illustrative TypeScript):

type SubtitleSegment = {
  start: number;
  end: number;
  text: string;
  words: { word: string; start: number; end: number }[];
};
type Subtitle = {
  segments: SubtitleSegment[];
};
type Paragraph = {
  start: number;
  end: number;
  text: string;
  segments: SubtitleSegment[];
};
type Chapter = {
  title: string;
  paragraphs: Paragraph[];
};
type Transcription = {
  chapters: Chapter[];
};

async function segmentSubtitles(subtitle: Subtitle): Promise<Transcription> {
  const chunkSize = 30; // Adjust as needed
  const overlapSize = 10; // Adjust as needed
  const segments = subtitle.segments;
  const numSegments = segments.length;
  const chunks: SubtitleSegment[][] = [];

  for (let i = 0; i < numSegments; i += chunkSize - overlapSize) {
    chunks.push(segments.slice(i, i + chunkSize));
  }

  const chunkResults = await Promise.all(
    chunks.map(async (chunk, chunkIndex) => {
      const chunkStartIndex = chunkIndex * (chunkSize - overlapSize);
      if (chunk.length === 0) return { paragraphBreaks: [], chapterBreaks: [] };

      const prompt = `Analyze the following subtitle segments and identify logical paragraph and chapter breaks. Return the indices of the segments where a new paragraph or chapter should begin within this block.\n\nSegments:\n${chunk
        .map((seg, index) => `${chunkStartIndex + index}: ${seg.text}`)
        .join("\n")}\n\nRespond with a JSON object like this: {"paragraphBreaks": [], "chapterBreaks": []}`; // Simplified prompt for example

      // **Simulate LLM call (replace with actual LLM API call)**
      const llmResponse = await simulateLLM(prompt);
      try {
        const parsedResponse = JSON.parse(llmResponse);
        return {
          paragraphBreaks: parsedResponse.paragraphBreaks || [],
          chapterBreaks: parsedResponse.chapterBreaks || [],
        };
      } catch (error) {
        console.error("Error parsing LLM response:", error);
        return { paragraphBreaks: [], chapterBreaks: [] };
      }
    })
  );

  const potentialParagraphBreaks: number[] = [];
  const potentialChapterBreaks: number[] = [];

  chunkResults.forEach((result, chunkIndex) => {
    const chunkStartIndex = chunkIndex * (chunkSize - overlapSize);
    result.paragraphBreaks.forEach((breakIndex) => {
      potentialParagraphBreaks.push(chunkStartIndex + breakIndex);
    });
    result.chapterBreaks.forEach((breakIndex) => {
      potentialChapterBreaks.push(chunkStartIndex + breakIndex);
    });
  });

  // **Implement logic to merge and refine paragraph breaks based on proximity**
  const mergedParagraphBreaks = mergeBreaks(potentialParagraphBreaks);
  const paragraphs = createParagraphs(segments, mergedParagraphBreaks);

  // **Implement logic to merge and refine chapter breaks based on proximity and assign paragraphs**
  const mergedChapterBreaks = mergeBreaks(potentialChapterBreaks);
  const chapters = createChapters(paragraphs, mergedChapterBreaks);

  return { chapters };
}

// **Helper functions (implement these)**

async function simulateLLM(prompt: string): Promise<string> {
  // Replace with actual LLM API call (e.g., OpenAI, Cohere)
  // This is a placeholder for demonstration
  console.log("Simulating LLM call with prompt:", prompt);
  return JSON.stringify({ paragraphBreaks: [5, 15], chapterBreaks: [0] }); // Example response
}

function mergeBreaks(breaks: number[], tolerance: number = 3): number[] {
  // Sort the breaks
  breaks.sort((a, b) => a - b);
  const merged: number[] = [];
  if (breaks.length === 0) return merged;
  merged.push(breaks[0]);
  for (let i = 1; i < breaks.length; i++) {
    if (breaks[i] - merged[merged.length - 1] > tolerance) {
      merged.push(breaks[i]);
    }
  }
  return merged;
}

function createParagraphs(
  segments: SubtitleSegment[],
  breakIndices: number[]
): Paragraph[] {
  const paragraphs: Paragraph[] = [];
  let startIndex = 0;
  for (const breakIndex of breakIndices) {
    const endindex = Math.min(breakIndex, segments.length);
    if (startIndex < endindex) {
      const paragraphSegments = segments.slice(startIndex, endindex);
      paragraphs.push({
        start: paragraphSegments[0].start,
        end: paragraphSegments[paragraphSegments.length - 1].end,
        text: paragraphSegments.map((s) => s.text).join(" "),
        segments: paragraphSegments,
      });
      startIndex = breakIndex;
    }
  }
  if (startIndex < segments.length) {
    const paragraphSegments = segments.slice(startIndex);
    paragraphs.push({
      start: paragraphSegments[0].start,
      end: paragraphSegments[paragraphSegments.length - 1].end,
      text: paragraphSegments.map((s) => s.text).join(" "),
      segments: paragraphSegments,
    });
  }
  return paragraphs;
}

function createChapters(
  paragraphs: Paragraph[],
  breakIndices: number[]
): Chapter[] {
  const chapters: Chapter[] = [];
  let startIndex = 0;
  for (const breakIndex of breakIndices) {
    const endIndex = paragraphs.findIndex(p => p.segments.length > 0 && subtitle.segments.indexOf(p.segments[0]) >= breakIndex);
    if (endIndex > startIndex) {
      const chapterParagraphs = paragraphs.slice(startIndex, endIndex);
      if (chapterParagraphs.length > 0) {
        chapters.push({
          title: chapterParagraphs[0].text.substring(0, 50) + "...", // Basic title
          paragraphs: chapterParagraphs,
        });
        startIndex = endIndex;
      }
    }
  }
  if (startIndex < paragraphs.length) {
    const chapterParagraphs = paragraphs.slice(startIndex);
    if (chapterParagraphs.length > 0) {
      chapters.push({
        title: chapterParagraphs[0].text.substring(0, 50) + "...",
        paragraphs: chapterParagraphs,
      });
    }
  }
  return chapters;
}

// **Example usage:**
const subtitleData: Subtitle = {
  segments: [
    { start: 0, end: 5, text: "This is the first sentence.", words: [] },
    { start: 6, end: 10, text: "And this is the second.", words: [] },
    { start: 11, end: 15, text: "A new topic begins here.", words: [] },
    { start: 16, end: 20, text: "More on the new topic.", words: [] },
    { start: 21, end: 25, text: "A different chapter starts now.", words: [] },
    { start: 26, end: 30, text: "Continuing the new chapter.", words: [] },
    // ... more segments
  ],
};

segmentSubtitles(subtitleData)
  .then((transcription) => console.dir(transcription, { depth: null }));

Key Considerations and Improvements:

Prompt Engineering: The quality of the LLM's output heavily depends on the prompt. Experiment with different phrasing, include examples, and specify the desired output format clearly.
LLM Choice: Different LLMs have different strengths. Some might be better at identifying topic shifts, while others excel at detecting speaker changes.
Cost Optimization:
- Chunk Size: Finding the right balance for chunkSize and overlapSize is crucial. Smaller chunks mean more calls but less data per call. Larger chunks reduce calls but might exceed context limits.
- Efficient Prompting: Minimize the information sent in each prompt. Focus on the relevant segments.
- Caching: If you process similar subtitles frequently, consider caching LLM responses.
Error Handling: Implement robust error handling for LLM API calls and JSON parsing.
Refinement Logic: The mergeBreaks and createChapters functions can be made more sophisticated. For example:
- Weighted Breaks: Give more weight to breaks suggested by multiple chunks.
- Contextual Analysis: When merging chapter breaks, consider the content of the paragraphs being grouped.
- LLM for Titles: Use the LLM to generate more descriptive chapter titles based on the content of the paragraphs.
Speaker Information: If your subtitle data includes speaker information, incorporate it into the prompt to help the LLM identify paragraph and chapter breaks based on speaker changes.

This algorithm provides a solid foundation for segmenting long subtitles using LLMs while addressing the challenges of context length and LLM limitations. Remember to adapt the chunk sizes, overlap, and prompting strategies to best suit your specific subtitle data and LLM capabilities.

mondaychen/output.md