Generating Keywords Automatically With llama-cli and Phi-3

by Dr. Phil Winder , CEO

This started as a need to generate related blog posts on the Winder.AI blog. Nearly all of my content-focussed websites use Hugo and it has a neat feature to generate links to related content.

This works by calculating the intersection of the keywords list in the frontmatter (by default) with every other page and ordering them by those that intersect the most. This method was chosen because it is fast and Hugo’s ethos is speed. A content based semantic approach would be better, but would be much slower.

Now, I haven’t been good at adding keywords to all of the blog’s hundreds of pages. And when I had, they tended to be overly specific and never overlapped with other pages. So I decided to write a script to automate this.

But natural language processing is time consuming and hard. So instead I chose to leverage a small language model that I could run locally.

Prerequisites

To make this work you will need to install llama.ccp’s llama-cli and download a language model in gguf format.

  1. Install llama.ccp (which includes llama-cli)

  2. Download Phi-3 in gguf format from Hugging Face

  3. Make sure it works:

    llama-cli -m path/to/Phi-3-mini-4k-instruct-q4.gguf -t 8 -n 128 --log-disable --prompt "<|user|>What is the best place in Yorkshire?<|end|><|assistant|>"
    
  4. The script shown next has one python dependency, so you will need to pip install frontmatter in a virtual environment.

Automatic Keyword Generation Script

Next I wrote a python script to iterate over a directory and pull out all the markdown files. I parse the markdown files and chunk the content. The content is passed into the language model iteratively to build a list of keywords. Most of the time was spent working on improving the data cleaning and robustness.

Here is the script in full. You can see a detailed description below.

import json
import os
import re
import subprocess

import frontmatter

# Set the path to the directory containing markdown files
directory_path = "path/to/content/blog/2024/"  # Path to directory to scan for markdown files
model_path = (
    "path/to/Phi-3-mini-4k-instruct-q4.gguf"  # Path to the llama model
)
frontmatter_key = "keywords"  # The key to write to
num_keywords = 15  # Number of keywords to generate
response_regex = r"\<\|assistant\|\>(.*?)\<\|end\|\>"  # Regular expression to match responses, differs per model, this is for phi-3


def remove_code_blocks(markdown_text: str):
    # Remove code blocks enclosed by ```
    return re.sub(r"```.*?```", "", markdown_text, flags=re.DOTALL)


def remove_handlebar_code_blocks(markdown_text: str):
    # Remove handlebar code blocks enclosed by {{ }}
    return re.sub(r"{{.*?}}", "", markdown_text, flags=re.DOTALL)


def remove_html_blocks(markdown_text: str):
    # Remove HTML blocks enclosed by <>
    return re.sub(r"<.*?>", "", markdown_text, flags=re.DOTALL)


def remove_markdown_links(markdown_text: str):
    # Remove HTML blocks enclosed by <>
    return re.sub(r"\(.*?\)", "", markdown_text, flags=re.DOTALL)


def clean_markdown(markdown_text: str) -> str:
    markdown_text = remove_code_blocks(markdown_text)
    markdown_text = remove_handlebar_code_blocks(markdown_text)
    markdown_text = remove_html_blocks(markdown_text)
    markdown_text = remove_markdown_links(markdown_text)
    return markdown_text


def split_by_headings(markdown_text: str):
    # Split the text by markdown headings
    chunks = re.split(r"(^#.*$)", markdown_text, flags=re.MULTILINE)
    # Group the headings with their corresponding content
    grouped_chunks = []
    for i in range(1, len(chunks), 2):
        heading = chunks[i].strip()
        content = chunks[i + 1].strip() if (i + 1) < len(chunks) else ""
        grouped_chunks.append((heading, content))
    return grouped_chunks


def call_llama_cli(model_path: str, prompt: str):
    try:
        # Call the llama-cli command
        result = subprocess.run(
            [
                "llama-cli",
                "--model",
                model_path,
                "--predict",
                "512",
                "--log-disable",
                "--json-schema",
                json.dumps(
                    {
                        "$schema": "http://json-schema.org/draft-07/schema#",
                        "type": "array",
                        "items": {"type": "string"},
                    }
                ),
                "--prompt",
                prompt,
            ],
            capture_output=True,
            text=True,
            check=True,
        )

        # Remove the prompt from the stdout if it exists
        cleaned = result.stdout.replace(prompt, "")

        # Return the stdout from the command
        return cleaned
    except subprocess.CalledProcessError as e:
        print(f"An error occurred: {e}")
        return None


def keyword_extraction_prompt(data: str):
    data = data.replace("\n", " ")  # Remove newlines
    return f"""\
<|user|>What are the 5 nouns that best represent this text? Retain acronyms but ignore any noun with an underscore. Output just the text as a JSON list on the same line.
{data}
<|end|>
<|assistant|>"""


def keyword_cleanup_prompt(data: str, n_keywords: int):
    data = data.replace("\n", " ")  # Remove newlines
    return f"""\
<|user|>Given a list of nouns and compound nouns, select {n_keywords} that are best for SEO. Prefer single word nouns and acronyms where possible. Flatten all results into a single JSON list on the same line.
{data}
<|end|>
<|assistant|>"""


def theme_keyword_prompt(data: str):
    data = data.replace("\n", " ")  # Remove newlines
    return f"""\
<|user|>Given a list of nouns and compound nouns, what single theme or category best represents this content? Prefer single word nouns and acronyms where possible.
{data}
<|end|>
<|assistant|>"""


def extract_json_from_markdown(markdown_text: str) -> list[str] | None:
    # Regular expression to match ```json or ```markdown code blocks
    json_code_blocks = re.findall(response_regex, markdown_text, re.DOTALL)

    json_objects = []
    for block in json_code_blocks:
        block = block.strip()  # Remove any leading/trailing whitespace
        try:
            json_obj = json.loads(block)
            json_objects.append(json_obj)
        except json.JSONDecodeError as e:
            print(f"Error decoding JSON: {e}")
            continue

    if len(json_objects) == 0:
        return None

    return json_objects[0]


def remove_strings_with_underscore(input_list: list[str]):
    # List comprehension to filter out strings containing an underscore
    return [string for string in input_list if "_" not in string]


def remove_small_chunks(chunks: list[str], size_threshold: int):
    # Remove chunks where content length is less than size_threshold
    return [
        (heading, content)
        for heading, content in chunks
        if len(content.split()) >= size_threshold
    ]


def update_frontmatter(file_path: str, key: str, value):
    # Updates a file with frontmatter and updates the key with the value
    with open(file_path, "r") as file:
        post = frontmatter.load(file)

    post.metadata[key] = value

    with open(file_path, "w") as file:
        file.write(frontmatter.dumps(post))


# Iterate over all markdown files in the directory
for root, dirs, files in os.walk(directory_path):
    for file in files:
        if file.endswith(".md"):
            file_path = os.path.join(root, file)

            # Read markdown file with frontmatter
            post = frontmatter.load(file_path)
            content = post.content

            # Clean markdown content
            cleaned_markdown = clean_markdown(content)

            # Split markdown content by headings
            chunks = split_by_headings(cleaned_markdown)
            if len(chunks) == 0:
                chunks = [("Content", cleaned_markdown)]
            print(f"Chunks found for file: {file_path} - {len(chunks)}")

            # Remove chunks with content length less than x words
            chunks = remove_small_chunks(chunks, 30)
            if len(chunks) == 0:
                print(f"No chunks found for file: {file_path}")
                continue

            # List to store the final keywords
            final_keywords = []

            # For each chunk, call the llama-cli command
            for i, (heading, content) in enumerate(chunks):
                # Call llama-cli with the content
                response = call_llama_cli(
                    model_path,
                    keyword_extraction_prompt(f"{heading}. {content}"),
                )

                if response is None:
                    print(f"No response for chunk {i+1}: {response}")
                    continue

                # Parse the JSON from the response
                keywords = extract_json_from_markdown(response)

                if keywords is None or len(keywords) == 0:
                    print(f"No keywords found for chunk {i+1}: {response}")
                    continue

                # Append the result to the keywords list
                print(f"Keywords for chunk {i+1}: {keywords}")
                final_keywords.extend(keywords)

            # Remove keywords with underscores
            final_keywords = remove_strings_with_underscore(final_keywords)

            # Clean up the keywords
            response = call_llama_cli(
                model_path,
                keyword_cleanup_prompt(
                    data=f"{final_keywords}",
                    # Max num_keywords keywords if available
                    n_keywords=min(num_keywords, len(final_keywords)),
                ),
            )

            # Parse the JSON from the response
            cleaned_keywords = extract_json_from_markdown(response)
            if cleaned_keywords is None or len(cleaned_keywords) == 0:
                print(f"No cleaned keywords found: {response}")
                continue

            # Generate one keyword that best represents the theme
            response = call_llama_cli(
                model_path,
                theme_keyword_prompt(
                    data=f"{final_keywords}",
                ),
            )
            theme = extract_json_from_markdown(response)
            if theme is None or len(theme) == 0:
                print(f"No theme keyword found: {response}")
                continue
            theme = theme[0]
            print(f"Theme keyword: {theme}")

            # Prepend the theme keyword to the cleaned keywords
            cleaned_keywords.insert(0, theme)

            # Remove duplicates using a map with a lowercased key and the original value
            cleaned_keywords = list(
                {keyword.lower(): keyword for keyword in cleaned_keywords}.values()
            )

            # Limit the number of keywords to num_keywords
            cleaned_keywords = cleaned_keywords[:num_keywords]

            # Lowercase everything to keep it consistent
            cleaned_keywords = [keyword.lower() for keyword in cleaned_keywords]

            # Update the frontmatter with the keywords
            print(f"Final keywords: {cleaned_keywords}")
            update_frontmatter(file_path, frontmatter_key, cleaned_keywords)

Code Explanation

The code above is fairly straightforward to follow but there are a few interesting sections that I would like to highlight.

  • The first part of the code provides functions to clean the markdown content. I found that if I didn’t do this then the language model would pull out lots of code keywords, which aren’t particularly searchable. So I strip the code blocks, the links, any HTML, things like that. I’m just trying to retain the content.
  • The model only has an 8k context window, so you need to be careful with the amount of content that goes in. I chose to split on each markdown heading, since that tended to produce sections of reasonable length.
  • Next comes the call to llama-cli via a Python subprocess. The interesting thing to note here is the use of --json-schema, which allows you to pass a JSON Schema that the output must conform too. This helped to save a lot of parsing code. Everything is formatted as a simple JSON list.
  • Then we have a collection of prompts for performing various tasks. This took some iteration to get right. It was hard to find a word that best described what I wanted a keyword to be. It should be short, but reflective of the topics and terms used. I finally landed on simply asking for nouns. This seems to produce clean and concise keywords.
  • Finally there’s a few helper functions to parse and to clean. The only thing to mention here is that the response is parsed assuming you are using Phi-3 boundary markers. If you change the model you will need to change response_regex.
  • The big loop at the bottom is the thing that iteratively loads all markdown files in the directory and replaces the keywords in the frontmatter with the generated keywords. This is a bit interesting because I found it best to first generate the raw keywords for each page, then clean them in a second global step. Finally I ran another prompt to attempt to produce a high level theme or topic for this content. This was necessary because often the high level topic (like “AI”) wasn’t included.

Conclusion

I have shown you how you can leverage a fully-local language model to perform a fairly sophisticated natural language task. It’s fairly hard to constrain the language model to make it do what you want it to do, but when you get there it is very powerful.

Try the related links to see how well it works!

More articles

LLM Prompt Best Practices For Large Context Windows

Explore the benefits and challenges of large context windows in AI, learning to design effective prompts to enhance AI performance.

Read more

Big Data in LLMs with Retrieval-Augmented Generation (RAG)

Explore how Retrieval-Augmented Generation (RAG) enhances Language Models by utilizing indexing, retrieval, and generation for up-to-date data access.

Read more
}