Generating Keywords Automatically With llama-cli and Phi-3
by Dr. Phil Winder , CEO
This started as a need to generate related blog posts on the Winder.AI blog. Nearly all of my content-focussed websites use Hugo and it has a neat feature to generate links to related content.
This works by calculating the intersection of the keywords list in the frontmatter (by default) with every other page and ordering them by those that intersect the most. This method was chosen because it is fast and Hugo’s ethos is speed. A content based semantic approach would be better, but would be much slower.
Now, I haven’t been good at adding keywords to all of the blog’s hundreds of pages. And when I had, they tended to be overly specific and never overlapped with other pages. So I decided to write a script to automate this.
But natural language processing is time consuming and hard. So instead I chose to leverage a small language model that I could run locally.
Prerequisites
To make this work you will need to install llama.ccp
’s llama-cli
and download a language model in gguf
format.
Make sure it works:
llama-cli -m path/to/Phi-3-mini-4k-instruct-q4.gguf -t 8 -n 128 --log-disable --prompt "<|user|>What is the best place in Yorkshire?<|end|><|assistant|>"
The script shown next has one python dependency, so you will need to
pip install frontmatter
in a virtual environment.
Automatic Keyword Generation Script
Next I wrote a python script to iterate over a directory and pull out all the markdown files. I parse the markdown files and chunk the content. The content is passed into the language model iteratively to build a list of keywords. Most of the time was spent working on improving the data cleaning and robustness.
Here is the script in full. You can see a detailed description below.
import json
import os
import re
import subprocess
import frontmatter
# Set the path to the directory containing markdown files
directory_path = "path/to/content/blog/2024/" # Path to directory to scan for markdown files
model_path = (
"path/to/Phi-3-mini-4k-instruct-q4.gguf" # Path to the llama model
)
frontmatter_key = "keywords" # The key to write to
num_keywords = 15 # Number of keywords to generate
response_regex = r"\<\|assistant\|\>(.*?)\<\|end\|\>" # Regular expression to match responses, differs per model, this is for phi-3
def remove_code_blocks(markdown_text: str):
# Remove code blocks enclosed by ```
return re.sub(r"```.*?```", "", markdown_text, flags=re.DOTALL)
def remove_handlebar_code_blocks(markdown_text: str):
# Remove handlebar code blocks enclosed by {{ }}
return re.sub(r"{{.*?}}", "", markdown_text, flags=re.DOTALL)
def remove_html_blocks(markdown_text: str):
# Remove HTML blocks enclosed by <>
return re.sub(r"<.*?>", "", markdown_text, flags=re.DOTALL)
def remove_markdown_links(markdown_text: str):
# Remove HTML blocks enclosed by <>
return re.sub(r"\(.*?\)", "", markdown_text, flags=re.DOTALL)
def clean_markdown(markdown_text: str) -> str:
markdown_text = remove_code_blocks(markdown_text)
markdown_text = remove_handlebar_code_blocks(markdown_text)
markdown_text = remove_html_blocks(markdown_text)
markdown_text = remove_markdown_links(markdown_text)
return markdown_text
def split_by_headings(markdown_text: str):
# Split the text by markdown headings
chunks = re.split(r"(^#.*$)", markdown_text, flags=re.MULTILINE)
# Group the headings with their corresponding content
grouped_chunks = []
for i in range(1, len(chunks), 2):
heading = chunks[i].strip()
content = chunks[i + 1].strip() if (i + 1) < len(chunks) else ""
grouped_chunks.append((heading, content))
return grouped_chunks
def call_llama_cli(model_path: str, prompt: str):
try:
# Call the llama-cli command
result = subprocess.run(
[
"llama-cli",
"--model",
model_path,
"--predict",
"512",
"--log-disable",
"--json-schema",
json.dumps(
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "array",
"items": {"type": "string"},
}
),
"--prompt",
prompt,
],
capture_output=True,
text=True,
check=True,
)
# Remove the prompt from the stdout if it exists
cleaned = result.stdout.replace(prompt, "")
# Return the stdout from the command
return cleaned
except subprocess.CalledProcessError as e:
print(f"An error occurred: {e}")
return None
def keyword_extraction_prompt(data: str):
data = data.replace("\n", " ") # Remove newlines
return f"""\
<|user|>What are the 5 nouns that best represent this text? Retain acronyms but ignore any noun with an underscore. Output just the text as a JSON list on the same line.
{data}
<|end|>
<|assistant|>"""
def keyword_cleanup_prompt(data: str, n_keywords: int):
data = data.replace("\n", " ") # Remove newlines
return f"""\
<|user|>Given a list of nouns and compound nouns, select {n_keywords} that are best for SEO. Prefer single word nouns and acronyms where possible. Flatten all results into a single JSON list on the same line.
{data}
<|end|>
<|assistant|>"""
def theme_keyword_prompt(data: str):
data = data.replace("\n", " ") # Remove newlines
return f"""\
<|user|>Given a list of nouns and compound nouns, what single theme or category best represents this content? Prefer single word nouns and acronyms where possible.
{data}
<|end|>
<|assistant|>"""
def extract_json_from_markdown(markdown_text: str) -> list[str] | None:
# Regular expression to match ```json or ```markdown code blocks
json_code_blocks = re.findall(response_regex, markdown_text, re.DOTALL)
json_objects = []
for block in json_code_blocks:
block = block.strip() # Remove any leading/trailing whitespace
try:
json_obj = json.loads(block)
json_objects.append(json_obj)
except json.JSONDecodeError as e:
print(f"Error decoding JSON: {e}")
continue
if len(json_objects) == 0:
return None
return json_objects[0]
def remove_strings_with_underscore(input_list: list[str]):
# List comprehension to filter out strings containing an underscore
return [string for string in input_list if "_" not in string]
def remove_small_chunks(chunks: list[str], size_threshold: int):
# Remove chunks where content length is less than size_threshold
return [
(heading, content)
for heading, content in chunks
if len(content.split()) >= size_threshold
]
def update_frontmatter(file_path: str, key: str, value):
# Updates a file with frontmatter and updates the key with the value
with open(file_path, "r") as file:
post = frontmatter.load(file)
post.metadata[key] = value
with open(file_path, "w") as file:
file.write(frontmatter.dumps(post))
# Iterate over all markdown files in the directory
for root, dirs, files in os.walk(directory_path):
for file in files:
if file.endswith(".md"):
file_path = os.path.join(root, file)
# Read markdown file with frontmatter
post = frontmatter.load(file_path)
content = post.content
# Clean markdown content
cleaned_markdown = clean_markdown(content)
# Split markdown content by headings
chunks = split_by_headings(cleaned_markdown)
if len(chunks) == 0:
chunks = [("Content", cleaned_markdown)]
print(f"Chunks found for file: {file_path} - {len(chunks)}")
# Remove chunks with content length less than x words
chunks = remove_small_chunks(chunks, 30)
if len(chunks) == 0:
print(f"No chunks found for file: {file_path}")
continue
# List to store the final keywords
final_keywords = []
# For each chunk, call the llama-cli command
for i, (heading, content) in enumerate(chunks):
# Call llama-cli with the content
response = call_llama_cli(
model_path,
keyword_extraction_prompt(f"{heading}. {content}"),
)
if response is None:
print(f"No response for chunk {i+1}: {response}")
continue
# Parse the JSON from the response
keywords = extract_json_from_markdown(response)
if keywords is None or len(keywords) == 0:
print(f"No keywords found for chunk {i+1}: {response}")
continue
# Append the result to the keywords list
print(f"Keywords for chunk {i+1}: {keywords}")
final_keywords.extend(keywords)
# Remove keywords with underscores
final_keywords = remove_strings_with_underscore(final_keywords)
# Clean up the keywords
response = call_llama_cli(
model_path,
keyword_cleanup_prompt(
data=f"{final_keywords}",
# Max num_keywords keywords if available
n_keywords=min(num_keywords, len(final_keywords)),
),
)
# Parse the JSON from the response
cleaned_keywords = extract_json_from_markdown(response)
if cleaned_keywords is None or len(cleaned_keywords) == 0:
print(f"No cleaned keywords found: {response}")
continue
# Generate one keyword that best represents the theme
response = call_llama_cli(
model_path,
theme_keyword_prompt(
data=f"{final_keywords}",
),
)
theme = extract_json_from_markdown(response)
if theme is None or len(theme) == 0:
print(f"No theme keyword found: {response}")
continue
theme = theme[0]
print(f"Theme keyword: {theme}")
# Prepend the theme keyword to the cleaned keywords
cleaned_keywords.insert(0, theme)
# Remove duplicates using a map with a lowercased key and the original value
cleaned_keywords = list(
{keyword.lower(): keyword for keyword in cleaned_keywords}.values()
)
# Limit the number of keywords to num_keywords
cleaned_keywords = cleaned_keywords[:num_keywords]
# Lowercase everything to keep it consistent
cleaned_keywords = [keyword.lower() for keyword in cleaned_keywords]
# Update the frontmatter with the keywords
print(f"Final keywords: {cleaned_keywords}")
update_frontmatter(file_path, frontmatter_key, cleaned_keywords)
Code Explanation
The code above is fairly straightforward to follow but there are a few interesting sections that I would like to highlight.
- The first part of the code provides functions to clean the markdown content. I found that if I didn’t do this then the language model would pull out lots of code keywords, which aren’t particularly searchable. So I strip the code blocks, the links, any HTML, things like that. I’m just trying to retain the content.
- The model only has an 8k context window, so you need to be careful with the amount of content that goes in. I chose to split on each markdown heading, since that tended to produce sections of reasonable length.
- Next comes the call to
llama-cli
via a Python subprocess. The interesting thing to note here is the use of--json-schema
, which allows you to pass a JSON Schema that the output must conform too. This helped to save a lot of parsing code. Everything is formatted as a simple JSON list. - Then we have a collection of prompts for performing various tasks. This took some iteration to get right. It was hard to find a word that best described what I wanted a keyword to be. It should be short, but reflective of the topics and terms used. I finally landed on simply asking for nouns. This seems to produce clean and concise keywords.
- Finally there’s a few helper functions to parse and to clean. The only thing to mention here is that the response is parsed assuming you are using Phi-3 boundary markers. If you change the model you will need to change
response_regex
. - The big loop at the bottom is the thing that iteratively loads all markdown files in the directory and replaces the keywords in the frontmatter with the generated keywords. This is a bit interesting because I found it best to first generate the raw keywords for each page, then clean them in a second global step. Finally I ran another prompt to attempt to produce a high level theme or topic for this content. This was necessary because often the high level topic (like “AI”) wasn’t included.
Conclusion
I have shown you how you can leverage a fully-local language model to perform a fairly sophisticated natural language task. It’s fairly hard to constrain the language model to make it do what you want it to do, but when you get there it is very powerful.
Try the related links to see how well it works!