Categorizing Personal Email Contacts with AI

February 20, 2024

Tags: ai, local-llms • Categories: Web Development

Table of Contents

For years I’ve wanted to send out a yearly update (digital Christmas card of sorts) email to friends, both new and old. One of the problems is I don’t have a good address book that indicates who is a personal or work contact. I’ve been playing with datasette and I thought it would be fun to index all of my past emails and have AI categorize if they are a work, personal, or vendor contact.

Importing Emails

The first step is importing your emails from gmail into a local database. The easiest way to do this is to use a combination of Google takeout and a datasette plugin (note that it’s easiest to use my fork of the datasette plugin, the original package is dead).

google-takeout-to-sqlite mbox emails.db '~/Downloads/Takeout/Mail/All mail Including Spam and Trash.mbox' --views --prefix personal

The --prefix option enables you to import multiple inboxes into the same sqlite DB. For instance, for a work inbox:

google-takeout-to-sqlite mbox emails.db '~/Downloads/Takeout Work/Mail/All mail Including Spam and Trash.mbox' --views --prefix work

--views enables some helpful views which make it a bit easier to run queries on the email dataset.

One thing I’d love to pursue further in the future is vector indexing all of my emails and attempting to auto draft emails that have a high correlation to past emails that I’ve written. I’m sure the big providers will add this in in the future, but it still seems a long way off and You could add in some really neat personalization. Something like Superwhisper for email writing.

Filtering Imported Emails

Now that all emails from our Gmail account are imported into a local database, we need to filter out obvious emails that are tied to companies and individuals we don’t know:

SELECT * 
FROM to_address_book 
WHERE DOMAIN NOT LIKE '%.%.%'
  AND DOMAIN NOT IN ('amazonses.com', 'craigslist.org', 'amazon.com', 'mandrillapp.com', 'fut.io', 'followup.cc', 'todoist.net', 'tmomail.net')
  AND NOT (
    EMAIL LIKE 'customer%' OR 
    EMAIL LIKE 'support%' OR 
    EMAIL LIKE 'info%' OR 
    EMAIL LIKE 'billing%' OR 
    EMAIL LIKE 'care%' OR 
    EMAIL LIKE 'hi%' OR 
    EMAIL LIKE 'bounce%' OR 
    EMAIL LIKE 'hello%' OR 
    EMAIL LIKE 'newsletter%' OR 
    EMAIL LIKE 'team%' OR 
    EMAIL LIKE 'service%' OR
    EMAIL LIKE 'reply%' OR
    EMAIL LIKE 'noreply%' OR
    EMAIL LIKE 'notification%' OR
    EMAIL LIKE 'help%' OR
    EMAIL LIKE 'sales%'
  );

(note that this query and others are automatically included using the --views option in the datasette plugin)

This exact query took some modification to get working for the type of people that I contact. Once I had the query set, I was ready to copy the data to a new table:

CREATE TABLE categorized_personal_emails as SELECT * FROM personal_filtered_to_address_book

I wanted to copy this data to a new table so I could add some categorization columns.

Categorizing Contacts with AI

Now for the fun part.

The query above filtered out the obvious non-personal contacts, but I wanted to get a more exact categorization since the list still included a couple thousand people. LLMs are really effective at categorization and summarization. I’ve been wanting to experiment with some of the open source models (mistral + llama) for this sort of task. So I figured I’d give it a shot with this project.

Here’s what I wanted to do:

Write a script that accepts an email address and a SQLite database path as command line arguments
Look up some recent conversations by that email address.
Categorize that email into a couple of different buckets based on the conversation. Return the result as a JSON object.

Using a Local LLM

This ended up being more complex than I thought. The open source LLMs do have JSON options, but they don’t work reliably. I have a pretty beefy mac (96GB M2 Max) so I figured mistral or llama would run just fine.

Here’s what I tried:

Using lightllm + ollama + mistral instruct. It worked a couple times, but often ran into some sort of infinite loop where Ollama never responded with a JSON response and all of my GPUs maxed out.
- Additionally, Mistral seemed to forget the JSON return instructions unless I included it at the end of the prompt.
- My instinct is if I shortened the prompt by reducing the conversation length, I might have been able to get consistent results. But I wanted more conversation data to more accurately categorize the email.
- I also found that lightLLM included additional prompt instructions, which I did not want, which ended up messing up the return result.
ollama native python bindings + mistral text and instruct. This didn’t do any better and had the same "infinite loop" problem that litellm did.
- If you look through the OpenAI documentation, they do mention that when specifying a JSON return type, if you don’t prompt it properly, you’ll get an infinite loop. However, I tried lots of different prompts locally, and none of them worked to avoid the infinite loop problem reliably.
I tried using OpenAI with the same exact prompt I was using for the local llms and it worked really well. Surprisingly, the response times were not even that bad. While open source models are impressive, OpenAI’s ease of use is far beyond what the open source ecosystem offers (sort of macOS vs linux analog)

Here’s the original script, which includes testing out the various providers. I’ve included an abridged version below that just uses the OpenAI bindings.

import click
import sqlite_utils
import json
import logging

PREFIX = ""
logger = logging.getLogger(__name__)

def call_openai(prompt):
    from openai import OpenAI
    import os

    client = OpenAI()
    OpenAI.api_key = os.getenv("OPENAI_API_KEY")

    completion = client.chat.completions.create(
        # only this specific version supports json_object
        model="gpt-3.5-turbo-1106",
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant. Your response should be in JSON format.",
            },
            {"role": "user", "content": prompt},
        ],
        response_format={"type": "json_object"},
    )

    json_response = json.loads(completion.choices[0].message.content)
    return json_response["contact_type"]

def categorize(email, conversations: str):
    prompt = f"""
Past conversations:

{conversations}

IMPORTANT INSTRUCTIONS:

- Categorize email address {email} by analyzing the past included conversations.
- Only use 'work' if the contact is not personal and is related to my professional work (software engineering, startups, venture capital)
- If unsure, use 'vendor'
- Respond ONLY in JSON with either 'work', 'vendor', or 'friend' in a single 'contact_type' field. Example: `{{"contact_type": "vendor"}}`
"""
    contact_type = call_openai(prompt)

    assert contact_type

    if contact_type not in ["work", "vendor", "friend"]:
        raise ValueError(f"Invalid contact type: {contact_type}")

    return contact_type

def get_conversation(db, email, count=10):
    # am i lazy? yes I am.
    formatted_prefix = f"{PREFIX}_" if PREFIX else ""

    query = f"""
    SELECT subject, body
    FROM {formatted_prefix}mbox_emails
    WHERE gmail_thread_id IN (
        SELECT gmail_thread_id
        FROM (
            SELECT gmail_thread_id, MAX(date) AS max_date
            FROM {formatted_prefix}mbox_emails
            WHERE EXISTS (
                SELECT 1
                FROM json_each({formatted_prefix}mbox_emails.all_contacts) AS contact
                WHERE json_extract(contact.value, '$.email') = '{email}'
            )
            GROUP BY gmail_thread_id
            ORDER BY MAX(date) DESC
            LIMIT {count}
        )
    )
    ORDER BY date DESC;
    """

    return db.execute_returning_dicts(query)

from litellm import token_counter

def format_conversations(conversations: list[dict]) -> str:
    filtered_conversations = []

    if len(conversations) == 0:
        raise ValueError("No conversations found.")

    for email in conversations:
        token_count = token_counter(
            # mistral
            model="openai/gpt-3.5-turbo",
            text=email["body"],
        )

        if token_count < 1_750:
            filtered_conversations.append(email)
        else:
            logger.info("Skipping email due to token count: %s", token_count)
            pass

    conversation = "\n\n".join(
        [
            # \r\n to \n
            f"```Subject: {email['subject']}\n\n{email['body']}```".replace(
                "\r\n", "\n"
            )
            # limit to a max of two conversations, mistral struggles with more
            # also fits within gpt3's token limit
            for email in filtered_conversations[0:2]
        ]
    )

    return conversation

def determine_contact_type(db, email):
    conversation = get_conversation(db, email)
    formatted_conversations = format_conversations(conversation)
    contact_type = categorize(email, formatted_conversations)
    return contact_type

def run_evals(db_path):
    db = sqlite_utils.Database(db_path)

    eval_list = [
        ("email1@example.com", "vendor"),
        ("email2@example.com", "friend"),
        ("email3@example.com", "friend"),
        ("email4@example.com", "vendor"),
        ("email5@example.com", "friend"),
        ("email6@example.com", "vendor"),
        ("email7@example.com", "vendor"),
        ("email8@example.com", "work"),
        ("email9@example.com", "work"),
    ]

    for email, expected_contact_type in eval_list:
        contact_type = determine_contact_type(db, email)
        print(f"email: {email}, contact_type: {contact_type}")
        assert contact_type == expected_contact_type

    print("All tests passed!")

@click.command()
@click.argument("db_path", type=click.Path(exists=True), required=True)
@click.option("--evals", is_flag=True, help="Run evals")
@click.option("--prefix", help="Table prefix, if there is one")
@click.option("--email", help="Email to categorize")
def cli(db_path, evals, prefix, email):
    db = sqlite_utils.Database(db_path)

    global PREFIX
    PREFIX = prefix

    if evals:
        run_evals(db_path)
    elif email:
        click.echo(determine_contact_type(db, email))
    else:
        raise ValueError("Must specify --evals or --email")

if __name__ == "__main__":
    cli()

Map-Reduce over a CSV

Something I’ve always wanted to do is to be able to run a map reduce operation on a CSV easily. This is something that all the nifty command line tools (zq, jq, etc) don’t allow you to do. OpenRefine does allow you to do this, but you have to write python2 code and use a janky peice of software that isn’t too well maintained.

However, I noticed that Datasette recently added the ability to write enrichment plugins. I hacked together an enrichment plugin that allows you to run an arbitrary shell script and save the resulting output to a new column if the script was successful.

pip install datasette
datasette install -U ~/Projects/python/datasette-enrichments-shell
datasette serve --root emails.db

At this point, I could run an enrichment on the table to categorize it:

cd ~/Projects/google-extraction/google-takeout-to-sqlite
source .venv/bin/activate
export OPENAI_API_KEY=sk-
xargs -I {} python scripts/categorize.py --prefix personal --email {}

What I later realized is the environment which runs the script via the enrichments-shell inherits all of your shell variables, so you don’t need to write a wrapper script like the one above. You can paste the following right into the input box in the datasette UI:

xargs -I {} python scripts/categorize.py --prefix personal --email {}

You can test the script locally before running the MapReduce operation across all of your data

echo '{"email": "email@example.com"}' | zsh -c ./scripts/categorize ./emails.db --prefix personal --email {}

It’s easy for whitespace to get accidentally added to the end of a script output. you can remove whitespace using a simple SQLite query:

UPDATE categorized_personal_emails
SET categorize = TRIM(categorize, ' ' || CHAR(9) || CHAR(10) || CHAR(13));

In this statement:

' ' is a space character.
CHAR(9) represents a tab.
CHAR(10) represents a newline.
CHAR(13) represents a carriage return.

(this is something that should really be added to the

Email Verification

Since there was a lot of emails in this request, I wanted to make sure that the emails were deliverable To make sure that my email deliverability on my personal email was not harmed.

Here’s another enrichment script I ran on a separate column which verified the emails I had in my list:

echo '{"email": "test@email.com"}' | jq -r '.email' | xargs -I{} http GET 'https://api.usebouncer.com/v1.1/email/verify' email=={} "x-api-key:$BOUNCER_API_KEY" | jq -r '.status' | tr -d '\n'

Sending the Email

Because I really enjoy trying out new tools, I gave mailmeteor a try. If you send out a update email once a week or once a month, I could totally see the service making sense, but for my use case, it just didn’t work. The free tier really doesn’t work and only sends out 50 emails a day for a couple days and then just gives up.

The best approach here was to just write the email in Google Docs and then copy and paste it into Gmail and send out groups of BCC messages over a couple days.