Scraping the web with OpenAI

July 29, 2023

Tags: ai, database, javascript • Categories: Learning

Table of Contents

One of the really interesting LLM use cases is extracting structured data from unstructured data. In the old days (6mo ago), extracting structured data from web pages required custom xpath or css selectors for each website that constantly broke as the host changed their page structure. For instance, extracting the price of a house on redfin.

This is why Plaid (and similar competitors) break so often: many of their integrations "screen scrape" which means they need a team of people updating xpath and css selectors on various bank sites (TreasuryDirect, for example, is broken constantly).

I built a open source database of venture capital firms that used this approach to extract team member information from each firm. From what I can tell, companies like Pitchbook and Crunchbase use a mix of web scraping and manual collection (humans calling companies or viewing websites to extract information). You can largely replace the type of work those firms are doing with fancy LLM prompts. Dolthub has been using GPT to label badly-structured data coming from the open health pricing data regulation.

One interesting way of looking at data businesses is they’ll all be obsolete once real-time data makes into way into LLM infrastructure. This is a hard problem, to be sure, but Google + Microsoft already have a massive head start (they already have the infrastructure to scrape the open web in real-time) so it’s hard to imagine a world where LLMs don’t have real-time data on the world and data brokers which used to house carefully-aggregated + structured bits of the open web don’t become a carefully structured prompt.

Here are some of the technologies I wanted to tinker with on this project:

Dolthub. A versioned database is such a compelling idea.
Langchain. Heard great things about it, and after running into many application-level issues when using LLMs with the natural-language-to-sql product I helped build I wanted to use a higher level abstraction.
OpenAI, specifically the GPT3.5 w/16k token window
pnpm. I can’t believe there’s yet another javascript package manager, but there is. And npm is still terrible, so I’m willing to tinker with another.
Continue to iterate on my node inspect fork to get over my frusterations with the CLI node debugging experience.

In any case, here’s what I learned building this database!

Categorize then extract

Running every page of a website through an LLM is a bad idea, mostly from a time + cost perspective.

What I found to work surprisingly well is:

Scraping all urls and some metadata about the page (title, etc)
Passing this list to an LLM to categorize the URLs to decide which ones should be processed

After you have the categorized list, you can pass specific pages to the LLM to categorize.

Convert HTML to Markdown

HTML is verbose. Passing raw HTML to LLMs will suck up a huge amount of tokens and make it harder to parse (it’ll be harder to find natural breakpoints to separate content at).

I found converting HTML to markdown is much better and there’s a great HTML pipeline that allows you to remove problematic HTML elements (like images, especially images with inline SVG data).

LLMs are great at conforming to JSON Schema

It’s surprising how well LLMs will confirm to JSON schema. With a well-structured prompt and 0 temperature value, you can return structured results from your scraping question.

TypeChat, guidance, and other projects provide nice interfaces around this.

Heal invalid JSON with another LLM call

It’s possible that JSON returned is truncated due to the LLM response length.

An interesting approach to solving this problem is passing the response back to the LLM and asking it to fix the response so it conforms to the specified schema.

An unsolved problem here is continuing to generate the response instead of accepting the truncated answer. I haven’t looked for solutions in depth here, but th

Preventing fake data

Without specific prompting, the LLM will make up content that fits the JSON Schema passed to it. Here are some things which fixed this:

Setting the type of a code fence containing webpage content from nothing to markdown
Explicitly instructing the model to respond with an empty array (If you cannot find any team members, respond with an empty array.)

The flow that worked to debug this is:

Throw a debugger statement in when this occurs and run the script with better-node-inspect or node inspect
Copy the prompt (copyToClipboard(renderedPrompt)
Paste into ChatGPT (or openai playground) and fiddle with the prompt
Adjust the prompt in code

In a real app, you’d want an integration test on this, but for a toy this was fine.

Thoughts on Langchain

There’s definitely a need for application-level abstractions to make it easier to work with LLMs. Langchain is one attempt at this, griptape is another, guidance & typechat are similar as well.

The langchain abstractions aren’t great, the documentation is poor, and there’s a lot of missing holes (like automatically dropping the 16k variant to 4k when the input is < 4k tokens). However, from my interaction with the langchain team over GitHub they are iterating quickly and working on refining the initial abstractions and introducing new ideas. We are still so early in the game; it’s not clear what the right primitivess and interaction patterns are. There’s still a lot of work to be done here and completely new approaches to be built—the right interaction pattern is still unknown.

ChatGPT for seed data

One interesting use case of ChatGPT is for database queries. You can prompt it to return a CSV of data from the internet, which is great for seeding a database:

Respond with the name and urls of technology venture capital firms. Use markdown block to contain your response. Format the list as a CSV with two columns: name and url. Urls should be formatted without https:// or http:// (just use the raw host). Return as many links as you have, the more the better!