Scraping the web with OpenAI
One of the really interesting LLM use cases is extracting structured data from unstructured data. In the old days (6mo ago), extracting structured data from web pages required custom xpath or css selectors for each website that constantly broke as the host changed their page structure. For instance, extracting the price of a house on redfin. This is why Plaid (and similar competitors) break so often: many of their integrations "screen scrape" which means they need a team of people updating xpath and css selectors on various bank sites (TreasuryDirect, for example, is broken constantly). I built a open source database of venture capital firms that used this approach to extract team member information from each firm…
Continue Reading