Scraping with Ollama and Crawl4AI

When you need to extract unstructured information from websites, a simple HTTP request is not enough — especially if the site uses JavaScript or dynamic rendering. In this tutorial, we'll walk through how to use Crawl4AI, Ollama, and CloudBrowser to automate the process, extract data, and format it into clean JSON.


Why This Stack

  • Crawl4AI — Orchestrates the crawling, handles concurrency, and chunks large content.

  • CloudBrowser — Runs cloud-hosted Puppeteer browsers with Chrome DevTools Protocol (CDP) support.

  • Ollama — Runs the LLM that turns HTML/Markdown into structured data based on your schema.


Requirements

  • Basic Python knowledge.

  • Installed Crawl4AI and Ollama.

  • A CloudBrowser account and API token.


Environment Setup

1. Install Crawl4AI

Follow the instructions from the official GitHub repository:

pip install crawl4ai

2. Install and Run Ollama

If you don't have a remote Ollama server, install and run it locally:

This launches the Gemma 3 model on http://localhost:11434.

3. Request a CloudBrowser WebSocket

CloudBrowser sessions are opened by sending a POST request to the API. For example:

The API will respond with JSON containing the address field:

Use this address as your Chrome DevTools Protocol WebSocket URL.


Example: Scraping Pricing Plans

The following Python script first requests a CloudBrowser session to get the WebSocket URL, then runs Crawl4AI.


How It Works

  1. CloudBrowser API creates a browser session and returns a WebSocket URL.

  2. Crawl4AI connects to that WebSocket, navigates to the target page, and extracts DOM content.

  3. Ollama (Gemma 3) formats the result into JSON based on your schema.

  4. The final structured data is printed in your console.


Conclusion

This approach ensures that you dynamically request a fresh CloudBrowser session before crawling, keeping the session alive only as long as you need it. CloudBrowser + Crawl4AI + Ollama provides a powerful, scala

Last updated