Scraping with Ollama and Crawl4AI
When you need to extract unstructured information from websites, a simple HTTP request is not enough — especially if the site uses JavaScript or dynamic rendering. In this tutorial, we'll walk through how to use Crawl4AI, Ollama, and CloudBrowser to automate the process, extract data, and format it into clean JSON.
Why This Stack
Crawl4AI — Orchestrates the crawling, handles concurrency, and chunks large content.
CloudBrowser — Runs cloud-hosted Puppeteer browsers with Chrome DevTools Protocol (CDP) support.
Ollama — Runs the LLM that turns HTML/Markdown into structured data based on your schema.
Requirements
Basic Python knowledge.
Installed Crawl4AI and Ollama.
A CloudBrowser account and API token.
Environment Setup
1. Install Crawl4AI
Follow the instructions from the official GitHub repository:
pip install crawl4ai2. Install and Run Ollama
If you don't have a remote Ollama server, install and run it locally:
This launches the Gemma 3 model on http://localhost:11434.
3. Request a CloudBrowser WebSocket
CloudBrowser sessions are opened by sending a POST request to the API. For example:
The API will respond with JSON containing the address field:
Use this address as your Chrome DevTools Protocol WebSocket URL.
Example: Scraping Pricing Plans
The following Python script first requests a CloudBrowser session to get the WebSocket URL, then runs Crawl4AI.
How It Works
CloudBrowser API creates a browser session and returns a WebSocket URL.
Crawl4AI connects to that WebSocket, navigates to the target page, and extracts DOM content.
Ollama (Gemma 3) formats the result into JSON based on your schema.
The final structured data is printed in your console.
Conclusion
This approach ensures that you dynamically request a fresh CloudBrowser session before crawling, keeping the session alive only as long as you need it. CloudBrowser + Crawl4AI + Ollama provides a powerful, scala
Last updated