Scraping with Ollama and Crawl4AI

When you need to extract unstructured information from websites, a simple HTTP request is not enough — especially if the site uses JavaScript or dynamic rendering. In this tutorial, we'll walk through how to use Crawl4AI, Ollama, and CloudBrowser to automate the process, extract data, and format it into clean JSON.


Why This Stack

  • Crawl4AI — Orchestrates the crawling, handles concurrency, and chunks large content.

  • CloudBrowser — Runs cloud-hosted Puppeteer browsers with Chrome DevTools Protocol (CDP) support.

  • Ollama — Runs the LLM that turns HTML/Markdown into structured data based on your schema.


Requirements

  • Basic Python knowledge.

  • Installed Crawl4AI and Ollama.

  • A CloudBrowser account and API token.


Environment Setup

1. Install Crawl4AI

Follow the instructions from the official GitHub repository:

pip install crawl4ai

2. Install and Run Ollama

If you don't have a remote Ollama server, install and run it locally:

brew install ollama
ollama pull gemma3:latest
ollama serve

This launches the Gemma 3 model on http://localhost:11434.

3. Request a CloudBrowser WebSocket

CloudBrowser sessions are opened by sending a POST request to the API. For example:

curl -X POST https://production.cloudbrowser.ai/api/v1/Browser/Open \
-H "Authorization: Bearer $CLOUDBROWSER_ACCESS_KEY" \
-d '{
  "Args": ["--no-sandbox", "--disable-gpu"],
  "IgnoredDefaultArgs": ["--enable-automation"],
  "Headless": true,
  "Stealth": true,
  "Browser": 2,
  "KeepOpen": 60,
  "Label": "app-worker01",
  "saveSession": true,
  "recoverSession": true
}'

The API will respond with JSON containing the address field:

{
  "address": "browserprovider-worker-*.cloudbrowser.ai/devtools/browser/*",
  "status": 200
}

Use this address as your Chrome DevTools Protocol WebSocket URL.


Example: Scraping Pricing Plans

The following Python script first requests a CloudBrowser session to get the WebSocket URL, then runs Crawl4AI.

import asyncio
import os
import requests
from typing import List
from crawl4ai import *
from pydantic import BaseModel

# Step 1: Open a CloudBrowser session to get WebSocket URL
CLOUDBROWSER_API_KEY = os.environ.get("CLOUDBROWSER_ACCESS_KEY")
response = requests.post(
    "https://production.cloudbrowser.ai/api/v1/Browser/Open",
    headers={"Authorization": f"Bearer {CLOUDBROWSER_API_KEY}"},
    json={
        "Args": ["--no-sandbox", "--disable-gpu"],
        "IgnoredDefaultArgs": ["--enable-automation"],
        "Headless": True,
        "Stealth": True,
        "Browser": 2,
        "KeepOpen": 60,
        "Label": "app-worker01",
        "saveSession": True,
        "recoverSession": True
    }
)
ws_url = response.json()["address"]

# Step 2: Define the output schema
class Pricing(BaseModel):
    name: str
    href: str

class PricingPlans(BaseModel):
    pricing_plans: List[Pricing]

# Step 3: Configure CloudBrowser with the returned WebSocket
browser_config = BrowserConfig(
    headless=True,
    verbose=True,
    browser_mode="cdp",
    cdp_url=f"wss://{ws_url}"
)

# Step 4: Define the extraction strategy using Ollama + Gemma 3
extraction_strategy = LLMExtractionStrategy(
    llm_config=LLMConfig(provider="ollama/gemma3:4b", base_url="http://localhost:11434"),
    extraction_type="schema",
    schema=PricingPlans.model_json_schema(),
    instruction="Extract all the pricing plans JSON array containing their 'name' and 'price'.",
    chunk_token_threshold=1200,
    overlap_rate=0.1,
    apply_chunking=True,
    input_format="markdown",
    verbose=True
)

# Step 5: Configure and run the crawler
crawl_config = CrawlerRunConfig(
    extraction_strategy=extraction_strategy,
    cache_mode=CacheMode.BYPASS
)

async def main():
    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(url="https://cloudbrowser.com/", config=crawl_config)

        if result.success:
            print("Extracted content:", result.extracted_content)
        else:
            print("Error:", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())

How It Works

  1. CloudBrowser API creates a browser session and returns a WebSocket URL.

  2. Crawl4AI connects to that WebSocket, navigates to the target page, and extracts DOM content.

  3. Ollama (Gemma 3) formats the result into JSON based on your schema.

  4. The final structured data is printed in your console.


Conclusion

This approach ensures that you dynamically request a fresh CloudBrowser session before crawling, keeping the session alive only as long as you need it. CloudBrowser + Crawl4AI + Ollama provides a powerful, scala

Last updated