Scraping with Ollama and Crawl4AI
When you need to extract unstructured information from websites, a simple HTTP request is not enough — especially if the site uses JavaScript or dynamic rendering. In this tutorial, we'll walk through how to use Crawl4AI, Ollama, and CloudBrowser to automate the process, extract data, and format it into clean JSON.
Why This Stack
Crawl4AI — Orchestrates the crawling, handles concurrency, and chunks large content.
CloudBrowser — Runs cloud-hosted Puppeteer browsers with Chrome DevTools Protocol (CDP) support.
Ollama — Runs the LLM that turns HTML/Markdown into structured data based on your schema.
Requirements
Basic Python knowledge.
Installed Crawl4AI and Ollama.
A CloudBrowser account and API token.
Environment Setup
1. Install Crawl4AI
Follow the instructions from the official GitHub repository:
pip install crawl4ai
2. Install and Run Ollama
If you don't have a remote Ollama server, install and run it locally:
brew install ollama
ollama pull gemma3:latest
ollama serve
This launches the Gemma 3 model on http://localhost:11434
.
3. Request a CloudBrowser WebSocket
CloudBrowser sessions are opened by sending a POST request to the API. For example:
curl -X POST https://production.cloudbrowser.ai/api/v1/Browser/Open \
-H "Authorization: Bearer $CLOUDBROWSER_ACCESS_KEY" \
-d '{
"Args": ["--no-sandbox", "--disable-gpu"],
"IgnoredDefaultArgs": ["--enable-automation"],
"Headless": true,
"Stealth": true,
"Browser": 2,
"KeepOpen": 60,
"Label": "app-worker01",
"saveSession": true,
"recoverSession": true
}'
The API will respond with JSON containing the address
field:
{
"address": "browserprovider-worker-*.cloudbrowser.ai/devtools/browser/*",
"status": 200
}
Use this address
as your Chrome DevTools Protocol WebSocket URL.
Example: Scraping Pricing Plans
The following Python script first requests a CloudBrowser session to get the WebSocket URL, then runs Crawl4AI.
import asyncio
import os
import requests
from typing import List
from crawl4ai import *
from pydantic import BaseModel
# Step 1: Open a CloudBrowser session to get WebSocket URL
CLOUDBROWSER_API_KEY = os.environ.get("CLOUDBROWSER_ACCESS_KEY")
response = requests.post(
"https://production.cloudbrowser.ai/api/v1/Browser/Open",
headers={"Authorization": f"Bearer {CLOUDBROWSER_API_KEY}"},
json={
"Args": ["--no-sandbox", "--disable-gpu"],
"IgnoredDefaultArgs": ["--enable-automation"],
"Headless": True,
"Stealth": True,
"Browser": 2,
"KeepOpen": 60,
"Label": "app-worker01",
"saveSession": True,
"recoverSession": True
}
)
ws_url = response.json()["address"]
# Step 2: Define the output schema
class Pricing(BaseModel):
name: str
href: str
class PricingPlans(BaseModel):
pricing_plans: List[Pricing]
# Step 3: Configure CloudBrowser with the returned WebSocket
browser_config = BrowserConfig(
headless=True,
verbose=True,
browser_mode="cdp",
cdp_url=f"wss://{ws_url}"
)
# Step 4: Define the extraction strategy using Ollama + Gemma 3
extraction_strategy = LLMExtractionStrategy(
llm_config=LLMConfig(provider="ollama/gemma3:4b", base_url="http://localhost:11434"),
extraction_type="schema",
schema=PricingPlans.model_json_schema(),
instruction="Extract all the pricing plans JSON array containing their 'name' and 'price'.",
chunk_token_threshold=1200,
overlap_rate=0.1,
apply_chunking=True,
input_format="markdown",
verbose=True
)
# Step 5: Configure and run the crawler
crawl_config = CrawlerRunConfig(
extraction_strategy=extraction_strategy,
cache_mode=CacheMode.BYPASS
)
async def main():
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://cloudbrowser.com/", config=crawl_config)
if result.success:
print("Extracted content:", result.extracted_content)
else:
print("Error:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
How It Works
CloudBrowser API creates a browser session and returns a WebSocket URL.
Crawl4AI connects to that WebSocket, navigates to the target page, and extracts DOM content.
Ollama (Gemma 3) formats the result into JSON based on your schema.
The final structured data is printed in your console.
Conclusion
This approach ensures that you dynamically request a fresh CloudBrowser session before crawling, keeping the session alive only as long as you need it. CloudBrowser + Crawl4AI + Ollama provides a powerful, scala
Last updated