# Scraping with Ollama and Crawl4AI

When you need to extract unstructured information from websites, a simple HTTP request is not enough — especially if the site uses JavaScript or dynamic rendering.\
In this tutorial, we'll walk through how to use **Crawl4AI**, **Ollama**, and **CloudBrowser** to automate the process, extract data, and format it into clean JSON.

***

### Why This Stack

* **Crawl4AI** — Orchestrates the crawling, handles concurrency, and chunks large content.
* **CloudBrowser** — Runs cloud-hosted Puppeteer browsers with Chrome DevTools Protocol (CDP) support.
* **Ollama** — Runs the LLM that turns HTML/Markdown into structured data based on your schema.

***

### Requirements

* Basic **Python** knowledge.
* Installed **Crawl4AI** and **Ollama**.
* A **CloudBrowser** account and API token.

***

### Environment Setup

#### 1. Install Crawl4AI

Follow the instructions from the official GitHub repository:

```bash
pip install crawl4ai
```

#### 2. Install and Run Ollama

If you don't have a remote Ollama server, install and run it locally:

```bash
brew install ollama
ollama pull gemma3:latest
ollama serve
```

This launches the **Gemma 3** model on `http://localhost:11434`.

#### 3. Request a CloudBrowser WebSocket

CloudBrowser sessions are opened by sending a POST request to the API. For example:

```bash
curl -X POST https://production.cloudbrowser.ai/api/v1/Browser/Open \
-H "Authorization: Bearer $CLOUDBROWSER_ACCESS_KEY" \
-d '{
  "Args": ["--no-sandbox", "--disable-gpu"],
  "IgnoredDefaultArgs": ["--enable-automation"],
  "Headless": true,
  "Stealth": true,
  "Browser": 2,
  "KeepOpen": 60,
  "Label": "app-worker01",
  "saveSession": true,
  "recoverSession": true
}'
```

The API will respond with JSON containing the `address` field:

```json
{
  "address": "browserprovider-worker-*.cloudbrowser.ai/devtools/browser/*",
  "status": 200
}
```

Use this `address` as your Chrome DevTools Protocol WebSocket URL.

***

### Example: Scraping Pricing Plans

The following Python script first requests a CloudBrowser session to get the WebSocket URL, then runs Crawl4AI.

```python
import asyncio
import os
import requests
from typing import List
from crawl4ai import *
from pydantic import BaseModel

# Step 1: Open a CloudBrowser session to get WebSocket URL
CLOUDBROWSER_API_KEY = os.environ.get("CLOUDBROWSER_ACCESS_KEY")
response = requests.post(
    "https://production.cloudbrowser.ai/api/v1/Browser/Open",
    headers={"Authorization": f"Bearer {CLOUDBROWSER_API_KEY}"},
    json={
        "Args": ["--no-sandbox", "--disable-gpu"],
        "IgnoredDefaultArgs": ["--enable-automation"],
        "Headless": True,
        "Stealth": True,
        "Browser": 2,
        "KeepOpen": 60,
        "Label": "app-worker01",
        "saveSession": True,
        "recoverSession": True
    }
)
ws_url = response.json()["address"]

# Step 2: Define the output schema
class Pricing(BaseModel):
    name: str
    href: str

class PricingPlans(BaseModel):
    pricing_plans: List[Pricing]

# Step 3: Configure CloudBrowser with the returned WebSocket
browser_config = BrowserConfig(
    headless=True,
    verbose=True,
    browser_mode="cdp",
    cdp_url=f"wss://{ws_url}"
)

# Step 4: Define the extraction strategy using Ollama + Gemma 3
extraction_strategy = LLMExtractionStrategy(
    llm_config=LLMConfig(provider="ollama/gemma3:4b", base_url="http://localhost:11434"),
    extraction_type="schema",
    schema=PricingPlans.model_json_schema(),
    instruction="Extract all the pricing plans JSON array containing their 'name' and 'price'.",
    chunk_token_threshold=1200,
    overlap_rate=0.1,
    apply_chunking=True,
    input_format="markdown",
    verbose=True
)

# Step 5: Configure and run the crawler
crawl_config = CrawlerRunConfig(
    extraction_strategy=extraction_strategy,
    cache_mode=CacheMode.BYPASS
)

async def main():
    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(url="https://cloudbrowser.com/", config=crawl_config)

        if result.success:
            print("Extracted content:", result.extracted_content)
        else:
            print("Error:", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())
```

***

### How It Works

1. **CloudBrowser API** creates a browser session and returns a WebSocket URL.
2. **Crawl4AI** connects to that WebSocket, navigates to the target page, and extracts DOM content.
3. **Ollama (Gemma 3)** formats the result into JSON based on your schema.
4. The final structured data is printed in your console.

***

### Conclusion

This approach ensures that you dynamically request a fresh CloudBrowser session before crawling, keeping the session alive only as long as you need it.\
**CloudBrowser + Crawl4AI + Ollama** provides a powerful, scala


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://cloudbrowser.gitbook.io/docs/configurations/scraping-with-ollama-and-crawl4ai.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
