src/crawlee/_utils/sitemap.py, src/crawlee/_utils/robots.py, src/crawlee/request_loaders/_sitemap_request_loader.py, and all built-in HTTP clients.robots.txt containing a URL that points to an internal host (layer 1) or uses a non-http scheme (layer 2).Two-layer SSRF via sitemap-derived URLs:
Base case, affects every HTTP client.** Sitemap entries and robots.txt Sitemap: directives were accepted regardless of the host they pointed to. A sitemap on example.com could push http://internal.corp/admin into the crawler's queue, and the configured HTTP client would dispatch the request.
Escalation, only CurlImpersonateHttpClient.** Nested-sitemap fetching dispatches the URL straight to the HTTP client, bypassing the Request construction step where Pydantic enforces http(s). Combined with the libcurl-backed CurlImpersonateHttpClient, this lets gopher://, file://, dict://, ftp://, etc., through.
Crawlee already validates URL schemes through Pydantic's AnyHttpUrl (via validate_http_url in src/crawlee/_utils/urls.py) wherever a crawl target is materialised as a Request: the Request.url field is declared as Annotated[str, BeforeValidator(validate_http_url), Field(frozen=True)]. Anything that becomes a Request is therefore guaranteed to be http(s).
Two parts of the sitemap pipeline sidestepped this property in different ways:
SitemapRequestLoader took every <urlset><url><loc> entry, wrapped it in Request.from_url (which accepts any valid http(s) URL), and pushed the result into the request queue. RobotsTxtFile.get_sitemaps() returned every Sitemap: directive verbatim. Neither imposed any host check against the parent sitemap or robots.txt URL, so an attacker controlling that content could push internal-network HTTP URLs into the queue and have them crawled by whichever HTTP client was configured.
Request chokepoint entirelyWhen _XmlSitemapParser encountered <sitemapindex><sitemap><loc>…</loc></sitemap></sitemapindex>, or when RobotsTxtFile.parse_sitemaps forwarded Sitemap: directives into the same pipeline, _fetch_and_process_sitemap dispatched the URL directly to the HTTP client:
async with http_client.stream(
sitemap_url,
method='GET',
headers=SITEMAP_HEADERS,
proxy_info=proxy_info,
timeout=timeout,
) as response:
...
No Request was constructed, so the Pydantic validator never ran. Before the fix, the HTTP clients' own send_request() and stream() methods did not call validate_http_url either, so a non-http(s) scheme could pass straight through to the backend client.
The non-HTTP escalation in layer 2 is specific to CurlImpersonateHttpClient, which is backed by curl-cffi / libcurl and speaks gopher, file, dict, ftp, and other non-HTTP protocols. The other clients shipped with Crawlee (HttpxHttpClient, ImpitHttpClient, PlaywrightHttpClient) reject non-http(s) schemes at their own backend layer, regardless of what Crawlee passes in, so they were only affected by layer 1.
<urlset><url><loc> or <sitemapindex><sitemap><loc>, or an attacker-controlled robots.txt that lists internal URLs under Sitemap:.GET requests against those URLs — either via client.request(url=request.url, …) inside crawl() for regular sitemap URLs, or via client.stream(url, …) inside the nested-sitemap fetch.CurlImpersonateHttpClient only)<sitemap><loc> entry or a robots.txt Sitemap: directive pointing to a non-http(s) URL.CurlImpersonateHttpClient.stream(...) hands the URL string verbatim to client.request(url=…, …), which dispatches via libcurl.Hardening in 1.7.0 was added at both producer and consumer ends — see Remediation.
SitemapRequestLoader, Sitemap.load / parse_sitemap, discover_valid_sitemaps, or RobotsTxtFile.parse_sitemaps.robots.txt that the crawler fetches — typically by being the target site, or by getting a target site to publish a malicious sitemap.For layer 2 (non-HTTP), the configured HTTP client must additionally be CurlImpersonateHttpClient.
The crawler can be coerced into issuing GET requests against internal HTTP services on its own network: admin panels, unauthenticated internal APIs, cloud metadata endpoints, etc. Read-back is blind — Crawlee surfaces fetched content only through its local Dataset / KeyValueStore (push_data() etc.) and does not natively forward scraped bodies anywhere external — so direct impact is mostly existence/timing probing and occasional state changes via side-effecting GET endpoints. Read-side leakage of internal content is only exploitable end-to-end if the deployer's own application separately exposes scraped data (for example, a public summariser or aggregator built on top of Crawlee).
CurlImpersonateHttpClient)Under the affected client, attackers gain the libcurl scheme set:
gopher:// is the canonical RESP-injection vector: pipeline FLUSHALL, CONFIG SET dir, CONFIG SET dbfilename, SAVE to an unauthenticated Redis on the crawler's network — enough to write attacker-controlled bytes to disk and, in the standard escalation, achieve remote code execution on the Redis host.file:// allows the crawler to read local files (application secrets, configuration) on the crawler host.dict:// and ftp:// permit fingerprinting and limited interaction with text-protocol services.In both layers, the SSRF is blind in the default configuration. Write-side impact (gopher:// → Redis) and timing-based internal probing do not depend on read-back and remain viable regardless of whether the deployer surfaces scraped content.
Both layers are fixed in crawlee==1.7.0. The fix is split across two PRs, applied at the two complementary boundaries of the affected pipeline:
SitemapRequestLoader and RobotsTxtFile.get_sitemaps() now run every nested-sitemap entry, every regular sitemap URL, and every Sitemap: directive through crawlee._utils.urls.filter_url. This applies to an EnqueueStrategy (default 'same-hostname') against the parent sitemap / robots.txt URL — cross-host entries are dropped — and rejects non-http(s) schemes. The strategy is stamped onto the emitted Requests, so BasicCrawler._check_url_after_redirects continues policing the policy across redirects.validate_http_url(url) is now called at the top of send_request() and stream() in ImpitHttpClient, HttpxHttpClient, CurlImpersonateHttpClient, and PlaywrightHttpClient. Non-http(s) schemes raise pydantic.ValidationError before any backend call. crawl() was already covered, because Request.url is validated by Pydantic on construction.After these changes, validation is enforced both where sitemap-derived HTTP requests are produced (sitemap and robots.txt loaders) and where they are consumed (HTTP clients). A regression at either layer is caught by the other.
SitemapRequestLoader and RobotsTxtFile.get_sitemaps() now default to enqueue_strategy='same-hostname'. Deployers that legitimately relied on cross-host sitemap entries (e.g., a sitemap index on sitemaps.example.com that points to content on www.example.com) must opt in explicitly with enqueue_strategy='same-domain' or enqueue_strategy='all'.
{
"github_reviewed_at": "2026-05-21T19:28:10Z",
"github_reviewed": true,
"cwe_ids": [
"CWE-918"
],
"nvd_published_at": null,
"severity": "LOW"
}