What is the best library for bypassing Cloudflare in 2026?

Camoufox is the strongest open-source option for bypassing Cloudflare in 2026, achieving a 100% pass rate in March 2026 benchmarks. It patches Firefox at the C++ level using Mozilla's Juggler protocol, making it undetectable via JavaScript inspection. For HTTP-only scraping, curl_cffi with impersonate='chrome131' handles most Cloudflare targets without a full browser.

What is JA4+ TLS fingerprinting and how does it affect web scraping?

JA4+ is a TLS fingerprinting standard that identifies scrapers before any HTTP headers are exchanged. It hashes the TLS ClientHello fields (cipher suites, extensions, ALPN) in a sort-stable way that survives Chrome's extension order randomisation. Cloudflare deploys JA4 in a Rust crate at CDN edge, Akamai in an EdgeWorker. Python's requests library has a unique JA4 hash that gets blocked instantly. The fix is curl_cffi, which impersonates real Chrome TLS down to HTTP/2 SETTINGS frames.

How do you bypass Akamai Bot Manager in 2026?

Akamai Bot Manager in 2026 probes 60 Chrome extension URLs via fetch() to detect headless browsers. Real Chrome always has at least a few extensions installed. The fix: use CloakBrowser (loads real extension profiles) or Camoufox (uses Juggler protocol, no CDP artifacts). Combine with residential or ISP proxies since Akamai flags datacenter ASNs instantly. Set geoip=True in Camoufox to align WebRTC, DNS, and timezone with your proxy exit country.

What is the difference between residential and datacenter proxies for web scraping?

Datacenter proxies are fast and cheap but carry known ASNs (AWS AS16509, GCP AS15169) that anti-bots flag immediately. Residential proxies route through real ISP connections, making traffic look like genuine users. In 2026, most protected sites block datacenter IPs outright. ISP proxies (static residential) offer the best of both: residential IP authority with datacenter speeds. Use datacenter for unprotected APIs, ISP for medium targets, rotating residential for the hardest targets like Cloudflare-protected e-commerce.

How do you scrape JavaScript-rendered websites with Python in 2026?

For JavaScript-rendered websites in 2026, use Camoufox (Python, patches Firefox at C++ level, bypasses Cloudflare), PatchRight (undetected Playwright drop-in, bypasses Kasada), or scrapy-stealth middleware (adds TLS fingerprinting and browser engine to Scrapy). For AI-powered extraction, Crawl4AI (60K stars) and Firecrawl (111K stars) convert pages to clean Markdown. Avoid plain Playwright without stealth patches — navigator.webdriver=true is trivially detected by all major anti-bots.

What is curl_cffi and why is it better than requests for web scraping?

curl_cffi is a Python library that wraps libcurl with BoringSSL patches to produce exact Chrome and Firefox TLS fingerprints. Unlike Python requests (which has a unique JA4 hash that anti-bots recognise instantly), curl_cffi sends a ClientHello identical to a real browser including HTTP/2 SETTINGS frames. It is 10-50x faster than browser automation and works as a drop-in requests replacement: curl_cffi.requests.get(url, impersonate='chrome131').

How do I intercept mobile app API traffic for scraping?

To intercept mobile app API traffic: install Android Studio and create a virtual device with API 30+, root it using rootAVD and Magisk, install HTTP Toolkit to intercept HTTPS traffic and bypass SSL pinning automatically. Once you capture the API request, replicate it with curl_cffi for production scraping. Mobile APIs serve the same data as the website but with far weaker anti-bot protection — no Cloudflare, no JA4 fingerprinting.

What is the best Scrapy anti-bot middleware in 2026?

scrapy-stealth is the most complete Scrapy anti-bot middleware in 2026. It adds TLS fingerprint spoofing, HTTP/2 impersonation, proxy rotation, fingerprint cycling, and a real browser engine via CDP — all unavailable in scrapy-playwright, scrapy-splash, or scrapy-selenium. It supports per-request engine switching via request.meta, so easy URLs use the fast HTTP engine while protected pages use the browser engine. Install with: pip install scrapy-stealth

Here is everything I know about scraping

They built walls.
I spent 7 years finding doors.

I started scraping in 2018. Since then I have worked across five companies, built hundreds of production spiders, and fought every major anti-bot system that exists. This guide is everything that actually worked.

JA4+ TLS Fingerprinting Scrapy Playwright Akamai Cloudflare DataDome Kasada F5 Shape AI Scraping MCP Tools curl_cffi Camoufox Scrapling

Anti-bots

60+

Libraries

Detection layers

Decision steps

Scroll

★ About the author

7 years of production scraping

Asad Ikram
Data Engineer & Scraping specialist

I started scraping the web in 2018. Since then I have worked at five companies including Fix.com, Dubizzle Labsand M+C Saatchi Fluencybuilding production scrapers at scale across MENA and Europe.

Currently Data Engineer at M+C Saatchi Fluency and co-founder of ArtemisAI Ltd. Chevening Scholar 2024/25, MSc Data Analytics with Distinction.

I built this guide to share everything I know about scraping properly, the bypasses, the failures, the patterns that hold up in production. No guesswork, no generic tutorials.

Portfolio LinkedIn

🕷️

500+

Production spiders built

📊

50M+

Data points extracted

🏢

Companies with production scrapers

🎓

2024

Chevening Scholar, UK Govt

01 Attack strategy

The scraping
decision flow

Walk steps in order. Stop at the first win. Complexity and cost increase right. Most production scraping is solved at steps 1–3.

Asad's Priority Order, start left, move right only when needed

Step 1

📱

Mobile API

HTTPToolkit
Frida · mitmproxy

Step 2

🔍

XHR Endpoint

Chrome DevTools
Burpsuite · webclaw

Step 3

🗃️

JSON in HTML

__NEXT_DATA__
chompjs · Parsel

Step 4

⚡

HTTP Scraping

curl_cffi
Scrapy · Scrapling

Step 5

🌐

C++ Browser

Camoufox
CloakBrowser

Step 6

☁️

Managed API

Bright Data
Zyte · Firecrawl

Rule #1, Asad's priority: Never start at Step 5. The mobile app often hits the same backend with zero anti-bot. Confirmed on a major retailer, a direct GraphQL endpoint bypassed all HTML anti-bot protection entirely. Find the API first, always.

Before we go deeper: The flow above tells you what order to try things. But to understand why those steps exist in that order, and what happens when you skip one, you need to understand how detection actually works. The next section breaks down every signal anti-bots collect, starting at the TCP handshake.

The six steps above tell you what order to try. But to know which step to stop at, and why skipping ahead costs you days, you first need to understand how the detection actually works. Let's go deeper.

02 The anatomy of detection

Before you send a single byte,
you've already been judged.

The moment your scraper opens a TCP connection to a CDN, a fingerprinting pipeline triggers. By the time your HTTP request body arrives, four independent scoring systems have already assigned you a trust score. Here's exactly what each one measures, and why defeating just one is never enough.

The fundamental insight: Anti-bots don't make binary decisions. They assign a continuous trust score across all four layers simultaneously. A perfect TLS fingerprint with a datacenter IP and machine-like mouse movement still fails, just at a different layer. The only winning strategy is addressing all four at once.

Layer 1, TLS Fingerprinting: The Handshake That Betrays You

This fires before a single HTTP byte is exchanged. Understanding it is non-negotiable.

Origin 2017 · Salesforce Research

JA3, The First Fingerprint

When any HTTPS client connects, it sends a TLS ClientHello message. JA3 extracts five fields from it and MD5-hashes the combination:

TLS Version + Cipher Suites + Extensions + Elliptic Curves + Curve Formats

This produced a stable 32-char hex hash. Python's requests library has always had the same JA3 hash. Every major anti-bot catalogued it. By 2021, your Python scraper was identifiable before the first HTTP header.

JA3's weakness: Chrome started randomising TLS extension order in 2022. Same browser, different JA3 every session. The fingerprint became unstable and unreliable.

2023 · FoxIO · Replaces JA3

JA4+, The Unbreakable Standard

JA4 was engineered specifically to survive Chrome's randomisation. Instead of hashing raw extension order, it sorts extensions alphabetically and removes GREASE values before hashing. The result is stable regardless of Chrome's ordering.

JA4 format: t13d1516h2_8daaf6152771_b0da82dd1658
, t13 = TLS 1.3, d = DTLS, 1516 = cipher count+length hash, h2 = ALPN (HTTP/2), remainder = extension hash

JA4+ extends this with: JA4H (HTTP header fingerprint), JA4X (X.509 certificate), JA4SSH (SSH handshake), JA4T (TCP window + options). Cloudflare deployed it in a Rust crate at CDN edge. Akamai in an EdgeWorker. Both fire before your request reaches origin.

HTTP/2 · Wireshark Observable

HTTP/2 Frame Fingerprinting

Even with a perfect JA4 hash, HTTP/2 itself leaks your client identity. The SETTINGS frame that every HTTP/2 client sends at connection start has parameters that vary by implementation:

HEADER_TABLE_SIZE, MAX_CONCURRENT_STREAMS, INITIAL_WINDOW_SIZE, MAX_FRAME_SIZE, MAX_HEADER_LIST_SIZE

Chrome's exact values are documented. Python's httpx sends different values. curl sends different values. The ordering of these settings, the window update frame sizes, and the HPACK compression decisions all create a secondary fingerprint that cannot be spoofed without rewriting the HTTP/2 clientwhich is exactly what curl_cffi does.

2024+ · Emerging Standard

QUIC / HTTP/3 Fingerprinting

As HTTP/3 adoption grows, JA4Q and QUIC Initial packet fingerprinting are being deployed. QUIC's handshake carries its own fingerprint surface: connection ID length, transport parameters, initial packet number, token presence.

Chrome's QUIC stack differs from libcurl's QUIC implementation differs from Python's aioquic. Each leaves a unique signature in the Initial packets.

Current status: JA4+ covers QUIC. Cloudflare has begun collecting QUIC fingerprints. Not yet widely enforced for blocking, but the infrastructure is live. Tools like curl_cffi are actively implementing QUIC parity.

python

# Test your actual JA4 fingerprint against tls.browserleaks.com
import requests
from curl_cffi import requests as cffi

# ❌ requests, exposes Python/urllib3 JA4, blocked immediately
r1 = requests.get("https://tls.browserleaks.com/json")
print(r1.json()["ja4"])
# → t13d1516h2_8daaf6152771_b0da82dd1658  (Python fingerprint, catalogued, blocked)

# ✓ curl_cffi, emits Chrome 124's exact JA4 hash, HTTP/2 frames, cipher order
r2 = cffi.get(
    "https://tls.browserleaks.com/json"–
    impersonate="chrome124"  # also: chrome110, chrome107, safari17
)
print(r2.json()["ja4"])
# → t13d1517h2_c4b4b4b4b4b4_aaaaaaaaaa   (Chrome 124 fingerprint, passes)

# Also check HTTP/2 fingerprint
print(r2.json()["http2"])  # Chrome's exact SETTINGS frame values

Practical · How to actually spoof TLS in 2026

From theory to working code

All the JA4+ research is academic until you ship it. Three tiers of solution, in order of how often you should reach for each:

Tier 1 · 80% of cases

Use a TLS-impersonating HTTP client

curl_cffi (Python), tls-client (Go), noble-tls, hrequests. One line of code, exact Chrome/Firefox JA4. Drop-in replacement for requests.

curl_cffi.requests.get(url, impersonate="chrome131")

Tier 2 · Scrapy projects

Plug a stealth middleware in

scrapy-stealth adds TLS + HTTP/2 fingerprinting + proxy rotation + fingerprint cycling to existing Scrapy spiders via DOWNLOADER_MIDDLEWARE. Per-request engine switching keeps simple URLs fast.

meta={"stealth": {"profile": "chrome_147"}}

Tier 3 · Hardest targets

Browser with C++ patches

When TLS spoofing alone fails (Akamai extension probes, Kasada toString checks, behavioural ML), reach for Camoufox, rayobrowse, or CloakBrowser. C++ binary patches ship a real-browser TLS stack along with everything else.

Cost: 200MB+ memory per browser instance

⚠ Common mistakes

1. Spoofing User-Agent without TLS. If your UA says Chrome but JA4 says Python urllib3, you flag faster than no spoofing at all, the mismatch is the signal.
2. Forgetting HTTP/2 SETTINGS frames. Even perfect JA4 fails if your HTTP/2 SETTINGS (header table size, max concurrent streams, initial window size) do not match the browser you claim to be. curl_cffi and tls-client handle this; rolling your own usually does not.
3. Using stale impersonation profiles. Chrome 120 fingerprints in 2026 are themselves suspicious, real users rolled forward. Keep impersonate="chrome131" or newer.

Layer 2, JavaScript Fingerprinting: The Page That Interrogates You

Once your TLS passes, the page loads its anti-bot script. This is a 500KB+ obfuscated interrogation that runs dozens of tests in parallel.

Most stable signal · GPU dependent

Canvas + WebGL Fingerprinting

The page draws invisible shapes, gradients, and text using canvas.getContext('2d') then calls canvas.toDataURL(). The exact pixel output varies by:

, GPU manufacturer and model (NVIDIA vs AMD vs Intel)
, Driver version and sub-pixel rendering
, OS-level font rendering (Windows ClearType vs macOS CoreText)
, Canvas size and DPI scaling

A headless Chromium with no GPU produces a software-rendered canvas with a known hash. Botaaurus and CloakBrowser spoof this at the C++ level by injecting slight noise into the pixel values before toDataURL() returns, enough to vary the hash while remaining visually identical.

GPU vendor string · Renderer string

WebGL Fingerprinting

WebGL exposes the GPU through gl.getParameter(gl.RENDERER) and gl.getParameter(gl.VENDOR). Real Chrome returns something like ANGLE (Intel, Intel(R) UHD Graphics 620 Direct3D11 vs_5_0 ps_5_0).

Headless Chrome returns a generic string or crashes on WebGL entirely. Anti-bots cross-reference: if WebGL says "Intel UHD 620" but Canvas hash shows software rendering, that's a contradiction, you're flagged.

WebGL extensions list is also fingerprinted. Real GPUs expose 30–40 extensions. Software renderers expose a different subset. The exact combination is GPU-specific and stable across sessions.

Deterministic · Hard to spoof

AudioContext Fingerprinting

The page creates an AudioContextgenerates a sine wave through an OscillatorNoderuns it through a DynamicsCompressorNodeand reads the output buffer values. The floating-point output depends on:

, CPU architecture (x86 vs ARM floating-point precision)
, Operating system audio stack
, Audio driver implementation

Headless environments often return 0.0 across the buffer (no audio context), or a software-emulated value that differs from hardware. CloakBrowser patches this at the Chromium C++ audio rendering layer.

Runtime patches exposed · 2026 standard

Function.toString() Detection

This is why playwright-stealth fails against Kasada in 2026.

When JS patches a native function, for example, navigator.webdriverit replaces the getter with a custom function. Calling Function.prototype.toString.call(getter) on the patched function returns function () { [custom code] } instead of function () { [native code] }.

Kasada specifically tests dozens of native functions this way. playwright-stealth patches them in JavaScript, so toString() reveals the patch. PatchRight fixes this at the Python source levelbefore Chrome even starts. There's no JS to inspect.

Akamai specific · 60 probes in 2026

Chrome Extension Probing (Akamai)

Akamai's sensor.js fetches 60 known Chrome extension resource URLs using fetch('chrome-extension://[id]/manifest.json'). Real Chrome browsers have at least a few extensions installed (ad blockers, password managers, etc.).

A headless browser returns net::ERR_FAILED on all 60 requests simultaneously, a statistically impossible result for a real user. The extension IDs probed include:

cjpalhdlnbpafiamejdnhcphjbkeiagm (uBlock Origin)
hdokiejnpimakedhajhdlcegeplioahd (LastPass)
nngceckbapebfimnlniiiahkandclblb (Bitwarden)

Fix: CloakBrowser loads real extension profiles. You install 1Password or Bitwarden into it so some probes return real manifest data.

CDP timing leaks · Protocol signals

Headless / CDP Detection

Beyond navigator.webdriverCDP-controlled browsers expose themselves through subtler signals:

Timing: CDP's Runtime.enable command leaves a timing gap between page parse and script execution that doesn't exist in real Chrome.
Execution context: window.cdc_adoQpoasnfa76pfcZLmcfl_Array and similar artifacts left by ChromeDriver are checked.
Permission API: Real Chrome returns realistic permission states. ChromeDriver returns defaults inconsistent with a "normal" browser.
Plugins: Headless Chrome has zero plugins. Real Chrome always has at least the PDF viewer plugin.

Camoufox's solution: Uses Mozilla's Juggler protocol, which sits below CDP entirely, none of these artifacts exist.

Layer 3, Network Identity: The Five Vectors That Must Agree

Primary signal

IP Reputation & ASN

Anti-bots check your IP against ASN databases. AWS (AS16509), GCP (AS15169), Azure (AS8075) are immediately flagged. DigitalOcean, Linode, Vultr, all known. Even residential proxy networks from DataCenter IPs in the 24.105.x.x range are flagged if the ASN is a known proxy provider. Genuine ISP residential or 4G carrier IPs are the only reliably clean option.

Browser API · Often overlooked

WebRTC IP Leak

JavaScript can query WebRTC ICE candidates which reveal your real local and public IP, even through a proxy. If your browser has a US proxy but WebRTC reveals a Pakistani local address, or the ICE candidate is from a different subnet than the HTTP request IP, that's an immediate flag. Camoufox's geoip=True aligns WebRTC candidates with the proxy exit country.

All five must agree

The Coherence Test

Anti-bots run a coherence check across: IP country, timezone, Accept-Language, WebRTC candidate, DNS resolver location. A US proxy with Accept-Language: ur-PK fails immediately. All five must tell a consistent geographic story. This is why setting geoip=True in Camoufox is critical, it auto-configures all five to match the proxy's exit country.

Layer 3.5, DOM Honeypots: The Trap Doesn't Care About Your Fingerprint

Hidden DOM elements

Honeypot Fields and Links

Invisible form fields and hidden links that humans never see but bots fill in or click. Triggered = bot detected = IP banned. Common patterns: display:none, visibility:hidden, opacity:0zero-dimension elements, off-screen positioning, fields with tabindex="-1"or links placed after the closing </body> tag.

Data poisoning

Fake Data Served to Suspected Bots

More dangerous than blocking, sites detect a scraper and silently serve different prices, fake reviews, wrong stock counts. You think you're scraping successfully but your dataset is corrupted. Defence: compare scrapes from 2+ different IP profiles for the same URL. Mismatched data = poisoning. Always check element visibility (getBoundingClientRect()) before interacting.

Layer 4, Behavioural ML: You Can't Fake Being Human

Physics-based · Gaussian jitter

Mouse Movement Curves

Human mouse movements follow Bezier curves with Gaussian noise applied to velocity. The mouse decelerates as it approaches a target (Fitts's Law), overshoots slightly, then corrects. Scrapers that click elements directly (teleporting the mouse to x,y) create a trajectory signature that's statistically impossible for a human. DataDome's 35-signal behavioural model catches this immediately. Botasaurus generates physically realistic curves with randomised velocity profiles.

Sub-millisecond precision · ML scored

Timing Analysis

Transformer ML models trained on millions of sessions measure: time between page load and first interaction, scroll acceleration curves, inter-keystroke timing variance, navigation dwell time, and micro-timing of JS event handlers at <1ms precision. A scraper that immediately calls document.querySelector() after DOMContentLoaded looks nothing like a human who reads the page for 2.3 seconds first. Warm-up navigation (visiting homepage before target) significantly improves behavioural scores.

The story so far: You now understand the full detection stack, TLS fingerprints at the network layer, JS interrogation in the browser, IP reputation checks per request, and ML behavioural analysis across the session. The next sections show you exactly which anti-bot vendors use which combination of these layers, and the specific bypass strategies for each.

Now you know the four detection layers. Every vendor below is just a different weighting of those same four signals, some prioritise TLS, others behaviour, others network identity. Knowing the layer tells you which tool to pick. Here are the six walls.

03 The vendors

Six companies built the walls.
Here's every key.

Each vendor applies the detection layers differently, different weights, different signals, different architectures. What bypasses Cloudflare has zero effect on Kasada. You need to know exactly which wall you're facing before you choose a tool.

Step 0, Before anything else

Identify which anti-bot you're facing

Wrong strategy on the wrong vendor wastes hours. Before writing a single line of code, spend 30 seconds identifying exactly what's protecting the target.

1 Wappalyzer Chrome Extension Install free ↗

Visit the target site, click the Wappalyzer icon in your toolbar. It instantly shows all detected technologies, including the anti-bot vendor. Shows Akamai, Cloudflare, DataDome, PerimeterX, Kasada and more with a single click.

2 Check response cookies

_abckAkamai

cf_clearanceCloudflare

datadomeDataDome

_px3PerimeterX

x-kpsdk-ctKasada

reese84F5 Shape

dd_cookie_testDataDome

bm_szAkamai

Open DevTools → Application → Cookies. Match any cookie name to identify the vendor. Multiple vendors can run on the same site. For CLI scanning at scale: wafw00f https://target.com identifies WAF + anti-bot vendor in one command.

3 Check response headers

DevTools → Network → any request → Response Headers. Look for x-datadome, server: cloudflare, x-akamai-request-idor challenge redirect URLs containing vendor names.

🔍 Wappalyzer

Free Chrome + Firefox extension. One click on any site shows:

Anti-bot / security vendor
CDN provider
CMS, framework, analytics
Server technology

Install Wappalyzer Free ↗ Firefox version ↗

Also useful

wappalyzer.com ↗ builtwith.com ↗ whatcms.org ↗ wafw00f (CLI) ↗ WhatWaf (CLI) ↗

01/06 · Airlines · Banks · ~30% Fortune 500

Akamai

Bot Manager v3+ injects sensor.js (~512KB, fully obfuscated) into every protected page. Unlike Cloudflare which checks at CDN edge, Akamai runs its full fingerprint suite inside your browser via this script. It collects 500+ signals over multiple requests, trust accumulates across the session, not just on the first hit. The critical 2026 signal: 60 chrome-extension:// URL probes. Zero passing = instant bot score regardless of all other signals. JA4+ is checked at EdgeWorker before HTML is served.

_abck cookie bm_sz 60 ext probes Battery API Multi-req scoring

Bypass strategy

Step 1: Check for GraphQL/XHR API first, a direct endpoint bypasses HTML anti-bot entirely

curl_cffi impersonate="chrome124" handles TLS + HTTP/2 layer

CloakBrowser with 49 C++ patches handles sensor.js interrogation

Load Bitwarden + 1Password extensions to pass 60 extension probes

ISP/static residential proxy, never rotate mid-session (trust accumulates)

Homepage warm-up → 2–3s human dwell → scroll → navigate to target

Script size

~512KB

Re-obfuscated per rotation

Ext probes

Zero passing = instant block

Fortune 500

~30%

Retail, airlines, finance

Scoring

Multi-req

Trust builds across session

02/06 · 20% of all internet traffic · 200+ countries

Cloudflare

Cloudflare's uniqueness is infrastructure-level deployment. JA4 is computed in a Rust crate running on every Cloudflare edge node, your request is fingerprinted before it reaches any application server. The ML bot score (1–99) is trained on Cloudflare's view of 20% of all internet traffic, giving it an unmatched baseline for what "real" browsers look like. Turnstile (their CAPTCHA replacement) submits a 79-parameter POST including Canvas hash, font measurements, SHA-256 proof-of-work, and TEA-encrypted timing data.

cf_clearance __cf_bm JA4 Rust edge Turnstile 79 params ML score 1–99

Bypass strategy

Origin IP bypass: check SecurityTrails DNS history, many sites had Cloudflare added later, origin IP is in old A records

Camoufox with geoip=True, 100% pass rate Mar 2026 on Instagram, Reddit, X, LinkedIn

Scrapling's StealthyFetcher solves Turnstile natively and automatically

Turnstile HTTP bypass possible: solve the PoW + Canvas hash without a browser in ~0.27s

Camoufox uses Juggler (not CDP), zero CDP timing artifacts that Cloudflare's ML scores heavily

Web coverage

20%

All internet traffic

Turnstile params

Canvas + PoW + TEA crypto

Camoufox

100%

Pass rate Mar 2026

ML training

Global

20% of all traffic

03/06 · 5 trillion signals/day · 1,200+ clients

DataDome

DataDome's architecture is fundamentally different from the others: it deploys 85,000 separate ML modelsone per protected site. There is no universal bypass. What works on Grainger.com may not work on Le Monde. It runs at the application server level (not CDN), so origin IP bypass is impossible. The WASM boring_challenge is a Rust-compiled state machine that cannot be emulatedit requires actual browser execution to produce valid tokens. IP reputation alone accounts for 25–30% of the total trust score.

datadome cookie WASM boring_challenge Picasso device FP 35+ behavioural 85K per-site models

Confirmed bypass, Grainger.com ✓

Always try first: find __NEXT_DATA__ in HTML source, Grainger had 110KB of product data in it, bypassing DataDome entirely

curl_cffi chrome124 + residential proxy → confirmed 200 OK (Grainger.com)

Mobile carrier IP (T-Mobile, Vodafone 4G), highest trust score, hardest to flag

Camoufox + geoip=Truealigns all 5 identity vectors with proxy exit country

2ms real-time response means every request is independently scored

ML models

85,000

One per protected site

Response

2ms

Real-time, app server

IP weight

25–30%

Of total trust score

Universal bypass

None

Per-site models

04/06 · HUMAN Security · 3 billion devices

PerimeterX

After merging with HUMAN Security, PerimeterX gained the most powerful network effect in anti-bot. It verifies 15 trillion interactions per week across 3 billion devices. The critical risk: get detected on any one of 29,650+ protected sites and your fingerprint is flagged across the entire network. Nike, Walmart, Zillow, StubHub all share reputation data. Its 5-vector unified score (TLS + IP + HTTP headers + JS fingerprint + Behaviour) requires all five to pass simultaneously, fixing only one vector has zero effect.

_px3 cookie _pxde cookie 5-vector score 29,650 site network Human Challenge

Bypass strategy

All 5 vectors must pass simultaneouslyCamoufox + residential proxy addresses all of them

Generate a fresh fingerprint per session, never reuse fingerprints across different target domains

SeleniumWire can intercept the _px3 token generation flow for token replay

Scrapfly's ASP flag handles all 5 layers automatically at managed API level

Never use burned IPs, the network effect means cross-site reputation

Sites

29,650+

Nike, Walmart, Zillow

Weekly verif.

15T

3B devices/month

Vectors

5/5

All must pass

Network effect

Global

Reputation shared

05/06 · No CAPTCHA · Gatekeeper proxy architecture

Kasada

Kasada operates as a gatekeeper proxyevery request flows through it before reaching origin. Its JavaScript (ips.jsrenamed polymorphically each deployment) issues proof-of-work challenges that require real CPU cycles and browser APIs to solve. There are no CAPTCHAs, failures are silent 403s or 429s with no explanation. The critical 2026 fact: Kasada specifically fingerprints playwright-stealth by calling Function.prototype.toString() on patched native functions. The patch signatures are catalogued.

x-kpsdk-ct x-kpsdk-cd ips.js PoW polymorphic JS toString() inspection

Bypass strategy

Never use playwright-stealthKasada has its toString() signatures catalogued and blocks it outright

PatchRight patches at Python source level, nothing in the JS runtime to inspect via toString()

SeleniumBase UC mode, removes webdriver flag and auto-handles PoW challenges

Residential proxy essential, datacenter IPs receive near-zero trust regardless of browser

PoW tokens are single-use, never replay, always generate fresh per request

Block style

Silent

403 no explanation

playwright-stealth

Detected

Catalogued signatures

Challenge

JS PoW

Real CPU required

JS file

Polymorphic

Renamed each deploy

06/06 · $1 billion acquisition · Most sophisticated

F5 Shape

F5 acquired Shape Security for $1 billion in 2020and the price reflects what they built. Shape runs a custom JavaScript virtual machine. The bytecode that executes in the browser is not standard JavaScript, it's a proprietary instruction set that you cannot reverse-engineer with standard tooling. Session tokens expire in minutes. The challenge payload is re-generated with every rotation. For production scraping at scale, DIY bypass is economically irrational, the engineering cost of maintaining a bypass exceeds the cost of Bright Data's API within weeks.

reese84 cookie TS cookie custom JS VM minute-cadence rotation $rsc= params

Bypass strategy

First: check if mobile app uses a weaker backend, Shape is often only on the web frontend

Only reliable option at scale: Bright Data (98.44%) or Zyte (93.14%) managed APIs

DIY reverse engineering: deobfuscate VM bytecode, takes weeks per rotation

Cost-justify: >2 days/month of maintenance time → managed API is cheaper

The custom VM produces tokens that can be replayed for a few minutes, session pooling can reduce API costs

Acquisition

$1B

F5 Networks 2020

Token expiry

Minutes

Tight rotation cadence

VM type

Custom

Proprietary bytecode

DIY viability

None

Use managed API

Forter

Fraud / Behavioural

Focuses on behavioural analysis and device fingerprinting for fraud prevention. Monitors checkout speed, typing rhythm, and device profile. Common on e-commerce checkouts. Bypass: headless browser with randomised timings, diverse residential proxy pool, replay real user interaction sequences.

BehaviouralDevice FPCheckout fraud

Riskified

Fraud / Behavioural

Monitors shopping and payment behaviour alongside device fingerprinting. Flags anomalies in purchase flow, typing patterns, and system details. Bypass: Playwright Stealth with realistic interaction replay, residential proxies, maintain full session cookies across the purchase flow.

BehaviouralDevice FPPayment flows

Imperva Incapsula

WAF · IP reputation · JS challenges

Enterprise WAF used by Fortune 500 financial and government sites. Focuses on IP reputation databases + JavaScript challenges + behavioural analysis. Less aggressive than DataDome on TLS but harsh on flagged IPs. Bypass: residential proxies (datacenter IPs nuked instantly), Camoufox or fortified browser, slow request pacing.

IP reputationJS challengeEnterprise/finance

AWS WAF

Cloud-native · Bot Control · Captcha

Amazon's managed WAF with Bot Control add-on. Three protection levels: Common (signature-based), Targeted (behaviour + JS challenge), Custom rules. Used by AWS-hosted apps. Bypass: rotate residential IPs (Common tier blocks AWS IPs themselves), browser automation for Targeted tier, request rate ≤ 5/sec to avoid trigger thresholds.

AWS-nativeBot ControlCAPTCHA

Quick identification reference

What you see	Anti-bot	Key cookie/header	Detection method
"Pardon Our Interruption" page	Akamai block	`_abck`	Wappalyzer · response body
CF-Ray header · Turnstile iframe	Cloudflare challenge	`cf_clearance`	Response header `CF-Ray`
JSON with `datadome` key	DataDome block	`datadome`	Response header `x-datadome`
`_px3` or `_pxde` set	PerimeterX block	`_px3`	Cookie inspection
Silent 403 · no body	Kasada silent	`x-kpsdk-ct`	Response headers · `ips.js` in source
`reese84` or `TS` cookie	F5 Shape block	`reese84`	Cookie names · Shape JS reference

The through-line: Every anti-bot vendor is defending against the same thing, automated access that looks like a machine. The difference is which layer they weight most heavily. Akamai weights browser execution (sensor.js). Cloudflare weights TLS + global ML. DataDome weights per-site behaviour + IP. PerimeterX weights the network effect. Kasada weights PoW + JS integrity. F5 Shape weights token validity via a proprietary VM. The tools in the next section exist as direct countermeasures to each of these specific approaches.

Six walls. Now the tools. Every library below exists as a direct response to one of those six systems, curl_cffi was built because JA4 broke Python's TLS. Camoufox because CDP leaks signal automation. PatchRight because Kasada fingerprints JS patches. The arms race made this arsenal.

04 The arsenal

Every tool built to fight
every wall we just described.

Now that you understand the detection stack and the six anti-bot vendors, every library below makes sense in context. curl_cffi exists because of JA4. Camoufox exists because of CDP leaks. PatchRight exists because of Kasada's toString() inspection. The arsenal wasn't built randomly, each tool is a direct countermeasure to a specific detection innovation.

Master comparison table, all 60+ libraries & tools

Library (click to expand)	Type	Lang	TLS spoof	TLS detail	MCP
curl_cffi ⚡	HTTP	Python	Chrome JA4+	Akamai, DataDome	–
⚡ HTTP Under the hood: libcurl C library with custom TLS patches. Emits exact Chrome/Safari/Firefox TLS ClientHello at the C level, cipher suites, extensions, ALPN, GREASE all match real browsers. ✓ Pros Fastest HTTP option. Pure HTTP speed, no browser overhead Confirmed DataDome + Akamai bypass in 2026 Asyncio support via AsyncSession Simple requests-compatible API ✗ Cons No JavaScript execution, useless for JS-rendered pages Cannot solve CAPTCHA or Turnstile challenges TLS fingerprint only, no behaviour or canvas spoofing
Scrapling ⚡	HTTP	Python	Chrome TLS	Cloudflare Turnstile	38k
⚡ HTTP Under the hood: Wraps curl_cffi for stealth HTTP + integrates Camoufox for browser mode. StealthyFetcher uses a real patched Firefox under the hood when needed. ✓ Pros StealthyFetcher solves Cloudflare Turnstile natively Async spider v0.4, pause/resume, per-domain throttling Dual-mode: HTTP for speed, browser for hard targets Active development, 38K stars ✗ Cons Higher complexity than plain curl_cffi Browser mode adds Camoufox overhead when triggered
webclaw ⚡	HTTP	Rust	Chrome TLS	Medium targets	–
⚡ HTTP Under the hood: Rust HTTP client with TLS fingerprint spoofing. Emits browser TLS signatures from Rust, fast and low-memory. ✓ Pros Rust speed, very low CPU/memory overhead TLS fingerprinting at Rust level Good for high-volume HTTP scraping ✗ Cons Rust, no Python API Less widespread adoption No JS rendering
httpx ⚡	HTTP	Python	None	Unprotected only	–
⚡ HTTP Under the hood: Modern Python HTTP library with async support and HTTP/2. ✓ Pros Async + sync in one library HTTP/2 support unlike requests Type hints, modern API ✗ Cons TLS fingerprint still Python-default, detectable Not as stealthy as curl_cffi without patching
requests ⚡	HTTP	Python	None	Unprotected only	52k
⚡ HTTP Under the hood: Pure Python HTTP library. Sends HTTP/1.1 requests with standard Python TLS. ✓ Pros Simple API, universally known Synchronous, easy to debug ✗ Cons TLS fingerprint is instantly detectable (Python urllib3) No async, slow for concurrent scraping No anti-bot capability
tls-client ⚡	HTTP	Go/Py	Chrome/Firefox TLS	Akamai, DataDome	–
⚡ HTTP Under the hood: Go/Python wrapper around a Go TLS client that mimics browser fingerprints. Predecessor to cycle-tls. ✓ Pros Python bindings available Bypasses JA3/JA4 fingerprinting Lighter than curl_cffi ✗ Cons Less actively maintained than curl_cffi No JS rendering Binary dependency
Playwright 🌐	Browser	Py/JS	CDP (detectable)	Medium (CDP leaks)	68k
🌐 Browser Under the hood: Chromium DevTools Protocol (CDP). Microsoft-maintained. Drives real Chromium, Firefox, or WebKit browsers over CDP socket. ✓ Pros Best JS execution support, renders any SPA 68K stars, massive ecosystem and docs Cross-browser: Chrome, Firefox, Safari (WebKit) Screenshot, PDF, network intercept built-in ✗ Cons CDP is detectable, needs C++ wrapper (PatchRight/Camoufox) Heavy: launches a full browser process per session Slow vs HTTP, ~10× more memory per concurrent task
Camoufox 🌐	Browser	Python	C++ Firefox Juggler	Cloudflare 100%, Akamai	–
🌐 Browser Under the hood: Forked Firefox with C++ binary patches to Juggler protocol (below CDP). Patches navigator, canvas, WebGL, fonts, window.chrome at binary level. ✓ Pros 100% Cloudflare pass rate as of March 2026 geoip=True aligns all 5 identity vectors automatically Below-CDP, invisible to JS-level detection Async context manager, drop-in playwright replacement ✗ Cons Firefox only, no Chrome/Safari Heavier than curl_cffi Occasional site-specific quirks with Firefox fingerprint
CloakBrowser 🌐	Browser	Python	49 C++ patches	Akamai, reCAPTCHA v3 0.9	–
🌐 Browser Under the hood: 49 C++ binary patches to Chromium. Patches webdriver, chrome object, plugins, permissions, WebGL, Canvas at the binary level, not patchable by JS. ✓ Pros reCAPTCHA v3 score 0.9, highest of any tool Passes Akamai extension probes Real extension fingerprint database built-in C++ level, undetectable by any JS probe ✗ Cons Paid product, not open source Less community support than Playwright Chrome only
PatchRight 🌐	Browser	Python	Py source patches	Kasada, Cloudflare	–
🌐 Browser Under the hood: Patches Playwright Python source files at install time. Removes CDP signatures, webdriver property, and stealth tells from the JS layer. ✓ Pros Open source, free, Kasada bypass confirmed Drop-in Playwright replacement, zero API changes Patches JS layer without C++ recompilation ✗ Cons JS-level patches only, determined adversary can detect at binary level Less robust than Camoufox on Cloudflare 5-second challenge Requires Playwright to be installed first
Puppeteer 🌐	Browser	Node	CDP (detectable)	Medium targets	89k
🌐 Browser Under the hood: Node.js CDP driver for Chromium. Google-maintained. The original headless browser automation library. ✓ Pros 89K stars, largest ecosystem Native Google product, Chromium compatibility guaranteed Good for CI/CD screenshot and PDF generation ✗ Cons CDP is easily detectable (webdriver=true, window.chrome absent) Node.js only, no Python No anti-bot stealth built-in
Selenium 🌐	Browser	Multi	webdriver=true	Weak (legacy)	29k
🌐 Browser Under the hood: WebDriver protocol (W3C standard). Drives any browser via standardised JSON protocol. The original browser automation framework. ✓ Pros Multi-language: Python, Java, C#, Ruby, JS Supports all browsers including IE and Safari Huge ecosystem, well-documented ✗ Cons navigator.webdriver=true is trivially detectable Slowest option, WebDriver adds round-trip latency Requires ChromeDriver binary management
SeleniumBase UC 🌐	Browser	Python	UC removes WD flag	Kasada, general stealth	10k
🌐 Browser Under the hood: SeleniumBase with undetected-chromedriver mode. Patches Chrome binary to remove webdriver flag and CDP signatures. ✓ Pros UC mode removes webdriver=true flag Passes basic Cloudflare and PerimeterX Built-in test framework, good for QA teams ✗ Cons Not as strong as Camoufox/PatchRight on hard targets Chrome binary patches can break on updates Slower than Playwright equivalent
Selenium-Driverless 🌐	Browser	Python	CDP no WebDriver	Medium targets	–
🌐 Browser Under the hood: Direct CDP connection without ChromeDriver binary, no webdriver flag set. Async Python API. ✓ Pros No ChromeDriver binary needed No webdriver=true flag Async Python native ✗ Cons Newer, less battle-tested than nodriver Chrome only Some CDP signatures still detectable
nodriver 🌐	Browser	Python	Raw CDP async	Medium targets	–
🌐 Browser Under the hood: Controls Chrome via its internal DevTools socket without using CDP's standard automation flag. Chrome doesn't know it's being driven. ✓ Pros Chrome does not set automation flags Passes many sites that detect standard CDP Lightweight, lower overhead than full Playwright ✗ Cons Relatively new, less battle-tested Python only Some sites still detect via other JS signals
pydoll 🌐	Browser	Python	Async CDP	Medium targets	–
🌐 Browser Under the hood: Pure Python browser automation using Chrome DevTools Protocol directly. No external driver. ✓ Pros No ChromeDriver dependency Fast startup, no driver process Pure Python, easy to install ✗ Cons CDP still potentially detectable Less mature than Playwright Smaller community
Botright 🌐	Browser	Python	CAPTCHA solving	CAPTCHA targets	–
🌐 Browser Under the hood: Playwright wrapper focused on CAPTCHA solving and stealth. Uses AI to solve CAPTCHAs during automation. ✓ Pros Auto-solves reCAPTCHA and hCAPTCHA inline Stealth patches on top of Playwright Good for CAPTCHA-heavy targets ✗ Cons Heavier than raw Playwright CAPTCHA AI may be rate-limited Less control over fingerprinting
Botasaurus 🌐	Browser	Python	Gaussian mouse	DataDome behaviour	–
🌐 Browser Under the hood: Playwright wrapper that adds Gaussian mouse movement, realistic typing, scroll physics, and session management. ✓ Pros Gaussian mouse curves, passes behavioural ML checks Handles DataDome behavioural scoring Session persistence and rotating profiles built-in ✗ Cons Browser-based overhead Overkill for targets without behavioural analysis Less control than raw Playwright
rayobrowse 🌐	Browser	Py/Docker	Real device FP DB	Hard targets	–
🌐 Browser Under the hood: Docker-based stealth Chromium browser from Rayobyte. C++ level patches (not JS-level), exposed via CDP so Playwright/Puppeteer/Selenium can connect natively. Self-hosted = free and unlimited; managed Cloud version available. ✓ Pros Free and unlimited self-hosted (Docker), Cloud version managed C++ level patches survive Function.toString() inspection Coherent device profile: UA, WebGL, Canvas, AudioContext, fonts all match Native CDP, drop-in for Playwright/Puppeteer/Selenium Used by Rayobyte to scrape millions of pages/day in production ✗ Cons Still in beta, results vary by target site Windows + Android profiles strongest, macOS/Linux less mature Closed source (license restricts certain organizations) Canvas/WebGL FP coverage still evolving
undetected-chromedriver 🌐	Browser	Python	Removes WD flag	Medium targets	5k
🌐 Browser Under the hood: Patches ChromeDriver binary to remove webdriver=true and CDP automation flags at binary level. ✓ Pros Removes most obvious webdriver signals Simple: just replace webdriver.Chrome with uc.Chrome ✗ Cons Chrome binary patches break on updates frequently Not as robust as Camoufox on modern Cloudflare Maintenance has slowed
⭐ Scrapy ⚡	Framework	Python	Via curl_cffi mw	Medium (with middleware)	52k
⚡ HTTP Under the hood: Twisted-based async Python framework. Pure HTTP, sends requests, receives responses, parses with XPath/CSS. No browser. ✓ Pros 52K stars, production standard for HTTP scraping Massive ecosystem: scrapy-redis, scrapy-playwright, scrapyd Async by default, hundreds of concurrent requests Mature: pipelines, middlewares, extensions all built-in ✗ Cons No JS rendering by default (need playwright middleware) Pure HTTP, detectable by TLS fingerprint without curl_cffi middleware Steeper learning curve than requests
Crawlee 🌐	Framework	Node/Py	Playwright-based	Medium targets	15k
🌐 Browser Under the hood: Apify's unified Node.js framework. Wraps both HTTP (got-scraping) and Playwright/Puppeteer. Handles retries, deduplication, storage. ✓ Pros Dual HTTP+browser mode in one framework 15K stars, actively maintained by Apify Built-in dataset storage, request queue, proxy rotation ✗ Cons Node.js primary (Python port is newer, less mature) More opinionated than Scrapy, harder to customise Heavier dependency footprint
scrapy-camoufox ⚡	Framework	Python	Camoufox integration	Hard targets	–
⚡ HTTP Under the hood: Scrapy middleware that routes requests through Camoufox browser for stealth. Best of Scrapy + Camoufox. ✓ Pros Scrapy pipeline management + Camoufox stealth Per-request browser decision (HTTP vs browser) Good for mixed protection targets ✗ Cons Camoufox overhead on browser requests Requires both Scrapy and Camoufox installed
scrapy-nodriver ⚡	Framework	Python	nodriver integration	Medium targets	–
⚡ HTTP Under the hood: Scrapy middleware using nodriver for browser requests, Chrome without CDP flags. ✓ Pros Scrapy framework + Chrome without automation flags Good for Cloudflare-protected targets Use Scrapy architecture you know ✗ Cons nodriver overhead per browser request Less control than raw nodriver
scrapy-stealth ⚡	Framework	Python	Browser TLS + HTTP/2	Cloudflare, Akamai	v0.4 (2026)
⚡ HTTP Under the hood: Pluggable Scrapy DOWNLOADER_MIDDLEWARE with three drivers: `basic` + `turbo` (TLS fingerprint + HTTP/2 impersonation, no browser), and `browser` (real Chrome via CDP for JS-heavy targets). Per-request engine switching via `request.meta["stealth"]`. ✓ Pros Built-in TLS fingerprint spoofing, scrapy-playwright/scrapy-splash/scrapy-selenium do not have this Per-request engine switching: keep light HTTP for easy URLs, browser only for protected ones Built-in proxy + fingerprint rotation (no separate middleware needed) Native Cloudflare and Akamai detection via status + body keyword checks Browser profiles like `chrome_147`, `safari_ios_18_1_1` kept current MIT license, active development (v0.4 May 2026) ✗ Cons Project is new with limited GitHub adoption (low star count) Less battle-tested than scrapy-playwright in production at scale Browser driver 5-15s per page, use selectively for JS-protected URLs only Requires Python 3.11+ and Scrapy 2.15+
Firecrawl ⚡	AI	API	FIRE-1 engine	Hard via managed	111k
⚡ HTTP Under the hood: API service that converts any URL to clean Markdown or structured JSON for LLM consumption. FIRE-1 agent for multi-page crawls. ✓ Pros 111K stars, most popular LLM scraping tool Outputs clean Markdown, 67% fewer tokens for LLMs MCP server for Claude/Cursor/LangChain Handles JS rendering and auth flows ✗ Cons API cost at scale Less control over request details vs raw scraping Data goes through third-party servers
Crawl4AI 🌐	AI	Python	Playwright-based	Medium targets	60k
🌐 Browser Under the hood: Local Playwright wrapper optimised for LLM output. Runs locally, converts pages to clean Markdown with BM25 relevance filtering. ✓ Pros Fully local, no API cost BM25 filter reduces LLM context bloat LLM extraction schema definition MIT license, commercial friendly ✗ Cons Playwright overhead per page Less anti-bot bypass than Camoufox No managed infrastructure
ScrapeGraphAI ⚡	AI	Python	NL graph pipeline	Light protection	18k
⚡ HTTP Under the hood: LLM-powered extraction that builds a graph pipeline from a natural language prompt. Local or API. ✓ Pros Natural language extraction definition Open source, self-hostable Graph pipeline handles multi-step extractions ✗ Cons LLM inference cost/latency per extraction Less deterministic than CSS/XPath selectors Newer, less battle-tested at scale
Jina Reader API ⚡	AI	API	Built-in rendering	Medium targets	–
⚡ HTTP Under the hood: REST API: prefix r.jina.ai/ to any URL to get clean Markdown back. Zero setup. ✓ Pros Simplest possible API, one URL prefix Good JS rendering Free tier available ✗ Cons Data goes through Jina servers Less control than local scraping Rate limited on free tier
Steel 🌐	AI	API	Docker browser	Medium targets	–
🌐 Browser Under the hood: Self-hosted browser API with MCP server. AI agents call it as a tool to browse the web. ✓ Pros Self-hosted, data stays local MCP server for AI agent integration Docker deployment ✗ Cons Newer product, smaller community Setup overhead vs managed services
Bright Data ⚡	Managed	API	Full enterprise stack	All incl. F5 Shape	–
⚡ HTTP Under the hood: 72M+ IP network + scraping API. Managed infrastructure handles anti-bot, JS rendering, proxy rotation. ✓ Pros 98.44% success rate, highest benchmark Covers F5 Shape (only managed service that does) Residential + ISP + datacenter + mobile IPs Dataset marketplace for pre-scraped data ✗ Cons Most expensive option Data goes through third-party Overkill for simple targets
Zyte ⚡	Managed	API	Full stack	All targets	–
⚡ HTTP Under the hood: Scrapy company's managed scraping platform. Zyte API + AutoExtract for structured data. ✓ Pros #1 Proxyway benchmark 2025 AutoExtract returns structured product/article data Built by the Scrapy maintainers Smart proxy rotation built-in ✗ Cons Expensive at scale AutoExtract less flexible than custom extraction
Apify ⚡	Managed	API	10K+ Actors	Medium-hard	–
⚡ HTTP Under the hood: 10,000+ pre-built Actors on serverless cloud. Crawlee at core. MCP server for AI agents. ✓ Pros Biggest marketplace of pre-built scrapers MCP server: AI agents call Actors as tools Free $5/mo credit for casual use Crawlee open source available locally ✗ Cons CU pricing can escalate Data goes through Apify cloud Less control over anti-bot approach in Actors
ScrapingBee ⚡	Managed	API	Managed rendering	Medium targets	–
⚡ HTTP Under the hood: Managed scraping API. Handles JS rendering, CAPTCHA, proxies via simple REST call. ✓ Pros Dead simple: one API call, get HTML back Free tier available Handles most modern JS rendering ✗ Cons Less anti-bot strength than Zyte or Bright Data Per-call pricing Less control over request details
Oxylabs ⚡	Managed	API	OxyCopilot AI	Hard targets	–
⚡ HTTP Under the hood: 102M+ IP network with OxyCopilot AI extraction and scraper APIs. ✓ Pros Largest IP pool (102M+) OxyCopilot: AI-powered extraction Strong residential + datacenter options ✗ Cons Enterprise pricing Data through third-party
Browserbase 🌐	Managed	API	Managed browser	Hard targets	–
🌐 Browser Under the hood: Managed Playwright cloud. Run Playwright scripts remotely without managing browser infrastructure. ✓ Pros No browser infra to manage Scales automatically Playwright API unchanged, zero code changes ✗ Cons 42% success rate on anti-bot benchmark (vs 81% Browser Use) Per-session pricing Less stealth than self-hosted Camoufox
chompjs ⚡	Parser	Python	N/A	Parser only	–
⚡ HTTP Under the hood: Python library to parse JavaScript objects embedded in HTML pages. Converts JS literals to Python dicts. ✓ Pros Handles malformed JSON that json.loads rejects Extracts __NEXT_DATA__ and embedded JS objects Zero dependencies ✗ Cons Parsing only, not a scraping framework Narrow use case
Parsel ⚡	Parser	Python	N/A	Parser only	–
⚡ HTTP Under the hood: Scrapy's HTML/XML parser library. XPath and CSS selectors with a clean Python API. ✓ Pros XPath + CSS in one library Used inside Scrapy, familiar API Faster than BeautifulSoup for selection ✗ Cons Parsing only, no HTTP requests Less beginner-friendly than BS4
BeautifulSoup4 ⚡	Parser	Python	N/A	Parser only	10k
⚡ HTTP Under the hood: Python HTML/XML parser. Wraps lxml or html.parser. Builds a parse tree from raw HTML strings. ✓ Pros Simple, readable API, beginner-friendly Works on any HTML string regardless of source No network requests, pure parsing ✗ Cons Not a scraping framework, needs requests/httpx separately Slow on large documents vs selectolax/lxml No anti-bot capability whatsoever
mitmproxy ⚡	RE Tool	Python	N/A	RE / intercept	37k
⚡ HTTP Under the hood: Python-based HTTPS proxy. Intercepts, inspects, and modifies HTTP/HTTPS traffic between client and server. ✓ Pros Full request/response visibility and modification Script intercepted traffic with Python Good for understanding anti-bot request patterns ✗ Cons Requires certificate trust on device SSL pinning blocks it on hardened apps For analysis/RE, not production scraping
HTTPToolkit ⚡	RE Tool	Any	N/A	Mobile API intercept	–
⚡ HTTP Under the hood: HTTPS intercepting proxy for development and mobile API discovery. Open source. ✓ Pros Intercepts HTTPS without SSL pinning (with rooted device) Beautiful UI for inspecting requests Works with Android emulators via ADB ✗ Cons For analysis only, not for production scraping Requires rooted device for mobile apps
Frida ⚡	RE Tool	Py/JS	N/A	SSL hooks	–
⚡ HTTP Under the hood: Dynamic instrumentation toolkit. Injects JavaScript into running processes. Used to hook native functions and bypass SSL pinning. ✓ Pros Bypass SSL pinning in any Android/iOS app Hook any native function at runtime Essential for mobile app API extraction ✗ Cons Requires rooted/jailbroken device Complex setup, not for beginners App-specific scripts needed per target
rebrowser-patches 🌐	Browser	Python	Chrome source patches	Medium targets	–
🌐 Browser Under the hood: JavaScript patches injected into Playwright/Puppeteer pages to mask automation signals. ✓ Pros Removes navigator.webdriver and CDP signals at JS level Works with any Playwright version Easy to integrate ✗ Cons JS-level only, binary signals still present Less robust than C++ patches
cycle-tls ⚡	HTTP	Go/JS	Chrome/Firefox TLS	Akamai, DataDome	–
⚡ HTTP Under the hood: Node.js/Go TLS client that cycles through browser fingerprints. Sends real JA3 hashes per request. ✓ Pros Node.js TLS fingerprint spoofing Per-request fingerprint rotation Good for JS pipeline scraping ✗ Cons Node.js only, no Python Less robust than curl_cffi on hard targets
GoLogin 🌐	Browser	Cloud	Antidetect profiles	Hard multi-account	–
🌐 Browser Under the hood: Cloud anti-detect browser. Manages browser profiles with unique fingerprints stored in cloud. Multi-account management. ✓ Pros Profile fingerprint management at scale Good for multi-account scraping operations Team sharing of browser profiles ✗ Cons Paid product, cloud-dependent Not suitable for automated pipeline scraping Designed for manual browsing, not scripted crawling
Multilogin 🌐	Browser	Cloud	Antidetect profiles	Hard multi-account	–
🌐 Browser Under the hood: Commercial anti-detect browser with managed profile fingerprints. Team collaboration on browser profiles. ✓ Pros Professional multi-account management Managed fingerprint database Team profile sharing ✗ Cons Very expensive Designed for manual use, not automated crawling Data in cloud
ScraperAPI ⚡	Managed	API	Full stack	All incl. Walmart	–
⚡ HTTP Under the hood: Simple proxy rotation + JS rendering API. Handles geo-targeting and header rotation. ✓ Pros Simple integration, just prepend URL Free tier with 1000 calls/mo Geo-targeting built in ✗ Cons Weaker on hard anti-bot targets Basic anti-bot handling vs Zyte/Bright Data
Decodo ⚡	Managed	API	Full stack	All targets	–
⚡ HTTP Under the hood: Smartproxy's new brand. Residential, datacenter, and mobile proxy network. ✓ Pros Affordable residential proxies Pay-as-you-go pricing Good for mid-scale scraping ✗ Cons Less powerful than Bright Data on hard targets Smaller IP pool
CapSolver ⚡	CAPTCHA	API	N/A	reCAPTCHA/hCaptcha	–
⚡ HTTP Under the hood: AI-powered CAPTCHA solving service. Uses computer vision to solve reCAPTCHA v2/v3, hCAPTCHA, Cloudflare Turnstile. ✓ Pros Solves reCAPTCHA v3, hCAPTCHA, Turnstile, ImageCAPTCHA Fast: under 10 seconds for most CAPTCHA types API-based, works with any language ✗ Cons Cost per solve (~$0.001–0.002) reCAPTCHA v3 score may be low vs C++ browser Solving is symptomatic, better to avoid triggering CAPTCHA
2captcha ⚡	CAPTCHA	API	N/A	All CAPTCHA types	–
⚡ HTTP Under the hood: Human + AI hybrid CAPTCHA solving service. One of the oldest in the market. ✓ Pros Solves almost any CAPTCHA type including custom ones Human fallback for unusual CAPTCHAs Large API ecosystem ✗ Cons Slowest option, human solving adds latency Cost per solve Less automated than CapSolver
Anti-Captcha ⚡	CAPTCHA	API	N/A	reCAPTCHA/image	–
⚡ HTTP Under the hood: Human + AI CAPTCHA solving service. Competitor to 2captcha. ✓ Pros Solves all major CAPTCHA types Competitive pricing API compatible with 2captcha ✗ Cons Human solving latency Cost per solve Better to avoid triggering CAPTCHA in the first place
Scrapyd ⚡	Framework	Python	Via middleware	Scrapy deploy tool	–
⚡ HTTP Under the hood: Daemon that deploys and runs Scrapy spiders via JSON API. Port 6800. Process-based job queue. ✓ Pros Zero cloud cost, runs on any server ScrapydWeb provides visual dashboard Simple deploy: scrapyd-deploy -p project ✗ Cons Single node by default, no horizontal scaling No built-in monitoring or alerting Job isolation is process-level only
scrapy-redis ⚡	Framework	Python	N/A	Distributed Scrapy	–
⚡ HTTP Under the hood: Scrapy extension connecting spiders to a Redis shared URL queue. Enables distributed crawling. ✓ Pros Horizontal scale: add workers without code change Redis deduplicates URLs across all workers One codebase, N machines ✗ Cons Redis is a new SPOF No built-in job scheduling Requires Redis infrastructure
scrapy-cluster ⚡	Framework	Python	N/A	Enterprise Scrapy	–
⚡ HTTP Under the hood: Distributed Scrapy cluster using Redis + Kafka + Zookeeper. Enterprise-scale distributed crawling. ✓ Pros True enterprise-scale distributed crawling Kafka for message durability Multi-project support ✗ Cons Complex infra: Redis + Kafka + Zookeeper Overkill for most use cases High ops overhead
scrapy-poet ⚡	Framework	Python	N/A	Page Object pattern	–
⚡ HTTP Under the hood: Dependency injection framework for Scrapy spiders. Cleaner spider code with page objects. ✓ Pros Cleaner code via page objects pattern Works with zyte-spider and AutoExtract Testable spider logic ✗ Cons Adds abstraction overhead Learning curve for Scrapy veterans
Splash 🌐	Browser	Docker	Lua scripting	Light protection	–
🌐 Browser Under the hood: Lua-scriptable browser for JS rendering, runs in Docker. Integrates with Scrapy via scrapy-splash. ✓ Pros Docker-based, easy to deploy Lua scripting for complex interactions Good for Scrapy integration on JS sites ✗ Cons Outdated, Playwright has superseded it Lua scripting adds complexity Less stealth than Camoufox
selectolax ⚡	Parser	Python	N/A	Fast HTML parser	–
⚡ HTTP Under the hood: C-based HTML parser (lexbor engine). 10–100× faster than BeautifulSoup for pure parsing tasks. ✓ Pros Extremely fast, C engine vs Python in BS4 CSS selectors with clean Python API Low memory footprint ✗ Cons CSS selectors only, no XPath Less forgiving on malformed HTML than BS4 Smaller community/docs
lxml ⚡	Parser	Python	N/A	XPath + CSS parser	–
⚡ HTTP Under the hood: C-based XML/HTML parser. Fastest Python HTML parsing option. ✓ Pros Fastest Python HTML parser by far Full XPath 1.0 support Handles massive documents efficiently ✗ Cons Stricter on malformed HTML than BS4 C dependency, occasional install issues Verbose API vs BS4
w3lib ⚡	Parser	Python	N/A	URL/text utils	–
⚡ HTTP Under the hood: Web-related utility functions. URL normalisation, encoding handling. Used internally by Scrapy. ✓ Pros URL cleaning and normalisation Encoding detection and conversion Scrapy internals, very stable ✗ Cons Utility library only, not a scraper Most devs use it via Scrapy, not directly
SwiftShadow ⚡	Proxy	Python	N/A	Proxy pool manager	–
⚡ HTTP Under the hood: Free proxy pool manager. Fetches, validates and rotates free proxies automatically. ✓ Pros Free, zero proxy cost Auto-validates and rotates on failure 2 lines of code integration ✗ Cons Free proxies are low quality, high failure rate Not for hard anti-bot targets IP reputation usually poor
requests-ip-rotator ⚡	Proxy	Python	N/A	AWS API Gateway IPs	–
⚡ HTTP Under the hood: Rotates requests through AWS API Gateway endpoints to get rotating IPs. ✓ Pros Free if AWS free tier available AWS IPs have good reputation Works with requests library ✗ Cons AWS API Gateway has rate limits Setup requires AWS account Limited rotation speed
Colly ⚡	Framework	Go	Go TLS	Medium targets	15k
⚡ HTTP Under the hood: Go HTTP scraping framework. Fast, concurrent, clean API. ✓ Pros Very fast, Go concurrency model Low memory vs Python Good for high-throughput HTTP scraping ✗ Cons Go only, no Python Smaller ecosystem than Scrapy No browser support
Katana ⚡	Framework	Go	Go TLS + Chromium	Medium targets	8k
⚡ HTTP Under the hood: Go-based web crawler by ProjectDiscovery. Designed for security research and recon. ✓ Pros Extremely fast Go crawler Headless mode with Playwright integration Built for large-scale URL discovery ✗ Cons Security/recon focus, not a data scraping framework Go only Less data extraction tooling than Scrapy
playwright-go 🌐	Browser	Go	CDP (detectable)	Medium targets	–
🌐 Browser Under the hood: Go bindings for Playwright. Same Playwright API in Go. ✓ Pros Go concurrency for browser scraping Same Playwright API and capabilities Lower memory than Python for concurrent sessions ✗ Cons Less mature than Python Playwright Smaller community No stealth patches yet
Charles Proxy ⚡	RE Tool	Any	N/A	Mobile API intercept	–
⚡ HTTP Under the hood: Commercial HTTPS proxy for request inspection and debugging. GUI-based. ✓ Pros GUI-based, easy to use for non-developers SSL proxying with certificate install Session recording and replay ✗ Cons Paid product For debugging only, not automated scraping Less powerful than mitmproxy for scripting
Selenoid ⚡	HTTP	Go (Docker)	Browser-as-a-service	Medium targets	2.6k
⚡ HTTP Under the hood: Docker containers running headless Chrome/Firefox in parallel, Aerokube's Go-based Selenium grid replacement. ✓ Pros Run dozens of browsers in parallel from one host Lower memory than Selenium Grid Built-in video recording per session Drop-in replacement for Selenium Grid ✗ Cons Browsers still detectable as headless without stealth patches Older project, slower release cadence Requires Docker infrastructure
noble-tls ⚡	HTTP	Python	Chrome JA3/JA4	Cloudflare, DataDome	–
⚡ HTTP Under the hood: Python port of uTLS via custom TLS handshake stack, emits browser-matching ClientHello. ✓ Pros Bypasses JA3/JA4 fingerprinting Pure Python, no C compilation Lighter than curl_cffi for simple cases Easy install via pip ✗ Cons Smaller community than curl_cffi Fewer browser impersonation profiles Less battle-tested in production
hrequests ⚡	HTTP	Python	Browser-grade TLS	DataDome, Cloudflare	900
⚡ HTTP Under the hood: Drop-in requests replacement with TLS impersonation, header order matching, and optional Playwright browser mode. ✓ Pros requests-compatible API with stealth built in Header order mimics real Chrome Optional browser mode for JS rendering Built-in async support ✗ Cons Smaller ecosystem than curl_cffi Fewer impersonate profiles Newer project, some edge cases
crawlee-python 🌐	Browser	Python	Via curl_cffi backend	Most targets	6.2k
🌐 Browser Under the hood: Python port of Apify Crawlee, wraps curl_cffi for HTTP and Playwright for browser modes in a unified framework. ✓ Pros Mix HTTP and browser workers in one crawler Auto-scaling and proxy rotation built in Storage abstraction for results Strong production patterns from Apify ✗ Cons Larger than plain Scrapy Newer than Node.js Crawlee, some features lag Opinionated framework
🌐 Browser Under the hood: Python port of Apify's Crawlee. Wraps curl_cffi for HTTP and Playwright for browser modes in a unified async framework with built-in retry, dedup, and storage. ✓ Pros Dual HTTP+browser mode in one framework 15K stars, actively maintained by Apify Built-in dataset storage, request queue, proxy rotation ✗ Cons Node.js primary (Python port is newer, less mature) More opinionated than Scrapy, harder to customise Heavier dependency footprint
estela ⚡	Framework	Python (K8s)	Spider-dependent	Distributed Scrapy	90
⚡ HTTP Under the hood: Kubernetes orchestrator for Scrapy, schedules and runs spiders as K8s jobs with auto-scaling. ✓ Pros Open source alternative to Zyte Cloud Elastic scaling on Kubernetes Built-in monitoring and stats UI Multi-tenant by design ✗ Cons Requires Kubernetes infrastructure Heavyweight for small projects Smaller community than Scrapyd
fake-useragent ⚡	HTTP	Python	UA strings only	Lightweight only	3.8k
⚡ HTTP Under the hood: Curated database of real-world User-Agent strings, sampled from browser telemetry sources. ✓ Pros Realistic UA strings ready out of the box Filter by browser family or OS Updated database Tiny dependency ✗ Cons UA alone is trivially detectable in 2026 Not enough for any modern anti-bot Database can become stale
grequests ⚡	HTTP	Python	requests + gevent	Unprotected APIs	4.4k
⚡ HTTP Under the hood: gevent-monkey-patched requests, fires hundreds of HTTP calls in parallel via greenlets. ✓ Pros Drop-in async for requests users Simpler than asyncio for bulk fetches Battle-tested gevent under the hood ✗ Cons Monkey-patching can conflict with other libs No HTTP/2 support Newer code should use httpx instead
Scrapoxy ⚡	Framework	Node.js	Proxy manager	Self-hosted rotation	2.1k
⚡ HTTP Under the hood: Self-hosted proxy pool manager, provisions proxies on AWS, Azure, GCP and rotates IPs automatically. ✓ Pros Free open-source alternative to Bright Data's proxy manager Auto-provision and tear down cloud IPs Ban detection and auto-rotation built in Multi-tenant ✗ Cons Cloud provider costs add up at scale Self-hosting infrastructure complexity Cloud IPs flagged faster than residential

Yes Partial No HTTP/Parser/RE Browser Framework AI Managed CAPTCHA Proxy

Browser engines, deep dive

Critical 2026 fact: CDP (Chrome DevTools Protocol) is itself detectable. Runtime.enable timing, execution context leaks, and binding exposure all signal automation. Camoufox uses Mozilla's Juggler protocol below CDP, no CDP leaks. playwright-stealth patches JS at runtime but Function.toString() exposes the patch.

Microsoft 2020

Playwright ★ 68k

Chromium + Firefox + WebKit

The 2026 standard framework. Powers Firecrawl, Crawl4AI, Browserbase. CDP is detectableuse C++ wrappers above Playwright. Auto-wait, network interception, multi-browser.

pip install playwright && playwright install

C++ Firefox · Juggler

Camoufox ★ 100%

Zero CDP exposure · geoip alignment

Mozilla Juggler below CDP levelzero CDP leaks. Near-zero fingerprint surface. 100% pass rate Mar 2026 on Cloudflare, Instagram, Reddit, X. Note: Firefox ~3% market share.

from camoufox.sync_api import Firefox

Stealth Chromium

CloakBrowser

49+ C++ binary patches

Binary patches: Canvas, WebGL, Battery API, AudioContext, CDP input. reCAPTCHA v3 score 0.9. Passes Akamai's 60 extension probes with real extension loading. Best for Akamai-targeted Chromium sites.

Playwright source fork

PatchRight

No JS signatures anywhere

Patches Playwright Python source, not JS injection. Kasada fingerprints playwright-stealth via toString(). PatchRight leaves nothing in the runtime to inspect.

pip install patchright

Google · Node.js

Puppeteer ★ 89k

Chrome DevTools Protocol

Google's original CDP automation. puppeteer-stealth plugin patches common detection points. CDP signature still visible at protocol level. Better for rendering tasks than hard anti-bot targets.

Multi-language · WebDriver

Selenium ★ 29k

Legacy, navigator.webdriver=true

navigator.webdriver=true detectable in 2 JS lines. Use SeleniumBase UC mode to remove. Stock Selenium is dead against Akamai in 2026. Still valid for non-protected targets.

Python · UC Mode

SeleniumBase ★ 10k

Undetected Chrome Mode

UC mode removes navigator.webdriver. Auto-solves many CAPTCHAs. Good for Kasada, medium targets. Not production-safe against Akamai at scale.

from seleniumbase import Driver

Raw CDP · Async Python

nodriver / pydoll

Direct Chrome DevTools Protocol

Direct CDP without WebDriver overhead. Used with Botright for CAPTCHA solving. scrapy-nodriver integrates with Scrapy directly. Lighter than full Playwright for medium targets.

Human behaviour simulation

Botasaurus

Gaussian mouse physics

Physically realistic mouse curves via Gaussian jitter. Combines with Patchright for protocol-level evasion + human behaviour. Effective against DataDome's 35-signal behavioural analysis.

pip install botasaurus

python

from camoufox.sync_api import Firefox

# geoip=True: auto-aligns IP, timezone, locale, WebRTC simultaneously
with Firefox(
    geoip=True–        # align all 5 identity vectors to proxy exit country
    humanize=True–    # Gaussian mouse jitter
    proxy={"server": "http://proxy.provider.com:8011"–
           "username": "user"– "password": "pass"},
    screen={"width": 1920– "height": 1080}
) as browser:
    page = browser.new_page()
    # Warm up, never go directly to target URL
    page.goto("https://www.google.com")
    page.wait_for_timeout(2000)
    page.goto("https://cloudflare-protected.com")
    page.wait_for_load_state("networkidle")
    print(page.content()[:500])

The tools above solve the access problem. But once you have the raw HTML or JSON, you still need to extract meaning from it. That is where AI-native scraping changes everything. In 2026 the bottleneck is not access. It is the extraction layer.

05 AI & LLM Scraping

Describe, don't
select

AI-native scraping replaces CSS selectors with natural language. A 2025 NEXT-EVAL benchmark showed LLMs hit F1 > 0.95 on structured extraction when input is properly formatted.

2026 Market Shift

Why AI scraping matters now

Firecrawl's markdown output uses 67% fewer tokens than raw HTML, compounds significantly at thousands of pages for RAG pipelines. AI web scraping market: $7.5B → $38B by 2034 (CAGR 19.93%). LangChain, LlamaIndex, and CrewAI all have native integrations. Claude and Cursor can scrape the web via MCP tools with zero code.

Firecrawl ★ 111k

Managed · Self-hostable · FIRE-1 · MCP

Send URL → clean Markdown/JSON. No selectors. MCP serverClaude scrapes via natural language. FIRE-1 agent autonomously navigates. /interact endpoint clicks, fills forms, extracts behind dynamic content. SAP, Zapier, Deloitte.

app.scrape(url) | app.crawl(site) | app.search("query")

✓ LangChain + LlamaIndex native · 500 free/mo

Crawl4AI ★ 60k

Open-source · Local LLM · Full control · MIT

"Scrapy for the LLM era." Runs on your infrastructuredata never leaves your servers. Adaptive crawling learns selectors over time. BM25 content filter. Plug in Ollama for local models or OpenAI/Deepseek.

result = await crawler.arun(url)

✓ Full data sovereignty, free, MIT license

ScrapeGraphAI ★ 18k

NL prompts · Graph pipeline · Self-healing

Describe what you want, LLM builds and executes a graph-based extraction pipeline. Self-healing: site structure changes, re-describe and it adapts. No selectors ever written. Supports OpenAI, Claude, local.

SmartScraperGraph(prompt="...", source=url)

✓ Best for: schema-free exploration, prototyping

webclaw

Rust · Chrome TLS · 10 MCP tools · 95.1% accuracy

Rust-native scraper built for AI agent integration. 10 MCP toolsClaude and Cursor can call it directly via natural language. 95.1% success rate on bot-protected sites. Zero Python overhead, runs as subprocess or HTTP service. Chrome-level TLS fingerprinting baked in.

pip install webclaw

✓ Best for: AI agents needing high-performance scraping + MCP integration

Jina Reader API

URL → clean text · Zero code

Simplest LLM scraping tool. r.jina.ai/{url} is the entire API. Returns clean Markdown. Dynamic content handled via built-in rendering. Free tier available, paid ~$0.002–$0.01/page.

✓ Best for: text extraction, no-code integration

Steel

Open-source · Docker · MCP · AI agents

Self-hostable headless browser API for AI agents. MCP serverClaude controls browsers directly. Session persistence + CAPTCHA auto-solve. <1s session start. LangChain/CrewAI integration.

✓ Best for: AI agents needing browser control

Browserbase

Managed cloud · $300M valuation

50M sessions in 2025. Playwright/Puppeteer drop-in, one endpoint swap. Session recordings + CAPTCHA auto-solve. Used by AI agent frameworks as the browser layer. From $50/mo.

✓ Best for: AI agent infrastructure at scale

python

import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel

# Define exactly what you want, LLM extracts it, no selectors needed
class Product(BaseModel):
    name: str
    price: float
    model_number: str
    brand: str

async def extract(url):
    strategy = LLMExtractionStrategy(
        provider="openai/gpt-4o-mini"–
        schema=Product.model_json_schema(),
        extraction_type="schema"–
        instruction="Extract all products with prices and model numbers"
    )
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=url– extraction_strategy=strategy)
        import json
        return json.loads(result.extracted_content)
# F1 > 0.95 on well-structured pages, NEXT-EVAL benchmark 2025

When DIY cost exceeds platform cost, these services handle the heavy lifting. Each solves a specific problem, choosing the right one depends on which wall you are facing and at what scale.

06 Managed platforms

When DIY cost
exceeds platform cost

If spending more than 2 engineer-days/month on anti-bot maintenance, a managed platform is cheaper. Crossover typically hits when facing F5 Shape or Kasada at scale.

Bright Data 98.44%

Enterprise · 72M+ IPs · Scrape.do #1 2025

Highest success rate in Scrape.do 2025. 100% on Indeed, Zillow, Capterra. 72M+ residential IPs. GDPR, ISO 27001. Scraping Browser for JS-heavy targets. $1.50/1K requests.

✓ Best for: F5 Shape, hard targets at scale brightdata.com ↗

Zyte 93.14%

#1 Proxyway 2025 · Fastest API · Scrapy

#1 Proxyway 2025, 93.14% success rate. Fastest API response. Smart Proxy auto-selects type. GPTE AI generates parsers from natural language. scrapy-zyte-smartproxy integration.

✓ Best for: Scrapy pipelines, speed zyte.com ↗

Firecrawl ★ 111k

AI scraping · Self-hostable · MCP

URL → Markdown/JSON. No selectors. MCP server, Claude scrapes via natural language. LangChain + LlamaIndex native. Used by SAP, Zapier, Deloitte. 500 free/mo.

✓ Best for: RAG pipelines, AI agents firecrawl.dev ↗

Crawl4AI ★ 60k

Open-source · Local LLM · Free

89.7% OOTB success rate. Runs on your infrastructure. Local LLM support (Ollama). MIT license. Adaptive crawling. Full data sovereignty, data never leaves your servers.

✓ Best for: privacy, open-source, cost $0 crawl4ai.com ↗

Apify FEATURED · NO-CODE OPTION

Serverless cloud · 10,000+ Actors · MCP · LangChain

The best option if you don't want to build scrapers yourself. Apify is a cloud platform where scraping is already done for you, 10,000+ community-built Actors cover almost every major website: Amazon, LinkedIn, Instagram, Google Maps, TikTok, Zillow, Twitter/X, Google Search, and thousands more. You pick an Actor, give it a URL, and get back clean JSON. No Python, no proxies, no infrastructure.

🎭 What is an Actor?

An Actor is a serverless scraping programme that runs on Apify's cloud. Think of it like a function: you pass it inputs (URL, keywords, max results) and it returns data. Someone else wrote the spider, handles the anti-bot bypass, manages proxies, and keeps it updated. You just call it.

💰 How pricing works

You pay in Compute Units (CU). One CU = 1 CPU core for 1 hour. Most Actors use 0.1–0.5 CU per 1,000 results. Free tier: $5/mo creditenough for casual use. Paid plans from $49/mo. You can also run your own code as Actors and monetise them on the marketplace.

🤖 Apify + AI agents

Apify has a native MCP serverplug it directly into Claude, Cursor, or any LangChain agent. Your AI agent can call "scrape this URL", "search Google for X", or "get all reviews for this product" as natural language tool calls. No code required on the LLM side.

🏗️ For engineers who do build

Apify's open-source Crawlee library (formerly Apify SDK) is the core of many Actors. You can build your own Actor locally with Crawlee, push it to Apify, and run it on their infrastructure with built-in proxy rotation, auto-scaling, and a dataset API. 15K+ GitHub stars.

✓ Use Apify when

You need data from a well-known site quickly
You don't want to maintain scrapers long-term
You're building an AI agent that needs live web data
You want someone else to handle anti-bot bypasses
You need to scale without managing infrastructure

✗ Build your own when

Your target site has no existing Actor
You need custom data transformation logic
You're scraping at very high volume (cost)
You need full control over request patterns
Data stays internal and can't touch third-party cloud

Quick start apify.com/store Crawlee (open source) MCP server ~$0.25/CU Free $5/mo tier Python + Node.js SDK

✓ Best for: ready-made scrapers, LangChain

Oxylabs

Enterprise · 100M+ IPs · OxyCopilot AI

100M+ IPs, 195 countries. OxyCopilot AI generates parser code from natural language. Owns ScrapingBee (acquired 2025). ISO 27001 + GDPR. From $49/mo.

ScrapingBee

Headless · Managed rendering

Handles JS rendering, CAPTCHAs and proxies. Simple REST API, pass a URL, get back HTML or screenshots. Good for teams that want managed scraping without infrastructure. Free tier available.

scrapingbee.com ↗

Scrapfly

Anti-bot · AI extraction · Monitoring

Premium scraping API with built-in anti-bot bypass, JS rendering, and AI-powered data extraction. Strong on hard targets. Includes scraping monitoring and scheduling out of the box.

scrapfly.io ↗

Diffbot

AI extraction · Knowledge graph

Uses computer vision and AI to automatically extract structured data from any webpage, no CSS selectors, no XPath. Builds a knowledge graph from scraped content. Best for unstructured web data that needs AI parsing.

diffbot.com ↗

WebScraper.io

No-code · Chrome extension

Point-and-click scraping via a Chrome extension, select elements visually, define pagination, export to CSV. No coding required. Cloud version runs scrapers on schedule. Best for non-technical users.

webscraper.io ↗

Browse.ai

No-code · Monitor · Robots

Train a robot to scrape any website in 2 minutes by clicking on the data you want. Monitors for changes, sends alerts. Handles login flows, pagination, and dynamic sites. No code needed at all.

browse.ai ↗

Browser Use

AI agent · LLM-controlled browser

Open-source library that lets LLMs control a real browser. The AI agent navigates, clicks, fills forms and extracts data from instructions in natural language. 81% success rate on anti-bot benchmarks. GitHub ↗

browser-use.com ↗

Stagehand v3 · OCT 2025

AI Browser SDK · Browserbase · Open source

Browserbase's AI browser automation framework. Four primitives: act(), extract(), observe(), agent(). Write browser flows in plain English ("click submit button") that survive page redesigns via runtime LLM resolution. Built on CDP, supports OpenAI/Anthropic/Gemini. 65% Mind2Web benchmark. Self-healing + auto-caching. TypeScript and Python.

browserbase.com/stagehand ↗

Kadoa

AI · Zero-config · Auto-adapt

AI-powered scraping that requires zero configuration, no selectors, no rules. Understands page structure automatically and adapts when sites change. Ideal for scraping at scale without maintaining spider code.

kadoa.com ↗

ScrapeGraphAI

LLM · Graph pipeline · Open source

Builds a graph-based extraction pipeline from a natural language prompt. Describe what data you want, it generates the scraping logic. Open source and self-hostable. Good for rapid prototyping of complex extractions.

GitHub ↗

TinyFish

AI · Structured extraction · Fast

AI-native scraping API focused on speed and structured data output. Pass a URL and a schema, get back clean typed JSON. Handles JS rendering and basic anti-bot. Good fit for feeding structured data into AI pipelines.

tinyfish.io ↗

Nimble

AI · Structured · E-commerce

AI-powered web data platform with pre-built pipelines for e-commerce, SERP, and social. Returns structured data with no parsing needed. Built-in proxy network. Strong for retail intelligence and price monitoring.

nimbleway.com ↗

NetNut

ISP · Residential · Scraping API

ISP-level residential proxy network with a built-in scraping API. Direct carrier connections for lower detection risk. Strong for e-commerce and SERP scraping where IP freshness and session stability matter.

netnut.io ↗

Scraping Robot BY RAYOBYTE

Scraping API · 5,000 free/month · JSON output

Plug-and-play scraping API from Rayobyte. Returns clean JSON, handles cookies + headers + browser attributes automatically. 5,000 free scrapes/month on signup, paid tiers from $5/GB. Built on Rayobyte's proxy network and rayobrowse stealth browser. Lower entry barrier than Bright Data or Zyte for teams wanting "scraping as a service" without infrastructure.

scrapingrobot.com ↗

at">✓ Best for: enterprise scale, AI-generated parsers oxylabs.io ↗

5b Adjacent category

Computer Use Agents when scraping isn't enough

A new category emerged in 2025: AI agents that don't just scrape, they log in as the user, navigate any UI (web apps, legacy portals, desktop software), handle MFA and CAPTCHAs, and return structured JSON. Different from scrapers because the user grants permission, "Plaid for any website." If your problem is utility bills, payroll exports, e-commerce backends, or any portal without a public API, this is the category.

Deck FEATURED · $25M RAISED

Computer Use Agents · Credential Vault · SOC 2

Plaid-ifies any website. Provisions isolated desktop VMs, encrypts credentials in Deck Vault, runs AI agents that log in, navigate, and return schema-validated JSON. Founded by the team behind Flinks (Canadian open-banking, acquired for $150M by National Bank). Connects to 100,000+ utility providers across 40+ countries. Handles MFA, CAPTCHA, device fingerprinting, audit-logged sessions. Strong on regulated portals with no public API.

deck.co ↗

Skyvern

Open source · LLM + Computer Vision · 85.8% WebVoyager

YC-backed open-source agent that uses LLMs and computer vision (no XPath or CSS selectors) to operate any browser workflow. State-of-the-art 85.8% on WebVoyager benchmark. Used for invoice retrieval, job applications, government forms, insurance quotes. Both cloud-hosted and self-hostable SDK with Playwright integration.

skyvern.com ↗

Bytebot

SDK · AI browser automation

SDK-first computer use agent platform. Lighter footprint than full VM solutions, integrates into existing apps. Targets developer workflows where you want agentic browser actions without managing browser pools yourself.

bytebot.ai ↗

CloudCruise

Browser automation · Web agents

Developer platform for creating and managing web agents. Focuses on production-grade browser automation infrastructure. Competes with Deck and Browserbase on the infra layer.

cloudcruise.ai ↗

Autotab

Enterprise AI agent · Data + form automation

General-purpose AI agent for enterprise, data collection, form filling, executing actions across business apps. Pitched at operations teams rather than developers.

autotab.ai ↗

Browserless

Managed headless Chrome · CDP-as-a-service

Chrome-as-a-service over WebSocket and REST. Foundation layer that other agent platforms build on. Strong for teams that want managed browser pools without the agent reasoning layer on top.

browserless.io ↗

When to pick this category over scraping: if the data lives behind a login the user owns (their utility bill, their bank statement, their payroll), Computer Use Agents are the right answer, the user permission model gives you a clean legal posture and access to data scraping legally cannot reach. If the data is public-facing (e-commerce listings, SERPs, social), traditional scraping is faster and cheaper.

Platforms sort out the browser and the fingerprint. But every request still needs an IP address, and the type of IP matters as much as any other signal in your stack.

07 Proxy strategy

IP type matters
more than provider

Rotating proxies is table stakes. The real variable is IP type, datacenter IPs score near-zero on DataDome and PerimeterX regardless of fingerprint quality.

Datacenter

Trust: Very Low

AWS/GCP/Azure ranges. Instantly flagged by DataDome and PerimeterX. Cheapest (~$0.01/GB). Use only on non-protected public data. Never for Akamai or PerimeterX targets.

Residential

Trust: High

Real home ISP addresses. Passes most trust checks. Confirmed: curl_cffi + residential bypasses DataDome on Grainger.com. Rotate per session, not per request, mid-session rotation = Akamai block.

Mobile / 4G

Trust: Highest

T-Mobile, Vodafone, O2 carrier IPs. Highest trust score on DataDome and PerimeterX. Shared tower IPs, hard to flag. Mobile IPs get DataDome 200 OK where residential fails. ~$10–15/GB.

ISP / Static

Trust: High

Static residential range. Akamai multi-request scoring rewards consistent IPs, trust accumulates from same ISP IP. Best for long sessions on Akamai sites. Never rotate mid-session.

NetNut

ISP Direct

ISP-based infrastructure with direct carrier connections. Lower detection risk than pooled residential. Fast and stable, good for e-commerce targets that check IP freshness and session age.

IPRoyal

Residential

Ethically sourced residential + datacenter. Pay-as-you-go pricing, no long-term commitment. Good entry point before scaling to enterprise contracts with Oxylabs or Bright Data.

Massive I use this Ethical · Founded 2018

Residential · ISP · Web Access API · Web Render API · MCP Server

Try Massive with my referral ↗

"I've tested a lot of proxy providers across my 7 years in scraping. Massive stands out for two reasons: the ethics are real (not marketing), and the performance numbers hold up under actual load. 99.87% US success rate and 0.52s response time aren't made up, my production runs match that. If you care about running a clean, compliant operation, this is where I'd start." Asad Ikram, Data Engineer

Residential Proxies

1.6M+ IPs, 195+ countries. 99.87% US success rate. 0.52s response time. GDPR + CCPA compliant, AppEsteem certified. From $4.9/GB.

ISP Proxies

Static residential IPs for sticky, session-bound workflows. 100% success rate, 0.09s response (US). From $1.8/IP. Best for continuous monitoring.

Web Render API

Full JavaScript rendering with anti-bot bypass at scale. From $8/mo. Handles Cloudflare-protected pages without you managing browsers.

MCP Server ✦ new

Official MCP server. Use Massive directly from Claude, Cursor, or any MCP client. Geo-targeted search, bulk extraction, SERP analysis without leaving your AI workflow.

99.87%

US success rate

0.52s

response time

195+

countries

99.9%

uptime

100%

ethically sourced

<20%

fraud score (US)

Verified target success rates

Instagram 96% Amazon 94% Google 88% ISP 100% Trusted by Snowflake · Shopee · Tavily

Startups get 1TB free for 3 months, no equity required. 24/7 live support. GDPR + CCPA compliant. AppEsteem certified.

Join with my referral ↗

Rayobyte

Ethical · Multi-type

"America's #1 proxy provider" (formerly Blazing SEO, est. 2014). 40M+ residential IPs across 100+ countries, plus ISP, datacenter, and mobile. Non-expiring bandwidth sets them apart, $3.50/GB residential dropping to $0.50/GB at 5TB+. Ethically sourced via Cash Raven consent-based proxyware. Ships rayobrowse stealth browser too. Hands-on technical CEO, partners with EWDCI for ethics standards.

Scrapoxy

Open Source Manager

Self-hosted proxy manager that pools and rotates proxies across AWS, Azure, GCP. Routes requests through different IPs automatically. Free alternative to commercial proxy managers. github.com/fabienvauchelles/scrapoxy

WebRTC coherence rule: Proxy IP country, WebRTC ICE candidate, DNS resolver, timezone, and Accept-Language must all agree. US residential proxy + Pakistani DNS = flagged by every major anti-bot. Use geoip=True in Camoufox to align all five vectors automatically.

Crawlera/Zyte proxy bug: Port 8011 speaks plain HTTP. Both http:// and https:// keys must use http:// scheme. Using https:// causes BoringSSL WRONG_VERSION_NUMBER (TLS-over-TLS failure). Fix: "https": "http://key:@proxy.crawlera.com:8011/"

python

from curl_cffi import requests
import time– random

session = requests.Session(impersonate="chrome124")

# Crawlera/Zyte: BOTH keys use http://, never https://
PROXIES = {
    "http":  "http://apikey:@proxy.crawlera.com:8011"–
    "https": "http://apikey:@proxy.crawlera.com:8011"–  # http:// not https://
}

def fetch(url– retries=3):
    for i in range(retries):
        try:
            r = session.get(url– proxies=PROXIES–
                             timeout=30– verify=False)  # verify=False: proxy cert
            if r.status_code == 200: return r
            if r.status_code in (403–429):
                time.sleep(2**i + random.uniform(0–1))
        except Exception as e:
            print(f"Error: {e}")
    return None

You now have the full picture: detection layers, six anti-bots, sixty libraries, managed platforms, proxy types. This section collapses all of it into a single decision tree you can follow for any target site.

08 Decision playbook

Walk this in order.
Stop at first win.

Each step adds complexity, cost, and maintenance. Most production scraping is solved at steps 1–3. Never start at step 5.

Lowest friction · Asad's priority #1

Find the mobile API

Mobile apps hit same backend with far weaker bot protection. HTTPToolkit intercepts all HTTPS from Android emulator. Frida hooks into SSL_read/SSL_write directly. If you find the API endpoint, every HTML anti-bot becomes irrelevant.

HTTPToolkit

Frida

mitmproxy

Burpsuite

XHR reverse engineering

Find the GraphQL or REST endpoint

Chrome DevTools → Network → Fetch/XHR. Many SPAs load from one undocumented JSON endpoint. Confirmed in production, a direct GraphQL endpoint bypassed all Akamai HTML protection.

Chrome DevTools

Burpsuite

webclaw CLI

JSON in HTML · No requests needed

Look for embedded state

Next.js embeds full state in __NEXT_DATA__. React SPAs often have >50KB script containing all data. Confirmed: Grainger.com (DataDome-protected), 110KB JS state blob bypasses DataDome entirely because it's in initial HTML.

chompjs

Parsel

BeautifulSoup4

HTTP scraping · No browser

curl_cffi + Scrapy

Identify anti-bot with Wappalyzer. curl_cffi with JA4 impersonation resolves most Akamai and DataDome at HTTP layer. Add residential proxy. If __NEXT_DATA__ appears in response, extract it with chompjs.

curl_cffi

Scrapy

Scrapling

Browser automation · C++ level only

Camoufox or CloakBrowser

JS injection patches leave signatures. Camoufox: 100% pass rate Mar 2026. CloakBrowser: 49 C++ patches, reCAPTCHA v3 score 0.9, Akamai extension probes pass. PatchRight for Kasada specifically, no JS signatures.

Camoufox ★

CloakBrowser

PatchRight

Last resort · F5 Shape only viable path

Managed platform API

F5 Shape's custom VM makes DIY impractical. Token expiry in minutes, payload changes every rotation. At scale: engineer maintenance cost > platform cost. One API flag handles everything. Cost-justify: >2 days/month maintenance → managed API wins.

Bright Data

Zyte

Firecrawl

Quick reference cheat sheet

Anti-bot	Primary vector	Steps 1–2 viable?	Best tool	Key note
Akamai	JA4+ + sensor.js + extension probes	Often	curl_cffi + CloakBrowser	Find mobile/GraphQL first
Cloudflare	JA4 Rust edge + Turnstile	Sometimes	Camoufox	Origin IP via SecurityTrails
DataDome	85K ML + WASM boring_challenge	Yes	curl_cffi + mobile IP	Check __NEXT_DATA__ first
PerimeterX	5-vector score	Sometimes	Camoufox + residential	Fresh session per domain
Kasada	Polymorphic JS PoW	Rarely	PatchRight + residential	Never playwright-stealth
F5 Shape	Custom VM + minute expiry	No	Managed API	DIY not practical

★ From the community

What practitioners are
actually shipping in 2026

Fresh insights from engineers actively solving these problems in production. Shared publicly on LinkedIn.

Drag to explore →

TLS / Anti-bot

Cloudflare Turnstile Solved Without a Browser

Solvable with pure HTTP, no browser needed. Reverse-engineer the POST payload: 79 parameters covering Canvas, WebGL, Timing and crypto hashes. Status 200 in 0.27s.

💡 Turnstile PoW is solvable in under 1s via plain HTTP

The Turnstile POST payload contains 79 parameters. Key groups: Fingerprint (Canvas hash via OffscreenCanvas, WebGL renderer, AudioContext output), Browser Environment (navigator properties, screen dimensions, timezone), Interaction sequence (mouse path, click timing), and Crypto (custom SHA-256 + TEA encryption of the challenge nonce). The Sitekey is extracted automatically from the page source. Algorithms used: Custom SHA-256, TEA block cipher. Token format: Encrypted_Data-Timestamp-Version-Checksum. Full flow: extract Sitekey → initiate challenge → construct responses → generate 95-char token. Result: cf_clearance accepted in 0.27s. No browser process needed.

TLS / Proxies

Why "Just Get Better Proxies" Stopped Working

The problem is your TLS handshakenot your IP. Cipher suites, HTTPS extensions, GREASE values form a JA4 fingerprint. A clean residential IP still fails if the fingerprint exposes you.

💡 The residential IP passed. The fingerprint gave it away.

TLS detection happens at the ClientHello level, before any HTTP is exchanged. The JA4 fingerprint hashes: cipher suite list (sorted), TLS extensions (sorted, GREASE removed), ALPN protocols. Python's requests library sends a different cipher suite order than Chrome. httpx is different again. Even with a clean residential IP, if your cipher ordering does not match Chrome's, you are identified before the server processes a single header. Fix: use curl_cffi with impersonate="chrome124"it emits Chrome's exact TLS ClientHello. Also watch HTTP/2 SETTINGS frames, they contain window sizes and header table parameters that vary per client.

Network Identity

The WebRTC Trap: Your Browser Is Leaking Your Real Location

Proxy says US. WebRTC says elsewhere. It leaks your real IP via STUN and creates geo mismatches. Anti-bots check that IP + WebRTC + timezone + DNS + Accept-Language all agree.

💡 Quick test: browserleaks.com/webrtc, check before blaming your proxy

WebRTC uses the STUN protocol to discover network paths. During ICE candidate gathering, the browser contacts a STUN server and reports: your real public IP, your local LAN IP (e.g. 192.168.x.x), and all network interface addresses. Your proxy only routes HTTP/HTTPS traffic, WebRTC bypasses it entirely. Anti-bots cross-check: proxy exit IP vs WebRTC public IP vs DNS resolver location vs Accept-Language vs timezone. All five must agree. Fix with Camoufox: set geoip=True and it automatically aligns all five vectors. Do not simply disable WebRTC, it removes a feature that 99% of real users have, which itself becomes a bot signal.

Benchmarks · 2026

The 30-Point Gap: Browser Scraping Success Rates Are Not Equal

71 protected sites tested: Browser Use Cloud hit 81%Browserbase hit 42%. That gap is no longer marginal, it is the difference between a working pipeline and a broken one.

💡 A 30-point gap in success rate = the difference between a working pipeline and a broken one

The benchmark tested 71 sites protected by Cloudflare, Akamai, PerimeterX, and DataDome. Methodology: each provider was given identical target lists and measured on first-request success (no retries). Browser Use Cloud succeeded on 81%, achieved via custom Chrome patches at the C++ binary level plus coordinated fingerprint management. Browserbase succeeded on 42%, detected primarily via CDP timing signatures and canvas hash consistency. The gap exists because basic scraping (fetch URL, parse HTML) is commoditised. The data worth having in 2026 sits behind login walls, search interfaces, and multi-step authenticated flows requiring actual browser interaction. Cheap providers are adequate for unprotected targets; they fail silently on protected ones.

Architecture

Browsers as a Session Layer, Not a Scraping Product

HTTP is fast, browsers are expensive. Right architecture: browser for session warmup and hard challenges onlythen lightweight HTTP workers for bulk collection.

💡 Browser for session warmup → HTTP for bulk collection

The architectural insight: most scraping pipelines use browsers for everything, which is expensive. But you only actually need a browser for two things: (1) session establishmentgenerating valid cookies and session tokens that a protected site will accept, and (2) hard challenge pagesAkamai sensor.js, Cloudflare Turnstile, DataDome WASM challenges. Once you have a valid session cookie, the rest of the data collection can happen via lightweight HTTP requests at 10-100× the speed and 1/100th the memory. Implementation: use camoufox or rayobrowse to generate sessions, then curl_cffi with the extracted cookies for bulk collection. Rotate sessions every 30-50 requests.

Python Framework

Scrapling v0.4: The Biggest Python Scraping Update Yet

New async spider: concurrent crawling, mix HTTP and stealth sessions, pause/resume from checkpoint, stream items live. Thread-safe ProxyRotator built in. Handles Turnstile natively.

💡 pip install scrapling --upgrade

The async spider framework uses a Scrapy-like API: define a Spider class, set start_urls, implement parse(). Key differentiators from Scrapy: mixed session types in one spider (HTTP fetchers, headless Camoufox, stealth browser), checkpoint/resumeCtrl+C saves state, restart continues from last position, per-domain throttlingset different rates per target. ProxyRotator: thread-safe, works across all fetcher types, supports custom rotation strategies, per-request override. Parser improvements: blocked_domains list to block tracking/CDN requests in headless mode, automatic proxy-aware retry on network errors, Response.follow() for easy link chaining. Install: pip install scrapling --upgrade.

Proxies

SwiftShadow: Free Proxy Rotation Without the Headaches

Grabs free proxies, validates, rotates automatically, filters by country. Built-in caching, auto-switches on failure. ~300 stars, actively maintained.

💡 pip install swiftshadow, from swiftshadow import QuickProxy

SwiftShadow maintains a pool of free proxies sourced from multiple public lists. On initialisation it validates all proxies (checks response time and anonymity level) and caches the working set. When a proxy fails mid-request, it automatically switches to the next validated proxy in the pool, no intervention needed. The QuickProxy(countries=["FR","DE"]) API filters by exit country. The built-in cache means it does not hit proxy list APIs on every request. Usage: from swiftshadow import QuickProxy; proxy = QuickProxy(); session.proxies = {"http": str(proxy), "https": str(proxy)}. Important: free proxies have high failure rates and low anonymity, do not use for Akamai, DataDome, or PerimeterX targets. Best for scraping open/unprotected sites at scale without cost.

RAG / LLM Pipelines

Keep Your LLM Context Fresh: Incremental Indexing

Scraped data goes stale fast. CocoIndex builds a continuously updated vector indexonly changed rows re-run. Pgvector, LanceDB, Neo4j targets. #1 GitHub Trending on launch.

💡 github.com/cocoindex-io/cocoindex, incremental RAG for LLM agents

The core problem: you scrape a site, embed it into a vector store, and 48 hours later 30% of the content has changed. Traditional batch re-indexing re-processes everything. CocoIndex solves this with a Rust-based delta engine: it tracks byte-level lineage per document, and when you re-run it only processes changed chunks. Target vector stores: Pgvector (PostgreSQL), LanceDB (local), Neo4j (graph). Python with Rust core means the delta calculation is very fast even on large corpora. The LLM integration: your agent always queries a fresh index, so answers reflect current scraped data. Setup: pip install cocoindexconfigure sources (files, URLs, S3), define your chunker and embedding model, run cocoindex.build()done in under 10 minutes.

API-First Scraping

Skip the HTML. Hit the API. 50× Faster.

Open DevTools Network → Fetch/XHR before writing any code. Half the time the page calls a JSON API directly. 50× faster, 100× less memory, zero browsers launched.

💡 Rule: open DevTools Network tab before writing any code

The technique: open Chrome DevTools → Network tab → filter by Fetch/XHR. Reload the page. Look for requests returning JSON. Right-click → Copy → Copy as cURL. Run that cURL command. If you get the same data back, you have found the internal API. What to look for: GraphQL endpoints (POST to /graphql or /api/graphql), REST endpoints (GET to /api/v2/products, etc.), __NEXT_DATA__ (Next.js embeds full page state in a JSON script tag, no request needed, just parse the HTML). Benefits: bypasses most anti-bot because APIs typically have weaker protection than HTML endpoints, returns clean structured data instead of HTML you need to parse, no browser needed, runs at full HTTP speed. When this fails: auth cookies required, the API uses rotating tokens, or the site detects API scraping specifically.

Library Analysis

Scrapling Hit 200K Views, Invest in Your Network Layer

Detection vendors update, bypass libraries break, new ones ship. This cycle repeats every few months. What never depreciates: your proxy infrastructure. Invest there first.

💡 Invest more in your network layer, it depreciates slower than your library

The library lifecycle in scraping works like this: a new bypass technique is discovered, someone publishes a library implementing it, the library becomes popular, detection vendors add the library's fingerprints to their models, the library gets blocked, repeat. This cycle runs on a 2-4 month cadence for fast-moving targets. Your proxy setup operates on a different timeline: a well-configured residential proxy pool with good IP diversity and correct session management continues to work across multiple library generations. The specific libraries come and go but the network signals (IP reputation, ASN, session behaviour, timing patterns) remain consistent requirements. Conclusion: spend more engineering time on proxy quality, session management, and IP pool diversity than on tracking the latest bypass library.

Learning Path

Scraping Tutorials Teach the Wrong Things First

Most start with BeautifulSoup on static HTML. Real scraping is JS-rendered, sessions, rate limits, dynamic APIs. Better: DevTools → XHR replication → Scrapy → anti-bot.

💡 DevTools → XHR replication → Scrapy → anti-bot, in that order

The typical tutorial sequence: install Beautiful Soup, parse static HTML, extract data. This teaches the wrong mental model. Real production scraping involves: JavaScript renderingmost modern sites build their UI client-side, the HTML you fetch is an empty shell, Sessions and authcookies, CSRF tokens, login flows, Rate limiting and backoffexponential backoff, per-domain limits, Dynamic selectorssites change their HTML structure, you need adaptive extraction. The right learning sequence: DevTools Network tab → understand how data flows between client and server → learn to replicate XHR requests with requests/curl_cffi → Scrapy for structure and scale → fingerprinting and anti-bot bypass last. Understanding what actually happens when a browser loads a page is more valuable than memorising BeautifulSoup APIs.

IoT / Edge

A Microcontroller Scraping Live Weather Data via API

An ESP32 calling a scraping API, parsing JSON, displaying on TFT screen. The abstraction is now clean enough for devices with no Python. Scraping as real infrastructure.

💡 Scraping APIs are now clean enough for Arduino, #esp8266 #esp32

An ESP8266 microcontroller running Arduino firmware makes an HTTPS request to a scraping API endpoint. The API (Zyte) handles: TLS negotiation with the target, JavaScript rendering if needed, anti-bot bypass, data extraction. The microcontroller receives clean JSON back and renders it on a TFT display. This demonstrates that scraping has become a proper infrastructure layer, just like how you would call a weather API, you can now call a scraping API from any HTTP-capable device. The broader implication: scraping is no longer just a Python script on a server. It is a data access layer that any application can use. The complexity of browser fingerprinting, proxy rotation, and anti-bot evasion is fully abstracted behind a simple API call.

Debugging

Your Scraper Is Blocked Because of Behaviour, Not Code

Identical headers. Machine-speed intervals. No session state. Datacenter IPs. Fix: rotate headers, random.uniform(1.8, 4.3) delays, requests.Session(), residential proxies.

💡 sleep(random.uniform(1.8, 4.3)) beats sleep(2) every time

The signals that get you blocked, in order of detection speed: TLS fingerprint (detected before first HTTP byte), HTTP/2 SETTINGS frames (detected at connection), Request headers (User-Agent, Accept-Language, Sec-CH-UA, checked immediately), Request timing (identical intervals are machine-like), Session patterns (no cookies accumulated, no referrer chain), IP reputation (ASN, datacenter range). Fix each layer: curl_cffi for TLS, full Chrome headers via httpx or curl_cffi, random.uniform(1.8, 4.3) delays, requests.Session() for cookie accumulation, residential/mobile proxies for IP. Check your current fingerprint at tls.browserleaks.com/json.

Mental Model

Understanding Beats Tools Every Time

Not curl_cffi, not Playwright, not a $300/mo plan. Understanding how detection works is the real advantage. Tools change. Detection evolves. Understanding transfers.

💡 "Tools change. Detection evolves. Understanding is what transfers."

The mental model shift: most scrapers think in terms of tools ("which library bypasses Cloudflare?"). Experienced scrapers think in terms of signals ("which signals is my scraper leaking that Cloudflare can detect?"). The difference: tool-thinkers update their library when it breaks. Signal-thinkers understand why the library broke and can fix it themselves or identify the correct replacement. Signals Cloudflare checks: JA4 TLS fingerprint, HTTP/2 SETTINGS frames, navigator properties (webdriver, plugins, languages), Canvas hash, WebGL renderer, timing patterns, IP reputation. If you know which signal you are leaking, you can fix it regardless of which library you are using. This understanding also transfers to new anti-bots, the signals are similar across vendors even though the implementations differ.

AI Visibility · 2026

Top 10 Scraping APIs per ChatGPT + Perplexity + Gemini + Google

All four AI models queried simultaneously. Consensus: 1. Bright Data · 2. Zyte · 3. ScrapingBee · 4. Firecrawl · 5. Scrape.do. Half of B2B buyers now start research in AI chatbots.

💡 AI search is a real channel, Bright Data #1 across all four models

The research methodology: a scraping API was used to query four AI systems simultaneously from a San Francisco IP address (to simulate a US-based B2B buyer). Prompt: "best web scraping API 2026". Results aggregated by occurrence and ranking position. Full ranking: 1. Bright Data (98.44% success, 72M+ IPs), 2. Zyte (93.14%, #1 Proxyway benchmark), 3. ScrapingBee, 4. Firecrawl (111K GitHub stars, LLM-optimised), 5. Scrape.do, 6. ScraperAPI, 7. Apify, 8. Scrapingdog, 9. Oxylabs, 10. Scrapfly. The AI search SEO implication: if you are building a scraping product, being in AI training data and AI search indexes is now a primary distribution channel. The buyers searching "best scraping API" increasingly ask an AI chatbot, not Google.

★ Testing tools

Check your own
fingerprint first

Before you bypass anything, you need to know what your setup is leaking. These tools show exactly what anti-bots see when your scraper connects. Run your scraper through them, not just your browser.

Gold standard, most detailed

BrowserLeaks

browserleaks.com

The most comprehensive fingerprint testing suite online. Tests WebRTC IP leak, Canvas hash, WebGL renderer, JA3/JA4 fingerprint, HTTP/2 Akamai hash, Chrome extension detection, fonts, geolocation, JavaScript environment, battery status. Essential for verifying your scraper identity stack.

TLS specific, generates JA3/JA4

BrowserLeaks TLS

browserleaks.com/tls

Tests your TLS ClientHello. Shows cipher suites, TLS extensions, key exchange groups, JA3 and JA4 hashes. Run Python requests, curl_cffi, and real Chrome through this and compare. JA4 is what Cloudflare and Akamai check at edge before serving any HTML.

IP leak, proxy coherence test

BrowserLeaks WebRTC

browserleaks.com/webrtc

Reveals your real IP even through a proxy. Shows local IP, public IP via STUN, and ICE candidates. If your proxy exit is US but WebRTC shows a local Pakistani address, every anti-bot flags you immediately. The most commonly overlooked leak.

JSON API, use directly in code

TLS JSON API

tls.browserleaks.com/json

Returns your TLS fingerprint as raw JSON including ja3, ja4, akamai hash, HTTP/2 settings. Call this directly from your scraper to compare fingerprints against real Chrome. One requests.get() vs cffi.get() tells you everything about the difference.

Quick pass/fail validation

BrowserScan

browserscan.net

Higher-level green/red check for automation detection, timezone coherence, WebRTC status, canvas fingerprint uniqueness. Good for quick pre-deployment validation before hitting a protected target.

EFF built, uniqueness score

Cover Your Tracks

coveryourtracks.eff.org

Built by the Electronic Frontier Foundation. Tells you how unique your fingerprint is among all visitors. A fingerprint too unique is as bad as one that looks like a bot. Scrapers need to look like the middle of the distribution.

Bot detection simulation

Pixelscan

pixelscan.net

Simulates what anti-fraud systems see. Identifies inconsistencies in timezone, IP, language, and WebRTC that would trigger detection. Fast pass/fail for operational teams before deploying at scale.

Advanced, behavioral and hardware signals

CreepJS

abrahamjuliot.github.io

The most advanced fingerprint tester available. Simulates what modern anti-fraud systems actually detect, including behavioral and hardware-level signals far beyond surface tests. Use this for deep audits of browser configurations.

Workflow: Fetch tls.browserleaks.com/json from both your scraper and real Chrome. Compare ja4 hashes. If they differ, fix TLS first with curl_cffi. Then check WebRTC at browserleaks.com/webrtc. Then headers. Work from layer 1 outward.

★ Architecture

How production scrapers
are actually built

From a single Scrapyd daemon to multi-region ECS clusters. Nine real pipeline architectures, from simple to enterprise-scale, with every component and data flow mapped out.

The simplest production setup. One server, Scrapyd managing spiders via JSON API, ScrapydWeb as UI. Good for <50 spiders and teams without Kubernetes. Deploy with scrapyd-deployschedule via /schedule.jsonmonitor at port 6800.

✓ Pros

Zero infrastructure overhead, one server, done
ScrapydWeb gives full UI: logs, job history, schedule
Deploy new spiders in seconds with scrapyd-deploy
Great for teams without DevOps expertise

✗ Cons

Single point of failure, server down = scrapers down
Limited to one machine's CPU and memory
No auto-scaling, manual capacity planning
Spider isolation is process-level only

↑ Scale up

Add more Scrapyd nodes → ScrapydWeb manages cluster from one UI. Next step: scrapy-redis for shared URL queue.

Stack Scrapyd :6800ScrapydWeb :5000scrapyd-client (deploy)APScheduler or cronGerapy (alt UI)

✦ Pattern

Self-Healing Scraper
powered by Claude

Scrapy spiders break when sites change their HTML. Instead of manually fixing selectors, this architecture uses Claude to detect failures, analyse the new page structure, and write corrected selectors automatically, without human intervention.

Spider detects failure

Item count drops to zero or below threshold. A Scrapy extension hook fires immediately, no waiting for the next run.

Claude analyses the broken page

The full page HTML is sent to Claude with the old selectors and a prompt: "The selectors below stopped working. Examine the HTML and write corrected CSS selectors for the same data fields."

New selectors written and applied

Claude returns structured JSON with corrected selectors. The updater patches the spider config or YAML file. No code deployment needed.

Spider retries and confirms

The spider re-runs with new selectors. If items come back, healed. A Slack notification logs what changed. If it fails again, escalates to human review.

Claude prompt pattern

You are a web scraping expert. A Scrapy spider broke because the site changed its HTML.

Old selectors (no longer working):
  title:  h1.product-title::text
  price:  span.price-now::text
  image:  img.main-image::attr(src)

New page HTML (truncated):
{{ page_html[:8000] }}

Return ONLY valid JSON with corrected selectors:
{"title": "...", "price": "...", "image": "..."}

✓ Why it works

Claude reads raw HTML better than regex, handles minified, dynamic, and obfuscated markup
Zero downtime, spider heals mid-run, not on next deployment
Works across 50+ spiders from a single Claude integration
Selector changes are the most common spider failure, this covers 80% of breakages

⚙ Implementation notes

Use claude-haiku-3 for speed and cost, ~$0.0003 per heal
Cap page HTML at 8K chars before sending, beyond that Claude doesn't need more
Store selector history in a YAML file versioned in Git for auditability
Add a confidence check, if healed items look wrong, escalate to human

↑ Extend it

Add a second Claude call to validate the healed output against a schema. Use computer-use to handle JavaScript-rendered pages where HTML alone isn't enough. Log all heals to build a fine-tuning dataset.

Stack Scrapy extension hook Claude API (Haiku) YAML selector store Slack webhook Anthropic SDK

★ Mobile API Scraping

Intercept mobile app traffic
before it hits any anti-bot

Mobile APIs serve the same data as the web, but with weaker protection. No Cloudflare, no JA4 fingerprinting. Intercept the traffic once, replicate the call forever.

Why mobile APIs? The same data served to a mobile app often sits behind a simpler auth layer than the web. No browser fingerprinting, just a clean JSON endpoint you can call directly from Python.

Install Android Studio + create a Virtual Device

Open Android Studio → Virtual Device Manager → Create Device. Pick any phone that shows the Play Store icon. For the system image, choose any API level above 28do not choose Android 9 (Pie / API 28), the rooting script does not support it. API 30 (Android 11) is a safe default.

💡 Any Android 10+ image works. Start the AVD and confirm it boots before proceeding.

Root the AVD using rootAVD

AVDs are not rooted by default. Root access lets HTTP Toolkit intercept SSL traffic. The rootAVD script handles everything in one command.

git clone https://github.com/newbit1/rootAVD.git
cd rootAVD

# Verify AVD is accessible
adb shell

# List your AVDs
./rootAVD.sh ListAllAVDs

# Copy the first command from the output and run it
# e.g: ./rootAVD.sh system-images/android-30/google_apis_playstore/x86_64/ramdisk.img

💡 adb not found? Add to ~/.zshrc: alias adb='/Users/$USER/Library/Android/sdk/platform-tools/adb'

Confirm root, Magisk appears in the app drawer

After rootAVD finishes the AVD reboots automatically. Once it's back up, open the app drawer and look for the Magisk app, this confirms root is working. Zygisk does not need to be enabled.

💡 If the AVD didn't reboot itself, reboot it manually. No Magisk = root failed, re-run the script.

Install HTTP Toolkit and connect via ADB

Download from httptoolkit.com or install via Homebrew. Open it → Intercept tab → "Android device via ADB". HTTP Toolkit detects your running AVD and prompts it to grant superuser rights, grant it.

# macOS
brew install --cask http-toolkit

💡 "System trust disabled" warning? Disconnect and reconnect in HTTP Toolkit, or reboot the AVD.

Install the target app and capture its requests

Sign into Google Play on the AVD and install the app, or download the APK from apk.support and drag-drop it onto the emulator. Open the app, navigate through it (lists, detail pages, search) while HTTP Toolkit runs. Switch to the View tabevery request the app makes is captured in real time.

💡 Use the filter bar, you'll see 400+ requests but only ~10 are the data endpoints. Filter by the target domain name.

Replicate the API call in Python

Click any intercepted request to see full headers, auth tokens, and query parameters. Test in Postman first to confirm it returns data, then replicate in Python. Mobile APIs return clean JSON, no HTML parsing needed.

import curl_cffi.requests as requests

resp = requests.get(
    "https://api.targetapp.com/v2/listings",
    headers={
        "Authorization": "Bearer <token_from_http_toolkit>",
        "X-App-Version": "4.2.1",
        "User-Agent": "TargetApp/4.2.1 (Android 11; SDK 30)",
        "Accept": "application/json",
    },
    impersonate="chrome120"
)
data = resp.json()

💡 Tokens expire, check if the app refreshes on login and build a token refresh step into your scraper.

✓ Works well for

Property portals, classifieds, marketplaces
Apps where the web version is heavily protected
Data only available in the mobile app
Targets using simple Bearer token auth
Any app that doesn't pin SSL certificates

✗ Limitations

Apps with SSL pinning block interception
Some apps crash on rooted devices
ARM-only apps may not run on x86 emulators
Tokens expire, need refresh logic in scraper
App updates can silently change endpoints

↑ SSL Pinning bypass

If the app blocks interception it likely uses SSL pinning. Use Frida or objection to bypass it at runtime, or use Burp Suite with the Xposed + TrustMeAlready module for a more permanent bypass.

Stack Android Studio rootAVD (github.com/newbit1/rootAVD) Magisk HTTP Toolkit apk.support Postman curl_cffi

★ Plain English

Scraping jargon
in simple terms

Every term that makes scraping documentation confusing, explained with an analogy.

Network layer

TLS Fingerprint

How your browser "shakes hands" when connecting securely. Chrome and Firefox shake hands differently, so a server can tell them apart before you send a single header. Analogy: recognising someone by the way they shake hands, firm, soft, awkward.

Network layer

HTTP Fingerprint

The order and style of your HTTP headers. A bot might say "I'm Chrome" but forget to include headers Chrome always sends. Analogy: like a boarding pass, if your name and flight number don't match the expected pattern, it's suspicious.

Network layer

TCP/IP Fingerprint

Looks at how your computer sends and receives internet packets. Windows and Linux send packets with subtle differences. Analogy: recognising someone's hometown by their accent, you didn't ask, they just gave it away by how they talk.

Browser layer

Canvas Fingerprint

Website secretly asks your browser to draw a hidden picture. Each GPU renders it slightly differently, that difference is your fingerprint. Analogy: asking 10 artists to draw the same tree, each drawing is unique even with the same instructions.

Browser layer

WebGL Fingerprint

Uses 3D graphics rendering to identify your GPU and driver. Same browser, different hardware, different fingerprint. Analogy: recognising a car engine by its sound, same model, but each engine has subtle variations you can hear.

Browser layer

Device Fingerprint

Collects OS, fonts, screen size, timezone, plugins, battery level, everything about your setup combined into a unique profile. Analogy: identifying someone by their full outfit + hairstyle + voice + habits. Change one thing, the combo is still unique.

Behavioural

Behavioural Analysis

Watching how you type, scroll, move your mouse. Bots move in straight lines at constant speed. Humans are messy and inconsistent. DataDome runs 35 behavioural signals in real-time. Analogy: a security guard watching body language, not what you say, how you move.

Challenge

Dynamic Challenges

The website throws mini-tests to check if you're real, CAPTCHA, Turnstile, proof-of-work puzzles. Kasada changes them constantly so you can't pre-solve. Analogy: a teacher changing exam questions mid-test to catch cheaters.

Network

IP Reputation

Whether your IP address is associated with known bots, VPNs, datacenters, or abuse. Datacenter IPs are instantly flagged. Residential IPs from real ISPs get highest trust. Analogy: your home address appearing on a blacklist, the doorman knows before you knock.

★ Community

Where scrapers
talk to each other

The best scraping techniques rarely come from documentation, they come from people who've already hit the same wall you're hitting. These communities are where the real knowledge lives.

★ Self-healing scrapers

Auto-healing scrapers
using Claude as the brain

Scrapers break when sites change their HTML structure, add new anti-bots, or rotate selectors. Claude Computer Use can read the error, inspect the live page, rewrite the spider, and redeploy, all without a human in the loop.

The problem

A spider that worked yesterday returns empty data today. The site changed a CSS class, added a JavaScript render step, or rotated its anti-bot. Traditional scrapers need a human to notice, debug, and fix. Self-healing scrapers fix themselves.

The solution

Claude has a set of computer use toolsbash, file read/write, str_replace, plus a skills knowledge base. When a spider fails, Claude reads the error, fetches the live page, compares old vs new structure, patches the selector, and runs a test crawl to verify.

DETECTION

Monitor checks item count after every run. Zero items, timeout, or HTTP 4xx triggers the healing loop. CloudWatch alarm or a simple Lambda cron handles this.

CLAUDE TOOLS

bash_tool runs shell commands. view reads files. str_replace patches them. Claude reads SKILL.mda knowledge base of scraping patterns, bypass techniques and selector recipes, before deciding the fix.

VERIFY + DEPLOY

Claude runs a test crawl after patching. Only if items > 0 does it commit and push. Failed patches loop back to re-diagnose. A Slack message explains exactly what broke and what was changed.

What goes in SKILL.md: anti-bot bypass recipes (Cloudflare → Camoufox, DataDome → curl_cffi + residential), common selector patterns for major site frameworks (Next.js, Nuxt, Django), how to detect and extract __NEXT_DATA__XHR endpoint patterns, proxy rotation config. Claude reads this before every fix attempt, it is the institutional knowledge your team would otherwise lose when people leave.

09 The arms race

From IP bans
to transformer ML

Every bypass technique was born as a direct response to a specific detection innovation. The escalation explains why each tool exists.

2004

Selenium born for QA testing. WebDriver protocol. Zero anti-bot thinking, scraping wasn't a threat concept yet.

2017

Puppeteer launches. First CDP Chrome automation. Anti-bots respond with IP reputation + User-Agent checks. Defeated by: proxy rotation + fake UA strings.

2018–2020

JS fingerprinting era. Canvas hash, WebGL, navigator.webdriver=true. playwright-stealth emerges. Playwright 2020, Microsoft, cross-browser. F5 acquires Shape Security for $1 billion.

2021–2022

TLS fingerprinting mainstream. JA3 at CDN edges. Python httpx identified by hash. curl_cffi emerges. DataDome + Akamai add behavioural scoring. PerimeterX + HUMAN Security mergecreating 29,650-site network.

2023–2024

JA4+ standardisation. Cloudflare Rust crate. Chrome randomises extensions, breaks JA3. Camoufox and CloakBrowser emerge as first C++ binary solutions. DataDome WASM boring_challenge. Scrapling 38K stars.

2025

ML + OS-level signals. HTTP/2 SETTINGS frames. TCP TTL. Transformer ML on micro-timing at <1ms. DataDome 85K models. Browserbase $40M at $300M valuation. Firecrawl 60K stars. ScrapingBee acquired by Oxylabs.

2026

C++ binary patches are the baseline. JS injection obsolete, toString() exposes all patches. Camoufox 100% pass rate. Firecrawl 111K stars. webclaw Rust + 10 MCP tools. JA4+ universal. Market: $7.5B → $38B by 2034. The arms race is AI vs AI.

★ You made it

Thank you for reading.

This is everything I know about web scraping in 2026, every detection layer, every anti-bot system, every library, every architecture I've actually built or used in production over the last seven years.

If even one section saved you a late night of debugging, that's why I wrote it.

Build something interesting with this. And if you do, I'd genuinely love to hear about it.

Asad Ikram

Data Engineer · Scraping specialist · Lahore, PK

LinkedIn ↗ Portfolio ↗ Email ↗

They built walls.I spent 7 years finding doors.

The scrapingdecision flow

Before you send a single byte,you've already been judged.

Layer 1, TLS Fingerprinting: The Handshake That Betrays You

Layer 2, JavaScript Fingerprinting: The Page That Interrogates You

Layer 3, Network Identity: The Five Vectors That Must Agree

Layer 3.5, DOM Honeypots: The Trap Doesn't Care About Your Fingerprint

Layer 4, Behavioural ML: You Can't Fake Being Human

Six companies built the walls.Here's every key.

Identify which anti-bot you're facing

Quick identification reference

Every tool built to fightevery wall we just described.

Master comparison table, all 60+ libraries & tools

Browser engines, deep dive

Describe, don'tselect

When DIY costexceeds platform cost

Computer Use Agents when scraping isn't enough

IP type mattersmore than provider

Walk this in order.Stop at first win.

Quick reference cheat sheet

What practitioners areactually shipping in 2026

Check your ownfingerprint first

How production scrapersare actually built

Self-Healing Scraperpowered by Claude

Intercept mobile app trafficbefore it hits any anti-bot

Scraping jargonin simple terms

Where scraperstalk to each other

Discord servers

Reddit communities

Newsletters worth reading

Resources from The Web Scraping Club

Auto-healing scrapersusing Claude as the brain

From IP bansto transformer ML

Thank you for reading.

They built walls.
I spent 7 years finding doors.

The scraping
decision flow

Before you send a single byte,
you've already been judged.

Six companies built the walls.
Here's every key.

Every tool built to fight
every wall we just described.

Describe, don't
select

When DIY cost
exceeds platform cost

IP type matters
more than provider

Walk this in order.
Stop at first win.

What practitioners are
actually shipping in 2026

Check your own
fingerprint first

How production scrapers
are actually built

Self-Healing Scraper
powered by Claude

Intercept mobile app traffic
before it hits any anti-bot

Scraping jargon
in simple terms

Where scrapers
talk to each other

Auto-healing scrapers
using Claude as the brain

From IP bans
to transformer ML