Table of contents of the article:
The web is no longer frequented only by users
For years, we've imagined web traffic as the result of the interaction between people and websites: users reading articles, customers purchasing products, visitors navigating pages, administrators accessing control panels, search engine crawlers indexing content.
This representation is incomplete today.
An increasingly large portion of HTTP requests arriving at a website don't come from real people, but from automated software. Some are legitimate, such as search engine crawlers, uptime monitoring tools, SEO validation systems, or bots used by authorized third-party services. Many others, however, are opaque, unwanted, or downright malicious.
The most significant fact is that Bot traffic is getting closer and closer to human traffic, to the point of representing, in many scenarios, a huge share of overall requests. The critical point, however, isn't just the quantity. The real problem is the quality of this traffic: a significant portion of bots don't generate value, don't convert, don't purchase, don't actually read the content, and don't contribute to the growth of the digital project.
On the contrary, it consumes resources.
It consumes bandwidth, CPU, RAM, database queries, application workers, cache capacity, log space, analysis time, and, in the worst cases, opens the door to abusive or fraudulent activity.
Not all bots are created equal
The term "bot" is often used generically, but it encompasses very different categories. Putting them all in the same box is a technical and strategic mistake.
there useful bots, such as those of search engines, which allow the indexing of pages. There are functional bots, such as monitoring tools, validators, site availability checkers, social preview services, third-party integrations, and legitimate automation systems.
Then there's a vast, much more problematic gray area. Here we find scrapers that harvest content, bots that extract prices and catalogs, automated systems that copy text and images, undeclared crawlers used for training or data enrichment, tools that simulate real browsers, bots that rotate IP addresses, user agents, and fingerprints to evade controls.
Finally, there are the overtly malicious bots: vulnerability scanners, credential stuffing, brute force attempts, account takeover attacks, spam bots, comment spam, form abuse, aggressive scraping, API endpoint enumeration, automated search for vulnerable plugins, probing on known paths such as /wp-admin, /xmlrpc.php, /wp-json/, /administrator, /vendor/, /phpmyadmin and so on.
For a modern infrastructure, the fundamental distinction is no longer simply between human traffic and bot traffic. The real distinction is between useful traffic, tolerable traffic, suspicious traffic, and harmful traffic.
The problem of declared identity
One of the most common mistakes in bot management is blindly trusting the user agent.
The user agent is a string declared by the client. It can say it's a browser, a known crawler, a preview system, an AI bot, or anything else. But declaring an identity doesn't prove it.
A malicious bot can pose as a legitimate crawler. It may use a reassuring name, imitate a popular browser, rotate HTTP headers, change IP addresses, partially respect certain browsing patterns, and attempt to appear human. It may also alternate between slow and aggressive behavior to avoid simplistic rate limiting thresholds.
For this reason, a strategy based solely on user agents, robots.txt, or static lists is weak. It's useful as a first layer, but it's not a sufficient defense.
More thorough checks are needed: Consistent reverse DNS for known bots, ASN verification, behavioral analysis, IP reputation, header consistency, TLS fingerprint, request rate, geographic distribution, browsing pattern, dynamic resource access, repetitiveness, crawl depth, and origin impact.
Cache, CDN, and Static Content: The False Sense of Security
Many companies consider cached content to be less sensitive. If a page is public and cacheable, there's a tendency to think there's no real problem if it's requested by bots. After all, it doesn't directly impact the backend, doesn't generate database queries, and doesn't consume PHP, Node.js, Java, or other application processes.
This vision is partial.
Even when caching reduces computational cost, the question remains: who is accessing the content? How often? For what purpose? Are they creating value or just extracting data?
For a publishing site, a scraper can copy articles and republish them elsewhere. For an e-commerce site, it can monitor prices, availability, catalog changes, marketing strategies, and promotions. For a SaaS platform, it can enumerate documentation, changelogs, public endpoints, and information useful for subsequent attackers.
Caching protects performance, but does not necessarily protect the information value of the content.
In other words, a page served quickly is not automatically a page served to the right person. Caching is an efficiency tool, not an access control policy.
When bots get to the source, the cost becomes real
The problem becomes even more evident when bots don't stop at static or cacheable content, but reach the origin: application backend, database, internal search engine, API, checkout, login, shopping cart, administrative endpoints, or custom pages.
In that case, automated traffic is no longer just a matter of visibility or content control. It becomes a direct cost.
Each dynamic request can trigger PHP-FPM, MySQL or MariaDB queries, Redis calls, application logic, session systems, plugins, modules, hooks, third-party calls, HTML generation, ACL checks, search functions, pricing, stock availability, coupons, shipping, and so on.
This is particularly relevant for WordPress, WooCommerce, Magento, PrestaShop sites, publishing portals, and applications with public APIs.
A bot that queries cacheable pages can be a nuisance. A bot that constantly forces cache misses, dynamic queries, internal searches, product filters, carts, logins, and APIs can become a problem. performance, infrastructure cost and safety.
The case of e-commerce
E-commerce sites are among the most attractive targets for automated traffic, not only for traditional attacks like credential stuffing or carding, but also for financially motivated activities.
A bot can monitor product prices and availability, copy product listings, collect images, identify promotions, verify coupons, simulate shopping carts, control checkout endpoints, test stolen credentials, create fake accounts, saturate search functions, or attempt mass catalog scraping.
In platforms like WooCommerce, Magento, or PrestaShop, the problem is exacerbated by the fact that many seemingly innocuous requests can become dynamic. Filters, searches, sorting, pagination, attribute combinations, product variations, carts, and sessions can bypass the cache or drastically reduce its effectiveness.
The result is an infrastructure that appears undersized not because there are too many real users, but because a significant portion of the capacity is consumed by automations that do not generate revenue.
Added to this is a further problem, often less visible: the impact on external services on consumption Connect to the site. Many e-commerce sites integrate marketing automation platforms, CRMs, live chat systems, push notifications, email marketing, advanced analytics, recommendation engines, or customer engagement tools. Services such as Motive, for example, are often used for marketing automation, cart recovery, messaging, user segmentation, or commercial interactions based on visitor behavior.
If bot traffic is interpreted as real traffic, these tools may receive an abnormal number of events, sessions, visits, simulated carts, or interactions. The risk is twofold: on the one hand, it taints analytics and segmentation; on the other, in services with plans based on volume, contacts, events, or monthly visits, artificial traffic can contribute to exceeding the thresholds set by the active plan, leading to forced upgrades, additional costs, or the need to upgrade to a higher subscription.
This is an often overlooked point: not all traffic is businessA site may have seemingly high numbers, many requests per second, full logs, and growing graphs, but if a significant portion of that traffic is made up of unwanted bots, those numbers don't represent success. They represent consumption.
The WordPress Case: XML-RPC, REST API, Logins, and Comments
In the WordPress world, bot traffic is a daily constant. Even small or medium-sized sites constantly receive automated requests to known paths.
Among the most frequent targets are /wp-login.php, /xmlrpc.php, /wp-json/, REST endpoints, feeds, sitemaps, author pages, query parameters used to enumerate content, vulnerable plugin files, old backup paths, archives, uploads, and static resources.
Many attacks aren't sophisticated. They're simply massive, distributed, and persistent. A single attempt may seem insignificant, but thousands or millions of requests over time generate noise, logs, load, false positives, bandwidth consumption, worker saturation, and a worsening of TTFB.
Furthermore, in WordPress, the problem isn't just "blocking the bad guys." It's preventing unnecessary traffic from triggering the entire application stack. A request handled at the Nginx, WAF, or reverse proxy level costs much less than a request that goes all the way to PHP and the database.
For this reason, in well-managed WordPress environments, bot protection should start before applying: at the web server level, reverse proxy, application firewall, cache and access rules.
AI bots: few compared to the total, but very influential
One of the most significant developments concerns artificial intelligence bots. In percentage terms, they may represent a smaller share of overall bot traffic, but their impact can be disproportionate.
AI crawlers and fetchers don't just index pages like traditional search engines. They can collect content, process it, synthesize it, reuse it in generated responses, disconnect the information from its original source, and reduce the need for the end user to visit the site that produced that content.
This introduces a new strategic issue: content can be read, digested, and reused by automated systems without generating direct traffic, leads, conversions, or sufficient recognition for the original publisher.
The issue isn't just about copyright or brand visibility. It's also about infrastructure. If an AI crawler accesses dynamic content, non-cacheable pages, internal searches, or specific endpoints, each request can generate computational costs without a commensurate return.
The issue, therefore, is not to demonize artificial intelligence, but to establish rules. Who can access it? What content? How often? How do I log in? What benefits does it bring to the site owner?
Robots.txt is no longer enough
For years the file robots.txt It was considered the primary tool for telling crawlers what they can and cannot visit. It remains useful, but has a structural limitation: it only works with collaborative bots.
A malicious bot can ignore it completely. An aggressive scraper can even read it to figure out which areas to apparently avoid and which endpoints might be of interest. An automated system can pretend to be a known bot without actually following its rules.
Even the meta tags noindex, the headers X-Robots-Tag And similar directives are policy tools, not technical barriers. They are guidelines. They do not physically impede access.
For this reason, modern bot management must combine public disclosures, technical controls, and observability.
Il robots.txt It's for communication. The WAF is for defense. The reverse proxy is for governance. The logs are for understanding. Confusing these roles leads to fragile strategies.
The problem of visibility
Many organizations don't actually know how much bot traffic they receive. They look at Google Analytics, Search Console, or similar tools, but these systems only capture a fraction of the phenomenon.
Many bots don't run JavaScript, don't load analytics tags, don't accept cookies, don't follow the behavior of a real browser, and aren't counted correctly in marketing statistics.
The paradox is that the infrastructure sees the traffic, but the marketing department often doesn't. The server manages it, the WAF records it, the reverse proxy serves it, the database can be affected, but traditional analytics dashboards can drastically underestimate it.
To truly understand the phenomenon, you need to look at HTTP logs, reverse proxy logs, application logs, response codes, latencies, cache hit ratio, source IPs, ASNs, user agents, most affected endpoints, percentage of cache requests hit and miss, frequency per client, geographic distribution, and correlation with peaks in CPU, RAM, I/O, and database queries.
Without this visibility, you risk optimizing the wrong part of the system.
From safety to performance
Bot management is often treated as a security issue. This is correct, but reductive.
Bots also impact performance. A site may have worse response times not because it's poorly optimized in the absolute sense, but because it's serving too much non-human traffic. Specifically, requests that generate cache misses, heavy queries, internal searches, or sessions can increase the perceived TTFB for real users.
In an e-commerce site, this can mean slower checkouts, less responsive product search, a cluttered administrative backend, saturated PHP processes, a stressed database, and a greater likelihood of 502, 503, or timeout errors.
The consequence is simple: Bot mitigation is also a web performance strategy.
Blocking or limiting unnecessary traffic upstream frees up resources for real users. It improves stability, reduces load, contains costs, and makes infrastructure behavior more predictable.
A fast infrastructure isn't just one that responds quickly. It's one that can decide who's worth responding to.
There's no need to block everything
A naive strategy might be: "let's block all bots." But that's almost never the right choice.
Some bots are useful. Search engine crawlers are essential for SEO. Monitoring systems are necessary to detect downtime. Some third-party services need access to specific endpoints. Social networks generate previews. Marketplaces can verify feeds and availability. Some AI systems, if properly managed, could even provide indirect visibility.
The point isn't to block indiscriminately. The point is to decide.
Which bots are allowed? On which paths? How often? Can they access dynamic pages? Can they query internal search? Can they attack API endpoints? Should they be restricted by ASN, country, fingerprint, or behavior? Should they receive full content, reduced content, or differentiated responses? Should they be blocked only when they exceed certain thresholds?
Modern bot management is granular.
Blocking everything is simple, but often harmful. Managing traffic is more complex, but much more effective.
Different policies for different contents
Not all content has the same value and not all endpoints have the same cost.
A cacheable homepage is not the same as an internal search. A product page is not the same as a checkout endpoint. A public article is not the same as a member's area. A sitemap is not the same as a REST API. A static image is not the same as a dynamically generated page with dozens of queries.
For this reason, policies should be differentiated.
You can be more permissive on static resources and more restrictive on dynamic endpoints. You can allow sitemap crawling, but limit the frequency of access to product pages. You can allow access to verified search engine crawlers, but block spoofed user agents. You can apply strict rate limiting to login, shopping cart, checkout, internal search, and APIs.
This distinction allows noise to be reduced without harming truly useful traffic.
The role of reverse proxy and WAF
Effective mitigation must occur as close to the edge of the infrastructure as possible, before the request reaches the application.
In a typical architecture, Nginx, Varnish, HAProxy, a WAF, or an advanced reverse proxy can intercept and filter many requests before they reach PHP-FPM, Node.js, Java, databases, or application backends.
This approach has two advantages. The first is performance: a request blocked upstream costs very little. The second is security: it reduces the application's attack surface.
Rules based on path, HTTP method, header, user agent, IP reputation, ASN, geolocation, rate, cookies, presence of JavaScript challenges, behavior, and request consistency can dramatically reduce unwanted traffic.
The important thing is not to turn the WAF into a dead end. The rules must be observable, tested, versioned, and adapted to the specific context of the site.
A good safety rule is one that blocks the wrong traffic without penalizing the right one.
Intelligent rate limiting
Rate limiting is one of the most useful tools, but it must be applied wisely.
Simply limiting the number of requests per IP can work against primitive bots, but it's less effective against distributed networks, residential proxies, botnets, or address rotation systems. Furthermore, overly aggressive thresholds can penalize real users, legitimate crawlers, or companies behind shared NATs.
Modern rate limiting should consider multiple dimensions: IP, subnet, ASN, user agent, endpoint, HTTP method, session cookie, authentication, country, fingerprint, generated response, and request cost.
A request to a CSS file doesn't carry the same weight as a request to an internal search. A GET request to a cacheable page doesn't have the same impact as a POST to a login form. A sporadic sitemap access isn't comparable to hundreds of requests per minute to product filters.
The best rate limiting doesn't just count requests. It's the cost and the risk that count.
Bots and APIs: An Often Underestimated Area
APIs are a natural target for automated traffic. They're structured, predictable, easily queryable, and often return data in a format convenient for processing.
Many sites protect HTML pages but leave API endpoints poorly controlled. This applies to REST APIs, GraphQL, custom endpoints, AJAX, JSON feeds, mobile integrations, headless apps, and publicly exposed internal systems.
API attacks can include ID enumeration, data scraping, lookup abuse, credential stuffing, logic bypasses, coupon abuse, unauthorized data access attempts, and excessive resource consumption.
For this reason, bot protection shouldn't stop at the frontend. It should also include API gateways, authentication, authorization, schema validation, token limits, method control, protection from expensive queries, and detailed logging.
A public API without clear boundaries is often an invitation to abusive automation.
The hidden economic impact
Bot traffic costs money even when it doesn't cause obvious incidents.
It consumes bandwidth. It increases requests served by the CDN or reverse proxy. It generates logs. It occupies storage space. It increases the volume of data to be analyzed. It can increase cloud costs, egress costs, logging and monitoring costs. It can force you to oversize servers, databases, caches, and load balancers.
In high-traffic environments, even a small percentage of unnecessary dynamic requests can result in significant costs. In smaller environments, however, aggressive scraping can be enough to saturate resources and compromise the experience of real users.
This is especially important for those managing hosting, e-commerce, and applications with sensitive margins. Paying for infrastructure to serve traffic that doesn't generate value reduces efficiency and profitability.
Non-human traffic isn't free. Even when it doesn't buy, convert, or interact, someone still serves it.
From passive defense to access governance
The real evolution is cultural: we must no longer think of bots simply as threats to be blocked, but as entities to be governed.
Each automated request should be evaluated based on three questions.
Who is the applicant?
Is it a verified bot, a known crawler, an internal system, a partner service, a suspicious client, or a spoofed identity?
What is he doing?
Is it reading public content, querying dynamic endpoints, attempting logins, scanning vulnerabilities, copying catalogs, performing searches, or accessing APIs?
What is the value or risk?
Does it bring useful traffic, indexing, tracking, and visibility, or does it just generate cost, exposure, scraping, and potential abuse?
Only by answering these questions is it possible to build a sensible policy.
What should a company do today?
The first step is measurement. Without measurement, every decision is arbitrary.
At least thirty days of logs should be analyzed, distinguishing human traffic, known bots, suspicious bots, unclassified traffic, cache requests HITs, cache requests MISSES, most affected dynamic endpoints, most active IPs and ASNs, most frequent user agents, response codes, error rates, and correlation with infrastructure spikes.
The second step is classification. Not all bots should be treated equally. You need to create operational categories: allowed, restricted, monitored, challenged, blocked.
The third step is to apply progressive controls. First, logging and alerting, then soft rules, then rate limiting, then selective blocking. Overly aggressive rules applied without observation can cause damage to SEO, integrations, and real users.
The fourth step is to protect the most expensive endpoints: login, cart, checkout, internal search, APIs, XML-RPC, REST APIs, administrative areas, forms, pages with complex queries, dynamic feeds.
The fifth step is to periodically review policies. Bots change rapidly. New crawlers emerge, old user agents are spoofed, new proxy networks are used, and new evasion techniques become common.
Bot management isn't a set-up-and-forget-it task. It's an ongoing process.
Conclusion
The modern web is no longer primarily composed of people visiting pages. It's a mixed ecosystem, where real users, legitimate crawlers, commercial automation, scrapers, malicious bots, and AI systems compete for access to the same content and resources.
When a huge portion of traffic may be non-human, bot management is no longer a technical detail. It becomes a topic of security, performance, costs, SEO, content protection and business strategy.
The question is no longer whether a site will receive automated traffic. It already does.
The real question is whether the infrastructure can distinguish, measure, limit, and govern it without harming real users and without giving up the bots that generate value.
As artificial intelligence, scraping, and automation redefine how content is collected, summarized, and consumed, access control becomes a critical lever.
Those who manage bots selectively and intelligently will have faster sites, more efficient infrastructure, more predictable costs, and greater control over their content. Those who continue to ignore the problem will risk paying, in economic and performance terms, for traffic that brings no real benefit.