Scrapy is a fast, asynchronous web scraping framework that schedules and processes multiple requests in parallel, making it well suited for both small and large crawls[1]. This report compiles practical, field-tested tips for building robust Scrapy spiders, covering polite and efficient crawling, data extraction with selectors and item loaders, link following and pagination, anti-blocking tactics, retries and error handling, caching and job persistence, deduplication, storage with pipelines and feeds, debugging and monitoring, and depth and stop-condition controls.
Tune concurrency and delay to match target sites. Use DOWNLOADDELAY to set a minimum pause between consecutive requests to the same domain and enable RANDOMIZEDOWNLOAD_DELAY to jitter the delay between 0.5 and 1.5 times the base value to reduce detection risk[2]. Enable AutoThrottle to adaptively compute delays from observed latency; it respects a start delay and target concurrency, and will not go below DOWNLOADDELAY nor above AUTOTHROTTLEMAX_DELAY[3][4]. Control throughput with CONCURRENTREQUESTS globally and CONCURRENTREQUESTSPERDOMAIN or CONCURRENTREQUESTSPER_IP for per-target limits[2].
Adopt responsible defaults: disable cookies if not needed, avoid unnecessary retries and redirects, rotate user agents and IPs when appropriate, and keep logs at INFO in production to reduce noise and overhead[5]. For broad crawls, prefer domain-level concurrency controls to avoid stressing any single host[6].
Use response.css() and response.xpath() to query HTML/XML and extract values with .get() for a single result or .getall() for lists, and leverage ::text, ::attr() and .re() for text, attributes, and regex processing[7]. Item Loaders structure parsing by applying input/output processors and support addxpath(), addcss(), addvalue(), and loaditem() to finalize items, with support for nested loaders and class inheritance to reuse parsing rules[8].
Avoid common XPath pitfalls: when chaining from a selected node, prefer relative paths starting with .// rather than / to prevent searching from the document root, and understand the difference between //node[1] (first child of each) and (//node)[1] (first node overall)[9]. Remove namespaces when necessary to simplify XPath expressions on XML or XHTML documents[9].
When writing basic spiders, follow links by yielding requests from parse(), and prefer response.follow or response.followall to automatically resolve relative URLs; set alloweddomains to constrain offsite requests[59][60].
For rule-based crawling, use CrawlSpider with LinkExtractor. Configure allow/deny regexes, allowdomains/denydomains, denyextensions, and restrictxpaths or restrictcss to target the right regions; LxmlLinkExtractor is recommended and supports parameters like tags, attrs, unique, canonicalize, and processvalue to filter and normalize URLs[51].
Identify the site's pagination mechanism, whether a next link, numbered pages, or AJAX calls, and keep yielding requests until you reach the last page[14]. If there is no explicit next link, reverse engineer the URL scheme and increment the page parameter safely to progress through results[19][20].
Control retries via RetryMiddleware. RETRYENABLED toggles retries, RETRYTIMES sets the maximum retry attempts beyond the first try (often defaulting to 2), and RETRYHTTPCODES defines which statuses trigger retry; override per-request with meta keys like dontretry or maxretry_times[34]. Manage redirects through RedirectMiddleware using REDIRECTENABLED and REDIRECTMAXTIMES and turn off per-request with dontredirect; you can also use handlehttpstatuslist or handlehttpstatusall to pass specific statuses to your spider[36][37].
HttpErrorMiddleware filters non-200 responses unless you opt in by setting handlehttpstatuslist or handlehttpstatusall at request or spider level, and it can be configured to allow specific codes via settings like HTTPERRORALLOWEDCODES[62][63]. Use errbacks to programmatically handle failures, which receive a Failure object and let you branch on error types such as HttpError or timeouts[44].
Enable ROBOTSTXT_OBEY to respect robots.txt via RobotsTxtMiddleware, which will prevent disallowed requests according to site policy[41]. UserAgentMiddleware reads USERAGENT from settings, may fall back to a Scrapy default, and allows a spider-level useragent attribute; during robots checks, ROBOTSTXTUSERAGENT can be used if set[42].
Per-request meta keys give granular control: downloadtimeout, downloadmaxsize, downloadwarnsize, downloadfailondataloss, ftpuser, ftppassword, dontcache, dontmergecookies, dontobeyrobotstxt, dontredirect, dontretry, handlehttpstatusall, handlehttpstatuslist, referrerpolicy, and maxretrytimes can be used to tweak behavior per URL[37][31].
To further reduce blocking, rotate user agents and consider proxy rotation through middleware or third-party services, which helps diversify request fingerprints across devices and IPs[12][15][22].
Enable HttpCacheMiddleware during development to store responses and avoid re-downloading, which speeds up iteration; the middleware returns cached responses when fresh and may serve cached content on download errors[30][29]. Control behavior with HTTPCACHEENABLED, HTTPCACHEPOLICY, and HTTPCACHESTORAGE, and fine-tune with HTTPCACHEDIR, HTTPCACHEEXPIRATIONSECS, DBM backend options, gzip toggle, and several ignore settings; you can disable caching per request using meta['dont_cache'][30][31].
Use JOBDIR to pause and resume long crawls. A job directory persists the scheduler queue, dupe filter, and spider state so you can stop with Ctrl-C and resume later using the exact same command; do not share a JOBDIR across spiders or runs, and store any custom counters in spider.state for persistence[33].
Scrapy uses a dupe filter (RFPDupeFilter by default) that fingerprints requests using canonicalized URL, method, and body, typically ignoring headers and fragments, and drops requests whose fingerprints have been seen[55][53]. Enable DUPEFILTERDEBUG to log duplicates, set dontfilter=True per request when intentional repeats are needed, or swap DUPEFILTERCLASS and adjust REQUESTFINGERPRINTER_CLASS to customize behavior[56][57][54].
Item Pipelines run in sequence to clean, validate, and store items and are enabled by registering classes in ITEM_PIPELINES with integer priorities; use ItemAdapter in pipelines to write code that works across item types[24][28]. For media, use FilesPipeline or ImagesPipeline for features like avoiding re-downloads, naming, expiration, thumbnails, filtering small images, and handling redirects[27].
Use FEEDS for convenient exports to JSON, JSONL, CSV, XML, and more with URIs pointing to local paths, FTP, S3, or GCS, and dynamic placeholders like %(time)s and %(name)s for filenames[26][25]. Manage overwrites, batch exports with FEEDEXPORTBATCHITEMCOUNT, filter by itemclasses, and apply compression plugins such as GzipPlugin, LZMAPlugin, or Bz2Plugin; set FEEDEXPORTENCODING and FEEDEXPORT_INDENT for readable output[26].
Use the parse command to test callbacks and inspect output quickly, and drop into the Scrapy shell with inspect_response to test selectors interactively against live responses[16][48]. openinbrowser() helps visualize a response as rendered in a browser to validate selectors or missing content[16].
Leverage Python logging via self.logger with configurable LOGLEVEL and LOGFILE, and monitor runtime metrics through the Stats Collection API, which supports counters and can be disabled with a dummy collector if needed[47][49]. Scrapy also offers a Telnet console and comprehensive debugging documentation for deeper introspection[50].
DepthMiddleware uses DEPTHLIMIT to cap how deep link following goes and can record per-depth stats via DEPTHSTATSVERBOSE; DEPTHPRIORITY influences breadth-first versus depth-first traversal by adjusting request priorities[66][37].
CloseSpider can terminate a crawl automatically based on thresholds like CLOSESPIDERTIMEOUT, CLOSESPIDERITEMCOUNT, CLOSESPIDERPAGECOUNT, CLOSESPIDERERRORCOUNT, and variants that stop when no items are produced, preventing runaway crawls[65][67].
Scrapy offers a rich toolkit for efficient, reliable scraping, from adaptive throttling and concurrency controls to powerful selectors and item loaders for precise extraction[3][7]. Combine robust pagination and rule-based link extraction with careful retries, redirect handling, and HTTP error controls to keep spiders stable across real-world sites[51][34][62]. For iteration speed and resilience, use the HTTP cache and JOBDIR persistence, and finish strong with feed exports or pipelines, all while monitoring with logs and stats[30][33][26][47]. Adopting these practices will help you build Scrapy projects that are both polite to targets and productive for your data needs[5].
Get more accurate answers with Super Pandi, upload files, personalized discovery feed, save searches and contribute to the PandiPedia.
Let's look at alternatives: