Skip to main content

RequestQueue

Request queue is a storage for managing HTTP requests.

The request queue class serves as a high-level interface for organizing and managing HTTP requests during web crawling. It provides methods for adding, retrieving, and manipulating requests throughout the crawling lifecycle, abstracting away the underlying storage implementation details.

Request queue maintains the state of each URL to be crawled, tracking whether it has been processed, is currently being handled, or is waiting in the queue. Each URL in the queue is uniquely identified by a unique_key property, which prevents duplicate processing unless explicitly configured otherwise.

The class supports both breadth-first and depth-first crawling strategies through its forefront parameter when adding requests. It also provides mechanisms for error handling and request reclamation when processing fails.

You can open a request queue using the open class method, specifying either a name or ID to identify the queue. The underlying storage implementation is determined by the configured storage client.

Usage

from crawlee.storages import RequestQueue

# Open a request queue
rq = await RequestQueue.open(name='my_queue')

# Add a request
await rq.add_request('https://example.com')

# Process requests
request = await rq.fetch_next_request()
if request:
try:
# Process the request
# ...
await rq.mark_request_as_handled(request)
except Exception:
await rq.reclaim_request(request)

Hierarchy

Index

Methods

__init__

  • __init__(client, id, name): None
  • Initialize a new instance.

    Preferably use the RequestQueue.open constructor to create a new instance.


    Parameters

    • client: RequestQueueClient

      An instance of a storage client.

    • id: str

      The unique identifier of the storage.

    • name: str | None

      The name of the storage, if available.

    Returns None

add_request

  • Add a single request to the manager and store it in underlying resource client.


    Parameters

    • request: str | Request

      The request object (or its string representation) to be added to the manager.

    • optionalkeyword-onlyforefront: bool = False

      Determines whether the request should be added to the beginning (if True) or the end (if False) of the manager.

    Returns ProcessedRequest

add_requests

  • async add_requests(requests, *, forefront, batch_size, wait_time_between_batches, wait_for_all_requests_to_be_added, wait_for_all_requests_to_be_added_timeout): None
  • Add requests to the manager in batches.


    Parameters

    • requests: Sequence[str | Request]

      Requests to enqueue.

    • optionalkeyword-onlyforefront: bool = False

      If True, add requests to the beginning of the queue.

    • optionalkeyword-onlybatch_size: int = 1000

      The number of requests to add in one batch.

    • optionalkeyword-onlywait_time_between_batches: timedelta = timedelta(seconds=1)

      Time to wait between adding batches.

    • optionalkeyword-onlywait_for_all_requests_to_be_added: bool = False

      If True, wait for all requests to be added before returning.

    • optionalkeyword-onlywait_for_all_requests_to_be_added_timeout: timedelta | None = None

      Timeout for waiting for all requests to be added.

    Returns None

drop

  • async drop(): None
  • Drop the storage, removing it from the underlying storage client and clearing the cache.


    Returns None

fetch_next_request

  • async fetch_next_request(): Request | None
  • Return the next request in the queue to be processed.

    Once you successfully finish processing of the request, you need to call RequestQueue.mark_request_as_handled to mark the request as handled in the queue. If there was some error in processing the request, call RequestQueue.reclaim_request instead, so that the queue will give the request to some other consumer in another call to the fetch_next_request method.

    Note that the None return value does not mean the queue processing finished, it means there are currently no pending requests. To check whether all requests in queue were finished, use RequestQueue.is_finished instead.


    Returns Request | None

get_handled_count

  • async get_handled_count(): int

get_metadata

get_request

  • async get_request(request_id): Request | None
  • Retrieve a specific request from the queue by its ID.


    Parameters

    • request_id: str

      The ID of the request to retrieve.

    Returns Request | None

get_total_count

  • async get_total_count(): int
  • Get an offline approximation of the total number of requests in the loader (i.e. pending + handled).


    Returns int

is_empty

  • async is_empty(): bool
  • Check if the request queue is empty.

    An empty queue means that there are no requests currently in the queue, either pending or being processed. However, this does not necessarily mean that the crawling operation is finished, as there still might be tasks that could add additional requests to the queue.


    Returns bool

is_finished

  • async is_finished(): bool
  • Check if the request queue is finished.

    A finished queue means that all requests in the queue have been processed (the queue is empty) and there are no more tasks that could add additional requests to the queue. This is the definitive way to check if a crawling operation is complete.


    Returns bool

mark_request_as_handled

  • Mark a request as handled after successful processing.

    This method should be called after a request has been successfully processed. Once marked as handled, the request will be removed from the queue and will not be returned in subsequent calls to fetch_next_request method.


    Parameters

    • request: Request

      The request to mark as handled.

    Returns ProcessedRequest | None

open

  • async open(*, id, name, configuration, storage_client): Storage
  • Open a storage, either restore existing or create a new one.


    Parameters

    • optionalkeyword-onlyid: str | None = None

      The storage ID.

    • optionalkeyword-onlyname: str | None = None

      The storage name.

    • optionalkeyword-onlyconfiguration: Configuration | None = None

      Configuration object used during the storage creation or restoration process.

    • optionalkeyword-onlystorage_client: StorageClient | None = None

      Underlying storage client to use. If not provided, the default global storage client from the service locator will be used.

    Returns Storage

purge

  • async purge(): None
  • Purge the storage, removing all items from the underlying storage client.

    This method does not remove the storage itself, e.g. don't remove the metadata, but clears all items within it.


    Returns None

reclaim_request

  • Reclaim a failed request back to the queue for later processing.

    If a request fails during processing, this method can be used to return it to the queue. The request will be returned for processing again in a subsequent call to RequestQueue.fetch_next_request.


    Parameters

    • request: Request

      The request to return to the queue.

    • optionalkeyword-onlyforefront: bool = False

      If true, the request will be added to the beginning of the queue. Otherwise, it will be added to the end.

    Returns ProcessedRequest | None

to_tandem

  • Combine the loader with a request manager to support adding and reclaiming requests.


    Parameters

    • optionalrequest_manager: RequestManager | None = None

      Request manager to combine the loader with. If None is given, the default request queue is used.

    Returns RequestManagerTandem

Properties

id

id: str

Get the storage ID.

name

name: str | None

Get the storage name.