Lesson 7 — Map and Crawl: Site-wide Collection Strategies

⏱ Est. reading time: 3 min Updated on 5/7/2026

When you need to collect vast amounts of content from a large website (e.g., for building an RAG knowledge base), single-page scraping is no longer sufficient. Firecrawl provides two powerful site-level tools: Map and Crawl.

7.1 Map: Discovering All Site URLs

Map acts like a rapid "surveyor." It scans a target website and returns a list of all public URLs without fetching page content.

Why use Map?

  • Extreme Speed: Discover thousands of links in seconds.
  • Targeted Search: Use the search parameter to find only relevant URLs.

    Example: firecrawl_map(url="https://docs.firecrawl.dev", search="webhook") returns only documentation URLs containing "webhook" in the path.


7.2 Crawl: Fully Automated Deep Scraping

Crawl follows links to deep-scrape entire sites. This is an asynchronous operation, ideal for large-scale tasks.

Key Parameters:

  • maxDiscoveryDepth: How many levels deep to follow links. 1–3 is recommended; going deeper can pull in irrelevant content.
  • limit: Total page count limit. We suggest ≤ 50 to prevent excessive data from overwhelming the LLM.
  • includePaths: Only scrape pages matching specific paths (e.g., ["/docs/"]).

7.3 Asynchronous Workflow (Job ID)

Since Crawl can take time, it follows this workflow:

  1. Initiate Request: Returns a Job ID.
  2. Poll Status: Use firecrawl_check_crawl_status with the ID.
  3. Retrieve Results: When status is completed, fetch the array containing all page contents.

7.4 Best Practice: Map + Batch Scrape (Recommended)

For developers seeking ultimate control, we recommend this strategy over direct Crawling:

  1. Step 1: Use Map to discover all site URLs.
  2. Step 2: Filter the list in your application layer by keywords, depth, or paths to find exactly what you need.
  3. Step 3: Loop through the list using Scrape or Extract to fetch content in batches.

Advantages of this Strategy:

  • Higher Precision: Completely avoid irrelevant pages like login screens or legal notices.
  • Better Reliability: Data volume per request is manageable, avoiding Token overflow or timeouts.
  • Cost Efficiency: Only pay Credits for pages that provide real value.