Lesson 7 — Map and Crawl: Site-wide Collection Strategies — Mastering Firecrawl — The Ultimate Guide to AI Web Scraping

When you need to collect vast amounts of content from a large website (e.g., for building an RAG knowledge base), single-page scraping is no longer sufficient. Firecrawl provides two powerful site-level tools: Map and Crawl.

7.1 Map: Discovering All Site URLs

Map acts like a rapid "surveyor." It scans a target website and returns a list of all public URLs without fetching page content.

Why use Map?

Extreme Speed: Discover thousands of links in seconds.
Targeted Search: Use the search parameter to find only relevant URLs.

Example: firecrawl_map(url="https://docs.firecrawl.dev", search="webhook") returns only documentation URLs containing "webhook" in the path.

7.2 Crawl: Fully Automated Deep Scraping

Crawl follows links to deep-scrape entire sites. This is an asynchronous operation, ideal for large-scale tasks.

Key Parameters:

maxDiscoveryDepth: How many levels deep to follow links. 1–3 is recommended; going deeper can pull in irrelevant content.
limit: Total page count limit. We suggest ≤ 50 to prevent excessive data from overwhelming the LLM.
includePaths: Only scrape pages matching specific paths (e.g., ["/docs/"]).

7.3 Asynchronous Workflow (Job ID)

Since Crawl can take time, it follows this workflow:

Initiate Request: Returns a Job ID.
Poll Status: Use firecrawl_check_crawl_status with the ID.
Retrieve Results: When status is completed, fetch the array containing all page contents.

7.4 Best Practice: Map + Batch Scrape (Recommended)

For developers seeking ultimate control, we recommend this strategy over direct Crawling:

Step 1: Use Map to discover all site URLs.
Step 2: Filter the list in your application layer by keywords, depth, or paths to find exactly what you need.
Step 3: Loop through the list using Scrape or Extract to fetch content in batches.

Advantages of this Strategy:

Higher Precision: Completely avoid irrelevant pages like login screens or legal notices.
Better Reliability: Data volume per request is manageable, avoiding Token overflow or timeouts.
Cost Efficiency: Only pay Credits for pages that provide real value.