Lesson 7 — Map and Crawl: Site-wide Collection Strategies
When you need to collect vast amounts of content from a large website (e.g., for building an RAG knowledge base), single-page scraping is no longer sufficient. Firecrawl provides two powerful site-level tools: Map and Crawl.
7.1 Map: Discovering All Site URLs
Map acts like a rapid "surveyor." It scans a target website and returns a list of all public URLs without fetching page content.
Why use Map?
- Extreme Speed: Discover thousands of links in seconds.
- Targeted Search: Use the
searchparameter to find only relevant URLs.Example:
firecrawl_map(url="https://docs.firecrawl.dev", search="webhook")returns only documentation URLs containing "webhook" in the path.
7.2 Crawl: Fully Automated Deep Scraping
Crawl follows links to deep-scrape entire sites. This is an asynchronous operation, ideal for large-scale tasks.
Key Parameters:
maxDiscoveryDepth: How many levels deep to follow links. 1–3 is recommended; going deeper can pull in irrelevant content.limit: Total page count limit. We suggest ≤ 50 to prevent excessive data from overwhelming the LLM.includePaths: Only scrape pages matching specific paths (e.g.,["/docs/"]).
7.3 Asynchronous Workflow (Job ID)
Since Crawl can take time, it follows this workflow:
- Initiate Request: Returns a
Job ID. - Poll Status: Use
firecrawl_check_crawl_statuswith the ID. - Retrieve Results: When status is
completed, fetch the array containing all page contents.
7.4 Best Practice: Map + Batch Scrape (Recommended)
For developers seeking ultimate control, we recommend this strategy over direct Crawling:
- Step 1: Use Map to discover all site URLs.
- Step 2: Filter the list in your application layer by keywords, depth, or paths to find exactly what you need.
- Step 3: Loop through the list using Scrape or Extract to fetch content in batches.
Advantages of this Strategy:
- Higher Precision: Completely avoid irrelevant pages like login screens or legal notices.
- Better Reliability: Data volume per request is manageable, avoiding Token overflow or timeouts.
- Cost Efficiency: Only pay Credits for pages that provide real value.