Lesson 5 — Mastering Scrape: Scrape Any Webpage — Mastering Firecrawl — The Ultimate Guide to AI Web Scraping

Scrape is the most core tool in Firecrawl. This lesson will guide you through using parameters to precisely and efficiently scrape even the most complex webpages.

5.1 Output Formats and Noise Reduction

You can request one or more output formats based on your AI or application's needs:

markdown (Preferred): Best for LLM consumption.
html: Preserves original structure.
screenshot: Captures a visual snapshot of the page.
links: Extracts all internal and external links.

Core Tip: Noise Reduction Set onlyMainContent: true. Firecrawl uses its AI models to automatically identify and keep the main content while stripping away navigation, footers, sidebars, and ads, significantly reducing Token consumption for downstream processing.

5.2 Handling Dynamic Content (JS Rendering)

Modern websites are often powered by React or Vue and require waiting for JS to execute. Use the waitFor parameter (in milliseconds):

{
  "url": "https://example.com",
  "formats": ["markdown"],
  "waitFor": 3000
}

Recommendation: 3000ms for standard pages, 5000-10000ms for complex SPA applications.

5.3 Precision Control: Tag Filtering

If you only need specific parts of a page (like product lists or comment sections), use:

includeTags: Only include specified CSS selectors.
excludeTags: Exclude unwanted elements.

Example:

"includeTags": ["article.product-card", ".price-section"],
"excludeTags": ["aside", ".recommended-ads"]

5.4 Chained Actions

Before scraping the content, you can direct the browser to perform a series of actions:

wait: Pause for a set time.
scroll: Scroll up or down (triggers lazy loading).
click: Click buttons or links.
write / press: Type text and press keys.

Practical Case: Scroll twice and take a screenshot

"actions": [
  { "type": "scroll", "direction": "down" },
  { "type": "wait", "milliseconds": 1000 },
  { "type": "scroll", "direction": "down" },
  { "type": "screenshot" }
]

5.5 Proxy and Anti-Scraping Modes

Choose the right proxy strategy based on the target website's protection level:

basic: Default mode, suitable for standard sites without anti-scraping measures.
stealth: Simulates real browser fingerprints to bypass basic detection.
enhanced: Strongest mode. Uses a global residential proxy pool specifically to combat heavy anti-scraping tools like Cloudflare.
auto: The system automatically selects the best mode based on the target site's response.