Empower your AI apps with clean data from any website. Featuring advanced scraping, crawling, and data extraction capabilities.
This repository is in development, and we’re still integrating custom modules into the mono repo. It's not fully ready for self-hosted deployment yet, but you can run it locally.
[WebDataWizard] is an API service that takes a URL, crawls it, and converts it into clean markdown or structured data. We crawl all accessible subpages and give you clean data for each. No sitemap required. Check out our
We provide an easy to use API with our hosted version. You can find the playground and documentation You can also self host the backend if you'd like.
- LLM-ready formats: markdown, structured data, screenshot, HTML, links, metadata
- The hard stuff: proxies, anti-bot mechanisms, dynamic content (js-rendered), output parsing, orchestration
- Customizability: exclude tags, crawl behind auth walls with custom headers, max crawl depth, etc...
- Media parsing: pdfs, docx, images.
- Reliability first: designed to get the data you need - no matter how hard it is.
- Actions: click, scroll, input, wait and more before extracting data
WebDataWizard allows you to perform various actions on a web page before scraping its content. This is particularly useful for interacting with dynamic content, navigating through pages, or accessing content that requires user interaction.
Here is an example of how to use actions to navigate to google.com, search for WebDataWizard, click on the first result, and take a screenshot.
pip install WebDataWizard-py
from WebDataWizard.WebDataWizard import WebDataWizardApp
app = WebDataWizardApp(api_key="fc-YOUR_API_KEY")
# Scrape a website:
scrape_status = app.scrape_url(
'https://WebDataWizard.dev',
params={'formats': ['markdown', 'html']}
)
print(scrape_status)
# Crawl a website:
crawl_status = app.crawl_url(
'https://WebDataWizard.dev',
params={
'limit': 100,
'scrapeOptions': {'formats': ['markdown', 'html']}
},
poll_interval=30
)
print(crawl_status)
With LLM extraction, you can easily extract structured data from any URL. We support pydantic schemas to make it easier for you too. Here is how you to use it:
from WebDataWizard.WebDataWizard import WebDataWizardApp
app = WebDataWizardApp(api_key="fc-YOUR_API_KEY")
class ArticleSchema(BaseModel):
title: str
points: int
by: str
commentsURL: str
class TopArticlesSchema(BaseModel):
top: List[ArticleSchema] = Field(..., max_items=5, description="Top 5 stories")
data = app.scrape_url('https://news.ycombinator.com', {
'formats': ['extract'],
'extract': {
'schema': TopArticlesSchema.model_json_schema()
}
})
print(data["extract"])
To install the WebDataWizard Node SDK, you can use npm:
npm install @mendable/WebDataWizard-js
- Get an API key from WebDataWizard.dev
- Set the API key as an environment variable named
WebDataWizard_API_KEY
or pass it as a parameter to theWebDataWizardApp
class.
import WebDataWizardApp, { CrawlParams, CrawlStatusResponse } from '@mendable/WebDataWizard-js';
const app = new WebDataWizardApp({apiKey: "fc-YOUR_API_KEY"});
// Scrape a website
const scrapeResponse = await app.scrapeUrl('https://WebDataWizard.dev', {
formats: ['markdown', 'html'],
});
if (scrapeResponse) {
console.log(scrapeResponse)
}
// Crawl a website
const crawlResponse = await app.crawlUrl('https://WebDataWizard.dev', {
limit: 100,
scrapeOptions: {
formats: ['markdown', 'html'],
}
} as CrawlParams, true, 30) as CrawlStatusResponse;
if (crawlResponse) {
console.log(crawlResponse)
}
With LLM extraction, you can easily extract structured data from any URL. We support zod schema to make it easier for you too. Here is how you to use it:
import WebDataWizardApp from "@mendable/WebDataWizard-js";
import { z } from "zod";
const app = new WebDataWizardApp({
apiKey: "fc-YOUR_API_KEY"
});
// Define schema to extract contents into
const schema = z.object({
top: z
.array(
z.object({
title: z.string(),
points: z.number(),
by: z.string(),
commentsURL: z.string(),
})
)
.length(5)
.describe("Top 5 stories on Hacker News"),
});
const scrapeResult = await app.scrapeUrl("https://news.ycombinator.com", {
extractorOptions: { extractionSchema: schema },
});
console.log(scrapeResult.data["llm_extraction"]);
WebDataWizard is open source available under the AGPL-3.0 license.
To deliver the best possible product, we offer a hosted version of WebDataWizard alongside our open-source offering. The cloud solution allows us to continuously innovate and maintain a high-quality, sustainable service for all users.
WebDataWizard Cloud is available at WebDataWizard.dev and offers a range of features that are not available in the open source version:
We love contributions! Please read our before submitting a pull request. If you'd like to self-host, refer to the
It is the sole responsibility of the end users to respect websites' policies when scraping, searching and crawling with WebDataWizard. Users are advised to adhere to the applicable privacy policies and terms of use of the websites prior to initiating any scraping activities. By default, WebDataWizard respects the directives specified in the websites' robots.txt files when crawling. By utilizing WebDataWizard, you expressly agree to comply with these conditions.