scrape-brrr

Simple web page scraping.

Install

yarn add scrape-brrr

Try it online

Examples

*The following examples use typescript style import. For plain nodejs, use

const { scrape } = require('scrape-brrr')

Dead-simple usage

/**
 *  <body>
 *    <div>
 *      <span>
 *        <p>sentence 1</p>
 *        <p>sentence 2</p>
 *        <p>sentence 3</p>
 *      </span>
 *    </div>
 *    <p>footer</p>
 *  </body> 
 */

import { scrape } from 'scrape-brrr'

const data = await scrape('http://website.com', 'div p:not(:first-child)')
// ["sentence 2", "sentence 3"]

Scrape single item

/**
 *  <body>
 *    <div>Best wof</div>
 *    <span>Largest wof</span>
 *  </body> 
 */

import { scrape } from 'scrape-brrr'

const data = await scrape('http://website.com', [
  {
    name: 'stats',
    selector: 'div',
  },
  {
    name: 'another-stats',
    selector: 'span',
  },
])
// { 
//   stats: "Best wof"
//   "another-stats": "Largest wof"
// }

Scrape multiple items

/**
 *  <body>
 *    <div>
 *      <span class="name">husky</span>
 *      <span class="name">golden</span>
 *    </div>
 *  </body> 
 */

import { scrape } from 'scrape-brrr'

const data = await scrape('http://website.com', [{
  name: 'bestWofs',
  selector: 'div .name',
  many: true
}])
// { bestWofs: ["husky", "golden"] }

Nested fields

/**
 *  <body>
 *    <div>
 *      <span class="name">husky</span>
 *      <span class="name">golden</span>
 *    </div>
 *  </body> 
 */

import { scrape } from 'scrape-brrr'

const data = await scrape('http://website.com', [{
  name: 'bestWofs',
  selector: 'div',
  many: true,
  nested: [
    {
      name: 'name',
      selector: 'span',
    }
  ]
}])
// { 
//   bestWofs: [
//     { name: "husky" }, 
//     { name: "golden" },
//   ]
// }

Extract link / HTML element attribute

/**
 *  <body>
 *    <span class="title" id="best">Best wof</div>
 *    <a href="/other-stats">other stats</a>
 *  </body> 
 */

import { scrape } from 'scrape-brrr'

const data = await scrape('http://website.com', [
  {
    name: 'key',
    selector: 'span',
    attr: 'id'
  },
  {
    name: 'otherLink',
    selector: 'a',
    attr: 'href'
  },
])
// { 
//   key: "best",
//   otherLink: "/other-stats"
// }

Transform

/**
 *  <body>
 *    <div>
 *      <span class="rank">1</span>
 *      <span class="name">husky</span>
 *    </div>
 *    <div>
 *      <span class="rank">2</span>
 *      <span class="name">golden</span>
 *    </div>
 *  </body> 
 */

import { scrape } from 'scrape-brrr'

const data = await scrape('http://website.com', [{
  name: 'best',
  selector: 'div',
  many: true,
  nested: [
    {
      name: 'rank',
      selector: '.rank',
    },
    {
      name: 'name',
      selector: '.name',
    }
  ],
  transform: arr => arr[0]
}])
// { 
//   best: { name: "husky" },
// }

Website with dynamic content by js

Use puppeteer to load page with javascript to scrape dynamic content.

/**
 *  <body>
 *    <h1>
 *      tick tok tick tok
 *    </h1>
 *    <script>
 *      document.querySelector('h1').textContent = 'boom!'
 *    </script>
 *  </body>
 */

import { scrape } from 'scrape-brrr'

const data = await scrape('http://website.com', 'h1', { dynamic: true })
// ["boom!"]

Scrape new reddit website

Might not get the data correctly due to change of html structure.

const { scrape } = require("scrape-brrr");
const numeral = require('numeral')

async function scrapeNewReddit() {
  const url = "https://reddit.com/";
  const data = await scrape(url, [
    {
      name: "posts",
      selector: ".Post",
      many: true,
      nested: [
        {
          name: "title",
          selector: "h3",
        },
        {
          name: "votes",
          selector: "div:first-child div div div",
          transform: str => numeral(str).value(),
        },
        {
          name: "link",
          selector: "a[data-click-id=body]",
          attr: "href",
        },
      ],
    },
  ], { dynamic: true });
  console.log(data);
}

scrapeNewReddit();
// { 
//   posts: [
//     {
//       title: 'GME Megathread for April 01, 2021',
//       votes: 11800,
//       link: '/r/wallstreetbets/comments/mhu8tg/gme_megathread_for_april_01_2021/'
//     },
//     ...
//   ]
// }

Other features

Handle non-utf8 charset response from server (e.g. chinese encoding big5)

Development

yarn install

yarn test

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.github		.github
src		src
.gitignore		.gitignore
.release-it.json		.release-it.json
.releaserc.yml		.releaserc.yml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
jest.config.js		jest.config.js
package.json		package.json
renovate.json		renovate.json
tsconfig.json		tsconfig.json
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scrape-brrr

Install

Try it online

Examples

Dead-simple usage

Scrape single item

Scrape multiple items

Nested fields

Extract link / HTML element attribute

Transform

Website with dynamic content by js

Scrape new reddit website

Other features

Development

About

Releases 1

Packages

Contributors 6

Languages

License

chauchakching/scrape-brrr

Folders and files

Latest commit

History

Repository files navigation

scrape-brrr

Install

Try it online

Examples

Dead-simple usage

Scrape single item

Scrape multiple items

Nested fields

Extract link / HTML element attribute

Transform

Website with dynamic content by js

Scrape new reddit website

Other features

Development

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 6

Languages

Packages