scrapework
A Python framework for scraping web pages.
Requirements
Python 3.4+
requestsmodulebeautifulsoup4module
Getting Started
To run:
$ cd git/scrapework
$ python
Python 3.6.2 |Anaconda custom (x86_64)| (default, Sep 21 2017, 18:29:43)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import scrapework
>>> s = scrapework.Scrape(vars)
>>> s.get_pages()
>>> s.parse_files()
>>>
Scrape class takes the following as arguments:
Base URL of website to scrape - i.e. ‘https://archivesgig.com’
Path to output directory - i.e. ‘/Users/username/path/to/folder or ‘folder’
Desired filename for output files - i.e. ‘archivesgig’
Pagination data (optional): URL structure for paginated pages (i.e. ‘/pages/’), begin page number (i.e. 1), end page number (i.e. 200), step (i.e. 1)