Location: Mudd 327
This is a workshop about programmatically collecting and storing useful information from the web using Python. First, we will use Requests and BeautifulSoup to download and parse HTML and XML files. We will then use the Scrapy framework to write a web spider that crawls online blog entries and stores their comments in a JSON file for later processing.
Web scraping is the technique of extracting information from the web and storing it in useful form. This is the first step in the process of discovering interesting patterns and gaining insights from big data sets. The web in particular is a vast source of information that can be systematically and programmatically accessed with very little cost.