Scrapy and persistent cookie manager middleware
By andre
Scrapy is a nice python environment for web scraping, i.e. extracting information from web sites automatically by crawling them. It works best with anonymous data discovery, but nothing stops you from having active sessions as well. In fact, scrapy transparently manages cookies, which are usually used to track user sessions. Unfortunately, the sessions don’t survive between runs. This, however, can be fixed quite easily by adding custom cookie middleware. Here is an example:
from __future__ import absolute_import
import os
import os.path
import logging
import pickle
from scrapy.http.cookies import CookieJar
from scrapy.downloadermiddlewares.cookies import CookiesMiddleware
import settings as settings
class PersistentCookiesMiddleware(CookiesMiddleware):
def __init__(self, debug=False):
super(PersistentCookiesMiddleware, self).__init__(debug)
self.load()
def process_response(self, request, response, spider):
# TODO: optimize so that we don't do it on every response
res = super(PersistentCookiesMiddleware, self).process_response(request, response, spider)
self.save()
return res
def getPersistenceFile(self):
return settings.COOKIES_STORAGE_FILE
def save(self):
logging.debug("Saving cookies to disk for reuse")
with open(self.getPersistenceFile(), "wb") as f:
pickle.dump(self.jars, f)
f.flush()
def load(self):
filename = self.getPersistenceFile()
logging.debug("Trying to load cookies from file '{0}'".format(filename))
if not os.path.exists(filename):
logging.info("File '{0}' for cookie reload doesn't exist".format(filename))
return
if not os.path.isfile(filename):
raise Exception("File '{0}' is not a regular file".format(filename))
with open(filename, "rb") as f:
self.jars = pickle.load(f)
Then configure your spider to use the new middleware in settings.py:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': None,
'middlewares.cookies.PersistentCookiesMiddleware': 701,
}