Intro to Scrapy¶

Scrapy is a Python framework for data scraping, which, to say in short, is the combination of almost everything we learnt until now: requests, css selectors (BeautifulSoup), xpath (lxml), regex (re) and even checking robots.txt or putting hte scraper to sleep.

Generally, as Scrapy is a framework, one does not code inside Jupyter Notebook. To mimic Scrapy behavior inside the Notebook, we will have to make some additional imports which would not be required otherwise.

Key points:

response - the object that contains page source as a Scrapy element to be scraped,
response.css() - css approach to scraping (BeautifulSoup),
response.xpath() - xpath approach to scraping (Lxml),
extract() - extract all elements satisfying some condition (provides list),
extract_first() - extract first element satisfying some condition (provides element).
response.css("a::text").extract_first() - will provide the text of the first link matched (CSS),
response.xpath("//a/text()").extract_first() - will provide the text of the first link matched (Xpath),
response.css('a::attr(href)').extract_first() - will provide the href attribute (URL) of the first link matched (CSS),
response.xpath("//a/@href").extract_first() - will provide the href attribute (URL) of the first link matched (Xpath).

In [1]:

import requests
from scrapy.http import TextResponse

In [2]:

url = "http://quotes.toscrape.com/"
r = requests.get(url)
response = TextResponse(r.url,body=r.text,encoding="utf-8")

In [3]:

response

Out[3]:

<200 http://quotes.toscrape.com/>

In [10]:

#get heading-css
response.css("a").extract_first()

Out[10]:

'<a href="/" style="text-decoration: none">Quotes to Scrape</a>'

In [13]:

#get heading-xpath
response.xpath("//a").extract_first()

Out[13]:

'<a href="/" style="text-decoration: none">Quotes to Scrape</a>'

In [16]:

#get authors-css
response.css("small::text").extract()

Out[16]:

['Albert Einstein',
 'J.K. Rowling',
 'Albert Einstein',
 'Jane Austen',
 'Marilyn Monroe',
 'Albert Einstein',
 'André Gide',
 'Thomas A. Edison',
 'Eleanor Roosevelt',
 'Steve Martin']

In [17]:

#authors-xpath
response.xpath("//small/text()").extract()

Out[17]:

['Albert Einstein',
 'J.K. Rowling',
 'Albert Einstein',
 'Jane Austen',
 'Marilyn Monroe',
 'Albert Einstein',
 'André Gide',
 'Thomas A. Edison',
 'Eleanor Roosevelt',
 'Steve Martin']

In [19]:

#heading-css
response.css('a[style="text-decoration: none"]').extract()

Out[19]:

['<a href="/" style="text-decoration: none">Quotes to Scrape</a>']

In [20]:

#heading-css text only
response.css('a[style="text-decoration: none"]::text').extract()

Out[20]:

['Quotes to Scrape']

In [21]:

#heading-css href only
response.css('a[style="text-decoration: none"]::attr(href)').extract()

Out[21]:

['/']

In [23]:

#tag text css
response.css("a[class='tag']::text").extract()

Out[23]:

['change',
 'deep-thoughts',
 'thinking',
 'world',
 'abilities',
 'choices',
 'inspirational',
 'life',
 'live',
 'miracle',
 'miracles',
 'aliteracy',
 'books',
 'classic',
 'humor',
 'be-yourself',
 'inspirational',
 'adulthood',
 'success',
 'value',
 'life',
 'love',
 'edison',
 'failure',
 'inspirational',
 'paraphrased',
 'misattributed-eleanor-roosevelt',
 'humor',
 'obvious',
 'simile',
 'love',
 'inspirational',
 'life',
 'humor',
 'books',
 'reading',
 'friendship',
 'friends',
 'truth',
 'simile']

In [24]:

#tag url css
response.css("a[class='tag']::attr(href)").extract()

Out[24]:

['/tag/change/page/1/',
 '/tag/deep-thoughts/page/1/',
 '/tag/thinking/page/1/',
 '/tag/world/page/1/',
 '/tag/abilities/page/1/',
 '/tag/choices/page/1/',
 '/tag/inspirational/page/1/',
 '/tag/life/page/1/',
 '/tag/live/page/1/',
 '/tag/miracle/page/1/',
 '/tag/miracles/page/1/',
 '/tag/aliteracy/page/1/',
 '/tag/books/page/1/',
 '/tag/classic/page/1/',
 '/tag/humor/page/1/',
 '/tag/be-yourself/page/1/',
 '/tag/inspirational/page/1/',
 '/tag/adulthood/page/1/',
 '/tag/success/page/1/',
 '/tag/value/page/1/',
 '/tag/life/page/1/',
 '/tag/love/page/1/',
 '/tag/edison/page/1/',
 '/tag/failure/page/1/',
 '/tag/inspirational/page/1/',
 '/tag/paraphrased/page/1/',
 '/tag/misattributed-eleanor-roosevelt/page/1/',
 '/tag/humor/page/1/',
 '/tag/obvious/page/1/',
 '/tag/simile/page/1/',
 '/tag/love/',
 '/tag/inspirational/',
 '/tag/life/',
 '/tag/humor/',
 '/tag/books/',
 '/tag/reading/',
 '/tag/friendship/',
 '/tag/friends/',
 '/tag/truth/',
 '/tag/simile/']

In [28]:

#tag text xpath
response.xpath("//a[@class='tag']/text()").extract()

Out[28]:

['change',
 'deep-thoughts',
 'thinking',
 'world',
 'abilities',
 'choices',
 'inspirational',
 'life',
 'live',
 'miracle',
 'miracles',
 'aliteracy',
 'books',
 'classic',
 'humor',
 'be-yourself',
 'inspirational',
 'adulthood',
 'success',
 'value',
 'life',
 'love',
 'edison',
 'failure',
 'inspirational',
 'paraphrased',
 'misattributed-eleanor-roosevelt',
 'humor',
 'obvious',
 'simile',
 'love',
 'inspirational',
 'life',
 'humor',
 'books',
 'reading',
 'friendship',
 'friends',
 'truth',
 'simile']

In [30]:

#tag url xpath
response.xpath("//a[@class='tag']/@href").extract()

Out[30]:

['/tag/change/page/1/',
 '/tag/deep-thoughts/page/1/',
 '/tag/thinking/page/1/',
 '/tag/world/page/1/',
 '/tag/abilities/page/1/',
 '/tag/choices/page/1/',
 '/tag/inspirational/page/1/',
 '/tag/life/page/1/',
 '/tag/live/page/1/',
 '/tag/miracle/page/1/',
 '/tag/miracles/page/1/',
 '/tag/aliteracy/page/1/',
 '/tag/books/page/1/',
 '/tag/classic/page/1/',
 '/tag/humor/page/1/',
 '/tag/be-yourself/page/1/',
 '/tag/inspirational/page/1/',
 '/tag/adulthood/page/1/',
 '/tag/success/page/1/',
 '/tag/value/page/1/',
 '/tag/life/page/1/',
 '/tag/love/page/1/',
 '/tag/edison/page/1/',
 '/tag/failure/page/1/',
 '/tag/inspirational/page/1/',
 '/tag/paraphrased/page/1/',
 '/tag/misattributed-eleanor-roosevelt/page/1/',
 '/tag/humor/page/1/',
 '/tag/obvious/page/1/',
 '/tag/simile/page/1/',
 '/tag/love/',
 '/tag/inspirational/',
 '/tag/life/',
 '/tag/humor/',
 '/tag/books/',
 '/tag/reading/',
 '/tag/friendship/',
 '/tag/friends/',
 '/tag/truth/',
 '/tag/simile/']

In [7]:

response.css("title").extract_first()

Out[7]:

'<title>Quotes to Scrape</title>'

In [9]:

response.css("title").re("title")

Out[9]:

['title', 'title']

In [17]:

#regex to get text between tags
response.css("title").re('.+>(.+)<.+')

Out[17]:

['Quotes to Scrape']

In [ ]:

14 KiB Raw Blame History Unescape Escape

Intro to Scrapy¶

14 KiB

Raw Blame History Unescape Escape