You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
14 KiB
14 KiB
<html>
<head>
</head>
</html>
Intro to Scrapy¶
Scrapy is a Python framework for data scraping, which, to say in short, is the combination of almost everything we learnt until now: requests, css selectors (BeautifulSoup), xpath (lxml), regex (re) and even checking robots.txt or putting hte scraper to sleep.
Generally, as Scrapy is a framework, one does not code inside Jupyter Notebook. To mimic Scrapy behavior inside the Notebook, we will have to make some additional imports which would not be required otherwise.
Key points:
- response - the object that contains page source as a Scrapy element to be scraped,
- response.css() - css approach to scraping (BeautifulSoup),
- response.xpath() - xpath approach to scraping (Lxml),
- extract() - extract all elements satisfying some condition (provides list),
- extract_first() - extract first element satisfying some condition (provides element).
- response.css("a::text").extract_first() - will provide the text of the first link matched (CSS),
- response.xpath("//a/text()").extract_first() - will provide the text of the first link matched (Xpath),
- response.css('a::attr(href)').extract_first() - will provide the href attribute (URL) of the first link matched (CSS),
- response.xpath("//a/@href").extract_first() - will provide the href attribute (URL) of the first link matched (Xpath).
In [1]:
import requests
from scrapy.http import TextResponse
In [2]:
url = "http://quotes.toscrape.com/"
r = requests.get(url)
response = TextResponse(r.url,body=r.text,encoding="utf-8")
In [3]:
response
Out[3]:
<200 http://quotes.toscrape.com/>
In [10]:
#get heading-css
response.css("a").extract_first()
Out[10]:
'<a href="/" style="text-decoration: none">Quotes to Scrape</a>'
In [13]:
#get heading-xpath
response.xpath("//a").extract_first()
Out[13]:
'<a href="/" style="text-decoration: none">Quotes to Scrape</a>'
In [16]:
#get authors-css
response.css("small::text").extract()
Out[16]:
['Albert Einstein', 'J.K. Rowling', 'Albert Einstein', 'Jane Austen', 'Marilyn Monroe', 'Albert Einstein', 'André Gide', 'Thomas A. Edison', 'Eleanor Roosevelt', 'Steve Martin']
In [17]:
#authors-xpath
response.xpath("//small/text()").extract()
Out[17]:
['Albert Einstein', 'J.K. Rowling', 'Albert Einstein', 'Jane Austen', 'Marilyn Monroe', 'Albert Einstein', 'André Gide', 'Thomas A. Edison', 'Eleanor Roosevelt', 'Steve Martin']
In [19]:
#heading-css
response.css('a[style="text-decoration: none"]').extract()
Out[19]:
['<a href="/" style="text-decoration: none">Quotes to Scrape</a>']
In [20]:
#heading-css text only
response.css('a[style="text-decoration: none"]::text').extract()
Out[20]:
['Quotes to Scrape']
In [21]:
#heading-css href only
response.css('a[style="text-decoration: none"]::attr(href)').extract()
Out[21]:
['/']
In [23]:
#tag text css
response.css("a[class='tag']::text").extract()
Out[23]:
['change', 'deep-thoughts', 'thinking', 'world', 'abilities', 'choices', 'inspirational', 'life', 'live', 'miracle', 'miracles', 'aliteracy', 'books', 'classic', 'humor', 'be-yourself', 'inspirational', 'adulthood', 'success', 'value', 'life', 'love', 'edison', 'failure', 'inspirational', 'paraphrased', 'misattributed-eleanor-roosevelt', 'humor', 'obvious', 'simile', 'love', 'inspirational', 'life', 'humor', 'books', 'reading', 'friendship', 'friends', 'truth', 'simile']
In [24]:
#tag url css
response.css("a[class='tag']::attr(href)").extract()
Out[24]:
['/tag/change/page/1/', '/tag/deep-thoughts/page/1/', '/tag/thinking/page/1/', '/tag/world/page/1/', '/tag/abilities/page/1/', '/tag/choices/page/1/', '/tag/inspirational/page/1/', '/tag/life/page/1/', '/tag/live/page/1/', '/tag/miracle/page/1/', '/tag/miracles/page/1/', '/tag/aliteracy/page/1/', '/tag/books/page/1/', '/tag/classic/page/1/', '/tag/humor/page/1/', '/tag/be-yourself/page/1/', '/tag/inspirational/page/1/', '/tag/adulthood/page/1/', '/tag/success/page/1/', '/tag/value/page/1/', '/tag/life/page/1/', '/tag/love/page/1/', '/tag/edison/page/1/', '/tag/failure/page/1/', '/tag/inspirational/page/1/', '/tag/paraphrased/page/1/', '/tag/misattributed-eleanor-roosevelt/page/1/', '/tag/humor/page/1/', '/tag/obvious/page/1/', '/tag/simile/page/1/', '/tag/love/', '/tag/inspirational/', '/tag/life/', '/tag/humor/', '/tag/books/', '/tag/reading/', '/tag/friendship/', '/tag/friends/', '/tag/truth/', '/tag/simile/']
In [28]:
#tag text xpath
response.xpath("//a[@class='tag']/text()").extract()
Out[28]:
['change', 'deep-thoughts', 'thinking', 'world', 'abilities', 'choices', 'inspirational', 'life', 'live', 'miracle', 'miracles', 'aliteracy', 'books', 'classic', 'humor', 'be-yourself', 'inspirational', 'adulthood', 'success', 'value', 'life', 'love', 'edison', 'failure', 'inspirational', 'paraphrased', 'misattributed-eleanor-roosevelt', 'humor', 'obvious', 'simile', 'love', 'inspirational', 'life', 'humor', 'books', 'reading', 'friendship', 'friends', 'truth', 'simile']
In [30]:
#tag url xpath
response.xpath("//a[@class='tag']/@href").extract()
Out[30]:
['/tag/change/page/1/', '/tag/deep-thoughts/page/1/', '/tag/thinking/page/1/', '/tag/world/page/1/', '/tag/abilities/page/1/', '/tag/choices/page/1/', '/tag/inspirational/page/1/', '/tag/life/page/1/', '/tag/live/page/1/', '/tag/miracle/page/1/', '/tag/miracles/page/1/', '/tag/aliteracy/page/1/', '/tag/books/page/1/', '/tag/classic/page/1/', '/tag/humor/page/1/', '/tag/be-yourself/page/1/', '/tag/inspirational/page/1/', '/tag/adulthood/page/1/', '/tag/success/page/1/', '/tag/value/page/1/', '/tag/life/page/1/', '/tag/love/page/1/', '/tag/edison/page/1/', '/tag/failure/page/1/', '/tag/inspirational/page/1/', '/tag/paraphrased/page/1/', '/tag/misattributed-eleanor-roosevelt/page/1/', '/tag/humor/page/1/', '/tag/obvious/page/1/', '/tag/simile/page/1/', '/tag/love/', '/tag/inspirational/', '/tag/life/', '/tag/humor/', '/tag/books/', '/tag/reading/', '/tag/friendship/', '/tag/friends/', '/tag/truth/', '/tag/simile/']
In [7]:
response.css("title").extract_first()
Out[7]:
'<title>Quotes to Scrape</title>'
In [9]:
response.css("title").re("title")
Out[9]:
['title', 'title']
In [17]:
#regex to get text between tags
response.css("title").re('.+>(.+)<.+')
Out[17]:
['Quotes to Scrape']
In [ ]: