# scrape_linkedin ## Introduction `scrape_linkedin` is a python package to scrape all details from public LinkedIn profiles, turning the data into structured json. You can scrape Companies and user profiles with this package. **Warning**: LinkedIn has strong anti-scraping policies, they may blacklist ips making unauthenticated or unusual requests ## Table of Contents - [scrape_linkedin](#scrapelinkedin) - [Introduction](#introduction) - [Table of Contents](#table-of-contents) - [Installation](#installation) - [Install with pip](#install-with-pip) - [Install from source](#install-from-source) - [Tests](#tests) - [Getting & Setting LI_AT](#getting--setting-liat) - [Getting LI_AT](#getting-liat) - [Setting LI_AT](#setting-liat) - [Examples](#examples) - [Usage](#usage) - [Command Line](#command-line) - [Python Package](#python-package) - [Profiles](#profiles) - [Companies](#companies) - [config](#config) - [Scraping in Parallel](#scraping-in-parallel) - [Example](#example) - [Configuration](#configuration) - [Issues](#issues) ## Installation ### Install with pip Run `pip install git+git://github.com/austinoboyle/scrape-linkedin-selenium.git` ### Install from source `git clone https://github.com/austinoboyle/scrape-linkedin-selenium.git` Run `python setup.py install` ### Tests Tests are (so far) only run on static html files. One of which is a linkedin profile, the other is just used to test some utility functions. ## Getting & Setting LI_AT Because of Linkedin's anti-scraping measures, you must make your selenium browser look like an actual user. To do this, you need to add the li_at cookie to the selenium session. ### Getting LI_AT 1. Navigate to www.linkedin.com and log in 2. Open browser developer tools (Ctrl-Shift-I or right click -> inspect element) 3. Select the appropriate tab for your browser (**Application** on Chrome, **Storage** on Firefox) 4. Click the **Cookies** dropdown on the left-hand menu, and select the `www.linkedin.com` option 5. Find and copy the li_at **value** ### Setting LI_AT There are two ways to set your li_at cookie: 1. Set the LI_AT environment variable - `$ export LI_AT=YOUR_LI_AT_VALUE` - **On Windows**: `C:/foo/bar> set LI_AT=YOUR_LI_AT_VALUE` 2. Pass the cookie as a parameter to the Scraper object. > `>>> with ProfileScraper(cookie='YOUR_LI_AT_VALUE') as scraper:` A cookie value passed directly to the Scraper **will override your environment variable** if both are set. ## Examples See [`/examples`](https://github.com/austinoboyle/scrape-linkedin-selenium/tree/master/examples) ## Usage ### Command Line scrape_linkedin comes with a command line argument module `scrapeli` created using [click](http://click.pocoo.org/5/). **Note: CLI only works with Personal Profiles as of now.** Options: - --url : Full Url of the profile you want to scrape - --user: www.linkedin.com/in/USER - --driver: choose Browser type to use (Chrome/Firefox), **default: Chrome** - -a --attribute : return only a specific attribute (default: return all attributes) - -i --input_file : Raw path to html file of the profile you want to scrape - -o --output_file: Raw path to output file for structured json profile (just prints results by default) - -h --help : Show this screen. Examples: - Get Austin O'Boyle's profile info: `$ scrapeli --user=austinoboyle` - Get only the skills of Austin O'Boyle: `$ scrapeli --user=austinoboyle -a skills` - Parse stored html profile and save json output: `$ scrapeli -i /path/file.html -o output.json` ### Python Package #### Profiles Use `ProfileScraper` component to scrape profiles. ```python from scrape_linkedin import ProfileScraper with ProfileScraper() as scraper: profile = scraper.scrape(user='austinoboyle') print(profile.to_dict()) ``` `Profile` - the class that has properties to access all information pulled from a profile. Also has a to_dict() method that returns all of the data as a dict with open('profile.html', 'r') as profile_file: profile = Profile(profile_file.read()) print (profile.skills) # [{...} ,{...}, ...] print (profile.experiences) # {jobs: [...], volunteering: [...],...} print (profile.to_dict()) # {personal_info: {...}, experiences: {...}, ...} **Structure of the fields scraped** - personal_info - name - company - school - headline - followers - summary - websites - email - phone - connected - image - skills - experiences - volunteering - jobs - education - interests - accomplishments - publications - cerfifications - patents - courses - projects - honors - test scores - languages - organizations #### Companies Use `CompanyScraper` component to scrape companies. ```python from scrape_linkedin import CompanyScraper with CompanyScraper() as scraper: company = scraper.scrape(company='facebook') print(company.to_dict()) ``` `Company` - the class that has properties to access all information pulled from a company profile. There will be three properties: overview, jobs, and life. **Overview is the only one currently implemented.** with open('overview.html', 'r') as overview, open('jobs.html', 'r') as jobs, open('life.html', 'r') as life: company = Company(overview, jobs, life) print (company.overview) # {...} **Structure of the fields scraped** - overview - name - company_size - specialties - headquarters - founded - website - description - industry - num_employees - type - image - jobs **NOT YET IMPLEMENTED** - life **NOT YET IMPLEMENTED** #### config Pass these keyword arguments into the constructor of your Scraper to override default values. You may (for example) want to decrease/increase the timeout if your internet is very fast/slow. - _cookie_ **`{str}`**: li_at cookie value (overrides env variable) - **default: `None`** - _driver_ **`{selenium.webdriver}`**: driver type to use - **default: `selenium.webdriver.Chrome`** - _driver_options_ **`{dict}`**: kwargs to pass to driver constructor - **default: `{}`** - _scroll_pause_ **`{float}`**: time(s) to pause during scroll increments - **default: `0.1`** - _scroll_increment_ **`{int}`** num pixels to scroll down each time - **default: `300`** - _timeout_ **`{float}`**: default time to wait for async content to load - **default: `10`** ## Scraping in Parallel New in version 0.2: built in parallel scraping functionality. Note that the up-front cost of starting a browser session is high, so in order for this to be beneficial, you will want to be scraping many (> 15) profiles. ### Example ```python from scrape_linkedin import scrape_in_parallel, CompanyScraper companies = ['facebook', 'google', 'amazon', 'microsoft', ...] #Scrape all companies, output to 'companies.json' file, use 4 browser instances scrape_in_parallel( scraper_type=CompanyScraper, items=companies, output_file="companies.json", num_instances=4 ) ``` ### Configuration **Parameters:** - _scraper_type_ **`{scrape_linkedin.Scraper}`**: Scraper to use - _items_ **`{list}`**: List of items to be scraped - _output_file_ **`{str}`**: path to output file - _num_instances_ **`{int}`**: number of parallel instances of selenium to run - _temp_dir_ **`{str}`**: name of temporary directory to use to store data from intermediate steps - **default: 'tmp_data'** - _driver_ {selenium.webdriver}: driver to use for scraping - **default: selenium.webdriver.Chrome** - _driver_options_ **`{dict}`**: dict of keyword arguments to pass to the driver function. - **default: scrape_linkedin.utils.HEADLESS_OPTIONS** - _\*\*kwargs_ **`{any}`**: extra keyword arguments to pass to the `scraper_type` constructor for each job ## Issues Report bugs and feature requests [here](https://github.com/austinoboyle/scrape-linkedin-selenium/issues).