analitics

Pages

Wednesday, June 28, 2017

The pyquery python module.

This tutorial is about pyquery python module and python 2.7.13 version.
First I used pip command to install it.
C:\Python27>cd Scripts

C:\Python27\Scripts>pip install pyquery
Collecting pyquery
  Downloading pyquery-1.2.17-py2.py3-none-any.whl
Requirement already satisfied: lxml>=2.1 in c:\python27\lib\site-packages (from pyquery)
Requirement already satisfied: cssselect>0.7.9 in c:\python27\lib\site-packages (from pyquery)
Installing collected packages: pyquery
Successfully installed pyquery-1.2.17
I try to install with pip and python 3.4 version but I got errors.
The development team tells us about this python module:
pyquery allows you to make jquery queries on xml documents. The API is as much as possible the similar to jquery. pyquery uses lxml for fast xml and html manipulation.
Let's try a simple example of this python module.
The base of this example is found links by HTML tag.
from pyquery import PyQuery
 
seeds = [
    'https://twitter.com',
    'http://google.com'
]
 
crawl_frontiers = []
 
def start_crawler():
    crawl_frontiers = crawler_seeds()
 
    print(crawl_frontiers)
 
def crawler_seeds():
    frontiers = []
    for index, seed in enumerate(seeds):
        frontier = {index: read_links(seed)}
        frontiers.append(frontier)
 
    return frontiers
 
def read_links(seed):
    crawler = PyQuery(seed)
    return [crawler(tag_a).attr("href") for tag_a in crawler("a")]
 
start_crawler()
The read_links function takes links from seeds array.
To do that, I need to read the links and put in into another array crawl_frontiers.
The frontiers array is used just for crawler process.
Also, this simple example allows you to understand better the arrays.
You can read more about this python module here.