June 28th, 2017

Catalin George Festila: The pyquery python module.

Programing, Python, by admin.

This tutorial is about pyquery python module and python 2.7.13 version.
First I used pip command to install it.

C:\Python27>cd Scripts

C:\Python27\Scripts>pip install pyquery
Collecting pyquery
Downloading pyquery-1.2.17-py2.py3-none-any.whl
Requirement already satisfied: lxml>=2.1 in c:\python27\lib\site-packages (from pyquery)
Requirement already satisfied: cssselect>0.7.9 in c:\python27\lib\site-packages (from pyquery)
Installing collected packages: pyquery
Successfully installed pyquery-1.2.17

I try to install with pip and python 3.4 version but I got errors.
The development team told us about this python module:
pyquery allows you to make jquery queries on xml documents. The API is as much as possible the similar to jquery. pyquery uses lxml for fast xml and html manipulation.
Let’s try a simple example with this python module.
The base of this example is find links by html tag.

from pyquery import PyQuery

seeds = [
'https://twitter.com',
'http://google.com'
]

crawl_frontiers = []

def start_crawler():
crawl_frontiers = crawler_seeds()

print(crawl_frontiers)

def crawler_seeds():
frontiers = []
for index, seed in enumerate(seeds):
frontier = {index: read_links(seed)}
frontiers.append(frontier)

return frontiers

def read_links(seed):
crawler = PyQuery(seed)
return [crawler(tag_a).attr("href") for tag_a in crawler("a")]

start_crawler()

The read_links function take links from seeds array.
To do that, I need to read the links and put in into another array crawl_frontiers.
The frontiers array is used just for crawler process.
Also this simple example allow you to understand better the arrays.
You can read more about this python module here .

Back Top

Leave a Reply