Programming of the Web-client in language Python
Programming of the Web-client - powerful technics{technical equipment} for creation of searches in Web. the Web-client is the any program taking the data from the Web-server by means of the hypertext transfer protocol (Hyper Text Transfer Protocol, http in yours URL). The Web-browser is the client, the same as and poiskoviki, that is the programs automatically moving on Web for gathering of the information. You can apply also Web-clients to use of opportunities of the services offered{suggested} by other inhabitants Web, and additions of dynamic properties in your own Web-site.
Programming of the Web-client is included into any tooling for developers. Adherents Perl'? use his{its} many years. In language Python this process reaches{achieves} even higher levels of convenience and flexibility. The majority of functions necessary for you is provided with three modules: HTTPLIB, URLLIB and new addition, XMLRPCLIB. In true Pitonovskom style each module is built on above the predecessor, providing thus the strong, well designed base for your applications. In this clause{article} we shall consider first two modules, having left XMLRPCLIB on then.
For our examples we shall use Meerkat. If you are similar to me, you waste time on tracking of tendencies and events in the environment of founders of open software which will allow you to receive competitive advantages. Meerkat represents the tool considerably simplifying this problem{task}. It is service of open access (an open wire service), collecting and ordering huge volumes of the information on open software. As his{its} interface for a browser is flexible and we adjust, that, using programming of the web-client, we can scan, take and even to save this information for subsequent use in an independent mode. First we will address to Meerkat with help HTTPLIB in an interactive mode, and then we shall pass to job with Meerkat's Open API through URLLIB to create adjusted means of gathering of the information.
HTTPLIB
HTTPLIB represents a simple wrapper around of the module socket. From three libraries mentioned by me, HTTPLIB provides the greatest control over the reference{manipulation} to a web-site. It is given, however, at the expense of the increase in volume of job necessary for performance of your problem{task}. The report http has no current status ("stateless") and consequently remembers nothing your previous searches. At connection with a Web-site for each search you should construct new object HTTPLIB. These searches form dialogue with the Web-server, imitating a Web-browser. Let's be connected online to Meerkat with help Open API Rejla Dornfesta (Rael Dornfest) and we shall see, that will turn out. Dialogue begins with construction of a series of the offers determining all over again, you want to undertake what action, and then identifying you for the Web-server:
>>> import httplib
>>> host = 'www.oreillynet.com'
>>> h = httplib. HTTP (host)
>>> h.putrequest ('GET', '/meerkat/? _ fl=minimal ')
>>> h.putheader ('Host', host)
>>> h.putheader ('User-agent', 'python-httplib')
>>> h.endheaders ()
>>>
Search GET informs the server, you want to receive what page. Heading Host informs him a name of the domain required by you. Modern the server, using HTTP 1.1, can have some domains to the same address. If you do not speak them, what domain is necessary for you, as a return code you receive a code of readdressing '302'. Heading User-agent informs the server, you concern to what type of the client to know, that he can send you, and that no. It is the information necessary for processing of your search by the Web-server. Further you request the answer:
>>> returncode, returnmsg, headers = h.getreply ()
>>> if returncode == 200: *OK
... f = h.getfile ()
... print f.read ()
...
As a result of it current page Meerkat in the minimal kind will be unpacked{will be printed out}. The heading of the response and contents come back separately from each other, that helps both in definition and elimination of problems, and in analysis of the data. If you want to see headings of the response, use print headers.
HTTPLIB hides mechanics of programming of sockets, and use of file object by him{it} for buffering allows you to apply the habitual approach to manipulation with the data. Nevertheless, better he approaches as the mainframe for construction of more powerful Web? Client applications or for interactive dialogue with a problem Web-site. For use in both these areas, HTTPLIB it is equipped with a useful opportunity of debugging. You receive access to her, causing a method h.set_debuglevel (1) at any moment after initialization of object (a line h = httplib. HTTP (host) in our example). With a level of debugging 1, the module will duplicate on the screen searches and results of any references{manipulations} to getreply ().
Interactive nature Python does{makes} process of the analysis of Web-sites with help HTTPLIB entertainment. Get used to this module, and you will have powerful and floppy tool for diagnostics of problems of Web-sites. Besides waste time on seeing{overlooking} source codes HTTPLIB. HTTPLIB, 200 lines of a code containing less, - the fast and simple introduction to programming sockets with use Python.
URLLIB
URLLIB provides the refined interface to functionality HTTPLIB. It is the best way to use it{him} directly for data acquisition, instead of for the analysis of a Web-site. Here the same interaction is submitted, as is higher, but with use URLLIB:
>>> import urllib
>>> u = urllib.urlopen (' http: // www.oreillynet.com/meerkat/? _ fl=minimal ')
That's all, that is necessary to make! In one line you have addressed to Meerkat, have obtained the data and have placed them in a temporary cache. For access to heading:
>>> print u.headers
And for viewing all file:
>>> print u.read ()
But it yet all. In addition to HTTP, URLLIB can address in the same way to FTP, Gopher and even to local files. The module contains also set of auxiliary functions, including that are used for analysis url, codings of lines in a url-safe format and maintenance of indication of a course of process during transfer of great volume of the data.
Example of use Meerkat
Present, that you have group of the clients expecting, that they will be informed by mail on last events concerning Linux. We can write a short script with use URLLIB for reception of this information from Meerkat, construct the list of links and save these links in a file for the subsequent transfer. Author Meerkat, Rehjel Dornfest, has already made the most part of job for us in Meerkat API. Everything, that remained to us is to design search, to disassemble links and to save results for the subsequent transfer.
What for to do{make} all this instead of simply giving Meerkat on a payoff to users? Maintenance of such "passive" service enables users to look through the information at a leisure and selectively to save the information in a familiar format (for example, in a format of email). Receiving news in the mail box in the morning on Monday, they will not pass{miss} the information past{last} for the days off.
As the minimal variant of sample Meerkat is limited to 15 news, we shall start a script each hour (for example, as the task cron under Unix, or, using AT command under NT) for reduction of probability of loss of the data. url which we shall use (results of use of it URL you can see here).
It will unit all news about Linux (profile=5) in last hour, submiting data in the minimal variant, without descriptions, the information on a category, the channel and date. We also use the module of regular expressions to take the information on links and to redirect a conclusion to the file object open in a mode of addition.
Conclusion
We only have touched a surface of these modules, and exist still set of others accessible for Python modules of network programming which can be used for the problems{tasks} connected to the Web-client. Programming of the Web-client is especially useful at processing great volumes of the tabulared data. Using programming of the Web-client in last project of the Exchange of the Electronic Data (Electronic Data Interchange project), we have avoided use of a bulky package of the patented programs. We took the updated information necessary for us on the prices directly from Web and placed her{it} in our database. It has saved to us a lot of time and nerves.
Programming of the Web-client can be useful and for testing structure and integrity of Web-sites. The most widespread procedure consists in check of idle links. Standard distribution kit Python includes a full example of such check based on URLLIB. Webchecker together with based on Tk by the external interface it is possible to find in subdirectory tools in the distribution kit. Other tool of language Python, Linbot, even more perfectly. He provides everything, that it is necessary for you for the decision of problems with a Web-site. As sites everyone become more complex{difficult}, other Web-client applications become all neobkhodimee for maintenance of quality of your Web-site.
In programming the Web-client there is a trap. Your programs often are sensitive to little changes in formatting pages. How the site shows the data today, can differ from how he will show them tomorrow. When the format changes, should change and your programs. It is one of the reasons of popularity XML: for the data in web, marked by the tags reflecting their value, a format is less important. As standards XML will develop and become standard, processing of the XML-given will be even easier and more reliable.
There are also some restrictions in tools which we here have considered. Though they superb have recommended themselves at the decision of client problems{tasks}, modules HTTPLIB and URLLIB cannot be used for construction of the industrial http-server as they process searches on one. For maintenance of asynchronous processing, Sehm Rashing has created the impressing tooling including asyncore.py, delivered in structure of standard distribution kit Python. The strongest example of this approach is ZOPE, the server of applications including the fast http-server, constructed with use of nucleus Medusa Sehma Rashinga.
In following clause{article} I shall show you as you can connect XML and programming of the web-client, using XMLRPCLIB. You can use XML for extraction of much more functionality from Meerkat API.

|