Download:
Webstemmer is a web crawler and HTML layout analyzer that
automatically extracts main text of a news site without having
banners, ads and/or navigation links mixed up
(Here is a sample output).
Generally, extracting text contents from web sites (especially
news sites) ends up with lots of unnecessary stuff: ads and
banners. You could craft some regular expression patterns
to pick up only desired parts, but to construct such a pattern is
often a tricky and time consuming task. Furthermore, some
patterns need to be aware of the surrounding contexts.
Some news sites even have several different layouts.
Webstemmer analyzes the layout of each page in a certain web site
and figures out where the main text is located. Analysis can be
done in a fully automatic manner with little human intervention.
You only need to give a URL of the top page. For more details,
see How It Works? page.
The algorithm works for most of well-known news sites.
The following table shows the average number of successfully
extracted pages among all the obtained pages from each site a day.
We got about 90% of the pages correctly for most of the sites:
webstemmer-dist-0.7.1.tar.gz
(Python 2.4 or newer is required)
What's it?
News Site | Avg. Extracted/Obtained |
---|---|
New York Times | 488.8/552.2 (88%) |
Newsday | 373.7/454.7 (82%) |
Washington Post | 342.6/367.3 (93%) |
Boston Globe | 332.9/354.9 (93%) |
ABC News | 299.7/344.4 (87%) |
BBC | 283.3/337.4 (84%) |
Los Angels Times | 263.2/345.5 (76%) |
Reuters | 188.2/206.9 (91%) |
CBS News | 171.8/190.1 (90%) |
Seattle Times | 164.4/185.4 (89%) |
NY Daily News | 144.3/147.4 (98%) |
International Herald Tribune | 125.5/126.5 (99%) |
Channel News Asia | 119.5/126.2 (94%) |
CNN | 65.3/73.9 (89%) |
Voice of America | 58.3/62.6 (94%) |
Independent | 58.1/58.5 (99%) |
Financial Times | 55.7/56.6 (98%) |
USA Today | 44.5/46.7 (96%) |
NY1 | 35.7/37.1 (95%) |
1010 Wins | 14.3/16.1 (88%) |
Total | 3829.1/4349.2 (88%) |
Text extraction with Webstemmer has the following steps:
Step 1. and 2. are only required at the first time. Once you learned the layout patterns, you can use them to extract texts from a newly obtained page from the same website by repeting step 3. and 4. until its layout is drastically changed.
Webstemmer package includes the following programs:
textcrawler.py
(web crawler)
analyze.py
(layout analyzer)
extract.py
(text extractor)
urldbutils.py
(URLDB utility)
html2txt.py
(simpler text extractor)
In the previous versions (<= 0.3), all these programs (web crawler, layout analyzer and text extractor) were combined into one command. Now they are separated in Webstemmer version 0.5 or newer.
To learn layout patterns, you first need to run a web crawler to obtain the seed pages. The crawler recursively follows the links in each page until it reaches a certain depth (the default is 1 -- i.e. the crawler follows each link from the top page only once) and stores the pages into a .zip file.
(Crawl CNN from its top page.)
$ ./textcrawler.py -o cnn http://www.cnn.com/ Writing: 'cnn.200511210103.zip' Making connection: 'www.cnn.com'... ...
An obtained .zip file contains a list of HTML files. Each file name in the archive includes a timestamp at which the crawling is performed. You can use the .zip file as a seed for learning layout patterns (step 2.) or extracting texts from new pages (step 4.)
(View the list of obtained pages.)
$ zipinfo cnn.200511210103.zip Archive: cnn.200511210103.zip 699786 bytes 75 files -rwx--- 2.0 fat 59740 b- defN 21-Nov-05 01:03 200511210103/www.cnn.com/ -rw---- 2.0 fat 32060 b- defN 21-Nov-05 01:03 200511210103/www.cnn.com/privacy.html -rw---- 2.0 fat 41039 b- defN 21-Nov-05 01:03 200511210103/www.cnn.com/interactive_legal.html -rw---- 2.0 fat 33760 b- defN 21-Nov-05 01:03 200511210103/www.cnn.com/INDEX/about.us/ ...
Then you can learn the layout patterns from obtained
pages with analyze.py
. The program take one or more
zip files as input and outputs obtained layout patterns into
the standard output.
(Learn the layout patterns from obtained pages and save it as cnn.pat
.)
$ ./analyze.py cnn.200511210103.zip > cnn.pat Opening: 'cnn.200511210103.zip'... Added: 1: 200511210103/www.cnn.com/ Added: 2: 200511210103/www.cnn.com/privacy.html Added: 3: 200511210103/www.cnn.com/interactive_legal.html Added: 4: 200511210103/www.cnn.com/INDEX/about.us/ ... Fixating....................................................
It takes O(n^2) time to learn layout patterns, e.g. when learning 100 pages takes a couple of minutes, learning about 1,000 pages takes a couple of hours.
The obtained layout patterns are represented in plain-text format. For more details, see Anatomy of pattern files.
Some time later, suppose you obtained a set of new pages from the same website.
(Crawl again from CNN top page.)
$ ./textcrawler.py -o cnn http://www.cnn.com/ Writing: 'cnn.200603010455.zip' Making connection: 'www.cnn.com'... ...
(View the obtained html pages.)
$ zipinfo cnn.200603010455.zip Archive: cnn.200603010455.zip 850656 bytes 85 files -rwx--- 2.0 fat 66507 b- defN 1-Mar-06 04:55 200603010455/www.cnn.com/ -rw---- 2.0 fat 33759 b- defN 1-Mar-06 04:55 200603010455/www.cnn.com/privacy.html -rw---- 2.0 fat 42738 b- defN 1-Mar-06 04:55 200603010455/www.cnn.com/interactive_legal.html -rw---- 2.0 fat 85 b- defN 1-Mar-06 04:55 200603010455/www.cnn.com/INDEX/about.us/ ...
Now you can extract the main texts from the newly obtained pages by
using the acquired pattern cnn.pat
.
$ ./extract.py cnn.pat cnn.200603010455.zip > cnn.txt Opening: 'cnn.200603010455.zip...
Extracted texts are saved as cnn.txt
.
Each article is delimited with an empty line.
Each article begins with a header line that has a form of
either "
When a page layout is identified, it is followed by
"
Each text line begins with a capitalized header,
and is separated exactly one newline character.
(In the above example, extra newlines are inserted for readability's sake.)
Therefore you can easily get the desired part with simple text processing like
Download the tar.gz file.
You need Python 2.4 or newer to run this software.
There is no special configuration or installation process required.
Just type
Most news sites use unique URLs to point different articles.
Therefore, normally you don't have to retrieve the same URL twice.
You need to specify an output filename. A timestamp (
Each layout pattern that
Normally it takes a zip file that
It can also take a list of filenames instead of a .zip file
that contains HTML files obtained by other HTTP clients such as
Some options are very technical.
You might need to understand the algorithm
to change them to get a desired effect.
You need to choose either display mode (
(This is so-called MIT/X License.)
Copyright (c) 2005-2009 Yusuke Shinyama <yusuke at cs dot nyu dot edu>
Permission is hereby granted, free of charge, to any person
obtaining a copy of this software and associated documentation
files (the "Software"), to deal in the Software without
restriction, including without limitation the rights to use,
copy, modify, merge, publish, distribute, sublicense, and/or
sell copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following
conditions:
The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY
KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Last Modified: Mon Jun 15 19:41:35 JST 2009
$ cat cnn.txt
!UNMATCHED: 200511210103/www.cnn.com/ (unmatched page)
!UNMATCHED: 200511210103/www.cnn.com/privacy.html (unmatched page)
!UNMATCHED: 200511210103/www.cnn.com/interactive_legal.html (unmatched page)
...
!MATCHED: 200603010455/www.cnn.com/2006/HEALTH/02/09/billy.interview/index.html (matched page)
PATTERN: 200511210103/www.cnn.com/2005/POLITICS/11/20/bush.murtha/index.html (layout pattern name)
SUB-0: CNN.com - Too busy to cook? Not so fast - Feb 9, 2006 (supplementary section)
TITLE: Too busy to cook? Not so fast (article title)
SUB-10: Leading chef shares his secrets for speedy, healthy cooking (supplementary section)
SUB-17: Corporate Governance (supplementary section)
SUB-17: Lifestyle (House and Home)
SUB-17: New You Resolution
SUB-17: Billy Strynkowski
MAIN-20: (CNN) -- A busy life can put the squeeze on healthy eating. But that (main text)
doesn't have to be the case, according to Billy Strynkowski, executive
chef of Cooking Light magazine. He says cooking healthy, tasty meals
at home can be done in 20 minutes or less.
MAIN-20: CNN's Jason White interviewed Chef Billy to learn his secrets for
healthy cooking on the run.
...
SUB-25: Health care difficulties in the Big Easy (supplementary section)
!MATCHED: 200603010455/www.cnn.com/2006/EDUCATION/02/28/teaching.evolution.ap/index.html (another matched page)
PATTERN: 200511210103/www.cnn.com/2005/POLITICS/11/20/bush.murtha/index.html (layout pattern name)
SUB-0: CNN.com - Evolution debate continues - Feb 28, 2006 (supplementary section)
TITLE: Evolution debate continues (article title)
SUB-17: Schools (supplementary section)
SUB-17: Education
MAIN-20: SALT LAKE CITY (AP) -- House lawmakers scuttled a bill that would have (main text)
required public school students to be told that evolution is not
empirically proven -- the latest setback for critics of evolution.
...
!MATCHED pageID
" or "!UNMATCHED pageID
",
which indicates whether the page's layout was identified or not.
pageID
is the name of the page included in the zip archive.
PATTERN:
" line that shows the layout pattern name
which matched to the page and one or more text lines.
Each text line begins with either "TITLE:
",
"MAIN-n:
", or "SUB-n:
",
which means the article title section, main text sections,
or other supplementary sections, respectively.
Each paragraph in a text section appears in a separate line.
perl
or grep
.
A line which begins with "SUB-n:
" is a supplementary section,
which is identified as neither a article title nor main text, but still considered
as meaningful text. The section ID n is different depending on the layout pattern.
Installation
./analyze.py
or ./extract.py
in the command line.
textcrawler.py
(web crawler)textcrawler.py
is a simple web crawler which
recursively crawls within a given site and collects text (HTML)
files. The crawler is suitable to obtain a middle-scale website
(up to 10,000 pages).
textcrawler.py
stores obtained pages in a single .zip
file. It supports Mozilla-style cookie files, persistent HTTP
connection, and gzip compression. To reduce its traffic, it allows
a user to have strict control over its crawling behavior, such as
specifying recursion depth and/or URL patterns it may (or may not)
obtain. Furthermore, it supports a persistent URL database
(URLDB) which maintains URLs that have been visited so far and
avoids to crawl the same URL repeatedly. Since most news sites
have a unique URL for each distinct article, this greatly helps
reducing the network traffic.
textcrawler.py
tries to use HTTP persistent
connections or gzip compressions as much as possible. It also
tries to obey a robots.txt
file in the site. Each
HTTP connection is made only to the IP address which was given to
the program first, i.e. it doesn't support crawling across
different hosts. All links that refer to other hosts are ignored.
textcrawler.py
supports -U
option
(specifying URLDB filename). When a URLDB is specified,
textcrawler.py
preserves the MD5 hash value and
last-visited time for each URL in the persistent file. Currently,
Berkeley DBM (bsddb) is used for this purpose. This will greatly
save the crawling time and disk space, since the crawler doesn't
store a page if its URL is already contained in the URLDB. A
URLDB file can be inflated as the number of the URLs it contains
is increased. Use urldbutils.py
command to reorganize an inflated URLDB file.
textcrawler.py
follows a set of regular expression patterns
that define which URLs may (or may not) be crawled. A regexp pattern can be specified
-a
(Accept) or -j
(reJect) options from command line.
The crawling permission of a URL is determined by checking
regexp patterns sequentially in the specified order.
By default, it accepts all URLs that include the start URL as its substring.
Syntax
$ textcrawler.py -o output_filename [options] start_url ...
YYYYMMDDHHMM
)
and the extension .zip
is automatically appended to this name.
Examples:
(Start from http://www.asahi.com/ with maximum recursion being 2,
and store the files into asahi.*.zip. Assume euc-jp as a default charset.)
$ textcrawler.py -o asahi -m2 -c euc-jp http://www.asahi.com/
(Start from http://www.boston.com/news/globe/, but the pages in
the upper directory "http://www.boston.com/news/" is also allowed.
Use the URLDB file boston.urldb.)
$ textcrawler.py -o boston -U boston.urldb -a'^http://www\.boston\.com/news/' http://www.boston.com/news/globe/
Options
-o output_filename
-b
option)
is appended after the specified filename.
-m maximum_depth_of_recursive_crawling
-k cookie_filename
textcrawler.py
automatically uses them when necessary.
textcrawler.py
does not store any cookie it obtains during crawling.
-c default_character_set
textcrawler.py
tries to follow the HTML charset
declared (by <meta> tag) in a page header. If there is no charset declaration, the default
value (such as "euc-jp
" or "utf-8
") is used.
textcrawler.py
does not detect the character set automatically.
-a accept_url_pattern
-j
option, the patterns
are checked in the specified order.
-j reject_url_pattern
-a
option, the patterns
are checked in the specified order.
By default, all the URLs which ends with
jpg, jpeg, gif, png, tiff, swf, mov, wmv, wma, ram, rm, rpm, gz, zip, or class
is rejected.
-U URLDB_filename
textcrawler.py
records
a URL that was once visited to a persistent URL database (URLDB) and
does not crawl the URL again. A URLDB file contains the md5 hash of URLs (as keys)
and the last-visited times (as values). When the crawler finds a new URL,
it dynamically checks the database and filters out those which have been
already visited. However, when the crawler haven't yet reached its maximum recursion depth,
the intermediate pages are still crawled to obtain links.
When you crawl the same site again and again, and you can assume an interesting
page has always a unique URL, this reduces the crawling time.
-b timestamp_string
-o
option.
It is also prepended to the name of each page in the zip file,
like "200510291951/www.example.com/...
".
By default, the string is automatically determined from the current time
when the program is started, in the form of YYYYMMDDHHMM
.
-i index_html_filename
/
" character,
the filename specified here is appended to that URL.
The default value is a null string (nothing is added).
Note that in some sites the URLs
"http://host/dir/
" and "http://host/dir/index.html
" is
distinguished (especially when they're using Apache's mod_dir module.)
-D delay_secs
-T timeout_secs
-L linkinfo_filename
textcrawler.py
stores all the anchor texts
(a text which is surrounded by <a>
tag) into
a zip file as a file called "linkinfo
".
Later this information is used by analyze.py
to
locate the page titles. This option changes the linkinfo filename.
When this option is set to an empty string, the crawler doesn't
store any anchor text.
-d
analyze.py
(layout analyzer)analyze.py
performs layout clustering
based on the HTML files textcrawler.py
has obtained,
and outputs the learned pattern file into the standard output.
It might take more than several hours depending on the number
of pages to analyze. For example, my machine (with Xeon 2GHz)
took 30 mins to learn the layouts from 300+ pages.
(For some reason, Psyco, a Python optimizer, doesn't
accelerate this sort of program. It simply took a huge amount of memory but
doesn't make the program run faster.)
analyze.py
outputs has the
"score" of the pattern, which shows how likely that page is an
article. The score is calculated based on the number of
alphanumeric characters in each section of a page. So normally
you can remove non-article pages by simply using -S
option to filter out low-scored layouts.
(You can even tune those patterns manually.
See Anatomy of pattern files.)
Syntax
$ analyze.py [options] input_file ... > layout_pattern_file
textcrawler.py
has generated.
Multiple input files are accepted.
This is useful for using pages obtained from a single site in multiple days.
Examples:
(Learn the layout from cnn.200511171335.zip and cnn.200511210103.zip
and save the patterns to cnn.pat)
$ analyze.py cnn.200511171335.zip cnn.200511210103.zip > cnn.pat
wget
:
In this case, the hierarchy of the directory must be
the same as one used by
$ find 200511171335/ -type f | ./analyze.py - linkinfo > cnn.pat
textcrawler.py
,
i.e. each filename should be in the form of
timestamp/URL
.
Options
-c default_character_set
-a accept_url_pattern
textcrawler.py
.
When combined with -j
option, the patterns
are checked in the specified order.
-j reject_url_pattern
-a
option, the patterns
are checked in the specified order. By default, analyze.py
tries to use all the pages contained in a given zip file.
-t clustering_threshold
-T title_detection_threshold
-S page_score_threshold
-1
preserves all the layouts obtained.
-L linkinfo_filename
analyze.py
tries to use
a linkinfo
file that is created by
textcrawler.py
and stored in a zip file.
This file contains one or multiple anchor texts
(a text which is surrounded by <a>
tag) referring to each page
and is used by the analyzer to locate page titles.
This option changes the linkinfo filename it searches in
a zip file. When this option is set to an empty string,
the analyzer tries to find anchor texts by itself
without using linkinfo
file, which may result
in the slower running speed.
-m max_samples
-d
extract.py
(text extractor)extract.py
receives a layout pattern file
and tries to extract the texts from a set of HTML pages.
This program takes a zip file (or directory name otherwise)
and outputs the extracted text into stdout.
Syntax
$ extract.py [options] pattern_filename input_filename ... > output_text
Example:
(Extract the texts from asahi.200510220801.zip using pattern file asahi.pat,
and store them in shift_jis encoding into file asahi.200510220801.txt)
$ extract.py -C shift_jis asahi.pat asahi.200510220801.zip > asahi.200510220801.txt
Options
-C output_text_encoding
utf-8
.
-c default_character_set
-a accept_url_pattern
textcrawler.py
.
When combined with -j
option, the patterns
are checked in the specified order.
-j reject_url_pattern
-a
option, the patterns
are checked in the specified order. By default, extract.py
tries to use all the pages contained in a given zip file.
-t layout_similarity_threshold
extract.py
tries to identify the layout of a page
by finding the most similar layout pattern. But the page is rejected
if the highest similarity is still less than this threshold,
and "!UNMATCHED
" is printed.
Usually you don't need to change this value.
-S
-T diffscore_threshold
extract.py
recognizes it as "variable blocks".
The default value is 0.5.
-M mainscore_threshold
extract.py
recognizes it as "main blocks".
The default value is 50.
-d
urldbutils.py
(URLDB utility)urldbutils.py
removes redundant URLs from a URLDB file to shrink it.
When textcrawler.py
uses a URLDB, it keeps adding a newly found
(the md5 hash of) URL into the database file,
which causes the file size increase gradually.
It also records the time that each URL is last seen.
If a URL is not seen for a certain time, it can be safely removed from the database.
Syntax
$ urldbutils.py {-D | -R} [options] filename [old_filename]
-D
) or
reorganize mode (-R
). The display mode is mainly for a debugging purpose.
When you rebuild a DBM file, two filenames (new and old) should be specified.
For the safety reason, it does not run when the new file already exists.
Example:
(Remove URLs which haven't been seen for 10 or more days, and
rebuild a new URLDB file myurldb.new.)
$ urldbutils.py -R -t 10 myurldb.new myurldb
$ mv -i myurldb.new myurldb
mv: overwrite `urldb'? y
Options
-D
-R
-t
option (threshold).
-t days
-v
-R
mode.
html2txt.py
(simpler text extractor)html2txt.py
is a much simpler text extractor
(or an HTML tag ripper) without using any sort of predefined pattern.
It just removes all HTML tags from the input files.
It also removes javascript or stylesheet contents surrounded by
<script>
...</script>
or
<style>
...</style>
tag.
Syntax
$ html2txt.py [options] input_filename ... > output_text
Example:
$ html2txt.py index.html > index.txt
Options
-C output_text_encoding
utf-8
.
-c default_character_set
Bugs
Changes
Terms and Conditions