Andy's Blog » phpquery, phpquery 手册, simple_html_dom php, php html dom, phpquery 中文手册, phpquery examples, phpQuery simplehtmldom, ganon php, phpquery example, phpquery newDocumentHTML » PHP HTML DOM / Ganon & phpQuery & Simple HTML DOM

PHP HTML DOM / Ganon & phpQuery & Simple HTML DOM

PHP Simple HTML DOM Parser

这个我从第一个测试版用到现在好几年了,轻量级,很不错,单文件代码 1393
项目地址: http://simplehtmldom.sourceforge.net/
手册: http://simplehtmldom.sourceforge.net/manual.htm

  • A HTML DOM parser written in PHP5+ let you manipulate HTML in a very easy way!
  • Require PHP 5+.
  • Supports invalid HTML.
  • Find tags on an HTML page with selectors just like jQuery.
  • Extract contents from HTML in a single line.

PHP Simple HTML DOM Parser 使用示例
查看

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
 
// Find all images
foreach($html->find('img') as $element)
      
echo $element->src . '<br>';
 
// Find all links
foreach($html->find('a') as $element)
      
echo $element->href . '<br>';

修改

// Create DOM from string
$html = str_get_html('<div id="hello">Hello</div><div id="world">World</div>');
 
$html->find('div', 1)->class = 'bar';
 
$html->find('div[id=hello]', 0)->innertext = 'foo';
 
echo $html; // Output: <div id="hello">foo</div><div id="world" class="bar">World</div>

Fix absoult url

$uri = new Net_URL2('http://example.com/foo/bar'); // URI of the resource
$baseURI = $uri;
foreach ($html->find('base[href]') as $elem) {
    
$baseURI = $uri->resolve($elem->href);
}
 
foreach ($html->find('*[src]') as $elem) {
    
$elem->src = $baseURI->resolve($elem->src)->__toString();
}
foreach ($html->find('*[href]') as $elem) {
    
if (strtoupper($elem->tag) === 'BASE') continue;
    
$elem->href = $baseURI->resolve($elem->href)->__toString();
}
foreach ($html->find('form[action]') as $elem) {
    
$elem->action = $baseURI->resolve($elem->action)->__toString();
}

Ganon

项目地址: http://code.google.com/p/ganon/
文档: http://code.google.com/p/ganon/w/list

这个功能强大的很,最近才发现的,加入我的常库,单文件代码 2856

The Ganon library gives access to HTML/XML documents in a very simple object oriented way. It eases modifying the DOM and makes finding elements easy with CSS3-like queries.

A universal tokenizer
A HTML/XML/RSS DOM Parser
Ability to manipulate elements and their attributes
Supports invalid HTML
Supports UTF8
Can perform advanced CSS3-like queries on elements (like jQuery -- namespaces supported)
A HTML beautifier (like HTML Tidy)
Minify CSS and Javascript
Sort attributes, change character case, correct indentation, etc.
Extensible
Parsing documents using callbacks based on current character/token
Operations separated in smaller functions for easy overriding
Fast
Easy

Ganon 使用示例:

// Parse the google code website into a DOM
$html = file_get_dom('http://code.google.com/');

Access
Accessing elements is made easy through the CSS3-like selectors and the object model.

// Find all the paragraph tags with a class attribute and print the
 
// value of the class attribute
 
foreach($html('p[class]') as $element) {
  
echo $element->class, "<br>\n";
 
}
 
 
// Find the first div with ID "gc-header" and print the plain text of
 
// the parent element (plain text means no HTML tags, just the text)
 
echo $html('div#gc-header', 0)->parent->getPlainText();
 
 
// Find out how many tags there are which are "ns:tag" or "div", but not
 
// "a" and do not have a class attribute
 
echo count($html('(ns|tag, div + !a)[!class]');
?>

Modification
Elements can be easily modified after you've found them.

// Find all paragraph tags which are nested inside a div tag, change
    
// their ID attribute and print the new HTML code
    
foreach($html('div p') as $index => $element) {
      
$element->id = "id$index";
    
}
    
echo $html;
 
 
    
// Center all the links inside a document which start with "http://"
    
// and print out the new HTML
    
foreach($html('a[href ^= "http://"]') as $element) {
      
$element->wrap('center');
    
}
    
echo $html;
 
 
    
// Find all odd indexed "td" elements and change the HTML to make them links
    
foreach($html('table td:odd') as $element) {
      
$element->setInnerText('<a href="#">'.$element->getPlainText().'</a>');
    
}
    
echo $html;

Beautify
Ganon can also help you beautify your code and format it properly.

// Beautify the old HTML code and print out the new, formatted code
    
dom_format($html, array('attributes_case' => CASE_LOWER));
    
echo $html;

phpQuery

这个重量级,比较耗资源,单文件代码 5702

phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library.
Library is written in PHP5 and provides additional Command Line Interface (CLI).

项目地址: http://code.google.com/p/phpquery/
文档:http://code.google.com/p/phpquery/wiki/Manual

phpQuery Examples

CLI
Fetch number of downloads of all release packages

phpquery 'http://code.google.com/p/phpquery/downloads/list?can=1' \
  --find '.vt.col_4 a' --contents \
  --getString null array_sum

PHP
Examples from demo.php

require('phpQuery/phpQuery.php');
// for PEAR installation use this
// require('phpQuery.php');

初始化 INITIALIZE IT

// $doc = phpQuery::newDocumentHTML($markup);
// $doc = phpQuery::newDocumentXML();
// $doc = phpQuery::newDocumentFileXHTML('test.html');
// $doc = phpQuery::newDocumentFilePHP('test.php');
// $doc = phpQuery::newDocument('test.xml', 'application/rss+xml');
// this one defaults to text/html in utf8
$doc = phpQuery::newDocument('<div/>');

填充 FILL IT

// array syntax works like ->find() here
$doc['div']->append('<ul></ul>');
// array set changes inner html
$doc['div ul'] = '<li>1</li><li>2</li><li>3</li>';

操纵 MANIPULATE IT

// almost everything can be a chain
$li = null;
$doc['ul > li']
        ->
addClass('my-new-class')
        ->
filter(':last')
                ->
addClass('last-li')
// save it anywhere in the chain
                ->
toReference($li);

选择 SELECT DOCUMENT

// pq(); is using selected document as default
phpQuery::selectDocument($doc);
// documents are selected when created or by above method
// query all unordered lists in last selected document
pq('ul')->insertAfter('div');

遍历 ITERATE IT

// all LIs from last selected DOM
foreach(pq('li') as $li) {
        
// iteration returns PLAIN dom nodes, NOT phpQuery objects
        
$tagName = $li->tagName;
        
$childNodes = $li->childNodes;
        
// so you NEED to wrap it within phpQuery, using pq();
        
pq($li)->addClass('my-second-new-class');
}

输出 PRINT OUTPUT

// 1st way
print phpQuery::getDocument($doc->getDocumentID());
// 2nd way
print phpQuery::getDocument(pq('div')->getDocumentID());
// 3rd way
print pq('div')->getDocument();
// 4th way
print $doc->htmlOuter();
// 5th way
print $doc;
// another...
print $doc['ul'];

Incoming search terms:

Tags: PHP, DOM, Ganon, HTML, phpQuery, Simple HTML DOM

本文地址: http://www.21andy.com/new/20120716/2071.html

3 评论 to “PHP开源CMS之MODx”

  1. Rivsen Tan 于 2012-07-17 12:08:13 发表:

    哈哈,php粉儿必备啊!

  2. 于 2012-08-16 19:28:20 发表:

    PHP 采集,折腾DOM快弄个半死.
    第一个Simple HTML DOM,后来也一直没用会.
    作为一个自学PHP没几天的菜鸟而言,实在有些惭愧.

  3. 于 2012-12-29 20:44:41 发表:

    蛋疼的。第二个运行速度太慢了。第一个都算快了。但是放在sinaapp还是运行不了。不知道怎么回事