Skip to content
David Kittell
David Kittell

Application & System: Development / Integration / Orchestration

  • Services
    • Application Development
    • Online Application Integration
  • Code
  • Online Tools
  • Tech Support
David Kittell

Application & System: Development / Integration / Orchestration

Web Crawler – PHP

Posted on August 9, 2013October 26, 2015 By David Kittell
$target_url = $_GET["url"];
if (empty($target_url)) {$target_url = "http://kittell.net";}

echo "<h1>" . $target_url . "</h1>";

$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
	echo "<br />cURL error number:" .curl_errno($ch);
	echo "<br />cURL error:" . curl_error($ch);
	exit;
}

// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {
	$href = $hrefs->item($i);
	$url = $href->getAttribute('href');
	echo "<br />$url";
}

Example:
” . $target_url . “

“;

$userAgent = ‘Googlebot/2.1 (http://www.googlebot.com/bot.html)’;

// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
echo “
cURL error number:” .curl_errno($ch);
echo “
cURL error:” . curl_error($ch);
exit;
}

// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate(“/html/body//a”);

for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute(‘href’);
echo “
$url”;
}
?>

Originally Posted on August 9, 2013
Last Updated on October 26, 2015
All information on this site is shared with the intention to help. Before any source code or program is ran on a production (non-development) system it is suggested you test it and fully understand what it is doing not just what it appears it is doing. I accept no responsibility for any damage you may do with this code.

Related

Code PHP

Post navigation

Previous post
Next post

Related Posts

Check For Valid Date

Posted on February 28, 2013October 26, 2015

Recently I had to validate dates in a form specifically the date of visit and date of birth This function will now check both for future date and do more checks for the date of visit, debug/helper information left in to better understand the function. Function Code: public bool bCheckDate(string…

Read More

Ektron Archive Content

Posted on February 4, 2014October 26, 2015

UPDATE [content] SET [searchable] = 0 ,end_date = ‘2014-02-04 12:10:00.000’ ,end_date_action = 2 WHERE content_id IN ( SELECT c.[content_id] FROM [content] c INNER JOIN [content_folder_tbl] cft ON c.folder_id = cft.folder_id WHERE — content_id = 34 cft.[folder_name] LIKE ‘OLD%’ ) Originally Posted on February 4, 2014Last Updated on October 26, 2015…

Read More

XML Site Map

Posted on October 17, 2013October 26, 2015

<%@ Page Language="VB" AutoEventWireup="false" CodeFile="generateXML.aspx.vb" Inherits="Examples_generateXML" %> <%@ Register Assembly="Ektron.Cms.Controls" Namespace="Ektron.Cms.Controls" TagPrefix="CMS" %> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head runat="server"> <title>Generate Sitemap Page</title> </head> <body> <form id="form1" runat="server"> <div> <asp:Button runat="server" ID="btnGenerate" Text="Click to Generate Sitemap" OnClientClick="displayWait()" /> <br /> <br /> <cms:login runat="server" id="login1"…

Read More

Code

Top Posts & Pages

  • PowerShell - Rename Pictures to Image Taken
  • Front Page
  • C# - Start/Stop/Restart Services
  • MacPorts / HomeBrew - Rip CD tracks from terminal
  • PowerShell - Show File Extensions

Recent Posts

  • Javascript – Digital Clock with Style
  • BASH – Web Ping Log
  • BASH – Picture / Video File Name Manipulation
  • Mac OSX Terminal – Create SSH Key
  • Bash – Rename Picture

Top Posts

  • PowerShell - Rename Pictures to Image Taken
  • C# - Start/Stop/Restart Services
  • MacPorts / HomeBrew - Rip CD tracks from terminal
  • PowerShell - Show File Extensions
  • Open On Screen Keyboard (OSK)
  • SQLite - Auto-Increment / Auto Generate GUID
©2025 David Kittell | WordPress Theme by SuperbThemes