Parsing Domain From URL In PHP

I need to build a function which parses the domain from a URL.

So, with http://google.com/dhasjkdas/sadsdds/sdda/sdads.html or http://www.google.com/dhasjkdas/sadsdds/sdda/sdads.html , it should return google.com ; with http://google.co.uk/dhasjkdas/sadsdds/sdda/sdads.html , it should return google.co.uk .

Comments

Tips For AllMarch 8, 2012 at 1:18 AM
check out parse_url():

$url = 'http://google.com/dhasjkdas/sadsdds/sdda/sdads.html';
$parse = parse_url($url);
print $parse['host']; // prints 'google.com'

note: parse_url doesn't handle really badly mangled urls very well, but is fine if you generally expect decent urls.
ReplyDelete
Replies
Tips For AllMarch 8, 2012 at 1:18 AM
$domain = str_ireplace('www.', '', parse_url($url, PHP_URL_HOST));

This would return the google.com for both http://google.com/... and http://www.google.com/...
ReplyDelete
Replies
Tips For AllMarch 8, 2012 at 1:18 AM
The internal PHP function parse_url is not always sufficient to parse URLs or URIs correctly into their components, such as Host/Domain, Path, Segments, Query and Fragment.

In this case you need a reliable result array like this:

- [SCHEME] => http
- [AUTHORITY] =>
user:pass@www.domain.com:80
- [USERINFO] => user:pass
- [USER] => user
- [PASS] => pass
- [HOST] => www.domain.com
- [REGNAME] => www.domain.com
- [DOMAIN] => www.domain.com
- [LABEL][] =>
- [LABEL][] => com
- [LABEL][] => domain
- [LABEL][] => www
- [PORT] => 80
- [PATH] => /dir1/dir2/page.html
- [SEGMENT][] => dir1
- [SEGMENT][] => dir2
- [SEGMENT][] => page.html
- [QUERY] => key1=value1&key2=value2
- [GET][key1] => value1
- [GET][key2] => value2
- [FRAGMENT] => anchor/line
- [ANCHOR][] => anchor
- [ANCHOR][] => line

There's a standard-compliant, robust and performant PHP Class for handling and parsing URLs / URIs according to RFC 3986 and RFC 3987 available for download and free use:

http://andreas-hahn.com/en/parse-url
ReplyDelete
Replies
Tips For AllMarch 8, 2012 at 1:18 AM
From http://us3.php.net/manual/en/function.parse-url.php#93983

for some odd reason, parse_url
returns the host (ex. example.com) as
the path when no scheme is provided in
the input url. So I've written a quick
function to get the real host:

function getHost($Address) {
$parseUrl = parse_url(trim($Address));
return trim($parseUrl['host'] ? $parseUrl['host'] : array_shift(explode('/', $parseUrl['path'], 2)));
}

getHost("example.com"); // Gives example.com
getHost("http://example.com"); // Gives example.com
getHost("www.example.com"); // Gives www.example.com
getHost("http://example.com/xyz"); // Gives example.com
ReplyDelete
Replies
Tips For AllMarch 8, 2012 at 1:18 AM
Check out parse_url()
ReplyDelete
Replies
Tips For AllMarch 8, 2012 at 1:18 AM
The code that was meant to work 100% didn't seem to cut it for me, I did patch the example a little but found code that wasn't helping and problems with it. so I changed it out to a couple of functions (to save asking for the list from mozilla all the time, and removing the cahce system). This has been tested against a set of 1000 URLs and seemed to work.

function domain($url)
{
global $subtlds;
$slds = "";
$url = strtolower($url);

$host = parse_url('http://'.$url,PHP_URL_HOST);

preg_match("/[^\.\/]+\.[^\.\/]+$/", $host, $matches);
foreach($subtlds as $sub){
if (preg_match('/\.'.preg_quote($sub).'$/', $host, $xyz)){
preg_match("/[^\.\/]+\.[^\.\/]+\.[^\.\/]+$/", $host, $matches);
}
}

return @$matches[0];
}

function get_tlds(){
$address = 'http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1';
$content = file($address);
foreach($content as $num => $line){
$line = trim($line);
if($line == '') continue;
if(@substr($line[0], 0, 2) == '/') continue;
$line = @preg_replace("/[^a-zA-Z0-9\.]/", '', $line);
if($line == '') continue; //$line = '.'.$line;
if(@$line[0] == '.') $line = substr($line, 1);
if(!strstr($line, '.')) continue;
$subtlds[] = $line;
//echo "{$num}: '{$line}'"; echo "<br>";
}

$subtlds = array_merge(array(
'co.uk', 'me.uk', 'net.uk', 'org.uk', 'sch.uk', 'ac.uk',
'gov.uk', 'nhs.uk', 'police.uk', 'mod.uk', 'asn.au', 'com.au',
'net.au', 'id.au', 'org.au', 'edu.au', 'gov.au', 'csiro.au'
),$subtlds);

$subtlds = array_unique($subtlds);

return $subtlds;
}

Then use it like

$subtlds = get_tlds();
echo domain('www.example.com') //outputs: exmaple.com
echo domain('www.example.uk.com') //outputs: exmaple.uk.com
echo domain('www.example.fr') //outputs: exmaple.fr

I know I should have turned this into a class, but didn't have time.
ReplyDelete
Replies
Tips For AllMarch 8, 2012 at 1:18 AM
Here is the code i made that 100% finds only the domain name, since it takes mozilla sub tlds to account. Only thing you have to check is how you make cache of that file, so you dont query mozilla every time.

For some strange reason, domains like co.uk are not in the list, so you have to make some hacking and add them manually. Its not cleanest solution but i hope it helps someone.

//=====================================================
static function domain($url)
{
$slds = "";
$url = strtolower($url);

$address = 'http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1';
if(!$subtlds = @kohana::cache('subtlds', null, 60))
{
$content = file($address);
foreach($content as $num => $line)
{
$line = trim($line);
if($line == '') continue;
if(@substr($line[0], 0, 2) == '/') continue;
$line = @preg_replace("/[^a-zA-Z0-9\.]/", '', $line);
if($line == '') continue; //$line = '.'.$line;
if(@$line[0] == '.') $line = substr($line, 1);
if(!strstr($line, '.')) continue;
$subtlds[] = $line;
//echo "{$num}: '{$line}'"; echo "<br>";
}
$subtlds = array_merge(Array(
'co.uk', 'me.uk', 'net.uk', 'org.uk', 'sch.uk', 'ac.uk',
'gov.uk', 'nhs.uk', 'police.uk', 'mod.uk', 'asn.au', 'com.au',
'net.au', 'id.au', 'org.au', 'edu.au', 'gov.au', 'csiro.au',
),$subtlds);

$subtlds = array_unique($subtlds);
//echo var_dump($subtlds);
@kohana::cache('subtlds', $subtlds);
}

preg_match('/^(http:[\/]{2,})?([^\/]+)/i', $url, $matches);
//preg_match("/^(http:\/\/|https:\/\/|)[a-zA-Z-]([^\/]+)/i", $url, $matches);
$host = @$matches[2];
//echo var_dump($matches);

preg_match("/[^\.\/]+\.[^\.\/]+$/", $host, $matches);
foreach($subtlds as $sub)
{
if (preg_match("/{$sub}$/", $host, $xyz))
preg_match("/[^\.\/]+\.[^\.\/]+\.[^\.\/]+$/", $host, $matches);
}

return @$matches[0];
}
ReplyDelete
Replies

Add comment

CCNA, CCNP, MCSA, CCNA Final Exam, All Answer Test Module With 100/100

Search This Blog

Parsing Domain From URL In PHP

Labels

Comments

Post a Comment

Popular posts from this blog

Slow Android emulator

Create Subdomains on the fly with .htaccess (PHP)

What is the worst gotcha in C# or .NET?