I need to build a function which parses the domain from a URL.
So, with http://google.com/dhasjkdas/sadsdds/sdda/sdads.html
or http://www.google.com/dhasjkdas/sadsdds/sdda/sdads.html
, it should return google.com
; with http://google.co.uk/dhasjkdas/sadsdds/sdda/sdads.html
, it should return google.co.uk
.
check out parse_url():
ReplyDelete$url = 'http://google.com/dhasjkdas/sadsdds/sdda/sdads.html';
$parse = parse_url($url);
print $parse['host']; // prints 'google.com'
note: parse_url doesn't handle really badly mangled urls very well, but is fine if you generally expect decent urls.
$domain = str_ireplace('www.', '', parse_url($url, PHP_URL_HOST));
ReplyDeleteThis would return the google.com for both http://google.com/... and http://www.google.com/...
The internal PHP function parse_url is not always sufficient to parse URLs or URIs correctly into their components, such as Host/Domain, Path, Segments, Query and Fragment.
ReplyDeleteIn this case you need a reliable result array like this:
- [SCHEME] => http
- [AUTHORITY] =>
user:pass@www.domain.com:80
- [USERINFO] => user:pass
- [USER] => user
- [PASS] => pass
- [HOST] => www.domain.com
- [REGNAME] => www.domain.com
- [DOMAIN] => www.domain.com
- [LABEL][] =>
- [LABEL][] => com
- [LABEL][] => domain
- [LABEL][] => www
- [PORT] => 80
- [PATH] => /dir1/dir2/page.html
- [SEGMENT][] => dir1
- [SEGMENT][] => dir2
- [SEGMENT][] => page.html
- [QUERY] => key1=value1&key2=value2
- [GET][key1] => value1
- [GET][key2] => value2
- [FRAGMENT] => anchor/line
- [ANCHOR][] => anchor
- [ANCHOR][] => line
There's a standard-compliant, robust and performant PHP Class for handling and parsing URLs / URIs according to RFC 3986 and RFC 3987 available for download and free use:
http://andreas-hahn.com/en/parse-url
From http://us3.php.net/manual/en/function.parse-url.php#93983
ReplyDeletefor some odd reason, parse_url
returns the host (ex. example.com) as
the path when no scheme is provided in
the input url. So I've written a quick
function to get the real host:
function getHost($Address) {
$parseUrl = parse_url(trim($Address));
return trim($parseUrl['host'] ? $parseUrl['host'] : array_shift(explode('/', $parseUrl['path'], 2)));
}
getHost("example.com"); // Gives example.com
getHost("http://example.com"); // Gives example.com
getHost("www.example.com"); // Gives www.example.com
getHost("http://example.com/xyz"); // Gives example.com
Check out parse_url()
ReplyDeleteThe code that was meant to work 100% didn't seem to cut it for me, I did patch the example a little but found code that wasn't helping and problems with it. so I changed it out to a couple of functions (to save asking for the list from mozilla all the time, and removing the cahce system). This has been tested against a set of 1000 URLs and seemed to work.
ReplyDeletefunction domain($url)
{
global $subtlds;
$slds = "";
$url = strtolower($url);
$host = parse_url('http://'.$url,PHP_URL_HOST);
preg_match("/[^\.\/]+\.[^\.\/]+$/", $host, $matches);
foreach($subtlds as $sub){
if (preg_match('/\.'.preg_quote($sub).'$/', $host, $xyz)){
preg_match("/[^\.\/]+\.[^\.\/]+\.[^\.\/]+$/", $host, $matches);
}
}
return @$matches[0];
}
function get_tlds(){
$address = 'http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1';
$content = file($address);
foreach($content as $num => $line){
$line = trim($line);
if($line == '') continue;
if(@substr($line[0], 0, 2) == '/') continue;
$line = @preg_replace("/[^a-zA-Z0-9\.]/", '', $line);
if($line == '') continue; //$line = '.'.$line;
if(@$line[0] == '.') $line = substr($line, 1);
if(!strstr($line, '.')) continue;
$subtlds[] = $line;
//echo "{$num}: '{$line}'"; echo "<br>";
}
$subtlds = array_merge(array(
'co.uk', 'me.uk', 'net.uk', 'org.uk', 'sch.uk', 'ac.uk',
'gov.uk', 'nhs.uk', 'police.uk', 'mod.uk', 'asn.au', 'com.au',
'net.au', 'id.au', 'org.au', 'edu.au', 'gov.au', 'csiro.au'
),$subtlds);
$subtlds = array_unique($subtlds);
return $subtlds;
}
Then use it like
$subtlds = get_tlds();
echo domain('www.example.com') //outputs: exmaple.com
echo domain('www.example.uk.com') //outputs: exmaple.uk.com
echo domain('www.example.fr') //outputs: exmaple.fr
I know I should have turned this into a class, but didn't have time.
Here is the code i made that 100% finds only the domain name, since it takes mozilla sub tlds to account. Only thing you have to check is how you make cache of that file, so you dont query mozilla every time.
ReplyDeleteFor some strange reason, domains like co.uk are not in the list, so you have to make some hacking and add them manually. Its not cleanest solution but i hope it helps someone.
//=====================================================
static function domain($url)
{
$slds = "";
$url = strtolower($url);
$address = 'http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1';
if(!$subtlds = @kohana::cache('subtlds', null, 60))
{
$content = file($address);
foreach($content as $num => $line)
{
$line = trim($line);
if($line == '') continue;
if(@substr($line[0], 0, 2) == '/') continue;
$line = @preg_replace("/[^a-zA-Z0-9\.]/", '', $line);
if($line == '') continue; //$line = '.'.$line;
if(@$line[0] == '.') $line = substr($line, 1);
if(!strstr($line, '.')) continue;
$subtlds[] = $line;
//echo "{$num}: '{$line}'"; echo "<br>";
}
$subtlds = array_merge(Array(
'co.uk', 'me.uk', 'net.uk', 'org.uk', 'sch.uk', 'ac.uk',
'gov.uk', 'nhs.uk', 'police.uk', 'mod.uk', 'asn.au', 'com.au',
'net.au', 'id.au', 'org.au', 'edu.au', 'gov.au', 'csiro.au',
),$subtlds);
$subtlds = array_unique($subtlds);
//echo var_dump($subtlds);
@kohana::cache('subtlds', $subtlds);
}
preg_match('/^(http:[\/]{2,})?([^\/]+)/i', $url, $matches);
//preg_match("/^(http:\/\/|https:\/\/|)[a-zA-Z-]([^\/]+)/i", $url, $matches);
$host = @$matches[2];
//echo var_dump($matches);
preg_match("/[^\.\/]+\.[^\.\/]+$/", $host, $matches);
foreach($subtlds as $sub)
{
if (preg_match("/{$sub}$/", $host, $xyz))
preg_match("/[^\.\/]+\.[^\.\/]+\.[^\.\/]+$/", $host, $matches);
}
return @$matches[0];
}