Skip to main content

Parsing Domain From URL In PHP



I need to build a function which parses the domain from a URL.





So, with http://google.com/dhasjkdas/sadsdds/sdda/sdads.html or http://www.google.com/dhasjkdas/sadsdds/sdda/sdads.html , it should return google.com ; with http://google.co.uk/dhasjkdas/sadsdds/sdda/sdads.html , it should return google.co.uk .


Comments

  1. check out parse_url():

    $url = 'http://google.com/dhasjkdas/sadsdds/sdda/sdads.html';
    $parse = parse_url($url);
    print $parse['host']; // prints 'google.com'


    note: parse_url doesn't handle really badly mangled urls very well, but is fine if you generally expect decent urls.

    ReplyDelete
  2. $domain = str_ireplace('www.', '', parse_url($url, PHP_URL_HOST));


    This would return the google.com for both http://google.com/... and http://www.google.com/...

    ReplyDelete
  3. The internal PHP function parse_url is not always sufficient to parse URLs or URIs correctly into their components, such as Host/Domain, Path, Segments, Query and Fragment.

    In this case you need a reliable result array like this:

    - [SCHEME] => http
    - [AUTHORITY] =>
    user:pass@www.domain.com:80
    - [USERINFO] => user:pass
    - [USER] => user
    - [PASS] => pass
    - [HOST] => www.domain.com
    - [REGNAME] => www.domain.com
    - [DOMAIN] => www.domain.com
    - [LABEL][] =>
    - [LABEL][] => com
    - [LABEL][] => domain
    - [LABEL][] => www
    - [PORT] => 80
    - [PATH] => /dir1/dir2/page.html
    - [SEGMENT][] => dir1
    - [SEGMENT][] => dir2
    - [SEGMENT][] => page.html
    - [QUERY] => key1=value1&key2=value2
    - [GET][key1] => value1
    - [GET][key2] => value2
    - [FRAGMENT] => anchor/line
    - [ANCHOR][] => anchor
    - [ANCHOR][] => line


    There's a standard-compliant, robust and performant PHP Class for handling and parsing URLs / URIs according to RFC 3986 and RFC 3987 available for download and free use:

    http://andreas-hahn.com/en/parse-url

    ReplyDelete
  4. From http://us3.php.net/manual/en/function.parse-url.php#93983


    for some odd reason, parse_url
    returns the host (ex. example.com) as
    the path when no scheme is provided in
    the input url. So I've written a quick
    function to get the real host:


    function getHost($Address) {
    $parseUrl = parse_url(trim($Address));
    return trim($parseUrl['host'] ? $parseUrl['host'] : array_shift(explode('/', $parseUrl['path'], 2)));
    }

    getHost("example.com"); // Gives example.com
    getHost("http://example.com"); // Gives example.com
    getHost("www.example.com"); // Gives www.example.com
    getHost("http://example.com/xyz"); // Gives example.com

    ReplyDelete
  5. The code that was meant to work 100% didn't seem to cut it for me, I did patch the example a little but found code that wasn't helping and problems with it. so I changed it out to a couple of functions (to save asking for the list from mozilla all the time, and removing the cahce system). This has been tested against a set of 1000 URLs and seemed to work.

    function domain($url)
    {
    global $subtlds;
    $slds = "";
    $url = strtolower($url);

    $host = parse_url('http://'.$url,PHP_URL_HOST);

    preg_match("/[^\.\/]+\.[^\.\/]+$/", $host, $matches);
    foreach($subtlds as $sub){
    if (preg_match('/\.'.preg_quote($sub).'$/', $host, $xyz)){
    preg_match("/[^\.\/]+\.[^\.\/]+\.[^\.\/]+$/", $host, $matches);
    }
    }

    return @$matches[0];
    }

    function get_tlds(){
    $address = 'http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1';
    $content = file($address);
    foreach($content as $num => $line){
    $line = trim($line);
    if($line == '') continue;
    if(@substr($line[0], 0, 2) == '/') continue;
    $line = @preg_replace("/[^a-zA-Z0-9\.]/", '', $line);
    if($line == '') continue; //$line = '.'.$line;
    if(@$line[0] == '.') $line = substr($line, 1);
    if(!strstr($line, '.')) continue;
    $subtlds[] = $line;
    //echo "{$num}: '{$line}'"; echo "<br>";
    }

    $subtlds = array_merge(array(
    'co.uk', 'me.uk', 'net.uk', 'org.uk', 'sch.uk', 'ac.uk',
    'gov.uk', 'nhs.uk', 'police.uk', 'mod.uk', 'asn.au', 'com.au',
    'net.au', 'id.au', 'org.au', 'edu.au', 'gov.au', 'csiro.au'
    ),$subtlds);

    $subtlds = array_unique($subtlds);

    return $subtlds;
    }


    Then use it like

    $subtlds = get_tlds();
    echo domain('www.example.com') //outputs: exmaple.com
    echo domain('www.example.uk.com') //outputs: exmaple.uk.com
    echo domain('www.example.fr') //outputs: exmaple.fr


    I know I should have turned this into a class, but didn't have time.

    ReplyDelete
  6. Here is the code i made that 100% finds only the domain name, since it takes mozilla sub tlds to account. Only thing you have to check is how you make cache of that file, so you dont query mozilla every time.

    For some strange reason, domains like co.uk are not in the list, so you have to make some hacking and add them manually. Its not cleanest solution but i hope it helps someone.

    //=====================================================
    static function domain($url)
    {
    $slds = "";
    $url = strtolower($url);

    $address = 'http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1';
    if(!$subtlds = @kohana::cache('subtlds', null, 60))
    {
    $content = file($address);
    foreach($content as $num => $line)
    {
    $line = trim($line);
    if($line == '') continue;
    if(@substr($line[0], 0, 2) == '/') continue;
    $line = @preg_replace("/[^a-zA-Z0-9\.]/", '', $line);
    if($line == '') continue; //$line = '.'.$line;
    if(@$line[0] == '.') $line = substr($line, 1);
    if(!strstr($line, '.')) continue;
    $subtlds[] = $line;
    //echo "{$num}: '{$line}'"; echo "<br>";
    }
    $subtlds = array_merge(Array(
    'co.uk', 'me.uk', 'net.uk', 'org.uk', 'sch.uk', 'ac.uk',
    'gov.uk', 'nhs.uk', 'police.uk', 'mod.uk', 'asn.au', 'com.au',
    'net.au', 'id.au', 'org.au', 'edu.au', 'gov.au', 'csiro.au',
    ),$subtlds);

    $subtlds = array_unique($subtlds);
    //echo var_dump($subtlds);
    @kohana::cache('subtlds', $subtlds);
    }


    preg_match('/^(http:[\/]{2,})?([^\/]+)/i', $url, $matches);
    //preg_match("/^(http:\/\/|https:\/\/|)[a-zA-Z-]([^\/]+)/i", $url, $matches);
    $host = @$matches[2];
    //echo var_dump($matches);

    preg_match("/[^\.\/]+\.[^\.\/]+$/", $host, $matches);
    foreach($subtlds as $sub)
    {
    if (preg_match("/{$sub}$/", $host, $xyz))
    preg_match("/[^\.\/]+\.[^\.\/]+\.[^\.\/]+$/", $host, $matches);
    }

    return @$matches[0];
    }

    ReplyDelete

Post a Comment

Popular posts from this blog

Slow Android emulator

I have a 2.67 GHz Celeron processor, 1.21 GB of RAM on a x86 Windows XP Professional machine. My understanding is that the Android emulator should start fairly quickly on such a machine, but for me it does not. I have followed all instructions in setting up the IDE, SDKs, JDKs and such and have had some success in staring the emulator quickly but is very particulary. How can I, if possible, fix this problem?