Skip to main content

Parsing Domain From URL In PHP



I need to build a function which parses the domain from a URL.





So, with http://google.com/dhasjkdas/sadsdds/sdda/sdads.html or http://www.google.com/dhasjkdas/sadsdds/sdda/sdads.html , it should return google.com ; with http://google.co.uk/dhasjkdas/sadsdds/sdda/sdads.html , it should return google.co.uk .


Comments

  1. check out parse_url():

    $url = 'http://google.com/dhasjkdas/sadsdds/sdda/sdads.html';
    $parse = parse_url($url);
    print $parse['host']; // prints 'google.com'


    note: parse_url doesn't handle really badly mangled urls very well, but is fine if you generally expect decent urls.

    ReplyDelete
  2. $domain = str_ireplace('www.', '', parse_url($url, PHP_URL_HOST));


    This would return the google.com for both http://google.com/... and http://www.google.com/...

    ReplyDelete
  3. The internal PHP function parse_url is not always sufficient to parse URLs or URIs correctly into their components, such as Host/Domain, Path, Segments, Query and Fragment.

    In this case you need a reliable result array like this:

    - [SCHEME] => http
    - [AUTHORITY] =>
    user:pass@www.domain.com:80
    - [USERINFO] => user:pass
    - [USER] => user
    - [PASS] => pass
    - [HOST] => www.domain.com
    - [REGNAME] => www.domain.com
    - [DOMAIN] => www.domain.com
    - [LABEL][] =>
    - [LABEL][] => com
    - [LABEL][] => domain
    - [LABEL][] => www
    - [PORT] => 80
    - [PATH] => /dir1/dir2/page.html
    - [SEGMENT][] => dir1
    - [SEGMENT][] => dir2
    - [SEGMENT][] => page.html
    - [QUERY] => key1=value1&key2=value2
    - [GET][key1] => value1
    - [GET][key2] => value2
    - [FRAGMENT] => anchor/line
    - [ANCHOR][] => anchor
    - [ANCHOR][] => line


    There's a standard-compliant, robust and performant PHP Class for handling and parsing URLs / URIs according to RFC 3986 and RFC 3987 available for download and free use:

    http://andreas-hahn.com/en/parse-url

    ReplyDelete
  4. From http://us3.php.net/manual/en/function.parse-url.php#93983


    for some odd reason, parse_url
    returns the host (ex. example.com) as
    the path when no scheme is provided in
    the input url. So I've written a quick
    function to get the real host:


    function getHost($Address) {
    $parseUrl = parse_url(trim($Address));
    return trim($parseUrl['host'] ? $parseUrl['host'] : array_shift(explode('/', $parseUrl['path'], 2)));
    }

    getHost("example.com"); // Gives example.com
    getHost("http://example.com"); // Gives example.com
    getHost("www.example.com"); // Gives www.example.com
    getHost("http://example.com/xyz"); // Gives example.com

    ReplyDelete
  5. The code that was meant to work 100% didn't seem to cut it for me, I did patch the example a little but found code that wasn't helping and problems with it. so I changed it out to a couple of functions (to save asking for the list from mozilla all the time, and removing the cahce system). This has been tested against a set of 1000 URLs and seemed to work.

    function domain($url)
    {
    global $subtlds;
    $slds = "";
    $url = strtolower($url);

    $host = parse_url('http://'.$url,PHP_URL_HOST);

    preg_match("/[^\.\/]+\.[^\.\/]+$/", $host, $matches);
    foreach($subtlds as $sub){
    if (preg_match('/\.'.preg_quote($sub).'$/', $host, $xyz)){
    preg_match("/[^\.\/]+\.[^\.\/]+\.[^\.\/]+$/", $host, $matches);
    }
    }

    return @$matches[0];
    }

    function get_tlds(){
    $address = 'http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1';
    $content = file($address);
    foreach($content as $num => $line){
    $line = trim($line);
    if($line == '') continue;
    if(@substr($line[0], 0, 2) == '/') continue;
    $line = @preg_replace("/[^a-zA-Z0-9\.]/", '', $line);
    if($line == '') continue; //$line = '.'.$line;
    if(@$line[0] == '.') $line = substr($line, 1);
    if(!strstr($line, '.')) continue;
    $subtlds[] = $line;
    //echo "{$num}: '{$line}'"; echo "<br>";
    }

    $subtlds = array_merge(array(
    'co.uk', 'me.uk', 'net.uk', 'org.uk', 'sch.uk', 'ac.uk',
    'gov.uk', 'nhs.uk', 'police.uk', 'mod.uk', 'asn.au', 'com.au',
    'net.au', 'id.au', 'org.au', 'edu.au', 'gov.au', 'csiro.au'
    ),$subtlds);

    $subtlds = array_unique($subtlds);

    return $subtlds;
    }


    Then use it like

    $subtlds = get_tlds();
    echo domain('www.example.com') //outputs: exmaple.com
    echo domain('www.example.uk.com') //outputs: exmaple.uk.com
    echo domain('www.example.fr') //outputs: exmaple.fr


    I know I should have turned this into a class, but didn't have time.

    ReplyDelete
  6. Here is the code i made that 100% finds only the domain name, since it takes mozilla sub tlds to account. Only thing you have to check is how you make cache of that file, so you dont query mozilla every time.

    For some strange reason, domains like co.uk are not in the list, so you have to make some hacking and add them manually. Its not cleanest solution but i hope it helps someone.

    //=====================================================
    static function domain($url)
    {
    $slds = "";
    $url = strtolower($url);

    $address = 'http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1';
    if(!$subtlds = @kohana::cache('subtlds', null, 60))
    {
    $content = file($address);
    foreach($content as $num => $line)
    {
    $line = trim($line);
    if($line == '') continue;
    if(@substr($line[0], 0, 2) == '/') continue;
    $line = @preg_replace("/[^a-zA-Z0-9\.]/", '', $line);
    if($line == '') continue; //$line = '.'.$line;
    if(@$line[0] == '.') $line = substr($line, 1);
    if(!strstr($line, '.')) continue;
    $subtlds[] = $line;
    //echo "{$num}: '{$line}'"; echo "<br>";
    }
    $subtlds = array_merge(Array(
    'co.uk', 'me.uk', 'net.uk', 'org.uk', 'sch.uk', 'ac.uk',
    'gov.uk', 'nhs.uk', 'police.uk', 'mod.uk', 'asn.au', 'com.au',
    'net.au', 'id.au', 'org.au', 'edu.au', 'gov.au', 'csiro.au',
    ),$subtlds);

    $subtlds = array_unique($subtlds);
    //echo var_dump($subtlds);
    @kohana::cache('subtlds', $subtlds);
    }


    preg_match('/^(http:[\/]{2,})?([^\/]+)/i', $url, $matches);
    //preg_match("/^(http:\/\/|https:\/\/|)[a-zA-Z-]([^\/]+)/i", $url, $matches);
    $host = @$matches[2];
    //echo var_dump($matches);

    preg_match("/[^\.\/]+\.[^\.\/]+$/", $host, $matches);
    foreach($subtlds as $sub)
    {
    if (preg_match("/{$sub}$/", $host, $xyz))
    preg_match("/[^\.\/]+\.[^\.\/]+\.[^\.\/]+$/", $host, $matches);
    }

    return @$matches[0];
    }

    ReplyDelete

Post a Comment

Popular posts from this blog

Why is this Javascript much *slower* than its jQuery equivalent?

I have a HTML list of about 500 items and a "filter" box above it. I started by using jQuery to filter the list when I typed a letter (timing code added later): $('#filter').keyup( function() { var jqStart = (new Date).getTime(); var search = $(this).val().toLowerCase(); var $list = $('ul.ablist > li'); $list.each( function() { if ( $(this).text().toLowerCase().indexOf(search) === -1 ) $(this).hide(); else $(this).show(); } ); console.log('Time: ' + ((new Date).getTime() - jqStart)); } ); However, there was a couple of seconds delay after typing each letter (particularly the first letter). So I thought it may be slightly quicker if I used plain Javascript (I read recently that jQuery's each function is particularly slow). Here's my JS equivalent: document.getElementById('filter').addEventListener( 'keyup', function () { var jsStart = (new Date).getTime()...