🕷️ Crawler Inspector
URL Lookup
URL:
Lookup by URL
Direct Parameter Lookup
Host ID:
Partition ID:
Unparsed Hash:
Lookup by Parameters
Raw Queries and Responses
1. Shard Calculation
Query:
curl -X POST \ 'http://laksa086.int.ahrefs:8124/' \ -H 'Content-Type: text/plain' \ -H 'X-ClickHouse-Database: crawler3' \ -H 'Authorization: Basic YXBpOg==' \ -d 'SELECT getAhrefsRootHashFromUnparsed(getAhrefsUnparsedNoserviceFromURL(\'https://www.facebook.com/61577603379145\')) AS root_hash, root_hash % 200 AS shard FORMAT JSONEachRow'
Response:
{"root_hash":"6329005247723782277","shard":77}
Calculated Shard:
77 (from laksa086)
2. Crawled Status Check
Query:
curl -X POST \ 'http://laksa077.int.ahrefs:8124/' \ -H 'Content-Type: text/plain' \ -H 'X-ClickHouse-Database: crawler3' \ -H 'Authorization: Basic YXBpOg==' \ -d 'SELECT getAhrefsURLFromUnparsed(src_unparsed) AS found_url, ifNull(toUnixTimestamp(download_stamp), 0) AS crawl_time, ifNull(toUnixTimestamp(props_url_first_seen), 0) AS first_indexed_time, download_http_code AS http_code, src_unparsed AS src_unparsed, src_root_hash AS src_root_hash, history_drop_reason AS history_drop_reason, meta_title AS meta_title, meta_descriptions AS meta_descriptions, attrs_boilerpipe_text AS attrs_boilerpipe_text, attrs_markdown AS attrs_markdown, attrs_readable_markdown AS attrs_readable_markdown, meta_canonical AS meta_canonical, ml_categories_json AS ml_categories_json, ml_types_json AS ml_types_json, ml_intent_types_json AS ml_intent_types_json, meta_language AS meta_language, attrs_author AS attrs_author, ifNull(toUnixTimestamp(attrs_publish_time), 0) AS attrs_publish_time, ifNull(toUnixTimestamp(attrs_original_publish_time), 0) AS attrs_original_publish_time, ifNull(attrs_is_republished, 0) AS attrs_is_republished, ifNull(attrs_nr_words, 0) AS attrs_nr_words, ifNull(attrs_boilerpipe_nr_words, 0) AS attrs_boilerpipe_nr_words, ifNull(body_ext_links_number, 0) AS body_ext_links_number, ifNull(body_int_links_number, 0) AS body_int_links_number, ifNull(meta_nofollow, 0) AS meta_nofollow, ifNull(meta_noarchive, 0) AS meta_noarchive, ifNull(props_was_rendered, 0) AS props_was_rendered, ifNull(src_redirect, \'\') AS src_redirect, ifNull(download_time_msec, 0) AS download_time_msec, ifNull(download_ttfb_msec, 0) AS download_ttfb_msec, ifNull(download_size, 0) AS download_size FROM crawler3.page_info_local FINAL PREWHERE (src_root_hash, src_unparsed) IN ((getAhrefsRootHashFromUnparsed(getAhrefsUnparsedNoserviceFromURL(\'https://www.facebook.com/61577603379145\')), getAhrefsUnparsedNoserviceFromURL(\'https://www.facebook.com/61577603379145\'))) FORMAT JSONEachRow'
Response:
3. Robots.txt Check
Query:
curl -sS --get \ 'http://fish032.int.ahrefs:12055/access' \ --data-urlencode 'max_retries=0' \ --data-urlencode 'pid=1777085912:502213:page-crawl-status-tool@yepsand' \ --data-urlencode 'kind=check' \ --data-urlencode 'url=https://www.facebook.com/61577603379145'
Response:
"Disallowed"
4. Spam/Ban Check
Query:
curl -X POST \ 'http://laksa077.int.ahrefs:8124/' \ -H 'Content-Type: text/plain' \ -H 'X-ClickHouse-Database: crawler3' \ -H 'Authorization: Basic YXBpOg==' \ -d 'SELECT fh_dont_index, ml_spam_score FROM robots.target_settings_local FINAL WHERE src_root_hash = getAhrefsRootHashFromUnparsed(getAhrefsUnparsedNoserviceFromURL(\'https://www.facebook.com/61577603379145\')) AND startsWith(getAhrefsDropPortFromUnparsed(getAhrefsUnparsedNoserviceFromURL(\'https://www.facebook.com/61577603379145\')), src_unparsed_prefix) ORDER BY length(src_unparsed_prefix) DESC LIMIT 1 FORMAT JSONEachRow'
Response:
5. Seen Status Check
Query:
curl -X POST \ 'http://laksa077.int.ahrefs:8124/' \ -H 'Content-Type: text/plain' \ -H 'X-ClickHouse-Database: crawler3' \ -H 'Authorization: Basic YXBpOg==' \ -d '(SELECT getAhrefsURLFromUnparsed(dst_unparsed) AS found_url, dst_unparsed AS unparsed, dst_root_hash AS root_hash FROM crawler3.urls_local FINAL PREWHERE (dst_root_hash, dst_unparsed) IN ((getAhrefsRootHashFromUnparsed(getAhrefsUnparsedNoserviceFromURL(\'https://www.facebook.com/61577603379145\')), getAhrefsUnparsedNoserviceFromURL(\'https://www.facebook.com/61577603379145\')))) UNION ALL (SELECT getAhrefsURLFromUnparsed(src_unparsed) AS found_url, src_unparsed AS unparsed, src_root_hash AS root_hash FROM web_queue.crawl5_local FINAL PREWHERE crawl_yyyymm >= toYYYYMM(today() - INTERVAL 2 MONTHS) AND (src_root_hash, src_unparsed) IN ((getAhrefsRootHashFromUnparsed(getAhrefsUnparsedNoserviceFromURL(\'https://www.facebook.com/61577603379145\')), getAhrefsUnparsedNoserviceFromURL(\'https://www.facebook.com/61577603379145\')))) FORMAT JSONEachRow'
Response:
Crawled check error: Failed to connect to laksa077.int.ahrefs port 8124 after 1 ms: Could not connect to server