HowTo crawl website
Crawler (or spider) - gets you all links that site have and reference to. It isn’t dirbusting, you can’t get hidden directories with crawler.
With crawler you can more easily find hard to find website functions or interesting links (like URL parameters
How to crawl
We will use 2 tools, katana
and gau
Fast and feature-full crawler:
- you can just crawl site -
katana -u
- crawl .js files for additional links (
-jc -jsl
) - use headless browser (in case you get blocked,
) - etc…
This one doesn’t crawl site from your computer, it uses data from public internet crawlers
- AlienVault’s Open Threat Exchange
- the Wayback Machine
- Common Crawl
- and URLScan
You can get crawl data in just 3 seconds!
This data may be not actuall or full, but site may remove some link reference and not actual web page, so use it.
Let’s make small bash script that will use both tools:
gau >> crawl
katana -u >> crawl
cat ./crawl | sort | uniq -u > crawl # Pro tip: insert httpx here