The robots.txt
file, a longstanding standard, instructs web robots like search engine crawlers on which areas of a website they can and cannot access.
For instance:
|
|
This snippet prevents all robots from accessing the /private/
directory.
|
|
This one specifically instructs Googlebot to avoid the /users/
directory.
The AI Conundrum
The existing robots.txt
standard falls short in the age of AI. Current capabilities only allow us to permit or restrict crawling, but we lack the ability to define more nuanced rules.
We need more granular controls, such as:
- Indexing: Can a web crawler index the content?
- Caching: Can a web crawler cache the content?
- LLM Training: Can the content be used for language model training?
- Summarising: Can a web crawler summarise the content?
- And more…
These capabilities were not always present, and website owners should have the right to decide how their content is utilized in these ways.
The Need for Enforcement
Robust rules are meaningless without enforcement. The robots.txt
file seems insufficient to prevent misuse by certain entities.
A case in point is Perplexity AI, a company recently found out that has been accused of using a fake user agent to bypass robots.txt
restrictions, effectively impersonating a human user. This alleged violation has been confirmed by Wired and by MacStories.
In Conclusion
Clearly, rules need teeth. We need regulatory bodies to address content owners’ complaints and impose penalties on companies like Perplexity AI that disregard the rules. Smaller creators often lack the resources to pursue legal action against larger entities.
As with any tool, it’s the application that matters. AI holds the potential for positive innovation, but not at the cost of exploiting the work and rights of others.
Disclaimer
The cover image was generated using AI – a testament to its capabilities, even if imperfect, and certainly surpassing my own artistic skills. The article’s content, however, is entirely human-generated.