Protecting your content in the AI age
In the age of digital content, publishers of all types face questions about how to handle the challenges and opportunities brought on by AI.
We know that many publishers are keen to protect their content from unauthorised scraping; bots can rapidly collect and aggregate content from websites, sometimes without the publishers’ consent or knowledge.
In the first of a two-part blog post, we look to define what data scraping bots are and provide practical tips for publishers looking to protect their content. In the second part, we’ll explore some of the opportunities for publishers in the AI age (spoiler alert: offline content is making a comeback!).
What are data scraping bots?
Data scraping bots are designed to automate the process of extracting large volumes of data from websites. They have become more well known recently from the debate around those designed to extract and synthesise data for the use in training Generative AI models.
The challenge for content owners is that not all bots are bad! If you’re a content owner, some bots can help to drive more traffic to your website, by accessing and processing your content for search engines like Google or Bing, making it more likely to appear as a search result. Usually, websites will have defences they set up themselves - or provided by hosts such as Cloudflare and Akamai - to protect parts of websites from malicious bots.
The problem arises when some of these defences rely on good behaviour – much like a “keep off the grass" sign – and malicious bots are able to disregard them.
How can you protect your content?
There is no magic bullet to completely protect your content; so long as it is available on the internet, there is risk of it being accessed and used in ways you’d not intended. That’s why a combination of both technical and legal defences is important.
Technical Defences
Technical defences are often the first line of protection against content scraping. While no method is foolproof, a combination of these strategies can make it harder (but not impossible) for bots to access content:
- Robots.txt and Meta Tags: While these files can’t prevent access, they serve as a signal to well-behaved bots about which parts of your site to avoid. Tools like Dark Visitors can automatically update and manage your Robots.txt, and track the bots visiting your site.
- CAPTCHA and reCAPTCHA: These tools are designed to prevent bots from accessing content, by requiring users to complete a test that’s easy for humans but difficult for machines. Implementing CAPTCHA on key pages - like login forms or comment sections - can be effective in deterring bots, but overuse has the potential to degrade user experience. A balance is required.
- Rate Limiting: This is often provided by your content host, or if you host your own content something you can have configured. It is likely you already have this enabled, but rate limiting restricts the number of requests that can be made by a bot in a given timeframe. CloudFlare has a great article on rate-limiting: (https://www.cloudflare.com/en-gb/learning/bots/what-is-rate-limiting/).
- Honey Pot Traps: A honey pot is a hidden field on your website that users cannot see or interact with, but bots might attempt to fill out. If they do, it flags to you that a bot is at work, allowing you to take mitigating actions (e.g. blocking its IP address).
- Watermarking: For visual content like images or videos, watermarking may deter unauthorised use. Even if the content is scraped, the watermark may remain, making it clear where the content originated.
- Content Delivery Networks and Bot Protection: If you host content using a delivery network, such as Cloudflare or Akamai, these providers often have bot defences that will shield your content from many bots. You may well have seen the checkbox before you visit a website hosted on Cloudflare that asks you to verify you’re human (like a CAPTCHA).
Legal Protections
While technical defences are important, they are not always sufficient on their own. You might also want to think about legal protections that can serve as a deterrent to would-be scrapers:
- Terms of Service (ToS): Clearly outline in your website’s Terms of Service that you do not allow scraping or the use of any content for training AI. While this won’t stop all bots, it provides a legal foundation for pursuing action against offenders.
- Copyright Notices: Ensure that all your website and content is marked with copyright notices. This reinforces your ownership of the content and can be useful in legal disputes.
- Copyright Infringement Takedown Notices: If your scraped content appears in the wrong place on the internet, you may be able to get it removed via a takedown notice, such as under the US Digital Millennium Copyright Act and the EU’s Digital Services Act.
- Legal Action: In extreme cases, it may be necessary to take legal action against scrapers, particularly if they are causing significant financial harm. While lawsuits can be costly and drawn out, they can result in compensation and serve as a deterrent to others.
Upcoming legislation
The EU’s new AI Act will soon require providers of “general-purpose AI” to:
- put in place a policy to comply with EU copyright law.
- use state-of-the art technologies to identify opt-outs by copyright holders from commercial text and data mining of their works.
- publish a detailed summary about the content used for training the AI.
How can you get fair compensation for use of your content?
Getting paid for your content can be challenging, but at Human Native AI, we’re building a fairer AI marketplace that can help you simplify this process.
Human Native AI can help you navigate data licensing:
- We give you granular control over how your data is used and at what price.
- We index, benchmark and evaluate your data to help you determine its quality and value, and we clean and prepare it so it’s in the right format for AI developers to use.
- Our analytics tools allow you to see how your content is being used.
Want to hear more about our data licensing solutions? Sign up now.