What you need to know before scraping EU websites

Web scraping’s legal status in the EU

Web scraping is straightforward technically, but the legal side? Not so much. It’s one of the most efficient ways to gather and structure online data at scale. That’s why it’s used by journalists, researchers, and engineers trying to build useful tools or insights from public data. Scraping lets you move fast, you get real-world data into a format that can be analyzed, whether you’re testing broadband pricing differences or tracking online services across cities.

In the EU, though, you can’t just scrape first and figure out the rules later. The legal environment is fragmented based on what kind of data you’re touching: personal data falls under GDPR, and even non-personal data might be protected by database law. Both can trigger legal obligations or restrictions depending on how and why the data was organized or published. It all comes down to purpose, context, and jurisdiction.

For business leaders, especially in tech or data-heavy spaces, the signal is clear, you either build with compliance in mind or you’ll waste time handling legal cleanup. There’s too much opportunity in smart data collection to ignore, but no executive wants that followed by regulatory risk. Understand the lines, build frameworks for accountability, and move efficiently.

Non-personal data is generally less regulated

Non-personal data unlocks a lot of operational efficiency, you’re not dealing with people’s identities or privacy concerns. That’s an obvious green light for most teams looking to scale research, AI model training, or service tracking. But don’t get complacent. Just because data isn’t tied to individuals doesn’t make it law-free in the EU.

The EU’s Database Directive creates rights for databases that show “creative effort” or significant investment. That means the way the data is structured or collected could give the publisher a legal lever to restrict use. In practice, the bar is high. Courts in the EU have said it only applies if scraping that database threatens its business model or income. That’s a clear signal, most functional datasets don’t cross that threshold. It’s rare that a scraped data set really qualifies as a protected database, unless you’re taking someone’s monetized core offering.

Still, executives should understand the risk before deploying scraping at scale. Don’t assume zero regulation just because the data isn’t personal. Have legal counsel assess whether target databases qualify under EU protections. And if there’s a real revenue engine behind the site’s data display, expect pushback.

Focus your team on two things: 1) sticking to publicly available, minimally structured data and 2) making sure that even if contested, the scrape doesn’t undermine the source’s commercial viability. That will put you well inside the practical safe zone for most commercial or research-driven data scraping operations in Europe.

Research institutions enjoy expanded rights to conduct data scraping

The EU understands that data fuels progress when used responsibly. That’s why, in 2021, the Digital Single Market Directive expanded scraping allowances for registered research institutions and cultural heritage organizations. These groups can now conduct text and data mining on any data they have lawful access to, whether it’s free data online or content behind a subscription they legally hold.

Here’s the limitation: this safe harbor explicitly includes bodies engaged in public-interest scientific research. That means universities and national research labs. It doesn’t clarify whether nonprofit journalism groups qualify, even if their work is public-facing and evidence-based. From a compliance standpoint, you shouldn’t assume you’re covered just because your work benefits the public. Legal interpretation hasn’t caught up yet.

There’s a direct path forward, though. Data teams at private companies or nonprofits can collaborate with qualified research institutions under public-private partnerships. If the research aligns with one of the EU’s Framework Programmes for scientific development, then the protections apply. It’s a viable strategy for any organization aiming to mine data responsibly under EU law.

For business leaders working in data-driven sectors or operating R&D functions across EU jurisdictions, this nuance matters. If your team wants to scrape at scale, routing efforts through a qualified university or research affiliate can create legal clarity. It also opens the door for positive regulatory alignment and long-term credibility with EU institutions.

Website terms of service (ToS) can legally restrict scraping

Just because data isn’t protected by copyright or personal privacy laws doesn’t make it free to use. Many sites legally bind users through Terms of Service that forbid scraping or batch data retrieval. And in the EU, those terms carry civil enforcement weight, even if there’s no criminal penalty involved.

The Ryanair v. PR Aviation case shows exactly what this looks like in practice. PR Aviation was aggregating flight information from Ryanair to display on its own platform. Ryanair’s data wasn’t covered by copyright or specialized database laws, but they still won in court because of their Terms of Service. The court ruled that users were bound by those terms and that scraping in violation of them was enforceable under contract law.

As an executive managing legal risk, this point is critical. Scraping policies need to be evaluated on a per-site basis. A scraper that ignores ToS behaves like a user violating a binding agreement. That leaves your team open to lawsuits, injunctions, or other legal friction, especially in merger reviews, investor audits, or public scrutiny.

Scraping is still permitted in many circumstances. Many websites don’t have explicit clauses restricting it, and not all jurisdictions have favored enforcement. Still, you can’t afford ambiguity. Legal counsel should pressure-test your interpretation early. In most cases, scraping public data without breaching ToS, or with the site’s explicit or technical permission, is both a safer and more scalable model. Build company policy to reflect that and save your legal team the stress.

Scraping personal data triggers stringent GDPR compliance requirements

Scraping becomes more than a technical issue once personal data enters the picture. Under the EU’s General Data Protection Regulation (GDPR), any data tied to an identifiable individual, names, email addresses, location information, online identifiers, is regulated. If your scraper collects any of that, your organization becomes a “data controller,” and that comes with legal duties, liabilities, and documentation mandates.

First, you need a legal basis to collect or process that data. “Legitimate interest” is the usual route, especially for journalism, research, or advocacy work. But that doesn’t mean your interests automatically outweigh someone’s privacy rights. You need to justify the data collection, assess the risk to individuals, document your analysis, and make sure you’ve taken reasonable steps to minimize and secure the data. That framework includes limiting what you collect, storing it securely, potentially conducting a Data Protection Impact Assessment (DPIA), and giving individuals ways to opt out or request deletion.

Scraping personal data also means complying with disclosure requirements. You’re expected to notify individuals, often through a privacy notice, that their data is being processed. Even if that’s hard to do at scale, you aren’t exempt.

For executives, here’s the key point: if your data workflows touch personal information from the EU, expect oversight and be ready to respond. Teams should avoid collecting unnecessary identifiers. If the data isn’t essential to the outcome you’re driving, don’t collect it. The overhead, from compliance to storage to possible regulatory audits, isn’t worth collecting data you can’t justify using.

Pseudonymized data remains subject to GDPR

There’s a critical distinction in EU data law that a lot of technical teams overlook. Removing names or email addresses from a dataset doesn’t automatically exempt it from GDPR. If there’s still a way to link the data, directly or indirectly, back to a person, it’s considered pseudonymized. That puts it squarely under GDPR.

Only anonymized data, where re-identification is no longer possible by any reasonably available methods, falls outside the regulation. That bar is high. You can’t rely on weak identifiers or assume other datasets won’t be combined to reverse-engineer identities. The EU expects a complete assessment of how data could be re-linked before declaring it exempt.

If your teams are working with stripped-down datasets that still reference online behavior, device IDs, or structured attributes, treat that data as regulated until you can prove otherwise. Internal documentation is required. So is a risk-focused review of how data can move across systems, especially if third-party access is involved.

For business leaders, this is operational hygiene. Building real data governance means you know what’s being collected, how it’s stored, and when it becomes a compliance issue. If you’re investing in scraping or enrichment platforms, check that they’re architected to comply with GDPR standards around pseudonymization. And if there’s uncertainty? Treat the data with full compliance safeguards and avoid mistakes that lead to regulatory exposure.

Varying national implementations of GDPR and jurisdictional complexities

Scraping data from EU-based websites means navigating how each member state interprets and enforces it. The GDPR allows, and in some cases requires, countries to create their own rules around how privacy law interacts with freedom of expression and journalistic activity. These rules differ. What’s protected speech or fair processing in one country could be a regulatory violation in another.

That matters. If your data pipeline touches content from multiple EU countries, you need to evaluate the regulatory lens of the jurisdiction where the data subject resides, where the servers hosting the site are located, and where your organization processes the data. You could be under multiple regulatory scopes at once.

This isn’t always intuitive. Some member states have tougher interpretations of exemptions for public interest research, while others require additional steps for processing personal data in a journalistic context. And the place where the scraped data is hosted might not be the same as the country whose courts would handle a dispute.

For executives running operations, this means centralized legal strategies are limited. A compliance check in Germany might not work in France. A dataset cleared by Dutch standards might raise issues in Ireland. The only viable approach is to assess legal risk country by country or design universal compliance protocols that align with the most robust sets of rules. If your company operates in media, data aggregation, AI, or analytics and handles EU personal data across borders, consider outside legal review as standard process.

Overloading a website through intensive scraping activity may lead to cybercrime charges

Even when scraping is technically legal, how it’s done matters. If your activity disrupts a website’s performance, by flooding it with requests or bypassing rate limits, you can face legal action under EU cybercrime law. The legal line isn’t scraping; it’s harm. EU law doesn’t require intent to damage for some types of offenses. Resource exhaustion, whether accidental or deliberate, can qualify as criminal under certain cybersecurity statutes if it degrades system availability or denies service to legitimate users.

This is often missed by development teams focused only on efficiency. A scraper that checks a site every second, doesn’t handle errors properly, or runs parallel requests without constraints can unintentionally mimic a denial-of-service. That brings immediate legal risk, particularly for systems hosting sensitive data or services the public relies on.

The takeaway for leadership: strong engineering practices help avoid legal threats. Scrapers must be designed with respect for server load, timeouts, and user-agent policies. Stagger request frequency. Implement backoff behavior. These aren’t just performance or ethical considerations; they reduce legal exposure. Ignoring this puts your company at risk of breach of contract and criminal liability.

If you’re deploying tools inside or outside of Europe that collect data at scale, integrate risk-aware engineering from the start. Teams should assume that every scraper hitting a public-facing service can be audited for intent and impact. Missteps here won’t just damage reputation, they can trigger regulatory investigations and requests for infrastructure review.

Emerging EU laws and proposed legislative changes

The current legal framework in the EU around web scraping is shifting. Several major legislative proposals are already in motion: the Data Governance Act (effective September 2023), the pending Data Act, and the draft ePrivacy Regulation. If your business extracts or relies on public data at scale, these developments are operational priorities.

The Data Governance Act focuses on increasing access to public-sector information while introducing new controls around how that data is shared. It encourages data re-use by creating “data intermediaries” to oversee compliance. This means that scraping from government platforms may soon happen within a more structured, compliance-driven model. Developers and businesses will have to align with a centralized access framework for certain types of public data.

The proposed Data Act is also worth attention. It aims to define who can access and use data generated by connected devices and services, potentially redrawing boundaries for data ownership and database rights. For companies relying on scraping from technical platforms, IoT services, or APIs, the rules may tighten. Part of the proposal includes modifications to the sui generis database right, which, if passed, will directly impact whether and how scraped databases remain protected under EU law.

Finally, the long-delayed ePrivacy Regulation seeks to complement GDPR with tighter rules around electronic communications, cookies, and metadata. While its final form isn’t confirmed, executables from this regulation could add more obligations for businesses scraping user-facing content, especially data tied to communications or online tracking.

For C-level executives, this is a clear signal. Data strategy in the EU can’t be static. You need dedicated legal attention monitoring and interpreting updates from Brussels. Some of these changes will expand access. Others will impose additional safeguards. Either way, businesses that anticipate the shift, and adjust early, will face fewer downstream complications.

Using scraped data for machine learning and AI model training

When scraped data becomes training material for AI models, the legal calculations change. Large language models and generative AI systems require vast datasets, often compiled using automated scraping, to function effectively. But using online content in this way pushes against undeclared legal boundaries in both copyright and data protection regimes, especially in the EU.

While scraping publicly available data may seem compliant on the surface, using that content for model training introduces secondary use questions. Much of what’s scraped, such as articles, reviews, or user-generated content, is protected by copyright. Transforming that material into embeddings or model weights could infringe upon the original rights holders’ exclusive rights, even if the data was publicly accessible.

Privacy law plays a role too. If scraped datasets include information about identifiable individuals, and that data contributes to model behavior, companies could be found processing personal data without a legitimate basis. The scale of this issue is under increased scrutiny by regulators, as models trained on large internet datasets may unknowingly internalize and reproduce sensitive or protected data points.

There’s also regulatory inertia. The law hasn’t caught up with the technical capability. Case law is minimal, and interpretations vary. Some companies argue that AI training qualifies as a transformative, fair use-type application under copyright, but this defense is legally untested in many jurisdictions, especially under EU frameworks, which are more protective than those in the U.S.

For tech executives operating in Europe or deploying EU-facing AI products, risk management needs to address model training inputs at the source. Documenting your pipeline, evaluating dataset provenance, and ensuring data minimization where possible are no longer best practices, they’re safeguards. Businesses pulling data into their training workflows without tracing rights or privacy exposure may face regulatory actions, even retroactively.

Journalists and researchers must carefully evaluate legal responsibilities

Scraping isn’t a free pass, even when the goal serves the public interest. Journalists, researchers, and advocacy organizations must distinguish between non-personal and personal data, review the website’s Terms of Service, and account for differing EU country-level laws. Each of these variables affects what they can collect, how they can process it, and what risks are involved.

Non-personal data, while less regulated, can still fall under database rights or be contractually protected through website terms. Personal data, even if incidentally collected, immediately brings GDPR into effect, triggering legal duties like purpose limitation, data minimization, and lawful justification. The thresholds for compliance aren’t low, even for nonprofit or journalistic work. Simply claiming public benefit is not always enough, especially when data subjects’ rights are involved.

Websites can also create their own boundaries. If scraping is explicitly prohibited in the Terms of Service, there can be legal consequences, even if the data isn’t protected by copyright or privacy laws. Operators may use technical barriers, pursue breach of contract claims, or seek injunctive relief if scrapers violate posted conditions.

Members of the press do have some protections under GDPR, but these are mediated through national law. Each country sets its own standards for reconciling privacy with free expression. That fragmentation complicates decision-making. Without a clear understanding of which jurisdiction applies and what the domestic exemptions allow, even good-faith data collection can expose a team to investigation.

For executives leading data, legal, or R&D teams, the operational reality is planning ahead. Before scraping any site or dataset, map out key questions: Is the data personal? Is it covered under a database right? Does the site’s ToS permit collection? Where are the servers, the company, and the data subjects located? Who owns the data once it’s ingested? Then, build a risk profile. If friction is likely, consult legal counsel before any deployment.

The bottom line

Legal ambiguity brings risk. And when your team is moving fast to build products, train models, or surface insights, those risks compound. Web scraping in the EU touches multiple regulatory layers: data privacy, copyright, database rights, and contractual terms. It’s not something you can afford to misunderstand or delegate blindly.

As a leadership team, treat data governance the same way you treat operational security or financial compliance: as a foundational layer, not a box to check after the fact. Scraping can power everything from strategic intelligence to product development, but only if you understand the legal terrain upfront. The balance is clear, structure your approach now, and you avoid breakdowns later.

Whether you’re building AI systems, launching research initiatives, or leveraging public data for market insights, align legal and technical teams early. Don’t assume just because something’s accessible, it’s freely usable. And if your teams touch personal data, particularly in the EU, make sure your compliance strategy isn’t based on guesswork.

Make scraping smart by making it intentional. Strong execution here won’t just keep you out of trouble, it’ll protect your ability to scale with confidence.