OpenAI data breach shows AI companies are prime targets for hackers
The recent attack on OpenAI’s systems, even though it was small and only affected an employee forum, shows that hackers are becoming more interested in AI companies and how valuable their data is.
You don’t need to worry that someone got a hold of your private ChatGPT conversations during the recent breach of OpenAI’s systems. Even though the hack is scary, it appears to have been a minor issue.
However, it serves as a reminder that AI companies have quickly become one of the most desirable targets for hackers.
The New York Times wrote more about the hack after Leopold Aschenbrenner, a former employee of OpenAI, hinted at it in a podcast.
Although he described it as a “major security incident,” anonymous sources within the company informed the Times that the hacker had only gained access to an employee forum.
It’s important not to take any security breach lightly, and listening in on OpenAI developers conversations is a useful thing to do.
Hackers may not be able to get into internal systems, work-in-progress models, secret roadmaps, or other things like that.
But we should be scared of it no matter what, not just because China or other enemies might pass us in the AI arms race. The truth is that these AI companies now control a huge amount of very valuable information.
Let’s talk about three types of data that OpenAI and some other AI companies either make or have access to: high-quality training data, bulk user interactions, and customer data.
Companies keep their training data secret, so their stashes are unknown. You would be mistaken to believe that they are merely vast collections of web data that have undergone scraping.
They utilize web scrapers and datasets such as the Pile, but transforming this raw data into a form suitable for training GPT-4o is a significant task. People must devote many hours to this task, with only partial automation possible.
Some engineers who work with machine learning think that the quality of the dataset is the most important thing that goes into making a large language model or maybe any transformer-based system.
For this reason, a model solely relying on learning from Twitter and Reddit will never achieve the same level of intelligence as one that incorporates learning from all the books published over the past 100 years.
This is also likely why OpenAI used training data from sources that weren’t always legal, such as copyrighted books, even though they claim to have stopped doing so.
So the training datasets that OpenAI has created are very useful for rivals, such as other businesses, and countries that are at odds with the U.S. and government regulators.
Wouldn’t the FTC or courts be interested in understanding the data used and verifying if OpenAI has been providing accurate information?
OpenAI has a huge amount of user data, including likely billions of conversations with ChatGPT about hundreds of thousands of different topics. This data may be even more valuable. Search data used to be the key to understanding how people think about the web as a whole.
Similarly, ChatGPT not only understands the thoughts of a smaller group of users than Google but also provides a wealth of additional information. Unless you opt-out, ChatGPT uses your conversations for training purposes.
If more people are searching for air conditioners on Google, it indicates that the market is heating up. The users then do not talk about what they want, how much they’re willing to spend, what their home looks like, which brands they don’t want to buy from, and so on.
This useful information is clear since Google is trying to get its users to give it to them by replacing searches with AI interactions.
Think about how many times people have talked to ChatGPT and how useful that data is for everyone, not just people who make AIs. It’s a gold mine for marketing teams, consultants, and analysts.
The last type of data might be the most valuable on the open market: data about how customers use AI and data that consumers have given to models.
Hundreds of big companies and countless smaller ones use tools like OpenAI and Anthropic’s APIs for a huge range of tasks. They also need to be able to fine-tune language models on their internal databases or be able to get to those databases in some other way.
This could include common items like old budget sheets or personnel records for easier searchability, or it could include highly valuable items like unreleased software code.
It’s up to them to decide what they do with the AI’s abilities and whether they are useful. However, the AI provider does have special access, just like any other SaaS product.
The Security Challenges Faced by AI Companies
Some of these are considered trade secrets, with AI companies currently at the forefront of many of them. Due to the nascent nature of this business sector, there exists a unique risk of incomplete understanding or standardization of AI processes.
Like any other SaaS provider, AI companies can provide standard levels of security, privacy, on-premises options, and overall service responsibility. There’s no doubt that OpenAI’s Fortune 500 clients’ private databases and API calls are very tightly secured!
They should know as much, if not more, about AI risks when handling private data. The fact that OpenAI didn’t report this attack is their choice, but it doesn’t make people trust a company that badly needs trust.
Being careful with security doesn’t change the value of what it’s meant to protect, or the fact that bad people and other enemies are constantly trying to get in.
Choosing the right settings and keeping your software up-to-date are important parts of security, but they’re not the only things that matter. Surprisingly, AI is now accelerating this ongoing battle by dispatching agents and attack automation to scrutinize every aspect of these companies’ defenses.
You don’t need to be scared; companies that have access to a lot of personal or commercially valuable data have been dealing with similar risks for years.
However, AI companies represent a more recent, youthful, and potentially more lucrative target than a poorly configured enterprise server or a negligent data broker.
Anyone doing business with AI companies should be concerned about even a hack like the one described above, where no sensitive information was stolen to the best of our knowledge. The targets are identifiable. If someone or everyone takes a shot, don’t be shocked.