OpenAI has introduced an impressive text-to-video tool named Sora, capable of generating lifelike video clips from simple text prompts. Since the release of this tool, there has been ongoing curiosity about the data used to train the model.
Training Data Controversy
When asked in an interview if YouTube videos were used to train the model, OpenAI's CTO couldn't provide a definite answer, saying, "I’m not sure about it." Similarly, the COO declined to confirm whether the model was trained using YouTube content. Despite these ambiguous responses, reports have surfaced alleging that OpenAI utilized YouTube videos for training Sora.
In recent developments, Google’s CEO Sundar Pichai addressed the issue, stating that he would resolve it if the allegations prove to be accurate. According to a New York Times article, OpenAI employed over a million hours of YouTube content for Sora's training.
Google's Response
When questioned about potential violations of Google’s terms and conditions, Sundar Pichai responded, "Look, I think it’s a question for them to answer. I don’t have anything to add. We do have clear terms of service." He further mentioned, "And so, you know, I think normally in these things we engage with companies and make sure they understand our terms of service. And we’ll sort it out."
Reportedly, The New York Times has already taken legal action against OpenAI for using their copyrighted content in AI training. However, Pichai did not disclose his strategy for addressing this issue.
Creator Rights and AI Training
Ideally, content creators should have the right to opt in or out of having their material used by others. AI training necessitates a vast amount of data, typically sourced from the internet, but this should be done with proper permission. When asked if YouTube content was used by OpenAI, the company’s COO hinted at future plans. He mentioned that alongside developing a tool to detect AI-generated images, they are also working on a "content ID system for AI" that would allow creators to see where their content is being used, who is training on it, and to opt in or out of such training.