Key Takeaways
1. Yandex has launched the open-source Yambda dataset to provide insights into music listener preferences for building a streaming audio service similar to Spotify.
2. The dataset includes detailed recordings of 4.79 billion user interactions with 9.39 million music tracks over ten months from 28 million monthly Yandex Music users.
3. Yandex aims to use the dataset for AI-driven playlist customization, unlike other platforms that keep their algorithms private for competitive advantage.
4. The Yambda dataset is available for download in different sizes: 5 billion, 500 million, and 50 million events, with the largest needing at least 85 GB of storage.
5. The dataset is formatted in Apache Parquet, allowing for easier analysis and research, and can be accessed on HuggingFace.
Yandex has announced the launch of its open-source Yambda dataset, which provides insights into music listener preferences. This dataset aims to help build a streaming audio service akin to Spotify, featuring AI-driven playlist customization.
Playlist Creation with AI
Platforms such as Spotify, Tidal, and Qobuz utilize software algorithms or AI technologies to generate playlists tailored to individual user tastes. However, these companies typically keep their codes and models under wraps, viewing their ability to automatically curate enjoyable song selections as a valuable trade secret that contributes to their competitive edge.
Extensive Data Collection
Over a span of ten months, Yandex collected data consisting of 4.79 billion user interactions with 9.39 million music tracks from its 28 million monthly Yandex Music users. This dataset encompasses crucial feedback from listeners, detailing their listening choices, as well as their preferences and aversions. Each interaction is recorded with a timestamp for better accuracy.
Dataset Availability
The Yambda dataset is available for download in various sizes: five billion (1 million users), five hundred million (100,000 users), and fifty million (10,000 users) events, with the largest dataset needing a minimum of 85 GB storage. It is formatted in Apache Parquet, a column-oriented file format that simplifies analysis and research.
Readers can also consider gifting a Spotify gift card to share the joy of streaming music.
Yambda can be found at HuggingFace, as noted in the Yandex press release.
Source:
Link