Skip to content

Instantly share code, notes, and snippets.

@wliu
Last active April 11, 2018 21:45
Show Gist options
  • Save wliu/76908a532e649de96941fed5d07ce420 to your computer and use it in GitHub Desktop.
Save wliu/76908a532e649de96941fed5d07ce420 to your computer and use it in GitHub Desktop.
Spam Detector

Spam Detector

Welcome to UnitedMasters. This challenge helps us assess engineering expertise and creative thinking, while enabling you to get a better understanding of the music domain. We also think this challenge is a just a fun exercise for anyone that loves to write code. Feel free to ask questions or get clarification on anything.

Dataset

The dataset for this challenge, dataset.tar, is a archive containing three gzipped JSON Lines files:

sc_tracks.json.gz: contains ~6000 Soundcloud track objects from @corpus, our internal datastore. The track object mirrors the [Soundcloud Track API] (https://developers.soundcloud.com/docs/api/reference#tracks) with UnitedMaster specific fields denoted by a leading "_" character in the field name.

track_ratings.json.gz: contains human curated quality ratings for the tracks specifed in sc_tracks.json.gz. Ratings range from -1 (spam, with the spamtype field providing further classification) to 5 (this could be the next Prince).

test_sc_tracks.json.gz: contains 500 Soundcloud track objects that have not been rated.

Challenge

Build a spam detection mechanism for the unrated tracks in test_sc_tracks.json.gz using your toolchain of choice. Measure the effectiveness of your approach against the rated tracks. Show your code. Be prepared to provide suggestions on ways that your approach could be further improved. Most of all, have fun!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment