Algoritmi

Last.fm, one of the oldest organizations for online music, was founded in 2002 as a radio and social music discovery platform, offering music streaming and personal recommendations to its users.

The consumption of music was perceived at Last.fm as a social behaviour: a data-driven approach made it possible to discover the music taste of users, constructing detailed user’s listening profiles by tracking listening habits and suggesting new songs by analyzing the listening profile and similarities between songs.

In this project, you will analyze excerpts of some real datasets publicly released by Last.fm.

Goal: finding the “top” song-authors by tag

You are given as input a CSV file containing a list of song tracks and additional details. Each line has:

track_id, song_title, author_name, tags

where tags is a string with one or more text user-provided labels that characterize the song,

separated by “;” (e.g. “rock;indie;female vocalist”).

Your task is to design and implement a Python code that reads the CSV file, extracting the relevant information which must be then stored in a suitable data structure. The data structure should make it possible to answer queries of the form:

Which are the top K song-authors matching a given list of tags L?

as fast as possible. The list of tags L and the integer value K are given as input to your query algorithm.

Example

In the small CSV dataset that you will be provided (see below), there are 2526 “rock” songs and 1762 “pop” songs, among the other tags. Only 1340 songs have both “rock” and “pop” in their tags (in any order). Among these 1340 songs, the top authors are:

 

• The Police, with 148 songs

• U2, with 110 songs

• The Cure, with 109 songs

• Elvis Presley, with 106 songs

• Michael Jackson, with 101 songs

Hence, if K=2 and L=[“rock”, “pop”] (or L=[“pop”, “rock”]), your algorithm should return the ordered list of strings [“The Police”,”U2″], where items appear in the list in decreasing order of number of songs. If two authors have the same number of songs, they should appear in alphabetical order.

If K=4 and L is the same as above, your output list should be [“The Police”, “U2”, “The Cure”, “Elvis Presley”].

Datasets

You are given four datasets: small.csv, medium.csv, large.csv, and full.csv:

• The small dataset has about 3000 songs (20 authors, 7 tags).

• The medium dataset has about 10000 songs.

• The large dataset has about 100000 songs.

• The full dataset has about 500000 songs.

We will run your query algorithm on the different datasets: the larger the dataset you can handle (correctly), the better the evaluation of your project! We will also consider the running time of your code on the datasets that you can solve.

Python implementation guidelines

We are providing you a skeleton of the code. You can modify the code we provide, adding the missing parts.

In the project folder, you will find a file main.py (used to run the project) and the input file small.csv. You should not change the main file, except for the variable group_id (as specified below).

Change the name of folder group0 to match the group id that we will assign you after the registration on Luiss Learn, and also change the group_id value in main.py.

Implement your code in file project.py, that contains two functions (you can add more functions, if needed, but you must at least implement these ones). Function prepare will be called once per song list: use prepare to read the input file and store the relevant information in suitable global data structures of your choice. Function top_authors should implement your query algorithm, as described above.

You can use additional files, if needed, but all of them have to be in the group folder. There is a file utils.py where, if you want, you can implement auxiliary algorithms and data structures.

You can use Python lists, dictionaries, and string functions, but no specific algorithm from any Python library. If you need any algorithms (e.g., for searching, sorting, or selection) you have to implement your own version from scratch. If in doubt, just ask.

Project report

You should write a report on your implementation (about 2 pages) describing your code, how it works, and the main reasons of your implementation choices (e.g., why did you use a sorted list instead of a dictionary? How does your query algorithm work?).

As a bonus, you should try to analyze the asymptotic cost of your implementation assuming that:

• Each song a O(1) tags, each of length O(1). Hence, any string processing on the “;” separated

tags string (such as a split) requires constant time.

• Function top_authors is called with exactly 2 tags [t1, t2] and there are n1 songs with tag t1,

n2 songs with tag t2, and n songs with both tags where n < min(n1, n2). Keep in mind that n is much smaller than the total number of songs in the input file.

Deadline and what to send us

You should send us (Caminiti and Finocchi) an e-mail with a pdf file containing your report and a zip file with your group folder. The subject of the email should be: “Algorithms project 2021 group X”, where X is your group id.

Project deadline: May 25, 2021.

We will run your code and let your implementations compete against each other: which groups are the fastest? Which groups can solve the largest datasets? We plan to release the results of the contest on the course Web site before the first exam session (June 3).

Tags: No tags