DataScience Lab

Table of Contents

News / info

  • First class is on Tuesday September, 28 in Amphi 5.
  • If you have a laptop, bring it to the class. I can bring extension cords if necessary (please ask).
  • Class is normally held on Tuesday from 8:30am to 11:45am. Check your ADE planning to be sure.
  • There are 3 assignments, you have to form new groups for every assignment, and you cannot be working with the same person for all 3 assignements. I am aware that these constraints may not be satisfiable but I take this very seriously, so you should to.
  • <2021-09-30 Thu> The link to the office TEAM of the class is available here (Do not use the one from IASD, it is the one from last year)
  • <2021-09-30 Thu> You have to create your TEAMS and chose your topic before Sunday 12:00pm
    • create a private chanel on teams, with the team members. The name should be:
      • A1-MF-XXX or A1-LB-XXX or A1-OT-XXX where:
        • A1 stands for Assignment 1
        • MF stands for Matrix Factorization
        • LB stands for Linear Bandits
        • OT stands for Optimal Transport
        • XXX is the name of your team (whatever you like)
    • create your git hub accounts, join the classroom for assignement 1 with this link, and create your team with whatever team name you used in the chanel name (XXX from before).
  • <2021-09-30 Thu> Private section is now password protected, the password will be given on the team Général channel
  • <2021-10-04 Mon> 04-10 : meeting in room A407. Tomorrow will be a working session.
  • <2021-10-05 Tue> Added a description on how to test MF on movielens (see below)
  • <2021-10-05 Tue> I need three more groups for preliminary presentation next week.
  • <2021-10-11 Mon> I have granted you access to some computing ressources from the Lamsade, see No description for this link
  • <2021-10-11 Mon> Everyone meeting in room A407, Laurent will be remote.
    • Reminder, deadline for the project : October 24th.
      • expected on the repos:
        • Slides for preliminary or final presentation
        • Brief report containing the interesting bits of your work.
        • Source code used to produce all the experimentation.
    • Please upload your preliminary presentations on your git repository.
  • <2021-10-12 Tue> Unrelated note: if you want to register to Ethics and AI, please do it before next week wednesday (oct 20) Check your Dauphine email for more info.

Tentative planning for the year

(may change) Note: A1 = assignement 1, Ax = assignement x.

Date Description
W39 – September, 28 Class intro + intro A1
W40 – October, 5 Group sessions A1
W41 – October, 12 Preliminary presentations + Group sessions A1
Deadline A1: October, 24 20:00pm  
W42 – October, 19 Final presentations (results) & discussion A1, intro A2
W43 – October, 26 Group sessions A2
W44 – November, 2 — No class ---
W45 – November, 9 Preliminary presentations + Group sessions A2
Deadline A2: November, 14 20:00pm  
W46 – November, 16 Final presentations for A2, intro A3
W47 – November, 22 — No class (PSL Week) ---
W48 – November, 30 — No class ---
W49 – December, 7 — No class ---
W50 – December, 14 Preliminary/final presentations A3
Deadline A3: December 20: 20:00pm  

HowTo

Group sessions

How it is supposed to work:

  • Students describe their plan/idea/readings/experiments and ask questions.
  • Professors answer questions when they can.

How it is not supposed to work:

  • Professors explain students how to conduct their project.

Class presentations

For each assignment, each group is expected to give at least one of the two types of présentation.

Preliminary presentations

  • 5 minutes / 5 slides
  • Briefly & clearly state the problem you are woking on (one slide).
  • Present and compare approaches you are considering for solving the problem.
  • Describe what you have implemented (briefly)
  • Discuss possible experiments and evaluation metrics.
  • Present preliminary results if you have any.

Final presentations

  • 5 minutes / 5 slides
  • Briefly & clearly state the problem you are woking on (one slide).
  • Present and compare approaches you have studied during this assignement.
  • Describe what you have implemented (briefly)
  • Discuss the evaluation metrics you have used.
  • Show experimental results and disucss these results.

Assignment 1

Links, References and remarks about Assignment 1

groups

  • 92i Factorization
  • Benamara Djelid Benrekia
  • Bigaud Rolland Zaghrini
  • Data Sharks
  • Driss Ly
  • Duzenli Grosjean Chan-Renous
  • Germain Emma Xavier
  • L'affine équipe
  • Liption Ice Team
  • Malartic Taoudi
  • Munching Gloves
  • Opteam
  • team o thé

Preliminary presentations

  • 92i Factorization (MF)
  • Munching Gloves (MF)
  • Benamara Djelid Benrekia (MF)
  • Germain Emma Xavier (MF)
  • Teamothe (LB) (?)
  • LiptonIceTeam (OT) (?)

FAQ

Can I develop approach X (method that has not been discussed in class).

You are very much encouraged to study & implement something that we have not discussed in class, as long as it is a solution to the problem we're trying to solve.

Typically, it is a good idea to compare some approach that we have discussed in class with something that we have not discussed in class so your experience can profit other students (and so we can have new ideas for next year).

Is it mandatory to use the dataset or the metric specified by the professors?

If you can, you should. It's better if you run at least one experiment that is comparable with the experiments of the other groups working on the same problem.

However comparative experiments are not always very insightful, so you are also encouraged to conduct other types of experiments using different datasets or different metrics to better understand how your approach behave. Be creative.

Last year, one group made a random dataset generator so they could plot the performance of their algorithm w.r.t. the size of the dataset. From that plot, they concluded that their approach could never scale to any realistic dataset :). That's just an example, but it was good work, and it turned out that generating realistic random matrices was also an interesting problem

Do I have to use Git? Can I use Jupyter Notebook instead?

Humf, Git and Jupyter notebook are two very different tools. Yes, you have to use Git. You can also use Jupyter Notebook if you want.

Git is a tool to manage a source code repository. It is used to version your code (keep track of the changes) and collaborate with other developers (merge multiple concurrent versions of the code). You have to use it, because this is how I am going to access your code/report at the end of the project. You also have to use it because it's critical for you to know how to use it if you ever want to collaborate with someone, or handle a code base that contains more than few lines of code.

Jupyter Notebook is an interactive browser-based code editor. It can be used to run few lines of Python code in your browser, but it is not so convenient when you have a large code base, or when you want to run your code on a distant server, or not interactively. You can use it if you want, but I will not check it unless you explicitly refer to it in your report.

In case you want to use it, I recommend that you first write a Python Module with all the important functions inside (See Python Modules). Then, you can import this module in your Jupyter Notebook and call the functions from there. This way, you can also write a simple non-interactive script so that you can run your program on a remote server.

I don't have enough computing power.

Alright. You can either use Google Colab (within Jupyter Notebooks hosted at Google), or access the GPUs servers hosted at Lamsade (the computer science Lab at Dauphine), using ssh. To get your access credentials, log on https://db.masteriasd.eu/, find your username and download your private key, as a file

Then open a terminal and type

chmod 600 /path/to/private/key/id_rsa_<username>
ssh <username>@ssh.lamsade.dauphine.fr -p 5022 -i id_rsa_<username>

Obviously you need to replace <username> with your own username.

Then you can choose one of these machines:

  • Ourasi: 20 cores / 40 threads / 32GiB RAM / 2x Nvidia RTX 2080 Ti
  • Kaisertrot: 20 cores / 40 threads / 32GiB RAM / 2x Nvidia RTX 2080 Ti (actually, one is dead)
  • Boldeagle: 8 cores / 16 threads / 32GiB RAM / 2x Nvidia GTX 1080 Ti
  • Readycash: 8 cores / 16 threads / 32GiB RAM / 2x Nvidia GTX 1080 Ti

These are shared ressources so please do not use more than 2 GPUs at a time. You can see who else is using the CPUs/GPUs using htop or nvidia-smi.

You can also transfer files from your computer to the servers using scp (see man scp)

Date: 2020-09-30 Wed 00:00

Author: Benjamin Negrevergne, Laurent Meunier

Created: 2021-10-12 Tue 17:05

Validate