DataScience Lab

Table of Contents

News / info

  • First class is on Tuesday September, 28 in Amphi 5.
  • If you have a laptop, bring it to the class. I can bring extension cords if necessary (please ask).
  • Class is normally held on Tuesday from 8:30am to 11:45am. Check your ADE planning to be sure.
  • There are 3 assignments, you have to form new groups for every assignment, and you cannot be working with the same person for all 3 assignements. I am aware that these constraints may not be satisfiable but I take this very seriously, so you should to.
  • <2021-09-30 Thu> The link to the office TEAM of the class is available here (Do not use the one from IASD, it is the one from last year)
  • <2021-09-30 Thu> You have to create your TEAMS and chose your topic before Sunday 12:00pm
    • create a private chanel on teams, with the team members. The name should be:
      • A1-MF-XXX or A1-LB-XXX or A1-OT-XXX where:
        • A1 stands for Assignment 1
        • MF stands for Matrix Factorization
        • LB stands for Linear Bandits
        • OT stands for Optimal Transport
        • XXX is the name of your team (whatever you like)
    • create your git hub accounts, join the classroom for assignement 1 with this link, and create your team with whatever team name you used in the chanel name (XXX from before).
  • <2021-09-30 Thu> Private section is now password protected, the password will be given on the team Général channel
  • <2021-10-04 Mon> 04-10 : meeting in room A407. Tomorrow will be a working session.
  • <2021-10-05 Tue> Added a description on how to test MF on movielens (see below)
  • <2021-10-05 Tue> I need three more groups for preliminary presentation next week.
  • <2021-10-11 Mon> I have granted you access to some computing ressources from the Lamsade, see No description for this link
  • <2021-10-11 Mon> Everyone meeting in room A407, Laurent will be remote.
    • Reminder, deadline for the project : October 24th.
      • expected on the repos:
        • Slides for preliminary or final presentation
        • Brief report containing the interesting bits of your work.
        • Source code used to produce all the experimentation.
    • Please upload your preliminary presentations on your git repository.
  • <2021-10-12 Tue> Unrelated note: if you want to register to Ethics and AI, please do it before next week wednesday (oct 20) Check your Dauphine email for more info.
  • <2021-10-19 Tue> everybody in room A407, the other groups will be giving the final presentation. Discussion + presentation of A2
  • <2021-10-20 Wed> Added description for the report, remember you have to create your teams and chose your topic before nextweek monday
  • <2021-10-24 Sun> I'll give a brief recap about the 3 topics for next assignement. You'll be able to change if you want to but please try to know who you're going to work with, and on what topic, based on what I already said last time.
  • <2021-10-26 Tue> We meet in A407. We will use the other room if necessary. (group session)
  • <2021-11-02 Tue> I need a list of groups willing to give preliminary presentations next week!. Please email me with your group name, and the topic you are working on.
    • Team 1: WE – Googole Translate
    • Team 2: AE – WeHaveNoIdea
    • Team 3: WE – WordHunters
    • Team 4: WE – The translators
    • Team 5: WE – Translateam
    • Team 6: NF – GAN-ache
  • <2021-11-09 Tue> Class will take place in room A407 (as usual), Second time slot will be free so you can attend: https://prairie-institute.fr/prairie-day/#Lunch
  • <2021-11-13 Sat> NEW DEADLINE FOR A2 A2 is postponed to next Wednesday (<2021-11-17 Wed>) at 20:00pm. Remember that all remember that the remaining groups will be presenting their work on <2021-11-16 Tue>.
  • <2021-11-16 Tue> Everybody in room A401. Assignment 3 is out.
  • <2021-12-05 Sun> Minor changes on the testing platform
    • kaisertrot is not available anymore, and is reserved for the testplafrom
    • pip3 install tqdm pandas matplotlib
    • now plots agg and time in addition of the other metrics
  • <2021-12-08 Wed> Group sessions on December, 10:

Please make sure:

  • you and the rest of your group is available on teams at the time indicated below.
  • you have created a channel called A3-teamname
  • Me and Laurent are members of this channel

Schedule (random)

  • adversarialattacks (10h30 Laurent)
  • adversarial-sans-serif (10h30 Ben)
  • batman (10h40 Ben)
  • biparo (10h40 Laurent)
  • cmbnet (10h50 Ben)
  • lcx (10h50 Laurent)
  • obfuscateam (11h20 Ben)
  • pixel-attacks (11h20 Laurent)
  • robustnet-1 (11h30 Ben)
  • superman (11h30 Laurent)
  • t-espasnetbaptiste (11h40 Ben)
  • titans (11h40 Laurent)
  • zfc (11h50 Ben)

Tentative planning for the year

(may change) Note: A1 = assignement 1, Ax = assignement x.

Date Description
W39 – September, 28 Class intro + intro A1
W40 – October, 5 Group sessions A1
W41 – October, 12 Preliminary presentations + Group sessions A1
Deadline A1: October, 24 20:00pm  
W42 – October, 19 Final presentations (results) & discussion A1, intro A2
W43 – October, 26 Group sessions A2
W44 – November, 2 — No class ---
W45 – November, 9 Preliminary presentations + Group sessions A2
W46 – November, 16 Final presentations for A2, intro A3
Deadline A2: November, 17 20:00pm NEW DEADLINE!!!
W47 – November, 22 — No class (PSL Week) ---
W48 – November, 30 — No class ---
W49 – December, 7 — No class ---
W49 – December, 10 — group sessions! ---
W50 – December, 14 Preliminary/final presentations A3
Deadline A3: December 20: 20:00pm  

HowTo

Group sessions

How it is supposed to work:

  • Students describe their plan/idea/readings/experiments and ask questions.
  • Professors answer questions when they can.

How it is not supposed to work:

  • Professors explain students how to conduct their project.

Class presentations

For each assignment, each group is expected to give at least one of the two types of présentation.

Preliminary presentations

  • 7 minutes (~ 5 slides)
  • Briefly & clearly state the problem you are woking on (one slide).
  • Present and compare approaches you are considering for solving the problem.
  • Describe what you have implemented (briefly)
  • Discuss possible experiments and evaluation metrics.
  • Present preliminary results if you have any.

Final presentations

  • 7 minutes (~ 5 slides)
  • Briefly & clearly state the problem you are woking on (one slide).
  • Present and compare approaches you have studied during this assignement.
  • Describe what you have implemented (briefly)
  • Discuss the evaluation metrics you have used.
  • Show experimental results and disucss these results.

Report

5 pages max. pdf file named report.pdf, don't forget to mention your names and the name of your team on the front page.

Here is a possible structure for the report (you don't have to follow it if you don't want to.)

  • Brief context and problem statement
  • Brief description of all the work you have accomplished.
    • What you have implemented, using which library and which ressources. (If you use a library/module, explain what it does, and what you had to do yourself)
    • If you have used/reused code from online tutorials, it's ok but you have to mention it.
    • List of experimentations you have conducted (do not include the actual plot at this point)
  • Motivated and coherent explanation your work, Very free, you should focus on what you think is interesting.
    • Specify your initial goal,
    • draw an experiment, to evaluate the first approach,
    • show and analyse the results (plots, figures (=numeric values))
    • if there are limitations discuss possible solutions you have considered, and the ones you have implemented
  • Take away messages (few sentences)
    • summarize what you have learned.

Assignment 1

Links, References and remarks about Assignment 1

groups

  • 92i Factorization
  • Benamara Djelid Benrekia
  • Bigaud Rolland Zaghrini
  • Data Sharks
  • Driss Ly
  • Duzenli Grosjean Chan-Renous
  • Germain Emma Xavier
  • L'affine équipe
  • Liption Ice Team
  • Malartic Taoudi
  • Munching Gloves
  • Opteam
  • team o thé

Preliminary presentations

  • 92i Factorization (MF)
  • Munching Gloves (MF)
  • Benamara Djelid Benrekia (MF)
  • Germain Emma Xavier (MF)
  • Teamothe (LB) (?)
  • LiptonIceTeam (OT) (?)

Assignment 2

Assignment 3

FAQ

Can I develop approach X (method that has not been discussed in class).

You are very much encouraged to study & implement something that we have not discussed in class, as long as it is a solution to the problem we're trying to solve.

Typically, it is a good idea to compare some approach that we have discussed in class with something that we have not discussed in class so your experience can profit other students (and so we can have new ideas for next year).

Is it mandatory to use the dataset or the metric specified by the professors?

If you can, you should. It's better if you run at least one experiment that is comparable with the experiments of the other groups working on the same problem.

However comparative experiments are not always very insightful, so you are also encouraged to conduct other types of experiments using different datasets or different metrics to better understand how your approach behave. Be creative.

Last year, one group made a random dataset generator so they could plot the performance of their algorithm w.r.t. the size of the dataset. From that plot, they concluded that their approach could never scale to any realistic dataset :). That's just an example, but it was good work, and it turned out that generating realistic random matrices was also an interesting problem

Do I have to use Git? Can I use Jupyter Notebook instead?

Humf, Git and Jupyter notebook are two very different tools. Yes, you have to use Git. You can also use Jupyter Notebook if you want.

Git is a tool to manage a source code repository. It is used to version your code (keep track of the changes) and collaborate with other developers (merge multiple concurrent versions of the code). You have to use it, because this is how I am going to access your code/report at the end of the project. You also have to use it because it's critical for you to know how to use it if you ever want to collaborate with someone, or handle a code base that contains more than few lines of code.

Jupyter Notebook is an interactive browser-based code editor. It can be used to run few lines of Python code in your browser, but it is not so convenient when you have a large code base, or when you want to run your code on a distant server, or not interactively. You can use it if you want, but I will not check it unless you explicitly refer to it in your report.

In case you want to use it, I recommend that you first write a Python Module with all the important functions inside (See Python Modules). Then, you can import this module in your Jupyter Notebook and call the functions from there. This way, you can also write a simple non-interactive script so that you can run your program on a remote server.

I don't have enough computing power.

Alright. You can either use Google Colab (within Jupyter Notebooks hosted at Google), or access the GPUs servers hosted at Lamsade (the computer science Lab at Dauphine), using ssh. To get your access credentials, log on https://db.masteriasd.eu/, find your username and download your private key, as a file

Then open a terminal and type

chmod 600 /path/to/private/key/id_rsa_<username>
ssh <username>@ssh.lamsade.dauphine.fr -p 5022 -i id_rsa_<username>

Obviously you need to replace <username> with your own username.

Then you can choose one of these machines:

  • Ourasi: 20 cores / 40 threads / 32GiB RAM / 2x Nvidia RTX 2080 Ti
  • Kaisertrot: 20 cores / 40 threads / 32GiB RAM / 2x Nvidia RTX 2080 Ti (actually, one is dead)
  • Boldeagle: 8 cores / 16 threads / 32GiB RAM / 2x Nvidia GTX 1080 Ti
  • Readycash: 8 cores / 16 threads / 32GiB RAM / 2x Nvidia GTX 1080 Ti

These are shared ressources so please do not use more than 2 GPUs at a time. You can see who else is using the CPUs/GPUs using htop or nvidia-smi.

You can also transfer files from your computer to the servers using scp (see man scp)

Date: 2020-09-30 Wed 00:00

Author: Benjamin Negrevergne, Laurent Meunier

Created: 2021-12-08 Wed 10:43

Validate