DataScience Lab

Table of Contents

News / info

  • DEADLINE FOR A3 WAS POSTPONED, new deadline 08/12/2023 23:59 (DD/MM/YYYY CET)

(Tentative) planning for the year

Note: A1 = assignment 1, Ax = assignment x.

Date Description
September, 19 Class intro + Intro A1
September, 27 Group sessions
October, 3 — NO class ---
October, 10 preliminary presentations A1
October, 17 final presentations A1. Intro A2
October, 24 Alex's presentation on PR
October, 31 Group session A2
November, 7 Preliminary presentations A2
November, 15 Final presentations A2. Intro A3
November, 22 Lucas' presentation on Randomized Ensembles
November, 27 — NO CLASS — (PSL Week)
December, 5 preliminary presentations A3
December, 6 preliminary presentation A3

HowTo

Group sessions

How it is supposed to work:

  • Students describe their plan/idea/readings/experiments and ask questions;
  • Professors answer questions when they can.

How it is not supposed to work:

  • Professors explain students how to conduct their project.

Class presentations

For each assignment, each group is expected to give exactly one presentation (either a preliminary presentation or a final presentation).

  • WARNING1: Timing will be extreamly strict (i.e. you will be interrupted in the middle of your sentence.)
  • WARNING2: Focus on the novelty of your work (and not on what has been presented during class)

Preliminary presentations

  • 6 minutes (~ 5 slides)
  • Briefly & clearly state the problem you are woking on (one slide).
  • Present and compare approaches you are considering for solving the problem
  • Describe what you have implemented (briefly)
  • Discuss possible experiments and evaluation metrics.
  • Present preliminary results if you have any.

Final presentations

  • 6 minutes (~ 5 slides)
  • Briefly & clearly state the problem you are woking on (one slide).
  • Present and compare approaches you have studied during this assignment.
  • Describe what you have implemented (briefly)
  • Discuss the evaluation metrics you have used.
  • Show experimental results and disucss these results.

Reports

  • 1 front page with student names, name the team, and optionally project title
  • 5 extra pages max. (ref not included, figures included),
  • pdf file named report.pdf
  • has to be available on the git repository by the deadline (NO EMAIL!)

Reports should contain:

  • a detailed list of what you have implemented, together with the name of the file in your repository containing the corresponding source code. If you have used external libraries to do something important, please mention it;
  • a list of experimentations conducted, with a conclusion;
  • anything interesting that you have learned from working on the assignment.

They should not contain:

  • detailed description of the principles of the techniques seen in class;
  • extensive code listing, (brief pseudo code is ok).

Assignment 1

Links

  • Slides assignment 1: here
  • GitHub classroom link: here
  • Alex's code: here
  • Testing datasets are available here
  • Testing platform: here

Refs

  • Recommender Systems : The Textbook by Charu C. Aggarwal (read the section about MF, available at the library)
  • For PCA, non-linear PCA, kernel PCA etc. see Generalized Principal Component Analysis by René Vidal Yi Ma and S.Shankar Sastry: here
  • Deep Matrix Factorization, by Xue et al.

Assignment 2

Links

  • Slides assignment 2: here
  • GitHub classroom link: here (fixed!)
  • Testing platform: here
  • DEADLINE for Assignment 2: <2023-11-17 Fri> at 11:59pm
    • report + slide + code has to be on the repository by this date
  • Alex's presentation on precision and recall here

Refs

Assignment 3

  • Slides assignment 3: here
  • GitHub classroom link: here
  • Testing platform: here
  • DEADLINE for Assignment 3: December 8.
  • slides lucas : here

FAQ

Can I develop approach X (method that has not been discussed in class).

You are very much encouraged to study & implement something that we have not discussed in class, as long as it is a solution to the problem we're trying to solve.

Typically, it is a good idea to compare some approach that we have discussed in class with something that we have not discussed in class so your experience can profit other students (and so we can have new ideas for next year).

Is it mandatory to use the dataset or the metric specified by the professors?

If you can, you should. It's better if you run at least one experiment that is comparable with the experiments of the other groups working on the same problem.

However comparative experiments are not always very insightful, so you are also encouraged to conduct other types of experiments using different datasets or different metrics to better understand how your approach behave. Be creative.

Last year, one group made a random dataset generator so they could plot the performance of their algorithm w.r.t. the size of the dataset. From that plot, they concluded that their approach could never scale to any realistic dataset:). That's just an example, but it was good work, and it turned out that generating realistic random matrices was also an interesting problem

Do I have to use Git? Can I use Jupyter Notebook instead?

Git and Jupyter notebook are two very different tools. Yes, you have to use Git. You can also use Jupyter Notebook if you want.

Git is a tool to manage a source code repository. It is used to version your code (keep track of the changes) and collaborate with other developers (merge multiple concurrent versions of the code). You have to use it, because this is how I am going to access your code/report at the end of the project. You also have to use it because it's critical for you to know how to use it if you ever want to collaborate with someone, or handle a code base that contains more than few lines of code.

Jupyter Notebook is an interactive browser-based code editor. It can be used to run few lines of Python code in your browser, but it is not so convenient when you have a large code base, or when you want to run your code on a distant server, or not interactively. You can use it if you want, but I will not check it unless you explicitly refer to it in your report.

In case you want to use it, I recommend that you first write a Python Module with all the important functions inside (See Python Modules). Then, you can import this module in your Jupyter Notebook and call the functions from there. This way, you can also write a simple non-interactive script so that you can run your program on a remote server.

I don't have enough computing power.

You can either use Google Colab (within Jupyter Notebooks hosted at Google), or access the GPUs servers hosted at Lamsade (the computer science Lab at Dauphine), using ssh. Your account has been created already, you just need your private key, send me an email if you want it.

Then open a terminal and type

chmod 600 /path/to/private/key/id_rsa_<username>
ssh <username>@ssh.lamsade.dauphine.fr -p 5022 -i id_rsa_<username>

You need to replace <username> with your own username.

Then you can choose one of these machines:

  • Ourasi: 20 cores / 40 threads / 32GiB RAM / 2x NVIDIA A6000
  • Kaisertrot: 20 cores / 40 threads / 32GiB RAM / 2x NVIDIA A6000
  • Boldeagle: 8 cores / 16 threads / 32GiB RAM / 2x Nvidia GTX 1080 Ti
  • Readycash: 8 cores / 16 threads / 32GiB RAM / 2x Nvidia GTX 1080 Ti

These are shared ressources so please do not use more than 1 GPUs at a time! You can see who else is using the CPUs/GPUs using htop or nvtop or nvidia-smi.

You can also transfer files from your computer to the servers using scp (see man scp)

How do I use scp to copy files on the lamsade servers

To copy the local file test.py on your home directory on the lamsade servers:

scp -i idfile -P 5022 test.py username@ssh.lamsade.dauphine.fr:.

also works the other way around

scp  -i idfile -P 5022 username@ssh.lamsade.dauphine.fr:test.py .

Notice the . at the end.

Explanations:

  • -i idfile because you need to specify your privite key for the authentification to succeed, do it with -i. See man scp
  • -P 5022 because the ssh server at lamsade doesn't run on the standard ssh port (22) for security reasons, so you need to specify the actual port. For scp, you can do it with -P (notice the capital P). See ~man scp.
  • :. The path specification (the part after the column :) is a standard unix path, so if it starts with a / it's an absolute path (i.e. relative to the root of the filesystem /), otherwise it is relative to the current directory, which in this case is the directory in which you end up when you log on using ssh (your home directory).

Recall that on unix . always refers to the current directory (and .. to the parent directory, hence ce command cd ..).

  • ssh.lamsade.dauphine.fr: the remote server specification should always be a valid dns specification (or an ip address). In this case, it refers to the ssh server of the subdomain lamsade of the dauphine domain of the fr dnz area.

Is there a way to simplify the process of logging in and copying files using ssh/scp?

Yes, you can configure your ssh client to remember all the important information (key, username, port etc.) but the exact way to do it depends on the ssh client your are using.

If you are running unix locally, your ssh client is the program that is executed when you type the command ssh. It is configured using various config files that are located in the .ssh directory that is in your home directory.

You can start by copying your private key inside your (local) .ssh directory. Ssh will find it and try it automatically when you log in.

You can also specify the port and the complete dns inside a file call config in your .ssh directory.

Mine contains this:

Host lamsade
Hostname ssh.lamsade.dauphine.fr
User bnegrevergne
Port 5022

so I can just type ssh lamsade and scp file lamsade:.

If you are using another ssh client, I am sure you can do this as well, but I don't know how. If you find out, tell me how, and I'll put it here for the others.

Date: 2020-09-30 Wed 00:00

Author: Benjamin Negrevergne, Alexandre Vérine

Created: 2023-12-08 Fri 15:27

Validate