Data Analysis Project 2023 - overview

Responsible teachers:

  • Carl Herrmann (carl.herrmann@bioquant.uni-heidelberg.de) - IPMB & BioQuant
  • Maiwen Caudron-Herger (m.caudron@dkfz-heidelberg.de) - Division - RNA Biology & Cancer (B150)

The goal of the Data Analysis Module during the Summer Semester is to provide hands-on experience in data analysis of large scale datasets and get first insights into using computational tools to provide a reproducible data analysis.

After this module, you will have

  • gained skills in programming language such as R or Python (depending on the chosen project)
  • learn to use tools to perform reproducible analysis, like Markdown/Notebooks and GitHub
  • understand the key steps in a large-scale data analysis

In a nutshell…

  1. choose your team and one project before Monday 24.04 (10am) (Registration opens Friday 21.04 1pm in this Google Sheet)
  2. prepare and present a 10 min. project proposal on Wednesday 17.05
  3. work on your project during the whole term
  4. meet your tutor every week
  5. prepare a final report and make a final presentation on Wednesday 19.07

Important dates

  • Introductory lecture Wednesday 19.04
    • overall presentation of the module
    • presentation of the different topics and sub-projects
    • introduction to Markdown / Github
  • Second lecture Wednesday 26.04
    • Introduction to linear regression

Projects

We have defined 5 topics in data and image analysis; each project will comprise up to 5 different sub-projects. Most of the time, these 5 sub-projects are very similar to each other but analyze slightly different datasets.

You can find a description of the 5 topics here:

You will find a description of the projects an a list of supervisors/tutors in these description files.

Slides from presentations (19.04)

Important information:

  • Topic 1 is an image analysis project which will be performed in Python
  • Topics 2,3,4,5 are data analysis topics/projects which will be conducted in R

How to …

How do I select my project/team?

  1. check the project description on this page!
  2. select your teammates; each sub-project will be worked out by groups of 4 students.
  3. once the choice has been made, register your team in a Google Spreadsheet (link will be provided) (registration will open Friday 21.04.23, 1pm); the choice of sub-project and definition of the teams should be completed by Monday, 24.04.23, 10am (no extension!)
  4. create a GitHub account and register your github name in the registration Google Sheet

Who will help me?

For each project, there will be a tutor assigned to this project. Each team within a project will have a weekly online meeting with his tutor on Wednesday between 10am and 1 pm during 20-30 minutes.

VERY IMPORTANT: as the weekly time which the tutor can dedicate to your project is limited, you should carefully prepare your meeting. We have provided a template which should help you organize your weekly meeting efficiently!

What am I supposed to do?

  1. select your project and you teammates and register before Wednesday, 26.04, 10am
  2. Project presentation on 17.05.23: prepare a 10 minutes presentation (+10 minutes discussion/questions) based on the indicated literature for each project, listing the relevant questions/topics that you want to address in your project. During this presentation, you should also explain the datasets you will be working with, and how you want to make use of them.
  3. Final presentation on 19.07 (15 minutes + 10 minutes discussion / questions)

How will I be evaluated?

Each student will have an individual evaluation! This will take into account the 2 presentations listed above, as well as the report (markdown report / jupyter notebook depending on the projects).

Here are the relevant points taken into account during the project proposal presentation:

  • presentation of the literature results
  • understanding of the biological question
  • clarity of presentation of planned analysis and milestones
  • knowledge of the datasets
  • teamwork aspects
  • quality of the oral presentation and slides

During the oral presentations, each student will be asked to explain part of the analysis, especially to explain the code! So everyone should make sure to be involved in the project.

Final reports will be submitted (or “committed”) to the Github repository of the group as a .Rmd or .ipynb file.

Important note: Science is collaboration! so please make sure to share your insights/knowledge with other groups! You are free to choose whatever way to do so, e.g. Whatsapp groups or Slack groups.


Practical aspects

Depending on the projects, you will use either R (Topics 02/03/04/05) or python (Topics 01).

R-based projects

You will use RStudio, and create a R markdown document.

RStudio

  • For those of you who want to work on their laptops, you can install RStudio using the previous link; however, since some datasets are rather large, you will need a decent laptop to be able to process the data in a reasonable time (i5/i7 CPUs with 8Gb RAM at least)

RMarkdown documents

These will consist in a mixture of plain text (explanations about the analysis, comments,…) and code pieces (called chunks). The advantage is that the report will be automatically generated (either as pdf or html document) when compiling the markdown file. All plots will be automatically and dynamically created from the code pieces in the markdown document.

Have a look at this tutorial or this one to get started with RMarkdown; RMarkdown is very easy to generate with RStudio.

Python based projects

Similar to R markdow documents, the Jupyter Notebook offers a way to mix markdown text together with Python code. Installing Jupyter Notebook requires the installation of Python. Just follow the instructions on the previous link.

GitHub

Git is a system to handle collaborative projects, in which each member of the team is contributing to the project. You can check this website for a simple intro to Git/GitHub.

Git can be used either from the command line, or using GitHub Desktop, a GUI manager which makes commiting changes, etc… very easy.

This tool will help you (and us…) track the progress of your project.

Documentation

Introduction to GitHub

Here some slides on GitHub

These are the Python Notebook files with Python intro provided by David Schwarzenbacher