Python Machine Learning/Data Science Project Structure

QBoard » Advanced Visualizations » Viz - Python » Python Machine Learning/Data Science Project Structure

Python Machine Learning/Data Science Project Structure

Back To Topics

Tags : python machine-learning kaggle data-science organization

Viaan Prakash

461
I'm looking for information on how should a Python Machine Learning project be organized. For Python usual projects there is Cookiecutter and for R ProjectTemplate.

This is my current folder structure, but I'm mixing Jupyter Notebooks with actual Python code and it does not seems very clear.
```
.
├── cache
├── data
├── my_module
├── logs
├── notebooks
├── scripts
├── snippets
└── tools
```
I work in the scripts folder and currently adding all the functions in files under my_module, but that leads to errors loading data(relative/absolute paths) and other problems.

I could not find proper best practices or good examples on this topic besides this kaggle competition solution and some Notebooks that have all the functions condensed at the start of such Notebook.
September 27, 2021 2:17 PM IST

0
Maryam Bains

317

You may want to look at:

http://tshauck.github.io/Gloo/

loo's goal is to tie together a lot of the data analysis actions that happen regularly and make that processes easy. Automatically loading data into the ipython environment, running scripts, making utitlity functions available and more. These are things that have to be done often, but aren't the fun part.

It's not actively maintained but the basics are there.

September 30, 2021 12:37 PM IST

0
Advika Banerjee

319 1
We've started a cookiecutter-data-science project designed for Python data scientists that might be of interest to you, check it out here. Structure is explained here.

Would love feedback if you have it! Feel free to respond here, open PRs or file issues.

In response to your issue about re-using code by importing .py files into notebooks, the most effective way that our team has found is to append to the system path. This may make some people cringe, but it seems like the cleanest way of importing code into a notebook without lots of module boilerplate and a pip -e install.

One tip is to use the %autoreload and %aimport magics with the above. Here's an example:
```
# Load the "autoreload" extension
%load_ext autoreload

# always reload modules marked with "%aimport"
%autoreload 1

import os
import sys

# add the 'src' directory as one where we can import modules
src_dir = os.path.join(os.getcwd(), os.pardir, 'src')
sys.path.append(src_dir)

# import my method from the source code
%aimport preprocess.build_features
```
The above code comes from section 3.5 in this notebook for some context.
October 4, 2021 1:28 PM IST

0
Samar Patil

346 3
Organize Your Science Project Directory To Make It Easier to Understand

When you are working on a data project, there are often many files that you need to store on your computer. These files may include:
- Raw Data Files
- Processed data files: you may need to take the raw data and process it in some way
- Code and scripts
- Outputs like figures and tables
- Writing associated with your project
It will save you time and make your project more useable and reproducible if you carefully consider how these files are stored on your computer. Below are some best practices to consider when pulling together a project.

Importance of Directory and File Names

As you create new directories and files on your computer, consider using a carefully crafted naming convention that makes it easier for anyone to find things and also to understand what each files does or contains.

It is good practice to use file and directory that are:
- Human readable: use expressive names that clearly describe what the directory or file contains (e.g. code, data, outputs, figures).
- Machine readable: avoid strange characters or spaces. Instead of spaces, you can use - or _ to separate words within the name to make them easy to read and parse.
- Sortable: it is nice to be able to sort files to quickly see what is there and find what you need. For example, you can create a naming convention for a list of related directories or files (e.g. 01-max.jpg, 02-terry.jpg, etc), which will result in sortable files.
These guidelines not only help you to organize your directories and files, but they can also help you to implement machine readable names that can be easily queried or parsed using scientific programming or other forms of scripting.
December 17, 2021 11:40 AM IST

0

Cluzters.ai is the first step towards uniting various Industry participants in the field of Applied Data Innovations. It is a gamified community geared towards creating a level playing turf for Data science professionals.

Member Sign In

Member Sign In

Create Account

Python Machine Learning/Data Science Project Structure

Organize Your Science Project Directory To Make It Easier to Understand

Importance of Directory and File Names

Connect With Us