Heterogeneous Computing

for Signal and Data Processing

Heterogeneous Computing for Signal and Data Processing

** Parallel computing with GPUs and other devices**

Course number: EECS E4750

(Original name: Signal Processing and Communications on Mobile Multicore Processors)

Prof. Zoran Kostic, Electrical Engineering Department, Data Sciences Institute, Columbia University in the City of New York

Target Audience:

Students interested in acquiring software and systems design skills in parallel computing for graphics processing units (GPUs) and heterogeneous computing infrastructure, relevant to applications in data processing, deep learning, signal and communications industries.

Bulletin Description:

  • Methods for deploying signal and data processing algorithms on contemporary general purpose graphics processing units (GPGPUs) and heterogeneous computing infrastructures. Using programming languages such as OpenCL and CUDA for computational speedup in audio, image and video processing and computational data analysis. Significant design project.

Dates:

  • Fall 2022: EECS E4750

  • Fall 2021, 2020, 2019, 2018, 2017, 2016: EECS E4750

  • Fall 2015, 2014: ELEN E4750 : SP & COMM ON MOBILE MULTI PROC

Content

Applications of Parallel Computing

Graphics Processing Unit (GPU) architecture and programming.

Heterogeneous Parallel Computing (HPC)

Parallel SW development in OpenCL and CUDA, Apple Metal, Vulkan, other standards.

  • Motivating examples from imaging, audio, multimedia, deep learning

  • Cross section of mobile processor architectures: Nvidia, AMD, Intel

  • General Purpose Processors, Graphic Processing Units (GPU), DSPs

  • ARM architecture

  • Parallel programming concepts for mobile platforms

  • CUDA and OpenCL language

  • Tools: development environments, code development, profiling

  • Standards: Khronos OpenGL, WebGL, HSA

  • Parallel programming examples

    • Signal processing

    • Image and video processing

    • Neural networks and deep learning

    • Communications processing, protocols

    • Data Analysis

  • Power Considerations

Syllabus Details

Theory, CUDA, OpenCL:

  • Portability and Scalability in HPC

  • Data Parallelism and Threads

  • Memory Hierarchy

  • Memory Allocation and Data Movement

  • Kernel-Based Parallel Programming

  • Memory Bandwidth and Coalescing

  • Matrix-Matrix Multiplications

  • Thread, Warps and Wavefronts

  • Thread Scheduling

  • Tiled Processing for 1D, 2D

  • Control Divergence

  • Convolution and Tiled Convolution

  • Reduction Kernels

  • Atomic Operations

  • Histogram Kernel

  • Applications: Deep Learning, Imaging, Video, ...

  • Profiling and Debugging

Project Suggestions for implementation in CUDA or OpenCL

  • Image processing

  • Audio processing

  • Machine learning

  • Deep Learning algorithm parallelization

  • Optimization of communication networks

  • Optimization of energy networks

  • Medical applications

  • Graphics

  • Video processing

  • Visualization

  • Financial applications


Books, Tools and Resources

  • BOOKS:

    • David Kirk and Wen-mei Hwu, "Programming Massively Parallel Processors -A Hands-on Approach," 3rd Edition, publisher: Elsevier eBook ISBN: 9780128119877, Paperback ISBN: 9780128119860, (https://www.elsevier.com/books/programming-massively-parallel-processors/kirk/978-0-12-811986-0)

    • Old book- D. Kirk and W. Hwu, “Programming Massively Parallel Processors – A Hands-on Approach,” 2nd Edition, Morgan Kaufman Publisher (elsevier) ,ISBN-13: 978-0124159921 ISBN-10: 0124159923 (http://www.elsevier.com/books/programming-massively-parallel-processors/kirk/978-0-12-415992-1)

    • OpenCL Programming by Example, Ravishekhar Banger, Koushik Bhattacharyya, Packt Publishing (December 23, 2013),ISBN : 1849692343, ISBN 13 : 9781849692342

  • Parallel machines:

    • Google Cloud GPUs

    • Server with NVIDIA Tesla K40 + Nvidia Quadro K5000s + Mobile: Jetson TK1 + Intel Xeon E5-1620v

  • SDKs (SW development kit) by NVIDIA, Intel...

2014 Fall Projects

  • Low Rank Matrix Recovery Using Principal Component Analysis

  • Acceleration of Genetic Algorithms and Image Pattern Recognition of fMRI Fingerprint

  • Parallel Implementations of Detection Algorithms for MIMO Systems on the GPU

  • Harnessing GPU for solving Options Pricing problems in Financial Engineering

  • Topics Extraction with GPU Acceleration (machine learning)

  • Parallel Decoding of Space Time Codes on GPU

  • Image processing using parallel computing and PyOpenCL (Night vision)

2015 Fall Projects

  • 3D Image Reconstruction (Stereo Vision Based Depth Perception & 3D Spatial Reconstruction)

  • Accelerate the Analysis of EEG Signal Based on Nonlinear Feature Extraction and Classification by Parallel Algorithm

  • Fast VOIP MOS (Mean Opinion Score) calculation

  • GPU Acceleration for Neural Network based Handwritten Digits Recognition

  • Image Matching Accelerator based on SIFT

  • Image blending

  • Image Stitching

  • Local Linear Embedding using OpenCL

  • Performance of Linear Equalization in Narrowband Channels

  • Parallel Computing on SAR Image Processing

  • Canny Edge and Boundary Detection using OpenCL

  • Parallel HEVC Video Compression Using OpenCL

  • Speaker Recognition

2016 Fall Projects

  • 3D Voxel De-blurring

  • Laplacian Approximation on GPU

  • Basic object recognition

  • Camera Localization

  • Dark channel haze removal

  • Disparity Map Calculation by GPU

  • FPGA Implementation of PyOpenCL/OpenCL

  • First Principles MPI Simulator

  • Fingerprints recognition for security

  • GPU acceleration for SPH(Smooth Particle Hydrodynamics)

  • K-means Clustering Acceleration on GPU

  • Kinect Color and Depth Image Alignment

  • GPU-based Monte Carlo simulation of light transport for optical fiber probe geometries

  • Object Tracking Based on Video Analysis

  • Parallel Computing in Traffic Sign Detection

  • Real-time medical image processing empowered by parallel computing

  • Parallel simulated annealing

  • Recommendation Algorithms using deep learning

  • Real-time image de-hazing

  • Smart Cluster Construction

2017 Fall Projects

  • Parallel Methods for Image Deblurring

  • Parallel Locality-Sensitive Hashing In Movement and Gesture Recognition

  • 3D Wifi Mesh Generation

  • Sudoku Solver with Parallel Backtracking

  • Parallel Processing for Analysis Purpose

  • Implement a game playing system

  • 3D Image Reconstruction (Stereo Vision Based Depth Perception & 3D Spatial Reconstruction)

  • Vector representation for words

  • Iris detection in biometrics

  • Parallel Computation in Image Registration

  • 3D Human Action Recognition

  • Fractal Image Compression in Parallel Computing platfrom

  • Parallelized Object Tracking with KCF

  • Parallelized Stock Market Prediction

  • Parallelized Monte Carlo Methods in Reinforcement Learning

  • Sparse Representation Based Image Super Resolution

  • Neural Network Based Artistic Style Transfer Algorithm

  • Sparse Representation Based Image Super Resolution

  • Face recognition with CNN

  • Object Detection using Cascade Classifier

2018 Fall Projects

  • 3D rendering

  • Parallelization for KL Divergence Non-Negative Matrix Factorization

  • Deconvolution using Richardson–Lucy approach

  • 3D City Adventure Game

  • Acceleration of fiber orientation analysis and visualization for optical coherence tomography imaging

  • GPU accelerating 3D Reconstrucion from sketches via Multi-view convolutional networks

  • Using Parallel Computing to Accelerate Artificial Neural Network

  • Mutual Information Based Semi Global Stereo Matching

  • Parallel SGD

  • Slam solver

  • Accelerating forward propagation of ZF-Net

  • CUDA and OpenCL accelerating for typical deep learning algorithms

  • Rapid optical coherence tomography image acquisition

  • Parallel Optimization for Deep Learning

  • Finite Difference Time Domain Simulation

2019 Fall Projects

  • GPU Acceleration of Canny Edge Detection for Images

  • Accelerating Discrete Wavelet Transforms of Parallel Architectures

  • Acceleration of K-means Clustering using PyCUDA

  • Acceleration of Point Correspondence Model Construction in Cone-Beam CT Scans

  • CUDA implementation of Recommender Systems

  • Data Augmentation Techniques for Deep Learning: Comparing Cuda, OpenCL, and Serial Implementations

  • Hardware Acceleration of Secure Hashing

  • Image Preprocessing for Convolutional Neural Networks using Parallel Computing

  • Accelerated Singular Value Decomposition for Principal Component Analysis

  • Parallel Particle Swarm Optimization

  • Acceleration of Spectral Domain Phase Microscopy for Vibrometry in Spectral Domain Optical Coherence Tomography

  • Spring mass system simulation

  • Efficient Primitives for CP and Tucker Tensor Decompositions on GPUs

  • Acceleration of Deep Learning Algorithm U-net to make predictions of drivers' workload

  • Particle filter For Mobile Robot localization

2020 Fall Projects

  • Acoustic Feature Extraction Acceleration

  • Acceleration of Multiple Signal Classification (MUSIC) Algorithm for Frequency Estimation

  • Parallel Acceleration for Cross-Lingual Relation Extraction

  • Acceleration of Demosaicing Bayer Color Filter Arrays (CFA)

  • Efficient CNN for Video Understanding

  • Acceleration of Long Short-Term Memory (LSTM) on GPU

  • Parallel linear programming solver

  • Acceleration of a Spiking Neural Network

  • Acceleration of Support Vector Machine(SVM) algorithm

2021 Fall Projects

  • 3D Virtual Acoustics Using CUDA

  • Acceleration of Image Haze Removal Using Dark Channel Prior

  • Acceleration of LeNet on GPU with PyCUDA

  • Acceleration of kmeans algorithm

  • Acceleration of GLove representation

  • Covert Data Reduction into Actual Computational Efficiency: Customized Parallellzation of Convolution with Sparse Mask

  • Parallelizing class imbalance problem using SMOTE

  • Deep reinforcement learning with GPU acceleration()"

  • Heterogeneous Stock Inference via Scheduling

  • Parrallelize RANSAC

  • Parallel Random Forests

  • Parallelizing Nonlocal Means (NLM) denoising algorithm for 3D images

  • Speedup genetic algorithm using parallel computation

  • Non-negative Matrix Factorization

  • Fast and High Precision Multi-resolution Engine using Parallel Processors