Research | imageslab

Cross-Spectrum Face Recognition

Imagery for facial recognition is primary conducted in the visible spectrum. However, for military, local law enforcement, and commercial security applications, the ability to perform facial recognition under variable and low illumination conditions is a significant challenge. Therefore, we proposed an optimal feature regression and discriminative classification framework for matching thermal face images to visible face images.

Multi-Region Thermal-to-Visible Face Synthesis

Often in operational settings, facial recognition reports must be carefully adjudicated by experts. Although custom thermal-to-visible face recognition algorithms are capable of matching, the results lack the ability for experts to verify that the identity of individuals in the returned matchs actually corresponds to the identity of individuals in the thermal imagery. Therfore, we were the first to propose an algorithm to synthesis a visible-like image that preserves facial structure and geometry from a given thermal image.

Cross-Domain Identification

Recent advances in domain adaptation, especially thos applied to heterogeneous facial recognition, rely upon restrictive Euclidean loss functions which perform best when image from different domains are co-registered and temporally synchronized. Therefore, we introduce a novel domain adaptation framework that combines a new feature mapping sub-network with existing deep neural network architectures. This framework is optimized with new cross-domain identity and domain invariance loss functions.

Domain and Pose Invariance

Interest in thermal to visible face recognition has grown significantly over the last decade due to advancements in thermal infrared cameras and analytics beyond the visible spectrum. Despite large discrepancies between thermal and visible spectra, existing approaches bridge domain gaps by either synthesizing visible faces from thermal faces or by learning the cross-spectrum image representations. These approaches typically work well with frontal facial imagery collected at varying ranges and expressions, but exhibit significantly reduced

performance when matching thermal faces with varying poses to frontal visible faces. We propose a novel Domain and Pose Invariant Framework that simultaneously learns domain and pose invariant representations. Our proposed framework is composed of modified networks for extracting the most correlated

intermediate representations from off-pose thermal and frontal visible face imagery, a sub-network to jointly bridge domain and pose gaps, and a joint-loss function comprised of cross-spectrum and pose-correction losses. We demonstrate efficacy and advantages of the proposed method by evaluating on three
thermal-visible datasets: ARL Visible-to-Thermal Face, ARL Multimodal Face, and Tufts Face. Although DPIF focuses on learning to match off-pose thermal to frontal visible faces, we also
show that DPIF enhances performance when matching

Mitigating Catastrophic Interference (Forgetting)

Modern algorithms for RGB-IR facial recognition leverage precise and accurate guidance from curated (i.e., labeled) data to bridge large spectral differences. However, supervised
cross-spectral face recognition methods are often extremely sensitive due to over-fitting to labels, performing well in

some settings but not in others. Moreover, when fine-tuning on data from additional settings, supervised cross-spectral face recognition are prone to catastrophic forgetting. Therefore, we propose a novel unsupervised framework for RGB-IR face recognition to minimize the cost and time inefficiencies pertaining to labeling large-scale, multi-spectral data required to train supervised cross-spectral recognition methods and to alleviate the effect of forgetting by removing over dependence on hard labels to bridge such large spectral differences. The proposed framework integrates an efficient backbone network architecture with
part-based attention models, which collectively enhances common information between visible and infrared faces. Then, the framework is optimized using pseudo-labels and a new cross-spectral memory bank loss. This framework is
evaluated on the ARL-VTF and TUFTS datasets, achieving 98.55% and 43.28% true accept rate, respectively. Additionally, we analyze effects of forgetting and show that our framework is less prone to these effects.

Joint Target Detection and Classification

By enhancing accumulative algorithms (akin to the Hough Transform) to sensor network applications, we proposed a novel framework for simultaneous classification and localization of targets using distributed sensor networks. Since accumulative algorithms are naturally distributed, the communication and computational requirements are significantly less than centralized methods.

Cross-Modality Distillation for Sensor Networks

In wide area surveillance applications using camera networks, often the number of pixel on a particular object or person is minimal and there is often not a sufficient amount of detail to cue detection algorithms. Thus, potentially generative many false positive detections. Therefore, we propose a cross-modality distillation model, where we leverage the geometry of a heterogeneous sensor network in order to learn robust feature for detecting and tracking objects/people across the camera network.

Person Re-Identification

Recent advances in person re-identification have demonstrated enhanced discriminability, especially with supervised learning or transfer learning for different tasks. However, since the data requirements—including the degree of data curations—are becoming increasingly complex and laborious, there is a critical need for unsupervised methods that are robust to large intra-class variations, such as changes in perspective, illumination, articulated motion, resolution, etc. Therefore, we propose an unsupervised framework for person re-identification which is trained in an end-to-end manner without any pre-training. Our proposed framework leverages a new attention mechanism that combines group convolutions to (1) enhance spatial attention at multiple scales and (2) reduce the number of trainable parameters by 59.6%. Additionally, our framework jointly optimizes the network with agglomerative clustering and instance learning to tackle hard samples. We perform extensive analysis using the Market1501 and DukeMTMC-reID datasets to demonstrate that our method consistently outperforms the state-of-the-art methods (with and without pre-trained weights).

Learning Attention without Supervision

Recent advancements like multiple contextual
analysis, attention mechanisms, distance-aware optimization, and multi-task guidance have been widely used for supervised person re-identification (ReID), but the implementation and effects of such methods in unsupervised ReID frameworks are non-trivial and unclear, respectively. Moreover, with increasing size

and complexity of image- and video-based ReID datasets, manual or semi-automated annotation procedures for supervised ReID are becoming labor intensive and cost prohibitive, which is undesirable especially considering the likelihood of annotation errors increase with scale/complexity of data collections. Therefore, we propose a new iterative clustering framework that is insensitive to
annotation errors and over-fitting ReID annotations (i.e., labels). Our proposed unsupervised framework incorporates (a) a novel multi-context group attention architecture that learns a holistic attention map from multiple local and global contexts, (b) an unsupervised clustering loss function that down-weights easily discriminative identities, and (c) a background diversity term
that helps cluster persons across different cross-camera views without leveraging any identification or camera labels. We per-
form extensive analysis using the DukeMTMC-VideoReID and MARS video-based ReID datasets and the MSMT17 image-based ReID dataset. Our approach is shown to provide a new state-of-
the-art performance for unsupervised ReID, reducing the rank-1 performance gap between supervised and unsupervised ReID to 1.1%, 12.1%, and 21.9% from 6.1%, 17.9%, and 22.6% for DukeMTMC, MARS, and MSMT17 datasets, respectively.

Out-of-Distribution 3D Object Generation

Contemporary object pose estimation algorithms predict transformation parameters of perspectives of objects from a reference pose. Learning these parameters often requires significantly more data than conventional sensors provide. Therefore, synthetic data is frequently used to increase the amount of data, number of object perspectives, and number of object classes, which is beneficial for improving the generalization of pose estimation algorithms. However, robust synthesis of objects from different perspectives requires manually setting precision describing increments between pose angles. Consequently, learning from arbitrarily small increments requires very precise sampling from existing sensor data, which increases time, complexity, and resources necessary for a larger sample size. Therefore, there is a need to minimize the amount of sampling and processing required for synthesis methods (e.g., generative) which have difficulty producing samples that lie outside of groups within the latent space resulting in modal collapse. While reducing the number of observed object perspectives directly addresses this problem, generative models have issues synthesizing out-of-distribution (OOD) data. We study the effects of synthesizing OOD data by exploiting orthogonality constraints to synthesize intermediate poses of 3D point cloud object representations that are not observed during training. Additionally, we perform an
ablation study on each axial rotation for poses and the OOD generative capabilities between different model types. We test and evaluate our proposed method using objects from ShapeNet.