转载

arXiv Paper Daily: Mon, 2 Jan 2017

Neural and Evolutionary Computing

Adult Content Recognition from Images Using a Mixture of Convolutional Neural Networks

Mundher Al-Shabi , Tee Connie , Andrew Beng Jin Teoh Subjects : Machine Learning (stat.ML) ; Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)

With rapid development of the Internet, the web contents become huge. Most of

the websites are publicly available and anyone can access the contents

everywhere such as workplace, home and even schools. Nev-ertheless, not all the

web contents are appropriate for all users, especially children. An example of

these contents is pornography images which should be restricted to certain age

group. Besides, these images are not safe for work (NSFW) in which employees

should not be seen accessing such contents. Recently, convolutional neural

networks have been successfully applied to many computer vision problems.

Inspired by these successes, we propose a mixture of convolutional neural

networks for adult content recognition. Unlike other works, our method is

formulated on a weighted sum of multiple deep neural network models. The

weights of each CNN models are expressed as a linear regression problem learnt

using Ordinary Least Squares (OLS). Experimental results demonstrate that the

proposed model outperforms both single CNN model and the average sum of CNN

models in adult content recognition.

Computer Vision and Pattern Recognition

A Unified Tensor-based Active Appearance Face Model

Zhen-Hua Feng , Josef Kittler , William Christmas , Xiao-Jun Wu

Comments: 15 pages, 7 figures

Subjects

Computer Vision and Pattern Recognition (cs.CV)

Appearance variations result in many difficulties in face image analysis. To

deal with this challenge, we present a Unified Tensor-based Active Appearance

Model (UT-AAM) for jointly modelling the geometry and texture information of 2D

faces. In contrast with the classical Tensor-based AAM (T-AAM), the proposed

UT-AAM has four advantages: First, for each type of face information, namely

shape and texture, we construct a tensor model capturing all relevant

appearance variations. This unified tensor model contrasts with the

variation-specific models of T-AAM. Second, a strategy for dealing with

self-occluded faces is proposed to obtain consistent shape and texture

representations of faces across large pose variations. Third, our UT-AAM is

capable of constructing the model from an incomplete training dataset, using

tensor completion methods. Last, we use an effective cascaded-regression-based

method for UT-AAM fitting. With these improvements, the utility of UT-AAM in

practice is considerably enhanced in comparison with the classical T-AAM. As an

example, we demonstrate the improvements in training facial landmark detectors

through the use of UT-AAM to synthesise a large number of virtual samples.

Experimental results obtained using the Multi-PIE and 300-W face datasets

demonstrate the merits of the proposed approach.

A Joint Speaker-Listener-Reinforcer Model for Referring Expressions

Licheng Yu , Hao Tan , Mohit Bansal , Tamara L. Berg

Comments: 11 pages, 6 figures

Subjects

Computer Vision and Pattern Recognition (cs.CV)

; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Referring expressions are natural language constructions used to identify

particular objects within a scene. In this paper, we propose a unified

framework for the tasks of referring expression comprehension and generation.

Our model is composed of three modules: speaker, listener, and reinforcer. The

speaker generates referring expressions, the listener comprehends referring

expressions, and the reinforcer introduces a reward function to guide sampling

of more discriminative expressions. The listener-speaker modules are trained

jointly in an end-to-end learning framework, allowing the modules to be aware

of one another during learning while also benefiting from the discriminative

reinforcer’s feedback. We demonstrate that this unified framework and training

achieves state-of-the-art results for both comprehension and generation on

three referring expression datasets. Project and demo page:

this https URL

Memory Efficient Multi-Scale Line Detector Architecture for Retinal Blood Vessel Segmentation

Hamza Bendaoudi , Farida Cheriet , J. M. Pierre Langlois

Comments: This paper was accepted and presented at Conference on Design and Architectures for Signal and Image Processing – DASIP 2016

Subjects

Computer Vision and Pattern Recognition (cs.CV)

; Hardware Architecture (cs.AR)

This paper presents a memory efficient architecture that implements the

Multi-Scale Line Detector (MSLD) algorithm for real-time retinal blood vessel

detection in fundus images on a Zynq FPGA. This implementation benefits from

the FPGA parallelism to drastically reduce the memory requirements of the MSLD

from two images to a few values. The architecture is optimized in terms of

resource utilization by reusing the computations and optimizing the bit-width.

The throughput is increased by designing fully pipelined functional units. The

architecture is capable of achieving a comparable accuracy to its software

implementation but 70x faster for low resolution images. For high resolution

images, it achieves an acceleration by a factor of 323x.

Feedback Networks

Amir R. Zamir , Te-Lin Wu , Lin Sun , William Shen , Jitendra Malik , Silvio Savarese

Comments: See a video describing the method at this https URL and the website this http URL

Subjects

Computer Vision and Pattern Recognition (cs.CV)

Currently, the most successful learning models in computer vision are based

on learning successive representations followed by a decision layer. This is

usually actualized through feedforward multilayer neural networks, e.g.

ConvNets, where each layer forms one of such successive representations.

However, an alternative that can achieve the same goal is a feedback based

approach in which the representation is formed in an iterative manner based on

a feedback received from previous iteration’s output.

We establish that a feedback based approach has several fundamental

advantages over feedforward: it enables making early predictions at the query

time, its output naturally conforms to a hierarchical structure in the label

space (e.g. a taxonomy), and it provides a new basis for Curriculum Learning.

We observe that feedback networks develop a considerably different

representation compared to feedforward counterparts, in line with the

aforementioned advantages. We put forth a general feedback based learning

architecture with the endpoint results on par or better than existing

feedforward networks with the addition of the above advantages. We also

investigate several mechanisms in feedback architectures (e.g. skip connections

in time) and design choices (e.g. feedback length). We hope this study offers

new perspectives in quest for more natural and practical learning models.

Shape Estimation from Defocus Cue for Microscopy Images via Belief Propagation

Arnav Bhavsar Subjects : Computer Vision and Pattern Recognition (cs.CV)

In recent years, the usefulness of 3D shape estimation is being realized in

microscopic or close-range imaging, as the 3D information can further be used

in various applications. Due to limited depth of field at such small distances,

the defocus blur induced in images can provide information about the 3D shape

of the object. The task of `shape from defocus’ (SFD), involves the problem of

estimating good quality 3D shape estimates from images with depth-dependent

defocus blur. While the research area of SFD is quite well-established, the

approaches have largely demonstrated results on objects with bulk/coarse shape

variation. However, in many cases, objects studied under microscopes often

involve fine/detailed structures, which have not been explicitly considered in

most methods. In addition, given that, in recent years, large data volumes are

typically associated with microscopy related applications, it is also important

for such SFD methods to be efficient. In this work, we provide an indication of

the usefulness of the Belief Propagation (BP) approach in addressing these

concerns for SFD. BP has been known to be an efficient combinatorial

optimization approach, and has been empirically demonstrated to yield good

quality solutions in low-level vision problems such as image restoration,

stereo disparity estimation etc. For exploiting the efficiency of BP in SFD, we

assume local space-invariance of the defocus blur, which enables the

application of BP in a straightforward manner. Even with such an assumption,

the ability of BP to provide good quality solutions while using non-convex

priors, reflects in yielding plausible shape estimates in presence of fine

structures on the objects under microscopy imaging.

Action Recognition Based on Joint Trajectory Maps with Convolutional Neural Networks

Pichao Wang , Wanqing Li , Chuankun Li , Yonghong Hou Subjects : Computer Vision and Pattern Recognition (cs.CV)

Convolutional Neural Networks (ConvNets) have recently shown promising

performance in many computer vision tasks, especially image-based recognition.

How to effectively apply ConvNets to sequence-based data is still an open

problem. This paper proposes an effective yet simple method to represent

spatio-temporal information carried in (3D) skeleton sequences into three (2D)

images by encoding the joint trajectories and their dynamics into color

distribution in the images, referred to as Joint Trajectory Maps (JTM), and

adopts ConvNets to learn the discriminative features for human action

recognition. Such an image-based representation enables us to fine-tune

existing ConvNets models for the classification of skeleton sequences without

training the networks afresh. The three JTMs are generated in three orthogonal

planes and provide complimentary information to each other. The final

recognition is further improved through multiply score fusion of the three

JTMs. The proposed method was evaluated on four public benchmark datasets, the

large NTU RGB+D Dataset, MSRC-12 Kinect Gesture Dataset (MSRC-12), G3D Dataset

and UTD Multimodal Human Action Dataset (UTD-MHAD) and achieved the

state-of-the-art results.

Rotation equivariant vector field networks

Diego Marcos , Michele Volpi , Nikos Komodakis , Devis Tuia Subjects : Computer Vision and Pattern Recognition (cs.CV)

We propose a method to encode rotation equivariance or invariance into

convolutional neural networks (CNNs). Each convolutional filter is applied with

several orientations and returns a vector field that represents the magnitude

and angle of the highest scoring rotation at the given spatial location. To

propagate information about the main orientation of the different features to

each layer in the network, we propose an enriched orientation pooling, i.e. max

and argmax operators over the orientation space, allowing to keep the

dimensionality of the feature maps low and to propagate only useful

information. We name this approach RotEqNet. We apply RotEqNet to three

datasets: first, a rotation invariant classification problem, the MNIST-rot

benchmark, in which we improve over the state-of-the-art results. Then, a

neuron membrane segmentation benchmark, where we show that RotEqNet can be

applied successfully to obtain equivariance to rotation with a simple fully

convolutional architecture. Finally, we improve significantly the

state-of-the-art on the problem of estimating cars’ absolute orientation in

aerial images, a problem where the output is required to be covariant with

respect to the object’s orientation.

Deep Learning Logo Detection with Data Expansion by Synthesising Context

Hang Su , Xiatian Zhu , Shaogang Gong Subjects : Computer Vision and Pattern Recognition (cs.CV)

Logo detection in unconstrained images is challenging, particularly when only

very sparse labelled training images are accessible due to high labelling

costs. In this work, we describe a model training image synthesising method

capable of improving significantly logo detection performance when only a

handful of (e.g., 10) labelled training images captured in realistic context

are available, avoiding extensive manual labelling costs. Specifically, we

design a novel algorithm for generating Synthetic Context Logo (SCL) training

images to increase model robustness against unknown background clutters,

resulting in superior logo detection performance. For benchmarking model

performance, we introduce a new logo detection dataset TopLogo-10 collected

from top 10 most popular clothing/wearable brandname logos captured in rich

visual context. Extensive comparisons show the advantages of our proposed SCL

model over the state-of-the-art alternatives for logo detection using two

real-world logo benchmark datasets: FlickrLogo-32 and our new TopLogo-10.