March 05, 2019

1.郭世晨:《Gated Convolutional Neural Network for Semantic Segmentation in High-Resolution Images》
2.Fyaz Ali:《Human Action Recognition in Videos》

1. 论文分享及工作汇报(郭世晨)











那么, 将这样的指标作为底层特征的权重是否可以得到更好的分割精度呢? 我们无须探寻“更优指标”的具体数学公式, 只需要得到这个指标的泰勒展开式即可.








本文从信息熵的角度出发, 但不仅仅局限于信息熵之中, 在充分继承信息熵门控作用的同时, 又进行了更大程度上的学习.

2. Human Action Recognition in Videos (Ali)


My thesis work explores various approaches using Convolutional and Recurrent Neural Networks to classify and temporally localize activities on videos, moreover an implementation to achieve it has been stated. As the first step, features have been extracted from video frames employing a state of the art 3D Convolutional Neural Network. These features are supplied into a recurrent neural network that solves the activity classification and temporally location tasks in a simplistic and flexible way. Different architectures and configurations have been tested in order to achieve the best performance and learning of the video dataset provided. In addition, it has been studied a different kind of post-processing over the trained network’s output to achieve better results on the temporal localization of activities on the videos. We show how our system can achieve competitive results in both tasks with a simple architecture. We evaluate our method in the ActivityNet Challenge 2016, achieving a 0.5874 mAP and a 0.2237 mAP in the classification and detection tasks, respectively. The results produced by proposed architecture in this thesis are very good and have state of art output.

Proposed Project:

The videos Recognizing activities becomes a hot topic over the last years in the computer vision area [1]. The exponential growth of portable video cameras and online multimedia repositories, as well as recent advances in video coding, storage, and computational resources, have motivated an exceptional analysis in the field towards new and extra efficient solutions for organizing, understanding and retrieving video content.

Deep learning and machine learning methods have recently become the hot topic in many computers vision tasks, such as objects and images recognition in still images. While successful techniques have been manifested with image understanding, video content still presents additional challenges (e.g. motion, temporal consistency, spatial location …) that usually cannot be bridged with a still image recognition solutions.

The goal of this report is to address the challenges of video content analysis taking advantage of state-of-the-art deep learning techniques. The aim of this project is to develop a competitive framework to both classify and temporally localize activities on videos. To accomplish this goal, the dataset used to fulfill this task is be the ActivityNet dataset [2], which offers untrimmed videos depicting a diversity of human activities.

In appropriate, this project’s main contributions are:

I had already worked on deep learning and image processing techniques for still images, but never before in video analytics.


This project has been finished with the certain goal of setting a baseline for video analytic with deep learning in the research so that future students and researchers to keep working on it. The specifications of this project are the following:

This project first done by Image Processing Group (GPI) of the UniversitatPolit`ecnica de Catalunya (UPC) during the Spring 2016 semester so we worked on the same specification in order to improve some results. The specifications were selected to take into account the needs of the project and the available resources. All the construction was done on Python using a very well-known framework which is Keras, also a pioneering use of this library in GPI. This Deep Learning wrapper facilitates the design and training of models over two computational frameworks: Theano [3]and TensorFlow [4]. Both projects support complex and high demanding computations over both CPU and GPU. For this project, Theano was used as back-end because at the time of developing this project, it was the only one that had implemented the convolution 3D and max-pooling 3D operations required for its development. In addition to the software, specific hardware was needed. The high critical computational resources needed to train neural networks required the use of GPUs provided by the Software Lab UESTC.

Methods and Procedures

This project seeks to offer a simple and flexible solution to face the jobs of classification and temporal localization on videos. The proposed solution is made by a first step stage where the video features are extracted. To do so, to extract features from spatial and temporal dimension from video frames. We used a pre-trained 3D convolutional network. In parallel, features from audio were extracted to combine them with the ones extracted from the video frames.

As a second step, a Recurrent Neural Network is proposed to exploit long-term relations and predict the sequence of activities available at each video. This network was tested with different architecture inputs. as shown in Below schematic diagram.Fig.1.1

All the experiments were done using a recently published video dataset, the ActivityNet [2]which offers untrimmed videos from 200 activities. All the videos from the dataset present a single activity on it so it’s possible to classify them globally. Furthermore, the activities along the video are temporally localized. Most of the effort and architecture was performed taking into account. Fixing the detection task can be easily extrapolated to solve the global classification task for the whole video.

Moreover, multiple post-processing techniques were proposed to improve the classification and detection tasks. Multiple configurations were tested for the purpose of maximize the results.

Finally, the very good results have been/img/presentation/2019-03-05/ taken using all these specifications.

Human Action Recognition

Analyzing motions and actions has a long history and attractive to various disciplines including psychology, biology and computer science (see Table.1 for the list of surveys related to motion and action recognition in computer vision). One can trace the fascination about motion back to 500BC with Zeno’s dichotomy paradox. From an engineering perspective, action recognition extends over a broad range of high-impact societal applications, from video surveillance to human-computer inter- action, retail analytics, user interface design, learning for robotics, web-video search and retrieval, medical diagnosis, quality-of-life improvement for elderly care, and sports analytics. The long list of emerging technologies and applications [6]points to “manually analyzing action and motion data is impossible”.

Human action recognition gets plenty applications in fields such as monitoring, security, sports, and movies. Such a method classifies a spatiotemporal feature descriptor of a human frame in a video, based on training examples. Nevertheless, many classifiers face the constraints of the long training sequence and the great size of the feature vector. Individual activity detection is a challenging, unsolved problem [7], [8], still though numerous attempts have been done. Human action analysis in computer vision includes detecting, tracking, and recognition of human activities [8]. This has a wide range of encouraging applications. Some examples are security surveillance, human-machine interaction, sports, video annotations, medical diagnostics and entry, exit control. However, it remains a challenging task to detect human activities, because of their shifting features and a wide range of poses that they can adopt [7].