Electronic Systems Engineering (ESE)https://etd.iisc.ac.in/handle/2005/202021-06-18T03:37:44Z2021-06-18T03:37:44ZAlgorithm And Architecture Design for Real-time Face RecognitionMahale, Gopinath Vasanthhttps://etd.iisc.ac.in/handle/2005/27432019-09-13T11:13:18Z2017-10-31T00:00:00ZAlgorithm And Architecture Design for Real-time Face Recognition
Mahale, Gopinath Vasanth
Face recognition is a field of biometrics that deals with identification of subjects based on features present in the images of their faces. The factors that make face recognition popular and favorite as compared to other biometric methods are easier operation and ability to identify subjects without their knowledge. With these features, face recognition has become an integral part of the present day security systems, targeting a smart and secure world.
There are various factors that de ne the performance of a face recognition system. The most important among them are recognition accuracy of algorithm used and time taken for recognition. Recognition accuracy of the face recognition algorithm gets affected by changes in pose, facial expression and illumination along with occlusions in the images. There have been a number of algorithms proposed to enable recognition under these ambient changes. However, it has been hard to and a single algorithm that can efficiently recognize faces in all the above mentioned conditions. Moreover, achieving real time performance for most of the complex face recognition algorithms on embedded platforms has been a challenge. Real-time performance is highly preferred in critical applications such as identification of crime suspects in public. As available software solutions for FR have significantly large latency in recognizing individuals, they are not suitable for such critical real-time applications. This thesis focuses on real-time aspect of FR, where acceleration of the algorithms is achieved by means of parallel hardware architectures.
The major contributions of this work are as follows. We target to design a face recognition system that can identify at most 30 faces in each frame of video at 15 frames per second, which amounts to 450 recognitions per second. In addition, we target to achieve good recognition accuracy along with scalability in terms of database size and input image resolutions. To design a system with these specifications, as a first step, we explore algorithms in literature and come up with a hybrid face recognition algorithm. This hybrid algorithm shows good recognition accuracy on face images with changes in illumination, pose and expressions, and also with occlusions. In addition the computations in the algorithm are modular in nature which are suitable for real-time realizations through parallel processing.
The face recognition system consists of a face detection module to detect faces in the input image, which is followed by a face recognition module to identify the detected faces. There are well established algorithms and architectures for face detection in literature which can perform detection at 15 frames per second on video frames. Detected faces of different sizes need to be scaled to the size specified by the face recognition module. To meet the real-time constraints, we propose a hardware architecture for real-time bi-cubic convolution interpolation with dynamic scaling factors. To recognize the resized faces in real-time, a scalable parallel pipelined architecture is designed for the hybrid algorithm which can perform 450 recognitions per second on a database containing grayscale images of at most 450 classes on Virtex 6 FPGA. To provide flexibility and programmability, we extend this design to REDEFINE, a multi-core massively parallel reconfigurable architecture. In this design, we come up with FR specific programmable cores termed Scalable Unit for Region Evaluation (SURE) capable of performing modular computations in the hybrid face recognition algorithm. We replicate SUREs in each tile of REDEFINE to construct a face recognition module termed REDEFINE for Face Recognition using SURE Homogeneous Cores (REFRESH).
There is a need to learn new unseen faces on-line in practical face recognition systems. Considering this, for real-time on-line learning of unseen face images, we design tiny processors termed VOP, Processor for Vector Operations. VOPs function as coprocessors to process elements under each tile of REDEFINE to accelerate micro vector operations appearing in the synaptic weight computations. We also explore deep neural networks which operate similar to the processing in human brain and capable of working on very large face databases. We explore the field of Random matrix theory to come up with a solution for synaptic weight initialization in deep neural networks for better classification . In addition, we perform design space exploration of hardware architecture for deep convolution networks and conclude with directions for future work.
2017-10-31T00:00:00ZAlgorithm-Architecture Co-Design for Dense Linear Algebra ComputationsMerchant, Farhadhttps://etd.iisc.ac.in/handle/2005/39582019-09-13T11:13:09Z2018-08-13T00:00:00ZAlgorithm-Architecture Co-Design for Dense Linear Algebra Computations
Merchant, Farhad
Achieving high computation efficiency, in terms of Cycles per Instruction (CPI), for high-performance computing kernels is an interesting and challenging research area. Dense Linear Algebra (DLA) computation is a representative high-performance computing ap-
plication, which is used, for example, in LU and QR factorizations. Unfortunately, mod-
ern off-the-shelf microprocessors fall significantly short of achieving theoretical lower bound in CPI for high performance computing applications. In this thesis, we perform an in-depth analysis of the available parallelisms and propose suitable algorithmic
and architectural variation to significantly improve the computation efficiency. There
are two standard approaches for improving the computation effficiency, first, to perform
application-specific architecture customization and second, to do algorithmic tuning.
In the same manner, we first perform a graph-based analysis of selected DLA kernels.
From the various forms of parallelism, thus identified, we design a custom processing
element for improving the CPI. The processing elements are used as building blocks for
a commercially available Coarse-Grained Reconfigurable Architecture (CGRA). By per-
forming detailed experiments on a synthesized CGRA implementation, we demonstrate
that our proposed algorithmic and architectural variations are able to achieve lower CPI compared to off-the-shelf microprocessors. We also benchmark against state-of-the-art custom implementations to report higher energy-performance-area product.
DLA computations are encountered in many engineering and scientific computing ap-
plications ranging from Computational Fluid Dynamics (CFD) to Eigenvalue problem.
Traditionally, these applications are written in highly tuned High Performance Comput-
ing (HPC) software packages like Linear Algebra Package (LAPACK), and/or Scalable
Linear Algebra Package (ScaLAPACK). The basic building block for these packages is Ba-
sic Linear Algebra Subprograms (BLAS). Algorithms pertaining LAPACK/ScaLAPACK
are written in-terms of BLAS to achieve high throughput. Despite extensive intellectual
efforts in development and tuning of these packages, there still exists a scope for fur-
ther tuning in this packages. In this thesis, we revisit most prominent and widely used
compute bound algorithms like GMM for further exploitation of Instruction Level Parallelism (ILP). We further look into LU and QR factorizations for generalizations and
exhibit higher ILP in these algorithms. We first accelerate sequential performance of the algorithms in BLAS and LAPACK and then focus on the parallel realization of these
algorithms. Major contributions in the algorithmic tuning in this thesis are as follows:
Algorithms:
We present graph based analysis of General Matrix Multiplication (GMM) and
discuss different types of parallelisms available in GMM
We present analysis of Givens Rotation based QR factorization where we improve
GR and derive Column-wise GR (CGR) that can annihilate multiple elements of a
column of a matrix simultaneously. We show that the multiplications in CGR are
lower than GR
We generalize CGR further and derive Generalized GR (GGR) that can annihilate
multiple elements of the columns of a matrix simultaneously. We show that the
parallelism exhibited by GGR is much higher than GR and Householder Transform
(HT)
We extend generalizations to Square root Free GR (also knows as Fast Givens
Rotation) and Square root and Division Free GR (SDFG) and derive Column-wise
Fast Givens, and Column-wise SDFG . We also extend generalization for complex
matrices and derive Complex Column-wise Givens Rotation
Coarse-grained Recon gurable Architectures (CGRAs) have gained popularity in the
last decade due to their power and area efficiency. Furthermore, CGRAs like REDEFINE also exhibit support for domain customizations. REDEFINE is an array of Tiles where each Tile consists of a Compute Element and a Router. The Routers are responsible
for on-chip communication, while Compute Elements in the REDEFINE can be domain
customized to accelerate the applications pertaining to the domain of interest. In this
thesis, we consider REDEFINE base architecture as a starting point and we design Processing Element (PE) that can execute algorithms in BLAS and LAPACK efficiently.
We perform several architectural enhancements in the PE to approach lower bound of the
CPI. For parallel realization of BLAS and LAPACK, we attach this PE to the Router of
REDEFINE. We achieve better area and power performance compared to the yesteryear
customized architecture for DLA. Major contributions in architecture in this thesis are as follows:
Architecture:
We present design of a PE for acceleration of GMM which is a Level-3 BLAS
operation
We methodically enhance the PE with different features for improvement in the
performance of GMM
For efficient realization of Linear Algebra Package (LAPACK), we use PE that can
efficiently execute GMM and show better performance
For further acceleration of LU and QR factorizations in LAPACK, we identify
macro operations encountered in LU and QR factorizations, and realize them on a
reconfigurable data-path resulting in 25-30% lower run-time
2018-08-13T00:00:00ZAnalysis and Control of Cascades in Complex NetworksKotnis, Bhushanhttps://etd.iisc.ac.in/handle/2005/41652019-09-13T11:13:09ZAnalysis and Control of Cascades in Complex Networks
Kotnis, Bhushan
Our modern societies are best described as complex systems consisting of a large number of interacting components. Understanding the nature of these interactions is crucial, not only for gaining insights, but also for shaping their evolution. Modeling complex systems using networks, where nodes represent interacting agents, while links represent the interactions between these agents, can be useful for analyzing these systems. For example, social systems can be represented using social networks, where nodes represent individuals and links represent interactions or relationships between them. Thanks to information and communication technologies, the large scale availability of data allows us to study the nature and function of these networks. For example, large scale data on social connections allows us to gain an understanding of the architecture of these networks, which in turn is very useful for a variety of tasks, like preventing the spread of infectious diseases, marketing products and services, or influencing a large section of the population. Taking advantage of this data, here we aim to (a) study the effects of network structure on processes like epidemics and failure cascades on networks, and (b) formulate cost effective policies for influencing cascades such as information and marketing campaigns.
We study contagion dynamics on a variety of networks such as time varying and interacting networks. In particular, we study (a) the impact of human behavior on biological epidemics in time varying networks, and (b) cascading node and link failures in a system consisting of two networks exhibiting dependent and antagonistic interactions. Our investigations reveal useful insights, e.g., in time varying networks, results suggest an existence of an adaptive threshold As for the interacting networks problem, we found that the phase transition observed in our system is very different from the one seen in commonly studied interacting networks.
After having studied cascades, we take the next step of controlling them. Campaigners, advertisers and activists are increasingly turning to social recommendation mechanisms provided by social media for promoting their products, services, brands and even ideas. One widely used promotion strategy is incentivizing individuals, using referral rewards or other discounts, for encouraging them to spread the word. Due to budget constraints on scarce monetary incentives, it may not be possible to provide incentives for the entire population. Thus, there is a need to allocate resources judiciously for ensuring the highest possible campaign size. We address this problem of maximizing the campaign penetration in social networks under budget constraints and the dual problem of minimizing campaigning cost while ensuring a given campaign size. We formulate the optimization problems using percolation theory. Although the problems turn out to be non-trivial, we show that they can be reduced to simple linear programs, which can be further simplified and solved using simple algorithms with linearithmic complexity. Simulations on real world networks suggest that the proposed solutions could work in a real world setting.
Analytical Modeling Of Quantum Thershold Voltage For Short Channel Multi Gate Silicon Nanowire TransistorsKumar, P Rakeshhttps://etd.iisc.ac.in/handle/2005/9692020-09-28T11:54:22Z2010-12-29T00:00:00ZAnalytical Modeling Of Quantum Thershold Voltage For Short Channel Multi Gate Silicon Nanowire Transistors
Kumar, P Rakesh
Silicon nanowire based multiple gate metal oxide field effect transistors(MG-MOSFET) appear as replacements for conventional bulk transistors in post 45nm technology nodes. In such transistors the short channel effect(SCE) is controlled by the device geometry, and hence an undoped (or, lightly doped) ultra-thin body silicon nanowire is used to sustain the channel. The use of undoped body also solves several issues in bulk MOSFETs e.g., random dopant fluctuations, mobility degradation and compatibility with midgap metal gates. The electrostatic integrity of such devices increases with the scaling down of the body thickness. Since the quantization of electron energy cannot be ignored in such ultra-thin body devices, it is extremely important to consider quantum effects in their threshold voltage models.
Most of the models reported so far are valid for long channel double gate devices. Only Munteanu et al. [Journal of non-crystalline solids vol 351 pp 1911-1918 2005] have reported threshold voltage model for short channel symmetric double gate MOSFET, however it involves unphysical fitting parameters. Only Munteanu et al.[Molecular simulation vol 31 pp 839-845 2005] reported threshold voltage model for quad gate transistor which is implicit in nature. On the other hand no modeling work has been reported for other types of MG-MOSFETs (e.g., tri gate, cylindrical body)apart from numerical simulation results.
In this work we report physically based closed form quantum threshold voltage models for short channel symmetric double gate, quad gate and cylindrical body gate-all-around MOSFETs. In these devices quantum effects aries mainly due to the structural confinement of electron energy. Proposed models are based on the analytical solution of two or three-dimensional Poisson equation and one or two-dimensional Schrodinger equation depending on the device geometries. Judicial approximations have been taken to simplify the models in order to make them closed form and efficient for large scale circuit simulation. Effort has also been put to model the quantum threshold voltage of tri gate MOSFET. However it is found that the energy quantization in tri gate devices are mainly due to electronic confinement and hence it is very difficult to develop closed form analytical equations for the threshold voltage. Thus in this work the modeling of tri gate devices have been limited to long channel cases. All the models are validated against the professional numerical simulator.
2010-12-29T00:00:00Z