Researcher Database

Information Initiative Center Supercomputing
Assistant Professor

Researcher Profile and Settings


  • Information Initiative Center Supercomputing

Job Title

  • Assistant Professor

J-Global ID

Research Interests

  • Parallel Computing   Numerical Linear Algebra   High Performance Computing   

Research Areas

  • Informatics / Computational science
  • Informatics / High-performance computing

Academic & Professional Experience

  • 2015/04 - Today Hokkaido University Information Initiative Center
  • 2013/10 - 2015/03 理化学研究所 計算科学研究機構 研究部門 大規模並列数値計算技術研究チーム 特別研究員
  • 2012/04 - 2013/09 Kobe University Graduate School of System Informatics


  • 2007/04 - 2012/03  Nagoya University  Graduate School of Engineering  Department of Computational Science and Engineering
  • 2002/04 - 2007/03  Nagoya University  School of Engineering  Physical Science and Engineering

Association Memberships


Research Activities

Published Papers

  • Rise Ooi, Takeshi Iwashita, Takeshi Fukaya, Akihiro Ida, Rio Yokota
    HPCAsia2020: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region 92 - 101 2020 [Refereed][Not invited]
  • Takeshi Fukaya, Ramaseshan Kannan, Yuji Nakatsukasa, Yusaku Yamamoto, Yuka Yanagisawa
    SIAM J. Scientific Computing 42 (1)  - 503 2020 [Refereed][Not invited]
    The Cholesky QR algorithm is an efficient communication-minimizing algorithm
    for computing the QR factorization of a tall-skinny matrix. Unfortunately it
    has the inherent numerical instability and breakdown when the matrix is
    ill-conditioned. A recent work establishes that the instability can be cured by
    repeating the algorithm twice (called CholeskyQR2). However, the applicability
    of CholeskyQR2 is still limited by the requirement that the Cholesky
    factorization of the Gram matrix runs to completion, which means it does not
    always work for matrices $X$ with $\kappa_2(X)\gtrsim { { \bf u } }^{-\frac{1},{2 } }$
    where ${ { \bf u } }$ is the unit roundoff. In this work we extend the
    applicability to $\kappa_2(X)=\mathcal{O}({\bf u}^{-1})$ by introducing a shift
    to the computed Gram matrix so as to guarantee the Cholesky factorization
    $R^TR= A^TA+sI$ succeeds numerically. We show that the computed $AR^{-1}$ has
    reduced condition number $\leq { { \bf u } }^{-\frac{1},{2 } }$, for which CholeskyQR2
    safely computes the QR factorization, yielding a computed $Q$ of orthogonality
    $\|Q^TQ-I\|_2$ and residual $\|A-QR\|_F/\|A\|_F$ both $\mathcal{O}({ { \bf u } })$.
    Thus we obtain the required QR factorization by essentially running Cholesky QR
    thrice. We extensively analyze the resulting algorithm shiftedCholeskyQR to
    reveal its excellent numerical stability. shiftedCholeskyQR is also highly
    parallelizable, and applicable and effective also when working in an oblique
    inner product space. We illustrate our findings through experiments, in which
    we achieve significant (up to x40) speedup over alternative methods.
  • Kazuyuki Tanaka, Hiroto Imachi, Tomoya Fukumoto, Akiyoshi Kuwata, Yuki Harada, Takeshi Fukaya, Yusaku Yamamoto, Takeo Hoshi
    JAPAN JOURNAL OF INDUSTRIAL AND APPLIED MATHEMATICS 36 (2) 719 - 742 0916-7005 2019/07 [Refereed][Not invited]
    An open-source middleware named EigenKernel was developed for use with parallel generalized eigenvalue solvers or large-scale electronic state calculation to attain high scalability and usability. The middleware enables the users to choose the optimal solver, among the three parallel eigenvalue libraries of ScaLAPACK, ELPA, EigenExa and hybrid solvers constructed from them, according to the problem specification and the target architecture. The benchmark was carried out on the Oakforest-PACS supercomputer and reveals that ELPA, EigenExa and their hybrid solvers show better performance, when compared with pure ScaLAPACK solvers. The benchmark on the K computer is also used for discussion. In addition, a preliminary research for the performance prediction was investigated, so as to predict the elapsed time T as the function of the number of used nodes P (T=T(P)). The prediction is based on Bayesian inference in the Markov Chain Monte Carlo (MCMC) method and the test calculation indicates that the method is applicable not only to performance interpolation but also to extrapolation. Such a middleware is of crucial importance for application-algorithm-architecture co-design among the current, next-generation (exascale), and future-generation (post-Moore era) supercomputers.
  • Takeshi Fukaya
    Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, HPC Asia 2019, Guangzhou, China, January 14-16, 2019 ACM 81 - 90 2019 [Refereed][Not invited]
  • Cache Blocking of Sparse-Matrix Vector Multiplication with DIA/CRS Hybrid Format
    Yukiteru Ishida, Rie Miura, Takeshi Fukaya, Takeshi Iwashita, Hiroshi Nakashima
    Proc, Cross-disciplinary WS. Computing Systems, Infrastructures, and Programming 2018/05 [Refereed][Not invited]
  • Takeshi Fukaya, Toshiyuki Imamura, Yusaku Yamamoto
    2018 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPS Workshops 2018, Vancouver, BC, Canada, May 21-25, 2018 IEEE Computer Society 1113 - 1122 2018 [Refereed][Not invited]
  • Takeshi Fukaya, Takeshi Iwashita
    Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, HPC Asia 2018, Chiyoda, Tokyo, Japan, January 28-31, 2018 ACM 116 - 126 2018 [Refereed][Not invited]
  • 熊谷洋佑, 藤井昭宏, 田中輝雄, 深谷猛, 須田礼仁
    情報処理学会論文誌トランザクション コンピューティングシステム(Web) 9 (3) 1‐13 (WEB ONLY)  1882-7829 2016/08 [Refereed][Not invited]
  • On constructing cost models for online automatic tuning using ATMathCoreLib
    Seiji Nagashima, Takeshi Fukaya, Yusaku Yamamoto
    Proceedings of IEEE MCSoC 2016 IEEE Computer Society Press 1 (1) 1 - 8 2016 [Refereed][Not invited]
  • Yosuke Kumagai, Akihiro Fujii, Teruo Tanaka, Yusuke Hirota, Takeshi Fukaya, Toshiyuki Imamura, Reiji Suda
    PARALLEL PROCESSING AND APPLIED MATHEMATICS, PPAM 2015, PT I 9573 74 - 85 0302-9743 2016 [Refereed][Not invited]
    The conjugate gradient (CG) method is useful for solving large and sparse linear systems. It has been pointed out that collective communication needed for calculating inner products becomes serious performance bottleneck when executing the CG method on massively parallel systems. Recently, the Chebyshev basis CG (CBCG) method, a communication avoiding variant of the CG method, has been proposed, and theoretical studies have shown promising results, particularly for upcoming exascale supercomputers. In this paper, we evaluate the CBCG method on an actual system, namely the K computer, to examine the potential of the CBCG method. We first construct a realistic performance model that reflects the computation on the K computer, and the model indicates that the CBCG method is faster than CG method if the number of cores is sufficient large. We then measure the execution time of both methods on the K computer, and obtained results agree with our estimation.
  • Toshiyuki Imamura, Takeshi Fukaya, Yusuke Hirota, Susumu Yamada, Masahiko Machida
    Advances in Parallel Computing 27 381 - 390 0927-5452 2016 [Refereed][Not invited]
    © 2016 The authors and IOS Press. The present paper describes an efficient communication optimization technique for Householder tridiagonalization called CAHTR and evaluates its parallel performance. CAHTR is intended to reduce the number of problems in collective communication, especially MPI Allreduce operations. We demonstrate the optimal version of CAHTR(3) compared with a naive implementation CAHTR(0). The CAHTR algorithms are evaluated on the K supercomputer system, and speedup exceeds x1.4 for the case of N = 5000 and P = 1024.
  • Seiji Nagashima, Takeshi Fukaya, Yusaku Yamamoto
    We consider the problem of online automatic tuning. In this setting, we execute the target program with some tuning parameters N times, where N is given, while optimizing the parameters to minimize some objective function such as the total execution time. Thus we have to choose the parameters for each execution by taking into account the trade-off between exploration and exploitation. The ATMathCoreLib library developed by Suda is a set of software that solves this problem. To model the performance of the target software, ATMathCoreLib uses a linear statistical model, and its basis functions must be provided by the user. In this paper, we investigate how to choose the basis functions appropriately, using the singular value decomposition of a square matrix as an example. We consider three cases, namely, (I) when the performance characteristics of the target problem are well understood by the user, (II) when the tuning parameter has a complicated structure, as occurs in the case of simultaneous selection of an algorithm and its parameter, and (III) when the performance characteristics of the target problem are not known to the user. The results of using ATMathCoreLib with different basis functions for each case are given. They help one understand the tuning by ATMathCoreLib and contribute to the progress of ATMathCoreLib.
  • Yamamoto Yusaku, Nakatsukasa Yuji, Yanagisawa Yuka, Fukaya Takeshi
    JSIAM Letters 一般社団法人 日本応用数理学会 8 (0) 5 - 8 1883-0609 2016 [Not refereed][Not invited]
    The Cholesky QR algorithm is an ideal QR decomposition algorithm for high performance computing, but known to be unstable. We present error analysis of the Cholesky QR algorithm in an oblique inner product defined by a positive definite matrix, and show that by repeating the algorithm twice (called CholeskyQR2), its stability is greatly improved.
  • Yusaku Yamamoto, Yuji Nakatsukasa, Yuka Yanagisawa, Takeshi Fukaya
    ELECTRONIC TRANSACTIONS ON NUMERICAL ANALYSIS 44 306 - 326 1068-9613 2015 [Refereed][Not invited]
    We consider the QR decomposition of an m x n matrix X with full column rank, where m >= n. Among the many algorithms available, the Cholesky QR algorithm is ideal from the viewpoint of high performance computing since it consists entirely of standard level 3 BLAS operations with large matrix sizes, and requires only one reduce and broadcast in parallel environments. Unfortunately, it is well-known that the algorithm is not numerically stable and the deviation from orthogonality of the computed Q factor is of order O((kappa(2)(X))(2) u), where kappa(2)(X) is the 2-norm condition number of X and u is the unit roundoff. In this paper, we show that if the condition number of X is not too large, we can greatly improve the stability by iterating the Cholesky QR algorithm twice. More specifically, if kappa(2)(X) is at most O(u(-1/2)), both the residual and deviation from orthogonality are shown to be of order 0(u). Numerical results support our theoretical analysis.
  • Takeshi Fukaya, Toshiyuki Imamura
    The solution of real symmetric dense eigenvalue problems is one of the fundamental matrix computations. To date, several new high-performance eigensolvers have been developed for peta and postpeta scale systems. One of these, the EigenExa eigensolver, has been developed in Japan. EigenExa provides two routines: eigen_s, which is based on traditional tridiagonalization, and eigen_sx, which employs a new method via a pentadiagonal matrix. Recently, we conducted a detailed performance evaluation of EigenExa by using 4,800 nodes of the Oakleaf-FX supercomputer system. In this paper, we report the results of our evaluation, which is mainly focused on investigating the differences between the two routines. The results clearly indicate both the advantages and disadvantages of eigen_sx over eigen_s, which will contribute to further performance improvement of EigenExa. The obtained results are also expected to be useful for other parallel dense matrix computations, in addition to eigenvalue problems.
  • Takeshi Fukaya, Toshiyuki Imamura, Yusaku Yamamoto
    HIGH PERFORMANCE COMPUTING FOR COMPUTATIONAL SCIENCE - VECPAR 2014 8969 269 - 283 0302-9743 2015 [Refereed][Not invited]
    We consider computing tall-skinny QR factorizations on a large-scale parallel machine. We present a realistic performance model and analyze the difference of the parallel execution time between Householder QR and TSQR. Our analysis indicates the possibility that TSQR becomes slower than Householder QR as the number of columns of the target matrix increases. We aim for estimating the difference and selecting the faster algorithm by using models, which falls into auto-tuning. Numerical experiments on the K computer support our analysis and show our success in determining the faster algorithm.
  • Performance analysis of the Householder-type parallel tall-skinny QR factorizations toward automatic algorithm selection
    T. Fukaya, T. Imamura, Y. Yamamoto
    Proceedings of VECPAR 2014 1 (1) 1 - 1 2014 [Refereed][Not invited]
  • Takeshi Fukaya, Yuji Nakatsukasa, Yuka Yanagisawa, Yusaku Yamamoto
    2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA) 31 - 38 2014 [Refereed][Not invited]
    Designing communication-avoiding algorithms crucial for high performance computing on a large-scale parallel system. The TSQR algorithm is a communication-avoiding algorithm for computing a tall-skinny QR factorization, and TSQR is known to he much faster and as stable as the classical Householder QR algorithm The Cholesky QR algorithm is another very simple and fast communication-avoiding algorithm, but rarely used in practice because of its numerical instability. Our recent work points out that an algorithm that simply repeats Cholesky QR twice, which we call CholeskyQR2, gives excellent accuracy for a wide range of matrices arising in practice. Although the communication cost of CholeskyQR2 is twice that of TSQR, it has an advantage that its reduction operation is addition whereas that of TSQR is a QR factorization, whose high-performance implementation is more difficult. Thus, CholeskvQR2 can potentially be significantly faster than TSQR. Indeed, in our experiments using 16384 nodes of the K computer, CholeskyQR2 ran about three times faster than TSQR for a 4194304 x 64 matrix,
  • 深谷 猛, 山本 有作, 張 紹良
    情報処理学会論文誌 論文誌トランザクション 情報処理学会 2011 (2) 146 - 157 1882-7772 2012/04 [Refereed][Not invited]
  • Jun-ichi Muramatsu, Takeshi Fukaya, Shao-Liang Zhang, Kinji Kimura, Yusaku Yamamoto
    IJNC 1 (2) 132 - 143 2011 [Refereed][Not invited]
  • Yusaku Yamamoto, Takeshi Fukaya
    Software Automatic Tuning: From Concepts to State-of-the-Art Results 69 - 85 2010 [Refereed][Not invited]
    In this chapter, we survey several approaches to optimizing the blocking strategy for basic matrix decompositions, such as LU, Cholesky, and QR. Conventional blocking strategies such as fixed-size blocking and recursive blocking are widely used to optimize the performance of these decompositions. However, these strategies have only a small number of parameters such as the block size or the level of recursion and are not sufficiently flexible to exploit the performance of modern high-performance architectures. As such, several attempts have been made to define a much larger class of strategies and to choose the best strategy among them according to the target machine and the matrix size. The number of candidate strategies is usually exponential in the size of the matrix. However, with the use of dynamic programming, the cost of optimization can be reduced to a realistic level. As representatives of such approaches, we survey variable-size blocking, generalized recursive blocking, and the combination of variable-size blocking and the TSQR algorithm. Directions for future research are also discussed. © 2010 Springer Science+Business Media LLC.
    JSIAM Lett (Web) 2 69-72 (J-STAGE)  1883-0617 2010 [Refereed][Not invited]
  • 深谷猛, 山本有作, 畝山多加志, 中村佳正
    情報処理学会論文誌トランザクション(CD-ROM) 2009 (1) KONPYUTINGUSHISUTEMU,VOL.2,NO.2,98-109  1882-7772 2009/11 [Refereed][Not invited]
  • 深谷猛, 山本有作, 畝山多加志, 中村佳正
    情報処理学会論文誌. コンピューティングシステム 2 (2) 98 - 109 2009/07 [Refereed][Not invited]
    JSIAM Lett (Web) 1 56-59 (J-STAGE)  1883-0617 2009 [Refereed][Not invited]
  • Takeshi Fukaya, Yasaku Yamamoto, Shao-Liang Zhang
    2008 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING 402 - 410 1552-5244 2008 [Refereed][Not invited]
    In this paper, we present a new approach to optimizing the blocking strategy for the Householder QR decomposition. In high performance implementations of the Householder QR algorithm, it is common to use a blocking technique for the efficient use of the cache memory. There are several well known blocking strategies like the fixed-size blocking and recursive blocking, and usually their parameters such as the block size and the recursion level are tuned according to the target machine and the problem size. However, strategies generated with this kind of parameter optimization constitute only a small fraction of all possible blocking strategies. Given the complex performance characteristics of modern microprocessors, non-standard strategies may prove effective on some machines. Considering this situation, we first propose a new universal model that can express a far larger class of blocking strategies than has been considered so far. Next, we give an algorithm to find a near-optimal strategy from this class using dynamic programming. As a result of this approach, we found an effective blocking strategy that has never been reported. Performance evaluation on the Opteron and Core2 processors show that our strategy achieves about 1.2 times speedup over recursive blocking when computing the QR decomposition of a 6000 x 6000 matrix.
  • 深谷猛, 山本有作, 畝山多加志, 堀玄, 梅野健
    情報処理学会論文誌 48 (SIG8(ACS18)) 31 - 43 0387-5806 2007/05 [Refereed][Not invited]
  • Yusaku Yamamoto, Takeshi Fukaya, Takashi Uneyama, Masami Takata, Kinji Kimura, Masashi Iwasaki, Yoshimasa Nakamura
    Parallel Computing Technologies, 9th International Conference, PaCT 2007, Pereslavl-Zalessky, Russia, September 3-7, 2007, Proceedings Springer 340 - 345 2007 [Refereed][Not invited]

Books etc

  • 櫻井 鉄也, 松尾 宇泰, 片桐 孝洋, 日本応用数理学会 (Contributor第6章 固有値・特異値問題における並列計算 6.1 直接法)
    共立出版 2018 (ISBN: 9784320019553)
  • 直野, 健, 寺西, 慶太, Cavazos, John, 須田, 礼仁 (ContributorDynamic Programming Approaches to Optimizing the Blocking Strategy for Basic Matrix Decompositions)
    Springer 2010 (ISBN: 9781441969347) xiv, 377 p. 69-85


Awards & Honors

  • 2019/03 IPSJ IPSJ Yamashita SIG Research Award
    受賞者: Takeshi FUkaya
  • 2018/05 IPSJ Symposium xSIG2018 Best Research Award
     Enhancement of Algebraic Block Multi-Color Ordering for ILU Preconditioning and Its Performance Evaluation in Preconditioned GMRES Solver 
    受賞者: Senxi Li;Takeshi Iwashita;Takeshi Fukaya
  • 2009/06 EASIAM 2009 EASIAM Student Paper Competition 2nd Prize
     A Dynamic Programming Approach to Optimizing the Blocking Strategy for the Householder QR Decomposition 
    受賞者: Fukaya Takeshi
  • 2009/01 2009年ハイパフォーマンスコンピューティングと計算科学シンポジウム(HPCS2009) 最優秀論文賞
     正方行列向け特異値分解のCUDA による高速化 
    受賞者: 深谷 猛;山本 有作;畝山 多加志;中村 佳正

Research Grants & Projects

  • High performance linear solver for advanced computational electromagnetics
    Japan Society for the Promotion of Science:Grants-in-Aid for Scientific Research Grant-in-Aid for Scientific Research (B)
    Date (from‐to) : 2019/04 -2022/03 
    Author : 岩下 武史, 伊田 明弘, 塙 敏博, 美舩 健, 高橋 康人, 深谷 猛
  • HPCの視点に基づくテンソル分解アルゴリズムの高性能化
    Japan Society for the Promotion of Science:Grants-in-Aid for Scientific Research Grant-in-Aid for Early-Career Scientists
    Date (from‐to) : 2018/04 -2021/03 
    Author : 深谷 猛
  • Japan Society for the Promotion of Science:Grants-in-Aid for Scientific Research Grant-in-Aid for Scientific Research (B)
    Date (from‐to) : 2015/04 -2018/03 
    Author : IMAMURA Toshiyuki, YAMAMOTO Yusaku, Todo Shinji
    This research project aims to realize high performance numerical services investigated in the past based on new mathematical principles in the emerging computing system where tens of thousands to hundreds of millions of processing cores are installed. Giving two important themes, `Mixed-granularity numerical kernel' and `Asynchronous numerical algorithm,' we conducted; i) the research on the theory of asynchronous numerical algorithms. Also avoidance of communication and synchronization at a practical level, then CAHTR and a new method for the FDTD scheme were proposed. Furthermore, we have practiced; ii) promoting research on core numerical infrastructure technologies such as automatic tuning for scalable, lightweight code generation at super-many-core, and promoting innovative research leading to the next generation numerical calculation software.
  • Japan Society for the Promotion of Science:Grants-in-Aid for Scientific Research Grant-in-Aid for Young Scientists (B)
    Date (from‐to) : 2015/04 -2018/03 
    Author : Fukaya Takeshi
    In order to reduce the communication cost in large-scale parallel computations, so-called communication-avoiding (CA) algorithms for matrix factorization have been actively studied. In this research, we aimed for improving the practicality of CA algorithms in realistic situations. We investigated fundamental techniques for CA algorithms: for example, those for implementing and tuning CA algorithms. We also compared different CA algorithms. In addition, we studied the performance modeling of CA algorithms.
  • 大規模行列計算のための階層的自動チューニング手法の開発
    Japan Society for the Promotion of Science:Grants-in-Aid for Scientific Research Grant-in-Aid for JSPS Fellows
    Date (from‐to) : 2010 -2011 
    Author : 深谷 猛
    計算機環境の複雑化・多様化により,それぞれの条件(問題や計算機環境)に応じてアルゴリズムをチューニングすることが,高性能計算を実現するために不可避となっている。その際,従来の人手によるチューニングだけでなく,何らかの仕組みに基づいて計算機自身がチューニングを行う「自動チューニング」技術の開発が求められている。このような背景の下で,昨年度は基本的な行列計算の一つであるQR分解におけるブロック化の方法を自動的に決定する仕組みを構築し,有効な自動チューニング手法として期待できることを示した。そこで,本年度はこの手法をベースにして,実用化の観点から研究を進めた。 構築した手法では,動的計画法を用いることでアルゴリズムの候補を効率的に比較することが可能となっていた。また,比較の際に使用する評価値は性能予測モデルにより算出されることを前提としていた。そこで,本年度は,使用する性能予測モデルによる,チューニングの効果と実行コストの変化について考察した。また,行列サイズが大規模になった場合,全ての候補を比較することが困難になることが予想されるため,候補を限定してチューニングを行う手法の効果について検討した。さらに,限定の仕方を徐々に変化させることで,チューニングの効果とコストのトレードオブを効率的に制御する手法についても検討した。一方,並列計算を想定して,共有メモリ型並列計算機を用いてQR分解を行う場合の自動チューニング手法に関して検討した。並列計算では,TSQRと呼ばれるブロック分割が可能となり,同時に有効であることが知られているので,これを新たに取り入れたチューニング手法を構築し,その効果を検証した。 その他,QR分解以外として,LU分解アルゴリズムに対する自動チューニング手法を検討した。 以上の研究により,大規模行列計算アルゴリズムに対する実用的な自動チューニング手法の開発に向けた一つの方向性を示すとともに,その過程で解決すべき課題を具体的に明らかにすることができた。

Educational Activities

Committee Membership

  • 2020   HPC Asia 2020 Organizing Committee   Publicity chair
  • 2020   PDSEC'20 Program Committee   member
  • 2020   iWAPT2020 Program Committee   member
  • 2019   MCSoC-19: Special Session ATMG Program Committee   member
  • 2019   ICPP2019 Program Committee   member
  • 2019   PDSEC'19 Program Committee   member
  • 2019   iWAPT2019 Program Committee   member
  • 2018   MCSoC-18: Special Session ATMG Program Committee   member
  • 2018   PDSEC'18 Program Committee   member
  • 2018   iWAPT2018 Program Committee   member
  • 2017   MCSoC-17: Special Session ATMG Program Committee   Chair
  • 2017   PDSEC'17 Program Committee   member
  • 2017   iWAPT2017 Program Committee   member
  • 2016   iWAPT2016 Program Committee   member
  • 2015   EPASA2015 Program committee   vice chair
  • 2015   iWAPT2015 Program Committee   member

Copyright © MEDIA FUSION Co.,Ltd. All rights reserved.