统计数科学术讲座

Least Squares Approximation for a Distributed System

演讲者:Prof. Hansheng WANG (王汉生教授),Peking University

时间:2020-06-13 16:00-17:00

地点:Tencent Meeting ID: 161 862 991

报告简介  Abstract

In this work we develop a distributed least squares approximation (DLSA) method, which is able to solve a large family of regression problems (e.g., linear regression, logistic regression, Cox’s model) on a distributed system. By approximating the local objective function using a local quadratic form, we are able to obtain a combined estimator by taking a weighted average of local estimators. The resulting estimator is proved to be statistically as efficient as the global estimator. In the meanwhile it requires only one round of communication. We further conduct the shrinkage estimation based on the DLSA estimation by using an adaptive Lasso approach. The solution can be easily obtained by using the LARS algorithm on the master node. It is theoretically shown that the resulting estimator enjoys the oracle property and is selection consistent by using a newly designed distributed Bayesian Information Criterion (DBIC). The finite sample performance as well as the computational efficiency are further illustrated by extensive numerical study and an airline dataset. The airline dataset is 52GB in memory size. The entire methodology has been implemented by Python for a de-facto standard Spark system. By using the proposed DLSA algorithm on the Spark system, it takes 26 minutes to obtain a logistic regression estimator whereas a full likelihood algorithm takes 15 hours to reaches an inferior result.


嘉宾简介  About the Speaker

王汉生,北京大学光华管理学院商务统计与经济计量系,教授,博导,系主任。1998年北京大学数学学院概率统计系本科毕业,2001年美国威斯康星大学麦迪逊分校统计系博士毕业。2003年加入光华至今。国家杰出青年基金获得者,全国工业统计学教学研究会青年统计学家协会创始会长,美国统计学会(ASA)Fellow,国际统计协会(ISI)Elected Member,英国皇家统计协会(RSS)、美国数理统计协会(IMS)、泛华国际统计协会(ICSA)的当选会员。先后历任8个国际学术期刊副主编(Associate Editor),其中是多个杂志的第一位来自中国大陆地区的副主编(包括:美国统计学会(ASA))。国内外各种专业杂志上发表文章100+篇,并合著有英文专著共1本,合著中文教材4本。2014—2018连续5届爱思唯尔中国高被引学者榜单(数学类),最近5年SCI他人引用1800+,Google Scholar最近5年引用次数3100+。高等学校科学研究优秀成果奖(人文社会科学)论文奖一等奖(2009)+三等奖(2013)。在理论研究方面,主要关注变量选择、数据降维、高维数据分析、以及复杂网络数据分析。所有这些研究都以大规模、复杂、超高维数据分析为核心。其相关的应用领域包括但不局限于:中文文本、网络结构、位置轨迹。在业界实践方面,曾担任博雅立方科技有限公司首席科学家(2009—2015),百分点首席统计学家(2015—现在)。此外,量帮科技、考拉征信、彩虹无线、蓬景数字、西门子、三一重工、格灵深瞳、天罡仪表、广联达等众多企业有联合研究工作。涉及量化投资、互联网征信、车联网、移动设备RTB广告竞价、搜索引擎营销、电子商务、重装制造业等多个重要行业。


讲座海报 Poster

21期统计数科大讲堂-王汉生教授.jpg