序列相似性网络聚类与蛋白质家族划分-食品科学与资源挖掘全国重点实验室

序列相似性网络聚类与蛋白质家族划分

来源：　　发布日期：2020/06/05 15:54:56　　浏览次数：

序列相似性网络聚类与蛋白质家族划分

Effect of Culture Models on Metabolism and Protein Components of Microalgae Chlorella vulgaris

DOI：10.3969/j.issn.1673-1689.2014.01.016

英文关键词: graph clustering protein family similarity graph

基金项目:

作者	单位
时逢宽	江南大学工业生物技术教育部重点实验室,江苏无锡,214122
李炜疆	江南大学生物工程学院,江苏无锡,214122

摘要点击次数: 205

全文下载次数: 1082

中文摘要:

图聚类法是利用蛋白质序列信息推断其家族分类的有力手段。对于蛋白质数据集中家族内外存在着如同许多超家族一样的复杂关系,图聚类法达到较好表现必须两因素,1)输入的相似性图需要包含有足够的用于分类的信息;2)需要稳健的算法以识别被隐藏在相似性图中的模糊集团。作者测试模块度最优算法Contraction-Dilation(CD)算法,采用来自于Pfam中的具有高度序列差异的烯醇酶宗族作为测试数据集。结果表明使用CD算法在相关参数与相似性图比较恰当的情况下,得到聚类结果与Pfam中高度一致。该算法能在一般情况下,使用最佳参数附近较宽范围仍能表现出较好性能。

英文摘要:

Graph clustering is a powerful methods to infer protein family classification from sequence only. To achieve good performance for a set of proteins that have complex intra- and inter-class relationships as in many protein superfamilies,two factors are essential:1) the similarity graph as input that contains enough information for classification and 2) a stable algorithm that can discover the obscure group structure hidden in the similarity graph. We tested a modularity optimization algorithm,called Contraction-Dilation(CD),on a set of sequences from the Pfam clan enolase with broad sequence diversity. The results show that CD outputs are in high agreement with the Pfam classification when the algorithm parameters and similarity graph are appropriately set. The fact that best performance can be achieved in a wide range around optimal settings shows the capability of this approach in general situation.

查看全文查看/发表评论下载PDF阅读器

上一篇 > ：一种低度芡实酒理化功能性质分析

下一篇 > ：我国农贸市场食品安全风险来源与原因分析——基于江苏省无锡市的调研