DISCUSSION
In this study, we demonstrated that certain genotypes (and subgenotypes) and mutations are associated with development of hepatic carcinogenesis. There seems to be a stratified risk of HCC, with each genotype (or subgenotype) being associated with a certain pattern of mutations. The significance of these genotypes and mutations was verified by use of an independent cohort which was composed of both HCC and non-HCC patients. Using these algorithms, the sensitivity of identifying a high-risk case ranged from 72% to 75% and the specificity ranged from 66% to 72%. Although the use of these algorithms had only moderate discriminatory capability to predict HCC (positive likelihood ratio of 2.21 to 2.57 and negative likelihood ratio of 0.36 to 0.39), our data suggested that different HBV genotypes and subgenotypes might have different predominant carcinogenic mechanisms.
The issue of HBV genotypes has been debated due to discrepant results in previous studies from different countries (
18,
27). These differences may be explained by a distinct distribution of HBV subgenotypes in different geographical regions. In most Asian countries, only subgroup Ba of HBV is found, while the majority of Japanese patients with HBV have subgroup Bj (
9). Genotype C HBV has a higher risk of HCC than genotype B HBV, which is probably related to a delayed HBeAg seroconversion, more active hepatitis, and a higher prevalence of basal core promoter mutations (
5,
19,
32). Among genotype C HBV, there were also differences in the disease activity associated with different subgenotypes (
6). Recently, we have shown that subgenotype Ce HBV was associated with the highest risk of HCC independent of other risk factors, including high HBV DNA levels and liver cirrhosis, among a longitudinal cohort of 1,006 chronic hepatitis B patients followed up for 7.7 years (
10). The proportion of HCC in patients with subtype adw was found to be higher than that in patients with subtype adr (
25). Going beyond attributing HCC to a specific genotype, this study suggests that different genotypes of HBV are associated with different mutations of the viral genome and thus may have separate mechanisms of hepatic carcinogenesis.
The basal core promoter mutant (T1762/A1764) is found to parallel the progression of liver disease and increases the risk of HCC for both genotype B and C HBV (
19,
33). In common with previous studies, we also found mutation at codon 1762/1764 to be associated with HCC in genotype B HBV infection. The reason why 1762/1764 mutations were not identified as a marker for HCC in genotype C HBV was related to the high prevalence of mutations at these sites even among the non-HCC patients (
8). However, this phenomenon may also mean that a selection pressure on the basal core promoter/X region of the HBV genome in genotype B HBV is associated with the development of HCC. The HCC-associated mutations selected by HBV subgenotype Ce are located in the envelope region, while those selected by HBV subgenotype Cs are located in the precore/core region. These findings offer additional support for the presence of various virologic mechanisms of hepatocarcinogenesis by different HBV genotypes/subgenotypes. The functions of these mutations and their gene products need further investigation.
HBV DNA appears to integrate into host DNA at different sites, exerting direct and indirect effects on the host genes (
7). It has also been postulated that the integrated HBV genes can activate cellular genes remote from the site of HBV DNA integration, thereby influencing cellular proliferation and differentiation. This transactivation effect could be mediated through different signal transduction pathways. Identification of HCC-related mutations is only the first step in understanding the viral mechanism of hepatic carcinogenesis. Functional genomic studies of these mutations would have to be carried out in the future to elucidate the effects of these mutations on cell growth and death of hepatocytes.
There are several limitations in this study. First, although patients in the control group were age matched with those in the HCC group, the possibility of developing malignancy in the future cannot be denied. As there is no matching in the disease severity and liver cirrhosis, the HCC-related mutations may have an indirect effect on HCC development through increasing hepatic inflammation and liver cirrhosis. When the algorithms were tested with the independent validation cohort, a very high sensitivity and a satisfactory specificity were reported for both genotype B and C subgenotypes. Second, although this is by far the largest cohort of HCC and non-HCC cases to have full-length viral genomic analysis of HBV compared to previous studies (
17,
22), the sample size is still relatively small. The 95% CIs for the sensitivity and specificity of the genomic algorithms are still wide. In the future, laboratory methods to detect these mutants in a more robust manner than does full-genome sequencing are needed to facilitate a larger-scale validation study. A larger cohort, preferably from a different geographic location, would also be needed to validate the generalization of our results. Third, we can only study patients with genotype B and subgenotypes Ce and Cs of HBV. We cannot study genotype A HBV and genotype D HBV, which are prevalent in Europe and Africa, because of our geographic limitations. Moreover, as most Hong Kong residents are immigrants from China, we did not have the information on the place where the ancestors of the patients acquired the infection. We believe that most of our patients originated from southern China, where HBV subgenotype Cs is more prevalent than subgenotype Ce. However, the methodology adopted in this study could be used in countries with other HBV genotypes for mining of HBV-related mutations. Finally, we have not worked out the functionality of these mutated codons and why they might lead to development of HCC. More work is required to elucidate the virologic and host responses to mutations. We cannot draw a conclusion on the causal relationship between these HBV mutations and HCC.
In conclusion, this study suggests that HBV genotypes B and C demonstrate different point mutations which might be associated with high risk of hepatic carcinogenesis. The difference in the locations of these mutations in the HBV genome may reflect the underlying mechanisms of hepatocarcinogenesis of the different HBV genotype/subgenotypes. The detection of these mutations has shown promising results in the association with a higher cancer risk. By combining this information with other clinical risk factors for HCC, including HBV DNA levels and liver cirrhosis status (
10,
11), future clinical algorithms can be refined. It is possible that these diagnostic algorithms may shed light on which patients with chronic HBV infection require more frequent screening and surveillance for HCC development.
APPENDIX
The information gain of a feature (attribute) is the reduction in uncertainty (entropy) that results if the attribute is used for classification. Hence, the higher the information gain, the better. The following equation gives the entropy, E, of an attribute X with n values, X 1 … X n , where P(Xj ) is the frequency of the value Xj : E(X) = \({{\sum}_{j{=}1}^{n}}\) −P(Xj )log2 P(Xj ).
Specific to a typical DNA classification problem, we assumed that the data had
M classes,
C 1 …
C M . For each aligned site position, it has
N possible nucleotides,
V 1 …
V N . We defined
Cm as the number of sequences in class
Cm. Cmi is the number of sequences in class
Cm whose character at the aligned site is
Vi , which could be A, T, G, or C in our case. The remainder of X,
R(
X) was defined as follows:
The information gain, IG j , of the aligned site j is the difference between the original information content E(C) of the data set and the amount of information needed to classify all the unclassified data left in the data set after applying site j for classification: IGj = E(C) − R(j).
The features were ranked by the information gains, and then the top-ranked features were chosen for classification. A site with higher information gain would contribute more discriminatory power to the classification such that more samples could be distinguished by this site.