小編給大家分享一下.NET Core如何實(shí)現(xiàn)定時(shí)抓取網(wǎng)站文章并發(fā)送到郵箱,希望大家閱讀完這篇文章之后都有所收獲,下面讓我們一起去探討吧!
建網(wǎng)站原本是網(wǎng)站策劃師、網(wǎng)絡(luò)程序員、網(wǎng)頁(yè)設(shè)計(jì)師等,應(yīng)用各種網(wǎng)絡(luò)程序開(kāi)發(fā)技術(shù)和網(wǎng)頁(yè)設(shè)計(jì)技術(shù)配合操作的協(xié)同工作。成都創(chuàng)新互聯(lián)專業(yè)提供網(wǎng)站設(shè)計(jì)、做網(wǎng)站,網(wǎng)頁(yè)設(shè)計(jì),網(wǎng)站制作(企業(yè)站、自適應(yīng)網(wǎng)站建設(shè)、電商門戶網(wǎng)站)等服務(wù),從網(wǎng)站深度策劃、搜索引擎友好度優(yōu)化到用戶體驗(yàn)的提升,我們力求做到極致!作為一個(gè)持續(xù)運(yùn)行的工具,沒(méi)有日志記錄怎么行,我準(zhǔn)備使用的是NLog來(lái)記錄日志,它有個(gè)日志歸檔功能非常不錯(cuò)。在http請(qǐng)求中,由于網(wǎng)絡(luò)問(wèn)題吧可能會(huì)出現(xiàn)失敗的情況,這里我使用Polly來(lái)進(jìn)行Retry。使用HtmlAgilityPack來(lái)解析網(wǎng)頁(yè),需要對(duì)xpath有一定了解。下面是詳細(xì)說(shuō)明:
組件名 | 用途 | github |
---|---|---|
NLog | 記錄日志 | https://github.com/NLog/NLog |
Polly | 當(dāng)http請(qǐng)求失敗,進(jìn)行重試 | https://github.com/App-vNext/Polly |
HtmlAgilityPack | 網(wǎng)頁(yè)解析 | https://github.com/zzzprojects/html-agility-pack |
MailKit | 發(fā)送郵件 | https://github.com/jstedfast/MailKit |
有不了解的組件,可以通過(guò)訪問(wèn)github獲取資料。
參考文章
https://www.jb51.net/article/112595.htm
獲取&解析博客園首頁(yè)數(shù)據(jù)
我是用的是HttpWebRequest來(lái)進(jìn)行http請(qǐng)求,下面分享一下我簡(jiǎn)單封裝的類庫(kù):
using System; using System.IO; using System.Net; using System.Text; namespace CnBlogSubscribeTool { /// <summary> /// Simple Http Request Class /// .NET Framework >= 4.0 /// Author:stulzq /// CreatedTime:2017-12-12 15:54:47 /// </summary> public class HttpUtil { static HttpUtil() { //Set connection limit ,Default limit is 2 ServicePointManager.DefaultConnectionLimit = 1024; } /// <summary> /// Default Timeout 20s /// </summary> public static int DefaultTimeout = 20000; /// <summary> /// Is Auto Redirect /// </summary> public static bool DefalutAllowAutoRedirect = true; /// <summary> /// Default Encoding /// </summary> public static Encoding DefaultEncoding = Encoding.UTF8; /// <summary> /// Default UserAgent /// </summary> public static string DefaultUserAgent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36" ; /// <summary> /// Default Referer /// </summary> public static string DefaultReferer = ""; /// <summary> /// httpget request /// </summary> /// <param name="url">Internet Address</param> /// <returns>string</returns> public static string GetString(string url) { var stream = GetStream(url); string result; using (StreamReader sr = new StreamReader(stream)) { result = sr.ReadToEnd(); } return result; } /// <summary> /// httppost request /// </summary> /// <param name="url">Internet Address</param> /// <param name="postData">Post request data</param> /// <returns>string</returns> public static string PostString(string url, string postData) { var stream = PostStream(url, postData); string result; using (StreamReader sr = new StreamReader(stream)) { result = sr.ReadToEnd(); } return result; } /// <summary> /// Create Response /// </summary> /// <param name="url"></param> /// <param name="post">Is post Request</param> /// <param name="postData">Post request data</param> /// <returns></returns> public static WebResponse CreateResponse(string url, bool post, string postData = "") { var httpWebRequest = WebRequest.CreateHttp(url); httpWebRequest.Timeout = DefaultTimeout; httpWebRequest.AllowAutoRedirect = DefalutAllowAutoRedirect; httpWebRequest.UserAgent = DefaultUserAgent; httpWebRequest.Referer = DefaultReferer; if (post) { var data = DefaultEncoding.GetBytes(postData); httpWebRequest.Method = "POST"; httpWebRequest.ContentType = "application/x-www-form-urlencoded;charset=utf-8"; httpWebRequest.ContentLength = data.Length; using (var stream = httpWebRequest.GetRequestStream()) { stream.Write(data, 0, data.Length); } } try { var response = httpWebRequest.GetResponse(); return response; } catch (Exception e) { throw new Exception(string.Format("Request error,url:{0},IsPost:{1},Data:{2},Message:{3}", url, post, postData, e.Message), e); } } /// <summary> /// http get request /// </summary> /// <param name="url"></param> /// <returns>Response Stream</returns> public static Stream GetStream(string url) { var stream = CreateResponse(url, false).GetResponseStream(); if (stream == null) { throw new Exception("Response error,the response stream is null"); } else { return stream; } } /// <summary> /// http post request /// </summary> /// <param name="url"></param> /// <param name="postData">post data</param> /// <returns>Response Stream</returns> public static Stream PostStream(string url, string postData) { var stream = CreateResponse(url, true, postData).GetResponseStream(); if (stream == null) { throw new Exception("Response error,the response stream is null"); } else { return stream; } } } }
獲取首頁(yè)數(shù)據(jù)
string res = HttpUtil.GetString(https://www.cnblogs.com);
解析數(shù)據(jù)
我們成功獲取到了html,但是怎么提取我們需要的信息(文章標(biāo)題、地址、摘要、作者、發(fā)布時(shí)間)呢。這里就亮出了我們的利劍HtmlAgilityPack,他是一個(gè)可以根據(jù)xpath來(lái)解析網(wǎng)頁(yè)的組件。
載入我們前面獲取的html:
HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(html);
從上圖中,我們可以看出,每條文章所有信息都在一個(gè)class為post_item的div里,我們先獲取所有的class=post_item的div
//獲取所有文章數(shù)據(jù)項(xiàng) var itemBodys = doc.DocumentNode.SelectNodes("//div[@class='post_item_body']");
我們繼續(xù)分析,可以看出文章的標(biāo)題在class=post_item_body的div下面的h4標(biāo)簽下的a標(biāo)簽,摘要信息在class=post_item_summary的p標(biāo)簽里面,發(fā)布時(shí)間和作者在class=post_item_foot的div里,分析完畢,我們可以取出我們想要的數(shù)據(jù)了:
foreach (var itemBody in itemBodys) { //標(biāo)題元素 var titleElem = itemBody.SelectSingleNode("h4/a"); //獲取標(biāo)題 var title = titleElem?.InnerText; //獲取url var url = titleElem?.Attributes["href"]?.Value; //摘要元素 var summaryElem = itemBody.SelectSingleNode("p[@class='post_item_summary']"); //獲取摘要 var summary = summaryElem?.InnerText.Replace("\r\n", "").Trim(); //數(shù)據(jù)項(xiàng)底部元素 var footElem = itemBody.SelectSingleNode("div[@class='post_item_foot']"); //獲取作者 var author = footElem?.SelectSingleNode("a")?.InnerText; //獲取文章發(fā)布時(shí)間 var publishTime = Regex.Match(footElem?.InnerText, "\\d+-\\d+-\\d+ \\d+:\\d+").Value; Console.WriteLine($"標(biāo)題:{title}"); Console.WriteLine($"網(wǎng)址:{url}"); Console.WriteLine($"摘要:{summary}"); Console.WriteLine($"作者:{author}"); Console.WriteLine($"發(fā)布時(shí)間:{publishTime}"); Console.WriteLine("--------------華麗的分割線---------------"); }
運(yùn)行一下:
我們成功的獲取了我們想要的信息。現(xiàn)在我們定義一個(gè)Blog對(duì)象將它們裝起來(lái)。
public class Blog { /// <summary> /// 標(biāo)題 /// </summary> public string Title { get; set; } /// <summary> /// 博文url /// </summary> public string Url { get; set; } /// <summary> /// 摘要 /// </summary> public string Summary { get; set; } /// <summary> /// 作者 /// </summary> public string Author { get; set; } /// <summary> /// 發(fā)布時(shí)間 /// </summary> public DateTime PublishTime { get; set; } }
http請(qǐng)求失敗重試
我們使用Polly在我們的http請(qǐng)求失敗時(shí)進(jìn)行重試,設(shè)置為重試3次。
//初始化重試器 _retryTwoTimesPolicy = Policy .Handle<Exception>() .Retry(3, (ex, count) => { _logger.Error("Excuted Failed! Retry {0}", count); _logger.Error("Exeption from {0}", ex.GetType().Name); });
測(cè)試一下:
可以看到當(dāng)遇到exception是Polly會(huì)幫我們重試三次,如果三次重試都失敗了那么會(huì)放棄。
發(fā)送郵件
使用MailKit來(lái)進(jìn)行郵件發(fā)送,它支持IMAP,POP3和SMTP協(xié)議,并且是跨平臺(tái)的十分優(yōu)秀。下面是根據(jù)前面園友的分享自己封裝的一個(gè)類庫(kù):
using System.Collections.Generic; using CnBlogSubscribeTool.Config; using MailKit.Net.Smtp; using MimeKit; namespace CnBlogSubscribeTool { /// <summary> /// send email /// </summary> public class MailUtil { private static bool SendMail(MimeMessage mailMessage,MailConfig config) { try { var smtpClient = new SmtpClient(); smtpClient.Timeout = 10 * 1000; //設(shè)置超時(shí)時(shí)間 smtpClient.Connect(config.Host, config.Port, MailKit.Security.SecureSocketOptions.None);//連接到遠(yuǎn)程smtp服務(wù)器 smtpClient.Authenticate(config.Address, config.Password); smtpClient.Send(mailMessage);//發(fā)送郵件 smtpClient.Disconnect(true); return true; } catch { throw; } } /// <summary> ///發(fā)送郵件 /// </summary> /// <param name="config">配置</param> /// <param name="receives">接收人</param> /// <param name="sender">發(fā)送人</param> /// <param name="subject">標(biāo)題</param> /// <param name="body">內(nèi)容</param> /// <param name="attachments">附件</param> /// <param name="fileName">附件名</param> /// <returns></returns> public static bool SendMail(MailConfig config,List<string> receives, string sender, string subject, string body, byte[] attachments = null,string fileName="") { var fromMailAddress = new MailboxAddress(config.Name, config.Address); var mailMessage = new MimeMessage(); mailMessage.From.Add(fromMailAddress); foreach (var add in receives) { var toMailAddress = new MailboxAddress(add); mailMessage.To.Add(toMailAddress); } if (!string.IsNullOrEmpty(sender)) { var replyTo = new MailboxAddress(config.Name, sender); mailMessage.ReplyTo.Add(replyTo); } var bodyBuilder = new BodyBuilder() { HtmlBody = body }; //附件 if (attachments != null) { if (string.IsNullOrEmpty(fileName)) { fileName = "未命名文件.txt"; } var attachment = bodyBuilder.Attachments.Add(fileName, attachments); //解決中文文件名亂碼 var charset = "GB18030"; attachment.ContentType.Parameters.Clear(); attachment.ContentDisposition.Parameters.Clear(); attachment.ContentType.Parameters.Add(charset, "name", fileName); attachment.ContentDisposition.Parameters.Add(charset, "filename", fileName); //解決文件名不能超過(guò)41字符 foreach (var param in attachment.ContentDisposition.Parameters) param.EncodingMethod = ParameterEncodingMethod.Rfc2047; foreach (var param in attachment.ContentType.Parameters) param.EncodingMethod = ParameterEncodingMethod.Rfc2047; } mailMessage.Body = bodyBuilder.ToMessageBody(); mailMessage.Subject = subject; return SendMail(mailMessage, config); } } }
測(cè)試一下:
說(shuō)明
關(guān)于抓取數(shù)據(jù)和發(fā)送郵件的調(diào)度,程序異常退出的數(shù)據(jù)處理等等,在此我就不詳細(xì)說(shuō)明了,有興趣的看源碼(文末有g(shù)ithub地址)
抓取數(shù)據(jù)是增量更新的。不用RSS訂閱的原因是RSS更新比較慢。
完整的程序運(yùn)行截圖:
每發(fā)送一次郵件,程序就會(huì)將記錄時(shí)間調(diào)整到今天的9點(diǎn),然后每次抓取數(shù)據(jù)之后就會(huì)判斷當(dāng)前時(shí)間減去記錄時(shí)間是否大于等于24小時(shí),如果符合就發(fā)送郵件并且更新記錄時(shí)間。
收到的郵件截圖:
截圖中的郵件標(biāo)題為13日但是郵件內(nèi)容為14日,是因?yàn)槲覟榱搜菔拘Ч?,將今天?4日)的數(shù)據(jù)copy到了13日的數(shù)據(jù)里面,不要被誤導(dǎo)了。
還提供一個(gè)附件便于收集整理:
看完了這篇文章,相信你對(duì)“.NET Core如何實(shí)現(xiàn)定時(shí)抓取網(wǎng)站文章并發(fā)送到郵箱”有了一定的了解,如果想了解更多相關(guān)知識(shí),歡迎關(guān)注創(chuàng)新互聯(lián)行業(yè)資訊頻道,感謝各位的閱讀!
分享名稱:.NETCore如何實(shí)現(xiàn)定時(shí)抓取網(wǎng)站文章并發(fā)送到郵箱-創(chuàng)新互聯(lián)
當(dāng)前地址:http://muchs.cn/article36/djigpg.html
成都網(wǎng)站建設(shè)公司_創(chuàng)新互聯(lián),為您提供服務(wù)器托管、網(wǎng)站導(dǎo)航、網(wǎng)站收錄、營(yíng)銷型網(wǎng)站建設(shè)、做網(wǎng)站、靜態(tài)網(wǎng)站
聲明:本網(wǎng)站發(fā)布的內(nèi)容(圖片、視頻和文字)以用戶投稿、用戶轉(zhuǎn)載內(nèi)容為主,如果涉及侵權(quán)請(qǐng)盡快告知,我們將會(huì)在第一時(shí)間刪除。文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng),如需處理請(qǐng)聯(lián)系客服。電話:028-86922220;郵箱:631063699@qq.com。內(nèi)容未經(jīng)允許不得轉(zhuǎn)載,或轉(zhuǎn)載時(shí)需注明來(lái)源: 創(chuàng)新互聯(lián)
猜你還喜歡下面的內(nèi)容