Click here to Skip to main content
15,881,380 members
Articles / Desktop Programming / WPF
Tip/Trick

How to Parse HTML using C#

Rate me:
Please Sign up or sign in to vote.
4.86/5 (24 votes)
6 Aug 2014CPOL2 min read 278K   55   15
Get information of any website you want

Introduction

Usually website has an RSS file. Then we can parse it to have the latest news, however, there are some that didn't make this RSS file so we should parse directly HTML of this website.

You can download this sample here.

Using the Code

First of all, we should add to the reference the <a href="http://htmlagilitypack.codeplex.com/">Htmlagilitypack</a>. You can download it from nuget on your Visual Studio.

P.S.: If you are working on Windows Phone, it will have some problems with that DLL, so you must add these two DLL files, System.net.http and System.Xml.Xpath. You can also find it on nuget.

We create a new function that takes as parameter the website that you want to parse:

C++
Parsing("http://www.mytek.tn/");

Then, we send a request to the website to get all HTML pages:

HttpClient http = new HttpClient();
var response = await http.GetByteArrayAsync(website);
String source = Encoding.GetEncoding("utf-8").GetString(response, 0, response.Length - 1);
source = WebUtility.HtmlDecode(source);
HtmlDocument resultat = new HtmlDocument();
resultat.LoadHtml(source);

P.S.: You should pay attention to the Encoding, each website has an Encoding. In this example, it uses utf-8, you can find it on the attribute charset on the website HTML.

Image 1

After that, we inspect the element that we want to parse and get its id or class, then we can retrieve it easily.

Image 2

As you can see in the picture, we want to parse information of these devices that are all wrapped in ul, but before that we must find the ancestor div that has an id or a class, in this example the div has a class named block_content.

So now, we will filter the HTML with only the content of this div, then we get all tag of li that contains information that we want to get.

List<HtmlNode> toftitle = resultat.DocumentNode.Descendants().Where
(x => (x.Name == "div" && x.Attributes["class"] != null &&
   x.Attributes["class"].Value.Contains("block_content"))).ToList();

After each filter you do, it is preferred to breakpoint the project to verify our work.

Image 3

As a result, we get 11 divs that have class named block_content, so you should verify which item contains information that we want to get. In this example, it's item N°6.

var li = toftitle[6].Descendants("li").ToList();
foreach (var item in li)
{
  var link = item.Descendants("a").ToList()[0].GetAttributeValue("href", null);
  var img = item.Descendants("img").ToList()[0].GetAttributeValue("src", null);
  var title = item.Descendants("h5").ToList()[0].InnerText;
}

Inside each item of li, we will get the link, image and Title.

  • Descendants allow you to get all tag with specified name inside the item.
  • GetAttributeValue allows you to get the attribute of the tag.
  • InnerText allows you to get Text between tags.
  • InnerHtml allows you to get HTML.

History

Difficulty of parsing HTML depends on the structure of the website.

Image 4

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Junior) Microsoft Student Partners
Tunisia Tunisia
I study Software Engineering , 23 years old , I'm motivated with all Technologies of Microsoft.
Since I have been in the Community of Microsoft as Microsoft Student Partners, I developped many apps on the platform Windows and Phone. Now , it's time to share what I learn here and I'am ready to help Everyone.
You can contact me at any time (anisderbel@outlook.com)
This is a Organisation

9 members

Comments and Discussions

 
Questionhelp Pin
Member 133936044-Sep-17 20:24
Member 133936044-Sep-17 20:24 
GeneralMy vote of 5 Pin
DLChambers15-Feb-17 6:16
DLChambers15-Feb-17 6:16 
QuestionWhere can I see the full code? Pin
Abhi Khose4-May-16 20:43
Abhi Khose4-May-16 20:43 
Questioncan not get href of tag <a> Pin
tanliem65325-Aug-15 22:11
tanliem65325-Aug-15 22:11 
QuestionLoadHtml method error Pin
Ibrahim Tayseer19-Aug-15 23:11
Ibrahim Tayseer19-Aug-15 23:11 
AnswerRe: LoadHtml method error Pin
BerggreenDK7-Aug-22 1:12
BerggreenDK7-Aug-22 1:12 
QuestionCannot load project Pin
Andrw_S23-Mar-15 5:21
Andrw_S23-Mar-15 5:21 
AnswerRe: Cannot load project Pin
BerggreenDK7-Aug-22 1:45
BerggreenDK7-Aug-22 1:45 
QuestionHtmlAgilityPack contains errors on output Pin
gggustafson3-Mar-15 11:12
mvagggustafson3-Mar-15 11:12 
Questioni wanna extract data from amazon.com related to books Pin
Member 114231883-Feb-15 0:13
Member 114231883-Feb-15 0:13 
QuestionNice job but.... Pin
Kees van Spelde6-Aug-14 6:48
professionalKees van Spelde6-Aug-14 6:48 
GeneralRe: Nice job but.... Pin
PIEBALDconsult6-Aug-14 7:06
mvePIEBALDconsult6-Aug-14 7:06 
GeneralRe: Nice job but.... Pin
Kees van Spelde6-Aug-14 8:35
professionalKees van Spelde6-Aug-14 8:35 
GeneralRe: Nice job but.... Pin
Anis Derbel6-Aug-14 10:50
Anis Derbel6-Aug-14 10:50 
GeneralRe: Nice job but.... Pin
Anis Derbel6-Aug-14 10:51
Anis Derbel6-Aug-14 10:51 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.