Monday, November 24, 2014

How email service providers tag spam

Updated from original posted on http://castechbytes.blogspot.com/2009/09/how-email-service-providers-tag-spam.html

As a former technical support representative for an Internet Service Provider (ISP) in the United States, I have come to understand how their mail servers work in reducing spam. Of course, the one I would like to discuss is just the basic manner of recognizing spam. Providers may have other means of identifying spam.

First: Let's have a working definition of "spam" as unsolicited email--be it business, advocacy or commercial email. I will not talk about IM spam or spam in other internet-based communication channels.

Identifying Spam


Spam identification is the key to spam reduction, so most email service providers actually focus on that aspect. As soon as they determine that an email is spam, what they do with that it is just a matter of creativity on the email address owner or the email service provider.

There are two key strategies in spam identification: Keyword monitoring and Spam tagging.

Keyword monitoring is (of course) monitoring key words in the email's subject line. Based on the ISP's pre-studied list of words that are most likely to contain spam, ISPs control the distribution of messages that have these words. Instead of going to the recipients' Inbox, they go to the Junk or Bulk mail folder.

This means that if you send email with words such as "tits" or "dicks" in your email, this will probably be sent to the Spam folder of the recipient if the email server has that approach in spam detection.

While this seems logical, the downside of this is not supported by certain groups as this same rule may be applied to the words such as "breasts" or "penis" which may be required in medical professions. Useful emails may be forwarded to the spam folder without knowledge of the recipient (obviously) even though they intend to receive it.

Spam tagging refers to the use of the "Report Spam" or "This is spam" feature of your email service provider to report that the email you are reading is actually a spam. Most of the email service providers have this feature, but the level of effect of the feature varies.

With this feature, what happens usually is that the email you use is transferred to your Spam folder. What you may not know, however, is that what you actually do is not just tag the email as spam, but tag the sender as a spammer.

This indicates that if a certain number of users tag that email sender as a spammer, the email server of the recipient will automatically tag it as spam, resulting to the future emails of that sender to be forwarded to the Spam folder, even for other receivers.

Of course, these days, a combination of the two, together with artificial intelligence, is used to identify spam.

Cost of Spam to Organizations


Spam costs a lot of money for organizations as they have to deal with wasted resources (bandwidth) and time (for deleting spam), not to mention privacy and other security issues, so proper identification of spam is important. Estimates in 2012 indicate that spam costs internet users globally about $20 billion to $50 billion. On other hand, you should be careful with tagging an email as "spam" if you are in an organization where the sender is sending relevant information that only you do not appreciate receiving. You may be costing the inconvenience not only to the sender but also to the other recipients.

If you discover that you  received an email but it is in the Spam folder when it is actually not spam, you can do any of the following:
  1. Use the "Unmark as spam" or similar feature
  2. Add the recipient's email address to your address book. This adds a rule to your email that the sender is a valid contact.

Monday, November 17, 2014

Using Pivot Table to summarize large numerical data in Microsoft Excel

Sometimes, you have large amounts of data in a spreadsheet. The data are related in such a way that separate cells are linked through a corresponding attribute, which appears in an adjacent cell or within the same row. Take for example the table here:

Let's say you want to have a smaller summary table that tells you how much capacity does each of the three grids (Luzon, Visayas, and Mindanao) have. If this is in a spreadsheet application like Microsoft Excel or Google Sheets, you can use Sort to sort the table by the grid column, then insert a row at the last entry for each group. In the corresponding column, you can use the function SUM. You can then create another sheet in the same file so that you can create a table with five rows and two columns. First row is header, last row is totals. It might look like this:

GridDependable Capacity
Luzon
Visayas
Mindanao
Total

However, what if you want to have another way of summarizing it? For example, you want to have a column that says, for each grid, show how much capacity is for Fossil, and how much is for Renewable? Something like this:

FossilRenewableSub-total
Luzon------
Visayas------
Mindanao------
Sub-total------

The task becomes a little bit harder. Of course, this can become even more difficult if you want to report per energy resource, sub-categories of Fossil and Renewable, like this:

FossilRenewableTotal
CoalOilNatural GasFossil Sub-totalHydroGeothermalSolarWindRenewable Sub-total
Luzon--------------------
Visayas--------------------
Mindanao--------------------
Sub-total--------------------

Of course, we are arranging this based on how you think is the logical way of presenting the data. What if you presented it to your supervisor, who says you should present it this way instead:

LuzonVisayasMindanaoSub-total
FossilCoal--------
Oil--------
Natural Gas--------
Fossil Sub-total--------
RenewableHydro--------
Geothermal--------
Solar--------
Wind--------
Renewable Sub-total--------
GRAND TOTAL--------

While theoretically, it should not be all that difficult, considering it is just a matter of re-organizing the data that were already calculated, there is actually a feature in spreadsheet applications that allows you to do automatic summarizing of numerical data (in sums and averages, among others) and arrange them by certain identifiable categories. It is called the Pivot Table tool.

With the advent of mobile version of applications, such as Android app of Google's Sheets, Pivot Table is one of those features that have yet to be made available to the mobile version. I guess it is a bit too sophisticated to be done in a mobile operating system. Nonetheless, in Google Sheets, if you have created it already in the web (desktop) version, you can make changes to the source sheet, and it will be updated in the pivot table sheet.

Pivot Table is a tool in spreadsheet applications (available in Microsoft Excel, LibreOffice Calc, and Google Sheets. It allows you to summarize large sets of numerical data by using columns that have repeating or grouping data. In the table, it refers to the columns on "Grid," "Source Type," "Clean or Not Clean," and "Source."

Before using Pivot Table


Pivot Table will use the values in the first row of the worksheet as labels of the output summary table (called the Pivot Table Report), so you must make sure that the data in the worksheet source are appropriately labeled.

Also, you have to ensure that the data is "clean," meaning you are sure that each cell contains the data that should be there. Your source table must be organized in such a way that each row is one record. We will discuss entity and data cleaning in the future, but for now, just ensure that one row corresponds to one record that is unique. Among others, this means that the data in that cell corresponds to that record (row) in that kind of variable (label or column).

Also, if you intend to make mathematical calculations, make sure that all numbers are formatted as numbers. If you have a cell wherein you manually inserted a comma, for example, your spreadsheet application will convert that number into text. If you need to use a comma for thousands and millions places, you can do this by using the cell formatting option of your spreadsheet program.

In this post, we will discuss how to do that in Excel. I will have separate posts for Calc and Google Sheets.

With Microsoft Excel


If you are using Microsoft Excel 2010, you can create a pivot table for your data by going to the Insert tab (Just make sure your cursor is in a cell within the table you want to summarize.), and then click Pivot Table.



You will be asked where you will source your data and where you want to put the Pivot Table Report. It should look like this:


Assuming you choose "New Worksheet," (and I recommend this, so that you can change the arrangement of the source table without affecting the output Pivot Table Report), and after you click "OK," you will go to the new sheet, with the column labels on the right side of the screen. Put a check on the labels that you want to include in your report.




As you see above, merely checking the labels will just add the labels in the "Row Labels" field. You have to move the labels to where you want it to display. In our example, we want each Grid to have a separate column, so we will move "Grid" to the "Column Labels" field, like so (You can drag and drop the label):

(In Excel, it can automatically detect that you want to add the values in the "Dependable Capacity" column, because the contents of that label are numerical values. However, you can also tell your spreadsheet application to summarize by using Average, or Count, Maximum, or Minimum, among others, depending on your spreadsheet application.)


If you want to refine your output by removing some specific values (e.g., you want to remove power plants that are in Mindanao), you can click the downward arrow. It will display a menu with finer controls.


Microsoft Excel allows you to produce a Pivot Table Report using multiple sheets, but I think that is rarely used in teaching. In case you want to know how to do that, you can read the official support article here.

Use of Pivot Table


Aside from summarizing, Pivot Table allows you to see possible relationships among the variables in your data because they are summarized. While the apparent relationships may not have the same confidence level as using statistical tools, such as Spearman Rho correlation, it can give you an idea which variables to test.

Wanna give it a try? You can test your knowledge by downloading clean data from The World Bank's Open Data initiative, available here.

Read how to use pivot table in LibreOffice Calc
Read how to use pivot table in Google Sheets

Using Pivot Table in Google Sheets

Sometimes, you have large amounts of data in a spreadsheet. The data are related in such a way that separate cells are linked through a corresponding attribute, which appears in an adjacent cell or within the same row. Take for example the table here:

Let's say you want to have a smaller summary table that tells you how much capacity does each of the three grids (Luzon, Visayas, and Mindanao) have. If this is in a spreadsheet application like Microsoft Excel or Google Sheets, you can use Sort to sort the table by the grid column, then insert a row at the last entry for each group. In the corresponding column, you can use the function SUM. You can then create another sheet in the same file so that you can create a table with five rows and two columns. First row is header, last row is totals. It might look like this:

GridDependable Capacity
Luzon
Visayas
Mindanao
Total

However, what if you want to have another way of summarizing it? For example, you want to have a column that says, for each grid, show how much capacity is for Fossil, and how much is for Renewable? Something like this:

FossilRenewableSub-total
Luzon------
Visayas------
Mindanao------
Sub-total------

The task becomes a little bit harder. Of course, this can become even more difficult if you want to report per energy resource, sub-categories of Fossil and Renewable, like this:

FossilRenewableTotal
CoalOilNatural GasFossil Sub-totalHydroGeothermalSolarWindRenewable Sub-total
Luzon--------------------
Visayas--------------------
Mindanao--------------------
Sub-total--------------------

Of course, we are arranging this based on how you think is the logical way of presenting the data. What if you presented it to your supervisor, who says you should present it this way instead:

LuzonVisayasMindanaoSub-total
FossilCoal--------
Oil--------
Natural Gas--------
Fossil Sub-total--------
RenewableHydro--------
Geothermal--------
Solar--------
Wind--------
Renewable Sub-total--------
GRAND TOTAL--------

While theoretically, it should not be all that difficult, considering it is just a matter of re-organizing the data that were already calculated, there is actually a feature in spreadsheet applications that allows you to do automatic summarizing of numerical data (in sums and averages, among others) and arrange them by certain identifiable categories. It is called the Pivot Table tool.

With the advent of mobile version of applications, such as Android app of Google's Sheets, Pivot Table is one of those features that have yet to be made available to the mobile version. I guess it is a bit too sophisticated to be done in a mobile operating system. Nonetheless, in Google Sheets, if you have created it already in the web (desktop) version, you can make changes to the source sheet, and it will be updated in the pivot table sheet.

Pivot Table is a tool in spreadsheet applications (available in Microsoft Excel, LibreOffice Calc, and Google Sheets. It allows you to summarize large sets of numerical data by using columns that have repeating or grouping data. In the table, it refers to the columns on "Grid," "Source Type," "Clean or Not Clean," and "Source."

Before using Pivot Table


Pivot Table will use the values in the first row of the worksheet as labels of the output summary table (called the Pivot Table Report), so you must make sure that the data in the worksheet source are appropriately labeled.

Also, you have to ensure that the data is "clean," meaning you are sure that each cell contains the data that should be there. Your source table must be organized in such a way that each row is one record. We will discuss entity and data cleaning in the future, but for now, just ensure that one row corresponds to one record that is unique. Among others, this means that the data in that cell corresponds to that record (row) in that kind of variable (label or column).

Also, if you intend to make mathematical calculations, make sure that all numbers are formatted as numbers. If you have a cell wherein you manually inserted a comma, for example, your spreadsheet application will convert that number into text. If you need to use a comma for thousands and millions places, you can do this by using the cell formatting option of your spreadsheet program.

In this post, we will discuss how to do that with Google Sheets.

With Google Sheets


To create Pivot Table in Google Sheets, you can go to the Data menu and select "Pivot table report." Again, to streamline the process, make sure your cursor is in the table source of your desired Report.


This will show you the screen similar to Calc, where you can choose which labels you want to place where. Unfortunately, unlike Calc, you cannot drag and drop the labels. You have to click "Add Field" in each "Field" to use it as a label. Like in the previous spreadsheet applications, you can use multiple levels per field.







Like in the previous spreadsheet programs, you can filter out data depending on the available labels by adding a field in the "Filter" field.



Click here for more references about Google Sheet's Pivot Table.

Use of Pivot Table


Aside from summarizing, Pivot Table allows you to see possible relationships among the variables in your data because they are summarized. While the apparent relationships may not have the same confidence level as using statistical tools, such as Spearman Rho correlation, it can give you an idea which variables to test.

Wanna give it a try? You can test your knowledge by downloading clean data from The World Bank's Open Data initiative, available here.

Using Pivot Table in Calc

Sometimes, you have large amounts of data in a spreadsheet. The data are related in such a way that separate cells are linked through a corresponding attribute, which appears in an adjacent cell or within the same row. Take for example the table here:

Let's say you want to have a smaller summary table that tells you how much capacity does each of the three grids (Luzon, Visayas, and Mindanao) have. If this is in a spreadsheet application like Microsoft Excel or Google Sheets, you can use Sort to sort the table by the grid column, then insert a row at the last entry for each group. In the corresponding column, you can use the function SUM. You can then create another sheet in the same file so that you can create a table with five rows and two columns. First row is header, last row is totals. It might look like this:

GridDependable Capacity
Luzon
Visayas
Mindanao
Total

However, what if you want to have another way of summarizing it? For example, you want to have a column that says, for each grid, show how much capacity is for Fossil, and how much is for Renewable? Something like this:

FossilRenewableSub-total
Luzon------
Visayas------
Mindanao------
Sub-total------

The task becomes a little bit harder. Of course, this can become even more difficult if you want to report per energy resource, sub-categories of Fossil and Renewable, like this:

FossilRenewableTotal
CoalOilNatural GasFossil Sub-totalHydroGeothermalSolarWindRenewable Sub-total
Luzon--------------------
Visayas--------------------
Mindanao--------------------
Sub-total--------------------

Of course, we are arranging this based on how you think is the logical way of presenting the data. What if you presented it to your supervisor, who says you should present it this way instead:

LuzonVisayasMindanaoSub-total
FossilCoal--------
Oil--------
Natural Gas--------
Fossil Sub-total--------
RenewableHydro--------
Geothermal--------
Solar--------
Wind--------
Renewable Sub-total--------
GRAND TOTAL--------

While theoretically, it should not be all that difficult, considering it is just a matter of re-organizing the data that were already calculated, there is actually a feature in spreadsheet applications that allows you to do automatic summarizing of numerical data (in sums and averages, among others) and arrange them by certain identifiable categories. It is called the Pivot Table tool.

With the advent of mobile version of applications, such as Android app of Google's Sheets, Pivot Table is one of those features that have yet to be made available to the mobile version. I guess it is a bit too sophisticated to be done in a mobile operating system. Nonetheless, in Google Sheets, if you have created it already in the web (desktop) version, you can make changes to the source sheet, and it will be updated in the pivot table sheet.

Pivot Table is a tool in spreadsheet applications (available in Microsoft Excel, LibreOffice Calc, and Google Sheets. It allows you to summarize large sets of numerical data by using columns that have repeating or grouping data. In the table, it refers to the columns on "Grid," "Source Type," "Clean or Not Clean," and "Source."

Before using Pivot Table


Pivot Table will use the values in the first row of the worksheet as labels of the output summary table (called the Pivot Table Report), so you must make sure that the data in the worksheet source are appropriately labeled.

Also, you have to ensure that the data is "clean," meaning you are sure that each cell contains the data that should be there. Your source table must be organized in such a way that each row is one record. We will discuss entity and data cleaning in the future, but for now, just ensure that one row corresponds to one record that is unique. Among others, this means that the data in that cell corresponds to that record (row) in that kind of variable (label or column).

Also, if you intend to make mathematical calculations, make sure that all numbers are formatted as numbers. If you have a cell wherein you manually inserted a comma, for example, your spreadsheet application will convert that number into text. If you need to use a comma for thousands and millions places, you can do this by using the cell formatting option of your spreadsheet program.

In this post, we will use LibreOffice or OpenOffice.Org Calc.

With LibreOffice Calc

For LibreOffice Calc, you can access the Pivot Table function by going to Data menu and selecting Pivot Table, like below. To streamline the process, put the cursor in the table where you know you will source the Pivot Table Report.


Like in Excel, Calc will ask where you want to get the data from:


Calc will then show you where you can place your labels. You can drag and drop the labels.

(In Calc, you have to make sure that you select the appropriate label in the "Data Fields." This is where the values will be summarized, which maybe summation, counting, maximum, or minimum. To change what type of operation you want to do, click the column label in Data Fields, choose, and then click "OK.")


 If you have multiple level labels (e.g., in rows, you will display first the Type of Source, then Source), you have to arrange it like with the higher level positioned first. Notice below that in the "Row Fields," the label "Source Type" is positioned first.


Your Pivot Table Report should look like this:


Like in Excel, if you want to refine the Report, just click on the downward arrows. You can also tell Calc to include only certain records based on certain value parameters, like include only records where the Dependable Capacity is greater than 10. To do that, just click the A1 Cell, and it will open up following Filter dialog box:


Unfortunately, if you make any changes in the source table, it won't automatically update the Pivot Table Report. But it can easily be updated. Just right-click any cell within the Report, and select "Refresh."

Using that process, you can also see the "Edit Layout." That will allow you to change the placement of labels, in case your supervisor wants to re-arrange the columns.

Click here for more references about LibreOffice Calc's Pivot Table.

On how to do this with Google Sheets, click here.

Use of Pivot Table


Aside from summarizing, Pivot Table allows you to see possible relationships among the variables in your data because they are summarized. While the apparent relationships may not have the same confidence level as using statistical tools, such as Spearman Rho correlation, it can give you an idea which variables to test.

Wanna give it a try? You can test your knowledge by downloading clean data from The World Bank's Open Data initiative, available here.